[Linux-cluster] I give up

Wed Nov 28 21:27:14 UTC 2007

Kevin Anderson wrote:
> Not sure what you mean by 3 to 1 using IP tie breaker.  How are you 
> maintaining quorum without qdisk as a voting entity?
>
I have three nodes. If one fails the other two are expected to maintain 
quorum and continue. I would really like a second failure to keep going 
on it's own (last man standing). For this to work I would need to set 
expected votes to 1 and make sure the correct node wins the ensuing 
fencing race.

Case two. I remove one node from the cluster to maintain it. Now I have 
a two node cluster. Same issues as above. Luci wants to set two_node = 1 
in this case instead of just dealing with expected votes = 1. I haven't 
test this because I'm testing all this with node 2 and node 3 while the 
future node 1 is currently our production server.

The ping gateway test/IP tie-breaker was my way of reliably running down 
to last man standing.

>> During network partition test, expecting a fencing race where I 
>> control the outcome, one node would not fence the other and did not 
>> takeover the service until the other node attempted to rejoin the 
>> cluster (way too late).
>
> Is this resolved with the 5.1 release we did a few weeks ago?
>
I'm using the latest release.
>>
>> Another poster stated that he could not get the cluster to function 
>> properly since the switch to Openais. Hence I'm speculating that they 
>> are related.
> Doubtful.  There have been issues with cisco switch configurations 
> with allowing multicast properly.  All of those have been resolved 
> with a switch configuration setting change.
>
I don't know why it "stared at me" instead of recovering the service, 
because debugging is lacking. I really think that even if the "verbose 
debugging" was a compile time option and users had to install "testing" 
rpms, that all the problems would have been flushed out long ago.

...
> Both of these are part of the bigger picture resource monitoring work 
> that Lon and some of the linux ha guys are jointly working on 
> converging to a single base.  See this page -
> http://people.redhat.com/lhh/cluster-integration.html
>
> Which again, not very visible :-(.
 From a distance, it seems that 5.0 and 5.1 are less stable than 4.4 and 
4.5 (I've only tried the current ones). If big changes were made and 
released prematurely, it's being shaken out by production clusters 
instead of test clusters.

How much of this "not very visible" work is being tested by a larger group?

>
> 3. Time for Cluster Summit again - location preferences, timeframe, 
> funding, etc?
>
Summit's are better than closed development but users like me are never 
going to attend. A community based site is a good foundation.

By the way, I am a C programmer. (From windows land though we use RH on 
all of our servers.) I've spent a month trying to get this to work. It's 
open source and given enough time I can make it go. I don't have any 
more time. It's supposed to be production quality.

I have a failure case staring at me but debugging is lacking so I have 
to look else where for a solution. I can't sit here dangling my feet 
waiting and I can't spend weeks fixing it myself.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071128/fe59ec34/attachment.htm>