[Linux-cluster] really reliable?

Mon Apr 20 20:55:35 UTC 2009

How long is long when waiting for the fencing part to complete?  For you guys, what is the normal amount of time that it takes for "Starting fencing..." to complete?  For me, it takes anywhere from 30-45 seconds to complete.  Would this be because I'm on Cisco switches?  If this is normal then I'm just going to leave it be because it does complete and my cluster forms just fine, it just sits there with these messages repeated for 30-45 seconds:

Apr 16 21:01:47 oilfish openais[4648]: [TOTEM] entering GATHER state from 11. 
Apr 16 21:01:52 oilfish openais[4648]: [TOTEM] entering GATHER state from 0. 
Apr 16 21:01:52 oilfish openais[4648]: [TOTEM] Creating commit token because I am the rep. 
Apr 16 21:01:52 oilfish openais[4648]: [TOTEM] Saving state aru 37 high seq received 37 
Apr 16 21:01:52 oilfish openais[4648]: [TOTEM] Storing new sequence id for ring 13c 
Apr 16 21:01:52 oilfish openais[4648]: [TOTEM] entering COMMIT state. 
Apr 16 21:01:52 oilfish openais[4648]: [TOTEM] entering RECOVERY state. 
Apr 16 21:01:52 oilfish openais[4648]: [TOTEM] position [0] member 172.31.37.2: 
Apr 16 21:01:52 oilfish openais[4648]: [TOTEM] previous ring seq 312 rep 172.31.37.2 
Apr 16 21:01:52 oilfish openais[4648]: [TOTEM] aru 37 high delivered 37 received flag 1 
Apr 16 21:01:52 oilfish openais[4648]: [TOTEM] Did not need to originate any messages in recover
y. 
Apr 16 21:01:52 oilfish openais[4648]: [TOTEM] Sending initial ORF token 
Apr 16 21:01:52 oilfish openais[4648]: [CLM  ] CLM CONFIGURATION CHANGE 
Apr 16 21:01:52 oilfish openais[4648]: [CLM  ] New Configuration: 
Apr 16 21:01:52 oilfish openais[4648]: [CLM  ] 	r(0) ip(172.31.37.2)  
Apr 16 21:01:52 oilfish openais[4648]: [CLM  ] Members Left: 
Apr 16 21:01:52 oilfish openais[4648]: [CLM  ] Members Joined: 
Apr 16 21:01:52 oilfish openais[4648]: [CLM  ] CLM CONFIGURATION CHANGE 
Apr 16 21:01:52 oilfish openais[4648]: [CLM  ] New Configuration: 
Apr 16 21:01:52 oilfish openais[4648]: [CLM  ] 	r(0) ip(172.31.37.2)  
Apr 16 21:01:52 oilfish openais[4648]: [CLM  ] Members Left: 
Apr 16 21:01:52 oilfish openais[4648]: [CLM  ] Members Joined: 
Apr 16 21:01:52 oilfish openais[4648]: [SYNC ] This node is within the primary component and wil
l provide service. 
Apr 16 21:01:52 oilfish openais[4648]: [TOTEM] entering OPERATIONAL state. 
Apr 16 21:01:52 oilfish openais[4648]: [CLM  ] got nodejoin message 172.31.37.2 
Apr 16 21:01:52 oilfish openais[4648]: [CPG  ] got joinlist message from node 1

Then eventually:

Apr 16 21:02:19 oilfish openais[4648]: [TOTEM] entering OPERATIONAL state. 
Apr 16 21:02:19 oilfish openais[4648]: [CLM  ] got nodejoin message 172.31.37.2 
Apr 16 21:02:19 oilfish openais[4648]: [CLM  ] got nodejoin message 172.31.37.4 
Apr 16 21:02:19 oilfish openais[4648]: [CPG  ] got joinlist message from node 2 
Apr 16 21:02:19 oilfish openais[4648]: [CPG  ] got joinlist message from node 1 
Apr 16 21:02:23 oilfish kernel: dlm: connecting to 2
Apr 16 21:02:23 oilfish kernel: dlm: got connection from 2

-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Vu Pham
Sent: Tuesday, April 14, 2009 6:17 PM
To: golharam at umdnj.edu; linux clustering
Subject: Re: [Linux-cluster] really reliable?

Ryan Golhar wrote:
> I'm running RHEL 5.3 64-bit.  So far, I only want to see that the
> cluster can run.  I'll worry about getting GFS after I'm confident this
> works.
> 
> I've got three nodes: pico, vail, and whistler.  They each have two NIC
> cards, one that provides a public IP address, and another that provides
> private communications.  All cluster traffic will go over the private
> network, 192.168.20.0.
> 
> I've installed only the following components:
> system-config-cluster-1.0.52-1.1, cman-2.0.98-1, and rgmanager-2.0.38-2.
> 
> I've created my cluster.conf file to include these three nodees and
> fence them using a brocade fibre switch (for GFS).
> 
> When I start the cluster services on all 3 nodes using the manually 
> method of:
> 
> /sbin/ccsd; /usr/sbin/cman_tool join
> 
> The nodes successfully form a cluster.  I am able to leave the cluster 
> and kill ccsd as well.
> 
> If I try to start the cman service I see:
> 
> [root at pico cluster]# /sbin/service cman start
> Starting cluster:
>    Loading modules... done
>    Mounting configfs... done
>    Starting ccsd... done
>    Starting cman... done
>    Starting daemons... done
>    Starting fencing...
> 
> 
> And it just hangs.  I know my fencing is set up correctly because I've 
> had nodes fence other nodes before (when I was trying with 6 members). 
> If I let it sit for long enough sometimes it finishes successfully.  I'm 
> not sure what its doing because fence_tool is called and its a binary...
> 

Ryan,

Anything suspicious in the log when it hangs at fencing ?
Could you show your cluster.conf ?

Vu

> Ryan
> 
> 
> Gordan Bobic wrote:
>> What distro are you using? I've found that:
>>
>> 1) Distros other than RHEL/CentOS can be quirky when it comes to using
>> RHCS. I've even run into problems on Fedora more than once (not to 
>> mention
>> that FC hasn't shipped GFS1 since FC5 and GFS2 hasn't been deemed
>> production stable until last month - and we're up to FC10 now).
>>
>> 2) Starting RHCS components using anything except the intended init 
>> scripts
>> tends to cause problems.
>>
>> 3) Source of 99% of problems in the rest of the cases (i.e. not 
>> covered by
>> 1) and 2) above) is incorrectly configured fencing.
>>
>> Does your setup fall under either of the first two categories?
>> Have you verified beyond doubt that your fencing is configured correctly
>> and that the fencing script gets verification upon success?
>>
>> Gordan
>>
>> On Tue, 14 Apr 2009 12:17:44 -0400, Ryan Golhar <golharam at umdnj.edu> 
>> wrote:
>>> Hi all,
>>>
>>> Is redhat cluster suite really reliable?  I've been having so much 
>>> trouble getting a cluster up and running, I'm beginning to second 
>>> guess my decision to use this software stack.
>>>
>>> I have 3 nodes (eventually 10) running and set up.  The fencing 
>>> method is by a brocade fibre switch.  The ultimate goal of this 
>>> cluster is to shared a SAN connected by fibre.
>>>
>>> I've installed just the bare minimum (before even getting to GFS) to 
>>> test the cluster software.  Just starting cman cluster services fails 
>>> on two of the nodes.
>>>
>>> Even when I try to reboot the nodes, I can't because the whole system 
>>> hangs on various processes that don't ever shut down.  I have to 
>>> physically reboot these boxes.
>>>
>>> The logs fill up with errors about not being able to connect to cman,
>> etc.
>>> I've been at it for awhile now and am not sure this is the best route 
>>> anymore.
>>>
>>> Ryan
>>
>> -- 
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster