[Linux-cluster] Odd cluster problems

Thu Aug 2 20:19:17 UTC 2007

On Thu, Aug 02, 2007 at 02:00:13PM -0500, Jay Leafey wrote:
> Lon Hohberger wrote:
> >On Tue, Jul 31, 2007 at 10:48:44AM -0500, Jay Leafey wrote:
> >>I've got a 3-node cluster running CentOS 4.5 and I cannot communicate 
> >>with the resource group manager.  When I use the clustat command I get a 
> >>timeout:
> >>
> >>>[root at rapier ~]# clustat
> >>>Timed out waiting for a response from Resource Group Manager
> >>>Member Status: Quorate
> >>>
> >>> Member Name                              Status
> >>> ------ ----                              ------
> >>> rapier.utmem.edu                         Online, Local, rgmanager
> >>> thorax.utmem.edu                         Offline
> >>> cyclops.utmem.edu                        Online, rgmanager
> >
> >>>Fence Domain:    "default"                           2   2 recover 4 -
> >>>[1 2]
> >
> >Until fencing completes, rgmanager won't respond.
> >
> >fence_ack_manual needs to be run.
> >
> >>><SNIP>
> >>>
> >>>User:            "usrm::manager"                    10  10 recover 2 -
> >>>[1 2]
> >>>
> >
> 
> Your reply was a bit confusing at first, but looking deeper showed you 
> were right on the mark.  The systems (using HP ILO fencing) were unable 
> to communicate with each other very well or with the ILO ports at all. 
> Turns out some of the ports they were configured on had been moved to a 
> different VLAN, so the network was split between the ILOs and the host 
> ports.

Sorry, I just assumed you were using manual fencing as opposed to iLO,
since that's the 90+/- % case of why fencing was stuck in the 'recover'
state.

I guess we all know what happens when you assume... :)  Or maybe, when I
assume?

-- Lon

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.