[Linux-cluster] Cluster failing after rebooting a standby node

Wed Apr 23 03:33:19 UTC 2008

Hello all,

We've been encountering an issue with RHCS4 U6 (using the U5 version of 
system-config-cluster as U6 version is broken) that results in the 
cluster failing after rebooting one of the standby nodes with CMAN 
dieing after too many transition restarts.

We have a 7 node cluster, with 5 active nodes and 2 standby nodes.  We 
are running the cluster with broadcast mode for cluster communication 
(the default for CS4), changing to multicast isn't an option at the 
moment due to us using Cisco switching infrastructure.  The hardware 
we're running the cluster on are IBM HS21 blades within 2  IBM H series 
Bladechassis (3 within one chassis, 4 in another).  Each Bladechassis 
network switch module has dual gig uplinks to a Cisco switch.

We have done a lot of analysis of our network to ensure that the problem 
is not being caused by the underlying network preventing the cluster 
nodes from talking to one another, so we have ruled this out as a cause 
of the problem.

The cluster is currently a pre-production system that we are testing 
before putting into production so the nodes are basically sitting idle 
at the moment whilst we test things (i.e. the cluster).

What we have seen happening, is that we have the cluster operational for 
several days and when initiating a reboot of one of the standby nodes 
(that isn't running any clustered services at the time), the other 
cluster nodes start filling the logs with:

Apr 14 15:44:57 server01 kernel: CMAN: Initiating transition, generation 64
Apr 14 15:45:12 server01 kernel: CMAN: Initiating transition, generation 65

With the generation number increasing until CMAN dies with:

Apr 14 15:48:24 server01 kernel: CMAN: too many transition restarts - 
will die
Apr 14 15:48:24 server01 kernel: CMAN: we are leaving the cluster. 
Inconsistent cluster view
Apr 14 15:48:24 server01 kernel: SM: 01000004 sm_stop: SG still joined
Apr 14 15:48:24 server01 kernel: SM: 03000003 sm_stop: SG still joined
Apr 14 15:48:24 server01 clurgmgrd[22461]: <warning> #67: Shutting down 
uncleanly
Apr 14 15:48:24 server01 ccsd[7135]: Cluster manager shutdown. 
 Attemping to reconnect...
Apr 14 15:48:25 server01 ccsd[7135]: Cluster is not quorate.  Refusing 
connection.
Apr 14 15:48:25 server01 ccsd[7135]: Error while processing connect: 
Connection refused
Apr 14 15:48:25 server01 ccsd[7135]: Invalid descriptor specified (-111).
Apr 14 15:48:25 server01 ccsd[7135]: Someone may be attempting something 
evil.
Apr 14 15:48:25 server01 ccsd[7135]: Error while processing get: Invalid 
request descriptor
Apr 14 15:48:25 server01 ccsd[7135]: Invalid descriptor specified (-111).
Apr 14 15:48:25 server01 ccsd[7135]: Someone may be attempting something 
evil.
Apr 14 15:48:25 server01 ccsd[7135]: Error while processing get: Invalid 
request descriptor
Apr 14 15:48:25 server01 ccsd[7135]: Invalid descriptor specified (-21).
Apr 14 15:48:25 server01 ccsd[7135]: Someone may be attempting something 
evil.
Apr 14 15:48:25 server01 ccsd[7135]: Error while processing disconnect: 
Invalid request descriptor
Apr 14 15:48:25 server01 ccsd[7135]: Cluster is not quorate.  Refusing 
connection.

The interesting thing is that immediately after rebooting all of the 
nodes within the cluster and restarting the cluster services, the 
problem cannot be replicated.  Typically the cluster system has to have 
been running for 3-4 days untouched before we can then replicate the 
problem again (i.e. I reboot one of the standby nodes and it fails again).

I made a change yesterday to cluster.conf to increase the logging 
facility and logging level (set it to debug level - 7) and after using 
ccs_tool to apply the changes to the cluster online, once again I can't 
replicate the problem (even though immediately before this I could 
replicate the problem).

Has anyone experienced anything even remotely similar to this (I 
couldn't see anything similar reported in the list archives) and/or have 
any suggestions as to what might be causing the issue?

Cheers,

Ben