[Linux-cluster] Cluster failing after rebooting a standby node

Wed Apr 23 13:53:52 UTC 2008

Ben J wrote:
> Hello all,
> 
> We've been encountering an issue with RHCS4 U6 (using the U5 version of
> system-config-cluster as U6 version is broken) that results in the
> cluster failing after rebooting one of the standby nodes with CMAN
> dieing after too many transition restarts.
> 
> We have a 7 node cluster, with 5 active nodes and 2 standby nodes.  We
> are running the cluster with broadcast mode for cluster communication
> (the default for CS4), changing to multicast isn't an option at the
> moment due to us using Cisco switching infrastructure.  The hardware
> we're running the cluster on are IBM HS21 blades within 2  IBM H series
> Bladechassis (3 within one chassis, 4 in another).  Each Bladechassis
> network switch module has dual gig uplinks to a Cisco switch.
> 
> We have done a lot of analysis of our network to ensure that the problem
> is not being caused by the underlying network preventing the cluster
> nodes from talking to one another, so we have ruled this out as a cause
> of the problem.
> 
> The cluster is currently a pre-production system that we are testing
> before putting into production so the nodes are basically sitting idle
> at the moment whilst we test things (i.e. the cluster).
> 
> What we have seen happening, is that we have the cluster operational for
> several days and when initiating a reboot of one of the standby nodes
> (that isn't running any clustered services at the time), the other
> cluster nodes start filling the logs with:
> 
> Apr 14 15:44:57 server01 kernel: CMAN: Initiating transition, generation 64
> Apr 14 15:45:12 server01 kernel: CMAN: Initiating transition, generation 65
> 
> With the generation number increasing until CMAN dies with:
> 
> Apr 14 15:48:24 server01 kernel: CMAN: too many transition restarts -
> will die
> Apr 14 15:48:24 server01 kernel: CMAN: we are leaving the cluster.
> Inconsistent cluster view
> Apr 14 15:48:24 server01 kernel: SM: 01000004 sm_stop: SG still joined
> Apr 14 15:48:24 server01 kernel: SM: 03000003 sm_stop: SG still joined
> Apr 14 15:48:24 server01 clurgmgrd[22461]: <warning> #67: Shutting down
> uncleanly
> Apr 14 15:48:24 server01 ccsd[7135]: Cluster manager shutdown. Attemping
> to reconnect...
> Apr 14 15:48:25 server01 ccsd[7135]: Cluster is not quorate.  Refusing
> connection.
> Apr 14 15:48:25 server01 ccsd[7135]: Error while processing connect:
> Connection refused
> Apr 14 15:48:25 server01 ccsd[7135]: Invalid descriptor specified (-111).
> Apr 14 15:48:25 server01 ccsd[7135]: Someone may be attempting something
> evil.
> Apr 14 15:48:25 server01 ccsd[7135]: Error while processing get: Invalid
> request descriptor
> Apr 14 15:48:25 server01 ccsd[7135]: Invalid descriptor specified (-111).
> Apr 14 15:48:25 server01 ccsd[7135]: Someone may be attempting something
> evil.
> Apr 14 15:48:25 server01 ccsd[7135]: Error while processing get: Invalid
> request descriptor
> Apr 14 15:48:25 server01 ccsd[7135]: Invalid descriptor specified (-21).
> Apr 14 15:48:25 server01 ccsd[7135]: Someone may be attempting something
> evil.
> Apr 14 15:48:25 server01 ccsd[7135]: Error while processing disconnect:
> Invalid request descriptor
> Apr 14 15:48:25 server01 ccsd[7135]: Cluster is not quorate.  Refusing
> connection.
> 
> The interesting thing is that immediately after rebooting all of the
> nodes within the cluster and restarting the cluster services, the
> problem cannot be replicated.  Typically the cluster system has to have
> been running for 3-4 days untouched before we can then replicate the
> problem again (i.e. I reboot one of the standby nodes and it fails again).
> 
> I made a change yesterday to cluster.conf to increase the logging
> facility and logging level (set it to debug level - 7) and after using
> ccs_tool to apply the changes to the cluster online, once again I can't
> replicate the problem (even though immediately before this I could
> replicate the problem).
> 
> Has anyone experienced anything even remotely similar to this (I
> couldn't see anything similar reported in the list archives) and/or have
> any suggestions as to what might be causing the issue?

I have heard of similar incidents but we have never managed to pin down
just what is happening. if you can reproduce it could you send me a
tcpdump of the conversations on port 6809 when it happens please ?

You might like to set up tcpdump to do a rolling capture so it doesn't
fill up a disk while you're waiting for it to happen!

The command is:
tcpdump -C 10 -W10 -w /tmp/port6809.dmp -xs0 port 6809

Every time we have seen this, there has been a potential for networking
troubles at the site. If you are confident that your network is fully
stable then it would be really helpful to get some debugging for this
problem.

Thanks,

Chrissie