[Linux-cluster] Cluster failing after rebooting a standby node
Lon Hohberger
lhh at redhat.com
Wed Apr 23 13:43:07 UTC 2008
On Wed, 2008-04-23 at 11:33 +0800, Ben J wrote:
> What we have seen happening, is that we have the cluster operational for
> several days and when initiating a reboot of one of the standby nodes
> (that isn't running any clustered services at the time), the other
> cluster nodes start filling the logs with:
>
> Apr 14 15:44:57 server01 kernel: CMAN: Initiating transition, generation 64
> Apr 14 15:45:12 server01 kernel: CMAN: Initiating transition, generation 65
>
> With the generation number increasing until CMAN dies with:
>
> Apr 14 15:48:24 server01 kernel: CMAN: too many transition restarts -
> will die
> Apr 14 15:48:24 server01 kernel: CMAN: we are leaving the cluster.
> Inconsistent cluster view
^^^^ This is the problem.
vvvv These are all caused by that problem, and will
go away when the above is resolved.
> Apr 14 15:48:24 server01 kernel: SM: 01000004 sm_stop: SG still joined
> Apr 14 15:48:24 server01 kernel: SM: 03000003 sm_stop: SG still joined
> Apr 14 15:48:24 server01 clurgmgrd[22461]: <warning> #67: Shutting down
> uncleanly
> Apr 14 15:48:24 server01 ccsd[7135]: Cluster manager shutdown.
> Attemping to reconnect...
> <snip>...
> Apr 14 15:48:25 server01 ccsd[7135]: Error while processing disconnect:
> Invalid request descriptor
> Apr 14 15:48:25 server01 ccsd[7135]: Cluster is not quorate. Refusing
> connection.
> The interesting thing is that immediately after rebooting all of the
> nodes within the cluster and restarting the cluster services, the
> problem cannot be replicated. Typically the cluster system has to have
> been running for 3-4 days untouched before we can then replicate the
> problem again (i.e. I reboot one of the standby nodes and it fails again).
>
> I made a change yesterday to cluster.conf to increase the logging
> facility and logging level (set it to debug level - 7) and after using
> ccs_tool to apply the changes to the cluster online, once again I can't
> replicate the problem (even though immediately before this I could
> replicate the problem).
On RHEL4, there's some ugly arcane thing you need to do after this:
cman_tool version -r <new_config_version>
I'm not sure this is the cause of the 'too many transitions' problem you
hit. (Unfortunately, I'm not one of the people who fully understands
what causes 'too many transitions'...)
-- Lon
More information about the Linux-cluster
mailing list