[Linux-cluster] CS5 Problem

Alain Moulle Alain.Moulle at bull.net
Thu Apr 24 11:12:39 UTC 2008


Hi

I 'm facing a problem :

when testing a two-nodes cluster with quorum disk, when
I poweroff the node1 , node 2 fences well the node 1 and
failovers the service, but in log of node 2 I have before and after
the fence success messages  many messages like this:
Apr 24 11:30:04 s_sys at xn3 qdiskd[13740]: <crit> Node 2 is undead.
Apr 24 11:30:04 s_sys at xn3 qdiskd[13740]: <alert> Writing eviction notice for node 2
Apr 24 11:30:05 s_sys at xn3 qdiskd[13740]: <crit> Node 2 is undead.
Apr 24 11:30:05 s_sys at xn3 qdiskd[13740]: <alert> Writing eviction notice for node 2
Apr 24 11:30:06 s_sys at xn3 qdiskd[13740]: <crit> Node 2 is undead.
Apr 24 11:30:06 s_sys at xn3 qdiskd[13740]: <alert> Writing eviction notice for node 2
Apr 24 11:30:07 s_sys at xn3 qdiskd[13740]: <crit> Node 2 is undead.
Apr 24 11:30:07 s_sys at xn3 qdiskd[13740]: <alert> Writing eviction notice for node 2
Apr 24 11:30:08 s_sys at xn3 qdiskd[13740]: <crit> Node 2 is undead.

The problem is that when on node1 , after the reboot I try to start
again the CS5 , cman fails with these messages in syslog :
Apr 24 11:47:02 s_sys at xn4 ccsd[11099]:  Copyright (C) Red Hat, Inc.  2004  All
rights reserved.
Apr 24 11:47:02 s_sys at xn4 ccsd[11099]: cluster.conf (cluster name = A0ha2,
version = 1) found.
Apr 24 11:47:02 s_sys at xn4 ccsd[11099]: Remote copy of cluster.conf is from
quorate node.
Apr 24 11:47:02 s_sys at xn4 ccsd[11099]:  Local version # : 1
Apr 24 11:47:02 s_sys at xn4 ccsd[11099]:  Remote version #: 1
Apr 24 11:47:02 s_sys at xn4 ccsd[11099]: Remote copy of cluster.conf is from
quorate node.
Apr 24 11:47:02 s_sys at xn4 ccsd[11099]:  Local version # : 1
Apr 24 11:47:02 s_sys at xn4 ccsd[11099]:  Remote version #: 1
Apr 24 11:47:02 s_sys at xn4 ccsd[11099]: Remote copy of cluster.conf is from
quorate node.
Apr 24 11:47:02 s_sys at xn4 ccsd[11099]:  Local version # : 1
Apr 24 11:47:02 s_sys at xn4 ccsd[11099]:  Remote version #: 1
Apr 24 11:47:02 s_sys at xn4 ccsd[11099]: Remote copy of cluster.conf is from
quorate node.
Apr 24 11:47:02 s_sys at xn4 ccsd[11099]:  Local version # : 1
Apr 24 11:47:02 s_sys at xn4 ccsd[11099]:  Remote version #: 1
Apr 24 11:47:31 s_sys at xn4 ccsd[11099]: Unable to connect to cluster
infrastructure after 30 seconds.
Apr 24 11:48:01 s_sys at xn4 ccsd[11099]: Unable to connect to cluster
infrastructure after 60 seconds.
Apr 24 11:48:31 s_sys at xn4 ccsd[11099]: Unable to connect to cluster
infrastructure after 90 seconds.
Apr 24 11:48:37 s_sys at xn4 ntpd[6179]: synchronized to 192.168.64.99, stratum 11
Apr 24 11:48:37 s_sys at xn4 ntpd[6179]: kernel time sync enabled 0001
Apr 24 11:49:01 s_sys at xn4 ccsd[11099]: Unable to connect to cluster
infrastructure after 120 seconds.
Apr 24 11:49:31 s_sys at xn4 ccsd[11099]: Unable to connect to cluster
infrastructure after 150 seconds.
Apr 24 11:50:01 s_sys at xn4 crond[11455]: (root) CMD (/usr/lib64/sa/sa1 1 1)
Apr 24 11:50:01 s_sys at xn4 ccsd[11099]: Unable to connect to cluster
infrastructure after 180 seconds.
Apr 24 11:50:31 s_sys at xn4 ccsd[11099]: Unable to connect to cluster
infrastructure after 210 seconds.
Apr 24 11:51:01 s_sys at xn4 ccsd[11099]: Unable to connect to cluster
infrastructure after 240 seconds.
Apr 24 11:51:31 s_sys at xn4 ccsd[11099]: Unable to connect to cluster
infrastructure after 270 seconds.
Apr 24 11:52:01 s_sys at xn4 ccsd[11099]: Unable to connect to cluster
infrastructure after 300 seconds.
Apr 24 11:52:31 s_sys at xn4 ccsd[11099]: Unable to connect to cluster
infrastructure after 330 seconds.
Apr 24 11:53:01 s_sys at xn4 ccsd[11099]: Unable to connect to cluster
infrastructure after 360 seconds.
Apr 24 11:53:31 s_sys at xn4 ccsd[11099]: Unable to connect to cluster
infrastructure after 390 seconds.
Apr 24 11:54:01 s_sys at xn4 ccsd[11099]: Unable to connect to cluster
infrastructure after 420 seconds.
Apr 24 11:54:31 s_sys at xn4 ccsd[11099]: Unable to connect to cluster
infrastructure after 450 seconds ...
etc.

or also :
Apr 24 10:17:37 s_sys at xn4 ccsd[11023]: Cluster is not quorate.  Refusing connection.
Apr 24 10:17:37 s_sys at xn4 ccsd[11023]: Error while processing connect:
Connection refused
Apr 24 10:17:37 s_sys at xn4 ccsd[11023]: Invalid descriptor specified (-111).
Apr 24 10:17:37 s_sys at xn4 ccsd[11023]: Someone may be attempting something evil.
Apr 24 10:17:37 s_sys at xn4 ccsd[11023]: Error while processing get: Invalid
request descriptor
Apr 24 10:17:37 s_sys at xn4 ccsd[11023]: Invalid descriptor specified (-111).
Apr 24 10:17:37 s_sys at xn4 ccsd[11023]: Someone may be attempting something evil.
Apr 24 10:17:37 s_sys at xn4 ccsd[11023]: Error while processing get: Invalid
request descriptor
Apr 24 10:17:37 s_sys at xn4 ccsd[11023]: Invalid descriptor specified (-21).
Apr 24 10:17:37 s_sys at xn4 ccsd[11023]: Someone may be attempting something evil.
Apr 24 10:17:37 s_sys at xn4 ccsd[11023]: Error while processing disconnect:
Invalid request descriptor
Apr 24 10:17:37 s_sys at xn4 rgmanager: [11331]: <notice> Cluster Service Manager
is stopped.


And I can't start it again, except after stopping the CS on both nodes.

My cluster.conf qdisk record is likewise :
<quorumd label="QDISK_2_0" interval="1" tko="10" votes="1" min_score="1">
     <heuristic interval="10" tko="3" program="ping -t1 -c1 192.168.64.99"
score="1"/>
     <heuristic interval="10" program="ping -t3 -c1 192.168.64.99" score="1"/>
</quorumd>

I need urgent help if you have any ideas on the problem ?

Thanks a lot
Regards.
Alain Moullé





More information about the Linux-cluster mailing list