[Linux-cluster] CS5 Problem
Alain Moulle
Alain.Moulle at bull.net
Thu Apr 24 11:12:39 UTC 2008
Hi
I 'm facing a problem :
when testing a two-nodes cluster with quorum disk, when
I poweroff the node1 , node 2 fences well the node 1 and
failovers the service, but in log of node 2 I have before and after
the fence success messages many messages like this:
Apr 24 11:30:04 s_sys at xn3 qdiskd[13740]: <crit> Node 2 is undead.
Apr 24 11:30:04 s_sys at xn3 qdiskd[13740]: <alert> Writing eviction notice for node 2
Apr 24 11:30:05 s_sys at xn3 qdiskd[13740]: <crit> Node 2 is undead.
Apr 24 11:30:05 s_sys at xn3 qdiskd[13740]: <alert> Writing eviction notice for node 2
Apr 24 11:30:06 s_sys at xn3 qdiskd[13740]: <crit> Node 2 is undead.
Apr 24 11:30:06 s_sys at xn3 qdiskd[13740]: <alert> Writing eviction notice for node 2
Apr 24 11:30:07 s_sys at xn3 qdiskd[13740]: <crit> Node 2 is undead.
Apr 24 11:30:07 s_sys at xn3 qdiskd[13740]: <alert> Writing eviction notice for node 2
Apr 24 11:30:08 s_sys at xn3 qdiskd[13740]: <crit> Node 2 is undead.
The problem is that when on node1 , after the reboot I try to start
again the CS5 , cman fails with these messages in syslog :
Apr 24 11:47:02 s_sys at xn4 ccsd[11099]: Copyright (C) Red Hat, Inc. 2004 All
rights reserved.
Apr 24 11:47:02 s_sys at xn4 ccsd[11099]: cluster.conf (cluster name = A0ha2,
version = 1) found.
Apr 24 11:47:02 s_sys at xn4 ccsd[11099]: Remote copy of cluster.conf is from
quorate node.
Apr 24 11:47:02 s_sys at xn4 ccsd[11099]: Local version # : 1
Apr 24 11:47:02 s_sys at xn4 ccsd[11099]: Remote version #: 1
Apr 24 11:47:02 s_sys at xn4 ccsd[11099]: Remote copy of cluster.conf is from
quorate node.
Apr 24 11:47:02 s_sys at xn4 ccsd[11099]: Local version # : 1
Apr 24 11:47:02 s_sys at xn4 ccsd[11099]: Remote version #: 1
Apr 24 11:47:02 s_sys at xn4 ccsd[11099]: Remote copy of cluster.conf is from
quorate node.
Apr 24 11:47:02 s_sys at xn4 ccsd[11099]: Local version # : 1
Apr 24 11:47:02 s_sys at xn4 ccsd[11099]: Remote version #: 1
Apr 24 11:47:02 s_sys at xn4 ccsd[11099]: Remote copy of cluster.conf is from
quorate node.
Apr 24 11:47:02 s_sys at xn4 ccsd[11099]: Local version # : 1
Apr 24 11:47:02 s_sys at xn4 ccsd[11099]: Remote version #: 1
Apr 24 11:47:31 s_sys at xn4 ccsd[11099]: Unable to connect to cluster
infrastructure after 30 seconds.
Apr 24 11:48:01 s_sys at xn4 ccsd[11099]: Unable to connect to cluster
infrastructure after 60 seconds.
Apr 24 11:48:31 s_sys at xn4 ccsd[11099]: Unable to connect to cluster
infrastructure after 90 seconds.
Apr 24 11:48:37 s_sys at xn4 ntpd[6179]: synchronized to 192.168.64.99, stratum 11
Apr 24 11:48:37 s_sys at xn4 ntpd[6179]: kernel time sync enabled 0001
Apr 24 11:49:01 s_sys at xn4 ccsd[11099]: Unable to connect to cluster
infrastructure after 120 seconds.
Apr 24 11:49:31 s_sys at xn4 ccsd[11099]: Unable to connect to cluster
infrastructure after 150 seconds.
Apr 24 11:50:01 s_sys at xn4 crond[11455]: (root) CMD (/usr/lib64/sa/sa1 1 1)
Apr 24 11:50:01 s_sys at xn4 ccsd[11099]: Unable to connect to cluster
infrastructure after 180 seconds.
Apr 24 11:50:31 s_sys at xn4 ccsd[11099]: Unable to connect to cluster
infrastructure after 210 seconds.
Apr 24 11:51:01 s_sys at xn4 ccsd[11099]: Unable to connect to cluster
infrastructure after 240 seconds.
Apr 24 11:51:31 s_sys at xn4 ccsd[11099]: Unable to connect to cluster
infrastructure after 270 seconds.
Apr 24 11:52:01 s_sys at xn4 ccsd[11099]: Unable to connect to cluster
infrastructure after 300 seconds.
Apr 24 11:52:31 s_sys at xn4 ccsd[11099]: Unable to connect to cluster
infrastructure after 330 seconds.
Apr 24 11:53:01 s_sys at xn4 ccsd[11099]: Unable to connect to cluster
infrastructure after 360 seconds.
Apr 24 11:53:31 s_sys at xn4 ccsd[11099]: Unable to connect to cluster
infrastructure after 390 seconds.
Apr 24 11:54:01 s_sys at xn4 ccsd[11099]: Unable to connect to cluster
infrastructure after 420 seconds.
Apr 24 11:54:31 s_sys at xn4 ccsd[11099]: Unable to connect to cluster
infrastructure after 450 seconds ...
etc.
or also :
Apr 24 10:17:37 s_sys at xn4 ccsd[11023]: Cluster is not quorate. Refusing connection.
Apr 24 10:17:37 s_sys at xn4 ccsd[11023]: Error while processing connect:
Connection refused
Apr 24 10:17:37 s_sys at xn4 ccsd[11023]: Invalid descriptor specified (-111).
Apr 24 10:17:37 s_sys at xn4 ccsd[11023]: Someone may be attempting something evil.
Apr 24 10:17:37 s_sys at xn4 ccsd[11023]: Error while processing get: Invalid
request descriptor
Apr 24 10:17:37 s_sys at xn4 ccsd[11023]: Invalid descriptor specified (-111).
Apr 24 10:17:37 s_sys at xn4 ccsd[11023]: Someone may be attempting something evil.
Apr 24 10:17:37 s_sys at xn4 ccsd[11023]: Error while processing get: Invalid
request descriptor
Apr 24 10:17:37 s_sys at xn4 ccsd[11023]: Invalid descriptor specified (-21).
Apr 24 10:17:37 s_sys at xn4 ccsd[11023]: Someone may be attempting something evil.
Apr 24 10:17:37 s_sys at xn4 ccsd[11023]: Error while processing disconnect:
Invalid request descriptor
Apr 24 10:17:37 s_sys at xn4 rgmanager: [11331]: <notice> Cluster Service Manager
is stopped.
And I can't start it again, except after stopping the CS on both nodes.
My cluster.conf qdisk record is likewise :
<quorumd label="QDISK_2_0" interval="1" tko="10" votes="1" min_score="1">
<heuristic interval="10" tko="3" program="ping -t1 -c1 192.168.64.99"
score="1"/>
<heuristic interval="10" program="ping -t3 -c1 192.168.64.99" score="1"/>
</quorumd>
I need urgent help if you have any ideas on the problem ?
Thanks a lot
Regards.
Alain Moullé
More information about the Linux-cluster
mailing list