[Linux-cluster] RHEL5.3 / cman-2.0.98-1.el5 / Problem loop on "Node x is undead"

Wed Feb 25 15:35:44 UTC 2009

Alain.Moulle wrote:

>> > Hi,
>> > 
>> > I'm facing again this problem of Node  evicted and Node is undead ...
>> > And I really don't know what to do ... below are the traces in syslog.
>> > My version is :RHEL5.3 / cman-2.0.98-1.el5
>> > 
>> > Feb 25 14:33:33 s_sys at xn3 qdiskd[27582]: <notice> Writing eviction
>> > notice for node 2
>> > Feb 25 14:33:34 s_sys at xn3 qdiskd[27582]: <notice> Node 2 evicted
>> > Feb 25 14:33:35 s_sys at xn3 qdiskd[27582]: <crit> Node 2 is undead.
>> > ... etc.
>> > Feb 25 14:33:45 s_sys at xn3 qdiskd[27582]: <crit> Node 2 is undead.
>> > Feb 25 14:33:45 s_sys at xn3 qdiskd[27582]: <alert> Writing eviction notice
>> > for node 2
>> > Feb 25 14:33:46 s_sys at xn3 qdiskd[27582]: <crit> Node 2 is undead.
>> > Feb 25 14:33:46 s_sys at xn3 qdiskd[27582]: <alert> Writing eviction notice
>> > for node 2
>> > Feb 25 14:33:47 s_kernel at xn3 kernel: dlm: closing connection to node 2
>> > Feb 25 14:33:47 s_sys at xn3 fenced[27785]: xn4 not a cluster member after
>> > 0 sec post_fail_delay
>> > Feb 25 14:33:47 s_sys at xn3 fenced[27785]: fencing node "xn4"
>> > Feb 25 14:33:47 s_sys at xn3 qdiskd[27582]: <crit> Node 2 is undead.
>> > ...etc.
>> > Feb 25 14:33:52 s_sys at xn3 qdiskd[27582]: <alert> Writing eviction notice
>> > for node 2
>> > Feb 25 14:33:52 s_sys at xn3 fenced[27785]: fence "xn4" success
>> > Feb 25 14:33:53 s_sys at xn3 qdiskd[27582]: <crit> Node 2 is undead.
>> > Feb 25 14:33:53 s_sys at xn3 qdiskd[27582]: <alert> Writing eviction notice
>> > for node 2
>> > Feb 25 14:33:54 s_sys at xn3 qdiskd[27582]: <crit> Node 2 is undead.
>> > Feb 25 14:33:54 s_sys at xn3 qdiskd[27582]: <alert> Writing eviction notice
>> > for node 2
>> > Feb 25 14:33:54 s_sys at xn3 clurgmgrd[27990]: <notice> Taking over service
>> > service:lustre_xn4 from down member xn4
>> > Feb 25 14:33:55 s_sys at xn3 qdiskd[27582]: <crit> Node 2 is undead.
>> > .. etc.
>> > 
>> > An then after reboot of xn4 , when we try to start the CS on xn4, it
>> > can't enter in the cluster, and we
>> > must stop CS on both nodes and start on both sides again.
>> > 
>> > Where could this problem come from ? How can I avoid this eviction of
>> > node  ?
>> > 
>> > Any help would be very appreciated .
>>     
>
> You haven't posted any cman/openais messages but it's quite possible
> you've hit this bug:
>
> https://bugzilla.redhat.com/show_bug.cgi?id=485026
>
> There's a patch included and some links to fixed RPMs.
>
>
> Chrissie
Thanks Chrissie, but I have checked this bugzilla, and it seems, except
if I'm misunderstanding, to be more on the problem of starting a second
node too late with regard to the start of a first node ... so that in fact
the second node can't enter in the cluster anymore. But there are no
"Node is undead" messages in the syslog in this case (I've checked the 
joined
syslog in the bugzilla).
My problem is after a poweroff -f on a node of a ha pair with quorum disk
but when both nodes are up and running their services : in this case , 
making a
poweroff on second node makes the first one generate the loop "Node 2 
evicted"
and "Node 2 is undead" in syslog, and this even just after the poweroff, 
not when
the second node is trying to start the CS again .

Regards,
Alain
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090225/fcfb8989/attachment.htm>