[Linux-cluster] CS5 / about loop "Node is undead"

Mon Jun 9 20:25:40 UTC 2008

On Wed, 2008-06-04 at 14:47 +0200, Alain Moulle wrote:
> Hi
> 
> About my problem of node entering a loop :
> Jun  3 15:54:49 s_sys at xn2 qdiskd[22256]: <notice> Writing eviction notice for node 1
> Jun  3 15:54:50 s_sys at xn2 qdiskd[22256]: <notice> Node 1 evicted
> Jun  3 15:54:51 s_sys at xn2 qdiskd[22256]: <crit> Node 1 is undead.
> 
> I notice that just before entering this loop, I have a message :
> Jun  3 15:54:47 s_sys at xn2 fenced[22327]: fencing node "xn1"
> Jun  3 15:54:48 s_sys at xn2 qdiskd[22256]: <info> Assuming master role
> 
> but never the message :
> Jun  3 15:54:47 s_sys at xn2 fenced[22327]: fence "xn1" success
> 
> Nethertheless, the service of xn1 is well failovered by xn2, but
> then after the reboot of xn1, we can't start again the CS5 due
> to the problem of infernal loop "Node is undead" on xn2.
> 
> whereas when it works correctly, both messages :
> fencing node "xn1"
> fence "xn1" success
> are successive (after about 30s)
> 
> So my question is : could this pb of infernal loop "Node is undead"
> be systematically due to a failed fencing phase of xn2 towards xn1 ?
> 
> PS: note that I have applied patch :
> http://sources.redhat.com/git/?p=cluster.git;a=commit;h=b2686ffe984c517110b949d604c54a71800b67c9

Yes.  If qdiskd thinks the node is dead and the node started writing to
the disk again (which is what fencing should prevent), it will display
those messages.

-- Lon