[Linux-cluster] cluster latest cvs does not fence dead nodes automatically

Fajar A. Nugraha fajar at telkom.co.id
Tue Feb 15 06:52:21 UTC 2005


Hi,

I'm building two-node cluster using today's cvs from 
sources.redhat.com:/cvs/cluster.
Shared storage is located on FC shared disk.
All work as expected up to using gfs.

When I simulated a node crash (I did ifcfg eth0 down on node 2),
node 1 simply says (on syslog):

Feb 15 13:33:35 hosting-cl02-01 CMAN: removing node hosting-cl02-02 from 
the cluster : Missed too many heartbeats

However, NO fencing occured. Not even a "fence failed" message. I use 
fence_ibmblade.
After that, access to gfs device blocked (df -k still works though), and 
/proc/cluster/nodes show

Node  Votes Exp Sts  Name
   1    1    1   M   node-01
   2    1    1   X   node-02

there's an "X" on node 2, but /proc/cluster/service shows

Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 run       -
[1 2]

DLM Lock Space:  "clvmd"                             2   3 run       -
[1 2]

DLM Lock Space:  "data"                              3   4 run       -
[1 2]

DLM Lock Space:  "config"                            5   6 run       -
[1 2]

GFS Mount Group: "data"                              4   5 run       -
[1 2]

GFS Mount Group: "config"                            6   7 run       -
[1 2]

which is the same content with before node 2 is dead.
AFAIK, state should be "recover" or "waiting to recover" instead of run.

If I reboot node 2 (which is the same thing if you exceute
fence_ibmblade manually), and restart cluster services on that node, all 
is back to normal,
and these messages show on syslog :

Feb 15 13:38:40 node-01 CMAN: node node-02 rejoining
Feb 15 13:38:40 node-01 fenced[25486]: node-02 not a cluster member 
after 0 sec post_fail_delay
Feb 15 13:38:42 node-01 GFS: fsid=node:config.0: jid=1: Trying to 
acquire journal lock...
Feb 15 13:38:42 node-01 GFS: fsid=node:data.0: jid=1: Trying to acquire 
journal lock...
Feb 15 13:38:43 node-01 GFS: fsid=node:config.0: jid=1: Looking at 
journal...
Feb 15 13:38:43 node-01 GFS: fsid=node:data.0: jid=1: Looking at journal...
Feb 15 13:38:43 node-01 GFS: fsid=node:config.0: jid=1: Acquiring the 
transaction lock...
Feb 15 13:38:43 node-01 GFS: fsid=node:config.0: jid=1: Replaying journal...
Feb 15 13:38:43 node-01 GFS: fsid=node:config.0: jid=1: Replayed 0 of 0 
blocks
Feb 15 13:38:43 node-01 GFS: fsid=node:config.0: jid=1: replays = 0, 
skips = 0, sames = 0
Feb 15 13:38:43 node-01 GFS: fsid=node:data.0: jid=1: Acquiring the 
transaction lock...
Feb 15 13:38:43 node-01 GFS: fsid=node:data.0: jid=1: Replaying journal...
Feb 15 13:38:43 node-01 GFS: fsid=node:data.0: jid=1: Replayed 0 of 0 blocks
Feb 15 13:38:43 node-01 GFS: fsid=node:data.0: jid=1: replays = 0, skips 
= 0, sames = 0
Feb 15 13:38:43 node-01 GFS: fsid=node:data.0: jid=1: Journal replayed in 1s
Feb 15 13:38:43 node-01 GFS: fsid=node:data.0: jid=1: Done
Feb 15 13:38:43 node-01 GFS: fsid=node:config.0: jid=1: Journal replayed 
in 1s
Feb 15 13:38:43 node-01 GFS: fsid=node:config.0: jid=1: Done

Any idea what's wrong?

Regards,

Fajar







More information about the Linux-cluster mailing list