[Linux-cluster] Recovering from "telling LM to withdraw"
Jeff Sturm
jeff.sturm at eprize.com
Wed Jul 1 13:50:36 UTC 2009
Recently we had a cluster node fail with a failed assertion:
Jun 26 09:23:39 mqc02 kernel: GFS: fsid=dc1rhc:tb2data.1.8: fatal:
assertion "gfs_glock_is_locked_by_me(gl) && gfs_glock_is_held_excl(gl)"
failed
Jun 26 09:23:39 mqc02 kernel: GFS: fsid=dc1rhc:tb2data.1.8: function =
gfs_trans_add_gl
Jun 26 09:23:39 mqc02 kernel: GFS: fsid=dc1rhc:tb2data.1.8: file =
/builddir/build/BUILD/gfs-kmod-0.1.23/_kmod_build_/src/gfs/trans.c, line
= 237
Jun 26 09:23:39 mqc02 kernel: GFS: fsid=dc1rhc:tb2data.1.8: time =
1246022619
Jun 26 09:23:39 mqc02 kernel: GFS: fsid=dc1rhc:tb2data.1.8: about to
withdraw from the cluster
Jun 26 09:23:39 mqc02 kernel: GFS: fsid=dc1rhc:tb2data.1.8: telling LM
to withdraw
This is with CentOS 5.2, GFS1. The cluster had been operating
continuously for about 3 months.
My challenge isn't in preventing assertion failures entirely-I recognize
lurking software bugs and hardware anomalies can lead to a failed node.
Rather, I want to prevent one node from freezing the cluster. When the
above was logged, all nodes in the cluster which access the tb2data
filesystem also froze and did not recover. We recovered with a rolling
cluster restart and a precautionary gfs_fsck.
Most cluster problems can be quickly handled by the fence agents. The
"telling LM to withdraw" does not trigger a fence operation, or any
other automated recovery. I need a deployment strategy to fix that.
Should I write an agent to scan the syslog, match on the message above,
and fence the node?
Has anyone else encountered the same problem? If so, how did you get
around it?
-Jeff
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090701/5b38bae7/attachment.htm>
More information about the Linux-cluster
mailing list