[Linux-cluster] Recovering from "telling LM to withdraw"

Abhijith Das adas at redhat.com
Wed Jul 1 16:43:26 UTC 2009


Jeff Sturm wrote:
>
> Recently we had a cluster node fail with a failed assertion:
>
> Jun 26 09:23:39 mqc02 kernel: GFS: fsid=dc1rhc:tb2data.1.8: fatal:
> assertion "gfs_glock_is_locked_by_me(gl) &&
> gfs_glock_is_held_excl(gl)" failed
>
> Jun 26 09:23:39 mqc02 kernel: GFS: fsid=dc1rhc:tb2data.1.8: function =
> gfs_trans_add_gl
>
> Jun 26 09:23:39 mqc02 kernel: GFS: fsid=dc1rhc:tb2data.1.8: file =
> /builddir/build/BUILD/gfs-kmod-0.1.23/_kmod_build_/src/gfs/trans.c,
> line = 237
>
> Jun 26 09:23:39 mqc02 kernel: GFS: fsid=dc1rhc:tb2data.1.8: time =
> 1246022619
>
> Jun 26 09:23:39 mqc02 kernel: GFS: fsid=dc1rhc:tb2data.1.8: about to
> withdraw from the cluster
>
> Jun 26 09:23:39 mqc02 kernel: GFS: fsid=dc1rhc:tb2data.1.8: telling LM
> to withdraw
>
> This is with CentOS 5.2, GFS1. The cluster had been operating
> continuously for about 3 months.
>
> My challenge isn't in preventing assertion failures entirely—I
> recognize lurking software bugs and hardware anomalies can lead to a
> failed node. Rather, I want to prevent one node from freezing the
> cluster. When the above was logged, all nodes in the cluster which
> access the tb2data filesystem also froze and did not recover. We
> recovered with a rolling cluster restart and a precautionary gfs_fsck.
>
> Most cluster problems can be quickly handled by the fence agents. The
> "telling LM to withdraw" does not trigger a fence operation, or any
> other automated recovery. I need a deployment strategy to fix that.
> Should I write an agent to scan the syslog, match on the message
> above, and fence the node?
>
> Has anyone else encountered the same problem? If so, how did you get
> around it?
>
> -Jeff
>
https://bugzilla.redhat.com/show_bug.cgi?id=471258

The assert+withdraw you're seeing seems to be this bug above. I've tried
to recreate this on my cluster and failed. If you have a recipe to
create this, could you please post it to the bugzilla?

Meanwhile, I'll look at the code again to see if I can spot anything.

Thanks!
--Abhi




More information about the Linux-cluster mailing list