<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">

<head>
<meta http-equiv=Content-Type content="text/html; charset=us-ascii">
<meta name=Generator content="Microsoft Word 12 (filtered medium)">
<style>
<!--
 /* Font Definitions */
 @font-face
        {font-family:Calibri;
        panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
        {font-family:Verdana;
        panose-1:2 11 6 4 3 5 4 4 2 4;}
 /* Style Definitions */
 p.MsoNormal, li.MsoNormal, div.MsoNormal
        {margin:0in;
        margin-bottom:.0001pt;
        font-size:11.0pt;
        font-family:"Calibri","sans-serif";}
a:link, span.MsoHyperlink
        {mso-style-priority:99;
        color:blue;
        text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
        {mso-style-priority:99;
        color:purple;
        text-decoration:underline;}
span.EmailStyle17
        {mso-style-type:personal-compose;
        font-family:"Verdana","sans-serif";
        color:windowtext;
        font-weight:normal;
        font-style:normal;}
.MsoChpDefault
        {mso-style-type:export-only;}
@page Section1
        {size:8.5in 11.0in;
        margin:1.0in 1.0in 1.0in 1.0in;}
div.Section1
        {page:Section1;}
-->
</style>
<!--[if gte mso 9]><xml>
 <o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
 <o:shapelayout v:ext="edit">
  <o:idmap v:ext="edit" data="1" />
 </o:shapelayout></xml><![endif]-->
</head>

<body lang=EN-US link=blue vlink=purple>

<div class=Section1>

<p class=MsoNormal><span style='font-size:10.0pt;font-family:"Verdana","sans-serif"'>Recently
we had a cluster node fail with a failed assertion:<o:p></o:p></span></p>

<p class=MsoNormal><span style='font-size:10.0pt;font-family:"Verdana","sans-serif"'><o:p> </o:p></span></p>

<p class=MsoNormal><span style='font-size:10.0pt;font-family:"Verdana","sans-serif"'>Jun
26 09:23:39 mqc02 kernel: GFS: fsid=dc1rhc:tb2data.1.8: fatal: assertion
"gfs_glock_is_locked_by_me(gl) && gfs_glock_is_held_excl(gl)"
failed<o:p></o:p></span></p>

<p class=MsoNormal><span style='font-size:10.0pt;font-family:"Verdana","sans-serif"'>Jun
26 09:23:39 mqc02 kernel: GFS: fsid=dc1rhc:tb2data.1.8:   function =
gfs_trans_add_gl<o:p></o:p></span></p>

<p class=MsoNormal><span style='font-size:10.0pt;font-family:"Verdana","sans-serif"'>Jun
26 09:23:39 mqc02 kernel: GFS: fsid=dc1rhc:tb2data.1.8:   file =
/builddir/build/BUILD/gfs-kmod-0.1.23/_kmod_build_/src/gfs/trans.c, line = 237<o:p></o:p></span></p>

<p class=MsoNormal><span style='font-size:10.0pt;font-family:"Verdana","sans-serif"'>Jun
26 09:23:39 mqc02 kernel: GFS: fsid=dc1rhc:tb2data.1.8:   time =
1246022619<o:p></o:p></span></p>

<p class=MsoNormal><span style='font-size:10.0pt;font-family:"Verdana","sans-serif"'>Jun
26 09:23:39 mqc02 kernel: GFS: fsid=dc1rhc:tb2data.1.8: about to withdraw from
the cluster<o:p></o:p></span></p>

<p class=MsoNormal><span style='font-size:10.0pt;font-family:"Verdana","sans-serif"'>Jun
26 09:23:39 mqc02 kernel: GFS: fsid=dc1rhc:tb2data.1.8: telling LM to withdraw<o:p></o:p></span></p>

<p class=MsoNormal><span style='font-size:10.0pt;font-family:"Verdana","sans-serif"'><o:p> </o:p></span></p>

<p class=MsoNormal><span style='font-size:10.0pt;font-family:"Verdana","sans-serif"'>This
is with CentOS 5.2, GFS1.  The cluster had been operating continuously for
about 3 months.<o:p></o:p></span></p>

<p class=MsoNormal><span style='font-size:10.0pt;font-family:"Verdana","sans-serif"'><o:p> </o:p></span></p>

<p class=MsoNormal><span style='font-size:10.0pt;font-family:"Verdana","sans-serif"'>My
challenge isn't in preventing assertion failures entirely—I recognize
lurking software bugs and hardware anomalies can lead to a failed node. 
Rather, I want to prevent one node from freezing the cluster.  When the
above was logged, all nodes in the cluster which access the tb2data filesystem
also froze and did not recover.  We recovered with a rolling cluster
restart and a precautionary gfs_fsck.<o:p></o:p></span></p>

<p class=MsoNormal><span style='font-size:10.0pt;font-family:"Verdana","sans-serif"'><o:p> </o:p></span></p>

<p class=MsoNormal><span style='font-size:10.0pt;font-family:"Verdana","sans-serif"'>Most
cluster problems can be quickly handled by the fence agents.  The "telling
LM to withdraw" does not trigger a fence operation, or any other automated
recovery.  I need a deployment strategy to fix that.  Should I write
an agent to scan the syslog, match on the message above, and fence the node?<o:p></o:p></span></p>

<p class=MsoNormal><span style='font-size:10.0pt;font-family:"Verdana","sans-serif"'><o:p> </o:p></span></p>

<p class=MsoNormal><span style='font-size:10.0pt;font-family:"Verdana","sans-serif"'>Has
anyone else encountered the same problem?  If so, how did you get around
it?<o:p></o:p></span></p>

<p class=MsoNormal><span style='font-size:10.0pt;font-family:"Verdana","sans-serif"'><o:p> </o:p></span></p>

<p class=MsoNormal><span style='font-size:10.0pt;font-family:"Verdana","sans-serif"'>-Jeff<br>
<br>
<o:p></o:p></span></p>

</div>

</body>

</html>