[Linux-cluster] GFS2 interesting death with error

Thu Nov 5 19:36:27 UTC 2009

Saw an interesting and different GFS2 death this morning that I wanted 
to pass along in case anyone has insights.  We have not seen any of the 
"hanging in dlm_posix_lock" since fsck'ing early Sunday morning.  In any 
case I'm pretty confident that's being triggered by the creation & 
deletion of ".lock" files within Dovecot.  This was something completely 
different and it left some potentially useful debug info in the logs.

Things were running fine when the machine "post2" abruptly died.  The 
following was found to have been enscribed upon its stone logs:

Nov  5 10:56:28 post2 kernel: original: gfs2_rindex_hold+0x32/0x153 [gfs2]
Nov  5 10:56:28 post2 kernel: pid : 27197
Nov  5 10:56:28 post2 kernel: lock type: 2 req lock state : 3
Nov  5 10:56:28 post2 kernel: new: gfs2_rindex_hold+0x32/0x153 [gfs2]
Nov  5 10:56:28 post2 kernel: pid: 27197
Nov  5 10:56:28 post2 kernel: lock type: 2 req lock state : 3
Nov  5 10:56:28 post2 kernel:  G:  s:SH n:2/2053b f:s t:SH d:EX/0 l:0 
a:0 r:4
Nov  5 10:56:28 post2 kernel:   H: s:SH f:H e:0 p:27197 [procmail] 
gfs2_rindex_hold+0x32/0x153 [gfs2]
Nov  5 10:56:28 post2 kernel:   I: n:23/132411 t:8 f:0x00000010
Nov  5 10:56:28 post2 kernel: ----------- [cut here ] --------- [please 
bite here ] ---------
Nov  5 10:56:32 post2 kernel: Kernel BUG at 
...ir/build/BUILD/gfs2-kmod-1.92/_kmod_build_/glock.c:950

The fact that it died in procmail indicates that the failure occurred 
while writing mail to someone's Inbox.  The system wasn't heavily loaded 
at the time -- the load averages were a little bit below 1.0 at the time 
of the crash.

Also interesting is what happened next.  The load average on post1 (the 
only other node) shot up over 100, as numerous processes were blocked.  
It spent several minutes with an administrative process using 100% of a 
CPU -- I believe it was dlm_recoverd though I'm not 100% certain.  Then, 
just as the load average had come back down to 15-20 and functionality 
was returning, it abruptly hung.  At this point I reset both cluster 
nodes and all was well.

Anyway, if you've seen anything like this or have a clue as to the 
cause, I'd love to hear it.  Looks like more lock-related glitchiness in 
our relatively lock intensive environment.

Thanks,
Allen

-- 
Allen Belletti
allen at isye.gatech.edu                             404-894-6221 Phone
Industrial and Systems Engineering                404-385-2988 Fax
Georgia Institute of Technology