[Linux-cluster] GFS2 processes getting stuck in WCHAN=dlm_posix_lock
allen at isye.gatech.edu
Mon Nov 2 19:27:24 UTC 2009
On 11/02/2009 06:42 AM, Steven Whitehouse wrote:
> On Fri, 2009-10-30 at 19:27 -0400, Allen Belletti wrote:
>> Hi All,
>> As I've mentioned before, I'm running a two-node clustered mail server
>> on GFS2 (with RHEL 5.4) Nearly all of the time, everything works
>> great. However, going all the way back to GFS1 on RHEL 5.1 (I think it
>> was), I've had occasional locking problems that force a reboot of one or
>> both cluster nodes. Lately I've paid closer attention since it's been
>> happening more often.
>> I'll notice the problem when the load average starts rising. It's
>> always tied to "stuck" processes, and I believe always tied to IMAP
>> clients (I'm running Dovecot.) It seems like a file belonging to user
>> "x" (in this case, "jforrest" will become locked in some way, such that
>> every IMAP process tied that user will get stuck on the same thing.
>> Over time, as the user keeps trying to read that file, more& more
>> processes accumulate. They're always in state "D" (uninterruptible
>> sleep), and always on "dlm_posix_lock" according to WCHAN. The only way
>> I'm able to get out of this state is to reboot. If I let it persist for
>> too long, I/O generally stops entirely.
>> This certainly seems like it ought to have a definite solution, but I've
>> no idea what it is. I've tried a variety of things using "find" to
>> pinpoint a particular file, but everything belonging to the affected
>> user seems just fine. At least, I can read and copy all of the files,
>> and do a stat via ls -l.
>> Is it possible that this is a bug, not within GFS at all, but within
>> Dovecot IMAP?
>> Any thoughts would be appreciated. It's been getting worse lately and
>> thus no fun at all.
> Do you know if dovecot IMAP uses signals at all? That would be the first
> thing that I'd look at. The other thing to check is whether it makes use
> of F_GETLK and in particular the l_pid field? strace should be able to
> answer both of those questions (except the l_pid field of course, but
> the chances are it it calls F_GETLK and then sends a signal, its also
> using the l_pid field),
I've checked via both strace and grepping the source, and found no
evidence of F_GETLK nor the l_pid field being referenced. Signals don't
appear to play a significant role either; I've managed to snag an strace
-f -p of a "healthy" imap session (ie, dlm_posix_lock briefly appearing
in WCHAN but going away as expected) and I see no signals being used.
By the way, I took advantage of a quiet period early Sunday morning and
ran fsck.gfs2( version 3.0.4) on the two GFS2 filesystems. Both had a
variety of errors although no evidence of major corruption. Since that
completed I've seen no additional "stuck" locks but the sample period is
far too short to tell. Sometimes things work for weeks without issue.
Thanks for your suggestions!
allen at isye.gatech.edu 404-894-6221 Phone
Industrial and Systems Engineering 404-385-2988 Fax
Georgia Institute of Technology
More information about the Linux-cluster