[Linux-cluster] GFS2 processes getting stuck in WCHAN=dlm_posix_lock
Steven Whitehouse
swhiteho at redhat.com
Mon Nov 2 11:42:49 UTC 2009
Hi,
On Fri, 2009-10-30 at 19:27 -0400, Allen Belletti wrote:
> Hi All,
>
> As I've mentioned before, I'm running a two-node clustered mail server
> on GFS2 (with RHEL 5.4) Nearly all of the time, everything works
> great. However, going all the way back to GFS1 on RHEL 5.1 (I think it
> was), I've had occasional locking problems that force a reboot of one or
> both cluster nodes. Lately I've paid closer attention since it's been
> happening more often.
>
> I'll notice the problem when the load average starts rising. It's
> always tied to "stuck" processes, and I believe always tied to IMAP
> clients (I'm running Dovecot.) It seems like a file belonging to user
> "x" (in this case, "jforrest" will become locked in some way, such that
> every IMAP process tied that user will get stuck on the same thing.
> Over time, as the user keeps trying to read that file, more & more
> processes accumulate. They're always in state "D" (uninterruptible
> sleep), and always on "dlm_posix_lock" according to WCHAN. The only way
> I'm able to get out of this state is to reboot. If I let it persist for
> too long, I/O generally stops entirely.
>
> This certainly seems like it ought to have a definite solution, but I've
> no idea what it is. I've tried a variety of things using "find" to
> pinpoint a particular file, but everything belonging to the affected
> user seems just fine. At least, I can read and copy all of the files,
> and do a stat via ls -l.
>
> Is it possible that this is a bug, not within GFS at all, but within
> Dovecot IMAP?
>
> Any thoughts would be appreciated. It's been getting worse lately and
> thus no fun at all.
>
> Cheers,
> Allen
>
Do you know if dovecot IMAP uses signals at all? That would be the first
thing that I'd look at. The other thing to check is whether it makes use
of F_GETLK and in particular the l_pid field? strace should be able to
answer both of those questions (except the l_pid field of course, but
the chances are it it calls F_GETLK and then sends a signal, its also
using the l_pid field),
Steve.
More information about the Linux-cluster
mailing list