[Linux-cluster] GFS 2 node hang in rm test
Daniel Phillips
phillips at redhat.com
Thu Dec 9 21:52:42 UTC 2004
On Tuesday 07 December 2004 04:38, Patrick Caulfield wrote:
> On Mon, Dec 06, 2004 at 04:13:50PM -0800, Daniel McNeil wrote:
> > On Mon, 2004-12-06 at 11:45, Ken Preslan wrote:
> > > On Fri, Dec 03, 2004 at 03:08:00PM -0800, Daniel McNeil wrote:
> >
> > Looking at the stack trace above and dissabling dlm.ko,
> > it looks like dlm_lock+0x319 is the call to dlm_lock_stage1().
> > looking at dlm_lock_stage1(), it looks like it is sleeping on
> > down_write(&rsb->res_lock)
> >
> > So now I have to find who is holding the res_lock.
>
> That's consistent with the hang you reported before - in fact it's
> almost certainly the same thing. My guess is thet there is a dealock
> on res_lock somewhere . In which case I suspect it's going to be
> easier to find that one by reading code rather than running tests.
> res_lock should never be held for any extended period of time, but in
> your last set of tracebacks there was nothing obviously holding it -
> so I suspect something is sleeping with it.
Hi Patrick,
Last week I had a bug in the cluster snapshot failover code that exposed
a bug in dlm or libdlm I think. My code inadvertently acquired a lock
twice, first in PW mode, then later in CR mode (because I wasn't
checking to see if it already had the PW lock). This caused
dlm_unlock_wait to wait forever. Are these locks supposed to be
recursive or not? In any event, waiting forever has got to be a bug.
It might have something to do with a lkid tangle, since I never provided
separate lkids for the unlock.
This should be easily reproducible.
Regards,
Daniel
More information about the Linux-cluster
mailing list