[Linux-cluster] GFS 2 node hang in rm test

Daniel McNeil daniel at osdl.org
Tue Dec 7 16:53:14 UTC 2004


On Tue, 2004-12-07 at 01:38, Patrick Caulfield wrote:
> On Mon, Dec 06, 2004 at 04:13:50PM -0800, Daniel McNeil wrote:
> > On Mon, 2004-12-06 at 11:45, Ken Preslan wrote:
> > > On Fri, Dec 03, 2004 at 03:08:00PM -0800, Daniel McNeil wrote:
> > 
> > 
> > Looking at the stack trace above and dissabling dlm.ko,
> > it looks like dlm_lock+0x319 is the call to dlm_lock_stage1().
> > looking at dlm_lock_stage1(), it looks like it is sleeping on
> > 	 down_write(&rsb->res_lock)
> > 
> > So now I have to find who is holding the res_lock.
> 
> That's consistent with the hang you reported before - in fact it's almost
> certainly the same thing. My guess is thet there is a dealock on res_lock
> somewhere . In which case I suspect it's going to be easier to find that one by
> reading code rather than running tests. res_lock should never be held for any
> extended period of time, but in your last set of tracebacks there was nothing
> obviously holding it - so I suspect something is sleeping with it.
> 
> 

I looked through the stack traces and did not see any other
processes that might be holding the lock.  There were only
3 other processes with stack traces in the dlm module and
they do not look like they are holding it.  That is
confusing.  I can think of 3 possibilites:

	1. forgetting to up the semaphore somewhere
	2. a process spinning in the kernel is holding it
	3. freed the structure containing the res_lock.

All of these seem unlikely to me.  I reviewed the code
last evening, the the up's and down's are closed together
and nothing looked obviously wrong.

I'll think about adding more debug output.

I ran it again last night and it ran 27 loops until 7am this
morning before hanging.  I'm still collecting info from this
hang.  At least it is reproducible.

Daniel





More information about the Linux-cluster mailing list