[Linux-cluster] umount hang and assert failure

Tue Nov 23 16:43:04 UTC 2004

On Tue, 2004-11-23 at 03:14, Patrick Caulfield wrote:
> On Tue, Nov 23, 2004 at 11:50:23AM +0800, David Teigland wrote:
> > 
> > On Mon, Nov 22, 2004 at 12:44:07PM -0800, Daniel McNeil wrote:
> > 
> > > The full stack traces are available here:
> > > http://developer.osdl.org/daniel/gfs_umount_hang/
> > 
> > Thanks, it's evident that the dlm became "stuck" on the node that's not
> > doing the umount.  All the hung processes are blocked on the dlm's
> > "in_recovery" lock. 
> 
> There also seems to be a GFS process with a failed "down_write" in dlm_unlock
> which might be a clue. It's not the in_recovery lock because that's only held
> for read during normal locking operations so it must be either the res_lock or
> the ls_unlock_sem. odd as those are normally only held for very short time
> periods.

More info.  I rebooted the cl031 the node not doing the umount
but hung doing the cat of /proc/cluster/services.  The 1st node
saw the node go away, but the umount was still hung.  I was expecting
the recovery from the death of this node to clean
up any locking problem.

I rebooted the 2nd node and started the tests over again last night.

This morning one node (cl030) got this:

cur_state = 2, new_state = 2
Kernel panic - not syncing: GFS: Assertion failed on line 69 of file
/Views/redhat-cluster/cluster/gfs-kernel/src/gfs/bits.c
GFS: assertion: "valid_change[new_state * 4 + cur_state]"
GFS: time = 1101174691
GFS: fsid=gfs_cluster:stripefs.0: RG = 65530

I'll upgrade to latest cvs and start the tests over.
Is there anything I can do to get more info when
this kind of thing happens?

Thanks,

Daniel