[Linux-cluster] lm_dlm_cancel

Tue Sep 2 14:15:25 UTC 2008

On Tue, 2 Sep 2008, David Teigland wrote:

> On Mon, Sep 01, 2008 at 07:55:48PM -0400, James Chamberlain wrote:
>> Hi all,
>>
>> Since I sent the below, the aforementioned cluster crashed.  Now I
>> can't mount the scratch112 filesystem.  Attempts to do so crash the
>> node trying to mount it.  If I run gfs_fsck against it, I see the
>> following:
>>
>> # gfs_fsck -nv /dev/s12/scratch112
>> Initializing fsck
>> Initializing lists...
>> Initializing special inodes...
>> Validating Resource Group index.
>> Level 1 check.
>> 5834 resource groups found.
>> (passed)
>> Setting block ranges...
>> Can't seek to last block in file system: 4969529913
>> Unable to determine the boundaries of the file system.
>> Freeing buffers.
>>
>> Not being able to determine the boundaries of the file system seems
>> like a very bad thing.  However, LVM didn't complain in the slightest
>> when I expanded the logical volume.  How can I recover from this?
>
> Looks like the killed gfs_grow left your fs is a bad condition.
> I believe Bob Peterson has addressed that recently.

I think it was in a bad condition before I hit ^C rather than because I 
did.  As I mentioned, I was getting the lm_dlm_cancel messages before I hit 
^C.  But I'd agree that one way or another, the gfs_grow operation somehow 
left the fs in a bad state.

>>> I'm trying to grow a GFS filesystem.  I've grown this filesystem
>>> before and everything went fine.  However, when I issued gfs_grow
>>> this time, I saw the following messages in my logs:
>>>
>>> Aug 29 21:04:13 s12n02 kernel: lock_dlm: lm_dlm_cancel 2,17 flags 80
>>> Aug 29 21:04:13 s12n02 kernel: lock_dlm: lm_dlm_cancel skip 2,17
>>> flags 100
>>> Aug 29 21:04:14 s12n02 kernel: lock_dlm: lm_dlm_cancel 2,17 flags 80
>>> Aug 29 21:04:14 s12n02 kernel: dlm: scratch112: (14239) dlm_unlock:
>>> 10241 busy 2
>>> Aug 29 21:04:14 s12n02 kernel: lock_dlm: lm_dlm_cancel rv -16 2,17
>>> flags 40080
>>>
>>> The last three lines of these log entries repeat themselves once a
>>> second until I hit ^C.  The filesystem appears to still be up and
>>> accessible.  Any thoughts on what's going on here and what I can do
>>> about it?
>
> Should be fixed by
> https://bugzilla.redhat.com/show_bug.cgi?id=438268

Thanks Dave.  Any idea if there's a corresponding patch for RHEL 4?

Regards,

James