[Linux-cluster] no version for "gfs2_unmount_lockproto"

Fri Feb 15 02:56:31 UTC 2008

Hi Bob,

You can skip to the middle; I leave my ramblings here since I've typed
them already...

Bob Peterson <rpeterso at redhat.com> writes:

> On Wed, 2008-02-13 at 17:22 +0100, Ferenc Wagner wrote:
>> *Here* comes something possibly interesting, after fence_tool join:
>> 
>> fenced[4543]: fencing deferred to prior member
>> 
>> Though it doesn't look like node3 (which has the filesystem mounted)
>> would want to fence node1 (which has this message in its syslog).  Is
>> there a command available to find out the current fencing status or
>> history?
>
> Not as far as I know.  You should probably look in the /var/log/messages
> on all the nodes to see which node decided it needed to fence the other
> and why.  Perhaps it's not letting you mount gfs because of a pending
> fence operation.  You could do cman_tool services to see the status
> of the cluster from all nodes.

On node3, which has the GFS mounted:

# cman_tool services
type             level name     id       state       
fence            0     default  00010001 none        
[1 3]
dlm              1     clvmd    00030001 none        
[1 3]
dlm              1     test     00050003 none        
[3]
gfs              2     test     00040003 none        
[3]

On node1, which can't mount the filesystem, the first half is the
same, and the second half (dlm and gfs) is missing.

There are failed fence attempts in the logs of node3 from a couple of
days ago.  Since then, node1 was rebooted a couple of times; I also
stopped the cluster infrastructure on node3 (back to killing ccsd).
Then brought up the two nodes simultaneously, starting fenced with the
-c option; activation of the clustered VG propagated to the other node
as expected.  After all this, node1 still can't mount the filesystem.
Node3 can, but it seemingly doesn't influence node1...

>> Well, it helps on the node which has the filesystem mounted.  Of
>> course not on the other.  Is gfs_tool supposed to work on mounted
>> filesystems only?  Probably so.
>
> Some of the gfs_tool commands like "gfs_tool sb" need a device
> and others like "gfs_tool lockdump" need a mount point.  The man
> page says which requires which.

Exactly.  Too bad I didn't care to check earlyer.  Sorry for that.

> Another thing you can try is mounting it with the mount helper
> manually, with verbose mode, by doing something like this:
>
> /sbin/mount.gfs -v -o users -t gfs /dev/your/device /your/mount/point
>
> And see what information it gives you.  Using the straight mount
> command won't put the mount helper into verbose mode, so you may
> get more information that way.

Not only more info, but... see:

# mount -t gfs /dev/gfs/test /mnt
mount: wrong fs type, bad option, bad superblock on /dev/gfs/test,
       missing codepage or other error
       In some cases useful info is found in syslog - try
       dmesg | tail  or so

# /sbin/mount.gfs -v -o users -t gfs /dev/gfs/test /mnt
bash: /sbin/mount.gfs: No such file or directory
# /usr/sbin/mount.gfs -v -o users -t gfs /dev/gfs/test /mnt
/usr/sbin/mount.gfs: mount /dev/mapper/gfs-test /mnt
/usr/sbin/mount.gfs: parse_opts: opts = "users"
/usr/sbin/mount.gfs:   set flag 0 for "users", flags = 0
/usr/sbin/mount.gfs: parse_opts: flags = 0
/usr/sbin/mount.gfs: parse_opts: extra = ""
/usr/sbin/mount.gfs: parse_opts: hostdata = ""
/usr/sbin/mount.gfs: parse_opts: lockproto = ""
/usr/sbin/mount.gfs: parse_opts: locktable = ""
/usr/sbin/mount.gfs: message to gfs_controld: asking to join mountgroup:
/usr/sbin/mount.gfs: write "join /mnt gfs lock_dlm pilot:test users /dev/mapper/gfs-test"
/usr/sbin/mount.gfs: message from gfs_controld: response to join request:
/usr/sbin/mount.gfs: lock_dlm_join: read "0"
/usr/sbin/mount.gfs: message from gfs_controld: mount options:
/usr/sbin/mount.gfs: lock_dlm_join: read "hostdata=jid=1:id=65539:first=0"
/usr/sbin/mount.gfs: lock_dlm_join: hostdata: "hostdata=jid=1:id=65539:first=0"
/usr/sbin/mount.gfs: lock_dlm_join: extra_plus: "hostdata=jid=1:id=65539:first=0"
/usr/sbin/mount.gfs: mount(2) ok
/usr/sbin/mount.gfs: lock_dlm_mount_result: write "mount_result /mnt gfs 0"
/usr/sbin/mount.gfs: read_proc_mounts: device = "/dev/mapper/gfs-test"
/usr/sbin/mount.gfs: read_proc_mounts: opts = "rw,hostdata=jid=1:id=65539:first=0"

Which for me means that the mount tool couldn't find the helper as I
put it into /usr/sbin instead of /sbin (probably some unfortunate
configure option or installation choice -- it's in /sbin on node3
which runs the cluster suite compiled agains the previous kernel).  I
will investigate this after having some good sleep.

Since the overly general error message is from mount, not from the
mount.gfs helper, you probably can't do much for improving it.  But
explicitly invoking the helper in verbose mode looks like a very
powerful troubleshooting trick!

> Perhaps you should also try "group_tool dump gfs" to see if there are
> error messages from the gfs control daemon pertaining to gfs and
> why the mount failed.

That's another interesting information source.  Still it wasn't enough
for me to get out of this trap: after the above successful mount, I
unmounted the GFS with the stock umount, which again coultn't find the
helper, but did the job nevertheless.  Except for leaving the mount
group, so that I can mount the filesystem again...  Now the helper
says:

message to gfs_controld: asking to join mountgroup:
write "join /mnt gfs lock_dlm pilot:test users /dev/mapper/gfs-test"
mount point already used or other mount in progress
error mounting lockproto lock_dlm

and umount.gfs can't help as it doesn't find the mount in
/proc/mounts...  Is it possible to fix this or do I have to reboot?

>> Thanks for the clarification.  And what does that deferred fencing
>> mean?
>
> That means some node decided it was necessary to fence another node
> and it is waiting for that fence to complete.  If there's a pending
> fence of a node that's not completing, check to make sure your
> fence device is configured and working properly.

How could I find out about a pending fence?  Only by periodic messages
in the syslog?
-- 
Thanks,
Feri.