[Linux-cluster] gfs2 mount hangs

Thu Aug 13 06:22:11 UTC 2009

Hi,

I'm hitting a problem that mounting on a gfs2 volume hangs.

ENV:
1) cluster.conf
[wwg at cool gfs2]$ cat /etc/cluster/cluster.conf
<?xml version="1.0"?>
<cluster name="testgfs2" config_version="1">

<clusternodes>
<clusternode name="cool" nodeid="1">
        <fence>
                <method name="1">
                        <device name="manual" nodename="cool"/>
                </method>
        </fence>
</clusternode>

<clusternode name="desk" nodeid="2">
         <fence>
                 <method name="1">
                         <device name="manual" nodename="desk"/>
                 </method>
         </fence>
</clusternode>

</clusternodes>

<fencedevices>
         <fencedevice name="manual" agent="fence_manual"/>
</fencedevices>

<cman two_node="1" expected_votes="2"/>

</cluster>

2) I format the volume /dev/sdb as
mkfs.gfs2 -t testgfs2:1 -j 8 /dev/sdb
#though -j 2 is enough. testing againt 2 journal volumes has the same 
result.

3) kernel version
[wwg at cool gfs2]$ uname -r
2.6.31-rc5
I also tested with the mainline git 2.6.30.4

4) gfs2 module
tested with the one included in 2.6.31-rc5, in 2.6.30.4 or in 
git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmw.git
they have same result.

the hang happens when the other node already mounted the volume. and if 
the other node unmount the volume, the mount on this node successes.
by checking dmesg, the successful mount causes logs like:

GFS2: fsid=testgfs2:1.0: Joined cluster. Now mounting FS...
GFS2: fsid=testgfs2:1.0: jid=0, already locked for use
GFS2: fsid=testgfs2:1.0: jid=0: Looking at journal...
GFS2: fsid=testgfs2:1.0: jid=0: Done
GFS2: fsid=testgfs2:1.0: jid=1: Trying to acquire journal lock...
GFS2: fsid=testgfs2:1.0: jid=1: Looking at journal...
GFS2: fsid=testgfs2:1.0: jid=1: Done
GFS2: fsid=testgfs2:1.0: jid=2: Trying to acquire journal lock...
GFS2: fsid=testgfs2:1.0: jid=2: Looking at journal...
GFS2: fsid=testgfs2:1.0: jid=2: Done
GFS2: fsid=testgfs2:1.0: jid=3: Trying to acquire journal lock...
GFS2: fsid=testgfs2:1.0: jid=3: Looking at journal...
GFS2: fsid=testgfs2:1.0: jid=3: Done
GFS2: fsid=testgfs2:1.0: jid=4: Trying to acquire journal lock...
GFS2: fsid=testgfs2:1.0: jid=4: Looking at journal...
GFS2: fsid=testgfs2:1.0: jid=4: Done
GFS2: fsid=testgfs2:1.0: jid=5: Trying to acquire journal lock...
GFS2: fsid=testgfs2:1.0: jid=5: Looking at journal...
GFS2: fsid=testgfs2:1.0: jid=5: Done
GFS2: fsid=testgfs2:1.0: jid=6: Trying to acquire journal lock...
GFS2: fsid=testgfs2:1.0: jid=6: Looking at journal...
GFS2: fsid=testgfs2:1.0: jid=6: Done
GFS2: fsid=testgfs2:1.0: jid=7: Trying to acquire journal lock...
GFS2: fsid=testgfs2:1.0: jid=7: Looking at journal...
GFS2: fsid=testgfs2:1.0: jid=7: Done

and the hang one stops after:

GFS2: fsid=testgfs2:1.0: Joined cluster. Now mounting FS...

so I guess it's waiting for the other node to release locks on journals...

sysrq-t shows the mount stack:

mount.gfs2    D 000000e2     0  2889   2888 0x00000080
  ed4adc6c 00000082 a24475bb 000000e2 ed4e4f74 c09a5260 c09a9b60 ed4e4f74
  ed4adc3c c09a9b60 c09a9b60 f00019c0 dac8a400 ed4adc58 ed4adc44 00000001
  a2443568 000000e2 ed4e4ce0 ed4adc5c ed4adc5c 00000246 c1f98cd4 c1f98cd8
Call Trace:
  [<f8c5a6e6>] gfs2_glock_holder_wait+0xd/0x11 [gfs2]
  [<c0742897>] __wait_on_bit+0x39/0x60
  [<f8c5a6d9>] ? gfs2_glock_holder_wait+0x0/0x11 [gfs2]
  [<f8c5a6d9>] ? gfs2_glock_holder_wait+0x0/0x11 [gfs2]
  [<c074295e>] out_of_line_wait_on_bit+0xa0/0xa8
  [<c044779d>] ? wake_bit_function+0x0/0x3c
  [<f8c5d487>] wait_on_bit.clone.1+0x1c/0x28 [gfs2]
  [<f8c5d4fe>] gfs2_glock_wait+0x31/0x37 [gfs2]
  [<f8c5d774>] gfs2_glock_nq+0x270/0x278 [gfs2]
  [<f8c5d8a0>] gfs2_glock_nq_num+0x4c/0x6c [gfs2]
  [<f8c66567>] init_journal+0x2b2/0x675 [gfs2]
  [<c04c76bb>] ? wake_up_inode+0x1c/0x1e
  [<c04c76fa>] ? unlock_new_inode+0x3d/0x40
  [<c04c6fe1>] ? d_alloc+0x23/0x15e
  [<c04dd3a0>] ? inotify_d_instantiate+0x17/0x3a
  [<f8c65e5f>] ? gfs2_glock_nq_init+0x13/0x31 [gfs2]
  [<f8c6694f>] init_inodes+0x25/0x152 [gfs2]
  [<f8c6750d>] fill_super+0xa91/0xc10 [gfs2]
  [<f8c5d899>] ? gfs2_glock_nq_num+0x45/0x6c [gfs2]
  [<c04bade0>] get_sb_bdev+0xdc/0x119
  [<c04b4aea>] ? pcpu_alloc+0x352/0x38b
  [<f8c65ad8>] gfs2_get_sb+0x18/0x1a [gfs2]
  [<f8c66a7c>] ? fill_super+0x0/0xc10 [gfs2]
  [<c04baacc>] vfs_kern_mount+0x82/0xf0
  [<c04bab89>] do_kern_mount+0x38/0xc3
  [<c04cc03f>] do_mount+0x68c/0x6e4
  [<c0492dc1>] ? __get_free_pages+0x24/0x26
  [<c04cc0fd>] sys_mount+0x66/0x98
  [<c0402a28>] sysenter_do_call+0x12/0x27

anybody could please help me out?

regards,
wengang.

-- 
--just begin to learn, you are never too late...