[Linux-cluster] GFS2 mount hangs for some disks

Tue Jan 5 18:37:10 UTC 2016

Hi list,

I have some problems with GFS2 with failed nodes. After one of the  
cluster nodes fenced and rebooted, it cannot mount some of the gfs2  
file systems but hangs on the mount operation. No output. I've waited  
nearly 10 minutes to mount single disk but it didn't respond. Only  
solution is to shutdown all nodes and clean start of the cluster. I'm  
suspecting journal size or file system quotas.

I have 8-node rhel-6 cluster with GFS2 formatted disks which are all  
mounted by all nodes.
There are two types of disk:
     Type A :
         ~50 GB disk capacity
         8 journal with size 512MB
         block-size: 1024
         very small files (Avg: 50 byte - sym.links)
         ~500.000 file (inode)
         Usage: 10%
         Nearly no write IO (under 1000 file per day)
         No user quota (quota=off)
         Mount options: async,quota=off,nodiratime,noatime

     Tybe B :
         ~1 TB disk capacity
         8 journal with size 512MB
         block-size: 4096
         relatively small files (Avg: 20 KB)
         ~5.000.000 file (inode)
         Usage: 20%
         write IO ~50.000 file per day
         user quota is on (some of the users exceeded quota)
         Mount options: async,quota=on,nodiratime,noatime

To improve performance, I set journal size to 512 MB instead of 128 MB  
default. All disk are connected with fiber from SAN Storage. All disk  
on cluster LVM. All nodes connected to each other with private  
Gb-switch.

For example, after "node5" failed and fenced, it can re-enter the  
cluster. When i try "service gfs2 start", it can mount "Type A" disks,  
but hangs on the first "Tybe B" disk. Logs hangs on the "Trying to  
join cluster lock_dlm" message:

     ...
     Jan 05 00:01:52 node5 lvm[4090]: Found volume group "VG_of_TYPE_A"
     Jan 05 00:01:52 node5 lvm[4119]: Activated 2 logical volumes in  
volume group VG_of_TYPE_A
     Jan 05 00:01:52 node5 lvm[4119]: 2 logical volume(s) in volume  
group "VG_of_TYPE_A" now active
     Jan 05 00:01:52 node5 lvm[4119]: Wiping internal VG cache
     Jan 05 00:02:26 node5 kernel: Slow work thread pool: Starting up
     Jan 05 00:02:26 node5 kernel: Slow work thread pool: Ready
     Jan 05 00:02:26 node5 kernel: GFS2 (built Dec 12 2014 16:06:57) installed
     Jan 05 00:02:26 node5 kernel: GFS2: fsid=: Trying to join cluster  
"lock_dlm", "TESTCLS:typeA1"
     Jan 05 00:02:26 node5 kernel: GFS2: fsid=TESTCLS:typeA1.5: Joined  
cluster. Now mounting FS...
     Jan 05 00:02:27 node5 kernel: GFS2: fsid=TESTCLS:typeA1.5: jid=5,  
already locked for use
     Jan 05 00:02:27 node5 kernel: GFS2: fsid=TESTCLS:typeA1.5: jid=5:  
Looking at journal...
     Jan 05 00:02:27 node5 kernel: GFS2: fsid=TESTCLS:typeA1.5: jid=5: Done
     Jan 05 00:02:27 node5 kernel: GFS2: fsid=: Trying to join cluster  
"lock_dlm", "TESTCLS:typeA2"
     Jan 05 00:02:27 node5 kernel: GFS2: fsid=TESTCLS:typeA2.5: Joined  
cluster. Now mounting FS...
     Jan 05 00:02:28 node5 kernel: GFS2: fsid=TESTCLS:typeA2.5: jid=5,  
already locked for use
     Jan 05 00:02:28 node5 kernel: GFS2: fsid=TESTCLS:typeA2.5: jid=5:  
Looking at journal...
     Jan 05 00:02:28 node5 kernel: GFS2: fsid=TESTCLS:typeA2.5: jid=5: Done
     Jan 05 00:02:28 node5 kernel: GFS2: fsid=: Trying to join cluster  
"lock_dlm", "TESTCLS:typeB1"

I've waited nearly 10 minutes in this state without respond or log. In  
this state, I cannot do `ls` in another nodes for this file system.  
Any idea of the cause of the problem? How is the cluster affected by  
journal size or count?
-- 
B.Baransel BAĞCI