[Linux-cluster] GFS2 mount hangs for some disks
emmanuel segura
emi2fast at gmail.com
Tue Jan 5 18:51:21 UTC 2016
1: share your config, nobody has the crystal ball!!!
2: you need to be share that your fencing is working.
2016-01-05 19:37 GMT+01:00 B.Baransel BAĞCI <bagcib at itu.edu.tr>:
> Hi list,
>
> I have some problems with GFS2 with failed nodes. After one of the cluster
> nodes fenced and rebooted, it cannot mount some of the gfs2 file systems but
> hangs on the mount operation. No output. I've waited nearly 10 minutes to
> mount single disk but it didn't respond. Only solution is to shutdown all
> nodes and clean start of the cluster. I'm suspecting journal size or file
> system quotas.
>
> I have 8-node rhel-6 cluster with GFS2 formatted disks which are all mounted
> by all nodes.
> There are two types of disk:
> Type A :
> ~50 GB disk capacity
> 8 journal with size 512MB
> block-size: 1024
> very small files (Avg: 50 byte - sym.links)
> ~500.000 file (inode)
> Usage: 10%
> Nearly no write IO (under 1000 file per day)
> No user quota (quota=off)
> Mount options: async,quota=off,nodiratime,noatime
>
> Tybe B :
> ~1 TB disk capacity
> 8 journal with size 512MB
> block-size: 4096
> relatively small files (Avg: 20 KB)
> ~5.000.000 file (inode)
> Usage: 20%
> write IO ~50.000 file per day
> user quota is on (some of the users exceeded quota)
> Mount options: async,quota=on,nodiratime,noatime
>
> To improve performance, I set journal size to 512 MB instead of 128 MB
> default. All disk are connected with fiber from SAN Storage. All disk on
> cluster LVM. All nodes connected to each other with private Gb-switch.
>
> For example, after "node5" failed and fenced, it can re-enter the cluster.
> When i try "service gfs2 start", it can mount "Type A" disks, but hangs on
> the first "Tybe B" disk. Logs hangs on the "Trying to join cluster lock_dlm"
> message:
>
> ...
> Jan 05 00:01:52 node5 lvm[4090]: Found volume group "VG_of_TYPE_A"
> Jan 05 00:01:52 node5 lvm[4119]: Activated 2 logical volumes in volume
> group VG_of_TYPE_A
> Jan 05 00:01:52 node5 lvm[4119]: 2 logical volume(s) in volume group
> "VG_of_TYPE_A" now active
> Jan 05 00:01:52 node5 lvm[4119]: Wiping internal VG cache
> Jan 05 00:02:26 node5 kernel: Slow work thread pool: Starting up
> Jan 05 00:02:26 node5 kernel: Slow work thread pool: Ready
> Jan 05 00:02:26 node5 kernel: GFS2 (built Dec 12 2014 16:06:57)
> installed
> Jan 05 00:02:26 node5 kernel: GFS2: fsid=: Trying to join cluster
> "lock_dlm", "TESTCLS:typeA1"
> Jan 05 00:02:26 node5 kernel: GFS2: fsid=TESTCLS:typeA1.5: Joined
> cluster. Now mounting FS...
> Jan 05 00:02:27 node5 kernel: GFS2: fsid=TESTCLS:typeA1.5: jid=5,
> already locked for use
> Jan 05 00:02:27 node5 kernel: GFS2: fsid=TESTCLS:typeA1.5: jid=5:
> Looking at journal...
> Jan 05 00:02:27 node5 kernel: GFS2: fsid=TESTCLS:typeA1.5: jid=5: Done
> Jan 05 00:02:27 node5 kernel: GFS2: fsid=: Trying to join cluster
> "lock_dlm", "TESTCLS:typeA2"
> Jan 05 00:02:27 node5 kernel: GFS2: fsid=TESTCLS:typeA2.5: Joined
> cluster. Now mounting FS...
> Jan 05 00:02:28 node5 kernel: GFS2: fsid=TESTCLS:typeA2.5: jid=5,
> already locked for use
> Jan 05 00:02:28 node5 kernel: GFS2: fsid=TESTCLS:typeA2.5: jid=5:
> Looking at journal...
> Jan 05 00:02:28 node5 kernel: GFS2: fsid=TESTCLS:typeA2.5: jid=5: Done
> Jan 05 00:02:28 node5 kernel: GFS2: fsid=: Trying to join cluster
> "lock_dlm", "TESTCLS:typeB1"
>
>
> I've waited nearly 10 minutes in this state without respond or log. In this
> state, I cannot do `ls` in another nodes for this file system. Any idea of
> the cause of the problem? How is the cluster affected by journal size or
> count?
> --
> B.Baransel BAĞCI
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
--
.~.
/V\
// \\
/( )\
^`~'^
More information about the Linux-cluster
mailing list