[Linux-cluster] gfs2 resource not mounting

Fri Oct 3 19:32:34 UTC 2014

Using the same two-node configuration I described in an earlier post this forum, I'm having problems getting a gfs2 resource started on one of the nodes. The resource in question:

 Resource: clusterfs (class=ocf provider=heartbeat type=Filesystem)
  Attributes: device=/dev/vg_cluster/ha_lv directory=/mnt/gfs2-demo fstype=gfs2 options=noatime 
  Operations: start interval=0s timeout=60 (clusterfs-start-timeout-60)
              stop interval=0s timeout=60 (clusterfs-stop-timeout-60)
              monitor interval=10s on-fail=fence (clusterfs-monitor-interval-10s)

pcs status shows:

Clone Set: dlm-clone [dlm]
     Started: [ rh7cn1.devlab.sinenomine.net rh7cn2.devlab.sinenomine.net ]
 Clone Set: clvmd-clone [clvmd]
     Started: [ rh7cn1.devlab.sinenomine.net rh7cn2.devlab.sinenomine.net ]
 Clone Set: clusterfs-clone [clusterfs]
     Started: [ rh7cn1.devlab.sinenomine.net ]
     Stopped: [ rh7cn2.devlab.sinenomine.net ]

Failed actions:
    clusterfs_start_0 on rh7cn2.devlab.sinenomine.net 'unknown error' (1): call=46, status=complete, last-rc-change='Fri Oct  3 14:41:26 2014', queued=4702ms, exec=0ms

Using pcs resource debug-start I see:

Operation start for clusterfs:0 (ocf:heartbeat:Filesystem) returned 1
 >  stderr: INFO: Running start for /dev/vg_cluster/ha_lv on /mnt/gfs2-demo
 >  stderr: mount: permission denied
 >  stderr: ERROR: Couldn't mount filesystem /dev/vg_cluster/ha_lv on /mnt/gfs2-demo

The log on the node shows - 

Oct  3 14:57:37 rh7cn2 kernel: GFS2: fsid=rh7cluster:vol1: Trying to join cluster "lock_dlm", "rh7cluster:vol1"
Oct  3 14:57:38 rh7cn2 kernel: GFS2: fsid=rh7cluster:vol1: Joined cluster. Now mounting FS...
Oct  3 14:57:38 rh7cn2 dlm_controld[5857]: 1564 cpg_dispatch error 9

On the other node - 

Oct  3 15:09:47 rh7cn1 kernel: GFS2: fsid=rh7cluster:vol1.0: recover generation 14 done
Oct  3 15:09:48 rh7cn1 kernel: GFS2: fsid=rh7cluster:vol1.0: recover generation 15 done

I'm assuming I didn't define the gfs2 resource such that it could be used concurrently by both nodes. Here's the cib.xml definition for it:

      <clone id="clusterfs-clone">
        <primitive class="ocf" id="clusterfs" provider="heartbeat" type="Filesystem">
          <instance_attributes id="clusterfs-instance_attributes">
            <nvpair id="clusterfs-instance_attributes-device" name="device" value="/dev/vg_cluster/ha_lv"/>
            <nvpair id="clusterfs-instance_attributes-directory" name="directory" value="/mnt/gfs2-demo"/>
            <nvpair id="clusterfs-instance_attributes-fstype" name="fstype" value="gfs2"/>
            <nvpair id="clusterfs-instance_attributes-options" name="options" value="noatime"/>
          </instance_attributes>
          <operations>
            <op id="clusterfs-start-timeout-60" interval="0s" name="start" timeout="60"/>
            <op id="clusterfs-stop-timeout-60" interval="0s" name="stop" timeout="60"/>
            <op id="clusterfs-monitor-interval-10s" interval="10s" name="monitor" on-fail="fence"/>
          </operations>
        </primitive>
        <meta_attributes id="clusterfs-clone-meta">
          <nvpair id="clusterfs-interleave" name="interleave" value="true"/>
        </meta_attributes>
      </clone>

-------------------------------

Unrelated (I believe) to the above, I also note the following messages in /var/log/messages which appear to be related to pacemaker and http (another resource I have defined):

Oct  3 15:05:06 rh7cn2 systemd: pacemaker.service: Got notification message from PID 6036, but reception only permitted for PID 5575

I'm running systemd-208-11.el7_0.2. A bugzilla search matches with one report but the fix was put into -11.

Neale