[Linux-cluster] RHEL5 GFS2 - 2 node - node fenced when writing

Wed Jun 6 23:27:42 UTC 2007

Hello,

Installed RHEL5 on a new two node cluster with Shared FC storage.  The two
shared storage boxes are each split into 6.9TB LUNs for a total of 4 -
6.9TBLUNS.  Each machine is connected via a single 100Mb connection to
a switch
and a single FC connection to a FC switch.

The 4 LUNs have LVM on them with GFS2.  The file systems are mountable from
each box.  When performing a script dd write of zeros in 250MB file sizes to
the file system from each box to different LUNS, one of the nodes in the
cluster is fenced by the other one.  File size does not seem to matter.

My first guess at the problem was the heartbeat timeout in openais.  In the
cluster.conf below I added the totem line to hopefully raise the timeout to
10 seconds.  This however did not resolve the problem.  Both boxes are
running the latest updates as of 2 days ago from up2date.

Below is the cluster.conf and what is seen in the logs.  Any suggestions
would be greatly appreciated.

Thanks!

Neal

##########################################

Cluster.conf

##########################################

<?xml version="1.0"?>
<cluster alias="storage1" config_version="4" name="storage1">
        <fence_daemon post_fail_delay="0" post_join_delay="3"/>
        <clusternodes>
                <clusternode name="fu1" nodeid="1" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="apc4" port="1"
switch="1"/>
                                </method>
                        </fence>
                        <multicast addr="224.10.10.10" interface="eth0"/>
                </clusternode>
                <clusternode name="fu2" nodeid="2" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="apc4" port="2"
switch="1"/>
                                </method>
                        </fence>
                        <multicast addr="224.10.10.10" interface="eth0"/>
                </clusternode>
        </clusternodes>
        <cman expected_votes="1" two_node="1">
                <multicast addr="224.10.10.10"/>
                <totem token="10000"/>
        </cman>
        <fencedevices>
                <fencedevice agent="fence_apc" ipaddr="192.168.14.193"
login="apc" name="apc4" passwd="apc"/>
        </fencedevices>
        <rm>
                <failoverdomains/>
                <resources/>
        </rm>
</cluster>

#####################################################

/var/log/messages

#####################################################

Jun  5 20:19:30 fu1 openais[5351]: [TOTEM] The token was lost in the
OPERATIONAL state.
Jun  5 20:19:30 fu1 openais[5351]: [TOTEM] Receive multicast socket recv
buffer size (262142 bytes).
Jun  5 20:19:30 fu1 openais[5351]: [TOTEM] Transmit multicast socket send
buffer size (262142 bytes).
Jun  5 20:19:30 fu1 openais[5351]: [TOTEM] entering GATHER state from 2.
Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] entering GATHER state from 0.
Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] Creating commit token because I
am the rep.
Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] Saving state aru 6e high seq
received 6e
Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] entering COMMIT state.
Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] entering RECOVERY state.
Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] position [0] member
192.168.14.195:
Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] previous ring seq 16 rep
192.168.14.195
Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] aru 6e high delivered 6e received
flag 0
Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] Did not need to originate any
messages in recovery.
Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] Storing new sequence id for ring
14
Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] Sending initial ORF token
Jun  5 20:19:34 fu1 openais[5351]: [CLM  ] CLM CONFIGURATION CHANGE
Jun  5 20:19:34 fu1 openais[5351]: [CLM  ] New Configuration:
Jun  5 20:19:34 fu1 kernel: dlm: closing connection to node 2
Jun  5 20:19:34 fu1 fenced[5367]: fu2 not a cluster member after 0 sec
post_fail_delay
Jun  5 20:19:34 fu1 openais[5351]: [CLM  ]      r(0) ip(192.168.14.195)
Jun  5 20:19:34 fu1 openais[5351]: [CLM  ] Members Left:
Jun  5 20:19:34 fu1 fenced[5367]: fencing node "fu2"
Jun  5 20:19:34 fu1 openais[5351]: [CLM  ]      r(0) ip(192.168.14.197)
Jun  5 20:19:34 fu1 openais[5351]: [CLM  ] Members Joined:
Jun  5 20:19:34 fu1 openais[5351]: [SYNC ] This node is within the primary
component and will provide service.
Jun  5 20:19:34 fu1 openais[5351]: [CLM  ] CLM CONFIGURATION CHANGE
Jun  5 20:19:34 fu1 openais[5351]: [CLM  ] New Configuration:
Jun  5 20:19:34 fu1 openais[5351]: [CLM  ]      r(0) ip(192.168.14.195)
Jun  5 20:19:34 fu1 openais[5351]: [CLM  ] Members Left:
Jun  5 20:19:34 fu1 openais[5351]: [CLM  ] Members Joined:
Jun  5 20:19:34 fu1 openais[5351]: [SYNC ] This node is within the primary
component and will provide service.
Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] entering OPERATIONAL state.
Jun  5 20:19:34 fu1 openais[5351]: [CLM  ] got nodejoin message
192.168.14.195
Jun  5 20:19:34 fu1 openais[5351]: [CPG  ] got joinlist message from node 1
Jun  5 20:19:36 fu1 fenced[5367]: fence "fu2" success
Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0: jid=1: Trying
to acquire journal lock...
Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0: jid=1: Trying
to acquire journal lock...
Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0: jid=1: Looking
at journal...
Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0: jid=1: Trying
to acquire journal lock...
Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0: jid=1: Trying
to acquire journal lock...
Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0: jid=1: Looking
at journal...
Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0: jid=1: Looking
at journal...
Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0: jid=1: Looking
at journal...
Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0: jid=1:
Acquiring the transaction lock...
Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0: jid=1:
Replaying journal...
Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0: jid=1: Replayed
0 of 0 blocks
Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0: jid=1: Found 0
revoke tags
Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0: jid=1: Journal
replayed in 1s
Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0: jid=1: Done
Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0: jid=1:
Acquiring the transaction lock...
Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0: jid=1:
Replaying journal...
Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0: jid=1: Replayed
0 of 0 blocks
Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0: jid=1: Found 0
revoke tags
Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0: jid=1: Journal
replayed in 1s
Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0: jid=1: Done
Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0: jid=1:
Acquiring the transaction lock...
Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0: jid=1:
Acquiring the transaction lock...
Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0: jid=1:
Replaying journal...
Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0: jid=1: Replayed
222 of 223 blocks
Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0: jid=1: Found 1
revoke tags
Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0: jid=1: Journal
replayed in 1s
Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0: jid=1: Done
Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0: jid=1:
Replaying journal...
Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0: jid=1: Replayed
438 of 439 blocks
Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0: jid=1: Found 1
revoke tags
Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0: jid=1: Journal
replayed in 1s
Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0: jid=1: Done
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070606/64378584/attachment.htm>