[Linux-cluster] RHEL5 GFS2 - 2 node - node fenced when writing

Steven Whitehouse swhiteho at redhat.com
Thu Jun 7 07:28:53 UTC 2007


Hi,

The version of GFS2 in RHEL5 is rather old. Please use Fedora, the
upstream kernel or wait until RHEL 5.1 is out. This should solve the
problem that you are seeing,

Steve.

On Wed, 2007-06-06 at 19:27 -0400, nrbwpi at gmail.com wrote:
> Hello,
> 
> Installed RHEL5 on a new two node cluster with Shared FC storage.  The
> two shared storage boxes are each split into 6.9TB LUNs for a total of
> 4 - 6.9TB LUNS.  Each machine is connected via a single 100Mb
> connection to a switch and a single FC connection to a FC switch. 
> 
> The 4 LUNs have LVM on them with GFS2.  The file systems are mountable
> from each box.  When performing a script dd write of zeros in 250MB
> file sizes to the file system from each box to different LUNS, one of
> the nodes in the cluster is fenced by the other one.  File size does
> not seem to matter. 
> 
> My first guess at the problem was the heartbeat timeout in openais.
> In the cluster.conf below I added the totem line to hopefully raise
> the timeout to 10 seconds.  This however did not resolve the problem.
> Both boxes are running the latest updates as of 2 days ago from
> up2date. 
> 
> Below is the cluster.conf and what is seen in the logs.  Any
> suggestions would be greatly appreciated.
> 
> Thanks!
> 
> Neal
> 
> 
> 
> ##########################################
> 
> Cluster.conf
> 
> ##########################################
> 
> 
> <?xml version="1.0"?>
> <cluster alias="storage1" config_version="4" name="storage1">
>         <fence_daemon post_fail_delay="0" post_join_delay="3"/> 
>         <clusternodes>
>                 <clusternode name="fu1" nodeid="1" votes="1">
>                         <fence>
>                                 <method name="1"> 
>                                         <device name="apc4" port="1"
> switch="1"/>
>                                 </method>
>                         </fence>
>                         <multicast addr=" 224.10.10.10"
> interface="eth0"/>
>                 </clusternode>
>                 <clusternode name="fu2" nodeid="2" votes="1"> 
>                         <fence>
>                                 <method name="1">
>                                         <device name="apc4" port="2"
> switch="1"/> 
>                                 </method>
>                         </fence>
>                         <multicast addr="224.10.10.10"
> interface="eth0"/> 
>                 </clusternode>
>         </clusternodes>
>         <cman expected_votes="1" two_node="1">
>                 <multicast addr="224.10.10.10"/>
>                 <totem token="10000"/>
>         </cman>
>         <fencedevices>
>                 <fencedevice agent="fence_apc" ipaddr="192.168.14.193"
> login="apc" name="apc4" passwd="apc"/>
>         </fencedevices>
>         <rm>
>                 <failoverdomains/>
>                 <resources/> 
>         </rm>
> </cluster>
> 
> 
> #####################################################
> 
> /var/log/messages
> 
> #####################################################
> 
> Jun  5 20:19:30 fu1 openais[5351]: [TOTEM] The token was lost in the
> OPERATIONAL state. 
> Jun  5 20:19:30 fu1 openais[5351]: [TOTEM] Receive multicast socket
> recv buffer size (262142 bytes).
> Jun  5 20:19:30 fu1 openais[5351]: [TOTEM] Transmit multicast socket
> send buffer size (262142 bytes).
> Jun  5 20:19:30 fu1 openais[5351]: [TOTEM] entering GATHER state from
> 2. 
> Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] entering GATHER state from
> 0.
> Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] Creating commit token
> because I am the rep.
> Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] Saving state aru 6e high
> seq received 6e 
> Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] entering COMMIT state.
> Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] entering RECOVERY state.
> Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] position [0] member
> 192.168.14.195:
> Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] previous ring seq 16 rep
> 192.168.14.195
> Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] aru 6e high delivered 6e
> received flag 0 
> Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] Did not need to originate
> any messages in recovery.
> Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] Storing new sequence id for
> ring 14
> Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] Sending initial ORF token 
> Jun  5 20:19:34 fu1 openais[5351]: [CLM  ] CLM CONFIGURATION CHANGE
> Jun  5 20:19:34 fu1 openais[5351]: [CLM  ] New Configuration:
> Jun  5 20:19:34 fu1 kernel: dlm: closing connection to node 2
> Jun  5 20:19:34 fu1 fenced[5367]: fu2 not a cluster member after 0 sec
> post_fail_delay 
> Jun  5 20:19:34 fu1 openais[5351]: [CLM  ]      r(0)
> ip(192.168.14.195)
> Jun  5 20:19:34 fu1 openais[5351]: [CLM  ] Members Left:
> Jun  5 20:19:34 fu1 fenced[5367]: fencing node "fu2" 
> Jun  5 20:19:34 fu1 openais[5351]: [CLM  ]      r(0)
> ip(192.168.14.197)
> Jun  5 20:19:34 fu1 openais[5351]: [CLM  ] Members Joined:
> Jun  5 20:19:34 fu1 openais[5351]: [SYNC ] This node is within the
> primary component and will provide service. 
> Jun  5 20:19:34 fu1 openais[5351]: [CLM  ] CLM CONFIGURATION CHANGE
> Jun  5 20:19:34 fu1 openais[5351]: [CLM  ] New Configuration:
> Jun  5 20:19:34 fu1 openais[5351]: [CLM  ]      r(0)
> ip(192.168.14.195)
> Jun  5 20:19:34 fu1 openais[5351]: [CLM  ] Members Left:
> Jun  5 20:19:34 fu1 openais[5351]: [CLM  ] Members Joined:
> Jun  5 20:19:34 fu1 openais[5351]: [SYNC ] This node is within the
> primary component and will provide service. 
> Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] entering OPERATIONAL state.
> Jun  5 20:19:34 fu1 openais[5351]: [CLM  ] got nodejoin message
> 192.168.14.195
> Jun  5 20:19:34 fu1 openais[5351]: [CPG  ] got joinlist message from
> node 1 
> Jun  5 20:19:36 fu1 fenced[5367]: fence "fu2" success
> Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0: jid=1:
> Trying to acquire journal lock...
> Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0: jid=1:
> Trying to acquire journal lock... 
> Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0: jid=1:
> Looking at journal...
> Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0: jid=1:
> Trying to acquire journal lock...
> Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0: jid=1:
> Trying to acquire journal lock... 
> Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0: jid=1:
> Looking at journal...
> Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0: jid=1:
> Looking at journal...
> Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0: jid=1:
> Looking at journal... 
> Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0: jid=1:
> Acquiring the transaction lock...
> Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0: jid=1:
> Replaying journal...
> Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0: jid=1:
> Replayed 0 of 0 blocks 
> Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0: jid=1:
> Found 0 revoke tags
> Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0: jid=1:
> Journal replayed in 1s
> Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0: jid=1:
> Done 
> Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0: jid=1:
> Acquiring the transaction lock...
> Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0: jid=1:
> Replaying journal...
> Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0: jid=1:
> Replayed 0 of 0 blocks 
> Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0: jid=1:
> Found 0 revoke tags
> Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0: jid=1:
> Journal replayed in 1s
> Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0: jid=1:
> Done 
> Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0: jid=1:
> Acquiring the transaction lock...
> Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0: jid=1:
> Acquiring the transaction lock...
> Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0: jid=1:
> Replaying journal... 
> Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0: jid=1:
> Replayed 222 of 223 blocks
> Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0: jid=1:
> Found 1 revoke tags
> Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0: jid=1:
> Journal replayed in 1s 
> Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0: jid=1:
> Done
> Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0: jid=1:
> Replaying journal...
> Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0: jid=1:
> Replayed 438 of 439 blocks 
> Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0: jid=1:
> Found 1 revoke tags
> Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0: jid=1:
> Journal replayed in 1s
> Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0: jid=1:
> Done 
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster




More information about the Linux-cluster mailing list