[Linux-cluster] GFS hanging on 3 node RHEL4 cluster

Shawn Hood shawnlhood at gmail.com
Tue Oct 7 17:33:45 UTC 2008


Problem:
It seems that IO on one machine in the cluster (not always the same
machine) will hang and all processes accessing clustered LVs will
block.  Other machines will follow suit shortly thereafter until the
machine that first exhibited the problem is rebooted (via fence_drac
manually).  No messages in dmesg, syslog, etc.  Filesystems recently
fsckd.

Hardware:
Dell 1950s (similar except memory -- 3x 16GB RAM, 1x 8GB RAM).
Running RHEL4 ES U7.  Four machines
Onboard gigabit NICs (Machines use little bandwidth, and all network
traffic including DLM share NICs)
QLogic 2462 PCI-Express dual channel FC HBAs
QLogic SANBox 5200 FC switch
Apple XRAID which presents as two LUNs (~4.5TB raw aggregate)
Cisco Catalyst switch

Simple four machine RHEL4 U7 cluster running kernel 2.6.9-78.0.1.ELsmp
x86_64 with the following packages:
ccs-1.0.12-1
cman-1.0.24-1
cman-kernel-smp-2.6.9-55.13.el4_7.1
cman-kernheaders-2.6.9-55.13.el4_7.1
dlm-kernel-smp-2.6.9-54.11.el4_7.1
dlm-kernheaders-2.6.9-54.11.el4_7.1
fence-1.32.63-1.el4_7.1
GFS-6.1.18-1
GFS-kernel-smp-2.6.9-80.9.el4_7.1

One clustered VG.  Striped across two physical volumes, which
correspond to each side of an Apple XRAID.
Clustered volume group info:
  --- Volume group ---
  VG Name               hq-san
  System ID
  Format                lvm2
  Metadata Areas        2
  Metadata Sequence No  50
  VG Access             read/write
  VG Status             resizable
  Clustered             yes
  Shared                no
  MAX LV                0
  Cur LV                3
  Open LV               3
  Max PV                0
  Cur PV                2
  Act PV                2
  VG Size               4.55 TB
  PE Size               4.00 MB
  Total PE              1192334
  Alloc PE / Size       905216 / 3.45 TB
  Free  PE / Size       287118 / 1.10 TB
  VG UUID               hfeIhf-fzEq-clCf-b26M-cMy3-pphm-B6wmLv

Logical volumes contained with hq-san VG:
  cam_development   hq-san                          -wi-ao 500.00G
  qa            hq-san                          -wi-ao   1.07T
  svn_users         hq-san                          -wi-ao   1.89T

All four machines mount svn_users, two machines mount qa, and one
mounts cam_development.

/etc/cluster/cluster.conf:

<?xml version="1.0"?>
<cluster alias="tungsten" config_version="31" name="qualia">
        <fence_daemon post_fail_delay="0" post_join_delay="3"/>
        <clusternodes>
                <clusternode name="odin" votes="1">
                        <fence>
                                <method name="1">
                    <device modulename="" name="odin-drac"/>
                </method>
                        </fence>
                </clusternode>
                <clusternode name="hugin" votes="1">
                        <fence>
                                <method name="1">
                    <device modulename="" name="hugin-drac"/>
                </method>
                        </fence>
                </clusternode>
                <clusternode name="munin" votes="1">
                        <fence>
                                <method name="1">
                    <device modulename="" name="munin-drac"/>
                </method>
                        </fence>
                </clusternode>
                <clusternode name="zeus" votes="1">
                        <fence>
                                <method name="1">
                    <device modulename="" name="zeus-drac"/>
                </method>
                        </fence>
                </clusternode>
    </clusternodes>
        <cman expected_votes="1" two_node="0"/>
        <fencedevices>
                <resources/>
                <fencedevice name="odin-drac" agent="fence_drac"
ipaddr="redacted" login="root" passwd="redacted"/>
                <fencedevice name="hugin-drac" agent="fence_drac"
ipaddr="redacted" login="root" passwd="redacted"/>
                <fencedevice name="munin-drac" agent="fence_drac"
ipaddr="redacted" login="root" passwd="redacted"/>
                <fencedevice name="zeus-drac" agent="fence_drac"
ipaddr="redacted" login="root" passwd="redacted"/>
        </fencedevices>
        <rm>
        <failoverdomains/>
        <resources/>
    </rm>
</cluster>




--
Shawn Hood
910.670.1819 m




More information about the Linux-cluster mailing list