[Linux-cluster] GAS locking up

Tue Sep 7 18:39:29 UTC 2004

Hi,

 I have two machines, hestia and hroth1 which are running Red Hat
Enterprise Linux 3.0 AS. The two machines are connected via fibrechannel
to the same storage group on a EMC CX300 array. I have compiled GAS
using the latest src.rpm file that is available and the 2.4.21-15 kernel
patches. All works fine on both nodes for a while (locking is fine, no
corruption, manual fencing works if a machine dies) but then I
experience lockups for processes that access any of the mounted GAS
filesystems. It is hard to reproduce reliably and may occur at any time.
Classic examples are ls /scratch (where /scratch is a GAS filesystem) or
even mount or unmount. Once one process has locked up, no other GAS
filesystems or any commands associated with them work. Only a reboot
will solve the problem - restarting lock_gulm does not help (and has
actually given me a kernel panic on one occasion).

 At first I thought that this was a fencing issue, but looking at both
machine's /var/log/messages shows no GAS messages at all (when a machine
crashes and the manual fence is activated, I always see messages telling
me to acknowledge the fence). In addition, gulm_tool shows both nodes to
logged in and show the heartbeat working fine.

 Another interesting behaviour is the delay required to stat the
filesystems for the first time after they are mounted - e.g. running df
-l can take up to 5 seconds per GAS filesystem.

 Has anyone heard of these problems before? As it stands, my current
setup is somewhat unusable(!).

 For reference, my ccsd configuration looks like this:

nodes.ccs:

 nodes {
        hestia {
                ip_interfaces {
                        eth1 = "192.168.1.253"
                }
                fence {
                        human {
                                admin {
                                        ipaddr = "192.168.1.253"
                                }
                        }
                }
        }
        hroth1 {
                ip_interfaces {
                        eth1 = "192.168.1.1"
                }
                fence {
                        human {
                                admin {
                                        ipaddr = "192.168.1.1"
                                }
                        }
                }
        }
}

cluster.ccs:
cluster {
        name = "SAN1"
        lock_gulm {
                servers = ["hestia", "hroth1"]
        }
}

fence.ccs
fence_devices {
        admin {
                agent = "fence_manual"
        }
}

Any advice would be very very gratefully received.

Regards,

Brian Marsden

--
Dr. Brian Marsden                Email: brian.marsden at sgc.ox.ac.uk
Head of Research Informatics
Structural Genomics Consortium
University of Oxford
Botnar Research Centre           Phone: +44 (0)1865 227723 
OX3 7LD, Oxford, UK              Fax:   +44 (0)1865 737231