[Linux-cluster] Service Failover/Fencing Problems
Nick.Couchman at seakr.com
Fri Aug 24 22:12:28 UTC 2007
I'm attempting to set up a clustered file system using the cluster suite, GFS, Samba, and NFS. I'm having a couple of issues with the configuration.
Right now I have two virtual machines in the cluster. They're sharing some iSCSI storage and are configured in a two-node cluster. I used Conga to do the configuration. I have a fairly simple setup so far: two nodes, two "virtual IP" resources, two Samba resources, and a single file share (for now). Since I'm in VMware VMs, I don't really have any fencing devices right now - maybe that's part of the problem. The physical configuration that this is going in will have two identical Dell PE2650s with DRACs, so I can use the DRAC controllers for fencing. I have two failover domains set up - each node is the primary (priority = 1) machine in one of the domains and the backup in the other domain.
Back to the problem, though...I'm attempting to test scenarios where ones of the servers becomes unavailable. So far, my tests have caused the cluster to behave erratically and not succeed in moving services to any host. Here's what I've tried for my tests, so far:
1) Disconnecting the virtual network interface on one of the VMs that's part of the cluster. This caused the other machine to recognize that the machine was no longer in the cluster, but the other machine never did anything about restarting the service - it just left it as "dead." I had to manually start the service, even though I have it configured to auto-start and I have the policy set to "relocate."
2) Shutting down one of the VMs (forcibly - hard shutdown - like a power supply failure). In this instance, I started getting messages on the second node that it was attempting to fence the first node and kept failing the fencing. Conga showed the node as "unknown" but still showed the original service running on the first node, though, and when I attempted to migrate the service manually, Conga hung up trying to migrate it (probably waiting on the fencing to occur).
So those the problems I'm currently encountering. I'd really like to get this setup figured out so that I can get a cluster that fails IPs and Samba/NFS services successfully between nodes. Do I have to have a fencing device in order for this to work correctly, or is there some other way to configure the cluster to tell it to just fail the services if it detects a problem? I know that could cause some bad scenarios, i.e. if you lose connectivity between the servers but the servers still have connectivity to other networks/clients, but for now I just need a simple setup that will allow me to have a node fail and everything migrate successfully.
More information about the Linux-cluster