[Linux-cluster] Node fencing problem
jobot at wmdata.com
Wed Aug 22 09:07:21 UTC 2007
We're having some problems getting fencing to work as expected on our two-node cluster.
Our cluster.conf file: http://pastebin.com/m7ac9376d
kernel version: 2.6.18-8.1.8.el5
cman version: 2.0.64-1.0.1.el5
When I'm simulating a network failure on a node I expect it to be fenced by the other node but that doesn't happen for some reason:
Steps to reproduce:
1. Start the cluster
2. Mount a GFS filesystem on both nodes (test-db1 and test-db2)
3. Simulate a net failure on test-db1
1. Node test-db2 would detect that test-db1 failed
2. test-db1 get fenced by test-db2
3. test-db2 replays the GFS journal (filesystem writable again)
4. Fail over services from test-db1 to test-db2
1. Node-test-db2 detects that something happened to test-db1
2. test-db2 replays the GFS journal (filesystem writable again)
3. The service on test-db1 is still listed as started and not failed
over to test-db2 even though test-db2 thinks test-db1 is "offline".
Log files and debug output from test-db2:
/var/log/messages after the failure: http://pastebin.com/m2fe4ce36
"group_tool dump fence" output: http://pastebin.com/m79d21ed9
clustat output: http://pastebin.com/m4d1007c2
And if I restore network connectivity on test-db1 the filsystem will become writeable on that node as well and probably results in filesystem corruption.
I think the fencedevice part of cluster.conf is correct since nodes are sometimes fenced when the cluster is started and one node isn't joining fast enough.
What am I doing wrong?
More information about the Linux-cluster