[Linux-cluster] CLVM and Cluster Service Migration issues
James Chamberlain
jamesc at exa.com
Wed Dec 19 21:14:16 UTC 2007
Hi all,
I've got a three node CentOS 5 x86-64 CS/GFS cluster running kernel
2.6.18-53.el5. Last night, I tried to grow two of the file systems on it.
I ran lvextend and then gfs_grow on node3, with node2 serving the file
systems out to the local network. While gfs_grow was running, node2 failed
the service and I couldn't get it to restart. It looked to me like neither
node1 nor node2 was aware of the lvextend I had run on node3. I had to
reboot the full cluster to bring everything back online.
This afternoon, node2 fenced node3. Nothing migrated, and the entire
cluster needed to be rebooted again to recover. What I noticed after the
full reboot is I seem to be getting initial ARP responses from the wrong
nodes, as below:
[root at workstation ~]# arping cluster-fs1
ARPING 10.1.1.142 from 10.1.1.101 eth0
Unicast reply from 10.1.1.142 [00:1B:78:D1:88:C2] 0.624ms
Unicast reply from 10.1.1.142 [00:1C:C4:81:9F:66] 0.666ms
Unicast reply from 10.1.1.142 [00:1C:C4:81:9F:66] 0.621ms
Sent 2 probes (1 broadcast(s))
Received 3 response(s)
[root at workstation ~]# arping cluster-fs2
ARPING 10.1.1.143 from 10.1.1.101 eth0
Unicast reply from 10.1.1.143 [00:1B:78:D1:88:C2] 0.695ms
Unicast reply from 10.1.1.143 [00:1C:C4:81:9F:66] 0.734ms
Unicast reply from 10.1.1.143 [00:1C:C4:81:9F:66] 0.680ms
Sent 2 probes (1 broadcast(s))
Received 3 response(s)
[root at workstation ~]# arping cluster-fs3
ARPING 10.1.1.144 from 10.1.1.101 eth0
Unicast reply from 10.1.1.144 [00:1C:C4:81:9F:66] 0.734ms
Unicast reply from 10.1.1.144 [00:1B:78:D1:88:C2] 0.913ms
Unicast reply from 10.1.1.144 [00:1B:78:D1:88:C2] 0.640ms
Sent 2 probes (1 broadcast(s))
Received 3 response(s)
[root at workstation ~]# arping node1
ARPING 10.1.1.131 from 10.1.1.101 eth0
Unicast reply from 10.1.1.1 [00:1B:78:D1:88:C2] 0.771ms
[...]
[root at workstation ~]# arping node2
ARPING 10.1.1.132 from 10.1.1.101 eth0
Unicast reply from 10.1.1.2 [00:1C:C4:81:AD:72] 0.681ms
[...]
[root at workstation ~]# arping node3
ARPING 10.1.1.133 from 10.1.1.101 eth0
Unicast reply from 10.1.1.3 [00:1C:C4:81:9F:66] 0.631ms
At the time, node1 was supposed to be serving fs1, fs2, and fs3. I'll note
that I did forget to run "lvmconf --enable-cluster" when I first set the
volume group up, though I did make that change before putting the cluster
into production.
Anyone have any thoughts on what's going on and what to do about it?
Thanks,
James
More information about the Linux-cluster
mailing list