[Linux-cluster] CLVM and Cluster Service Migration issues

James Chamberlain jamesc at exa.com
Wed Dec 19 21:14:16 UTC 2007


Hi all,

I've got a three node CentOS 5 x86-64 CS/GFS cluster running kernel 
2.6.18-53.el5.  Last night, I tried to grow two of the file systems on it. 
I ran lvextend and then gfs_grow on node3, with node2 serving the file 
systems out to the local network.  While gfs_grow was running, node2 failed 
the service and I couldn't get it to restart.  It looked to me like neither 
node1 nor node2 was aware of the lvextend I had run on node3.  I had to 
reboot the full cluster to bring everything back online.

This afternoon, node2 fenced node3.  Nothing migrated, and the entire 
cluster needed to be rebooted again to recover.  What I noticed after the 
full reboot is I seem to be getting initial ARP responses from the wrong 
nodes, as below:

[root at workstation ~]# arping cluster-fs1
ARPING 10.1.1.142 from 10.1.1.101 eth0
Unicast reply from 10.1.1.142 [00:1B:78:D1:88:C2]  0.624ms
Unicast reply from 10.1.1.142 [00:1C:C4:81:9F:66]  0.666ms
Unicast reply from 10.1.1.142 [00:1C:C4:81:9F:66]  0.621ms
Sent 2 probes (1 broadcast(s))
Received 3 response(s)
[root at workstation ~]# arping cluster-fs2
ARPING 10.1.1.143 from 10.1.1.101 eth0
Unicast reply from 10.1.1.143 [00:1B:78:D1:88:C2]  0.695ms
Unicast reply from 10.1.1.143 [00:1C:C4:81:9F:66]  0.734ms
Unicast reply from 10.1.1.143 [00:1C:C4:81:9F:66]  0.680ms
Sent 2 probes (1 broadcast(s))
Received 3 response(s)
[root at workstation ~]# arping cluster-fs3
ARPING 10.1.1.144 from 10.1.1.101 eth0
Unicast reply from 10.1.1.144 [00:1C:C4:81:9F:66]  0.734ms
Unicast reply from 10.1.1.144 [00:1B:78:D1:88:C2]  0.913ms
Unicast reply from 10.1.1.144 [00:1B:78:D1:88:C2]  0.640ms
Sent 2 probes (1 broadcast(s))
Received 3 response(s)

[root at workstation ~]# arping node1
ARPING 10.1.1.131 from 10.1.1.101 eth0
Unicast reply from 10.1.1.1 [00:1B:78:D1:88:C2]  0.771ms
[...]
[root at workstation ~]# arping node2
ARPING 10.1.1.132 from 10.1.1.101 eth0
Unicast reply from 10.1.1.2 [00:1C:C4:81:AD:72]  0.681ms
[...]
[root at workstation ~]# arping node3
ARPING 10.1.1.133 from 10.1.1.101 eth0
Unicast reply from 10.1.1.3 [00:1C:C4:81:9F:66]  0.631ms

At the time, node1 was supposed to be serving fs1, fs2, and fs3.  I'll note 
that I did forget to run "lvmconf --enable-cluster" when I first set the 
volume group up, though I did make that change before putting the cluster 
into production.

Anyone have any thoughts on what's going on and what to do about it?

Thanks,

James




More information about the Linux-cluster mailing list