[Linux-cluster] CLVM and Cluster Service Migration issues

Wed Dec 19 21:14:16 UTC 2007

Hi all,

I've got a three node CentOS 5 x86-64 CS/GFS cluster running kernel 
2.6.18-53.el5.  Last night, I tried to grow two of the file systems on it. 
I ran lvextend and then gfs_grow on node3, with node2 serving the file 
systems out to the local network.  While gfs_grow was running, node2 failed 
the service and I couldn't get it to restart.  It looked to me like neither 
node1 nor node2 was aware of the lvextend I had run on node3.  I had to 
reboot the full cluster to bring everything back online.

This afternoon, node2 fenced node3.  Nothing migrated, and the entire 
cluster needed to be rebooted again to recover.  What I noticed after the 
full reboot is I seem to be getting initial ARP responses from the wrong 
nodes, as below:

[root at workstation ~]# arping cluster-fs1
ARPING 10.1.1.142 from 10.1.1.101 eth0
Unicast reply from 10.1.1.142 [00:1B:78:D1:88:C2]  0.624ms
Unicast reply from 10.1.1.142 [00:1C:C4:81:9F:66]  0.666ms
Unicast reply from 10.1.1.142 [00:1C:C4:81:9F:66]  0.621ms
Sent 2 probes (1 broadcast(s))
Received 3 response(s)
[root at workstation ~]# arping cluster-fs2
ARPING 10.1.1.143 from 10.1.1.101 eth0
Unicast reply from 10.1.1.143 [00:1B:78:D1:88:C2]  0.695ms
Unicast reply from 10.1.1.143 [00:1C:C4:81:9F:66]  0.734ms
Unicast reply from 10.1.1.143 [00:1C:C4:81:9F:66]  0.680ms
Sent 2 probes (1 broadcast(s))
Received 3 response(s)
[root at workstation ~]# arping cluster-fs3
ARPING 10.1.1.144 from 10.1.1.101 eth0
Unicast reply from 10.1.1.144 [00:1C:C4:81:9F:66]  0.734ms
Unicast reply from 10.1.1.144 [00:1B:78:D1:88:C2]  0.913ms
Unicast reply from 10.1.1.144 [00:1B:78:D1:88:C2]  0.640ms
Sent 2 probes (1 broadcast(s))
Received 3 response(s)

[root at workstation ~]# arping node1
ARPING 10.1.1.131 from 10.1.1.101 eth0
Unicast reply from 10.1.1.1 [00:1B:78:D1:88:C2]  0.771ms
[...]
[root at workstation ~]# arping node2
ARPING 10.1.1.132 from 10.1.1.101 eth0
Unicast reply from 10.1.1.2 [00:1C:C4:81:AD:72]  0.681ms
[...]
[root at workstation ~]# arping node3
ARPING 10.1.1.133 from 10.1.1.101 eth0
Unicast reply from 10.1.1.3 [00:1C:C4:81:9F:66]  0.631ms

At the time, node1 was supposed to be serving fs1, fs2, and fs3.  I'll note 
that I did forget to run "lvmconf --enable-cluster" when I first set the 
volume group up, though I did make that change before putting the cluster 
into production.

Anyone have any thoughts on what's going on and what to do about it?

Thanks,

James