[Linux-cluster] problem with rejoining a node
Patrick Caulfield
pcaulfie at redhat.com
Mon Aug 8 15:35:14 UTC 2005
Javi Polo wrote:
> Hi there (again :P)
>
> I'm still fighting with all this, sorry to bother so much (hope some day
> when I understand it all better I'll write some article on how to set this up)
>
> Well, I have already up the cluster and mounted the gfs filesystem in 3
> machines, and if one of those goes down, it's correctly fenced. The FC
> port is also disconnected, so I suppose at this point is everything ok.
>
> The problem is on the recovery. I understand that when a node rejoins
> is automaticaly unfenced, and then it can rejoin the fence and
> mount again the filesystem.
>
> I've blocked all input and output traffic on the node I want to test
> with iptables.
>
> The node gets fenced ok:
> Aug 8 16:00:48 gfstest2 fenced[2594]: fencing node "gfstest1"
> Aug 8 16:00:56 gfstest2 fenced[2594]: fence "gfstest1" success
What sort of fencing are you using? If it's a power-switch fence then the
node should be hard rebooted. If it's SAN fencing then you'll have to get the
node out of the cluster - the remaining two nodes /should/ tell it it leave the
cluster.
A node can't just "rejoin" a cluster after being SAN fenced. it must be removed
from the cluster and rejoin from scratch. There's far too much state involved
for it to merge seamlessly back into a cluster.
> Now I can access the GFS filesystem safely from my other 2 nodes, as the
> FC port for gfstest1 is disabled now, but if I enable traffic for the
> node, it does not rejoin the cluster. Shouldnt this be automatically?
>
> Anyway, I cannot rejoin/leave/whatever the cluster from gfstest1:
> gfstest1:~# cman_tool services
> Service Name GID LID State Code
> Fence Domain: "default" 1 2 run -
> [1 2 3]
>
> DLM Lock Space: "primer_fs" 2 3 run -
> [1 2 3]
>
> GFS Mount Group: "primer_fs" 3 4 run -
> [1 2 3]
>
> gfstest1:~# cman_tool join
> cman_tool: Node is already active
> gfstest1:~# cman_tool leave
> cman_tool: Can't leave cluster while there are 5 active subsystems
cman_tool leave force will force it to leave, but you might find it still needs
a reboot to clear the filesystems.
> and also, I cannot umount /dev/sdc1 as I have no access to the SAN
> (and however DLM should block him not to do so). So I get a totally
> screwed up system, that I can just fix by hard-rebooting (if I do a
> clean reboot, the system "hangs" while "umounting filesystems").
>
> Also, when the system boots up, the SAN is still unaccessible, as the
> fencing script does not run to re-enable the port ...
>
> I'm loooooost diving into google querys ... and certainly it's hard to
> find accurate info about all this :/
>
> could someone spot some light?
> (probably I dont understand well how the fencing system works, but also
> havent find anywhere where its explained :/)
>
> thx in advance :)
--
patrick
More information about the Linux-cluster
mailing list