[Linux-cluster] Graceful recover after connectivity failure

Mon Jan 14 13:33:32 UTC 2008

Patrick Caulfeld wrote:
> Cliff Hones wrote:
>> I am using Centos5.1 with GNBD and GNBD fencing.
>>
>> Following the failure of a cluster member - eg a temporary
>> loss of connectivity - which results in the node being
>> fenced, is there a clean way to re-join the cluster without
>> having to reboot the affected node?
> 
> Basically, no.
> 
> If a node is apart from the cluster for any period of time, it can't
> tell whether the state of that cluster has changed while it was
> disconnected. So it must be fenced and restart the cluster software from
> the beginning to rebuild it's state from scratch.

Yes - I realised that.  If one of the power fencing mechanisms
is needed the node will reboot and restart its cluster - hopefully
automatically.  If gnbd fencing is used, the node is left up and
running, but locked out of the shared storage.  What I would like
is to find a "clean" way to restart the node's cluster s/w.
A manual reboot is one way - either by power-cycling or (provided
it doesn't hang) by a shutdown or reboot command.  I would prefer
something less drastic than a complete reboot - eg a controlled
shutdown and then restart of the cluster software, but I have not
managed to achieve this.  Has anyone any idea if this is possible?

If rebooting is the only (or best) option then it needs to be possible
remotely - eg using ssh access.  Unfortunately, when the node has
been fenced, the normal "shutdown" hangs during the cluster shutdown
scripts, and by the time it hangs the ssh daemon has already been
stopped, so it is impossible to force remotely.  While a forced
reboot bypassing the init scripts is possible, it is hardly a "clean"
way to shut down.

Fajar A. Nugraha wrote:
> By "force a reboot", do you mean "reboot -f"?
> I had some situation where "reboot" simply hangs (mostly related with
 > I/O problems), but "reboot -f" works every time.

My understanding of reboot is that it acts as "reboot -f" if run
during a hung a shutdown (actually, if run when init level is 0 or 6).
So it seems safest to try a normal shutdown, and then if it hangs
do a reboot from the system console.  This normally works, but on at
least two occasions the system has still locked, this time with
no login capability at all - even changing to another console session
using ctrl/f<n> on the system console fails.

While I can see that in a controlled production environment the
loss of a server node may be best handled by a forced (power)
reboot, it does seem unfortunate if this has to be done in,
say, an office environment when communication has been temporarily
lost eg while rearranging n/w cabling.

-- Cliff