[Linux-cluster] qdisk WITHOUT fencing

Mon Jun 21 13:02:22 UTC 2010

>From my experience, and some good practices (IMHO) I've seen in a lot
of productions, cluster must never be autostarted.

This to prevent a power flapping node from accessing the storage
almost randomly.

The second thing, is to have reliable suicide procedure, generally
based on hardware watchdog mechanism.
Almost all the known vendors provide reliable hardware that can be
used for that. That will imply that this autofence mechanism to be
supported  only on the certified hardware.

A simple watchdog agent would be to monitor the cluster state, if it
goes inquorate, then the node is hard reset without any further
consideration.
When coupled to autostart off, there is no risk, anymore.

>> GPRS or the good old modem over a phone line?
In the datacenters I manage, mobile communications are inoperant,
there're practical Faraday cages.
I thought about POTS lines, but it made me feel like I was going back
to the 90's....

> The problem is that although you don't need to fence anything, you need to:
> 1) Verify that the site is properly down
> 2) Make sure it stays down

1 --> Best case, electrical problem, all the nodes and storage is off
and  if it is not (interconnect failure for instance), the watchdog
mechanism described above has done its job (need to be coupled to a
3rd site tie breaker).
2 --> Forbid cluster autostart to avoid this kind of problem.

2010/6/21 Gordan Bobic <gordan at bobich.net>:
> On 06/21/2010 11:28 AM, Kaloyan Kovachev wrote:
>>
>> On Mon, 21 Jun 2010 10:20:34 +0100, Gordan Bobic<gordan at bobich.net>
>> wrote:
>>>
>>> On 06/21/2010 08:52 AM, Kaloyan Kovachev wrote:
>>>>
>>>> On Fri, 18 Jun 2010 18:15:09 +0200, brem belguebli
>>>> <brem.belguebli at gmail.com>   wrote:
>>>>>
>>>>> How do you deal with fencing when the intersite interconnects (SAN and
>>>>> LAN) are the cause of the failure ?
>>>>>
>>>>
>>>> GPRS or the good old modem over a phone line?
>>>
>>> That isn't going to work if the whole site is down for whatever reason
>>> (unlikely as it may be).
>>>
>>
>> If the whole site is down because of a power failure - yes (well, then you
>> don't need to actually fence anything) , but if the failure is just in the
>> intersite connection - alternative low speed connection to simply fence
>> the
>> remote nodes and tell the remote SAN to block it's access should be
>> enough.
>
> The problem is that although you don't need to fence anything, you need to:
> 1) Verify that the site is properly down
> 2) Make sure it stays down
>
> Otherwise you are risking resource clashes.
>
>>> To protect yourself from the 100% outage of a remote site, the only sane
>>
>>> way I of approaching it I can think of is to do something like the
>>> following:
>>>
>>> 1) Make each node fence itself off from the failed node using iptables
>>> or some other firewalling method. The SAN should also be prevented from
>>> allowing the booted out node back onto it.
>>>
>>
>> then each node should do that kind of fencing, but if a single node blocks
>> the port(s) on the switch (to the remote location) should be easier to do
>> as fencing agent. Again having additional communication channel will help
>> -
>> "if it's just the link, then fence the remote nodes and don't block the
>> port(s)" this would avoid manual intervention to restore the link after
>> the
>> outage is fixed
>
> There is no reason why you couldn't fire off the iptables fencing command to
> each node via SSH, so that whichever node does the fencing, covers it for
> all nodes.
>
> Gordan
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>