[Linux-cluster] qdisk WITHOUT fencing

Fri Jun 18 09:27:10 UTC 2010

2010/6/18 Jankowski, Chris <Chris.Jankowski at hp.com>:
> Brem,
>
> I love this analogy.
>
I hit the light after a few beers discussing with colleagues of RHCS ;-)

> Using the analogy you gave, the problem with a mafioso is that he cannot kill all other mafiosos in the gang when they are all sitting in solitary confinment cells (:-)).

Indeed, this is why fencing cannot fit with stretched (Geo, metro)
clusters... without hacking the setup, with the risk of not being
supported anymore.
>
> I would like to remark that this STONITH business causes endless problems in clusters within a single data centre too. For example a temporary hiccup on the network that causes short heartbeat failure triggers all nodes of the cluster to kill the other nodes. And boy, do they succeed with a typical HP iLO fencing. You can see all your nodes going down. Then they come back and the shootout continues essentially indefinitely if fencing works. If not, then they all block.
>
Timers (TKO, qdisk, DM-MP...) are very important in setting up a
cluster, and network protection (bonding, multiring -- not supported
yet-- ) also. A temporary network hiccup shouldn't last more than a
few seconds (up to 5 max), or it has to be considered as an outage.

> And all of that is so unnecessary, as a combination of a properly implemented quorum disk and SCSI reservations with local boot disks and data disks on shared storage could provide quorum maintenance, split-brain avoidance and protection of the integrity of the filesystem. DEC ASE cluster on Ultrix and MIPS hardware had that in 1991. You do not even need GFS2, although it is very nice to have a real cluster filesystem.
>
In my geo cluster setup (2 sites) , I cannot rely on scsi reservation
as if the interconnect (both SAN and MAN) goes down, the nodes from
one site won't be able to clear the other site's luns reservation,
ending up in a split brain situation.

Ideally, a tie-breaker should be located on a 3rd site. An iscsi Lun
accessible from both production sites, acting as quorum disk.
In case one of the 2 sites gets isolated, its nodes won't be able to
access this Lun and qdisk should instruct these nodes to commit
suicide (panic, or hard reset).
This combined to a watchdog mechanism that  monitors if the cluster is
quorate, and in case it's not anymore, hard resets the faulty nodes.

> By the way, I believe that commercial stretched cluster on Linux is not possible if you rely on LVM for distributed storage. Linux LVM is architecturally incapable of providing any resilience over distance, IMHO.

You mean LVM mirroring ?  if so, as for all the mirroring mechanisms
(and even synchronous replication ones), most vendors (Veritas,
storage vendors) tend to say max 100 km between 2 sites, ie less than
2 or 3 ms latency.
I'm seeing some new features coming with LVM mirroring, ie mirror log
redundancy, already existing mirror log cluster awarness, partial
synchronization, device-mapper cluster awarness, etc ...
Plus the most awaited feature, dm-replicator that 'll bring a new
"era" in managing DR situation (but still incompatible with the
fencing constraint!!!)

> It is missing the plex and subdisk layers as in Veritas LVM and has no notion of location, so you it cannot tell which piece of storage is in which data centre. The only volume manager that I know that has this feature is in OpenVMS.  Perhaps the latest Veritas has it too.
It's (the location thing)  a design choice from Symantec SF, but it is
not absolutely necessary to build stretched clusters. Look at HP-UX
Serviceguard based on HP-UX LVM for instance.
Concerning the plex and subdisk layer, I think it's just a matter of
terminology (PV,mirror leg,VG, and LV), not a difference IMHO.

>
> One could use distributed storage arrays of the type of HP P4000 (bought with Left Hand Networks). This shifts the problem from the OS to the storage vendor.
>
How do you address remote site replication ?
> What distributed storage would you use in a hypothetical stretched cluster?
>
In our environment we use HP  high end frames (XP24000), and some
couples have CA enabled.
> Regards,
>
> Chris Jankowski
>
Brem
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Brem Belguebli
> Sent: Friday, 18 June 2010 16:13
> To: linux clustering
> Subject: Re: [Linux-cluster] qdisk WITHOUT fencing
>
> If I may do this comparison,
> - All the other known cluster stacks (linux/unix/win....) have the Japanese (Harakiri) sense of honor, ie if a node goes wrong and commits suicide, all the remaining nodes trust blindly the fact that the node commited suicide
> - RHCS have the Italian sense of honor (Mafioso), when a node goes wrong, even if some cluster process makes this node commit suicide (qdisk for instance), the remaining nodes do not trust it until some node of the cluster "shoot the sick node in the head"
>
> It's clear that geo clustering RHCS, due to this constraint is normally impossible, though some scripting logic could allow to bypass completely the fencing and guarantee the integrity of the cluster.
>
> Brem
>
> On Thu, 2010-06-17 at 23:31 +0000, Jankowski, Chris wrote:
>> Jim,
>>
>> You hit architectural limitation of Linux Cluster, which is specific to Linux Cluster design, which other clusters tend not to have.
>>
>> Linux Cluster assumes that you will *always* be able to execute fencing of *all* other nodes.  In fact, this is a stated *prerequisite* for correct operation of the cluster.
>>
>> This is all very well when you have two PCs under your desk and a power switch.
>>
>> However, this model completely fails when any network more complex then a power switch is present. Your network fails and you have a partitioned cluster that cannot fence. It all gets stuck. From a practical, operational point of view of an IT this is a disaster worse then not having a cluster.
>>
>> Having come to Linux Cluster with a TruCluster background, I always had a problem with the STONITH approach used by Linux Cluster. I deem it harmful. But I see no inclination anywhere in the Linux Cluster world to remove it.
>>
>> I believe that there is a major philosophical chasm dividing the design stance between the Linux Cluster and others. The Linux Cluster seems to be saying "A node is the centre of the world and can control it".  Other clusters take the opposite stance: "A node is a part of the world, cannot control it and may have a very limited visibility of the world in some circuumstances."
>>
>> Regards,
>>
>> Chris Jankowski
>>
>>
>>
>> -----Original Message-----
>> From: linux-cluster-bounces at redhat.com
>> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of jimbob palmer
>> Sent: Friday, 18 June 2010 01:59
>> To: linux-cluster at redhat.com
>> Subject: [Linux-cluster] qdisk WITHOUT fencing
>>
>> Dear distinguished linux-cluster members!
>>
>> I have two data centers linked by physical fibre. Everything goes over this physical route: everything.
>>
>> I would like to setup a high availability nfs server with drbd:
>> * drbd to replicate storage
>> * nfsd running
>> * floating ip
>>
>> If the physical link between the two data centers is lost, I would like the primary data center to win.
>>
>> I've setup a qdisk, and this works well: the node which can access the qdisk wins. i.e. the primary datacenter, which is the data center where the san holding the qdisk also lives, wins.
>>
>> Unfortunately for me, I get pages and pages of errors about being unable to fence the secondary node.
>>
>> The docs tell me that I absolutely must use power fencing, but in this case fencing makes no sense: it won't work when the link between the data centers is severed. The network, and the qdisk is the decider for who "wins".
>>
>> So what should I do?
>>
>> Many thanks in advance.
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>