[Linux-cluster] qdisk WITHOUT fencing

Jankowski, Chris Chris.Jankowski at hp.com
Fri Jun 18 10:28:04 UTC 2010


Gordan,

>>>Do you have a better idea? How do you propose to ensure that there is no resource clash when a node becomes intermittent or half-dead? How do you prevent it's interference from bringing down the service? What do you propose? More importantly, how would you propose to handle this when ensuring consistency is of paramount importance, e.g. when using a cluster file system? 

I believe that SCSI reservation are the key for protection.  One can form a group of hosts that are allowed to access storage and exclude those that had their membership revoked. Note that this is a protective mechanism - the stance is here: "This is ours and we protect it".  A node that have been ejected cannot do damage anymore.  This is philosophically opposite approach to fencing, which is: "I'll go out and shoot everybody whom I consider suspect and I am not going to come back until I've successfully shot everybody whom I consider suspect."

A properly implemented quorum disk is the key for management of the cluster membership. Based on access to quorum disk one can then establish who is the member. The nodes ejected are configured to commit suicide, reboot and try to rejoin the cluster. Then, based on membership one can set up SCSI reservations on shared storage.  This will protect the integrity of the filesystems including shared cluster filesystem.

Note that there is natural affinity between the quorum disk on shared storage and shared cluster file system on the shared storage. Whoever has access to the quorum disk has access to shared storage and can stay as a member. Whoever does not should be ejected. Whether such node is dead, half-dead or actively looking for mischief is irrelevant, because it does not have access to storage once SCSI reservations have been set to exclude it. It won't get anywhere without access to storage.

The cluster will reform after failure and won't need fencing.

This is how DEC/Compaq/HP TruCluster V5.x works. It does support shared cluster filesystem.  In fact, this is the only filesystem that it supports except for UFS for CDROMS. And it supports shared root. There is only one password file, one group file, one set of binaries and libraries all shared in CFS. And it has a rolling upgrade. It works reliably and there is not a trace of fencing in it.  So, it can be done.  This is a living proof and it works. Those clusters used to run multiterabyte Oracle RAC databases when Alpha was still actively marketed.

Here is an excellent Technical Overview of it:

http://h30097.www3.hp.com/docs/base_doc/DOCUMENTATION/V51B_ACRO_DUX/ARHGVETE.PDF

For a long time there was a hesitance to relying on SCSI reservations, because shared parallel SCSI was rarefied and exotic. Then we had FC and it was expensive. Not for everybody it was said. But today one can buy $500 iSCSI storage arrays. They support all required protocols. This is now commodity. If one has a system that is important enough to be clustered then a block mode array should not be a cost problem (iSCSI, FC, switched SAS and shortly FCoE).

-------------

I also would like to remark that from practical operations point of view the great amount of effort that is expanded on trying to do something with the network in the cluster or checking server interfaces is at best useless, but mostly harmful.  The pragmatic stance today in data centre environment is:

- we have bonded interfaces connected to different switches - this takes care of redundancy of the local link. If switches are properly configured it will even propagate upstream switch failure to the local link and force failover in the bond. 
- cluster nodes cannot fix network - no matter what they think about it.  Therefore services should ride through network failures.  Failing over a service because network went down does not help on networks with redundancy correctly implemented.  Actually it hurts. If you fail over a database then users have to relogin, they lost their sessions, context and often data. You also loose warm database cache (Oracle SGA) with all the right blocks in it.

All this business of trying to ping your default gateway is plain silly. As if we had different gateways for each member of the cluster. And trying to marry quorum disk with heuristics that ping gateways seems to be even sillier.

Regards,

Chris Jankowski



-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Gordan Bobic
Sent: Friday, 18 June 2010 18:38
To: linux clustering
Subject: Re: [Linux-cluster] qdisk WITHOUT fencing

On 06/18/2010 07:57 AM, Jankowski, Chris wrote:

> Using the analogy you gave, the problem with a mafioso is that he 
> cannot kill all other mafiosos in the gang when they are all sitting 
> in solitary confinment cells (:-)).

Do you have a better idea? How do you propose to ensure that there is no resource clash when a node becomes intermittent or half-dead? How do you prevent it's interference from bringing down the service? What do you propose? More importantly, how would you propose to handle this when ensuring consistency is of paramount importance, e.g. when using a cluster file system?

> I would like to remark that this STONITH business causes endless 
> problems in clusters within a single data centre too. For example a 
> temporary hiccup on the network that causes short heartbeat failure 
> triggers all nodes of the cluster to kill the other nodes. And boy, do 
> they succeed with a typical HP iLO fencing. You can see all your nodes 
> going down. Then they come back and the shootout continues essentially 
> indefinitely if fencing works. If not, then they all block.

If your network is that intermittent, you have bigger problems.
But you can adjust your cman timeout values (<totem token = "[timeout in
milliseconds]"/>) to something more appropriate to the quality of your network.

> And all of that is so unnecessary, as a combination of a properly 
> implemented quorum  disk and SCSI reservations with local boot disks 
> and data disks on shared storage  could provide quorum maintenance, 
> split-brain avoidance and protection of the integrity  of the 
> filesystem.

I disagree. If a note starts to go wrong, it cannot be trusted to not trash the file system, ignoring quorums and suchlike. Data integrity is too important to take that risk.

> DEC ASE cluster on Ultrix and MIPS hardware had that in 1991. You do 
> not  even need GFS2, although it is very nice to have a real cluster 
> filesystem.

If you want something that's looser than a proper cluster FS without the need for fencing (and are happy to live with the fact that when splitbrain occurs, one of the files will win and the other copies _will_ get trashed, you may want to look into GlusterFS if you haven't already.

> By the way, I believe that commercial stretched cluster on Linux is 
> not possible if you rely on LVM for distributed storage. Linux LVM is 
> architecturally incapable of providing any resilience over distance, 
> IMHO. It is missing the plex and subdisk layers as in Veritas LVM and 
> has no notion of location, so you it cannot tell which piece of 
> storage is in which data centre. The only volume manager that I know 
> that has this feature is in OpenVMS.  Perhaps the latest Veritas has 
> it too.

I never actually found a purpose for LVM that cannot be done away with if you apply a modicum of forward planning (something that seems to be becoming quite rare in most industries these days). There are generally better ways than LVM to achieve the things that LVM is supposed to do.

> One could use distributed storage arrays of the type of HP P4000 
> (bought with Left Hand Networks). This shifts the problem from the OS 
> to the storage vendor.
>
> What distributed storage would you use in a hypothetical stretched 
> cluster?

Depends on what exactly your use-case is. In most use-cases, properly distributed storage (a-la CleverSafe) comes with too much of a performance penalty to be useful when geographically dispersed. The single most defining measure of performance of a system is access time latencies. When caching gets difficult and your ping times move from LAN
(slow) to WAN (ridiculous), performance generally becomes completely unworkable.

Gordan

Gordan

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster




More information about the Linux-cluster mailing list