[Linux-cluster] GFS 6.0 node without quorum tries to fence

Wed Aug 4 06:12:51 UTC 2004

So, what I have learned from all answers is very bad news for me. It
seems, what happened is as expected by most of you. But this means:

-----------------------------------------------------------------------
--- One single point of failure in one node can stop the whole gfs. ---
-----------------------------------------------------------------------

The single point of failure is:
The lancard specified in "nodes.ccs:ip_interfaces" stops working on one
node. No matter if this node was master or slave.

The whole gfs is stopped:
The rest of the cluster seems to need time to form a new cluster. The
bad node does not need so much time for switching to arbitrary mode. So
the bad node has enough time to fence all other nodes, before it would
be fenced by the new master.

The bad node lives but it can not form a cluster. GFS is not working.

Now all other nodes will reboot. But after reboot they can not join the
cluster, because they can not contact the bad node. The lancard is still
broken. GFS is not working.

Did I miss something?
Please tell me that I am wrong!

> -----Original Message-----
> From: linux-cluster-bounces at redhat.com 
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of 
> Schumacher, Bernd
> Sent: Dienstag, 3. August 2004 13:56
> To: linux-cluster at redhat.com
> Subject: [Linux-cluster] GFS 6.0 node without quorum tries to fence
> 
> 
> Hi,
> I have three nodes oben, mitte and unten. 
> 
> Test:
> I have disabled eth0 on mitte, so that mitte will be excluded. 
> 
> Result:
> Oben and unten are trying to fence mitte and build a new 
> cluster. OK! But mitte tries to fence oben and unten. PROBLEM!
>  
> Why can this happen? Mitte knows that it can not build a 
> cluster. See Logfile from mitte: "Have 1, need 2"
> 
> Logfile from mitte:
> Aug  3 12:53:17 mitte lock_gulmd_core[1845]: Client (oben) 
> expired Aug 3 12:53:17 mitte lock_gulmd_core[1845]: Core lost 
> slave quorum. Have 1, need 2. Switching to Arbitrating. Aug  
> 3 12:53:17 mitte
> lock_gulmd_core[2120]: Gonna exec fence_node oben Aug  3 
> 12:53:17 mitte
> lock_gulmd_core[1845]: Forked [2120] fence_node oben with a 0 
> pause. Aug 3 12:53:17 mitte fence_node[2120]: Performing 
> fence method, manual, on oben. 
> 
> cluster.ccs:
> cluster {
>     name = "tom"
>     lock_gulm {
>         servers = ["oben", "mitte", "unten"]
>     }
> }
> 
> fence.ccs:
> fence_devices {
>   manual_oben {
>     agent = "fence_manual"
>   }     
>   manual_mitte ...
> 
> 
> nodes.ccs:
> nodes {
>   oben {
>     ip_interfaces {
>       eth0 = "192.168.100.241"
>     }
>     fence { 
>       manual {
>         manual_oben {
>           ipaddr = "192.168.100.241"
>         }
>       }
>     }
>   }
>   mitte ...
> 
> regards
> Bernd Schumacher
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com 
> http://www.redhat.com/mailman/listinfo/linux-> cluster
>