[Linux-cluster] some questions about setting up GFS

Wed Jan 12 14:49:05 UTC 2005

On Wed, Jan 12, 2005 at 05:10:40PM +0300, Sergey wrote:
> Hello!
> 
> > It looks like you are not using pool.
> 
> Thanks, I've guided by your examples, so raid can be mounted.
> 
> Now I have some questions about Cluster Configuration System Files.
> 
> I have 2 nodes - hp1 and hp2. Any of nodes have Integrated Lights-Out
> with ROM Version: 1.55 - 04/16/2004.
> 
> Since I have only 2 nodes one of them has to be master, but if first
> of them (master) is correctly shut down, slave experiencing
> serious problems which can be solved by resetting. Is it all right?
> How to make it right?
> 
> I tried to make servers = ["hp1","hp2","hp3"] (hp3 is really absent),
> then if master is shut down second node became master. So, if

The nodes in the servers config line for gulm form a mini-cluster of
sorts.  There must be quorum (51%) of nodes present in this mini-cluster
for things to continue.

You must have two of the three servers up and running so that the
mini-cluster has quorum, which then will alow the other nodes to
connect.

> nodes are alternately correctly shut down and boot up master is
> switching from one to another and everything seems ok, but if one of
> the nodes is shut down incorrectly (e.g. power cord is pulled out of
> socket), this have written in systemlog:
> 
> Jan 12 14:44:33 hp1 lock_gulmd_core[6500]: hp2 missed a heartbeat (time:1105530273952756 mb:1)
> Jan 12 14:44:48 hp1 lock_gulmd_core[6500]: hp2 missed a heartbeat (time:1105530288972780 mb:2)
> Jan 12 14:45:03 hp1 lock_gulmd_core[6500]: hp2 missed a heartbeat (time:1105530303992751 mb:3)
> Jan 12 14:45:03 hp1 lock_gulmd_core[6500]: Client (hp2) expired
> Jan 12 14:45:03 hp1 lock_gulmd_core[6500]: Core lost slave quorum. Have 1, need 2. Switching to Arbitrating.
> Jan 12 14:45:03 hp1 lock_gulmd_core[6614]: Gonna exec fence_node hp2
> Jan 12 14:45:03 hp1 lock_gulmd_core[6500]: Forked [6614] fence_node hp2 with a 0 pause.
> Jan 12 14:45:03 hp1 fence_node[6614]: Performing fence method, riloe, on hp2.
> Jan 12 14:45:04 hp1 fence_node[6614]: The agent (fence_rib) reports:
> Jan 12 14:45:04 hp1 fence_node[6614]: WARNING!  fence_rib is deprecated.  use fence_ilo instead parse error: unknown
> option "ipaddr=10.10.0.112"
> 
> If start again service lock_gulm on the second node, then on first
> node this have written in systemlog:
> 
> Jan 12 14:50:14 hp1 lock_gulmd_core[7148]: Gonna exec fence_node hp2
> Jan 12 14:50:14 hp1 fence_node[7148]: Performing fence method, riloe, on hp2.
> Jan 12 14:50:14 hp1 fence_node[7148]: The agent (fence_rib) reports:
> Jan 12 14:50:14 hp1 fence_node[7148]: WARNING!  fence_rib is deprecated.  use fence_ilo instead parse error: unknown
> option "ipaddr=10.10.0.112"
> Jan 12 14:50:14 hp1 fence_node[7148]:
> Jan 12 14:50:14 hp1 fence_node[7148]: All fencing methods FAILED!
> Jan 12 14:50:14 hp1 fence_node[7148]: Fence of "hp2" was unsuccessful.
> Jan 12 14:50:14 hp1 lock_gulmd_core[6500]: Fence failed. [7148] Exit code:1 Running it again.
> Jan 12 14:50:14 hp1 lock_gulmd_core[6500]: Forked [7157] fence_node hp2 with a 5 pause.
> Jan 12 14:50:15 hp1 lock_gulmd_core[6500]:  (10.10.0.201:hp2) Cannot login if you are expired.

The node hp2 has to be successfully fenced before it is allowed to
re-join the cluster.  If your fencing is misconfigured or not working, a
fenced node will never get to rejoin.

You really should test that fencing works by running 
fence_node <node name> for each node in your cluster before running
lock_gulmd.  This makes sure that fencing is setup and working
correctly.

Do that, and once you've verified that fencing is correct (without
lock_gulmd running) try things again with lock_gulmd.

> And I can't umount GFS file system and can't reboot systems
> because GFS is mounted, only reset both nodes.
> 
> I think I have mistakes in my configuration, may be it is because
> incorrect agent = "fence_rib" or something else.
> 
> Please help :-)

> 
> 
> Cluster Configuration:
> 
> cluster.ccs:
> cluster {
>          name = "cluster"
>          lock_gulm {
>              servers = ["hp1"]    (or servers = ["hp1,"hp2","hp3"])
>          }
> }
> 
> fence.ccs:
> fence_devices {
>                 ILO-HP1 {
>                         agent = "fence_rib"
>                         ipaddr = "10.10.0.111"
>                         login = "xx"
>                         passwd = "xx"
>                         }
>                 ILO-HP2 {
>                         agent = "fence_rib"
>                         ipaddr = "10.10.0.112"
>                         login = "xx"
>                         passwd = "xx"
>                         }
>             }
> 
> nodes.ccs:
> nodes {
>       hp1 {
>           ip_interfaces { eth0 = "10.10.0.200" }
>           fence { riloe { ILO-HP1 { localport = 17988 } } }
>           }
>       hp2 {
>           ip_interfaces { eth0 = "10.10.0.201" }
>           fence { riloe { ILO-HP2 { localport = 17988 } } }
>           }
> # if 3 nodes in cluster.ccs
> #      hp3 {
> #          ip_interfaces { eth0 = "10.10.0.201" }
> #          fence { riloe { ILO-HP2 { localport = 17988 } } }
> #          }

-- 
Michael Conrad Tadpol Tilstra
Hi, I'm an evil mutated signature virus, put me in your .sig or I will
bite your kneecaps!
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050112/855c8f24/attachment.sig>