[Linux-cluster] Configuring rgmanager

Mon Feb 28 13:05:21 UTC 2005

>>
>> Failover will not occur until after CMAN (or gulm) says the node is dead
>> and has been fenced.  When using the kernel Service Manager (provided by
>> CMAN), recovery is in the following order:
>>
>> (1) Fencing
>> (2) Locking
>> (3) GFS
>> (4) User services (e.g. rgmanager)
>>
>> How long did you wait? :)
>>
>>  
>>
Results from my tests with two nodes(buba and gump)(and latest 
cvs(update done today)):
I tried to put a basic script in failover on two nodes.
Initialization:
-ccsd, cman_tool join fence_tool join on the two nodes
Then I start the rgmanager on the two nodes:
 the script coucou (echo `uname -n` >> bla.txt) is launched on one of 
the two nodes.
With clusvcadm I made this script ran on gump, and I rebooted gump:
There is the syslog on buba:
Feb 28 13:20:17 buba kernel: CMAN: removing node gump from the cluster : 
Missed too many heartbeats
Feb 28 13:20:17 buba fenced[7573]: gump not a cluster member after 0 sec 
post_fail_delay
Feb 28 13:20:17 buba fenced[7573]: fencing node "gump"
Feb 28 13:20:20 buba fence_manual: Node 200.0.0.102 needs to be reset 
before recovery can procede.  Waiting for 200.0.0.102 to rejoin the 
cluster or for manual acknowledgement that it has been reset (i.e. 
fence_ack_manual -n 200.0.0.102)
Feb 28 13:20:29 buba fenced[7573]: fence "gump" success
Feb 28 13:20:32 buba clurgmgrd[7581]: <notice> Taking over resource 
group coucou from down member (null)
Feb 28 13:20:32 buba clurgmgrd[7581]: <notice> Resource group coucou started

Then gump came and rejoined the cluster: syslog of buba:
Feb 28 13:23:47 buba kernel: CMAN: node gump rejoining

I put the script on gump(always with clusvcadm):
Feb 28 13:24:50 buba clurgmgrd[7581]: <notice> Stopping resource group 
coucou
Feb 28 13:24:50 buba clurgmgrd[7581]: <notice> Resource group coucou is 
stopped
Feb 28 13:24:50 buba clurgmgrd[7581]: <notice> Resource group coucou is 
now running on member 2

Then I re rebooted (:)) gump and there came the problems:
Gump was removed from the cluster
Feb 28 13:25:57 buba kernel: CMAN: removing node gump from the cluster : 
Missed too many heartbeats
Feb 28 13:25:57 buba fenced[7573]: gump not a cluster member after 0 sec 
post_fail_delay
Feb 28 13:25:57 buba fenced[7573]: fencing node "gump"
Feb 28 13:26:03 buba fence_manual: Node 200.0.0.102 needs to be reset 
before recovery can procede.  Waiting for 200.0.0.102 to rejoin the 
cluster or for manual acknowledgement that it has been reset (i.e. 
fence_ack_manual -n 200.0.0.102)
Feb 28 13:26:14 buba fenced[7573]: fence "gump" success

And there, the rgmanager did nothing
when I looked to /proc/cluster/services I had:
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 run       -
[1]

DLM Lock Space:  "Magma"                             3   4 run       -
[1]

User:            "usrm::manager"                     2   3 recover 2 -
[1]

Whereas I had, during the first reboot of gump:

Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 run       -
[1]

DLM Lock Space:  "Magma"                             3   4 run       -
[1]

User:            "usrm::manager"                     2   3 run       -
[1]

Then I tried to bring gump back:

And there is what I had in gump:
[root at gump ~]# cat /proc/cluster/services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 run       -
[1 2]

User:            "usrm::manager"                     0   3 join      
S-1,80,2
[]

So there, nothing worked, I hopelessly tried to restart the rgmanager on 
the two nodes, but nothing worked, I had states where in gump

[root at gump ~]# cat /proc/cluster/services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 run       -
[1 2]

DLM Lock Space:  "Magma"                             3   6 run       -
[2]

User:            "usrm::manager"                     4   5 run       -
[2]

and in buba:
[root at buba ~]# cat /proc/cluster/services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 run       -
[1 2]

(buba seems not to have any clurgmrgrd running, even if I started the 
rgmanager...)

I don't know if it's a bug of the rgmanager or if I'm doing something 
wrong, but I don't understand why during the first reboot everything 
worked and nothing then...