[Linux-cluster] Configuring rgmanager
Ion Alberdi
ialberdi at histor.fr
Mon Feb 28 13:05:21 UTC 2005
>>
>> Failover will not occur until after CMAN (or gulm) says the node is dead
>> and has been fenced. When using the kernel Service Manager (provided by
>> CMAN), recovery is in the following order:
>>
>> (1) Fencing
>> (2) Locking
>> (3) GFS
>> (4) User services (e.g. rgmanager)
>>
>> How long did you wait? :)
>>
>>
>>
Results from my tests with two nodes(buba and gump)(and latest
cvs(update done today)):
I tried to put a basic script in failover on two nodes.
Initialization:
-ccsd, cman_tool join fence_tool join on the two nodes
Then I start the rgmanager on the two nodes:
the script coucou (echo `uname -n` >> bla.txt) is launched on one of
the two nodes.
With clusvcadm I made this script ran on gump, and I rebooted gump:
There is the syslog on buba:
Feb 28 13:20:17 buba kernel: CMAN: removing node gump from the cluster :
Missed too many heartbeats
Feb 28 13:20:17 buba fenced[7573]: gump not a cluster member after 0 sec
post_fail_delay
Feb 28 13:20:17 buba fenced[7573]: fencing node "gump"
Feb 28 13:20:20 buba fence_manual: Node 200.0.0.102 needs to be reset
before recovery can procede. Waiting for 200.0.0.102 to rejoin the
cluster or for manual acknowledgement that it has been reset (i.e.
fence_ack_manual -n 200.0.0.102)
Feb 28 13:20:29 buba fenced[7573]: fence "gump" success
Feb 28 13:20:32 buba clurgmgrd[7581]: <notice> Taking over resource
group coucou from down member (null)
Feb 28 13:20:32 buba clurgmgrd[7581]: <notice> Resource group coucou started
Then gump came and rejoined the cluster: syslog of buba:
Feb 28 13:23:47 buba kernel: CMAN: node gump rejoining
I put the script on gump(always with clusvcadm):
Feb 28 13:24:50 buba clurgmgrd[7581]: <notice> Stopping resource group
coucou
Feb 28 13:24:50 buba clurgmgrd[7581]: <notice> Resource group coucou is
stopped
Feb 28 13:24:50 buba clurgmgrd[7581]: <notice> Resource group coucou is
now running on member 2
Then I re rebooted (:)) gump and there came the problems:
Gump was removed from the cluster
Feb 28 13:25:57 buba kernel: CMAN: removing node gump from the cluster :
Missed too many heartbeats
Feb 28 13:25:57 buba fenced[7573]: gump not a cluster member after 0 sec
post_fail_delay
Feb 28 13:25:57 buba fenced[7573]: fencing node "gump"
Feb 28 13:26:03 buba fence_manual: Node 200.0.0.102 needs to be reset
before recovery can procede. Waiting for 200.0.0.102 to rejoin the
cluster or for manual acknowledgement that it has been reset (i.e.
fence_ack_manual -n 200.0.0.102)
Feb 28 13:26:14 buba fenced[7573]: fence "gump" success
And there, the rgmanager did nothing
when I looked to /proc/cluster/services I had:
Service Name GID LID State Code
Fence Domain: "default" 1 2 run -
[1]
DLM Lock Space: "Magma" 3 4 run -
[1]
User: "usrm::manager" 2 3 recover 2 -
[1]
Whereas I had, during the first reboot of gump:
Service Name GID LID State Code
Fence Domain: "default" 1 2 run -
[1]
DLM Lock Space: "Magma" 3 4 run -
[1]
User: "usrm::manager" 2 3 run -
[1]
Then I tried to bring gump back:
And there is what I had in gump:
[root at gump ~]# cat /proc/cluster/services
Service Name GID LID State Code
Fence Domain: "default" 1 2 run -
[1 2]
User: "usrm::manager" 0 3 join
S-1,80,2
[]
So there, nothing worked, I hopelessly tried to restart the rgmanager on
the two nodes, but nothing worked, I had states where in gump
[root at gump ~]# cat /proc/cluster/services
Service Name GID LID State Code
Fence Domain: "default" 1 2 run -
[1 2]
DLM Lock Space: "Magma" 3 6 run -
[2]
User: "usrm::manager" 4 5 run -
[2]
and in buba:
[root at buba ~]# cat /proc/cluster/services
Service Name GID LID State Code
Fence Domain: "default" 1 2 run -
[1 2]
(buba seems not to have any clurgmrgrd running, even if I started the
rgmanager...)
I don't know if it's a bug of the rgmanager or if I'm doing something
wrong, but I don't understand why during the first reboot everything
worked and nothing then...
More information about the Linux-cluster
mailing list