[Linux-cluster] Rgmanager fails to restart
Janne Peltonen
janne.peltonen at helsinki.fi
Sun Jul 1 11:45:21 UTC 2007
The story continues...
On Sun, Jul 01, 2007 at 02:30:40PM +0300, Janne Peltonen wrote:
> > Sometimes, when I have cleanly shut down rgmanager on one node, and the
> > services have nicely migrated to other nodes, trying to start rgmanager
> > fails. Trying to access /dev/misc/dlm_rgmanager results in "No such
> > device". clurgmgrd concludes that locks are not working and exits.
> > (See strace output attached.)
> Interesting. After the one node with failing rgmanagers was shot in the
> head (there were no log lines about fencing, only two about deferring
> fencing to an earlier node), the fenced node was left in 'off' state, and,
> well, the other nodes had their services left running (but rgmanagers
> apparently stuck - no more status checks an no response to the clustat
> command).
Now, the cluster node whose fencing resulted in a stuck system came up
and joined the cluster.
[jmmpelto at pcn1 ~]$ sudo cman_tool services
type level name id state
fence 0 default 00000000 JOIN_STOP_WAIT
[1 2 3 4 100]
dlm 1 clvmd 00000000 JOIN_STOP_WAIT
[1 2 3 4 100]
[jmmpelto at pcn1 ~]$ sudo cman_tool status
Version: 6.0.1
Config Version: 40
Cluster Name: mappi-primary
Cluster Id: 11929
Cluster Member: Yes
Cluster Generation: 184
Membership state: Cluster-Member
Nodes: 5
Expected votes: 5
Total votes: 5
Quorum: 3
Active subsystems: 8
Flags:
Ports Bound: 0 11
Node name: pcn1-hb
Node ID: 1
Multicast addresses: 239.192.46.199
Node addresses: 10.3.0.11
I killed the completely stuck pcn2-hb from there:
[jmmpelto at pcn1 ~]$ sudo cman_tool kill -n pcn2-hb
Log:
Jul 1 14:36:36 pcn2.mappi.helsinki.fi dlm_controld[4577]: cluster is down, exiting
Jul 1 14:36:36 pcn2.mappi.helsinki.fi gfs_controld[4583]: cluster is down, exiting
Jul 1 14:36:36 pcn2.mappi.helsinki.fi fenced[4571]: cluster is down, exiting
Jul 1 14:36:59 pcn2.mappi.helsinki.fi ccsd[4508]: Unable to connect to cluster infrastructure after 30 seconds.
Thereafter, node pcn3-hb fenced it, this time with log entries:
Jul 1 14:36:50 pcn3.mappi.helsinki.fi fenced[4371]: pcn2-hb not a cluster member after 0 sec post_fail_delay
Jul 1 14:36:50 pcn3.mappi.helsinki.fi fenced[4371]: pcn1-hb not a cluster member after 0 sec post_fail_delay
Jul 1 14:36:50 pcn3.mappi.helsinki.fi fenced[4371]: fencing node "pcn2-hb"
Jul 1 14:38:08 pcn3.mappi.helsinki.fi fenced[4371]: fence "pcn2-hb" success
Jul 1 14:38:13 pcn3.mappi.helsinki.fi ccsd[4308]: Attempt to close an unopened CCS descriptor (3012450).
Jul 1 14:38:13 pcn3.mappi.helsinki.fi ccsd[4308]: Error while processing disconnect: Invalid request descriptor
But nobody tried to fence pcn1-hb (see the second log line). But apparently,
pcn3-hb tried to say something to pcn1-hb.
Jul 1 14:38:13 pcn1.mappi.helsinki.fi fenced[4461]: fencing deferred to prior member
Jul 1 14:38:13 pcn1.mappi.helsinki.fi dlm_controld[4467]: open "/sys/kernel/dlm/rgmanager/id" error -1 2
Jul 1 14:38:13 pcn1.mappi.helsinki.fi dlm_controld[4467]: open "/sys/kernel/dlm/rgmanager/control" error -1 2
Jul 1 14:38:13 pcn1.mappi.helsinki.fi dlm_controld[4467]: open "/sys/kernel/dlm/rgmanager/event_done" error -1 2
This time the services are in no specific state, but the rgmanager still does nothin constructive:
[jmmpelto at pcn3 ~]$ sudo cman_tool services
Password:
type level name id state
fence 0 default 00010001 none
[1 3 4 100]
dlm 1 clvmd 00010002 none
[1 3 4 100]
dlm 1 rgmanager 00020002 none
[1 3 4]
[jmmpelto at pcn3 ~]$ sudo clustat
Timed out waiting for a response from Resource Group Manager
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
pcnm-hb 100 Online
pcn1-hb 1 Online
pcn2-hb 2 Offline
pcn3-hb 3 Online, Local
pcn4-hb 4 Online
On node pcn1-hb:
[jmmpelto at pcn1 ~]$ sudo cman_tool services
type level name id state
fence 0 default 00010001 none
[1 3 4 100]
dlm 1 clvmd 00010002 none
[1 3 4 100]
dlm 1 rgmanager 00020002 none
[1 3 4]
[jmmpelto at pcn1 ~]$
[jmmpelto at pcn1 ~]$
[jmmpelto at pcn1 ~]$ sudo clustat
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
pcnm-hb 100 Online
pcn1-hb 1 Online, Local
pcn2-hb 2 Offline
pcn3-hb 3 Online
pcn4-hb 4 Online
Er again.
--Janne
More information about the Linux-cluster
mailing list