[Linux-cluster] Rgmanager fails to restart

Sun Jul 1 11:45:21 UTC 2007

The story continues...

On Sun, Jul 01, 2007 at 02:30:40PM +0300, Janne Peltonen wrote:
> > Sometimes, when I have cleanly shut down rgmanager on one node, and the
> > services have nicely migrated to other nodes, trying to start rgmanager
> > fails. Trying to access /dev/misc/dlm_rgmanager results in "No such
> > device". clurgmgrd concludes that locks are not working and exits.
> > (See strace output attached.)
> Interesting. After the one node with failing rgmanagers was shot in the
> head (there were no log lines about fencing, only two about deferring
> fencing to an earlier node), the fenced node was left in 'off' state, and,
> well, the other nodes had their services left running (but rgmanagers
> apparently stuck - no more status checks an no response to the clustat
> command).

Now, the cluster node whose fencing resulted in a stuck system came up
and joined the cluster.

[jmmpelto at pcn1 ~]$ sudo cman_tool services
type             level name     id       state
fence            0     default  00000000 JOIN_STOP_WAIT
[1 2 3 4 100]
dlm              1     clvmd    00000000 JOIN_STOP_WAIT
[1 2 3 4 100]
[jmmpelto at pcn1 ~]$ sudo cman_tool status
Version: 6.0.1
Config Version: 40
Cluster Name: mappi-primary
Cluster Id: 11929
Cluster Member: Yes
Cluster Generation: 184
Membership state: Cluster-Member
Nodes: 5
Expected votes: 5
Total votes: 5
Quorum: 3
Active subsystems: 8
Flags:
Ports Bound: 0 11
Node name: pcn1-hb
Node ID: 1
Multicast addresses: 239.192.46.199
Node addresses: 10.3.0.11

I killed the completely stuck pcn2-hb from there:

[jmmpelto at pcn1 ~]$ sudo cman_tool kill -n pcn2-hb

Log:

Jul  1 14:36:36 pcn2.mappi.helsinki.fi dlm_controld[4577]: cluster is down, exiting
Jul  1 14:36:36 pcn2.mappi.helsinki.fi gfs_controld[4583]: cluster is down, exiting
Jul  1 14:36:36 pcn2.mappi.helsinki.fi fenced[4571]: cluster is down, exiting
Jul  1 14:36:59 pcn2.mappi.helsinki.fi ccsd[4508]: Unable to connect to cluster infrastructure after 30 seconds.

Thereafter, node pcn3-hb fenced it, this time with log entries:

Jul  1 14:36:50 pcn3.mappi.helsinki.fi fenced[4371]: pcn2-hb not a cluster member after 0 sec post_fail_delay
Jul  1 14:36:50 pcn3.mappi.helsinki.fi fenced[4371]: pcn1-hb not a cluster member after 0 sec post_fail_delay
Jul  1 14:36:50 pcn3.mappi.helsinki.fi fenced[4371]: fencing node "pcn2-hb"
Jul  1 14:38:08 pcn3.mappi.helsinki.fi fenced[4371]: fence "pcn2-hb" success
Jul  1 14:38:13 pcn3.mappi.helsinki.fi ccsd[4308]: Attempt to close an unopened CCS descriptor (3012450).
Jul  1 14:38:13 pcn3.mappi.helsinki.fi ccsd[4308]: Error while processing disconnect: Invalid request descriptor

But nobody tried to fence pcn1-hb (see the second log line). But apparently,
pcn3-hb tried to say something to pcn1-hb.

Jul  1 14:38:13 pcn1.mappi.helsinki.fi fenced[4461]: fencing deferred to prior member
Jul  1 14:38:13 pcn1.mappi.helsinki.fi dlm_controld[4467]: open "/sys/kernel/dlm/rgmanager/id" error -1 2
Jul  1 14:38:13 pcn1.mappi.helsinki.fi dlm_controld[4467]: open "/sys/kernel/dlm/rgmanager/control" error -1 2
Jul  1 14:38:13 pcn1.mappi.helsinki.fi dlm_controld[4467]: open "/sys/kernel/dlm/rgmanager/event_done" error -1 2

This time the services are in no specific state, but the rgmanager still does nothin constructive:

[jmmpelto at pcn3 ~]$ sudo cman_tool services
Password:
type             level name       id       state
fence            0     default    00010001 none
[1 3 4 100]
dlm              1     clvmd      00010002 none
[1 3 4 100]
dlm              1     rgmanager  00020002 none
[1 3 4]
[jmmpelto at pcn3 ~]$ sudo clustat
Timed out waiting for a response from Resource Group Manager
Member Status: Quorate

  Member Name                        ID   Status
  ------ ----                        ---- ------
  pcnm-hb                             100 Online
  pcn1-hb                               1 Online
  pcn2-hb                               2 Offline
  pcn3-hb                               3 Online, Local
  pcn4-hb                               4 Online

On node pcn1-hb:

[jmmpelto at pcn1 ~]$ sudo cman_tool services
type             level name       id       state
fence            0     default    00010001 none
[1 3 4 100]
dlm              1     clvmd      00010002 none
[1 3 4 100]
dlm              1     rgmanager  00020002 none
[1 3 4]
[jmmpelto at pcn1 ~]$
[jmmpelto at pcn1 ~]$
[jmmpelto at pcn1 ~]$ sudo clustat
Member Status: Quorate

  Member Name                        ID   Status
  ------ ----                        ---- ------
  pcnm-hb                             100 Online
  pcn1-hb                               1 Online, Local
  pcn2-hb                               2 Offline
  pcn3-hb                               3 Online
  pcn4-hb                               4 Online

Er again.

--Janne