[Linux-cluster] Rgmanager fails to restart

Janne Peltonen janne.peltonen at helsinki.fi
Sun Jul 1 11:30:40 UTC 2007


On Sun, Jul 01, 2007 at 02:17:48PM +0300, Janne Peltonen wrote:
> Hi!
> 
> Sometimes, when I have cleanly shut down rgmanager on one node, and the
> services have nicely migrated to other nodes, trying to start rgmanager
> fails. Trying to access /dev/misc/dlm_rgmanager results in "No such
> device". clurgmgrd concludes that locks are not working and exits.
> (See strace output attached.)

Interesting. After the one node with failing rgmanagers was shot in the
head (there were no log lines about fencing, only two about deferring
fencing to an earlier node), the fenced node was left in 'off' state, and,
well, the other nodes had their services left running (but rgmanagers
apparently stuck - no more status checks an no response to the clustat
command). The node that (apparently, since there is no log entry) did
the fencing:

[jmmpelto at pcn2 ~]$ sudo cman_tool services
type             level name       id       state
fence            0     default    00010001 FAIL_ALL_STOPPED
[1 2 3 4 100]
dlm              1     clvmd      00010002 FAIL_ALL_STOPPED
[1 2 3 4 100]
dlm              1     rgmanager  00020002 FAIL_ALL_STOPPED
[1 2 3 4]

Other nodes with rgmanager running:

[jmmpelto at pcn3 ~]$ sudo cman_tool services
type             level name       id       state       
fence            0     default    00010001 FAIL_START_WAIT
[2 3 4 100]
dlm              1     clvmd      00010002 FAIL_ALL_STOPPED
[1 2 3 4 100]
dlm              1     rgmanager  00020002 FAIL_ALL_STOPPED
[1 2 3 4]

The fifth node without rgmanager:

[jmmpelto at pcnm ~]$ sudo cman_tool services
type             level name     id       state       
fence            0     default  00010001 FAIL_START_WAIT
[2 3 4 100]
dlm              1     clvmd    00010002 FAIL_ALL_STOPPED
[1 2 3 4 100]

Er. What might be up.


--Janne




More information about the Linux-cluster mailing list