[Linux-cluster] Rgmanager fails to restart
Janne Peltonen
janne.peltonen at helsinki.fi
Sun Jul 1 11:30:40 UTC 2007
On Sun, Jul 01, 2007 at 02:17:48PM +0300, Janne Peltonen wrote:
> Hi!
>
> Sometimes, when I have cleanly shut down rgmanager on one node, and the
> services have nicely migrated to other nodes, trying to start rgmanager
> fails. Trying to access /dev/misc/dlm_rgmanager results in "No such
> device". clurgmgrd concludes that locks are not working and exits.
> (See strace output attached.)
Interesting. After the one node with failing rgmanagers was shot in the
head (there were no log lines about fencing, only two about deferring
fencing to an earlier node), the fenced node was left in 'off' state, and,
well, the other nodes had their services left running (but rgmanagers
apparently stuck - no more status checks an no response to the clustat
command). The node that (apparently, since there is no log entry) did
the fencing:
[jmmpelto at pcn2 ~]$ sudo cman_tool services
type level name id state
fence 0 default 00010001 FAIL_ALL_STOPPED
[1 2 3 4 100]
dlm 1 clvmd 00010002 FAIL_ALL_STOPPED
[1 2 3 4 100]
dlm 1 rgmanager 00020002 FAIL_ALL_STOPPED
[1 2 3 4]
Other nodes with rgmanager running:
[jmmpelto at pcn3 ~]$ sudo cman_tool services
type level name id state
fence 0 default 00010001 FAIL_START_WAIT
[2 3 4 100]
dlm 1 clvmd 00010002 FAIL_ALL_STOPPED
[1 2 3 4 100]
dlm 1 rgmanager 00020002 FAIL_ALL_STOPPED
[1 2 3 4]
The fifth node without rgmanager:
[jmmpelto at pcnm ~]$ sudo cman_tool services
type level name id state
fence 0 default 00010001 FAIL_START_WAIT
[2 3 4 100]
dlm 1 clvmd 00010002 FAIL_ALL_STOPPED
[1 2 3 4 100]
Er. What might be up.
--Janne
More information about the Linux-cluster
mailing list