[Linux-cluster] node fenced by dlm_controld on a clean shutdown

Mon Nov 19 16:11:45 UTC 2012

David Teigland napsal(a):
> On Mon, Nov 19, 2012 at 10:39:20AM +0100, Jacek Konieczny wrote:
>> On Mon, Nov 19, 2012 at 10:16:48AM +0100, Jacek Konieczny wrote:
>>> It goes like that:
>>> - resources using the shared storage are properly stopped by Pacemaker.
>>> - DRBD is cleanly demoted and unconfigured by Pacemaker
>>> - Pacemaker cleanly exits
>>> - CLVMD is stopped.
>>> ??? dlm_controld is stopped
>>> ??? corosync is being stopped
>>>
>>> and at this point the node is fenced (rebooted) by the dlm_controld on
>>> the other node. I would expect it continue with a clean shutdown.
>>>
>>> Any idea how to debug/fix it?
>>> Is this '541 cpg_dispatch error 9' the problem?
>>
>> I found a workaround: I have added a 10 seconds pause between
>> dlm_controld and corosync shutdown. The node shuts down cleanly now (is
>> not fenced). '541 cpg_dispatch error 9' is still there in the logs,
>> though.
> 
> corosync-cfgtool -H is supposed to shut down corosync cleanly using the
> cfg_shutdown_callback.  It looks like it may not be doing that.
> 

I don't think it's about corosync not shut down cleanly. As can be seen
in logs:
...
Nov 19 09:49:43 dev1n2 corosync[1130]:  [SERV  ] Service engine
unloaded: corosync profile loading service
Nov 19 09:49:43 dev1n2 corosync[1130]:  [WD    ] magically closing the
watchdog.
Nov 19 09:49:43 dev1n2 corosync[1130]:  [SERV  ] Service engine
unloaded: corosync watchdog service
Nov 19 09:49:43 dev1n2 corosync[1130]:  [MAIN  ] Corosync Cluster Engine
exiting normally