[Linux-cluster] forcefully taking over a service from another node, kdump

Wed Apr 14 18:34:59 UTC 2010

Setup: 2 Nodes: node1, node2. IPMI fencing mechanism.

I'm trying to minimize downtime and to get kdump at the same time; while the
fail-over process works fine w/o kdump'ing,
I need to tweak post_fail_delay to be high enough to ensure that the
panicking node won't get fenced.

To ensure that kdump works, I need to set post_fail_delay to 1200 secs (to
ensure that dumping process has completed; big memory), and with the post
kdump script to sleep for another 1200 seconds.

That way, say node1 panic'ed, it would kdump'ing itself and then would go to
sleep for a while. node2 then will fence node1 (reboot it via IPMI) and take
over the service most likely when node1 was sleeping at the post kdump.

This has drawbacks of losing service for 1,200 seconds (while kdumping) and
assume that kdump'ing will finish at 1,200 seconds.

=== Working on a new solution ===

I'm working on a solution for this by a kdump_pre script.
When node1 panic'ed, before kdumping, it would contact node2 so that node2
will attempt to take over the service.

At node2, I found <service> running at node1 and issue:
    clusvcadm -r <service>

Because of node1's state (it is kdumping), the command just hangs and it did
not manage to cut down the service down time.

What can I do at node2 to forcefully take over the service from node1 after
node2 is contacted by node1 at kdump_pre stage ?

Thanks
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100414/a94b24ba/attachment.htm>