[Linux-cluster] Fencing through iLO and functioning of kdump

Fri Aug 27 17:29:27 UTC 2010

You have a couple options here:

1.  Switch to fence_scsi(uses scsi reservation as you described) or an other I/O fencing method that does not reboot the system.  This will enable you core dump to complete without power fencing interrupting it.

2.  Put in a post fail delay long enough for fencing to complete.  This is sub optimal as your cluster services/resources will be hung for the duration of the post fail delay.  I usually only do this when I know I have a node that is crashing and no I/O fencing capabilities.

3.  If you don't have access to an I/O fence agent and it post fail delay won't work for some reason you can try:

Best practice I can think of right now would be the following:
1. disable the power fence device on the host you're seeing panics on, I have changed the IP for it in cluster.conf in the past
2. when that node fails, the other nodes will attempt to fence the host
   and it will fail since the fence device was disabled
   (NOTE: between steps 2 and 3, cluster operation is suspended)
3. administrator can now do things like:
   - disconnect the FC and network cables form the affected host ensuring
     that it is 'manually I/O fenced'
   - run fence_ack_manual on the other host to override the failed
     fencing operation to continue cluster operation on the other nodes
4. Now the failed host is free to continue kdumping for as long as need be

Hope this helps.

-b

----- "Chris Jankowski" <Chris.Jankowski at hp.com> wrote:

> Hi,
> 
> How can I reconcile the need to have Kdump configured and operational
> on cluster nodes with the need for fencing of a node most commonly and
> conveniently implemented through iLO on HP servers?
> 
> Customers require Kdump configured and operational to be able to have
> kernel crashes analysed by Red Hat support. The taking of crash dump
> starts immediately after the crash, but it may take very considerable
> time on a machine with 512 GB of memory (more than an hour) if done in
> dumplevel 0 and over 1 GBE network. However, if I use iLO fencing then
> the crashed node will be powered off through iLO which will
> irrecovably kill the the kernel dump in progress and erase the memory
> content containing the crashed kernel image.
> 
> Ideally, I would love to have the functionality that is present in
> several UNIX clusters, when a crashed node completes its kernel crash
> dump in peace. In UNIX clusters the crashed node can be configured to
> reboot automatically after kernel crash and rejoin the cluster. It
> typically does the kernel dump as a part of the boot.
> 
> The UNIX clusters typically use SCSI reservation to protect integrity
> of storage. This enables them to keep the failed node isolated whilst
> it is still able to do the kernel crash dump before rejoining the
> cluster. I believe this option is not avilable in Linux Cluster.
> 
> So, how can I have functioning Linux cluster with ability of taking a
> kernel crash dump of crashed nodes and without blocking the access to
> shared GFS2 filesystem for the hour or so that bit may take a crash
> dump obn a very large system?
> 
> Thanks and regards,
> 
> Chris Jankowski
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster