[Linux-cluster] Fencing through iLO and functioning of kdump
Ben Turner
bturner at redhat.com
Wed Sep 1 14:48:23 UTC 2010
Here is a kbase on fence scsi:
https://access.redhat.com/kb/docs/DOC-17809
It should answer any questions you have:
https://access.redhat.com/kb/docs/DOC-17809
Usually I try the fence_scsi_test to be sure my devices are capable, note:
"To assist with finding and detecting devices which are (or are not) suitable for use with fence_scsi, a tool has been provided. The fence_scsi_test script will find devices visible to the node and report whether or not they are compatible with SCSI persistent reservations."
-Ben
----- "Chris Jankowski" <Chris.Jankowski at hp.com> wrote:
> Ben,
>
> Thank you for pointing me at fence_scsi.
> It looks like fence_scsi will fit the bill elegantly. And it should be
> much more reliable then iLO fencing if the cluster uses properly
> configured, dual fabric FC SAN for shared storage.
>
> I read the fence_scsi manual page and have one more question.
>
> What do I need to do for my cluster to start using SCSI reservations?
> Is this done by default?
>
> Thanks and regards,
>
> Chris Jankowski
>
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Ben Turner
> Sent: Saturday, 28 August 2010 03:29
> To: linux clustering
> Subject: Re: [Linux-cluster] Fencing through iLO and functioning of
> kdump
>
> You have a couple options here:
>
> 1. Switch to fence_scsi(uses scsi reservation as you described) or an
> other I/O fencing method that does not reboot the system. This will
> enable you core dump to complete without power fencing interrupting
> it.
>
> 2. Put in a post fail delay long enough for fencing to complete.
> This is sub optimal as your cluster services/resources will be hung
> for the duration of the post fail delay. I usually only do this when
> I know I have a node that is crashing and no I/O fencing
> capabilities.
>
> 3. If you don't have access to an I/O fence agent and it post fail
> delay won't work for some reason you can try:
>
> Best practice I can think of right now would be the following:
> 1. disable the power fence device on the host you're seeing panics on,
> I have changed the IP for it in cluster.conf in the past 2. when that
> node fails, the other nodes will attempt to fence the host
> and it will fail since the fence device was disabled
> (NOTE: between steps 2 and 3, cluster operation is suspended) 3.
> administrator can now do things like:
> - disconnect the FC and network cables form the affected host
> ensuring
> that it is 'manually I/O fenced'
> - run fence_ack_manual on the other host to override the failed
> fencing operation to continue cluster operation on the other
> nodes 4. Now the failed host is free to continue kdumping for as long
> as need be
>
> Hope this helps.
>
> -b
>
>
> ----- "Chris Jankowski" <Chris.Jankowski at hp.com> wrote:
>
> > Hi,
> >
> > How can I reconcile the need to have Kdump configured and
> operational
> > on cluster nodes with the need for fencing of a node most commonly
> and
> > conveniently implemented through iLO on HP servers?
> >
> > Customers require Kdump configured and operational to be able to
> have
> > kernel crashes analysed by Red Hat support. The taking of crash dump
>
> > starts immediately after the crash, but it may take very
> considerable
> > time on a machine with 512 GB of memory (more than an hour) if done
> in
> > dumplevel 0 and over 1 GBE network. However, if I use iLO fencing
> then
> > the crashed node will be powered off through iLO which will
> > irrecovably kill the the kernel dump in progress and erase the
> memory
> > content containing the crashed kernel image.
> >
> > Ideally, I would love to have the functionality that is present in
> > several UNIX clusters, when a crashed node completes its kernel
> crash
> > dump in peace. In UNIX clusters the crashed node can be configured
> to
> > reboot automatically after kernel crash and rejoin the cluster. It
> > typically does the kernel dump as a part of the boot.
> >
> > The UNIX clusters typically use SCSI reservation to protect
> integrity
> > of storage. This enables them to keep the failed node isolated
> whilst
> > it is still able to do the kernel crash dump before rejoining the
> > cluster. I believe this option is not avilable in Linux Cluster.
> >
> > So, how can I have functioning Linux cluster with ability of taking
> a
> > kernel crash dump of crashed nodes and without blocking the access
> to
> > shared GFS2 filesystem for the hour or so that bit may take a crash
>
> > dump obn a very large system?
> >
> > Thanks and regards,
> >
> > Chris Jankowski
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
More information about the Linux-cluster
mailing list