[Linux-cluster] Fencing through iLO and functioning of kdump
Ryan O'Hara
rohara at redhat.com
Wed Sep 1 17:11:45 UTC 2010
On Wed, Sep 01, 2010 at 10:48:23AM -0400, Ben Turner wrote:
> Here is a kbase on fence scsi:
>
> https://access.redhat.com/kb/docs/DOC-17809
>
> It should answer any questions you have:
>
> https://access.redhat.com/kb/docs/DOC-17809
>
> Usually I try the fence_scsi_test to be sure my devices are capable, note:
>
> "To assist with finding and detecting devices which are (or are not) suitable for use with fence_scsi, a tool has been provided. The fence_scsi_test script will find devices visible to the node and report whether or not they are compatible with SCSI persistent reservations."
I just have to comment that fence_scsi_test is rather limited. I'm
currently working on making it more robust, such that it more
accurately tests device(s) for SCSI-PR support.
Basically there are two issues:
1. The current script does not verify that registrations exist on a
device -- it relies on the error code returned from sg_persist. This
usually works, but we have seen some arrays that will report false
positives.
2. The script *only* puts a registration on the device(s) and then
removes the registration from each device. This doesn't tell the whole
story, since it the array must also support the preempt-and-abort
operation.
A new fence_scsi_test script should be available in the very near
future. Here is the relevant BZ:
https://bugzilla.redhat.com/show_bug.cgi?id=603838
Ryan
> ----- "Chris Jankowski" <Chris.Jankowski at hp.com> wrote:
>
> > Ben,
> >
> > Thank you for pointing me at fence_scsi.
> > It looks like fence_scsi will fit the bill elegantly. And it should be
> > much more reliable then iLO fencing if the cluster uses properly
> > configured, dual fabric FC SAN for shared storage.
> >
> > I read the fence_scsi manual page and have one more question.
> >
> > What do I need to do for my cluster to start using SCSI reservations?
> > Is this done by default?
> >
> > Thanks and regards,
> >
> > Chris Jankowski
> >
> > -----Original Message-----
> > From: linux-cluster-bounces at redhat.com
> > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Ben Turner
> > Sent: Saturday, 28 August 2010 03:29
> > To: linux clustering
> > Subject: Re: [Linux-cluster] Fencing through iLO and functioning of
> > kdump
> >
> > You have a couple options here:
> >
> > 1. Switch to fence_scsi(uses scsi reservation as you described) or an
> > other I/O fencing method that does not reboot the system. This will
> > enable you core dump to complete without power fencing interrupting
> > it.
> >
> > 2. Put in a post fail delay long enough for fencing to complete.
> > This is sub optimal as your cluster services/resources will be hung
> > for the duration of the post fail delay. I usually only do this when
> > I know I have a node that is crashing and no I/O fencing
> > capabilities.
> >
> > 3. If you don't have access to an I/O fence agent and it post fail
> > delay won't work for some reason you can try:
> >
> > Best practice I can think of right now would be the following:
> > 1. disable the power fence device on the host you're seeing panics on,
> > I have changed the IP for it in cluster.conf in the past 2. when that
> > node fails, the other nodes will attempt to fence the host
> > and it will fail since the fence device was disabled
> > (NOTE: between steps 2 and 3, cluster operation is suspended) 3.
> > administrator can now do things like:
> > - disconnect the FC and network cables form the affected host
> > ensuring
> > that it is 'manually I/O fenced'
> > - run fence_ack_manual on the other host to override the failed
> > fencing operation to continue cluster operation on the other
> > nodes 4. Now the failed host is free to continue kdumping for as long
> > as need be
> >
> > Hope this helps.
> >
> > -b
> >
> >
> > ----- "Chris Jankowski" <Chris.Jankowski at hp.com> wrote:
> >
> > > Hi,
> > >
> > > How can I reconcile the need to have Kdump configured and
> > operational
> > > on cluster nodes with the need for fencing of a node most commonly
> > and
> > > conveniently implemented through iLO on HP servers?
> > >
> > > Customers require Kdump configured and operational to be able to
> > have
> > > kernel crashes analysed by Red Hat support. The taking of crash dump
> >
> > > starts immediately after the crash, but it may take very
> > considerable
> > > time on a machine with 512 GB of memory (more than an hour) if done
> > in
> > > dumplevel 0 and over 1 GBE network. However, if I use iLO fencing
> > then
> > > the crashed node will be powered off through iLO which will
> > > irrecovably kill the the kernel dump in progress and erase the
> > memory
> > > content containing the crashed kernel image.
> > >
> > > Ideally, I would love to have the functionality that is present in
> > > several UNIX clusters, when a crashed node completes its kernel
> > crash
> > > dump in peace. In UNIX clusters the crashed node can be configured
> > to
> > > reboot automatically after kernel crash and rejoin the cluster. It
> > > typically does the kernel dump as a part of the boot.
> > >
> > > The UNIX clusters typically use SCSI reservation to protect
> > integrity
> > > of storage. This enables them to keep the failed node isolated
> > whilst
> > > it is still able to do the kernel crash dump before rejoining the
> > > cluster. I believe this option is not avilable in Linux Cluster.
> > >
> > > So, how can I have functioning Linux cluster with ability of taking
> > a
> > > kernel crash dump of crashed nodes and without blocking the access
> > to
> > > shared GFS2 filesystem for the hour or so that bit may take a crash
> >
> > > dump obn a very large system?
> > >
> > > Thanks and regards,
> > >
> > > Chris Jankowski
> > >
> > > --
> > > Linux-cluster mailing list
> > > Linux-cluster at redhat.com
> > > https://www.redhat.com/mailman/listinfo/linux-cluster
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
More information about the Linux-cluster
mailing list