[Linux-cluster] fencing issue - with attach logs&conf

Fri Mar 5 20:01:26 UTC 2010

Corey,

Hi Corey

I was talking about a watchdog not a kernel panic (sysreq...), on
common (X86) hardware, most server vendors implement embedded hardware
chips that could be used.

Indeed, SCSI3 reservation/registration could be combined to this whole
stuff to be sure about the nodes sanity.

I think the choice should be given to the admin to adopt or not the
paranoid approach of not failing over the services.

2010/3/4 Corey Kovacs <corey.kovacs at gmail.com>:
> Brem,
>
> It's been my understanding that the kernel panic technique you are
> describing essentially is undesirable for the fact that the kernel is in an
> unknown state. Basically anything can happen. The OS doesn't have to do a
> sync for an hba do flush etc. Since RedHat isn't in the business of building
> there own hardware like HP(DEC), Sun, IBM, they take the next best route to
> ensure that nothing from that problematic machine can affect the storage and
> the only way to guarantee that is to remove power from the whole machine.
>
> VMS and Tru64 use the panic method but the other nodes will issue a
> reservation on the scsi bus against that node to protect the storage. They
> can do that because they know exactly how there hardware and implementation
> of reservations work.
>
> Corey
>
> On Thu, Mar 4, 2010 at 5:32 AM, שלום קלמר <sklemer at gmail.com> wrote:
>>
>> Thanks to all !!!!
>>
>> Shalom.klemer at hp.com
>>
>> On Thu, Mar 4, 2010 at 12:00 AM, Lon Hohberger <lhh at redhat.com> wrote:
>>>
>>> On Wed, 2010-03-03 at 13:10 +0200, שלום קלמר wrote:
>>> > Hi.
>>> >
>>> > I got 2 power supplies. But if someone by mistake pull the power
>>> > cables , is that mean
>>> >
>>> > That the services will not failover ??
>>>
>>> The problem is:
>>>
>>> no power = no ping + no DRAC access
>>> no network = no ping, no DRAC access
>>>
>>> If there's no power, then it is safe to fail over.
>>>
>>> If there is no network (and power is OK), then it is not safe to fail
>>> over.  Failover in this case is very likely to produce data corruption!
>>>
>>> Because we can not tell which case happened, we do not fail over.
>>>
>>> -- Lon
>>>
>>>
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>