[Linux-cluster] GFS lock held for 12 hours

Fri May 19 01:26:51 UTC 2006

We have a GFS cluster with 12 data nodes and 3 lock-servers.

Red Hat AS3 U7 GFS-6.0.2.30-0

The data nodes all access a SAN disk.

The SAN fabric is divided into two independent halves - called Red and Blue.
Half the data nodes on each.  The data nodes access only one disk - reachable
via either SAN.

There are other clients, other clusters and other disks sharing the SAN.

Recently a faulty HBA was plugged into a machine, not part of our cluster, and
connected to the Red SAN.

At this point the Red SAN failed, there were two main moderately immediate
results:

One of the Red SAN nodes became very busy.  Presumably it was holding a fairly
big GFS lock at the time.  But it continued to hold the lock and to send
heartbeats.  The node gave the appearance of being hung.

The rest of the Red SAN nodes, over a period of a few minutes, all presumably
did some IO to the disk and presumably got into a busy wait state, which was
so tight that they stopped sending heartbeats, and got fenced.  (APC PDU's)

On reboot these nodes could see the SAN as normal except they could not see
their SAN disk.  Nor could they see another disk added to the SAN as part of
the debugging attempted later.  

Many attempts were made to make the disk reappear, mostly by rebooting or
shutting down GFS and rmmod-ing qla2300 and modprobe-ing qla2300.  Everything
was quite normal, except the Red SAN would not let any of our nodes see our
disk.

On the Blue SAN all the machines became very busy.  Presumably because of the
one Red SAN machine holding the lock.  These nodes were also thought to be
hung, but none of them were rebooted as it was discovered that they were still
exporting an important Web tree that was not on GFS disk.  (They sprang back
to life when the one - lock holding - Red SAN machine was rebooted - which was
well after the Red SAN problem was fixed).

This state of affairs lasted 12 hours.  

Fixing it was made difficult because to anyone looking at the problem it
appeared the entire SAN and the entire cluster was down.  Very little that we
saw at the time indicated that only the Red SAN had failed.  (Hindsight is
wonderful).

This was particularly unfortunate.  The justification for installing GFS was 
resilience in the face of hardware failure.  (esp no spof).

So finally here are my questions:  

Is it really reasonable for a machine to hang onto a lock for 12 hours ?

Would it be possible for a GFS machine to detect that it cannot do IO to its
GFS disk any more and release any locks it holds - perhaps by fencing itself?

(I'm thinking of adding a cronjob that forks a subprocess that does an IO to
the GFS disk.  The parent could shutdown the node, leading to a fence, if the
child takes more than a minute).

Have I made any mistakes in my guesses and presumptions ?

Keith Lewis