[Linux-cluster] Hard lockups when writing a lot to GFS

Thu Dec 9 22:31:29 UTC 2004

I have a two-node setup on a dual-port SCSI SAN.  Note this is just
for test purposes.  Part of the SAN is a GFS filesystem shared between
the two nodes.

When we fetch content to the GFS filesystem via an rsync pull (well, 
several rsync pulls) on node 1, it runs for a while then node 1 hard
locks (nothing on the console, network dies, console dies, it's frozen
solid).  Of course, node 2 notices it and marks node 1 down 
(/proc/cluster/nodes shows an "X" for node 1 under "Sts").  So the
cluster behaviour is OK.  If I "fence-ack-manual -n node1" on node 2,
it runs along happily.  I can reboot node 1 and everything returns to
normalcy.

The problem is, why is node 1 dying like this?  It is important that
this get sorted out as we have a LOT of data to synchronize (rsync is
just the test case--we'll probably use a different scheme on
deployment), and I suspect it's heavy write activity on that node
that's causing the crash.

Oh, both nodes have the GFS filesystem mounted with "-o rw,noatime".

Any ideas would be GREATLY appreciated!
----------------------------------------------------------------------
- Rick Stevens, Senior Systems Engineer     rstevens at vitalstream.com -
- VitalStream, Inc.                       http://www.vitalstream.com -
-                                                                    -
-      Do you know how to save five drowning lawyers?  No?  GOOD!    -
----------------------------------------------------------------------