[Linux-cluster] blocking dlm_lock

Paul Dugas paul at dugas.cc
Thu Jul 29 14:14:37 UTC 2010


Morning all,

I have a cluster of three RHEL5-x86_64 machines (all up to date)
sharing a GFS filesystem on a Coraid AoE unit.  Last night, I shut the
whole thing down to replace batteries in a couple UPS units and bought
things back up without issue.  About an hour later, access to the
shared filesystem stalled from all three machines.  It was late so I
figured I missed something so I brought it back down an up again and
it was fine.  About 4am this morning, it did it again.  By the time I
got to the site, people were already screaming so I simply restarted
it again.  I've had some time (and coffee) now to look through the
logs and am finding little of value.  I see two anomalies but I don't
know what they mean.

The first thing I found is a number of lines like so:

  openais[3953]: [TOTEM] Retransmit List: 3eb233

The second this is a set off messages like this:

  kernel: INFO: task nfsd:3523 blocked for more than 120 seconds.

These are followed by stack dumps where dlm_lock is on top.

Some searching suggests this may be an issue with my switch.  Is that
reasonable?  Is there a way to get further diagnostics?  This cluster
has been in service for a couple years so I'm leaning toward something
being broken instead of configured wrong.  Any help would be
appreciated.

Paul
--
Paul Dugas • 522 Black Canyon Park, Canton GA 30114 USA •
Paul at Dugas.cc • +1.404.932.1355




More information about the Linux-cluster mailing list