[Linux-cluster] blocking dlm_lock
Paul Dugas
paul at dugas.cc
Thu Jul 29 14:14:37 UTC 2010
Morning all,
I have a cluster of three RHEL5-x86_64 machines (all up to date)
sharing a GFS filesystem on a Coraid AoE unit. Last night, I shut the
whole thing down to replace batteries in a couple UPS units and bought
things back up without issue. About an hour later, access to the
shared filesystem stalled from all three machines. It was late so I
figured I missed something so I brought it back down an up again and
it was fine. About 4am this morning, it did it again. By the time I
got to the site, people were already screaming so I simply restarted
it again. I've had some time (and coffee) now to look through the
logs and am finding little of value. I see two anomalies but I don't
know what they mean.
The first thing I found is a number of lines like so:
openais[3953]: [TOTEM] Retransmit List: 3eb233
The second this is a set off messages like this:
kernel: INFO: task nfsd:3523 blocked for more than 120 seconds.
These are followed by stack dumps where dlm_lock is on top.
Some searching suggests this may be an issue with my switch. Is that
reasonable? Is there a way to get further diagnostics? This cluster
has been in service for a couple years so I'm leaning toward something
being broken instead of configured wrong. Any help would be
appreciated.
Paul
--
Paul Dugas • 522 Black Canyon Park, Canton GA 30114 USA •
Paul at Dugas.cc • +1.404.932.1355
More information about the Linux-cluster
mailing list