[Linux-cluster] Cluster failure, dlm overload

Thu Apr 5 22:19:15 UTC 2012

Hi,

First of all, thanks for your time.

A five node cluster that is sharing several GFS filesystem is having total
blocks of filesystem activity. Around one block each week. These blocks
appeared several weeks ago, after more than three years in service. Cluster
is restored after restart of all cluster nodes ;-)

When these blocks appears, we can see dlm send and receive process with a
high level of CPU consumption, network traffic is a also ten times the
normal one.

A capture (wireshark) of network traffic in DLM port shows thousand of
messages per second. In particular, all "request message" are replied with
a "request reply" where errno=EBADR, Lookup messages seems ok.

The cluster is with a software version a few outdated, the one of RedHat
2.6.18, but not possible to upgrade easily.

Any suggestion is welcome.

Kind regards,
ALU
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120406/fbb61203/attachment.htm>