[Linux-cluster] High DLM CPU usage - low GFS/iSCSI performance

Thu Feb 24 09:34:01 UTC 2011

Hello everyone,

We currently have the following RHCS cluster in operation:

- 3 nodes, Xeon CPU, 12 GB hardware etc.
- 100mbit network between the cluster nodes
- Dell MD3200i iSCSI SAN, with 4 Gbit links (dm-multipath) to each server
(through two switches), 5 15k RPM spindles
- 1 GFS1 file system on the above mentioned SAN

2 of the nodes share a single GFS file system, which is used for hosting
virtual machine containers (for web serving, mail and light database work).
We've noticed that performance is suboptimal so we've started to
investigate. The load is not high (we previously ran the same containers on
a single, much cheaper server using local 7200rpm disks and ext3 fs without
issues), but there is a lot of small block I/O.

When I run iptraf (only monitoring the iSCSI traffic) and top side by side
on a single server I often see dlm_send using 100% CPU. During this time I/O
to our gfs filesystem seems to be blocked and container performance goes
down the drain.

My question is: what causes dlm_send to use 100% CPU and is this wat causes
the low GFS performance? Based on what the servers are doing I'm not
expecting any deadlocks (they're mostly accessing separate parts of the
filesystem), so I'm suspecting some other kind of limitation here. Could it
be the 100Mbit network?

I've looked into the waiters queue using the debug fs and it varies between
0 and 60 entries which doesn't seem to bad to me. The locks table has some
30.000 locks. All DLM and GFS settings are defaults. Any hints on where to
look are appreciated!

Regards,

Martijn Storck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20110224/23fcb304/attachment.htm>