[Linux-cluster] High DLM CPU usage - low GFS/iSCSI performance

Fri Feb 25 07:21:51 UTC 2011

Thanks for your message.

Somehow the issue has not returned since yesterday when I applied some
tuning to our GFS, specifically:

glock_purge 50
demote_secs 100
scand_secs 5
statfs_fast 1

I'm not sure how this could be related to our DLM issues, but it seems to
have some sort effect. We're able to push the IOPS limit of our SAN without
problems now and system load is down to 1.5 from 6.

The cluster traffic is hard to judge since the cluster communicates over our
'Internet' interface (total traffic peaked at about 24Mbit/s). It might be
better to route it over our internal network (once I put a decent switch in
there), but how would I do that? I assume:

1. I should use hostnames that resolve to internal IP's in cluster.conf
2. I should change the bindnetaddr in openais.conf (it's now on the default
192.168.2.0, which is not a subnet we use)

Would that do the trick or am I missing something?

Cheers,
Martijn

On Thu, Feb 24, 2011 at 11:02 AM, Steven Whitehouse <swhiteho at redhat.com>wrote:

> Hi,
>
> On Thu, 2011-02-24 at 10:34 +0100, Martijn Storck wrote:
> > Hello everyone,
> >
> >
> > We currently have the following RHCS cluster in operation:
> >
> >
> > - 3 nodes, Xeon CPU, 12 GB hardware etc.
> > - 100mbit network between the cluster nodes
> > - Dell MD3200i iSCSI SAN, with 4 Gbit links (dm-multipath) to each
> > server (through two switches), 5 15k RPM spindles
> > - 1 GFS1 file system on the above mentioned SAN
> >
> >
> > 2 of the nodes share a single GFS file system, which is used for
> > hosting virtual machine containers (for web serving, mail and light
> > database work). We've noticed that performance is suboptimal so we've
> > started to investigate. The load is not high (we previously ran the
> > same containers on a single, much cheaper server using local 7200rpm
> > disks and ext3 fs without issues), but there is a lot of small block
> > I/O.
> >
> >
> > When I run iptraf (only monitoring the iSCSI traffic) and top side by
> > side on a single server I often see dlm_send using 100% CPU. During
> > this time I/O to our gfs filesystem seems to be blocked and container
> > performance goes down the drain.
> >
> Can you take a netstat -t while the cpu usage is at 100%, that will tell
> us whether there is queued data at that point in time.
>
> >
> > My question is: what causes dlm_send to use 100% CPU and is this wat
> > causes the low GFS performance? Based on what the servers are doing
> > I'm not expecting any deadlocks (they're mostly accessing separate
> > parts of the filesystem), so I'm suspecting some other kind of
> > limitation here. Could it be the 100Mbit network?
> >
> Well, that depends on how much traffic there is... have you measured the
> traffic when the problem is occurring?
>
> >
> > I've looked into the waiters queue using the debug fs and it varies
> > between 0 and 60 entries which doesn't seem to bad to me. The locks
> > table has some 30.000 locks. All DLM and GFS settings are defaults.
> > Any hints on where to look are appreciated!
> >
> It does sounds like a performance issue, and it shouldn't be too hard to
> get to the bottom of what is going on,
>
> Steve.
>
> >
> > Regards,
> >
> >
> > Martijn Storck
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20110225/c59cfb0e/attachment.htm>