[Linux-cluster] Re: write's pausing - which tools to debug?
Axel Thimm
Axel.Thimm at ATrpms.net
Wed Oct 19 10:48:16 UTC 2005
Hi,
On Tue, Oct 18, 2005 at 09:20:14AM -0500, Troy Dawson wrote:
> We've been having some problems with doing a write's to our GFS file
> system, and it will pause, for long periods. (Like from 5 to 10
> seconds, to 30 seconds, and occasially 5 minutes) After the pause, it's
> like nothing happened, whatever the process is, just keeps going happy
> as can be.
> Except for these pauses, our GFS is quite zippy, both reads and writes.
> But these pauses are holding us back from going full production.
> I need to know what tools I should use to figure out what is causing
> these pauses.
>
> Here is the setup.
> All machines: RHEL 4 update 1 (ok, actually S.L. 4.1), kernel
> 2.6.9-11.ELsmp, GFS 6.1.0, ccs 1.0.0, gulm 1.0.0, rgmanager 1.9.34
>
> I have no ability to do fencing yet, so I chose to use the gulm locking
> mechanism. I have it setup so that there are 3 lock servers, for
> failover. I have tested the failover, and it works quite well.
If this is a testing environment use manual fencing. E.g. if a node
needs to get fenced you get a log message saying that you should do
that and acknowledge that.
> I have 5 machines in the cluster. 1 isn't connected to the SAN, or
> using GFS. It is just a failover gulm lock server incase the other two
> lock servers go down.
>
> So I have 4 machines connected to our SAN and using GFS. 3 are
> read-only, 1 is read-write. If it is important, the 3 read-only are
> x86_64, the 1 read-write and the 1 not connected are i386.
>
> The read/write machine is our master lock server. Then one of the
> read-only is a fallback lock server, as is the machine not using GFS.
>
> Anyway, we're getting these pauses when writting, and I'm having a hard
> time tracking down where the problem is. I *think* that we can still
> read from the other machines. But since this comes and goes, I haven't
> been able to verify that.
What SAN hardware is attached to the nodes?
> Anyway, which tools do you think would be best in diagnosing this?
I'd suggest to check/monitor networking. Also place the cluster
communication on a separate network that the SAN/LAN network. The
cluster heartbeat goes over UDP and a congested network may delay
these packages or drop the completely. At least that's the CMAN
picture, lock_gulm may be different.
Also don't mix RHELU1 and U2 or FC<N>. Just in case you'd like to
upgrade to SL4.2 one by one.
There have been many changes/bug fixes to the cluster bits in RHELU2,
and there are also some new spiffy features like multipath. Perhaps
it's worth rebasing your testing environment?
--
Axel.Thimm at ATrpms.net
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20051019/3ed9fb23/attachment.sig>
More information about the Linux-cluster
mailing list