[Linux-cluster] Re: write's pausing - which tools to debug?
Troy Dawson
dawson at fnal.gov
Fri Oct 21 13:18:28 UTC 2005
Axel Thimm wrote:
> Hi,
>
> On Tue, Oct 18, 2005 at 09:20:14AM -0500, Troy Dawson wrote:
>
>>We've been having some problems with doing a write's to our GFS file
>>system, and it will pause, for long periods. (Like from 5 to 10
>>seconds, to 30 seconds, and occasially 5 minutes) After the pause, it's
>>like nothing happened, whatever the process is, just keeps going happy
>>as can be.
>>Except for these pauses, our GFS is quite zippy, both reads and writes.
>> But these pauses are holding us back from going full production.
>>I need to know what tools I should use to figure out what is causing
>>these pauses.
>>
>>Here is the setup.
>>All machines: RHEL 4 update 1 (ok, actually S.L. 4.1), kernel
>>2.6.9-11.ELsmp, GFS 6.1.0, ccs 1.0.0, gulm 1.0.0, rgmanager 1.9.34
>>
>>I have no ability to do fencing yet, so I chose to use the gulm locking
>>mechanism. I have it setup so that there are 3 lock servers, for
>>failover. I have tested the failover, and it works quite well.
>
>
> If this is a testing environment use manual fencing. E.g. if a node
> needs to get fenced you get a log message saying that you should do
> that and acknowledge that.
>
>
>>I have 5 machines in the cluster. 1 isn't connected to the SAN, or
>>using GFS. It is just a failover gulm lock server incase the other two
>>lock servers go down.
>>
>>So I have 4 machines connected to our SAN and using GFS. 3 are
>>read-only, 1 is read-write. If it is important, the 3 read-only are
>>x86_64, the 1 read-write and the 1 not connected are i386.
>>
>>The read/write machine is our master lock server. Then one of the
>>read-only is a fallback lock server, as is the machine not using GFS.
>>
>>Anyway, we're getting these pauses when writting, and I'm having a hard
>>time tracking down where the problem is. I *think* that we can still
>>read from the other machines. But since this comes and goes, I haven't
>>been able to verify that.
>
>
> What SAN hardware is attached to the nodes?
>
>
From the switch on down, I don't know. It's a centrally managed SAN,
that I have been allowed to plug into and given disk space. I do have
Qlogic cards in the machines.
>>Anyway, which tools do you think would be best in diagnosing this?
>
>
> I'd suggest to check/monitor networking. Also place the cluster
> communication on a separate network that the SAN/LAN network. The
> cluster heartbeat goes over UDP and a congested network may delay
> these packages or drop the completely. At least that's the CMAN
> picture, lock_gulm may be different.
>
That sounds like a good idea. All of our machines have two ethernet
ports, and I'm not using the second one on any of them. That would
actually fix some other problems as well.
> Also don't mix RHELU1 and U2 or FC<N>. Just in case you'd like to
> upgrade to SL4.2 one by one.
>
Yup, read that, but thanks for the reminder.
> There have been many changes/bug fixes to the cluster bits in RHELU2,
> and there are also some new spiffy features like multipath. Perhaps
> it's worth rebasing your testing environment?
>
Don't I wish it was a testing enviroment. But at least the machines
don't HAVE to be 24x7. And I've only got one of them in production
right now, so it's only one going down.
Troy
--
__________________________________________________
Troy Dawson dawson at fnal.gov (630)840-6468
Fermilab ComputingDivision/CSS CSI Group
__________________________________________________
More information about the Linux-cluster
mailing list