[Linux-cluster] Meaning of Cluster Cycle and timeout problems - GFS 100% cpu utilization

Mon Apr 21 08:53:04 UTC 2008

Hi,

Thanks for the fast response!

It looks like GFS causes 100% cpu utilization and therefore the qdiskd  
process has no processor time.

Is this a known problem and has anyone seen such behavior before?

We are using rhel 4.5 with the following packages:

ccs-1.0.11-1.x86_64.rpm
cman-1.0.17-0.x86_64.rpm
cman-kernel-2.6.9-53.5.x86_64.rpm
dlm-1.0.7-1.x86_64.rpm
dlm-kernel-2.6.9-52.2.x86_64.rpm
fence-1.32.50-2.x86_64.rpm
GFS-6.1.15-1.x86_64.rpm
GFS-kernel-2.6.9-75.9.x86_64.rpm
gulm-1.0.10-0.x86_64.rpm
iddev-2.0.0-4.x86_64.rpm
lvm2-cluster-2.02.27-2.el4.x86_64.rpm
magma-1.0.8-1.x86_64.rpm
magma-plugins-1.0.12-0.x86_64.rpm
perl-Net-Telnet-3.03-3.noarch.rpm
rgmanager-1.9.72-1.x86_64.rpm
system-config-cluster-1.0.51-2.0.noarch.rpm

The Kernel is 2.6.9-55.

Thanks for reading and answering,

Peter

Am 17.04.2008 um 20:54 schrieb Lon Hohberger:

> On Thu, 2008-04-17 at 09:08 +0200, Peter wrote:
>> Hi!
>>
>> In our Cluster we have the following entry in the "messages" logfile:
>>
>> "qdiskd[4314]: <warning> qdisk cycle took more than 3 seconds to
>> complete (3.890000)"
>
> It means it took more than 3 seconds for one qdiskd cycle to complete.
> This is a whole lot:
>
>   8192 bytes in 16 block reads
>   some internal calculations
>   512  bytes in 1 block write
>
> (that's it...)
>
>
>> Theese messages are very frequent. I can not find anything except the
>> source code via google and i am sorry to say that i am not so familar
>> with c to get the point.
>>
>>
>> We also have sometimes a quorum timeout:
>>
>> "kernel: CMAN: Quorum device /dev/sdh timed out"
>>
>>
>> Are theese two messages independent and what is the meaning of the
>> first message?
>
>
> No, they're 100% related.  It sounds like qdiskd is getting starved  
> for
> I/O to /dev/sdh, or possibly it's getting CPU-starved for some reason.
> Being that it's more or less a real-time program which helps keep the
> cluster running, that's bad!  In your case, it's getting hung up for
> longer than the cluster failover time, so CMAN thinks qdiskd has died.
> Not good.
>
>
> (1) Turn *off* status_file if you have it enabled!  It's for  
> debugging,
> and under certain load patterns, it can really slow down qdiskd.
>
>
> (2) If you think it's I/O, what you should try is (assuming you're  
> using
> cluster2/rhel5/centos5/etc. here):
>
>  echo deadline > /sys/block/sdh/queue
>
> If you had a default of 10 seconds (1 interval 10 tko), you should  
> also
> do:
>
>  echo 2500 > /sys/block/sdh/queue/iosched/write_expire
>
> ... you've got at least 3 for interval, so I'm not sure this would  
> apply
> to you.
>
> [NOTE: On rhel4/centos4/stable, I think you have to set the I/O
> scheduler globally in the kernel command line at system boot.]
>
>
> (3) If you think qdiskd is getting CPU starved, you can adjust the
> 'scheduler' and 'priority' values in cluster.conf to something
> different.  I think the man page might be wrong; I think the highest
> 'priority' value for the 'rr' scheduler is 99, not 100.  See the
> qdisk(5) man page for more information on those.
>
> -- Lon
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2209 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080421/fe336524/attachment.p7s>