[Linux-cluster] GFS2 and D state HTTPD processes

Fri Sep 25 08:46:34 UTC 2009

Hi,

On Thu, 2009-09-24 at 11:30 +0100, Gavin Conway wrote:
> Hi All,
> 
>  
> 
>  
> 
>  We have 6 nodes running GFS2 under CentOS 5.3 all connecting via
> Cisco 2960G switches to an MD3000i with 8 x 146GB SAS 15K drives.
> These nodes run a PHP website pulling their PHP and images files from
> a GFS2 volume being exported by iSCSI from the MD3000i .
> 
>  
> 
> Problem we have is that since inception we’ve seen issues whereby the
> HTTPD processes will go into a state of ‘D’, zombied’ and the only way
> we have to recover from that is to restart all the nodes in the
> cluster.
> 
>  
> 
> I’ve tuned the demote_secs down from 300 to 20 seconds on the
> assumption that file locking is causing an issue.
That is unlikely to make any meaningful change and in fact it could well
hurt performance, depending on the workload.

>  
>
> Similarly we’re running with the following GFS values;
> 
>  
> 
>         <gfs_controld plock_ownership="1" plock_rate_limit="0"/>
> 
Try turning off plock_ownership and see if that fixes the problem

>  
> 
> Can anyone give me some pointers on what we should be investigating
> for why this is failing? I’ve had our networks team crawl over the
> networking and that all seems fine. The MTU is set correctly on the
> MD3000i and on the individual nodes. I’ve also used the ping_pong tool
> and on a single file on the GFS cluster we can get around 90K locks on
> a file. If I run ping_pong against the same file from two nodes that
> then drops to around 70 locks per second. I don’t think that’s the
> issue though.
> 
>  
> 
> If anyone can provide some insight to either what to change, what to
> debug or how to investigate this further it’d be greatly appreciated.
> 
>  
There are two things to look at. One is back traces from processes (echo
't' > proc/sysrq-trigger) and the other is the glock dump
from /sys/kernel/debug/fs/gfs2/glocks. The first tells us what is
hanging and the second (hopefully) why. Look for glocks with 'W' in the
flags field (f:) for their holders (H:) and it should be possible to
correlate them with the processes which are stuck.

Do you get any messages in the syslog?

Steve.

> 
>  
> 
> Thanks
> Gavin
> 
> 
>  
> 
> Gavin Conway
> 
> Senior Engineer, Operations (Systems Group), UKSolutions
> 
>  
> 
> Telephone: 0845 004 1333, option 2
> 
> Email: gavin.conway at uksolutions.co.uk
> 
> Web: www.uksolutions.co.uk
> 
> UKS Ltd, Birmingham Road, Studley, Warwickshire, B80 7BG Registered in
> England Number 3036806
> 
> This email must be read in conjunction with the legal & service
> notices on http://www.uksolutions.co.uk/disclaimer.html
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster