[Linux-cluster] File system slow & crash

Thu Apr 22 09:56:25 UTC 2010

Hi,

On Thu, 2010-04-22 at 02:29 +0700, Somsak Sriprayoonsakul wrote:
> Just notice that, on a node it is using kernel version
> 2.6.18-164.15.1.el5. Don't sure if the difference has any effect.
> 
> On Thu, Apr 22, 2010 at 2:27 AM, Somsak Sriprayoonsakul
> <somsaks at gmail.com> wrote:
>         Hello,
>         
>         We are using GFS2 on 3 nodes cluster, kernel
>         2.6.18-164.6.1.el5, RHEL/CentOS5, x86_64 with 8-12GB memory in
>         each node. The underlying storage is HP 2312fc smart array
>         equipped with 12 SAS 15K rpm, configured as RAID10 using 10
>         HDDs + 2 spares. The array has about 4GB cache. Communication
>         is 4Gbps FC, through HP StorageWorks 8/8 Base e-port SAN
>         Switch.
>         
>         Our application is apache version 1.3.41, mostly serving
>         static HTML file + few PHP. Note that, we have to downgrade to
>         1.3.41 due to application requirement. Apache was configured
>         with 500 MaxClients. Each HTML file is placed in different
>         directory. The PHP script modify HTML file and do some locking
>         prior to HTML modification. We use round-robin DNS to load
>         balance between each web server.
>         
Is the PHP script creating new html files (and therefore also new
directories) or just modifying existing ones?

Ideally you'd set up the system so that all accesses to a particular
html file all go to the same node under normal circumstances and only
fail over to a different node in the case of that particular node
failing. That way you will ensure locality of access under normal
conditions and thus get the maximum benefit from the cluster filesystem.

>From your description I suspect that its the I/O pattern across nodes
which is causing the main problem which you describe. I suspect that the
DNS round robin is making the situation worse since it will be
effectively randomly assigning requests to nodes.

Having said that, killing processes using GFS2 or trying to umount it
should not cause an oops. The kill maybe ignored for processes in
'D' (uninterruptible sleep) and likewise the umount may fail with
-EBUSY, but any oops is a bug. Please report it via Red Hat's bugzilla.

Using the num_glockd= command line parameter is not recommended with
GFS2 (in fact it doesn't exist/is ignored in more recent versions) and
setting data=writeback may or may not actually improve performance (it
depends upon the individual workload) but it does increase the
possibility of seeing corrupt data if there is a crash. I would
generally caution against using data=writeback except in very special
cases.

Steve.