[Linux-cluster] GFS2 directory hangs on one node CentOS 5.3

Mon Sep 28 10:21:27 UTC 2009

Hi,

On Mon, 2009-09-28 at 12:13 +0200, Libor Tomsik wrote:
> Hi,
> >Hi,
> >
> >On Sat, 2009-09-26 at 18:29 +0200, Libor Tomsik wrote:
> >> Hi all,
> >>
> >> I'm having a strange issue with a two nodes cluster based on xen
> >> virtual hosts with shared disk on clvm. The servers are running apache
> >> and one is considered as hot backup. On that node awstats are counted
> >> from the apache custom logs stored on the shared device. Web data,
> >> logs, configs and awstats results are in different directories withing
> >> the same GFS2 volume.
> >>
> >> Everything works fine, but sometimes (at production environment, damn)
> >> the directory with logs get frozen for the spare node with awstats.
> >> All commands like ls, cd, mc on that directory get status D. On the
> >> second node all works fine. Other directories seems unaffected too.
> >>
> >> I can not umount fs neither remout it ro and back rw since there are
> >> "running" processes at D state.
> >>
> >> Can someone give me some advice, how-to prevent this problem? And
> >> how-to recovery from it? It is a production with SLA on :(  In next
> >> time, I'll try to make lockdump on both nodes.
> >>
> >> Kernel is 2.6.18-128.1.10.el5xen, gfs2-utils-0.1.53-1.el5_3.2,
> >> kmod-gfs2-xen-1.92-1.1.el5_2.2
> >>
> >> Regards
> >>
> >> Libor
> >>
> >That sounds to me like there is a lot of activity from both nodes
> >relating to the same directory. Can you split the logs of the two nodes
> >into two different directories? That will probably solve the problem.
> >
> Actually there is just one apache writing on one server. Well in many
> threads. Maybe this is the problem? I have about 40 sites hosted
> there. So 2x40 separate log files.
> The second node is just periodically reading this directory.
> 
That can still cause a problem. The second node will require a shared
lock on the directory, so if there is any file creation going on, it
will be dramatically slowed down by that. Is it possible to stop the
second node's I/O to check that?

There shouldn't really be a bit issue with lots of threads provided they
are all on the same node as is the case here,

Steve.

> >This kind of problem is tricky to debug since the glock dumps will tell
> >you what state the glocks are currently in, and not what has been
> >happening the in past.
> >
> >In the upstream code we've now got GFS2 tracepoints which will help in
> >tracking down issues like this, but those are not in RHEL yet,
> >
> >Steve.
> >
> >> --
> >> Linux-cluster mailing list
> >> Linux-cluster redhat com
> >> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> Regards
> 
> Libor.
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster