[Linux-cluster] Httpd Process io blocked

Tue Mar 7 14:12:36 UTC 2006

2006/3/7, Marc Grimme <grimme at atix.de>:
> Sebastien,
> On Tuesday 07 March 2006 12:35, Sébastien DIDIER wrote:
> > 2006/3/7, Marc Grimme <grimme at atix.de>:
> > > Hi,
> > > to debug you could use strace. E.g. executing strace -p 14970 will
> > > probably show you that the process is waiting for a lock. As the ps
> > > already does. My first guess would be, that you use apache with php and
> > > sessions.
> >
> > Thanks. But strace doesnt output anything and became Ctrl-C imune. It
> > needs a sigkill to exit and the traced process stays in T state. I
> > seems that it doesnt manage to get last system call where the process
> > is in D state.
> Hmm, sounds like I've heard that already. If you trace the root httpd with -f
> and -t and lookout for great timeslices you'll propably find processes
> waiting for locks. The D state is a good indicator (ps ax | grep " D " and
> look at the pids). Do the pids of the D processes change from time to time or
> do they stay the same pids?

Marc,

All the blocked processes have the same pid since the beginning of
this issue. (22 hours by now)

> >
> > > If so, the phplib uses flocks for locking the session-ids. Normally it
> > > happens that one process locks a session. If another process comes along
> > > to get an flock on that session it has to wait until the further flock is
> > > closed. It very often happens that the other process gets that flock when
> > > the client and session are not available any more. Then the flock is held
> > > until the apache process timesout.
> >
> > I don't think it is session related because I store sessions file
> > outside the GFS mount point (/tmp) and I run a load balancer based
> > upon the source adress (to always send requests to the same server and
> > then keep sessions)
> Yes, I agree. Sessions get lost if the the node fails, right?

Yes. That may be a problem for some apps... But it is easier (and more
efficient) than storing session data into SQL.

> >
> > But, we are using mysql query caching (with some libraries like AdoDb)
> > inside the GFS mount point. Do you think it could be the cache files
> > which are dead-locked ?
> It depends on how those files are locked and how and when the locks are set
> and released. If a lock is set at apache-child forktime and released at
> process terminate time, then yes that could happen. If only accesses to data
> of those files are protected with flocks then it should perform quite well.
>
> Is that query caching part of perl-adodb or is it implemented by yourselves?

It appears that we are using a very common PHP AdoDB abstact class
without any change in the code.

When I run a "lsof -p" on each blocked process on the two nodes, each
one has exactly the same file open :
apache  23327 www-data   10r   REG  253,0      2128  5053927
/home/sites/website/web/queryCache/ca/adodb_cad1702c2e5d18a71d765e95bf55ea3b.cache
(deleted)

>
> Have a look and play with strace and watch out for great times and the
> syscalls concerned with that. I would expect you ending up with
> flock-timeouts.
>
> Hope that helps,
> regards Marc.
> >
> > > We have made a patch for a better locking with php which you can find on
> > > http:/www.open-sharedroot.org in the downloads section.
> > > Hope that helps
> > > Regards Marc.
> > >
> > > On Tuesday 07 March 2006 11:50, Sébastien DIDIER wrote:
> > > > Hi,
> > > >
> > > > I'm running a two-nodes GFS cluster which hosts web sites. The GFS
> > > > partition is over a Iscsi device and by now, i'm using manual fencing.
> > > >
> > > > Today, I got 5 httpd process on both nodes which got stuck in IO
> > > > blocking state. I suspected a GFS filesystem corruption but I haven't
> > > > got any output from the kernel. I ran a fsck two days ago after a
> > > > power chute.
> > > >
> > > > Here's the wait state of the process. (idem for the other node)
> > > >
> > > > # ps -o pid,tt,user,fname,wchan -C apache
> > > >   PID TT       USER     COMMAND  WCHAN
> > > >  4426 ?        root     apache   -
> > > > 14970 ?        www-data apache   glock_wait_internal
> > > > 15103 ?        www-data apache   glock_wait_internal
> > > > 16780 ?        www-data apache   glock_wait_internal
> > > > 16959 ?        www-data apache   glock_wait_internal
> > > > 14936 ?        www-data apache   finish_stop
> > > > 12859 ?        www-data apache   -
> > > > 13005 ?        www-data apache   -
> > > > 13311 ?        www-data apache   semtimedop
> > > > 13390 ?        www-data apache   semtimedop
> > > >
> > > > How can I debug further this problem ? And how can I bring back home
> > > > my httpd processes without a reboot ?
> > > >
> > > > Many thanks for your help.
> > > >
> > > > Regards,
> > > > Sébastien DIDIER
> > > >
> > > > --
> > > > Linux-cluster mailing list
> > > > Linux-cluster at redhat.com
> > > > https://www.redhat.com/mailman/listinfo/linux-cluster
> > >
> > > --
> > > Gruss / Regards,
> > >
> > > Marc Grimme
> > > Phone: +49-89 121 409-54
> > > http://www.atix.de/               http://www.open-sharedroot.org/
> > >
> > > **
> > > ATIX - Ges. fuer Informationstechnologie und Consulting mbH
> > > Einsteinstr. 10 - 85716 Unterschleissheim - Germany
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
>
> --
> Gruss / Regards,
>
> Marc Grimme
> Phone: +49-89 121 409-54
> http://www.atix.de/               http://www.open-sharedroot.org/
>
> **
> ATIX - Ges. fuer Informationstechnologie und Consulting mbH
> Einsteinstr. 10 - 85716 Unterschleissheim - Germany
>
>