[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: ext3 with quota under heavy load.

On Jun 26, 2003  12:19 -0700, Dale wrote:
> --- Andreas Dilger <adilger clusterfs com> wrote:
> > This almost certainly is a lock deadlock of some sort.  I've had
> > pretty good luck in debugging such problems just by running "sysrq-T"
> > on the console and/or using "crash" to examine the running kernel.  This
> > needs a fair amount of knowledge of the various locks in ext3.  The most
> > common problems are related to lock ordering problems with some process
> > starting a journal transaction and then blocking on a lock (e.g. directory
> > or inode semaphore, or superblock lock), and some other process holding
> > that lock and trying to start a new transaction when the journal is full.
> > 
> > The journal being full is a crucial issue, because if it isn't full you
> > can start a new transaction without problems, but when it is full you
> > need to flush the journal and wait for all existing users to free up
> > their handles, which will never happen if the first process has a
> > transaction handle and is blocked waiting for a lock the second process
> > is holding.
> If you could provide a little more instruction it would be appriciated.
> I'm guessing magic sysrq is required and sysrq-T means ALT+PrintScreen+T?

Correct.  You can also use the "crash" tool (based on GDB) to get this
information, but I'm not sure whether it requires kernel patches in order
to work properly.

> What kind of information does this provide and what should I do with it?

This gives you a stack dump of all of the processes currently on the system
to the console.  You need to do this while you are experiencing the lockup,
obviously.  Unless you have in-kernel symbol decoding, you will also need to
run the output through ksymoops in order to get anything meaningful from it.

Interesting processes would include kswapd, kupdated, kjournald, and any of
the other hanging processes, although there will likely be a lot of "secondary
casualties" from the original deadlock.

You should be able to see which processes are deadlocked by running "ps auxww"
and looking for those stuck in disk wait "D" in the STAT column.  At that
point, probably one process will be in __down_failed(), and a bunch of others
will be in start_this_handle() or similar, and kjournald will be waiting on
the journal to be cleared.

Cheers, Andreas
Andreas Dilger

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]