extremely slow "ls" on a cleared fatty ext3 directory on FC4/5

Sun Aug 13 17:46:30 UTC 2006

On Sun, Aug 13, 2006 at 12:01:17AM -0700, Robinson Tiemuqinke wrote:
> 
>  A stupid flat directory /tmp holding 5 millon files,
> the directory locates on a ext3 file system with
> dir_index feature turned on. The running Linux are FC4
> and FC5.
>  
>  The files are just directly under /tmp, not in any
> subdirectories -- they are results of mis-operations
> of users.

Wow!  How many users do you have on your system?  And over what period
of time did this build up?

>From a system administration point of view, a really good idea is to
have a job which just deletes all file in /tmp that stick around for
longer than 24 hours or so, and unconditionally on reboot.  Then when
the users scream, you can give them access to a /scratch partition
which has lsightly more lax rules, such as deletion after 1 or 2
weeks, and with a README which says, "not backed up --- data can be
deleted at any time, and if you complain, we will laugh at you".  :-)

>From a technical point of view, what's happening is that dir_index
speeds up directory lookups by using a hash tree.  Unfortunately,
POSIX imposes requirements about how readdir() is supposed to work if
files are added or deleted while the readdir() is in process.
(Basically a file which is created or deleted during the readdir must
appear once or not at all, and all other files must be returned
exactly once.)  This isn't too bad, except that this requirement must
also be maintained even across a telldir() which saves a linear offset
into the director, and seekdir() which seeks back to that location on
disk.  This interface is horribly broken, as it fundamentally assumes
a linear linked list implementation such as was used three decades ago
in Unix.  And, it gives filesystem implementors nightmares when they
are required to provide this interface even when they are trying to
use more advanced data structures that no longer have a linear
directory layout --- say, like a B-tree.

Different filesystems solve this in different ways; some use multiple
B-trees, with one B-tree only so that readdir() can have the proper
semantics.  This has the downside that file creations and deletions
now have to update two separate trees.

The choice which ext3 used was a simpler one, which is that we simply
return files in hash sort order.  This provides the correct semantics,
but unfortunately it means that workloads which do a readdir()
followed by a stat() of each file ends up accessing the inode table in
an effectively random order.  This can also happen if the inode table
is fragmented, but this causes the worst case to happen every single
time.

There are solutions; and the simplest is to have programs read the
entire directory into memory, and then sort by the list by inodes
before actually stat'ing the file.  This can be done in userspace much
more easily than in the kernel, since userspace memory is swappable,
and kernel memory is not.  I have written an ld_preload which allows a
program to do the right thing without needing to modify the program:

http://www.redhat.com/archives/ext3-users/2004-September/msg00025.html

Unfortunately, for programs that use telldir() and seekdir(), and hold
on to the telldir() pointer for a long time, and still expect POSIX
semantics, this will not necessarily work correctly, so it's not
something I would recommend for the systemwide ld_preload.  But it is
useful for accelerating programs that haven't yet been modified, such
as ls and find.  Other programs, such as mutt's maildir handling, have
already been so modified, and is a much better solution.  (In fact, it
provides speedup benefits on all filesystems, but just much more on
ext3 filesystems with dir_tree enabled.)

The fact that ext3 doesn't shrink directories is a long-standing Unix
implementation restriction.  It's not impossible for us to add support
for truncating directories as files get deleted, but it's just never
bubbled up to the top of the todo list; in practice, workloads that
create gigantic directories that then shrink down to nothing are
relatively rare.   

> If there are any ways to fix this kind of problem
> without rebooting machine? I'm afraid of the commands
> "rsync -avHn /tmp/ /new_tmp/; rm -rf /tmp/ && mv
> /new_tmp/ /tmp" because other applications are
> accessing /tmp/ as well.

Not without rebooting, but probably it will required scheduled
downtime where you kick all of the users off, and then recreate the
tmp directory --- either using rsync, or just doing a plain old "rm
-rf /tmp; mkdir /tmp".  If users are expecting that files stick around
in /tmp, that's huge cultural problem, and it will come back to haunt
you in multiple ways....

						- Ted