listing a huge amount of file on one file system

Cameron Simpson cs at zip.com.au
Fri Jun 9 23:17:02 UTC 2006


On 06Jun2006 10:05, Esquivel, Vicente <Esquivelv at uhd.edu> wrote:
| Can anyone tell me if they have experienced long waits while trying to
| list a directory with a huge amount of files in it?  

Sure. It is a common issue with very large directories.

| One of our servers that is running on RHEL 4, has a directory that
| contains over 2 millions files in it and growing.  The files are all
| small files in size but there are a lot of them due to the application
| that runs on this server.    I have tried to do an "ls" and "ls -l"
| command inside of that directory but it just seems to run for a long
| time with not output, I am assuming that if I leave it running long
| enough it will eventually list them all.  I was just wondering if anyone
| has seen this before or have a better way of getting a listing of all
| the files inside a directory like that.

There are vaqrious causes for delay (aside from sheer size).

First, as mentioned, ls sorts its output which requires it to read all the
entries before printing anything. Using the -f option skips the sort.

Second, on RedHat systems, the default install includes an alias for
"ls" that tries to colour files by type. It's very very annoying.
It is also expensive. You will understand that "ls -l" will be expensive
because ls must lstat() every file in order to get the information for
a "long (-l)" listing. You would expect that plain "ls" does not need to
do that, and it should not - it only needs the filenames, which come
directly from the directory.

However, by aliasing "ls" to the colourising mode, "ls" is again forced
to lstat() every file (and worse, stat() every symlink!) in order to
determine file types and so to determine colours.

Try saying:

	/bin/ls -f
    or	/bin/ls
    or	unalias ls; ls -f
    or	unalias ls; ls

and see if the behaviour improves.

Third, very large and flat (no subdirectories) directories are quite
expensive on many filesystems because doing a stat() or lstat() to
look up file details involved reading the directory contents to map the
filename into the file inode number. Often, that is a linear read of the
directory (some filesystems use more sophisticated internal structure
than a simple linear list, but it is still uncommon).  In consequence,
the stat() every file requires 2000000 reads of the directory, and each
such read will on average read about half the content (it should stop when
it finds the filename, which may be anywhere in the listing). So cost for
"ls -l" is roughly the _square_ of the number of directory entries.

It is usually a performance improvement to break large flat directories
into subdirectories. You still need to stat() everything in the long run
(2000000 items) but the linear cost per directory can be reduced because
each individual directory is smaller.

Finally, the sheer size of the directory may be exceeding some stupid
hardired limit in the Veritas backup utility, although I'd expect the
veritas people to know about such a limit if it exists.

Cheers,
-- 
Cameron Simpson <cs at zip.com.au> DoD#743
http://www.cskk.ezoshosting.com/cs/

Dangerous stuff, science.  Lots of us not fit for it.
        - H.C. Bailey, _The Long Dinner_




More information about the redhat-list mailing list