Very slow directory traversal

Sat Oct 6 07:10:48 UTC 2007

My last full backup of my Cyrus mail spool had 1,393,569 files and
cconsumed about 4G after compression. It took over 13 hours.  Some
investigation led to the following test:
 time tar cf /dev/null /var/spool/cyrus/mail/r/user/ross/debian/user/
That took 15 minutes the first time it ran, and 32 seconds when run
immediately thereafter.  There were 355,746 files. This is typical of
what I've been seeing: initial run is slow; later runs are much faster.

df shows
/dev/evms/CyrusSpool  19285771  17650480    606376  97% /var/spool/cyrus

mount shows
/dev/evms/CyrusSpool on /var/spool/cyrus type ext3 (rw,noatime)

The spool was active when I did the tests just described, but inactive
during backup.  It's on top of LVM as managed by EVMS in a Linux 2.6.18
kernel, Pentium 4 processor.  It might be significant the Linux treats
this as an SMP machine with 2 processors, since the single processor has
hyperthreading.  I'm using a stock Debian kernel, -686 variant.

# time dd if=/dev/evms/CyrusSpool bs=4096 skip=16k count=256k
of=/dev/null
262144+0 records in
262144+0 records out
1073741824 bytes (1.1 GB) copied, 26.4824 seconds, 40.5 MB/s

The spool was mostly populated all at once from another system, and the
file names are mostly numbers.  Perhaps that creates some hashing
trouble?

Can anyone explain this, or, even better, give me a hint how I could
improve this situation?

I found some earlier posts on similar issues, although they mostly
concerned apparently empty directories that took a long time.  Theodore
Tso had a comment that seemed to indicate that hashing conflicts with
Unix requirements.  I think the implication was that you could end up
with linearized, or partly linearized searches under some scenarios.
Since this is a mail spool, I think it gets lots of sync()'s.

I conducted pretty extensive tests before picking ext3 for this file
system; it was fastest for my tests of writing messages into the spool.
I think I tested the "nearly full disk" scenario, but I probably didn't
test the scale of files I have now.  Obviously my problem now is
reading, not writing.

# dumpe2fs -h /dev/evms/CyrusSpool
dumpe2fs 1.40.2 (12-Jul-2007)
Filesystem volume name:   <none>
Last mounted on:          <not available>
Filesystem UUID:          44507cfa-39ce-46f1-9e3e-87091225395d
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal resize_inode dir_index filetype
needs_recovery sparse_super
Filesystem flags:         signed directory hash
Default mount options:    (none)
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              10289152  # c 10x the number of files.
Block count:              20578300
Reserved block count:     1028915
Free blocks:              1651151
Free inodes:              8860352
First block:              1
Block size:               1024
Fragment size:            1024
Reserved GDT blocks:      236
Blocks per group:         8192
Fragments per group:      8192
Inodes per group:         4096
Inode blocks per group:   512
Filesystem created:       Mon Jan  1 11:32:49 2007
Last mount time:          Thu Oct  4 09:42:00 2007
Last write time:          Thu Oct  4 09:42:00 2007
Mount count:              2
Maximum mount count:      25
Last checked:             Fri Sep 28 09:26:39 2007
Check interval:           15552000 (6 months)
Next check after:         Wed Mar 26 09:26:39 2008
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:               128
Journal inode:            8
Default directory hash:   tea
Directory Hash Seed:      9f50511e-2078-4476-96f4-c6f3415fda4f
Journal backup:           inode blocks
Journal size:             32M

I believe I created it this way; in particular, I'm pretty sure I've had
dir_index from the start.