Ext2 metadata and disk seeks

Wed Apr 20 03:56:28 UTC 2005

Hi,

I'd appreciate help in solving the following problem:

Given a large list of file names, find the fastest way to open(2) and
fstat(2) all of the files.

This is done for caching purposes; note that the files are not being
read -- I'd like to solve the stat-ing problem first.  The files are on
a 32 GB ext3 file system (with LVM), on a slow laptop IDE disk (4200
RPM), so disk seeking is obviously the biggest obstacle.  hdparm -t says
that the upper limit is ~26 MB/s.  The file list contains 14,307 files,
amounting to 285 MB.

Here are some results for the different approaches I tried so far:

1. Sort the file list by file name:

   time[1]:      16.3 s
   data read[2]: 21,200 KB

2. Sort the file list by inode number:

   time:         16.1 s
   data read:    21,200 KB

3. What I thought should offer the best performance:

   Using ext2_inode_scan from libext2fs, read the inode tables of the
   corresponding block groups in a batch [3].  While scanning the block
   group, store the block numbers of all directory blocks.  Once the
   scan is complete, sort the directory block numbers and read all
   the blocks using ext2fs_read_dir_block().

   time:         22.7 s
   data read:    119,900 KB

The reasoning behind the last approach is that the bottleneck here are
path lookups -- the disk head must move back and forth between the
inode table and directory blocks.  By reading the inodes and directory
blocks in a batch, we should be able to cache all the needed
information (and not hit the disk at all while iterating through the
list of files).

What's puzzling me though is the poor performance of this approach
(not even near the 26 MB/s mark).  Reading the inodes is indeed fast,
but reading the directory blocks is not.  Using /proc/sys/vm/block_dump
logging, I verified that the blocks are read sequentially, with some
gaps in between.  A typical sequence of reading the directory blocks
will look like this:

    READ block 21249176 on dm-0
    READ block 21249184 on dm-0
    READ block 21249192 on dm-0
    READ block 21249200 on dm-0
    READ block 21249256 on dm-0
    READ block 21255824 on dm-0
    READ block 21258200 on dm-0
    READ block 21258208 on dm-0

Timings are showing that these gaps (where block_{n+1} > block_{n} + 8)
are where most of the time is lost.  Is there any way I could read 
the directory blocks in larger batches?  Any other ideas on how to
improve these times?

Thanks,
-- 
Ziga

[1] Note that the file system was unmounted between running the tests
    to make sure the caches were cold.

[2] This is the amount of data read while stat-ing the files (retrieved
    from /proc/diskstats, rsect column).

[3] The algorithm is as follows: for each file, get the corresponding
    block group.  If that group wasn't scanned yet, iterate through all
    of the inodes in the group's table.  This way we read more inodes
    than necessary, but we're reading them in large contiguous blocks.