Many small files, best practise.

Peter Grandi pg_ext3 at
Mon Sep 21 15:37:25 UTC 2009

[ ... whether datasets like 1G records for a total of 7TB should be
stored as one-record-per-file in a filesystem or as a database ... ]

>>> When you deal with systems that store millions of files,

>> Millions of files may work; but 1 billion is an utter absurdity.
>> A filesystem that can store reasonably 1 billion small files in
>> 7TB is an unsolved research issue...

> I'd disagree.  We have Lustre filesystems with 500M files on
> the ext4(ish) metadata server, and these are only 4TB. Note
> there is NO DATA in the metadata files, so it isn't quite like
> a normal filesystem.

That is possible, but to me seems quite unreasonable. How long
does that take to RSYNC, for example? To just backup? What about
doing a 'find'? These are mad things.

This is the special case of an MDS as you mention, but it is still
fairly dangerous.

Just like many other similar choices (e.g. 19+1 RAID5 arrays), it
works (not so awesomely) as long as it works, and when it breaks it
is very bad.

I like the Lustre idea, and to me it is currently the best of a not
very enthusing lot, but the MDT is by far the weakest bit, and the
``lots of tiny files'' idea is one of the big deals.

In particular size of MDTs is a significant scalability issue
with Lustre, which was designed in older gentler times for
purposes to which metadata scalability might not have been so
essential. Like most good ideas it has been scaled up beyond
expectations (UNIX-style), and perhaps it is reaching the end
of its useful range.

Fortunately sensible Lustre people keep frequent and wholesame
MDS backups, and restoring a backup, and even a 500M 800B file
backup/restore is hopefully much faster than an 'fsck' if there
is damage.

> It also depends on what you mean by "small files". We've
> previously discussed storing small file data in an extended
> attribute, and if you are tuning for this and the file size is
> small enough (3kB or less) the file data could be stored
> inside the inode (i.e. zero seek data IO).

If I were to use a filesystem as a makeshift database I would
indeed use one of those filesystems that store small files or file
tails in the metadata, as I wrote:

  >> And for cases where a filesystem still makes sense I would
  >> rather use, instead of the inane manylevel directory
  >> structure above, a file system design with proper tree
  >> indexes and perhaps even one with the ability to store
  >> small files into inodes.

You might consider storing Lustre MDTs on Reiser3 instead of
'ldiskfs' :-).

But this is backwards; the database guys have spent the past
several decades working on the ``lots of small records reliably''
problem (and with "bushy" indices), and the main work by the file
system guys has been solving the ``massive massively parallel
files'' one. To the point that people like Reiser who did work
(with database like techniques) on the small files problems for
filesystems have been at best ignored.

[ ... ]

> I think you aren't backing your comments with any facts.

You may think that -- but that's only because you think wrong,
as you haven't read my comments or you want to misrepresent

I made at the very start a clear example of a case with 1M small
files engendering a difference between more than 15 hours vs. 6
minutes for just creation.

For amusement I just rerun it in a nicer form on a somewhat faster

  base$  rm /fs/jugen/tmp/manysmall.db 
  base$  time perl /fs/jugen/tmp/manysmall.db 1000000 50 100
    1 percent done, 990000 to go
    2 percent done, 980000 to go
    3 percent done, 970000 to go
   98 percent done, 20000 to go
   99 percent done, 10000 to go
  100 percent done, 0 to go

  real	0m48.209s
  user	0m6.240s
  sys	0m0.348s
  base$  ls -ld /fs/jugen/tmp/manysmall.db 
  -rw------- 1 pcg pcg 98197504 Sep 21 16:19 /fs/jugen/tmp/manysmall.db

That's 1M records in 10MB in less than a minute or 20K records/s,
for around 1.5MB/s, which is fairly typical for random access to a
fairly standard 1TB consumer drive in its latter half.

  base$  sudo sysctl vm.drop_caches=1
  vm.drop_caches = 1
  base$  time perl /fs/jugen/tmp/manysmall.db 1000000 10000
    1 percent done, 9900 to go
    2 percent done, 9800 to go
    3 percent done, 9700 to go
   98 percent done, 200 to go
   99 percent done, 100 to go
  100 percent done, 0 to go
  average length: 69.3816

  real	2m4.265s
  user	0m0.150s
  sys	0m0.126s

Seeking of course is not awesome, and we get 10K records in 2m, or
around 80 records/s. Ah well. I need an SSD :-).

And as to the 'fsck', I confess that I had a list of cases in
mind but was waiting for the usual worn out dodgy technique of
quoting undamaged filesystem times:

> The e2fsck time on our MDS filesystems with 500M IN USE inodes
> is on the order of 4 hours (disk-based RAID-1+0 array). If
> this was on a RAID-1+0 SSD it could be noticably faster. Ric
> also commented previously about single-digit hours for e2fsck
> on a test 1B file ext4 filesystem.

That is a classic "benchmark" -- undamaged filesystem 'fsck'
tests, like the other favourite, freshly loaded filesystem
benchmarks, are just dodgy marketing tools.

And even so! 1 hour per TB, or 1h per 100M files. To me keeping
what may be production filesystem with 500M files unavailable
for 4 hours because one occasionally has to run 'fsck' (even if
in fact there is no damage) with an upside risk of weeks or
months sounds not such a good idea. But who knows.

There are been reports, which are sadly familiar to those who
work as sysadms, of single digit TB filesystems taking weeks to
months to repair, if damaged. The difference of course is
between scanning the metadata and crawling it.

Which is of course perfectly obvious, as RAIDs allow for
parallelizing of read/write but not easily for scanning and less so
for crawls. Scaling 'fsck' is not easy, is an unsolved research
problem, even if things like Lustre help somewhat (minus the MDTs
of course).

Now I feel a bit preachy, I'll mention some wider concepts (mostly
from the database guys) that should fit well in this discussion:

* A "database" is defined as something including a dataset whose
  working set does not fit in memory (it thrashes -- every access
  involves at least one IO). There are several types of databases,
  structured/unstructured, factual/textual/...; a filesystem is a
  kind of database, as that definition applies. But to me and
  several decades of practice and theory it is a database of
  record _containers_ (as suggested by the very word "file"), not
  of records. It is exceptionally hard to do a DBMS that handles
  equally well records and record containers.

* A "very large database" is a database that cannot be practically
  backed up (or checked) offline, as backup (or check) take too
  long wrt to requirements. Many filesystems are moving into the
  "very large database" category (can your customers accept that it
  might take 4 hours or 4 weeks to check, and 4 days to restore,
  their filesystem?). Storing small records (or small containers
  even) in a filesystem makes it much more likely that it becomes a
  "very large database", and while the technology for "very large
  databases" DBMSes is mature, that for "very large database" file
  system designs is not there or at least not as mature, even if
  the fun guys at Sun have been trying lately with ZFS.

* These are not novel or little know concepts and experiences. 'ar'
  files have been around for a long time, for some good reason.

More information about the Ext3-users mailing list