Many small files, best practise.

Mon Sep 14 21:08:58 UTC 2009

[ ... ]

>> Also note that in order to write 10^9 files at 10^3/s rate
>> takes 10^6 seconds; roughly 10 days to populate the
>> filesystem (or at least that to restore it from backups).

> One thing that you can do when doing bulk loads of files (say,
> during a restore or migration), is to use a two phase
> write. First, write each of a batch of files (say 1000 files
> at a time), then go back and reopen/fsync/close them.

Why not just restore a database?

>>> One layout for directories that works well with this kind of
>>> thing is a time based one (say YEAR/MONTH/DAY/HOUR/MIN where
>>> MIN might be 0, 5, 10, ..., 55 for example).

>> As to the problem above and ths kind of solution, I reckon that
>> it is utterly absurd (and I could have used much stronger words).

> When you deal with systems that store millions of files,

Millions of files may work; but 1 billion is an utter absurdity.
A filesystem that can store reasonably 1 billion small files in
7TB is an unsolved research issue...

The obvious thing to do is to use a database, and there is no
way around this point.

If one genuinely needs to store a lot of files, why not split
them into many independent filesystems? A single large one is
only need to allow for hard linking or for having a single large
space pool, and in applications where the directory structure
above makes any kind of sense that neither is usually required.

> you pretty much always are going to use some kind of made up
> directory layout.

File systems are usually used for storing somewhat unstructured
information, not records that can be looked up with a simple
"YEAR/MONTH/DAY/HOUR/MIN" key, which seems very suitable for
something like a simpel DBMS.

There is even a tendency to move filesystems into databases, as
they scale a lot better.

And for cases where a filesystem still makes sense I would
rather use, instead of the inane manylevel directory structure
above, a file system design with proper tree indexes and perhaps
even one with the ability to store small files into inodes.

[ ... ]

> You can always try to write 1 million files in a single
> subdirectory,

Again, I'd rather avoid anything like that.

> but if you are writing your own application, using this kind
> of scheme is pretty trivial.

And an utter absurdity, for 1 billion files in 200k directories.
Both on its own merits and compared to the OBVIOUS alternative.

>> If anything, consider the obvious (obvious except to those
>> who want to use a filesystem as a small record database),
>> which is 'fsck' time, in particular given the structure of
>> 'ext3' (or 'ext4') metadata.

> fsck time has improved quite a lot recently with ext4 (and
> with xfs).

How many months do you think a 7TB filesystem with 1 billion
files would take to 'fsck' even with those improvements? Even
with the nice improvements?

[ ... ]