Many small files, best practise.

Wed Sep 16 18:56:54 UTC 2009

On 09/14/2009 05:08 PM, Peter Grandi wrote:
> [ ... ]
>
>>> Also note that in order to write 10^9 files at 10^3/s rate
>>> takes 10^6 seconds; roughly 10 days to populate the
>>> filesystem (or at least that to restore it from backups).
>
>> One thing that you can do when doing bulk loads of files (say,
>> during a restore or migration), is to use a two phase
>> write. First, write each of a batch of files (say 1000 files
>> at a time), then go back and reopen/fsync/close them.
>
> Why not just restore a database?

If you started with a database, that would be reasonable. If you started with a 
file system, I guess I don't understand what you are suggesting.

>
>>>> One layout for directories that works well with this kind of
>>>> thing is a time based one (say YEAR/MONTH/DAY/HOUR/MIN where
>>>> MIN might be 0, 5, 10, ..., 55 for example).
>
>>> As to the problem above and ths kind of solution, I reckon that
>>> it is utterly absurd (and I could have used much stronger words).
>
>> When you deal with systems that store millions of files,
>
> Millions of files may work; but 1 billion is an utter absurdity.
> A filesystem that can store reasonably 1 billion small files in
> 7TB is an unsolved research issue...

Strangely enough, I have been testing ext4 and stopped filling it at a bit over 
1 billion 20KB files on Monday (with 60TB of storage).

Running fsck on it took only 2.4 hours.

>
> The obvious thing to do is to use a database, and there is no
> way around this point.

Everything has a use case. I am certainly not an anti-DB person, but your 
assertion alone is not convincing.

>
> If one genuinely needs to store a lot of files, why not split
> them into many independent filesystems? A single large one is
> only need to allow for hard linking or for having a single large
> space pool, and in applications where the directory structure
> above makes any kind of sense that neither is usually required.

Splitting a big file system into small ones means that you (the application or 
sys admin) must load balance where to put new files instead of having the system 
do it for you.

>> you pretty much always are going to use some kind of made up
>> directory layout.

The use case for big file systems with lots of small files (at least the one 
that I know of) is for object based file systems where files usually have odd, 
non-humanly generated file names (think guids with time stamps and digital 
signatures).

These are pretty trivial to map into the time based directory scheme I mentioned 
before.

>
> File systems are usually used for storing somewhat unstructured
> information, not records that can be looked up with a simple
> "YEAR/MONTH/DAY/HOUR/MIN" key, which seems very suitable for
> something like a simpel DBMS.
>
> There is even a tendency to move filesystems into databases, as
> they scale a lot better.
>
> And for cases where a filesystem still makes sense I would
> rather use, instead of the inane manylevel directory structure
> above, a file system design with proper tree indexes and perhaps
> even one with the ability to store small files into inodes.
>
> [ ... ]

Have you tried to make a production DB with 1 billion records? Or done 
experiments with fs vs db schemes?

>
>> You can always try to write 1 million files in a single
>> subdirectory,
>
> Again, I'd rather avoid anything like that.
>
>> but if you are writing your own application, using this kind
>> of scheme is pretty trivial.
>
> And an utter absurdity, for 1 billion files in 200k directories.
> Both on its own merits and compared to the OBVIOUS alternative.
>
>>> If anything, consider the obvious (obvious except to those
>>> who want to use a filesystem as a small record database),
>>> which is 'fsck' time, in particular given the structure of
>>> 'ext3' (or 'ext4') metadata.
>
>> fsck time has improved quite a lot recently with ext4 (and
>> with xfs).
>
> How many months do you think a 7TB filesystem with 1 billion
> files would take to 'fsck' even with those improvements? Even
> with the nice improvements?
>

20KB files written to ext4 run at around 3,000 files/sec. It took us about 4 
days to fill it to 1 billion files and 2.4 hours to fsck.

Not to be mean, but I have worked in this exact area and have benchmarked both 
large DB instances and large file systems.  Good use cases exist for both, but 
the facts do not back up your DB is the only solution proposal :-)

ric