Many small files, best practise.

Mon Sep 14 11:34:39 UTC 2009

On 09/14/2009 05:40 AM, Peter Grandi wrote:
>    
>>> RHEL 5.3
>>> ~1000.000.000 files (1-30k)
>>> ~7TB in total
>>> //
>>>        
>    
>>> I'm looking for a best practice when implementing this using
>>> EXT3 (or some other FS if it shouldn't do the job.).
>>>        
> "best practice" would be a rather radical solution.
>
>    
>>> On average the reads dominate (99%), writes are only used for
>>> updating and isn't a part of the service provided.  The data
>>> is divided into 200k directories with each some 5k files.
>>> This ratio (dir/files) can be altered to optimize FS
>>> performance.
>>>        
>    
>> If you are writing to a local S-ATA disk, ext3/4 can write a
>> few thousand files/sec without doing any fsync() operations.
>> With fsync(), you will drop down quite a lot.
>>      
> Unfortunately using 'fsync' is a good idea for production
> systems.
>
> Also note that in order to write 10^9 files at 10^3/s rate takes
> 10^6 seconds; roughly 10 days to populate the filesystem (or at
> least that to restore it from backups).
>
>    

One thing that you can do when doing bulk loads of files (say, during a 
restore or migration), is to use a two phase write. First, write each of 
a batch of files (say 1000 files at a time), then go back and 
reopen/fsync/close them.

This will give you performance levels closer to not using fsync() and 
still give you good data integrity. Note that this usually is a good fit 
for this class of operations since you can always restart the bulk load 
if you have a crash/error/etc.

To give this a try, you can use "fs_mark" to write say 100k files with 
the fsync one file at a time (-S 1, its default) or use one of the batch 
fsync modes (-S 3 for example).

>> One layout for directories that works well with this kind of
>> thing is a time based one (say YEAR/MONTH/DAY/HOUR/MIN where
>> MIN might be 0, 5, 10, ..., 55 for example).
>>      
> As to the problem above and ths kind of solution, I reckon that
> it is utterly absurd (and I could have used much stronger words).
>    

When you deal with systems that store millions of files, you pretty much 
always are going to use some kind of made up directory layout. The above 
scheme works pretty well in that it correlates well to normal usage 
patterns and queries (and tends to have those subdirectories laid out 
contiguously).

You can always try to write 1 million files in a single subdirectory, 
but if you are writing your own application, using this kind of scheme 
is pretty trivial.

>    BTW, the sort of people who consider seriously such utter
>    absurdities try to do a thorough job, and I don't want to
>    know how the underlying storage system is structured :-).
>
> If anything, consider the obvious (obvious except to those who
> want to use a filesystem as a small record database), which is
> 'fsck' time, in particular given the structure of 'ext3' (or
> 'ext4') metadata.
>    

fsck time has improved quite a lot recently with ext4 (and with xfs).

> So: just don't use a filesystem as a database, spare us the
> horror; use a database, even a simple one, which is not utterly
> absurd.
>
> Compare these two:
>
>    http://lists.gllug.org.uk/pipermail/gllug/2005-October/055445.html
>    

In this case, doing the bulk load I described above (reading in sorted 
order, writing out in the same), would significantly reduce the time of 
the restore.

>    http://lists.gllug.org.uk/pipermail/gllug/2005-October/055488.html
>
> Anyhow I do see a lot of inane questions and "solutions" like
> the above in various lists (usually the XFS one, which attracts
> a lot of utter absurdities).
>
>    
>> When reading files in ext3 (and ext4) or doing other bulk
>> operations like a large deletion, it is important to sort the
>> files by inode (do the readdir, get say all of the 5k files in
>> your subdir and then sort by inode before doing your bulk
>> operation).
>>      
> Good idea, but it is best to avoid the cases where this matters.
>
>