[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: Many small files, best practise.

On 09/14/2009 05:40 AM, Peter Grandi wrote:
RHEL 5.3
~1000.000.000 files (1-30k)
~7TB in total
I'm looking for a best practice when implementing this using
EXT3 (or some other FS if it shouldn't do the job.).
"best practice" would be a rather radical solution.

On average the reads dominate (99%), writes are only used for
updating and isn't a part of the service provided.  The data
is divided into 200k directories with each some 5k files.
This ratio (dir/files) can be altered to optimize FS
If you are writing to a local S-ATA disk, ext3/4 can write a
few thousand files/sec without doing any fsync() operations.
With fsync(), you will drop down quite a lot.
Unfortunately using 'fsync' is a good idea for production

Also note that in order to write 10^9 files at 10^3/s rate takes
10^6 seconds; roughly 10 days to populate the filesystem (or at
least that to restore it from backups).

One thing that you can do when doing bulk loads of files (say, during a restore or migration), is to use a two phase write. First, write each of a batch of files (say 1000 files at a time), then go back and reopen/fsync/close them.

This will give you performance levels closer to not using fsync() and still give you good data integrity. Note that this usually is a good fit for this class of operations since you can always restart the bulk load if you have a crash/error/etc.

To give this a try, you can use "fs_mark" to write say 100k files with the fsync one file at a time (-S 1, its default) or use one of the batch fsync modes (-S 3 for example).

One layout for directories that works well with this kind of
thing is a time based one (say YEAR/MONTH/DAY/HOUR/MIN where
MIN might be 0, 5, 10, ..., 55 for example).
As to the problem above and ths kind of solution, I reckon that
it is utterly absurd (and I could have used much stronger words).

When you deal with systems that store millions of files, you pretty much always are going to use some kind of made up directory layout. The above scheme works pretty well in that it correlates well to normal usage patterns and queries (and tends to have those subdirectories laid out contiguously).

You can always try to write 1 million files in a single subdirectory, but if you are writing your own application, using this kind of scheme is pretty trivial.

   BTW, the sort of people who consider seriously such utter
   absurdities try to do a thorough job, and I don't want to
   know how the underlying storage system is structured :-).

If anything, consider the obvious (obvious except to those who
want to use a filesystem as a small record database), which is
'fsck' time, in particular given the structure of 'ext3' (or
'ext4') metadata.

fsck time has improved quite a lot recently with ext4 (and with xfs).

So: just don't use a filesystem as a database, spare us the
horror; use a database, even a simple one, which is not utterly

Compare these two:


In this case, doing the bulk load I described above (reading in sorted order, writing out in the same), would significantly reduce the time of the restore.


Anyhow I do see a lot of inane questions and "solutions" like
the above in various lists (usually the XFS one, which attracts
a lot of utter absurdities).

When reading files in ext3 (and ext4) or doing other bulk
operations like a large deletion, it is important to sort the
files by inode (do the readdir, get say all of the 5k files in
your subdir and then sort by inode before doing your bulk
Good idea, but it is best to avoid the cases where this matters.

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]