Many small files, best practise.

Mon Sep 21 13:54:44 UTC 2009

[ ... whether storing 1 bilion 7KB (average) records are best
stored in a database or 1 per file in a file system ... ]

>>> One thing that you can do when doing bulk loads of files
>>> (say, during a restore or migration), is to use a two phase
>>> write. First, write each of a batch of files (say 1000 files
>>> at a time), then go back and reopen/fsync/close them.

>> Why not just restore a database?

> If you started with a database, that would be reasonable. If
> you started with a file system, I guess I don't understand
> what you are suggesting.

Well, the topic of this discussion is whether one *should* start
with a database for the "lots of small records" case. 

It is not a new topic by any means -- there have been many
debates in the past as to how silly it is to have immense
file-per-message news/mail spool archives with lots of little
files. The outcome has always been to store them in databased of
one sort or another.

>>>>> One layout for directories that works well with this kind
>>>>> of thing is a time based one (say YEAR/MONTH/DAY/HOUR/MIN
>>>>> where MIN might be 0, 5, 10, ..., 55 for example).

>>> As to the problem above and ths kind of solution, I reckon
>>> that it is utterly absurd (and I could have used much
>>> stronger words).

>>> When you deal with systems that store millions of files,

>> Millions of files may work; but 1 billion is an utter
>> absurdity.  A filesystem that can store reasonably 1 billion
>> small files in 7TB is an unsolved research issue ... [
>> ... and fsck ... ]

> Strangely enough, I have been testing ext4 and stopped filling
> it at a bit over 1 billion 20KB files on Monday (with 60TB of
> storage).

Is that a *reasonable* use of a filesystem? Have you compared to
storing 1 billion 20KB records in a simple database?

As an aside, 20KB is no longer than much in the "small files"
range. For example, one stupid idea of storing records as "small
files" is the enormous internal fragmentation caused by 4KiB
allocation granularity, which swells space used too. Even for
the original problem, which was about:

  > ~1000.000.000 files (1-30k)
  > ~7TB in total

that is presumably lots of files under 4KiB if the average file
size is 7KB in a range between 1-30KB.

Also looking at my humble home system, at the root filesystem
and a media (RPMs, TARs, ZIPs, JPGs, ISOs, ...) archival
filesystem (both JFS):

  base# df / /fs/basho
  Filesystem           1M-blocks      Used Available Use% Mounted on
  /dev/sdb1                11902      9712      2191  82% /
  /dev/sda8               238426    228853      9573  96% /fs/basho
  base# df -i / /fs/basho
  Filesystem            Inodes   IUsed   IFree IUse% Mounted on
  /dev/sdb1            4873024  359964 4513060    8% /
  /dev/sda8            19738976  126493 19612483    1% /fs/basho

I see that files under 4K are the vast majority on one and a
large majority on the other:

  base# find / -xdev -type f -size -4000 | wc -l
  305064
  base# find /fs/basho -xdev -type f -size -4000 | wc -l
  107255

Anyhow, because while some people make (because they do "work")
fielsystems with millions and even billion inodes and/or 60TB
capacities (on 60+1 RAID5s sometimes), the question is whether
it makes sense or is an absurdity on its own merits and when
compared to a database.

That something stupid can be done is not an argument for doing it.

The arguments I referred to in my original comments show just
how expensive is to misuse a directory hierarchy in a filesystem
as if it were an index in a database, by comparing them:

 "I have a little script, the job of which is to create a lot of
  very small files (~1 million files, typically ~50-100 bytes each)."
 "It's a bit of a one-off (or twice, maybe) script, and
  currently due to finish in about 15 hours,"

 "creates a Berkeley DB database of K records of random length
  varying between I and J bytes,"
 "So, we got 130MiB of disc space used in a single file, >2500
  records sustained per second inserted over 6 minutes and a half,"

Perhaps 50-100 bytes is a bit extreme, but still compare "due to
finish in about 15 hours" with "6 minutes and a half".

Now, in that case a large part of the speedup is that the
records were small enough that 1m of them as a database would
fit into memory (that BTW was part of the point why using a
filesystem for that was utterly absurd).

I'd rather not do a test with 1G 6-7KB records on my (fairly
standard, small, 2GHz PCU, 2GiB RAM) home PC, but 1M 6-7KB
records is of course feasible, and on a single modern disk with
1 TB (and a slightly prettified updated script using BTREE) I
get (1M records with a 12 byte key, record length random between
2000 and 10000 bytes):

  base# rm manyt.db
  base# time perl manymake.pl manyt.db 1000000 2000 10000
    1 percent done, 990000 to go
    2 percent done, 980000 to go
    3 percent done, 970000 to go
  ....
   98 percent done, 20000 to go
   99 percent done, 10000 to go
  100 percent done, 0 to go

  real	81m6.812s
  user	0m29.957s
  sys	0m30.124s
  base# ls -ld manyt.db 
  -rw------- 1 root root 8108961792 Sep 19 20:36 manyt.db

The creation script flushes every 1% too, but from the pathetic
peak 3-4MB/s write rate it is pretty obvious that on my system
things don't get cached a lot (by design...).

As to reading, 10000 records at random among those 1M:

  base# time perl manyseek.pl manyt.db 1000000 10000
    1 percent done, 9900 to go
    2 percent done, 9800 to go
    3 percent done, 9700 to go
  ....
   98 percent done, 200 to go
   99 percent done, 100 to go
  100 percent done, 0 to go
  average length: 5984.4108

  real	7m22.016s
  user	0m0.210s
  sys	0m0.442s

That is on the slower half of a 1T drive in a half empty JFS
filesystem. That's 200/s 6KB average records inserted, and about
22/s looked up, which is about as good as the drive can do, all
in a single 8GB file. Sure, a lot slower than 50-100 bytes as it
can no longer much fit into memory, but still way off "due to
finish in about 15 hours". Sure the system I used for the new
test is a bit faster than the one used for the "in about 15
hours" test, but we are still talking one arm, which is largely
the bottleneck.

But wait -- I am JOKING. because it is ridiculous to load a 1M
record dataset into an indexed database one record at a time.

Sure it is *possible*, but any sensible database has a bulk
loader that builds the index after loading the data. So in any
reasonable scenario the difference when *restoring* a backedup
filesystem will be rather bigger than for the scenario above.
Sure, some file systems have 'dump' like tools that help, but
they don't recreate a nice index, they just restore it. Ah well.

Now let's see a much bigger scale test:

> [ ... ] testing ext4 and stopped filling it at a bit over 1
> billion 20KB files on Monday (with 60TB of storage). Running
> fsck on it took only 2.4 hours. [ ... ]

> [ ... ] 20KB files written to ext4 run at around 3,000
> files/sec. It took us about 4 days to fill it to 1 billion
> files [ ... ]

That sounds like you did use 'fsync' per file or something
similar, as you had written:

>>>> If you are writing to a local S-ATA disk, ext3/4 can write a
>>>> few thousand files/sec without doing any fsync() operations.
>>>> With fsync(), you will drop down quite a lot.

and here you report around 3000/s over a 60TB array.

Then 20KBx3000/s is 60MB/s -- rather unimpressive score for a
60TB filesystem (presumably spread over 60 drives or more), even
with 'fsync'. And the creation record rate itself looks like
about 50 records/s per drive. That is rather disappointing. Yes,
they are larger files, but that should not cause that much
slowdown.

Also, the storage layout is not declared (except that you are
storing 20TB of data in 60TB of drives, which is a bit of a cheat),
and it would be also quite interesting to see the output of that
'fsck' run:

> and 2.4 hours to fsck.

But that is an unreasonable test, even if it is the type of test
popular with some file system designers, precisely because...

Testing file system performance just after loading is a naive or
cheating exercise, especially with 'ext4' (and 'ext3'), as after
loading all those inodes and files are going to be nearly
optimally laid out (e.g. list of inode numbers in a directory
pretty much sequential), and with 'ext4' each file will consist
of a single extent (hopefully), so less metadata.

But a filesystem that simulates a simple small object database
will as a rule not be so lucky; it will grow and be modified.

Even worse, 'fsck' on a filesystem *without damage* is just an
exercise in enumerating inodes and other metadata. What is
interesting is that happens when there is damage and 'fsck' has
to start cross-correlating metadata.

So here are some more realistic 'fsck' estimates from other
filesystems and other times, who should be very familiar to
those considering utterly absurd designs:

  http://ukai.org/b/log/debian/snapshot

   "long fsck on disks for old snapshot.debian.net is completed
    today. It takes 75 days!"

   "It still fsck for a month....

    root      6235 36.1 59.7 1080080 307808 pts/2 D+  Jun21 15911:50 fsck.ext3 /dev/md5"

That was I think before some improvements to 'ext3' checking.

  http://groups.google.com/group/linux.debian.ports.x86-64/msg/fd2b4d46a4c294b5

   "Keep in mind if you go with XFS, you're going to need 10-15
    gig of memory or swap space to fsck 6tb.. it needs about 9
    gig to xfs_check, and 3 gig to xfs_repair a 4tb array on one
    of my systems.. oh, and a couple days to do either. :)"

   "> Generally, IMHO no. A fsck will cost a lot of time with
    > all filesystems.

    Some worse than others though.. looks like this 4tb is going
    to take 3 weeks.. it took about 3-4 hours on ext3.. If i had
    a couple gig of ram to put in the server that'd probably
    help though, as it's constantly swapping out a few meg a
    second."

  http://lists.us.dell.com/pipermail/linux-poweredge/2007-November/033821.html

   "> I'll definitely be considering that, as I already had to
    > wait hours for fsck to run on some 2 to 3TB ext3
    > filesystems after crashes. I know it can be disabled, but
    > I do feel better forcing a complete check after a system
    > crash, especially if the filesystem had been mounted for
    > very long, like a year or so, and heavily used.

    The decision process for using ext3 on large volumes is simple:

    Can you accept downtimes measured in hours (or days) due to
    fsck? No - don't use ext3."

  http://www.mysqlperformanceblog.com/2006/10/08/small-things-are-better/

   "Yesterday I had fun time repairing 1.5Tb ext3 partition,
    containing many millions of files. Of course it should have
    never happened - this was decent PowerEdge 2850 box with RAID
    volume, ECC memory and reliable CentOS 4.4 distribution but
    still it did. We had "journal failed" message in kernel log
    and filesystem needed to be checked and repaired even though
    it is journaling file system which should not need checks in
    normal use, even in case of power failures. Checking and
    repairing took many hours especially as automatic check on
    boot failed and had to be manually restarted."

Another factor is just how "complicated" the filesystem is, and
for example 'fsck' times with large numbers of hard links can be
very bad (and there are quite a few use cases like 'rdiff-backup').

Also, what about the few numbers you mention above? The 2.4
hours for 1 billion files mean 110K inodes examined per second.
Now 60TB probably means like 60 1TB drives to store 20TB of
data, a pretty large degree of parallelism. T'so reports:

  http://thunk.org/tytso/blog/2008/08/08/fast-ext4-fsck-times/

which shows that on a single (laptop) drive an 800K inode/90GB
'ext4' filesystem could be checked in 63s or around 12K inodes/s
per drive, not less than 2K.

There seems to be a scalability problem -- but of course: one of
the "unsolved research issue"s is that while read/write/etc. can
be parallelized (for large files) by using wide RAIDs, it is not
so easy to parallelize 'fsck' (except by using multiple mostly
independent filesystems).

[ ... ]

> The use case for big file systems with lots of small files (at
> least the one that I know of) is for object based file systems
> where files usually have odd, non-humanly generated file names
> (think guids with time stamps and digital signatures).

> These are pretty trivial to map into the time based directory
> scheme I mentioned before.

And it is utterly absurd to do so (see below).

> [ ... ] benchmarked both large DB instances and large file
> systems.  Good use cases exist for both, but the facts do not
> back up your DB is the only solution proposal :-)

Sure, large filesystems (to a point, which for me is the single
digit TB range) with large files have their place, even if
people seem to prefer metafilesystem like Lustre even for those,
for good reasons.

But the discussion is whether it makes sense, for a case like 1G
records averaging about 7KB, to use a filesystem with 200K
directories with each 5K files (or something similar) one file
per record, or a database with a nice overall index and a single
or a few files for all records.

Your facts above show that it is *possible* to create a similar
(1G x 20K records) filesystem, and that it seem to make a rather
poor use of a very large storage system.

The facts that I referred to in my original comment show that
there is a VERY LARGE performance difference between using a
filesystem as a (very) small-record database for just 1M
records, and a PRETTY LARGE difference even for 6KB records, and
that doing something stupid on the database side.

In the end the facts just confirm the overall discussion that
I referred to in my original comment:

  http://lists.gllug.org.uk/pipermail/gllug/2005-October/055445.html

   "* The size of the tree will be around 1M filesystem blocks on
      most filesystems, whose block size usually defaults to 4KiB,
      for a total of around 4GiB, or can be set as low as 512B, for
      a total of around 0.5GiB.

    * With 1,000,000 files and a fanout of 50, we need 20,000
      directories above them, 400 above those and 8 above those.
      So 3 directory opens/reads every time a file has to be
      accessed, in addition to opening and reading the file.

    * Each file access will involve therefore four inode accesses
      and four filesystem block accesses, probably rather widely
      scattered. Depending on the size of the filesystem block and
      whether the inode is contiguous to the body of the file this
      can involve anything between 32KiB and 2KiB of logical IO per
      file access.

    * It is likely that of the logical IOs those relating to the two
      top levels (those comprising 8 and 400 directories) of the
      subtree will be avoided by caching between 200KiB and 1.6MiB,
      but the other two levels, the 20,000 bottom directories and
      the 1,000,000 leaf files, won't likely be cached."

These are pretty elementary considerations, and boil down to the
issue of whether for a given dataset of "small" records the best
index structure is a tree of directories or a nicely balanced
index tree, and whether the "small" records should be at most
one per (4KiB usually) block or can share blocks, and there is
little doubt that tha latter wins pretty big.

Your proposed directory based index "YEAR/MONTH/DAY/HOUR/MIN"
seems to me particularly inane, as it has a *fixed fanout*, of
12 at the "MONTH" level, around 30 at the "DAY" level, 24 at the
hour level, and 60 at the "MIN" level with no balancing. Fine if
the record creation rate is constant.

Perhaps not -- it involves 500K "MIN" directories per year.
If we create 1G files per year we get around 2K files per "MIN"
directory, each of which is then likely to be a few 4KiB blocks
long. Fabulous :-).

Sure, it is a *doable* structure, but it is not *reasonable*,
especially if one knows the better alternative.

Overall the data and arguments above suggests that:

* Large filesystems (2 digits TB and more) usually should be
  avoided.

* Filesystems with large numbers (more than a few millions) of
  files, even large files, should be avoided.

* Large filesystems with a large number of small (around 4KiB)
  inodes (not just files) are utterly absurd, on their own
  merits, and even more so when compared with a database.

* Two big issues are that while parallel storage scales up data
  performance, it does not do that well with metadata, and in
  particular metadata crawls such as 'fsck' are hard to
  parallelize (they are hard even when they in effect resolve
  just in mostly-linear scans).

* If one *has* to have any of the above, separate filesystems,
  and/or filesystems based on a database-like design (e.g. based
  on indices throughout like HFS+ or Reiser3 or to some degree
  JFS and even XFS) may be the lesser evils, even if they have
  some limitations. But that is still fairly crazy. 'ar' files
  for one thing have been invented decades ago precisely because
  lots of small files and filesystems are a bad combination.

These are conclusions well supported by experiment, data and
simple reasoning as in the above. I should not have to explain
these pretty obvious points in detail -- that databases are much
better for large small record collections is not exactly a
recent discovery.

Sure, a lot of people "know better" and adopt what I call the
"syntactically valid" approach, where if a combination is
possible it is then fine. Good luck!