Poor Performance WhenNumber of Files > 1M
John Kalucki
ext3 at kalucki.com
Wed Jun 11 22:25:17 UTC 2008
Ric Wheeler wrote:
> Eric Sandeen wrote:
>> John Kalucki wrote:
>>
>>
>>> Performance seems to always map directly to the number of files in
>>> the ext3 filesystem.
>>>
>>> After some initial run-fast time, perhaps once dirty pages begin to
>>> be written aggressively, for every 5,000 files added, my files
>>> created per second tends to drop by about one. So, depending on the
>>> variables, say with 6 RAID10 spindles, I might start at ~700
>>> files/sec, quickly drop, then more slowly drop to ~300 files/sec at
>>> perhaps 1 million files, then see 299 files/sec for the next 5,000
>>> creations, 298 files/sec, etc. etc.
>>>
>>> As you'd expect, there isn't much CPU utilization, other than
>>> iowait, and some kjournald activity.
>>>
>>> Is this a known limitation of ext3? Is expecting to write to
>>> O(10^6)-O(10^7) files in something approaching constant time
>>> expecting too much from a filesystem? What, exactly, am I stressing
>>> to cause this unbounded performance degradation?
>>>
>>
>> I think this is a linear search through the block groups for the new
>> inode allocation, which always starts at the parent directory's block
>> group; and starts over from there each time. See find_group_other().
>>
>> So if the parent's group is full and so are the next 1000 block groups,
>> it will search 1000 groups and find space in the 1001st. On the next
>> inode allocation it will re-search(!) those 1000 groups, and again find
>> space in the 1001st. And so on. Until the 1001st is full, and then
>> it'll search 1001 groups and find space in the 1002nd... etc (If I'm
>> remembering/reading correctly, but this does jive with what you see.).
>>
>> I've toyed with keeping track (in the parent's inode) where the last
>> successful child allocation happened, and start the search there. I'm a
>> bit leery of how this might age, though... plus I'm not sure if it
>> should be on-disk or just in memory.... But this behavior clearly needs
>> some help. I should probably just get it sent out for comment.
>>
>> -Eric
>>
>>
> I run a very similar test, but normally run with a synchronous write
> work load (i.e., fsync before close). In my testing, you will see a
> slow but gradual decline in the files/sec. For example, on a 1TB S-ATA
> drive, the latest test run started off at a rate of 22 files/sec (each
> file is 40k) and is currently chugging along at a bit over 17
> files/sec when it has hit 2.8 million files in one directory. I am
> using the ext3 run to get a baseline for a similar run of xfs and btrfs.
>
> One other random tuning thought - you can help by writing into
> separate directories, but you will need to make sure that you don't
> produce a random write pattern when you select your target
> subdirectory. I think that the use case mentioned using a hashed
> directory structure which is fine, but you want to hash in a way that
> writes into a shared subdirectory for some period of time (say get a
> rotation of every X files or Y seconds). Easiest way to do this is to
> use a GUID with a time stamp and hash on the time stamp bits.
>
> Note that there is a multi-threaded performance bug in ext3 (Josef
> Bacik had looked at fixing this) which throttles writes/sec down to
> around 230 when you do synchronous transactions so you might be
> hitting that as well.
>
> ric
Unfortunately, I don't have the opportunity to limit the directories. My
application is taking random-ish data and organizing it into logical
groups for subsequent quick reading. But I did take your suggestion into
account and it contains what seems to be the important nugget -- too
many active directories makes a bad situation worse.
But still, my test reaches a steady state of active directories pretty
quickly -- or so I'd like to think. The performance does indeed continue
to creep downwards.
I'm doing everything single-threaded. Introducing a second thread seems
to be an immediate disaster, even though I'm stripped across 3 disks.
Unfortunate. Perhaps moving the journal to another filesystem would
allow better multi-threaded throughput, but I'm not sure that this is
important to me.
xfs, zfs, btrfs, and reiser could be attractive for my use-case.
Thanks for your response,
John
More information about the Ext3-users
mailing list