Poor Performance WhenNumber of Files > 1M

Wed Jun 11 22:25:17 UTC 2008

Ric Wheeler wrote:
> Eric Sandeen wrote:
>> John Kalucki wrote:
>>
>>  
>>> Performance seems to always map directly to the number of files in 
>>> the ext3 filesystem.
>>>
>>> After some initial run-fast time, perhaps once dirty pages begin to 
>>> be written aggressively, for every 5,000 files added, my files 
>>> created per second tends to drop by about one. So, depending on the 
>>> variables, say with 6 RAID10 spindles, I might start at ~700 
>>> files/sec, quickly drop, then more slowly drop to ~300 files/sec at 
>>> perhaps 1 million files, then see 299 files/sec for the next 5,000 
>>> creations, 298 files/sec, etc. etc.
>>>
>>> As you'd expect, there isn't much CPU utilization, other than 
>>> iowait, and some kjournald activity.
>>>
>>> Is this a known limitation of ext3? Is expecting to write to 
>>> O(10^6)-O(10^7) files in something approaching constant time 
>>> expecting too much from a filesystem? What, exactly, am I stressing 
>>> to cause this unbounded performance degradation?
>>>     
>>
>> I think this is a linear search through the block groups for the new
>> inode allocation, which always starts at the parent directory's block
>> group; and starts over from there each time.  See find_group_other().
>>
>> So if the parent's group is full and so are the next 1000 block groups,
>> it will search 1000 groups and find space in the 1001st.  On the next
>> inode allocation it will re-search(!) those 1000 groups, and again find
>> space in the 1001st.  And so on.  Until the 1001st is full, and then
>> it'll search 1001 groups and find space in the 1002nd... etc (If I'm
>> remembering/reading correctly, but this does jive with what you see.).
>>
>> I've toyed  with keeping track (in the parent's inode) where the last
>> successful child allocation happened, and start the search there.  I'm a
>> bit leery of how this might age, though... plus I'm not sure if it
>> should be on-disk or just in memory.... But this behavior clearly needs
>> some help.  I should probably just get it sent out for comment.
>>
>> -Eric
>>
>>   
> I run a very similar test, but normally run with a synchronous write 
> work load (i.e., fsync before close). In my testing, you will see a 
> slow but gradual decline in the files/sec. For example, on a 1TB S-ATA 
> drive, the latest test run started off at a rate of 22 files/sec (each 
> file is 40k) and is currently chugging along at a bit over 17 
> files/sec when it has hit 2.8 million files in one directory. I am 
> using the ext3 run to get a baseline for a similar run of xfs and btrfs.
>
> One other random tuning thought - you can help by writing into 
> separate directories, but you will need to make sure that you don't 
> produce a random write pattern when you select your target 
> subdirectory. I think that the use case mentioned using a hashed 
> directory structure which is fine, but you want to hash in a way that 
> writes into a shared subdirectory for some period of time (say get a 
> rotation of every X files or Y seconds).  Easiest way to do this is to 
> use a GUID with a time stamp and hash on the time stamp bits.
>
> Note that there is a multi-threaded performance bug in ext3 (Josef 
> Bacik had looked at fixing this) which throttles writes/sec down to 
> around 230 when you do synchronous transactions so you might be 
> hitting that as well.
>
> ric

Unfortunately, I don't have the opportunity to limit the directories. My 
application is taking random-ish data and organizing it into logical 
groups for subsequent quick reading. But I did take your suggestion into 
account and it contains what seems to be the important nugget -- too 
many active directories makes a bad situation worse.

But still, my test reaches a steady state of active directories pretty 
quickly -- or so I'd like to think. The performance does indeed continue 
to creep downwards.

I'm doing everything single-threaded. Introducing a second thread seems 
to be an immediate disaster, even though I'm stripped across 3 disks. 
Unfortunate. Perhaps moving the journal to another filesystem would 
allow better multi-threaded throughput, but I'm not sure that this is 
important to me.

xfs, zfs, btrfs, and reiser could be attractive for my use-case.

Thanks for your response,
John