Extended Attribute Write Performance

Thu Jan 12 19:52:03 UTC 2006

On Jan 12, 2006  12:07 -0500, Charles P. Wright wrote:
> I'm writing an application that makes pretty extensive use of extended
> attributes to store file attributes on Ext2.  I used a profiling tool
> developed by my colleague Nikolai Joukov at SUNY Stony Brook to dig a
> bit deeper into the performance of my application.

Presumably you are using ext3 and not ext2, given posting to this list?

> In the course of my benchmark, there are 54247 setxattr operations
> during a 54 seconds.   They use about 10.56 seconds of the time, which
> seemed to be a rather outsized performance toll to me (~40k writes took
> only 10% as long).
> 
> After looking at the profile, 27 of those writes end up taking 7.74
> seconds.  That works out to roughly 286 ms per call; which seems a bit
> high.
> 
> The workload is not memory constrained (the working set is 50MB + 5000
> files).  Each file has one extended attribute block that contains two
> attributes totaling 32 bytes.  The attributes are unique (random
> actually), so there isn't any sharing.
> 
> Can someone provide me with some intuition as to why there are so many
> writes that reach the disk, and why they take so long.  I would expect
> that the operations shouldn't take much longer than a seek (on the order
> of 10ms, not 200+)?

I suspect the reason is that the journal is getting full and jbd is
doing a full journal checkpoint because it has run out of space for
new transactions.  This is because using external EA blocks consume
a lot of space (4kB) regardless of how small the EA is, and this can
eat up the journal quickly.  54247 * 4kB = 211MB, much larger than
the default 32MB (or maybe 128MB with newer e2fsprogs) journal size.

Solutions to your specific problem are to use large inodes and the
fast EA space ("mke2fs -j -I 256 ..." makes 256-byte inodes, 128 bytes
left for EAs) and/or increasing the journal size ("mke2fs -J size=400",
though even 400MB won't be enough for this test case).

We implemented the large inodes + fast EAs (included in 2.6.12+ kernels)
to avoid the need to do any seeking when reading/writing EAs, in addition
to the benefit of not writing so much data (mostly unused) to disk.
This showed a huge performance increase for Lustre metadata servers
(which use EAs on every file) and also with Samba4 testing.

We've run into similar problems recently with test loads that are
generating a lot of dirty metadata.  The real solution is to fix the
jbd layer not to be so aggressive about flushing out the whole journal
when it runs out of space, as this introduces gigantic latencies.
It should instead only clear out a smaller amount of space in order to
allow the new transaction to start and it can again do the checkpoint
in the background.  Not sure when we'll be able to work on that.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.