Ext3: Why data=journal is better than data=ordered when data needs to be read from and written to disk at the same time

Sat Apr 2 04:01:00 UTC 2011

On Mon, Mar 28, 2011 at 12:43 PM, Peter Grandi
<pg_ext3 at ext3.for.sabi.co.uk> wrote:
> [ ... ]
>
>>> When executing an fsync(), in data=ordered mode you have to
>>> write the data data blocks into the journal and wait for the
>>> data blocks to be written.  This requires generally will
>>> require extra seeks.  In data=journaled mode, the data blocks
>>> can be written directly into the sjoujournal without needing
>>> to seek.
>
>>> Of course eventually the data and metadata blocks will need
>>> to be written to their permanent locations before the journal
>>> space can be reused.  But for short bursty write patterns,
>>> the fsync() latency will be much smaller in data=journal
>>> mode.
>
>>  [ ... ]
>
>> In this case, if we conduct the experiment in data=journal
>> mode and data=ordered mode respectively,
>
> That experiment is not necessarily demonstrative, it depends on
> RAM caching, elevator, ...
>
>> since write latency is much smaller in data=journal mode,
>
> Write latency is actually much longer: because it requires *two*
> writes instead of one. It is *fsync* latency as mentioned above
> that is smaller, because it depends only on the first write to
> what is in effect a small log based filesystem. This distinction
> matters a great deal, because it is the reason why "short bursty
> write patterns" is the qualification above. For long write
> patterns things are very different as the journal eventually
> fills up. For any given size it will also fill up a lot faster
> for 'data=journal'.
>
> Ahhh while writing that I have just realized that large journals
> can be a bad idea especially for metadata operations. Will have
> to think more about that.
>
Well, the experiment I described was actually taken from the following article,

http://www.ibm.com/developerworks/library/l-fs8.html?S_TACT=105AGX52&S_CMP=cn-a-l

The author claims that it is Andrew Morton who tested this and showed that
" data=journal mode allowed the 16-meg-file to be read from 9 to over
13 times faster than other ext3 modes, ReiserFS, and even ext2 (which
has no journaling overhead)". Although I cannot find the original
Andrew Morton's post in LKML, one fact is this article is widely
copied to many other websites.

Futhermore, in the kernel internal
document,Documentation/filesystems/ext3.txt, there is saying:

195 * journal mode
196 data=journal mode provides full data and metadata journaling.  All
new data is
197 written to the journal first, and then to its final location.
198 In the event of a crash, the journal can be replayed, bringing both data and
199 metadata into a consistent state.  This mode is the slowest except when data
200 needs to be read from and written to disk at the same time where it
201 outperforms all other modes.

Although Ted and you both explained that the fsync latency is shorter
in data=journal mode, my original question, as the title indicated, is
why data=journal outperforms the other modes when read and write
simultaneously? Or, this statement in the kernel doc is not
accurate?If so, then we should submit a patch and modify this document
so that the other people won't be mislead, and it would be better to
show people some more demonstrative examples in which data=journal
really outperforms the other modes.

In addition, I am actually not very clear why you said that write()
latency is longer while fsync() latency is shorter, I am trying to
repeat what you said, please point out if I am incorrect:
1. Normally we call write() syscall first and then call fsync() to
flush the data.
2. The write() returns as long as the data is written into page caches
while the fsync() returns only if the data have been written into a
stable store.
3. Although write() latency for data=journal mode is much longer
because it requires two writes instead of one, however, since the
write() means writing to page cache, so the actually cost is not so
high, compared to the fsync() syscall where we have to write into disk
and may require disk seeks. So we can mainly focus on the fsync()
system call.
4. Since the journal is a stable store, for the data=journal mode,
fsync() returns as long as the meta data and the real data have been
written into the journal file, and this process is sequential access.
But for the data=moded mode, fsync() will terminate only if the data
itself has been written into the disk, since this process is random
access, we do need many times of disk seeks, which is expensive, so in
this case, fsync() latency is much longer than the in the data=journal
mode. And that's why we claim that data=journal wins for this burst
write case.

Are these correct?

Regards
Jidong