Frequent metadata corruption with ext3 + hard power-off

Fri May 18 19:03:46 UTC 2007

On May 18, 2007  09:48 -0400, Mats Ahlgren wrote:
> Namely, I'm confused: I would guess caching simply delays the time data gets 
> to disk, and perhaps exacerbates data being written in not-the-order it was 
> given? But, how could this cause a problem on a journaled filesystem? if one 
> is (theoretically) only appending to the journal, checksumming/hashing to 
> detect consistent journal entries on failure (since the last checkpoint), and 
> only replaying consistent journal entries (which are idempotent)... then, 
> assuming all those things above work, how could caching cause massive 
> corruption of the directory tree? (Is the above an accurate model for ext3?)

One issue is that we do not YET have journal checksumming in order to detect
the case where the commit block is written to the disk but not all of the
disk-cached blocks in the rest of that transaction are not yet committed.
That is where the big risk comes in for writeback cache in the device.

Ideally, the jbd layer could be notified when the transaction blocks are
flushed from device cache before writing the commit block, but the current
linux mechanism to do this (write barriers) sucks perforance-wise (it
sent throughput from 180MB/s to 7MB/s when enabled in our test systems).
It was better to just turn off write cache entirely than to use barriers.

We have a patch for journal checksumming that is _right_ at the verge of
being ready for fixing the "commit-block before transaction blocks" problem.
In fact, in earlier testing it improved performance in some cases because
it allows the commit block to always be sent to disk at the same time as the
transaction blocks because we know the checksum will tell us if there were
any blocks not written to disk.

Girish, could you post your latest tested patch here for review?

> Also, does anyone think data-journaling mode being 'ordered' instead 
> of 'journaled' had anything to do with it?

Seems unlikely.

> On Sunday 18 March 2007 09:33:59 Theodore Tso wrote:
> > It sounds like you have a disk which is doing very aggressive write
> > caching.  If you are using a new enough kernel (2.6.9 or greater
> > should have this), adding "barrier=1" to your mount options should
> > help.  We should probably make this the default at this point...
> > 
> > 						- Ted

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.