journal on an ssd

Thu Sep 11 13:07:15 UTC 2008

On Thu, Sep 11, 2008 at 07:43:18AM +0200, Tobias Oetiker wrote:
> 
> What I am hoping for, is that someone tells me, that in the case of
> 'data=journal' the loss would only be the material that is still in
> the journal (eg 30 seconds worth of data) and the rest of the fs
> would have a fair chance of being recoverd with fsck.
> 

The paper you quoted essentially indicated that ext3's JBD layer
checking for error cases sufficiently.  It has improved since then,
but there are a few places where when I did a quick audit of the code
paths, I was able to find a few places where we aren't checking the
error returns when calling sync_dirty_buffer(), for example.  In
general, though, if there is a failure to write to the SSD, it should
get detected fairly quickly, at which point the journal will get
aborted, which will suspend writes to the filesystem.  It may not
happen as quickly as we might like, and if you get really unlucky and
a singleton write fails and it's one where the error return doesn't
get written, you could end up writing garbage to the filesystem on a
journal replay.  

In that worst case scenario, you might end up losing a full inode
table block's worth of inodes, but in general, the loss should be the
last few minutes worth of data.  Fsck has a better than normal chance
of recoverying from a busted journal.  That being said, it would be
wise to monitor the health of the SSD via S.M.A.R.T., since I would
suspect that failures of the SSD should be easily predicted by the
firmware.

On Thu, Sep 11, 2008 at 09:13:21AM +0100, Chris Haynes wrote:
> 
> Is it perhaps the case that, to maximize the integrity of the main
> data, one would *want* the journal to have a different failure
> pattern?
> 
> That, if there were any doubt about journal integrity, it would be
> better (for the integrity of the main file system) to discard the
> journal entirely?
> 
> This would suggest the use of a robust hash / cryptographic digest
> of the journal contents, stored with it and checked each time the
> journal is about to be used. These are quite quick to compute
> nowadays.

Indeed, this is what ext4 does; there is a checksum (you don't need a
cryptographic digest since contrary to most sysadmin's fears, hard
drives are *not* malicious sentient beings :-), in each commit record
to detect these problems, and if a problem is found, we abort running
the journal right then and there.

It is possible this change can mean that you will lose more data, not
less.  If there is a singleton failure writing a single block, early
in the journal, aborting the journal means that we don't replay any of
the later journal commits, and it could very well be corrupted data
block was later rewritten successfully to the journal in a later
commit, and in fact, continuing the journal recovery is the right
thing to do.  On the other hand, if the corrupted datablock was a
journal descriptor, aborting the journal commit is the best thing you
could do.  But this could mean that in theory you might end up losing
more than just the last 30 seconds, but more like last couple of
minutes worth of data.

(Even data which was fsync'ed, since fsync only guarantees that the
data was written to some stable storage; fsync makes no guarantees
about what might happen if your stable storage, including the journal,
fails to store data correctly.)

We've talked about changing the journalling code to write a separate
checksum for each block, which would allow us to more intelligently
recover from a failed checksum in the journal block.  It wouldn't be a
trivial thing to add, so we haven't added that to date.  And this is a
relatively unlikely case, which involves an (undetected) single write
failure, followed by a crash at just the wrong time, before the
journal has a chance to wrap.

Also, ext4 is even better than ext3 in terms of checking error returns
(although to be honest when I did a quick audit just now I still did
find a few places where we should add some error checks; I'll work on
getting fixes submitted for both ext3 and ext4).

							- Ted