[dm-devel] Some thoughts about providing data block checksumming for ext4

Andreas Dilger adilger at dilger.ca
Tue Nov 4 21:20:44 UTC 2014

On Nov 3, 2014, at 4:33 PM, Theodore Ts'o <tytso at mit.edu> wrote:

> I've been thinking a lot about the best way to provide data block
> checksumming for ext4 in an efficient way, and as I promised on
> today's ext4 concall, I want to detail them in the hopes that it will
> spark some more interest in actually implementing this feature,
> perhaps in a more general way than just for ext4.
> I've included in this writeup a strawman design to implement data
> block checksuming as a device mapper module.
> Comments appreciated!
> 						- Ted
> The checksum consistency problem
> =================================
> Copy-on-write file systems such as btrfs and zfs have a big
> advantage when it comes to providing data block checksums because they
> never overwrite an existing data block.  In contrast, update-in-place
> file systems such as ext4 and xfs, if they want to provide data block
> checksums, must be able to update checksum and the data block
> atomically, or else if the system fails at an inconvenient point in
> time, the previously existing block in a file would have an
> inconsistent checksum and contents.   
> In the case of ext4, we can solve this by the data blocks through the
> journal, alongside the metadata block containing the checksum.
> However, this results in the performance cost of doing double writes
> as in data=journal mode.  We can do slightly better by skipping this
> if the block in question is a newly allocated block, since there is no
> guarantee that data will be safe until an fsync() call, and in the
> case of a newly allocated block, there is no previous contents which
> is at risk.
> But there is a way we can do even better!  If we can manage to
> compress the block even by a tiny amount, so that 4k block can be
> stored in 4092 bytes (which means we need to be able to compress the
> block by 0.1%), we can store the checksum inline with the data, which
> can then be atomically updated assuming a modern drive with a 4k
> sector size (even a 512e disk will work fine, assuming the partition
> is properly 4k aligned).  If the block is not sufficiently
> compressible, then we will need to store the checksum out-of-line, but
> in practice, this should be relatively rare.  (The most common case of
> incompressible file formats are things like media files and
> already-compressed packages, and these files are generally not updated
> in a random-write workload.)

My main concern here would be the potential performance impact.  This
wouldn't ever reduce the amount of data actually written to any block
(presumably the end of the block would be zero-filled to avoid leaking
data), so the compress + checksum would mean every data block must have
every byte processed by the filesystem.

It's bad enough even having to do something once for each block (hence
mballoc and bios to allocate and submit many blocks at once), so if this
has to compress and checksum (or vice versa) every block it could get
expensive.  Ideally the compress+checksum operations would be combined,
so that only a single pass would be needed for all of the data.

> In order to distinguish between these a compressed+checksum and
> non-compressed+out-of-line checksum block, we can use a CRC-24
> checksum.  In the compressed+checksum case, we store a zero in the
> first byte of the block, followed by a 3 byte checksum, followed by
> the compressed contents of the block.  In the case where block can not
> be compressed, we save the high nibble of the block plus the 3 byte
> CRC-24 checksum in the out-of-line metadata block, and then we set the
> high nibble of the block to be 0xF so that there is no possibility
> that a block with an original initial byte of zero will be confused
> with a compressed+checksum block.  (Why the high nibble and not the
> just the first byte of the block?  We have other planned uses for
> those 4 bits; more later in this paper.)
> Storing the data block checksums in ext4
> ========================================
> There are two ways that have been discussed for storing data block
> checksums in ext4.  The first approach is to dedicate every a checksum
> block every 1024 blocks, which would be sufficient to store a 4 byte
> checksum (assuming a 4k block).  This approach has the advantage of
> being very simple.  However, it becomes very difficult to upgrade an
> existing file system to one that supports data block checksums without
> doing the equivalet of a backup/restore operation.
> The second approach is to store the checksums in a per-inode structure
> which is indexed by logical block number.  This approach makes is much
> simpler to upgrade an existing file system.  In addition, if not all
> files need to be data integrity protected, it is less efficient.  The

s/less efficient/more efficient/ to checksum only some of the files?

> case where this might become important is in the case where we are
> using a cryptographic Message Authentication Code (MAC) instead of a
> checksum.  This is because a MAC is significantly larger than 4 byte
> checksum, and not all of the files in the file system might be
> encrypted and thus need cryptographic data integrity protection in
> order to protect against certain chosen plaintext attacks.  In that
> case, only using a per-inode structure in those cases for those file
> blocks which require protection might make a lot of sense.  (And if we
> pursue cryptographic data integrity guarantees for the ext4 encryption
> project, we will probably need to go down this route).  The massive
> disadvantage of this scheme is that it is significantly more
> complicated to implement.

If e.g. SHA-256 is needed, then compress-by-32-bytes with the
inline checksum might be a lot harder than compress-by-4-bytes, but
not necessarily impossible for 4KB blocks unless they are already
compressed files.  

> However, if we are going to simply intersperse the metadata blocks
> alongside the data blocks, there is no real need to do this work in
> the file system.  Instead, we can actually do this work in a device
> mapper plugin instead.  This has the advantage that it moves the
> complexity outside of the file system, and allows any update-in-place
> file system (including xfs, jfs, etc.) to gain the benefits data block
> checksumming.  So in the next section of this paper I will outline a
> strawman design of such a dm plugin.

I think it is easier to determine at the filesystem level if the data
blocks are overwriting existing blocks or not, without the overhead
of having to send per-unlink/truncate trim commands down to a DM device.
Having this implemented in ext4 allows a lot more flexibility in how
and when to store the checksum (e.g. per-file checksum flags that are
inherited, store the checksum for small incompressible files in the inode
or in extent blocks, etc).

> Doing data block checksumming as a device-mapper plugin
> =======================================================
> Since we need to give this a name, for now I'm going to call this
> proposed plugin "dm-protected".  (If anyone would like to suggest a
> better name, I'm all ears.)

"dm-checksum" would be better, since "protected" falsely implies that
the data is somehow protected against loss or corruption, when it only
really allows detecting the corruption and not fixing it.

> The Non-Critical Write flag
> ---------------------------
> First, let us define an optional extension to the Linux block layer
> which allows to provide a certain optimization when writing
> non-compressible files such as audio/video media files, which are
> typically written in a streaming fashion and which are generally not
> updated in place after they are initially written.  As this
> optimization is purely optional, this feature might not be implemented
> initially, and a file system does not have to take advantage of this
> extension if it is implemented.
> If present, this extension allows the file system to pass a hint to
> the block device that a particular data block write is the first time
> that a newly allocated block is being written.  As such, it is not
> critically important that the checksum be atomically updated when the
> data block is written, in the case where the data block can not be
> compressed such that the checksum can fit inline with the compressed
> data.
> XXX I'm not sure "non-critical" is the best name for this flag.  It
> may be renamed if we can think of a better describe name.

Something like "write-once" or "idempotent" or similar, since that
makes it clear how this is used.  I think anyone who is checksumming
their data would consider that it is "critical".

> Layout of the pm-protected device
> ---------------------------------
> The layout of the the dm-protected device is a 4k checksum block
> followed by 1024 data blocks.  Hence, given a logical 4k block number
> (LBN) L, the checksum block associated with that LBN is located at
> physical block number (PBN):
> 	PBN_checksum = (L + 1) / 1024
> where '/' is an C-style integer division operation.

> The PBN where the data for stored at LBN can be calculated as follows:
> 	PBN_L = L + (L / 1024) + 1
> The checksum block is used when we need to store an out-of-line
> checksum for a particular block in its "checksum group", where we
> treat the contents of checksum block as a 4 byte integer array, and
> where the entry for a particular LBN can be found by indexing into (L
> % 1024).
> For redundancy purposes we calculate the metadata checksum of the
> checksum block assuming that low nibble of the first byte in each
> entry is entry, and we use the low nibbles of first byte in each entry

s/each entry is entry/each entry is zero/ ?

> to store store the first LBN for which this block is used plus the
> metdata checksum of the checksum block.  We encoding the first LBN for
> the checksum block so we can identify the checksum block when it is
> copied into the Active Area (described below).
> Writing to the dm-protected device
> -----------------------------------
> As described earlier, when we write to the dm-protected device, the
> plugin will attempt to compress the contents of the data block.  If it
> is successful at reducing the required storage size by 4 bytes, then
> it will write the block in place.
> If the data block is not compressible, and this is a non-critical
> write, then we update the checksum in the checksum block for that
> particular LBN range, and we write out the data block immediately, and
> then after a 5 second delay (in case there are subsequent
> non-compressible, non-critial writes, as there will probably be when
> large media file is written), we write out the modified checksum
> block.

The good news is that (IMHO) these two uses are largely exclusive.
Files that are incompressible (e.g. media) are typically write-once,
while databases and other apps that overwrite files in place do not
typically compress the data blocks.

> If the data block is not compressible, and the write is not marked as
> non-critcal, then we need to worry about making sure the data block(s)
> and the checksum block are written out transactionally.  To do this, we
> write the current contents of the checksum block to a free block in
> the Active Area (AA) using FUA, which is 64 block area which is used to
> store a copy of checksum blocks for which their blocks are actively
> being modified.  We then calculate the checksum for the modified data
> blocks in the checksum group, and update the checksum block in memory,
> but we do not allow any of the data blocks to be written out until one
> of the following has happened and we need to trigger a commit of the
> checksum group:
>   *) a 5 second timer has expired
>   *) we have run out of free slots in the Active Area
>   *) we are under significant memory pressure and we need to release some of
>         the pinned buffers for the data blocks in the checksum group
>   *) the file system has requested a FLUSH CACHE operation

Why introduce a new mechanism when this could be done using data=journal
writes for incompressible data?  This is essentially just implementing
jbd2 journaling with a bunch of small journals (AAs), and we could save
a lot of code complexity by re-using the existing jbd2 code to do it.

Using data=journal, if there is a crash after the commit to the journal,
the data blocks and checksums will be checkpointed to the filesystem
again if needed, or be discarded without modifying the original data
blocks if the transaction didn't commit.

The journal would only need to get involved if data blocks couldn't be
compressed, and if overwriting existing data (presumably a rare case,
but this couldn't be optimized in a dm-layer device unless it was sparse
and was tracking block usage/trim.

Using the existing jbd2 code in this case could also take advantage of
optimizations like putting the journal on a separate disk, or on flash
for fast write/commit and the checkpoint can be done in the background
asynchronously.  We could potentially allow multiple data journals per
device (multiple AAs) if there was a good reason to do so, since any
dependencies between blocks can be avoided, unlike with namespace ops.

> A commit of the checksum group consists of the following:
> 1) An update of the checksum block using a FUA write
> 2) Writing all of the pinned data blocks in the checksum group to disk
> 3) Sending a FLUSH CACHE request to the underlying storage
> 4) Allowing the slot in the Active Area to be used for some other checksum block
> Recovery after a power fail
> ---------------------------
> If the dm-protected device was not cleanly shut down, then we need to
> examine all of the checksum blocks in the Active Area.  For each
> checksum block in the AA, the checksums for all of their data blocks
> should machine either the checksum found in the AA, or the checksum

s/machine/match/ ?

> found in the checksum block in the checksum group.  Once we have which
> checksum corresponds to the data block after the unclean shutdown, we
> can update the checksum block and clear the copy found in the AA.

This is essentially journal checksums, which also already exist.

> On a clean shutdown of the dm-protected device, we can clear the
> Active Area, and so the recovery procedure will not be needed the next
> time the dm-protected device is initialized.

This is normal journal checkpoint and cleanup.

> Integration with other DM modules
> =================================
> If the dm-protected device is layered on dm-raid 1 setup, then if
> there is a checksum failure the dm-protected device should attempt to
> fetch the alternate copy of the device.
> Of course, the the dm-protected module could be layered on top of a
> dm-crypt, dm-thin module, or LVM setup.
> Conclution
> ==========
> In this paper, we have examined some of the problems of providing data
> block checksumming in ext4, and have proposed a solution which
> implements this functionality as a device-mapper plugin.  For many
> file types, it is expected that using a very fast compression
> algorithm (we only need to compress the block by less than 0.1%) will
> allow us to provide data block checksumming with almost no I/O
> overhead and only a very modest amount of CPU overhead.
> For those file types which contain a large number of incompressible
> block, if they do not need to be updated-in-place, we can also
> minimize the overhead by avoiding the need to do a transactional
> update of the data block and the checksum block.
> In those cases where we do need to do a transactional update of the
> checksum block relative to the data blocks, we have outlined a very
> simple logging scheme which is both efficient and relatively easy to
> implement.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Cheers, Andreas

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://listman.redhat.com/archives/dm-devel/attachments/20141104/009353f7/attachment.sig>

More information about the dm-devel mailing list