[Cluster-devel] Recording extents in GFS2

Sat Feb 20 09:48:42 UTC 2021

Hi all,

once we change the journal format, in addition to recording block numbers
as extents, there are some additional issues we should address at the same
time:

I. The current transaction format of our journals is as follows:

   - One METADATA log descriptor block for each [503 / 247 / 119 / 55]
   metadata blocks, followed by those metadata blocks. For each metadata
   block, the log descriptor records the 64-bit block number.
   - One JDATA log descriptor block for each [251 / 123 / 59 / 27] metadata
   blocks, followed by those metadata blocks. For each metadata block, the log
   descriptor records the 64-bit block number and another 64-bit field for
   indicating whether the block needed escaping.
   - One REVOKE log descriptor block for the initial [503 / 247 / 119 / 55]
   revokes, followed by a metadata header (not to be confused with the log
   header) for each additional [509 / 253 / 125 / 61] revokes. Each revoke is
   recorded as a 64-bit block number in its REVOKE log descriptor or metadata
   header.
   - One log header with various necessary and useful metadata that acts as
   a COMMIT record. If the log header is incorrect or missing, the preceding
   log descriptors are ignored.

We should change that so that a single log descriptor contains a number of
records. There should be records for METADATA and JDATA blocks that follow,
as well as for REVOKES and for COMMIT. If a transaction contains metadata
and/or jdata blocks, those will obviously need a precursor and a commit
block like today, but we shouldn't need separate blocks for metadata and
journaled data in many cases. Small transactions that only consist of
revokes and of a commit should frequently fit into a single block entirely,
though.

Right now, we're writing log headers ("commits") with REQ_PREFLUSH to make
sure all the log descriptors of a transaction make it to disk before the
log header. Depending on the device, this is often costly. If we can fit an
entire transaction into a single block, REQ_PREFLUSH won't be needed
anymore.

III. We could also checksum entire transactions to avoid REQ_PREFLUSH. At
replay time, all the blocks that make up a transaction will either be there
and the checksum will match, or the transaction will be invalid. This
should be less prohibitively expensive with CPU support for CRC32C
nowadays, but depending on the hardware, it may make sense to turn this off.

IV. We need recording of unwritten blocks / extents (allocations /
fallocate) as this will significantly speed up moving glocks from one node
to another:

At the moment, data=ordered is implemented by keeping a list of all inodes
that did an ordered write. When it comes time to flush the log, the data of
all those ordered inodes is flushed first. When all we want is to flush a
single glock in order to move it to a different node, we currently flush
all the ordered inodes as well as the journal.

If we only flushed the ordered data of the glock being moved plus the
entire journal, the ordering guarantees for the other ordered inodes in the
journal would be violated. In that scenario, unwritten blocks could (and
would) show up in files after crashes.

If we instead record unwritten blocks in the journal, we'll know which
blocks need to be zeroed out at recovery time. Once an unwritten block is
written, we record a REVOKE entry for that block.

This comes at the cost of tracking those blocks of course, but with that in
place, moving a glock from one node to another will only require flushing
the underlying inode (assuming it's a inode glock) and the journal. And
most likely, we won't have to bother with implementing "simple"
transactions as described in
https://bugzilla.redhat.com/show_bug.cgi?id=1631499.

Thanks,
Andreas
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/cluster-devel/attachments/20210220/ed43bb40/attachment.htm>