[Cluster-devel] Recording extents in GFS2

Mon Feb 22 13:03:34 UTC 2021

On Mon, Feb 22, 2021 at 12:41 PM Andreas Gruenbacher <agruenba at redhat.com>
wrote:

> On Mon, Feb 22, 2021 at 11:21 AM Steven Whitehouse <swhiteho at redhat.com>
> wrote:
>
>> Hi,
>> On 20/02/2021 09:48, Andreas Gruenbacher wrote:
>>
>> Hi all,
>>
>> once we change the journal format, in addition to recording block numbers
>> as extents, there are some additional issues we should address at the same
>> time:
>>
>> I. The current transaction format of our journals is as follows:
>>
>>    - One METADATA log descriptor block for each [503 / 247 / 119 / 55]
>>    metadata blocks, followed by those metadata blocks. For each metadata
>>    block, the log descriptor records the 64-bit block number.
>>    - One JDATA log descriptor block for each [251 / 123 / 59 / 27]
>>    metadata blocks, followed by those metadata blocks. For each metadata
>>    block, the log descriptor records the 64-bit block number and another
>>    64-bit field for indicating whether the block needed escaping.
>>    - One REVOKE log descriptor block for the initial [503 / 247 / 119 /
>>    55] revokes, followed by a metadata header (not to be confused with the log
>>    header) for each additional [509 / 253 / 125 / 61] revokes. Each revoke is
>>    recorded as a 64-bit block number in its REVOKE log descriptor or metadata
>>    header.
>>    - One log header with various necessary and useful metadata that acts
>>    as a COMMIT record. If the log header is incorrect or missing, the
>>    preceding log descriptors are ignored.
>>
>>                                                                   ^^^^
>> succeeding? (I hope!)
>>
>
> No, we call lops_before_commit (which writes the various log descriptors,
> metadata, and journaled data blocks) before writing the log header in
> log_write_header -> gfs2_write_log_header. In that sense, we could call it
> a trailer.
>
> We should change that so that a single log descriptor contains a number of
>> records. There should be records for METADATA and JDATA blocks that follow,
>> as well as for REVOKES and for COMMIT. If a transaction contains metadata
>> and/or jdata blocks, those will obviously need a precursor and a commit
>> block like today, but we shouldn't need separate blocks for metadata and
>> journaled data in many cases. Small transactions that only consist of
>> revokes and of a commit should frequently fit into a single block entirely,
>> though.
>>
>> Yes, it makes sense to try and condense what we are writing. Why would we
>> not need to have separate blocks for journaled data though? That one seems
>> difficult to avoid, and since it is used so infrequently, perhaps not such
>> an important issue.
>>
> Journaled data would of course still need to be written. We could have a
> single log descriptor with METADATA and JDATA records, followed by the
> metadata and journaled data blocks, followed by a log descriptor with a
> COMMIT record.
>
>> Right now, we're writing log headers ("commits") with REQ_PREFLUSH to
>> make sure all the log descriptors of a transaction make it to disk before
>> the log header. Depending on the device, this is often costly. If we can
>> fit an entire transaction into a single block, REQ_PREFLUSH won't be needed
>> anymore.
>>
>> I'm not sure I agree. The purpose of the preflush is to ensure that the
>> data and the preceding log blocks are really on disk before we write the
>> commit record. That will still be required while we use ordered writes,
>> even if we can use (as you suggest below) a checksum to cover the whole
>> transaction, and thus check for a complete log record after the fact. Also,
>> we would still have to issue the flush in the case of a fsync derived log
>> flush too.
>>
>>
>>
>> III. We could also checksum entire transactions to avoid REQ_PREFLUSH. At
>> replay time, all the blocks that make up a transaction will either be there
>> and the checksum will match, or the transaction will be invalid. This
>> should be less prohibitively expensive with CPU support for CRC32C
>> nowadays, but depending on the hardware, it may make sense to turn this off.
>>
>> IV. We need recording of unwritten blocks / extents (allocations /
>> fallocate) as this will significantly speed up moving glocks from one node
>> to another:
>>
>> That would definitely be a step forward.
>>
>>
>>
>> At the moment, data=ordered is implemented by keeping a list of all
>> inodes that did an ordered write. When it comes time to flush the log, the
>> data of all those ordered inodes is flushed first. When all we want is to
>> flush a single glock in order to move it to a different node, we currently
>> flush all the ordered inodes as well as the journal.
>>
>> If we only flushed the ordered data of the glock being moved plus the
>> entire journal, the ordering guarantees for the other ordered inodes in the
>> journal would be violated. In that scenario, unwritten blocks could (and
>> would) show up in files after crashes.
>>
>> If we instead record unwritten blocks in the journal, we'll know which
>> blocks need to be zeroed out at recovery time. Once an unwritten block is
>> written, we record a REVOKE entry for that block.
>>
>> This comes at the cost of tracking those blocks of course, but with that
>> in place, moving a glock from one node to another will only require
>> flushing the underlying inode (assuming it's a inode glock) and the
>> journal. And most likely, we won't have to bother with implementing "simple"
>> transactions as described in
>> https://bugzilla.redhat.com/show_bug.cgi?id=1631499.
>>
>> Thanks,
>> Andreas
>>
>> That would be another way of looking at the problem, yes. It does add a
>> lot to the complexity though, and it doesn't scale very well on systems
>> with large amounts of memory (and therefore potentially lots of unwritten
>> extents to record & keep track of). If there are lots of small
>> transactions, then each one might be significantly expanded by the need to
>> write the info to track the things which have not been written yet.
>>
>> If we keep track of individual allocations/deallocations, as per Abhi's
>> suggestion, then we know where the areas are which may potentially have
>> unwritten data in them. That may allow us to avoid having to do the data
>> writeback ahead of the journal flush in the first place - moving something
>> more towards the XFS way of doing things.
>>
> Well, allocations and unwritten data are essentially the same thing; I may
> not have said that very clearly. So avoiding unnecessary ordered data
> write-out is *exactly* what I'm proposing here. When moving a glock from
> one node to another, we very certainly do want to write out the ordered
> data of that specific inode, however. The problem is that tracking
> allocations is worthless if we don't record one of the following things in
> the journal: either (a) which of the unwritten blocks have been written
> already, or (b) the fact that all unwritten blocks of an inode have been
> written now. When moving a glock from one node to another, (b) may be
> relatively easy to ascertain, but in a running system, we may never reach
> that state.
>

To expand on this a little, fsync is a point at which (b) is achieved, due
to the fact that we don't allow multiple local processes concurrent "EX"
access to a file today. This isn't really a desired property of the
filesystem though; other filesystems allow a lot more concurrency. So
before too long, we might end up in a situation where an fsync only
guarantees that all previous writes will be synced to disk. The resource
group glock sharing is a move in that direction.

Andreas
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/cluster-devel/attachments/20210222/589fb56b/attachment.htm>