<div dir="ltr"><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Feb 2, 2021 at 6:35 PM Steven Whitehouse <<a href="mailto:swhiteho@redhat.com">swhiteho@redhat.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div>
<p>Hi,<br>
</p>
<div>On 24/01/2021 06:44, Abhijith Das
wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr">
<div>
<div style="font-family:monospace,monospace">Hi all,</div>
<div style="font-family:monospace,monospace"><br>
</div>
<div style="font-family:monospace,monospace">I've been looking at
rgrp.c:gfs2_alloc_blocks(), which is called from various
places to allocate single/multiple blocks for inodes. I've
come up with some data structures to accomplish recording of
these allocations as extents.</div>
<div style="font-family:monospace,monospace"><br>
</div>
<div style="font-family:monospace,monospace">I'm proposing we add
a new metadata type for journal blocks that will hold these
extent records.</div>
</div>
<div style="font-family:monospace,monospace"><br>
</div>
<font face="monospace">GFS2_METATYPE_EX 15 /* <span class="gmail_default" style="font-family:monospace,monospace">New metadata type
for a block that will hold extents</span> */<br>
<br>
<span class="gmail_default" style="font-family:monospace,monospace">This structure below
will be at the start of the block, followed by a number of
alloc_ext structures.</span></font>
<div><font face="monospace, monospace"><br>
</font><font face="monospace">struct gfs2_extents {</font><span class="gmail_default" style="font-family:monospace,monospace"> /* This structure
is 32 bytes long */</span><br>
<span class="gmail_default" style="font-family:monospace,monospace"> </span><font face="monospace">struct gfs2_meta_header ex_header;</font><br>
<span class="gmail_default" style="font-family:monospace,monospace"> </span><font face="monospace">__be32 ex_count;</font><span class="gmail_default" style="font-family:monospace,monospace"> /* count of number
of alloc_ext structs that follow this header. */</span><br>
<span class="gmail_default" style="font-family:monospace,monospace"> </span><font face="monospace">__be32 __pad;</font><br>
<font face="monospace">};</font><br>
<span class="gmail_default" style="font-family:monospace,monospace"></span><span class="gmail_default" style="font-family:monospace,monospace"></span>
<div><font face="monospace, monospace"><span class="gmail_default" style="font-family:monospace,monospace">/* flags for the
alloc_ext struct */</span><br>
</font><font face="monospace">#<span class="gmail_default">define</span> AE_FL_<span class="gmail_default" style="font-family:monospace,monospace">XXX</span></font><br>
<font face="monospace"><span class="gmail_default" style="font-family:monospace,monospace"><br>
</span></font>
<div><font face="monospace"><span class="gmail_default" style="font-family:monospace,monospace"></span>struct
alloc_ext {<span class="gmail_default" style="font-family:monospace,monospace"> /* This
structure is 48 bytes long */</span><br>
<span class="gmail_default" style="font-family:monospace,monospace"> </span>struct
gfs2_inum ae_num;<span class="gmail_default" style="font-family:monospace,monospace"> /* The inode
this allocation/deallocation belongs to */</span><br>
<span class="gmail_default" style="font-family:monospace,monospace"> </span>__be32
ae_flags;<span class="gmail_default" style="font-family:monospace,monospace"> /* specifies
if we're allocating/deallocating, data/metadata, etc.
*/</span><br>
<span class="gmail_default" style="font-family:monospace,monospace"> </span>__be64
ae_start;<span class="gmail_default" style="font-family:monospace,monospace"> /* starting
physical block number of the extent */</span><br>
<span class="gmail_default" style="font-family:monospace,monospace"> </span>__be64
ae_len;<span class="gmail_default" style="font-family:monospace,monospace"> /* length
of the extent */</span><br>
<span class="gmail_default" style="font-family:monospace,monospace"> </span>__be32
ae_uid;<span class="gmail_default" style="font-family:monospace,monospace"> /* user
this belongs to, for quota accounting */</span><br>
<span class="gmail_default" style="font-family:monospace,monospace"> </span>__be32
ae_gid;<span class="gmail_default" style="font-family:monospace,monospace"> /* group
this belongs to, for quota accounting */</span><br>
<span class="gmail_default" style="font-family:monospace,monospace"> </span>__be32
__pad;<br>
};</font></div>
</div>
<div><font face="monospace"><br>
</font></div>
</div>
</div>
</blockquote>
<p><font face="monospace">The gfs2_inum structure is a bit OTT for
this I think. A single 64 bit inode number should be enough?
Also, it is quite likely we may have multiple extents for the
same inode... so should we split this into two so we can have
something like this? It is more complicated, but should save
space in the average case.<br>
</font></p>
<p><font face="monospace">struct alloc_hdr {</font></p>
<p><font face="monospace"> __be64 inum;</font></p>
<p><font face="monospace"> __be32 uid; /* This is duplicated from
the inode... various options here depending on whether we think
this is something we should do. Should we also consider logging
chown using this structure? We will have to carefully check
chown sequence wrt to allocations/deallocations for quota
purposes */<br>
</font></p>
<p><font face="monospace"> __be32 gid;</font></p>
<p><font face="monospace"> __u8 num_extents; /* Never likely to
have huge numbers of extents per header, due to block size! */</font></p>
<p><font face="monospace"> /* padding... or is there something
else we could/should add here? */<br>
</font></p>
<p><font face="monospace">};</font></p>
<p><font face="monospace">followed by num_extents copies of:</font></p>
<p><font face="monospace">struct alloc_extent {</font></p>
<p><font face="monospace"> __be64 phys_start;</font></p>
<p><font face="monospace"> __be64 logical_start; /* Do we need a
logical & physical start? Maybe we don't care about the
logical start? */<br>
</font></p>
<p><font face="monospace"> __be32 length; /* Max extent length is
limited by rgrp length... only need 32 bits */<br>
</font></p>
<p><font face="monospace"> __be32 flags; /* Can we support
unwritten, zero extents with this? Need to indicate
alloc/free/zero, data/metadata */</font></p>
<p><font face="monospace">};</font></p></div></blockquote><div>We're trying to keep allocations relatively close together and within the same resource group, so to store extent lists more compactly, we could store the first extent's start address absolutely, and the start of each successive extent within range as a signed 32-bit number relative to that.<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div>
<p><font face="monospace">Just wondering if there is also some
shorthand we might be able to use in case we have multiple
extents all separated by either one metadata block, or a very
small number of metadata blocks (which will be the case for
streaming writes). Again it increases the complexity, but will
likely reduce the amount we have to write into the new journal
blocks quite a lot. Not much point having a 32 bit length, but
never filling it with a value above 509 (4k block size)...<br></font></p></div></blockquote><div>The current allocator fills at most one indirect block before allocating the next indirect block(s), which is why we end up with the above described pattern. Once we switch to extent-based inodes, we won't be allocating indirect blocks anymore, so we also won't end up with those chopped-up extents anymore. There will be the occasional node split in the inode extent tree, but that will be a much less frequent occurrence, and it won't happen when extending an existing extent. Delayed allocation would further improve the on-disk allocation patterns. On the other hand, we'll end up with more overhead when files are highly fragmented.<br></div><div><br></div><div>As long as we're only storing extents in the journal, I don't think those 509-block chunks are a problem; we'll still end up with more compact metadata for mostly-contiguous files. We'll do much worse for test cases that write every other block, for example.<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div><p><font face="monospace">
</font></p>
<p><font face="monospace">
</font></p>
<blockquote type="cite">
<div dir="ltr">
<div>
<div><font face="monospace"><span class="gmail_default" style="font-family:monospace,monospace">With 4k block
sizes, we can fit 84 extents (10 for 512b, 20 for 1k, </span><span class="gmail_default" style="font-family:monospace,monospace">42 for 2k block
sizes) in one block. As we process more allocs/deallocs,
we keep creating more such alloc_ext records and tack
them to the back of this block if there's space or else
create a new block. For smaller extents, this might not
be efficient, so we might just want to revert to the old
method of recording the bitmap blocks instead.</span></font></div>
</div>
<div><font face="monospace"><span class="gmail_default" style="font-family:monospace,monospace">During journal
replay, we decode these new blocks and flip the
corresponding bitmaps for each of the blocks represented
in the extents. For the ones where we just recorded the
bitmap blocks the old-fashioned way, we also replay them
the old-fashioned way. This way we're also backward
compatible with an older version of gfs2 that only records
the bitmaps.</span></font></div>
<div><font face="monospace"><span class="gmail_default" style="font-family:monospace,monospace">Since we record
the uid/gid with each extent, we can do the quota
accounting without relying on the quota change file. We
might need to keep the quota change file around for
backward compatibility and for the cases where we might
want to record allocs/deallocs the old-fashioned way.</span></font></div>
<div><font face="monospace"><span class="gmail_default" style="font-family:monospace,monospace"><br>
</span></font></div>
<div>
<div style="font-family:monospace,monospace">I'm going to play
around with this and come up with some patches to see if
this works and what kind of performance improvements we get.
These data structures will mostly likely need reworking and
renaming, but this is the general direction I'm thinking
along.</div>
<div style="font-family:monospace,monospace"><br>
</div>
<div style="font-family:monospace,monospace">Please let me know
what you think.</div>
</div>
<div style="font-family:monospace,monospace"><br>
</div>
<div style="font-family:monospace,monospace">Cheers!</div>
<div style="font-family:monospace,monospace">--Abhi</div>
</div>
</blockquote>
<p>That all sounds good. I'm sure it will take a little while to
figure out how to get this right,</p>
<p>Steve.</p>
</div></blockquote><div>Thanks,</div><div>Andreas<br></div></div></div>