[dm-devel] [PATCH 00/17] btrfs: split bio at btrfs_map_bio() time

Qu Wenruo wqu at suse.com
Wed Dec 1 08:57:35 UTC 2021



On 2021/12/1 13:17, Qu Wenruo wrote:
> [BACKGROUND]
> 
> Currently btrfs never uses bio_split() to split its bio against RAID
> stripe boundaries.
> 
> Instead inside btrfs we check our stripe boundary everytime we allocate
> a new bio, and ensure the new bio never cross stripe boundaries.
> 
> [PROBLEMS]
> 
> Although this behavior works fine, it's against the common practice used in
> stacked drivers, and is making the effort to convert to iomap harder.
> 
> There is also an hidden burden, every time we allocate a new bio, we uses
> BIO_MAX_BVECS, but since we know the boundaries, for RAID0/RAID10 we can
> only fit at most 16 pages (fixed 64K stripe size, and 4K page size),
> wasting the 256 slots we allocated.
> 
> [CHALLENGES]
> 
> To change the situation, this patchset attempts to improve the situation
> by moving the bio split into btrfs_map_bio() time, so upper layer should
> no longer bother the bio split against RAID stripes or even chunk
> boundaries.
> 
> But there are several challenges:
> 
> - Conflicts in various endio functions
>    We want the existing granularity, instead of chained endio, thus we
>    must make the involved endio functions to handle split bios.
> 
>    Although most endio functions are already doing their works
>    independent of the bio size, they are not yet fully handling split
>    bio.
> 
>    This patch will convert them to use saved bi_iter and only iterate
>    the split range instead of the whole bio.
>    This change involved 3 types of IOs:
> 
>    * Buffered IO
>      Including both data and metadata
>    * Direct IO
>    * Compressed IO
> 
>    Their endio functions needs different level of updates to handle split
>    bios.
> 
>    Furthermore, there is another endio, end_workqueue_bio(), it can't
>    handle split bios at all, thus we change the timing so that
>    btrfs_bio_wq_end_io() is only called after the bio being split.
> 
> - Checksum verification
>    Currently we rely on btrfs_bio::csum to contain the checksum for the
>    whole bio.
>    If one bio get split, csum will no longer points to the correct
>    location for the split bio.
> 
>    This can be solved by introducing btrfs_bio::offset_to_original, and
>    use that new member to calculate where we should read csum from.
> 
>    For the parent bio, it still has btrfs_bio::csum for the whole bio,
>    thus it can still free it correctly.
> 
> - Independent endio for each split bio
>    Unlike stack drivers, for RAID10 btrfs needs to try its best effort to
>    read every sectors, to handle the following case: (X means bad, either
>    unable to read or failed to pass checksum verification, V means good)
> 
>    Dev 1	(missing) | D1 (X) |
>    Dev 2 (OK)	  | D1 (V) |
>    Dev 3 (OK)	  | D2 (V) |
>    Dev 4 (OK)	  | D2 (X) |
> 
>    In the above RAID10 case, dev1 is missing, and although dev4 is fine,
>    its D2 sector is corrupted (by bit rot or whatever).
> 
>    If we use bio_chain(), read bio for both D1 and D2 will be split, and
>    since D1 is missing, the whole D1 and D2 read will be marked as error,
>    thus we will try to read from dev2 and dev4.
> 
>    But D2 in dev4 has csum mismatch, we can only rebuild D1 and D2
>    correctly from dev2:D1 and dev3:D2.
> 
>    This patchset resolve this by saving bi_iter into btrfs_bio::iter, and
>    uses that at endio to iterate only the split part of an bio.
>    Other than this, existing read/write page endio functions can handle
>    them properly without problem.
> 
> - Bad RAID56 naming/functionality
>    There are quite some RAID56 call sites relies on specific behavior on
>    __btrfs_map_block(), like returning @map_length as stripe_len other
>    than real mapped length.
> 
>    This is handled by some small cleanups specific for RAID56.
> 
> [NEED FEEDBACK]
> In this refactor, btrfs is utilizing a lot of call sites like:
> 
>   btrfs_bio_save_iter();	// Save bi_iter into some other location
>   __bio_for_each_segment(bvec, bio, iter, btrfs_bio->iter) {
> 	/* Doing endio for each bvec */
>   }
> 
> And manually implementing an endio which does some work of
> __bio_chain_endio() but with extra btrfs specific workload.
> 
> I'm wondering if block layer is fine to provide some *enhanced* chain
> bio facilities?
> 
> [CHANGELOG]
> RFC->v1:
> - Better patch split
>    Now patch 01~06 are refactors/cleanups/preparations.
>    While 07~13 are the patches that doing the conversion while can handle
>    both old and new bio split timings.
>    Finally patch 14~16 convert the bio split call sites one by one to
>    newer facility.
>    The final patch is just a small clean up.
> 
> - Various bug fixes
>    During the full fstests run, various stupid bugs are exposed and
>    fixed.

The latest version can be fetched from this branch:

https://github.com/adam900710/linux/tree/refactor_chunk_map

Which already has the fix for the btrfs/187 crash, which is caused by a 
missing modification for scrub_stripe_index_and_offset().

Thanks,
Qu
> 
> Qu Wenruo (17):
>    btrfs: update an stale comment on btrfs_submit_bio_hook()
>    btrfs: save bio::bi_iter into btrfs_bio::iter before submitting
>    btrfs: use correct bio size for error message in btrfs_end_dio_bio()
>    btrfs: refactor btrfs_map_bio()
>    btrfs: move btrfs_bio_wq_end_io() calls into submit_stripe_bio()
>    btrfs: replace btrfs_dio_private::refs with
>      btrfs_dio_private::pending_bytes
>    btrfs: introduce btrfs_bio_split() helper
>    btrfs: make data buffered read path to handle split bio properly
>    btrfs: make data buffered write endio function to be split bio
>      compatible
>    btrfs: make metadata write endio functions to be split bio compatible
>    btrfs: make dec_and_test_compressed_bio() to be split bio compatible
>    btrfs: return proper mapped length for RAID56 profiles in
>      __btrfs_map_block()
>    btrfs: allow btrfs_map_bio() to split bio according to chunk stripe
>      boundaries
>    btrfs: remove buffered IO stripe boundary calculation
>    btrfs: remove stripe boundary calculation for compressed IO
>    btrfs: remove the stripe boundary calculation for direct IO
>    btrfs: unexport btrfs_get_io_geometry()
> 
>   fs/btrfs/btrfs_inode.h |  10 +-
>   fs/btrfs/compression.c |  70 +++-----------
>   fs/btrfs/disk-io.c     |   9 +-
>   fs/btrfs/extent_io.c   | 189 +++++++++++++++++++++++++------------
>   fs/btrfs/extent_io.h   |   2 +
>   fs/btrfs/inode.c       | 210 ++++++++++++++++-------------------------
>   fs/btrfs/raid56.c      |  14 ++-
>   fs/btrfs/raid56.h      |   2 +-
>   fs/btrfs/scrub.c       |   4 +-
>   fs/btrfs/volumes.c     | 157 ++++++++++++++++++++++--------
>   fs/btrfs/volumes.h     |  75 +++++++++++++--
>   11 files changed, 435 insertions(+), 307 deletions(-)
> 




More information about the dm-devel mailing list