[dm-devel] dm-thin: Several Questions on dm-thin performance.
Eric Wheeler
dm-devel at lists.ewheeler.net
Thu Dec 5 23:14:00 UTC 2019
Thanks Joe, great writeup. Maybe this should go in thin-provisioning.txt.
More below:
On Wed, 4 Dec 2019, Joe Thornber wrote:
> (These notes are for my own benefit as much as anything, I haven't
> worked on this for a couple of years and will forget it all completely
> if I don't write it down somewhere).
>
> Let's start by writing some pseudocode for what the remap function for
> thin provisioning actually does.
>
> ----------------------------------------------------------
> -- Metadata
>
> newtype ThinId = Int
> data Bio = Bio
> newtype VBlock = Integer -- virtual block
> newtype DBlock = Integer -- data block
Can you define virtual block vs data block? Is this just the thin volume
offset vs the tdata offset?
>
> data LookupResult =
> Unprovisioned |
> Provisioned { lrDBlock :: DataBlock,
> lrShared :: Bool
> }
What is "lr"? as in lrDBlock?
> metadataLookup :: ThinId -> VBlock -> Task LookupResult
> metadataLookup = undefined
>
> metadataInsert :: ThinId -> VBlock -> DBlock -> Task ()
> metadataInsert = undefined
>
> metadataRemove :: ThinId -> VBlock -> Task ()
> metadataRemove = undefined
>
> -- Blocks all other metadata operations while running
> metadataCommit :: Task ()
> metadataCommit = undefined
>
> ----------------------------------------------------------
> -- Tasks
>
> -- Many steps of servicing a bio can block. eg, taking a lock,
> -- reading metadata, updating metadata, zeroing data, copying
> -- data ...
> -- So we completely side step the issue in this pseudocode by
> -- running everything in a magic light weight thread.
> spawn :: Task () -> IO ()
> spawn = undefined
>
> ----------------------------------------------------------
> -- Locking
>
> -- These 'with' primitives acquire a lock (can block of course), perform
> -- an action, and then automatically release the lock.
>
> -- Shared lock can be upgraded, so we need to pass the lock into
> -- the action body.
> withSharedLock :: ThinId -> VBlock -> (Lock -> Task ()) -> Task ()
> withSharedLock thinId vblock actionFn = undefined
>
> withExclusiveLock :: ThinId -> VBlock -> Task () -> Task ()
> withExclusiveLock thinId vblock action = undefined
>
> -- This promotes a shared lock to exclusive.
> withUpgradedLock :: Lock -> Task () -> Task ()
> withUpgradedLock lock action = undefined
>
> -- Data locks are always exclusive
> withDataLock :: DBlock -> Task () -> Task ()
> withDataLock dblock action = undefined
>
> ----------------------------------------------------------
>
> -- Top level remap function. Kicks off a green thread for each bio.
> -- How we handle a bio depends on whether it's a read, write, discard
> -- or flush bio. Whether the block is already provisioned, and if so
> -- whether it is shared between snapshots.
> remap :: ThinId -> Bio -> IO ()
> remap thinId bio = spawn $ remapFn thinId bio vblock
> where
> vblock = virtBlock bio
> remapFn = case classifyBio bio of
> ReadBio -> remapRead
> WriteBio -> remapWrite
> DiscardBio -> remapDiscard
> FlushBio -> remapFlush
>
> ----------------------------------------------------------
>
> remapRead :: ThinId -> Bio -> VBlock -> Task ()
> remapRead thinId bio vblock = do
> withSharedLock thinId vblock $ \_ -> do
> lr <- metadataLookup thinId vblock
> case lr of
> -- Read, Unprovisioned, Shared/!Shared
> Unprovisioned -> do
> fillWithZeroes bio
> complete bio Success
>
> -- Read, Provisioned, Shared/!Shared
> (Provisioned dblock _) ->
> remapAndWait bio dblock
>
> ----------------------------------------------------------
>
> remapWrite :: ThinId -> Bio -> VBlock -> Task ()
> remapWrite thinId bio vblock = do
> withSharedLock thinId vblock $ \lock -> do
> lr <- metadataLookup thinId vblock
> case lr of
> -- Write, Unprovisioned
> Unprovisioned ->
> withUpgradedLock lock $
> provision thinId bio vblock
>
> -- Write, Provisioned, !Shared
> (Provisioned dblock False) ->
> remapAndWait bio dblock
>
> -- Write, Provisioned, Shared
> (Provisioned dblock True) ->
> withUpgradedLock lock $
> breakSharing thinId bio vblock dblock
>
> breakSharing :: ThinId -> Bio -> VBlock -> DataBlock -> Task ()
> breakSharing thinId bio vblock dblockOld = do
> ab <- allocateBlock
> case ab of
> NoDataSpace ->
> complete bio Failure
>
> (Allocated dblockNew) ->
> withDataLock dblockOld $ -- we grab data locks to avoid races with discard
> withDataLock dblockNew $ do
> copy dblockOld dblockNew
> metadataInsert thinId vblock dblockNew
> remapAndWait thinId bio dblockNew
>
> provision :: ThinId -> Bio -> VBlock -> Task ()
> provision thinId bio vblock = do
> case allocateBlock of
> NoDataSpace ->
> complete bio Failure
>
> (Allocated dblock) ->
> withDataLock dblock $ do
> metadataInsert thinId vblock dblock
> remapAndWait thinId bio dblock
Does the allocator block? If so, it would be neat to pre-allocate some
number of blocks during metadata idle times. There could be a hidden thin
volume (ie, devid #16777215) that blocks are allocated into and then
stolen from for use elsewhere. The blocks could be pre-zeroed, too!
>
> ----------------------------------------------------------
>
> discard :: ThinId -> Bio -> VBlock -> Task ()
> discard thinId bio vblock = do
> withExclusiveLock thinId vblock $ do
> lr <- metadataLookup thinId vblock
> case lr of
> -- Discard, Unprovisioned
> Unprovisioned ->
> complete bio Success
>
> -- Discard, Provisioned, !Shared
> (Provisioned dblock False) ->
> withDataLock dblock $ do
> remapAndWait bio dblock -- passdown
> metadataRemove thinId dblock
>
> -- Discard, Provisioned, Shared
> (Provisioned dblock True) ->
> withDataLock dblock $ do
> metadataRemove thinId dblock
> complete bio Success
>
> ----------------------------------------------------------
>
> flush :: Task ()
> flush = metadataCommit
>
> ----------------------------------------------------------
>
> remapAndWait :: Bio -> DataBlock -> Task ()
> remapAndWait bio dblock = do
> remap bio dblock
> issue bio
> wait bio
>
> The above is a simplification (eg, discards can cover more than a single
> block, the pool has multiple modes like OUT_OF_DATA_SPACE). But it gives
> a good idea of what the dm target needs to do, and in a succinct manner.
>
> Now dm-thin.c is anything but succinct, for a couple of reasons:
>
> - Because of where we are in the IO stack we cannot allocate memory.
> This means we either use memory preallocated via mempools, or allocate
> a fixed size block before a bio is processed.
>
> - We don't have a magic green threads library that hides the numerous
> blocking operations that we need. Instead we have low level facilities
> like workqueues etc. This tends to have the effect of breaking up the logic
> and scattering it across lots of little completion functions.
>
>
> How we handle blocking, locking, and quiescing IO are all intertwined.
> Which is why switching over to the new bio_prison interface is going to
> involve an awful lot of churn.
>
> In the upstream code
> ====================
>
> - Locking
>
> The locks provided by bio_prison (v1), are all exclusive locks. As such
> we take pains to hold them for as short a period as possible. This means
> holding them for the duration of an IO is completely out of the question.
> Nonetheless, as pointed out in the original post for this thread, this
> can cause bad lock contention, especially if the data block size is large.
>
> - Quiescing
>
> Because we do not hold locks for the lifetime of the bios, we need
> another way of tracking IO and quiescing regions. This is what the
> deferred_set component does. Effectively it divides time up into
> bins, and keeps a reference count of how many IOs are still in flight
> for each bin. To quiesce we grab a lock, and then wait for all bins
> before this lock was acquired to drain. Advantages of this approach
> is it uses very little memory (I think we're currently running with
> 64 bins), and consumes v. little cpu. But we're never specific about
curious, is "v." short for "very" (not "versus")?
> which region we're waiting to quiesce, instead always waiting for all
> IO older than a certain point to drain. So we are certainly introducing
> more latency here.
>
> - Blocking
>
> A single thread services any operations that could block.
> When there is work for this thread to perform a work queue item
> is queued (do_worker()). This then examines linked lists of work
> (prepared_mappings, discard_pt1, discard_pt2, prefetches etc), and
> processes each list as a batch. Batching like this is a mixed blessing;
> it allows us to sort incoming bios so we can process bios to the same
> region at the same time, but it is also v. bad for max latency, as we
> have no idea which piece of work was there the longest.
>
> Next iteration of the code
> ==========================
>
> - Locking
>
> bio_prison (v2) provides shared locks, and custom lock levels. So,
> at the expense of memory, we can hold shared locks for long periods
> that cover the lifetime of the bio. Acquiring a lock now blocks.
>
> - Quiescing
>
> Because we hold the locks for long periods we can now do away with the
> deferred set completely. If you want to quiesce a region, just grab
> the exclusive lock associated with it, when it's finally granted you
> know it's also quiesced.
>
good to know.
> - Blocking
>
> I want to move away from the idea of a single worker function that
> has different categories of work stored for it in different lists.
> Instead, storing specific work structs on the work queue for each bio.
> Partly this is to reduce latency by increasing 'fairness'. But also
> the fact that acquiring a lock now blocks means there are a lot more
> block operations to handle, and we'd just end up with a lot of these
> lists of work. It would also allow us to have multiple kernel threads
> servicing the workqueue.
>
> If you look at dm-cache-target.c you'll see this has already been
> done for that target. We have continuation structs that represent
> the work to be performed after the current blocking op has completed.
> dm-cache uses this for migrations, which have a much simpler state model
> than dm-thin. Even so there are a lot of these little continuation
> functions (eg, mg_start, mg_lock_writes, mg_copy, mg_full_copy,
> mg_upgrade_lock, mg_update_metadata_after_copy, mg_update_metadata,
> mg_success, mg_complete).
>
>
> Where are we now?
> =================
>
> I did a lot of work on this a couple of years ago. First I just dove
> in, trying to code things up by hand. But it quickly devolved into a
> maze of badly named continuation functions, all alike. It's very hard
> to give these functions meaningful names; go through the pseudocode at
> the top of this email and for each place where we could block, try to
> describe where we are. The biggest problem is as we introduce more of
> these continuations big picture logic receeds and it becomes v. hard to
> reason about the code.
Event-driven continuation functions seem to pop up frequently in the Linux
kernel. It would be neat if there was a framework to write these
procedurally. Macros might help, could still be pretty ugly. Almost
needs GCC support.
> I then experimented with automatically generating all the code from a
> simpler specification (I used a lispy version of the pseudocode above).
> This showed promise and I got it generating kernel code that would
> compile. I was debugging this when I got dragged onto other work,
> and this has stagnated since.
Do you think this is the best way to proceed? Someone with Lisp
background would need to help. It might generate first-pass code but would
be difficult to maintain as kernel changes patch the auto-generated code.
-Eric
>
>
> So that's where we are.
>
>
>
>
> --
> dm-devel mailing list
> dm-devel at redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel
>
>
More information about the dm-devel
mailing list