[dm-devel] dm-thin: Several Questions on dm-thin performance.

Thu Dec 5 23:14:00 UTC 2019

Thanks Joe, great writeup.  Maybe this should go in thin-provisioning.txt.

More below:

On Wed, 4 Dec 2019, Joe Thornber wrote:

> (These notes are for my own benefit as much as anything, I haven't
> worked on this for a couple of years and will forget it all completely
> if I don't write it down somewhere).
> 
> Let's start by writing some pseudocode for what the remap function for
> thin provisioning actually does.
> 
> 	----------------------------------------------------------
> 	-- Metadata
> 
> 	newtype ThinId = Int
> 	data Bio = Bio
> 	newtype VBlock = Integer	-- virtual block
> 	newtype DBlock = Integer	-- data block

Can you define virtual block vs data block?  Is this just the thin volume 
offset vs the tdata offset?

> 
> 	data LookupResult =
> 	    Unprovisioned |
> 	    Provisioned { lrDBlock :: DataBlock,
> 	                  lrShared :: Bool
> 	                }

What is "lr"? as in lrDBlock?

> 	metadataLookup :: ThinId -> VBlock -> Task LookupResult
> 	metadataLookup = undefined
> 
> 	metadataInsert :: ThinId -> VBlock -> DBlock -> Task ()
> 	metadataInsert = undefined
> 
> 	metadataRemove :: ThinId -> VBlock -> Task ()
> 	metadataRemove = undefined
> 
> 	-- Blocks all other metadata operations while running
> 	metadataCommit :: Task ()
> 	metadataCommit = undefined
> 
> 	----------------------------------------------------------
> 	-- Tasks
> 
> 	-- Many steps of servicing a bio can block.  eg, taking a lock,
> 	-- reading metadata, updating metadata, zeroing data, copying
> 	-- data ...
> 	-- So we completely side step the issue in this pseudocode by
> 	-- running everything in a magic light weight thread.
> 	spawn :: Task () -> IO ()
> 	spawn = undefined
> 
> 	----------------------------------------------------------
> 	-- Locking
> 
> 	-- These 'with' primitives acquire a lock (can block of course), perform
> 	-- an action, and then automatically release the lock.
> 	 
> 	-- Shared lock can be upgraded, so we need to pass the lock into
> 	-- the action body.
> 	withSharedLock :: ThinId -> VBlock -> (Lock -> Task ()) -> Task ()
> 	withSharedLock thinId vblock actionFn = undefined
> 
> 	withExclusiveLock :: ThinId -> VBlock -> Task () -> Task ()
> 	withExclusiveLock thinId vblock action = undefined
> 
> 	-- This promotes a shared lock to exclusive.
> 	withUpgradedLock :: Lock -> Task () -> Task ()
> 	withUpgradedLock lock action = undefined
> 
> 	-- Data locks are always exclusive
> 	withDataLock :: DBlock -> Task () -> Task ()
> 	withDataLock dblock action = undefined
> 
> 	----------------------------------------------------------
> 
> 	-- Top level remap function.  Kicks off a green thread for each bio.
> 	-- How we handle a bio depends on whether it's a read, write, discard
> 	-- or flush bio.  Whether the block is already provisioned, and if so
> 	-- whether it is shared between snapshots.
> 	remap :: ThinId -> Bio -> IO ()
> 	remap thinId bio = spawn $ remapFn thinId bio vblock
> 	    where
> 	        vblock = virtBlock bio
> 	        remapFn = case classifyBio bio of
> 	            ReadBio -> remapRead
> 	            WriteBio -> remapWrite
> 	            DiscardBio -> remapDiscard
> 	            FlushBio -> remapFlush
> 
> 	----------------------------------------------------------
> 
> 	remapRead :: ThinId -> Bio -> VBlock -> Task ()
> 	remapRead thinId bio vblock = do
> 	    withSharedLock thinId vblock $ \_ -> do
> 	        lr <- metadataLookup thinId vblock
> 	        case lr of
> 	            -- Read, Unprovisioned, Shared/!Shared
> 	            Unprovisioned -> do
> 	                fillWithZeroes bio
> 	                complete bio Success
> 
> 	            -- Read, Provisioned, Shared/!Shared
> 	            (Provisioned dblock _) ->
> 	                remapAndWait bio dblock
> 
> 	----------------------------------------------------------
> 
> 	remapWrite :: ThinId -> Bio -> VBlock -> Task ()
> 	remapWrite thinId bio vblock = do
> 	    withSharedLock thinId vblock $ \lock -> do
> 	        lr <- metadataLookup thinId vblock
> 	        case lr of
> 	            -- Write, Unprovisioned
> 	            Unprovisioned ->
> 	                withUpgradedLock lock $
> 	                    provision thinId bio vblock
> 
> 	            -- Write, Provisioned, !Shared
> 	            (Provisioned dblock False) ->
> 	                remapAndWait bio dblock
> 
> 	            -- Write, Provisioned, Shared
> 	            (Provisioned dblock True) ->
> 	                withUpgradedLock lock $
> 	                    breakSharing thinId bio vblock dblock
> 
> 	breakSharing :: ThinId -> Bio -> VBlock -> DataBlock -> Task ()
> 	breakSharing thinId bio vblock dblockOld = do
> 	    ab <- allocateBlock
> 	   case ab of
> 	       NoDataSpace ->
> 	           complete bio Failure
> 
> 	       (Allocated dblockNew) ->
> 	           withDataLock dblockOld $		-- we grab data locks to avoid races with discard
> 	               withDataLock dblockNew $ do
> 	                   copy dblockOld dblockNew
> 	                   metadataInsert thinId vblock dblockNew
> 	           remapAndWait thinId bio dblockNew
> 
> 	provision :: ThinId -> Bio -> VBlock -> Task ()
> 	provision thinId bio vblock = do
> 	    case allocateBlock of
> 	        NoDataSpace ->
> 	            complete bio Failure
> 
> 	        (Allocated dblock) ->
> 	            withDataLock dblock $ do
> 	                metadataInsert thinId vblock dblock
> 	                remapAndWait thinId bio dblock

Does the allocator block?  If so, it would be neat to pre-allocate some 
number of blocks during metadata idle times.  There could be a hidden thin 
volume (ie, devid #16777215) that blocks are allocated into and then 
stolen from for use elsewhere.  The blocks could be pre-zeroed, too!

> 	            
> 	----------------------------------------------------------
> 
> 	discard :: ThinId -> Bio -> VBlock -> Task ()
> 	discard thinId bio vblock = do
> 	    withExclusiveLock thinId vblock $ do
> 	        lr <- metadataLookup thinId vblock
> 	        case lr of
> 	            -- Discard, Unprovisioned
> 	            Unprovisioned ->
> 	                complete bio Success
> 
> 	            -- Discard, Provisioned, !Shared
> 	            (Provisioned dblock False) ->
> 	                withDataLock dblock $ do
> 	                    remapAndWait bio dblock  		-- passdown
> 	                    metadataRemove thinId dblock
> 
> 	            -- Discard, Provisioned, Shared
> 	           (Provisioned dblock True) ->
> 	               withDataLock dblock $ do
> 	                   metadataRemove thinId dblock
> 	                   complete bio Success
> 
> 	----------------------------------------------------------
> 
> 	flush :: Task ()
> 	flush = metadataCommit
> 	    
> 	----------------------------------------------------------
> 
> 	remapAndWait :: Bio -> DataBlock -> Task ()
> 	remapAndWait bio dblock = do
> 	    remap bio dblock
> 	    issue bio
> 	    wait bio
>    
> The above is a simplification (eg, discards can cover more than a single
> block, the pool has multiple modes like OUT_OF_DATA_SPACE).  But it gives
> a good idea of what the dm target needs to do, and in a succinct manner.
> 
> Now dm-thin.c is anything but succinct, for a couple of reasons:
> 
> - Because of where we are in the IO stack we cannot allocate memory.
>   This means we either use memory preallocated via mempools, or allocate
>   a fixed size block before a bio is processed.
> 
> - We don't have a magic green threads library that hides the numerous
>   blocking operations that we need.  Instead we have low level facilities
>   like workqueues etc.  This tends to have the effect of breaking up the logic
>   and scattering it across lots of little completion functions.
> 
> 
> How we handle blocking, locking, and quiescing IO are all intertwined.
> Which is why switching over to the new bio_prison interface is going to
> involve an awful lot of churn.
> 
> In the upstream code
> ====================
> 
> - Locking
> 
>   The locks provided by bio_prison (v1), are all exclusive locks.  As such
>   we take pains to hold them for as short a period as possible.  This means
>   holding them for the duration of an IO is completely out of the question.
>   Nonetheless, as pointed out in the original post for this thread, this
>   can cause bad lock contention, especially if the data block size is large.
> 
> - Quiescing
> 
>   Because we do not hold locks for the lifetime of the bios, we need
>   another way of tracking IO and quiescing regions.  This is what the
>   deferred_set component does.  Effectively it divides time up into
>   bins, and keeps a reference count of how many IOs are still in flight
>   for each bin.  To quiesce we grab a lock, and then wait for all bins
>   before this lock was acquired to drain.  Advantages of this approach
>   is it uses very little memory (I think we're currently running with
>   64 bins), and consumes v. little cpu.  But we're never specific about

curious, is "v." short for "very" (not "versus")?

>   which region we're waiting to quiesce, instead always waiting for all
>   IO older than a certain point to drain.  So we are certainly introducing
>   more latency here.
> 
> - Blocking
> 
>   A single thread services any operations that could block.
>   When there is work for this thread to perform a work queue item
>   is queued (do_worker()).  This then examines linked lists of work
>   (prepared_mappings, discard_pt1, discard_pt2, prefetches etc), and
>   processes each list as a batch.  Batching like this is a mixed blessing;
>   it allows us to sort incoming bios so we can process bios to the same
>   region at the same time, but it is also v. bad for max latency, as we
>   have no idea which piece of work was there the longest.
> 
> Next iteration of the code
> ========================== 
> 
> - Locking
> 
>   bio_prison (v2) provides shared locks, and custom lock levels.  So,
>   at the expense of memory, we can hold shared locks for long periods
>   that cover the lifetime of the bio.  Acquiring a lock now blocks.
> 
> - Quiescing
> 
>   Because we hold the locks for long periods we can now do away with the
>   deferred set completely.  If you want to quiesce a region, just grab
>   the exclusive lock associated with it, when it's finally granted you
>   know it's also quiesced.
> 

good to know.

> - Blocking
> 
>   I want to move away from the idea of a single worker function that
>   has different categories of work stored for it in different lists.
>   Instead, storing specific work structs on the work queue for each bio.
>   Partly this is to reduce latency by increasing 'fairness'.  But also
>   the fact that acquiring a lock now blocks means there are a lot more
>   block operations to handle, and we'd just end up with a lot of these
>   lists of work.  It would also allow us to have multiple kernel threads
>   servicing the workqueue.
> 
>   If you look at dm-cache-target.c you'll see this has already been
>   done for that target.  We have continuation structs that represent
>   the work to be performed after the current blocking op has completed.
>   dm-cache uses this for migrations, which have a much simpler state model
>   than dm-thin.  Even so there are a lot of these little continuation
>   functions (eg, mg_start, mg_lock_writes, mg_copy, mg_full_copy,
>   mg_upgrade_lock, mg_update_metadata_after_copy, mg_update_metadata,
>   mg_success, mg_complete).
> 
> 
> Where are we now?
> =================
> 
> I did a lot of work on this a couple of years ago.  First I just dove
> in, trying to code things up by hand.  But it quickly devolved into a
> maze of badly named continuation functions, all alike.  It's very hard
> to give these functions meaningful names; go through the pseudocode at
> the top of this email and for each place where we could block, try to
> describe where we are.  The biggest problem is as we introduce more of
> these continuations big picture logic receeds and it becomes v. hard to
> reason about the code.

Event-driven continuation functions seem to pop up frequently in the Linux 
kernel.  It would be neat if there was a framework to write these 
procedurally.  Macros might help, could still be pretty ugly.  Almost 
needs GCC support.

> I then experimented with automatically generating all the code from a
> simpler specification (I used a lispy version of the pseudocode above).
> This showed promise and I got it generating kernel code that would
> compile.  I was debugging this when I got dragged onto other work,
> and this has stagnated since.

Do you think this is the best way to proceed?  Someone with Lisp 
background would need to help. It might generate first-pass code but would 
be difficult to maintain as kernel changes patch the auto-generated code.

-Eric

> 
> 
> So that's where we are.
>  
> 
> 
> 
> --
> dm-devel mailing list
> dm-devel at redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel
> 
>