[dm-devel] dm-thin: Several Questions on dm-thin performance.

Wed Dec 4 12:11:06 UTC 2019

(These notes are for my own benefit as much as anything, I haven't
worked on this for a couple of years and will forget it all completely
if I don't write it down somewhere).

Let's start by writing some pseudocode for what the remap function for
thin provisioning actually does.

	----------------------------------------------------------
	-- Metadata

	newtype ThinId = Int
	data Bio = Bio
	newtype VBlock = Integer	-- virtual block
	newtype DBlock = Integer	-- data block

	data LookupResult =
	    Unprovisioned |
	    Provisioned { lrDBlock :: DataBlock,
	                  lrShared :: Bool
	                }

	metadataLookup :: ThinId -> VBlock -> Task LookupResult
	metadataLookup = undefined

	metadataInsert :: ThinId -> VBlock -> DBlock -> Task ()
	metadataInsert = undefined

	metadataRemove :: ThinId -> VBlock -> Task ()
	metadataRemove = undefined

	-- Blocks all other metadata operations while running
	metadataCommit :: Task ()
	metadataCommit = undefined

	----------------------------------------------------------
	-- Tasks

	-- Many steps of servicing a bio can block.  eg, taking a lock,
	-- reading metadata, updating metadata, zeroing data, copying
	-- data ...
	-- So we completely side step the issue in this pseudocode by
	-- running everything in a magic light weight thread.
	spawn :: Task () -> IO ()
	spawn = undefined

	----------------------------------------------------------
	-- Locking

	-- These 'with' primitives acquire a lock (can block of course), perform
	-- an action, and then automatically release the lock.

	-- Shared lock can be upgraded, so we need to pass the lock into
	-- the action body.
	withSharedLock :: ThinId -> VBlock -> (Lock -> Task ()) -> Task ()
	withSharedLock thinId vblock actionFn = undefined

	withExclusiveLock :: ThinId -> VBlock -> Task () -> Task ()
	withExclusiveLock thinId vblock action = undefined

	-- This promotes a shared lock to exclusive.
	withUpgradedLock :: Lock -> Task () -> Task ()
	withUpgradedLock lock action = undefined

	-- Data locks are always exclusive
	withDataLock :: DBlock -> Task () -> Task ()
	withDataLock dblock action = undefined

	----------------------------------------------------------

	-- Top level remap function.  Kicks off a green thread for each bio.
	-- How we handle a bio depends on whether it's a read, write, discard
	-- or flush bio.  Whether the block is already provisioned, and if so
	-- whether it is shared between snapshots.
	remap :: ThinId -> Bio -> IO ()
	remap thinId bio = spawn $ remapFn thinId bio vblock
	    where
	        vblock = virtBlock bio
	        remapFn = case classifyBio bio of
	            ReadBio -> remapRead
	            WriteBio -> remapWrite
	            DiscardBio -> remapDiscard
	            FlushBio -> remapFlush

	----------------------------------------------------------

	remapRead :: ThinId -> Bio -> VBlock -> Task ()
	remapRead thinId bio vblock = do
	    withSharedLock thinId vblock $ \_ -> do
	        lr <- metadataLookup thinId vblock
	        case lr of
	            -- Read, Unprovisioned, Shared/!Shared
	            Unprovisioned -> do
	                fillWithZeroes bio
	                complete bio Success

	            -- Read, Provisioned, Shared/!Shared
	            (Provisioned dblock _) ->
	                remapAndWait bio dblock

	----------------------------------------------------------

	remapWrite :: ThinId -> Bio -> VBlock -> Task ()
	remapWrite thinId bio vblock = do
	    withSharedLock thinId vblock $ \lock -> do
	        lr <- metadataLookup thinId vblock
	        case lr of
	            -- Write, Unprovisioned
	            Unprovisioned ->
	                withUpgradedLock lock $
	                    provision thinId bio vblock

	            -- Write, Provisioned, !Shared
	            (Provisioned dblock False) ->
	                remapAndWait bio dblock

	            -- Write, Provisioned, Shared
	            (Provisioned dblock True) ->
	                withUpgradedLock lock $
	                    breakSharing thinId bio vblock dblock

	breakSharing :: ThinId -> Bio -> VBlock -> DataBlock -> Task ()
	breakSharing thinId bio vblock dblockOld = do
	    ab <- allocateBlock
	   case ab of
	       NoDataSpace ->
	           complete bio Failure

	       (Allocated dblockNew) ->
	           withDataLock dblockOld $		-- we grab data locks to avoid races with discard
	               withDataLock dblockNew $ do
	                   copy dblockOld dblockNew
	                   metadataInsert thinId vblock dblockNew
	           remapAndWait thinId bio dblockNew

	provision :: ThinId -> Bio -> VBlock -> Task ()
	provision thinId bio vblock = do
	    case allocateBlock of
	        NoDataSpace ->
	            complete bio Failure

	        (Allocated dblock) ->
	            withDataLock dblock $ do
	                metadataInsert thinId vblock dblock
	                remapAndWait thinId bio dblock

	----------------------------------------------------------

	discard :: ThinId -> Bio -> VBlock -> Task ()
	discard thinId bio vblock = do
	    withExclusiveLock thinId vblock $ do
	        lr <- metadataLookup thinId vblock
	        case lr of
	            -- Discard, Unprovisioned
	            Unprovisioned ->
	                complete bio Success

	            -- Discard, Provisioned, !Shared
	            (Provisioned dblock False) ->
	                withDataLock dblock $ do
	                    remapAndWait bio dblock  		-- passdown
	                    metadataRemove thinId dblock

	            -- Discard, Provisioned, Shared
	           (Provisioned dblock True) ->
	               withDataLock dblock $ do
	                   metadataRemove thinId dblock
	                   complete bio Success

	----------------------------------------------------------

	flush :: Task ()
	flush = metadataCommit

	----------------------------------------------------------

	remapAndWait :: Bio -> DataBlock -> Task ()
	remapAndWait bio dblock = do
	    remap bio dblock
	    issue bio
	    wait bio

The above is a simplification (eg, discards can cover more than a single
block, the pool has multiple modes like OUT_OF_DATA_SPACE).  But it gives
a good idea of what the dm target needs to do, and in a succinct manner.

Now dm-thin.c is anything but succinct, for a couple of reasons:

- Because of where we are in the IO stack we cannot allocate memory.
  This means we either use memory preallocated via mempools, or allocate
  a fixed size block before a bio is processed.

- We don't have a magic green threads library that hides the numerous
  blocking operations that we need.  Instead we have low level facilities
  like workqueues etc.  This tends to have the effect of breaking up the logic
  and scattering it across lots of little completion functions.

How we handle blocking, locking, and quiescing IO are all intertwined.
Which is why switching over to the new bio_prison interface is going to
involve an awful lot of churn.

In the upstream code
====================

- Locking

  The locks provided by bio_prison (v1), are all exclusive locks.  As such
  we take pains to hold them for as short a period as possible.  This means
  holding them for the duration of an IO is completely out of the question.
  Nonetheless, as pointed out in the original post for this thread, this
  can cause bad lock contention, especially if the data block size is large.

- Quiescing

  Because we do not hold locks for the lifetime of the bios, we need
  another way of tracking IO and quiescing regions.  This is what the
  deferred_set component does.  Effectively it divides time up into
  bins, and keeps a reference count of how many IOs are still in flight
  for each bin.  To quiesce we grab a lock, and then wait for all bins
  before this lock was acquired to drain.  Advantages of this approach
  is it uses very little memory (I think we're currently running with
  64 bins), and consumes v. little cpu.  But we're never specific about
  which region we're waiting to quiesce, instead always waiting for all
  IO older than a certain point to drain.  So we are certainly introducing
  more latency here.

- Blocking

  A single thread services any operations that could block.
  When there is work for this thread to perform a work queue item
  is queued (do_worker()).  This then examines linked lists of work
  (prepared_mappings, discard_pt1, discard_pt2, prefetches etc), and
  processes each list as a batch.  Batching like this is a mixed blessing;
  it allows us to sort incoming bios so we can process bios to the same
  region at the same time, but it is also v. bad for max latency, as we
  have no idea which piece of work was there the longest.

Next iteration of the code
========================== 

- Locking

  bio_prison (v2) provides shared locks, and custom lock levels.  So,
  at the expense of memory, we can hold shared locks for long periods
  that cover the lifetime of the bio.  Acquiring a lock now blocks.

- Quiescing

  Because we hold the locks for long periods we can now do away with the
  deferred set completely.  If you want to quiesce a region, just grab
  the exclusive lock associated with it, when it's finally granted you
  know it's also quiesced.

- Blocking

  I want to move away from the idea of a single worker function that
  has different categories of work stored for it in different lists.
  Instead, storing specific work structs on the work queue for each bio.
  Partly this is to reduce latency by increasing 'fairness'.  But also
  the fact that acquiring a lock now blocks means there are a lot more
  block operations to handle, and we'd just end up with a lot of these
  lists of work.  It would also allow us to have multiple kernel threads
  servicing the workqueue.

  If you look at dm-cache-target.c you'll see this has already been
  done for that target.  We have continuation structs that represent
  the work to be performed after the current blocking op has completed.
  dm-cache uses this for migrations, which have a much simpler state model
  than dm-thin.  Even so there are a lot of these little continuation
  functions (eg, mg_start, mg_lock_writes, mg_copy, mg_full_copy,
  mg_upgrade_lock, mg_update_metadata_after_copy, mg_update_metadata,
  mg_success, mg_complete).

Where are we now?
=================

I did a lot of work on this a couple of years ago.  First I just dove
in, trying to code things up by hand.  But it quickly devolved into a
maze of badly named continuation functions, all alike.  It's very hard
to give these functions meaningful names; go through the pseudocode at
the top of this email and for each place where we could block, try to
describe where we are.  The biggest problem is as we introduce more of
these continuations big picture logic receeds and it becomes v. hard to
reason about the code.

I then experimented with automatically generating all the code from a
simpler specification (I used a lispy version of the pseudocode above).
This showed promise and I got it generating kernel code that would
compile.  I was debugging this when I got dragged onto other work,
and this has stagnated since.

So that's where we are.