[dm-devel] dm-thin: Several Questions on dm-thin performance.
Joe Thornber
thornber at redhat.com
Wed Dec 4 12:11:06 UTC 2019
(These notes are for my own benefit as much as anything, I haven't
worked on this for a couple of years and will forget it all completely
if I don't write it down somewhere).
Let's start by writing some pseudocode for what the remap function for
thin provisioning actually does.
----------------------------------------------------------
-- Metadata
newtype ThinId = Int
data Bio = Bio
newtype VBlock = Integer -- virtual block
newtype DBlock = Integer -- data block
data LookupResult =
Unprovisioned |
Provisioned { lrDBlock :: DataBlock,
lrShared :: Bool
}
metadataLookup :: ThinId -> VBlock -> Task LookupResult
metadataLookup = undefined
metadataInsert :: ThinId -> VBlock -> DBlock -> Task ()
metadataInsert = undefined
metadataRemove :: ThinId -> VBlock -> Task ()
metadataRemove = undefined
-- Blocks all other metadata operations while running
metadataCommit :: Task ()
metadataCommit = undefined
----------------------------------------------------------
-- Tasks
-- Many steps of servicing a bio can block. eg, taking a lock,
-- reading metadata, updating metadata, zeroing data, copying
-- data ...
-- So we completely side step the issue in this pseudocode by
-- running everything in a magic light weight thread.
spawn :: Task () -> IO ()
spawn = undefined
----------------------------------------------------------
-- Locking
-- These 'with' primitives acquire a lock (can block of course), perform
-- an action, and then automatically release the lock.
-- Shared lock can be upgraded, so we need to pass the lock into
-- the action body.
withSharedLock :: ThinId -> VBlock -> (Lock -> Task ()) -> Task ()
withSharedLock thinId vblock actionFn = undefined
withExclusiveLock :: ThinId -> VBlock -> Task () -> Task ()
withExclusiveLock thinId vblock action = undefined
-- This promotes a shared lock to exclusive.
withUpgradedLock :: Lock -> Task () -> Task ()
withUpgradedLock lock action = undefined
-- Data locks are always exclusive
withDataLock :: DBlock -> Task () -> Task ()
withDataLock dblock action = undefined
----------------------------------------------------------
-- Top level remap function. Kicks off a green thread for each bio.
-- How we handle a bio depends on whether it's a read, write, discard
-- or flush bio. Whether the block is already provisioned, and if so
-- whether it is shared between snapshots.
remap :: ThinId -> Bio -> IO ()
remap thinId bio = spawn $ remapFn thinId bio vblock
where
vblock = virtBlock bio
remapFn = case classifyBio bio of
ReadBio -> remapRead
WriteBio -> remapWrite
DiscardBio -> remapDiscard
FlushBio -> remapFlush
----------------------------------------------------------
remapRead :: ThinId -> Bio -> VBlock -> Task ()
remapRead thinId bio vblock = do
withSharedLock thinId vblock $ \_ -> do
lr <- metadataLookup thinId vblock
case lr of
-- Read, Unprovisioned, Shared/!Shared
Unprovisioned -> do
fillWithZeroes bio
complete bio Success
-- Read, Provisioned, Shared/!Shared
(Provisioned dblock _) ->
remapAndWait bio dblock
----------------------------------------------------------
remapWrite :: ThinId -> Bio -> VBlock -> Task ()
remapWrite thinId bio vblock = do
withSharedLock thinId vblock $ \lock -> do
lr <- metadataLookup thinId vblock
case lr of
-- Write, Unprovisioned
Unprovisioned ->
withUpgradedLock lock $
provision thinId bio vblock
-- Write, Provisioned, !Shared
(Provisioned dblock False) ->
remapAndWait bio dblock
-- Write, Provisioned, Shared
(Provisioned dblock True) ->
withUpgradedLock lock $
breakSharing thinId bio vblock dblock
breakSharing :: ThinId -> Bio -> VBlock -> DataBlock -> Task ()
breakSharing thinId bio vblock dblockOld = do
ab <- allocateBlock
case ab of
NoDataSpace ->
complete bio Failure
(Allocated dblockNew) ->
withDataLock dblockOld $ -- we grab data locks to avoid races with discard
withDataLock dblockNew $ do
copy dblockOld dblockNew
metadataInsert thinId vblock dblockNew
remapAndWait thinId bio dblockNew
provision :: ThinId -> Bio -> VBlock -> Task ()
provision thinId bio vblock = do
case allocateBlock of
NoDataSpace ->
complete bio Failure
(Allocated dblock) ->
withDataLock dblock $ do
metadataInsert thinId vblock dblock
remapAndWait thinId bio dblock
----------------------------------------------------------
discard :: ThinId -> Bio -> VBlock -> Task ()
discard thinId bio vblock = do
withExclusiveLock thinId vblock $ do
lr <- metadataLookup thinId vblock
case lr of
-- Discard, Unprovisioned
Unprovisioned ->
complete bio Success
-- Discard, Provisioned, !Shared
(Provisioned dblock False) ->
withDataLock dblock $ do
remapAndWait bio dblock -- passdown
metadataRemove thinId dblock
-- Discard, Provisioned, Shared
(Provisioned dblock True) ->
withDataLock dblock $ do
metadataRemove thinId dblock
complete bio Success
----------------------------------------------------------
flush :: Task ()
flush = metadataCommit
----------------------------------------------------------
remapAndWait :: Bio -> DataBlock -> Task ()
remapAndWait bio dblock = do
remap bio dblock
issue bio
wait bio
The above is a simplification (eg, discards can cover more than a single
block, the pool has multiple modes like OUT_OF_DATA_SPACE). But it gives
a good idea of what the dm target needs to do, and in a succinct manner.
Now dm-thin.c is anything but succinct, for a couple of reasons:
- Because of where we are in the IO stack we cannot allocate memory.
This means we either use memory preallocated via mempools, or allocate
a fixed size block before a bio is processed.
- We don't have a magic green threads library that hides the numerous
blocking operations that we need. Instead we have low level facilities
like workqueues etc. This tends to have the effect of breaking up the logic
and scattering it across lots of little completion functions.
How we handle blocking, locking, and quiescing IO are all intertwined.
Which is why switching over to the new bio_prison interface is going to
involve an awful lot of churn.
In the upstream code
====================
- Locking
The locks provided by bio_prison (v1), are all exclusive locks. As such
we take pains to hold them for as short a period as possible. This means
holding them for the duration of an IO is completely out of the question.
Nonetheless, as pointed out in the original post for this thread, this
can cause bad lock contention, especially if the data block size is large.
- Quiescing
Because we do not hold locks for the lifetime of the bios, we need
another way of tracking IO and quiescing regions. This is what the
deferred_set component does. Effectively it divides time up into
bins, and keeps a reference count of how many IOs are still in flight
for each bin. To quiesce we grab a lock, and then wait for all bins
before this lock was acquired to drain. Advantages of this approach
is it uses very little memory (I think we're currently running with
64 bins), and consumes v. little cpu. But we're never specific about
which region we're waiting to quiesce, instead always waiting for all
IO older than a certain point to drain. So we are certainly introducing
more latency here.
- Blocking
A single thread services any operations that could block.
When there is work for this thread to perform a work queue item
is queued (do_worker()). This then examines linked lists of work
(prepared_mappings, discard_pt1, discard_pt2, prefetches etc), and
processes each list as a batch. Batching like this is a mixed blessing;
it allows us to sort incoming bios so we can process bios to the same
region at the same time, but it is also v. bad for max latency, as we
have no idea which piece of work was there the longest.
Next iteration of the code
==========================
- Locking
bio_prison (v2) provides shared locks, and custom lock levels. So,
at the expense of memory, we can hold shared locks for long periods
that cover the lifetime of the bio. Acquiring a lock now blocks.
- Quiescing
Because we hold the locks for long periods we can now do away with the
deferred set completely. If you want to quiesce a region, just grab
the exclusive lock associated with it, when it's finally granted you
know it's also quiesced.
- Blocking
I want to move away from the idea of a single worker function that
has different categories of work stored for it in different lists.
Instead, storing specific work structs on the work queue for each bio.
Partly this is to reduce latency by increasing 'fairness'. But also
the fact that acquiring a lock now blocks means there are a lot more
block operations to handle, and we'd just end up with a lot of these
lists of work. It would also allow us to have multiple kernel threads
servicing the workqueue.
If you look at dm-cache-target.c you'll see this has already been
done for that target. We have continuation structs that represent
the work to be performed after the current blocking op has completed.
dm-cache uses this for migrations, which have a much simpler state model
than dm-thin. Even so there are a lot of these little continuation
functions (eg, mg_start, mg_lock_writes, mg_copy, mg_full_copy,
mg_upgrade_lock, mg_update_metadata_after_copy, mg_update_metadata,
mg_success, mg_complete).
Where are we now?
=================
I did a lot of work on this a couple of years ago. First I just dove
in, trying to code things up by hand. But it quickly devolved into a
maze of badly named continuation functions, all alike. It's very hard
to give these functions meaningful names; go through the pseudocode at
the top of this email and for each place where we could block, try to
describe where we are. The biggest problem is as we introduce more of
these continuations big picture logic receeds and it becomes v. hard to
reason about the code.
I then experimented with automatically generating all the code from a
simpler specification (I used a lispy version of the pseudocode above).
This showed promise and I got it generating kernel code that would
compile. I was debugging this when I got dragged onto other work,
and this has stagnated since.
So that's where we are.
More information about the dm-devel
mailing list