[dm-devel] To add, or not to add, a bio REQ_ROTATIONAL flag

Mon Aug 1 02:58:50 UTC 2016

[+cc from "Enable use of Solid State Hybrid Drives"
	https://lkml.org/lkml/2014/10/29/698 ]

On Thu, 28 Jul 2016, Martin K. Petersen wrote:
> >>>>> "Eric" == Eric Wheeler <bcache at lists.ewheeler.net> writes:
> Eric> [...]  This may imply that
> Eric> we need a new way to flag cache bypass from userspace [...]
> Eric> So what are our options?  What might be the best way to do this?
[...] 
> Eric> Are FADV_NOREUSE/FADV_DONTNEED reasonable candidates?
> 
> FADV_DONTNEED was intended for this. There have been patches posted in
> the past that tied the loop between the fadvise flags and the bio. I
> would like to see those revived.

That sounds like a good start, this looks about right from 2014:
	https://lkml.org/lkml/2014/10/29/698
	https://lwn.net/Articles/619058/

I read through the thread and have summarized the relevant parts here 
with additional commentary below the summary:

/* Summary 

They were seeking to do basically the same in 2014 thing we want with 
stacked block caching drivers today: hint to the IO layer so the (ATA 3.2) 
driver can decide whether a block should hit the cache or spinning disk.  
This was done by adding bitflags to ioprio for IOPRIO_ADV_ advice.

There are two arguments throughout the thread: one that the cache hint 
should be per-process (ionice) and the other, that hints should be per 
inode via fadvise (and maybe madvise).  Dan Williams noted with respect to 
fadvise for their implementation that "It's straightforward to add, but I 
think "80%" of the benefit can be had by just having a per-thread cache 
priority."

Kapil Karkra extended the page flags so the ioprio advice bits can be 
copied into bio->bi_rw, to which Jens said "is a bit...icky. I see why 
it's done, though, it requires the least amount of plumbing."

Martin K. Petersen provides a matrix of desires for actual use cases here:
	https://lkml.org/lkml/2014/10/29/1014 
and asks "Are there actually people asking for sub-file granularity? I 
didn't get any requests for that in the survey I did this summer. [...] In 
any case I thought it was interesting that pretty much every use case that 
people came up with could be adequately described by a handful of I/O 
classes."

Further, Jens notes that "I think we've needed a proper API for passing in 
appropriate hints on a per-io basis for a LONG time. [...] We've tried 
(and failed) in the past to define a set of hints that make sense. It'd be 
a shame to add something that's specific to a given transport/technology. 
That said, this set of hints do seem pretty basic and would not 
necessarily be a bad place to start. But they are still very specific to 
this use case."
*/

So, perhaps it is time to plan the hint API and figure out how to plumb 
it.  These are some design considerations based on the thread:

a. People want per-process cache hinting (ionice, or some other tool).
b. Per inode+range hinting would be useful to some (fadvise, ioctl, etc)
c. Don't use page flags to convey cache hints---or find a clean way to do so.
d. Per IO hints would be useful to both stacking and hardware drivers.
e. Cache layers will implement their own device assignment choice based 
on the caching hint; for example, an IO flagged to miss the cache might 
hit if already in cache due to unrelated IO and such a determination would 
be made per-cache-implementation.

I can see this go two ways:

1. A dedicated implementation for cache hinting.
2. An API for generalized hinting, upon which cache hinting may be 
implemented.

To consider #2, what hinting is wanted from processes and inodes down to 
bio's?  Does it justify an entire API for generalized hinting, or do we 
just need a cache hinting implementation?  If we do want #2, then what are 
all of the features wanted by the community so it can be designed as such?

If #1 is sufficient, then what is the preferred mechanism and 
implementation for cache hinting?

In either direction, how can those hints pass down to bio's in an 
appropriate way (ie, not page flags)?

With the interest of a cache hinting implementation independent of 
transport/technology, I have been playing with an idea to use two per-IO 
"TTL" counters, both of which tend toward zero; I've not yet started an 
implementation:

cacheskip: 
	Decrement until zero to skip cache layers (slow medium)
	Ignore cachedepth until cacheskip==0.

cachedepth:
	Initialize to positive, negative, or zero value.  Once zero, no 
	special treatment is given to the IO.  When less than zero, prefer the 
	slower medium.  When greater than zero, prefer the faster medium.  
	Inc/decrement toward zero each time the IO passes through a 
	caching layer.

Independent of how we might apply these counters to a pid/inode, the cache 
layers might look something like this:

cachedepth	description
  0		direct IO
+-1		pagecache
+-2		som arbitrary
+-3		caching
+-4		driver
+-n		...

Layers beyond the pagecache are assigned arbitrarily by the driver 
stacking order implemented by the end user. For example, if passing 
through dm-cache, then dm-cache would use its own preference logic to 
decide whether it should cache or not if cachedepth is zero.  If nonzero, 
it would cache/bypass appropriately and then inc/decrements cachedepth 
toward zero after making its decision.  Understandably, extenuating 
circumstances may require a layer to ignore the hint---such as a 
bypass-hinted IO that gets cached because it is already hot.

Consider the following scenarios for this contrived cache stack:

1. pagecache
2. dm-cache
3. bcache
4. HBA supporting cache hints (ATA 3.2, perhaps)

cacheskip	cachedepth	description
-------------------------------------------
	0		0	use pagecache; lower layers do what they want
	1		0	skip pagecache (direct IO); lower layers do what they want
	0		-1	same as previous
	2		1	skip pagecache, dmcache; prefer bcache-ssd
	0		-3	skip pagecache; dmcache bypass; bcache bypass
	1		2	skip pagecache; prefer dmcache-ssd, prefer bcache-ssd
	3		1	hint to prefer HBA cache only

This would empower the user to decide where caching should begin, and for 
how many layers caching should hint for slow(-) or fast(+) backing devices 
before letting the IO stack make its own hintless choice.  Hopefully this 
lets each layer make their own choices that best fit their implementation.

Note that this would not support multi-device tiering as written.  If some 
layer supports multiple IO performance tiers (more than 2) at the same 
layer, then this hinting algorithm is insufficient unless a 
cache-layer-specific datastructure could be passed with the IO hinting 
request.  Also, an eviction hint is not supported by this model.

Please comment with your thoughts.  I look forward to feedback and 
implementation ideas for what would be the best way to plumb cache hinting 
for whatever implementation is chosen.

--
Eric Wheeler