[dm-devel] [Lsf-pc] [LSF/MM TOPIC] a few storage topics

Wed Jan 25 20:06:13 UTC 2012

On Wed, Jan 25, 2012 at 12:37:48PM -0600, James Bottomley wrote:
> On Wed, 2012-01-25 at 13:28 -0500, Loke, Chetan wrote:
> > > So there are two separate problems mentioned here.  The first is to
> > > ensure that readahead (RA) pages are treated as more disposable than
> > > accessed pages under memory pressure and then to derive a statistic for
> > > futile RA (those pages that were read in but never accessed).
> > > 
> > > The first sounds really like its an LRU thing rather than adding yet
> > > another page flag.  We need a position in the LRU list for never
> > > accessed ... that way they're first to be evicted as memory pressure
> > > rises.
> > > 
> > > The second is you can derive this futile readahead statistic from the
> > > LRU position of unaccessed pages ... you could keep this globally.
> > > 
> > > Now the problem: if you trash all unaccessed RA pages first, you end up
> > > with the situation of say playing a movie under moderate memory
> > > pressure that we do RA, then trash the RA page then have to re-read to display
> > > to the user resulting in an undesirable uptick in read I/O.
> > > 
> > > Based on the above, it sounds like a better heuristic would be to evict
> > > accessed clean pages at the top of the LRU list before unaccessed clean
> > > pages because the expectation is that the unaccessed clean pages will
> > > be accessed (that's after all, why we did the readahead).  As RA pages age
> > 
> > Well, the movie example is one case where evicting unaccessed page may not be the right thing to do. But what about a workload that perform a random one-shot search?
> > The search was done and the RA'd blocks are of no use anymore. So it seems one solution would hurt another.
> 
> Well not really: RA is always wrong for random reads.  The whole purpose
> of RA is assumption of sequential access patterns.

Just to jump back, Jeff's benchmark that started this (on xfs and ext4):

	- buffered 1MB reads get down to the scheduler in 128KB chunks

The really hard part about readahead is that you don't know what
userland wants.  In Jeff's test, he's telling the kernel he wants 1MB
ios and our RA engine is doing 128KB ios.

We can talk about scaling up how big the RA windows get on their own,
but if userland asks for 1MB, we don't have to worry about futile RA, we
just have to make sure we don't oom the box trying to honor 1MB reads
from 5000 different procs.

-chris