[dm-devel] [Lsf-pc] [LSF/MM TOPIC] a few storage topics

Fri Feb 3 12:55:43 UTC 2012

On Wed, Jan 25, 2012 at 04:40:23PM +0000, Steven Whitehouse wrote:
> Hi,
> 
> On Wed, 2012-01-25 at 11:22 -0500, Loke, Chetan wrote:
> > > If the reason for not setting a larger readahead value is just that it
> > > might increase memory pressure and thus decrease performance, is it
> > > possible to use a suitable metric from the VM in order to set the value
> > > automatically according to circumstances?
> > > 
> > 
> > How about tracking heuristics for 'read-hits from previous read-aheads'? If the hits are in acceptable range(user-configurable knob?) then keep seeking else back-off a little on the read-ahead?
> > 
> > > Steve.
> > 
> > Chetan Loke
> 
> I'd been wondering about something similar to that. The basic scheme
> would be:
> 
>  - Set a page flag when readahead is performed
>  - Clear the flag when the page is read (or on page fault for mmap)
> (i.e. when it is first used after readahead)
> 
> Then when the VM scans for pages to eject from cache, check the flag and
> keep an exponential average (probably on a per-cpu basis) of the rate at
> which such flagged pages are ejected. That number can then be used to
> reduce the max readahead value.
> 
> The questions are whether this would provide a fast enough reduction in
> readahead size to avoid problems? and whether the extra complication is
> worth it compared with using an overall metric for memory pressure?
> 
> There may well be better solutions though,

The caveat is, on a consistently thrashed machine, the readahead size
should better be determined for each read stream.

Repeated readahead thrashing typically happen in a file server with
large number of concurrent clients. For example, if there are 1000
read streams each doing 1MB readahead, since there are 2 readahead
window for each stream, there could be up to 2GB readahead pages that
will sure be thrashed in a server with only 1GB memory.

Typically the 1000 clients will have different read speeds. A few of
them will be doing 1MB/s, most others may be doing 100KB/s. In this
case, we shall only decrease readahead size for the 100KB/s clients.
The 1MB/s clients actually won't see readahead thrashing at all and
we'll want them to do large 1MB I/O to achieve good disk utilization.

So we need something better than the "global feedback" scheme, and we
do have such a solution ;)  As said in my other email, the number of
history pages remained in the page cache is a good estimation of that
particular read stream's thrashing safe readahead size.

Thanks,
Fengguang