[dm-devel] [Lsf-pc] [LSF/MM TOPIC] a few storage topics

Tue Jan 24 20:39:36 UTC 2012

On Tue 24-01-12 15:13:40, Jeff Moyer wrote:
> Jan Kara <jack at suse.cz> writes:
> 
> > On Tue 24-01-12 14:14:14, Jeff Moyer wrote:
> >> Chris Mason <chris.mason at oracle.com> writes:
> >> 
> >> >> All three filesystems use the generic mpages code for reads, so they
> >> >> all get the same (bad) I/O patterns.  Looks like we need to fix this up
> >> >> ASAP.
> >> >
> >> > Can you easily run btrfs through the same rig?  We don't use mpages and
> >> > I'm curious.
> >> 
> >> The readahead code was to blame, here.  I wonder if we can change the
> >> logic there to not break larger I/Os down into smaller sized ones.
> >> Fengguang, doing a dd if=file of=/dev/null bs=1M results in 128K I/Os,
> >> when 128KB is the read_ahead_kb value.  Is there any heuristic you could
> >> apply to not break larger I/Os up like this?  Does that make sense?
> >   Well, not breaking up I/Os would be fairly simple as ondemand_readahead()
> > already knows how much do we want to read. We just trim the submitted I/O to
> > read_ahead_kb artificially. And that is done so that you don't trash page
> > cache (possibly evicting pages you have not yet copied to userspace) when
> > there are several processes doing large reads.
> 
> Do you really think applications issue large reads and then don't use
> the data?  I mean, I've seen some bad programming, so I can believe that
> would be the case.  Still, I'd like to think it doesn't happen.  ;-)
  No, I meant a cache thrashing problem. Suppose that we always readahead
as much as user asks and there are say 100 processes each wanting to read 4
MB.  Then you need to find 400 MB in the page cache so that all reads can
fit.  And if you don't have them, reads for process 50 may evict pages we
already preread for process 1, but process one didn't yet get to CPU to
copy the data to userspace buffer. So the read becomes wasted.

> > Maybe 128 KB is a too small default these days but OTOH noone prevents you
> > from raising it (e.g. SLES uses 1 MB as a default).
> 
> For some reason, I thought it had been bumped to 512KB by default.  Must
> be that overactive imagination I have...  Anyway, if all of the distros
> start bumping the default, don't you think it's time to consider bumping
> it upstream, too?  I thought there was a lot of work put into not being
> too aggressive on readahead, so the downside of having a larger
> read_ahead_kb setting was fairly small.
  Yeah, I believe 512KB should be pretty safe these days except for
embedded world. OTOH average desktop user doesn't really care so it's
mostly servers with beefy storage that care... (note that I wrote we raised
the read_ahead_kb for SLES but not for openSUSE or SLED (desktop enterprise
distro)).

								Honza
-- 
Jan Kara <jack at suse.cz>
SUSE Labs, CR