[dm-devel] Proposal for annotating _unstable_ pages

Fri May 22 18:17:59 UTC 2015

On Thu, May 21, 2015 at 09:21:12PM +0200, Jan Kara wrote:
> On Thu 21-05-15 11:09:55, Kent Overstreet wrote:
> > On Thu, May 21, 2015 at 06:54:53PM +0200, Jan Kara wrote:
> > > On Wed 20-05-15 18:04:40, Kent Overstreet wrote:
> > > > > Yeah.  I never figured out a sane way to migrate pages and keep everything
> > > > > else happy.  Daniel Phillips is having a go at page forking for tux3; let's
> > > > > see if the questions about that get resolved.
> > > > 
> > > > That would be great, we need something.
> > > > 
> > > > I'd also be really curious what btrfs is doing today - is it just bouncing
> > > > everything internally, or did they come up with something more clever?
> > > 
> > > Btrfs is just waiting for IO to complete.
> > > 
> > > > > > Also, there's probably always going to be situations where we're reading or
> > > > > > writing to pages user space can stomp on (dio) - IMO we need to add a bio flag
> > > > > > to annotate this - "if you need this to be stable you have to bounce it".
> > > > > > Otherwise either filesystems/block drivers are going to be stuck bouncing
> > > > > > everything, or it'll just (continue to be) buggy.
> > > > > 
> > > > > Well, for now there's BIO_SNAP_STABLE that forces the block layer to bounce it,
> > > > > but right now ext3 is the last user of it, and afaict btrfs is the only other
> > > > > FS that takes care of stable pages on its own.
> > > > 
> > > > I have no idea what BIO_SNAP_STABLE was supposed to be for, but I don't see how
> > > > it's useful for anything sane.
> > > 
> > > It's for the case where lower layer requests it needs stable pages but
> > > upper layer isn't able to provide them (as is the case of ext3). Then block
> > > layer bounces the data for the caller.
> > > 
> > > > But that's the complete opposite of the problem stable pages are supposed to
> > > > solve: stable pages are for when the _lower_ layer (be it filesystem, bcache,
> > > > md, lvm) needs the memory being either read to or written from (both, it's not
> > > > just writes) to not be diddled over while the IO is in flight.
> > > > 
> > > > Now, a point that I think has been missed is that stable pages are _not_ a
> > > > complete solution, at least for consumers in the block layer.
> > > > 
> > > > The situation today is that if I'm in the block layer, and I get a handed a read
> > > > or write bio, I _don't know_ if it's from something that's going to diddle over
> > > > those pages or not. So if I require stable pages - be it for data checksumming
> > > > or for other things - I've just got to bounce the bio myself.
> > > > 
> > > > And then the really annoying thing is that if you've got stacked things that all
> > > > need stable pages (maybe btrfs on top of bcache on top of md) - they _all_ have
> > > > to assume the pages aren't going to be stable, so if they need them they _all_
> > > > have to bounce - even though once the first layer bounced the bio that made it
> > > > stable for everything underneath it.
> > > 
> > > The current design is that if you need stable pages for your device, set
> > > bdi capability BDI_CAP_STABLE_WRITES, fs then takes care of not scribbling
> > > over your page while it is under writeback or uses BIO_SNAP_STABLE if it
> > > cannot.
> > 
> > But if I need stable pages, I still have to bounce because that _does not_
> > guarantee stable pages, it only gives me stable pages for some of the IOs and in
> > the lower layers you can't tell which is which.
> > 
> > Do you see the problem? What good is BDI_CAP_STABLE_WRITES if it's not a
> > guarantee and I can't tell if I need to bounce or not?
>   So fix the upper layers to make it a guarantee? You mentioned direct IO
> needs fixing. Anything else?

Back when I was writing the stable pages patches, I observed that some of the
filesystems didn't hold the pages containing their own metadata stable during
writeback on a stable-writes device.  The journalling filesystems were fine
because they had various means to take care of that.

ISTR ext2 and vfat were the biggest culprits, but both maintainers rejected
the patches to fix that behavior.  This might no longer be the case; those
patches were so long ago I can't find them in Google.

--D

> 
> 								Honza
> -- 
> Jan Kara <jack at suse.cz>
> SUSE Labs, CR
> 
> --
> dm-devel mailing list
> dm-devel at redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel