[dm-devel] [PATCH 0/7] Fix muitpath/multipathd flush issue

Thu Jul 2 12:24:32 UTC 2020

On Wed, 2020-07-01 at 22:14 -0500, Benjamin Marzinski wrote:
> On Wed, Jul 01, 2020 at 10:54:34PM +0200, Martin Wilck wrote:
> > On Thu, 2020-06-18 at 18:06 -0500, Benjamin Marzinski wrote:
> > > I uploaded the test program, aio_test:
> > > 
> > > https://github.com/bmarzins/test_programs.git
> > > 
> > > You just need to run in on a queueing multipath device with no
> > > active
> > > paths and an open count of 0. It will hang with the device open.
> > > Restore
> > > a path, and it will exit, and the open count will go to 0.
> > 
> > Tried it now, it behaves as you say. I admit I can't imagine how
> > the
> > suspend/resume would improve anything here. Indeed, as you say, it
> > opens 
> > up a race window. Another process might open the device while
> > it's suspended. Worse perhaps, once the device is resumed, an
> > uevent will be 
> > generated, and stacked devices might (in principle at least) be
> > recreated
> > before we get down to flush the map.
> > 
> > MAYBE the suspend/resume was necessary in the past because some
> > earlier 
> > kernels wouldn't immediately fail all outstanding commands when 
> > queue_if_no_path was deactivated? (just speculating, I don't know
> > if this
> > is the case).
> 
> If you disable queue_if_no_path and then do a suspend with flushing,
> you
> are guaranteed that the supend won't return until all the IO has
> completed or failed.

I just realized that if we suspend, we don't even need disable
queue_if_no_path, because the kernel does that automatically if a
"suspend with flush" is issued, and has been doing so basically
forever.

>   This would allow anything that was waiting on
> queued IO to have the IO failback, which could allow it to close the
> device in time for multipath to be able to remove it (obviously this
> is
> racey).  However, this assumes that you do your open checks after the
> suspend, which multipath no longer does.
>  I realize that multipath can't
> suspend until after it tries to remove the partition devices,
> otherwise
> those can get stuck. But there probably is some order that gets this
> all
> right-ish.

Our code is currently racy. IMO we should

 0 unset queue_if_no_path
 1 remove partition mappings
 2 open(O_EXCL|O_RDONLY) the device
 3 If this fails, in multipath, try again after a suspend/resume cycle.
   In multipathd, I think we should just fail for now; perhaps later
   we could handle the explicit "remove map" command like multipath.
 4 If it fails again, bail out (in multipath, retry if asked to do so)
 5 run a "deferred remove" ioctl
 6 close the fd, the dm device will now be removed "atomically".

This cuts down the race window to the minimum possible - after (2), no
mounts / kpartx / lvm operations won't be possible any more.

When we remove the partition mappings, we could use the same technique
to avoid races on that layer. Unfortunately, a pending "deferred
remove" operation doesn't cause new open() or mount() calls to fail; if
it did, we could use a simpler approach.

> So, for a while now, the suspending has been doing nothing for
> us.  We
> could either try to reorder things so that we actually try to flush
> the
> queued IOs back first (with or without suspending), or we could just
> remove suspending and say that things are working fine the way they
> currently are.

What else can we do except suspend/resume to avoid racing with pending
close(), umount() or similar operations? Well, I guess if we open the
device anyway as I proposed, we could do an fsync() on it. That might
actually be better because it avoids the uevent being sent on resume.

We definitely can't suspend the map before we remove the partitions. We
could try a suspend/resume on the partition devices themselves (or
fsync()) if the opencount is > 0.

Regards,
Martin

-- 
Dr. Martin Wilck <mwilck at suse.com>, Tel. +49 (0)911 74053 2107
SUSE  Software Solutions Germany GmbH
HRB 36809, AG Nürnberg GF: Felix
Imendörffer