[dm-devel] dm: add memory barrier before waitqueue_active
Mike Snitzer
snitzer at redhat.com
Tue Feb 5 19:58:52 UTC 2019
On Tue, Feb 05 2019 at 2:29pm -0500,
Mikulas Patocka <mpatocka at redhat.com> wrote:
>
>
> On Tue, 5 Feb 2019, Mike Snitzer wrote:
>
> > On Tue, Feb 05 2019 at 12:56pm -0500,
> > Mikulas Patocka <mpatocka at redhat.com> wrote:
> >
> > >
> > >
> > > On Tue, 5 Feb 2019, Mike Snitzer wrote:
> > >
> > > > On Tue, Feb 05 2019 at 5:09am -0500,
> > > > Mikulas Patocka <mpatocka at redhat.com> wrote:
> > > >
> > > > > Hi
> > > > >
> > > > > Please submit patch this to Linus before 5.0 is released.
> > > > >
> > > > > Mikulas
> > > > >
> > > > >
> > > > >
> > > > > waitqueue_active without preceding barrier is unsafe, see the comment
> > > > > before waitqueue_active definition in include/linux/wait.h.
> > > > >
> > > > > This patch changes it to wq_has_sleeper.
> > > > >
> > > > > Signed-off-by: Mikulas Patocka <mpatocka at redhat.com>
> > > > > Fixes: 6f75723190d8 ("dm: remove the pending IO accounting")
> > > > >
> > > > > ---
> > > > > drivers/md/dm.c | 2 +-
> > > > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > > > >
> > > > > Index: linux-2.6/drivers/md/dm.c
> > > > > ===================================================================
> > > > > --- linux-2.6.orig/drivers/md/dm.c 2019-02-04 20:18:03.000000000 +0100
> > > > > +++ linux-2.6/drivers/md/dm.c 2019-02-04 20:18:03.000000000 +0100
> > > > > @@ -699,7 +699,7 @@ static void end_io_acct(struct dm_io *io
> > > > > true, duration, &io->stats_aux);
> > > > >
> > > > > /* nudge anyone waiting on suspend queue */
> > > > > - if (unlikely(waitqueue_active(&md->wait)))
> > > > > + if (unlikely(wq_has_sleeper(&md->wait)))
> > > > > wake_up(&md->wait);
> > > > > }
> > > > >
> > > >
> > > > This could be applicable to dm-rq.c:rq_completed() too...
> > >
> > > I don't know - it depends if the I/O counters are already protected by
> > > some other lock or serializing instruction. If not, then this is broken
> > > too.
> >
> > blk-mq uses its tags to know, so pretty sure we're OK.
>
> I'm not sure about the blk-mq code ... Jens could answer the question if
> it uses some interlocked synchronization primitives or not.
>
> > > > but I'm not following where you think we benefit from adding the
> > > > smp_mb() to end_io_acct() please be explicit about your concern.
> > >
> > > end_io_acct does:
> > > decrease the percpu counter
> > > test if the waitqueue is active
> > > if active, wake up
> > >
> > > the CPU can reorder it to:
> > > test if the waitqueue is active
> > > decrease the percpu counter
> > > if active, wake up
> >
> > For bio-based, are you certain about that given the locking that is done
> > in generic_end_io_acct()?
> > -- part_stat_lock() coupled with part_stat_local_dec()
>
> #define part_stat_lock() ({ rcu_read_lock(); get_cpu(); })
> #define part_stat_local_dec(gendiskp, field) \
> local_dec(&(part_stat_get(gendiskp, field)))
>
> There is no locking. The rcu lock isn't synchronization barrier.
Right it isn't.
> > > now, we can have two CPUs racing in this way:
> > >
> > > CPU1: test if the waitqueue is active - returns false
> > > CPU2: it sees that the sum of the counters is not zero
> > > CPU2: it adds itself to the waitqueue
> > > CPU1: decrease the percpu counter - now the sum is zero
> > > CPU1: not calling wake_up, because it already tested the waitqueue and
> > > there was no process waiting on it
> > > CPU2: keeps waiting on the waitqueue - deadlock
> >
> > Yes, that is the conclusion if the reordering is possible. I'm just not
> > convinced that in practice we aren't getting other barriers to make the
> > code safe as-is.
>
> If you argue that there are locks, show them. The current code just
> disables preemption - on preemptive kernels it increases the preemption
> count (that is non-locked operation) - on non-preemptive kernels it does
> nothing - and then it returns current CPU.
>
> > BUT, even if we currently are, that doesn't mean we
> > should leave this DM code exposed to block core implementation altering
> > the order of IO accounting vs tests of waitqueue state.
> >
> > That said, this code has always had this race. Before we had a double
>
> No. In the kernel 4.20 and before, it uses "if (!pending)
> wake_up(&md->wait);". I.e. the problematic function "waitqueue_active" was
> not used at all.
I was mistaken.
> In 5.0 we have to use waitqueue_active because "pending" cannot be easily
> calculated. And calling wake_up with each bio would destroy performance.
Yes.
> > check of md_in_flight(); that was removed (and left to be tested on
> > wakeup) as a mini-optimization. It doesn't change the fact that we
> > _always_ could've had the "test if the waitqueue is active" reordered
> > ahead of the "decrease the percpu counter".
> >
> > > > Focusing on bio-based DM, your concern is end_io_acct()'s wake_up() will
> > > > race with its, or some other cpus', preceding generic_end_io_acct()
> > > > percpu accounting?
> > > > - and so dm_wait_for_completion()'s !md_in_flight() condition will
> > > > incorrectly determine there is outstanding IO due to end_io_acct()'s
> > > > missing smp_mb()?
> > > > - SO dm_wait_for_completion() would go back to top its loop and may
> > > > never get woken up again via wake_up(&md->wait)?
> > > >
> > > > The thing is in both callers that use this pattern:
> > > >
> > > > if (unlikely(waitqueue_active(&md->wait)))
> > > > wake_up(&md->wait);
> > > >
> > > > the condition (namely IO accounting) will have already been updated via
> > > > generic_end_io_acct() (in terms of part_dec_in_flight() percpu updates).
> > > > So to me, using smp_mb() here is fairly pointless when you consider the
> > > > condition that the waiter (dm_wait_for_completion) will be using is
> > > > _not_ the byproduct of a single store.
> > > >
> > > > Again, for bio-based DM, block core is performing atomic percpu updates
> > > > across N cpus. And the dm_wait_for_completion() waiter is doing percpu
> > > > totalling via md_in_flight_bios().
> > >
> > > Without locks or barriers, the CPU can reorder reads and writes
> > > arbitrarily. You can argue about ordering of memory accesses, but other
> > > CPUs may see completely different order.
> >
> > This doesn't tell me much relative to the question at hand.
> >
> > I think you're missing that: it'd be really nice to have precise
> > understanding that the smp_mb() really is necessary. Because otherwise,
> > we're just slowing IO completion down with a needless memory barrier.
> >
> > Mike
>
> I already showed this:
>
> > > CPU1: test if the waitqueue is active - returns false
> > > CPU2: it sees that the sum of the counters is not zero
> > > CPU2: it adds itself to the waitqueue
> > > CPU1: decrease the percpu counter - now the sum is zero
> > > CPU1: not calling wake_up, because it already tested the waitqueue and
> > > there was no process waiting on it
> > > CPU2: keeps waiting on the waitqueue - deadlock
>
>
> The suggestion to use smp_mb is in the comment in the file
> include/linux/wait.h, just before the waitqueue_active definition - this
> is the quotation of the part of the comment:
>
> * Because without the explicit smp_mb() it's possible for the
> * waitqueue_active() load to get hoisted over the @cond store such that we'll
> * observe an empty wait list while the waiter might not observe @cond.
>
>
> What other argument do you want?
I read this before. In all this exchange, I was trying to understand if
your concern was born out of caution, and your expertise, as opposed to
fixing a bug you or others have actually experienced recently
(e.g. dm_wait_for_completion hanging indefinitely).
It needs fixing regardless (given the apparent lack of memory barrier
before waitqueue_active). I was just trying to get more details to be
able to craft a bit better patch header.
Thanks,
Mike
More information about the dm-devel
mailing list