[dm-devel] dm: add memory barrier before waitqueue_active

Mike Snitzer snitzer at redhat.com
Tue Feb 5 19:58:52 UTC 2019


On Tue, Feb 05 2019 at  2:29pm -0500,
Mikulas Patocka <mpatocka at redhat.com> wrote:

> 
> 
> On Tue, 5 Feb 2019, Mike Snitzer wrote:
> 
> > On Tue, Feb 05 2019 at 12:56pm -0500,
> > Mikulas Patocka <mpatocka at redhat.com> wrote:
> > 
> > > 
> > > 
> > > On Tue, 5 Feb 2019, Mike Snitzer wrote:
> > > 
> > > > On Tue, Feb 05 2019 at  5:09am -0500,
> > > > Mikulas Patocka <mpatocka at redhat.com> wrote:
> > > > 
> > > > > Hi
> > > > > 
> > > > > Please submit patch this to Linus before 5.0 is released.
> > > > > 
> > > > > Mikulas
> > > > > 
> > > > > 
> > > > > 
> > > > > waitqueue_active without preceding barrier is unsafe, see the comment
> > > > > before waitqueue_active definition in include/linux/wait.h.
> > > > > 
> > > > > This patch changes it to wq_has_sleeper.
> > > > > 
> > > > > Signed-off-by: Mikulas Patocka <mpatocka at redhat.com>
> > > > > Fixes: 6f75723190d8 ("dm: remove the pending IO accounting")
> > > > > 
> > > > > ---
> > > > >  drivers/md/dm.c |    2 +-
> > > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > 
> > > > > Index: linux-2.6/drivers/md/dm.c
> > > > > ===================================================================
> > > > > --- linux-2.6.orig/drivers/md/dm.c	2019-02-04 20:18:03.000000000 +0100
> > > > > +++ linux-2.6/drivers/md/dm.c	2019-02-04 20:18:03.000000000 +0100
> > > > > @@ -699,7 +699,7 @@ static void end_io_acct(struct dm_io *io
> > > > >  				    true, duration, &io->stats_aux);
> > > > >  
> > > > >  	/* nudge anyone waiting on suspend queue */
> > > > > -	if (unlikely(waitqueue_active(&md->wait)))
> > > > > +	if (unlikely(wq_has_sleeper(&md->wait)))
> > > > >  		wake_up(&md->wait);
> > > > >  }
> > > > >  
> > > > 
> > > > This could be applicable to dm-rq.c:rq_completed() too...
> > > 
> > > I don't know - it depends if the I/O counters are already protected by 
> > > some other lock or serializing instruction. If not, then this is broken 
> > > too.
> > 
> > blk-mq uses its tags to know, so pretty sure we're OK.
> 
> I'm not sure about the blk-mq code ... Jens could answer the question if 
> it uses some interlocked synchronization primitives or not.
> 
> > > > but I'm not following where you think we benefit from adding the
> > > > smp_mb() to end_io_acct() please be explicit about your concern.
> > > 
> > > end_io_acct does:
> > > 	decrease the percpu counter
> > > 	test if the waitqueue is active
> > > 	if active, wake up
> > > 
> > > the CPU can reorder it to:
> > > 	test if the waitqueue is active
> > > 	decrease the percpu counter
> > > 	if active, wake up
> > 
> > For bio-based, are you certain about that given the locking that is done
> > in generic_end_io_acct()?
> > -- part_stat_lock() coupled with part_stat_local_dec()
> 
> #define part_stat_lock()  ({ rcu_read_lock(); get_cpu(); })
> #define part_stat_local_dec(gendiskp, field)                            \
>         local_dec(&(part_stat_get(gendiskp, field)))
> 
> There is no locking. The rcu lock isn't synchronization barrier.

Right it isn't.
 
> > > now, we can have two CPUs racing in this way:
> > > 
> > > CPU1:	test if the waitqueue is active - returns false
> > > CPU2:	it sees that the sum of the counters is not zero
> > > CPU2:	it adds itself to the waitqueue
> > > CPU1:	decrease the percpu counter - now the sum is zero
> > > CPU1:	not calling wake_up, because it already tested the waitqueue and 
> > > 	there was no process waiting on it
> > > CPU2:	keeps waiting on the waitqueue - deadlock
> > 
> > Yes, that is the conclusion if the reordering is possible.  I'm just not
> > convinced that in practice we aren't getting other barriers to make the
> > code safe as-is.
> 
> If you argue that there are locks, show them. The current code just 
> disables preemption - on preemptive kernels it increases the preemption 
> count (that is non-locked operation) - on non-preemptive kernels it does 
> nothing - and then it returns current CPU.
> 
> > BUT, even if we currently are, that doesn't mean we
> > should leave this DM code exposed to block core implementation altering
> > the order of IO accounting vs tests of waitqueue state.
> > 
> > That said, this code has always had this race.  Before we had a double
> 
> No. In the kernel 4.20 and before, it uses "if (!pending) 
> wake_up(&md->wait);". I.e. the problematic function "waitqueue_active" was 
> not used at all.

I was mistaken.

> In 5.0 we have to use waitqueue_active because "pending" cannot be easily 
> calculated. And calling wake_up with each bio would destroy performance.

Yes.

> > check of md_in_flight(); that was removed (and left to be tested on
> > wakeup) as a mini-optimization.  It doesn't change the fact that we
> > _always_ could've had the "test if the waitqueue is active" reordered
> > ahead of the "decrease the percpu counter".
> > 
> > > > Focusing on bio-based DM, your concern is end_io_acct()'s wake_up() will
> > > > race with its, or some other cpus', preceding generic_end_io_acct()
> > > > percpu accounting?
> > > > - and so dm_wait_for_completion()'s !md_in_flight() condition will
> > > >   incorrectly determine there is outstanding IO due to end_io_acct()'s
> > > >   missing smp_mb()?
> > > > - SO dm_wait_for_completion() would go back to top its loop and may
> > > >   never get woken up again via wake_up(&md->wait)?
> > > > 
> > > > The thing is in both callers that use this pattern:
> > > >     
> > > >     if (unlikely(waitqueue_active(&md->wait)))
> > > > 		wake_up(&md->wait);
> > > > 
> > > > the condition (namely IO accounting) will have already been updated via
> > > > generic_end_io_acct() (in terms of part_dec_in_flight() percpu updates).
> > > > So to me, using smp_mb() here is fairly pointless when you consider the
> > > > condition that the waiter (dm_wait_for_completion) will be using is
> > > > _not_ the byproduct of a single store.
> > > > 
> > > > Again, for bio-based DM, block core is performing atomic percpu updates
> > > > across N cpus.  And the dm_wait_for_completion() waiter is doing percpu
> > > > totalling via md_in_flight_bios().
> > > 
> > > Without locks or barriers, the CPU can reorder reads and writes 
> > > arbitrarily. You can argue about ordering of memory accesses, but other 
> > > CPUs may see completely different order.
> > 
> > This doesn't tell me much relative to the question at hand.
> > 
> > I think you're missing that: it'd be really nice to have precise
> > understanding that the smp_mb() really is necessary.  Because otherwise,
> > we're just slowing IO completion down with a needless memory barrier.
> > 
> > Mike
> 
> I already showed this:
> 
> > > CPU1:       test if the waitqueue is active - returns false
> > > CPU2:       it sees that the sum of the counters is not zero
> > > CPU2:       it adds itself to the waitqueue
> > > CPU1:       decrease the percpu counter - now the sum is zero
> > > CPU1:       not calling wake_up, because it already tested the waitqueue and
> > >             there was no process waiting on it
> > > CPU2:       keeps waiting on the waitqueue - deadlock
> 
> 
> The suggestion to use smp_mb is in the comment in the file 
> include/linux/wait.h, just before the waitqueue_active definition - this 
> is the quotation of the part of the comment:
> 
>  * Because without the explicit smp_mb() it's possible for the
>  * waitqueue_active() load to get hoisted over the @cond store such that we'll
>  * observe an empty wait list while the waiter might not observe @cond.
> 
> 
> What other argument do you want?

I read this before.  In all this exchange, I was trying to understand if
your concern was born out of caution, and your expertise, as opposed to
fixing a bug you or others have actually experienced recently
(e.g. dm_wait_for_completion hanging indefinitely).

It needs fixing regardless (given the apparent lack of memory barrier
before waitqueue_active).  I was just trying to get more details to be
able to craft a bit better patch header.

Thanks,
Mike




More information about the dm-devel mailing list