[dm-devel] dm-raid: unsynced raid snapshot creation/deletion causes panic (work queue teardown race)

Mon May 11 20:30:03 UTC 2015

On Wed, Apr 29 2015 at  8:33am -0400,
heinzm at redhat.com <heinzm at redhat.com> wrote:

> From: Heinz Mauelshagen <heinzm at redhat.com>
> 
> This patch avoids oopses caused by a callback racing
> with the destrution of a mapping.
> 
> 
> Signed-off-by: Heinz Mauelshagen <heinzm at redhat.com>
> Tested-by: Heinz Mauelshagen <heinzm at redhat.com>
> 
> 
> ---
>  drivers/md/dm-raid.c | 18 ++++++++++++++++++
>  1 file changed, 18 insertions(+)
> 
> diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c
> index 88e4c7f..fc4bc83 100644
> --- a/drivers/md/dm-raid.c
> +++ b/drivers/md/dm-raid.c
> @@ -1614,6 +1614,17 @@ static void raid_presuspend(struct dm_target *ti)
>  {
>  	struct raid_set *rs = ti->private;
>  
> +	/*
> +	 * Address a teardown race when calling
> +	 * raid_(pre|post)suspend followed by raid_dtr:
> +	 *
> +	 * MD's call chain md_stop_writes()->md_reap_sync_thread()
> +	 * causes work to be queued on the md_misc_wq queue
> +	 * not flushing it, hence the callback can occur after
> +	 * a potential destruction of the raid set causing an oops.
> +	 */
> +	rs->md.event_work.func = NULL;
> +
>  	md_stop_writes(&rs->md);
>  }
>  
> @@ -1684,6 +1695,13 @@ static void raid_resume(struct dm_target *ti)
>  {
>  	struct raid_set *rs = ti->private;
>  
> +	/*
> +	 * See "Address a teardown race" in raid_presuspend()
> +	 *
> +	 * Reenable the worker function.
> +	 */
> +	rs->md.event_work.func = do_table_event;
> +
>  	set_bit(MD_CHANGE_DEVS, &rs->md.flags);
>  	if (!rs->bitmap_loaded) {
>  		bitmap_load(&rs->md);

The subject of this patch is strange (snapshot?)

Anyway, what flushes the md_misc_wq before you set it to NULL?
Shouldn't that happen as part of presuspend?

Also, is it really safe to set this to NULL out from underneath MD?
Should a helper get exposed from MD with some extra locking?

Mike