[dm-devel] [PATCH 4/4] dm era: Remove unreachable resize operation in pre-resume function

Thu Feb 11 14:19:51 UTC 2021

On 2/10/21 8:48 PM, Mike Snitzer wrote:
> On Wed, Feb 10 2021 at  1:12P -0500,
> Mike Snitzer <snitzer at redhat.com> wrote:
> 
>> On Fri, Jan 22 2021 at 10:25am -0500,
>> Nikos Tsironis <ntsironis at arrikto.com> wrote:
>>
>>> The device metadata are resized in era_ctr(), so the metadata resize
>>> operation in era_preresume() never runs.
>>>
>>> Also, note, that if the operation did ever run it would deadlock, since
>>> the worker has not been started at this point.
> 
> It wouldn't have deadlocked, it'd have queued the work (see wake_worker)
> 

Hi Mike,

The resize is performed as an RPC and in_worker1() ends up calling
perform_rpc(). perform_rpc() calls wake_worker() and then waits for the
RPC to complete: wait_for_completion(&rpc->complete). So, start_worker()
is not called until after the RPC has been completed.

But, you are right, it won't deadlock. I was confused by wake_worker:

   static void wake_worker(struct era *era)
   {
           if (!atomic_read(&era->suspended))
                   queue_work(era->wq, &era->worker);
   }

When we suspend the device we set era->suspended to 1, so I mistakenly
thought that the resize operation during preresume would deadlock,
because wake_worker wouldn't queue the work.

But, the resize is only triggered when loading a new table, which
creates a new target by calling era_ctr. There era->suspended is
indirectly initialized to 0, because of kzalloc.

So, wake_worker will indeed queue the work.

>>>
>>> Fixes: eec40579d84873 ("dm: add era target")
>>> Cc: stable at vger.kernel.org # v3.15+
>>> Signed-off-by: Nikos Tsironis <ntsironis at arrikto.com>
>>> ---
>>>   drivers/md/dm-era-target.c | 9 ---------
>>>   1 file changed, 9 deletions(-)
>>>
>>> diff --git a/drivers/md/dm-era-target.c b/drivers/md/dm-era-target.c
>>> index 104fb110cd4e..c40e132e50cd 100644
>>> --- a/drivers/md/dm-era-target.c
>>> +++ b/drivers/md/dm-era-target.c
>>> @@ -1567,15 +1567,6 @@ static int era_preresume(struct dm_target *ti)
>>>   {
>>>   	int r;
>>>   	struct era *era = ti->private;
>>> -	dm_block_t new_size = calc_nr_blocks(era);
>>> -
>>> -	if (era->nr_blocks != new_size) {
>>> -		r = in_worker1(era, metadata_resize, &new_size);
>>> -		if (r)
>>> -			return r;
>>> -
>>> -		era->nr_blocks = new_size;
>>> -	}
>>>   
>>>   	start_worker(era);
>>>   
>>> -- 
>>> 2.11.0
>>>
>>
>> Resize shouldn't actually happen in the ctr.  The ctr loads a temporary
>> (inactive) table that will only become active upon resume.  That is why
>> resize should always be done in terms of resume.
>>

I kept the resize in the ctr to maintain the original behavior of
dm-era.

But, I had missed what you are describing here, which indeed makes sense
and it's the correct thing to do.

Thanks a lot for explaining it.

>> I'll look closer but ctr shouldn't do the actual resize, and the
>> start_worker() should be moved above the resize code you've removed
>> above.
> 
> Does this work for you?  If so I'll get it staged (like I've just
> staged all your other dm-era fixes for 5.12).
> 

The patch you attach won't work as is. We can't perform the resize in
the worker, because the worker might run other metadata operations,
e.g., it could start digestion, before resizing the metadata. These
operations will end up using the old size.

This can lead to errors:

1. Create a 1GiB dm-era device

    # dmsetup create eradev --table "0 2097152 era /dev/datavg/erameta /dev/datavg/eradata 8192"

2. Write to a block

    # dd if=/dev/zero of=/dev/mapper/eradev oflag=direct bs=4M count=1 seek=200

3. Suspend the device

    # dmsetup suspend eradev

4. Load a new table reducing the size of the device, so it doesn't
    include the block written at step (2)

    # dmsetup load eradev --table "0 1048576 era /dev/datavg/erameta /dev/datavg/eradata 8192"

5. Resume the device

    # dmsetup resume eradev

In dmesg we see the following:

    device-mapper: era: metadata_digest_transcribe_writeset: dm_array_set_value failed
    device-mapper: era: process_old_eras: digest step failed, stopping digestion

The reason is that the worker started the digestion of the archived
writeset using the old, larger size.

As a result, metadata_digest_transcribe_writeset tried to write beyond
the end of the era array.

Instead, we have to resize the metadata directly in era_preresume, and
not use the worker to do it.

I prepared a new patch doing that, which I will send with a new mail.

Nikos.

>   drivers/md/dm-era-target.c | 13 ++-----------
>   1 file changed, 2 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/md/dm-era-target.c b/drivers/md/dm-era-target.c
> index d0e75fd31c1e..ec198e9cdafb 100644
> --- a/drivers/md/dm-era-target.c
> +++ b/drivers/md/dm-era-target.c
> @@ -1501,15 +1501,6 @@ static int era_ctr(struct dm_target *ti, unsigned argc, char **argv)
>   	}
>   	era->md = md;
>   
> -	era->nr_blocks = calc_nr_blocks(era);
> -
> -	r = metadata_resize(era->md, &era->nr_blocks);
> -	if (r) {
> -		ti->error = "couldn't resize metadata";
> -		era_destroy(era);
> -		return -ENOMEM;
> -	}
> -
>   	era->wq = alloc_ordered_workqueue("dm-" DM_MSG_PREFIX, WQ_MEM_RECLAIM);
>   	if (!era->wq) {
>   		ti->error = "could not create workqueue for metadata object";
> @@ -1583,6 +1574,8 @@ static int era_preresume(struct dm_target *ti)
>   	struct era *era = ti->private;
>   	dm_block_t new_size = calc_nr_blocks(era);
>   
> +	start_worker(era);
> +
>   	if (era->nr_blocks != new_size) {
>   		r = in_worker1(era, metadata_resize, &new_size);
>   		if (r)
> @@ -1591,8 +1584,6 @@ static int era_preresume(struct dm_target *ti)
>   		era->nr_blocks = new_size;
>   	}
>   
> -	start_worker(era);
> -
>   	r = in_worker0(era, metadata_era_rollover);
>   	if (r) {
>   		DMERR("%s: metadata_era_rollover failed", __func__);
>