[dm-devel] Deadlock when swapping a table with a dm-era target

Wed Dec 8 20:10:32 UTC 2021

On 12/3/21 6:00 PM, Zdenek Kabelac wrote:
> Dne 03. 12. 21 v 15:42 Nikos Tsironis napsal(a):
>> On 12/2/21 5:41 PM, Zdenek Kabelac wrote:
>>> Dne 01. 12. 21 v 18:07 Nikos Tsironis napsal(a):
>>>> Hello,
>>>>
>>>> Under certain conditions, swapping a table, that includes a dm-era
>>>> target, with a new table, causes a deadlock.
>>>>
>>>> This happens when a status (STATUSTYPE_INFO) or message IOCTL is blocked
>>>> in the suspended dm-era target.
>>>>
>>>> dm-era executes all metadata operations in a worker thread, which stops
>>>> processing requests when the target is suspended, and resumes again when
>>>> the target is resumed.
>>>>
>>>> So, running 'dmsetup status' or 'dmsetup message' for a suspended dm-era
>>>> device blocks, until the device is resumed.
>>>>
>> Hi Zdenek,
>>
>> Thanks for the feedback. There doesn't seem to be any documentation
>> mentioning that loading the new table should happen before suspend, so
>> thanks a lot for explaining it.
>>
>> Unfortunately, this isn't what causes the deadlock. The following
>> sequence, which loads the table before suspend, also results in a
>> deadlock:
>>
>> 1. Create device with dm-era target
>>
>>    # dmsetup create eradev --table "0 1048576 era /dev/datavg/erameta /dev/datavg/eradata 8192"
>>
>> 2. Load new table to device, e.g., to resize the device
>>
>>    # dmsetup load eradev --table "0 2097152 era /dev/datavg/erameta /dev/datavg/eradata 8192"
>>
>> 3. Suspend the device
>>
>>    # dmsetup suspend eradev
>>
>> 4. Retrieve the status of the device. This blocks for the reasons I
>>    explained in my previous email.
>>
>>    # dmsetup status eradev
> 
> 
> Hi
> 
> Querying 'status' while the device is suspend is the next issue you need to fix in your workflow.
> 

Hi,

These steps are not my exact workflow. It's just a series of steps to
easily reproduce the bug.

I am not the one retrieving the status of the suspended device. LVM is.
LVM, when running commands like 'lvs' and 'vgs', retrieves the status of
the devices on the system using the DM_TABLE_STATUS ioctl.

LVM indeed uses the DM_NOFLUSH_FLAG, but this doesn't make a difference
for dm-era, since it doesn't check for this flag.

So, for example, a user or a monitoring daemon running an LVM command,
like 'vgs', at the "wrong" time triggers the bug:

1. Create device with dm-era target

    # dmsetup create eradev --table "0 1048576 era /dev/datavg/erameta /dev/datavg/eradata 8192"

2. Load new table to device, e.g., to resize the device

    # dmsetup load eradev --table "0 2097152 era /dev/datavg/erameta /dev/datavg/eradata 8192"

3. Suspend the device

    # dmsetup suspend eradev

4. Someone, e.g., a user or a monitoring daemon, runs an LVM command at
    this point, e.g. 'vgs'.

5. 'vgs' tries to retrieve the status of the dm-era device using the
    DM_TABLE_STATUS ioctl, and blocks.

6. Resume the device: This deadlocks.

    # dmsetup resume eradev

So, I can't change something in my workflow to prevent the bug. It's a
race that happens when someone runs an LVM command at the "wrong" time.

I am aware that using an appropriate LVM 'global_filter' can prevent
this, but:

1. This is just a workaround, not a proper solution.
2. This is not always an option. Imagine someone running an LVM command
    in a container, for example. Or, we may not be allowed to change the
    LVM configuration of the host at all.

> Normally 'status' operation may need to flush queued IO operations to get accurate data.
> 
> So you should query states before you start to mess with tables.
> 
> If you want to get 'status' without flushing - use:   'dmsetup status --noflush'.
> 

I am aware of that, and of the '--noflush' flag.

But, note, that:

1. As I have already explained in my previous emails, the reason of the
    deadlock is not I/O related.
2. dm-era doesn't check for this flag, so using it doesn't make a
    difference.
3. Other targets, e.g., dm-thin and dm-cache, that check for this flag,
    also check _explicitly_ if the device is suspended, before committing
    their metadata to get accurate statistics. They don't just rely on
    the user to use the '--noflush' flag.

That said, fixing 'era_status()' to check for the 'noflush' flag and to
check if the device is suspended, could be a possible fix, which I have
already proposed in my first email.

Although, as I have already explained, it's not a simple matter of not
committing metadata when the 'noflush' flag is used, or the device is
suspended.

dm-era queues the status operation (as well as all operations that touch
the metadata) for execution by a worker thread, to avoid using locks for
accessing the metadata.

When the target is suspended this thread doesn't execute operations, so
the 'table_status()' call blocks, holding the SRCU read lock of the
device (md->io_barrier), until the target is resumed.

But, 'table_status()' _never_ unblocks if you resume the device with a
new table preloaded. Instead, the resume operation ('dm_swap_table()')
deadlocks waiting for 'table_status()' to drop the SRCU read lock.

This never happens, and the only way to recover is to reboot.

> 
>> 5. Resume the device. This deadlocks for the reasons I explained in my
>>    previous email.
>>
>>    # dmsetup resume eradev
>>
>> 6. The dmesg logs are the same as the ones I included in my previous
>>    email.
>>
>> I have explained the reasons for the deadlock in my previous email, but
>> I would be more than happy to discuss them more.
>>
> 
> There is no bug - if your only problem is 'stuck'  status while you have devices in suspended state.
> 

As I explained previously, my problem is not 'stuck' status while the
device is suspended.

The issue is that if the suspended dm-era device has a new table
preloaded, the 'stuck' status results in 'stuck' resume.

And the only way to recover is by rebooting.

> You should NOT be doing basically anything while being suspend!!
> 

The documentation of the writecache target
(https://www.kernel.org/doc/html/latest/admin-guide/device-mapper/writecache.html)
states that the following is the proper sequence for removing the cache
device:

1. send the "flush_on_suspend" message
2. load an inactive table with a linear target that maps to the
    underlying device
3. suspend the device
4. ask for status and verify that there are no errors
5. resume the device, so that it will use the linear target
6. the cache device is now inactive and it can be deleted

The above sequence, except from step 1 that is not applicable to dm-era,
is exactly the sequence of steps that triggers the bug for dm-era.

These steps, if run for dm-era, cause a deadlock.

So, although I understand your point about not doing anything with a
suspended device, it seems that this sequence of steps is not wrong, and
it is actually recommended by the writecache documentation.

Still, as I mentioned, I am not explicitly running the 'status'
operation on the suspended dm-era device. It's a race with LVM, which
runs it implicitly when running commands such as 'vgs' or 'lvs'.

> i.e. imagine you suspend 'swap' device and while you are in suspened state kernel decides to swap memory pages - so you get instantly frozen here.
> 
> For this reason lvm2 while doing  'suspend/resume' sequance preallocates all memory in front of this operation - does very minimal set of operation between suspend/resume to minimize also latencies and so on.
> 
> Clearly if you suspend just some 'supportive'  disk of yours - you likely are no in danger of blocking your swap - but the 'status --noflush' logic still applies.
> 

I get what you are describing about a 'swap' device, and I agree
completely.

But, this is not what happens in the case of dm-era.

Regards,
Nikos.