[dm-devel] [PATCH 0/2] dm era: Fix bugs that lead to lost writes after crash

Nikos Tsironis ntsironis at arrikto.com
Fri Jan 22 15:19:29 UTC 2021


In case of a system crash, dm-era might lose the information about
blocks written during the current era, although the corresponding writes
were passed down to the origin device and completed successfully.

There are two major, distinct bugs that can lead to lost writes:
  1. dm-era doesn't recover the committed writeset after a system crash
  2. dm-era decides whether to defer or not a write based on non
     committed information

Failure to recover committed writeset
=====================================

Following a system crash, dm-era fails to recover the committed writeset
for the current era, leading to lost writes. That is, we lose the
information about what blocks were written during the affected era.

There are three issues that cause the committed writeset to get lost:

1. dm-era doesn't load the committed writeset when opening the metadata
2. The code that resizes the metadata wipes the information about the
   committed writeset (assuming it was loaded at step 1)
3. era_preresume() starts a new era, without taking into account that
   the current era might not have been archived, due to a system crash.

Steps to reproduce
------------------

1. Create two LVs, one for data and one for metadata

   # lvcreate -n eradata -L1G datavg
   # lvcreate -n erameta -L64M datavg

2. Fill the whole data device with zeroes

   # dd if=/dev/zero of=/dev/datavg/eradata oflag=direct bs=1M

3. Create the dm-era device. We set the tracking granularity to 4MiB.

   # dmsetup create eradev --table "0 `blockdev --getsz \
     /dev/datavg/eradata` era /dev/datavg/erameta /dev/datavg/eradata 8192"

4. Write random data to the first block of the device

   # dd if=/dev/urandom of=/dev/mapper/eradev oflag=direct bs=4M count=1

5. Flush the device

   # sync /dev/mapper/eradev

6. Forcefully reboot the machine

   # echo b > /proc/sysrq-trigger

7. When the machine comes back up recreate the dm-era device and ask for
   the list of blocks written since era 1, i.e., for all blocks ever
   written to the device.

   # dmsetup message eradev 0 take_metadata_snap
   # era_invalidate --metadata-snapshot --written-since 1 /dev/datavg/erameta
   <blocks>
   </blocks>

The list of written blocks reported by dm-era is empty, even though we
wrote the first 4MiB block of the device successfully. Using, e.g.,
`hexdump /dev/datavg/eradata`, one can verify that indeed the first 4MiB
block of the device was written.

Missed writes
=============

In case of a system crash, dm-era might fail to mark blocks as written
in its metadata, although the corresponding writes to these blocks were
passed down to the origin device and completed successfully.

Suppose the following sequence of events:

1. We write to a block that has not been yet written in the current era
2. era_map() checks the in-core bitmap for the current era and sees that
   the block is not marked as written.
3. The write is deferred for submission after the metadata have been
   updated and committed.
4. The worker thread processes the deferred write
   (process_deferred_bios()) and marks the block as written in the
   in-core bitmap, **before** committing the metadata.
5. The worker thread starts committing the metadata.
6. We do more writes that map to the same block as the write of step (1)
7. era_map() checks the in-core bitmap and sees that the block is marked
   as written, **although the metadata have not been committed yet**.
8. These writes are passed down to the origin device immediately and the
   device reports them as completed.
9. The system crashes, e.g., power failure, before the commit from step
   (5) finishes.

When the system recovers and we query the dm-era target for the list of
written blocks it doesn't report the aforementioned block as written,
although the writes of step (6) completed successfully.

Steps to reproduce
------------------

1. Create two LVs, one for data and one for metadata

   # lvcreate -n eradata -L1G datavg
   # lvcreate -n erameta -L64M datavg

2. Fill the whole data device with zeroes

   # dd if=/dev/zero of=/dev/datavg/eradata oflag=direct bs=1M

3. Create a dm-delay device, initially with no delay, that overlays the
   metadata device. This allows us to delay the metadata commit so we
   can reproduce the bug easier.

   # dmsetup create delaymeta --table "0 `blockdev --getsz \
     /dev/datavg/erameta` delay /dev/datavg/erameta 0 0 /dev/datavg/erameta 0 0"

4. Create the dm-era device, using the data LV for data and the dm-delay
   device for its metadata. We set the tracking granularity to 4MiB.

   # dmsetup create eradev --table "0 `blockdev --getsz \
     /dev/datavg/eradata` era /dev/mapper/delaymeta /dev/datavg/eradata 8192"

5. Change the dm-delay device table and set the write delay to 10secs

   # dmsetup suspend delaymeta; dmsetup load delaymeta --table "0 \
     `blockdev --getsz /dev/datavg/erameta` delay /dev/datavg/erameta 0 0 \
     /dev/datavg/erameta 0 10000"; dmsetup resume delaymeta

6. Run the following script:

   #!/bin/bash

   # a. Write to the first 4KiB block of the device, which maps to era block #0
   dd if=/dev/urandom of=/dev/mapper/eradev oflag=direct bs=4K count=1 &

   # b. Write to the second 4KiB block of the device, which also maps to block #0
   dd if=/dev/urandom of=/dev/mapper/eradev oflag=direct bs=4K seek=1 count=1

   # c. Sync the device
   sync /dev/mapper/eradev

   # d. Forcefully reboot
   echo b > /proc/sysrq-trigger

   The command of step (6a) blocks as expected, waiting for the metadata
   commit. Meanwhile dm-era has marked block #0 as written in the in-core
   bitmap.

   We would expect the command of step (6b) to also block waiting for
   the metadata commit triggered by (6a), as they touch the same block.

   But, it doesn't.

7. After the system comes back up examine the data device, e.g., using
   `hexdump /dev/datavg/eradata`. We can see that indeed the write from
   (6a) never completed, but the write from (6b) hit the disk.

8. Recreate the device stack and ask for the list of blocks written
   since era 1, i.e., for all blocks ever written to the device.

   # dmsetup message eradev 0 take_metadata_snap
   # era_invalidate --metadata-snapshot --written-since 1 /dev/mapper/delaymeta
   <blocks>
   </blocks>

The list of written blocks reported by dm-era is empty, even though
block #0 was written and flushed to the device.

Nikos Tsironis (2):
  dm era: Recover committed writeset after crash
  dm era: Update in-core bitset after committing the metadata

 drivers/md/dm-era-target.c | 42 ++++++++++++++++++++++++++++--------------
 1 file changed, 28 insertions(+), 14 deletions(-)

-- 
2.11.0




More information about the dm-devel mailing list