[dm-devel] fstrim discarding too many or wrong blocks on Linux 5.1, leading to data loss

Tue May 21 16:46:20 UTC 2019

> Am 20.05.2019 um 13:38 schrieb Michael Laß <bevan at bi-co.net>:
> 
>> 
>> Am 19.05.2019 um 21:55 schrieb Michael Laß <bevan at bi-co.net>:
>> 
>> CC'ing dm-devel, as this seems to be a dm-related issue. Short summary for new readers:
>> 
>> On Linux 5.1 (tested up to 5.1.3), fstrim may discard too many blocks, leading to data loss. I have the following storage stack:
>> 
>> btrfs
>> dm-crypt (LUKS)
>> LVM logical volume
>> LVM single physical volume
>> MBR partition
>> Samsung 830 SSD
>> 
>> The mapping between logical volumes and physical segments is a bit mixed up. See below for the output for “pvdisplay -m”. When I issue fstrim on the mounted btrfs volume, I get the following kernel messages:
>> 
>> attempt to access beyond end of device
>> sda1: rw=16387, want=252755893, limit=250067632
>> BTRFS warning (device dm-5): failed to trim 1 device(s), last error -5
>> 
>> At the same time, other logical volumes on the same physical volume are destroyed. Also the btrfs volume itself may be damaged (this seems to depend on the actual usage).
>> 
>> I can easily reproduce this issue locally and I’m currently bisecting. So far I could narrow down the range of commits to:
>> Good: 92fff53b7191cae566be9ca6752069426c7f8241
>> Bad: 225557446856448039a9e495da37b72c20071ef2
> 
> I finished bisecting. Here’s the responsible commit:
> 
> commit 61697a6abd24acba941359c6268a94f4afe4a53d
> Author: Mike Snitzer <snitzer at redhat.com>
> Date:   Fri Jan 18 14:19:26 2019 -0500
> 
>    dm: eliminate 'split_discard_bios' flag from DM target interface
> 
>    There is no need to have DM core split discards on behalf of a DM target
>    now that blk_queue_split() handles splitting discards based on the
>    queue_limits.  A DM target just needs to set max_discard_sectors,
>    discard_granularity, etc, in queue_limits.
> 
>    Signed-off-by: Mike Snitzer <snitzer at redhat.com>

Reverting that commit solves the issue for me on Linux 5.1.3. Would that be an option until the root cause has been identified? I’d rather not let more people run into this issue.

Cheers,
Michael

> Maybe the assumptions taken here ("A DM target just needs to set max_discard_sectors, discard_granularity, etc, in queue_limits.”) isn’t valid in my case? Does anyone have an idea?
> 
> 
>> 
>> In this range of commits, there are only dm-related changes.
>> 
>> So far, I have not reproduced the issue with other file systems or a simplified stack. I first want to continue bisecting but this may take another day.
>> 
>> 
>>> Am 18.05.2019 um 12:26 schrieb Qu Wenruo <quwenruo.btrfs at gmx.com>:
>>> On 2019/5/18 下午5:18, Michael Laß wrote:
>>>> 
>>>>> Am 18.05.2019 um 06:09 schrieb Chris Murphy <lists at colorremedies.com>:
>>>>> 
>>>>> On Fri, May 17, 2019 at 11:37 AM Michael Laß <bevan at bi-co.net> wrote:
>>>>>> 
>>>>>> 
>>>>>> I tried to reproduce this issue: I recreated the btrfs file system, set up a minimal system and issued fstrim again. It printed the following error message:
>>>>>> 
>>>>>> fstrim: /: FITRIM ioctl failed: Input/output error
>>>>> 
>>>>> Huh. Any kernel message at the same time? I would expect any fstrim
>>>>> user space error message to also have a kernel message. Any i/o error
>>>>> suggests some kind of storage stack failure - which could be hardware
>>>>> or software, you can't know without seeing the kernel messages.
>>>> 
>>>> I missed that. The kernel messages are:
>>>> 
>>>> attempt to access beyond end of device
>>>> sda1: rw=16387, want=252755893, limit=250067632
>>>> BTRFS warning (device dm-5): failed to trim 1 device(s), last error -5
>>>> 
>>>> Here are some more information on the partitions and LVM physical segments:
>>>> 
>>>> fdisk -l /dev/sda:
>>>> 
>>>> Device     Boot Start       End   Sectors   Size Id Type
>>>> /dev/sda1  *     2048 250069679 250067632 119.2G 8e Linux LVM
>>>> 
>>>> pvdisplay -m:
>>>> 
>>>> --- Physical volume ---
>>>> PV Name               /dev/sda1
>>>> VG Name               vg_system
>>>> PV Size               119.24 GiB / not usable <22.34 MiB
>>>> Allocatable           yes (but full)
>>>> PE Size               32.00 MiB
>>>> Total PE              3815
>>>> Free PE               0
>>>> Allocated PE          3815
>>>> PV UUID               mqCLFy-iDnt-NfdC-lfSv-Maor-V1Ih-RlG8lP
>>>> 
>>>> --- Physical Segments ---
>>>> Physical extent 0 to 1248:
>>>>  Logical volume	/dev/vg_system/btrfs
>>>>  Logical extents	2231 to 3479
>>>> Physical extent 1249 to 1728:
>>>>  Logical volume	/dev/vg_system/btrfs
>>>>  Logical extents	640 to 1119
>>>> Physical extent 1729 to 1760:
>>>>  Logical volume	/dev/vg_system/grml-images
>>>>  Logical extents	0 to 31
>>>> Physical extent 1761 to 2016:
>>>>  Logical volume	/dev/vg_system/swap
>>>>  Logical extents	0 to 255
>>>> Physical extent 2017 to 2047:
>>>>  Logical volume	/dev/vg_system/btrfs
>>>>  Logical extents	3480 to 3510
>>>> Physical extent 2048 to 2687:
>>>>  Logical volume	/dev/vg_system/btrfs
>>>>  Logical extents	0 to 639
>>>> Physical extent 2688 to 3007:
>>>>  Logical volume	/dev/vg_system/btrfs
>>>>  Logical extents	1911 to 2230
>>>> Physical extent 3008 to 3320:
>>>>  Logical volume	/dev/vg_system/btrfs
>>>>  Logical extents	1120 to 1432
>>>> Physical extent 3321 to 3336:
>>>>  Logical volume	/dev/vg_system/boot
>>>>  Logical extents	0 to 15
>>>> Physical extent 3337 to 3814:
>>>>  Logical volume	/dev/vg_system/btrfs
>>>>  Logical extents	1433 to 1910
>>>> 
>>>> 
>>>> Would btrfs even be able to accidentally trim parts of other LVs or does this clearly hint towards a LVM/dm issue?
>>> 
>>> I can't speak sure, but (at least for latest kernel) btrfs has a lot of
>>> extra mount time self check, including chunk stripe check against
>>> underlying device, thus the possibility shouldn't be that high for btrfs.
>> 
>> Indeed, bisecting the issue led me to a range of commits that only contains dm-related and no btrfs-related changes. So I assume this is a bug in dm.
>> 
>>>> Is there an easy way to somehow trace the trim through the different layers so one can see where it goes wrong?
>>> 
>>> Sure, you could use dm-log-writes.
>>> It will record all read/write (including trim) for later replay.
>>> 
>>> So in your case, you can build the storage stack like:
>>> 
>>> Btrfs
>>> <dm-log-writes>
>>> LUKS/dmcrypt
>>> LVM
>>> MBR partition
>>> Samsung SSD
>>> 
>>> Then replay the log (using src/log-write/replay-log in fstests) with
>>> verbose output, you can verify every trim operation against the dmcrypt
>>> device size.
>>> 
>>> If all trim are fine, then move the dm-log-writes a layer lower, until
>>> you find which layer is causing the problem.
>> 
>> That sounds like a plan! However, I first want to continue bisecting as I am afraid to lose my reproducer by changing parts of my storage stack.
>> 
>> Cheers,
>> Michael
>> 
>>> 
>>> Thanks,
>>> Qu
>>>> 
>>>> Cheers,
>>>> Michael
>>>> 
>>>> PS: Current state of bisection: It looks like the error was introduced somewhere between b5dd0c658c31b469ccff1b637e5124851e7a4a1c and v5.1.
>> 
>> 
>> --
>> dm-devel mailing list
>> dm-devel at redhat.com
>> https://www.redhat.com/mailman/listinfo/dm-devel
> 
> 
> --
> dm-devel mailing list
> dm-devel at redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel