[dm-devel] [PATCH] dm-raid: add RAID discard support

Wed Oct 1 16:00:45 UTC 2014

On Wed, Oct 1, 2014 at 6:56 AM, NeilBrown <neilb at suse.de> wrote:
> On Wed, 24 Sep 2014 13:02:28 +0200 Heinz Mauelshagen <heinzm at redhat.com>
> wrote:
>
>>
>> Martin,
>>
>> thanks for the good explanation of the state of the discard union.
>> Do you have an ETA for the 'zeroout, deallocate' ... support you mentioned?
>>
>> I was planning to have a followup patch for dm-raid supporting a dm-raid
>> table
>> line argument to prohibit discard passdown.
>>
>> In lieu of the fuzzy field situation wrt SSD fw and discard_zeroes_data
>> support
>> related to RAID4/5/6, we need that in upstream together with the initial
>> patch.
>>
>> That 'no_discard_passdown' table line can be added to dm-raid RAID4/5/6
>> table
>> lines to avoid possible data corruption but can be avoided on RAID1/10
>> table lines,
>> because the latter are not suffering from any  discard_zeroes_data flaw.
>>
>>
>> Neil,
>>
>> are you going to disable discards in RAID4/5/6 shortly
>> or rather go with your bitmap solution?
>
> Can I just close my eyes and hope it goes away?
>
> The idea of a bitmap of uninitialised areas is not a short-term solution.
> But I'm not really keen on simply disabling discard for RAID4/5/6 either. It
> would  mean that people with good sensible hardware wouldn't be able to use
> it properly.
>
> I would really rather that discard_zeroes_data were only set on devices where
> it was actually true.  Then it wouldn't be my problem any more.
>
> Maybe I could do a loud warning
>   "Not enabling DISCARD on RAID5 because we cannot trust committees.
>    Set "md_mod.willing_to_risk_discard=Y" if your devices reads discarded
>    sectors as zeros"
>
> and add an appropriate module parameter......
>
> While we are on the topic, maybe I should write down my thoughts about the
> bitmap thing in case someone wants to contribute.
>
>   There are 3 states that a 'region' can be in:
>     1- known to be in-sync
>     2- possibly not in sync, but it should  be
>     3- probably not in sync, contains no valuable data.
>
>  A read from '3' should return zeroes.
>  A write to '3' should change the region to be '2'.  It could either
>     write zeros before allowing the write to start, or it could just start
>     a normal resync.
>
>  Here is a question:  if a region has been discarded, are we guaranteed that
>  reads are at least stable.  i.e. if I read twice will I definitely get the
>  same value?

Not sure with other specs, but an NVMe-compliant SSD that supports
discard (Dataset Management command with Deallocate attribute, in NVMe
parlance) is, per spec, required to be deterministic when deallocated
range is subsequently read. That's what the spec (1.1) says:

The value read from a deallocated LBA shall be deterministic;
specifically, the value returned by subsequent reads of that LBA shall
be the same until a write occurs to that LBA. The values read from a
deallocated LBA and its metadata (excluding protection information)
shall be all zeros, all ones, or the last data written to the
associated LBA and its metadata. The values read from an unwritten or
deallocated LBA’s protection information field shall be all ones
(indicating the protection information shall not be checked).

Regards,
Andrey

>
>  It would be simplest to store all the states in the one bitmap. i.e. have a
>  new version of the current bitmap (which distinguishes between '1' and '2')
>  which allows the 3rd state.  Probably just 2 bits per region, though we could
>  squeeze state for 20 regions into 32 bits if we really wanted :-)
>
>  Then we set the bitmap to all '3's when it is created and set regions to
>  '3' when discarding.
>
>  One problem with this approach is that it forces the same region size.
>  Regions for the current bitmap can conveniently be 10s of Megabytes.  For
>  state '3', we ideally want one region per stripe.
>  For 4TB drives with a 512K chunk size (and smaller is not uncommon) that is
>  8 million regions which is a megabyte of bits.
>
>  I guess that isn't all that much.  One problem with small regions for the
>  1/2 distinction is that we end up updating the bitmap more often, but
>  that could be avoided by setting multiple bits at one.
>  i.e. keep the internal counters with a larger granularity, and when a
>  counter (of outstanding writes) becomes non-zero, set quite a few bits in
>  the bitmap.
>
>  So that is probably what I would do:
>   - new version for bitmap which has 2 bits per region and encodes 3 states
>   - bitmap granularity matches chunk size (by default)
>   - decouple region size of bitmap from region size for internal 'dirty'
>     accounting
>   - write to a 'state 3' region sets it to 'state 2' and kicks off resync
>   - 'discard' sets state to '3'.
>
> NeilBrown
>
> --
> dm-devel mailing list
> dm-devel at redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel