[dm-devel] Proper way to test RAID456?

Wed Jan 12 16:56:12 UTC 2022

On Sun, 9 Jan 2022 20:13:36 +0800
Qu Wenruo <quwenruo.btrfs at gmx.com> wrote:

> On 2022/1/9 18:04, David Woodhouse wrote:
> > On Sun, 2022-01-09 at 07:55 +0800, Qu Wenruo wrote:  
> >> On 2022/1/9 04:29, Lukas Straub wrote:  
> >>> But there is a even simpler solution for btrfs: It could just not touch
> >>> stripes that already contain data.  
> >>
> >> That would waste a lot of space, if the fs is fragemented.
> >>
> >> Or we have to write into data stripes when free space is low.
> >>
> >> That's why I'm trying to implement a PPL-like journal for btrfs RAID56.  
> >
> > PPL writes the P/Q of the unmodified chunks from the stripe, doesn't
> > it?  
> 
> Did I miss something or the PPL isn't what I thought?
> 
> I thought PPL either:
> 
> a) Just write a metadata entry into the journal to indicate a full
>     stripe (along with its location) is going to be written.
> 
> b) Write a metadata entry into the journal about a non-full stripe
>     write, then write the new data and new P/Q into the journal
> 
> And this is before we start any data/P/Q write.
> 
> And after related data/P/Q write is finished, remove corresponding
> metadata and data entry from the journal.
> 
> Or PPL have even better solution?

Yes, PPL is a bit better than a journal as you described it (md
supports both). Because a journal would need to be replicated to
multiple devices (raid1) in the array while the PPL is only written to
the drive containing the parity for the particular stripe. And since the
parity is distributed across all drives, the PPL overhead is also
distributed across all drives. However, PPL only works for raid5 as
you'll see.

PPL works like this:

Before any data/parity write either:

 a) Just write a metadata entry into the PPL on the parity drive to
    indicate a full stripe (along with its location) is going to be
    written.

 b) Write a metadata entry into the PPL on the parity drive about a
    non-full stripe write, including which data chunks are going to be
    modified, then write the XOR of chunks not modified by this write in
    to the PPL.

To recover a inconsistent array with a lost drive:

In case a), the stripe consists only of newly written data, so it will
be affected by the write-hole (this is the trade-off that PPL makes) so
just standard parity recovery.

In case b), XOR what we wrote to the PPL (the XOR of chunks not
modified) with the modified data chunks to get our new (consistent)
parity. Then do standard parity recovery. This just works if we lost a
unmodified data chunk.
If we lost a modified data chunk this is not possible and just do
standard parity recovery from the beginning. Again, the newly written
data is affected by the write-hole but existing data is not.
If we lost the parity drive (containing the PPL) there is no need to
recover since all the data chunks are present.

Of course, this was a simplified explanation, see drivers/md/raid5-ppl.c
for details (it has good comments with examples). This also covers the
case where a data chunk is only partially modified and the unmodified
part of the chunk also needs to be protected (by working on a per-block
basis instead of per-chunk).

The PPL is not possible for raid6 AFAIK, because there it could happen
that you loose both a modified data chunk and a unmodified data chunk.

Regards,
Lukas Straub

> >
> > An alternative in a true file system which can do its own block
> > allocation is to just calculate the P/Q of the final stripe after it's
> > been modified, and write those (and) the updated data out to newly-
> > allocated blocks instead of overwriting the original.  
> 
> This is what Johannes is considering, but for a different purpose.
> Johannes' idea is to support zoned device. As the physical location a
> zoned append write will only be known after it's written.
> 
> So his idea is to maintain another mapping tree for zoned write, so that
> full stripe update will also happen in that tree.
> 
> But that idea is still in the future, on the other hand I still prefer
> some tried-and-true method, as I'm 100% sure there will be new
> difficulties waiting us for the new mapping tree method.
> 
> Thanks,
> Qu
> 
> >
> > Then the final step is to free the original data blocks and P/Q.
> >
> > This means that your RAID stripes no longer have a fixed topology; you
> > need metadata to be able to *find* the component data and P/Q chunks...
> > it ends up being non-trivial, but it has attractive properties if we
> > can work it out.  

-- 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL: <http://listman.redhat.com/archives/dm-devel/attachments/20220112/2ff0310b/attachment.sig>