<div dir="ltr"><div dir="ltr">I have added vdo-devel to the converation: <a href="https://www.redhat.com/archives/vdo-devel/2019-April/msg00017.html" target="_blank">https://www.redhat.com/archives/vdo-devel/2019-April/msg00017.html</a></div><div dir="ltr"><br></div><div dir="ltr">Here is some more info to describe the specific issue:</div><div dir="ltr"><div><br></div><div><div>A dm-thin volume is configured with a chunk/block size that determines the minimum allocation size that it can track, between 64KiB and 1GiB. If an application performs a write to a dm-thin block device, and that IO operation completely overlaps a thin block, dm-thin will skip zeroing after allocation before performing the write. This is a pretty big performance optimization as it effectively halves IO for large sequential writes. When a block device has a snapshot the data is referenced by both the original block and the snapshot. If a write is issued dm-thin will normally allocate a new chunk, copy the old data to that new chunk, then perform the write. If the new write completely overlaps a chunk it will skip the copy.</div><div><br></div><div>So for example dm-thin block device is created in a thin pool with a 512k block size. A new block is created and an application performs a 4k sequential write at the beginning of the volume. dm-thin will do the following, </div><div><br></div><div>1) allocate 512k block</div><div>2) write 0's to the block</div><div>3) perform the 4k write</div><div><br></div><div>This does 516k of writes for a 4k write (ouch). If the write was at least 512k, it will skip zeroing and just do the write.</div><div><br></div><div>Similarly assume there is a dm-thin block device with a snapshot and data is shared between the two. Again the application performs a 4k write.</div><div><br></div><div>1) allocate new 512k block</div><div>2) copy 512k form the old block to the new</div><div>3) perform the 4k write</div><div><br></div><div>This does 512k in reads and 516k in writes (big ouch). If the write was at least 512k it will skip all the overhead.</div><div><br></div><div>Now fast forward to VDO. Normally the IO size is determined by the max_sectors_kb setting in /sys/block/DEVICE/queue. This value is inherited for stacked DM devices and can be modified by the user up to the hardware limit max_hw_sectors_kb, which also appears to be inherited for stacked DM devices. VDO sets this value to 4k which in turn forces all layers stacked above it to also have a 4k maximum. If you take my previous example but place VDO beneath the dm-thin volume, all IO sequential or otherwise will be split down to 4k which will completely eliminate all the performance optimizations that dm-thin provides.</div><div><br></div><div>1) Is this known behavior? </div><div>2) Is there a possible workaround?</div></div><div class="gmail-yj6qo gmail-ajU" style="outline:none;padding:10px 0px;width:22px;margin:2px 0px 0px"><br class="gmail-Apple-interchange-newline"></div></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Apr 23, 2019 at 6:11 AM Zdenek Kabelac <<a href="mailto:zkabelac@redhat.com">zkabelac@redhat.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Dne 19. 04. 19 v 16:40 Ryan Norwood napsal(a):<br>
> We have been using dm-thin layered above VDO and have noticed that our <br>
> performance is not optimal for large sequential writes as max_sectors_kb <br>
> and max_hw_sectors_kb for all thin devices are set to 4k due to the VDO layer <br>
> beneath.<br>
> <br>
> This effectively eliminates the performance optimizations for sequential <br>
> writes to skip both zeroing and COW overhead when a write fully overlaps a <br>
> thin chunk as all bios are split into 4k which always be less than the 64k <br>
> thin chunk minimum.<br>
> <br>
> Is this known behavior? Is there any way around this issue?<br>
<br>
Hi<br>
<br>
If you require highest performance - I'd suggest to avoid using VDO.<br>
VDO replaces performance with better space utilization.<br>
It works on 4KiB block - so by design it's going to be slow.<br>
<br>
I'd also probably not mix 2 provisioning technologies together - there<br>
is nontrivial amount of problematic states when the whole device stack<br>
runs out of real physical space.<br>
<br>
Regards<br>
<br>
Zdenek<br>
</blockquote></div>