[dm-devel] [LSF/MM/BFP ATTEND] [LSF/MM/BFP TOPIC] Storage: Copy Offload

Tue Mar 8 20:48:17 UTC 2022

On 3/1/22 23:32, Chaitanya Kulkarni wrote:
> Nikos,
> 
>>> [8] https://kernel.dk/io_uring.pdf
>>
>> I would like to participate in the discussion too.
>>
>> The dm-clone target would also benefit from copy offload, as it heavily
>> employs dm-kcopyd. I have been exploring redesigning kcopyd in order to
>> achieve increased IOPS in dm-clone and dm-snapshot for small copies over
>> NVMe devices, but copy offload sounds even more promising, especially
>> for larger copies happening in the background (as is the case with
>> dm-clone's background hydration).
>>
>> Thanks,
>> Nikos
> 
> If you can document your findings here it will be great for me to
> add it to the agenda.
> 

My work focuses mainly on improving the IOPs and latency of the
dm-snapshot target, in order to bring the performance of short-lived
snapshots as close as possible to bare-metal performance.

My initial performance evaluation of dm-snapshot had revealed a big
performance drop, while the snapshot is active; a drop which is not
justified by COW alone.

Using fio with blktrace I had noticed that the per-CPU I/O distribution
was uneven. Although many threads were doing I/O, only a couple of the
CPUs ended up submitting I/O requests to the underlying device.

The same issue also affects dm-clone, when doing I/O with sizes smaller
than the target's region size, where kcopyd is used for COW.

The bottleneck here is kcopyd serializing all I/O. Users of kcopyd, such
as dm-snapshot and dm-clone, cannot take advantage of the increased I/O
parallelism that comes with using blk-mq in modern multi-core systems,
because I/Os are issued only by a single CPU at a time, the one on which
kcopyd’s thread happens to be running.

So, I experimented redesigning kcopyd to prevent I/O serialization by
respecting thread locality for I/Os and their completions. This made the
distribution of I/O processing uniform across CPUs.

My measurements had shown that scaling kcopyd, in combination with
scaling dm-snapshot itself [1] [2], can lead to an eventual performance
improvement of ~300% increase in sustained throughput and ~80% decrease
in I/O latency for transient snapshots, over the null_blk device.

The work for scaling dm-snapshot has been merged [1], but,
unfortunately, I haven't been able to send upstream my work on kcopyd
yet, because I have been really busy with other things the last couple
of years.

I haven't looked into the details of copy offload yet, but it would be
really interesting to see how it affects the performance of random and
sequential workloads, and to check how, and if, scaling kcopyd affects
the performance, in combination with copy offload.

Nikos

[1] https://lore.kernel.org/dm-devel/20190317122258.21760-1-ntsironis@arrikto.com/
[2] https://lore.kernel.org/dm-devel/425d7efe-ab3f-67be-264e-9c3b6db229bc@arrikto.com/