[dm-devel] [RFC PATCH 1/1] dm crypt: change maximum sector size to PAGE_SIZE

Itai Handler itai.handler at gmail.com
Sun Mar 13 14:53:55 UTC 2022


Hi Milan,

On 28/11/2021 17:33, Itai Handler wrote:
> On 27/11/2021 13:19, Milan Broz wrote:
>> On 11/25/21 17:28, Itai Handler wrote:
>>> On 14/11/2021 13:56, Itai Handler wrote:
>>>> On 11/11/2021 15:07, Milan Broz wrote:
>>>>> On 10/11/2021 18:43, Itai Handler wrote:
>>>>>> Maximum sector size of dm-crypt is currently limited to 4096 bytes.
>>>>>>
>>>>>> On systems where PAGE_SIZE is larger than 4096 bytes, using larger
>>>>>> sectors can be beneficial for performance reasons.
>>>>> The limit to 4096 was set because this is the smallest possible
>>>>> page size that all platform supports.
>>>>>
>>>>> If you allow a higher size here, the device cannot be activated on a
>>>>> platform
>>>>> with the smaller page size. (Encrypted sector size becomes
>>>>> atomic sector size for all upper layers - as you mention below, not
>>>>> all fs support bigger sectors.)
>>>>>
>>>>> For LUKS, this is not acceptable - the format is portable by definition.
>>>>>
>>>>> For specific dm-crypt device, I am not sure. I would better kept
>>>>> the 4096 page size limit here.
>>>> I considered only plain dm-crypt since I am unfamiliar with LUKS.
>>>> Does LUKS assume that dm-crypt sector size is limited to 4K?
>>>> If so, maybe I'll be able to also patch LUKS regarding this issue.
>>>>
>>>>> It also depends on crypto API driver here (performance is usually
>>>>> optimized to 4k).
>>>>> What cipher and encryption mode did you use for test?
>>>> The cipher I used for the test is not publicly available but I can say
>>>> that it's performance
>>>> is not optimized to 4k blocks.
>>>> I believe that this results from the high overhead of setting up DMA
>>>> transfers. (my
>>>> cipher uses DMA to transfer data between memory and programmable logic).
>>>> There are many additional ciphers that use DMA in the tree, but I
>>>> cannot run any
>>>> benchmark with them at the moment.
>>>>
>>>> I have performed some additional benchmarks using the ARM
>>>> Cryptographic Extensions
>>>> CPU ciphers and saw that increasing block size beyond 4K does increase
>>>> performance,
>>>> albeit the performance improvement isn't as large as I've seen when
>>>> using my cipher.
>>>>
>>>> Following are "tcrypt mode=600 sec=5 num_mb=512" results for
>>>> xts-aes-ce decryption
>>>> (ARM CPU Cryptographic Extensions cipher):
>>>>    testing speed of multibuffer xts(aes) (xts-aes-ce) decryption
>>>>    ...
>>>>    trcypt: test 5 (256 bit key, 4096 byte blocks): 801792 operations in
>>>> 5 seconds (3284140032 bytes)
>>>>    ...
>>>>    trcypt: test 9 (256 bit key, 65536 byte blocks): 63488 operations in
>>>> 5 seconds (4160749568 bytes)
>>>>
>>>> That translates to:
>>>>    657 MB/s for 4K byte blocks.
>>>>    832 MB/s for 64K blocks.
>>>>
>>>> That means that there is about 27 percents improvement when
>>>> transitioning to 64K sectors,
>>>> for the cipher alone (only tcrypt benchmark).
>>>>
>>>> This benchmark had been performed on an ARM Cortex A53 CPU.
>>>> (Note that in all of my benchmarks PAGE_SIZE=64K).
>>>>
>>>>> How the number looks for random access? Linear test is usually
>>>>> misleading.
>>>>> I expect there will be big performance problem if you write small
>>>>> data chunks,
>>>>> writes and encryption will be amplified to full big sectors here...)
>>>> I understand your concern.
>>>> However my patch does not force anyone to use large sectors - it only
>>>> opens up this
>>>> possibility for those interested in that option.
>>>> This is similarly to the option to format an ext4 filesystem with 64K
>>>> sectors.
>>>> By the way: when you do that, you get a warning saying that the
>>>> filesystem
>>>> will not be usable on most systems.
>>>>
>>>> Sometime users need to store mostly large files on a filesystem, for
>>>> example for
>>>> backup or for video files.
>>>> I think that in these cases random access time is not so important.
>>>> Some users may also be able to reserve a dedicated partition for
>>>> storing such
>>>> large files.
>>>>
>>>>> (Technical detail: such pat MUST increase dm-crypt minor version.)
>>>> Thanks for pointing that out. I believe that I can prepare a v2 patch
>>>> that will
>>>> address that issue.
>>>>> Milan
>>>>>
>>>>>> This patch changes maximum sector size from 4096 bytes to PAGE_SIZE,
>>>>>> and in addition it changes the type of sector_size in
>>>>>> struct crypt_config from 'unsigned short int' to 'unsigned int', in
>>>>>> order to be able to represent larger values.
>>>>>>
>>>>>> On a prototype system which has PAGE_SIZE of 65536 bytes, I saw about
>>>>>> x2 performance improvement in sequential read throughput benchmark
>>>>>> while using only about half of the CPU usage, after simply increasing
>>>>>> sector size from 4096 to 65536 bytes.
>>>>>> I used ext4 filesystem for that benchmark, which supports 64KiB
>>>>>> sectors.
>>>>>>
>>>>>> Note: A small change should be made in cryptsetup in order to add
>>>>>> support for sectors larger than 4096 bytes.
>>>>>>
>>>>>> Signed-off-by: Itai Handler <itai.handler at gmail.com>
>>>>>> ---
>>>>>>    drivers/md/dm-crypt.c | 6 +++---
>>>>>>    1 file changed, 3 insertions(+), 3 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
>>>>>> index 916b7da16de2..78c239443bd5 100644
>>>>>> --- a/drivers/md/dm-crypt.c
>>>>>> +++ b/drivers/md/dm-crypt.c
>>>>>> @@ -168,7 +168,7 @@ struct crypt_config {
>>>>>>           } iv_gen_private;
>>>>>>           u64 iv_offset;
>>>>>>           unsigned int iv_size;
>>>>>> -       unsigned short int sector_size;
>>>>>> +       unsigned int sector_size;
>>>>>>           unsigned char sector_shift;
>>>>>>
>>>>>>           union {
>>>>>> @@ -3115,9 +3115,9 @@ static int crypt_ctr_optional(struct dm_target
>>>>>> *ti, unsigned int argc, char **ar
>>>>>>                           cc->cipher_auth = kstrdup(sval, GFP_KERNEL);
>>>>>>                           if (!cc->cipher_auth)
>>>>>>                                   return -ENOMEM;
>>>>>> -               } else if (sscanf(opt_string, "sector_size:%hu%c",
>>>>>> &cc->sector_size, &dummy) == 1) {
>>>>>> +               } else if (sscanf(opt_string, "sector_size:%u%c",
>>>>>> &cc->sector_size, &dummy) == 1) {
>>>>>>                           if (cc->sector_size < (1 << SECTOR_SHIFT) ||
>>>>>> -                           cc->sector_size > 4096 ||
>>>>>> +                           cc->sector_size > PAGE_SIZE ||
>>>>>>                               (cc->sector_size & (cc->sector_size -
>>>>>> 1))) {
>>>>>>                                   ti->error = "Invalid feature value for
>>>>>> sector_size";
>>>>>>                                   return -EINVAL;
>>>>>>
>>>> I appreciate your valuable comments.
>>>>
>>>> Itai
>>>>
>>> Milan, can you comment on the above?
>> Hi,
>>
>> well, if you want my opinion, let's summarize it:
>>
>> - you tested it on a driver that is proprietary and is not in
>> the mainline kernel
> You are correct here. I also suspect that some drivers in the mainline may
> benefit as well, but I didn't have the opportunity to verify that yet.
>> - provided numbers are with tcrypt - I think this is
>> an internal crypto API test; it says nothing about performance
>> with your dm-crypt patch above it, right?
> You are correct. Please see additional benchmark results below.
>> - no numbers for random access/small file access here (note,
>> if you need to reencrypt 64k sector vs. 4k of mostly unused data,
>> it WILL have a huge performance impact).
>>
>> IMO, optimizing for big linear access is better on the fs layer
>> (and even years ago, I noticed some vendor's patches for setting
>> large fs sectors in video recording equipment).
> Interesting. Can you point me to some of those patches?
>> The same applies to encryption - I think FDE is not the best layer
>> here if you need to work with large sectors (encryption of not-used
>> sectors is just huge waste of resources - fs encrypts only used space,
>> I hope... :).
>>
>> I said that LUKS must remain multiplatform. Adding larger than
>> 4k sectors is not an option; I will undoubtedly reject that patch
>> in cryptsetup.
>> So this patch can be used only for plain dm-crypt mapping,
>> where you need to handle your key management (that could be ok, though).
> In the future I plan to submit a cryptsetup patch that will allow
> larger sectors only for the plain dm-crypt mapping. Such a change will
> not affectthe portability of LUKS.
>> Sorry, but without any performance numbers that prove this really
>> helps and does not create huge performance problems...
> Following are some benchmark results using the ARM Cryptographic
> Extensions CPU cipher "xts(aes)", on an ARM Cortex A53 CPU,
> and a 2 TB Samsung EVO 970 NVME SSD:
>
> Random read, dm crypt sector size 4K: 966 MB/s
> Random read, dm crypt sector size 64K: 1054 MB/s (9% improvement)
>
> Random write, sector size 4K: 841 MB/s
> Random write, sector size 64K: 983 MB/s (16% improvement)
>
> Following are cryptsetup and fio parameters used for this test:
> cryptsetup open --type plain --sector-size $SECTOR_SIZE \
>     -c 'capi:xts(aes)-plain64'
>     /dev/nvme0n1p1 dmcrypt0
>
> (SECTOR_SIZE is set to 4096 or 65536).
>
> fio --ioengine pvsync --filename=/dev/mapper/dmcrypt0 \
>     --readwrite=$READWRITE --bs=65536 --blockalign=65536 \
>     --iomem_align=65536 --time_based --group_reporting \
>     --runtime=50 --direct=1 --num_jobs=32 --name=plain
>
> (READWRITE is set to randread or randwrite)
>
> Note that I suspect that with ciphers that use DMA we will be able to see
> much larger improvement (similarly to what I have reported in my initial
> post), however I don't have suitable hardware for testing that at the
> moment.
> It would be nice if someone who has such hardware could test that or
> point me to publicly available hardware for buying for such an experiment.
>
> Note that this benchmark randomly reads/writes 64K files to the SSD.
> For smaller than 64KB files, the performance is indeed expected to drop
> significantly, however:
> 1. Today users can format (unencrypted) ext4 filesystems to 64K sectors.
> If they store small files on such filesystems, they probably already
> suffer from performance problems.
> 2. My patch allows to use all sector sizes between 4K and 64K
> (i.e. 4K, 8K, 16K, 32K and 64K). When we use 8K sectors, the performance
> drop for small files isn't expected to be so high and still I can show in
> that case huge performance improvement, at least when using my proprietary
> cipher (I may be able to show improvement with other hardware when I
> get my hands on such hardware).
> 3. Users might be able to partition their disk to several partitions, with
> the 64K bytes sectors partitions reserved for larger files.
>
>> NACK from me for this patch (but it is up to Mike to decide here).
>>
>> Milan
> Thanks.
> Please note that I'll be off-work from now until next Tuesday.
>
> Itai
>
I finally managed to get my hands on an Open-Q 865 SOM Development board
and have just performed a few benchmarks using the qce (qcrypt) crypto
accelerator, which is available also in the mainline kernel.

Following is a summary of these benchmarks:

+------------------+-----------+-----------+-----------+-------------+
| sector_size/test | seq_read  | rand_read | seq_write | rand_write  |
+------------------+-----------+-----------+-----------+-------------+
| 4 KiB  sectors   |  13.3     |  22.8     |  27.2     |  24         |
| 64 KiB sectors   |  581      |  582      |  586      |  588        |
+------------------+-----------+-----------+-----------+-------------+
                               (All numbers in MB/s).

These results show significant improvements:
  x43 sequential read.
  x25 random read.
  x21 sequential write.
  x24 random write.

So it appears to me that there is real benefit for enabling 64K sectors
in dm-crypt at least for the particular use case I showed (qce driver
on arm with 64K PAGE_SIZE, plain dm-crypt mode).

Notes:

1. These benchmarks have been performed using a ramdisk.
2. These benchmarks have been performed on an MSM kernel, which is based on
   linux kernel 4.19.81.
   Kernel sources are available at:
     https://source.codeaurora.org/quic/la/kernel/msm-4.19
   commit id: c815d7543db92377d85d214af0dbf1d11324726e

In all benchmarks, CONFIG_ARM64_64K_PAGES=y, the ramdisk size is 1 GB
(/dev/ram0).

For benchmarking with 4 KiB sectors, I created the dm-crypt device as
follows:

cryptsetup open --type plain --sector-size 4096 \
  -c 'capi:qcom-xts(aes)-plain64' /dev/ram0 dmcrypt0

For 64 KiB sectors, simply replace 4096 with 65536.

Following are the benchmark commands:
fio --filename=/dev/mapper/dmcrypt0 --readwrite=read --bs=1048576 \
  --time_based --group_reporting --runtime=10 --direct=1 --iodepth=4 \
  --ioengine=libaio --name=plain

fio --filename=/dev/mapper/dmcrypt0 --readwrite=randread --bs=1048576 \
  --time_based --group_reporting --runtime=10 --direct=1 --iodepth=4 \
  --ioengine=libaio --name=plain

fio --filename=/dev/mapper/dmcrypt0 --readwrite=write --bs=1048576 \
  --time_based --group_reporting --runtime=10 --direct=1 --iodepth=4 \
  --ioengine=libaio --name=plain

fio --filename=/dev/mapper/dmcrypt0 --readwrite=randwrite --bs=1048576 \
  --time_based --group_reporting --runtime=10 --direct=1 --iodepth=4 \
  --ioengine=libaio --name=plain

- Itai



More information about the dm-devel mailing list