[dm-devel] [PATCH] dm crypt: Avoid percpu_counter spinlock contention in crypt_page_alloc()
Mikulas Patocka
mpatocka at redhat.com
Tue Aug 10 18:21:13 UTC 2021
Reviewed-by: Mikulas Patocka <mpatocka at redhat.com>
On Sun, 8 Aug 2021, Arne Welzel wrote:
> On many core systems using dm-crypt, heavy spinlock contention in
> percpu_counter_compare() can be observed when the dmcrypt page allocation
> limit for a given device is reached or close to be reached. This is due
> to percpu_counter_compare() taking a spinlock to compute an exact
> result on potentially many CPUs at the same time.
>
> Switch to non-exact comparison of allocated and allowed pages by using
> the value returned by percpu_counter_read_positive().
>
> This may over/under estimate the actual number of allocated pages by at
> most (batch-1) * num_online_cpus() (assuming my understanding of the
> percpu_counter logic is proper).
>
> Currently, batch is bounded by 32. The system on which this issue was
> first observed has 256 CPUs and 512G of RAM. With a 4k page size, this
> change may over/under estimate by 31MB. With ~10G (2%) allowed for dmcrypt
> allocations, this seems an acceptable error. Certainly preferred over
> running into the spinlock contention.
>
> This behavior was separately/artificially reproduced on an EC2 c5.24xlarge
> instance system with 96 CPUs and 192GB RAM as follows, but can be
> provokes on systems with less available CPUs.
>
> * Disable swap
> * Tune vm settings to promote regular writeback
> $ echo 50 > /proc/sys/vm/dirty_expire_centisecs
> $ echo 25 > /proc/sys/vm/dirty_writeback_centisecs
> $ echo $((128 * 1024 * 1024)) > /proc/sys/vm/dirty_background_bytes
>
> * Create 8 dmcrypt devices based on files on a tmpfs
> * Create and mount an ext4 filesystem on each crypt devices
> * Run stress-ng --hdd 8 within one of above filesystems
>
> Total %system usage shown via sysstat goes to ~35%, write througput on the
> underlying loop device is ~2GB/s. perf profiling an individual kworker
> kcryptd thread shows the following in the profile, indicating it hits
> heavy spinlock contention in percpu_counter_compare():
>
> 99.98% 0.00% kworker/u193:46 [kernel.kallsyms] [k] ret_from_fork
> |
> ---ret_from_fork
> kthread
> worker_thread
> |
> --99.92%--process_one_work
> |
> |--80.52%--kcryptd_crypt
> | |
> | |--62.58%--mempool_alloc
> | | |
> | | --62.24%--crypt_page_alloc
> | | |
> | | --61.51%--__percpu_counter_compare
> | | |
> | | --61.34%--__percpu_counter_sum
> | | |
> | | |--58.68%--_raw_spin_lock_irqsave
> | | | |
> | | | --58.30%--native_queued_spin_lock_slowpath
> | | |
> | | --0.69%--cpumask_next
> | | |
> | | --0.51%--_find_next_bit
> | |
> | |--10.61%--crypt_convert
> | | |
> | | |--6.05%--xts_crypt
> ...
>
> After apply this change, %system usage is lowered to ~7% and
> write throughput on the loopback interface increases to 2.7GB/s.
> The profile shows mempool_alloc() as ~8% rather than ~62% in the
> profile and not hitting the percpu_counter() spinlock anymore.
>
> |--8.15%--mempool_alloc
> | |
> | |--3.93%--crypt_page_alloc
> | | |
> | | --3.75%--__alloc_pages
> | | |
> | | --3.62%--get_page_from_freelist
> | | |
> | | --3.22%--rmqueue_bulk
> | | |
> | | --2.59%--_raw_spin_lock
> | |
> | | --2.57%--native_queued_spin_lock_slowpath
> | |
> | --3.05%--_raw_spin_lock_irqsave
> | |
> | --2.49%--native_queued_spin_lock_slowpath
>
> Suggested-by: DJ Gregor <dj at corelight.com>
> Signed-off-by: Arne Welzel <arne.welzel at corelight.com>
> ---
> drivers/md/dm-crypt.c | 7 ++++++-
> 1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
> index 50f4cbd600d5..2ae481610f12 100644
> --- a/drivers/md/dm-crypt.c
> +++ b/drivers/md/dm-crypt.c
> @@ -2661,7 +2661,12 @@ static void *crypt_page_alloc(gfp_t gfp_mask, void *pool_data)
> struct crypt_config *cc = pool_data;
> struct page *page;
>
> - if (unlikely(percpu_counter_compare(&cc->n_allocated_pages, dm_crypt_pages_per_client) >= 0) &&
> + /*
> + * Note, percpu_counter_read_positive() may over (and under) estimate
> + * the current usage by at most (batch - 1) * num_online_cpus() pages,
> + * but avoids potential spinlock contention of an exact result.
> + */
> + if (unlikely(percpu_counter_read_positive(&cc->n_allocated_pages) > dm_crypt_pages_per_client) &&
> likely(gfp_mask & __GFP_NORETRY))
> return NULL;
>
> --
> 2.20.1
>
More information about the dm-devel
mailing list