[dm-devel] [RFC PATCH 00/20] dm-crypt: parallel processing

Tue Aug 21 09:08:15 UTC 2012

Hello,

I promised to Mikulas that I will post remaining of his dm-crypt parallel
processing patchset (plus some related changes) with some comments.

The problem:

The current implementation (using per cpu struct) always use encryption
on the same CPU which submitted IO.

With very fast storage (usually SSD or MD RAID0) one CPU core can
be limiting and the throughput of encrypted disk is worse in
comparison with underlying storage.

Idea here is to distribute encryption to other CPU cores/CPUs.

(Side effect of patches is nice clean up dmcrypt code. :)

Mikulas Patocka (20):
  dm-crypt: remove per-cpu structure
  dm-crypt: use unbound workqueue for request processing
  dm-crypt: remove completion restart
  dm-crypt: use encryption threads
  dm-crypt: Unify spinlock
  dm-crypt: Introduce an option that sets the number of threads.
  dm-crypt: don't use write queue
  dm-crypt: simplify io queue
  dm-crypt: unify io_queue and crypt_queue
  dm-crypt: don't allocate pages for a partial request.
  dm-crypt: avoid deadlock in mempools
  dm-crypt: simplify cc_pending
  dm-crypt merge convert_context and dm_crypt_io
  dm-crypt: move error handling to crypt_convert.
  dm-crypt: remove io_pending field
  dm-crypt: small changes
  dm-crypt: move temporary values to stack
  dm-crypt: offload writes to thread
  dm-crypt: retain write ordering
  dm-crypt: sort writes

 drivers/md/dm-crypt.c |  838 +++++++++++++++++++++++++++----------------------
 1 file changed, 464 insertions(+), 374 deletions(-)

My notes:

I extensively tested this (on top of 3.6.0-rc2) and while I like
simplification and the main logic (if we have enough power why
not use other CPUs) I see several problems here.

1) The implementation is not much useful on modern CPUs with hw accelerated
AES (with AES-NI even one core can saturate very fast storage).
(Some optimized crypto behaves similar, like twofish optimized modules.)

2) The patchset targets linear access pattern mainly and one
process generating IOs. (With more processes/CPUs generating IOs
you get parallel processing even with current code.)

I can see significant improvement (~30%) for linear read
(if not using AES-NI) and if underlying storage is zero target
(ideal situation removing scheduler from the stack).

For random access pattern (and more IO producers) I cannot find
reliable improvement except ideal zero target case (where improvement
is always >30%). For more producers it doesn't help at all.
I tested RAID0 over SATA disks and very fast SSD on quad core cpu,
dmcrypt running with 1, 3 or 4 threads (and with cfq and noop scheduler)
using fio threaded test with 1 or 4 threads.

Notes to implementation:

1) Last two patches (19/20) provides sorting of IO requests, this
logically should be done in IO scheduler.

I don't think this should be in dmcrypt, if scheduler doesn't work
properly, it should be fixed or tuned for crypt access pattern.

2) Could be kernel workqueue used/fixed here instead? Basically all it needs
is to prefer submitting CPU, if it is busy just move work to another CPU.

3) It doesn't honour CPU hotplugging. Number of CPU is determined
in crypt constructor. If hotplugging is used for CPU power saving,
it will not work properly.

4) With debugging options on, everything is extremely slower
(Perhaps bug in some debug option, I just noticed this on Fedora
rawhide with all debug on.)

I posted it mainly because I think this should move forward,
whatever direction...

Milan