[dm-devel] [RFC PATCH 0/4] dm mpath: vastly improve blk-mq IO performance

Thu Mar 31 20:04:22 UTC 2016

I developed these changes some weeks ago but have since focused on
regression and performance testing on larger NUMA systems.

For regression testing I've been using mptest:
https://github.com/snitm/mptest

For performance testing I've been using a null_blk device (with
various configuration permutations, e.g. pinning memory to a
particular NUMA node, and varied number of submit_queues).

By eliminating multipath's heavy use of the m->lock spinlock in the
fast IO paths serious performance improvements are realized.

Overview of performance test setup:
===================================

NULL_BLK_HW_QUEUES=12
NULL_BLK_QUEUE_DEPTH=4096

DM_MQ_HW_QUEUES=12
DM_MQ_QUEUE_DEPTH=2048

FIO_QUEUE_DEPTH=32
FIO_RUNTIME=10
FIO_NUMJOBS=12

NID=0

run_fio() {
    DEVICE=$1
    TASK_NAME=$(basename ${DEVICE})
    PERF_RECORD=$2
    RUN_CMD="${FIO} --numa_cpu_nodes=${NID} --numa_mem_policy=bind:${NID} --cpus_allowed_policy=split --group_reporting --rw=randread --bs=4k --numjobs=${FIO_NUMJOBS} \
              --iodepth=${FIO_QUEUE_DEPTH} --runtime=${FIO_RUNTIME} --time_based --loops=1 --ioengine=libaio \
              --direct=1 --invalidate=1 --randrepeat=1 --norandommap --exitall --name task_${TASK_NAME} --filename=${DEVICE}"
    ${RUN_CMD}
}

modprobe null_blk gb=4 bs=512 hw_queue_depth=${NULL_BLK_QUEUE_DEPTH} nr_devices=1 queue_mode=2 irqmode=1 completion_nsec=1 submit_queues=${NULL_BLK_HW_QUEUES}
run_fio /dev/nullb0

echo ${NID} > /sys/module/dm_mod/parameters/dm_numa_node
echo Y > /sys/module/dm_mod/parameters/use_blk_mq
echo ${DM_MQ_QUEUE_DEPTH} > /sys/module/dm_mod/parameters/dm_mq_queue_depth
echo ${DM_MQ_HW_QUEUES} > /sys/module/dm_mod/parameters/dm_mq_nr_hw_queues
echo "0 8388608 multipath 0 0 1 1 service-time 0 1 2 /dev/nullb0 1 1" | dmsetup create dm_mq
run_fio /dev/mapper/dm_mq
dmsetup remove dm_mq

echo "0 8388608 multipath 0 0 1 1 queue-length 0 1 1 /dev/nullb0 1" | dmsetup create dm_mq
run_fio /dev/mapper/dm_mq
dmsetup remove dm_mq

echo "0 8388608 multipath 0 0 1 1 round-robin 0 1 1 /dev/nullb0 1" | dmsetup create dm_mq
run_fio /dev/mapper/dm_mq
dmsetup remove dm_mq

Test results on 4 NUMA node 192-way x86_64 system with 524G of memory:
======================================================================

Big picture is the move to lockless really helps.

round-robin's repeat_count and percpu current_path code (went upstream
during 4.6 merge) seems to _really_ help (even if repeat_count is 1, as
is the case in all these results).

Below, each set of 4 results in the named file (e.g. "result.lockless_pinned") are:
raw null_blk
service-time
queue-length
round-robin

The files with the trailing "_12" means:
NULL_BLK_HW_QUEUES=12
DM_MQ_HW_QUEUES=12
FIO_NUMJOBS=12

And the file without "_12" means:
NULL_BLK_HW_QUEUES=32
DM_MQ_HW_QUEUES=32
FIO_NUMJOBS=32

lockless: (this patchset applied)
*********
result.lockless_pinned:  read : io=236580MB, bw=23656MB/s, iops=6055.9K, runt= 10001msec
result.lockless_pinned:  read : io=108536MB, bw=10853MB/s, iops=2778.3K, runt= 10001msec
result.lockless_pinned:  read : io=106649MB, bw=10664MB/s, iops=2729.1K, runt= 10001msec
result.lockless_pinned:  read : io=162906MB, bw=16289MB/s, iops=4169.1K, runt= 10001msec

result.lockless_pinned_12:  read : io=165233MB, bw=16522MB/s, iops=4229.6K, runt= 10001msec
result.lockless_pinned_12:  read : io=96686MB, bw=9667.7MB/s, iops=2474.1K, runt= 10001msec
result.lockless_pinned_12:  read : io=97197MB, bw=9718.8MB/s, iops=2488.3K, runt= 10001msec
result.lockless_pinned_12:  read : io=104509MB, bw=10450MB/s, iops=2675.2K, runt= 10001msec

result.lockless_unpinned:  read : io=101525MB, bw=10151MB/s, iops=2598.8K, runt= 10001msec
result.lockless_unpinned:  read : io=61313MB, bw=6130.8MB/s, iops=1569.5K, runt= 10001msec
result.lockless_unpinned:  read : io=64892MB, bw=6488.6MB/s, iops=1661.8K, runt= 10001msec
result.lockless_unpinned:  read : io=78557MB, bw=7854.1MB/s, iops=2010.9K, runt= 10001msec

result.lockless_unpinned_12:  read : io=83455MB, bw=8344.7MB/s, iops=2136.3K, runt= 10001msec
result.lockless_unpinned_12:  read : io=50638MB, bw=5063.4MB/s, iops=1296.3K, runt= 10001msec
result.lockless_unpinned_12:  read : io=56103MB, bw=5609.8MB/s, iops=1436.1K, runt= 10001msec
result.lockless_unpinned_12:  read : io=56421MB, bw=5641.6MB/s, iops=1444.3K, runt= 10001msec

spinlock:
*********
result.spinlock_pinned:  read : io=236048MB, bw=23602MB/s, iops=6042.3K, runt= 10001msec
result.spinlock_pinned:  read : io=64657MB, bw=6465.4MB/s, iops=1655.5K, runt= 10001msec
result.spinlock_pinned:  read : io=67519MB, bw=6751.2MB/s, iops=1728.4K, runt= 10001msec
result.spinlock_pinned:  read : io=81409MB, bw=8140.4MB/s, iops=2083.9K, runt= 10001msec

result.spinlock_pinned_12:  read : io=159782MB, bw=15977MB/s, iops=4090.3K, runt= 10001msec
result.spinlock_pinned_12:  read : io=64368MB, bw=6436.2MB/s, iops=1647.7K, runt= 10001msec
result.spinlock_pinned_12:  read : io=67337MB, bw=6733.5MB/s, iops=1723.7K, runt= 10001msec
result.spinlock_pinned_12:  read : io=75453MB, bw=7544.6MB/s, iops=1931.5K, runt= 10001msec

result.spinlock_unpinned:  read : io=103267MB, bw=10326MB/s, iops=2643.4K, runt= 10001msec
result.spinlock_unpinned:  read : io=34751MB, bw=3474.8MB/s, iops=889526, runt= 10001msec
result.spinlock_unpinned:  read : io=34475MB, bw=3447.2MB/s, iops=882477, runt= 10001msec
result.spinlock_unpinned:  read : io=43793MB, bw=4378.1MB/s, iops=1121.0K, runt= 10001msec

result.spinlock_unpinned_12:  read : io=83573MB, bw=8356.5MB/s, iops=2139.3K, runt= 10001msec
result.spinlock_unpinned_12:  read : io=32715MB, bw=3271.2MB/s, iops=837414, runt= 10001msec
result.spinlock_unpinned_12:  read : io=34249MB, bw=3424.6MB/s, iops=876675, runt= 10001msec
result.spinlock_unpinned_12:  read : io=41486MB, bw=4148.3MB/s, iops=1061.1K, runt= 10001msec

Summary:
========
Pinning this test to a particular NUMA node helps.  As does using more
queues/threads -- which is a nice advance because before DM mpath
really hit a wall.

What makes these favorable results possible is switching over to
bitops, atomic counters and lockless_deference.

Comparing result.lockless_pinned vs result.spinlock_pinned you can see
that this patchset delivers between 40 and 50% IOPs and bandwidth
performance improvement.

Jeff Moyer has been helping review these changes (and has graciously
labored over _really_ understanding all the concurrency at play in DM
mpath) -- his review isn't yet complete but I wanted to get this
patchset out now to raise awareness about how I think DM multipath
will be changing (for inclussion during the Linux 4.7 merge window).

Mike Snitzer (4):
  dm mpath: switch to using bitops for state flags
  dm mpath: use atomic_t for counting members of 'struct multipath'
  dm mpath: move trigger_event member to the end of 'struct multipath'
  dm mpath: eliminate use of spinlock in IO fast-paths

 drivers/md/dm-mpath.c | 351 ++++++++++++++++++++++++++++----------------------
 1 file changed, 195 insertions(+), 156 deletions(-)

-- 
2.6.4 (Apple Git-63)