[vfio-users] Best pinning strategy for latency / performance trade-off

Thu Feb 2 18:34:34 UTC 2017

Hi Thomas,

awesome work! I've changed my (gaming-)setup (2x Xeon E5-2670 (8 real 
cores per CPU)) to the following:

VM1 and VM2:
Each gets 4 real cores on CPU0; Emulator-Thread is pinned to the 
respective Hyper-Threading cores.

VM3:
6 real cores on CPU1; Emulator-Thread is pinned to the respective 
Hyper-Threading cores.

Host:
2 real cores on CPU1; 2 Hyperthreaded cores.

I've chosen this layout ("Low latency setup"), since it fits my setup 
the most. Alternatively I could have pinned the emulator threads to the 
host ("Balanced setup, emulator with host"), but this would result in 
some cross-node-traffic, which I wanted to prevent. Additionally some 
benchmarks show that hyperthreading does not improve the gaming 
performance [1] by much.

With my new setup, I ran DPC Latency Checker [2] and saw on all three 
VMs timings around 1000us. However, LatencyMon [3] showed most of the 
time much lower values (<100us). Can they be compared?

LatencyMon also showed me that the USB2 driver has a long ISR. Changing 
this to USB3 in libvirt fixed that.

Cheers,
Jan

[1] 
https://www.techpowerup.com/forums/threads/gaming-benchmarks-core-i7-6700k-hyperthreading-test.219417/
[2] http://www.thesycon.de/eng/latency_check.shtml
[3] http://www.resplendence.com/latencymon

Am 01.02.2017 um 16:46 schrieb Thomas Lindroth:
> A while ago there was a conversation on the #vfio-users irc channel about how to
> use cpuset/pinning to get the best latency and performance. I said I would run
> some tests and eventually did. Writing up the result took a lot of time and
> there are some more test I want to run to verify the results but don't have time
> to do that now. I'll just post what I've concluded instead. First some theory.
>
> Latency in a virtual environment have many difference causes.
> * There is latency in the hardware/bios like system management interrupts.
> * The host operating system introduce some latency. This is often because the
>   host won't schedule the VM when it wants to run.
> * The emulator got some latency because of things like nested page tables and
>   handling of virtual hardware.
> * The guest OS introduce it's own latency when the workload wants to run but the
>   guest scheduler won't schedule it.
>
> Point 1 and 4 are latencies you get even on bare metal but point 2 and 3 is
> extra latency caused by the virtualisation. This post is mostly about reducing
> the latency of point 2.
>
> I assume you are already familiar with how this is usually done. By using cpuset
> you can reserve some cores for exclusive use by the VM and put all system
> processes on a separate housekeeping core. This allows the VM to run whenever it
> wants which is good for latency but the downside is the VM can't use the
> housekeeping core so performance is reduced.
>
> By running pstree -p when the VM is running you get some output like this:
> ...
> ─qemu-system-x86(4995)─┬─{CPU 0/KVM}(5004)
>                        ├─{CPU 1/KVM}(5005)
>                        ├─{CPU 2/KVM}(5006)
>                        ├─{CPU 3/KVM}(5007)
>                        ├─{CPU 4/KVM}(5008)
>                        ├─{CPU 5/KVM}(5009)
>                        ├─{qemu-system-x86}(4996)
>                        ├─{qemu-system-x86}(5012)
>                        ├─{qemu-system-x86}(5013)
>                        ├─{worker}(5765)
>                        └─{worker}(5766)
>
> Qemu spawn a bunch of threads for different things. The "CPU #/KVM" threads runs
> the actual guest code and there is one for each virtual cpu. I call them
> "VM threads" from here on. The qemu-system-x86 threads are used to emulate
> virtual hardware and is called the emulator in libvirt terminology. I call them
> "emulator threads". The worker threads are probably what libvirt calls iothreads
> but I treat them the same as the emulator threads and refer to them both as
> "emulator threads".
>
> My cpu is a i7-4790K with 4 hyper threaded cores for a total of 8 logical cores.
> A lot of people here probably have something similar. Take a look in
> /proc/cpuinfo to see how it's laid out. I number my cores like cpuinfo where I
> got physical cores 0-3 and logical cores 0-7. pcore 0 corresponds to lcore 0,4
> and pcore 1 is lcore 1,5 and so on.
>
> The goal is to partition the system processes, VM threads and emulator threads
> on these 8 lcores to get good latency and acceptable performance but to do that
> I need a way to measure latency. Mainline kernel 4.9 got a new latency tracer
> called hwlat. It's designed to measure hardware latencies like SMI but if you
> run it in a VM you get all latencies below the guest (point 1-3 above). Hwlat
> bypasses the normal cpu scheduler so it won't measure any latency from the guest
> scheduler (point 4). It basically makes it possible to focus on just the VM
> related latencies.
> https://lwn.net/Articles/703129/
>
> We should perhaps also discuss how much latency is too much. That's up for
> debate but the windows DPC latency checker lists 500us as green, 1000us as
> yellow and 2000us as red. If a game runs at 60fps it has a deadline of 16.7ms to
> render a frame. I'll just decide that 1ms (1000us) is the upper limit for what I
> can tolerate.
>
> One of the consequences of how hwlat works is that it also fails to notice a lot
> of the point 3 types of latencies. Most of the latency in point 3 is caused by
> vm-exits. That's when the guest do something the hardware virtualisation can't
> handle and have to rely on kvm or qemu to emulate the behaviour. This is a lot
> slower than real hardware but it mostly only happens when the guest tries to
> access hardware resources, so I'll call it IO-latency. The hwlat tracer only
> sits and spins in kernel space and never touch any hardware by itself. Since
> hwlat don't trigger vm-exits it also can't measure latencies from that so it
> would be good to have something else that could. They way I rigged things up is
> to set the virtual disk controller to ahci. I know that has to be emulated by
> qemu. I then added a ram block device from /dev/ram* to the VM as a virtual
> disk. I can then run the fio disk benchmark in the VM on that disk to trigger
> vm-exits and get a report on the latency from fio. It's not a good solution but
> it's the best I could come up with.
> http://freecode.com/projects/fio
>
> === Low latency setup ===
>
> Let's finally get down to business. The first setup I tried is configured for
> minimum latency at the expense of performance.
>
> The virtual cpu in this setup got 3 cores and no HT. The VM threads are pinned
> to lcore 1,2,3. The emulator threads are pinned to lcore 5,6,7. That leaves
> pcore 0 which is dedicated to the host using cpuset.
>
> Here is the layout in libvirt xml
> <vcpupin vcpu='0' cpuset='1'/>
> <vcpupin vcpu='1' cpuset='2'/>
> <vcpupin vcpu='2' cpuset='3'/>
> <emulatorpin cpuset='5-7'/>
> <topology sockets='1' cores='3' threads='1'/>
>
> And here are the result of hwlat (all hwlat test run for 30 min each). I used a
> synthetic load to test how the latencies changed under load. I use the program
> stress as synthetic load on both guest and host
> (stress --vm 1 --io 1 --cpu 8 --hdd 1).
>
>                          mean     stdev    max(us)
> host idle, VM idle:   17.2778   15.6788     70
> host load, VM idle:   21.4856   20.1409     72
> host idle, VM load:   19.7144   18.9321    103
> host load, VM load:   21.8189   21.2839    139
>
> As you can see the load on the host makes little difference for the latency.
> The cpuset isolation works well. The slight decrease of the mean might be
> because of reduced memory bandwidth. Putting the VM under load will increase the
> latency a bit. This might seem odd since the idea of using hwlat was to bypass
> the guest scheduler thereby making the latency independent of what is running in
> the guest. What is probably happening is that the "--hdd" part of the stress
> access the disk and this makes the emulator threads run. They are pinned to the
> HT siblings of the VM threads and thereby slightly impact the latency of them.
> Overall the latency is very good in this setup.
>
> fio (us) min=40, max=1306, avg=52.81, stdev=12.60 iops=18454
> Here is the result of the io latency test with fio. Since the emulator treads
> are running mostly isolated on their own siblings this result must be considered
> good.
>
> === Low latency setup, with realtime ===
>
> In an older post to the mailing list I said "The NO_HZ_FULL scheduler mode only
> works if a single process wants to run on a core. When the VM thread runs as
> realtime priority it can starve the kernel threads for long period of time and
> the scheduler will turn off NO_HZ_FULL when that happens since several processes
> wants to run. To get the full advantage of NO_HZ_FULL don't use realtime
> priority."
>
> Let's see how much impact this really has. The idea behind realtime pri is to
> always give your preferred workload priority over unimportant workloads. But to
> make any difference there has to be an unimportant workload to preempt. Cpuset
> is a great way to move unimportant processes to a housekeeping cpu but
> unfortunately the kernel got some pesky kthreads that refuse to migrate. By
> using realtime pri on the VM threads I should be able to out-preempt the kernel
> threads and get lower latency. In this test I used  the same setup as above but
> I used schedtool to set round-robin pri 1 on all VM related threads.
>
>                          mean     stdev    max(us)
> host idle, VM idle:   17.6511   15.3028     61
> host load, VM idle:   20.2400   19.6558     57
> host idle, VM load:   18.9244   18.8119    108
> host load, VM load:   20.4228   21.0749    122
>
> The result is mostly the same. Those few remaining kthreads that I can't disable
> or migrate apparently doesn't make much difference on latency.
>
> === Balanced setup, emulator with VM threads ===
>
> 3 cores isn't a lot these days and some games like Mad max and Rise of the tomb
> raider max out the cpu in the low latency setup. This results in big frame drops
> when that happens. The setup below with a virtual 2 core HT cpu would probably
> give ok latency but the addition of hyper threading usually only give 25-50%
> extra performance for real world workloads so this setup would generally be
> slower than the low latency setup. I didn't bother to test it.
> <vcpupin vcpu='0' cpuset='2'/>
> <vcpupin vcpu='1' cpuset='6'/>
> <vcpupin vcpu='2' cpuset='3'/>
> <vcpupin vcpu='3' cpuset='7'/>
> <emulatorpin cpuset='1,5'/>
> <topology sockets='1' cores='2' threads='2'/>
>
> To get better performance I need at least a virtual 3 core HT cpu but if the
> host use pcore 0 and the VM threads use pcore 1-3 where will the emulator
> threads run? I could overallocate the system by having the emulator threads
> compete with the VM threads or I could overallocate the system by having the
> emulator threads compete with the host processes. Lets try to run the emulator
> with the VM treads first.
>
> <vcpupin vcpu='0' cpuset='1'/>
> <vcpupin vcpu='1' cpuset='5'/>
> <vcpupin vcpu='2' cpuset='2'/>
> <vcpupin vcpu='3' cpuset='6'/>
> <vcpupin vcpu='4' cpuset='3'/>
> <vcpupin vcpu='5' cpuset='7'/>
> <emulatorpin cpuset='1-3,5-7'/>
> <topology sockets='1' cores='3' threads='2'/>
>
> The odd ordering for vcpupin is done because Intel cpus lay out HT siblings as
> lcore[01234567] = pcore[01230123] but qemu lays out the virtual cpu as
> lcore[012345] = pcore[001122]. To get a 1:1 mapping I have to order them like
> that.
>
>                          mean     stdev    max(us)
> host idle, VM idle:   17.4906   15.1180     89
> host load, VM idle:   22.7317   19.5327     95
> host idle, VM load:   82.3694  329.6875   9458
> host load, VM load:  141.2461 1170.5207  20757
>
> The result is really bad. It works ok as long as the VM is idle but as soon as
> it's under load I get bad latencies. The reason is likely that the stressor
> accesses the disk which activates the emulator and in this setup the emulator
> can preempt the VM threads. We can check if this is the case by running the
> stress without "--hdd".
>
>                                        mean     stdev    max(us)
> host load, VM load(but no --hdd):   57.4728  138.8211   1345
>
> The latency is reduced quite a bit but it's still high. It's likely still the
> emulator threads preempting the VM threads. Accessing the disk is just one of
> many things the VM can do to activate the emulator.
>
> fio (us) min=41, max=7348, avg=62.17, stdev=14.99 iops=15715
> io latency is also a lot worse compared to the low latency setup. The reason is
> the VM threads can preempt the emulator threads while they are emulating the
> disk drive.
>
> === Balanced setup, emulator with host ===
>
> Pairing up the emulator threads and VM threads was a bad idea so lets try
> running the emulator on the core reserved for the host. Since the VM threads run
> by themselves in this setup we would expect to get good hwlat latency but the
> emulator threads can be preempted by host processes so io latency might suffer.
> Lets start by looking at the io latency.
>
> fio (us) min=40, max=46852, avg=61.55, stdev=250.90 iops=15893
>
> Yup, massive io latency. Here is a situation were realtime pri could help.
> If the emulator threads get realtime pri they can out-preempt the host
> processes. Lets try that.
>
> fio (us) min=38, max=2640, avg=53.72, stdev=13.61  iops=18140
>
> That's better but it's not as good as the low latency setup where the emulator
> threads got their own lcore. To reduce the latency even more we could try to
> split pcore 0 in two and run host processes on lcore 0 and the emulator threads
> on lcore 4. But this doesn't leave much cpu for the emulator (or the host).
>
> fio (us) min=44, max=1192, avg=56.07, stdev=8.52 iops=17377
>
> The max io latency now decreased to the same level as the low latency setup.
> Unfortunately the number of iops also decreased a bit (down 5.8% compared to the
> low latency setup). I'm guessing this is because the emulator threads don't get
> as much cpu power in this setup.
>
>                          mean     stdev    max(us)
> host idle, VM idle:   18.3933   15.5901    106
> host load, VM idle:   20.2006   18.8932     77
> host idle, VM load:   23.1694   22.4301    110
> host load, VM load:   23.2572   23.7288    120
>
> Hwlat latency is comparable to the low latency setup so this setup gives a good
> latency / performance trade-off
>
> === Max performance setup ===
>
> If 3 cores with HT isn't enough I suggest you give up but for comparison let's
> see what happens if we mirror the host cpu in the VM. Now we have no room at all
> for the emulator or the host processes so I let them schedule free.
> <vcpupin vcpu='0' cpuset='0'/>
> <vcpupin vcpu='1' cpuset='4'/>
> <vcpupin vcpu='2' cpuset='1'/>
> <vcpupin vcpu='3' cpuset='5'/>
> <vcpupin vcpu='4' cpuset='2'/>
> <vcpupin vcpu='5' cpuset='6'/>
> <vcpupin vcpu='6' cpuset='3'/>
> <vcpupin vcpu='7' cpuset='7'/>
> <emulatorpin cpuset='0-7'/>
> <topology sockets='1' cores='4' threads='2'/>
>
>                            mean      stdev    max(us)
> host idle, VM idle:    185.4200   839.7908   6311
> host load, VM idle:   3835.9333  7836.5902  97234
> host idle, VM load:   1891.4300  3873.9165  31015
> host load, VM load:   8459.2550  6437.6621  51665
>
> fio (us) min=48, max=112484, avg=90.41, stdev=355.10 iops=10845
>
> I only ran these tests for 10 min each. That's all that was needed. As you can
> see it's terrible. I'm afraid that many people probably run a setup similar to
> this. I ran like this myself for a while until I switched to libvirt and started
> looking into pinning. Realtime pri would probably help a lot here but realtime
> in this configuration is potentially dangerous. Workloads on the guest could
> starve the host and depending on how the guest gets its input a reset using the
> hardware reset button could be needed to get the system back.
>
> === Testing with games ===
>
> I want low latency for gaming so it would make sense to test the setups with
> games. This turns out to be kind of tricky. Games are complicated and
> interpreting the results can be hard. https://i.imgur.com/NIrXnkt.png as an
> example here is a percentile plot of the frametimes in the built in benchmark of
> rise of the tomb raider taken with fraps. The performance and balanced setups
> looks about the same at lower percentiles but the low latency setup is a lot
> lower. This means that the low latency setup, with is the weakest in terms of
> cpu power, got higher frame rate for some parts of the benchmark. This doesn't
> make sense at first. It only starts to make sense if I pay attention to the
> benchmark while it's running. Rise of the tomb raider loads in a lot of geometry
> dynamically and the low latency setup can't keep up. It has bad pop-in of
> textures and objects so the scene the gpu renders is less complicated than the
> other setups. Less complicated scene results in higher frame rate. An odd
> counter intuitive result.
>
> Overall the performance and balanced setups have the same percentile curve for
> lower percentiles in every game I tested. This tells me that the balanced setup
> got enough cpu power for all games I've tried. They only differ at higher
> percentile due to latency induced framedrops. The performance setup always have
> the worst max frametime in every game so there is no reason to use it over the
> balanced setup. The performance setup also have crackling sound in several games
> over hdmi audio even with MSI enabled. Which setup got the lowest max framtime
> depends on the workload. If the game max out the cpu of the low latency setup
> the max framtime will be worse than the balanced setup, if not the low latency
> setup got the best latency.
>
> === Conclusion ===
>
> The balanced setup (emulators with host) doesn't have the best latency in every
> workload but I haven't found any workload where it performs poorly in regards to
> max latency, io latency or available cpu power. Even in those workloads where
> another setup performed better the balanced setup was always close. If you are
> too lazy to switch setups depending on the workload use the balanced setup as
> the default configuration. If your cpu isn't a 4 core with HT finding the best
> setup for your cpu is left as an exercise for the reader.
>
> === Future work ===
>
> https://vfio.blogspot.se/2016/10/how-to-improve-performance-in-windows-7.html
> This was a nice trick for forcing win7 to use TSC. Just one problem, turns out
> it doesn't work if hyper threading is enabled. Any time I use a virtual cpu with
> threads='2' win7 will revert to using the acpi_pm. I've spent a lot of time
> trying to work around the problem but failed. I don't even know why hyper
> threading would make a difference for TSC. Microsoft's documentation is
> amazingly unhelpful. But even when the guest is hammering the acpi_pm timer the
> balanced setup gives better performance than the low latency setup but I'm
> afraid the reduced resolution and extra indeterminism of the acpi_pm timer might
> result in other problems. This is only a problem in win7 because modern versions
> if windows should use hypervclock. I've read somewhere that it might be possible
> to modify OVMF to work around the bug in win7 that prevents hyperv from working.
> With that modification it might be possible to use hypervclock in win7.
> Perhaps I'll look into that in the future. In the mean time I'll stick with the
> balanced setup despite the use of acpi_pm.
>
> _______________________________________________
> vfio-users mailing list
> vfio-users at redhat.com
> https://www.redhat.com/mailman/listinfo/vfio-users
>