[libvirt-users] NUMA issues on virtualized hosts

Tue Sep 18 07:50:44 UTC 2018

On 09/17/2018 04:59 PM, Lukas Hejtmanek wrote:
> Hello,
> 
> so the current domain configuration:
> <cpu mode='host-passthrough'><topology sockets='8' cores='4' threads='1'/><numa><cell cpus='0-3' memory='62000000' /><cell cpus='4-7' memory='62000000' /><cell cpus='8-11' memory='62000000' /><cell cpus='12-15' memory='62000000' /><cell cpus='16-19' memory='62000000' /><cell cpus='20-23' memory='62000000' /><cell cpus='24-27' memory='62000000' /><cell cpus='28-31' memory='62000000' /></numa></cpu>
> <cputune><vcpupin vcpu='0' cpuset='0' /><vcpupin vcpu='1' cpuset='1' /><vcpupin vcpu='2' cpuset='2' /><vcpupin vcpu='3' cpuset='3' /><vcpupin vcpu='4' cpuset='4' /><vcpupin vcpu='5' cpuset='5' /><vcpupin vcpu='6' cpuset='6' /><vcpupin vcpu='7' cpuset='7' /><vcpupin vcpu='8' cpuset='8' /><vcpupin vcpu='9' cpuset='9' /><vcpupin vcpu='10' cpuset='10' /><vcpupin vcpu='11' cpuset='11' /><vcpupin vcpu='12' cpuset='12' /><vcpupin vcpu='13' cpuset='13' /><vcpupin vcpu='14' cpuset='14' /><vcpupin vcpu='15' cpuset='15' /><vcpupin vcpu='16' cpuset='16' /><vcpupin vcpu='17' cpuset='17' /><vcpupin vcpu='18' cpuset='18' /><vcpupin vcpu='19' cpuset='19' /><vcpupin vcpu='20' cpuset='20' /><vcpupin vcpu='21' cpuset='21' /><vcpupin vcpu='22' cpuset='22' /><vcpupin vcpu='23' cpuset='23' /><vcpupin vcpu='24' cpuset='24' /><vcpupin vcpu='25' cpuset='25' /><vcpupin vcpu='26' cpuset='26' /><vcpupin vcpu='27' cpuset='27' /><vcpupin vcpu='28' cpuset='28' /><vcpupin vcpu='29' cpuset='29' /><vcpupin vcpu='30' cpuset='30' /><vcpupin vcpu='31' cpuset='31' /></cputune>
> <numatune>
> <memnode cellid="0" mode="strict" nodeset="0"/>
> <memnode cellid="1" mode="strict" nodeset="1"/>
> <memnode cellid="2" mode="strict" nodeset="2"/>
> <memnode cellid="3" mode="strict" nodeset="3"/>
> <memnode cellid="4" mode="strict" nodeset="4"/>
> <memnode cellid="5" mode="strict" nodeset="5"/>
> <memnode cellid="6" mode="strict" nodeset="6"/>
> <memnode cellid="7" mode="strict" nodeset="7"/>
> </numatune>
> 
> hopefully, I got it right. 

Yes, looking good.

> 
> Good news is, that spec benchmark looks promising. The first test bwaves
> finished in 1003 seconds compared to 1700 seconds in the previous wrong case.
> So far so good.

Very well, this means that the config above is correct.

> 
> Bad news is, that iozone is still the same. There might be some
> misunderstanding. 
> 
> I have to cases:
> 
> 1) cache=unsafe. In this case, I can see that hypervizor is prone to swap.
> Swap a lot. It usually eats whole swap partition and kswapd is running on 100%
> CPU. swappines, dirty_ration and company do not improve things at all.
> However, I believe, this is just wrong option for scratch disks where one can
> expect huge I/O load. Moreover, the hypevizor is poor machine with only low
> memory left (ok, in my case about 10GB available), so it does not make sense
> to use that memory for additional cache/disk buffers.

One thing that just occurred to me - is the qcow2 file fully allocated?

# qemu-img info /var/lib/libvirt/images/fedora.qcow2
..
virtual size: 20G (21474836480 bytes)
disk size: 7.0G
..

This is NOT a fully allocated qcow2.

> 
> 2) cache=none. In this case, performance is better (only few percent behind
> baremetal). However, as soon as the size of stored data is about the size of
> memory of the virtual, writes stops and iozone is eating whole CPU, it looks like
> it is searching more free pages and it is harder and harder. But not sure,
> I am not skilled in this area.

Hmm. Could it be that SSD doesn't have enough free blocks and thus
writes are throttled? Can you fstrim it and see if that helps?

> 
> here, you can clearly see, that it starts writes, doing the writes, then it
> takes a pause, writes again, and so on, but the pauses are longer and longer..
> https://pastebin.com/2gfPFgb9
> The output is until the very end of iozone (I cancelled it by ctrl-c).
> 
> It seems that this is not happening on 2-NUMA node with rotational disks only.
> It is partly happening on 2-NUMA node with 2 NVME SSDs. The partly means, that
> there are also pauses in writes but it finishes, speed is reduced though. On
> 1-NUMA node, with the same test, I can see steady writes from the very
> beginning to the very end at roughly the same speed.
> 
> Maybe it could be related to the fact, that NVME is PCI device that is linked
> to one NUMA node only?

Can be. I don't know qemu internals that much to know if its capable of
doing zero copy disk writes.

> 
> 
> As of iothreads, I have only 1 disk (the vde) that is exposed to high i/o
> load, so I believe more I/O threads is not applicable here. If I understand
> correctly, I cannot set more iothreads to a single device.. And it does not
> seem to be iothreads linked as the same scenario in 1-NUMA configuration works
> OK (I mean that memory penalties can be huge as it does not reflect real NUMA
> topology, but disk speed it ok anyway.)

Ah, since it's only one disk then iothreads will not help much here.
Still worth giving it a shot ;-) Remember, iothreads are for all I/O,
not disk I/O only.

Anyway, this is the point where I have to say "I don't know". Sorry. Try
contacting qemu guys:

qemu-discuss at nongnu.org
qemu-devel at nongnu.org

Michal