[libvirt-users] NUMA issues on virtualized hosts

Lukas Hejtmanek xhejtman at ics.muni.cz
Fri Sep 14 12:06:26 UTC 2018


Hello,

I have cluster with AMD EPYC 7351 cpu. Two CPUs per node. I have performance
8-NUMA configuration:

This is from hypervizor:
[root at hde10 ~]# lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                64
On-line CPU(s) list:   0-63
Thread(s) per core:    2
Core(s) per socket:    16
Socket(s):             2
NUMA node(s):          8
Vendor ID:             AuthenticAMD
CPU family:            23
Model:                 1
Model name:            AMD EPYC 7351 16-Core Processor
Stepping:              2
CPU MHz:               1800.000
CPU max MHz:           2400.0000
CPU min MHz:           1200.0000
BogoMIPS:              4800.05
Virtualization:        AMD-V
L1d cache:             32K
L1i cache:             64K
L2 cache:              512K
L3 cache:              8192K
NUMA node0 CPU(s):     0-3,32-35
NUMA node1 CPU(s):     4-7,36-39
NUMA node2 CPU(s):     8-11,40-43
NUMA node3 CPU(s):     12-15,44-47
NUMA node4 CPU(s):     16-19,48-51
NUMA node5 CPU(s):     20-23,52-55
NUMA node6 CPU(s):     24-27,56-59
NUMA node7 CPU(s):     28-31,60-63

I'm running one big virtual on this hypervizor - almost whole memory + all
physical CPUs.

This is what I'm seeing inside:

root at zenon10:~# lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                32
On-line CPU(s) list:   0-31
Thread(s) per core:    1
Core(s) per socket:    4
Socket(s):             8
NUMA node(s):          8
Vendor ID:             AuthenticAMD
CPU family:            23
Model:                 1
Model name:            AMD EPYC 7351 16-Core Processor
Stepping:              2
CPU MHz:               2400.000
BogoMIPS:              4800.00
Virtualization:        AMD-V
Hypervisor vendor:     KVM
Virtualization type:   full
L1d cache:             64K
L1i cache:             64K
L2 cache:              512K
NUMA node0 CPU(s):     0-3
NUMA node1 CPU(s):     4-7
NUMA node2 CPU(s):     8-11
NUMA node3 CPU(s):     12-15
NUMA node4 CPU(s):     16-19
NUMA node5 CPU(s):     20-23
NUMA node6 CPU(s):     24-27
NUMA node7 CPU(s):     28-31

This is virtual node configuration: (i tried different numatune settings but
it was still the same)

<domain type='kvm' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'>
        <name>one-55782</name>
        <vcpu><![CDATA[32]]></vcpu>
        <cputune>
                <shares>32768</shares>
        </cputune>
        <memory>507904000</memory>
        <os>
                <type arch='x86_64'>hvm</type>
        </os>
        <devices>
                <emulator><![CDATA[/usr/bin/kvm]]></emulator>
                <disk type='file' device='disk'>
                        <source file='/opt/opennebula/var/datastores/108/55782/disk.0'/>
                        <target dev='vda'/>
                        <driver name='qemu' type='qcow2' cache='unsafe'/>
                </disk>
                <disk type='file' device='disk'>
                        <source file='/opt/opennebula/var/datastores/108/55782/disk.1'/>
                        <target dev='vdc'/>
                        <driver name='qemu' type='raw' cache='unsafe'/>
                </disk>
                <disk type='file' device='disk'>
                        <source file='/opt/opennebula/var/datastores/108/55782/disk.2'/>
                        <target dev='vdd'/>
                        <driver name='qemu' type='raw' cache='unsafe'/>
                </disk>
                <disk type='file' device='disk'>
                        <source file='/opt/opennebula/var/datastores/108/55782/disk.3'/>
                        <target dev='vde'/>
                        <driver name='qemu' type='raw' cache='unsafe'/>
                </disk>
                <disk type='file' device='cdrom'>
                        <source file='/opt/opennebula/var/datastores/108/55782/disk.4'/>
                        <target dev='vdb'/>
                        <readonly/>
                        <driver name='qemu' type='raw'/>
                </disk>
                <interface type='bridge'>
                        <source bridge='br0'/>
                        <mac address='02:00:93:fb:3b:78'/>
                        <target dev='one-55782-0'/>
                        <model type='virtio'/>
                        <filterref filter='no-arp-mac-spoofing'>
                                <parameter name='IP' value='147.251.59.120'/>
                        </filterref>
                </interface>
        </devices>
        <features>
                <pae/>
                <acpi/>
        </features>
        <!-- RAW data follows: -->
<cpu mode='host-passthrough'><topology sockets='8' cores='4' threads='1'/><numa><cell cpus='0-3' memory='62000000' /><cell cpus='4-7' memory='62000000' /><cell cpus='8-11' memory='62000000' /><cell cpus='12-15' memory='62000000' /><cell cpus='16-19' memory='62000000' /><cell cpus='20-23' memory='62000000' /><cell cpus='24-27' memory='62000000' /><cell cpus='28-31' memory='62000000' /></numa></cpu>
<cputune><vcpupin vcpu='0' cpuset='0' /><vcpupin vcpu='1' cpuset='2' /><vcpupin vcpu='2' cpuset='4' /><vcpupin vcpu='3' cpuset='6' /><vcpupin vcpu='4' cpuset='8' /><vcpupin vcpu='5' cpuset='10' /><vcpupin vcpu='6' cpuset='12' /><vcpupin vcpu='7' cpuset='14' /><vcpupin vcpu='8' cpuset='16' /><vcpupin vcpu='9' cpuset='18' /><vcpupin vcpu='10' cpuset='20' /><vcpupin vcpu='11' cpuset='22' /><vcpupin vcpu='12' cpuset='24' /><vcpupin vcpu='13' cpuset='26' /><vcpupin vcpu='14' cpuset='28' /><vcpupin vcpu='15' cpuset='30' /><vcpupin vcpu='16' cpuset='1' /><vcpupin vcpu='17' cpuset='3' /><vcpupin vcpu='18' cpuset='5' /><vcpupin vcpu='19' cpuset='7' /><vcpupin vcpu='20' cpuset='9' /><vcpupin vcpu='21' cpuset='11' /><vcpupin vcpu='22' cpuset='13' /><vcpupin vcpu='23' cpuset='15' /><vcpupin vcpu='24' cpuset='17' /><vcpupin vcpu='25' cpuset='19' /><vcpupin vcpu='26' cpuset='21' /><vcpupin vcpu='27' cpuset='23' /><vcpupin vcpu='28' cpuset='25' /><vcpupin vcpu='29' cpuset='27' /><vcpupin vcpu='30' cpuset='29' /><vcpupin vcpu='31' cpuset='31' /></cputune>
<numatune><memory mode='preferred' nodeset='0'/></numatune>)
<devices><serial type='pty'><target port='0'/></serial><console type='pty'><target type='serial' port='0'/></console><channel type='pty'><target type='virtio' name='org.qemu.guest_agent.0'/></channel></devices>
<devices><hostdev mode='subsystem' type='pci' managed='yes'><source><address domain='0x0' bus='0x11' slot='0x0' function='0x1'/></source></hostdev></devices>

        <devices><controller type='pci' index='1' model='pci-bridge'/><controller type='pci' index='2' model='pci-bridge'/><controller type='pci' index='3' model='pci-bridge'/><controller type='pci' index='4' model='pci-bridge'/><controller type='pci' index='5' model='pci-bridge'/></devices>
        <metadata>
                <system_datastore><![CDATA[/opt/opennebula/var/datastores/108/55782]]>          </system_datastore>
        </metadata>
</domain>

If I run e.g., spec2017 on the virtual, I can see:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND         
 1350 root      20   0  843136 830068   2524 R  78.1  0.2 513:16.16 bwaves_r_base.m 
 2456 root      20   0  804608 791264   2524 R  76.6  0.2 491:39.92 bwaves_r_base.m 
 4631 root      20   0  843136 829892   2344 R  75.8  0.2 450:16.04 bwaves_r_base.m 
 6441 root      20   0  802580 790212   2532 R  75.0  0.2 120:37.54 bwaves_r_base.m 
 7991 root      20   0  784676 772092   2576 R  75.0  0.2 387:15.39 bwaves_r_base.m 
 8142 root      20   0  843136 830044   2496 R  75.0  0.2 384:39.02 bwaves_r_base.m 
 8234 root      20   0  843136 830064   2524 R  75.0  0.2  99:04.48 bwaves_r_base.m 
 8578 root      20   0  749240 736604   2468 R  73.4  0.2 375:45.66 bwaves_r_base.m 
 9974 root      20   0  784676 771984   2468 R  73.4  0.2 348:01.36 bwaves_r_base.m 
10396 root      20   0  802580 790264   2576 R  73.4  0.2 340:08.40 bwaves_r_base.m 
12932 root      20   0  843136 830024   2480 R  73.4  0.2 288:39.76 bwaves_r_base.m 
13113 root      20   0  784676 771864   2348 R  71.9  0.2 284:47.34 bwaves_r_base.m 
13518 root      20   0  784676 762816   2540 R  71.9  0.2 276:31.58 bwaves_r_base.m 
14443 root      20   0  784676 771984   2468 R  71.9  0.2 260:01.82 bwaves_r_base.m 
12791 root      20   0  784676 772060   2544 R  70.3  0.2 291:43.96 bwaves_r_base.m 
10544 root      20   0  843136 830068   2520 R  68.8  0.2 336:47.43 bwaves_r_base.m 
15464 root      20   0  784676 762880   2608 R  60.9  0.2 239:19.14 bwaves_r_base.m 
15487 root      20   0  784676 772048   2532 R  60.2  0.2 238:37.07 bwaves_r_base.m 
16824 root      20   0  784676 772120   2604 R  55.5  0.2 212:10.92 bwaves_r_base.m 
17255 root      20   0  843136 830012   2468 R  54.7  0.2 203:22.89 bwaves_r_base.m 
17962 root      20   0  784676 772004   2488 R  54.7  0.2 188:26.07 bwaves_r_base.m 
17505 root      20   0  843136 830068   2520 R  53.1  0.2 198:04.25 bwaves_r_base.m 
27767 root      20   0  784676 771860   2344 R  52.3  0.2 592:25.95 bwaves_r_base.m 
24458 root      20   0  843136 829888   2344 R  50.8  0.2 658:23.70 bwaves_r_base.m 
30746 root      20   0  747376 735160   2604 R  43.0  0.2 556:47.67 bwaves_r_base.m 

The CPU TIME should be roughly the same but huge differences are obvious. 

This is what I see on the hypervizor:
   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                     
 18201 oneadmin  20   0  474.0g 473.3g   1732 S  2459 94.0  33332:54 kvm                                                                         
   369 root      20   0       0      0      0 R 100.0  0.0 768:12.85 kswapd1                                                                     
   368 root      20   0       0      0      0 R  94.1  0.0 869:05.61 kswapd0  

i.e., kswapd is eating whole CPU. Swap is turned off. 

[root at hde10 ~]# free
              total        used        free      shared  buff/cache   available
Mem:      528151432   503432580     1214048       34740    23504804    21907800
Swap:             0           0           0

Hypervisor is 
[root at hde10 ~]# cat /etc/redhat-release 
CentOS Linux release 7.5.1804 (Core)

qemu-kvm-1.5.3-156.el7_5.5.x86_64

Virtual is Debian 9.


Moreover, I'm using this type of disks for virtuals:
<disk type='file' device='disk'>
                        <source file='/opt/opennebula/var/datastores/108/55782/disk.3'/>
                        <target dev='vde'/>
                        <driver name='qemu' type='raw' cache='unsafe'/>
                </disk>

If I keep cache='unsafe' and if I run iozone test on really big files (e.g.,
8x 100GB), I can see huge cache pressure on the hypervizor - all 8 kswapd are
running on 100 % percent and slowing things down. The disk under datastore is
NVME SSD Intel 4500. 

If I set cache='none', kswaps are on idle, disk writes are pretty fast,
however, with 8-NUMA configuration, writes slow down to less than 10MB/s as
soon as the size of written data is roughly the same as memory size in the virtual
node. iozone has 100 % CPU usage thereafter and it seems that it is traversing page
lists. If I do the same with 1-NUMA configuration, everything is ok except
performance penalty about 25 %.

-- 
Lukáš Hejtmánek

Linux Administrator only because
  Full Time Multitasking Ninja 
  is not an official job title




More information about the libvirt-users mailing list