[libvirt-users] NUMA issues on virtualized hosts
Lukas Hejtmanek
xhejtman at ics.muni.cz
Fri Sep 14 13:36:59 UTC 2018
Hello,
ok, I found that cpu pinning was wrong, so I corrected it to be 1:1. The issue
with iozone remains the same.
The spec is running, however, it runs slower than 1-NUMA case.
The corrected XML looks like follows:
<cpu mode='host-passthrough'><topology sockets='8' cores='4' threads='1'/><numa><cell cpus='0-3' memory='62000000' /><cell cpus='4-7' memory='62000000' /><cell cpus='8-11' memory='62000000' /><cell cpus='12-15' memory='62000000' /><cell cpus='16-19' memory='62000000' /><cell cpus='20-23' memory='62000000' /><cell cpus='24-27' memory='62000000' /><cell cpus='28-31' memory='62000000' /></numa></cpu>
<cputune><vcpupin vcpu='0' cpuset='0' /><vcpupin vcpu='1' cpuset='1' /><vcpupin vcpu='2' cpuset='2' /><vcpupin vcpu='3' cpuset='3' /><vcpupin vcpu='4' cpuset='4' /><vcpupin vcpu='5' cpuset='5' /><vcpupin vcpu='6' cpuset='6' /><vcpupin vcpu='7' cpuset='7' /><vcpupin vcpu='8' cpuset='8' /><vcpupin vcpu='9' cpuset='9' /><vcpupin vcpu='10' cpuset='10' /><vcpupin vcpu='11' cpuset='11' /><vcpupin vcpu='12' cpuset='12' /><vcpupin vcpu='13' cpuset='13' /><vcpupin vcpu='14' cpuset='14' /><vcpupin vcpu='15' cpuset='15' /><vcpupin vcpu='16' cpuset='16' /><vcpupin vcpu='17' cpuset='17' /><vcpupin vcpu='18' cpuset='18' /><vcpupin vcpu='19' cpuset='19' /><vcpupin vcpu='20' cpuset='20' /><vcpupin vcpu='21' cpuset='21' /><vcpupin vcpu='22' cpuset='22' /><vcpupin vcpu='23' cpuset='23' /><vcpupin vcpu='24' cpuset='24' /><vcpupin vcpu='25' cpuset='25' /><vcpupin vcpu='26' cpuset='26' /><vcpupin vcpu='27' cpuset='27' /><vcpupin vcpu='28' cpuset='28' /><vcpupin vcpu='29' cpuset='29' /><vcpupin vcpu='30' cpuset='30' /><vcpupin vcpu='31' cpuset='31' /></cputune>
<numatune><memory mode='strict' nodeset='0-7'/></numatune>
In this case, the first part took more than 1700 seconds. 1-NUMA config
finishes in 1646 seconds.
Hypervisor with 1-NUMA config finishes in 1470 seconds, the hypervisor with
8-NUMA config finishes in 900 seconds.
On Fri, Sep 14, 2018 at 02:06:26PM +0200, Lukas Hejtmanek wrote:
> Hello,
>
> I have cluster with AMD EPYC 7351 cpu. Two CPUs per node. I have performance
> 8-NUMA configuration:
>
> This is from hypervizor:
> [root at hde10 ~]# lscpu
> Architecture: x86_64
> CPU op-mode(s): 32-bit, 64-bit
> Byte Order: Little Endian
> CPU(s): 64
> On-line CPU(s) list: 0-63
> Thread(s) per core: 2
> Core(s) per socket: 16
> Socket(s): 2
> NUMA node(s): 8
> Vendor ID: AuthenticAMD
> CPU family: 23
> Model: 1
> Model name: AMD EPYC 7351 16-Core Processor
> Stepping: 2
> CPU MHz: 1800.000
> CPU max MHz: 2400.0000
> CPU min MHz: 1200.0000
> BogoMIPS: 4800.05
> Virtualization: AMD-V
> L1d cache: 32K
> L1i cache: 64K
> L2 cache: 512K
> L3 cache: 8192K
> NUMA node0 CPU(s): 0-3,32-35
> NUMA node1 CPU(s): 4-7,36-39
> NUMA node2 CPU(s): 8-11,40-43
> NUMA node3 CPU(s): 12-15,44-47
> NUMA node4 CPU(s): 16-19,48-51
> NUMA node5 CPU(s): 20-23,52-55
> NUMA node6 CPU(s): 24-27,56-59
> NUMA node7 CPU(s): 28-31,60-63
>
> I'm running one big virtual on this hypervizor - almost whole memory + all
> physical CPUs.
>
> This is what I'm seeing inside:
>
> root at zenon10:~# lscpu
> Architecture: x86_64
> CPU op-mode(s): 32-bit, 64-bit
> Byte Order: Little Endian
> CPU(s): 32
> On-line CPU(s) list: 0-31
> Thread(s) per core: 1
> Core(s) per socket: 4
> Socket(s): 8
> NUMA node(s): 8
> Vendor ID: AuthenticAMD
> CPU family: 23
> Model: 1
> Model name: AMD EPYC 7351 16-Core Processor
> Stepping: 2
> CPU MHz: 2400.000
> BogoMIPS: 4800.00
> Virtualization: AMD-V
> Hypervisor vendor: KVM
> Virtualization type: full
> L1d cache: 64K
> L1i cache: 64K
> L2 cache: 512K
> NUMA node0 CPU(s): 0-3
> NUMA node1 CPU(s): 4-7
> NUMA node2 CPU(s): 8-11
> NUMA node3 CPU(s): 12-15
> NUMA node4 CPU(s): 16-19
> NUMA node5 CPU(s): 20-23
> NUMA node6 CPU(s): 24-27
> NUMA node7 CPU(s): 28-31
>
> This is virtual node configuration: (i tried different numatune settings but
> it was still the same)
>
> <domain type='kvm' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'>
> <name>one-55782</name>
> <vcpu><![CDATA[32]]></vcpu>
> <cputune>
> <shares>32768</shares>
> </cputune>
> <memory>507904000</memory>
> <os>
> <type arch='x86_64'>hvm</type>
> </os>
> <devices>
> <emulator><![CDATA[/usr/bin/kvm]]></emulator>
> <disk type='file' device='disk'>
> <source file='/opt/opennebula/var/datastores/108/55782/disk.0'/>
> <target dev='vda'/>
> <driver name='qemu' type='qcow2' cache='unsafe'/>
> </disk>
> <disk type='file' device='disk'>
> <source file='/opt/opennebula/var/datastores/108/55782/disk.1'/>
> <target dev='vdc'/>
> <driver name='qemu' type='raw' cache='unsafe'/>
> </disk>
> <disk type='file' device='disk'>
> <source file='/opt/opennebula/var/datastores/108/55782/disk.2'/>
> <target dev='vdd'/>
> <driver name='qemu' type='raw' cache='unsafe'/>
> </disk>
> <disk type='file' device='disk'>
> <source file='/opt/opennebula/var/datastores/108/55782/disk.3'/>
> <target dev='vde'/>
> <driver name='qemu' type='raw' cache='unsafe'/>
> </disk>
> <disk type='file' device='cdrom'>
> <source file='/opt/opennebula/var/datastores/108/55782/disk.4'/>
> <target dev='vdb'/>
> <readonly/>
> <driver name='qemu' type='raw'/>
> </disk>
> <interface type='bridge'>
> <source bridge='br0'/>
> <mac address='02:00:93:fb:3b:78'/>
> <target dev='one-55782-0'/>
> <model type='virtio'/>
> <filterref filter='no-arp-mac-spoofing'>
> <parameter name='IP' value='147.251.59.120'/>
> </filterref>
> </interface>
> </devices>
> <features>
> <pae/>
> <acpi/>
> </features>
> <!-- RAW data follows: -->
> <cpu mode='host-passthrough'><topology sockets='8' cores='4' threads='1'/><numa><cell cpus='0-3' memory='62000000' /><cell cpus='4-7' memory='62000000' /><cell cpus='8-11' memory='62000000' /><cell cpus='12-15' memory='62000000' /><cell cpus='16-19' memory='62000000' /><cell cpus='20-23' memory='62000000' /><cell cpus='24-27' memory='62000000' /><cell cpus='28-31' memory='62000000' /></numa></cpu>
> <cputune><vcpupin vcpu='0' cpuset='0' /><vcpupin vcpu='1' cpuset='2' /><vcpupin vcpu='2' cpuset='4' /><vcpupin vcpu='3' cpuset='6' /><vcpupin vcpu='4' cpuset='8' /><vcpupin vcpu='5' cpuset='10' /><vcpupin vcpu='6' cpuset='12' /><vcpupin vcpu='7' cpuset='14' /><vcpupin vcpu='8' cpuset='16' /><vcpupin vcpu='9' cpuset='18' /><vcpupin vcpu='10' cpuset='20' /><vcpupin vcpu='11' cpuset='22' /><vcpupin vcpu='12' cpuset='24' /><vcpupin vcpu='13' cpuset='26' /><vcpupin vcpu='14' cpuset='28' /><vcpupin vcpu='15' cpuset='30' /><vcpupin vcpu='16' cpuset='1' /><vcpupin vcpu='17' cpuset='3' /><vcpupin vcpu='18' cpuset='5' /><vcpupin vcpu='19' cpuset='7' /><vcpupin vcpu='20' cpuset='9' /><vcpupin vcpu='21' cpuset='11' /><vcpupin vcpu='22' cpuset='13' /><vcpupin vcpu='23' cpuset='15' /><vcpupin vcpu='24' cpuset='17' /><vcpupin vcpu='25' cpuset='19' /><vcpupin vcpu='26' cpuset='21' /><vcpupin vcpu='27' cpuset='23' /><vcpupin vcpu='28' cpuset='25' /><vcpupin vcpu='29' cpuset='27' /><vcpupin vcpu='30' cpuset='29' /><vcpupin vcpu='31' cpuset='31' /></cputune>
> <numatune><memory mode='preferred' nodeset='0'/></numatune>)
> <devices><serial type='pty'><target port='0'/></serial><console type='pty'><target type='serial' port='0'/></console><channel type='pty'><target type='virtio' name='org.qemu.guest_agent.0'/></channel></devices>
> <devices><hostdev mode='subsystem' type='pci' managed='yes'><source><address domain='0x0' bus='0x11' slot='0x0' function='0x1'/></source></hostdev></devices>
>
> <devices><controller type='pci' index='1' model='pci-bridge'/><controller type='pci' index='2' model='pci-bridge'/><controller type='pci' index='3' model='pci-bridge'/><controller type='pci' index='4' model='pci-bridge'/><controller type='pci' index='5' model='pci-bridge'/></devices>
> <metadata>
> <system_datastore><![CDATA[/opt/opennebula/var/datastores/108/55782]]> </system_datastore>
> </metadata>
> </domain>
>
> If I run e.g., spec2017 on the virtual, I can see:
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 1350 root 20 0 843136 830068 2524 R 78.1 0.2 513:16.16 bwaves_r_base.m
> 2456 root 20 0 804608 791264 2524 R 76.6 0.2 491:39.92 bwaves_r_base.m
> 4631 root 20 0 843136 829892 2344 R 75.8 0.2 450:16.04 bwaves_r_base.m
> 6441 root 20 0 802580 790212 2532 R 75.0 0.2 120:37.54 bwaves_r_base.m
> 7991 root 20 0 784676 772092 2576 R 75.0 0.2 387:15.39 bwaves_r_base.m
> 8142 root 20 0 843136 830044 2496 R 75.0 0.2 384:39.02 bwaves_r_base.m
> 8234 root 20 0 843136 830064 2524 R 75.0 0.2 99:04.48 bwaves_r_base.m
> 8578 root 20 0 749240 736604 2468 R 73.4 0.2 375:45.66 bwaves_r_base.m
> 9974 root 20 0 784676 771984 2468 R 73.4 0.2 348:01.36 bwaves_r_base.m
> 10396 root 20 0 802580 790264 2576 R 73.4 0.2 340:08.40 bwaves_r_base.m
> 12932 root 20 0 843136 830024 2480 R 73.4 0.2 288:39.76 bwaves_r_base.m
> 13113 root 20 0 784676 771864 2348 R 71.9 0.2 284:47.34 bwaves_r_base.m
> 13518 root 20 0 784676 762816 2540 R 71.9 0.2 276:31.58 bwaves_r_base.m
> 14443 root 20 0 784676 771984 2468 R 71.9 0.2 260:01.82 bwaves_r_base.m
> 12791 root 20 0 784676 772060 2544 R 70.3 0.2 291:43.96 bwaves_r_base.m
> 10544 root 20 0 843136 830068 2520 R 68.8 0.2 336:47.43 bwaves_r_base.m
> 15464 root 20 0 784676 762880 2608 R 60.9 0.2 239:19.14 bwaves_r_base.m
> 15487 root 20 0 784676 772048 2532 R 60.2 0.2 238:37.07 bwaves_r_base.m
> 16824 root 20 0 784676 772120 2604 R 55.5 0.2 212:10.92 bwaves_r_base.m
> 17255 root 20 0 843136 830012 2468 R 54.7 0.2 203:22.89 bwaves_r_base.m
> 17962 root 20 0 784676 772004 2488 R 54.7 0.2 188:26.07 bwaves_r_base.m
> 17505 root 20 0 843136 830068 2520 R 53.1 0.2 198:04.25 bwaves_r_base.m
> 27767 root 20 0 784676 771860 2344 R 52.3 0.2 592:25.95 bwaves_r_base.m
> 24458 root 20 0 843136 829888 2344 R 50.8 0.2 658:23.70 bwaves_r_base.m
> 30746 root 20 0 747376 735160 2604 R 43.0 0.2 556:47.67 bwaves_r_base.m
>
> The CPU TIME should be roughly the same but huge differences are obvious.
>
> This is what I see on the hypervizor:
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 18201 oneadmin 20 0 474.0g 473.3g 1732 S 2459 94.0 33332:54 kvm
> 369 root 20 0 0 0 0 R 100.0 0.0 768:12.85 kswapd1
> 368 root 20 0 0 0 0 R 94.1 0.0 869:05.61 kswapd0
>
> i.e., kswapd is eating whole CPU. Swap is turned off.
>
> [root at hde10 ~]# free
> total used free shared buff/cache available
> Mem: 528151432 503432580 1214048 34740 23504804 21907800
> Swap: 0 0 0
>
> Hypervisor is
> [root at hde10 ~]# cat /etc/redhat-release
> CentOS Linux release 7.5.1804 (Core)
>
> qemu-kvm-1.5.3-156.el7_5.5.x86_64
>
> Virtual is Debian 9.
>
>
> Moreover, I'm using this type of disks for virtuals:
> <disk type='file' device='disk'>
> <source file='/opt/opennebula/var/datastores/108/55782/disk.3'/>
> <target dev='vde'/>
> <driver name='qemu' type='raw' cache='unsafe'/>
> </disk>
>
> If I keep cache='unsafe' and if I run iozone test on really big files (e.g.,
> 8x 100GB), I can see huge cache pressure on the hypervizor - all 8 kswapd are
> running on 100 % percent and slowing things down. The disk under datastore is
> NVME SSD Intel 4500.
>
> If I set cache='none', kswaps are on idle, disk writes are pretty fast,
> however, with 8-NUMA configuration, writes slow down to less than 10MB/s as
> soon as the size of written data is roughly the same as memory size in the virtual
> node. iozone has 100 % CPU usage thereafter and it seems that it is traversing page
> lists. If I do the same with 1-NUMA configuration, everything is ok except
> performance penalty about 25 %.
>
> --
> Lukáš Hejtmánek
>
> Linux Administrator only because
> Full Time Multitasking Ninja
> is not an official job title
--
Lukáš Hejtmánek
Linux Administrator only because
Full Time Multitasking Ninja
is not an official job title
More information about the libvirt-users
mailing list