[libvirt-users] NUMA issues on virtualized hosts

Lukas Hejtmanek xhejtman at ics.muni.cz
Mon Sep 17 07:02:14 UTC 2018


Hello, 

I did some performance measurements with SpecCPU 2017 in variant fp rate
(i.e., utilize all cpu cores). It looks like this:

8-NUMA Hypervizor specfp2017 - 124
1-NUMA Hypervizor specfp2017 - 103
2-NUMA Hypervizor specfp2017 - 120

8-NUMA Virtual (on 8N Hypervizor) specfp2017 - 92 
1-NUMA Virtual (on 1N Hypervizor) specfp2017 - 95.2 
2-NUMA Virtual (on 2N Hypervizor) specfp2017 - 98  (memory strict)
2-NUMA Virtual (on 2N Hypervizor) specfp2017 - 98.1 (memory interleave)
2x 1-NUMA Virtual (on 2N Hypervizor) specfp2017 - 117.2 (sum for both)


On Fri, Sep 14, 2018 at 03:40:56PM +0200, Lukas Hejtmanek wrote:
> Hello again,
> 
> when the iozone writes slow. This is how slabtop looks like:
> 62476752 62476728   0%    0.10K 1601968       39   6407872K buffer_head
> 1000678 999168   0%    0.56K 142954        7    571816K radix_tree_node
> 132184 125911   0%    0.03K   1066      124      4264K kmalloc-32
> 118496 118224   0%    0.12K   3703       32     14812K kmalloc-node
>  73206  56467   0%    0.19K   3486       21     13944K dentry
>  34816  33247   0%    0.12K   1024       34      4096K kernfs_node_cache
>  34496  29031   0%    0.06K    539       64      2156K kmalloc-64
>  23283  22707   0%    1.05K   7761        3     31044K ext4_inode_cache
>  16940  16052   0%    0.57K   2420        7      9680K inode_cache
>  14464   4124   0%    0.06K    226       64       904K anon_vma_chain
>  11900  11841   0%    0.14K    425       28      1700K ext4_groupinfo_4k
>  11312   9861   0%    0.50K   1414        8      5656K kmalloc-512
>  10692  10066   0%    0.04K    108       99       432K ext4_extent_status
>  10688   4238   0%    0.25K    668       16      2672K kmalloc-256
>   8120   2420   0%    0.07K    145       56       580K anon_vma
>   8040   4563   0%    0.20K    402       20      1608K vm_area_struct
>   7488   3845   0%    0.12K    234       32       936K kmalloc-96
>   7456   7061   0%    1.00K   1864        4      7456K kmalloc-1024
>   7234   7227   0%    4.00K   7234        1     28936K kmalloc-4096
> 
> 
> and this is /proc/$PID/stack of iozone eating CPU but not writing data.
> 
> [<ffffffffba78151b>] find_get_entry+0x1b/0x100
> [<ffffffffba781de0>] pagecache_get_page+0x30/0x2a0
> [<ffffffffc06ec12b>] ext4_da_get_block_prep+0x27b/0x440 [ext4]
> [<ffffffffba840d8b>] __find_get_block_slow+0x3b/0x150
> [<ffffffffba840ebd>] unmap_underlying_metadata+0x1d/0x70
> [<ffffffffc06ec960>] ext4_block_write_begin+0x2e0/0x520 [ext4]
> [<ffffffffc06ebeb0>] ext4_inode_attach_jinode.part.72+0xa0/0xa0 [ext4]
> [<ffffffffc041f9f9>] jbd2__journal_start+0xd9/0x1e0 [jbd2]
> [<ffffffffba80511a>] __check_object_size+0xfa/0x1d8
> [<ffffffffba946b85>] iov_iter_copy_from_user_atomic+0xa5/0x330
> [<ffffffffba780dcb>] generic_perform_write+0xfb/0x1d0
> [<ffffffffba7831ca>] __generic_file_write_iter+0x16a/0x1b0
> [<ffffffffc06e7220>] ext4_file_write_iter+0x90/0x370 [ext4]
> [<ffffffffc06e7190>] ext4_dax_fault+0x140/0x140 [ext4]
> [<ffffffffba6aef01>] update_curr+0xe1/0x160
> [<ffffffffba808890>] new_sync_write+0xe0/0x130
> [<ffffffffba809010>] vfs_write+0xb0/0x190
> [<ffffffffba80a452>] SyS_write+0x52/0xc0
> [<ffffffffba603b7d>] do_syscall_64+0x8d/0xf0
> [<ffffffffbac15c4e>] entry_SYSCALL_64_after_swapgs+0x58/0xc6
> [<ffffffffffffffff>] 0xffffffffffffffff
> 
> 
> On Fri, Sep 14, 2018 at 03:36:59PM +0200, Lukas Hejtmanek wrote:
> > Hello,
> > 
> > ok, I found that cpu pinning was wrong, so I corrected it to be 1:1. The issue
> > with iozone remains the same.
> > 
> > The spec is running, however, it runs slower than 1-NUMA case. 
> > 
> > The corrected XML looks like follows:
> > <cpu mode='host-passthrough'><topology sockets='8' cores='4' threads='1'/><numa><cell cpus='0-3' memory='62000000' /><cell cpus='4-7' memory='62000000' /><cell cpus='8-11' memory='62000000' /><cell cpus='12-15' memory='62000000' /><cell cpus='16-19' memory='62000000' /><cell cpus='20-23' memory='62000000' /><cell cpus='24-27' memory='62000000' /><cell cpus='28-31' memory='62000000' /></numa></cpu>
> > <cputune><vcpupin vcpu='0' cpuset='0' /><vcpupin vcpu='1' cpuset='1' /><vcpupin vcpu='2' cpuset='2' /><vcpupin vcpu='3' cpuset='3' /><vcpupin vcpu='4' cpuset='4' /><vcpupin vcpu='5' cpuset='5' /><vcpupin vcpu='6' cpuset='6' /><vcpupin vcpu='7' cpuset='7' /><vcpupin vcpu='8' cpuset='8' /><vcpupin vcpu='9' cpuset='9' /><vcpupin vcpu='10' cpuset='10' /><vcpupin vcpu='11' cpuset='11' /><vcpupin vcpu='12' cpuset='12' /><vcpupin vcpu='13' cpuset='13' /><vcpupin vcpu='14' cpuset='14' /><vcpupin vcpu='15' cpuset='15' /><vcpupin vcpu='16' cpuset='16' /><vcpupin vcpu='17' cpuset='17' /><vcpupin vcpu='18' cpuset='18' /><vcpupin vcpu='19' cpuset='19' /><vcpupin vcpu='20' cpuset='20' /><vcpupin vcpu='21' cpuset='21' /><vcpupin vcpu='22' cpuset='22' /><vcpupin vcpu='23' cpuset='23' /><vcpupin vcpu='24' cpuset='24' /><vcpupin vcpu='25' cpuset='25' /><vcpupin vcpu='26' cpuset='26' /><vcpupin vcpu='27' cpuset='27' /><vcpupin vcpu='28' cpuset='28' /><vcpupin vcpu='29' cpuset='29' /><vcpupin vcpu='30' cpuset='30' /><vcpupin vcpu='31' cpuset='31' /></cputune>
> > <numatune><memory mode='strict' nodeset='0-7'/></numatune>
> > 
> > In this case, the first part took more than 1700 seconds. 1-NUMA config
> > finishes in 1646 seconds. 
> > 
> > Hypervisor with 1-NUMA config finishes in 1470 seconds, the hypervisor with
> > 8-NUMA config finishes in 900 seconds.
> > 
> > On Fri, Sep 14, 2018 at 02:06:26PM +0200, Lukas Hejtmanek wrote:
> > > Hello,
> > > 
> > > I have cluster with AMD EPYC 7351 cpu. Two CPUs per node. I have performance
> > > 8-NUMA configuration:
> > > 
> > > This is from hypervizor:
> > > [root at hde10 ~]# lscpu
> > > Architecture:          x86_64
> > > CPU op-mode(s):        32-bit, 64-bit
> > > Byte Order:            Little Endian
> > > CPU(s):                64
> > > On-line CPU(s) list:   0-63
> > > Thread(s) per core:    2
> > > Core(s) per socket:    16
> > > Socket(s):             2
> > > NUMA node(s):          8
> > > Vendor ID:             AuthenticAMD
> > > CPU family:            23
> > > Model:                 1
> > > Model name:            AMD EPYC 7351 16-Core Processor
> > > Stepping:              2
> > > CPU MHz:               1800.000
> > > CPU max MHz:           2400.0000
> > > CPU min MHz:           1200.0000
> > > BogoMIPS:              4800.05
> > > Virtualization:        AMD-V
> > > L1d cache:             32K
> > > L1i cache:             64K
> > > L2 cache:              512K
> > > L3 cache:              8192K
> > > NUMA node0 CPU(s):     0-3,32-35
> > > NUMA node1 CPU(s):     4-7,36-39
> > > NUMA node2 CPU(s):     8-11,40-43
> > > NUMA node3 CPU(s):     12-15,44-47
> > > NUMA node4 CPU(s):     16-19,48-51
> > > NUMA node5 CPU(s):     20-23,52-55
> > > NUMA node6 CPU(s):     24-27,56-59
> > > NUMA node7 CPU(s):     28-31,60-63
> > > 
> > > I'm running one big virtual on this hypervizor - almost whole memory + all
> > > physical CPUs.
> > > 
> > > This is what I'm seeing inside:
> > > 
> > > root at zenon10:~# lscpu
> > > Architecture:          x86_64
> > > CPU op-mode(s):        32-bit, 64-bit
> > > Byte Order:            Little Endian
> > > CPU(s):                32
> > > On-line CPU(s) list:   0-31
> > > Thread(s) per core:    1
> > > Core(s) per socket:    4
> > > Socket(s):             8
> > > NUMA node(s):          8
> > > Vendor ID:             AuthenticAMD
> > > CPU family:            23
> > > Model:                 1
> > > Model name:            AMD EPYC 7351 16-Core Processor
> > > Stepping:              2
> > > CPU MHz:               2400.000
> > > BogoMIPS:              4800.00
> > > Virtualization:        AMD-V
> > > Hypervisor vendor:     KVM
> > > Virtualization type:   full
> > > L1d cache:             64K
> > > L1i cache:             64K
> > > L2 cache:              512K
> > > NUMA node0 CPU(s):     0-3
> > > NUMA node1 CPU(s):     4-7
> > > NUMA node2 CPU(s):     8-11
> > > NUMA node3 CPU(s):     12-15
> > > NUMA node4 CPU(s):     16-19
> > > NUMA node5 CPU(s):     20-23
> > > NUMA node6 CPU(s):     24-27
> > > NUMA node7 CPU(s):     28-31
> > > 
> > > This is virtual node configuration: (i tried different numatune settings but
> > > it was still the same)
> > > 
> > > <domain type='kvm' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'>
> > >         <name>one-55782</name>
> > >         <vcpu><![CDATA[32]]></vcpu>
> > >         <cputune>
> > >                 <shares>32768</shares>
> > >         </cputune>
> > >         <memory>507904000</memory>
> > >         <os>
> > >                 <type arch='x86_64'>hvm</type>
> > >         </os>
> > >         <devices>
> > >                 <emulator><![CDATA[/usr/bin/kvm]]></emulator>
> > >                 <disk type='file' device='disk'>
> > >                         <source file='/opt/opennebula/var/datastores/108/55782/disk.0'/>
> > >                         <target dev='vda'/>
> > >                         <driver name='qemu' type='qcow2' cache='unsafe'/>
> > >                 </disk>
> > >                 <disk type='file' device='disk'>
> > >                         <source file='/opt/opennebula/var/datastores/108/55782/disk.1'/>
> > >                         <target dev='vdc'/>
> > >                         <driver name='qemu' type='raw' cache='unsafe'/>
> > >                 </disk>
> > >                 <disk type='file' device='disk'>
> > >                         <source file='/opt/opennebula/var/datastores/108/55782/disk.2'/>
> > >                         <target dev='vdd'/>
> > >                         <driver name='qemu' type='raw' cache='unsafe'/>
> > >                 </disk>
> > >                 <disk type='file' device='disk'>
> > >                         <source file='/opt/opennebula/var/datastores/108/55782/disk.3'/>
> > >                         <target dev='vde'/>
> > >                         <driver name='qemu' type='raw' cache='unsafe'/>
> > >                 </disk>
> > >                 <disk type='file' device='cdrom'>
> > >                         <source file='/opt/opennebula/var/datastores/108/55782/disk.4'/>
> > >                         <target dev='vdb'/>
> > >                         <readonly/>
> > >                         <driver name='qemu' type='raw'/>
> > >                 </disk>
> > >                 <interface type='bridge'>
> > >                         <source bridge='br0'/>
> > >                         <mac address='02:00:93:fb:3b:78'/>
> > >                         <target dev='one-55782-0'/>
> > >                         <model type='virtio'/>
> > >                         <filterref filter='no-arp-mac-spoofing'>
> > >                                 <parameter name='IP' value='147.251.59.120'/>
> > >                         </filterref>
> > >                 </interface>
> > >         </devices>
> > >         <features>
> > >                 <pae/>
> > >                 <acpi/>
> > >         </features>
> > >         <!-- RAW data follows: -->
> > > <cpu mode='host-passthrough'><topology sockets='8' cores='4' threads='1'/><numa><cell cpus='0-3' memory='62000000' /><cell cpus='4-7' memory='62000000' /><cell cpus='8-11' memory='62000000' /><cell cpus='12-15' memory='62000000' /><cell cpus='16-19' memory='62000000' /><cell cpus='20-23' memory='62000000' /><cell cpus='24-27' memory='62000000' /><cell cpus='28-31' memory='62000000' /></numa></cpu>
> > > <cputune><vcpupin vcpu='0' cpuset='0' /><vcpupin vcpu='1' cpuset='2' /><vcpupin vcpu='2' cpuset='4' /><vcpupin vcpu='3' cpuset='6' /><vcpupin vcpu='4' cpuset='8' /><vcpupin vcpu='5' cpuset='10' /><vcpupin vcpu='6' cpuset='12' /><vcpupin vcpu='7' cpuset='14' /><vcpupin vcpu='8' cpuset='16' /><vcpupin vcpu='9' cpuset='18' /><vcpupin vcpu='10' cpuset='20' /><vcpupin vcpu='11' cpuset='22' /><vcpupin vcpu='12' cpuset='24' /><vcpupin vcpu='13' cpuset='26' /><vcpupin vcpu='14' cpuset='28' /><vcpupin vcpu='15' cpuset='30' /><vcpupin vcpu='16' cpuset='1' /><vcpupin vcpu='17' cpuset='3' /><vcpupin vcpu='18' cpuset='5' /><vcpupin vcpu='19' cpuset='7' /><vcpupin vcpu='20' cpuset='9' /><vcpupin vcpu='21' cpuset='11' /><vcpupin vcpu='22' cpuset='13' /><vcpupin vcpu='23' cpuset='15' /><vcpupin vcpu='24' cpuset='17' /><vcpupin vcpu='25' cpuset='19' /><vcpupin vcpu='26' cpuset='21' /><vcpupin vcpu='27' cpuset='23' /><vcpupin vcpu='28' cpuset='25' /><vcpupin vcpu='29' cpuset='27' /><vcpupin vcpu='30' cpuset='29' /><vcpupin vcpu='31' cpuset='31' /></cputune>
> > > <numatune><memory mode='preferred' nodeset='0'/></numatune>)
> > > <devices><serial type='pty'><target port='0'/></serial><console type='pty'><target type='serial' port='0'/></console><channel type='pty'><target type='virtio' name='org.qemu.guest_agent.0'/></channel></devices>
> > > <devices><hostdev mode='subsystem' type='pci' managed='yes'><source><address domain='0x0' bus='0x11' slot='0x0' function='0x1'/></source></hostdev></devices>
> > > 
> > >         <devices><controller type='pci' index='1' model='pci-bridge'/><controller type='pci' index='2' model='pci-bridge'/><controller type='pci' index='3' model='pci-bridge'/><controller type='pci' index='4' model='pci-bridge'/><controller type='pci' index='5' model='pci-bridge'/></devices>
> > >         <metadata>
> > >                 <system_datastore><![CDATA[/opt/opennebula/var/datastores/108/55782]]>          </system_datastore>
> > >         </metadata>
> > > </domain>
> > > 
> > > If I run e.g., spec2017 on the virtual, I can see:
> > > 
> > >   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND         
> > >  1350 root      20   0  843136 830068   2524 R  78.1  0.2 513:16.16 bwaves_r_base.m 
> > >  2456 root      20   0  804608 791264   2524 R  76.6  0.2 491:39.92 bwaves_r_base.m 
> > >  4631 root      20   0  843136 829892   2344 R  75.8  0.2 450:16.04 bwaves_r_base.m 
> > >  6441 root      20   0  802580 790212   2532 R  75.0  0.2 120:37.54 bwaves_r_base.m 
> > >  7991 root      20   0  784676 772092   2576 R  75.0  0.2 387:15.39 bwaves_r_base.m 
> > >  8142 root      20   0  843136 830044   2496 R  75.0  0.2 384:39.02 bwaves_r_base.m 
> > >  8234 root      20   0  843136 830064   2524 R  75.0  0.2  99:04.48 bwaves_r_base.m 
> > >  8578 root      20   0  749240 736604   2468 R  73.4  0.2 375:45.66 bwaves_r_base.m 
> > >  9974 root      20   0  784676 771984   2468 R  73.4  0.2 348:01.36 bwaves_r_base.m 
> > > 10396 root      20   0  802580 790264   2576 R  73.4  0.2 340:08.40 bwaves_r_base.m 
> > > 12932 root      20   0  843136 830024   2480 R  73.4  0.2 288:39.76 bwaves_r_base.m 
> > > 13113 root      20   0  784676 771864   2348 R  71.9  0.2 284:47.34 bwaves_r_base.m 
> > > 13518 root      20   0  784676 762816   2540 R  71.9  0.2 276:31.58 bwaves_r_base.m 
> > > 14443 root      20   0  784676 771984   2468 R  71.9  0.2 260:01.82 bwaves_r_base.m 
> > > 12791 root      20   0  784676 772060   2544 R  70.3  0.2 291:43.96 bwaves_r_base.m 
> > > 10544 root      20   0  843136 830068   2520 R  68.8  0.2 336:47.43 bwaves_r_base.m 
> > > 15464 root      20   0  784676 762880   2608 R  60.9  0.2 239:19.14 bwaves_r_base.m 
> > > 15487 root      20   0  784676 772048   2532 R  60.2  0.2 238:37.07 bwaves_r_base.m 
> > > 16824 root      20   0  784676 772120   2604 R  55.5  0.2 212:10.92 bwaves_r_base.m 
> > > 17255 root      20   0  843136 830012   2468 R  54.7  0.2 203:22.89 bwaves_r_base.m 
> > > 17962 root      20   0  784676 772004   2488 R  54.7  0.2 188:26.07 bwaves_r_base.m 
> > > 17505 root      20   0  843136 830068   2520 R  53.1  0.2 198:04.25 bwaves_r_base.m 
> > > 27767 root      20   0  784676 771860   2344 R  52.3  0.2 592:25.95 bwaves_r_base.m 
> > > 24458 root      20   0  843136 829888   2344 R  50.8  0.2 658:23.70 bwaves_r_base.m 
> > > 30746 root      20   0  747376 735160   2604 R  43.0  0.2 556:47.67 bwaves_r_base.m 
> > > 
> > > The CPU TIME should be roughly the same but huge differences are obvious. 
> > > 
> > > This is what I see on the hypervizor:
> > >    PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                     
> > >  18201 oneadmin  20   0  474.0g 473.3g   1732 S  2459 94.0  33332:54 kvm                                                                         
> > >    369 root      20   0       0      0      0 R 100.0  0.0 768:12.85 kswapd1                                                                     
> > >    368 root      20   0       0      0      0 R  94.1  0.0 869:05.61 kswapd0  
> > > 
> > > i.e., kswapd is eating whole CPU. Swap is turned off. 
> > > 
> > > [root at hde10 ~]# free
> > >               total        used        free      shared  buff/cache   available
> > > Mem:      528151432   503432580     1214048       34740    23504804    21907800
> > > Swap:             0           0           0
> > > 
> > > Hypervisor is 
> > > [root at hde10 ~]# cat /etc/redhat-release 
> > > CentOS Linux release 7.5.1804 (Core)
> > > 
> > > qemu-kvm-1.5.3-156.el7_5.5.x86_64
> > > 
> > > Virtual is Debian 9.
> > > 
> > > 
> > > Moreover, I'm using this type of disks for virtuals:
> > > <disk type='file' device='disk'>
> > >                         <source file='/opt/opennebula/var/datastores/108/55782/disk.3'/>
> > >                         <target dev='vde'/>
> > >                         <driver name='qemu' type='raw' cache='unsafe'/>
> > >                 </disk>
> > > 
> > > If I keep cache='unsafe' and if I run iozone test on really big files (e.g.,
> > > 8x 100GB), I can see huge cache pressure on the hypervizor - all 8 kswapd are
> > > running on 100 % percent and slowing things down. The disk under datastore is
> > > NVME SSD Intel 4500. 
> > > 
> > > If I set cache='none', kswaps are on idle, disk writes are pretty fast,
> > > however, with 8-NUMA configuration, writes slow down to less than 10MB/s as
> > > soon as the size of written data is roughly the same as memory size in the virtual
> > > node. iozone has 100 % CPU usage thereafter and it seems that it is traversing page
> > > lists. If I do the same with 1-NUMA configuration, everything is ok except
> > > performance penalty about 25 %.
> > > 
> > > -- 
> > > Lukáš Hejtmánek
> > > 
> > > Linux Administrator only because
> > >   Full Time Multitasking Ninja 
> > >   is not an official job title
> > 
> > -- 
> > Lukáš Hejtmánek
> > 
> > Linux Administrator only because
> >   Full Time Multitasking Ninja 
> >   is not an official job title
> 
> -- 
> Lukáš Hejtmánek
> 
> Linux Administrator only because
>   Full Time Multitasking Ninja 
>   is not an official job title

-- 
Lukáš Hejtmánek

Linux Administrator only because
  Full Time Multitasking Ninja 
  is not an official job title




More information about the libvirt-users mailing list