[vfio-users] [OVMF+VIFO+QEMU] Large VM RAM allocation = Long Boot Times

Thu Nov 22 06:02:47 UTC 2018

Hi Laszlo,

That did it! Boot times are down to ~25 seconds-to-splash with 8
cores, 2 threads and 24G of assigned RAM.
Not as fast as some bare-metal machines, but far more bearable &
manageable than *minutes*!

I compiled a custom kernel (version 4.19.2) with the PREEMPT_NONE=y
(menuconfig: General setup -> Preemption Model ->  No Forced
Preemption (Server)) parameter.
The default on Arch seems to be PREEMPT=y, which equates to
"Preemptible Kernel (Low-Latency Desktop)"

4.19.2 is slightly newer than what I was running previously (4.18.10),
but I'm confident that this setting made the difference.
At first I incorrectly configured the Arch package build and ended up
with an Arch-stock kernel with the default preemption settings, rather
than my custom settings.
The stock build had the long boot times.

So many different configuration settings at every level of the stack
(firmware, kernel, qemu, guest).
Amazing that any of this works at all!

For others reading along, the working config is:

Kernel version: 4.19.2, built with PREEMPT_NONE=y
OVMF version: ovmf-x64-0-20181105.757.g114a75ad86

qemu command line:
/usr/sbin/qemu-system-x86_64 \
        -nodefconfig -no-user-config -nodefaults -nographic \
        -name win10,debug-threads=on -machine
q35,accel=kvm,usb=off,vmport=off,dump-guest-core=off,kernel-irqchip=on
\
        -monitor unix:win10.sock,server,nowait -debugcon
file:debugcon.log -global isa-debugcon.iobase=0x402 \
        \
        -mem-prealloc -mem-path /dev/hugepages -m 24576 \
        \
        -realtime mlock=on \
        -smp cores=8,threads=2,sockets=1 -enable-kvm \
        -cpu host,hv_time,hv_vpindex,hv_reset,hv_runtime,hv_crash,hv_synic,hv_stimer,hv_spinlocks=0x1fff,hv_vendor_id=Zegelin,kvm=off,l3-cache=on,-hypervisor,migratable=no,+invtsc
\
        \
        -rtc base=localtime,driftfix=slew -no-hpet -global
kvm-pit.lost_tick_policy=discard \
        -global ICH9-LPC.disable_s3=1 \
        -global ICH9-LPC.disable_s4=1 \
        \
        -drive file=/vmstorage/firmware/ovmf-x64-0-20181105.757.g114a75ad86/OVMF_CODE-pure-efi.fd,if=pflash,format=raw,unit=0,readonly=on
\
        \
        -object iothread,id=io1 \
        -device pcie-root-port,id=pcie0 \
        \
        -netdev bridge,id=netdev0,br=br0 -device
virtio-net-pci,netdev=netdev0,mac=52:54:00:19:34:57 \
        -device
virtio-scsi-pci,id=scsi0,ioeventfd=on,iothread=io1,num_queues=4 \
        \
        -blockdev
node-name=block0,driver=host_device,filename=/dev/zvol/vmstorage/images/adam-win10,aio=native,cache.direct=on
\
        -blockdev node-name=block1,driver=raw,file=block0,cache.direct=on \
        -device scsi-hd,drive=block1,bus=scsi0.0,write-cache=on \
        \
        -device vfio-pci,host=7:00.0,bus=pcie0,addr=0x00,multifunction=on \
        -device vfio-pci,host=7:00.1,bus=pcie0,addr=0x00.0x1 \
        -device vfio-pci,host=05:00.0 \
        -device vfio-pci,host=06:00.0 \

Many thanks,
Adam
On Mon, Nov 19, 2018 at 2:46 PM A Z <adam at zegelin.com> wrote:
>
> Hi Laszlo,
>
> Looks like CONFIG_PREEMPT_VOLUNTARY isn't set.
>
> CONFIG_PREEMPT_RCU=y
> CONFIG_PREEMPT_NOTIFIERS=y
> # CONFIG_PREEMPT_NONE is not set
> # CONFIG_PREEMPT_VOLUNTARY is not set
> CONFIG_PREEMPT=y
> CONFIG_PREEMPT_COUNT=y
> CONFIG_DEBUG_PREEMPT=y
> # CONFIG_PREEMPTIRQ_EVENTS is not set
> # CONFIG_PREEMPT_TRACER is not set
>
> I'll recompile the kernel tonight if I get time with that flag set and
> see if it makes a difference.
>
> - Adam
> On Mon, Nov 19, 2018 at 10:35 AM Laszlo Ersek <lersek at redhat.com> wrote:
> >
> > On 11/16/18 22:36, A Z wrote:
> > > And just a quick discovery:
> > >
> > > With 24G assigned (-m 24576)...
> > >
> > > Reducing the number of CPU cores to 1, 2 or 3 with
> > > ovmf-x64-0-20181105.757.g114a75ad86 results in a really fast boot to
> > > the TianoCore splash.
> > > As soon as the number of CPU cores jumps to 4, the VM takes a hit in
> > > terms of boot time.
> > >
> > > 1 core, 1 thread: ~16 seconds.
> > > 2 cores, 1 thread: ~16 seconds
> > > 3 cores, 1 thread: ~19 seconds
> > > 4 cores, 1 thread: ~111 seconds
> > > 8 cores, 1 thread: ~150 seconds
> > > 8 cores, 2 threads: ~334 seconds
> > >
> > > (timing is based on grepping debugcon.log for "[Bds]OsIndication:",
> > > which seems to occur right after the splash screen is displayed)
> > >
> > > Significant pauses can be observed at points in the boot process,
> > > based on the debugcon output...
> > >
> > > First pause is here:
> > >
> > >> Loading PEIM at 0x0007FEC1000 EntryPoint=0x0007FEC6985 CpuMpPei.efi
> > >> Register PPI Notify: F894643D-C449-42D1-8EA8-85BDD8C65BDE
> > >> Notify: PPI Guid: F894643D-C449-42D1-8EA8-85BDD8C65BDE, Peim notify entry point: 7FEC58C4
> > >> AP Loop Mode is 1
> > >> WakeupBufferStart = 9F000, WakeupBufferSize = 1000
> > >
> > > Then here:
> > >
> > >> Loading driver at 0x0007F96C000 EntryPoint=0x0007F9743EE CpuDxe.efi
> > >> InstallProtocolInterface: BC62157E-3E33-4FEC-9920-2D3B36D750DF 7F4AC618
> > >> ProtectUefiImageCommon - 0x7F4AE040
> > >>   - 0x000000007F96C000 - 0x000000000000E3C0
> > >> Paging: added 512 pages to page table pool
> > >> CurrentPagingContext:
> > >>   MachineType   - 0x8664
> > >>   PageTableBase - 0x7FA01000
> > >>   Attributes    - 0xC0000006
> > >> InstallProtocolInterface: 26BACCB1-6F42-11D4-BCE7-0080C73C8881 7F979E20
> > >> MemoryProtectionCpuArchProtocolNotify:
> > > <snip>
> > >> ConvertPages: failed to find range A0000 - FFFFF
> > >> ConvertPages: failed to find range 80000000 - 8FFFFFFF
> > >> ConvertPages: failed to find range 90000000 - FBFFFFFF
> > >> ConvertPages: failed to find range FEC00000 - FEC00FFF
> > >> Failed to update capability: [8] 00000000FED00000 - 00000000FED003FF (C700000000000001 -> C700000000026001)
> > >> ConvertPages: failed to find range FED1C000 - FED1FFFF
> > >> ConvertPages: failed to find range FEE00000 - FEEFFFFF
> > >> ConvertPages: failed to find range FFE00000 - FFFFFFFF
> > >> AP Loop Mode is 1
> > >> WakeupBufferStart = 9F000, WakeupBufferSize = 229
> > >
> > > And the last one:
> > >
> > >> Loading driver at 0x0007F918000 EntryPoint=0x0007F91D876 PciHostBridgeDxe.efi
> > >> InstallProtocolInterface: BC62157E-3E33-4FEC-9920-2D3B36D750DF 7F01D798
> > >> ProtectUefiImageCommon - 0x7F01D1C0
> > >>   - 0x000000007F918000 - 0x0000000000009F00
> > >> Select Item: 0x0
> > >> FW CFG Signature: 0x554D4551
> > >> Select Item: 0x1
> > >> FW CFG Revision: 0x3
> > >> QemuFwCfg interface (DMA) is supported.
> > >> Select Item: 0x19
> > >> InitRootBridge: populated root bus 0, with room for 255 subordinate bus(es)
> > >> RootBridge: PciRoot(0x0)
> > >>   Support/Attr: 70069 / 70069
> > >>     DmaAbove4G: No
> > >> NoExtConfSpace: No
> > >>      AllocAttr: 3 (CombineMemPMem Mem64Decode)
> > >>            Bus: 0 - FF Translation=0
> > >>             Io: 6000 - FFFF Translation=0
> > >>            Mem: 90000000 - FBFFFFFF Translation=0
> > >>     MemAbove4G: 800000000 - FFFFFFFFF Translation=0
> > >>           PMem: FFFFFFFFFFFFFFFF - 0 Translation=0
> > >>    PMemAbove4G: FFFFFFFFFFFFFFFF - 0 Translation=0
> > >
> > >
> > > Regards,
> > > Adam
> > >
> > >
> > > On Thu, Nov 15, 2018 at 10:28 PM A Z <adam at zegelin.com> wrote:
> > >>
> > >> On Thu, Nov 15, 2018 at 2:47 AM Laszlo Ersek <lersek at redhat.com> wrote:
> > >>>
> > >>> If nothing at all had changed in your configuration, then the behavior
> > >>> wouldn't have changed either. You must certainly have upgraded some
> > >>> components since then, between host kernel, QEMU, and OVMF (likely all
> > >>> of them). Please state the version numbers that work and don't work, and
> > >>> try to isolate the issue to a version change that affects a *single*
> > >>> component.
> > >>
> > >> I wasn't clear on this, sorry. I meant that the VM configuration --
> > >> the qemu command-line parameters, *except* for the OVMF firmware image
> > >> and the amount of memory assigned, haven't changed since I got
> > >> everything up-and-running (i.e., CPU caps, disk, network and all the
> > >> PCI-passthrough assignments, etc haven't changed.)
> > >> I added more memory to the host a few months ago, and upped the VM
> > >> allocation to match at the same time. This made the boot times even
> > >> worse.
> > >> I've also occasionally downloaded a newer version of OVMF to see if it
> > >> fixes this boot time issue.
> > >> The host OS kernel, qemu and other package versions have changed over
> > >> this period as I've installed OS updates. So has the build of Windows
> > >> 10.
> > >>
> > >>> I vaguely recall similar reports from the past. I've been under the
> > >>> impression that this kind of issue has been fixed. Thus, I'd guess you
> > >>> are running old(er) versions of the affected component (whichever it
> > >>> is), or else you are running bleeding edge, and the bug has crept back
> > >>> in (new regression).
> > >>
> > >> It's possibly the latter.
> > >>
> > >> Host kernel: 4.18.10-arch1-1-ARCH #1 SMP PREEMPT Wed Sep 26 09:48:22
> > >> UTC 2018 x86_64 GNU/Linux
> > >> qemu version: QEMU emulator version 3.0.0
> > >> OMVF build: ovmf-x64-0-20181105.757.g114a75ad86 (downloaded from www.kraxel.org)
> > >>
> > >>
> > >> On Thu, Nov 15, 2018 at 6:14 AM Hristo Iliev <hristo at hiliev.eu> wrote:
> > >>
> > >>> This sounds quite like a problem I used to have in the past with certain versions of OVMF and the Linux kernel. If I remember correctly, some memory ranges would get marked as non-cacheable, which resulted in the terrible slowdown you describe. The resolution back then was to stick to older versions of both the firmware and the kernel.
> > >>>
> > >>> I still keep around an older version of OVMF that I used until recently - edk2.git-ovmf-x64-0-20160714.b1992.gfc3005.noarch.rpm. You could download the RPM here and try if it works for you:
> > >>>
> > >>> https://exa.flopser.de/s/z3cjqFUlZ2gBPCy
> > >>
> > >> I will try this firmware build. I have read similar things about
> > >> versions from 2016 working correctly.
> > >> But TBH I'd also prefer to run something newer -- bug fixes (and
> > >> regressions!), security patches, etc.
> > >>
> > >>> Recent QEMU versions started complaining about firmware incompatibilities, so I tried the latest RPM from Kraxel (https://www.kraxel.org/repos/jenkins/edk2/) and it works just fine. The host system is Arch Linux with the latest Arch kernel and QEMU.
> > >>
> > >> While I'm not running today's build, I am running one downloaded from
> > >> Kraxel.org within the past week.
> > >>
> > >>
> > >> On Thu, Nov 15, 2018 at 7:28 AM Alex Williamson
> > >> <alex.williamson at redhat.com> wrote:
> > >>
> > >>> If you can confirm that your system uses a NUMA memory configuration
> > >>> and that 24G exceeds the node local memory, then this is a known
> > >>> regression in the Linux kernel mm.  AIUI it's being worked upstream,
> > >>> workarounds are to either use hugepages or locked memory for the VM.
> > >>
> > >> I'm currently using hugepages. Not sure how to lock the memory, sorry.
> > >>
> > >> It is a dual CPU system. The topology:
> > >>
> > >> $ sudo numactl -H
> > >> available: 2 nodes (0-1)
> > >> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 24 25 26 27 28 29 30 31 32 33 34 35
> > >> node 0 size: 32143 MB
> > >> node 0 free: 236 MB
> > >> node 1 cpus: 12 13 14 15 16 17 18 19 20 21 22 23 36 37 38 39 40 41 42
> > >> 43 44 45 46 47
> > >> node 1 size: 32232 MB
> > >> node 1 free: 95 MB
> > >> node distances:
> > >> node   0   1
> > >>   0:  10  21
> > >>   1:  21  10
> > >>
> > >>
> > >>
> > >> For reference, the qemu command line I'm exec'ing from a script is:
> > >>
> > >> /usr/sbin/qemu-system-x86_64 \
> > >>         -name win10,debug-threads=on -monitor
> > >> unix:win10.sock,server,nowait -nodefconfig -debugcon file:debugcon.log
> > >> -global isa-debugcon.iobase=0x402 \
> > >>         -nodefconfig -no-user-config -nodefaults -nographic \
> > >>         -machine q35,accel=kvm \
> > >>         -enable-kvm -cpu
> > >> host,kvm=off,hv_spinlocks=0x1fff,hv_relaxed,hv_time,hv_vendor_id=Nvidia43FIX
> > >> \
> > >>         -rtc base=localtime,clock=host,driftfix=slew \
> > >>         \
> > >>         -drive file=/vmstorage/firmware/ovmf-x64-0-20181105.757.g114a75ad86/OVMF-pure-efi.fd,if=pflash,format=raw,unit=0,readonly=on
> > >> \
> > >>         -drive file=/home/adam/win10-vars.fd,if=pflash,format=raw,readonly=off,unit=1
> > >> \
> > >>         \
> > >>         -object iothread,id=io1 \
> > >>         \
> > >>         -mem-path /dev/hugepages  \
> > >>         -m 24576 \
> > >>         -smp cores=8,threads=2,sockets=1 \
> > >>         \
> > >>         -netdev bridge,id=netdev0,br=br0 -device
> > >> virtio-net-pci,netdev=netdev0,mac=52:54:00:19:34:57 \
> > >>         -device
> > >> virtio-scsi-pci,id=scsi0,ioeventfd=on,iothread=io1,num_queues=4 \
> > >>         \
> > >>         -device vfio-pci,host=7:00.0,addr=0x6,multifunction=on \
> > >>         -device vfio-pci,host=7:00.1,addr=0x6.0x1 \
> > >>         -device vfio-pci,host=05:00.0 \
> > >>         -device vfio-pci,host=06:00.0
> >
> > Can you grep your kernel config for "PREEMPT"? I remember that for the
> > Lime Tech guys, CONFIG_PREEMPT_VOLUNTARY made a huge difference, in the
> > same spots. It might help you too.
> >
> > I do notice that your "uname" output already includes "PREEMPT". I don't
> > remember (and don't have time now) to look up what knob does exactly
> > what. Please try massaging these knobs in your host kernel config, and
> > see if they make a difference.
> >
> > When CONFIG_PREEMPT_VOLUNTARY was identified as the one bit making the
> > difference in the Lime Tech case, I requested that they please document
> > their findings on the web somewhere, perhaps in a forum or in a blog
> > post. I'm not sure if that happened after all.
> >
> > For some hand-wavy explanation, I can offer this: the two spots in the
> > OVMF debug log are related to the guest firmware starting up all VCPUs
> > at once, and those VCPUs fighting over some spinlocks (by design). PLE
> > (pause loop exiting) should really help here, but apparently, it doesn't
> > always do that. So, with CONFIG_PREEMPT (or without it, I don't
> > remember), the host kernel threads get scheduled in a way that is
> > exactly "adversarial" to those VCPUs making good progress. E.g. if a
> > VCPU owns a spinlock but the corresponding host kernel thread is
> > de-scheduled for a long time, the other VCPUs / host kernel threads
> > won't make any progress, no matter how hard they work. Yada yada. As I
> > said, hand-wavy, sorry.
> >
> > Thanks
> > Laszlo