[libvirt] [edk2-discuss] [OVMF] resource assignment fails for passthrough PCI GPU

Fri Nov 22 21:48:25 UTC 2019

(+Jiri, +libvir-list)

On Fri, Nov 22, 2019 at 04:58:25PM +0000, Dr. David Alan Gilbert wrote:
> * Laszlo Ersek (lersek at redhat.com) wrote:
> > (+Dave, +Eduardo)
> > 
> > On 11/22/19 00:00, dann frazier wrote:
> > > On Tue, Nov 19, 2019 at 06:06:15AM +0100, Laszlo Ersek wrote:
> > >> On 11/19/19 01:54, dann frazier wrote:
> > >>> On Fri, Nov 15, 2019 at 11:51:18PM +0100, Laszlo Ersek wrote:
> > >>>> On 11/15/19 19:56, dann frazier wrote:
> > >>>>> Hi,
> > >>>>>   I'm trying to passthrough an Nvidia GPU to a q35 KVM guest, but UEFI
> > >>>>> is failing to allocate resources for it. I have no issues if I boot w/
> > >>>>> a legacy BIOS, and it works fine if I tell the linux guest to do the
> > >>>>> allocation itself - but I'm looking for a way to make this work w/
> > >>>>> OVMF by default.
> > >>>>>
> > >>>>> I posted a debug log here:
> > >>>>> https://bugs.launchpad.net/ubuntu/+source/edk2/+bug/1849563/+attachment/5305740/+files/q35-uefidbg.log
> > >>>>>
> > >>>>> Linux guest lspci output is also available for both seabios/OVMF boots here:
> > >>>>>   https://bugs.launchpad.net/ubuntu/+source/edk2/+bug/1849563
> > >>>>
> > >>>> By default, OVMF exposes such a 64-bit MMIO aperture for PCI MMIO BAR
> > >>>> allocation that is 32GB in size. The generic PciBusDxe driver collects,
> > >>>> orders, and assigns / allocates the MMIO BARs, but it can work only out
> > >>>> of the aperture that platform code advertizes.
> > >>>>
> > >>>> Your GPU's region 1 is itself 32GB in size. Given that there are further
> > >>>> PCI devices in the system with further 64-bit MMIO BARs, the default
> > >>>> aperture cannot accommodate everything. In such an event, PciBusDxe
> > >>>> avoids assigning the largest BARs (to my knowledge), in order to
> > >>>> conserve the most aperture possible, for other devices -- hence break
> > >>>> the fewest possible PCI devices.
> > >>>>
> > >>>> You can control the aperture size from the QEMU command line. You can
> > >>>> also do it from the libvirt domain XML, technically speaking. The knob
> > >>>> is experimental, so no stability or compatibility guarantees are made.
> > >>>> (That's also the reason why it's a bit of a hack in the libvirt domain XML.)
> > >>>>
> > >>>> The QEMU cmdline options is described in the following edk2 commit message:
> > >>>>
> > >>>>   https://github.com/tianocore/edk2/commit/7e5b1b670c38
> > >>>
> > >>> Hi Laszlo,
> > >>>
> > >>>   Thanks for taking the time to describe this in detail! The -fw_cfg
> > >>> option did avoid the problem for me.
> > >>
> > >> Good to hear, thanks.
> > >>
> > >>> I also noticed that the above
> > >>> commit message mentions the existence of a 24GB card as a reasoning
> > >>> behind choosing the 32GB default aperture. From what you say below, I
> > >>> understand that bumping this above 64GB could break hosts w/ <= 37
> > >>> physical address bits.
> > >>
> > >> Right.
> > >>
> > >>> What would be the downside of bumping the
> > >>> default aperture to, say, 48GB?
> > >>
> > >> The placement of the aperture is not trivial (please see the code
> > >> comments in the linked commit). The base address of the aperture is
> > >> chosen so that the largest BAR that can fit in the aperture may be
> > >> naturally aligned. (BARs are whole powers of two.)
> > >>
> > >> The largest BAR that can fit in a 48 GB aperture is 32 GB. Therefore
> > >> such an aperture would be aligned at 32 GB -- the lowest base address
> > >> (dependent on guest RAM size) would be 32 GB. Meaning that the aperture
> > >> would end at 32 + 48 =  80 GB. That still breaches the 36-bit phys
> > >> address width.
> > >>
> > >> 32 GB is the largest aperture size that can work with 36-bit phys
> > >> address width; that's the aperture that ends at 64 GB exactly.
> > > 
> > > Thanks, yeah - now that I read the code comments that is clear (as
> > > clear as it can be w/ my low level of base knowledge). In the commit you
> > > mention Gerd (CC'd) had suggested a heuristic-based approach for
> > > sizing the aperture. When you say "PCPU address width" - is that a
> > > function of the available physical bits?
> > 
> > "PCPU address width" is not a "function" of the available physical bits
> > -- it *is* the available physical bits. "PCPU" simply stands for
> > "physical CPU".
> > 
> > > IOW, would that approach
> > > allow OVMF to automatically grow the aperture to the max ^2 supported
> > > by the host CPU?
> > 
> > Maybe.
> > 
> > The current logic in OVMF works from the guest-physical address space
> > size -- as deduced from multiple factors, such as the 64-bit MMIO
> > aperture size, and others -- towards the guest-CPU (aka VCPU) address
> > width. The VCPU address width is important for a bunch of other purposes
> > in the firmware, so OVMF has to calculate it no matter what.
> > 
> > Again, the current logic is to calculate the highest guest-physical
> > address, and then deduce the VCPU address width from that (and then
> > expose it to the rest of the firmware).
> > 
> > Your suggestion would require passing the PCPU (physical CPU) address
> > width from QEMU/KVM into the guest, and reversing the direction of the
> > calculation. The PCPU address width would determine the VCPU address
> > width directly, and then the 64-bit PCI MMIO aperture would be
> > calculated from that.
> > 
> > However, there are two caveats.
> > 
> > (1) The larger your guest-phys address space (as exposed through the
> > VCPU address width to the rest of the firmware), the more guest RAM you
> > need for page tables. Because, just before entering the DXE phase, the
> > firmware builds 1:1 mapping page tables for the entire guest-phys
> > address space. This is necessary e.g. so you can access any PCI MMIO BAR.
> > 
> > Now consider that you have a huge beefy virtualization host with say 46
> > phys address bits, and a wimpy guest with say 1.5GB of guest RAM. Do you
> > absolutely want tens of *terabytes* for your 64-bit PCI MMIO aperture?
> > Do you really want to pay for the necessary page tables with that meager
> > guest RAM?
> > 
> > (Such machines do exist BTW, for example:
> > 
> > http://mid.mail-archive.com/9BD73EA91F8E404F851CF3F519B14AA8036C67B5@DGGEMI521-MBX.china.huawei.com
> > )
> > 
> > In other words, you'd need some kind of knob anyway, because otherwise
> > your aperture could grow too *large*.
> > 
> > 
> > (2) Exposing the PCPU address width to the guest may have nasty
> > consequences at the QEMU/KVM level, regardless of guest firmware. For
> > example, that kind of "guest enlightenment" could interfere with migration.
> > 
> > If you boot a guest let's say with 16GB of RAM, and tell it "hey friend,
> > have 40 bits of phys address width!", then you'll have a difficult time
> > migrating that guest to a host with a CPU that only has 36-bits wide
> > physical addresses -- even if the destination host has plenty of RAM
> > otherwise, such as a full 64GB.
> > 
> > There could be other QEMU/KVM / libvirt issues that I m unaware of
> > (hence the CC to Dave and Eduardo).
> 
> host physical address width gets messy. There are differences as well
> between upstream qemu behaviour, and some downstreams.
> I think the story is that:
> 
>   a) Qemu default: 40 bits on any host
>   b) -cpu blah,host-phys-bits=true   to follow the host.
>   c) RHEL has host-phys-bits=true by default
> 
> As you say, the only real problem with host-phys-bits is migration -
> between say an E3 and an E5 xeon with different widths.  The magic 40's
> is generally wrong as well - I think it came from some ancient AMD,
> but it's the default on QEMU TCG as well.

Yes, and because it affects live migration ability, we have two
constraints:
1) It needs to be exposed in the libvirt domain XML;
2) QEMU and libvirt can't choose a value that works for everybody
   (because neither QEMU or libvirt know where the VM might be
   migrated later).

Which is why the BZ below is important:

> 
> I don't think there's a way to set it in libvirt;
> https://bugzilla.redhat.com/show_bug.cgi?id=1578278  is a bz asking for
> that.
> 
> IMHO host-phys-bits is actually pretty safe; and makes most sense in a
> lot of cases.

Yeah, it is mostly safe and makes sense, but messy if you try to
migrate to a host with a different size.

> 
> Dave
> 
> 
> > Thanks,
> > Laszlo
> > 
> > > 
> > >   -dann
> > > 
> > >>>> For example, to set a 64GB aperture, pass:
> > >>>>
> > >>>>   -fw_cfg name=opt/ovmf/X-PciMmio64Mb,string=65536
> > >>>>
> > >>>> The libvirt domain XML syntax is a bit tricky (and it might "taint" your
> > >>>> domain, as it goes outside of the QEMU features that libvirt directly
> > >>>> maps to):
> > >>>>
> > >>>>   <domain
> > >>>>    type='kvm'
> > >>>>    xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'>
> > >>>>     <qemu:commandline>
> > >>>>       <qemu:arg value='-fw_cfg'/>
> > >>>>       <qemu:arg value='opt/ovmf/X-PciMmio64Mb,string=65536'/>
> > >>>>     </qemu:commandline>
> > >>>>   </domain>
> > >>>>
> > >>>> Some notes:
> > >>>>
> > >>>> (1) The "xmlns:qemu" namespace definition attribute in the <domain> root
> > >>>> element is important. You have to add it manually when you add
> > >>>> <qemu:commandline>  and <qemu:arg> too. Without the namespace
> > >>>> definition, the latter elements will make no sense, and libvirt will
> > >>>> delete them immediately.
> > >>>>
> > >>>> (2) The above change will grow your guest's physical address space to
> > >>>> more than 64GB. As a consequence, on your *host*, *if* your physical CPU
> > >>>> supports nested paging (called "ept" on Intel and "npt" on AMD), *then*
> > >>>> the CPU will have to support at least 37 physical address bits too, for
> > >>>> the guest to work. Otherwise, the guest will break, hard.
> > >>>>
> > >>>> Here's how to verify (on the host):
> > >>>>
> > >>>> (2a) run "egrep -w 'npt|ept' /proc/cpuinfo" --> if this does not produce
> > >>>> output, then stop reading here; things should work. Your CPU does not
> > >>>> support nested paging, so KVM will use shadow paging, which is slower,
> > >>>> but at least you don't have to care about the CPU's phys address width.
> > >>>>
> > >>>> (2b) otherwise (i.e. when you do have nested paging), run "grep 'bits
> > >>>> physical' /proc/cpuinfo" --> if the physical address width is >=37,
> > >>>> you're good.
> > >>>>
> > >>>> (2c) if you have nested paging but exactly 36 phys address bits, then
> > >>>> you'll have to forcibly disable nested paging (assuming you want to run
> > >>>> a guest with larger than 64GB guest-phys address space, that is). On
> > >>>> Intel, issue:
> > >>>>
> > >>>> rmmod kvm_intel
> > >>>> modprobe kvm_intel ept=N
> > >>>>
> > >>>> On AMD, go with:
> > >>>>
> > >>>> rmmod kvm_amd
> > >>>> modprobe kvm_amd npt=N
> > >>>>
> > >>>> Hope this helps,
> > >>>> Laszlo
> > >>>>
> > >>>
> > >>
> > > 
> > 
> --
> Dr. David Alan Gilbert / dgilbert at redhat.com / Manchester, UK

-- 
Eduardo