[vfio-users] (good) working GPU for passthrough?

Laszlo Ersek lersek at redhat.com
Tue Jan 22 22:51:28 UTC 2019


On 01/18/19 21:28, Kash Pande wrote:
> On 2019-01-18 12:20 p.m., Bronek Kozicki wrote:
>> This is starting to make sense, going to try passthrough with q35 and
>> UEFI now.
>
>
> Good luck! Looks like you're on the right track.

Sorry, I've had zero time to follow this list in recent weeks (months?)

If it's of any help, let me dump a domain XML on you that works fine on
my end, for assigning my GTX750.

I vaguely recall that even S3 suspend/resume worked with it. (I haven't
booted this guest in a good while now; I set it up because I wanted to
see if Q35 would work.)

I'll make some comments on it, below.

> <domain type='kvm' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'>

The qemu namespace prefix is defined for the <qemu:commandline> &
<qemu:arg> elements in the end.

>   <name>ovmf.win10.gpu.q35</name>
>   <uuid>c1eaad17-1b0e-4716-8172-f8e86a5b51f7</uuid>
>   <memory unit='KiB'>4194304</memory>
>   <currentMemory unit='KiB'>4194304</currentMemory>
>   <vcpu placement='static'>3</vcpu>
>   <os>
>     <type arch='x86_64' machine='pc-q35-2.10'>hvm</type>
>     <loader readonly='yes' type='pflash'>/home/vm-images/firmware/OVMF_CODE.4m.fd</loader>
>     <nvram template='/home/vm-images/firmware/OVMF_VARS.4m.fd'>/var/lib/libvirt/qemu/nvram/ovmf.win10.gpu.q35_VARS.fd</nvram>

This is some OVMF binary I built sometime; just use whatever you get
from Gerd's repo, or from your distro.

>     <bootmenu enable='yes' timeout='10000'/>

I like to see the progress bar on the physical monitor. :)

>   </os>
>   <features>
>     <acpi/>
>     <apic/>
>     <pae/>
>     <hyperv>
>       <relaxed state='on'/>
>       <vapic state='on'/>
>       <spinlocks state='on' retries='8191'/>
>     </hyperv>

Probably cargo-culted these here from other Windows guest domain XMLs,
and/or Alex's blog :)

>     <kvm>
>       <hidden state='on'/>
>     </kvm>

This is necessary but no longer sufficient nowadays, to convince the
nvidia driver to work. I'm not 100% up to date on the status; what I
usually do is,

- first install the guest using the nvidia drivers from the
  "evga_gtx750.iso" image that I ripped from the physical CD that I got
  with the card,

- once the driver updates itself from the network and breaks with Code
  43, Device Manager offers rolling back the driver to the previous
  version (i.e. the one I used from the iSO, and works in the guest).

>   </features>
>   <cpu mode='host-passthrough' check='none'>
>     <topology sockets='1' cores='3' threads='1'/>
>   </cpu>
>   <clock offset='localtime'>
>     <timer name='rtc' tickpolicy='catchup'/>
>     <timer name='pit' tickpolicy='delay'/>
>     <timer name='hpet' present='no'/>
>     <timer name='hypervclock' present='yes'/>
>   </clock>

More Windows stuff I guess.

>   <on_poweroff>destroy</on_poweroff>
>   <on_reboot>restart</on_reboot>
>   <on_crash>destroy</on_crash>
>   <pm>
>     <suspend-to-mem enabled='yes'/>

Yeah I'm pretty sure now that S3 worked.

>     <suspend-to-disk enabled='no'/>

OVMF has no support for S4.

>   </pm>
>   <devices>
>     <emulator>/usr/bin/qemu-system-x86_64</emulator>
>     <disk type='file' device='disk'>
>       <driver name='qemu' type='qcow2' cache='writeback' error_policy='enospace' discard='unmap'/>
>       <source file='/home/vm-images/ovmf.win10.gpu.q35.img'/>
>       <target dev='sda' bus='scsi'/>
>       <boot order='1'/>
>       <address type='drive' controller='0' bus='0' target='0' unit='0'/>
>     </disk>

discard='unmap' is great with a virtio-scsi disk, Windows automatically
discards the blocks of deleted files eventually, and this combo frees up
space on the host disk.

>     <disk type='file' device='cdrom'>
>       <driver name='qemu' type='raw' cache='writeback'/>
>       <source file='/home/vm-images/isos/....iso'/>
>       <target dev='sdb' bus='scsi'/>
>       <readonly/>
>       <shareable/>
>       <boot order='2'/>
>       <address type='drive' controller='0' bus='0' target='0' unit='1'/>
>     </disk>

The Windows installer ISO is exposed as a virtio-scsi CD-ROM, OVMF can
boot that, and it has better performance than SATA emulation.

>     <disk type='file' device='cdrom'>
>       <driver name='qemu' type='raw' cache='writeback'/>
>       <source file='/home/vm-images/isos/evga_gtx750.iso'/>
>       <target dev='sdc' bus='scsi'/>
>       <readonly/>
>       <shareable/>
>       <address type='drive' controller='0' bus='0' target='0' unit='2'/>
>     </disk>

So, these three devices are LUNs 0, 1 and 2, of target 0, of the
virtio-scsi controller that is defined below. (bus='0' is ignored.) The
reference comes from "bus='scsi'" in the <target> element, plus
"controller='0'" in the <address> element. The referred-to virtio-scsi
controller will have (type='scsi' index='0') below.

>     <disk type='file' device='cdrom'>
>       <driver name='qemu' type='raw' cache='writeback'/>
>       <source file='/usr/share/virtio-win/virtio-win.iso'/>
>       <target dev='sdd' bus='sata'/>
>       <readonly/>
>       <shareable/>
>       <address type='drive' controller='0' bus='0' target='0' unit='0'/>
>     </disk>

The virtio-win driver ISO is exposed as a SATA CD-ROM, because the
Windows installer needs to load the virtio-scsi driver from it.

Note that we again have (controller='0' bus='0' target='0' unit='0'),
but there's no conflict, since bus='sata'. The referenced SATA
controller below has (type='sata' index='0').

>     <controller type='sata' index='0'>
>       <address type='pci' domain='0x0000' bus='0x00' slot='0x1f' function='0x2'/>
>     </controller>
>     <controller type='pci' index='0' model='pcie-root'/>
>     <controller type='pci' index='1' model='pcie-root-port'>
>       <model name='pcie-root-port'/>
>       <target chassis='3' port='0x10'/>
>       <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0' multifunction='on'/>
>     </controller>
>     <controller type='pci' index='2' model='pcie-root-port'>
>       <model name='pcie-root-port'/>
>       <target chassis='4' port='0x11'/>
>       <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x1'/>
>     </controller>
>     <controller type='pci' index='3' model='pcie-root-port'>
>       <model name='pcie-root-port'/>
>       <target chassis='1' port='0x12'/>
>       <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x2'/>
>     </controller>
>     <controller type='pci' index='4' model='pcie-root-port'>
>       <model name='pcie-root-port'/>
>       <target chassis='2' port='0x13'/>
>       <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x3'/>
>     </controller>
>     <controller type='pci' index='5' model='pcie-root-port'>
>       <model name='pcie-root-port'/>
>       <target chassis='5' port='0x14'/>
>       <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x4'/>
>     </controller>
>     <controller type='pci' index='6' model='pcie-root-port'>
>       <model name='pcie-root-port'/>
>       <target chassis='6' port='0x15'/>
>       <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x5'/>
>     </controller>
>     <controller type='pci' index='7' model='pcie-root-port'>
>       <model name='pcie-root-port'/>
>       <target chassis='7' port='0x16'/>
>       <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x6'/>
>     </controller>
>     <controller type='pci' index='8' model='pcie-root-port'>
>       <model name='pcie-root-port'/>
>       <target chassis='8' port='0x17'/>
>       <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x7'/>
>     </controller>

Eight PCI Express Root Ports grouped into a multifunction endpoint, in
slot 2 of the root bridge. <target chassis='X' port='Y'/> doesn't have
much meaning in practice, they just need to be distinct.

The endpoint is fully populated (all eight functions are provided).

What's important is the "index" attribute; those numbers will be
referred-to as "bus" numbers below.

>     <controller type='scsi' index='0' model='virtio-scsi'>
>       <address type='pci' domain='0x0000' bus='0x05' slot='0x00' function='0x0'/>
>     </controller>

So bus='0x05' here refers to the Root Port defined with index='5' above.

Each root port exposes a bus, and any endpoint plugged into that bus can
only have slot='0x00'. On the other hand, the endpoint may have multiple
functions. (Not used here.)

>     <controller type='virtio-serial' index='0'>
>       <address type='pci' domain='0x0000' bus='0x06' slot='0x00' function='0x0'/>
>     </controller>
>     <controller type='usb' index='0' model='qemu-xhci' ports='15'>
>       <address type='pci' domain='0x0000' bus='0x03' slot='0x00' function='0x0'/>
>     </controller>

USB3 (qemu-xhci) emulation is less CPU-hungry than USB1/USB2, due to the
USB3 spec being more "virtualization friendly" (no polling required,
IIRC). Gerd could explain in detail.

>     <interface type='network'>
>       <mac address='52:54:00:24:7b:48'/>
>       <source network='default'/>
>       <model type='virtio'/>
>       <rom bar='off'/>
>       <address type='pci' domain='0x0000' bus='0x01' slot='0x00' function='0x0'/>
>     </interface>

More of the same, using up further root ports.

<rom bar='off'/> because I wanted to prevent QEMU from loading the
combined BIOS+UEFI iPXE oprom for the virtio-net NIC. This syntax is
obsolete, nowadays you'd write <rom enabled='no'/>. In general, ignore.

>     <serial type='pty'>
>       <target type='isa-serial' port='0'>
>         <model name='isa-serial'/>
>       </target>
>     </serial>
>     <console type='pty'>
>       <target type='serial' port='0'/>
>     </console>
>     <channel type='unix'>
>       <target type='virtio' name='org.qemu.guest_agent.0'/>
>       <address type='virtio-serial' controller='0' bus='0' port='1'/>
>     </channel>
>     <channel type='spicevmc'>
>       <target type='virtio' name='com.redhat.spice.0'/>
>       <address type='virtio-serial' controller='0' bus='0' port='2'/>
>     </channel>

This got auto-added somehow, I think.

>     <input type='tablet' bus='usb'>
>       <address type='usb' bus='0' port='1'/>
>     </input>
>     <input type='keyboard' bus='usb'>
>       <address type='usb' bus='0' port='2'/>
>     </input>

Here, (type='usb' bus='0') refers to the qemu-xhci controller, which was
defined with (type='usb' index='0').

>     <input type='mouse' bus='ps2'/>
>     <input type='keyboard' bus='ps2'/>
>     <graphics type='vnc' port='-1' autoport='yes'>
>       <listen type='address'/>
>     </graphics>
>     <video>
>       <model type='qxl' ram='65536' vram='65536' vgamem='16384' heads='1' primary='yes'/>
>       <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x0'/>
>     </video>

I guess I should have used spice with qxl, not VNC. I didn't care much
:)

QXL itself was required, because of the native QXL DOD (display only
driver) that supports multiple resolutions, good performance, and even
S3, until one installs the nvidia driver for the physical GPU. The basic
display driver inherits the UEFI framebuffer from OVMF. It works, but
it's not fast, doesn't (cannot) support resolution change, or S3.

Note that the emulated video card is on the root bridge (bus='0x00').
This is actually required for the VGA compatibility built into QXL, as
far as I remember.

One emulated video device that has no such baggage is virtio-gpu-pci
(*not* virtio-vga), but for x86 domains, libvirt only offers virtio-vga
as the "virtio" model. virtio-gpu-pci is selected for aarch64 domains
only. Anyway, ignore.

>     <hostdev mode='subsystem' type='pci' managed='yes'>
>       <source>
>         <address domain='0x0000' bus='0x28' slot='0x00' function='0x0'/>
>       </source>
>       <address type='pci' domain='0x0000' bus='0x02' slot='0x00' function='0x0' multifunction='on'/>
>     </hostdev>
>     <hostdev mode='subsystem' type='pci' managed='yes'>
>       <source>
>         <address domain='0x0000' bus='0x28' slot='0x00' function='0x1'/>
>       </source>
>       <address type='pci' domain='0x0000' bus='0x02' slot='0x00' function='0x1'/>
>     </hostdev>

So these are the functions of the assigned GTX750:

28:00.0 VGA compatible controller: NVIDIA Corporation GM107 [GeForce GTX 750] (rev a2)
28:00.1 Audio device: NVIDIA Corporation Device 0fbc (rev a1)

Note that the functions  of the assigned physical device are re-grouped
into a multifunction device, mandatorily in slot 0, on the bus exposed
by the Root Port that is defined with index='2' above.

AFAIR, it's not required to re-group the GPU's functions into a
multifunction device in the guest. However, if we didn't do that, we'd
have to use up another root port (because we can't use any nonzero slot
on any root port).

In fact I'm unsure why I bothered assigning 28:00.1 at all, I don't
remember using that audio function.

>     <hostdev mode='subsystem' type='pci' managed='yes'>
>       <source>
>         <address domain='0x0000' bus='0x00' slot='0x1b' function='0x0'/>
>       </source>
>       <address type='pci' domain='0x0000' bus='0x08' slot='0x00' function='0x0'/>
>     </hostdev>

More of the same, for:

00:1b.0 Audio device: Intel Corporation 82801JI (ICH10 Family) HD Audio Controller

This is the audio device I actually used, I think. Goes into a separate
root port again (index='8').

>     <hostdev mode='subsystem' type='usb' managed='yes'>
>       <source>
>         <vendor id='0x03f0'/>
>         <product id='0x0324'/>
>       </source>
>       <address type='usb' bus='0' port='3'/>
>     </hostdev>
>     <hostdev mode='subsystem' type='usb' managed='yes'>
>       <source>
>         <vendor id='0x046d'/>
>         <product id='0xc040'/>
>       </source>
>       <address type='usb' bus='0' port='4'/>
>     </hostdev>

Assigned my physical USB mouse and keyboard, on ports 3 and 4 of the
qemu-xhci controller, in addition to the emulated USB tablet and
keyboard that the guest sees on ports 1 and 2 the qemu-xhci controller.

>     <memballoon model='virtio'>
>       <address type='pci' domain='0x0000' bus='0x07' slot='0x00' function='0x0'/>
>     </memballoon>
>     <rng model='virtio'>
>       <rate bytes='1048576'/>
>       <backend model='random'>/dev/urandom</backend>
>       <address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x0'/>
>     </rng>

And at this point, all of our Root Ports have been used up.

>   </devices>
>   <qemu:commandline>
>     <qemu:arg value='-global'/>
>     <qemu:arg value='isa-debugcon.iobase=0x402'/>
>     <qemu:arg value='-debugcon'/>
>     <qemu:arg value='file:/tmp/ovmf.win10.gpu.q35.log'/>

This is so I can capture an OVMF debug log. When you report an OVMF
issue, please always do this, and attach the log to the report.

>     <qemu:arg value='-fw_cfg'/>
>     <qemu:arg value='name=opt/ovmf/PcdResizeXterm,string=y'/>

This is some obscure convenience stuff for when I use the UEFI shell --
when I change the resolution in the shell with the MODE command, escape
sequences are emitted which actually resize my xterm window. Ignore.

>   </qemu:commandline>
> </domain>

Er... this ended up quite incoherent. Sorry :)

Thanks
Laszlo




More information about the vfio-users mailing list