[libvirt] How should libvirt apps enable virtio-pci for aarch64?

Mon Dec 7 17:00:04 UTC 2015

On 12/07/2015 10:37 AM, Cole Robinson wrote:
> On 12/07/2015 07:27 AM, Daniel P. Berrange wrote:
>> On Sun, Dec 06, 2015 at 09:46:56PM -0500, Cole Robinson wrote:
>>> Hi all,
>>>
>>> I'm trying to figure out how apps should request virtio-pci for libvirt + qemu
>>> + arm/aarch64. Let me provide some background.
>>>
>>> qemu's arm/aarch64 original virtio support is via virtio-mmio, libvirt XML
>>> <address type='virtio-mmio'/>. Currently this is what libvirt sets as the
>>> address default for all arm/aarch64 virtio devices in the XML. Long term
>>> though all arm virt will likely be using virtio-pci: it's faster, enables
>>> hotplug, is more x86 like, etc.
>>>
>>> Support for virtio-pci is newer and not as widespread. qemu has had the
>>> necessary support since 2.4 at least, but the guest side isn't well
>>> distributed yet. For example, Fedora 23 and earlier don't work out of the box
>>> with virtio-pci. Internal RHELSA (RHEL Server for Aarch64) builds have it
>>> recently working AFAIK.
>>>
>>> Libvirt has some support for enabling virtio-pci with aarch64, commits added
>>> by Pavel Fedin in v1.2.19. (See e8d55172544c1fafe31a9e09346bdebca4f0d6f9). The
>>> patches add a PCIe controller automatically to the XML (and qemu commandline)
>>> if qemu-system-aarch64 supports it. However virtio-mmio is still used as the
>>> default virtio address, given the current lack of OS support.
>>>
>>> So we are at the point where libvirt apps want to enable this, but presently
>>> there isn't a good solution; the only option is to fully allocate <address
>>> type='pci' ...> for each virtio device in the XML. This is suboptimal for 2
>>> reasons:
>>>
>>> #1) apps need to duplicate libvirt's non-trivial address type=pci allocation logic
>>>
>>> #2) apps have to add an <address> block for every virtio device, which is less
>>> friendly than the x86 case where this is rarely required. Any XML device
>>> snippets that work for x86 likely won't give the desired result for aarch64,
>>> since they will default to virtio-mmio. Think virsh attach-device/attach-disk
>>> commands
>> Yeah this is very undesirable for a default out of the box config - we should
>> always strive to "do the best thing" when no address is given.
>>
>>> Here are some possible solutions:
>>>
>>> * Drop the current behavior of adding a PCIe controller unconditionally, and
>>> instead require apps to specify it in the XML. Then, if libvirt sees a PCIe
>>> controller in the XML, default the virtio address type to pci. Apps will know
>>> if the OS they are installing supports virtio-pci (eventually via libosinfo),
>>> so this is the way we can implicitly ask libvirt 'allocate us pci addresses'
>> Yes, clearly we need to record in libosinfo whether an OS can do PCI vs
>> MMIO.
>>
>>> Upsides:
>>> - Solves both the stated problems.
>>> - Simplest addition for applications IMO
>>>
>>> Downsides:
>>> - Requires a libvirt behavior change, no longer adding the PCIe controller by
>>> default. But in practice I don't think it will really affect anyone, since
>>> there isn't really any OS support for virtio-pci yet, and no apps support it
>>> either AFAIK.
>>> - The PCIe controller is not strictly about virtio-pci, it's for enabling
>>> plain emulated PCI devices as well. So there is a use case for using the PCIe
>>> controller for a graphics card even while your OS doesn't yet support
>>> virtio-pci. In the big picture though this is a small time window with current
>>> OS, and users can work around it by manually requesting <address
>>> type='virtio-mmio'/>, so medium/long term this isn't a big deal IMO
>>> - The PCIe controller XML is:
>>>      <controller type='pci' index='0' model='pcie-root'/>
>>>      <controller type='pci' index='1' model='dmi-to-pci-bridge'/>
>>>      <controller type='pci' index='2' model='pci-bridge'/>
>>> I have no idea if that's always going to be the expected XML, maybe it's not
>>> wise to hardcode that in apps. Laine?

That was only intended for the Q35 machinetype (but somehow all of them 
got turned on by a patch to add pcie-root to aarch64 virt machinetypes). 
pcie-root is included in the hardware by qemu on Q35, and can't be 
removed. dmi-to-pci-bridge translates from the PCIe ports of pcie-root 
to standard pci ports (but non-hotpluggable), and pci-bridge converts 
from non-hotpluggable PCI to hotpluggable PCI, which is the kind of 
slots that management applications expect to be available.

Other machinetypes don't need to do this same thing (for that matter, in 
the future this may not be the most desirable way to go for Q35 - in the 
2 (or is it 3) years since Q35 support was added, I've learned that 
pretty much every emulated PCI device in qemu can be plugged into a PCIe 
port (on the Q35 machinetype at least) with no complaints from qemu, and 
we now have pcie-root-port and pcie-switch-downstream-port (with 
matching pcie-switch-upstream-port) that can accept hotplugged devices, 
so a Q35 machine could now be constructed as:

   <controller type='pci' index='0' model='pcie-root'/>
   <controller type='pci' index='1' model='pcie-root-port'/>
   <controller type='pci' index='2' model='pcie-switch-upstream-port'/>
   <controller type='pci' index='3' model='pcie-switch-downstream-port'/>
   <controller type='pci' index='4' model='pcie-switch-downstream-port'/>
   ...

and the address assignment could be modified to allow auto-selection of 
PCIe ports for PCI devices (the downstream ports support hotplugging 
devices, but can't be hotplugged themselves, and they can only be 
plugged into an upstream port, which can only be plugged into a 
root-port (or a downstream-port)).

At any rate, I don't think the current PCIe bus structure should be 
hardcoded anywhere. We can change what libvirt does by default any time; 
existing configs will continue with what we setup in the past (and thus 
won't suffer "your hardware has changed! Reactivate!!" problems), but 
newly created ones will use whatever new model we come up with.

>>>
>>>
>>> * Next idea: Users specify something like like <address type='pci'/> and
>>> libvirt fills in the address for us.
>>>
>>> Upsides:
>>> - We can stick with the current PCIe controller default and avoid some of the
>>> problems mentioned above.
>>> - An auto address feature may be useful in other contexts as well.
>>>
>>> Downsides:
>>> - Seems potentially tricky to implement in libvirt code. There's many places
>>> that check type=pci and key off that, seems like it would be easy to miss
>>> updating a check and cause regressions. Maybe we could add a new type like
>>> auto-pci to make it explicit. There's probably some implementation trick to
>>> make this safe, but at first glance it looked a little dicey.
>> I'm not sure it is actually all that hairy - it might be as simple as
>> updating only qemuAssignDevicePCISlots so that instread of:
>>
>>      if (def->controllers[i]->info.type != VIR_DOMAIN_DEVICE_ADDRESS_TYPE_NONE)
>>                  continue;
>>
>> It handles type=pci with no values set too
>>
> I think my fear was that there are other places in domain_conf that check for
> ADDRESS_TYPE_PCI before we even get to assigning PCI slots. But i'll poke at it.

This is very possible. ADDRESS_TYPE_PCI means that the PCI address is 
"valid". 0000:00:00.0 is a valid PCI address (although it happens to be 
reserved on any x86 architecture. For the bit early on after the parse 
we may need to have a separate "valid address" flag beyond the type to 
prevent confusion. I do like the idea of being able to say "<address 
type='pci'/> to select the bus without specifying an address though.

>
>>> - Doesn't really solve problem #2 mentioned up above... maybe we could change
>>> the address allocation logic to default to virtio-pci if there's already a
>>> virtio-pci device in the XML. But it's more work.
>>> - More work for apps, but nothing horrible.
>>> * Change the default address type from virtio-mmio to pci, if qemu supports
>>> it. I'm listing this for completeness. In the short term this doesn't make
>>> sense as there isn't any OS releases that will work with this default. However
>>> it might be worth considering for the future, maybe keying off a particular
>>> qemu version or machine type. I suspect 2 years from now no one is going to be
>>> using virtio-mmio so long term it's not an ideal default.
>> Yeah, when QEMU starts doing versioned machine types for AArch64 we could
>> do this, but then this just kind of flips the problem around - apps now
>> need to manually add <address type="mmio"> for every device if deploying
>> an OS that can't do PCI.  Admittedly this is slightly easier, since address
>> rules for mmio are simpler than address rules for PCI.
>>
>>> I think the first option is best (keying off the PCIe controller specified by
>>> the user), with a longer term plan to change the default from mmio to pci. But
>>> I'm not really sold on anything either way. So I'm interested if anyone else
>>> has ideas.
>> I guess I'd tend towards option 1 too - only adding PCI controller if we
>> actually want to use PCI with the guest.

I *kind of* agree with this, but not completely.

I think that if we know for sure based on introspection of the virtual 
machine (or based on verified knowledge that we hardcode into libvirt) 
that there is a PCI controller of some type that is implemented in the 
machine and no way to remove it, we should put that information in the 
XML. Any controllers that are optional, and don't exist in the virtual 
machine if they're not added to the commandline, can be auto-added only 
if needed.

So for example, if a Q35 domain was created that had just 2 emulated PCI 
devices, we could auto-add just enough ports to accommodate that. (For 
that matter, adding an emulated PCI device would cause the auto-add of a 
pcie-switch-downstream-port, which might cause the auto-add of a 
pcie-switch-upstream-port, which might cause the auto-add of a 
pcie-root-port; the 2nd emulated PCI device would cause an auto-add of 
another pcie-switch-downstream-port which would find a spot in the 
existing pcie-switch-upstream-port) The one problem with doing this 
would be that there would be no free ports for hotplugging; I'm not sure 
what is the best way to deal with that; I suppose any attempt at 
hotplugging a new device would lead to an error, after which you would 
shut down the virtual machine, manually add in a port, then start it up 
again; or maybe when there are *any* PCI devices, we could always make 
sure there were 'several' extra ports for hotplugging? Neither sounds 
ideal...).