[libvirt] RFC: managing "pci passthrough" usage of sriov VFs via a new network forward type

Wed Aug 24 08:16:33 UTC 2011

On 08/23/2011 06:50 AM, Daniel P. Berrange wrote:
> On Mon, Aug 22, 2011 at 05:17:25AM -0400, Laine Stump wrote:
>> For some reason beyond my comprehension, the designers of SRIOV
>> ethernet cards decided that the virtual functions (VF) of the card
>> (each VF corresponds to an ethernet device, e.g. "eth10") should
>> each be given a new+different+random MAC address each time the
>> hardware is rebooted.
> [...snip...]
>
>> This makes using SRIOV VFs via PCI passthrough very unpalatable. The
>> problem can be solved by setting the MAC address of the ethernet
>> device prior to assigning it to the guest, but of course the
>> <hostdev>  element used to assign PCI devices to guests has no place
>> to specify a MAC address (and I'm not sure it would be appropriate
>> to add something that function-specific to<hostdev>).
> In discussions at the KVM forum, other related problems were
> noted too. Specifically when using an SRIOV VF with VEPA/VNLink
> we need to be able to set the port profile on the VF before
> assigning it to the guest, to lock down what the guest can
> do. We also likely need to a specify a VLAN tag on the NIC.
> The VLAN tag is actally something we need to be able todo
> for normal non-PCI passthrough usage of SRIOV networks too.
>
>>                                                          Dave Allan
>> and I have discussed a different possible method of eliminating this
>> problem (using a new forward type for libvirt networks) that I've
>> outlined below. Please let me know what you think - is this
>> reasonable in general? If so, what about the details? If not, any
>> counter-proposals to solve the problem?
> The issue I see is that if an application wants to know what
> PCI devices have been assigned to a guest, they can no longer
> just look at<hostdev>  elements.

Actually, I was thinking that the proper <hostdev> *would* be added to 
the live XML as non-persistent. This way all PCI devices currently 
assigned to the guest could still be retrieved by looking at the 
<hostdev> elements, but the specific PCI device used for this particular 
instance wouldn't need to be hardcoded into the config XML. (I think the 
ability to grab a free ethernet device from a pool at runtime, rather 
than having hardcoded devices, is an important feature of this proposed 
method of dealing with pci passthrough ethernet devices. I suppose a 
management app could be written to handle that allocation, and rewrite 
the domain config, but it seems like something that libvirt should be 
able to handle).

>   They also need to look at
> <interface>  elements. If we follow this proposed model in other
> areas, we could end up with PCI devices appearing as<disks>
> <controllers>  and who knows what else. I think this is not
> very desirable for applications, and it is also not good for
> our internal code that manages PCI devices. ie the security
> drivers now have to look at many different places to find
> what PCI devices need labelling.

I agree that we don't want to make management applications look for PCI 
devices scattered all over the config. Likewise I think it would be nice 
if applications don't have to go looking all over the place for MAC 
addresses. And now that I've heard port profiles need to be associated 
with these devices too, I'm wondering what will be next... having that 
type of high level information in a <hostdev> doesn't seem very 
appealing to me. I think it would be much cleaner if it could remain in 
<interface> (or in a <portgroup> of a network definition).

I think with non-persistent <hostdev> elements auto-generated based on 
<interface>/<network> definitions, we can get the best of both worlds - 
a complete list of all PCI devices allocated to the guest is still 
available in one place, but we can leverage a lot of code already in the 
network interface management stuff - interface pools, portgroups, etc. 
(unfortunately, we'll never be able to take advantage of bandwidth 
management or nwfilters, but there's really no solution to that short of 
installing an agent in the guest - by the time you get to that point, I 
think it's probably time to acknowledge that PCI passthrough of network 
devices just isn't a great general purpose solution, and use an actual 
QEMU network device instead)

>> One problem this doesn't solve is that when a guest is migrated, the
>> PCI info for the allocated ethernet device on the destination host
>> will almost surely be different. Is there any provision for dealing
>> with this in the device passthrough code? If not, then migration
>> will still not be possible.
> Migration is irrelevant with PCI passthrough, since we reject any
> attempt to migrate a guest with assigned PCI devices. A management
> app must explicitly hot-unplug all PCI devices before doing any
> migration, and plug back in new ones after migration finishes.

Nice. I didn't realize that. The description of how a management app 
handles the situation actually fits quite well with my proposal - the 
non-persistent hostdev would be unplugged, and after migration is 
completed, the normal codepath for initializing network device plumbing 
for the qemu process on the destination host would automatically reserve 
and plug in a new pci device.

>> Although I realize that many people are predisposed to not like the
>> idea of PCI passthrough of ethernet devices (including me), it seems
>> that it's going to be used, so we may as well provide the management
>> tools to do it in a sane manner.
> Reluctantly I think we need to provide the neccessary information
> underneath the<hostdev>  element. Fortunately we already have an
> XML schema for port profile and such things, that we share between
> the<interface>  device element and the<network>  schema.

I had actually been considering from the beginning that a <hostdev> 
element would end up in the live XML (after being created based on the 
<interface> (and the <network> it references) while the guest is 
starting up). This keeps network device config out of hostdev space, and 
hostdev config out of network device space (and fits in with the idea of 
eliminating host-specific config info from the domain config (since the 
actual PCI device to be used isn't in the domain XML, but is instead 
determined at domain startup.)

If it's acceptable to add non-persistent <hostdev>s to the live XML, the 
main open item I see is that the management apps trying to migrate a 
guest containing them will need to understand that these transient 
<hostdev> devices will have replacements automatically plugged in on the 
destination by the networking code. For that matter, the management app 
shouldn't be unplugging them either (and neither should "virsh 
detach-device", for example), because they will require extra code not 
normally run during a PCI hot-unplug (to disassociate the port profile, 
and return the ethernet device to the network's pool) (So maybe the 
hostdev does need some reference back to the higher level device 
definition (in this case <interface>) after all. Bah.)

(Another potential problem area I see is with the relative sequencing of 
unplugging/disassociating/plugging/associating these devices during a 
migration - for standard network devices I think the unplugging on the 
source host doesn't happen until after the migration is complete, but 
for PCI passthrough devices it must happen before the migration starts. 
But I may again be trying to think up a solution to a problem that is 
irrelevant).