[libvirt] PCI passthrough/SR-IOV on Cavium cn889x

Ciprian Barbu ciprian.barbu at enea.com
Thu Mar 22 14:40:03 UTC 2018


Hello,

Thank you for getting back to me so soon. I switched to Thunderbird for 
better text clarity. See some comment inline:

On 21.03.2018 19:54, Laine Stump wrote:
> On 03/21/2018 11:46 AM, Ciprian Barbu wrote:
>> Hello,
>>
>> In the context of running Openstack on a cluster of Cavium ThunderX cn8890 aarch64 servers, we are trying to attach virtual functions to a VM.
>>
>> First some introduction. This Cavium SoC has a different approach to Virtual Functions than on x86 NICs, in which VFs are always enabled and there are two types of VFs and *one single* PF, as follows:
>> - primary VFs - these are in fact assigned by the system to the physical ports of the server, e.g em2p1s0f1, em2p1s0f3 etc below.
>> - secondary VFs - the main purpose of these is to provide additional HW queues under SW control (usually DPDK applications) by automatically binding them to the needed physical port.
>> - one single "physical" function, device 0002:01:00.0 below, which to the best of my knowledge acts merely as a stub and cannot be assigned an interface name.
>>
>> Below is the output of "dpdk-devbind.py -s" which provides some useful information.
>>
>> Network devices using DPDK-compatible driver ============================================
>> 0002:01:00.2 'Device a034' drv=vfio-pci unused=nicvf
>>
>> Network devices using kernel driver
>> ===================================
>> 0000:01:10.0 'THUNDERX BGX (Common Ethernet Interface)' if= drv=thunder-BGX unused=thunder_bgx,vfio-pci
>> 0000:01:10.1 'THUNDERX BGX (Common Ethernet Interface)' if= drv=thunder-BGX unused=thunder_bgx,vfio-pci
>> 0002:01:00.0 'THUNDERX Network Interface Controller' if= drv=thunder-nic unused=nicpf,vfio-pci
>> 0002:01:00.1 'Device a034' if=em2p1s0f1 drv=thunder-nicvf unused=nicvf,vfio-pci
>> 0002:01:00.3 'Device a034' if=em2p1s0f3 drv=thunder-nicvf unused=nicvf,vfio-pci
>> 0002:01:00.4 'Device a034' if=em2p1s0f4 drv=thunder-nicvf unused=nicvf,vfio-pci
>> 0002:01:00.5 'Device a034' if=em2p1s0f5 drv=thunder-nicvf unused=nicvf,vfio-pci
>> 0002:01:00.6 'Device a034' if= drv=thunder-nicvf unused=nicvf,vfio-pci
>> 0002:01:00.7 'Device a034' if= drv=thunder-nicvf unused=nicvf,vfio-pci
>> 0002:01:01.0 'Device a034' if= drv=thunder-nicvf unused=nicvf,vfio-pci
>>
>> Now for the problem. I don't have a domain definition because libvirt fails to start a domain, but I might be able to find what nova generates. But what it tries to do is passthrough em2p1s0f3, address 0002:01:00.3:
>> <interface type='hostdev' managed='yes'>
>>    <source>
>>      <address type='pci' domain='0x0002' bus='0x1' slot='0x0' function='0x3'/>
>>    </source>
>> </interface>
> 
> I see that while I was typing my own "really long" message, that Alex
> pointed out in a response that you could use <hostdev> rather than
> <interface type='hostdev'> if you don't need to configure the MAC
> address or vlan tag of the VF from within libvirt. If that's the case,
> you can ignore the rest of my message, but otherwise read on :-)

Due to some Openstack technicalities, it's not possible or reliable to 
do so on this SoC. There are 2 ways to achieve this, if you are 
interested to read:
1. "blind" PCI passthrough [1], where it's possible to request any 
number of PCI devices of certain vendor_id:product_id. You cannot 
specify which PCI buss address, so it's not flexible
2. using direct-physical bound ports, no good documentation except for 
[2]. This doesn't work for Cavium ThunderX because the interface are 
*always* Virtual functions.

I will test your suggestion though, through libvirt, I usually don't 
manually start VMs, since it's Openstack.

> 
>>
>> You can find attached a trimmed libvirtd.log where the main error is:
>> 43236: error : virPCIGetVirtualFunctionInfo:2927 : internal error: The PF device for VF /sys/bus/pci/devices/0002:01:00.3 has no network device name
>>
>> I have actually spent a few days trying to do some hacks and learn some more. The main idea is that virPCIGetVirtualFunctionInfo fails to find the physical name for the virtual device at address 0002:01:00.3, which as I explained in the introduction is something that this Cavium SoC does not do.
>>
>> Looking further down the stream, almost all of the helper functions need a linkdev for the physical function, which means that making libvirt work on this system means some heavy refactoring, a solution being to use the sysfs path rather than the interface name.
> 
> The PF netdev name is needed because the netlink messages to get/set the
> VF MAC address and vlan tag are sent to the PF netdev. A message to set
> the MAC and vlan tag for VF 2 of PF "enpblah' would be something like this:
> 
>       RTM_SETLINK/NLM_F_REQUEST-------+
>       | ifindex=-1                    |
>       | family=AF_UNSPEC              |
>       | IFLA_IFNAME------------------+|
>       | | enpblah                    ||
>       | +----------------------------+|
>       | IFLA_VFINFO_LIST-------------+|
>       | | IFLA_VFINFO---------------+||
>       | | | IFLA_VF_MAC------------+|||
>       | | | | vf=2                 ||||
>       | | | | mac=de:ad:be:ef:c0:55||||
>       | | | +----------------------+|||
>       | | | IFLA_VF_VLAN-----------+|||
>       | | | | vf=2                 ||||
>       | | | | vlanid=42            ||||
>       | | | +----------------------+|||
>       | | +------------------------+|||
>       | +---------------------------+||
>       +-------------------------------+
> 
> I *think* (although I can't say for certain since the original code was
> written by someone else, and I've never tried it the other way) that we
> could achieve the same result by filling in ifindex with the index of
> "enpblah" (instead of -1), then leaving out the IFLA_IFNAME attribute,
> but I haven't found any way of specifying the target of a netlink
> message other than with its ifindex or its ifname.
> 
> When you say "use the sysfs path", what exactly do you mean? Is there a
> way to save/set the VF MAC addresses and vlan tags via sysfs? Or
> (better) a way to address the netlink message to the PF if it has no
> netdev name or ifindex? Maybe the drivers are setup so that an
> RTM_SETLINK request send to a "primary VF" would be able to get/set
> VF_INFO for "Secondary VFs" associated with the same PF? I'm just
> pulling ideas out of thin air here...

What I meant was that functions like virNetDevGetVirtualFunctionIndex,
or just virNetDevSaveNetConfig, which require the physical linkdev name, 
it should be possible to pass the sysfspath instead. But looking again 
in virHostdevPreparePCIDevices it looks like there are many places where 
netlink is used. So forget about this idea, it doesn't look that feasible.

> 
>> This will not work 100% from what I've seen, at least virNetDevGetVfConfig uses netlink to save the admin MAC (part of virNetDevSaveNetConfig), and netlink needs the ifname.
>>
>> So I'm quite stuck on finding a workaround/fix for this platform which would potentially be something upstreamable, so that we, ENEA, don't burden with maintaining an ugly hack. Right now we are using libvirt 3.5.0 but we can upgrade to something newer if need.
>>
>> The question(s) thus, are
>> 1. is this problem known in the libvirt community?
> 
> This is the first time I've heard of an SRIOV network device where the
> PF wasn't bound to a netdev driver and so had no netdev name or ifindex.
> 
> I guess this is describing the card you're talking about?
> 
>    https://dpdk.org/doc/guides/nics/thunderx.html

Yes, this is kind of the only public documentation about ThunderX NICs. 
But do note that the interfaces are integrated on the motherboard, this 
networking SoC has many HW accelerators and assignable HW resources, HW 
queues, VFs, buffer management etc. And all these blocks are connected 
to the SoC via PCIe, but not using slots, it's actually integrated on 
the motherboard. See this for example [3]. There is more documentation 
available on request through support accounts I think.

> 
> I have to say that it does *not* give me the warm fuzzies that it
> apparently requires setting
> /sys/module/vfio/parameters/enable_unsafe_noiommu_mode=1 in order to
> work (or did I misunderstand that part).
> 

It's needed inside the VM at least, to be able to assign vfio-pci to the 
device, which is needed if you want to run a DPDK application in the 
guest, on the passed-through interface. It might be needed to do the 
same on the host, but I'm not sure, but yes, it looks a bit scary. There 
is probably a good explanation for needing this.

> 
>> 2. Is there any plan to make it work?
> 
> If the hardware exists, and if users need to be able to set each VF's
> MAC address and vlan tag via libvirt config, then we (the royal Open
> Source "we" :-) need to make it work somehow.

I was hoping for more awareness about this problem, ThunderX has been 
available for some time. Our usecase with OPNFV/Openstack is just one of 
many possible, where we don't control what libvirt does, not directly. 
Probably others will pass the device as a hostdev like you and Alex 
suggested.

Since you mentioned this option, we might be able to hack Openstack Nova 
to treat these particular devices as PFs, although they look like VFs in 
the system, but we might be opening another can of worms this way.

> 
>> 3. Can you give some pointers on an approach to adapt libvirt to this system?
>> 4. Maybe it's worth changing the kernel to assign a sort of dummy interface to the physical function?
> 
> If there is no other way to address a netlink message to the PF telling
> it to set the MAC address and vlan tag of a VF, then that may be needed.
> If it can be saved/set in some other *standard* way, then perhaps
> libvirt can grow support for it.

I guess this will come naturally if some critical mass of users is achieved.

Hacking the kernel to show a dummy interface might not work, there is 
one single PF for all VFs, so one MAC address only.

> 
>> Thanks and sorry for the long email,
> 
> Long emails with actual information are always preferable to an endless
> chain of short mails that reveal the situation in tiny bits and pieces :-)
> 

Great, I hope it will also be productive. I hope to find some nice 
workaround, but I still found it useful to point out this problem and 
see what is the general consensus on what to do.


[1] 
https://trickycloud.wordpress.com/2016/03/28/openstack-for-nfv-applications-sr-iov-and-pci-passthrough/
[2] 
https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/10/html/networking_guide/sr-iov-support-for-virtual-networking
[3] 
https://www.avantek.co.uk/store/avantek-96-core-cavium-thunderx-arm-server-r270-t61.html

BR,
/Ciprian




More information about the libvir-list mailing list