[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [libvirt] SRIOV configuration



Edward and I have had a multi-day private conversation in IRC on the topic of this mail. I was planning to update this thread with an email, but forgot until now :-/


On 9/24/20 10:54 AM, Daniel P. Berrangé wrote:
On Mon, Sep 21, 2020 at 06:04:36PM +0300, Edward Haas wrote:
The PCI addresses appearing on the domxml are not the same as the ones
mappend/detected in the VM itself. I compared the domxml on the host
and the lspci in the VM while the VM runs.
Can you clarify what you are comparing here ?

The PCI slot / function in the libvirt XML should match, but the "bus"
number in libvirt XML is just a index referencing the <controller>
element in the libvirt XML.  So the "bus" number won't directly match
what's reported in the guest OS. If you want to correlate, you need
to look at the <address> on the <controller> to translate the libvirt
"bus" number.


Right. The bus number that is visible in the guest is 100% controlled by the device firmware (and possibly the guest OS?), and there is no way for qemu to explicitly set it, and thus no way for libvirt to guarantee that the bus number in libvirt XML will be what is seen in the guest OS; the bus number in the XML only has meaning within the XML - you can find which controller a device is connected to by looking for the PCI controller that has the same "index" as the device's "bus".



This occurs only when SRIOV is defined, messing up also the other
"regular" vnics.
Somehow, everything comes up and runs (with the SRIOV interface as
well) on the first boot (even though the PCI addresses are not in
sync), but additional boots cause the VM to mess up the interfaces
(not all are detected).


Actually we looked at this offline, and the "messing up" that's occurring is not due to any change in PCI address from one boot to the next. The entire problem is caused by  the guest OS using traditional "eth0" and "eth1" netdev names, and making the incorrect assumption that those names are stable from one boot to the next. In fact, it is a long-known problem that, due to a race between kernel code initializing devices and user processes giving them names, the ordering of ethN device names can change from one boot to the next *even with completely identical hardware and no configuration changes. Here is a good description of that problem, and of systemd's solution to it ("predictable network device names"):


https://www.freedesktop.org/wiki/Software/systemd/PredictableNetworkInterfaceNames/


Edward's inquiry was initiated by this bugzilla:


  https://bugzilla.redhat.com/show_bug.cgi?id=1874096


You can see in the "first boot" and "second boot" ifconfig output that the same ethernet device has the "altname" enp2s1, and the same device has the altname enp3s0 during both runs; these names are given by systemd's "predictable network device name" algorithm (which bases the netdev name on the PCI address of the device). But the race between kernel and userspace causes the "ethN" names to be assigned differently during one boot and the next.


In order to have predictable netdev names, the OS image needs to stop setting net.ifnames=0 on the kernel command line. If they like, they can give their own more descriptive names to the devices (methods arae described in the above systemd document), but they need to stop relying on ethN device names.


(note that this experience did uncover another bug in libvirt, which *might* contribute to the racy code flip flopping from boot to boot, but still isn't the root cause of the problem - in this case libvirtd is running privileged, but inside a container, and the container doesn't have full access to the devices' PCI config data in sysfs (you can see this when you run "lspci -v" inside the container, you'll notice "Capabilities: <access denied>". One result of this is that libvirt mistakenly determines the VF is a conventional PCI device (not PCIe), so it auto-adds a pcie-to-pci-bridge, and plugs the VF into that controller. I'm guessing that makes device initialization take slightly longer or something, changing the results of the race. I'm looking into changing the test for PCIe vs. conventional PCI, but again that isn't the real problem here)


This is how the domxml hostdev section looks like:
```
     <hostdev mode='subsystem' type='pci' managed='yes'>
       <driver name='vfio'/>
       <source>
         <address domain='0x0000' bus='0x3b' slot='0x0a' function='0x4'/>
       </source>
       <alias name='hostdev0'/>
       <address type='pci' domain='0x0000' bus='0x06' slot='0x01'
function='0x0'/>
     </hostdev>
```

Is there something we are missing or we misconfigured?
Tested with 6.0.0-16.fc31

My second question is: Can libvirt avoid accessing the PF (as we do
not need mac and other options).
I'm not sure, probably a question for Laine.


The entire point of <interface type='hostdev'> is to be able to set the MAC address (and optionally the vlan tag) of a VF when assigning it to a guest, and the only way to set those is via the PF. If you use plain <hostdev>, then libvirt has no idea that the device is a VF, so it doesn't look for or try to access its PF.


So, you're doing the right thing - since your container has no access to the PF, you need to set the MAC address / vlan tag outside the container (via the PF), and then use <hostdev> (which doesn't do anything related to PF devices).




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]