[PATCH] util: Add phys_port_name support on virPCIGetNetName

Adrian Chiris adrianc at nvidia.com
Thu Dec 17 10:07:03 UTC 2020


>-----Original Message-----
>From: Laine Stump <laine at redhat.com>
>Sent: Thursday, December 17, 2020 5:22 AM
>To: libvir-list at redhat.com
>Cc: Dmytro Linkin <dlinkin at nvidia.com>; Moshe Levi <moshele at nvidia.com>;
>Adrian Chiris <adrianc at nvidia.com>
>Subject: Re: [PATCH] util: Add phys_port_name support on
>virPCIGetNetName
>
>External email: Use caution opening links or attachments
>
>
>On 12/10/20 11:51 AM, Adrian Chiris wrote:
>
>> Hi,
>> Would be great to get a pair of eyes on this Patch, Thanks!
>
>
>I've looked at it several times and every time would just end up shaking my
>head wondering why there isn't one definitive symlink in the VF's sysfs for the
>netdev of the physical port. (Part of my lack of response from the last time is
>that I didn't know how to respond since I didn't understand why such
>roundabout logic should be needed to answer such a basic question, decided I
>should look at it again before responding, and then forgot about it :-()
>
>
>Anyway, this time I'm determined to not get up until I've got it figured out (or
>at least understand exactly what I don't have figured out)...
>
>
>
>>> -----Original Message-----
>>> From: Dmytro Linkin <dlinkin at nvidia.com>
>>> Sent: Tuesday, October 27, 2020 10:58 AM
>>> To: libvir-list at redhat.com
>>> Cc: Laine Stump <laine at redhat.com>; Adrian Chiris
>>> <adrianc at nvidia.com>; Moshe Levi <moshele at nvidia.com>
>>> Subject: Re: [PATCH] util: Add phys_port_name support on
>>> virPCIGetNetName
>>>
>>> On Mon, Sep 28, 2020 at 12:56:12PM +0300, Dmytro Linkin wrote:
>>>> On Tue, Sep 22, 2020 at 08:31:15AM -0400, Laine Stump wrote:
>>>>> On 8/28/20 6:53 AM, Dmytro Linkin wrote:
>>>>>> Current virPCIGetNetName() logic is to get net device name by
>>>>>> checking it's phys_port_id, if caller provide it, or by it's index
>>>>>> (eg, by it's position at sysfs net directory). This approach
>>>>>> worked fine up until linux kernel version 5.8, where NVIDIA
>>>>>> Mellanox driver implemented linking of VFs' representors to PCI
>>>>>> device in switchdev mode. This mean that device's sysfs net
>>>>>> directory will hold
>>> multiple net devices. Ex.:
>>>>>> $ ls '/sys/bus/pci/devices/0000:82:00.0/net'
>>>>>> ens1f0  eth0  eth1
>>>>>>
>>>>>> Most switch devices support phys_port_name instead of
>>>>>> phys_port_id, so
>>>>>> virPCIGetNetName() will try to get PF name by it's index - 0. The
>>>>>> problem here is that the PF nedev entry may not be the first.
>>>>>>
>>>>>> To fix that, for switch devices, we introduce a new logic to
>>>>>> select the PF uplink netdev according to the content of
>phys_port_name.
>>>>>> Extend
>>>>>> virPCIGetNetName() with physPortNameRegex variable to get proper
>>>>>> device by it's phys_port_name scheme, for ex., "p[0-9]+$" to get
>>>>>> PF, "pf[0-9]+vf[0-9]+$" to get VF or "p1$" to get exact net device.
>>>>>> So now
>>>>>> virPCIGetNetName() logic work in following sequence:
>>>>>>   - filter by phys_port_id, if it's provided,
>>>>>>   or
>>>>>>   - filter by phys_port_name, if it's regex provided,
>>>>>>   or
>>>>>>   - get net device by it's index (position) in sysfs net directory.
>>>>>> Also, make getting content of iface sysfs files more generic.
>>>>>>
>>>>>> Signed-off-by: Dmytro Linkin <dlinkin at nvidia.com>
>>>>>> Reviewed-by: Adrian Chiris <adrianc at nvidia.com>
>>>>>
>>>>> [...]
>>>>>
>>>>>
>>>>>> +/* Represents format of PF's phys_port_name in switchdev mode:
>>>>>> + * 'p%u' or 'p%us%u'. New line checked since value is readed from
>>>>>> +sysfs
>>> file.
>>>>>> + */
>>>>>> +# define VIR_PF_PHYS_PORT_NAME_REGEX ((char
>>>>>> +*)"(p[0-9]+$)|(p[0-9]+s[0-9]+$)")
>>>>>> +
>>>>>
>>>>> I've come back to look at this patch several times since it was
>>>>> posted (sorry for the extreme delay in responding), but just can't
>>>>> figure out what it's doing with this regex and why the regex is
>>>>> necessary. Not having access to the hardware that it works with is
>>>>> a bit of a problem, but perhaps I could get a better idea if you
>>>>> gave a full example of sysfs contents? My concern with using a
>>>>> regex is that it might work just fine when using one method for net
>>>>> device naming, but break if that was changed. Also, it seems
>>>>> counterintuitive that it would be necessary to look for a device
>>>>> with a name matching a specific pattern; why isn't there simply a
>>>>> single symbolic link somewhere in the sysfs tree for the net device
>>>>> that just directly points at its physical port? That would be so
>>>>> much simpler and more reliable (or at least it would give the
>>>>> *perception* of being more reliable).
>>>>>
>>>> I'll emphasize that regex is being used for phys_port_name, NOT net
>>>> device name (they are not the same). Anyway, I'll give an example.
>>>>
>>>> Lets look how virPCIGetNetName() currently work.
>>>> At first it try to read phys_port_id of every net device in net
>>>> folder, like:
>>>> $ cd '/sys/bus/pci/devices/0000:80:02.0/0000:82:00.0/'
>>>> $ cat net/ens1f0/phys_port_id
>>>>
>>>> Imagine we have single pci dual port nic (I hope named it right), eg.
>>>> net folder holds net devices for both ports, so read operation will
>>>> be successfull and function will return name of first or second port.
>>>> For single port or dual pci nics (or for drivers which didn't
>>>> implemented phys_port_id) read fails and function fallback to
>>>> picking up device by it's index, which not really fine (I'll
>>>> describe it
>>>> later) but work 'cause net folder usualy contains only one net device.
>>>>
>>>> But kernel patch brought third case which not work.
>>>>
>>>> Here are ifaces of ConnectX-5 NIC:
>>>> $ ls -l /sys/class/net | cut -d ' ' -f 9-
>>>> ens1f0 ->
>>>> ../../devices/pci0000:80/0000:80:02.0/0000:82:00.0/net/ens1f0
>>>> ens1f1 ->
>>>> ../../devices/pci0000:80/0000:80:02.0/0000:82:00.1/net/ens1f1
>>>> ens1f2 ->
>>>> ../../devices/pci0000:80/0000:80:02.0/0000:82:00.2/net/ens1f2
>>>> ens1f3 ->
>>>> ../../devices/pci0000:80/0000:80:02.0/0000:82:00.3/net/ens1f3
>>>> ens1f0_0 ->
>>>> ../../devices/pci0000:80/0000:80:02.0/0000:82:00.0/net/ens1f0_0
>>>> ens1f0_1 ->
>>>> ../../devices/pci0000:80/0000:80:02.0/0000:82:00.0/net/ens1f0_1
>>>>
>>>> ens1f0 & ens1f1 - PFs;
>>>> ens1f2 & ens1f3 - VFs created on 1st PF and..
>>>> ens1f0_0 & ens1f0_1 - corresponding representors.
>>>>
>>>> Here is content of PFs' pci net folders (for comparison):
>>>> $ ls '/sys/bus/pci/devices/0000:80:02.0/0000:82:00.0/net'
>>>> ens1f0  ens1f0_0  ens1f0_1
>>>> $ ls '/sys/bus/pci/devices/0000:80:02.0/0000:82:00.1/net'
>>>> ens1f1
>>>>
>>>> When virPCIGetNetName() called for 2nd PF, for ex., which is in
>>>> legacy mode, phys_port_id read fails (Operation not supported),
>>>> function falling back to index and return name of first and only net device.
>>>> Fine here.
>>>> When function called for 1st PF in switchdev mode, sequence is the
>>>> same and first net device name is returned. The problem here is that
>>>> PF could be not the first device, because order is not persistent
>>>> and defined by system.
>>>>
>>>> So approach is to find PF by reading phys_port_name. Ex.:
>>>> $ cd '/sys/bus/pci/devices/0000:80:02.0/0000:82:00.0/net/'
>>>> $ cat ens1f0/phys_port_name
>>>> p0
>>>> $ cat ens1f0_0/phys_port_name
>>>> pf0vf0
>>>> $ cat ens1f0_1/phys_port_name
>>>> pf0vf1
>
>>>>
>>>> And here regex is used. We really don't care about exact value of
>>>> phys_port_name, 'cause there only one PF net device. And regex also
>>>> cover smart nics, which use other phys_port_name scheme.
>
>
>Okay, in the last 4 hours I had typed *a lot* of stuff here, and thought I was
>starting to understand how everything worked. But in the meantime I was
>also exchanging email with Marcelo Leitner about the issue, and he happened
>to mention that ConnectX-5 cards no longer have multiple PF netdevs on a
>single PCI device.

Indeed,  single PF netdev per PCI physical function is there since ConnectX-4
However we still need to support ConnectX-3 NICs which follow the multiple
PF netdev model (the "old model").

>
>
>I *had been* operating under the impression that multiple PFs on a single PCI
>device was still the case, and that the entire reason for needing to look for the
>proper PF netdev was to pick between one of multiple PFs on the single PCI
>device. I had written up a whole huge wall of text explaining how the method
>being used in this patch must be only partially complete, because it wasn't
>matching the phys_port_name of the VF netdev to the phys_port_name of
>the proper PF netdev.
>
>
>But then after the email with Marcelo, his offhand statement slowly crept out
>of my unconscious, prompted along by my realization that your regexp isn't
>matching the devices that look to be different for different VFs, e.g. pf0vf1, it
>is matching the plain "p0". After that, Marcelo provided me with access to a
>machine with a ConnectX-5 card, where I confirmed for myself that there is
>only a single PF on any PCI device, and that the "p0"-named device is the PF.
>*Finally* it makes sense why your patch doesn't try to find a device that is
>matched to the particular VF in any way, but just looks for the device that has
>a fixed pattern for phys_port_name - it's because any PCI device will only
>have a single PF netdev, and we just have to figure out which one that is.
>
>
>Okay, so now that I finally understand, and have deleted the original wall of
>text.
>
>
>A few things about the patch:
>
>
>1) the new utility function that you added looks very useful, but should be
>added in a separate patch.
>
>
>2) the patches don't apply cleanly any more.
>
>
>3) since the regexp is fixed, I don't think it needs to be sent as an argument to
>the function - it can just be referenced directly when needed.
>
>
>4) I think it should be explained in the code that in the case of using
>phys_port_name, there is only a single netdev on the PCI device that can
>possibly be the PF, whereas in the case of using phys_port_id, there could be
>multiple PFs, and so we have to match the phys_port_id of the VF (having
>that bit of info would have saved me lots of agony :-)

Makes sense, definitely we should add some more info as stated in 4. 

>
>
>If you like I can try splitting the existing patch and fixing the merge conflicts,
>and put the results on gitlab or something for you to try out. I feel like I should
>do *something*, after making you wait for so long :-/
>

That would be great, we are on shutdown during next week, so will be able to look at it only the following week.





More information about the libvir-list mailing list