[libvirt] [PATCH v4 1/2] qemu_domain: NVLink2 bridge detection function for PPC64

Mon Apr 8 08:35:47 UTC 2019

On 04/04/2019 00:52, Daniel Henrique Barboza wrote:
> 
> 
> On 4/3/19 4:05 AM, Erik Skultety wrote:
>> On Tue, Apr 02, 2019 at 05:27:55PM -0300, Daniel Henrique Barboza wrote:
>>>
>>> On 4/2/19 5:34 AM, Erik Skultety wrote:
>>>> On Tue, Mar 12, 2019 at 06:55:49PM -0300, Daniel Henrique Barboza
>>>> wrote:
>>>>> The NVLink2 support in QEMU implements the detection of NVLink2
>>>>> capable devices by verifying the attributes of the VFIO mem region
>>>>> QEMU allocates for the NVIDIA GPUs. To properly allocate an
>>>>> adequate amount of memLock, Libvirt needs this information before
>>>>> a QEMU instance is even created, thus querying QEMU is not
>>>>> possible and opening a VFIO window is too much.
>>>>>
>>>>> An alternative is presented in this patch. Making the following
>>>>> assumptions:
>>>>>
>>>>> - if we want GPU RAM to be available in the guest, an NVLink2 bridge
>>>>> must be passed through;
>>>>>
>>>>> - an unknown PCI device can be classified as a NVLink2 bridge
>>>>> if its device tree node has 'ibm,gpu', 'ibm,nvlink',
>>>>> 'ibm,nvlink-speed' and 'memory-region'.
>>>> Alexey mentioned that it should be enough to check for the
>>>> properties ^above.
>>>> I'm just wondering, knowing this is IBM's PPC8/9 if the assumptions
>>>> we have
>>>> made are going to stay with further revisions of PPC, NVLink and
>>>> GPUs, IOW
>>>> we need to be sure "ibm,nvlink" won't be renamed with further
>>>> revisions, e.g.
>>>> other cards than V100 in the future, because then compatibility and
>>>> revision
>>>> selection comes into the picture.
>>> Perhaps Alexey or Piotr can comment on this. I can't confirm that the
>>> device node naming will remain as is in the long run.
>>>
>>>
>>>>> This patch introduces a helper called @ppc64VFIODeviceIsNV2Bridge
>>>>> that checks the device tree node of a given PCI device and
>>>>> check if it meets the criteria to be a NVLink2 bridge. This
>>>> Just out of curiosity, what about NVLink 1.0? Apart from
>>>> performance, I wasn't
>>>> able to find something useful in terms of compatibility, is there
>>>> something to
>>>> consider, since we're only relying on NVLink 2.0?
>>> NVLink1, as far as Libvirt goes, works similar to a regular GPU
>>> passthrough.
>>> There is no changes in the memory topology in QEMU that needs extra
>>> code to adjust rlimit.
>>>
>>> I am not entirely sure if the code above will not generate a
>>> false-positive,
>>> mistakenly detecting NV2 scenario for a NV1 GPU. I think the
>>> 'memory-region'
>>> attribute won't be present in a NV1 bridge. Piotr, can you comment here?
>>>
>> I'm glad you mentioned it. Well, the worst that can happen if it
>> generates a
>> false positive is that we raise the rlimit even though we don't need to,
>> which in general is a concern, since malicious guests with relaxed
>> limits can
>> lock enough memory so that the host doesn't have enough left for
>> itself, this
>> cannot be prevented completely, however, we should still follow our
>> "best effort" approach, IOW we should make sure that we only adjust
>> the limit
>> if necessary.
> 
> Just verified that this code will *not* result in NV2 false-positives
> for NV1
> passthrough.
> 
> As I've said before, NV1 works as a regular VFIO passthrough. The main
> difference is that there is *no* NVLink Bridges being passed through as
> well, which is exactly what the code here is detecting.

> For reference, the
> link below contains instructions of how NV1 passthrough of a Tesla k40m
> works in Libvirt:
> 
> https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1541902/comments/26
> 
> TL;DR, it consists simply in the <hostdev> element with the GPU and, in
> case
> of multiple NV1 GPUs are being passed through, an extra global parameter.
> There is no way to mistake NV1 with NV2 if we're looking for NV2 bridges
> being
> passed on, like we're doing here.

No. K40 is not NV1's GPU (although the same core), it is "NVIDIA
Corporation Device 15fe" and the difference to V100 (from P9 boxes) is
that previously NVLink1 was used for (very very fast) DMA _only_ (hence
the same limits as usual) but NVLink2 provides direct access for a CPU
to the GPU's RAM so that memory appears on the host, can be DMA'ed
to/from (by a network adapter, for example) and has to be mapped via
IOMMU resulting in bigger IOMMU tables.

So, NVLink1 also required the bridges to be passed but there was no
change to VFIO or QEMU in any way but there are still "ibm,gpu" and
"ibm,npu" properties in the host device tree - the Linux powernv
platform relies on this. There are no "ibm,nvlink" and "memory-region"
in those nodes on NVLink1 systems though and since the device tree comes
from skiboot which is the firmware our team controls - I can say this
API won't change incompatibly.

-- 
Alexey