[libvirt] [PATCH v3 3/4] qemu_domain: NVLink2 device tree functions for PPC64

Tue Mar 12 10:20:06 UTC 2019

On 12/03/2019 20:39, Daniel Henrique Barboza wrote:
> 
> 
> On 3/12/19 1:05 AM, Alexey Kardashevskiy wrote:
>>
>> On 12/03/2019 09:15, Daniel Henrique Barboza wrote:
>>>
>>> On 3/7/19 10:29 PM, Alexey Kardashevskiy wrote:
>>>> On 08/03/2019 04:51, Piotr Jaroszynski wrote:
>>>>> On 3/7/19 7:39 AM, Daniel Henrique Barboza wrote:
>>>>>> On 3/7/19 12:15 PM, Erik Skultety wrote:
>>>>>>> On Tue, Mar 05, 2019 at 09:46:08AM -0300, Daniel Henrique Barboza
>>>>>>> wrote:
>>>>>>>> The NVLink2 support in QEMU implements the detection of NVLink2
>>>>>>>> capable devices by verfying the attributes of the VFIO mem region
>>>>>>>> QEMU allocates for the NVIDIA GPUs. To properly allocate an
>>>>>>>> adequate amount of memLock, Libvirt needs this information before
>>>>>>>> a QEMU instance is even created.
>>>>>>>>
>>>>>>>> An alternative is presented in this patch. Given a PCI device,
>>>>>>>> we'll traverse the device tree at /proc/device-tree to check if
>>>>>>>> the device has a NPU bridge, retrieve the node of the NVLink2 bus,
>>>>>>>> find the memory-node that is related to the bus and see if it's a
>>>>>>>> NVLink2 bus by inspecting its 'reg' value. This logic is contained
>>>>>>>> inside the 'device_is_nvlink2_capable' function, which uses other
>>>>>>>> new helper functions to navigate and fetch values from the device
>>>>>>>> tree nodes.
>>>>>>>>
>>>>>>>> Signed-off-by: Daniel Henrique Barboza <danielhb413 at gmail.com>
>>>>>>>> ---
>>>>>>> This fails with a bunch of compilation errors, please make sure you
>>>>>>> run make
>>>>>>> check and make syntax-check on each patch of the series.
>>>>>> Ooops, forgot to follow up make syntax-check with 'make' before
>>>>>> submitting ... my bad.
>>>>>>
>>>>>>>>     src/qemu/qemu_domain.c | 194
>>>>>>>> +++++++++++++++++++++++++++++++++++++++++
>>>>>>>>     1 file changed, 194 insertions(+)
>>>>>>>>
>>>>>>>> diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c
>>>>>>>> index 77548c224c..97de5793e2 100644
>>>>>>>> --- a/src/qemu/qemu_domain.c
>>>>>>>> +++ b/src/qemu/qemu_domain.c
>>>>>>>> @@ -10343,6 +10343,200 @@
>>>>>>>> qemuDomainUpdateCurrentMemorySize(virDomainObjPtr vm)
>>>>>>>>     }
>>>>>>>>
>>>>>>>>
>>>>>>>> +/**
>>>>>>>> + * Reads a phandle file and returns the phandle value.
>>>>>>>> + */
>>>>>>>> +static int
>>>>>>>> +read_dt_phandle(const char* file)
>>>>>>> ..we prefer camelCase naming of functions...all functions should
>>>>>>> have a prefix,
>>>>>>> e.g. vir,<driver>, etc.
>>>>>>>
>>>>>>>> +{
>>>>>>>> +    unsigned int buf[1];
>>>>>>>> +    size_t read;
>>>>>>>> +    FILE *f;
>>>>>>>> +
>>>>>>>> +    f = fopen(file, "r");
>>>>>>>> +    if (!f)
>>>>>>>> +        return -1;
>>>>>>>> +
>>>>>>>> +    read = fread(buf, sizeof(unsigned int), 1, f);
>>>>>>> We already have safe helpers that take care of such operations.
>>>>>>>
>>>>>>>> +
>>>>>>>> +    if (!read) {
>>>>>>>> +        VIR_CLOSE(f);
>>>>>>> You could use VIR_AUTOCLOSE for this
>>>>>>>
>>>>>>>> +        return 0;
>>>>>>>> +    }
>>>>>>>> +
>>>>>>>> +    VIR_CLOSE(f);
>>>>>>>> +    return be32toh(buf[0]);
>>>>>>> Is this part of gnulib? We need to make sure it's portable
>>>>>>> otherwise we'd need
>>>>>>> to make the conversion ourselves, just like for le64toh
>>>>>>>
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +
>>>>>>>> +/**
>>>>>>>> + * Reads a memory reg file and returns the first 4 int values.
>>>>>>>> + *
>>>>>>>> + * The caller is responsible for freeing the returned array.
>>>>>>>> + */
>>>>>>>> +static unsigned int *
>>>>>>>> +read_dt_memory_reg(const char *file)
>>>>>>>> +{
>>>>>>>> +    unsigned int *buf;
>>>>>>>> +    size_t read, i;
>>>>>>>> +    FILE *f;
>>>>>>>> +
>>>>>>>> +    f = fopen(file, "r");
>>>>>>>> +    if (!f)
>>>>>>>> +        return NULL;
>>>>>>>> +
>>>>>>>> +    if (VIR_ALLOC_N(buf, 4) < 0)
>>>>>>>> +        return NULL;
>>>>>>>> +
>>>>>>>> +    read = fread(buf, sizeof(unsigned int), 4, f);
>>>>>>>> +
>>>>>>>> +    if (!read && read < 4)
>>>>>>>> +        /* shouldn't happen */
>>>>>>>> +        VIR_FREE(buf);
>>>>>>>> +    else for (i = 0; i < 4; i++)
>>>>>>>> +        buf[i] = be32toh(buf[i]);
>>>>>>>> +
>>>>>>>> +    VIR_CLOSE(f);
>>>>>>>> +    return buf;
>>>>>>> Same comments as above...
>>>>>>>
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +
>>>>>>>> +/**
>>>>>>>> + * This wrapper function receives arguments to be used in a
>>>>>>>> + * 'find' call to retrieve the file names that matches
>>>>>>>> + * the criteria inside the /proc/device-tree dir.
>>>>>>>> + *
>>>>>>>> + * A 'find' call with '-iname phandle' inside /proc/device-tree
>>>>>>>> + * provides more than a thousand matches. Adding '-path' to
>>>>>>>> + * narrow it down further is necessary to keep the file
>>>>>>>> + * listing sane.
>>>>>>>> + *
>>>>>>>> + * The caller is responsible to free the buffer returned by
>>>>>>>> + * this function.
>>>>>>>> + */
>>>>>>>> +static char *
>>>>>>>> +retrieve_dt_files_pattern(const char *path_pattern, const char
>>>>>>>> *file_pattern)
>>>>>>>> +{
>>>>>>>> +    virCommandPtr cmd = NULL;
>>>>>>>> +    char *output = NULL;
>>>>>>>> +
>>>>>>>> +    cmd = virCommandNew("find");
>>>>>>>> +    virCommandAddArgList(cmd, "/proc/device-tree/", "-path",
>>>>>>>> path_pattern,
>>>>>>>> +                         "-iname", file_pattern, NULL);
>>>>>>>> +    virCommandSetOutputBuffer(cmd, &output);
>>>>>>>> +
>>>>>>>> +    if (virCommandRun(cmd, NULL) < 0)
>>>>>>>> +        VIR_FREE(output);
>>>>>>>> +
>>>>>>>> +    virCommandFree(cmd);
>>>>>>>> +    return output;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +
>>>>>>>> +/**
>>>>>>>> + * Helper function that receives a listing of file names and
>>>>>>>> + * calls read_dt_phandle() on each one finding for a match
>>>>>>>> + * with the given phandle argument. Returns the file name if a
>>>>>>>> + * match is found, NULL otherwise.
>>>>>>>> + */
>>>>>>>> +static char *
>>>>>>>> +find_dt_file_with_phandle(char *files, int phandle)
>>>>>>>> +{
>>>>>>>> +    char *line, *tmp;
>>>>>>>> +    int ret;
>>>>>>>> +
>>>>>>>> +    line = strtok_r(files, "\n", &tmp);
>>>>>>>> +    do {
>>>>>>>> +       ret = read_dt_phandle(line);
>>>>>>>> +       if (ret == phandle)
>>>>>>>> +           break;
>>>>>>>> +    } while ((line = strtok_r(NULL, "\n", &tmp)) != NULL);
>>>>>>>> +
>>>>>>>> +    return line;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +
>>>>>>>> +/**
>>>>>>>> + * This function receives a string that represents a PCI device,
>>>>>>>> + * such as '0004:04:00.0', and tells if the device is NVLink2
>>>>>>>> capable.
>>>>>>>> + *
>>>>>>>> + * The logic goes as follows:
>>>>>>>> + *
>>>>>>>> + * 1 - get the phandle of a nvlink of the device, reading the
>>>>>>>> 'ibm,npu'
>>>>>>>> + * attribute;
>>>>>>>> + * 2 - find the device tree node of the nvlink bus using the
>>>>>>>> phandle
>>>>>>>> + * found in (1)
>>>>>>>> + * 3 - get the phandle of the memory region of the nvlink bus
>>>>>>>> + * 4 - find the device tree node of the memory region using the
>>>>>>>> + * phandle found in (3)
>>>>>>>> + * 5 - read the 'reg' value of the memory region. If the value of
>>>>>>>> + * the second 64 bit value is 0x02 0x00, the device is attached
>>>>>>>> + * to a NVLink2 bus.
>>>>>>>> + *
>>>>>>>> + * If any of these steps fails, the function returns false.
>>>>>>>> + */
>>>>>>>> +static bool
>>>>>>>> +device_is_nvlink2_capable(const char *device)
>>>>>>>> +{
>>>>>>>> +    char *file, *files, *tmp;
>>>>>>>> +    unsigned int *reg;
>>>>>>>> +    int phandle;
>>>>>>>> +
>>>>>>>> +    if ((virAsprintf(&file,
>>>>>>>> "/sys/bus/pci/devices/%s/of_node/ibm,npu",
>>>>>>>> +                    device)) < 0)
>>>>>>>> +        return false;
>>>>>>>> +
>>>>>>>> +    /* Find phandles of nvlinks:  */
>>>>>>>> +    if ((phandle = read_dt_phandle(file)) == -1)
>>>>>>>> +        return false;
>>>>>>>> +
>>>>>>>> +    /* Find a DT node for the phandle found */
>>>>>>>> +    files = retrieve_dt_files_pattern("*device-tree/pci*",
>>>>>>>> "phandle");
>>>>>>>> +    if (!files)
>>>>>>>> +        return false;
>>>>>>>> +
>>>>>>>> +    if ((file = find_dt_file_with_phandle(files, phandle)) ==
>>>>>>>> NULL)
>>>>>>>> +        goto fail;
>>>>>>>> +
>>>>>>>> +    /* Find a phandle of the GPU memory region of the device. The
>>>>>>>> +     * file found above ends with '/phandle' - the memory region
>>>>>>>> +     * of the GPU ends with '/memory-region */
>>>>>>>> +    tmp = strrchr(file, '/');
>>>>>>>> +    *tmp = '\0';
>>>>>>>> +    file = strcat(file, "/memory-region");
>>>>>>>> +
>>>>>>>> +    if ((phandle = read_dt_phandle(file)) == -1)
>>>>>>>> +        goto fail;
>>>>>>>> +
>>>>>>>> +    file = NULL;
>>>>>>>> +    VIR_FREE(files);
>>>>>>>> +
>>>>>>>> +    /* Find the memory node for the phandle found above */
>>>>>>>> +    files = retrieve_dt_files_pattern("*device-tree/memory*",
>>>>>>>> "phandle");
>>>>>>>> +    if (!files)
>>>>>>>> +        return false;
>>>>>>>> +
>>>>>>>> +    if ((file = find_dt_file_with_phandle(files, phandle)) ==
>>>>>>>> NULL)
>>>>>>>> +        goto fail;
>>>>>>>> +
>>>>>>>> +    /* And see its size in the second 64bit value of 'reg'. First,
>>>>>>>> +     * the end of the file needs to be changed from '/phandle' to
>>>>>>>> +     * '/reg' */
>>>>>>>> +    tmp = strrchr(file, '/');
>>>>>>>> +    *tmp = '\0';
>>>>>>>> +    file = strcat(file, "/reg");
>>>>>>>> +
>>>>>>>> +    reg = read_dt_memory_reg(file);
>>>>>>>> +    if (reg && reg[2] == 0x20 && reg[3] == 0x00)
>>>>>>>> +        return true;
>>>>>>> This function is very complex just to find out whether a PCI device
>>>>>>> is capable
>>>>>>> of NVLink or not. Feels wrong to do it this way, I believe it would
>>>>>>> be much
>>>>>>> easier if NVIDIA exposed a sysfs attribute saying whether a PCI
>>>>>>> device supports
>>>>>>> NVLink so that our node-device driver would take care of this
>>>>>>> during libvirtd
>>>>>>> startup and then you'd only call a single API from the PPC64 helper
>>>>>>> you're
>>>>>>> introducing in the next patch to find out whether you need the
>>>>>>> alternative
>>>>>>> formula or not.
>>>>>>>
>>>>>>> Honestly, apart from the coding style issues, I don't like this
>>>>>>> approach and
>>>>>>> unless there's a really compelling reason for libvirt to do it in a
>>>>>>> way which
>>>>>>> involves spawning a 'find' process because of a complex pattern and
>>>>>>> a bunch of
>>>>>>> data necessary to filter out, I'd really suggest contacting NVIDIA
>>>>>>> about this.
>>>>>>> It's the same for mdevs, NVIDIA exposes a bunch of attributes in
>>>>>>> sysfs which
>>>>>>> we're able to read.
>>>>>> I'll contact NVIDIA and see if there is an easier way (a sysfs
>>>>>> attribute, for
>>>>>> example) and, if doesn't, try to provide a plausibe reason to
>>>>>> justify this
>>>>>> detection code.
>>>>> Sorry for the delay in responding. The problem is that all the V100
>>>>> GPUs
>>>>> support NVLink, but it may or may not be connected up. This is
>>>>> detected
>>>>> at runtime during GPU initialization, which seems like much too
>>>>> heavy of
>>>>> an operation to perform as part of passthrough initialization. And
>>>>> that's
>>>>> why vfio-pci pieces rely on device tree information to figure it out.
>>>>>
>>>>> Alexey, would it be possible for vfio-pci to export the information
>>>>> in a
>>>>> way more friendly to libvirt?
>>>> The only information needed here is whether a specific GPU has RAM or
>>>> not. This can easily be found from the device-tree, imho quite friendly
>>>> already. VFIO gets to know about these new capabilities when the VFIO
>>>> PCI device is opened, and we rather want to avoid going that far in
>>>> libvirt (open a VFIO container, attach a group, get a vfio-pci fd from
>>>> it, enumerate regions - 2 PCI resets on the way, delays, meh).
>>>>
>>>> btw the first "find" for "ibm,npu" can be skipped - NVLinks have to be
>>>> passed too or the entire RAM thing won't work. "find" for the memory
>>>> node can also be dropped really - if NVLink bridge OF node has
>>>> "memory-region", then VFIO will most likely expose RAM and QEMU will
>>>> try
>>>> using it anyway.
>>>
>>> Hmmm I am not not sure I understood it completely ..... seeing the
>>> of_nodes exposed in sysfs of one of the NVLink buses we have:
>>>
>>>
>>> /sys/bus/pci/devices/0007:00:00.0/of_node$ ls
>>> class-code           ibm,gpu       ibm,nvlink-speed memory-region  reg
>>> device-id            ibm,loc-code  ibm,pci-config-space-type
>>> name           revision-id
>>> ibm,device-tgt-addr  ibm,nvlink    interrupts phandle        vendor-id
>>>
>>>
>>> We can make a safe assumption that the V100 GPU will always be passed
>>> through with at least one NVLink2 bus.
>> Assumption #1: if we want GPU RAM to be available in the guest, an
>> NVLink bridge must be passed through. No bridge - no RAM - no rlimit
>> adjustment.
>>
>>
>>> How many hops do we need to assert
>>> that a given device is a NVLink2 bus from its of_node info?
>>>
>>> For example, can we say something like:
>>>
>>> "The device node of this device has ibm,gpu and ibm,nvlink and
>>> ibm,nvlink-speed
>>> and ibm,nvlink-speed is 0x8, so this is a NVLink2 bus. Since it is not
>>
>> ibm,nvlink-speed can be different (I saw 9), you cannot rely on this.
>>
>>
>>> possible to pass
>>> through the bus alone, there is a V100 GPU in this same IOMMU group. So
>>> this is
>>> a NVLink2 passthrough scenario"
>>>
>>>
>>> Or perhaps:
>>>
>>> "It has ibm,gpu and ibm,nvlink and the of_node of its memory-region has
>>> a reg
>>> value of 0x20 , thus this is a nvlink2 bus and ....."
>>
>> 0x20 is a size of the GPU RAM window which might change in different
>> revisions of the hw as well.
>>
>>> Both alternatives are way simpler than what I'm doing in this patch. I
>>> am not sure
>>> if they are valid though.
>>
>> Assumption #2: if the nvlink2 bridge's DT node has memory-region, then
>> the GPU will try to use its RAM. The adjustment you'll have to make
>> won't depend on anything as the existing GPU RAM placement is so that
>> whatever you can possibly throw at QEMU will fit 128TiB window and we
>> cannot make the window smaller anyway (next size would lower power of
>> two - 64TiB - and GPU RAM lives just above that).
> 
> Ok, so, from the of_node of an unknown device that has been
> passed through to a VM, if the DT node has:
> 
> ibm,gpu
> ibm,nvlink
> memory-region
> 
> 
> Is that enough to justify the rlimit adjustment for NVLink2?

This is the idea, yes. To be on a safe side, you might want to limit the
check to the IBM vendor ID (not even sure about the device id) and PCI
bridges only.

> 
> 
>>
>> I could do better of course and allow adjusting the limit to cover let's
>> say 66TiB instead of 128TiB, it just seems to be unnecessary
>> complication (we already allocate only required amount of TCE entries on
>> demand).
>>
>>
>>
>>>>>>> Thinking about it a bit more, since this is NVIDIA-specific,
>>>>>>> having an
>>>>>>> NVIDIA-only sysfs attribute doesn't help the node-device driver I
>>>>>>> mentioned
>>>>>>> above. In general, we try to avoid introducing vendor-specific
>>>>>>> code, so nodedev
>>>>>>> might not be the right place to check for the attribute (because
>>>>>>> it's not
>>>>>>> universal and would require vendor-specific elements), but I think
>>>>>>> we could
>>>>>>> have an NVLink helper reading this sysfs attribute(s) elsewhere in
>>>>>>> libvirt
>>>>>>> codebase.
>>>>>>>
>>>>>>> Erik
>>
>>
> 

-- 
Alexey