[libvirt] [PATCH v3 3/4] qemu_domain: NVLink2 device tree functions for PPC64

Erik Skultety eskultet at redhat.com
Thu Mar 7 15:15:07 UTC 2019


On Tue, Mar 05, 2019 at 09:46:08AM -0300, Daniel Henrique Barboza wrote:
> The NVLink2 support in QEMU implements the detection of NVLink2
> capable devices by verfying the attributes of the VFIO mem region
> QEMU allocates for the NVIDIA GPUs. To properly allocate an
> adequate amount of memLock, Libvirt needs this information before
> a QEMU instance is even created.
>
> An alternative is presented in this patch. Given a PCI device,
> we'll traverse the device tree at /proc/device-tree to check if
> the device has a NPU bridge, retrieve the node of the NVLink2 bus,
> find the memory-node that is related to the bus and see if it's a
> NVLink2 bus by inspecting its 'reg' value. This logic is contained
> inside the 'device_is_nvlink2_capable' function, which uses other
> new helper functions to navigate and fetch values from the device
> tree nodes.
>
> Signed-off-by: Daniel Henrique Barboza <danielhb413 at gmail.com>
> ---

This fails with a bunch of compilation errors, please make sure you run make
check and make syntax-check on each patch of the series.

>  src/qemu/qemu_domain.c | 194 +++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 194 insertions(+)
>
> diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c
> index 77548c224c..97de5793e2 100644
> --- a/src/qemu/qemu_domain.c
> +++ b/src/qemu/qemu_domain.c
> @@ -10343,6 +10343,200 @@ qemuDomainUpdateCurrentMemorySize(virDomainObjPtr vm)
>  }
>
>
> +/**
> + * Reads a phandle file and returns the phandle value.
> + */
> +static int
> +read_dt_phandle(const char* file)

..we prefer camelCase naming of functions...all functions should have a prefix,
e.g. vir,<driver>, etc.

> +{
> +    unsigned int buf[1];
> +    size_t read;
> +    FILE *f;
> +
> +    f = fopen(file, "r");
> +    if (!f)
> +        return -1;
> +
> +    read = fread(buf, sizeof(unsigned int), 1, f);

We already have safe helpers that take care of such operations.

> +
> +    if (!read) {
> +        VIR_CLOSE(f);

You could use VIR_AUTOCLOSE for this

> +        return 0;
> +    }
> +
> +    VIR_CLOSE(f);
> +    return be32toh(buf[0]);

Is this part of gnulib? We need to make sure it's portable otherwise we'd need
to make the conversion ourselves, just like for le64toh

> +}
> +
> +
> +/**
> + * Reads a memory reg file and returns the first 4 int values.
> + *
> + * The caller is responsible for freeing the returned array.
> + */
> +static unsigned int *
> +read_dt_memory_reg(const char *file)
> +{
> +    unsigned int *buf;
> +    size_t read, i;
> +    FILE *f;
> +
> +    f = fopen(file, "r");
> +    if (!f)
> +        return NULL;
> +
> +    if (VIR_ALLOC_N(buf, 4) < 0)
> +        return NULL;
> +
> +    read = fread(buf, sizeof(unsigned int), 4, f);
> +
> +    if (!read && read < 4)
> +        /* shouldn't happen */
> +        VIR_FREE(buf);
> +    else for (i = 0; i < 4; i++)
> +        buf[i] = be32toh(buf[i]);
> +
> +    VIR_CLOSE(f);
> +    return buf;

Same comments as above...

> +}
> +
> +
> +/**
> + * This wrapper function receives arguments to be used in a
> + * 'find' call to retrieve the file names that matches
> + * the criteria inside the /proc/device-tree dir.
> + *
> + * A 'find' call with '-iname phandle' inside /proc/device-tree
> + * provides more than a thousand matches. Adding '-path' to
> + * narrow it down further is necessary to keep the file
> + * listing sane.
> + *
> + * The caller is responsible to free the buffer returned by
> + * this function.
> + */
> +static char *
> +retrieve_dt_files_pattern(const char *path_pattern, const char *file_pattern)
> +{
> +    virCommandPtr cmd = NULL;
> +    char *output = NULL;
> +
> +    cmd = virCommandNew("find");
> +    virCommandAddArgList(cmd, "/proc/device-tree/", "-path", path_pattern,
> +                         "-iname", file_pattern, NULL);
> +    virCommandSetOutputBuffer(cmd, &output);
> +
> +    if (virCommandRun(cmd, NULL) < 0)
> +        VIR_FREE(output);
> +
> +    virCommandFree(cmd);
> +    return output;
> +}
> +
> +
> +/**
> + * Helper function that receives a listing of file names and
> + * calls read_dt_phandle() on each one finding for a match
> + * with the given phandle argument. Returns the file name if a
> + * match is found, NULL otherwise.
> + */
> +static char *
> +find_dt_file_with_phandle(char *files, int phandle)
> +{
> +    char *line, *tmp;
> +    int ret;
> +
> +    line = strtok_r(files, "\n", &tmp);
> +    do {
> +       ret = read_dt_phandle(line);
> +       if (ret == phandle)
> +           break;
> +    } while ((line = strtok_r(NULL, "\n", &tmp)) != NULL);
> +
> +    return line;
> +}
> +
> +
> +/**
> + * This function receives a string that represents a PCI device,
> + * such as '0004:04:00.0', and tells if the device is NVLink2 capable.
> + *
> + * The logic goes as follows:
> + *
> + * 1 - get the phandle of a nvlink of the device, reading the 'ibm,npu'
> + * attribute;
> + * 2 - find the device tree node of the nvlink bus using the phandle
> + * found in (1)
> + * 3 - get the phandle of the memory region of the nvlink bus
> + * 4 - find the device tree node of the memory region using the
> + * phandle found in (3)
> + * 5 - read the 'reg' value of the memory region. If the value of
> + * the second 64 bit value is 0x02 0x00, the device is attached
> + * to a NVLink2 bus.
> + *
> + * If any of these steps fails, the function returns false.
> + */
> +static bool
> +device_is_nvlink2_capable(const char *device)
> +{
> +    char *file, *files, *tmp;
> +    unsigned int *reg;
> +    int phandle;
> +
> +    if ((virAsprintf(&file, "/sys/bus/pci/devices/%s/of_node/ibm,npu",
> +                    device)) < 0)
> +        return false;
> +
> +    /* Find phandles of nvlinks:  */
> +    if ((phandle = read_dt_phandle(file)) == -1)
> +        return false;
> +
> +    /* Find a DT node for the phandle found */
> +    files = retrieve_dt_files_pattern("*device-tree/pci*", "phandle");
> +    if (!files)
> +        return false;
> +
> +    if ((file = find_dt_file_with_phandle(files, phandle)) == NULL)
> +        goto fail;
> +
> +    /* Find a phandle of the GPU memory region of the device. The
> +     * file found above ends with '/phandle' - the memory region
> +     * of the GPU ends with '/memory-region */
> +    tmp = strrchr(file, '/');
> +    *tmp = '\0';
> +    file = strcat(file, "/memory-region");
> +
> +    if ((phandle = read_dt_phandle(file)) == -1)
> +        goto fail;
> +
> +    file = NULL;
> +    VIR_FREE(files);
> +
> +    /* Find the memory node for the phandle found above */
> +    files = retrieve_dt_files_pattern("*device-tree/memory*", "phandle");
> +    if (!files)
> +        return false;
> +
> +    if ((file = find_dt_file_with_phandle(files, phandle)) == NULL)
> +        goto fail;
> +
> +    /* And see its size in the second 64bit value of 'reg'. First,
> +     * the end of the file needs to be changed from '/phandle' to
> +     * '/reg' */
> +    tmp = strrchr(file, '/');
> +    *tmp = '\0';
> +    file = strcat(file, "/reg");
> +
> +    reg = read_dt_memory_reg(file);
> +    if (reg && reg[2] == 0x20 && reg[3] == 0x00)
> +        return true;

This function is very complex just to find out whether a PCI device is capable
of NVLink or not. Feels wrong to do it this way, I believe it would be much
easier if NVIDIA exposed a sysfs attribute saying whether a PCI device supports
NVLink so that our node-device driver would take care of this during libvirtd
startup and then you'd only call a single API from the PPC64 helper you're
introducing in the next patch to find out whether you need the alternative
formula or not.

Honestly, apart from the coding style issues, I don't like this approach and
unless there's a really compelling reason for libvirt to do it in a way which
involves spawning a 'find' process because of a complex pattern and a bunch of
data necessary to filter out, I'd really suggest contacting NVIDIA about this.
It's the same for mdevs, NVIDIA exposes a bunch of attributes in sysfs which
we're able to read.
Thinking about it a bit more, since this is NVIDIA-specific, having an
NVIDIA-only sysfs attribute doesn't help the node-device driver I mentioned
above. In general, we try to avoid introducing vendor-specific code, so nodedev
might not be the right place to check for the attribute (because it's not
universal and would require vendor-specific elements), but I think we could
have an NVLink helper reading this sysfs attribute(s) elsewhere in libvirt
codebase.

Erik




More information about the libvir-list mailing list