[libvirt] [PATCH 1/3] XML definitions for guest NUMA

Mon Nov 7 18:35:27 UTC 2011

On 11/06/2011 06:57 AM, Bharata B Rao wrote:
> XML definitions for guest NUMA and parsing routines.
>
> From: Bharata B Rao<bharata at linux.vnet.ibm.com>
>
> This patch adds XML definitions for guest NUMA specification and contains
> routines to parse the same. The guest NUMA specification looks like this:
>
> <cpu>
>          ...
>          <topology sockets='2' cores='4' threads='2'/>
>          <numa>
>                  <cell cpus='0-7' mems='512000'/>
>                  <cell cpus='8-15' mems='512000'/>
>          </numa>
>          ...
> </cpu>
>
> Signed-off-by: Bharata B Rao<bharata at linux.vnet.ibm.com>
> ---

> +<p>
> +      Guest NUMA topology can be specifed using<code>numa</code>  element.
> +<span class="since">Since X.X.X</span>

Let's just put 0.9.8 here.  It's easier at feature freeze time to grep 
and replace a concrete 0.9.8 into a different numbering scheme (0.10.0, 
1.0.0, ?) if we decide on something different than 0.9.8, than it is to 
remember to also search for X.X.X.

> +</p>
> +
> +<pre>
> +  ...
> +<cpu>
> +    ...
> +<numa>
> +<cell cpus='0-3' mems='512000'/>
> +<cell cpus='4-7' mems='512000'/>

I understand 'cpus' (a valid word meaning multiple cpu units), but 
'mems' seems odd; I think it would be better naming this attribute 
'memory' to match our <memory> element at the top level.  Just because 
qemu's command line names the option mems= doesn't mean we should be 
stuck making our XML inconsistent.

> +</numa>
> +    ...
> +</cpu>
> +  ...</pre>
> +
> +<p>
> +      Each<code>cell</code>  element specifies a NUMA cell or a NUMA node.
> +<code>cpus</code>  specifies the CPU or range of CPUs that are part of
> +      the node.<code>mems</code>  specifies the node memory in kilobytes
> +      (i.e. blocks of 1024 bytes). Each cell or node is assigned cellid
> +      or nodeid in the increasing order starting from 0.

I agree with doing things in 1024-byte blocks [1], since <memory> and 
<currentMemory> are also in that unit.

[1] 1024-byte blocks is technically kibibytes, not kilobytes; but you're 
copying from existing text, so at least we're consistent, not to mention 
fitting right in with the wikipedia complaint that KiB has had slow 
adoption by the computer industry: 
https://secure.wikimedia.org/wikipedia/en/wiki/Kibibyte :)

> +</p>
> +
> +<p>
> +      This guest NUMA specification translates to<code>-numa</code>  command
> +      line option for QEMU/KVM. For the above example, the following QEMU
> +      command line option is generated:
> +<code>-numa node,nodeid=0,cpus=0-3,mems=512000 -numa node,nodeid=1,cpus=4-7,mems=512000</code>

This paragraph is not necessary.  We don't need to give one 
hypervisor-specific example of how the XML is translated; it is 
sufficient to simply document the XML semantics in a way that can be 
implemented by any number of hypervisors.

> +
> +<define name="numaCell">
> +<element name="cell">
> +<attribute name="cpus">
> +<ref name="Cellcpus"/>

Typically, ref names start with a lower case letter.

> +</attribute>
> +<attribute name="mems">
> +<ref name="Cellmems"/>
> +</attribute>

Is it possible for these attributes to be optional?  That is, on the 
qemu line, can I specify cpus= but not mems=, or mems= but not cpus=? 
If so, then the attributes need to be optional in the schema, and the 
code behave with sane defaults when one of the two attributes is not 
present.

> @@ -2745,4 +2767,14 @@
>         <param name="pattern">[a-zA-Z0-9_\.:]+</param>
>       </data>
>     </define>
> +<define name="Cellcpus">
> +<data type="string">
> +<param name="pattern">([0-9]+(-[0-9]+)?|\^[0-9]+)(,([0-9]+(-[0-9]+)?|\^[0-9]+))*</param>

This looks like a repeat of <define name="cpuset">; if so, let's reuse 
that define instead of making a new one (and if not, why are we 
introducing yet another syntax?).

> +</data>
> +</define>
> +<define name="Cellmems">
> +<data type="unsignedInt">
> +<param name="pattern">[0-9]+</param>

Likewise, this looks like a repeat of <define name="memoryKB">, so lets 
reuse that.

> @@ -109,6 +114,19 @@ no_memory:
>       return NULL;
>   }
>
> +static int
> +virCPUDefNumaCPUs(virCPUDefPtr def)
> +{
> +    int i, j, ncpus = 0;
> +
> +    for (i = 0; i<  def->ncells; i++) {
> +        for (j = 0; j<  VIR_DOMAIN_CPUMASK_LEN; j++) {
> +            if (def->cells[i].cpumask[j])
> +                ncpus++;
> +        }
> +    }

Can this loop be made any faster by using count_one_bits?

> +    return ncpus;
> +}
>
>   virCPUDefPtr
>   virCPUDefParseXML(const xmlNodePtr node,
> @@ -289,6 +307,50 @@ virCPUDefParseXML(const xmlNodePtr node,
>           def->features[i].policy = policy;
>       }
>
> +    if (virXPathNode("./numa[1]", ctxt)) {
> +        VIR_FREE(nodes);
> +        n = virXPathNodeSet("./numa[1]/cell", ctxt,&nodes);
> +        if (n<  0 || n == 0) {

This looks a bit funny, compared to if (n <= 0).

> +        for (i = 0 ; i<  n ; i++) {
> +            char *cpus;
> +            int cpumasklen = VIR_DOMAIN_CPUMASK_LEN;
> +            unsigned long ul;
> +            int ret;
> +
> +            def->cells[i].cellid = i;
> +            cpus = virXMLPropString(nodes[i], "cpus");
> +
> +            if (VIR_ALLOC_N(def->cells[i].cpumask, cpumasklen)<  0)
> +                goto no_memory;
> +
> +            if (virDomainCpuSetParse((const char **)&cpus,
> +                                 0, def->cells[i].cpumask,
> +                                 cpumasklen)<  0)

Does this behave properly if the cpus=... attribute was missing?

> +                goto error;
> +
> +            ret = virXPathULong("string(./numa[1]/cell/@mems)",
> +                            ctxt,&ul);
> +            if (ret<  0) {
> +                virCPUReportError(VIR_ERR_INTERNAL_ERROR,
> +                    "%s", _("Missing 'mems' attribute in NUMA topology"));
> +                goto error;

Especially since you checked for a missing mems= counterpart?  And back 
to my earlier question of whether both attributes are really mandatory, 
or whether just one in isolation can make sense.

> +++ b/src/conf/cpu_conf.h
> @@ -67,6 +67,14 @@ struct _virCPUFeatureDef {
>       int policy;         /* enum virCPUFeaturePolicy */
>   };
>
> +typedef struct _virCellDef virCellDef;
> +typedef virCellDef *virCellDefPtr;
> +struct _virCellDef {
> +   int cellid;
> +   char *cpumask;	/* CPUs that are part of this node */
> +   unsigned int mem;	/* Node memory */

I'd write this as /* Node memory in kB */, to make it clear what scale 
is in use.

I saw the code for parsing the XML, but what about the code for 
generating the xml during 'virsh dumpxml' (aka virDomainDefFormat)?  The 
patch is incomplete without that.

-- 
Eric Blake   eblake at redhat.com    +1-801-349-2682
Libvirt virtualization library http://libvirt.org