[libvirt] [PATCH] numad: Set memory policy according to the advisory nodeset from numad

Eric Blake eblake at redhat.com
Mon May 7 17:06:55 UTC 2012


On 05/07/2012 02:34 AM, Osier Yang wrote:
> Though numad will manage the memory allocation of task dynamically,
> but it wants management application (libvirt) to pre-set the memory
> policy according to the advisory nodeset returned from querying numad,
> (just like pre-bind CPU nodeset for domain process), and thus the
> performance could benifit much more from it.

s/benifit/benefit/

> 
> This patch introduces new XML tag 'placement', value 'auto' indicates
> whether to set the memory policy with the advisory nodeset from numad,
> and its value defaults to 'static'. e.g.
> 
>   <numatune>
>     <memory placement='auto' mode='interleave'/>
>   </numatune>
> 
> Just like what current "numatune" does, the 'auto' numa memory policy
> setting uses libnuma's API too.
> 
> So, to full drive numad, one needs to specify placement='auto' for
> both "<vcpu>" and "<numatune><memory .../></numatune>". It's a bit
> inconvenient, but makes sense from semantics' point of view.
> 
> ---
> An alternative way is to not introduce the new XML tag, and pre-set
> the memory policy implicitly with "<vcpu placement='auto'>4</vcpu>",
> but IMHO it implies too much, and I'd not like go this way unless
> the new XML tag is not accepted.

I think we need a hybrid approach.

Existing users wrote XML that used just <vcpu placement='auto'> but
expecting full numad support. Normally, numad is most useful if it
controls _both_ memory (<numatune>) and vcpu placement together, but it
is feasible to control either one in isolation.

Therefore, I think what we need to do is specify that on input, we have
four situations:

1. /domain/vcpu at placement is missing
   /domain/numatune/memory at placement is missing
   We default to mode='static' in both places, and avoid numad

2. /domain/vcpu at placement is present
   /domain/numatune/memory at placement is missing
   We copy vcpu placement over to numa placement (and if
placement='auto', that means we use numad for everything)

3. /domain/vcpu at placement is missing
   /domain/numatune/memory at placement is present
   We copy numa placement over to vcpu placement (and if
placement='auto', that means we use numad for everything)

4. /domain/vcpu at placement is present
   /domain/numatune/memory at placement is present
   We use exactly what the user specifies (and if only one of the two
features is placement='auto', then the other feature avoids numad)

Mode 3 and 4 would be the new modes possible by the XML added in this
patch.  Mode 1 is the default for XML in use by guests before numad
integration, and Mode 2 is the XML for guests attempting to use numad in
the 0.9.11 release; I'm okay changing the semantics of mode 2 to be
easier to use (because you can use mode 4 if you really wanted the
unusual semantics of numad vcpu placement but not memory placement).

Then on output, we always output mode in both <vcpu> and <numatune>, as
in mode 4 (yeah, that probably means touching a lot of tests, but that's
life when we add new XML).

I'm still a bit torn on whether this must go in before 0.9.12, since
we're past rc1.  But since it looks like this is more of an oversight
(our numad usage in 0.9.11 is not very useful, since it missed memory
placement), this can be argued to be a bug fix rather than a new
feature, even though it is adding XML, so I will probably be okay with a
v2 patch going in before the final 0.9.12 release.

On to the actual code of v1:

> 
> diff --git a/docs/formatdomain.html.in b/docs/formatdomain.html.in
> index e1fe0c4..01b3124 100644
> --- a/docs/formatdomain.html.in
> +++ b/docs/formatdomain.html.in
> @@ -580,9 +580,14 @@
>          The optional <code>memory</code> element specifies how to allocate memory
>          for the domain process on a NUMA host. It contains two attributes,

s/two attributes,/several optional attributes./

>          attribute <code>mode</code> is either 'interleave', 'strict',

s/attribute/Attribute/

> -        or 'preferred',
> -        attribute <code>nodeset</code> specifies the NUMA nodes, it leads same
> -        syntax with attribute <code>cpuset</code> of element <code>vcpu</code>.
> +        or 'preferred', attribute <code>nodeset</code> specifies the NUMA nodes,

s/'preferred', attribute/'preferred'.  Attribute/

> +        it leads same syntax with attribute <code>cpuset</code> of element

s/it leads same syntax with/using the same syntax as/

> +        <code>vcpu</code>, the optional attribute <code>placement</code> can be

s/</code>, the/</code>.  The/

> +        used to indicate the memory placement mode for domain process, its value
> +        can be either "static" or "auto", defaults to "static". "auto" indicates

Given my above comments, this would default to the same placement as
used in <vcpu>.

> +        the domain process will only allocate memory from the advisory nodeset
> +        returned from querying numad, and the value of attribute <code>nodeset</code>
> +        will be ignored if it's specified.
>          <span class='since'>Since 0.9.3</span>

Mention that attribute placement is since 0.9.12.

>        </dd>
>      </dl>
> diff --git a/docs/schemas/domaincommon.rng b/docs/schemas/domaincommon.rng
> index 8419ccc..9b509f1 100644
> --- a/docs/schemas/domaincommon.rng
> +++ b/docs/schemas/domaincommon.rng
> @@ -562,16 +562,35 @@
>          <element name="numatune">
>            <optional>
>              <element name="memory">
> -              <attribute name="mode">
> -                <choice>
> -                  <value>strict</value>
> -                  <value>preferred</value>
> -                  <value>interleave</value>
> -                </choice>
> -              </attribute>
> -              <attribute name="nodeset">
> -                <ref name="cpuset"/>
> -              </attribute>
> +              <choice>
> +                <group>
> +                  <attribute name="mode">
> +                    <choice>
> +                      <value>strict</value>
> +                      <value>preferred</value>
> +                      <value>interleave</value>
> +                    </choice>
> +                  </attribute>
> +                  <attribute name='placement'>
> +                    <choice>
> +                      <value>static</value>
> +                      <value>auto</value>
> +                    </choice>
> +                  </attribute>
> +                </group>
> +                <group>
> +                  <attribute name="mode">
> +                    <choice>
> +                      <value>strict</value>
> +                      <value>preferred</value>
> +                      <value>interleave</value>
> +                    </choice>
> +                  </attribute>
> +                  <attribute name="nodeset">
> +                    <ref name="cpuset"/>
> +                  </attribute>
> +                </group>
> +              </choice>

It looks like you are requiring mode='...' in both choices, so the real
choice is whether you permit placement or nodeset.  However, this RNG
will forbid placement='static' nodeset='0-3', so you didn't get it quite
right.  I almost think you want:

<attribute name='mode'>
  <choice>
    <value>strict</value>
    <value>preferred</value>
    <value>interleave</value>
  </choice>
</attribute>
<choice>
  <group>
    <optional>
      <attribute name='placement'>
        <value>static</value>
      </attribute>
    </optional>
    <optional>
      <attribute name='nodeset'>
        <ref name='cpuset'/>
      </attribute>
    </optional>
  </group>
  <attribute name='placement'>
    <value>auto</value>
  </attribute>
</choice>

which says that both placement and nodeset can be missing (by my new
semantics, that would mean that placement defaults to <vcpu>, otherwise
to static); or that nodeset can be present (placement would be implied
as static), or that placement can explicitly be set to static.
Meanwhile, it says that placement='auto' cannot be mixed with
nodeset='...' (is that right, or do we output nodeset even with numad
automatic placement to show the user what they actually have at the
current moment?).


> +++ b/libvirt.spec.in
> @@ -454,6 +454,7 @@ BuildRequires: scrub
>  
>  %if %{with_numad}
>  BuildRequires: numad
> +BuildRequires: numactl-devel
>  %endif

This should be split into an independent fix.  Furthermore, it needs to
be gated - in F16, 'numad' provided the development tools that were
split off into 'numactl-devel' for F17.  So this really needs to be a
versioned check:

%if %{with_numad}
%if 0%{?fedora} >= 17
BuildRequires: numactl-devel
%else
BuildRequires: numad
%endif


> @@ -8055,13 +8052,40 @@ static virDomainDefPtr virDomainDefParseXML(virCapsPtr caps,
>                              virDomainReportError(VIR_ERR_XML_ERROR,
>                                                   _("Unsupported NUMA memory "
>                                                     "tuning mode '%s'"),
> -                                                 tmp);
> +                                                   tmp);

Spurious indentation change.  The old spacing was correct.


> +
> +                            /* "nodeset" leads same syntax with "cpuset". */

s/leads same syntax with/uses the same syntax as/

>  struct _virDomainTimerCatchupDef {
> @@ -1504,6 +1511,7 @@ struct _virDomainNumatuneDef {
>      struct {
>          char *nodemask;
>          int mode;
> +        int placement_mode;

Add a comment mentioning which enum values are valid for assignment to
this variable.

-- 
Eric Blake   eblake at redhat.com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 620 bytes
Desc: OpenPGP digital signature
URL: <http://listman.redhat.com/archives/libvir-list/attachments/20120507/aca385fd/attachment-0001.sig>


More information about the libvir-list mailing list