[libvirt] [PATCH] numad: Set memory policy according to the advisory nodeset from numad

Osier Yang jyang at redhat.com
Tue May 8 07:11:09 UTC 2012


On 2012年05月08日 01:06, Eric Blake wrote:
> On 05/07/2012 02:34 AM, Osier Yang wrote:
>> Though numad will manage the memory allocation of task dynamically,
>> but it wants management application (libvirt) to pre-set the memory
>> policy according to the advisory nodeset returned from querying numad,
>> (just like pre-bind CPU nodeset for domain process), and thus the
>> performance could benifit much more from it.
>
> s/benifit/benefit/
>
>>
>> This patch introduces new XML tag 'placement', value 'auto' indicates
>> whether to set the memory policy with the advisory nodeset from numad,
>> and its value defaults to 'static'. e.g.
>>
>>    <numatune>
>>      <memory placement='auto' mode='interleave'/>
>>    </numatune>
>>
>> Just like what current "numatune" does, the 'auto' numa memory policy
>> setting uses libnuma's API too.
>>
>> So, to full drive numad, one needs to specify placement='auto' for
>> both "<vcpu>" and"<numatune><memory .../></numatune>". It's a bit
>> inconvenient, but makes sense from semantics' point of view.
>>
>> ---
>> An alternative way is to not introduce the new XML tag, and pre-set
>> the memory policy implicitly with "<vcpu placement='auto'>4</vcpu>",
>> but IMHO it implies too much, and I'd not like go this way unless
>> the new XML tag is not accepted.
>
> I think we need a hybrid approach.

Not sure, honestly, :-)

>
> Existing users wrote XML that used just<vcpu placement='auto'>  but
> expecting full numad support. Normally, numad is most useful if it
> controls _both_ memory (<numatune>) and vcpu placement together, but it
> is feasible to control either one in isolation.

Yes.

>
> Therefore, I think what we need to do is specify that on input, we have
> four situations:
>
> 1. /domain/vcpu at placement is missing
>     /domain/numatune/memory at placement is missing
>     We default to mode='static' in both places, and avoid numad
>
> 2. /domain/vcpu at placement is present
>     /domain/numatune/memory at placement is missing
>     We copy vcpu placement over to numa placement (and if
> placement='auto', that means we use numad for everything)
>
> 3. /domain/vcpu at placement is missing
>     /domain/numatune/memory at placement is present
>     We copy numa placement over to vcpu placement (and if
> placement='auto', that means we use numad for everything)
>
> 4. /domain/vcpu at placement is present
>     /domain/numatune/memory at placement is present
>     We use exactly what the user specifies (and if only one of the two
> features is placement='auto', then the other feature avoids numad)
>
> Mode 3 and 4 would be the new modes possible by the XML added in this
> patch.  Mode 1 is the default for XML in use by guests before numad
> integration, and Mode 2 is the XML for guests attempting to use numad in
> the 0.9.11 release; I'm okay changing the semantics of mode 2 to be
> easier to use (because you can use mode 4 if you really wanted the
> unusual semantics of numad vcpu placement but not memory placement).
>
> Then on output, we always output mode in both<vcpu>  and<numatune>, as
> in mode 4 (yeah, that probably means touching a lot of tests, but that's
> life when we add new XML).

It sounds clear, but I'm not sure if it could make things a chaos, as it 
implicts so much in XML parsing, while the XMLs are probably used
by other drivers in future. But anyway I'm going forward this way to
see if it is.

>
> I'm still a bit torn on whether this must go in before 0.9.12, since
> we're past rc1.  But since it looks like this is more of an oversight
> (our numad usage in 0.9.11 is not very useful, since it missed memory
> placement), this can be argued to be a bug fix rather than a new
> feature, even though it is adding XML, so I will probably be okay with a
> v2 patch going in before the final 0.9.12 release.
>
> On to the actual code of v1:
>
>>
>> diff --git a/docs/formatdomain.html.in b/docs/formatdomain.html.in
>> index e1fe0c4..01b3124 100644
>> --- a/docs/formatdomain.html.in
>> +++ b/docs/formatdomain.html.in
>> @@ -580,9 +580,14 @@
>>           The optional<code>memory</code>  element specifies how to allocate memory
>>           for the domain process on a NUMA host. It contains two attributes,
>
> s/two attributes,/several optional attributes./
>
>>           attribute<code>mode</code>  is either 'interleave', 'strict',
>
> s/attribute/Attribute/
>
>> -        or 'preferred',
>> -        attribute<code>nodeset</code>  specifies the NUMA nodes, it leads same
>> -        syntax with attribute<code>cpuset</code>  of element<code>vcpu</code>.
>> +        or 'preferred', attribute<code>nodeset</code>  specifies the NUMA nodes,
>
> s/'preferred', attribute/'preferred'.  Attribute/
>
>> +        it leads same syntax with attribute<code>cpuset</code>  of element
>
> s/it leads same syntax with/using the same syntax as/
>
>> +<code>vcpu</code>, the optional attribute<code>placement</code>  can be
>
> s/</code>, the/</code>.  The/
>
>> +        used to indicate the memory placement mode for domain process, its value
>> +        can be either "static" or "auto", defaults to "static". "auto" indicates
>
> Given my above comments, this would default to the same placement as
> used in<vcpu>.
>
>> +        the domain process will only allocate memory from the advisory nodeset
>> +        returned from querying numad, and the value of attribute<code>nodeset</code>
>> +        will be ignored if it's specified.
>>           <span class='since'>Since 0.9.3</span>
>
> Mention that attribute placement is since 0.9.12.

Ok.

>
>>         </dd>
>>       </dl>
>> diff --git a/docs/schemas/domaincommon.rng b/docs/schemas/domaincommon.rng
>> index 8419ccc..9b509f1 100644
>> --- a/docs/schemas/domaincommon.rng
>> +++ b/docs/schemas/domaincommon.rng
>> @@ -562,16 +562,35 @@
>>           <element name="numatune">
>>             <optional>
>>               <element name="memory">
>> -<attribute name="mode">
>> -<choice>
>> -<value>strict</value>
>> -<value>preferred</value>
>> -<value>interleave</value>
>> -</choice>
>> -</attribute>
>> -<attribute name="nodeset">
>> -<ref name="cpuset"/>
>> -</attribute>
>> +<choice>
>> +<group>
>> +<attribute name="mode">
>> +<choice>
>> +<value>strict</value>
>> +<value>preferred</value>
>> +<value>interleave</value>
>> +</choice>
>> +</attribute>
>> +<attribute name='placement'>
>> +<choice>
>> +<value>static</value>
>> +<value>auto</value>
>> +</choice>
>> +</attribute>
>> +</group>
>> +<group>
>> +<attribute name="mode">
>> +<choice>
>> +<value>strict</value>
>> +<value>preferred</value>
>> +<value>interleave</value>
>> +</choice>
>> +</attribute>
>> +<attribute name="nodeset">
>> +<ref name="cpuset"/>
>> +</attribute>
>> +</group>
>> +</choice>
>
> It looks like you are requiring mode='...' in both choices, so the real
> choice is whether you permit placement or nodeset.  However, this RNG
> will forbid placement='static' nodeset='0-3', so you didn't get it quite
> right.  I almost think you want:
>
> <attribute name='mode'>
>    <choice>
>      <value>strict</value>
>      <value>preferred</value>
>      <value>interleave</value>
>    </choice>
> </attribute>
> <choice>
>    <group>
>      <optional>
>        <attribute name='placement'>
>          <value>static</value>
>        </attribute>
>      </optional>
>      <optional>
>        <attribute name='nodeset'>
>          <ref name='cpuset'/>
>        </attribute>
>      </optional>
>    </group>
>    <attribute name='placement'>
>      <value>auto</value>
>    </attribute>
> </choice>
>
> which says that both placement and nodeset can be missing (by my new
> semantics, that would mean that placement defaults to<vcpu>, otherwise
> to static); or that nodeset can be present (placement would be implied
> as static), or that placement can explicitly be set to static.
> Meanwhile, it says that placement='auto' cannot be mixed with
> nodeset='...' (is that right,

right.

or do we output nodeset even with numad
> automatic placement to show the user what they actually have at the
> current moment?).

No, we intended to show the cpuset for "vcpu", but it's removed later,
as it doesn't make much sense for numad will schedule the resources
dynamically.

>
>> +++ b/libvirt.spec.in
>> @@ -454,6 +454,7 @@ BuildRequires: scrub
>>
>>   %if %{with_numad}
>>   BuildRequires: numad
>> +BuildRequires: numactl-devel
>>   %endif
>
> This should be split into an independent fix.  Furthermore, it needs to
> be gated - in F16, 'numad' provided the development tools that were
> split off into 'numactl-devel' for F17.  So this really needs to be a
> versioned check:
>
> %if %{with_numad}
> %if 0%{?fedora}>= 17
> BuildRequires: numactl-devel
> %else
> BuildRequires: numad
> %endif

Ok.

>
>
>> @@ -8055,13 +8052,40 @@ static virDomainDefPtr virDomainDefParseXML(virCapsPtr caps,
>>                               virDomainReportError(VIR_ERR_XML_ERROR,
>>                                                    _("Unsupported NUMA memory "
>>                                                      "tuning mode '%s'"),
>> -                                                 tmp);
>> +                                                   tmp);
>
> Spurious indentation change.  The old spacing was correct.

Ah, yes.

>
>
>> +
>> +                            /* "nodeset" leads same syntax with "cpuset". */
>
> s/leads same syntax with/uses the same syntax as/

Ok.

>
>>   struct _virDomainTimerCatchupDef {
>> @@ -1504,6 +1511,7 @@ struct _virDomainNumatuneDef {
>>       struct {
>>           char *nodemask;
>>           int mode;
>> +        int placement_mode;
>
> Add a comment mentioning which enum values are valid for assignment to
> this variable.

Ok.


Thanks for the reviewing!

Regards,
Osier




More information about the libvir-list mailing list