[libvirt] [PATCH] numad: Set memory policy according to the advisory nodeset from numad
Osier Yang
jyang at redhat.com
Tue May 8 07:11:09 UTC 2012
On 2012年05月08日 01:06, Eric Blake wrote:
> On 05/07/2012 02:34 AM, Osier Yang wrote:
>> Though numad will manage the memory allocation of task dynamically,
>> but it wants management application (libvirt) to pre-set the memory
>> policy according to the advisory nodeset returned from querying numad,
>> (just like pre-bind CPU nodeset for domain process), and thus the
>> performance could benifit much more from it.
>
> s/benifit/benefit/
>
>>
>> This patch introduces new XML tag 'placement', value 'auto' indicates
>> whether to set the memory policy with the advisory nodeset from numad,
>> and its value defaults to 'static'. e.g.
>>
>> <numatune>
>> <memory placement='auto' mode='interleave'/>
>> </numatune>
>>
>> Just like what current "numatune" does, the 'auto' numa memory policy
>> setting uses libnuma's API too.
>>
>> So, to full drive numad, one needs to specify placement='auto' for
>> both "<vcpu>" and"<numatune><memory .../></numatune>". It's a bit
>> inconvenient, but makes sense from semantics' point of view.
>>
>> ---
>> An alternative way is to not introduce the new XML tag, and pre-set
>> the memory policy implicitly with "<vcpu placement='auto'>4</vcpu>",
>> but IMHO it implies too much, and I'd not like go this way unless
>> the new XML tag is not accepted.
>
> I think we need a hybrid approach.
Not sure, honestly, :-)
>
> Existing users wrote XML that used just<vcpu placement='auto'> but
> expecting full numad support. Normally, numad is most useful if it
> controls _both_ memory (<numatune>) and vcpu placement together, but it
> is feasible to control either one in isolation.
Yes.
>
> Therefore, I think what we need to do is specify that on input, we have
> four situations:
>
> 1. /domain/vcpu at placement is missing
> /domain/numatune/memory at placement is missing
> We default to mode='static' in both places, and avoid numad
>
> 2. /domain/vcpu at placement is present
> /domain/numatune/memory at placement is missing
> We copy vcpu placement over to numa placement (and if
> placement='auto', that means we use numad for everything)
>
> 3. /domain/vcpu at placement is missing
> /domain/numatune/memory at placement is present
> We copy numa placement over to vcpu placement (and if
> placement='auto', that means we use numad for everything)
>
> 4. /domain/vcpu at placement is present
> /domain/numatune/memory at placement is present
> We use exactly what the user specifies (and if only one of the two
> features is placement='auto', then the other feature avoids numad)
>
> Mode 3 and 4 would be the new modes possible by the XML added in this
> patch. Mode 1 is the default for XML in use by guests before numad
> integration, and Mode 2 is the XML for guests attempting to use numad in
> the 0.9.11 release; I'm okay changing the semantics of mode 2 to be
> easier to use (because you can use mode 4 if you really wanted the
> unusual semantics of numad vcpu placement but not memory placement).
>
> Then on output, we always output mode in both<vcpu> and<numatune>, as
> in mode 4 (yeah, that probably means touching a lot of tests, but that's
> life when we add new XML).
It sounds clear, but I'm not sure if it could make things a chaos, as it
implicts so much in XML parsing, while the XMLs are probably used
by other drivers in future. But anyway I'm going forward this way to
see if it is.
>
> I'm still a bit torn on whether this must go in before 0.9.12, since
> we're past rc1. But since it looks like this is more of an oversight
> (our numad usage in 0.9.11 is not very useful, since it missed memory
> placement), this can be argued to be a bug fix rather than a new
> feature, even though it is adding XML, so I will probably be okay with a
> v2 patch going in before the final 0.9.12 release.
>
> On to the actual code of v1:
>
>>
>> diff --git a/docs/formatdomain.html.in b/docs/formatdomain.html.in
>> index e1fe0c4..01b3124 100644
>> --- a/docs/formatdomain.html.in
>> +++ b/docs/formatdomain.html.in
>> @@ -580,9 +580,14 @@
>> The optional<code>memory</code> element specifies how to allocate memory
>> for the domain process on a NUMA host. It contains two attributes,
>
> s/two attributes,/several optional attributes./
>
>> attribute<code>mode</code> is either 'interleave', 'strict',
>
> s/attribute/Attribute/
>
>> - or 'preferred',
>> - attribute<code>nodeset</code> specifies the NUMA nodes, it leads same
>> - syntax with attribute<code>cpuset</code> of element<code>vcpu</code>.
>> + or 'preferred', attribute<code>nodeset</code> specifies the NUMA nodes,
>
> s/'preferred', attribute/'preferred'. Attribute/
>
>> + it leads same syntax with attribute<code>cpuset</code> of element
>
> s/it leads same syntax with/using the same syntax as/
>
>> +<code>vcpu</code>, the optional attribute<code>placement</code> can be
>
> s/</code>, the/</code>. The/
>
>> + used to indicate the memory placement mode for domain process, its value
>> + can be either "static" or "auto", defaults to "static". "auto" indicates
>
> Given my above comments, this would default to the same placement as
> used in<vcpu>.
>
>> + the domain process will only allocate memory from the advisory nodeset
>> + returned from querying numad, and the value of attribute<code>nodeset</code>
>> + will be ignored if it's specified.
>> <span class='since'>Since 0.9.3</span>
>
> Mention that attribute placement is since 0.9.12.
Ok.
>
>> </dd>
>> </dl>
>> diff --git a/docs/schemas/domaincommon.rng b/docs/schemas/domaincommon.rng
>> index 8419ccc..9b509f1 100644
>> --- a/docs/schemas/domaincommon.rng
>> +++ b/docs/schemas/domaincommon.rng
>> @@ -562,16 +562,35 @@
>> <element name="numatune">
>> <optional>
>> <element name="memory">
>> -<attribute name="mode">
>> -<choice>
>> -<value>strict</value>
>> -<value>preferred</value>
>> -<value>interleave</value>
>> -</choice>
>> -</attribute>
>> -<attribute name="nodeset">
>> -<ref name="cpuset"/>
>> -</attribute>
>> +<choice>
>> +<group>
>> +<attribute name="mode">
>> +<choice>
>> +<value>strict</value>
>> +<value>preferred</value>
>> +<value>interleave</value>
>> +</choice>
>> +</attribute>
>> +<attribute name='placement'>
>> +<choice>
>> +<value>static</value>
>> +<value>auto</value>
>> +</choice>
>> +</attribute>
>> +</group>
>> +<group>
>> +<attribute name="mode">
>> +<choice>
>> +<value>strict</value>
>> +<value>preferred</value>
>> +<value>interleave</value>
>> +</choice>
>> +</attribute>
>> +<attribute name="nodeset">
>> +<ref name="cpuset"/>
>> +</attribute>
>> +</group>
>> +</choice>
>
> It looks like you are requiring mode='...' in both choices, so the real
> choice is whether you permit placement or nodeset. However, this RNG
> will forbid placement='static' nodeset='0-3', so you didn't get it quite
> right. I almost think you want:
>
> <attribute name='mode'>
> <choice>
> <value>strict</value>
> <value>preferred</value>
> <value>interleave</value>
> </choice>
> </attribute>
> <choice>
> <group>
> <optional>
> <attribute name='placement'>
> <value>static</value>
> </attribute>
> </optional>
> <optional>
> <attribute name='nodeset'>
> <ref name='cpuset'/>
> </attribute>
> </optional>
> </group>
> <attribute name='placement'>
> <value>auto</value>
> </attribute>
> </choice>
>
> which says that both placement and nodeset can be missing (by my new
> semantics, that would mean that placement defaults to<vcpu>, otherwise
> to static); or that nodeset can be present (placement would be implied
> as static), or that placement can explicitly be set to static.
> Meanwhile, it says that placement='auto' cannot be mixed with
> nodeset='...' (is that right,
right.
or do we output nodeset even with numad
> automatic placement to show the user what they actually have at the
> current moment?).
No, we intended to show the cpuset for "vcpu", but it's removed later,
as it doesn't make much sense for numad will schedule the resources
dynamically.
>
>> +++ b/libvirt.spec.in
>> @@ -454,6 +454,7 @@ BuildRequires: scrub
>>
>> %if %{with_numad}
>> BuildRequires: numad
>> +BuildRequires: numactl-devel
>> %endif
>
> This should be split into an independent fix. Furthermore, it needs to
> be gated - in F16, 'numad' provided the development tools that were
> split off into 'numactl-devel' for F17. So this really needs to be a
> versioned check:
>
> %if %{with_numad}
> %if 0%{?fedora}>= 17
> BuildRequires: numactl-devel
> %else
> BuildRequires: numad
> %endif
Ok.
>
>
>> @@ -8055,13 +8052,40 @@ static virDomainDefPtr virDomainDefParseXML(virCapsPtr caps,
>> virDomainReportError(VIR_ERR_XML_ERROR,
>> _("Unsupported NUMA memory "
>> "tuning mode '%s'"),
>> - tmp);
>> + tmp);
>
> Spurious indentation change. The old spacing was correct.
Ah, yes.
>
>
>> +
>> + /* "nodeset" leads same syntax with "cpuset". */
>
> s/leads same syntax with/uses the same syntax as/
Ok.
>
>> struct _virDomainTimerCatchupDef {
>> @@ -1504,6 +1511,7 @@ struct _virDomainNumatuneDef {
>> struct {
>> char *nodemask;
>> int mode;
>> + int placement_mode;
>
> Add a comment mentioning which enum values are valid for assignment to
> this variable.
Ok.
Thanks for the reviewing!
Regards,
Osier
More information about the libvir-list
mailing list