[libvirt] [RFC 1/3] xml: introduce more config elements for NVDIMM memory

Zhong, Luyao luyao.zhong at intel.com
Fri Nov 16 15:35:23 UTC 2018



On 11/8/2018 5:09 AM, John Ferlan wrote:
> 
> 
> On 10/16/18 10:21 PM, Luyao Zhong wrote:
>> In order to align with QEMU ,four more parameters about NVDIMM will
>> be introduced into Libvirt xml.
>>
>> 1.alignsize
>> The 'alignsize' option allows users to specify the proper alignment. When
>> mmap(2) the backend files, QEMU uses the host page size by default as
>> the alignment of mapping address. However, some backends may require
>> alignments different from the pagesize. For example, mmap a device DAX
>> on NVDIMM maybe 2M-aligned.
>>
>> 2.pmem
>> The 'pmem' option allows users to specify whether the backend storage of
>> memory-backend-file is a real persistent memory that can be accessed in
>> SNIA NVM Programming Model. If the vNVDIMM backend is in host persistent
>> memory , and 'pmem' is 'on' and QEMU is built with libpmem, qemu will
>> guarantee the persistence of its own writes to the vNVDIMM backend.
>>
>> 3.persistence
>> The 'persistence' option allows users to set platform-supported features
>> about NVDIMM data persistence of a guest. It has two values, 'mem-ctrl'
>> means the platform supports flushing dirty data from the memory controller
>> to the NVDIMMs in the event of power loss, 'cpu' means The platform supports
>> flushing dirty data from the CPU cache to the NVDIMMs in the event of power
>> loss, which contains what 'mem-ctrl' means.
>>
>> 4.unarmed
>> The 'unarmed' option allows users to mark vNVDIMM read-only. Only one type
>> of vNVDIMM backends can guarantee the guest write persistence, which is the
>> device DAX on the real NVDIMM device (e.g., /dev/dax0.0). When using other
>> types of backends, it's suggested to set 'unarmed' option to 'on', so the
>> guest Linux NVDIMM driver will mark such vNVDIMM device as read-only.
>>
> 
> caveat: I didn't research too deeply into each of these options, but
> I'll give you some feedback from the aspect of how you should formulate
> your patches.
> 
> Rather than essentially replicate what was added in formatdomain provide
> the examples here and a summary of what the proposed XML would "look
> like", e.g.
> 
> Since 'alignsize' and 'pmem' seem to be related specifically to memory
> mapped files:
> 
>    <memory model='nvdimm'>
>      <source>
>        <path>/tmp/nvdimm</path>
>        <alignsize unit='KiB'>2048</alignsize>
>        <pmem>on</pmem>
>      </source>
>      ...
>    </memory>
> 
Got it.

> The <alignsize> would seem to similar in description to the existing
> <pagesize> - although yes for a different model type.....
> 
No, the 'alignsize' is totally another conception.It means our data 
structure must aligned with a fixed size.

> The <pmem> seems to have two states "on" or "off", thus having just
> <pmem/> similar to perhaps how nosharepages is handled.
> 
The 'pmem' has nothing to do with the 'nosharepages'. Actually, the 
'nvdimm' model can only support 'shared' access mode now. The 'pmem' 
just tell the QEMU if the backend file is a real persistence memory 
device, so QEMU will know if itself should guarantee the write persistent.

> Since they're both memory mapped related changes, keep them together
> from conf, xml, rng, etc. Then the "next" patch would be to do the qemu
> and capability changes. With the last being any virsh changes to allow
> modification on the command line (if possible).
> 
Got it.

> The "persistence" seems to be less of a <memory> option and more of a
> <machine> option, true/false?  It would thus need to be separated from
> the others. Still similar to the first one, separate patch for conf,
> xml, rng, html... Then patch for qemu...
> 
It really show the platform features, but it's memory-related. So my 
opinion is keep this along with other options in <source>, which cover 
all configurations about backend NVDIMM on host.

> The "unarmed" is a <target> option since it's being added in patch2
> after node and label are processed, so it should be separate from the
> <source> and <machine> option patches.  If it's just going to have 2
> states, then it doesn't need <unarmed>on</unarmed> - instead it could
> just be <unarmed/>.
> 
Got it. My opinion is keep it in <target>, which cover all 
configurations about guest vNVDIMM.

>> For more details, see
>> https://github.com/qemu/qemu/blob/master/docs/nvdimm.txt
> 
> Not a commit message worthy...  Place it under the ---...
> 
>>
>> Signed-off-by: Zhong,Luyao <luyao.zhong at intel.com>
>> ---
>>   docs/formatdomain.html.in     | 98 ++++++++++++++++++++++++++++++++++++-------
>>   docs/schemas/domaincommon.rng | 31 ++++++++++++--
>>   2 files changed, 111 insertions(+), 18 deletions(-)
>>
> 
> As noted above, but perhaps difficult to determine - the html.in, .rng,
> domain_conf, xml2xml tests, etc. should be in one patch.  Then another
> patch for the qemu, capabilities, xml2argv tests, etc. in the next
> patch. Then if you're going to add virsh options a separate patch for
> that (although probably doesn't apply here).
> 
Got it.

> After each patch, running make check syntax-check needs to pass.
> 
Got it.

>> diff --git a/docs/formatdomain.html.in b/docs/formatdomain.html.in
>> index 8189959..9dec984 100644
>> --- a/docs/formatdomain.html.in
>> +++ b/docs/formatdomain.html.in
>> @@ -8258,6 +8258,8 @@ qemu-kvm -net nic,model=? /dev/null
>>     <memory model='nvdimm'>
>>       <source>
>>         <path>/tmp/nvdimm</path>
>> +      <alignsize unit='KiB'>2048</alignsize>
>> +      <pmem>on</pmem>
>>       </source>
>>       <target>
>>         <size unit='KiB'>524288</size>
>> @@ -8265,6 +8267,8 @@ qemu-kvm -net nic,model=? /dev/null
>>         <label>
>>           <size unit='KiB'>128</size>
>>         </label>
>> +      <persistence>cpu</persistence>
>> +      <unarmed>on</unarmed>
>>       </target>
>>     </memory>
>>   </devices>
>> @@ -8339,10 +8343,38 @@ qemu-kvm -net nic,model=? /dev/null
>>           </dl>
>>   
>>           <p>
>> -          For model <code>nvdimm</code> this element is mandatory and has a
>> -          single child element <code>path</code> that represents a path
>> -          in the host that backs the nvdimm module in the guest.
>> +          For model <code>nvdimm</code> this element is mandatory. The
>> +          mandatory child element <code>path</code> represents a path in
>> +          the host that backs the nvdimm module in the guest. If
>> +          <code>nvdimm</code> is provided, then the following optional
>> +          elements can be provided as well:
> 
> You'd need to add a since tags here for the "new" work...e.g.:
> 
> <span class='since'>since 4.10.0</span>
> 
> Whenever something new is added for a specific version, that's what's
> necessary.

Got it.

> 
>>           </p>
>> +
>> +        <dl>
>> +          <dt><code>alignsize</code></dt>
>> +          <dd>
>> +            <p>
>> +              This element can be used to specify a proper alignment.
>> +              When mmap(2) the backend files, QEMU uses the host page
>> +              size by default as the alignment of mapping address. However,
>> +              some backends may require alignments different from the page.
>> +              For example, mmap a device DAX on NVDIMM maybe 2M-aligned.
>> +            </p>
>> +          </dd>
>> +
>> +          <dt><code>pmem</code></dt>
>> +          <dd>
>> +            <p>
>> +              This element can be used to specify whether the backend storage
>> +              of memory-backend-file is a real persistent memory that can be
>> +              accessed in SNIA NVM Programming Model. If the backend  is a real
>                                                                         ^
> there's two spaces between backend and is
> 
Got it. It's my mistake.

>> +              persistence memory and <code>pmem</code> is set to 'on', QEMU will
>> +              guarantee the persistence of its own writes to the vNVDIMM backend,
>> +              but which can work well with libpmem support (QEMU configured with
>> +              --enable-libpmem).
>> +            </p>
>> +          </dd>
>> +        </dl>
>>         </dd>
> 
> The probably should be reworded in syntax more suitable for libvirt's
> usage without including QEMU details.
> 
Got it. I will reword this.

> The pmem one in particular may be hard to describe... The fact that it
> doesn't work without libpmem does raise a couple of warning flags. I
> would just say that if <pmem/> is there, then it's on if it's not, then
> it's off. Internally that then isn't a tristate, it just an on/off.
> 
>>   
>>         <dt><code>target</code></dt>
>> @@ -8361,19 +8393,55 @@ qemu-kvm -net nic,model=? /dev/null
>>             NUMA nodes configured.
>>           </p>
>>           <p>
>> -          For NVDIMM type devices one can optionally use
>> -          <code>label</code> and its subelement <code>size</code>
>> -          to configure the size of namespaces label storage
>> -          within the NVDIMM module. The <code>size</code> element
>> -          has usual meaning described
>> -          <a href="#elementsMemoryAllocation">here</a>.
>> -          For QEMU domains the following restrictions apply:
>> +          Besides, the following optional elements can be provided as well for
>> +          NVDIMM type devices:
>>           </p>
>> -        <ol>
>> -          <li>the minimum label size is 128KiB,</li>
>> -          <li>the remaining size (total-size - label-size) has to be aligned to
>> -            4KiB</li>
>> -        </ol>
>> +
>> +        <dl>
>> +          <dt><code>label</code></dt>
>> +          <dd>
>> +            <p>
>> +              For NVDIMM type devices one can optionally use
>> +              <code>label</code> and its subelement <code>size</code>
>> +              to configure the size of namespaces label storage
>> +              within the NVDIMM module. The <code>size</code> element
>> +              has usual meaning described
>> +              <a href="#elementsMemoryAllocation">here</a>.
>> +              For QEMU domains the following restrictions apply:
>> +            </p>
>> +            <ol>
>> +              <li>the minimum label size is 128KiB,</li>
>> +              <li>the remaining size (total-size - label-size) will be aligned to
>> +                4KiB as default.</li>
>> +            </ol>
>> +          </dd>
>> +
>> +          <dt><code>persistence</code></dt>
>> +          <dd>
>> +            <p>
>> +              The <code>persistence</code> element can be set to "mem-ctl" or "cpu",
> 
> "mem-ctrl"
> 
>> +              which indicate platform-supported features about NVDIMM data persistence.
>> +              'mem-ctrl' means the platform supports flushing dirty data from the memory
>> +              controller to the NVDIMMs in the event of power loss, 'cpu' means The platform
> 
> means the platform
> 
>> +              supports flushing dirty data from the CPU cache to the NVDIMMs in the event
>> +              of power loss, which contains what 'mem-ctrl' means.
>> +              ACPI 6.2 Errata A added support for a new Platform Capabilities Structure
>> +              in NFIT, so the guest ACPI NFIT will be filled in according to the persistence
>> +              option.
>> +            </p>
>> +          </dd>
> 
> That's a hard read - someone has to know about ACPI 6.2 Errata A and
> whatever it is and where to find it?
> 
Sorry, maybe I put too much details here, there is no need to have 
knowledge about ACPI. I will delete the ACPI things.

> Also, I'm not sure this belongs on the NVDIMM line... Allowing it means
> more than one could use this and they could use separate values, each
> then being placed on the command line and then you don't know what goes
> with what.
> 
> That is what happens if we have
> 
>    <memory model='nvdimm'>
>    ...
>      <target>
>      ...
>        <persistence>mem-ctrl</persistence>
>      ...
>    </memory>
>    <memory model='nvdimm'>
>    ...
>      <target>
>      ...
>        <persistence>cpu</persistence>
>      ...
>    </memory>
> 
I have not considered this situation, I need do more tests and find a 
solution.

>> +
>> +          <dt><code>unarmed</code></dt>
>> +          <dd>
>> +            <p>
>> +              The <code>unarmed</code> element can be used to mark vNVDIMM read-only
>> +              through setting unarmed flag in guest NFIT.Currently the only vNVDIMM backend
> 
> What is NFIT?
> 
'NVDIMM Firmware Interface Table' defined in ACPI. I will delete this 
for its hard to understand and not very necessary.

> Add a space before Currently
> 
Got it.

>> +              can guarantee the guest write persistence is device DAX on Linux, so it's
> 
> What is DAX?
> 
'DAX' means 'direct access'.

>> +              suggested to set <code>unarmed</code> to 'on' when using other types of
>> +              backends.
>> +            </p>
>> +          </dd>
>> +        </dl>
> 
> unarmed will be similar to pmem...
> 
> Once these are properly separated it'll be easier to review the text in
> more detail. But again, like I noted earlier try to use more generic
> terms and avoid specifically calling out QEMU. Although that may be
> difficult - you have to consider the audience - they just want to know
> what the feature does in general and how it will help them or make
> things better for them. The details of what we rename it to under the
> covers hides the complexity.
> 
>>         </dd>
>>       </dl>
>>   
>> diff --git a/docs/schemas/domaincommon.rng b/docs/schemas/domaincommon.rng
>> index 099a949..cca7869 100644
>> --- a/docs/schemas/domaincommon.rng
>> +++ b/docs/schemas/domaincommon.rng
>> @@ -5353,9 +5353,21 @@
>>             </interleave>
>>           </group>
>>           <group>
>> -          <element name="path">
>> -            <ref name="absFilePath"/>
>> -          </element>
>> +          <interleave>
>> +            <element name="path">
>> +              <ref name="absFilePath"/>
>> +            </element>
>> +            <optional>
>> +              <element name="alignsize">
>> +                <ref name="scaledInteger"/>
>> +              </element>
>> +            </optional>
>> +            <optional>
>> +              <element name="pmem">
>> +                <ref name="virOnOff"/>
> 
> Again, I'd follow the nosharepages example here.
> 
See my explain above, NVDIMM model only supports 'shared' access now and 
the <pmem> has nothing to do with 'nosharepages'.

>> +              </element>
>> +            </optional>
>> +          </interleave>
>>           </group>
>>         </choice>
>>       </element>
>> @@ -5379,6 +5391,19 @@
>>               </element>
>>             </element>
>>           </optional>
>> +        <optional>
>> +          <element name="persistence">
>> +            <choice>
>> +              <value>mem-ctrl</value>
>> +              <value>cpu</value>
>> +            </choice>
>> +          </element>
>> +        </optional>
> 
> While the data is correct, the placement isn't...
> 
>> +        <optional>
>> +          <element name="unarmed">
>> +            <ref name="virOnOff"/>
> 
> Similar here to be like nosharepages.
> 
The <unarmed> has nothing to do with 'nosharepages'.

> John
Thanks for your review and so much detailed comments.
Luyao
> 
>> +          </element>
>> +        </optional>
>>         </interleave>
>>       </element>
>>     </define>
>>




More information about the libvir-list mailing list