[libvirt] [PATCH v2 01/10] Introduce NVDIMM memory model

Thu Sep 8 13:50:40 UTC 2016

On 01.09.2016 00:51, John Ferlan wrote:
> 
> 
> On 08/11/2016 09:26 AM, Michal Privoznik wrote:
>> NVDIMM is new type of memory introduced in qemu. The idea is that
> 
> s/in qemu/into QEMU 2.6/
> 
>> we have a DIMM module that keeps the data persistent across
> 
> s/DIMM/Non Volatile memory/
> 
>> domain reboots.
> 
> Perhaps a word of two about what is the usefulness of such a beast. I
> think (from a former project) one usage is to store parameters for
> firmware such as OVMF
> 
> Add extra line between paragraphs...
> 
>> At the domain XML level, we already have some representation of
>> 'dimm' modules. Long story short, we have <memory/> element that
>> lives under <devices/>. Now, the element even has @model
>> attribute which we can use to introduce new memory type:
>>
>>     <memory model='nvdimm'>
>>       <source>
>>         <path>/tmp/nvdimm</path>
>>       </source>
>>       <target>
>>         <size unit='KiB'>523264</size>
>>         <node>0</node>
>>       </target>
>>     </memory>
>>
>> So far, this is just a XML parser/formatter extension. QEMU
>> driver implementation is in the next commit.
>>
>> For more info on NVDIMM visit the following web page:
>>
>>     http://pmem.io/
>>
> 
> One could also google it ;-)... In any case, the actual details are
> found the Documents link from that page, subject to move over time of
> course.
> 
>> Signed-off-by: Michal Privoznik <mprivozn at redhat.com>
>> ---
>>  docs/formatdomain.html.in                          | 26 ++++--
>>  docs/schemas/domaincommon.rng                      | 32 ++++---
>>  src/conf/domain_conf.c                             | 97 ++++++++++++++++------
>>  src/conf/domain_conf.h                             |  2 +
>>  src/qemu/qemu_command.c                            |  6 ++
>>  src/qemu/qemu_domain.c                             |  1 +
>>  .../qemuxml2argv-memory-hotplug-nvdimm.xml         | 49 +++++++++++
>>  7 files changed, 171 insertions(+), 42 deletions(-)
>>  create mode 100644 tests/qemuxml2argvdata/qemuxml2argv-memory-hotplug-nvdimm.xml
>>
>> diff --git a/docs/formatdomain.html.in b/docs/formatdomain.html.in
>> index bfbb0f2..657df8f 100644
>> --- a/docs/formatdomain.html.in
>> +++ b/docs/formatdomain.html.in
>> @@ -6740,6 +6740,15 @@ qemu-kvm -net nic,model=? /dev/null
>>          <node>1</node>
>>        </target>
>>      </memory>
>> +    <memory model='nvdimm'>
>> +      <source>
>> +        <path>/tmp/nvdimm</path>
>> +      </source>
>> +      <target>
>> +        <size unit='KiB'>524287</size>
>> +        <node>1</node>
>> +      </target>
>> +    </memory>
>>    </devices>
>>    ...
>>  </pre>
>> @@ -6747,17 +6756,19 @@ qemu-kvm -net nic,model=? /dev/null
>>        <dt><code>model</code></dt>
>>        <dd>
>>          <p>
>> -          Currently only the <code>dimm</code> model is supported in order to
>> -          add a virtual DIMM module to the guest.
>> +          Select <code>dimm</code> to add a virtual DIMM module to the guest.
> 
> Ugh... The 'dimm' Since tag is in the "Memory devices" section...  Which
> just make the following awkward.  I think you should grab that 1.2.14
> from above and place it here.
> 
> Rather than Select go with "Use" or "Provide"
> 
>> +          Alternatively, <code>nvdimm</code> model adds a Non-Volatile DIMM
> 
> Use the same format - e.g.
> 
> "Use/Provide <code>nvdimm</code> to add a Non-Volatile DIMM module."
> 
>> +          module. <span class="since">Since 2.2.0</span>
> 
> Well we won't hit 2.2.0....
> 
>>          </p>
>>        </dd>
>>  
>>        <dt><code>source</code></dt>
>>        <dd>
>>          <p>
>> -          The optional source element allows to fine tune the source of the
>> -          memory used for the given memory device. If the element is not
>> -          provided defaults configured via <code>numatune</code> are used.
>> +          For model <code>dimm</code> this element is optional and allows to
>> +          fine tune the source of the memory used for the given memory device.
>> +          If the element is not provided defaults configured via
>> +          <code>numatune</code> are used.
>>          </p>
>>          <p>
>>            <code>pagesize</code> can optionally be used to override the default
>> @@ -6770,6 +6781,11 @@ qemu-kvm -net nic,model=? /dev/null
>>            <code>nodemask</code> can optionally be used to override the default
>>            set of NUMA nodes where the memory would be allocated.
>>          </p>
> 
> Since pagesize and nodemask are for DIMM only, so they probably need to
> be converted to some sort of list or in some way indented to ensure the
> visual cue is there that they are meant for dimm.  Perhaps the end of
> the dimm paragraph needs a "If <code>source</code> is provided, then at
> least one of the following values would be provided:".
> 
> I think the only way to get the right formatting look is :
> 
> <dd>
>   <dt><code>pagesize</code></dt>
>     <p> ...
>     </p>
>   <dt><code>nodemask</code></dt>
>     <p> ...
>     </p>
> </dd>
> 
>> +        <p>
>> +          For model <code>nvdimm</code> this element is mandatory and has a
>> +          single child element <code>path</code> which value represents a path
> 
> s/which value/that/ ?
> 
>> +          in host that back the nvdimm module in the guest.
> 
> s/in host that back/in the host that backs/
> 
> Should there be any mention that this also requires "<maxMemory
> slots='#'...>" to be set?
> 
> So something that isn't quite clear... For 'dimm', these are 'extra'
> memory, while for 'nvdimm 'it seems to be one hunk that's really not
> extra memory - rather it's memory that can be used for a specific
> purpose. What's not clear to me - is the existing "physical" memory in
> the guest (e.g. the numa node page) is used to map to the file or does
> this appear as extra memory that is only accessible for this purpose?

Kernel expose this new memory as a block device. As

> 
> Also, does the numa node have to be properly sized?  Your example shows
> a 214Mb numa cell and a 511MB nvdimm.  Doesn't seem to me that fits
> quite right.

This should be okay, It's a new memory that is not accounted as RAM
(it's a block device that guest sees).

> 
> BTW: That video you point to in your v1 indicates two types of NVDIMM -
> "persistent memory namespace" (accessed using loads/stores, mapped to
> physical memory) and "block mode namespace" (accessed via block
> operations, atomicity at block level, indirect mapping thru a block
> window, no mapping entier memory.

In both cases the device is shown as block device, but the "block mode"
is not supported even in qemu. And I doubt it will ever be.

> 
> So which is KVM? Since libvirt must be the master to more than one
> driver - should there be some sort of subtype="block|persistent" as
> well?  Or we could just indicate KVM only for now...

If we are ever gonna implement the block mode, we can introduce an
attribute for which "persistent memory mode" is the default.

> 
>> +        </p>
>>        </dd>
>>  
>>        <dt><code>target</code></dt>

>> diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c
>> index 82876f3..cb5a053 100644
>> --- a/src/conf/domain_conf.c
>> +++ b/src/conf/domain_conf.c

>> @@ -21705,23 +21738,35 @@ virDomainMemorySourceDefFormat(virBufferPtr buf,
>>      char *bitmap = NULL;
>>      int ret = -1;
>>  
>> -    if (!def->pagesize && !def->sourceNodes)
>> +    if (!def->pagesize && !def->sourceNodes && !def->path)
>>          return 0;
>>  
>>      virBufferAddLit(buf, "<source>\n");
>>      virBufferAdjustIndent(buf, 2);
>>  
>> -    if (def->sourceNodes) {
>> -        if (!(bitmap = virBitmapFormat(def->sourceNodes)))
>> -            goto cleanup;
>> +    switch ((virDomainMemoryModel) def->model) {
>> +    case VIR_DOMAIN_MEMORY_MODEL_DIMM:
>> +        if (def->sourceNodes) {
>> +            if (!(bitmap = virBitmapFormat(def->sourceNodes)))
>> +                goto cleanup;
>>  
>> -        virBufferAsprintf(buf, "<nodemask>%s</nodemask>\n", bitmap);
>> +            virBufferAsprintf(buf, "<nodemask>%s</nodemask>\n", bitmap);
>> +        }
>> +
>> +        if (def->pagesize)
>> +            virBufferAsprintf(buf, "<pagesize unit='KiB'>%llu</pagesize>\n",
>> +                              def->pagesize);
>> +        break;
>> +
>> +    case VIR_DOMAIN_MEMORY_MODEL_NVDIMM:
>> +        virBufferAsprintf(buf, "<path>%s</path>\n", def->path);
>> +        break;
>> +
>> +    case VIR_DOMAIN_MEMORY_MODEL_NONE:
>> +    case VIR_DOMAIN_MEMORY_MODEL_LAST:
>> +        break;
>>      }
>>  
>> -    if (def->pagesize)
>> -        virBufferAsprintf(buf, "<pagesize unit='KiB'>%llu</pagesize>\n",
>> -                          def->pagesize);
>> -
>>      virBufferAdjustIndent(buf, -2);
>>      virBufferAddLit(buf, "</source>\n");
>>  
> 
> 
> So do you feel there is a need for a check for different path in
> virDomainMemoryDefCheckABIStability? I know it's not a target thing, but
> the target/guest uses it by mapping directly...

>From my understanding qemu will copy the content of NVDIMM over to the
destination, therefore we are safe with backend changing its path.

> 
> Should there be checks for order?  That is src has dimm, dimm, dimm,
> nvdim while dst has nvdimm, dimm, dimm, dimm?  Would that matter?

No, the order shouldn't matter. See my next answer for explanation.

> 
> After going through patch 3, I'm beginning to think it wouldn't be such
> a good idea to keep nvdimm's and dimm's in the same nmems list.  I think
> they might just be too different.  I also recall some algorithm
> somewhere which made sure addresses and sizes used fit "properly"...
> Searching on DIMM in cscope brings up a few more references and questions.

So, I've just discussed this with Peter who implemented the DIMM (aka
mem hotplug) feature. In this case, libvirt tries to not duplicate code
that already lives in qemu. So how this works is: on domain startup
libvirt just puts the dimms (and after my patches nvdimms as well) onto
the qemu cmd line just in the order as in the XML. Then, when qemu is
started its magic algorithm assigns addresses to all the modules (so
that they don't overlap) and libvirt just queries this information on
the monitor. And it is only in case of migration when we explicitly put
the addresses onto qemu's cmd line (to preserve ABI stability).
On the next run of domain (I mean hard stop & start => new qemu
process), the process repeats itself again. So addresses can be possibly
different, but I'd say this is okay. For two reasons: even in real
hardware if you switch two devices (e.g. DIMM modules, disks, ...)
you'll boot into different environment (e.g. disks gets renamed, what
used to be sda is now sdb). Secondly, this NVDIMM is meant to be used as
a sync mechanism between host & guest.

Therefore I think DIMMs and NVDIMMs are very similar and can be kept in
the same internal list. At least for now. If we ever discover great
difference, we can put them into separate lists - after all it's an
internal list and we can change it whenever we please.

> 
> And well - what about migration?
> 
> 

Michal