[PATCH] qemu: Relax memory pre-allocation rules

Mon Dec 14 14:09:00 UTC 2020

On Mon, 30 Nov 2020 11:14:20 +0000
Daniel P. Berrangé <berrange at redhat.com> wrote:

> On Mon, Nov 30, 2020 at 11:48:28AM +0100, Michal Privoznik wrote:
> > On 11/30/20 11:16 AM, Daniel P. Berrangé wrote:  
> > > On Mon, Nov 30, 2020 at 11:06:14AM +0100, Michal Privoznik wrote:  
> > > > Currently, we configure QEMU to prealloc memory almost by
> > > > default. Well, by default for NVDIMMs, hugepages and if user
> > > > asked us to (via memoryBacking <allocation mode="immediate"/>).
> > > > 
> > > > However, there are two cases where this approach is not the best:
> > > > 
> > > > 1) in case when guest's NVDIMM is backed by real life NVDIMM. In
> > > >     this case users should put <pmem/> into the <memory/> device
> > > >     <source/>, like this:
> > > > 
> > > >     <memory model='nvdimm' access='shared'>
> > > >       <source>
> > > >         <path>/dev/pmem0</path>
> > > >         <pmem/>
> > > >       </source>
> > > >     </memory>
> > > > 
> > > >     Instructing QEMU to do prealloc in this case means that each
> > > >     page of the NVDIMM is "touched" (the first byte is read and
> > > >     written back - see QEMU commit v2.9.0-rc1~26^2) which cripples
> > > >     device wear.
> > > > 
> > > > 2) if free-page-reporting is turned on. While the
> > > >     free-page-reporting feature might not have a catchy or obvious
> > > >     name, when enabled it instructs KVM and subsequently QEMU to
> > > >     free pages no longer used by guest resulting in smaller memory
> > > >     footprint. And preallocating whole memory goes against this.
> > > > 
> > > > The BZ comment 11 mentions another, third case 'virtio-mem' but
> > > > that is not implemented in libvirt, yet.
> > > > 
> > > > Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1894053
> > > > Signed-off-by: Michal Privoznik <mprivozn at redhat.com>
> > > > ---
> > > >   src/qemu/qemu_command.c                               | 11 +++++++++--
> > > >   .../memory-hotplug-nvdimm-pmem.x86_64-latest.args     |  2 +-
> > > >   2 files changed, 10 insertions(+), 3 deletions(-)
> > > > 
> > > > diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c
> > > > index 479bcc0b0c..3df8b5ac76 100644
> > > > --- a/src/qemu/qemu_command.c
> > > > +++ b/src/qemu/qemu_command.c
> > > > @@ -2977,7 +2977,11 @@ qemuBuildMemoryBackendProps(virJSONValuePtr *backendProps,
> > > >       if (discard == VIR_TRISTATE_BOOL_ABSENT)
> > > >           discard = def->mem.discard;
> > > > -    if (def->mem.allocation == VIR_DOMAIN_MEMORY_ALLOCATION_IMMEDIATE)
> > > > +    /* The whole point of free_page_reporting is that as soon as guest frees
> > > > +     * any memory it is freed in the host too. Prealloc doesn't make much sense
> > > > +     * then. */
> > > > +    if (def->mem.allocation == VIR_DOMAIN_MEMORY_ALLOCATION_IMMEDIATE &&
> > > > +        def->memballoon->free_page_reporting != VIR_TRISTATE_SWITCH_ON)
> > > >           prealloc = true;  
> > > 
> > > If the user asked for allocation == immediate, we should not be
> > > silently ignoring that request. Isn't the scenario described simply
> > > a wierd user configuration scenario and if they don't want that, then
> > > then they can set     <allocation mode="ondemand"/> instead.  
> > 
> > Okay.
> >   
> > >   
> > > >       if (virDomainNumatuneGetMode(def->numa, mem->targetNode, &mode) < 0 &&
> > > > @@ -3064,7 +3068,10 @@ qemuBuildMemoryBackendProps(virJSONValuePtr *backendProps,
> > > >           if (mem->nvdimmPath) {
> > > >               memPath = g_strdup(mem->nvdimmPath);
> > > > -            prealloc = true;  
> > > 
> > > 
> > >   
> > > > +            /* If the NVDIMM is a real device then there's nothing to prealloc.
> > > > +             * If anyhing, we would be only wearing off the device. */
> > > > +            if (!mem->nvdimmPmem)
> > > > +                prealloc = true;  
> > > 
> > > I wonder if QEMU itself should take this optimization to skip its
> > > allocation logic ?  

by default QEMU does not prealloc, and if users explicitly ask for prealloc,
they should get it.
So libvirt also shouldn't set prealloc by default when it comes to nvdimm
on a file that's allocated on pmem enabled storage.

> > Also would make sense. This is that kind of bug which lies in between
> > libvirt and qemu. Although, since we are worried in silently ignoring user
> > requests, then wouldn't this be exactly what QEMU would be doing? I mean, if
> > an user/libvirt put both .prealloc=yes and .pmem=yes onto cmd line then
> > these would cancel out, wouldn't they?  
> 
> The difference is that an real NVDIMM is inherantly preallocated. QEMU

that's assuming that used backend file is on NVDIMM
(pmem=on doesn't guarantee it though)

> would not be ignoring the prealloc=yes arg - its implementation would
> merely be a no-op.

As for ignoring user's input, I don't like it (it usually bites down' the road).
if we decide that "pmem=on + prealloc=on" is invalid combo,
I'd rather error out with "fix your CLI" kind of message or we can warn user
that options combination is not optimal.

> Regards,
> Daniel