[libvirt] [Qemu-devel] [PATCH 1/2] numa: deprecate 'mem' parameter of '-numa node' option

Mon Mar 18 16:44:58 UTC 2019

On Mon, 4 Mar 2019 14:52:30 +0100
Igor Mammedov <imammedo at redhat.com> wrote:

> On Fri, 1 Mar 2019 18:01:52 +0000
> "Dr. David Alan Gilbert" <dgilbert at redhat.com> wrote:
> 
> > * Igor Mammedov (imammedo at redhat.com) wrote:  
> > > On Fri, 1 Mar 2019 15:49:47 +0000
> > > Daniel P. Berrangé <berrange at redhat.com> wrote:
> > >     
> > > > On Fri, Mar 01, 2019 at 04:42:15PM +0100, Igor Mammedov wrote:    
> > > > > The parameter allows to configure fake NUMA topology where guest
> > > > > VM simulates NUMA topology but not actually getting a performance
> > > > > benefits from it. The same or better results could be achieved
> > > > > using 'memdev' parameter. In light of that any VM that uses NUMA
> > > > > to get its benefits should use 'memdev' and to allow transition
> > > > > initial RAM to device based model, deprecate 'mem' parameter as
> > > > > its ad-hoc partitioning of initial RAM MemoryRegion can't be
> > > > > translated to memdev based backend transparently to users and in
> > > > > compatible manner (migration wise).
> > > > > 
> > > > > That will also allow to clean up a bit our numa code, leaving only
> > > > > 'memdev' impl. in place and several boards that use node_mem
> > > > > to generate FDT/ACPI description from it.      
> > > > 
> > > > Can you confirm that the  'mem' and 'memdev' parameters to -numa
> > > > are 100% live migration compatible in both directions ?  Libvirt
> > > > would need this to be the case in order to use the 'memdev' syntax
> > > > instead.    
> > > Unfortunately they are not migration compatible in any direction,
> > > if it where possible to translate them to each other I'd alias 'mem'
> > > to 'memdev' without deprecation. The former sends over only one
> > > MemoryRegion to target, while the later sends over several (one per
> > > memdev).
> > > 
> > > Mixed memory issue[1] first came from libvirt side RHBZ1624223,
> > > back then it was resolved on libvirt side in favor of migration
> > > compatibility vs correctness (i.e. bind policy doesn't work as expected).
> > > What worse that it was made default and affects all new machines,
> > > as I understood it.
> > > 
> > > In case of -mem-path + -mem-prealloc (with 1 numa node or numa less)
> > > it's possible on QEMU side to make conversion to memdev in migration
> > > compatible way (that's what stopped Michal from memdev approach).
> > > But it's hard to do so in multi-nodes case as amount of MemoryRegions
> > > is different.
> > > 
> > > Point is to consider 'mem' as mis-configuration error, as the user
> > > in the first place using broken numa configuration
> > > (i.e. fake numa configuration doesn't actually improve performance).
> > > 
> > > CCed David, maybe he could offer a way to do 1:n migration and other
> > > way around.    
> > 
> > I can't see a trivial way.
> > About the easiest I can think of is if you had a way to create a memdev
> > that was an alias to pc.ram (of a particular size and offset).  
> If I get you right that's what I was planning to do for numa-less machines
> that use -mem-path/prealloc options, where it's possible to replace
> an initial RAM MemoryRegion with a correspondingly named memdev and its
> backing MemoryRegion.

> But I don't see how it could work in case of legacy NUMA 'mem' options
> where initial RAM is 1 MemoryRegion (it's a fake numa after all) and how to
> translate that into several MemoryRegions (one per node/memdev).
Limiting it to x86 for demo purposes.
What would work (if*) is to create special MemoryRegion container, i.e.
  1. make memory_region_allocate_system_memory():memory_region_init()
     that special which already has id pc.ram and size that matches
     the single RAMBlock with the same id in incoming migration stream
     from OLD qemu ( started with -numa node,mem=x ... options)
  2. register "1" with vmstate_register_ram_global()/or other API
     which undercover will make migration code, split the single incoming
     RAM block into several smaller consecutive RAMBlocks  represented
     by memdev backends that are mapped as subregions within container 'pc.ram'
  3. in case of backward migration container MemoryRegion 'pc.ram' will serve
     other way around stitching back memdev subregions into the single
     'pc.ram' migration stream.

(if*) - but above describes an ideal use-case where -numa node,mem
are properly sized. In practice though QEMU doesn't have any checks
on numa's 'mem' value option. So users were able to 'split' RAM in
arbitrary chunks which memdev based backends might not be able
to recreate due to used backing storage limitations (alignment/page size).
To make it worse we don't really know what source (old QEMU) uses for
backend for real as it might fallback to anonymous RAM if mem-path fails
(there is no fallback in memdev case as there user gets what he/she asked
for or hard error).

There might be other issues on migration side of things as well,
but I just don't know about it enough to see them.

> > Dave
> >   
> > >     
> > > > > Signed-off-by: Igor Mammedov <imammedo at redhat.com>
> > > > > ---
> > > > >  numa.c               |  2 ++
> > > > >  qemu-deprecated.texi | 14 ++++++++++++++
> > > > >  2 files changed, 16 insertions(+)
> > > > > 
> > > > > diff --git a/numa.c b/numa.c
> > > > > index 3875e1e..2205773 100644
> > > > > --- a/numa.c
> > > > > +++ b/numa.c
> > > > > @@ -121,6 +121,8 @@ static void parse_numa_node(MachineState *ms, NumaNodeOptions *node,
> > > > >  
> > > > >      if (node->has_mem) {
> > > > >          numa_info[nodenr].node_mem = node->mem;
> > > > > +        warn_report("Parameter -numa node,mem is deprecated,"
> > > > > +                    " use -numa node,memdev instead");
> > > > >      }
> > > > >      if (node->has_memdev) {
> > > > >          Object *o;
> > > > > diff --git a/qemu-deprecated.texi b/qemu-deprecated.texi
> > > > > index 45c5795..73f99d4 100644
> > > > > --- a/qemu-deprecated.texi
> > > > > +++ b/qemu-deprecated.texi
> > > > > @@ -60,6 +60,20 @@ Support for invalid topologies will be removed, the user must ensure
> > > > >  topologies described with -smp include all possible cpus, i.e.
> > > > >    @math{@var{sockets} * @var{cores} * @var{threads} = @var{maxcpus}}.
> > > > >  
> > > > > + at subsection -numa node,mem=@var{size} (since 4.0)
> > > > > +
> > > > > +The parameter @option{mem} of @option{-numa node} is used to assign a part of
> > > > > +guest RAM to a NUMA node. But when using it, it's impossible to manage specified
> > > > > +size on the host side (like bind it to a host node, setting bind policy, ...),
> > > > > +so guest end-ups with the fake NUMA configuration with suboptiomal performance.
> > > > > +However since 2014 there is an alternative way to assign RAM to a NUMA node
> > > > > +using parameter @option{memdev}, which does the same as @option{mem} and has
> > > > > +an ability to actualy manage node RAM on the host side. Use parameter
> > > > > + at option{memdev} with @var{memory-backend-ram} backend as an replacement for
> > > > > +parameter @option{mem} to achieve the same fake NUMA effect or a properly
> > > > > +configured @var{memory-backend-file} backend to actually benefit from NUMA
> > > > > +configuration.
> > > > > +
> > > > >  @section QEMU Machine Protocol (QMP) commands
> > > > >  
> > > > >  @subsection block-dirty-bitmap-add "autoload" parameter (since 2.12.0)
> > > > > -- 
> > > > > 2.7.4
> > > > > 
> > > > > --
> > > > > libvir-list mailing list
> > > > > libvir-list at redhat.com
> > > > > https://www.redhat.com/mailman/listinfo/libvir-list      
> > > > 
> > > > Regards,
> > > > Daniel    
> > >     
> > --
> > Dr. David Alan Gilbert / dgilbert at redhat.com / Manchester, UK  
> 
>