[libvirt] [PATCH v2 3/3] qemu: add memfd source type

Wed Sep 19 15:39:12 UTC 2018

* Marc-André Lureau (marcandre.lureau at redhat.com) wrote:
> Hi
> 
> On Wed, Sep 19, 2018 at 5:58 PM Michal Privoznik <mprivozn at redhat.com> wrote:
> >
> > On 09/19/2018 12:03 PM, Marc-André Lureau wrote:
> > > Hi
> > >
> > > On Wed, Sep 19, 2018 at 1:41 PM Michal Privoznik <mprivozn at redhat.com> wrote:
> > >>
> > >> On 09/17/2018 03:14 PM, marcandre.lureau at redhat.com wrote:
> > >>> From: Marc-André Lureau <marcandre.lureau at redhat.com>
> > >>>
> > >>> Add a new memoryBacking source type "memfd", supported by QEMU (when
> > >>> the apability is available).
> > >>>
> > >>> A memfd is a specialized anonymous memory kind. As such, an anonymous
> > >>> source type could be automatically using a memfd. However, there are
> > >>> some complications when migrating from different memory backends in
> > >>> qemu (mainly due to the internal object naming at this point, but
> > >>> there could be more). For now, it is simpler and safer to simply
> > >>> introduce a new source type "memfd". Eventually, the "anonymous" type
> > >>> could learn to use memfd transparently in a seperate change.
> > >>>
> > >>> The main benefits are that it doesn't need to create filesystem files,
> > >>> and it also enforces sealing, providing a bit more safety.
> > >>>
> > >>> Signed-off-by: Marc-André Lureau <marcandre.lureau at redhat.com>
> > >>> ---
> > >>>  docs/formatdomain.html.in                     |  9 +--
> > >>>  docs/schemas/domaincommon.rng                 |  1 +
> > >>>  src/conf/domain_conf.c                        |  3 +-
> > >>>  src/conf/domain_conf.h                        |  1 +
> > >>>  src/qemu/qemu_command.c                       | 69 +++++++++++++------
> > >>>  src/qemu/qemu_domain.c                        | 12 +++-
> > >>>  .../memfd-memory-numa.x86_64-latest.args      | 34 +++++++++
> > >>>  tests/qemuxml2argvdata/memfd-memory-numa.xml  | 36 ++++++++++
> > >>>  tests/qemuxml2argvtest.c                      |  2 +
> > >>>  9 files changed, 140 insertions(+), 27 deletions(-)
> > >>>  create mode 100644 tests/qemuxml2argvdata/memfd-memory-numa.x86_64-latest.args
> > >>>  create mode 100644 tests/qemuxml2argvdata/memfd-memory-numa.xml
> > >>>
> > >>> diff --git a/docs/formatdomain.html.in b/docs/formatdomain.html.in
> > >>> index 1f12ab5b42..eeee1f6d40 100644
> > >>> --- a/docs/formatdomain.html.in
> > >>> +++ b/docs/formatdomain.html.in
> > >>> @@ -1099,7 +1099,7 @@
> > >>>      </hugepages>
> > >>>      <nosharepages/>
> > >>>      <locked/>
> > >>> -    <source type="file|anonymous"/>
> > >>> +    <source type="file|anonymous|memfd"/>
> > >>
> > >> I'm sorry but I do not think this is the way we should go. This
> > >> effectively avoids libvirt making the decision and exposes the backend
> > >> used directly. This puts unnecessary burden on mgmt applications because
> > >> they have to make yet another decision (track another domain attribute).
> > >>
> > >> IIUC, memfd is like memory-backend-file and -ram combined. It can do
> > >> hugepages or just plain malloc(). Therefore it should be our first
> > >> choice for freshly started domains. And only if qemu doesn't support it
> > >> we should fall back to either -file or -ram backends.
> > >
> > > memory-backend-memfd doesn't replace either -file or -ram though. It's
> > > a specialized anonymous memory kind, linux-only atm, and not widely
> > > available.
> >
> > Well, neither libvirt nor qemu really support hugepages on anything else
> > than linux.
> >
> > Nor it ever will? Because if we merge these patches and expose it in
> > domain XML, there is no turning back. We can't stop supporting it.
> >
> > >
> > > -file should be used for nvram or complex hugepage/numa setup for ex.
> >
> > How come? I can see .host-nodes and .policy attributes for -memfd
> > backend too. Sure, nvram is special, but for plain hugepages use case
> > -file and -memfd are interchangeable, aren't they?
> 
> Sorry, I think I misunderstood the problem then. The qemu mbind()
> might do all the work.
> 
> David, didn't you point out limitation of -memfd compared to -file for
> NUMA setup?

<thinks> I think we came to the conclusion they're mostly the same, but
with the gotcha that it's harder to control allocation with memfd.
I think for example you can create a fixed size hugetlbfs mount and
put a set of VMs in it and no they're limited to that size.
I think you can do similar things with /dev/shm like mounts.

Dave

> >
> > -object memory-backend-memfd,id=ram-node0,\
> > hugetlb=yes,hugetlbsize=2097152,\
> > share=yes,size=15032385536,host-nodes=3,policy=preferred
> >
> > -object memory-backend-file,id=ram-node0,\
> > path=/path/to/2M/hugetlfs,\
> > size=15032385536,host-nodes=3,policy=preferred
> >
> >
> > And for -ram there is no difference from usage/libvirt POV.
> >
> > -object memory-backend-memfd,id=ram-node0,\
> > share=yes,size=15032385536,host-nodes=3,policy=preferred
> >
> > -object memory-backend-ram,id=ram-node0,\
> > size=15032385536,host-nodes=3,policy=preferred
> >
> >
> > >
> > > But it's legitimate that a VM user request memfd to be used.
> > >
> > > The point of this patch is not to say that we shouldn't try to use
> > > memfd when possible, but rather let the user request specifically
> > > memfd, for security reasons for example. If the setup cannot be
> > > satisfied with -memfd, the user should get an error.
> >
> > What security reasons do you have in mind?
> 
> grow/shrink sealing (and avoiding somewhat hazardous file system operations).
> 
> >
> > >
> > >>
> > >> This means we have to track what backend the domain was started with so
> > >> that we preserve that on migration (although, the fact that these
> > >> backends are not interchangeable makes me question 'backend' in their
> > >> name :-P). For that we can use status/migration XML as I suggested earlier.
> > >>
> > >> Once again, status XML is not editable by user [*] and is used solely by
> > >> libvirtd to store runtime information for a running domain (and backend
> > >> used falls into that category).
> > >
> > > Why not do this transparent memfd-usage in a seperate series?
> >
> > Depends what we want libvirt to be. If we want it to be mere XML->qemu
> > cmd line generator, then we can expose all qemu settings as they are. If
> > we want it to have some logic built in (so that mgmt applications can
> > offload some decisions to it), then we can't expose all qemu settings.
> >
> > I my ideal world, I'd like to tell libvirt "I want a machine that uses
> > hugepages of this size" and let libvirt figure out the best command line
> > to fulfil my request (either use -file or -memfd or even -ram + -mem-path).
> >
> > On the other hand, I don't want to discourage you from posting patches,
> > so this is the point where I will no longer object. I pointed out my
> > objections enough :-)
> 
> I see the benefit in using memfd whenever possible. But I also see a
> benefit in being able to request its usage explcitely. That's why I
> think the 2 approaches are compatible.
> 
> Thanks!
--
Dr. David Alan Gilbert / dgilbert at redhat.com / Manchester, UK