[libvirt] [PATCH v2 3/3] qemu: add memfd source type

Wed Sep 19 12:04:50 UTC 2018

On 09/19/2018 07:10 AM, Michal Privoznik wrote:
> On 09/19/2018 12:02 PM, Pavel Hrdina wrote:
>> On Wed, Sep 19, 2018 at 11:41:11AM +0200, Michal Privoznik wrote:
>>> On 09/17/2018 03:14 PM, marcandre.lureau at redhat.com wrote:
>>>> From: Marc-André Lureau <marcandre.lureau at redhat.com>
>>>>
>>>> Add a new memoryBacking source type "memfd", supported by QEMU (when
>>>> the apability is available).
>>>>
>>>> A memfd is a specialized anonymous memory kind. As such, an anonymous
>>>> source type could be automatically using a memfd. However, there are
>>>> some complications when migrating from different memory backends in
>>>> qemu (mainly due to the internal object naming at this point, but
>>>> there could be more). For now, it is simpler and safer to simply
>>>> introduce a new source type "memfd". Eventually, the "anonymous" type
>>>> could learn to use memfd transparently in a seperate change.
>>>>
>>>> The main benefits are that it doesn't need to create filesystem files,
>>>> and it also enforces sealing, providing a bit more safety.
>>>>
>>>> Signed-off-by: Marc-André Lureau <marcandre.lureau at redhat.com>
>>>> ---
>>>>  docs/formatdomain.html.in                     |  9 +--
>>>>  docs/schemas/domaincommon.rng                 |  1 +
>>>>  src/conf/domain_conf.c                        |  3 +-
>>>>  src/conf/domain_conf.h                        |  1 +
>>>>  src/qemu/qemu_command.c                       | 69 +++++++++++++------
>>>>  src/qemu/qemu_domain.c                        | 12 +++-
>>>>  .../memfd-memory-numa.x86_64-latest.args      | 34 +++++++++
>>>>  tests/qemuxml2argvdata/memfd-memory-numa.xml  | 36 ++++++++++
>>>>  tests/qemuxml2argvtest.c                      |  2 +
>>>>  9 files changed, 140 insertions(+), 27 deletions(-)
>>>>  create mode 100644 tests/qemuxml2argvdata/memfd-memory-numa.x86_64-latest.args
>>>>  create mode 100644 tests/qemuxml2argvdata/memfd-memory-numa.xml
>>>>
>>>> diff --git a/docs/formatdomain.html.in b/docs/formatdomain.html.in
>>>> index 1f12ab5b42..eeee1f6d40 100644
>>>> --- a/docs/formatdomain.html.in
>>>> +++ b/docs/formatdomain.html.in
>>>> @@ -1099,7 +1099,7 @@
>>>>      </hugepages>
>>>>      <nosharepages/>
>>>>      <locked/>
>>>> -    <source type="file|anonymous"/>
>>>> +    <source type="file|anonymous|memfd"/>
>>>
>>> I'm sorry but I do not think this is the way we should go. This
>>> effectively avoids libvirt making the decision and exposes the backend
>>> used directly. This puts unnecessary burden on mgmt applications because
>>> they have to make yet another decision (track another domain attribute).
>>>
>>> IIUC, memfd is like memory-backend-file and -ram combined. It can do
>>> hugepages or just plain malloc(). Therefore it should be our first
>>> choice for freshly started domains. And only if qemu doesn't support it
>>> we should fall back to either -file or -ram backends.
>>>
>>> This means we have to track what backend the domain was started with so
>>> that we preserve that on migration (although, the fact that these
>>> backends are not interchangeable makes me question 'backend' in their
>>> name :-P). For that we can use status/migration XML as I suggested earlier.
>>>
>>> Once again, status XML is not editable by user [*] and is used solely by
>>> libvirtd to store runtime information for a running domain (and backend
>>> used falls into that category).
>>
>> I have to agree with Michal, we should not expose this implementation
>> detail in domain XML if we can hide it in status/migratable XML.
>>
>> One thing about the migration though.  I'm not sure what are we
>> officially supporting in libvirt because it might cause us some issues.
>>
>> We need to make sure that if you live-migrate domain from old libvirt
>> to new libvirt you should be able to migrate that domain back to old
>> libvirt.  The question is, whether this applies if you destroy and
>> start the domain on the new libvirt before you live-migrate it back
>> to old libvirt.
>>
>> Without the restart there is no issue, because the backend would not
>> be changed, but once you start the same domain again we would pick
>> new backend which would prevent migrating it back to the old libvirt.
>>
> 
> This is not supported. Exactly because of this reason. If we were to
> preserve this forward compatibility (i.e. migration from newer libvirt
> to older) then we couldn't use any new feature qemu has. For instance,
> if qemu introduces a new device and a user starts a domain with it,
> migration to older qemu will not work then, obviously. This also applies
> for other qemu capabilities.
> 
> Migrating from newer version to older is not supported. It might work,
> but that's rather coincidence than intent.
> 

Tough time to decide which of these to reply to... So I'll go here.

Anyway, perhaps it should be noted that in V1 the "decision" was to
force the consumer to select "anonymous" as their source type. Go back
to patch 3:

+    if (!mem->nvdimmPath &&
+        def->mem.source == VIR_DOMAIN_MEMORY_SOURCE_ANONYMOUS &&
+        virQEMUCapsGet(qemuCaps, QEMU_CAPS_OBJECT_MEMORY_MEMFD) &&
+        (!useHugepage || virQEMUCapsGet(qemuCaps,
QEMU_CAPS_OBJECT_MEMORY_MEMFD_HUGETLB)))

Since "anonymous" has to be selected (it's not automatic nor is it "set"
in the source type if nothing provided), then supplying "memfd" is only
a stretch because it's a new type. IOW, mgmt apps still would have had
to request "anonymous". If they didn't, then "file" would probably be
used for any hugepage backend while "ram" would be used for others
(anonymous or not provided). I say probably because there's the
conditions for mem->nvdimmPath OR memAccess which also could choose
"file" if source type was not supplied.

IIRC, the issue was migration if source type was "ram" and target was
"memfd"; however, that would be a source that didn't use hugepages (in
which which "file" would have been chosen). The issue was more qemu
based because some path wasn't fully supplied when using ram.

I think the "root cause" of the angst is because all the decision making
is left in the qemu_command code. If libvirtd is restarted, it's just
going to "find" the domain already started and wouldn't be updating any
status, so to that degree I think using status files won't work.

As for migration.... I think some of that just got answered while I was
typing... But, if "A" is a system that doesn't have memfd and "B" is a
system that has memfd *and* there's an automated decision to use memfd
when the source type is "anonymous", then for A -> B there's no
mechanism for A to tell B how it was started. If B -> A isn't supported,
then fine - what stops that?

What I had a hard time determining in the migration code was what
happens if the cookie bits the target has don't match what the source
has. I didn't see any sort of fail the migration operation, but there's
so much of that code I don't know. If we send a cookie that indicates
new default for anonymous backend, would that be automatically rejected
if the target doesn't know about it. Likewise, if a source sends it's
cookies without some bit, would that be rejected. It's this second case
where I came to an impasse on how to handle.

John