[PATCH v4 for v7.6.0 00/14] Introduce virtio-mem <memory/> model

Fri Sep 10 13:56:12 UTC 2021

On 7/7/21 12:30 PM, David Hildenbrand wrote:
> On 23.06.21 12:12, Michal Privoznik wrote:
>> v4 of:
>>
>> https://listman.redhat.com/archives/libvir-list/2021-April/msg01138.html
>>
>> diff to v3:
>> - Rebased code on the top of master
>> - Tried to work in all Peter's review suggestions
>> - Fixed a bug where adjusting <requested/> was viewed as hotplug of new
>>    <memory/> by XML validator and thus if <maxMemory/> was close
>> enough to
>>    <currentMemory/> the validator reported an error (this was reported by
>>    David).
>>
> 
> Hi Michal,

Hi,

sorry for replying this late.

> 
> I just retested with this version and it mostly works as expected. I
> tested quite some memory configurations and have some comments / reports :)
> 
> I tested successfully:
> - 1 node with one device
> - 2 nodes with one device on each node
> - 2 nodes with two devices on one node
> - "virsh update-memory-device" on live domains -- works great
> - huge pages and anonymous memory with access=private and access=shared.
>   There is only one issue with hugepages and memfd (prealloc=on gets
>   set).
> - shared memory on memfd and anonymous memory (-> shared file) with
>   access=shared
> 
> I only tested on a single host NUMA node so far, but don't expect
> surprises with host numa policies.
> 
> 
> 1. "virsh update-memory-device" and stopped domains
> 
> Once I have more than one virtio-mem device defined for a VM, "virsh
> update-memory-device" cannot be used anymore as aliases don't seem to be
> available on stopped VMs. If I manually define an alias on a stopped VM,
> the alias silently gets dropped. Is there any way to identify a
> virtio-mem device on a stopped domain?

Yes. You want what we call user aliases. They have to have "ua-" prefix
so that they don't clash with whatever libvirt generates. Something like
this:

<memory model="virtio-mem">
  <source>
    <pagesize unit="KiB">2048</pagesize>
  </source>
  <target>
    <size unit="KiB">2097152</size>
    <node>0</node>
    <block unit="KiB">2048</block>
    <requested unit="KiB">1048576</requested>
  </target>
  <alias name="ua-virtiomem0"/>
  <address type="pci" domain="0x0000" bus="0x00" slot="0x06"
function="0x0"/>
</memory>

But then you get to the fact that I haven't implemented update for
inactive domains. Will do in v5.

> 
> 
> 2. "virsh update-memory-device" with --config on a running domain
> 
> # virsh update-memory-device "Fedora34" --config --alias "virtiomem1"
> --requested-size 16G
> error: no memory device found
> 
> I guess the issue is again, that alias don't apply to the "!live" XML.
> So the "--config" option doesn't really work when having more than one
> virtio-mem device defined for a VM.

Good point. I wonder what piece of input XML I can use to look up
corresponding virtio-mem. Since it has a PCI address maybe I can use
that instead of alias? But then we are back to the old problem - in
general inactive and active XMLs can be different (due to hot(un-)plug).
So even when I'd find a device on the same PCI address it may be
different, actually. Therefore, I think the safest is to use aliases. At
anyrate - this can be implemented afterwards.

> 
> 
> 3. "virsh update-memory-device" and nodes
> 
> In addition to "--alias", something like "--node" would also be nice to
> have -- assuming there is only a single virtio-mem device per NUMA node,
> which is usually the case. For example:
> 
> "virsh update-memory-device "Fedora34" --node 1 --requested-size 16G"
> could come in handy. This would also work on "!live" domains.

Yes, makes sense.

> 
> 
> 4. "actual" vs. "current"
> 
> "<actual unit='KiB'>16777216</actual>" I wonder if "current" instead of
> "actual" would be more in line with "currentMemory". But no strong opinion.

Yeah, I don't have any opinion either. I can change it.

> 
> 
> 5. Slot handling.
> 
> As already discussed, virtio-mem and virtio-pmem don't need slots. Yet,
> the "slots" definition is required and libvirt reserves once slot for
> each such device ("error: unsupported configuration: memory device count
> '2' exceeds slots count '1'"). This is certainly future work, if we ever
> want to change that.

I can look into this.

> 
> 
> 6. 4k source results in an error
> 
>     <source>
>       <pagesize unit='KiB'>4096</pagesize>
>       <nodemask>0-1</nodemask>
>     </source>
> 
> "error: internal error: Unable to find any usable hugetlbfs mount for
> 4096 KiB"
> 
> This example is taken from https://libvirt.org/formatdomain.html for
> DIMMs. Not sure what the expected behavior is.

Ouch, this is a clear bug. Let me investigate and fix in next version.

> 
> 
> 7. File source gets silently dropped
> 
>     <source>
>       <path>/dev/shmem/vm0</path>
>     </source>
> 
> The statement gets silently dropped, which is somewhat surprising.
> However, I did not test what happens with DIMMs, maybe it's the same.

Yeah, this is somewhat expected. I mean, the part that's expected is
that libvirt drops parts it doesn't parse. Sometimes it is pretty
obvious (<source someRandomAttribute='value'/>) and sometimes it's not
so (like in your example when <path/> makes sense for other memory
models like virtio-pmem). But just to be sure - since virtio-mem can be
backed by any memory-backing-* backend, does it make sense to have
<path/> there? So far my code would use memory-backend-file for
hugepages only.

> 
> 
> 8. Global preallocation of memory
> 
> With
> 
> <memoryBacking>
>     <allocation mode="immediate"\>
> </memoryBacking>
> 
> we also get "prealloc=on" set for the memory backends of the virito-mem
> devices, which is sub-optimal, because we end up preallocating all
> memory of the memory backend (which is unexpected for a virtio-mem
> device) and virtio-mem will then discard all memory immediately again.
> So it's essentially a dangerous NOP -- dangerous because we temporarily
> consume a lot of memory.
> 
> In an ideal world, we would not set this for the memory backend used for
> the virtio-mem devices, but for the virtio-mem devices themselves, such
> that preallocation happens when new memory blocks are actually exposed
> to the VM.
> 
> As virtio-mem does not support "prealloc=on" for virtio-mem devices yet,
> this is future work. We might want to error out, though, if <allocation
> mode="immediate"\> is used along with virtio-mem devices for now. I'm
> planning on implementing this in QEMU soon. Until then, it might also be
> good enough to simply document that this setup should be avoided.

Right. Meanwhile this was implemented in QEMU and thus I can drop
prealloc=on. But then my question is what happens when user wants to
expose additional memory to the guest but doesn't have enough free
hugepages in the pool? Libvirt's using prealloc=on so that QEMU doesn't
get killed later, after the guest booted up.

> 
> 
> 9. Memfd and huge pages
> 
> <memoryBacking>
>     <source type="memfd"/>
> </memoryBacking>
> 
> and
> 
> <memory model='virtio-mem' access='shared'>
>   <source>
>     <pagesize unit='KiB'>2048</pagesize>
>   </source>
>   ...
> </memory>
> 
> 
> I get on the QEMU cmdline
> 
> "-object
> {"qom-type":"memory-backend-memfd","id":"memvirtiomem0","hugetlb":true,"hugetlbsize":2097152,"share":true,"prealloc":true,"size":17179869184}"
> 
> 
> Dropping "the memfd" source I get on the QEMU cmdline:
> 
> -object^@{"qom-type":"memory-backend-file","id":"memvirtiomem0","mem-path":"/dev/hugepages/libvirt/qemu/2-Fedora34-2","share":true,"size":17179869184}
> 
> 
> "prealloc":true should not have been added for virtio-mem in case of
> memfd. !memfd does what's expected.
> 

Okay, I will fix this. But can you shed more light here? I mean, why the
difference?

> 
> 10. Memory locking
> 
> With
> 
> <memoryBacking>
>     <locked/>
> </memoryBacking>
> 
> virtio-mem fails with
> 
> "qemu-system-x86_64: -device
> virtio-mem-pci,node=0,block-size=2097152,requested-size=0,memdev=memvirtiomem0,id=virtiomem0,bus=pci.0,addr=0x2:
> Incompatible with mlock"
> 
> Unfortunately,for example, on shmem like:
> 
> <memoryBacking>
>     <locked/>
>     <access mode="shared"/>
>     <source type="memfd"/>
> </memoryBacking>
> 
> it seems to fail after essentially (temporarily) preallocating all
> memory for the memory backend of the virtio-mem device. In the future,
> virtio-mem might be able to support mlock, until then, this is
> suboptimal but at least fails at some point.
> 
> 
> 11. Reservation of memory
> 
> With new QEMU versions we'll want to pass "reserve=off" for the memory
> backend used, especially with hugepages and private mappings. While this
> change was merged into QEMU, it's not part of an official release yet.
> Future work.
> 
> https://lore.kernel.org/qemu-devel/20210510114328.21835-1-david@redhat.com/
> 
> Otherwise, when we don't have the "size" currently in free and
> "unreserved" hugepages, we'll fail with "qemu-system-x86_64: unable to
> map backing store for guest RAM: Cannot allocate memory". The same thing
> can easily happen on anonymous memory when memory overcommit isn't
> disabled.
> 
> So this is future work, but at least the QEMU part is already upstream.

So what's the difference between reserve and prealloc?

Michal