[PATCH RFC v2 00/13] IOMMUFD Generic interface

Thu Sep 22 11:06:33 UTC 2022

On Wed, Sep 21, 2022 at 12:06:49PM -0600, Alex Williamson wrote:
> [Cc+ Steve, libvirt, Daniel, Laine]
> 
> On Tue, 20 Sep 2022 16:56:42 -0300
> Jason Gunthorpe <jgg at nvidia.com> wrote:
> > That really just leaves the accounting, and I'm still not convinced at
> > this must be a critical thing. Linus's latest remarks reported in lwn
> > at the maintainer summit on tracepoints/BPF as ABI seem to support
> > this. Let's see an actual deployed production configuration that would
> > be impacted, and we won't find that unless we move forward.
> 
> I'll try to summarize the proposed change so that we can get better
> advice from libvirt folks, or potentially anyone else managing locked
> memory limits for device assignment VMs.
> 
> Background: when a DMA range, ex. guest RAM, is mapped to a vfio device,
> we use the system IOMMU to provide GPA to HPA translation for assigned
> devices. Unlike CPU page tables, we don't generally have a means to
> demand fault these translations, therefore the memory target of the
> translation is pinned to prevent that it cannot be swapped or
> relocated, ie. to guarantee the translation is always valid.
> 
> The issue is where we account these pinned pages, where accounting is
> necessary such that a user cannot lock an arbitrary number of pages
> into RAM to generate a DoS attack.  Duplicate accounting should be
> resolved by iommufd, but is outside the scope of this discussion.
> 
> Currently, vfio tests against the mm_struct.locked_vm relative to
> rlimit(RLIMIT_MEMLOCK), which reads task->signal->rlim[limit].rlim_cur,
> where task is the current process.  This is the same limit set via the
> setrlimit syscall used by prlimit(1) and reported via 'ulimit -l'.
> 
> Note that in both cases above, we're dealing with a task, or process
> limit and both prlimit and ulimit man pages describe them as such.
> 
> iommufd supposes instead, and references existing kernel
> implementations, that despite the descriptions above these limits are
> actually meant to be user limits and therefore instead charges pinned
> pages against user_struct.locked_vm and also marks them in
> mm_struct.pinned_vm.
> 
> The proposed algorithm is to read the _task_ locked memory limit, then
> attempt to charge the _user_ locked_vm, such that user_struct.locked_vm
> cannot exceed the task locked memory limit.
> 
> This obviously has implications.  AFAICT, any management tool that
> doesn't instantiate assigned device VMs under separate users are
> essentially untenable.  For example, if we launch VM1 under userA and
> set a locked memory limit of 4GB via prlimit to account for an assigned
> device, that works fine, until we launch VM2 from userA as well.  In
> that case we can't simply set a 4GB limit on the VM2 task because
> there's already 4GB charged against user_struct.locked_vm for VM1.  So
> we'd need to set the VM2 task limit to 8GB to be able to launch VM2.
> But not only that, we'd need to go back and also set VM1's task limit
> to 8GB or else it will fail if a DMA mapped memory region is transient
> and needs to be re-mapped.
> 
> Effectively any task under the same user and requiring pinned memory
> needs to have a locked memory limit set, and updated, to account for
> all tasks using pinned memory by that user.

That is pretty unpleasant. Having to update all existing VMs, when
starting a new VM (or hotpluigging a VFIO device to an existing
VM) is something we would never want todo.

Charging this against the user weakens the DoS protection that
we have today from the POV of individual VMs.

Our primary risk is that a single QEMU is compromised and attempts
to impact the host in some way. We want to limit the damage that
an individual QEMU can cause.

Consider 4 VMs each locked with 4 GB. Any single compromised VM
is constrained to only use 4 GB of locked memory.

With the per-user accounting, now any single compromised VM can
use the cummulative 16 GB of locked memory, even though we only
want that VM to be able to use 4 GB.

For a cummulative memory limit, we would expect cgroups to be
the enforcement mechanism. eg consider a machine has 64 GB of
RAM, and we want to reserve 12 GB for host OS ensure, and all
other RAM is for VM usage. A mgmt app wanting such protecton
would set a limit on /machines.slice, at the 52 GB mark, to
reserve the 12 GB for non-VM usage.

Also, the mgmt app accounts for how many VMs it has started on
a host and will not try to launch more VMs than there is RAM
available to support them. Accounting at the user level instead
of the task level, is effectively trying to protect against the
bad mgmt app trying to overcommit the host. That is not really
important, as the mgmt app is so privileged it is already
assumed to be a trusted host component akin to a root acocunt.

So per-user locked mem accounting looks like a regression in
our VM isolation abilities compared to the per-task accounting.

> How does this affect known current use cases of locked memory
> management for assigned device VMs?

It will affect every single application using libvirt today, with
the possible exception of KubeVirt. KubeVirt puts each VM in a
separate container, and if they have userns enabled, those will
each get distinct UIDs for accounting purposes.

I expect every other usage of libvrit to be affected, unless the
mgmt app has gone out of its way to configure a dedicated UID for
each QEMU - I'm not aware of any which do this.

> Does qemu://system by default sandbox into per VM uids or do they all
> use the qemu user by default.  I imagine qemu://session mode is pretty
> screwed by this, but I also don't know who/where locked limits are
> lifted for such VMs.  Boxes, who I think now supports assigned device
> VMs, could also be affected.

In a out of the box config qemu:///system will always run all VMs
under the user:group pairing  qemu:qemu.

Mgmt apps can requested a dedicated UID per VM in the XML config,
but I'm not aware of any doing this. Someone might, but we should
assume the majority will not.

IOW, it affects essentially all libvirt usage of VFIO.

> > So, I still like 2 because it yields the smallest next step before we
> > can bring all the parallel work onto the list, and it makes testing
> > and converting non-qemu stuff easier even going forward.
> 
> If a vfio compatible interface isn't transparently compatible, then I
> have a hard time understanding its value.  Please correct my above
> description and implications, but I suspect these are not just
> theoretical ABI compat issues.  Thanks,

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|