[libvirt-users] RLIMIT_MEMLOCK in container environment

Sat Aug 24 07:08:19 UTC 2019

On Fri, 23 Aug 2019, 0:27 Laine Stump, <laine at redhat.com> wrote:

> (Adding Alex Williamson to Cc so he can correct any mistakes)
>
> On 8/22/19 4:39 PM, Ihar Hrachyshka wrote:
> > On Thu, Aug 22, 2019 at 12:01 PM Laine Stump <laine at redhat.com> wrote:
> >>
> >> On 8/22/19 10:56 AM, Ihar Hrachyshka wrote:
> >>> On Thu, Aug 22, 2019 at 2:24 AM Daniel P. Berrangé <
> berrange at redhat.com> wrote:
> >>>>
> >>>> On Wed, Aug 21, 2019 at 01:37:21PM -0700, Ihar Hrachyshka wrote:
> >>>>> Hi all,
> >>>>>
> >>>>> KubeVirt uses libvirtd to manage qemu VMs represented as Kubernetes
> >>>>> API resources. In this case, libvirtd is running inside an
> >>>>> unprivileged pod, with some host mounts / capabilities added to the
> >>>>> pod, needed by libvirtd and other services.
> >>>>>
> >>>>> One of the capabilities libvirtd requires for successful startup
> >>>>> inside a pod is SYS_RESOURCE. This capability is used to adjust
> >>>>> RLIMIT_MEMLOCK ulimit value depending on devices attached to the
> >>>>> managed guest, both on startup and during hotplug. AFAIU the need to
> >>>>> lock the memory is to avoid pages being pushed out from RAM into
> swap.
> >>
> >>
> >> I recall successfully testing GPU assignment from an unprivileged
> >> libvirtd several years ago by setting a high enough ulimit for the uid
> >> used to run libvirtd in advance (. I think we check if the current
> >> setting is high enough, and don't try to set it unless we think we need
> to.
> >>
> >
> > The PR I linked to in the original email does just that: it starts
> > libvirtd; then, if domain is going to use VFIO, sets ulimit of
> > libvirtd process to VM memory size + 1Gb (mimicking libvirt code) +
> > 256Mb (to stay conservative) using prlimit() syscall; then defines the
> > domain.
>
> So you're making an educated guess, which is essentially what libvirt is
> doing (based on advice from other people with better information than
> us, but still a guess).
>
> >
> >> If I understand you correctly, you're saying that in your case it's okay
> >> for the memlock limit to be lower than we try to set it to, because swap
> >> is disabled anyway, is that correct?
> >>
> >
> > I'm honestly not exactly sure about the reason why we need to set the
> > limit, but I assume it's because of swap. I can be totally confused on
> > that part though.
>
>
> What I understand from an IRC conversation with Alex just now is that
> increasing RLIMIT_MEMLOCK isn't done just to prevent any of the pages
> being swapped out. It's done because "all GPAs (Guest Physical
> Addresses) that could potentially be DMA targets need to have fixed
> mappings through the iommu, therefore all need to be allocated and
> mappings fixed [...] setting rlimit allows us to perform all the
> necessary pins within the user's locked memory limit".
>
> So even if swap is disabled, it still needs to be done (either by
> libvirt, or by someone else who has the necessary privileges and control
> over the libvirtd process).
>
>
> >>>>
> >>>> Libvirt shouldn't set RLIMIT_MEMLOCK by default, unless there's
> >>>> something in the XML that requires it - one of
> >>>
> >>> You are right, sorry. We add SYS_RESOURCE only for particular domains.
> >>>
> >>>>
> >>>>    - hard limit memory value is present
> >>>>    - host PCI device passthrough is requested
> >>>
> >>> We are using passthrough
> >>
> >> (If you want to make Alex happy, use the term "VFIO device assignment"
> >> rather than passthrough :-).)
> >>
> >
> > Not sure who Alex is but I'll try to make everyone happy! :)
>
> The Alex I'm referring to is the Alex I just Cc'ed. He is the VFIO
> maintainer.
>
>
> >>> to pass SR-IOV NIC VFs into guests. We also
> >>> plan to do the same for GPUs in the near future.
> >>
> >>   >>> I believe we would benefit from one of the following features on
> >>   >>> libvirt side (or both):
> >>   >>>
> >>   >>> a) expose the memory lock value calculated by libvirtd through
> >>   >>> libvirt ABI so that we can use it when calling prlimit() on
> libvirtd
> >>   >>> process;
> >>   >>> b) allow to disable setrlimit() calls via libvirtd config file
> knob
> >>   >>> or domain definition.
> >>
> >> (b) sounds much more reasonable, as long as qemu doesn't complain (I
> >> don't know whether or not it checks)
> >>
> >> Slightly related to this - I'm currently working on patches to avoid
> >> making any ioctl calls that would fail in an unprivileged libvirtd when
> >> using tap/macvtap devices.

This is music to my ears, great to hear.

ATM, I'm doing this by adding an attribute
> >> "unmanaged='yes'" to the interface <target> element. The idea is that if
> >> someone sets unmanaged='yes', they're stating that the caller (i.e.
> >> kubevirt) is responsible for all device setup, and that libvirt should
> >> just use it without further setup. A similar approach could be applied
> >> to hostdev devices - if unmanaged is set, we assume that the caller has
> >> done everything to make the associated device usable.
> >>
> >> (Of course this all makes me realize the inanity of adding a <target
> >> dev='blah' unmanaged='yes'/> for interfaces when hostdevs already have
> >> <hostdev managed='yes'> and <interface type='hostdev' managed='yes'>. So
> >> to prevent setting the locklimit for hostdev, would we make a new
> >> setting like <hostdev managed='no-never-not-even-a-tiny-bit'>? Sigh. I
> >> *hate* trying to make config consistent :-/)
>

Sounds tough indeed. I'd try to avoid negatively-named knobs. managed=no is
simpler to perceive than unmanaged=yes. It may be just me, but I'd even
assume managed=no if the target dev name is specified. If libvirt manages
the tap device, it should create a fresh one, too. But all of this is a big
digression.

>>
> >> (alternately, we could just automatically fail the attempt to set the
> >> lock limit in a graceful manner and allow the guest to continue)
> >>
> >
> > If that's something maintainers feel good about, I am all for it since
> > it simplifies the implementation.
>
> Well, after talking to Alex, I think that since a) libvirt only attempts
> to increase the limit after determining that it isn't already high
> enough, and b) if it isn't high enough and we can't increase it, then
> qemu is going to fail anyway, that c) we can't just fail gracefully and
> continue.
>
> So *somebody* needs to increase the limit, and if you want libvirt to be
> unprivileged, that means it needs to be you doing the increase. And
> since the amount that libvirt increases it is just some number based on
> oral folklore (and not on a specific value we learn by querying
> somewhere), I don't think it's worthwhile figuring out some way for
> libvirt to report it via an official API - that would end up just being
> this:
>
> "Hey, you know that number that you guys are just making a guess about
> based on some advice someone gave you once? Yeah, send me *that* number
> so I can claim to be basing my actions on real science instead of
> slightly educated voodoo! K THX BYE!" :-)
>

Well, it's more like: "you know that voodoo you do to guess the number? If
you ever educate yourself about it, e.g by querying qemu, send me *that*
number. I'd rather not think about it ever again, BYE."

> >
> >> BTW, I'm guessing that you use <hostdev> to assign the SRIOV VFs rather
> >> than <interface type='hostdev'>, correct? The latter would require that
> >> you have enough capabilities to set MAC addresses on the VFs (that's the
> >> entire point of using <interface type='hostdev'> instead of plain
> <hostdev>)
> >
> > Yes, we use <hostdev> exactly because interface sets MAC address: in
> > kubevirt scenario, the container that is running libvirtd has its own
> > network namespace and doesn't have access to PF to set the VF MAC
> > address on. Instead, we rely on CNI plugin that is running in the root
> > namespace context to configure the VF interface as needed. (I've
> > contributed custom MAC support to SR-IOV CNI plugin very recently.)
> >
> > Ihar
> >
>
> _______________________________________________
> libvirt-users mailing list
> libvirt-users at redhat.com
> https://www.redhat.com/mailman/listinfo/libvirt-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/libvirt-users/attachments/20190824/c69df0d6/attachment.htm>