<div dir="auto"><div><br><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, 23 Aug 2019, 0:27 Laine Stump, <<a href="mailto:laine@redhat.com" rel="noreferrer noreferrer" target="_blank">laine@redhat.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">(Adding Alex Williamson to Cc so he can correct any mistakes)<br> <br> On 8/22/19 4:39 PM, Ihar Hrachyshka wrote:<br> > On Thu, Aug 22, 2019 at 12:01 PM Laine Stump <<a href="mailto:laine@redhat.com" rel="noreferrer noreferrer noreferrer" target="_blank">laine@redhat.com</a>> wrote:<br> >><br> >> On 8/22/19 10:56 AM, Ihar Hrachyshka wrote:<br> >>> On Thu, Aug 22, 2019 at 2:24 AM Daniel P. Berrangé <<a href="mailto:berrange@redhat.com" rel="noreferrer noreferrer noreferrer" target="_blank">berrange@redhat.com</a>> wrote:<br> >>>><br> >>>> On Wed, Aug 21, 2019 at 01:37:21PM -0700, Ihar Hrachyshka wrote:<br> >>>>> Hi all,<br> >>>>><br> >>>>> KubeVirt uses libvirtd to manage qemu VMs represented as Kubernetes<br> >>>>> API resources. In this case, libvirtd is running inside an<br> >>>>> unprivileged pod, with some host mounts / capabilities added to the<br> >>>>> pod, needed by libvirtd and other services.<br> >>>>><br> >>>>> One of the capabilities libvirtd requires for successful startup<br> >>>>> inside a pod is SYS_RESOURCE. This capability is used to adjust<br> >>>>> RLIMIT_MEMLOCK ulimit value depending on devices attached to the<br> >>>>> managed guest, both on startup and during hotplug. AFAIU the need to<br> >>>>> lock the memory is to avoid pages being pushed out from RAM into swap.<br> >><br> >><br> >> I recall successfully testing GPU assignment from an unprivileged<br> >> libvirtd several years ago by setting a high enough ulimit for the uid<br> >> used to run libvirtd in advance (. I think we check if the current<br> >> setting is high enough, and don't try to set it unless we think we need to.<br> >><br> > <br> > The PR I linked to in the original email does just that: it starts<br> > libvirtd; then, if domain is going to use VFIO, sets ulimit of<br> > libvirtd process to VM memory size + 1Gb (mimicking libvirt code) +<br> > 256Mb (to stay conservative) using prlimit() syscall; then defines the<br> > domain.<br> <br> So you're making an educated guess, which is essentially what libvirt is <br> doing (based on advice from other people with better information than <br> us, but still a guess).<br> <br> > <br> >> If I understand you correctly, you're saying that in your case it's okay<br> >> for the memlock limit to be lower than we try to set it to, because swap<br> >> is disabled anyway, is that correct?<br> >><br> > <br> > I'm honestly not exactly sure about the reason why we need to set the<br> > limit, but I assume it's because of swap. I can be totally confused on<br> > that part though.<br> <br> <br> What I understand from an IRC conversation with Alex just now is that <br> increasing RLIMIT_MEMLOCK isn't done just to prevent any of the pages <br> being swapped out. It's done because "all GPAs (Guest Physical <br> Addresses) that could potentially be DMA targets need to have fixed <br> mappings through the iommu, therefore all need to be allocated and <br> mappings fixed [...] setting rlimit allows us to perform all the <br> necessary pins within the user's locked memory limit".<br> <br> So even if swap is disabled, it still needs to be done (either by <br> libvirt, or by someone else who has the necessary privileges and control <br> over the libvirtd process).<br> <br> <br> >>>><br> >>>> Libvirt shouldn't set RLIMIT_MEMLOCK by default, unless there's<br> >>>> something in the XML that requires it - one of<br> >>><br> >>> You are right, sorry. We add SYS_RESOURCE only for particular domains.<br> >>><br> >>>><br> >>>> - hard limit memory value is present<br> >>>> - host PCI device passthrough is requested<br> >>><br> >>> We are using passthrough<br> >><br> >> (If you want to make Alex happy, use the term "VFIO device assignment"<br> >> rather than passthrough :-).)<br> >><br> > <br> > Not sure who Alex is but I'll try to make everyone happy! :)<br> <br> The Alex I'm referring to is the Alex I just Cc'ed. He is the VFIO <br> maintainer.<br> <br> <br> >>> to pass SR-IOV NIC VFs into guests. We also<br> >>> plan to do the same for GPUs in the near future.<br> >><br> >> >>> I believe we would benefit from one of the following features on<br> >> >>> libvirt side (or both):<br> >> >>><br> >> >>> a) expose the memory lock value calculated by libvirtd through<br> >> >>> libvirt ABI so that we can use it when calling prlimit() on libvirtd<br> >> >>> process;<br> >> >>> b) allow to disable setrlimit() calls via libvirtd config file knob<br> >> >>> or domain definition.<br> >><br> >> (b) sounds much more reasonable, as long as qemu doesn't complain (I<br> >> don't know whether or not it checks)<br> >><br> >> Slightly related to this - I'm currently working on patches to avoid<br> >> making any ioctl calls that would fail in an unprivileged libvirtd when<br> >> using tap/macvtap devices. </blockquote></div></div><div dir="auto"><br></div><div dir="auto">This is music to my ears, great to hear.</div><div dir="auto"><br></div><div dir="auto"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">ATM, I'm doing this by adding an attribute<br> >> "unmanaged='yes'" to the interface <target> element. The idea is that if<br> >> someone sets unmanaged='yes', they're stating that the caller (i.e.<br> >> kubevirt) is responsible for all device setup, and that libvirt should<br> >> just use it without further setup. A similar approach could be applied<br> >> to hostdev devices - if unmanaged is set, we assume that the caller has<br> >> done everything to make the associated device usable.<br> >><br> >> (Of course this all makes me realize the inanity of adding a <target<br> >> dev='blah' unmanaged='yes'/> for interfaces when hostdevs already have<br> >> <hostdev managed='yes'> and <interface type='hostdev' managed='yes'>. So<br> >> to prevent setting the locklimit for hostdev, would we make a new<br> >> setting like <hostdev managed='no-never-not-even-a-tiny-bit'>? Sigh. I<br> >> *hate* trying to make config consistent :-/)<br></blockquote></div></div><div dir="auto"><br></div><div dir="auto">Sounds tough indeed. I'd try to avoid negatively-named knobs. managed=no is simpler to perceive than unmanaged=yes. It may be just me, but I'd even assume managed=no if the target dev name is specified. If libvirt manages the tap device, it should create a fresh one, too. But all of this is a big digression.</div><div dir="auto"><br></div><div dir="auto"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> >><br> >> (alternately, we could just automatically fail the attempt to set the<br> >> lock limit in a graceful manner and allow the guest to continue)<br> >><br> > <br> > If that's something maintainers feel good about, I am all for it since<br> > it simplifies the implementation.<br> <br> Well, after talking to Alex, I think that since a) libvirt only attempts <br> to increase the limit after determining that it isn't already high <br> enough, and b) if it isn't high enough and we can't increase it, then <br> qemu is going to fail anyway, that c) we can't just fail gracefully and <br> continue.<br> <br> So *somebody* needs to increase the limit, and if you want libvirt to be <br> unprivileged, that means it needs to be you doing the increase. And <br> since the amount that libvirt increases it is just some number based on <br> oral folklore (and not on a specific value we learn by querying <br> somewhere), I don't think it's worthwhile figuring out some way for <br> libvirt to report it via an official API - that would end up just being <br> this:<br> <br> "Hey, you know that number that you guys are just making a guess about <br> based on some advice someone gave you once? Yeah, send me *that* number <br> so I can claim to be basing my actions on real science instead of <br> slightly educated voodoo! K THX BYE!" :-)<br></blockquote></div></div><div dir="auto"><br></div><div dir="auto">Well, it's more like: "you know that voodoo you do to guess the number? If you ever educate yourself about it, e.g by querying qemu, send me *that* number. I'd rather not think about it ever again, BYE."</div><div dir="auto"><br></div><div dir="auto"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><br> > <br> >> BTW, I'm guessing that you use <hostdev> to assign the SRIOV VFs rather<br> >> than <interface type='hostdev'>, correct? The latter would require that<br> >> you have enough capabilities to set MAC addresses on the VFs (that's the<br> >> entire point of using <interface type='hostdev'> instead of plain <hostdev>)<br> > <br> > Yes, we use <hostdev> exactly because interface sets MAC address: in<br> > kubevirt scenario, the container that is running libvirtd has its own<br> > network namespace and doesn't have access to PF to set the VF MAC<br> > address on. Instead, we rely on CNI plugin that is running in the root<br> > namespace context to configure the VF interface as needed. (I've<br> > contributed custom MAC support to SR-IOV CNI plugin very recently.)<br> > <br> > Ihar<br> > <br> <br> _______________________________________________<br> libvirt-users mailing list<br> <a href="mailto:libvirt-users@redhat.com" rel="noreferrer noreferrer noreferrer" target="_blank">libvirt-users@redhat.com</a><br> <a href="https://www.redhat.com/mailman/listinfo/libvirt-users" rel="noreferrer noreferrer noreferrer noreferrer" target="_blank">https://www.redhat.com/mailman/listinfo/libvirt-users</a></blockquote></div></div></div>