[libvirt] [BUG] mlock support breakage

Mon Mar 13 16:08:58 UTC 2017

On Mon, Mar 13, 2017 at 11:58:24AM -0400, Luiz Capitulino wrote:
> 
> Libvirt commit c2e60ad0e51 added a new check to the XML validation
> logic where XMLs containing <memoryBacking><mlocked/> must also
> contain <memtune><hard_limit>. This causes two breakages where
> working guests won't start anymore:
> 
> 1. Systems where mlock limit was set in /etc/security/limits.conf

I'm surprised if that has any effect, unless you were setting it
against the root user.

The limits.conf file is loaded by PAM, and when libvirtd spawns
QEMU, PAM is not involved, so limits.conf will never be activated.

This is why libvirt provides max_processes/max_files/max_core
settings in /etc/libvirt/qemu.conf - you can't set those in
limits.conf and have them work - unless you set them against
root, so libvirtd itself got the higher limits which are then
inherited by QEMU. 

> 2. Guests using hugeTLB pages. In this case, guests were relying
>    on the fact that libvirt automagically increases mlock
>    limit to 1GB

Yep, that's bad - we mustn't break previously working scenarios
like this, even if there were not following documented practice.

> While it's true that <memoryBacking><mlocked/> documentation
> says that <memtune><hard_limit> is required, this is actually
> an extremely bad request because:
> 
>  A. <memtune><hard_limit> own documention strongly recommends
>     NOT using it

Yep, hard limit is impossible to calculate reliably since no one
has been able to provide an accurate way to predict QEMU's peak
memory usage. When libvirt previously set hard_limit by default,
we got many bug reports about guest's killed by the OOM killer,
no matter what algorithm we tried.

>  B. <memtune><hard_limit> does more than setting memory locking
>     limit
> 
>  C. <memtune><hard_limit> does not support infinity, so you have
>     to guess a limit
> 
>  D. If <memtune><hard_limit> is less than 1GB, it will lower
>     VFIO's automatic limit of "guest memory + 1GB"
> 
> Here's two possible solutions to fix this all:
> 
>  1. Drop change c2e60ad0e51 and drop automatic increases. Let
>     users configure limits in /etc/security/limits.conf
> 
>     pros: this is the most correct way to do it, and how
>           it should be done originally IMHO
> 
>     cons: will break working VFIO setups, so probably undoable

limits.conf is useless - see above.

>  2. Drop change c2e60ad0e51 and automtically increase memory
>     locking limit to infinity when seeing <memoryBacking><locked/>
> 
>    pros: make all cases work, no more <hard_limit> requirement
> 
>    cons: allows guests with <locked/> to lock all memory
>          assigned to them plus QEMU allocations. While this seems
>          undesirable or even a security issue, using <hard_limit>
>          will have the same effect

I think this is the only viable approach, given that no one can
provide a way to reliably calculate QEMU peak memory usage.
Unless we want to take guest RAM + $LARGE NUMBER - eg just blindly
assume that 2 GB is enough for QEMU working set, so for an 8 GB
guest, just set  10 GB as the limit.

> Lastly, <locked/> doesn't belong to <memoryBacking>, it should
> be in <memtune>. I recommend deprecating it from <memoryBacking>
> and adding it where it belongs.

We never make these kind of changes in libvirt XML. It is sub-optimal
location, but it has no functional problem, so there's no functional
benefit to moving it and clear backcompat downsides.

Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://entangle-photo.org       -o-    http://search.cpan.org/~danberr/ :|