[libvirt-users] Nested KVM: L0 guest produces kernel BUG on wakeup from managed save (while a nested VM is running)

Thu Feb 8 08:19:17 UTC 2018

On Wed, Feb 7, 2018 at 11:26 PM, David Hildenbrand <david at redhat.com> wrote:
> On 07.02.2018 16:31, Kashyap Chamarthy wrote:
>> [Cc: KVM upstream list.]
>>
>> On Tue, Feb 06, 2018 at 04:11:46PM +0100, Florian Haas wrote:
>>> Hi everyone,
>>>
>>> I hope this is the correct list to discuss this issue; please feel
>>> free to redirect me otherwise.
>>>
>>> I have a nested virtualization setup that looks as follows:
>>>
>>> - Host: Ubuntu 16.04, kernel 4.4.0 (an OpenStack Nova compute node)
>>> - L0 guest: openSUSE Leap 42.3, kernel 4.4.104-39-default
>>> - Nested guest: SLES 12, kernel 3.12.28-4-default
>>>
>>> The nested guest is configured with "<type arch='x86_64'
>>> machine='pc-i440fx-1.4'>hvm</type>".
>>>
>>> This is working just beautifully, except when the L0 guest wakes up
>>> from managed save (openstack server resume in OpenStack parlance).
>>> Then, in the L0 guest we immediately see this:
>>
>> [...] # Snip the call trace from Florian.  It is here:
>> https://www.redhat.com/archives/libvirt-users/2018-February/msg00014.html
>>
>>> What does fix things, of course, is to switch from the nested guest
>>> from KVM to Qemu — but that also makes things significantly slower.
>>>
>>> So I'm wondering: is there someone reading this who does run nested
>>> KVM and has managed to successfully live-migrate or managed-save? If
>>> so, would you be able to share a working host kernel / L0 guest kernel
>>> / nested guest kernel combination, or any other hints for tuning the
>>> L0 guest to support managed save and live migration?
>>
>> Following up from our IRC discussion (on #kvm, Freenode).  Re-posting my
>> comment here:
>>
>> So I just did a test of 'managedsave' (which is just "save the state of
>> the running VM to a file" in libvirt parlance) of L1, _while_ L2 is
>> running, and I seem to reproduce your case (see the call trace
>> attached).
>>
>>     # Ensure L2 (the nested guest) is running on L1.  Then, from L0, do
>>     # the following:
>>     [L0] $ virsh managedsave L1
>>     [L0] $ virsh start L1 --console
>>
>> Result: See the call trace attached to this bug.  But L1 goes on to
>> start "fine", and L2 keeps running, too.  But things start to seem
>> weird.  As in: I try to safely, read-only mount the L2 disk image via
>> libguestfs (by setting export LIBGUESTFS_BACKEND=direct, which uses
>> direct QEMU): `guestfish --ro -a -i ./cirros.qcow2`.  It throws the call
>> trace again on the L1 serial console.  And the `guestfish` command just
>> sits there forever
>>
>>
>>   - L0 (bare metal) Kernel: 4.13.13-300.fc27.x86_64+debug
>>   - L1 (guest hypervisor) kernel: 4.11.10-300.fc26.x86_64
>>   - L2 is a CirrOS 3.5 image
>>
>> I can reproduce this at least 3 times, with the above versions.
>>
>> I'm using libvirt 'host-passthrough' for CPU (meaning: '-cpu host' in
>> QEMU parlance) for both L1 and L2.
>>
>> My L0 CPU is:  Intel(R) Xeon(R) CPU E5-2609 v3 @ 1.90GHz.
>>
>> Thoughts?
>
> Sounds like a similar problem as in
> https://bugzilla.kernel.org/show_bug.cgi?id=198621
>
> In short: there is no (live) migration support for nested VMX yet. So as
> soon as your guest is using VMX itself ("nVMX"), this is not expected to
> work.

Hi David, thanks for getting back to us on this.

I see your point, except the issue Kashyap and I are describing does
not occur with live migration, it occurs with savevm/loadvm (virsh
managedsave/virsh start in libvirt terms, nova suspend/resume in
OpenStack lingo). And it's not immediately self-evident that the
limitations for the former also apply to the latter. Even for the live
migration limitation, I've been unsuccessful at finding documentation
that warns users to not attempt live migration when using nesting, and
this discussion sounds like a good opportunity for me to help fix
that.

Just to give an example,
https://www.redhat.com/en/blog/inception-how-usable-are-nested-kvm-guests
from just last September talks explicitly about how "guests can be
snapshot/resumed, migrated to other hypervisors and much more" in the
opening paragraph, and then talks at length about nested guests —
without ever pointing out that those very features aren't expected to
work for them. :)

So to clarify things, could you enumerate the currently known
limitations when enabling nesting? I'd be happy to summarize those and
add them to the linux-kvm.org FAQ so others are less likely to hit
their head on this issue. In particular:

- Is https://fedoraproject.org/wiki/How_to_enable_nested_virtualization_in_KVM
still accurate in that -cpu host (libvirt "host-passthrough") is the
strongly recommended configuration for the L2 guest?

- If so, are there any recommendations for how to configure the L1
guest with regard to CPU model?

- Is live migration with nested guests _always_ expected to break on
all architectures, and if not, which are safe?

- Idem, for savevm/loadvm?

- With regard to the problem that Kashyap and I (and Dennis, the
kernel.org bugzilla reporter) are describing, is this expected to work
any better on AMD CPUs?  (All reports are on Intel)

- Do you expect nested virtualization functionality to be adversely
affected by KPTI and/or other Meltdown/Spectre mitigation patches?

Kashyap, can you think of any other limitations that would benefit
from improved documentation?

Cheers,
Florian