[libvirt-users] Nested KVM: L0 guest produces kernel BUG on wakeup from managed save (while a nested VM is running)

Thu Feb 8 13:29:46 UTC 2018

Hi David,

thanks for the added input! I'm taking the liberty to snip a few
paragraphs to trim this email down a bit.

On Thu, Feb 8, 2018 at 1:07 PM, David Hildenbrand <david at redhat.com> wrote:
>> Just to give an example,
>> https://www.redhat.com/en/blog/inception-how-usable-are-nested-kvm-guests
>> from just last September talks explicitly about how "guests can be
>> snapshot/resumed, migrated to other hypervisors and much more" in the
>> opening paragraph, and then talks at length about nested guests —
>> without ever pointing out that those very features aren't expected to
>> work for them. :)
>
> Well, it still is a kernel parameter "nested" that is disabled by
> default. So things should be expected to be shaky. :) While running
> nested guests work usually fine, migrating a nested hypervisor is the
> problem.
>
> Especially see e.g.
> https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/virtualization_deployment_and_administration_guide/nested_virt
>
> "However, note that nested virtualization is not supported or
> recommended in production user environments, and is primarily intended
> for development and testing. "

Sure, I do understand that Red Hat (or any other vendor) is taking no
support responsibility for this. At this point I'd just like to
contribute to a better understanding of what's expected to definitely
_not_ work, so that people don't bloody their noses on that. :)

>> So to clarify things, could you enumerate the currently known
>> limitations when enabling nesting? I'd be happy to summarize those and
>> add them to the linux-kvm.org FAQ so others are less likely to hit
>> their head on this issue. In particular:
>
> The general problem is that migration of an L1 will not work when it is
> running L2, so when L1 is using VMX ("nVMX").
>
> Migrating an L2 should work as before.
>
> The problem is, in order for L1 to make use of VMX to run L2, we have to
> run L2 in L0, simulating VMX -> nested VMX a.k.a. nVMX . This requires
> additional state information about L1 ("nVMX" state), which is not
> properly migrated when migrating L1. Therefore, after migration, the CPU
> state of L1 might be screwed up after migration, resulting in L1 crashes.
>
> In addition, certain VMX features might be missing on the target, which
> also still has to be handled via the CPU model in the future.

Thanks a bunch for the added detail. Now I got a primer today from
Kashyap on IRC on how savevm/loadvm is very similar to migration, but
I'm still struggling to wrap my head around it. What you say makes
perfect sense to me in that _migration_ might blow up in subtle ways,
but can you try to explain to me why the same considerations would
apply with savevm/loadvm?

> L0, should hopefully not crash, I hope that you are not seeing that.

No I am not; we're good there. :)

>> - Is https://fedoraproject.org/wiki/How_to_enable_nested_virtualization_in_KVM
>> still accurate in that -cpu host (libvirt "host-passthrough") is the
>> strongly recommended configuration for the L2 guest?
>>
>> - If so, are there any recommendations for how to configure the L1
>> guest with regard to CPU model?
>
> You have to indicate the VMX feature to your L1 ("nested hypervisor"),
> that is usually automatically done by using the "host-passthrough" or
> "host-model" value. If you're using a custom CPU model, you have to
> enable it explicitly.

Roger. Without that we can't do nesting at all.

>> - Is live migration with nested guests _always_ expected to break on
>> all architectures, and if not, which are safe?
>
> x86 VMX: running nested guests works, migrating nested hypervisors does
> not work
>
> x86 SVM: running nested guests works, migrating nested hypervisor does
> not work (somebody correct me if I'm wrong)
>
> s390x: running nested guests works, migrating nested hypervisors works
>
> power: running nested guests works only via KVM-PR ("trap and emulate").
> migrating nested hypervisors therefore works. But we are not using
> hardware virtualization for L1->L2. (my latest status)
>
> arm: running nested guests is in the works (my latest status), migration
> is therefore also not possible.

Great summary, thanks!

>> - Idem, for savevm/loadvm?
>>
>
> savevm/loadvm is not expected to work correctly on an L1 if it is
> running L2 guests. It should work on L2 however.

Again, I'm somewhat struggling to understand this vs. live migration —
but it's entirely possible that I'm sorely lacking in my knowledge of
kernel and CPU internals.

>> - With regard to the problem that Kashyap and I (and Dennis, the
>> kernel.org bugzilla reporter) are describing, is this expected to work
>> any better on AMD CPUs?  (All reports are on Intel)
>
> No, remeber that they are also still missing migration support of the
> nested SVM state.

Understood, thanks.

>> - Do you expect nested virtualization functionality to be adversely
>> affected by KPTI and/or other Meltdown/Spectre mitigation patches?
>
> Not an expert on this. I think it should be affected in a similar way as
> ordinary guests :)

Fair enough. :)

>> Kashyap, can you think of any other limitations that would benefit
>> from improved documentation?
>
> We should certainly document what I have summaries here properly at a
> central palce!

I tried getting registered on the linux-kvm.org wiki to do exactly
that, and ran into an SMTP/DNS configuration issue with the
verification email. Kashyap said he was going to poke the site admin
about that.

Now, here's a bit more information on my continued testing. As I
mentioned on IRC, one of the things that struck me as odd was that if
I ran into the issue previously described, the L1 guest would enter a
reboot loop if configured with kernel.panic_on_oops=1. In other words,
I would savevm the L1 guest (with a running L2), then loadvm it, and
then the L1 would stack-trace, reboot, and then keep doing that
indefinitely. I found that weird because on the second reboot, I would
expect the system to come up cleanly.

I've now changed my L2 guest's CPU configuration so that libvirt (in
L1) starts the L2 guest with the following settings:

<cpu>
    <model fallback='forbid'>Haswell-noTSX</model>
    <vendor>Intel</vendor>
    <feature policy='disable' name='vme'/>
    <feature policy='disable' name='ss'/>
    <feature policy='disable' name='f16c'/>
    <feature policy='disable' name='rdrand'/>
    <feature policy='disable' name='hypervisor'/>
    <feature policy='disable' name='arat'/>
    <feature policy='disable' name='tsc_adjust'/>
    <feature policy='disable' name='xsaveopt'/>
    <feature policy='disable' name='abm'/>
    <feature policy='disable' name='aes'/>
    <feature policy='disable' name='invpcid'/>
</cpu>

Basically, I am disabling every single feature that my L1's "virsh
capabilities" reports. Now this does not make my L1 come up happily
from loadvm. But it does seem to initiate a clean reboot after loadvm,
and after that clean reboot it lives happily.

If this is as good as it gets (for now), then I can totally live with
that. It certainly beats running the L2 guest with Qemu (without KVM
acceleration). But I would still love to understand the issue a little
bit better.

Cheers,
Florian