[libvirt-users] Nested KVM: L0 guest produces kernel BUG on wakeup from managed save (while a nested VM is running)

Thu Feb 8 13:57:33 UTC 2018

On Thu, Feb 8, 2018 at 2:47 PM, David Hildenbrand <david at redhat.com> wrote:
>> Again, I'm somewhat struggling to understand this vs. live migration —
>> but it's entirely possible that I'm sorely lacking in my knowledge of
>> kernel and CPU internals.
>
> (savevm/loadvm is also called "migration to file")
>
> When we migrate to a file, it really is the same migration stream. You
> "dump" the VM state into a file, instead of sending it over to another
> (running) target.
>
> Once you load your VM state from that file, it is a completely fresh
> VM/KVM environment. So you have to restore all the state. Now, as nVMX
> state is not contained in the migration stream, you cannot restore that
> state. The L1 state is therefore "damaged" or incomplete.

*lightbulb* Thanks a lot, that's a perfectly logical explanation. :)

>> Now, here's a bit more information on my continued testing. As I
>> mentioned on IRC, one of the things that struck me as odd was that if
>> I ran into the issue previously described, the L1 guest would enter a
>> reboot loop if configured with kernel.panic_on_oops=1. In other words,
>> I would savevm the L1 guest (with a running L2), then loadvm it, and
>> then the L1 would stack-trace, reboot, and then keep doing that
>> indefinitely. I found that weird because on the second reboot, I would
>> expect the system to come up cleanly.
>
> Guess the L1 state (in the kernel) is broken that hard, that even a
> reset cannot fix it.

... which would also explain that in contrast to that, a virsh
destroy/virsh start cycle does fix things.

>> I've now changed my L2 guest's CPU configuration so that libvirt (in
>> L1) starts the L2 guest with the following settings:
>>
>> <cpu>
>>     <model fallback='forbid'>Haswell-noTSX</model>
>>     <vendor>Intel</vendor>
>>     <feature policy='disable' name='vme'/>
>>     <feature policy='disable' name='ss'/>
>>     <feature policy='disable' name='f16c'/>
>>     <feature policy='disable' name='rdrand'/>
>>     <feature policy='disable' name='hypervisor'/>
>>     <feature policy='disable' name='arat'/>
>>     <feature policy='disable' name='tsc_adjust'/>
>>     <feature policy='disable' name='xsaveopt'/>
>>     <feature policy='disable' name='abm'/>
>>     <feature policy='disable' name='aes'/>
>>     <feature policy='disable' name='invpcid'/>
>> </cpu>
>
> Maybe one of these features is the root cause of the "messed up" state
> in KVM. So disabling it also makes the L1 state "less broken".

Would you try a guess as to which of the above features is a likely culprit?

>> Basically, I am disabling every single feature that my L1's "virsh
>> capabilities" reports. Now this does not make my L1 come up happily
>> from loadvm. But it does seem to initiate a clean reboot after loadvm,
>> and after that clean reboot it lives happily.
>>
>> If this is as good as it gets (for now), then I can totally live with
>> that. It certainly beats running the L2 guest with Qemu (without KVM
>> acceleration). But I would still love to understand the issue a little
>> bit better.
>
> I mean the real solution to the problem is of course restoring the L1
> state correctly (migrating nVMX state, what people are working on right
> now). So what you are seeing is a bad "side effect" of that.
>
> For now, nested=true should never be used along with savevm/loadvm/live
> migration.

Yes, I gathered as much. :) Thanks again!

Cheers,
Florian