[edk2-devel] RFC: Fast Migration for SEV and SEV-ES - blueprint and proof of concept

Wed Oct 28 19:31:44 UTC 2020

Hello,

Dov Murik. James Bottomley, Hubertus Franke, and I have been working on 
a plan for fast live migration of SEV and SEV-ES (and SEV-SNP when it's 
out and even hopefully Intel TDX) VMs. We have developed an approach 
that we believe is feasible and a demonstration that shows our solution 
to the most difficult part of the problem. In short, we have implemented 
a UEFI Application that can resume from a VM snapshot. We think this is 
the crux of SEV-ES live migration. After describing the context of our 
demo and how it works, we explain how it can be extended to a full 
SEV-ES migration. Our goal is to show that fast SEV and SEV-ES live 
migration can be implemented in OVMF with minimal kernel changes. We 
provide a blueprint for doing so.

Typically the hypervisor facilitates live migration. AMD SEV excludes 
the hypervisor from the trust domain of the guest. When a hypervisor 
(HV) examines the memory of an SEV guest, it will find only a 
ciphertext. If the HV moves the memory of an SEV guest, the ciphertext 
will be invalidated. Furthermore, with SEV-ES the hypervisor is largely 
unable to access guest CPU state. Thus, fast migration of SEV VMs 
requires support from inside the trust domain, i.e. the guest.

One approach is to add support for SEV Migration to the Linux kernel. 
This would allow the guest to encrypt/decrypt its own memory with a 
transport key. This approach has met some resistance. We propose a 
similar approach implemented not in Linux, but in firmware, specifically 
OVMF. Since OVMF runs inside the guest, it has access to the guest 
memory and CPU state. OVMF should be able to perform the manipulations 
required for live migration of SEV and SEV-ES guests.

The biggest challenge of this approach involves migrating the CPU state 
of an SEV-ES guest. In a normal (non-SEV migration) the HV sets the CPU 
state of the target before the target begins executing. In our approach, 
the HV starts the target and OVMF must resume to whatever state the 
source was in. We believe this to be the crux (or at least the most 
difficult part) of live migration for SEV and we hope that by 
demonstrating resume from EFI, we can show that our approach is 
generally feasible.

Our demo can be found at <https://github.com/secure-migration>. The 
tooling repository is the best starting point. It contains documentation 
about the project and the scripts needed to run the demo. There are two 
more repos associated with the project. One is a modified edk2 tree that 
contains our modified OVMF. The other is a modified qemu, that has a 
couple of temporary changes needed for the demo. Our demonstration is 
aimed only at resuming from a VM snapshot in OVMF. We provide the source 
CPU state and source memory to the destination using temporary plumbing 
that violates the SEV trust model. We explain the setup in more depth in 
README.md. We are showing only that OVMF can resume from a VM snapshot. 
At the end we will describe our plan for transferring CPU state and 
memory from source to guest. To be clear, the temporary tooling used for 
this demo isn't built for encrypted VMs, but below we explain how this 
demo applies to and can be extended to encrypted VMs.

We Implemented our resume code in a very similar fashion to the 
recommended S3 resume code. When the HV sets the CPU state of a guest, 
it can do so when the guest is not executing. Setting the state from 
inside the guest is a delicate operation. There is no way to atomically 
set all of the CPU state from inside the guest. Instead, we must set 
most registers individually and account for changes in control flow that 
doing so might cause. We do this with a three-phase trampoline. OVMF 
calls phase 1, which runs on the OVMF map. Phase 1 sets up phase 2 and 
jumps to it. Phase 2 switches to an intermediate map that reconciles the 
OVMF map and the source map. Phase 3 switches to the source map, 
restores the registers, and returns into execution of the source. We 
will go backwards through these phases in more depth.

The last thing that resume to EFI does is return. Specifically, we use 
IRETQ, which reads the values of RIP, CS, RFLAGS, RSP, and SS from a 
temporary stack and restores them atomically, thus returning to source 
execution. Prior to returning, we must manually restore most other 
registers to the values they had on the source. One particularly 
significant register is CR3. When we return to Linux, CR3 must be set to 
the source CR3 or the first instruction executed in Linux will cause a 
page fault. The code that we use to restore the registers and return 
must be mapped in the source page table or we would get a page fault 
executing the instructions prior to returning into Linux. The value of 
CR3 is so significant, that it defines the three phases of the 
trampoline. Phase 3 begins when CR3 is set to the source CR3. After 
setting CR3, we set all the other registers and return.

Phase 2 mainly exists to setup phase 3. OVMF uses a 1-1 mapping, meaning 
that virtual addresses are the same as physical addresses. The kernel 
page table uses an offset mapping, meaning that virtual addresses differ 
from physical addresses by a constant (for the most part). Crucially, 
this means that the virtual address of the page that is executed by 
phase 3 differs between the OVMF map and the source map. If we are 
executing code mapped in OVMF and we change CR3 to point to the source 
map, although the page may be mapped in the source map, the virtual 
address will be different, and we will face undefined behavior. To fix 
this, we construct intermediate page tables that map the pages for phase 
2 and 3 to the virtual address expected in OVMF and to the virtual 
address expected in the source map. Thus, we can switch CR3 from OVMF's 
map to the intermediate map and then from the intermediate map to the 
source map. Phase 2 is much shorter than phase 3. Phase 2 is mainly 
responsible for switching to the intermediate map, flushing the TLB, and 
jumping to phase 3.

Fortunately phase 1 is even simpler than phase 2. Phase 1 has two 
duties. First, since phase 2 and 3 operate without a stack and can't 
access values defined in OVMF (such as the addresses of the pages 
containing phase 2 and 3), phase 1 must pass these values to phase 2 by 
putting them in registers. Second, phase 1 must start phase 2 by jumping 
to it.

Given that we can resume to a snapshot in OVMF, we should be able to 
migrate an SEV guest as long as we can securely communicate the VM 
snapshot from source to destination. For our demo, we do this with a 
handful of QMP commands. More sophisticated methods are required for a 
production implementation.

When we refer to a snapshot, what we really mean is the device state, 
memory, and CPU state of a guest. In live migration this is transmitted 
dynamically as opposed to being saved and restored. Device state is not 
protected by SEV and can be handled entirely by the HV. Memory, on the 
other hand, cannot be handled only by the HV. As mentioned previously, 
memory needs to be encrypted with a transport key. A Migration Handler 
on the source will coordinate with the HV to encrypt pages and transmit 
them to the destination. The destination HV will receive the pages over 
the network and pass them to the Migration Handler in the target VM so 
they can be decrypted. This transmission will occur continuously until 
the memory of the source and target converges.

Plain SEV does not protect the CPU state of the guest and therefore does 
not require any special mechanism for transmission of the CPU state. We 
plan to implement an end-to-end migration with plain SEV first. In 
SEV-ES, the PSP (platform security processor) encrypts CPU state on each 
VMExit. The encrypted state is stored in memory. Normally this memory 
(known as the VMSA) is not mapped into the guest, but we can add an 
entry to the nested page tables that will expose the VMSA to the guest. 
This means that when the guest VMExits, the CPU state will be saved to 
guest memory. With the CPU state in guest memory, it can be transmitted 
to the target using the method described above.

In addition to the changes needed in OVMF to resume the VM, the 
transmission of the VM from source to target will require a new code 
path in the hypervisor. There will also need to be a few minor changes 
to Linux (adding a mapping for our Phase 3 pages). Despite all the 
moving pieces, we believe that this is a feasible approach for 
supporting live migration for SEV and SEV-ES.

For the sake of brevity, we have left out a few issues, including SMP 
support, generation of the intermediate mappings, and more. We have 
included some notes about these issues in the COMPLICATIONS.md file. We 
also have an outline of an end-to-end implementation of live migration 
for SEV-ES in END-TO-END.md. See README.md for info on how to run the 
demo. While this is not a full migration, we hope to show that fast live 
migration with SEV and SEV-ES is possible without major kernel changes.

-Tobin

-=-=-=-=-=-=-=-=-=-=-=-
Groups.io Links: You receive all messages sent to this group.
View/Reply Online (#66714): https://edk2.groups.io/g/devel/message/66714
Mute This Topic: https://groups.io/mt/77875297/1813853
Group Owner: devel+owner at edk2.groups.io
Unsubscribe: https://edk2.groups.io/g/devel/unsub [edk2-devel-archive at redhat.com]
-=-=-=-=-=-=-=-=-=-=-=-