[edk2-devel] RFC: Fast Migration for SEV and SEV-ES - blueprint and proof of concept

Laszlo Ersek lersek at redhat.com
Tue Nov 3 14:59:38 UTC 2020


Hi Tobin,

(keeping full context -- I'm adding Dave)

On 10/28/20 20:31, Tobin Feldman-Fitzthum wrote:
> Hello,
> 
> Dov Murik. James Bottomley, Hubertus Franke, and I have been working on
> a plan for fast live migration of SEV and SEV-ES (and SEV-SNP when it's
> out and even hopefully Intel TDX) VMs. We have developed an approach
> that we believe is feasible and a demonstration that shows our solution
> to the most difficult part of the problem. In short, we have implemented
> a UEFI Application that can resume from a VM snapshot. We think this is
> the crux of SEV-ES live migration. After describing the context of our
> demo and how it works, we explain how it can be extended to a full
> SEV-ES migration. Our goal is to show that fast SEV and SEV-ES live
> migration can be implemented in OVMF with minimal kernel changes. We
> provide a blueprint for doing so.
> 
> Typically the hypervisor facilitates live migration. AMD SEV excludes
> the hypervisor from the trust domain of the guest. When a hypervisor
> (HV) examines the memory of an SEV guest, it will find only a
> ciphertext. If the HV moves the memory of an SEV guest, the ciphertext
> will be invalidated. Furthermore, with SEV-ES the hypervisor is largely
> unable to access guest CPU state. Thus, fast migration of SEV VMs
> requires support from inside the trust domain, i.e. the guest.
> 
> One approach is to add support for SEV Migration to the Linux kernel.
> This would allow the guest to encrypt/decrypt its own memory with a
> transport key. This approach has met some resistance. We propose a
> similar approach implemented not in Linux, but in firmware, specifically
> OVMF. Since OVMF runs inside the guest, it has access to the guest
> memory and CPU state. OVMF should be able to perform the manipulations
> required for live migration of SEV and SEV-ES guests.
> 
> The biggest challenge of this approach involves migrating the CPU state
> of an SEV-ES guest. In a normal (non-SEV migration) the HV sets the CPU
> state of the target before the target begins executing. In our approach,
> the HV starts the target and OVMF must resume to whatever state the
> source was in. We believe this to be the crux (or at least the most
> difficult part) of live migration for SEV and we hope that by
> demonstrating resume from EFI, we can show that our approach is
> generally feasible.
> 
> Our demo can be found at <https://github.com/secure-migration>. The
> tooling repository is the best starting point. It contains documentation
> about the project and the scripts needed to run the demo. There are two
> more repos associated with the project. One is a modified edk2 tree that
> contains our modified OVMF. The other is a modified qemu, that has a
> couple of temporary changes needed for the demo. Our demonstration is
> aimed only at resuming from a VM snapshot in OVMF. We provide the source
> CPU state and source memory to the destination using temporary plumbing
> that violates the SEV trust model. We explain the setup in more depth in
> README.md. We are showing only that OVMF can resume from a VM snapshot.
> At the end we will describe our plan for transferring CPU state and
> memory from source to guest. To be clear, the temporary tooling used for
> this demo isn't built for encrypted VMs, but below we explain how this
> demo applies to and can be extended to encrypted VMs.
> 
> We Implemented our resume code in a very similar fashion to the
> recommended S3 resume code. When the HV sets the CPU state of a guest,
> it can do so when the guest is not executing. Setting the state from
> inside the guest is a delicate operation. There is no way to atomically
> set all of the CPU state from inside the guest. Instead, we must set
> most registers individually and account for changes in control flow that
> doing so might cause. We do this with a three-phase trampoline. OVMF
> calls phase 1, which runs on the OVMF map. Phase 1 sets up phase 2 and
> jumps to it. Phase 2 switches to an intermediate map that reconciles the
> OVMF map and the source map. Phase 3 switches to the source map,
> restores the registers, and returns into execution of the source. We
> will go backwards through these phases in more depth.
> 
> The last thing that resume to EFI does is return. Specifically, we use
> IRETQ, which reads the values of RIP, CS, RFLAGS, RSP, and SS from a
> temporary stack and restores them atomically, thus returning to source
> execution. Prior to returning, we must manually restore most other
> registers to the values they had on the source. One particularly
> significant register is CR3. When we return to Linux, CR3 must be set to
> the source CR3 or the first instruction executed in Linux will cause a
> page fault. The code that we use to restore the registers and return
> must be mapped in the source page table or we would get a page fault
> executing the instructions prior to returning into Linux. The value of
> CR3 is so significant, that it defines the three phases of the
> trampoline. Phase 3 begins when CR3 is set to the source CR3. After
> setting CR3, we set all the other registers and return.
> 
> Phase 2 mainly exists to setup phase 3. OVMF uses a 1-1 mapping, meaning
> that virtual addresses are the same as physical addresses. The kernel
> page table uses an offset mapping, meaning that virtual addresses differ
> from physical addresses by a constant (for the most part). Crucially,
> this means that the virtual address of the page that is executed by
> phase 3 differs between the OVMF map and the source map. If we are
> executing code mapped in OVMF and we change CR3 to point to the source
> map, although the page may be mapped in the source map, the virtual
> address will be different, and we will face undefined behavior. To fix
> this, we construct intermediate page tables that map the pages for phase
> 2 and 3 to the virtual address expected in OVMF and to the virtual
> address expected in the source map. Thus, we can switch CR3 from OVMF's
> map to the intermediate map and then from the intermediate map to the
> source map. Phase 2 is much shorter than phase 3. Phase 2 is mainly
> responsible for switching to the intermediate map, flushing the TLB, and
> jumping to phase 3.
> 
> Fortunately phase 1 is even simpler than phase 2. Phase 1 has two
> duties. First, since phase 2 and 3 operate without a stack and can't
> access values defined in OVMF (such as the addresses of the pages
> containing phase 2 and 3), phase 1 must pass these values to phase 2 by
> putting them in registers. Second, phase 1 must start phase 2 by jumping
> to it.
> 
> Given that we can resume to a snapshot in OVMF, we should be able to
> migrate an SEV guest as long as we can securely communicate the VM
> snapshot from source to destination. For our demo, we do this with a
> handful of QMP commands. More sophisticated methods are required for a
> production implementation.
> 
> When we refer to a snapshot, what we really mean is the device state,
> memory, and CPU state of a guest. In live migration this is transmitted
> dynamically as opposed to being saved and restored. Device state is not
> protected by SEV and can be handled entirely by the HV. Memory, on the
> other hand, cannot be handled only by the HV. As mentioned previously,
> memory needs to be encrypted with a transport key. A Migration Handler
> on the source will coordinate with the HV to encrypt pages and transmit
> them to the destination. The destination HV will receive the pages over
> the network and pass them to the Migration Handler in the target VM so
> they can be decrypted. This transmission will occur continuously until
> the memory of the source and target converges.
> 
> Plain SEV does not protect the CPU state of the guest and therefore does
> not require any special mechanism for transmission of the CPU state. We
> plan to implement an end-to-end migration with plain SEV first. In
> SEV-ES, the PSP (platform security processor) encrypts CPU state on each
> VMExit. The encrypted state is stored in memory. Normally this memory
> (known as the VMSA) is not mapped into the guest, but we can add an
> entry to the nested page tables that will expose the VMSA to the guest.
> This means that when the guest VMExits, the CPU state will be saved to
> guest memory. With the CPU state in guest memory, it can be transmitted
> to the target using the method described above.
> 
> In addition to the changes needed in OVMF to resume the VM, the
> transmission of the VM from source to target will require a new code
> path in the hypervisor. There will also need to be a few minor changes
> to Linux (adding a mapping for our Phase 3 pages). Despite all the
> moving pieces, we believe that this is a feasible approach for
> supporting live migration for SEV and SEV-ES.
> 
> For the sake of brevity, we have left out a few issues, including SMP
> support, generation of the intermediate mappings, and more. We have
> included some notes about these issues in the COMPLICATIONS.md file. We
> also have an outline of an end-to-end implementation of live migration
> for SEV-ES in END-TO-END.md. See README.md for info on how to run the
> demo. While this is not a full migration, we hope to show that fast live
> migration with SEV and SEV-ES is possible without major kernel changes.
> 
> -Tobin

the one word that comes to my mind upon reading the above is,
"overwhelming".

(I have not been addressed directly, but:

- the subject says "RFC",

- and the documentation at

https://github.com/secure-migration/resume-from-edk2-tooling#what-changes-did-we-make

states that AmdSevPkg was created for convenience, and that the feature
could be integrated into OVMF. (Paraphrased.)

So I guess it's tolerable if I make a comment: )

I've checked out the "mh-state-dev" branch of
<https://github.com/secure-migration/resume-from-efi-edk2.git>. It has
80 commits on top of edk2 master (base commit: d5339c04d7cd,
"UefiCpuPkg/MpInitLib: Add missing explicit PcdLib dependency",
2020-04-23).

These commits were authored over the 6-7 months since April. It's
obviously huge work. To me, most of these commits clearly aim at getting
the demo / proof-of-concept functional, rather than guiding (more
precisely: hand-holding) reviewers through the construction of the feature.

In my opinion, the series is not upstreamable in its current format
(which is presently not much more readable than a single-commit code
drop). Upstreaming is probably not your intent, either, at this time.

I agree that getting feedback ("buy-in") at this level of maturity is
justified from your POV, before you invest more work into cleaning up /
restructuring the series.

My problem is that "hand-holding" is exactly what I'd need -- I cannot
dedicate one or two weeks, as an indivisible block, to understanding
your design. Nor can I approach the series patch-wise in its current
format. Personally I would need the patch series to lead me through the
whole design with baby steps ("ELI5"), meaning small code changes and
detailed commit messages. I'd *also* need the more comprehensive
guide-like documentation, as background material.

Furthermore, I don't have an environment where I can test this
proof-of-concept (and provide you with further incentive for cleaning up
the series, by reporting success).

So I hope others can spend the time discussing the design with you, and
testing / repeating the demo. For me to review the patches, the patches
should condense and replay your thinking process from the last 7 months,
in as small as possible logical steps. (On the list.)

I really don't want to be the bottleneck here, which is why I would
support introducing this feature as a separate top-level package
(AmdSevPkg).

Thanks
Laszlo



-=-=-=-=-=-=-=-=-=-=-=-
Groups.io Links: You receive all messages sent to this group.
View/Reply Online (#66931): https://edk2.groups.io/g/devel/message/66931
Mute This Topic: https://groups.io/mt/77875297/1813853
Group Owner: devel+owner at edk2.groups.io
Unsubscribe: https://edk2.groups.io/g/devel/unsub [edk2-devel-archive at redhat.com]
-=-=-=-=-=-=-=-=-=-=-=-





More information about the edk2-devel-archive mailing list