[edk2-devel] [RFC PATCH 00/19] Add AMD Secure Nested Paging (SEV-SNP) support

Fri Apr 9 22:43:11 UTC 2021

On 4/9/21 7:24 AM, Laszlo Ersek wrote:
> On 04/08/21 13:59, Brijesh Singh wrote:
>> On 4/8/21 4:58 AM, Laszlo Ersek wrote:
>>> On 03/24/21 16:31, Brijesh Singh wrote:
>>>> At this time we only support the pre-validation. OVMF detects all the available
>>>> system RAM in the PEI phase. When SEV-SNP is enabled, the memory is validated
>>>> before it is made available to the EDK2 core.
>>> Can you describe this in a bit more detail, before I look at the
>>> individual patches? Specifically, what existing logic in the PEI phase
>>> was taken, and extended, and how?
>> One of the key requirement is that the guest private pages much be
>> validated before the access. If guest tries to access the pages before
>> the validation then it will result in #VC (page-not-validated)
>> exception. To avoid the #VC, we propose the validating the memory before
>> the access. We will incrementally add the support to lazy validate (i.e
>> validate on access).
> What is the primary threat (in simple terms please) that validation is
> supposed to prevent?

To protect against the memory re-mapping attack the guest pages must be
validated. The idea is that every guest page can map only to a single
physical memory page at one time.

> And, against that particular threat, does pre-validation offer some
> protection, or will that only come with lazy validation?

For the hardware it does not matter how the memory was validated -- lazy
vs prevalidate. Both approaches use the PVALIDATE instruction to
validate the page.

In the case of pre-validation, the memory is validated before the
access. Whereas in the lazy validation, the access will cause a fault
and fault handler should validate the page and retry the access. Its
similar to a page fault handler, OS can populate the page table or back
the pages on demand.

The only downside of pre-validation is, we will take a hit on the boot
time. The GHCB spec provides method by which we can batch multiple
requests at once to minimize the context switches.

>
>> Let me try explaining a bit, the page validation process consist of two
>> steps:
>>
>> 1. Add the pages in the RMP table -- must be done by the hypervisor
>> using the RMPUPDATE instruction. The guest can use VMGEXIT NAEs to ask
>> hypervisor to add or remove pages from the RMP table.
>>
>> 2. Guest issue the PVALIDATE instruction -- this sets the validate bit
>> in the RMP table.
>>
>> Similar to SEV, the OVMF_CODE.fd is encrypted through the SNP firmware
>> before the launch. The SNP firmware also validates the memory page after
>> encrypting. This allows us to boot the initial entry code without guest
>> going through the validation process.
>>
>> The OVMF reset vector uses few data pages (e.g page table, early Sec
>> stack). Access to these data pages will result in #VC. There are two
>> approaches we can take to validate these data pages:
>>
>> 1. Ask SNP firmware to pre-validate it -- SNP firmware provides an
>> special command that can be used to pre-validate the pages without
>> affecting the measurement.
> This means the two pflash chips, right?

This does not need to be two pflash chips. The SNP firmware command does
not know anything about the ROM or pflash chip. The command accepts the
system physical address that need to be validated by the firmware. In
patch #2, OVMF provides a range of data pages that need to be validated
by the SNP firmware before booting the guest.

>
>> 2. Enhance the reset vector code to validate the pages.
>>
>> For now I choose #1.
>>
>> The pre-validation performed by the SNP firmware is sufficient to boot
>> through the SEC phase. The SEC phase later decompress the Fv to a new
>> memory location. Now we need the OVMF to take over the validation
>> procedure.  The series extends the MemEncryptSevLib to add a new helper
>> MemEncryptSevSnpValidateRam(). The helper is used to validate the system
>> RAM. See patch #12. SEC phase calls the MemEncryptSevSnpValidateRam() to
>> validate the output buffer used for the decompression. This was
>> sufficient to boot into the PEI phase, see patch #13.
> Two questions here:
>
> - Is ACPI S3 in scope for now?

Its not in the scope yet. I have not looked at it.

>
> - We need more areas made accessible in SEC than just the decompression
> buffer; for example the temporary SEC/PEI heap and stack, and (IIRC)
> various special SEV-ES areas laid out via MEMFD. Where are all of those
> handled / enumerated?

Sorry, I simplified my response by saying just decompression. You are
right that its more than the decompression buffer. In my current patch,
the SEC phase validates all the memory up to
PcdOvmfDecompressionScratchEnd (which includes more than just output
buffer). See patch #13.

>
>> The PEI detects
>> all the available system RAM. After the memory detection is completed
>> the PlatformPei calls the AmdSevSnpInitialize(). The initialization
>> routine iterate through the HOB and calls the
>> MemEncryptSevSnpValidateRam() to validate all the system RAM. Is it
>> possible the more system ram can be detected after the PlatformPei is
>> completed ?
> That would cause problems even without SEV-SNP (i.e., with plain SEV),
> so I'm not worried about it.
>
>> One of the important thing is we should *never* validate the pages
>> twice.
> What are the symptoms / consequences of:
>
> - the guest accessing an unvalidated page (I understand it causes a #VC,
> but what is the direct result of that, when this series is applied?),

After the series it applied, an access to an unvalidated page will
result in #VC. The series does not extend the VC handler to handle the
page-not-validated error code. The current VC handler
[InternalVmgExitHandleVc()] will inject a #GP to terminate the guest if
#VC is raised for unvalidated page.

>
> - doubly-validating a page?
>
> The first question is relevant because we should crash as soon as we
> touch a page we forgot to validate (we shouldn't let any corruption or
> similar spread out until we finally realize there's a problem).
>
> The second question is relevant for security I guess. What attacks
> become possible, and/or what symptoms are produced, if we
> doubly-validate a page?

Assuming that guest validates its memory, this guarantees the injective
mapping between GPAs and SPAs.

As I noted that the page validation process consist of two steps #1
RMPUPDATE and 2#PVALIDATE. The RMPUPDATE is a hypervisor instruction and
PVALIDATE is a guest instruction. A malicious hypervisor can remap gpa
to different physical address in RMP table using the RMPUPDATE just
before the second PVALIDATE instruction is executed. The guest need to
ensure that it does not leave the security vulnerability that can be
exploited by a malicious hypervisor. The SNP white paper recommends that
a guest OS should keep track of what is has validate so that it can
avoid getting into these situation.

>
> Furthermore, IIRC, we have separate #VC handlers for the different
> firmware phases; do they behave consistently / identicall when a
> #VC(page-not-validated) occurs, when this patch set is applied?
>
> My first question is basically asking whether we can *exclusively* rely
> on #VC(page-not-validated) faults to tell us that we missed validating a
> particular page. If we can do that, then the job is a bit easier,
> because from the GPA, we can more or less also derive *when and where*
> we should pre-validate the page (at least until validation is done
> completely lazily).

Yes, we should be able to rely #VC to tell us that we missed validating
the page. In my this series so far I have not seen a #VC due to
page-not-validated because we have carefully validated the pages before
they are accessed. I am thinking that #VC(page-not-validated) should be
treated as an error in the first phase. Incrementally we will add the
lazy validation, in which the #VC(page-not-validated) will provide a
hint as to what has not been validated.

>
>> The MemEncryptSevSnpValidateRam() uses a interval search tree to
>> keep the record of what has been validated. Before validating the range,
>> it lookup in its tree and if it finds that range is already validated
>> then do nothing. If it detects an overlap then it will validate only non
>> overlapping regions -- see patch #14.
> What data structure is used for implementing the interval tree?
>
> I'm not necessarily looking for a data structure with "nice" asymptotic
> time complexity. With pre-validation especially, I think simplicity
> (ease of review) is more important for the data structure than
> performance. If it's not an actual "tree", we shouldn't call it a
> "tree". (An "interval tree" is usually an extension of a Red-Black Tree,
> and that's not the simplest data structure in existence; although edk2
> does offer an rbtree library.)

I am using binary search tree. Currently, we don't have many nodes to
track. Most of the guest memory is private and are validated before we
exit from the PEI phase.

> Furthermore, what you describe above is called idempotency. No matter
> how many times we attempt to validate a range, it may (or may not even)
> cause an actual change in the first action only. Is this property
> (=idempotency) an inherent requirement of the technology, or is it a
> simplification of the implementation? Put differently: if you called
> CpuDeadLoop() in the validation function any time an overlapping
> validation request were received, would that hugely complicate the call
> sites?

>From the technology point of view you can validate the same page again
and again. Hardware does not force any restriction. To protect against
time-based remap attack a guest also need to ensure that it does not
blindly validates the pages otherwise guest is taking risk. Avoiding a
double validation is strong recommendation. In current implementation,
the validation burden is not put on the caller. Caller calls standard
APIs to toggle the encryption attribute. Under the hood, if we see that
page is already validated then no need to issue the PVALIDATE. If the
page is getting changed from C=1 -> 0 then unvalidate it.

> I'm kind of "obsessing" about idempotency because you say we must
> *never* doubly-validate a page, so the *difference* between:
> - explicitly crashing on such an attempt,
> - and silently ignoring such an attempt,
> may be meaningful.

I said we should never doubly validate to avoid creating a vulnerability
that can later be exploited by the malicious hypervisor. Currently we
lookup address in our tree to ensure that we don't do a double
validation. The PVALIDATE instruction also can be used to determine if
we are attempting to validate an already validated page. In current
implementation after the PVALIDATE completion, I check the rflags.CF to
ensure that its not a double validation condition. If its double
validation then abort the guest. Its bug in the guest.

> It's kind of the difference between "oops this is a call site *bug*, but
> we patched it up", vs. "this is expected of call sites, we should just
> handle it internally".

Our attempt should be to patch to avoid the double validation. I would
still check the status of PVALIDATE, if instruction detects it as a
double validation then we have a bug in our tracking and should abort
the guest.

>
>> The patch #18 extend the MemEncrypt{Set,Clear}PageEncMask() to call the
>> SNP page state change during the C-bit toggle.
> - When exactly do we invalidate?

Basically before the page is transition from C=1 -> 0 in the page table.

>
> - Does the time and place of invalidation depend on whether we perform
> pre-validation, or lazy validation?
It does not matter. Both type of validation update the Validated bit in
the RMP table. The important thing is that invalidation can happen only
for a already validated private page. If the page is shared then
invalidation will result in #GP.
>
> - Is page invalidation idempotent too?
We should treat this as idempotent too. As I said, tacking the page
state is strongly recommended for the security reason. The SNP white
paper has a section dedicated for the validation and may able to clarify
things which is not covered in my response.
>
> - What is the consequence (security impact) if we forget invalidation?
I can't think of a security implication of it but if we forget to
invalidate then it can lead issues in the tracking data structure when
it comes to validation next time. e.g a page was mapped as C=1 and we
changed it to C=0 without invalidating it. Now if the same page is being
changed from C=0 -> 1 then tracking data structure may think that page
is already validated and will skip the validation process and thus
access will result in #VC (page-not-validated).
>
> - There are four page states I can imagine:
>   - encrypted for host access, valid for guest access
>   - encrypted for host access, invalid for guest access
>   - decrypted for host access, valid for guest access
>   - decrypted for host access, invalid for guest access
>
> Does a state exist, from these four, that's never used? (Potentially
> caught by the hardware?)

I would not mix the page state with the host access. Basically there is
two page state exist

- Guest-Valid

- Guest-Invalid

By default all the guest memory is in Guest-Invalid state on boot. The
PVALIDATE instruction is used to transition from Valid->Invalid. Guest
can use the PVALIDATE to change the state from Valid to Invalid.

A guest private page (i.e page mapped C=1) must be in Guest Valid state.
If hardware see that a private access page is not in the Guest-Valid
state then it will trigger #VC.

>
> Do the patches highlight / explain the validity transitions, in
> comments? My understanding is that the C-bit toggle and the guest access
> valid/invalid toggle are separate actions, so "middle" states do exist,
> but perhaps only temporarily.

I have tried to document the steps in the patch description and I can
add more comments as required. When SEV-SNP is active then we need to
invalidate the page before toggling the C-bit from C=1 -> 0 and
similarly after the page is toggled from C=0 -> 1 we must validate it.

> I'm curious how it works, for example, with variousvirtio transfers (bus
> master read, bus master write, bus master common buffer). In the
> IoMmuDxe driver minimally, we access memory both through PTEs with the
> C-bit set and through PTEs with the C-bit clear, meaning that
> "encrypted, valid", and "decrypted, valid" are both required states. But
> that seems to conflict with the notion that "C-bit toggle" be directly
> coupled with a "validity toggle".

That should not be an issue. Each page have entry in the RMP table. So,
you can have one page as guest invalid and another page as a guest valid.

> Put differently, I'm happy that modifying IoMmuDxe appears unnecessary,
> but then that tells me I'm missing something about the state transitions
> between the above *four* states.

Only thing which I would propose changing in IoMmuDxe driver is to use a
pre-allocated buffer and avoid toggling the C-bit on every read/write.
In past the C-bit toggle was all contained inside the guest. But when
SNP is enabled, each C-bit toggle causes two additional steps #1
RMPUPDATE and #2 PVALIDATE. The RMPUPDATE is requested using the VMEXIT
so it will cause a VM context switch. I would prefer to avoid it.

My current patch set does not have this optimization yet.

>
>> Please let me know if you have any questions. We can hash out the design
>> here before you taking a closure look at the code.
> Sorry that I've been (and am being) slow to start reviewing this series.

No worries, take your time. Thank you for all your feedback.

-Brijesh

>
> Thanks,
> Laszlo
>

-=-=-=-=-=-=-=-=-=-=-=-
Groups.io Links: You receive all messages sent to this group.
View/Reply Online (#73918): https://edk2.groups.io/g/devel/message/73918
Mute This Topic: https://groups.io/mt/81584576/1813853
Group Owner: devel+owner at edk2.groups.io
Unsubscribe: https://edk2.groups.io/g/devel/unsub [edk2-devel-archive at redhat.com]
-=-=-=-=-=-=-=-=-=-=-=-