[edk2-devel] [PATCH v3 2/2] OvmfPkg/PlatformInitLib: catch QEMU's CPU hotplug reg block regression

Laszlo Ersek lersek at redhat.com
Fri Jan 20 12:55:05 UTC 2023


On 1/20/23 10:10, Ard Biesheuvel wrote:
> On Fri, 20 Jan 2023 at 09:50, Laszlo Ersek <lersek at redhat.com> wrote:
>>
>> a couple of requests to Oliver below:
>>
>> On 1/19/23 12:27, Ard Biesheuvel wrote:
>>> On Thu, 19 Jan 2023 at 12:01, Laszlo Ersek <lersek at redhat.com> wrote:
>>>>
>>>> In QEMU v5.1.0, the CPU hotplug register block misbehaves: the negotiation
>>>> protocol is (effectively) broken such that it suggests that switching from
>>>> the legacy interface to the modern interface works, but in reality the
>>>> switch never happens. The symptom has been witnessed when using TCG
>>>> acceleration; KVM seems to mask the issue. The issue persists with the
>>>> following (latest) stable QEMU releases: v5.2.0, v6.2.0, v7.2.0. Currently
>>>> there is no stable release that addresses the problem.
>>>>
>>>> The QEMU bug confuses the Present and Possible counting in function
>>>> PlatformMaxCpuCountInitialization(), in
>>>> "OvmfPkg/Library/PlatformInitLib/Platform.c". OVMF ends up with Present=0
>>>> Possible=1. This in turn further confuses MpInitLib in UefiCpuPkg (hence
>>>> firmware-time multiprocessing will be broken). Worse, CPU hot(un)plug with
>>>> SMI will be summarily broken in OvmfPkg/CpuHotplugSmm, which (considering
>>>> the privilege level of SMM) is not that great.
>>>>
>>>> Detect the issue in PlatformCpuCountBugCheck(), and print an error message
>>>> and *hang* if the issue is present.
>>>>
>>>> Users willing to take risks can override the hang with the experimental
>>>> QEMU command line option
>>>>
>>>>   -fw_cfg name=opt/org.tianocore/X-Cpuhp-Bugcheck-Override,string=yes
>>>>
>>>> (The "-fw_cfg" QEMU option itself is not experimental; its above argument,
>>>> as far it concerns the firmware, is experimental.)
>>>>
>>>> The problem was originally reported by Ard [0]. We analyzed it at [1] and
>>>> [2]. A QEMU patch was sent at [3]; now merged as commit dab30fbef389
>>>> ("acpi: cpuhp: fix guest-visible maximum access size to the legacy reg
>>>> block", 2023-01-08), to be included in QEMU v8.0.0.
>>>>
>>>> [0] https://bugzilla.tianocore.org/show_bug.cgi?id=4234#c2
>>>>
>>>> [1] https://bugzilla.tianocore.org/show_bug.cgi?id=4234#c3
>>>>
>>>> [2] IO port write width clamping differs between TCG and KVM
>>>>     http://mid.mail-archive.com/aaedee84-d3ed-a4f9-21e7-d221a28d1683@redhat.com
>>>>     https://lists.gnu.org/archive/html/qemu-devel/2023-01/msg00199.html
>>>>
>>>> [3] acpi: cpuhp: fix guest-visible maximum access size to the legacy reg block
>>>>     http://mid.mail-archive.com/20230104090138.214862-1-lersek@redhat.com
>>>>     https://lists.gnu.org/archive/html/qemu-devel/2023-01/msg00278.html
>>>>
>>>> NOTE: PlatformInitLib is used in the following platform DSCs:
>>>>
>>>>   OvmfPkg/AmdSev/AmdSevX64.dsc
>>>>   OvmfPkg/CloudHv/CloudHvX64.dsc
>>>>   OvmfPkg/IntelTdx/IntelTdxX64.dsc
>>>>   OvmfPkg/Microvm/MicrovmX64.dsc
>>>>   OvmfPkg/OvmfPkgIa32.dsc
>>>>   OvmfPkg/OvmfPkgIa32X64.dsc
>>>>   OvmfPkg/OvmfPkgX64.dsc
>>>>
>>>> but I can only test this change with the last three platforms, running on
>>>> QEMU.
>>>>
>>>> Test results:
>>>>
>>>>   TCG  QEMU     OVMF     override  result
>>>>        patched  patched
>>>>   ---  -------  -------  --------  --------------------------------------
>>>>   0    0        0        0         CPU counts OK (KVM masks the QEMU bug)
>>>>   0    0        1        0         CPU counts OK (KVM masks the QEMU bug)
>>>>   0    1        0        0         CPU counts OK (QEMU fix, but KVM masks
>>>>                                    the QEMU bug anyway)
>>>>   0    1        1        0         CPU counts OK (QEMU fix, but KVM masks
>>>>                                    the QEMU bug anyway)
>>>>   1    0        0        0         boot with broken CPU counts (original
>>>>                                    QEMU bug)
>>>>   1    0        1        0         broken CPU count caught (boot hangs)
>>>>   1    0        1        1         broken CPU count caught, bug check
>>>>                                    overridden, boot continues
>>>>   1    1        0        0         CPU counts OK (QEMU fix)
>>>>   1    1        1        0         CPU counts OK (QEMU fix)
>>>>
>>>> Cc: Ard Biesheuvel <ardb+tianocore at kernel.org>
>>>> Cc: Brijesh Singh <brijesh.singh at amd.com>
>>>> Cc: Erdem Aktas <erdemaktas at google.com>
>>>> Cc: Gerd Hoffmann <kraxel at redhat.com>
>>>> Cc: James Bottomley <jejb at linux.ibm.com>
>>>> Cc: Jiewen Yao <jiewen.yao at intel.com>
>>>> Cc: Jordan Justen <jordan.l.justen at intel.com>
>>>> Cc: Michael Brown <mcb30 at ipxe.org>
>>>> Cc: Min Xu <min.m.xu at intel.com>
>>>> Cc: Oliver Steffen <osteffen at redhat.com>
>>>> Cc: Sebastien Boeuf <sebastien.boeuf at intel.com>
>>>> Cc: Tom Lendacky <thomas.lendacky at amd.com>
>>>> Bugzilla: https://bugzilla.tianocore.org/show_bug.cgi?id=4250
>>>> Signed-off-by: Laszlo Ersek <lersek at redhat.com>
>>>
>>> Thanks a lot for taking the time and investing the effort. I'm quite
>>> happy that we have this 'escape hatch' now, which we could arguably
>>> use temporarily in the VS2019 platform CI until its QEMU binary gets
>>> updated, right?
>>
>> Yes, I have to agree there.
>>
>> Right now, because those QEMU binaries are affected by the regression,
>> and because they use TCG, OVMF already sees Present=0 Possible=1. Due to
>> the interference of Present=0 with the QEMU v2.7 reset bug workaround,
>> we also get BootCpuCount=0. Furthermore, MaxCpuCount gets set to 1, from
>> Possible. Thus, we exit PlatformMaxCpuCountInitialization() with
>> PcdCpuBootLogicalProcessorNumber=0 (from BootCpuCount) and
>> PcdCpuMaxLogicalProcessorNumber=1 (from MaxCpuCount).
>>
>> Then, in the "predictable subset" of consequences of the QEMU
>> regression, we can say that MpInitLib interprets the above PCD values as
>> "uniprocessor system with the boot CPU count not exposed by the
>> platform". This (i.e., *just this*) does not fall outside of MpInitLib's
>> domain (again, note my qualification "predictable subset").
>>
>> Now, if we apply the patch and also add the -fw_cfg switch to the
>> Windows CI, *and* we also don't add any -smp flags (as far as I can
>> tell, no -smp flag is used now), then the new PCD state will be
>>
>> PcdCpuBootLogicalProcessorNumber=1 (changed from zero)
>> PcdCpuMaxLogicalProcessorNumber=1 (stays the same)
>>
>> As far as I can tell, *right now* this change should have no effect *in
>> MpInitLib*, IOW nothing gets worse or better there. Namely,
>> PcdCpuBootLogicalProcessorNumber is only consumed in WakeUpAP(), and
>> only when InitFlag == ApInitConfig. InitFlag is set like that only in
>> CollectProcessorCount(). However, CollectProcessorCount() is only called
>> if PcdCpuMaxLogicalProcessorNumber is >1 (see MaxLogicalProcessorNumber
>> in MpInitLibInitialize()). Meaning in effect that
>> PcdCpuMaxLogicalProcessorNumber=1 makes PcdCpuBootLogicalProcessorNumber
>> irrelevant, so its change from 0 to 1 is invisible *to MpInitLib*.
>>
>> Oliver:
>>
>> (1) can you please post a patch for the Windows CI so that the following
>> option be passed to QEMU:
>>
>>   -fw_cfg name=opt/org.tianocore/X-Cpuhp-Bugcheck-Override,string=yes
>>
>> (This option is harmless when the firmware does not determine the QEMU
>> bug, so it can be passed in advance; it will have no consequence at all.)
>>
>> In the patch, please reference
>>
>>   https://bugzilla.tianocore.org/show_bug.cgi?id=4250
>>
> 
> Can I take the above as an ack on
> 
> https://edk2.groups.io/g/devel/message/98899
> 
> ?
> 
>> (2) Please file a separate TianoCore BZ for *backing out* the change (=
>> for removing the -fw_cfg switch), and assign it to yourself :)
>>
>> Once the Windows CI advances to a fixed QEMU binary, the "escape hatch"
>> should be shut welded down.
>>
>> (3) Please give me a hint when the CI patch (1) has been merged; then I
>> can go ahead and merge this v3 series as well.
>>
> 
> I'll merge the whole lot once you're happy with the CI patch.
> 

(/me checks the timestamps of messages :) my tendency to work in batches
has its downsides as well, alas. Sorry about the confusion; I'll proceed
with the merge in the other thread.)



-=-=-=-=-=-=-=-=-=-=-=-
Groups.io Links: You receive all messages sent to this group.
View/Reply Online (#98920): https://edk2.groups.io/g/devel/message/98920
Mute This Topic: https://groups.io/mt/96374974/1813853
Group Owner: devel+owner at edk2.groups.io
Unsubscribe: https://edk2.groups.io/g/devel/unsub [edk2-devel-archive at redhat.com]
-=-=-=-=-=-=-=-=-=-=-=-




More information about the edk2-devel-archive mailing list