[edk2-devel] [PATCH v2 2/2] ArmVirtPkg/ArmVirtQemu: Avoid early ID map on ThunderX

Oliver Steffen osteffen at redhat.com
Fri May 19 16:32:13 UTC 2023


Quoting Oliver Steffen (2023-03-02 14:29:43)
> On Thu, Mar 2, 2023 at 11:50 AM Ard Biesheuvel <[1]ardb at kernel.org> wrote:
>
>     On Thu, 9 Feb 2023 at 16:15, Ard Biesheuvel <[2]ardb at kernel.org> wrote:
>     >
>     > On Tue, 7 Feb 2023 at 13:58, Oliver Steffen <[3]osteffen at redhat.com>
>     wrote:
>     > >
>     > > On Tue, Feb 7, 2023 at 12:57 PM Ard Biesheuvel <[4]ardb at kernel.org>
>     wrote:
>     > >>
>     > >> On Tue, 7 Feb 2023 at 11:51, Oliver Steffen <[5]osteffen at redhat.com>
>     wrote:
>     > >> >
>     > >> > On Thu, Feb 2, 2023 at 12:09 PM Oliver Steffen <[6]
>     osteffen at redhat.com> wrote:
>     > >> >>
>     > >> >>
>     > >> >> On Wed, Feb 1, 2023 at 2:29 PM Ard Biesheuvel <[7]ardb at kernel.org>
>     wrote:
>     > >> >>>
>     > >> >>> On Wed, 1 Feb 2023 at 13:59, Oliver Steffen <[8]
>     osteffen at redhat.com> wrote:
>     > >> >>> >
>     > >> >>> > On Wed, Feb 1, 2023 at 12:52 PM Ard Biesheuvel <[9]
>     ardb at kernel.org> wrote:
>     > >> >>> >>
>     > >> >>> >> On Wed, 1 Feb 2023 at 10:14, Oliver Steffen <[10]
>     osteffen at redhat.com> wrote:
>     > >> >>> >> >
>     > >> >>
>     > >> >> [...]
>     > >> >>>
>     > >> >>> >> > I am sorry, this story does not seem to be over yet.
>     > >> >>> >> >
>     > >> >>> >> > We are using the Erratum patch and also included the commit
>     406504c7 in
>     > >> >>> >> > the kernel.
>     > >> >>> >> > Now the firmware crashes sometimes (10 out of 89 tests).
>     > >> >>> >> >
>     > >> >>> >>
>     > >> >>> >> Thanks for the report. Is this still on ThunderX2?
>     > >> >>> >>
>     > >> >>> >> > Any hints are very welcome!
>     > >> >>> >> >
>     > >> >>> >>
>     > >> >>> >> Do  you have access to those build artifacts?
>     > >> >>> >
>     > >> >>> >
>     > >> >>> > [11]https://kojihub.stream.centos.org/kojifiles/work/tasks/5251/
>     1835251/edk2-aarch64-20221207gitfff6d81270b5-4.el9.test.noarch.rpm
>     > >> >>> >
>     > >> >>> > and/or here:
>     > >> >>> >
>     > >> >>> > [12]https://kojihub.stream.centos.org/koji/taskinfo?taskID=
>     1835251
>     > >> >>> >
>     > >> >>> > Source for reference:
>     > >> >>> > [13]https://gitlab.com/redhat/centos-stream/src/edk2/-/
>     merge_requests/24
>     > >> >>> >
>     > >> >>>
>     > >> >>> Any chance the .dll files (which are actually ELF executables)
>     have
>     > >> >>> been preserved somewhere?
>     > >> >>
>     > >> >> Here is the build folder (~90MB):
>     > >> >> [14]https://gitlab.com/osteffen/thunderx2-debug/-/raw/main/
>     armvirt-thunderx2-issue.tar.xz
>     > >> >>
>     > >> >> I am waiting for the tests with the additional debug output to run.
>     > >> >
>     > >> >
>     > >> > We reran the test suite with the Erratum and the additional debug
>     > >> > output enabled.  Strangely, the problem does not occur anymore, the
>     > >> > firmware boots up normally.
>     > >> >
>     > >> > We retried the tests without the additional debug output.
>     > >> > RHEL ships two firmware flavors for AARCH64: a silent and a verbose
>     > >> > version.
>     > >>
>     > >> Are these RELEASE vs DEBUG builds?
>     > >
>     > >
>     > > All builds are DEBUG, just the amount of information printed on
>     > > the serial is different (almost zero for the "silent" one.)
>     > >
>     > >>
>     > >> > Both were tried. We see no problems with the verbose
>     > >> > one. The silent one fails noticeably more often if a software TPM
>     device
>     > >> > is present.
>     > >> >
>     > >>
>     > >> This smells like some missing cache or TLB maintenance - the verbose
>     > >> one exits to the host much more often, and likely relies on cache/TLB
>     > >> maintenance occurring in the hypervisor.
>     > >>
>     > >> So the build always includes TPM support but the issue only occurs
>     > >> when the sw TPM is actually exposed by QEMU?
>     > >
>     > >
>     > > Yes.
>     > > All builds include support for TPM, but the issue occurs more
>     frequently
>     > > if a sw TPM is exposed by QEMU.
>     > >
>     >
>     > Any chance you could provide a specific command line for launching
>     > QEMU? I am trying to reproduce this, but I am not making any progress.
>     >
>     > >>
>     > >> > Could this be related to how much stuff is going on in the early
>     phase
>     > >> > of the firmware (when logging is enabled: formatting of messages and
>     > >> > sending to serial port...) ?
>     > >> >
>     > >>
>     > >> I'll try to see if I can rig something up that logs into a buffer
>     > >> rather than straight to the serial, and dump it all out when handling
>     > >> the crash
>     > >>
>     >
>     > This takes a bit more time than I can afford to spend on this atm, and
>     > I'd like to be able to reproduce before I go down this rabbit hole.
>
>     Have there been any developments regarding this issue?
>
>
> Nothing from my side.  I tried to come up with a more reliable/faster
> reproducer
> but then stopped because of other stuff.
>
> If you have any idea what I could try next let me know.
>
> -Oliver
# Summary for Email 2

Hi all,

I had another look at this and I can now reproduce the issue consistently,
with a quite minimal setup, on recent Linux kernel, Qemu, and EDK2.
It requires rebooting the guest in a tight loop. It happens in silent
and verbose
builds alike, but since the verbose ones are slowed down by the serial
output, it
takes longer to hit the issue.
It is possible to reproduce it with the silent builds within a few minutes.
For the verbose case I recommend running multiple Qemu instances in parallel (as
many as the machine allows, in my case ~100).

Details:

CPU: Cavium ThunderX2(R) CPU CN9975
Tested on 3 different machines:
    HPE apache, HPE apollo, Gigabyte R181
Kernels tested:
 - 6.2.15-100.fc36.aarch64
 - 5.14.0-312.el9.aarch64
   (contains 406504c7b0405d74d74c15a667cd4c4620c3e7a9,
   "KVM: arm64: Fix S1PTW handling on RO memslots")
Qemu v8.0.0 (RHEL version and build from upstream repo)
EDK2: master branch from 2023-05-16 (cafb4f3f)
gcc 11.3.1

EDK2 build command line:
build \
  -a AARCH64
  -p ArmVirtPkg/ArmVirtQemu.dsc
  -t GCC5 -b DEBUG \
  -D NETWORK_IP6_ENABLE \
  -D NETWORK_HTTP_BOOT_ENABLE \
  -D NETWORK_TLS_ENABLE \
  -D NETWORK_ISCSI_ENABLE \
  -D NETWORK_ALLOW_HTTP_CONNECTIONS \
  -D CAVIUM_ERRATUM_27456=TRUE \
  -D TPM2_ENABLE=TRUE \
  -D TPM1_ENABLE=FALSE \
  -D DEBUG_PRINT_ERROR_LEVEL=0x80000000  \
  -D BUILD_SHELL=TRUE \
  --pcd="gEfiShellPkgTokenSpaceGuid.PcdShellDefaultDelay=0" \
  --pcd="gEfiMdePkgTokenSpaceGuid.PcdPlatformBootTimeOut=0" \
  --hash --cmd-len=65536

To reproduce the issue I launched the firmware in Qemu and have it do
a reboot once it finished booting up
via a startup.nsh on the ESP.

Qemu command line:
qemu-system-aarch64 \
    -machine virt,accel=kvm -m 13G \
    -boot menu=off \
    -cpu host \
    -blockdev node-name=code,driver=file,filename="${FW_CODE}",read-only=on \
    -blockdev node-name=vars,driver=file,filename="${FW_VARS}" \
    -machine pflash0=code \
    -machine pflash1=vars \
    -serial stdio \
    -net none \
    -drive file=esp.img,snapshot=on

Other things like number of CPUs or the presence of a vTPM have no
influence. I did not try different amounts of RAM yet.

Serial output:
[...]
InitializeDxeNxMemoryProtectionPolicy: StackBase = 0x00000000476C5000
StackSize = 0x0000000000020000
InitializeDxeNxMemoryProtectionPolicy: applying strict permissions to
active memory regions
SetUefiImageMemoryAttributes - 0x0000000040000000 - 0x00000000076E5000
(0x0000000000004000)
UpdateRegionMappingRecursive(0): 40000000 - 476E5000 set
60000000000400 clr FF9F000000000B3F
UpdateRegionMappingRecursive(1): 40000000 - 476E5000 set
60000000000400 clr FF9F000000000B3F
UpdateRegionMappingRecursive(2): 40000000 - 476E5000 set
60000000000400 clr FF9F000000000B3F
UpdateRegionMappingRecursive(3): 47600000 - 476E5000 set
60000000000400 clr FF9F000000000B3F
SetUefiImageMemoryAttributes - 0x00000000476C5000 - 0x0000000000001000
(0x0000000000006000)
UpdateRegionMappingRecursive(0): 476C5000 - 476C6000 set
60000000000000 clr FF9F000000000B3F
UpdateRegionMappingRecursive(1): 476C5000 - 476C6000 set
60000000000000 clr FF9F000000000B3F
UpdateRegionMappingRecursive(2): 476C5000 - 476C6000 set
60000000000000 clr FF9F000000000B3F
UpdateRegionMappingRecursive(3): 476C5000 - 476C6000 set
60000000000000 clr FF9F000000000B3F
SetUefiImageMemoryAttributes - 0x000000004772B000 - 0x00000000007C0000
(0x0000000000004000)
UpdateRegionMappingRecursive(0): 4772B000 - 47EEB000 set
60000000000400 clr FF9F000000000B3F
UpdateRegionMappingRecursive(1): 4772B000 - 47EEB000 set
60000000000400 clr FF9F000000000B3F
UpdateRegionMappingRecursive(2): 4772B000 - 47EEB000 set
60000000000400 clr FF9F000000000B3F
UpdateRegionMappingRecursive(3): 4772B000 - 47800000 set
60000000000400 clr FF9F000000000B3F
UpdateRegionMappingRecursive(3): 47E00000 - 47EEB000 set
60000000000400 clr FF9F000000000B3F
SetUefiImageMemoryAttributes - 0x0000000047EF3000 - 0x0000000000101000
(0x0000000000004000)
UpdateRegionMappingRecursive(0): 47EF3000 - 47FF4000 set
60000000000400 clr FF9F000000000B3F
UpdateRegionMappingRecursive(1): 47EF3000 - 47FF4000 set
60000000000400 clr FF9F000000000B3F
UpdateRegionMappingRecursive(2): 47EF3000 - 47FF4000 set
60000000000400 clr FF9F000000000B3F
UpdateRegionMappingRecursive(3): 47EF3000 - 47FF4000 set
60000000000400 clr FF9F000000000B3F
SetUefiImageMemoryAttributes - 0x0000000047FFA000 - 0x0000000334AA6000
(0x0000000000004000)
UpdateRegionMappingRecursive(0): 47FFA000 - 37CAA0000 set
60000000000400 clr FF9F000000000B3F
UpdateRegionMappingRecursive(1): 47FFA000 - 37CAA0000 set
60000000000400 clr FF9F000000000B3F
UpdateRegionMappingRecursive(2): 47FFA000 - 80000000 set
60000000000400 clr FF9F000000000B3F
UpdateRegionMappingRecursive(3): 47FFA000 - 48000000 set
60000000000400 clr FF9F000000000B3F
UpdateRegionMappingRecursive(2): 340000000 - 380000000 set 70C clr 0
UpdateRegionMappingRecursive(3): 37F000000 - 37F200000 set 70C clr 0
UpdateRegionMappingRecursive(2): 340000000 - 37CAA0000 set
60000000000400 clr FF9F000000000B3F
UpdateRegionMappingRecursive(3): 37CA00000 - 37CC00000 set 70C clr 0
UpdateRegionMappingRecursive(3): 37CA00000 - 37CAA0000 set
60000000000400 clr FF9F000000000B3F
SetUefiImageMemoryAttributes - 0x000000037CB40000 - 0x00000000031F9000
(0x0000000000004000)
UpdateRegionMappingRecursive(0): 37CB40000 - 37FD39000 set
60000000000400 clr FF9F000000000B3F
UpdateRegionMappingRecursive(1): 37CB40000 - 37FD39000 set
60000000000400 clr FF9F000000000B3F
UpdateRegionMappingRecursive(2): 37CB40000 - 37FD39000 set
60000000000400 clr FF9F000000000B3F
UpdateRegionMappingRecursive(3): 37CB40000 - 37CC00000 set
60000000000400 clr FF9F000000000B3F
UpdateRegionMappingRecursive(3): 37F000000 - 37F200000 set
60000000000400 clr FF9F000000000B3F
UpdateRegionMappingRecursive(3): 37FC00000 - 37FE00000 set 70C clr 0
UpdateRegionMappingRecursive(3): 37FC00000 - 37FD39000 set
60000000000400 clr FF9F000000000B3F


Synchronous Exception at 0x000000037FD3C0A8
PC 0x00037FD3C0A8 (0x00037FD39000+0x000030A8) [ 0] ArmCpuDxe.dll
PC 0x00037FD3C0A8 (0x00037FD39000+0x000030A8) [ 0] ArmCpuDxe.dll
PC 0x00037FD3BE70 (0x00037FD39000+0x00002E70) [ 0] ArmCpuDxe.dll
PC 0x00037FD3BE70 (0x00037FD39000+0x00002E70) [ 0] ArmCpuDxe.dll
PC 0x00037FD3C2E4 (0x00037FD39000+0x000032E4) [ 0] ArmCpuDxe.dll
PC 0x0000476E78F8 (0x0000476E5000+0x000028F8) [ 1] DxeCore.dll
PC 0x0000476ED680 (0x0000476E5000+0x00008680) [ 1] DxeCore.dll
PC 0x0000476F2744 (0x0000476E5000+0x0000D744) [ 1] DxeCore.dll
PC 0x0000476ECDE8 (0x0000476E5000+0x00007DE8) [ 1] DxeCore.dll
PC 0x00037FD3D2DC (0x00037FD39000+0x000042DC) [ 2] ArmCpuDxe.dll
PC 0x0000476EC788 (0x0000476E5000+0x00007788) [ 3] DxeCore.dll
PC 0x0000476F9CA8 (0x0000476E5000+0x00014CA8) [ 3] DxeCore.dll
PC 0x0000476EFEF0 (0x0000476E5000+0x0000AEF0) [ 3] DxeCore.dll

[ 0] /root/edk2/Build/ArmVirtQemu-AARCH64/DEBUG_GCC5/AARCH64/ArmPkg/Drivers/CpuDxe/CpuDxe/DEBUG/ArmCpuDxe.dll
[ 1] /root/edk2/Build/ArmVirtQemu-AARCH64/DEBUG_GCC5/AARCH64/MdeModulePkg/Core/Dxe/DxeMain/DEBUG/DxeCore.dll
[ 2] /root/edk2/Build/ArmVirtQemu-AARCH64/DEBUG_GCC5/AARCH64/ArmPkg/Drivers/CpuDxe/CpuDxe/DEBUG/ArmCpuDxe.dll
[ 3] /root/edk2/Build/ArmVirtQemu-AARCH64/DEBUG_GCC5/AARCH64/MdeModulePkg/Core/Dxe/DxeMain/DEBUG/DxeCore.dll

  X0 0x000000037F10BFF0   X1 0x000000037F106003   X2
0x000000000037FC00   X3 0x0000000000000000
  X4 0x0000000000000200   X5 0x0000000000000004   X6
0x0000000000000000   X7 0x000000037FD3F4B5
  X8 0x0000000000000000   X9 0x0000000000000002  X10
0x0000000000000000  X11 0x0000000000000000
 X12 0x0000000000000002  X13 0x0000000000000002  X14
0x0000000000000001  X15 0x0000000000000002
 X16 0x000000037FD3A268  X17 0x00000000007AFA10  X18
0x0000000000000000  X19 0x000000037FC00000
 X20 0x0000000000000002  X21 0x000000037F106003  X22
0x000000037F10B000  X23 0x000000037FD42000
 X24 0x00000000001FFFFF  X25 0x000000037FD39000  X26
0x000000037F106000  X27 0x0000000000000003
 X28 0x000000037F10BFF0   FP 0x00000000476E4780   LR 0x000000037FD3C0A8

  V0 0x0000000000000000 0000000000000000   V1 0x0000000000000000
0000000000000000
  V2 0x0000000000000000 0000000000000000   V3 0x0000000000000000
0000000000000000
  V4 0x0000000000000000 0000000000000000   V5 0x0000000000000000
0000000000000000
  V6 0x0000000000000000 0000000000000000   V7 0x0000000000000000
0000000000000000
  V8 0x0000000000000000 0000000000000000   V9 0x0000000000000000
0000000000000000
 V10 0x0000000000000000 0000000000000000  V11 0x0000000000000000
0000000000000000
 V12 0x0000000000000000 0000000000000000  V13 0x0000000000000000
0000000000000000
 V14 0x0000000000000000 0000000000000000  V15 0x0000000000000000
0000000000000000
 V16 0x0000000000000000 0000000000000000  V17 0x0000000000000000
0000000000000000
 V18 0x0000000000000000 0000000000000000  V19 0x0000000000000000
0000000000000000
 V20 0x0000000000000000 0000000000000000  V21 0x0000000000000000
0000000000000000
 V22 0x0000000000000000 0000000000000000  V23 0x0000000000000000
0000000000000000
 V24 0x0000000000000000 0000000000000000  V25 0x0000000000000000
0000000000000000
 V26 0x0000000000000000 0000000000000000  V27 0x0000000000000000
0000000000000000
 V28 0x0000000000000000 0000000000000000  V29 0x0000000000000000
0000000000000000
 V30 0x0000000000000000 0000000000000000  V31 0x0000000000000000
0000000000000000

  SP 0x00000000476E4780  ELR 0x000000037FD3C0A8  SPSR 0x80000205  FPSR
0x00000000
 ESR 0x86000006          FAR 0x000000037FD3C0A8

 ESR : EC 0x21  IL 0x1  ISS 0x00000006

Instruction abort: Translation fault, second level

Stack dump:
  00000476E4680: 0000000000000001 0000000000000004 00000000476E4700
00000000476F3980
  00000476E46A0: 000000037FD40CBD 0000000000000003 000000037FC00000
000000037FD39000
  00000476E46C0: 0060000000000400 FF9F000000000B3F 00000000476E4780
000000037FD3BE70
  00000476E46E0: 000000037FC00000 0000000000000002 000000037F106000
000000037F10B000
  00000476E4700: 0000000000000FF0 00000000001FFFFF 000000037FD39000
000000037F106000
  00000476E4720: 0000000000000003 000000037F10BFF0 0060000000000400
FF9F000000000B3F
  00000476E4740: 000000037FD39000 000000037FD39000 00000000476E4780
0060000000000403
  00000476E4760: 0000000C00000001 000000037FD3F90E 0000000000000400
000000037F10B000
> 00000476E4780: 00000000476E4830 000000037FD3BE70 000000037CB40000 0000000000000001
  00000476E47A0: 000000037F10B000 0000000047FFE000 0000000000000068
000000003FFFFFFF
  00000476E47C0: 000000037FD39000 000000037F10C528 0000000000000002
0000000047FFE068
  00000476E47E0: 0060000000000400 FF9F000000000B3F 0000000300000001
000000037FD39000
  00000476E4800: 000000017FD40CBD 0060000000000401 0000001500000001
000000037FD3F90E
  00000476E4820: 0060000000000400 000000037F106000 00000000476E48E0
000000037FD3BE70
  00000476E4840: 000000037CB40000 0000000000000000 0000000047FFE000
0000000047FFF000
  00000476E4860: 0000000000000000 0000007FFFFFFFFF 000000037FD39000
000000037F10C528
ASSERT [ArmCpuDxe]
/root/edk2/ArmPkg/Library/DefaultExceptionHandlerLib/AArch64/DefaultExceptionHandler.c(333):
((BOOLEAN)(0==1))



The full log is available here:
https://gitlab.com/osteffen/thunderx2-debug/-/raw/main/2023-05-19/85.log?inline=false

Debug files, firmware binaries, and the full build tree are here:
https://gitlab.com/osteffen/thunderx2-debug/-/tree/main/2023-05-19

I am able to reproduce this quickly, so any ideas for what I can try
are welcome :-)

Thanks
-Oliver



-=-=-=-=-=-=-=-=-=-=-=-
Groups.io Links: You receive all messages sent to this group.
View/Reply Online (#105084): https://edk2.groups.io/g/devel/message/105084
Mute This Topic: https://groups.io/mt/96075174/1813853
Group Owner: devel+owner at edk2.groups.io
Unsubscribe: https://edk2.groups.io/g/devel/unsub [edk2-devel-archive at redhat.com]
-=-=-=-=-=-=-=-=-=-=-=-




More information about the edk2-devel-archive mailing list