[edk2-devel] [PATCH 1/1] OvmfPkg/NestedInterruptTplLib: replace ASSERT() with a warning logged.

Fri Apr 28 14:50:04 UTC 2023

On 28/04/2023 14:38, Gerd Hoffmann wrote:
> I suspect the windows boot loader does something fishy here, but I can't
> proof it, I have not yet pinned down the exact location where interrupts
> get enabled while running at IPL=TPL_HIGH_LEVEL (which is what I suspect
> is happening, but of course this is not the only possible theory how
> that ASSERT got triggered).
> 
> Not fully sure how to best continue debugging this, I don't think gdb
> can set an watchpoint on eflags.if ...

In the absence of any better ideas, I'd be tempted to run QEMU via TCG 
instead of KVM, and hack the STI instruction definition to check the 
location in guest memory where gEfiCurrentTpl gets stored in your test 
build.  (It's brute force debugging, but it should find the culprit.)

>> This workaround looks wrong to me: it will cause the subsequent call to
>> NestedInterruptRestoreTPL() to drop the TPL back down to TPL_HIGH_LEVEL-1,
>> which is lower than it would have been when the interrupt occurred.  The
>> completed hardware interrupt would then result in an overall change of TPL,
>> which cannot be correct.
> 
> The idea was to set InterruptedTPL to the highest level which is allowed
> to run with interrupts enabled and continue running with interrupts
> enabled, hoping that things get back into normal once the next timer
> interrupt arrives.
> 
> Which -- now that I'm thinking about it again -- is clearly bogus, we
> are in the timer interrupt so IRQs have already been disabled at that
> point, so the fixup idea doesn't make much sense.
> 
> So just leave InterruptedTPL alone and hope for the best?

I think so.  CoreRestoreTpl(TPL_HIGH_LEVEL) should be a no-op, so it 
would be reasonable to expect that 
NestedInterruptRestoreTPL(TPL_HIGH_LEVEL, ...) should also be a no-op.

 From a quick check of the implementation, I'm pretty sure this will be 
the case.  However, the logic is already (necessarily) pretty complex 
and so I am not 100% sure of this.  The reasoning behind the logic in 
NestedInterruptTplLib relies on certain axioms and invariants (e.g. that 
there are a finite number of distinct TPLs, and that there are certain 
places within the code that further interrupts provably cannot occur), 
and so it gets very difficult to reason about when one of those 
invariants is violated, as it seems to be in this situation.

If it seems to work, then I think it would be plausible to replace the

   ASSERT (InterruptedTPL < TPL_HIGH_LEVEL);

with a warning message, and to return with InterruptedTPL unmodified. 
But I'd prefer to understand how this invariant violation arises in the 
first place.

Good luck!

Michael

-=-=-=-=-=-=-=-=-=-=-=-
Groups.io Links: You receive all messages sent to this group.
View/Reply Online (#103780): https://edk2.groups.io/g/devel/message/103780
Mute This Topic: https://groups.io/mt/98554842/1813853
Group Owner: devel+owner at edk2.groups.io
Unsubscribe: https://edk2.groups.io/g/devel/unsub [edk2-devel-archive at redhat.com]
-=-=-=-=-=-=-=-=-=-=-=-