[edk2-devel] [PATCH 1/1] MdeModulePkg/Ata: Fix command status reporting

Mon Jan 2 16:30:46 UTC 2023

Hello,

I've done some investigation into the password issue and I think I have the proposition on how to solve it. The issue arises due to the driver checking the PxIS.TFES(task file error status) and if TFES is set it performs recovery steps(this is correct according to AHCI spec 6.2.2) and restarts the command(this is left up to the SW according to AHCI spec). In the case of SECURITY UNLOCK command if the password is incorrect device will set the abort bit in ERROR register and error bit in status register which will cause the controller to set the PxIS.TFES bit which will  trigger command retires. In case of security commands retries are particularly bad since every incorrect password unlock attempt is decreasing the retry counter on the device side which eventually leads to device locking itself and aborting further unlock commands(power cycle is required to recover).

My proposition for a fix is as follows:

1. The logic in AhciRecoverPortError seems to be largely correct. Even if the TFES error bit shouldn't necessarily mean that the command should be retried, according to AHCI spec when TFES is set controller will enter the ERR:Fatal state and PxCMD.CR needs to be restarted. The only thing that we need to change in this function is to return EFI_DEVICE_ERROR when the port restart failed(as discussed below).
2. We need to add a logic which will decide if the cmd should be retried. This logic should check following conditions:

a) If HBDS is set command should be attempted again. HBDS indicates that there was a CRC error during accessing the system memory during DMA transfer(AHCI spec 6.1.1).
b) If HBFS is set do not attempt to send the command again. HBFS indicates that the pointer given to the AHCI controller is incorrect. Retries won't solve that(AHCI spec 6.1.1)
c) If IFS or INFS is set further check PxSERR.ERR and PxSERR.DIAG(based on AHCI spec 6.1.2).
i) for PxSERR.DIAG.C(CRC error) -> restart command
ii) for PxSERR.DIAG.B(disparity error) -> CRC error should also be reported so it is covered by above
iii) for PxSERR.ERR.P(protocol error) -> do not restart. Protocol errors are unlikely to be transient although I am not sure if Internal buffer overflows can't be caused by transient conditions on the devices side.
iv) for PxSERR.DIAG.H(handshake errors) -> do not restart. Acording to spec handshake errors might be caused by transient conditions such as CRC errors but I hope in such case HBA will also signal CRC in PxSERR.DIAG.C(spec is not super clear on that point though)
d) If TFES is set further check PxTFD.ERR if PxTFD.ERR bit 7 is set(INTERFACE CRC according to ACS-3 spec). For others do not retry.

Hopefully this will catch majority of errors caused by transient conditions while also avoiding restarts on commands that are simply malformed. I will start working on a patch and if anyone has any suggestions/objections please respond.

I will also update bz with this data: https://bugzilla.tianocore.org/show_bug.cgi?id=4011 ( https://bugzilla.tianocore.org/show_bug.cgi?id=4011 )

Thanks,
Mateusz

-=-=-=-=-=-=-=-=-=-=-=-
Groups.io Links: You receive all messages sent to this group.
View/Reply Online (#97840): https://edk2.groups.io/g/devel/message/97840
Mute This Topic: https://groups.io/mt/94411389/1813853
Group Owner: devel+owner at edk2.groups.io
Unsubscribe: https://edk2.groups.io/g/devel/unsub [edk2-devel-archive at redhat.com]
-=-=-=-=-=-=-=-=-=-=-=-

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/edk2-devel-archive/attachments/20230102/30fcf0b1/attachment.htm>