[Crash-utility] [PATCH V2] take Hardware Error & kernel pointer bug as separate panicmsg

Thu Feb 5 14:31:48 UTC 2015

----- Original Message -----
> There are just too many kinds of panic types are categorized under
> the same Oops: xxxx, makes this field really ambiguous and not so useful
> 
>        PANIC: "Oops: 0000 [#1] SMP " (check log for details)
> 
> this patch separated 3 kinds of panicmsg out, as the most happening cases
> among the machines managed by me; the match string are copied
> from kernel source code exactly, after applied, I got panicmsg like:
> 
>  include/linux/kernel.h:#define HW_ERR
>           panicmsg: "[Hardware Error]: CPU 7: Machine Check Exception: 5 Bank
>           11: f200003f000100b2"
>  drivers/char/sysrq.c:__handle_sysrq
>           panicmsg: "SysRq : Trigger a crash"
>  arch/x86/kernel/traps.c:do_general_protection
>           panicmsg: "general protection fault: 8800 [#1] SMP"
>  arch/x86/mm/fault.c:show_fault_oops
>           panicmsg: "BUG: unable to handle kernel paging request at
>           00001248a68eb328"
> 
> We need to move the SysRq matching lines to before matching "Oops", because
> SysRq lines usually also has the Oops, need to take precedence for SysRq.
> 
> Signed-off-by: Derek Che <drc at yahoo-inc.com>

Hi Derek,

As I mentioned earlier, in addition to checking for the general 
protection faults, in my testing I found several other instances 
where the "Oops" message could be replaced with the more meaningful
messages that preceded it, such as double faults, divide errors, 
stack segment faults, "Kernel BUG" (with a capital K), "Unable to 
handle kernel ..." (with a capital U), etc.  I also added a few
break instructions after a search-for message was found instead
of continuing to parse the kernel log.

However, the machine check string search does follow the "kernel panic - "
check, which I understand you would prefer to be the opposite.  The 
fatal error string searches that are being made come from from die()
calls, or from other message sources that are part of the kernel crash
sequence.  On the other hand, the machine check messages are generated 
from a stream of pr_emerg(HW_ERR) calls, and are not necessarily 
(although likely) crash precedents.  But since the kernel panic 
message does contain the "Fatal machine check" message, the reason
behind the crash is readily evident.  

I appreciate your getting the ball rolling here, as it was certainly
due for an update/improvement.

Queued for crash-7.1.0:

  https://github.com/crash-utility/crash/commit/c3840016bf1770b6b1cf571202f2c554fcd1cf55

Thanks,
  Dave