[Crash-utility] Interpreting bt

Thu Jan 24 14:33:38 UTC 2013

----- Original Message -----
> 
> 
> 
> 
> Hello,
> 
> 
> I am using crash version: 6.0.4-2.el6 on CentOS 6.3 (kernel
> 2.6.32-279.el6.x86_64). I apologize for my newbie questions, but
> googling did not help much.
> 
> When analyzing a kernel dump, I am getting the following bt.
> 
> crash> bt
> PID: 12663 TASK: ffff88036304f500 CPU: 0 COMMAND: "bash"
> #0 [ffff88035b949570] machine_kexec at ffffffff8103281b
> #1 [ffff88035b9495d0] crash_kexec at ffffffff810ba662
> #2 [ffff88035b9496a0] oops_end at ffffffff81501290
> #3 [ffff88035b9496d0] no_context at ffffffff81043bab
> #4 [ffff88035b949720] __bad_area_nosemaphore at ffffffff81043e35
> #5 [ffff88035b949770] bad_area at ffffffff81043f5e
> #6 [ffff88035b9497a0] __do_page_fault at ffffffff81044710
> #7 [ffff88035b9498c0] do_page_fault at ffffffff8150326e
> #8 [ffff88035b9498f0] page_fault at ffffffff81500625
> [exception RIP: ahaann+47]
> RIP: ffffffffa06ce48f RSP: ffff88035b9499a8 RFLAGS: 00010246
> RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
> RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88035daef4e0
> RBP: ffff88035b9499b8 R8: 0000000004a47daf R9: ffffffffa06dae99
> R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000007
> R13: 00007fc82f4b8000 R14: 000000000000000a R15: 0000000000000000
> ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
> #9 [ffff88035b9499c0] ahaecho at ffffffffa06d2899 [ahadrv]
> #10 [ffff88035b949a00] writectl at ffffffffa06c366e [ahadrv]
> #11 [ffff88035b949e40] writeaha at ffffffffa06d3e7b [ahadrv]
> #12 [ffff88035b949e60] proc_file_write at ffffffff811e6e44
> #13 [ffff88035b949ea0] proc_reg_write at ffffffff811e0abe
> #14 [ffff88035b949ef0] vfs_write at ffffffff8117b068
> #15 [ffff88035b949f30] sys_write at ffffffff8117ba81
> #16 [ffff88035b949f80] system_call_fastpath at ffffffff8100b0f2
> RIP: 0000003a29ada3c0 RSP: 00007ffffaec6830 RFLAGS: 00010202
> RAX: 0000000000000001 RBX: ffffffff8100b0f2 RCX: 0000000000000065
> RDX: 000000000000000a RSI: 00007fc82f4b8000 RDI: 0000000000000001
> RBP: 00007fc82f4b8000 R8: 000000000000000a R9: 00007fc82f4aa700
> R10: 00000000fffffff7 R11: 0000000000000246 R12: 000000000000000a
> R13: 0000003a29d8c780 R14: 000000000000000a R15: 0000000001e18460
> ORIG_RAX: 0000000000000001 CS: 0033 SS: 002b
> crash>
> 
> 
> 1. Are the hex addr in [] right before the function name the stack
> frame ptr for that function?

On x86_64 machines, the "at <address>" shown is the address in that frame's
function where the call instruction that it has made will return to.  So for
example, taking frame #15, where "sys_write at ffffffff8117ba81" has called
vfs_write(), you can disassemble all instructions from the beginning of 
sys_write() to that address like this example:

 crash> dis -r ffffffff80016e6b
 0xffffffff80016e26 <sys_write>: push   %r13
 0xffffffff80016e28 <sys_write+2>:       mov    %rsi,%r13
 0xffffffff80016e2b <sys_write+5>:       push   %r12
 0xffffffff80016e2d <sys_write+7>:       mov    $0xfffffffffffffff7,%r12
 0xffffffff80016e34 <sys_write+14>:      push   %rbp
 0xffffffff80016e35 <sys_write+15>:      mov    %rdx,%rbp
 0xffffffff80016e38 <sys_write+18>:      push   %rbx
 0xffffffff80016e39 <sys_write+19>:      sub    $0x18,%rsp
 0xffffffff80016e3d <sys_write+23>:      lea    0x14(%rsp),%rsi
 0xffffffff80016e42 <sys_write+28>:      callq  0xffffffff8000b5b4 <fget_light>
 0xffffffff80016e47 <sys_write+33>:      test   %rax,%rax
 0xffffffff80016e4a <sys_write+36>:      mov    %rax,%rbx
 0xffffffff80016e4d <sys_write+39>:      je     0xffffffff80016e86 <sys_write+96>
 0xffffffff80016e4f <sys_write+41>:      mov    0x38(%rax),%rax
 0xffffffff80016e53 <sys_write+45>:      lea    0x8(%rsp),%rcx
 0xffffffff80016e58 <sys_write+50>:      mov    %rbp,%rdx
 0xffffffff80016e5b <sys_write+53>:      mov    %r13,%rsi
 0xffffffff80016e5e <sys_write+56>:      mov    %rbx,%rdi
 0xffffffff80016e61 <sys_write+59>:      mov    %rax,0x8(%rsp)
 0xffffffff80016e66 <sys_write+64>:      callq  0xffffffff800164d0 <vfs_write>
 0xffffffff80016e6b <sys_write+69>:      mov    %rax,%r12
 crash>

And the stack address of the frame contains that return address location.

> 
> 2. I am assuming the panic occurred in function ahaann() (and not in
> ahaecho() ). Is that right?

That's correct.  The exception occurred precisely when executing the
instruction here: [exception RIP: ahadrv], which is at RIP ffffffffa06ce48f.

You can do a "dis -r ahaann+47" to see the instructions leading up
to the fatal one.  If you load the ahadrv module with "mod -s ahadrv",
you can also get line numbers interspersed with "dis -rl ahadrv+47"

> 
> 3. What is puzzling me is why there is no frame associated with call
> to ahaann(). Or is frame #8 associated to ahaann(). From the display
> it seems frame #8 is associated to page_fault() since 0xffffffff81500625
> is an address in page_fault(). Or am totally misinterpreting the call stack.
> 
> crash> dis ffffffff81500625
> 0xffffffff81500625 <page_fault+37>: jmpq 0xffffffff81500830

The ahaann() function didn't lay down a full frame because while it
was executing, it took a page fault exception.  As soon as that
occurred, an exception frame was dumped onto the stack at that
point (the register dump).  Control at that point was transferred
to page_fault() to handle the exception.  Normally the exception
should quietly resolve the page fault, return back to ahaann(),
and the function should continue on.  But the address that caused
the page fault was bogus/unresolvable, so it never returned, but
rather crashed the system.

So again, what you should do is:

 crash> mod -s ahadrv   (presuming you've got the kernel-debuginfo package installed)
 ...
 crash> dis -rl ahaann+47

And look at the last instruction shown.  My guess is that it's 
referencing a location with a NULL pointer (probably via one of
the NULL-filled RBX, RCX, RDX, RSI or RDI registers)?

> 
> 4. I can understand the value of register dump for frame #8, due to
> the panic. What is the significance of the register dump for frame
> #16.

Whenever a program running in user-space enters the kernel, it did
so as the result of an exception, be it a system call, page fault,
interrupt, etc.  And like the in-kernel page fault exception, it lays
down the user's register set at the top of the stack so they can be
restored upon return to user-space.

Dave