[Crash-utility] [PATCH] ppc64: fix 'bt' command for vmcore captured with fadump.

Mon Jan 23 18:13:24 UTC 2017

----- Original Message -----
> 
> 
> On Saturday 21 January 2017 02:00 AM, Dave Anderson wrote:
> >
> > ----- Original Message -----
> >
> > ... [cut] ...
> >
> >>> Also, the exception frame doesn't even show the [bracketed] type of
> >>> exception
> >>> that occurred -- it's just a register dump followed by the remainder of
> >>> the
> >>> backtrace.  Upon a quick glance, it's not obvious that they are even
> >>> active
> >>> tasks.  And traditionally, all of the other architectures have always
> >>> dumped
> >>> a full trace.
> >>>
> >>> I'm not sure what the mechanism is for shutting down the non-active
> >>> FADUMP tasks, so that's why I asked if you could restrict this change
> >>> to just those types of dumps.  (For that matter, is it even possible to
> >>> differentiate a real kdump from an FADUMP dumpfile --  aside from a
> >> Hi Dave,
> >>
> >> Differentiating a kdump and fadump dumpfile is not possible except that
> >> the
> >> stack search would invariably fail and ptregs are guaranteed to be saved
> >> by
> >> firmware in case of fadump. Posted v2 that doesn't change bt output for
> >> anything
> >> but active tasks in case of fadump..
> >
> > Ok, so let me get this straight.  The only difference I see with the v2
> > patch
> > is that fadump non-panicking active tasks change from something like this:
> >    
> >    PID: 0      TASK: c000000000e7f6d0  CPU: 0   COMMAND: "swapper"
> >     #0 [c000000000f2ba30] (null) at 3aae291c67  (unreliable)
> >     #1 [c000000000f2bae0] .tick_dev_program_event at c0000000000d16fc
> >     #2 [c000000000f2bb90] .__hrtimer_start_range_ns at c0000000000c4bcc
> >     #3 [c000000000f2bcb0] .tick_nohz_stop_sched_tick at c0000000000d2d30
> >     #4 [c000000000f2bdc0] .cpu_idle at c000000000015bf0
> >     #5 [c000000000f2be70] .rest_init at c000000000009de4
> >     #6 [c000000000f2bef0] .start_kernel at c000000000850eb4
> >     #7 [c000000000f2bf90] .start_here_common at c0000000000083d8
> >    
> > to this:
> >    
> >    PID: 0      TASK: c000000000e7f6d0  CPU: 0   COMMAND: "swapper"
> >     #0 [c000000000f2bd50] (null) at 0  (unreliable)
> >     #1 [c000000000f2bdc0] .cpu_idle at c000000000015bf0
> >     #2 [c000000000f2be70] .rest_init at c000000000009de4
> >     #3 [c000000000f2bef0] .start_kernel at c000000000850eb4
> >     #4 [c000000000f2bf90] .start_here_common at c0000000000083d8
> >    
> > But with your v1 patch, you also dumped the exception frame:
> >    
> >    PID: 0      TASK: c000000000e7f6d0  CPU: 0   COMMAND: "swapper"
> >     R0:  0000000000000000    R1:  c000000000f2bd50    R2:  c000000000f27628
> >     R3:  0000000000000000    R4:  0000000000000000    R5:  8000000002144400
> >     R6:  800000001314c4f8    R7:  0000000000000000    R8:  0000000000000000
> >     R9:  ffffffffffffffff    R10: 0000000000000000    R11: 80003fbff901700c
> >     R12: 0000000000000000    R13: c000000000ff2500    R14: 0000000001a3fa58
> >     R15: 00000000002230a8    R16: 0000000000223150    R17: 0000000000223144
> >     R18: 0000000000c8a098    R19: 0000000002b13a58    R20: 0000000000000000
> >     R21: 0000000002b135d8    R22: 0000000002b13530    R23: 0000000002280000
> >     R24: 0000000002b135f0    R25: c000000000fd5c48    R26: c0000000010942f0
> >     R27: c0000000010942f0    R28: c0000000005fd168    R29: 0000000000000008
> >     R30: c000000000eb1d68    R31: c000000000f28080
> >     NIP: c000000000055730    MSR: 8000000000009032    OR3: 0000000000000000
> >     CTR: 0000000000000000    LR:  c000000000057350    XER: 0000000000000000
> >     CCR: 0000000024000048    MQ:  0000000000000000    DAR: 000001000ad763b0
> >     DSISR: 0000000000000000     Syscall Result: 0000000000000000
> >     NIP [c000000000055730] .plpar_hcall_norets
> >     LR  [c000000000057350] .pseries_shared_idle_sleep
> >     #0 [c000000000f2bd50] (null) at 0  (unreliable)
> >     #1 [c000000000f2bdc0] .cpu_idle at c000000000015bf0
> >     #2 [c000000000f2be70] .rest_init at c000000000009de4
> >     #3 [c000000000f2bef0] .start_kernel at c000000000850eb4
> >     #4 [c000000000f2bf90] .start_here_common at c0000000000083d8
> >    
> > Again, I don't understand how the non-panicking active tasks are stopped
> > by the fadump facility, but is it because you cannot differentiate kdumps
> > from fadumps that you don't show the exception frame with the v2 patch?
> 
> Hi Dave,
> 
> The crashing cpu makes rtas call ibm,os-term to the firmware which
> saves the regs info of all online cpus. AFAIK, there is no exception frame
> marker (which we are using to detect one) set for this stack frames
> by the kernel. With v1, I was printing the registers without looking for
> exception frame marker, if the registers are saved..
> 
> > Would it be possible to also show the exception frame type in brackets and
> > the register dump for those fadump non-panicking active tasks?
> >
> 
> Hmmm.. Let me have a hard look at this.
> Will try and improve this..

Hari,

I was tinkering around with ppc64_get_dumpfile_stack_frame() from your v2 patch, 
and this seems to work:         

        else {
                *ksp = pt_regs->gpr[1];
                if (IS_KVADDR(*ksp)) {
                        readmem(*ksp+16, KVADDR, nip, sizeof(ulong),
                                "Regs NIP value", FAULT_ON_ERROR);
+                       ppc64_print_regs(pt_regs);
                        return TRUE;
                } else {
                        if (IN_TASK_VMA(bt_in->task, *ksp))
                                fprintf(fp, "%0lx: Task is running in user space\n",
                                        bt_in->task);
                        else
                                fprintf(fp, "%0lx: Invalid Stack Pointer %0lx\n",
                                        bt_in->task, *ksp);
                        *nip = pt_regs->nip;
                        ppc64_print_regs(pt_regs);
                        return FALSE;
                }
        }

And if the task were to have been running in userspace, it already dumps the
registers in the "else" section above.  

I see that the regs->trap is 0, so I understand now that there's nothing to 
translate w/respect to the exception frame type, but a follow-up translation
of the NIP and LR would at least show that there was some kind of hypercall
involved.  (Whether it can be firmly determined whether FADUMP was responsible
is another question)

Thanks,
  Dave