[Crash-utility] [PATCH] crash: Do not use bt -t flag in panic_search()

Thu Aug 6 15:25:29 UTC 2015

Hi Michael,

Re: your dumpfile where the erroneous "panic" address in a random user
task's exception frame register set gets picked up by mistake.  

Your original patch request modified the "bt" command used for the
kernel stack searches in panic_search().  But that piece of code
is the last-ditch effort for finding a panic task, which follows 
this path:

  get_panic_context()
    panic_search()
      get_dumpfile_panic_task()
        get_kdump_panic_task()       (requires kdump "crashing_cpu" symbol)
        get_diskdump_panic_task()    (requires kdump "crashing_cpu" symbol)
        get_active_set_panic_task()  (bt -r raw stack dump of active cpus)
    ...

Only if all of the above fail, does panic_search() initiate the 
exhaustive walkthrough of all kernel stacks for evidence.

Since you have gotten that far, I'm wondering whether your
target dumpfile with the faulty "panic" address is from an
s390x "live dump"?  In that case, there can never be any task 
with any such evidence, making the backtrace search a waste of 
time to begin with.

And if so, I'm thinking that since s390x will have set LIVE_DUMP 
flag set, if get_dumpfile_panic_task() returns NO_TASK, then 
panic_search() should just return a NULL to get_panic_context()
if it's a live dump, which will just default to the idle task on
cpu 0.  

Dave

----- Original Message -----
> 
> 
> ----- Original Message -----
> > Hi Dave,
> > 
> > I got a dump where a process "gmain" was incorrectly marked as running:
> > 
> > crash> ps | grep gmain
> > >   217      1   5      8bec23420     IN   0.0  463276  18240  gmain
> > 
> > The reason was that the "brute force" way parsing the "bt -t -o"
> > output in panic_search() found the symbol "panic" on the stack:
> > 
> > crash> bt -t -o 8bec23420
> > PID: 217    TASK: 8bec23420         CPU: 5   COMMAND: "gmain"
> >               START: __schedule at 83f650
> >   [       8b662b900] (null) at 0
> >   [       8b662b950] (null) at 0
> >   [       8b662b978] __schedule at 83f650
> >   [       8b662b990] (null) at 0
> > ...
> >   [       8b662bb18] (null) at 0
> >   [       8b662bb40] panic at 83679a  <<<<<--------------
> >   [       8b662bb58] _ehead at 280da
> 
> 
> I guess the obvious question is why "panic" was on the stack?
> 
> > 
> > The real stack trace was as follows:
> > 
> > crash> bt  8bec23420
> > Detaching after fork from child process 15508.
> > PID: 217    TASK: 8bec23420         CPU: 5   COMMAND: "gmain"
> >  #0 [8b662b8f0] __schedule at 83f650
> >  #1 [8b662b958] schedule at 83fade
> >  #2 [8b662b970] schedule_hrtimeout_range_clock at 842fc8
> >  #3 [8b662ba10] poll_schedule_timeout at 2c6e8a
> >  #4 [8b662ba30] do_sys_poll at 2c8604
> >  #5 [8b662be40] sys_poll at 2c8852
> >  #6 [8b662bea8] system_call at 843a66
> > 
> > IMHO the "-t" method is quite risky (at least on s390). What about using
> > the "normal" stack backtrace without the "-t" bt option?
> 
> That really worries me -- introducing the usage of normal backtrace on all tasks
> instead of simply walking the stack memory looking for text addresses is a
> huge
> change.
> 
> Dave
>  
> 
> > ---
> >  task.c |    2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > --- a/task.c
> > +++ b/task.c
> > @@ -6633,7 +6633,7 @@ panic_search(void)
> >          fd = &foreach_data;
> >  	fd->keys = 1;
> >  	fd->keyword_array[0] = FOREACH_BT;
> > -	fd->flags |= (FOREACH_t_FLAG|FOREACH_o_FLAG);
> > +	fd->flags |= FOREACH_o_FLAG;
> >  
> >  	dietask = lasttask = NO_TASK;
> >  	
> > 
>