[Crash-utility] [External Mail]Re: [PATCH] optimize the way to find the panic task.

Tue Oct 22 20:49:47 UTC 2019

Hi Qiwu,

I don't have any sample vmcores on hand in which this patch actually
detected the panic task, primarily because it's highly unlikely 
that the panic task cannot be determined by the normal means.  

So to effectively test the patch, in cmd_test() I added a call
to get_log_panic_task(), and ran it on ~300 dumpfiles.  For the 
most part it worked, but for older kernels, your search for "CPU: " 
within the informational string doesn't work, because the CPU 
number gets put on a line by itself, like this example:

  ...
  kernel BUG at block/blk-core.c:2045!
  invalid opcode: 0000 [#1] SMP 
  last sysfs file: /sys/class/firmware/0000:03:0d.0/loading
  CPU 1 
  Modules linked in: ...

So for backwards compatibility, I added an additional check for
the "CPU " string at the beginning of a line: 

  static ulong
  search_panic_task_by_cpu(char *buf)
  {
          int crashing_cpu;
          char *p1, *p2;
          ulong task = NO_TASK;

          p1 = NULL;

          if ((p1 = strstr(buf, "CPU: "))) 
                  p1 += strlen("CPU: ");
          else if (STRNEQ(buf, "CPU ")) {
                  p1 = buf + strlen("CPU ");

          if (p1) {
                  p2 = p1;
		  while (!whitespace(*p2) && (*p2 != '\n'))
                          p2++;
                  *p2 = NULLCHAR;
                  crashing_cpu = dtol(p1, RETURN_ON_ERROR, NULL);
                  if ((crashing_cpu >= 0) && in_cpu_map(ONLINE_MAP, crashing_cpu)) {
                          task = tt->active_set[crashing_cpu];
                          if (CRASHDEBUG(1))
                                  error(WARNING,
                                          "get_log_panic_task: active_set[%d]: %lx\n",
                                          crashing_cpu, tt->active_set[crashing_cpu]);
                  }
          }
          return task;
  }

There are still a number of dumpfiles for which the patch doesn't find
the panic task, for example, x86_64-specific "general protection fault" 
dumps that aren't in your panic_keywords[] array.  There are several of
those x86_64 fault types, but they shouldn't necessarily be in the keywords 
array because they may be generated in user-space, get fixed up, and 
therefore would not preface a kernel crash.  But as I mentioned
before, in those cases the panic task was determined by the normal
means.  So let's keep the keywords array as you have done.

Queued for crash-7.2.8:

  https://github.com/crash-utility/crash/commit/869f3b24fc3f1dd236b58e1cff86fb4e68da76cf

Thanks,
  Dave

----- Original Message -----
> Hi Dave,
> Thanks for your review. It's a great honor to give me some valuable
> suggestions about my patch.
> I have different points of view about the definition of max logbuf length.
> In upstream kernel, the max logbuf length is still determined by
> arch-specific CONFIG_LOG_BUF_SHIFT definition.
> [kernel/printk/printk.c]
> #define __LOG_BUF_LEN (1 << CONFIG_LOG_BUF_SHIFT)
> static char __log_buf[__LOG_BUF_LEN] __aligned(LOG_ALIGN);
> 
> LOG_BUF_LEN_MAX comes from commit e6fe3e5b7d16e8f146a4ae7fe481bc6e97acde1e,
> which give error on
> attempt to set log buffer length to over 2G.
> 
> For the question where I got the MAX_BUFSIZE of 2MB?
> I'm working on QCOM's ARM64 arch. In QCOM's kernel-4.14 code, the max logbuf
> length is set to 2MB.
> [arch/arm64/configs/xxx_config]
> CONFIG_LOG_BUF_SHIFT=21
> 
> Above your suggestions, I have correct the logic error and made some
> significant changes for my patch.
> The new patch file has been upload to attachment.
> Thanks for your review. I’m looking forward to your favourable reply!
> 
> Best regards,
> Qiwu
> 
> 
> -----Original Message-----
> From: Dave Anderson <anderson at redhat.com>
> Sent: Thursday, October 17, 2019 4:33 AM
> To: 陈启武 <chenqiwu at xiaomi.com>
> Subject: [External Mail]Re: [PATCH] optimize the way to find the panic task.
> 
> 
> Hi Qiwu,
> 
> I tested your patch against several ARM64 dumpfiles that I have on hand, and
> a couple of them that were created with "virsh dump" generated a
> segmentation violations like this:
> 
> ...
> please wait... (determining panic task)
> Program received signal SIGSEGV, Segmentation fault.
> 0x00007ffff6bba0b0 in __strstr_sse42 () from /lib64/libc.so.6 Missing
> separate debuginfos, use: debuginfo-install glibc-2.17-260.el7_6.6.x86_64
> libgcc-4.8.5-36.el7_6.2.x86_64 libstdc++-4.8.5-36.el7_6.2.x86_64
> lzo-2.06-8.el7.x86_64 ncurses-libs-5.9-14.20130511.el7_4.x86_64
> snappy-1.1.0-3.el7.x86_64 zlib-1.2.7-18.el7.x86_64
> (gdb) bt
> #0  0x00007ffff6bba0b0 in __strstr_sse42 () from /lib64/libc.so.6
> #1  0x00000000004b6389 in get_log_panic_task () at task.c:7485
> #2  0x00000000004ccefe in panic_search () at task.c:7361
> #3  get_panic_context () at task.c:6205
> #4  task_init () at task.c:642
> #5  0x0000000000460be5 in main_loop () at main.c:774
> #6  0x0000000000659383 in captured_command_loop (data=data at entry=0x0) at
> main.c:258
> #7  0x00000000006580aa in catch_errors (func=func at entry=0x659370
> <captured_command_loop>, func_args=func_args at entry=0x0,
>     errstring=errstring at entry=0x890f87 "", mask=mask at entry=6) at
>     exceptions.c:557
> #8  0x000000000065a316 in captured_main (data=data at entry=0x7fffffffdb30) at
> main.c:1064
> #9  0x00000000006580aa in catch_errors (func=func at entry=0x659650
> <captured_main>, func_args=func_args at entry=0x7fffffffdb30,
>     errstring=errstring at entry=0x890f87 "", mask=mask at entry=6) at
>     exceptions.c:557
> #10 0x000000000065a677 in gdb_main (args=0x7fffffffdb30) at main.c:1079
> #11 gdb_main_entry (argc=<optimized out>, argv=argv at entry=0x7fffffffdc98) at
> main.c:1099
> #12 0x00000000004eeab4 in gdb_main_loop (argc=<optimized out>, argc at entry=3,
> argv=argv at entry=0x7fffffffdc98) at gdb_interface.c:76
> #13 0x000000000045f03a in main (argc=3, argv=0x7fffffffdc98) at main.c:707
> (gdb)
> 
> The SIGSEGV is the strstr() call in get_log_panic_task(), where the "buf"
> pointer must be OK, but for some reason "i" is not available in the gdb
> session.  So I added the following debug line:
> 
>    7479         BZERO(buf, MAX_BUFSIZE);
>    7480         open_tmpfile();
>    7481         dump_log(SHOW_LOG_TEXT);
>    7482         rewind(pc->tmpfile);
>    7483         if (fread(buf, 1, MAX_BUFSIZE, pc->tmpfile)) {
>    7484                 while (panic_keywords[i++]) {
>    7485 fprintf(stderr, "[%d][%s]\n", i, panic_keywords[i]);
>    7486                         if ((p1 = strstr(buf, panic_keywords[i]))) {
>    7487                                 if ((p1 = strstr(p1, "CPU: "))) {
>    7488                                         p1 += strlen("CPU: ");
>    7489                                         p2 = p1;
>    7490
> 
> 
> and as expected, it runs off the end of the panic_keywords[] array:
> 
>   ...
> 
>   please wait... (determining panic task)[1][BUG: unable to handle kernel]
>   [2][Kernel BUG at]
>   [3][kernel BUG at]
>   [4][Bad mode in]
>   [5][Oops]
>   [6][Kernel panic]
>   [7][(null)]
>   Segmentation fault (core dumped)
>   $
> 
> But anyway, aside from the logic error above, a couple other comments:
> 
> (1) I do not want to change the order in which the panic task
>     search is made -- it should still try "foreach bt" first,
>     and only if that fails, search the log.
> 
> (2) The upstream kernel has a LOG_BUF_LEN_MAX that is 2GB,
>     so I'm not sure where you got the MAX_BUFSIZE of 2MB?
> 
> (3) But regardless of the log buffer size, I don't like the idea
>     of reading the whole log into a buffer.  It's already captured
>     into a temporary file that can be searched, so why bother copying
>     it into another buffer?
> 
> I would suggest using "while (fgets(buf, BUFSIZE, pc->tmpfile))"
> instead.  BUFSIZE should be large enough to contain any line in the log
> buffer, or certainly any line that contains one of the panic_keywords[]
> strings.
> 
> Also, can you please post any patches to the crash-utility mailing list
> instead of emailing me directly?
> 
> Thanks,
>   Dave
> 
> 
> 
> 
> ----- Original Message -----
> > Hi Dave,
> > I‘m working on arm64 kdump by crash-7.2.7, there is a warning msg "
> > panic task not found " gernarated as below:
> >
> > please wait... (determining panic task)
> >       KERNEL: vmlinux
> >    DUMPFILES: /var/tmp/ramdump_elf_8mA3xU [temporary ELF header]
> >               DDRCS0_0.BIN
> >               DDRCS1_0.BIN
> >               DDRCS1_1.BIN
> >         CPUS: 8
> >         DATE: Sat Feb  6 10:11:39 1971
> >       UPTIME: 00:00:07
> > LOAD AVERAGE: 0.64, 0.13, 0.04
> >        TASKS: 624
> >     NODENAME: localhost
> >      RELEASE: 4.4.184-perf-gdaa9cd595d7e-dirty
> >      VERSION: #1 SMP PREEMPT Thu Aug 22 14:41:16 CST 2019
> >      MACHINE: aarch64  (unknown Mhz)
> >       MEMORY: 5.7 GB
> >        PANIC: "Unable to handle kernel paging request at virtual address
> >        ffffffd532a1b2f8"
> >          PID: 0
> >      COMMAND: "swapper/0"
> >         TASK: ffffff803ec15390  (1 of 8)  [THREAD_INFO: ffffff803ec15390]
> >          CPU: 0
> >        STATE: TASK_RUNNING
> >      WARNING: panic task not found
> >
> > The panic task cannot be found by the following backtrace, result in
> > the error running task info in the overview showing :
> > [    7.630611] Process swapper/4 (pid: 0, stack limit = 0xffffffd536704000)
> > [    7.630614] Call trace:
> > [    7.630661] [<ffffffd532a1b2f8>] 0xffffffd532a1b2f8
> > [    7.630666] [<ffffff803c71d92c>] run_timer_softirq+0x508/0x554
> > [    7.630671] [<ffffff803c6835e4>] __do_softirq+0x1fc/0x3e4
> > [    7.630676] [<ffffff803c6aa0f0>] irq_exit+0x88/0xd0
> > [    7.630681] [<ffffff803c70b330>] __handle_domain_irq+0x8c/0xac
> > [    7.630685] [<ffffff803c681154>] gic_handle_irq+0xc8/0x190
> >
> > So I introduce this patch to optimize the way for finding the panic task.
> > We can find the panic task by searching arch-specific panic keywords
> > from kernel log.
> > I define some arch-specific panic keywords in a const array by
> > printing order of panic:
> > const char* panic_keywords[] = {
> >         "Unable to handle kernel",
> >         "BUG: unable to handle kernel",
> >         "Kernel BUG at",
> >         "kernel BUG at",
> >         "Bad mode in",
> >         "Oops",
> >         "Kernel panic"
> > };
> > We can search these panic keywords orderly from kernel log.
> > Generally, these panic keywords follow by printing out the stack trace
> > info of panic. Arch-specific dump_stack() implementations can use
> > dump_stack_print_info() function to print out the same generic debug
> > info. So we can determine the panic task by finding the first
> > keyword("CPU: ") behind the panic keyword have found.
> >
> > The patch file has been upload to attachment.
> > Thanks for your review. I’m looking forward to your favourable reply!
> >
> > Best regards,
> > Qiwu
> > #/******本邮件及其附件含有小米公司的保密信息，仅限于发送给上面地址中列出的个人或群组。禁止任何其他人以任何形式使用（包括但不限于全部
> > 或部分地泄露、复制、或散发）本邮件中的信息。如果您错收了本邮件，请您立即电话或邮件通知发件人并删除本邮件！
> > This e-mail and its attachments contain confidential information from
> > XIAOMI, which is intended only for the person or entity whose address
> > is listed above. Any use of the information contained herein in any
> > way (including, but not limited to, total or partial disclosure,
> > reproduction, or dissemination) by persons other than the intended
> > recipient(s) is prohibited. If you receive this e-mail in error,
> > please notify the sender by phone or email immediately and delete
> > it!******/#
> >
> #/******本邮件及其附件含有小米公司的保密信息，仅限于发送给上面地址中列出的个人或群组。禁止任何其他人以任何形式使用（包括但不限于全部或部分地泄露、复制、或散发）本邮件中的信息。如果您错收了本邮件，请您立即电话或邮件通知发件人并删除本邮件！
> This e-mail and its attachments contain confidential information from
> XIAOMI, which is intended only for the person or entity whose address is
> listed above. Any use of the information contained herein in any way
> (including, but not limited to, total or partial disclosure, reproduction,
> or dissemination) by persons other than the intended recipient(s) is
> prohibited. If you receive this e-mail in error, please notify the sender by
> phone or email immediately and delete it!******/#
>