[Crash-utility] [External Mail]Re: [PATCH] optimize the way to find the panic task.

Mon Oct 21 07:38:21 UTC 2019

Hi Dave,
Thanks for your review. It's a great honor to give me some valuable suggestions about my patch.
I have different points of view about the definition of max logbuf length.
In upstream kernel, the max logbuf length is still determined by arch-specific CONFIG_LOG_BUF_SHIFT definition.
[kernel/printk/printk.c]
#define __LOG_BUF_LEN (1 << CONFIG_LOG_BUF_SHIFT)
static char __log_buf[__LOG_BUF_LEN] __aligned(LOG_ALIGN);

LOG_BUF_LEN_MAX comes from commit e6fe3e5b7d16e8f146a4ae7fe481bc6e97acde1e, which give error on
attempt to set log buffer length to over 2G.

For the question where I got the MAX_BUFSIZE of 2MB?
I'm working on QCOM's ARM64 arch. In QCOM's kernel-4.14 code, the max logbuf length is set to 2MB.
[arch/arm64/configs/xxx_config]
CONFIG_LOG_BUF_SHIFT=21

Above your suggestions, I have correct the logic error and made some significant changes for my patch.
The new patch file has been upload to attachment.
Thanks for your review. I’m looking forward to your favourable reply!

Best regards,
Qiwu

-----Original Message-----
From: Dave Anderson <anderson at redhat.com>
Sent: Thursday, October 17, 2019 4:33 AM
To: 陈启武 <chenqiwu at xiaomi.com>
Subject: [External Mail]Re: [PATCH] optimize the way to find the panic task.

Hi Qiwu,

I tested your patch against several ARM64 dumpfiles that I have on hand, and a couple of them that were created with "virsh dump" generated a segmentation violations like this:

...
please wait... (determining panic task)
Program received signal SIGSEGV, Segmentation fault.
0x00007ffff6bba0b0 in __strstr_sse42 () from /lib64/libc.so.6 Missing separate debuginfos, use: debuginfo-install glibc-2.17-260.el7_6.6.x86_64 libgcc-4.8.5-36.el7_6.2.x86_64 libstdc++-4.8.5-36.el7_6.2.x86_64 lzo-2.06-8.el7.x86_64 ncurses-libs-5.9-14.20130511.el7_4.x86_64 snappy-1.1.0-3.el7.x86_64 zlib-1.2.7-18.el7.x86_64
(gdb) bt
#0  0x00007ffff6bba0b0 in __strstr_sse42 () from /lib64/libc.so.6
#1  0x00000000004b6389 in get_log_panic_task () at task.c:7485
#2  0x00000000004ccefe in panic_search () at task.c:7361
#3  get_panic_context () at task.c:6205
#4  task_init () at task.c:642
#5  0x0000000000460be5 in main_loop () at main.c:774
#6  0x0000000000659383 in captured_command_loop (data=data at entry=0x0) at main.c:258
#7  0x00000000006580aa in catch_errors (func=func at entry=0x659370 <captured_command_loop>, func_args=func_args at entry=0x0,
    errstring=errstring at entry=0x890f87 "", mask=mask at entry=6) at exceptions.c:557
#8  0x000000000065a316 in captured_main (data=data at entry=0x7fffffffdb30) at main.c:1064
#9  0x00000000006580aa in catch_errors (func=func at entry=0x659650 <captured_main>, func_args=func_args at entry=0x7fffffffdb30,
    errstring=errstring at entry=0x890f87 "", mask=mask at entry=6) at exceptions.c:557
#10 0x000000000065a677 in gdb_main (args=0x7fffffffdb30) at main.c:1079
#11 gdb_main_entry (argc=<optimized out>, argv=argv at entry=0x7fffffffdc98) at main.c:1099
#12 0x00000000004eeab4 in gdb_main_loop (argc=<optimized out>, argc at entry=3, argv=argv at entry=0x7fffffffdc98) at gdb_interface.c:76
#13 0x000000000045f03a in main (argc=3, argv=0x7fffffffdc98) at main.c:707
(gdb)

The SIGSEGV is the strstr() call in get_log_panic_task(), where the "buf"
pointer must be OK, but for some reason "i" is not available in the gdb session.  So I added the following debug line:

   7479         BZERO(buf, MAX_BUFSIZE);
   7480         open_tmpfile();
   7481         dump_log(SHOW_LOG_TEXT);
   7482         rewind(pc->tmpfile);
   7483         if (fread(buf, 1, MAX_BUFSIZE, pc->tmpfile)) {
   7484                 while (panic_keywords[i++]) {
   7485 fprintf(stderr, "[%d][%s]\n", i, panic_keywords[i]);
   7486                         if ((p1 = strstr(buf, panic_keywords[i]))) {
   7487                                 if ((p1 = strstr(p1, "CPU: "))) {
   7488                                         p1 += strlen("CPU: ");
   7489                                         p2 = p1;
   7490

and as expected, it runs off the end of the panic_keywords[] array:

  ...

  please wait... (determining panic task)[1][BUG: unable to handle kernel]
  [2][Kernel BUG at]
  [3][kernel BUG at]
  [4][Bad mode in]
  [5][Oops]
  [6][Kernel panic]
  [7][(null)]
  Segmentation fault (core dumped)
  $

But anyway, aside from the logic error above, a couple other comments:

(1) I do not want to change the order in which the panic task
    search is made -- it should still try "foreach bt" first,
    and only if that fails, search the log.

(2) The upstream kernel has a LOG_BUF_LEN_MAX that is 2GB,
    so I'm not sure where you got the MAX_BUFSIZE of 2MB?

(3) But regardless of the log buffer size, I don't like the idea
    of reading the whole log into a buffer.  It's already captured
    into a temporary file that can be searched, so why bother copying
    it into another buffer?

I would suggest using "while (fgets(buf, BUFSIZE, pc->tmpfile))"
instead.  BUFSIZE should be large enough to contain any line in the log buffer, or certainly any line that contains one of the panic_keywords[] strings.

Also, can you please post any patches to the crash-utility mailing list instead of emailing me directly?

Thanks,
  Dave

----- Original Message -----
> Hi Dave,
> I‘m working on arm64 kdump by crash-7.2.7, there is a warning msg "
> panic task not found " gernarated as below:
>
> please wait... (determining panic task)
>       KERNEL: vmlinux
>    DUMPFILES: /var/tmp/ramdump_elf_8mA3xU [temporary ELF header]
>               DDRCS0_0.BIN
>               DDRCS1_0.BIN
>               DDRCS1_1.BIN
>         CPUS: 8
>         DATE: Sat Feb  6 10:11:39 1971
>       UPTIME: 00:00:07
> LOAD AVERAGE: 0.64, 0.13, 0.04
>        TASKS: 624
>     NODENAME: localhost
>      RELEASE: 4.4.184-perf-gdaa9cd595d7e-dirty
>      VERSION: #1 SMP PREEMPT Thu Aug 22 14:41:16 CST 2019
>      MACHINE: aarch64  (unknown Mhz)
>       MEMORY: 5.7 GB
>        PANIC: "Unable to handle kernel paging request at virtual address
>        ffffffd532a1b2f8"
>          PID: 0
>      COMMAND: "swapper/0"
>         TASK: ffffff803ec15390  (1 of 8)  [THREAD_INFO: ffffff803ec15390]
>          CPU: 0
>        STATE: TASK_RUNNING
>      WARNING: panic task not found
>
> The panic task cannot be found by the following backtrace, result in
> the error running task info in the overview showing :
> [    7.630611] Process swapper/4 (pid: 0, stack limit = 0xffffffd536704000)
> [    7.630614] Call trace:
> [    7.630661] [<ffffffd532a1b2f8>] 0xffffffd532a1b2f8
> [    7.630666] [<ffffff803c71d92c>] run_timer_softirq+0x508/0x554
> [    7.630671] [<ffffff803c6835e4>] __do_softirq+0x1fc/0x3e4
> [    7.630676] [<ffffff803c6aa0f0>] irq_exit+0x88/0xd0
> [    7.630681] [<ffffff803c70b330>] __handle_domain_irq+0x8c/0xac
> [    7.630685] [<ffffff803c681154>] gic_handle_irq+0xc8/0x190
>
> So I introduce this patch to optimize the way for finding the panic task.
> We can find the panic task by searching arch-specific panic keywords
> from kernel log.
> I define some arch-specific panic keywords in a const array by
> printing order of panic:
> const char* panic_keywords[] = {
>         "Unable to handle kernel",
>         "BUG: unable to handle kernel",
>         "Kernel BUG at",
>         "kernel BUG at",
>         "Bad mode in",
>         "Oops",
>         "Kernel panic"
> };
> We can search these panic keywords orderly from kernel log.
> Generally, these panic keywords follow by printing out the stack trace
> info of panic. Arch-specific dump_stack() implementations can use
> dump_stack_print_info() function to print out the same generic debug
> info. So we can determine the panic task by finding the first
> keyword("CPU: ") behind the panic keyword have found.
>
> The patch file has been upload to attachment.
> Thanks for your review. I’m looking forward to your favourable reply!
>
> Best regards,
> Qiwu
> #/******本邮件及其附件含有小米公司的保密信息，仅限于发送给上面地址中列出的个人或群组。禁止任何其他人以任何形式使用（包括但不限于全部
> 或部分地泄露、复制、或散发）本邮件中的信息。如果您错收了本邮件，请您立即电话或邮件通知发件人并删除本邮件！
> This e-mail and its attachments contain confidential information from
> XIAOMI, which is intended only for the person or entity whose address
> is listed above. Any use of the information contained herein in any
> way (including, but not limited to, total or partial disclosure,
> reproduction, or dissemination) by persons other than the intended
> recipient(s) is prohibited. If you receive this e-mail in error,
> please notify the sender by phone or email immediately and delete
> it!******/#
>
#/******本邮件及其附件含有小米公司的保密信息，仅限于发送给上面地址中列出的个人或群组。禁止任何其他人以任何形式使用（包括但不限于全部或部分地泄露、复制、或散发）本邮件中的信息。如果您错收了本邮件，请您立即电话或邮件通知发件人并删除本邮件！ This e-mail and its attachments contain confidential information from XIAOMI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it!******/#
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Optimize-the-way-to-find-the-panic-task.patch
Type: application/octet-stream
Size: 5594 bytes
Desc: Optimize-the-way-to-find-the-panic-task.patch
URL: <http://listman.redhat.com/archives/crash-utility/attachments/20191021/3620b7c4/attachment.obj>