[Crash-utility] [PATCH 00/11] sadump: Incremental update patches

Thu Oct 20 21:06:54 UTC 2011

----- Original Message -----
> Hello Dave,
> 
> The following series fix minor bugs, clean up in sadump module, and
> address the issue on kdump's first 640kB backup.
> 
> The last patch is a preparation for makedumpfile's support on
> sadump-related formats, still work in progress, producing dumpfile in
> kdump-compressed format from sadump-related formats.
> 
> This patch set is based on crash 5.1.9.

Hello Daisuke,

As I have stated in our previous sadump-related discussions, you have
free rein to make whatever changes you like in sadump-specific
files, or in functions that deal with sadump-specific issues.  However, 
if your changes modify behavior when used with non-sadump dumpfiles
then I may have a problem with them.  So when you post a patch-set 
such as this last set, I would prefer that you post two separate 
patch-sets.

This 1/11 patchset is a good example of what I mean.  I have no
problem with the sadump-specific patches.  But I do have a big
problem with the last one, which is not necessarily sadump-specific:

  use_regs_in_elf_notes_on_kdump_fmt_from_sadump.patch.patch

BTW, these are the names of the patches as they were attached, where
the second one doesn't have "0002-" prepended to it, and there is
no "0008-" patch?:

  0001-sadump-bug-close-receives-unintened-value.patch.patch
  cleanup_is_sadump.patch.patch
  0002-sadump-bug-specify-wrong-type.patch.patch
  0003-sadump-bugfix-time-stamp-values-displayed-are-same.patch.patch
  0004-sadump-don-t-exit-if-time-stamps-mismatch.patch.patch
  0005-sadump-debug-messages-at-the-beginning-of-open_disk-.patch.patch
  0006-sadump-Allow-arbitrary-number-of-disk-set-configurat.patch.patch
  0007-sadump-refer-to-eip-and-esp-on-x86-kernels.patch.patch
  0010-Make-data-relevant-to-physical-memory-have-64-bits-l.patch.patch
  0011-Read-kexec-backup-region-if-read-to-the-first-640kB-.patch.patch
  use_regs_in_elf_notes_on_kdump_fmt_from_sadump.patch.patch

Anyway, I tested this by running "bt -a" on a large set of sample dumpfiles, 
first without, and then with, your patchset.  When your patches are applied, I see 
numerous examples where the backtraces are missing huge pieces of information.

Here are typical examples:

Here with un-patched crash-5.1.9, is a RHEL6 crashing process:

 PID: 14187  TASK: ffff88012b98e040  CPU: 0   COMMAND: "runtest.sh"
  #0 [ffff88012b2739e0] machine_kexec at ffffffff810310fb
  #1 [ffff88012b273a40] crash_kexec at ffffffff810b6632
  #2 [ffff88012b273b10] oops_end at ffffffff814df320
  #3 [ffff88012b273b40] no_context at ffffffff81040cbb
  #4 [ffff88012b273b90] __bad_area_nosemaphore at ffffffff81040f45
  #5 [ffff88012b273be0] bad_area at ffffffff8104106e
  #6 [ffff88012b273c10] __do_page_fault at ffffffff81041793
  #7 [ffff88012b273d30] do_page_fault at ffffffff814e132e
  #8 [ffff88012b273d60] page_fault at ffffffff814de6b5
     [exception RIP: sysrq_handle_crash+22]
     RIP: ffffffff8131b566  RSP: ffff88012b273e18  RFLAGS: 00010096
     RAX: 0000000000000010  RBX: 0000000000000063  RCX: 0000000000000f95
     RDX: 0000000000000000  RSI: 0000000000000000  RDI: 0000000000000063
     RBP: ffff88012b273e18   R8: ffffffff81b9e5c0   R9: 0000000000000000
     R10: 00007fff7b178160  R11: 0000000000000000  R12: 0000000000000000
     R13: ffffffff81a9a1a0  R14: 0000000000000286  R15: 0000000000000007
     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
  #9 [ffff88012b273e20] __handle_sysrq at ffffffff8131b822
 #10 [ffff88012b273e70] write_sysrq_trigger at ffffffff8131b8de
 #11 [ffff88012b273ea0] proc_reg_write at ffffffff811d5bce
 #12 [ffff88012b273ef0] vfs_write at ffffffff811730c8
 #13 [ffff88012b273f30] sys_write at ffffffff81173ad1
 #14 [ffff88012b273f80] system_call_fastpath at ffffffff8100b0b2

With crash-5.1.9 plus your patch -- nothing is shown below the page fault
exception frame:

 PID: 14187  TASK: ffff88012b98e040  CPU: 0   COMMAND: "runtest.sh"
     [exception RIP: sysrq_handle_crash+22]
     RIP: ffffffff8131b566  RSP: ffff88012b273e18  RFLAGS: 00010096
     RAX: 0000000000000010  RBX: 0000000000000063  RCX: 0000000000000f95
     RDX: 0000000000000000  RSI: 0000000000000000  RDI: 0000000000000063
     RBP: ffff88012b273e18   R8: ffffffff81b9e5c0   R9: 0000000000000000
     R10: 00007fff7b178160  R11: 0000000000000000  R12: 0000000000000000
     R13: ffffffff81a9a1a0  R14: 0000000000000286  R15: 0000000000000007
     CS: 0010  SS: 0018
  #0 [ffff88012b273e20] __handle_sysrq at ffffffff8131b822
  #1 [ffff88012b273e70] write_sysrq_trigger at ffffffff8131b8de
  #2 [ffff88012b273ea0] proc_reg_write at ffffffff811d5bce
  #3 [ffff88012b273ef0] vfs_write at ffffffff811730c8
  #4 [ffff88012b273f30] sys_write at ffffffff81173ad1
  #5 [ffff88012b273f80] system_call_fastpath at ffffffff8100b0b2
     RIP: 00007fad3a2f45e0  RSP: 00007fff7b1783d8  RFLAGS: 00010206
     RAX: 0000000000000001  RBX: ffffffff8100b0b2  RCX: 0000000000000000
     RDX: 0000000000000002  RSI: 00007fad3abe6000  RDI: 0000000000000001
     RBP: 00007fad3abe6000   R8: 000000000000000a   R9: 00007fad3abe2700
     R10: 00007fff7b178160  R11: 0000000000000246  R12: 0000000000000002
     R13: 00007fad3a5a6780  R14: 0000000000000002  R15: 0000000000000001
     ORIG_RAX: 0000000000000001  CS: 0033  SS: 002b

Again with un-patched crash-5.1.9, here are examples of two non-crashing cpus
that received shutdown NMI interrupts from the crashing task:

 PID: 0      TASK: ffff88012cd2f580  CPU: 1   COMMAND: "swapper"
  #0 [ffff880028227e90] crash_nmi_callback at ffffffff81028a96
  #1 [ffff880028227ea0] notifier_call_chain at ffffffff814e13e5
  #2 [ffff880028227ee0] atomic_notifier_call_chain at ffffffff814e144a
  #3 [ffff880028227ef0] notify_die at ffffffff810942fe
  #4 [ffff880028227f20] do_nmi at ffffffff814df033
  #5 [ffff880028227f50] nmi at ffffffff814de940
     [exception RIP: intel_idle+177]
     RIP: ffffffff812bc291  RSP: ffff88012cd31e68  RFLAGS: 00000046
     RAX: 0000000000000020  RBX: 0000000000000008  RCX: 0000000000000001
     RDX: 0000000000000000  RSI: ffff88012cd31fd8  RDI: ffffffff81a34040
     RBP: ffff88012cd31ed8   R8: 0000000000000000   R9: 00000000000000c8
     R10: 0000000000000000  R11: 0000000000000000  R12: 0000000000000020
     R13: 12257c81ed7a34e6  R14: 0000000000000003  R15: 0000000000000001
     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 --- <NMI exception stack> ---
  #6 [ffff88012cd31e68] intel_idle at ffffffff812bc291
  #7 [ffff88012cd31ee0] cpuidle_idle_call at ffffffff813ed4b7
  #8 [ffff88012cd31f00] cpu_idle at ffffffff81009de6

 PID: 37     TASK: ffff88012ce360c0  CPU: 2   COMMAND: "events/2"
  #0 [ffff880028247e90] crash_nmi_callback at ffffffff81028a96
  #1 [ffff880028247ea0] notifier_call_chain at ffffffff814e13e5
  #2 [ffff880028247ee0] atomic_notifier_call_chain at ffffffff814e144a
  #3 [ffff880028247ef0] notify_die at ffffffff810942fe
  #4 [ffff880028247f20] do_nmi at ffffffff814df033
  #5 [ffff880028247f50] nmi at ffffffff814de940
     [exception RIP: io_serial_in+22]
     RIP: ffffffff813324f6  RSP: ffff88012ce5fc70  RFLAGS: 00000006
     RAX: ffffffffab364400  RBX: ffffffff81f2cca0  RCX: 0000000000000000
     RDX: 000000000000d055  RSI: 0000000000000005  RDI: ffffffff81f2cca0
     RBP: ffff88012ce5fc70   R8: ffffffff81b9e5c0   R9: 0000000000000000
     R10: ffff880127498a60  R11: 0000000000000001  R12: 000000000000270c
     R13: 0000000000000020  R14: 0000000000000000  R15: ffffffff81332ba0
     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 --- <NMI exception stack> ---
  #6 [ffff88012ce5fc70] io_serial_in at ffffffff813324f6
  #7 [ffff88012ce5fc78] wait_for_xmitr at ffffffff81332b03
  #8 [ffff88012ce5fca8] serial8250_console_putchar at ffffffff81332bc6
  #9 [ffff88012ce5fcc8] uart_console_write at ffffffff8132e55e
 #10 [ffff88012ce5fd08] serial8250_console_write at ffffffff81332f2d
 #11 [ffff88012ce5fd58] __call_console_drivers at ffffffff81067495
 #12 [ffff88012ce5fd88] _call_console_drivers at ffffffff810674fa
 #13 [ffff88012ce5fda8] release_console_sem at ffffffff81067ac8
 #14 [ffff88012ce5fde8] fb_flashcursor at ffffffff812abb4a
 #15 [ffff88012ce5fe38] worker_thread at ffffffff81088a40
 #16 [ffff88012ce5fee8] kthread at ffffffff8108dff6
 #17 [ffff88012ce5ff48] kernel_thread at ffffffff8100c10a

But when running crash-5.1.9 plus your patch -- the transitions to the NMI exception
stack are not even shown at all:

 PID: 0      TASK: ffff88012cd2f580  CPU: 1   COMMAND: "swapper"
     [exception RIP: intel_idle+177]
     RIP: ffffffff812bc291  RSP: ffff88012cd31e68  RFLAGS: 00000046
     RAX: 0000000000000020  RBX: 0000000000000008  RCX: 0000000000000001
     RDX: 0000000000000000  RSI: ffff88012cd31fd8  RDI: ffffffff81a34040
     RBP: ffff88012cd31ed8   R8: 0000000000000000   R9: 00000000000000c8
     R10: 0000000000000000  R11: 0000000000000000  R12: 0000000000000020
     R13: 12257c81ed7a34e6  R14: 0000000000000003  R15: 0000000000000001
     CS: 0010  SS: 0018
  #0 [ffff88012cd31e70] sched_clock_cpu at ffffffff8109539d
  #1 [ffff88012cd31ee0] cpuidle_idle_call at ffffffff813ed4b7
  #2 [ffff88012cd31f00] cpu_idle at ffffffff81009de6

 PID: 37     TASK: ffff88012ce360c0  CPU: 2   COMMAND: "events/2"
     [exception RIP: io_serial_in+22]
     RIP: ffffffff813324f6  RSP: ffff88012ce5fc70  RFLAGS: 00000006
     RAX: ffffffffab364400  RBX: ffffffff81f2cca0  RCX: 0000000000000000
     RDX: 000000000000d055  RSI: 0000000000000005  RDI: ffffffff81f2cca0
     RBP: ffff88012ce5fc70   R8: ffffffff81b9e5c0   R9: 0000000000000000
     R10: ffff880127498a60  R11: 0000000000000001  R12: 000000000000270c
     R13: 0000000000000020  R14: 0000000000000000  R15: ffffffff81332ba0
     CS: 0010  SS: 0018
  #0 [ffff88012ce5fc78] wait_for_xmitr at ffffffff81332b03
  #1 [ffff88012ce5fca8] serial8250_console_putchar at ffffffff81332bc6
  #2 [ffff88012ce5fcc8] uart_console_write at ffffffff8132e55e
  #3 [ffff88012ce5fd08] serial8250_console_write at ffffffff81332f2d
  #4 [ffff88012ce5fd58] __call_console_drivers at ffffffff81067495
  #5 [ffff88012ce5fd88] _call_console_drivers at ffffffff810674fa
  #6 [ffff88012ce5fda8] release_console_sem at ffffffff81067ac8
  #7 [ffff88012ce5fde8] fb_flashcursor at ffffffff812abb4a
  #8 [ffff88012ce5fe38] worker_thread at ffffffff81088a40
  #9 [ffff88012ce5fee8] kthread at ffffffff8108dff6
 #10 [ffff88012ce5ff48] kernel_thread at ffffffff8100c10a

If I remove the "use_regs_in_elf_notes_on_kdump_fmt_from_sadump.patch.patch" patch
the backtraces are correct.  Now, it may be true that the changes you made make
sense with respect to sadump dumpfiles, where the register set stored in the header
is a reflection of the last location that each cpu ran (?).  

But those changes are totally unacceptable for compressed kdump dumpfiles.

Dave