[Crash-utility] Unable to switch stack frames while using crash

Thu Jun 23 15:29:56 UTC 2011

----- Original Message -----
> Hi Dave,
> 
> Thanks for the help.
> 
> I have some doubts regarding kdump and crash utility.
> 
> I am analyzing a vmcore dump caused by an oops in customer location
> using crash utility.The oops report is below
> 
> [345132.723424] BUG: unable to handle kernel NULL pointer dereference
> at
> 0000000000000005
> [345132.724928] IP: [<ffffffff811f03b3>] n_tty_read+0x58c/0x818
> [345132.726100] PGD 2c8e03067 PUD 2cbd88067 PMD 0
> [345132.727187] Oops: 0000 [#1] SMP
> [345132.727879] last sysfs file: /sys/block/loop7/dev
> [345132.728935] CPU 1
> [345132.729396] Modules linked in: xt_tcpudp iptable_filter ip_tables
> x_tables strmfs_mod bond0 ipmi_devintf hpwdt sctp ipv6 crc32c
> libcrc32c
> loop ipmi_si tpm_tis ipmi_msghandler hpilo tpm tpm_bios psmouse
> serio_raw shpchp pci_hotplug container processor evdev ext3 jbd
> mbcache
> dm_mirror dm_region_hash dm_log dm_snapshot dm_mod sg sr_mod cdrom
> ide_pci_generic ide_core usbhid hid ata_piix ata_generic libata
> ehci_hcd
> bnx2 uhci_hcd e1000e cciss scsi_mod button thermal fan thermal_sys edd
> [last unloaded: scsi_wait_scan]
> [345132.739511] Pid: 13366, comm: telnet Not tainted
> 2.6.32-cdma-18-amd64 #1 ProLiant DL380 G6
> [345132.741423] RIP: 0010:[<ffffffff811f03b3>] [<ffffffff811f03b3>] n_tty_read+0x58c/0x818
> [345132.743220] RSP: 0018:ffff88031ce75da8 EFLAGS: 00010246
> [345132.744469] RAX: 0000000000000000 RBX: ffff8802cbd54a68 RCX: 000000000061c044
> [345132.746061] RDX: 0000000000000005 RSI: ffff88031ce75e87 RDI: ffff8802cbd54d1c
> [345132.747726] RBP: ffff88031ce75eb8 R08: 0000000000000000 R09: 0000000000000000
> [345132.749391] R10: 0000000000616680 R11: 0000000000000246 R12: 000000000061c044
> [345132.750981] R13: ffff8802cbd54800 R14: 0000000000000000 R15: 7fffffffffffffff
> [345132.752650] FS: 00007ffff7fee6f0(0000) GS:ffff880033020000(0000) knlGS:0000000000000000
> [345132.754569] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [345132.755915] CR2: 0000000000000005 CR3: 000000030c408000 CR4: 00000000000006e0
> [345132.757579] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [345132.759169] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [345132.760778] Process telnet (pid: 13366, threadinfo ffff88031ce74000, task ffff88031b60d580)
> [345132.762707] Stack:
> [345132.763162] ffff88031b60d580 ffff88031b60d580 ffff88031b60d580 ffff88031b60d580
> [345132.764791] <0> 000000000061c02b 0000000000000000 0000000000000000 000000000061c02a
> [345132.766510] <0> ffff8802de651a40 ffff8802cbd549c0 ffff8802cbd54c90 ffff8802cbd54d1c
> [345132.768270] Call Trace:
> [345132.768877] [<ffffffff81045f84>] ? default_wake_function+0x0/0xf
> [345132.770309] [<ffffffff811ebf7e>] tty_read+0x7d/0xba
> [345132.771526] [<ffffffff810ebcc8>] vfs_read+0xab/0x167
> [345132.772541] [<ffffffff810ebe48>] sys_read+0x47/0x6f
> [345132.773526] [<ffffffff8100bbc2>] system_call_fastpath+0x16/0x1b
> [345132.774652] Code: 00 41 8b 85 5c 02 00 00 48 8b 9d 78 ff ff ff f0 0f
> b3 03 45 19 f6 49 63 95 5c 02 00 00 49 8b 85 50 02 00 00 48 8b bd 48 ff
> ff ff <0f> be 1c 10 e8 fc 6b 0e 00 48 89 c6 41 8b 85 5c 02 00 00 41 ff
> [345132.778840] RIP [<ffffffff811f03b3>] n_tty_read+0x58c/0x818
> [345132.780107] RSP <ffff88031ce75da8>
> [345132.780969] CR2: 0000000000000005
> [345132.781786] hpwdt: New timer passed in is 120 seconds.
> [345132.782942] hpwdt: timer reset to 120 for kdump
> 
> 
> After analysis, we figured out that the crash occurs in the function
> n_read_tty of kernel-source/drivers/char/n_tty.c . The oops occurred on
> linux kernel 2.6.32. Below is the code fragment where the page fault
> occurred. The page fault occurs when executing the statement c =
> tty->read_buf[tty->read_tail] .
> 
> /* N.B. avoid overrun if nr == 0 */
> while (nr && tty->read_cnt) {
> 
> int eol;
> 
> eol = test_and_clear_bit(tty->read_tail,
> tty->read_flags);
> c = tty->read_buf[tty->read_tail]; //
> page fault statement after analyzing oops
> 
> 
> spin_lock_irqsave(&tty->read_lock, flags);
> tty->read_tail = ((tty->read_tail+1) &
> (N_TTY_BUF_SIZE-1));
> tty->read_cnt--;
> if (eol) {
> /* this test should be
> redundant:
> * we shouldn't be reading data
> if
> * canon_data is 0
> */
> if (--tty->canon_data < 0)
> tty->canon_data = 0;
> }
> spin_unlock_irqrestore(&tty->read_lock,
> flags);
> 
> 
> 
> 
> Below is the contents of the structure tty_struct ( at the time of
> oops
> ). This was passed as an argument to the function n_read_tty().
> 
> tty_struct ffff8802cbd54800
> struct tty_struct { ...
> magic = 21505,
> driver = 0xffff88031b54ea00,
> ops = 0xffffffff8130f650,
> name = "pts9\000\...",
> driver_data = 0xffff88029c8a9668,
> icanon = 1 '\001',
> read_buf = 0xffff8802cbfe6000 "",
> read_head = 0,
> read_tail = 0,
> read_cnt = 0,
> read_flags = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0},
> canon_data = 0,
> ......................................
> 
> As per crash utility the field read_cnt is 0 when kernel oopsed.In that
> case, the statement while (nr && tty->read_cnt) in the above code
> fragment should have failed. This leads me to think that there was some
> other thread/task in kernel which should have updated the read_cnt
> field in parallel. However the crash utility reports that the runqueue
> of all CPUs at the time of crash as idle. Except CPU1 which was
> executing the user program telnet in kernel context ( system call ).
> Below is the runqueue output.
> 
> CPU 0 RUNQUEUE: ffff880033012d80
> CURRENT: PID: 0 TASK: ffffffff814204b0 COMMAND: "swapper"
> RT PRIO_ARRAY: ffff880033012e98
> [no tasks queued]
> CFS RB_ROOT: ffff880033012e10
> [no tasks queued]
> 
> CPU 1 RUNQUEUE: ffff880033032d80
> CURRENT: PID: 13366 TASK: ffff88031b60d580 COMMAND: "telnet"
> RT PRIO_ARRAY: ffff880033032e98
> [no tasks queued]
> CFS RB_ROOT: ffff880033032e10
> [no tasks queued]
> 
> CPU 2 RUNQUEUE: ffff880033052d80
> CURRENT: PID: 0 TASK: ffff88031e0e3540 COMMAND: "swapper"
> RT PRIO_ARRAY: ffff880033052e98
> [no tasks queued]
> CFS RB_ROOT: ffff880033052e10
> [no tasks queued]
> 
> CPU 3 RUNQUEUE: ffff880033072d80
> CURRENT: PID: 0 TASK: ffff88031e113580 COMMAND: "swapper"
> RT PRIO_ARRAY: ffff880033072e98
> [no tasks queued]
> CFS RB_ROOT: ffff880033072e10
> [no tasks queued]
> 
> 
> How is this logically possible. Crash reports there are no tasks running
> currently. Or before the oops trigger and kdump capturing the memory
> image, some process/thread ran which could have updated the data
> structure. I wanted to know if this scenario is possible. I kindly
> request your suggestion/guidance. Please let me know if you need any
> other details.

It's not clear to me, but I'm not at all familiar with this code area.
Maybe during hard or soft IRQ handling on this or another cpu?  Presumably 
there would be protection against that happening, and maybe it's of interest
that the very next instruction after the fault is a spin_lock_irqsave()
call, but that's just a wild guess on my part...

Anyway, the crash utility shows what the state of memory was at the point
when the "telnet" process (indirectly) issued NMI interrupts to all of
the other cpus.  You can verify where the other cpus were (in idle) by
"bt -a", which shows/verifies the reception of the NMI shutdown interrupt.

Dave