[Crash-utility] timer: invalid list entry: 1

Fri Mar 1 17:19:05 UTC 2013

----- Original Message -----

> I would give some more info.
>
> It is dual core system.  (ARM)
> both core are stuck at wfi (wait for interrupt)
> and we observe that the timer counter has one much ahead than the comparators.
> so we never get a local timer interrupt, and nobody is there to wake the cpu up.
>
> so we observe the freeze.
>
> Regards,
> Oza.

I don't know much about the ARM architecture, and the only sample
SMP ARM dumpfile I have on hand shows the non-panicking cpu blocked
in default_idle().  So I don't understand how "wfi" would come
into play. 

What does "bt -a" show?

> 
> some more info:
> I am debugging crash utility with gdb, and getting following stack trace.
> 
> crash> timer
> TVEC_BASES[0]: c0a419c0
> JIFFIES
> 4297762
> EXPIRES TIMER_LIST FUNCTION
> 128 c1621ea8 c007260c <idle_worker_timeout>
> 30208 c0b81f04 c04e4244 <inet_frag_secret_rebuild>
> 30720 c0b7f264 c0461440 <flow_cache_new_hashrnd>
> 30840 dba2be04 c0068ebc <process_timeout>
> 38228 dbae5e04 c0068ebc <process_timeout>
> 11796480 c097cb64 c0010aa4 <sched_clock_poll>
> 4294937694 c0a6f118 c026f820 <rx_timeout_handler>
> 4294945658 c16238fc c007412c <delayed_work_timer_fn>
> 4294945667 d811be14 c0068ebc <process_timeout>
> 4294945700 c16237cc c007412c <delayed_work_timer_fn>
> 4294945700 c16236e0 c007412c <delayed_work_timer_fn>
> 4294946020 c0a1dcbc c007412c <delayed_work_timer_fn>
> 4294946029 dca8f884 c007412c <delayed_work_timer_fn>
> 4294946504 c0b871c4 c007412c <delayed_work_timer_fn>
> 4294950720 c0b81d6c c007412c <delayed_work_timer_fn>
> 
> Breakpoint 2, do_list (ld=0xff961c78) at tools.c:3507
> 3507 error(INFO, "\ninvalid list entry: %lx\n", next);
> (gdb) bt
> #0 do_list (ld=0xff961c78) at tools.c:3507
> #1 0x0811de03 in do_timer_list (vec_kvaddr=3699761524, size=256,
> vec=0x85c9f40, option=0x0, highest=0x0, tv=0xff962ec4) at
> kernel.c:6983
> #2 0x0811c9d3 in dump_timer_data_tvec_bases_v2 () at kernel.c:6678
> #3 0x0811afac in dump_timer_data () at kernel.c:6370
> #4 0x0811af8a in cmd_timer () at kernel.c:6329
> #5 0x080910a1 in exec_command () at main.c:818
> #6 0x08090ec7 in main_loop () at main.c:766
> #7 0x081bf35a in current_interp_command_loop ()
> #8 0x081bfbcf in captured_command_loop ()
> #9 0x081beddc in catch_errors ()
> #10 0x081c0a9a in captured_main ()
> #11 0x081beddc in catch_errors ()
> #12 0x081c0adc in gdb_main ()
> #13 0x081c0b29 in gdb_main_entry ()
> #14 0x08121590 in gdb_main_loop (argc=2, argv=0xff964014) at gdb_interface.c:76
> #15 0x08090c01 in main (argc=3, argv=0xff964014) at main.c:671
> 
> here exactly I hit invalid entry.

Right, I understand where the error message came from.

The crash utility's do_list() function is simply reporting what
it sees in the list_head-type linked list that it was following.

I have only seen these types of timer command errors in
vmcores that were generated with the "snap.so" extension
module, or when running the command on a live system.  
And both of those scenarios make perfect sense because the
underlying kernel was running/modifying the timer-related
data structures while the memory was being copied. 

Presuming that the crash was taken with kdump, you would
typically expect that the timer data structures would
be stable.

Dave