[Crash-utility] timer: invalid list entry: 1

Sat Mar 2 10:21:06 UTC 2013

wfi (is wait for interrupt), in the sense we let the cpu go ino idle/dormant when he has nothing to do.
and the thread who has been scheduled earliest, the timer would have set accordingly and then wake the cpu up.
here we are missing both timer interrupt on both cpu.
that means that timer counter has much gone ahead, and it will never match programmed compare values.
so its system freeze, as interrupts are not happening.

in that freeze, we have special keryboard interrupt to take task dump and other dumps.
on that ramdump which I have crash utility would show 

crash> bt -a
PID: 0      TASK: c097b8b0  CPU: 0   COMMAND: "swapper/0"
bt: WARNING: cannot get stackframe for task

PID: 0      TASK: dc84ca40  CPU: 1   COMMAND: "swapper/1"
bt: WARNING: cannot get stackframe for task

and timer

crash> timer
TVEC_BASES[0]: c0a419c0
JIFFIES
4297762
EXPIRES  TIMER_LIST  FUNCTION
    128   c1621ea8   c007260c  <idle_worker_timeout>
  30208   c0b81f04   c04e4244  <inet_frag_secret_rebuild>
  30720   c0b7f264   c0461440  <flow_cache_new_hashrnd>
  30840   dba2be04   c0068ebc  <process_timeout>
  38228   dbae5e04   c0068ebc  <process_timeout>
11796480   c097cb64   c0010aa4  <sched_clock_poll>
4294937694   c0a6f118   c026f820  <rx_timeout_handler>
4294945658   c16238fc   c007412c  <delayed_work_timer_fn>
4294945667   d811be14   c0068ebc  <process_timeout>
4294945700   c16237cc   c007412c  <delayed_work_timer_fn>
4294945700   c16236e0   c007412c  <delayed_work_timer_fn>
4294946020   c0a1dcbc   c007412c  <delayed_work_timer_fn>
4294946029   dca8f884   c007412c  <delayed_work_timer_fn>
4294946504   c0b871c4   c007412c  <delayed_work_timer_fn>
4294950720   c0b81d6c   c007412c  <delayed_work_timer_fn>

timer: invalid list entry: 1
timer: ignoring faulty timer list at index 44 of timer array

timer: invalid list entry: 1
timer: ignoring faulty timer list at index 44 of timer array
TVEC_BASES[1]: dc85e000
JIFFIES
4297762
EXPIRES  TIMER_LIST  FUNCTION
    384   c0a42ba8   c007260c  <idle_worker_timeout>
4297862   dbec0dfc   c007412c  <delayed_work_timer_fn>
4297897   c162c6e0   c007412c  <delayed_work_timer_fn>
4297962   dbec0ea0   c04a7cec  <estimation_timer>
4297997   c162c7cc   c007412c  <delayed_work_timer_fn>
4300768   dcb36654   c007412c  <delayed_work_timer_fn>
4309824   c0a20024   c0516718  <addrconf_verify>
4327762   dcaabf54   c0068ebc  <process_timeout>
4327808   c162aea8   c007260c  <idle_worker_timeout>
4357762   dbaa3e04   c0068ebc  <process_timeout>
4357762   dbaa3e04   c0068ebc  <process_timeout>
4357888   c0b83fa4   c04e4244  <inet_frag_secret_rebuild>
4357888   c0b84694   c04e4244  <inet_frag_secret_rebuild>
4357888   c0b83fa4   c04e4244  <inet_frag_secret_rebuild>
4357888   c0b84694   c04e4244  <inet_frag_secret_rebuild>

Regards,
Oza.

________________________________
 From: Dave Anderson <anderson at redhat.com>
To: paawan oza <paawan1982 at yahoo.com> 
Cc: "Discussion list for crash utility usage, maintenance and development" <crash-utility at redhat.com> 
Sent: Friday, 1 March 2013 10:49 PM
Subject: Re: [Crash-utility]  timer: invalid list entry: 1

----- Original Message -----

> I would give some more info.
>
> It is dual core system.  (ARM)
> both core are stuck at wfi (wait for interrupt)
> and we observe that the timer counter has one much ahead than the comparators.
> so we never get a local timer interrupt, and nobody is there to wake the cpu up.
>
> so we observe the freeze.
>
> Regards,
> Oza.

I don't know much about the ARM architecture, and the only sample
SMP ARM dumpfile I have on hand shows the non-panicking cpu blocked
in default_idle().  So I don't understand how "wfi" would come
into play. 

What does "bt -a" show?

> 
> some more info:
> I am debugging crash utility with gdb, and getting following stack trace.
> 
> crash> timer
> TVEC_BASES[0]: c0a419c0
> JIFFIES
> 4297762
> EXPIRES TIMER_LIST FUNCTION
> 128 c1621ea8 c007260c <idle_worker_timeout>
> 30208 c0b81f04 c04e4244 <inet_frag_secret_rebuild>
> 30720 c0b7f264 c0461440 <flow_cache_new_hashrnd>
> 30840 dba2be04 c0068ebc <process_timeout>
> 38228 dbae5e04 c0068ebc <process_timeout>
> 11796480 c097cb64 c0010aa4 <sched_clock_poll>
> 4294937694 c0a6f118 c026f820 <rx_timeout_handler>
> 4294945658 c16238fc c007412c <delayed_work_timer_fn>
> 4294945667 d811be14 c0068ebc <process_timeout>
> 4294945700 c16237cc c007412c <delayed_work_timer_fn>
> 4294945700 c16236e0 c007412c <delayed_work_timer_fn>
> 4294946020 c0a1dcbc c007412c <delayed_work_timer_fn>
> 4294946029 dca8f884 c007412c <delayed_work_timer_fn>
> 4294946504 c0b871c4 c007412c <delayed_work_timer_fn>
> 4294950720 c0b81d6c c007412c <delayed_work_timer_fn>
> 
> Breakpoint 2, do_list (ld=0xff961c78) at tools.c:3507
> 3507 error(INFO, "\ninvalid list entry: %lx\n", next);
> (gdb) bt
> #0 do_list (ld=0xff961c78) at tools.c:3507
> #1 0x0811de03 in do_timer_list (vec_kvaddr=3699761524, size=256,
> vec=0x85c9f40, option=0x0, highest=0x0, tv=0xff962ec4) at
> kernel.c:6983
> #2 0x0811c9d3 in dump_timer_data_tvec_bases_v2 () at kernel.c:6678
> #3 0x0811afac in dump_timer_data () at kernel.c:6370
> #4 0x0811af8a in cmd_timer () at kernel.c:6329
> #5 0x080910a1 in exec_command () at main.c:818
> #6 0x08090ec7 in main_loop () at main.c:766
> #7 0x081bf35a in current_interp_command_loop ()
> #8 0x081bfbcf in captured_command_loop ()
> #9 0x081beddc in catch_errors ()
> #10 0x081c0a9a in captured_main ()
> #11 0x081beddc in catch_errors ()
> #12 0x081c0adc in gdb_main ()
> #13 0x081c0b29 in gdb_main_entry ()
> #14 0x08121590 in gdb_main_loop (argc=2, argv=0xff964014) at gdb_interface.c:76
> #15 0x08090c01 in main (argc=3, argv=0xff964014) at main.c:671
> 
> here exactly I hit invalid entry.

Right, I understand where the error message came from.

The crash utility's do_list() function is simply reporting what
it sees in the list_head-type linked list that it was following.

I have only seen these types of timer command errors in
vmcores that were generated with the "snap.so" extension
module, or when running the command on a live system.  
And both of those scenarios make perfect sense because the
underlying kernel was running/modifying the timer-related
data structures while the memory was being copied. 

Presuming that the crash was taken with kdump, you would
typically expect that the timer data structures would
be stable.

Dave
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/crash-utility/attachments/20130302/4d1f3493/attachment.htm>