[Crash-utility] Questions on multi-thread for crash

Fri Feb 10 09:04:36 UTC 2023

On Fri, Feb 10, 2023 at 12:42:36AM +0000, HAGIO KAZUHITO(萩尾 一仁) wrote:
> Hi Tao,
> 
> On 2023/02/08 13:34, Tao Liu wrote:
> > Hello,
> > 
> > Recently I made an attempt to introduce a thread pool for crash utility, to
> > optimize the performance of crash.
> 
> Thank you for the attempt, interesting.
> What data made you try to speed up the collection of member offsets?
> First I'm interested in which routines weigh with crash startup.
> 
> To be honest, personally I'm fairly satisfied with the current
> crash-8 startup time :) with commit cd8954023b (thanks to Hatayama-san)
> and maybe the benefit of GDB's parallel loading, which Andrew said.
> 
> [root at rhel91u ~]# time echo quit | crash > /dev/null
> 
> real    0m2.621s
> user    0m2.574s
> sys     0m0.112s
> 
> So I would like to know first whether it's likely to worth looking
> into the slow part, especially if we have such complexity of
> multi-threading or another way.
> 
> Thanks,
> Kazu
> 

Hi Kazu,

The startup time of crash and drgn comparison:

  $ time echo "quit()" | drgn -c /tmp/maple/vmcore -s /tmp/maple/vmlinux > /dev/null
  drgn 0.0.22+2.g33c3e36 (using Python 3.10.0, elfutils 0.185, with libkdumpfile)
  For help, type help(drgn).
  >>> import drgn
  >>> from drgn import NULL, Object, cast, container_of, execscript, offsetof, reinterpret, sizeof
  >>> from drgn.helpers.common import *
  >>> from drgn.helpers.linux import *
  warning: could not get debugging information for:
  kernel modules (could not read depmod: open: /lib/modules/5.19.0-uek+/modules.dep.bin:
  No such file or directory)

  real	0m0.222s
  user	0m1.032s
  sys	0m0.041s

  $ time echo quit | crash /tmp/maple/vmcore /tmp/maple/vmlinux > /dev/null

  real	0m8.266s
  user	0m7.853s
  sys	0m0.248s

As you see, there is a startup time difference between crash and drgn. I haven't
look into drgn's source code to see if drgn have the same amount of startup work
as crash. Then I measured the time consuming status of xx_init() functions of
crash, and gprof the time status:

  $ crash /tmp/maple/vmcore /tmp/maple/vmlinux
  gdb_session_init 		23.761000
  machdep_init POST_RELOC 	0.060000
  show_untrusted_files 		0.000000
  kdump_backup_region_init 	0.001000
  read_in_kernel_config 	1.287000
  kernel_init	 		1724.044000 <<--- a lot of symbols resolving
  machdep_init POST_GDB 	851.425000
  vm_init 			1993.168000 <<--- a lot of symbols resolving                           
  machdep_init POST_VM 		414.680000
  module_init 			287.185000                      
  help_init 			0.005000
  task_init 			459.189000                     
  vfs_init 			301.908000
  net_init 			116.945000
  dev_init 			6.494000
  machdep_init POST_INIT 	9.176000

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 14.47      1.12     1.12                             skip_ws(char const*&, char const*&, char const*)
 14.08      2.21     1.09                             lookup_partial_symbol(objfile*, partial_symtab*, lookup_name_info const&, int, domain_enum_tag)
 13.44      3.25     1.04                             strncmp_iw_with_mode(char const*, char const*, unsigned long, strncmp_iw_mode, language, completion_match_for_lcd*)
  5.81      3.70     0.45                             symbol_matches_search_name(general_symbol_info const*, lookup_name_info const&)
  4.52      4.05     0.35                             gdb::bcache::insert(void const*, int, bool*)
  4.26      4.38     0.33                             default_symbol_name_matcher(char const*, lookup_name_info const&, completion_match_result*)
  4.26      4.71     0.33                             language_defn::get_symbol_name_matcher(lookup_name_info const&) const
  3.23      4.96     0.25                             symbol_matches_domain(language, domain_enum_tag, domain_enum_tag)
  2.84      5.18     0.22        1   220.00   220.00  symval_hash_init
  2.20      5.35     0.17                             read_attribute_value(die_reader_specs const*, attribute*, unsigned int, long, unsigned char const*, bool*)
  2.20      5.52     0.17                             iterative_hash
  1.81      5.66     0.14                             strcmp_iw_ordered(char const*, char const*)
  1.81      5.80     0.14                             psym_lookup_symbol(objfile*, block_enum, char const*, domain_enum_tag)
  1.68      5.93     0.13                             dwarf2_psymtab::readin_p(objfile*) const
  1.55      6.05     0.12   133543     0.00     0.00  symname_hash_install
...

It looks to me that crash takes a lot of time on symbol resolving, so I thought
maybe I can parallel some of those to shorten the startup time.

Thanks,
Tao Liu

> > 
> > One obvious point which can benefit from multi-threading is memory.c:vm_init().
> > There are hundreds of MEMBER_OFFSET_INIT() related symbol resolving functions,
> > and most of the symbols are independent from each other, by careful arrangement,
> > they can be invoked parallelly. By doing so, we can shorten the waiting time of
> > crash starting.
> > 
> > The implementation is abstracted as the following:
> > 
> > Before multi-threading:
> > 	MEMBER_OFFSET_INIT(task_struct_mm, "task_struct", "mm");
> > 	MEMBER_OFFSET_INIT(mm_struct_mmap, "mm_struct", "mmap");
> > 
> > After multi-threading:
> > 	create_threadpool(&pool, 3);
> > 	...
> > 	MEMBER_OFFSET_INIT_PARA(pool, task_struct_mm, "task_struct", "mm");
> > 	MEMBER_OFFSET_INIT_PARA(pool, mm_struct_mmap, "mm_struct", "mmap");
> > 	...
> > 	wait_and_destroy_threadpool(pool);
> > 
> > MEMBER_OFFSET_INIT_PARA just append the task to the work queue of thread pool
> > and continues, it's up to the pool to schedule the worker thread to do the
> > symbol resolving work.
> > 
> > However, after enable multi-threading, I noticed there are always random errors
> > from gdb. From segfault to broken stack, it seems gdb is not thread safe at
> > all...
> > 
> > For example one error listed as follows:
> > 
> > 	Thread 10 "crash" received signal SIGSEGV, Segmentation fault.
> > 	[Switching to Thread 0x7fffc4f00640 (LWP 72950)]
> > 	c_yylex () at /sources/up-crash/gdb-10.2/gdb/c-exp.y:3250
> > 	3250   if (pstate->language ()->la_language != language_cplus
> > 	(gdb) bt
> > 	#0  c_yylex () at /sources/up-crash/gdb-10.2/gdb/c-exp.y:3250
> > 	#1  c_yyparse () at /sources/up-crash/gdb-10.2/gdb/c-exp.c.tmp:2092
> > 	#2  0x00000000006f62d7 in c_parse (par_state=<optimized out>) at /sources/
> >              up-crash/gdb-10.2/gdb/c-exp.y:3414
> > 	#3  0x0000000000894eac in parse_exp_in_context (stringptr=0x7fffc4efeff8,
> >              pc=<optimized out>, block=<optimized out>, comma=0, out_subexp=0x0,
> >              tracker=0x7fffc4efef10, cstate=0x0, void_context_p=0) at parse.c:1122
> > 	#4  0x00000000008951d6 in parse_exp_1 (tracker=0x0, comma=0, block=0x0,
> >              pc=0, stringptr=0x7fffc4efeff8) at parse.c:1031
> > 	#5  parse_expression (string=<optimized out>, string at entry=0x7fffc4eff140
> >              "slab_s", tracker=tracker at entry=0x0) at parse.c:1166
> > 	#6  0x000000000092039a in gdb_get_datatype (req=0x7fffc4eff720) at symtab.c:7239
> > 	#7  gdb_command_funnel_1 (req=0x7fffc4eff720) at symtab.c:7018
> > 	#8  0x00000000009206de in gdb_command_funnel (req=0x7fffc4eff720) at symtab.c:6956
> > 	#9  0x00000000005ad137 in gdb_interface (req=0x7fffc4eff720) at gdb_interface.c:409
> > 	#10 0x00000000005fe76c in datatype_info (name=0xab9700 "slab_s",
> >              member=0xaba8d8 "list", dm=0x0) at symbols.c:5708
> > 	#11 0x0000000000517a85 in member_offset_init_slab_s_list_slab_s_list ()
> >              at memory.c:659
> > 	#12 0x000000000068168f in group_routine (args=<optimized out>) at thpool.c:81
> > 	#13 0x00007ffff7a48b17 in start_thread () from /lib64/libc.so.6
> > 	  #14 0x00007ffff7acd6c0 in clone3 () from /lib64/libc.so.6
> > 	(gdb) p pstate
> > 	$1 = (parser_state *) 0x0
> > 
> > 	$ cat -n /sources/up-crash/gdb-10.2/gdb/c-exp.y
> > 	66  /* The state of the parser, used internally when we are parsing the
> > 	67     expression.  */
> > 	68
> > 	69  static struct parser_state *pstate = NULL;
> > 
> > pstate is a global variable and not thread safe, the value must be changed by
> > someone else...
> > 
> > Now the project has reached a dead end. Because making gdb thread safe is an
> > impossible mission to me. Is there any advice or suggestions? Thanks in advance!
> > 
> > Thanks!
> > Tao Liu
> > 
> > --
> > Crash-utility mailing list
> > Crash-utility at redhat.com
> > https://listman.redhat.com/mailman/listinfo/crash-utility
> > Contribution Guidelines: https://github.com/crash-utility/crash/wiki