[Crash-utility] [PATCH 0/5] [RFC] Multi-thread support for search cmd

Wed Apr 5 08:35:59 UTC 2023

On 2023/03/25 13:12, Tao Liu wrote:
> The primary part of the patchset will introduce multithread support for search
> cmd to improve its performance. A search operation is mainly made up with 2
> steps: 1) readmem data into pagebuf, 2) search specific values within the
> pagebuf. A typical workflow of search is as follows:
> 
> for addr from low to high:
> do
> 	readmem(addr, pagebuf)
> 	search_value(value, pagebuf)
> 	addr += pagesize
> done
> 
> There are 2 points which we can accelerate: 1) readmem don't have to wait
> search_value, when search_value is working, readmem can read the next pagebuf
> at the same time. 2) higher addr don't have to wait lower addr, they can be
> processed at the same time if we carefully arrange the output order.
> 
> For point 1, we introduce zones for pagebuf, e.g. search_value can work on
> zone 0 while readmem can prepare the data for zone 1. For point 2, we introduce
> multiple search_value in threads, e.g. readmem will prepare 100 pages as a
> batch, then we will have 4 threads of search_value, thread 0 handles page 1~25,
> thread 2 handles page 26~50 page, thread 3 handles page 51~75, thread 4 handles
> page 76~100.
> 
> A typical workflow of multithread search implemented in this patchset is as
> follows, which removed thread synchronization:
> 
> pagebuf[ZONE][BATCH]
> zone_index = buf_index = 0
> create_thread(4, search_value)
> for addr from low to high:
> do
> 	if buf_index < BATCH
> 		readmem(addr, pagebuf[zone_index][buf_index++])
> 		addr += pagesize
> 	else
> 		start_thread(pagebuf[zone_index], 0/4 * BATCH, 1/4 * BATCH)
> 		start_thread(pagebuf[zone_index], 1/4 * BATCH, 2/4 * BATCH)
> 		start_thread(pagebuf[zone_index], 2/4 * BATCH, 3/4 * BATCH)
> 		start_thread(pagebuf[zone_index], 3/4 * BATCH, 4/4 * BATCH)
> 		zone_index++
> 		buf_index = 0
> 	fi
> done
> 
> readmem works in the main process and not multi-threaded, because readmem will
> not only read data from vmcore, decompress it, but walk through page tables if
> virtual address given. It is hard to reimplement it into thread safe version,
> search_value is easier to be made thread-safe. By carefully choose batch size
> and thread num, we can maximize the concurrency.
> 
> The last part of the patchset, is replacing lseek/read to pread for kcore and
> diskdumped vmcore.
> 
> Here is the performance test result chart. Please note the vmcore and
> kcore are tested seperately on 2 different machines. crash-orig is the
> crash compiled with clean upstream code, crash-pread is the code with only
> pread patch applied(patch 5), crash-multi is the code with only multithread
> patches applied(patch 1~4).
> 
> ulong search:
> 
>      $ time echo "search abcd" | ./crash-orig vmcore vmlinux > /dev/null
>      $ time echo "search abcd -f 4 -n 4" | ./crash-multi vmcore vmlinux > /dev/null
> 
> 			 45G vmcore				64G kcore
> 		real        user        sys    		real       user       sys
> crash-orig	16m56.595s  15m57.188s  0m56.698s	1m37.982s  0m51.625s  0m46.266s
> crash-pread	16m46.366s  15m55.790s  0m48.894s	1m9.179s   0m36.646s  0m32.368s
> crash-multi	16m26.713s  19m8.722s   1m29.263s	1m27.661s  0m57.789s  0m54.604s
> 
> string search:
> 
>      $ time echo "search -c abcddbca" | ./crash-orig vmcore vmlinux > /dev/null
>      $ time echo "search -c abcddbca -f 4 -n 4" | ./crash-multi vmcore vmlinux > /dev/null
> 
> 			45G vmcore				64G kcore
> 		real        user        sys    		real       user       sys
> crash-orig	33m33.481s  32m38.321s  0m52.771s	8m32.034s  7m50.050s  0m41.478s
> crash-pread	33m25.623s  32m35.019s  0m47.394s	8m4.347s   7m35.352s  0m28.479s
> crash-multi	16m31.016s  38m27.456s  1m11.048s	5m11.725s  7m54.224s  0m44.186s
> 
> Discussion:
> 
> 1) Either multithread and pread patches can improve the performance a
>     bit, so if both patches applied, the performance can be better.
> 
> 2) Multi-thread search performs much better in search time consumptive
>     tasks, such as string search.

Thank you for the improvement!  sorry, I've not had time to see this and 
the cmd_search() code yet due to other tasks..

I think that multithreading is physically needed to speed up the search 
processing for a huge memory beyond a certain level, and we may have it 
ultimately.  So nice work, but it's complicated, I'm still not sure 
whether it should be done first.  I would like to see whether there is 
room for optimization, cleanup or etc.

So if you have any analysis or trial result you did, please let me know.

Thanks,
Kazu