[Crash-utility] [PATCH] Speed up "kmem -[sS]" by optimizing is_page_ptr()

Dave Anderson anderson at redhat.com
Tue Feb 20 16:32:44 UTC 2018



----- Original Message -----
> Hi Dave,
> 
> On 2/16/2018 4:18 PM, Dave Anderson wrote:
> ...
> >>>> OK, I understand your point.  But what concerns me is that the
> >>>> function's
> >>>> purpose is to absolutely identify whether the incoming page structure
> >>>> address
> >>>> is a correct page structure address.  But if an invalid address gets
> >>>> passed
> >>>> into is_page_ptr(), your patch would take the invalid address, calculate
> >>>> an
> >>>> invalid "nr", and continue from there, right?
> >>
> >> Yes, if an invalid "nr" is the number where section does not exist,
> >> valid_section_nr() would return 0.  Even if it is the number where section
> >> exists by accident, the invalid "addr" is not between mem_map and
> >> end_mem_map,
> >> or not page-aligned, because if so, it is a page structure address.
> >>
> >> Also without this patch, when an invalid address comes, the loop could
> >> tries
> >> many invalid "nr"s less than NR_MEM_SECTIONS().
> >>
> >> I hope this answers your concern..
> >>
> >>>
> >>> Another suggestion/question -- if is_page_ptr() is called with a NULL
> >>> phys
> >>> argument (as is done most of the time),  could it skip the "if
> >>> IS_SPARSEMEM()"
> >>> section at the top, and still utilize the part at the bottom, where it
> >>> walks
> >>> through the vt->node_table[x] array?  I'm not sure about the "ppend"
> >>> calculation
> >>> though -- even if there are holes in the node's address space, is it
> >>> still
> >>> a
> >>> contiguous chunk of page structure addresses per-node?
> >>
> >> I'm still investigating and not sure yet, but I think that SPASEMEM uses
> >> mem_section instead of node_mem_map means page structures could be
> >> non-contignuous per-node according to architecture or condition.
> >>
> >>   typedef struct pglist_data {
> >>   ...
> >>   #ifdef CONFIG_FLAT_NODE_MEM_MAP /* means !SPARSEMEM */
> >>           struct page *node_mem_map;
> >>
> >> I'll continue to check it.
> > 
> > You are right, but in the case where pglist_data.node_mem_map does *not*
> > exist,
> > the crash utility initializes each vt->node_table[node].mem_map with the
> > node's
> > starting mem_map address by using the return value from phys_to_page() of
> > the
> > node's starting physical address -- which uses the sparsemem functions.
> >  
> > The question is whether the current "ppend" calculation is correct for the
> > last
> > physical page in a node.   If it is not correct, then perhaps an
> > "mem_map_end" value
> > can be added to the node_table structure, initialized by using
> > phys_to_page() to get
> > the page address of the last physical address in the node.  And then in
> > that case, the
> > question is whether the mem_map range of virtual addresses are contiguous
> > -- even if
> > there are holes in the mem_map virtual address range.
> 
> "node_size" is set to pglist_data.node_spanned_pages, which includes holes.
> So I think that if VMEMMAP, which a page address is linear against its pfn,
> the current "ppend" calculation is correct for the last page in a node.
> But if not VMEMMAP, since there is no guarantee of the linearity, the
> calculation could be incorrect.
> 
> I found an example with RHEL5:
> 
> crash> help -o
> ...
>                     size_table:
>                           page: 56
> ...
> crash> kmem -n
> NODE    SIZE      PGLIST_DATA       BOOTMEM_DATA       NODE_ZONES
>   0    524279   ffff810000014000  ffffffff804e1900  ffff810000014000
>                                                     ffff810000014b00
>                                                     ffff810000015600
>                                                     ffff810000016100
>     MEM_MAP       START_PADDR  START_MAPNR
> ffff8100007da000       0            0
> 
> ZONE  NAME         SIZE       MEM_MAP      START_PADDR  START_MAPNR
>   0   DMA          4096  ffff8100007da000            0            0
>   1   DMA32      520183  ffff810000812000      1000000         4096
>   2   Normal          0                 0            0            0
>   3   HighMem         0                 0            0            0
> 
> -------------------------------------------------------------------
> 
> NR      SECTION        CODED_MEM_MAP        MEM_MAP       PFN
>  0  ffff810009000000  ffff8100007da000  ffff8100007da000  0
>  1  ffff810009000008  ffff8100007da000  ffff81000099a000  32768
>  2  ffff810009000010  ffff8100007da000  ffff810000b5a000  65536
>  3  ffff810009000018  ffff8100007da000  ffff810000d1a000  98304   <= there is a
>  4  ffff810009000020  ffff810008901000  ffff810009001000  131072  <= mem_map gap.
>  5  ffff810009000028  ffff810008901000  ffff8100091c1000  163840
>  :
> 14  ffff810009000070  ffff810008901000  ffff81000a181000  458752
> 15  ffff810009000078  ffff810008901000  ffff81000a341000  491520
> crash>
> 
> In this case, the "ppend" will be
> 
>   0xffff8100007da000 + (524279 * 56)
>   = 0xffff8100023d9e08
> 
> but it looks like the actual value is around 0xffff81000a501000.

Right, I understand that the current "ppend" calculation wouldn't work.

> And also, we can see the gap between NR=3 and 4.  This means that if the
> correct "mem_map_end" is added to the node_table structure, it would be
> not enough to check whether an address is a page structure.

Why?  Wouldn't it still give us an ascending range of page structure addresses
on a per-node basis?  (even if there was a physical and/or virtual memory hole?) 
AFAICT, for each section NR, the MEM_MAP and PFN values always increment.

Dave



 
> Thanks,
> Kazuhito Hagio
> 
> > 
> > Thanks,
> >   Dave
> > 
> > 
> > 
> >>
> >> Thanks,
> >> Kazuhito Hagio
> >>
> >>>
> >>>>
> >>>> Dave
> >>>>
> >>>>>
> >>>>>>
> >>>>>> There is really no compelling reason that count_partial() absolutely
> >>>>>> *must* use
> >>>>>> is_page_ptr(), and so I'm thinking that perhaps you could come up with
> >>>>>> a
> >>>>>> less
> >>>>>> heavy-handed method for simply testing whether a page.lru entry points
> >>>>>> to
> >>>>>> another
> >>>>>> vmemmap'd page.  Something along the lines of adding this for
> >>>>>> vmemmap-enabled kernels:
> >>>>>>
> >>>>>>   #define IN_VMEMMAP_RANGE(page) ((page >= VMEMMAP_VADDR) && (page <=
> >>>>>>   VMEMMAP_END))
> >>>>>>
> >>>>>> and then have count_partial() replace the is_page_ptr() call with
> >>>>>> another
> >>>>>> slub function that does something like this for vmemmap-enabled
> >>>>>> kernels:
> >>>>>>
> >>>>>>    (IN_VMMEMAP_RANGE(next) && accessible(next))
> >>>>>>
> >>>>>> Or instead of accessible(), it could read "next" as a list_head with
> >>>>>> RETURN_ON_ERROR,
> >>>>>> and verify that next->prev points back to the current list_head.
> >>>>>>
> >>>>>> Non-vmemmap-enabled kernels could still use is_page_ptr().
> >>>>>>
> >>>>>> What do you think of doing something like that?
> >>>>>
> >>>>> Given possible compatibility issues you said, I think that the way you
> >>>>> suggested
> >>>>> might well be enough for now.  I'll try a method like the above.
> >>>>>
> >>>>> Thanks,
> >>>>> Kazuhito Hagio
> >>>>>
> >>>>>>
> >>>>>> Dave
> >>>>>>
> >>>>>>
> >>>>>>       
> >>>>>>
> >>>>>> ----- Original Message -----
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> The "kmem -[sS]" commands can take several minutes to complete with
> >>>>>>> the following conditions:
> >>>>>>>   * The system has a lot of memory sections with CONFIG_SPARSEMEM.
> >>>>>>>   * The kernel uses SLUB and it has a very long partial slab list.
> >>>>>>>
> >>>>>>>   crash> kmem -s dentry | awk '{print strftime("%T"), $0}'
> >>>>>>>   10:18:34 CACHE            NAME                 OBJSIZE  ALLOCATED
> >>>>>>>   TOTAL
> >>>>>>>   SLABS  SSIZE
> >>>>>>>   10:19:41 ffff88017fc78a00 dentry                   192    9038949
> >>>>>>>   10045728
> >>>>>>>   239184     8k
> >>>>>>>   crash> kmem -S dentry | bash -c 'cat >/dev/null ; echo $SECONDS'
> >>>>>>>   334
> >>>>>>>
> >>>>>>> One of the causes is that is_page_ptr() in count_partial() checks if
> >>>>>>> a given slub page address is a page struct by searching all memory
> >>>>>>> sections linearly for the one which includes it.
> >>>>>>>
> >>>>>>>         nr_mem_sections = NR_MEM_SECTIONS();
> >>>>>>>         for (nr = 0; nr < nr_mem_sections ; nr++) {
> >>>>>>>                 if ((sec_addr = valid_section_nr(nr))) {
> >>>>>>>                         ...
> >>>>>>>
> >>>>>>> With CONFIG_SPARSEMEM{_VMEMMAP}, we can calculate the memory section
> >>>>>>> which includes a page struct with its page.flags, or its address and
> >>>>>>> VMEMMAP_VADDR. With this patch doing so, the computation amount can
> >>>>>>> be
> >>>>>>> significantly reduced in that case.
> >>>>>>>
> >>>>>>>   crash> kmem -s dentry | awk '{print strftime("%T"), $0}'
> >>>>>>>   10:34:55 CACHE            NAME                 OBJSIZE  ALLOCATED
> >>>>>>>   TOTAL
> >>>>>>>   SLABS  SSIZE
> >>>>>>>   10:34:55 ffff88017fc78a00 dentry                   192    9038949
> >>>>>>>   10045728
> >>>>>>>   239184     8k
> >>>>>>>   crash> kmem -S dentry | bash -c 'cat >/dev/null ; echo $SECONDS'
> >>>>>>>   2
> >>>>>>>
> >>>>>>> This patch uses VMEMMAP_VADDR. It is not defined on PPC64, but it
> >>>>>>> looks
> >>>>>>> like PPC64 supports VMEMMAP flag and machdep->machspec->vmemmap_base
> >>>>>>> is
> >>>>>>> it, so this patch also defines it for PPC64. This might need some
> >>>>>>> help
> >>>>>>> from PPC folks.
> >>>>>>>
> >>>>>>> Signed-off-by: Kazuhito Hagio <k-hagio at ab.jp.nec.com>
> >>>>>>> ---
> >>>>>>>  defs.h   |  2 ++
> >>>>>>>  memory.c | 15 +++++++++++++++
> >>>>>>>  2 files changed, 17 insertions(+)
> >>>>>>>
> >>>>>>> diff --git a/defs.h b/defs.h
> >>>>>>> index aa17792..84e68ca 100644
> >>>>>>> --- a/defs.h
> >>>>>>> +++ b/defs.h
> >>>>>>> @@ -3861,6 +3861,8 @@ struct efi_memory_desc_t {
> >>>>>>>  #define IS_VMALLOC_ADDR(X) machdep->machspec->is_vmaddr(X)
> >>>>>>>  #define KERNELBASE      machdep->pageoffset
> >>>>>>>  
> >>>>>>> +#define VMEMMAP_VADDR   (machdep->machspec->vmemmap_base)
> >>>>>>> +
> >>>>>>>  #define PGDIR_SHIFT     (machdep->pageshift + (machdep->pageshift
> >>>>>>>  -3)
> >>>>>>>  +
> >>>>>>>  (machdep->pageshift - 2))
> >>>>>>>  #define PMD_SHIFT       (machdep->pageshift + (machdep->pageshift -
> >>>>>>>  3))
> >>>>>>>  
> >>>>>>> diff --git a/memory.c b/memory.c
> >>>>>>> index 0df8ecc..0696763 100644
> >>>>>>> --- a/memory.c
> >>>>>>> +++ b/memory.c
> >>>>>>> @@ -13348,10 +13348,25 @@ is_page_ptr(ulong addr, physaddr_t *phys)
> >>>>>>>  	ulong nr_mem_sections;
> >>>>>>>  	ulong coded_mem_map, mem_map, end_mem_map;
> >>>>>>>  	physaddr_t section_paddr;
> >>>>>>> +#ifdef VMEMMAP_VADDR
> >>>>>>> +	ulong flags;
> >>>>>>> +#endif
> >>>>>>>  
> >>>>>>>  	if (IS_SPARSEMEM()) {
> >>>>>>>  		nr_mem_sections = NR_MEM_SECTIONS();
> >>>>>>> +#ifdef VMEMMAP_VADDR
> >>>>>>> +		nr = nr_mem_sections;
> >>>>>>> +		if (machdep->flags & VMEMMAP)
> >>>>>>> +			nr = pfn_to_section_nr((addr - VMEMMAP_VADDR) / SIZE(page));
> >>>>>>> +		else if (readmem(addr + OFFSET(page_flags), KVADDR, &flags,
> >>>>>>> +			sizeof(ulong), "page.flags", RETURN_ON_ERROR|QUIET))
> >>>>>>> +			nr = (flags >> (SIZE(page_flags)*8 - SECTIONS_SHIFT())
> >>>>>>> +				& ((1UL << SECTIONS_SHIFT()) - 1));
> >>>>>>> +
> >>>>>>> +		if (nr < nr_mem_sections) {
> >>>>>>> +#else
> >>>>>>>  	        for (nr = 0; nr < nr_mem_sections ; nr++) {
> >>>>>>> +#endif
> >>>>>>>  	                if ((sec_addr = valid_section_nr(nr))) {
> >>>>>>>  	                        coded_mem_map =
> >>>>>>>  	                        section_mem_map_addr(sec_addr);
> >>>>>>>  	                        mem_map =
> >>>>>>>  	                        sparse_decode_mem_map(coded_mem_map,
> >>>>>>>  	                        nr);
> >>>>>>> --
> >>>>>>> 1.8.3.1
> >>>>>>>
> >>>>>>> --
> >>>>>>> Crash-utility mailing list
> >>>>>>> Crash-utility at redhat.com
> >>>>>>> https://www.redhat.com/mailman/listinfo/crash-utility
> >>>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> Crash-utility mailing list
> >>>>>> Crash-utility at redhat.com
> >>>>>> https://www.redhat.com/mailman/listinfo/crash-utility
> >>>>>>
> >>>>>
> >>>>> --
> >>>>> Crash-utility mailing list
> >>>>> Crash-utility at redhat.com
> >>>>> https://www.redhat.com/mailman/listinfo/crash-utility
> >>>>>
> >>>>
> >>>> --
> >>>> Crash-utility mailing list
> >>>> Crash-utility at redhat.com
> >>>> https://www.redhat.com/mailman/listinfo/crash-utility
> >>>>
> >>>
> >>> --
> >>> Crash-utility mailing list
> >>> Crash-utility at redhat.com
> >>> https://www.redhat.com/mailman/listinfo/crash-utility
> >>>
> >>
> >> --
> >> Crash-utility mailing list
> >> Crash-utility at redhat.com
> >> https://www.redhat.com/mailman/listinfo/crash-utility
> >>
> > 
> > --
> > Crash-utility mailing list
> > Crash-utility at redhat.com
> > https://www.redhat.com/mailman/listinfo/crash-utility
> > 
> 
> --
> Crash-utility mailing list
> Crash-utility at redhat.com
> https://www.redhat.com/mailman/listinfo/crash-utility
> 




More information about the Crash-utility mailing list