[Crash-utility] [PATCH] Speed up "kmem -[sS]" by optimizing is_page_ptr()

Tue Feb 20 21:08:05 UTC 2018

Hi Dave,

On 2/20/2018 11:32 AM, Dave Anderson wrote:
...
>>>>> Another suggestion/question -- if is_page_ptr() is called with a NULL
>>>>> phys
>>>>> argument (as is done most of the time),  could it skip the "if
>>>>> IS_SPARSEMEM()"
>>>>> section at the top, and still utilize the part at the bottom, where it
>>>>> walks
>>>>> through the vt->node_table[x] array?  I'm not sure about the "ppend"
>>>>> calculation
>>>>> though -- even if there are holes in the node's address space, is it
>>>>> still
>>>>> a
>>>>> contiguous chunk of page structure addresses per-node?
>>>>
>>>> I'm still investigating and not sure yet, but I think that SPASEMEM uses
>>>> mem_section instead of node_mem_map means page structures could be
>>>> non-contignuous per-node according to architecture or condition.
>>>>
>>>>   typedef struct pglist_data {
>>>>   ...
>>>>   #ifdef CONFIG_FLAT_NODE_MEM_MAP /* means !SPARSEMEM */
>>>>           struct page *node_mem_map;
>>>>
>>>> I'll continue to check it.
>>>
>>> You are right, but in the case where pglist_data.node_mem_map does *not*
>>> exist,
>>> the crash utility initializes each vt->node_table[node].mem_map with the
>>> node's
>>> starting mem_map address by using the return value from phys_to_page() of
>>> the
>>> node's starting physical address -- which uses the sparsemem functions.
>>>  
>>> The question is whether the current "ppend" calculation is correct for the
>>> last
>>> physical page in a node.   If it is not correct, then perhaps an
>>> "mem_map_end" value
>>> can be added to the node_table structure, initialized by using
>>> phys_to_page() to get
>>> the page address of the last physical address in the node.  And then in
>>> that case, the
>>> question is whether the mem_map range of virtual addresses are contiguous
>>> -- even if
>>> there are holes in the mem_map virtual address range.
>>
>> "node_size" is set to pglist_data.node_spanned_pages, which includes holes.
>> So I think that if VMEMMAP, which a page address is linear against its pfn,
>> the current "ppend" calculation is correct for the last page in a node.
>> But if not VMEMMAP, since there is no guarantee of the linearity, the
>> calculation could be incorrect.
>>
>> I found an example with RHEL5:
>>
>> crash> help -o
>> ...
>>                     size_table:
>>                           page: 56
>> ...
>> crash> kmem -n
>> NODE    SIZE      PGLIST_DATA       BOOTMEM_DATA       NODE_ZONES
>>   0    524279   ffff810000014000  ffffffff804e1900  ffff810000014000
>>                                                     ffff810000014b00
>>                                                     ffff810000015600
>>                                                     ffff810000016100
>>     MEM_MAP       START_PADDR  START_MAPNR
>> ffff8100007da000       0            0
>>
>> ZONE  NAME         SIZE       MEM_MAP      START_PADDR  START_MAPNR
>>   0   DMA          4096  ffff8100007da000            0            0
>>   1   DMA32      520183  ffff810000812000      1000000         4096
>>   2   Normal          0                 0            0            0
>>   3   HighMem         0                 0            0            0
>>
>> -------------------------------------------------------------------
>>
>> NR      SECTION        CODED_MEM_MAP        MEM_MAP       PFN
>>  0  ffff810009000000  ffff8100007da000  ffff8100007da000  0
>>  1  ffff810009000008  ffff8100007da000  ffff81000099a000  32768
>>  2  ffff810009000010  ffff8100007da000  ffff810000b5a000  65536
>>  3  ffff810009000018  ffff8100007da000  ffff810000d1a000  98304   <= there is a
>>  4  ffff810009000020  ffff810008901000  ffff810009001000  131072  <= mem_map gap.
>>  5  ffff810009000028  ffff810008901000  ffff8100091c1000  163840
>>  :
>> 14  ffff810009000070  ffff810008901000  ffff81000a181000  458752
>> 15  ffff810009000078  ffff810008901000  ffff81000a341000  491520
>> crash>
>>
>> In this case, the "ppend" will be
>>
>>   0xffff8100007da000 + (524279 * 56)
>>   = 0xffff8100023d9e08
>>
>> but it looks like the actual value is around 0xffff81000a501000.
> 
> Right, I understand that the current "ppend" calculation wouldn't work.
> 
>> And also, we can see the gap between NR=3 and 4.  This means that if the
>> correct "mem_map_end" is added to the node_table structure, it would be
>> not enough to check whether an address is a page structure.
> 
> Why?  Wouldn't it still give us an ascending range of page structure addresses
> on a per-node basis?  (even if there was a physical and/or virtual memory hole?) 
> AFAICT, for each section NR, the MEM_MAP and PFN values always increment.

Sorry if I misunderstood something..
First, I assume that we are talking about the case of kernels with SPARSEMEM
and using the vm->numnodes loop after skipping the IS_SPARSEMEM() section.

The "mem_map_end" I mean here is the page address of the last physical
address in the node, and the example system has only one node.  So I think
that the "kmem -n" output above suggests that it could return TRUE for an
incoming "addr" between the end of NR=3 and the start of NR=4, but it's
not a page address.

 NR                 MEM_MAP
  0 +---------+ ffff8100007da000 = nt->mem_map
  : | pages.. |        :
  2 +---------+ ffff810000b5a000
  3 +---------+ ffff810000d1a000
    +---------+ ffff810000eda000 = ffff810000d1a000 + (32768 * 56)
    |   ???   |            <-- for an "addr" here, it could returns TRUE.
  4 +---------+ ffff810009001000
  5 +---------+ ffff8100091c1000
  : | pages.. |        :
 15 +---------+ ffff81000a341000
    +---------+ ffff81000a501000 = nt->mem_map_end

Because of such mem_map holes in a node, I don't think that the vm->numnodes
loop could be utilized for kernels with SPARSEMEM as it is.
Is this "mem_map_end" different from the one you assumed?

Thanks,
Kazuhito Hagio