[libvirt] [PATCH 01/12] Introduce virNodeHugeTLB

Fri May 30 13:42:01 UTC 2014

On 30.05.2014 14:46, Daniel P. Berrange wrote:
> On Fri, May 30, 2014 at 01:41:06PM +0200, Michal Privoznik wrote:
>> On 30.05.2014 10:52, Daniel P. Berrange wrote:
>>> On Thu, May 29, 2014 at 10:32:35AM +0200, Michal Privoznik wrote:
>>>>   /**
>>>> + * virNodeHugeTLB:
>>>> + * @conn: pointer to the hypervisor connection
>>>> + * @type: type
>>>> + * @params: pointer to memory parameter object
>>>> + *          (return value, allocated by the caller)
>>>> + * @nparams: pointer to number of memory parameters; input and output
>>>> + * @flags: extra flags; not used yet, so callers should always pass 0
>>>> + *
>>>> + * Get information about host's huge pages. On input, @nparams
>>>> + * gives the size of the @params array; on output, @nparams gives
>>>> + * how many slots were filled with parameter information, which
>>>> + * might be less but will not exceed the input value.
>>>> + *
>>>> + * As a special case, calling with @params as NULL and @nparams
>>>> + * as 0 on input will cause @nparams on output to contain the
>>>> + * number of parameters supported by the hypervisor. The caller
>>>> + * should then allocate @params array, i.e.
>>>> + * (sizeof(@virTypedParameter) * @nparams) bytes and call the API
>>>> + * again.  See virDomainGetMemoryParameters() for an equivalent
>>>> + * usage example.
>>>> + *
>>>> + * Returns 0 in case of success, and -1 in case of failure.
>>>> + */
>>>> +int
>>>> +virNodeHugeTLB(virConnectPtr conn,
>>>> +               int type,
>>>> +               virTypedParameterPtr params,
>>>> +               int *nparams,
>>>> +               unsigned int flags)
>>>
>>> What is the 'type' parameter doing ?
>>
>> Ah, it should be named numa_node rather than type. If type==-1, then overall
>> statistics are returned (number of {available,free} pages accumulated across
>> all NUMA nodes), if type >= 0, info on the specific NUMA node is returned.
>>
>>>
>>> I think in general this API needs a different design. I'd like to have
>>> an API that can request info for all page sizes on all NUMA nods in a
>>> single call. I also think the static unchanging data should be part of
>>> the cpu + NUMA info in the capabilities XML. So the API only reports
>>> info which is changing - ie the available pages.
>>
>> The only problem is, the size of huge pages pool is not immutable. Now it's
>> possible for 2M huge pages to be allocated dynamically:
>>
>> # echo 8 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
>>
>> and it may be possible for 1GB too in future (what if kernel learns how to
>> do it?). In general, the only thing that we can take unalterable for now is
>> the default size of huge pages. And I wouldn't bet on that either.
>
> Yes, you can in theory change the number of huge pages at an arbitrary
> time, but realistically people mostly only do it immediately at boot.
> With 1 GB pages is will be impossible todo it any time except immediately
> at boot. If you wait a little while, then memory will be too fragmented
> for you to be able to dynamically allocate more 1 GB pages. The same
> is true of 2MB pages to a lesser degree.

IMO no. Processes never ever see physical address (PA). All they see are 
virtual addresses (VA). So there is possibility for kernel to rearrange 
physical memory without effect on the processes in order to gain bigger 
segments of free memory. The only problem are DMA transfers which access 
PA directly but that can be resolved by reserving first X {M,G}B of RAM 
for the transfers. And kernel is doing something similar already - 
Kernel SamePage Merging (KSM). On the other hand, I don't think huge 
page pool size is going to be changed too often. Especially when 
compared to huge page allocation in a userspace processes. So whenever 
mgmt application changes the pool size, it knows it has to refetch 
capabilities. Well, if we document it :-)

>
>>> In the <topology> element we should report the number of pages of each
>>> size that are present in that node. We shouldn't treat huge pages
>>> separately from small pages in this respect.
>>
>> Well, if we do that you'll still need two API calls (and hence you'll lack
>> atomicity): virConnectGetCapabilities() followed by virNodeGetFreePages().
>> On the other hand, the virNodeGetFreePages() itself is not atomic either
>> (from system POV - other processes can jump in and allocate huge pages as
>> libvirtd is executing the API).
>
> I'm not worried about atomicity wrt to free pages - no matter what
> you do you will always race, because there's inherantly a gap between
> when you check the free pages and when the VM boots. I'm more interested
> in the fact that the capabilities shows you the overall NUMA topology
> of the host, and the pages allocated per node are inherantly part of
> that model IMHO.
>
>>>    <host>
>>>      <cpu>
>>>        <arch>x86_64</arch>
>>>        <model>Westmere</model>
>>>        <vendor>Intel</vendor>
>>>        <topology sockets='1' cores='6' threads='2'/>
>>>        <feature name='rdtscp'/>
>>>        <feature name='pdpe1gb'/>
>>>        <feature name='dca'/>
>>>        <feature name='pdcm'/>
>>>        <feature name='xtpr'/>
>>>        <pages units="KiB" size="4"/>
>>>        <pages units="MiB" size="2"/>
>>>        <pages units="GiB" size="1"/>
>>>      </cpu>
>>
>> Right. This makes sense.
>>
>>>
>>>      <topology>
>>>        <cells num='2'>
>>>          <cell id='0'>
>>>            <memory unit='KiB'>3221225472</memory>
>>>            <pages unit="KiB"  size="4">262144</pages>
>>>            <pages unit="MiB"  size="2">0</pages>
>>>            <pages unit="GiB"  size="1">2</pages>
>>>            <cpus num='4'>
>>>              <cpu id='0'/>
>>>              <cpu id='2'/>
>>>              <cpu id='4'/>
>>>              <cpu id='6'/>
>>>            </cpus>
>>>          </cell>
>>>          <cell id='1'>
>>>            <memory unit='KiB'>3221225472</memory>
>>>            <pages unit="KiB"  size="4">0</pages>
>>>            <pages unit="MiB"  size="2">1536</pages>
>>>            <pages unit="GiB"  size="1">2</pages>
>>>            <cpus num='4'>
>>>              <cpu id='1'/>
>>>              <cpu id='3'/>
>>>              <cpu id='5'/>
>>>              <cpu id='7'/>
>>>            </cpus>
>>>          </cell>
>>>        </cells>
>>>      </topology
>>>
>>
>> Well, and this to some extent too. We can report the pool size, but since it
>> may vary it's on the same level as number of free pages. What about:
>>
>> <pages unit='MiB' size='2' free='8' avail='8'/>
>>
>> That way we don't even need a new API (not saying we shouldn't implement it
>> though - API is still better than updating capabilities each single time).
>
> I'm reluctant to include the free count in the capabilities because
> that will encourage people to query the capabilities overly frequently
> which will be pretty inefficient. This is why we only provide free
> memory info there currently and have the separate API call.
>
> If we report the used/free count in the capabilities we're forced into
> refreshing this data frequently. If we only report the overall count
> per node, even if there's the possibility to dynamically change that,
> we have more potential to avoid refreshing it frequently.
>
>>> So then an API call to request the available pages on all nodes
>>> would look something like
>>>
>>>     virNodeGetFreePages(virConnectPtr conn,
>>>                         unsigned int *pages,
>>>                         unsigned int npages,
>>>                         unsigned int startcell,
>>>                         unsigned int cellcount,
>>>                         unsigned long long *counts);
>>>
>>> In this API
>>>
>>>   @pages - array whose elements are the page sizes to request info for
>>>   @npages - number of elements in @pages
>>>   @startcell - ID of first NUMA cell to request data for
>>>   @cellcount - number of cells to request data for
>>>   @counts - array which is @npages * @cellcount in length
>>
>> I wanted to use virTypedParam to make the API open for future addition (e.g.
>> if somebody needs count of surplus or reserved huge pages).
>
> I'm not sure what you mean by 'reserved huge pages' - all non-4k page
> counts we are reporting are "reserved pages".

I mean any of  /sys/kernel/mm/hugepages/hugepages-2048kB/*.
But we can invent virNodeGetResvHugepages() for that.

Michal