[libvirt] [RFC PATCH] NUMA tuning support

Osier Yang jyang at redhat.com
Fri May 6 03:45:23 UTC 2011


于 2011年05月05日 22:33, Lee Schermerhorn 写道:
> On Thu, 2011-05-05 at 17:38 +0800, Osier Yang wrote:
>> Hi, All,
>>
>> This is a simple implenmentation for NUMA tuning support based on binary
>> program 'numactl', currently only supports to bind memory to specified nodes,
>> using option "--membind", perhaps it need to support more, but I'd like
>> send it early so that could make sure if the principle is correct.
>>
>> Ideally, NUMA tuning support should be added in qemu-kvm first, such
>> as they could provide command options, then what we need to do in libvirt
>> is just to pass the options to qemu-kvm, but unfortunately qemu-kvm doesn't
>> support it yet, what we could do currently is only to use numactl,
>> it forks process, a bit expensive than qemu-kvm supports NUMA tuning
>> inside with libnuma, but it shouldn't affects much I guess.
>>
>> The NUMA tuning XML is like:
>>
>> <numatune>
>>    <membind nodeset='+0-4,8-12'/>
>> </numatune>
>>
>> Any thoughts/feedback is appreciated.
>
> Osier:
>
> A couple of thoughts/observations:
>
> 1) you can accomplish the same thing -- restricting a domain's memory to
> a specified set of nodes -- using the cpuset cgroup that is already
> associated with each domain.  E.g.,
>
> 	cgset -r cpuset.mems=<nodeset>  /libvirt/qemu/<domain>
>
> Or the equivalent libcgroup call.
>
> However, numactl is more flexible; especially if you intend to support
> more policies:  preferred, interleave.  Which leads to the question:
>
> 2) Do you really want the full "membind" semantics as opposed to
> "preferred" by default?  Membind policy will restrict the VMs pages to
> the specified nodeset and will initiate reclaim/stealing and wait for
> pages to become available or the task is OOM-killed because of mempolicy
> when all of the nodes in nodeset reach their minimum watermark.  Membind
> works the same as cpuset.mems in this respect.  Preferred policy will
> keep memory allocations [but not vcpu execution] local to the specified
> set of nodes as long as there is sufficient memory, and will silently
> "overflow" allocations to other nodes when necessary.  I.e., it's a
> little more forgiving under memory pressure.

Thanks for the thoughts, Lee,

Yes, we might support "preferred" too, once it's needed.

>
> But then pinning a VM's vcpus to the physical cpus of a set of nodes and
> retaining the default local allocation policy will have the same effect
> as "preferred" while ensuring that the VM component tasks execute
> locally to the memory footprint.  Currently, I do this by looking up the
> cpulist associated with the node[s] from  e.g.,
> /sys/devices/system/node/node<i>/cpulist and using that list with the
> vcpu.cpuset attribute.  Adding a 'nodeset' attribute to the
> cputune.vcpupin element would simplify specifying that configuration.

Yes, binding to specified nodeset can be achieved with current
<vcpu cpuset="">, but it's not that clear enough, e.g. Here
you need to look up */node/cpulist manualy. But I'm not sure
if it's good to add another attribute "nodeset", as from senmentics,
"nodeset" is implied in "cpuset", CPU ID is uniq regardless of
which node it's belong to.

Regards
Osier




More information about the libvir-list mailing list