[libvirt] "[V3] RFC for support cache tune in libvirt"

Thu Jan 12 10:51:14 UTC 2017

On Thu, Jan 12, 2017 at 08:47:58AM -0200, Marcelo Tosatti wrote:
> On Thu, Jan 12, 2017 at 09:44:36AM +0800, 乔立勇(Eli Qiao) wrote:
> > hi, It's really good to have you get involved to support CAT in
> > libvirt/OpenStack.
> > replied inlines.
> > 
> > 2017-01-11 20:19 GMT+08:00 Marcelo Tosatti <mtosatti at redhat.com>:
> > 
> > >
> > > Hi,
> > >
> > > Comments/questions related to:
> > > https://www.redhat.com/archives/libvir-list/2017-January/msg00354.html
> > >
> > > 1) root s2600wt:~/linux# virsh cachetune kvm02 --l3.count 2
> > >
> > > How does allocation of code/data look like?
> > >
> > 
> > My plan's expose new options:
> > 
> > virsh cachetune kvm02 --l3data.count 2 --l3code.count 2
> > 
> > Please notes, you can use only l3 or l3data/l3code(if enable cdp while
> > mount resctrl fs)
> 
> Fine. However, you should be able to emulate a type=both reservation
> (non cdp) by writing a schemata file with the same CBM bits:
> 
> 		L3code:0=0x000ff;1=0x000ff
> 		L3data:0=0x000ff;1=0x000ff
> 
> (*)
> 
> I don't see how this interface enables that possibility.
> 
> I suppose it would be easier for mgmt software to have it
> done automatically: 
> 
> virsh cachetune kvm02 --l3 size_in_kbytes.
> 
> Would create the reservations as (*) in resctrlfs, in 
> case host is CDP enabled.
> 
> (also please use kbytes, or give a reason to not use
> kbytes).
> 
> Note: exposing the unit size is fine as mgmt software might 
> decide a placement of VMs which reduces the amount of L3
> cache reservation rounding (although i doubt anyone is going
> to care about that in practice).
> 
> > > 2) 'nodecachestats' command:
> > >
> > >         3. Add new virsh command 'nodecachestats':
> > >         This API is to expose vary cache resouce left on each hardware (cpu
> > >         socket).
> > >         It will be formated as:
> > >         <resource_type>.<resource_id>: left size KiB
> > >
> > > Does this take into account that only contiguous regions of cbm masks
> > > can be used for allocations?
> > >
> > >
> > yes, it is the contiguous regions cbm or in another word it's the default
> > cbm represent's cache value.
> > 
> > resctrl doesn't allow set non-contiguous cbm (which is restricted by
> > hardware)
> 
> OK.
> 
> > 
> > 
> > > Also, it should return the amount of free cache on each cacheid.
> > >
> > 
> > yes, it is.  resource_id == cacheid
> 
> OK.
> > >
> > > 3) The interface should support different sizes for different
> > > cache-ids. See the KVM-RT use case at
> > > https://www.redhat.com/archives/libvir-list/2017-January/msg00415.html
> > > "WHAT THE USER NEEDS TO SPECIFY FOR VIRTUALIZATION (KVM-RT)".
> > >
> > 
> > I don't think it's good to let user specify cache-ids while doing cache
> > allocation.
> 
> This is necessary for our usecase.
> 
> > the cache ids used should rely on what cpu affinity the vm are setting.
> 
> The cache ids configuration should match the cpu affinity configuration.
> 
> > eg.
> > 
> > 1. for those host who has only one cache id(one socket host), we don't need
> > to set cache id
> 
> Right.
> 
> > 2. if multiple cache ids(sockets), user should set vcpu -> pcpu mapping
> > (define cpuset for a VM), then we (libvirt) need to compute how much cache
> > on which cache id should set.
> > Which is to say, user should set the cpu affinity before cache allocation.
> > 
> > I know that the most cases of using CAT is for NFV. As far as I know, NFV
> > is using NUMA and cpu pining (vcpu -> pcpu mapping), so we don't need to
> > worry about on which cache id we set the cache size.
> > 
> > So, just let user specify cache size(here my propose is cache unit account)
> > and let libvirt detect on which cache id set how many cache.
> 
> Ok fine, its OK to not expose this to the user but calculate it
> internally in libvirt. As long as you recompute the schematas whenever
> cpu affinity changes. But using different cache-id's in schemata is
> necessary for our usecase.

Hum, thinking again about this, it needs to be per-vcpu. So for the NFV
use-case you want:

	vcpu0: no reservation (belongs to the default group).
	vcpu1: reservation with particular size.

Then if a vcpu is pinned, "trim" the reservation down to the
particular cache-id where its pinned to.

This is important because it allows vcpu0 workload to not 
interfere with the realtime workload running on vcpu1.