[libvirt] [PATCH 0/9] Yet another version of CAT stuff (no idea about the version number)

Fri Dec 15 15:11:48 UTC 2017


On 2017年12月15日 17:06, Martin Kletzander wrote:
> On Thu, Dec 14, 2017 at 07:46:27PM +0800, Eli wrote:
>>
>>>>
>>>>     @Eli: Can you help with the testing?
>>>>
>>
>> It seems the interface is only implement the isolated case, I remember
>> that you have proposed that for some overlap case?
>>
>
> Hi, yes.  It got a bit more complicated so I want to do this
> incrementally.  First enable the easiest cases, then add APIs to manage
> the system's default group, make type='both' allocation work on
> CDP-enabled hosts, add APIs for modifying cachetunes for live and
> stopped domains, add support for memory bandwidth allocation, and so
> on.  This is too much stuff to add in one go.
>
> I guess I forgot to add this info to the cover letter (I think I did at
> least for the previous version).
>
> I also wasted some time on the tests, some of them are not even in the
> patchset, have a look at previous version if you want to see them.
>
ok, sorry for not watching for libvirt list for some time..
>> I have not see the whole patch set yet, but I have some quick testing on
>> you patch, will try to find more time to review patches (Currently I am
>> maintain another daemon software which is dedicated for RDT feature
>> called RMD)
>>
>> Only the issue 1 is the true issue, for the others, I think they should
>> be discussed, or be treat as the 'known issue'.
>>
>> My env:
>>
>> L1d cache:             32K
>> L1i cache:             32K
>> L2 cache:              256K
>> L3 cache:              56320K
>> NUMA node0 CPU(s):     0-21,44-65
>> NUMA node1 CPU(s):     22-43,66-87
>>
>>
>> virsh capabilities:
>>
>> 171 <cache>
>> 172       <bank id='0' level='3' type='both' size='55' unit='MiB'
>> cpus='0-21,44-65'>
>> 173         <control granularity='2816' unit='KiB' type='both'
>> maxAllocs='16'/>
>> 174 </bank>
>> 175       <bank id='1' level='3' type='both' size='55' unit='MiB'
>> cpus='22-43,66-87'>
>> 176         <control granularity='2816' unit='KiB' type='both'
>> maxAllocs='16'/>
>> 177 </bank>
>> 178     </cache>
>>
>> *Issue:
>>
>> *1. Doesn't support asynchronous cache allocation. e.g, I need provide
>> all cache allocation require ways, but I am only care about the
>> allocation on one of the cache id, cause the VM won't be schedule to
>> another cache (socket).
>>
>
> Oh, really?  This is not written in the kernel documentation. Can't the
> unspecified caches just inherit the setting from the default group?
> That would make sense.  It would also automatically adjust if the
> default system one is changed.
>
Maybe I express myself not clearly, yes the caches will be added to 
default resource group
> Do you have any contact to anyone working on the RDT in the kernel? I
> think this would save time and effort to anyone who will be using the
> feature.
Sure, /fenghua.yu at intel.com and Tony Luck <tony.luck at intel.com>

kernel doc 
https://github.com/torvalds/linux/blob/master/Documentation/x86/intel_rdt_ui.txt

///////
>
>> So I got this error if I define the domain like this:
>>
>>   <vcpu placement='static'>6</vcpu>
>>   <cputune>
>>     <emulatorpin cpuset='0,37-38,44,81-82'/>
>>     <cachetune vcpus='0-4'>
>> *      <cache id='0' level='3' type='both' size='2816' unit='KiB'/>
>>        ^^^ not provide cache id='1'
>> *    </cachetune>
>>
>>
>> root at s2600wt:~# virsh start kvm-cat
>> error: Failed to start domain kvm-cat
>> error: Cannot write into schemata file
>> '/sys/fs/resctrl/qemu-qemu-13-kvm-cat-0-4/schemata': Invalid argument
>>
>
> Oh, I have to figure out why is there 'qemu-qemu' :D
>
>> This behavior is not correct.
>>
>> I expect the CBM will be look like:
>>
>> root at s2600wt:/sys/fs/resctrl# cat qemu-qemu-14-kvm-cat-0-4/*
>> 000000,00000000,00000000
>> L3:0=80;1=fffff *(no matter what it is, cause my VM won't be schedule on
>> it, ether I have deinfe the vcpu->cpu pining or, I assume that kernel
>> won't schedule it to cache 1)
>>
>
> Well, it matters.  It would have to have all zeros there so that that
> part of the cache is not occupied.
Well, the hardware won't allow you to specify 0 ways , at least 1 (some 
of the platform it's 2 ways)
 From my previous experence, I set it to fffff (it will be treat as 0 in 
the code)

it's decided by min_cbm_bits

see 
https://github.com/torvalds/linux/blob/master/Documentation/x86/intel_rdt_ui.txt#L48:14
>
>> *Or at least, restrict xml when I define this domain, tell me I need to
>> provide all cache ids (even if I have 4 cache but I only run my VM on
>> 'cache 0')
>> *
>
> We could do that.  It would allow us to make this better (or lift the
> restriction) in case this is "fixed" in the kernel.
>
> Or at least in the future we could do this to meet the users half-way:
>
> - Any vcpus that have cachetune enabled for them must also be pinned
>
> - Users need to specify allocations for all cache ids that the vcpu
>  might run on (according to the pinning acquired from before), for all
>  others we'd just simply set it to all zeros or the same bitmask as the
>  system's default group.
>
> But for now we could just copy the system's setting to unspecified
> caches or request the user to specify everything.
>
>> *2. cache way fragment (no good answers)
>>
>> I see that for now we allocate cache ways start from the low bits, newly
>> created VM will allocate cache from the next way, if some of the VM
>> (allocated ways in the middle, eg it's schemata is 00100) destroyed, and
>> that slot (1 cache way) may not fit others and it will be wasted, But,
>> how can we handle this, seems no good way, rearrangement? That will lead
>> cache missing in a time window I think.
>>
>
> Avoiding fragmentation is not a simple thing.  It's impossible to do
> without any moving, which might be unwanted.  This will be solved by
> providing an API that will tell you move the allocation if you so
> desire.  For now I at least try allocating the smallest region into
> which the requested allocation fits, so that the unallocated parts are
> as big as possible.
>
Agree
>> 3. The admin/user should manually operate the default resource group,
>> that's is to say, after resctrl is mounted, the admin/user should
>> manually change the schemata of default group. Will libvirt provide
>> interface/API to handle it?
>>
>
> Yes, this is planned.
>
>> 4. Will provide some APIs like `FreeCacheWay` to end user to see how
>> many cache ways could be allocated on the host?
>>
>
> Yes, this should be provided by an API as well.
>
>>     For other users/orchestrator (nova), they may need to know if a VM
>> can schedule on the host, but the cache ways is not liner, it may have
>> fragment.
>>
>> 5, What if other application want to have some shared cache ways with
>> some of the VM?
>>     Libvirt for now try to read all of the resource group (instead of
>> maintain the consumed cache ways itself), so if another resource group
>> was created under /sys/fs/resctl, and the schemata of it is "FFFFF",
>> then libvirt will report not enough room for new VM. But the user
>> actually want to have another Appliation(e.g. ovs, dpdk pmds) share
>> cache ways with the VM created by libvirt.
>
> Adding support for shared allocations is planned as I said before,
> however this is something that will be needed to be taken care of
> differently anyway.  I don't know how specific the use case would be,
> but let's say you want to have 8 cache ways allocated for the VM, but
> share only 4 of them with some DPDK PMD.  You can't use "shared" because
> that would just take some 8 bits even when some of them might be shared
> with the system's default group.  Moreover it means that the allocation
> can be shared with machines ran in the future.  So in this case you need
> to have the 8 bits exclusively allocated and then (only after the
> machine is started) pin the PMD process to those 4 cache ways.
>
> For the missing issue from the other email:
>
>> If the host enabled CDP, which is to see the host will report l3 
>> cache type
>> code and data. when user don't want code/data cache ways allocated
>> separated, for current implement, it will report not support `both` type
>> l3 cache.
>
>> But we can improve this as make code and data schemata the same
>> e.g, if host enabled CDP, but user request 2 `both` type l3 cache.
>
>> We can write the schemata looks like:
>
>> L3DATA:0=3
>> L3CODE:0=3
>
> Yes, that's what we want to achieve, but again, in a future patchset.
>
> Hope that answers your questions.  Thanks for trying it out, it is
> really complicated to develop something like this without the actual
> hardware to test it on.
Yep.
>
> Have a nice day,
> Martin

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/libvir-list/attachments/20171215/a9a7bca9/attachment-0001.htm>