[libvirt] OpenStack/libvirt CAT interface

Wed Jan 11 10:39:22 UTC 2017

On Tue, Jan 10, 2017 at 02:18:41PM -0200, Marcelo Tosatti wrote:
> 
> There have been queries about the OpenStack interface 
> for CAT:
> 
> http://bugzilla.redhat.com/show_bug.cgi?id=1299678
> 
> Comment 2 says:
> Sahid Ferdjaoui 2016-01-19 10:58:48 EST
> A spec will have to be addressed, after a first look this feature needs
> some work in several components of Nova to maintain/schedule/consume
> host's cache. I can work on that spec and implementation it when libvirt
> will provides information about cache and feature to use it for guests.
> 
> I could add a comment about parameters to resctrltool, but since
> this depends on the libvirt interface, it would be good to know
> what the libvirt interface exposes first.
> 
> I believe it should be essentially similar to OpenStack's
> "reserved_host_memory_mb":
> 
>         Set the reserved_host_memory_mb to reserve RAM for host
> processes. For
>         the purposes of testing I am going to use the default of 512 MB:
>         reserved_host_memory_mb=512
> 
> But rather use:
> 
>         rdt_cat_cache_reservation=type=code/data/both,size=10mb,cacheid=2;
>                                   type=code/data/both,size=2mb,cacheid=1;...
> 
> (per-vcpu).
> 
> Where cache-id is optional.
> 
> What is cache-id (from Documentation/x86/intel_rdt_ui.txt on recent
> kernel sources):
> Cache IDs
> ---------
> On current generation systems there is one L3 cache per socket and L2
> caches are generally just shared by the hyperthreads on a core, but this
> isn't an architectural requirement. We could have multiple separate L3
> caches on a socket, multiple cores could share an L2 cache. So instead
> of using "socket" or "core" to define the set of logical cpus sharing
> a resource we use a "Cache ID". At a given cache level this will be a
> unique number across the whole system (but it isn't guaranteed to be a
> contiguous sequence, there may be gaps).  To find the ID for each
> logical
> CPU look in /sys/devices/system/cpu/cpu*/cache/index*/id
> 
> 
> WHAT THE USER NEEDS TO SPECIFY FOR VIRTUALIZATION (KVM-RT)
> ==========================================================
> 
> For virtualization the following scenario is desired,
> on a given socket:
> 
>         * VM-A with VCPUs VM-A.vcpu-1, VM-A.vcpu-2.
>         * VM-B with VCPUs VM-B.vcpu-1, VM-B.vcpu-2.
> 
> With one realtime workload on each vcpu-2.
> 
> Assume VM-A.vcpu-2 on pcpu 3.
> Assume VM-B.vcpu-2 on pcpu 5.
> 
> Assume pcpus 0-5 on cacheid 0.
> 
> We want VM-A.vcpu-2 to have a certain region of cache reserved,
> and VM-B.vcpu-2 as well. vcpu-1 for both VMs can use the default group
> (that is not have reserved L3 cache).
> 
> This translates to the following resctrltool-style reservations:
> 
>         res.vm-a.vcpu-2
> 
>                 type=both,size=VM-A-RESSIZE,cache-id=0
> 
>         res.vm-b.vcpu-2
> 
>                 type=both,size=VM-B-RESSIZE,cache-id=0
> 
> Which translate to the following in resctrlfs:
> 
>         res.vm-a.vcpu-2
> 
>                 type=both,size=VM-A-RESSIZE,cache-id=0
>                 type=both,size=default-size,cache-id=1
>                 ...
> 
>         res.vm-b.vcpu-2
> 
>                 type=both,size=VM-B-RESSIZE,cache-id=0
>                 type=both,size=default-size,cache-id=1
>                 ...
> 
> Which is what we want, since the VCPUs are pinned.
> 
> 
> res.vm-a.vcpu-1 and res.vm-b.vcpu-1 don't need to
> be assigned to any reservation, which means they'll
> remain on the default group.
> 
> RESTRICTIONS TO THE SYNTAX ABOVE
> ================================
> 
> Rules for the parameters:
> * type=code must be paired with type=data entry.
> 
> ABOUT THE LIST INTERFACE
> ========================
> 
> About an interface for listing the reservations
> of the system to OpenStack.
> 
> I think that what OpenStack needs is to check, before
> starting a guest on a given host, that there is sufficient
> space available for the reservation.
> 
> To do that, it can:
> 
>         1) resctrltool list (the end of the output mentions
>            how much free space available there is), or
>            via resctrlfs directly (have to lock the filesystem,
>            read each directory, AND each schemata, and count
>            number of zero bits).
>         2) Via libvirt
> 
> Should fix resctrltool/API to list amount of contiguous free space
> BTW.

Elements of the libvirt CAT interface:

1) Convertion of kbytes (user specification) --> number of CBM bits
for host.

resctrlfs exposes the CBM bitmask HW format, where every bit
indicates a portion of L3 cache. Therefore each bit refers
to a number of ways of L3 cache, therefore a number of kbytes.

Users measure or determine the CAT size per VM, so the specification 
should be in kbytes and not number of bits on any particular host.

If you expose the "schemata" interface to users, they need to
convert between kbytes --> bits of CBM for that particular host.

IMO there is no benefit of exposing this information to higher layers
(in fact you only want to think about it when programming the 
HW interface).

2) Sharing of groups.

It is possible that two groups share a certain portion of cache, that is:

	   1   2   3   4   5   6   7  8    (CBM bits)
	[  0   0   1   1   1   1   0  0	  ]		process-A
	[  0   0   0   0   1   1   1  1	  ]		process-B

In this example, processes A and B share bits 5 and 6 of the CBM mask,
which indicate a certain portion of L3 cache.

That scheme could be generalized in a format as follows:

	GroupA.size = X kbytes,
	GroupB.size = Y kbytes,
	(GroupA,GroupB) share Z kbytes.

However, for VMs (and even for normal CAT usage), i don't see any usage
for that configuration, because:

	* Determinism is lost: for the shared regions of L3 cache,
  	  process-A can reclaim into process-B's L3 cache.
	* Have to measure both applications together when determining
	  the shared size.

3) CAT allocation type: both or code/data separation.

Older CAT enabled processors support a CBM bitmask without
separation of code/data, that is, both code and data cachelines
can be reclaimed from a given L3 cache reservation.

This means that an application with the following pattern:

	NR OF ACCESSES	| TYPE OF ACCESS
	10000		| DATA
	100		| CODE
	10000		| DATA
	100		| CODE

Can have a high rate of code memory cache-misses, even
with cache allocation.

So newer CAT enabled processors support CBM bitmask separation, that is:
you can reserve a certain portion of L3 cache for code and another
portion of L3 cache for data. This is called CDP (CD stands for Cache-Data 
i suppose).

Given a {type=code, type=data} reservation request from a user, with 
different sizes, the host can be:

CDP enabled host: no problem.
Non-CDP enabled host: reservation can only be shared.

Which means that high rate of code or data misses can be noted.
What is done in resctrlfs, when converting a {type=code, type=data}
reservation to type=both, is to reserve a type=both reservation
with size equals the sum of both type=code or type=data reservation.

However, it is useful to expose whether host is CDP enabled or not
to OpenStack (so it can decide whether or not to fail initialization
of a VM with {type=code,type=data} reservation on non-CDP host, 
or not.

4) Size of allocatable reservation size:

Other than exposing the L3 cache size, exposing the amount of 
reservable L3 cache is also required to determine eligibility
of execution of a VM on a particular host.

Options for the libvirt interface:

OPTION-1: expose the full resctrlfs interface
=============================================

There is no point in having OpenStack perform 
"1) Convertion of kbytes (user specification) --> number of CBM bits
for host." as detailed above. 

So we want to expose kbytes to OpenStack.

OPTION-2: expose sharing of groups
==================================

As noted above, sharing of L3 portions by VMs is not beneficial.

OPTION-3: don't expose cbm bits and don't expose sharing of groups
==================================================================

What remains is the

	"type={both,data,code}, size=X, cache-id= Z"

format.

With an interface to expose CDP/Non-CDP capable host, and
another to expose allocatable L3 cache size at that moment.