[libvirt] Designing XML for HMAT

Thu Jan 9 16:18:02 UTC 2020

Dear list,

QEMU gained support for configuring HMAT recently (see 
v4.2.0-415-g9b12dfa03a
and friends). HMAT stands for Heterogeneous Memory Attribute Table and 
defines
various attributes to NUMA. Guest OS/app can read these information and fine
tune optimization. See [1] for more info (esp. links in the transcript).

QEMU defines so called initiator, which is an attribute to a NUMA node 
and if
specified points to another node that has the best performance to this node.

For instance:

   -machine hmat=on \
   -m 2G,slots=2,maxmem=4G \
   -object memory-backend-ram,size=1G,id=m0 \
   -object memory-backend-ram,size=1G,id=m1 \
   -numa node,nodeid=0,memdev=m0 \
   -numa node,nodeid=1,memdev=m1,initiator=0 \
   -smp 2,sockets=2,maxcpus=2 \
   -numa cpu,node-id=0,socket-id=0 \
   -numa cpu,node-id=0,socket-id=1

creates a machine with 2 NUMA nodes, node 0 has CPUs and node 1 has 
memory only
and it's initiator is node 0 (yes, HMAT allows you to create CPU-less "NUMA"
nodes). The initiator of node 0 is not specified, but since the node has at
least one CPU it is initiator to itself (and has to be per specs).

This could be represented by an attribute to our /domain/cpu/numa/cell 
element.
For instance like this:

   <domain>
     <vcpu>2</vcpu>
     <cpu>
       <numa>
         <cell id='0' cpus='0,1' memory='1' unit='GiB'/>
         <cell id='1'            memory='1' unit='GiB' initiator='0'/>
       </numa>
     </cpu>
   </domain>

Then, QEMU allows us to control two other important memory attributes:

   1) hmat-lb for Latency and Bandwidth

   2) hmat-cache for cache attributes

For example:

   -machine hmat=on \
   -m 2G,slots=2,maxmem=4G \
   -object memory-backend-ram,size=1G,id=m0 \
   -object memory-backend-ram,size=1G,id=m1 \
   -smp 2,sockets=2,maxcpus=2 \
   -numa node,nodeid=0,memdev=m0 \
   -numa node,nodeid=1,memdev=m1,initiator=0 \
   -numa cpu,node-id=0,socket-id=0 \
   -numa cpu,node-id=0,socket-id=1 \
   -numa 
hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-latency,latency=5 
\
   -numa 
hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-bandwidth,bandwidth=200M 
\
   -numa 
hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-latency,latency=10 
\
   -numa 
hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-bandwidth,bandwidth=100M 
\
   -numa 
hmat-cache,node-id=0,size=10K,level=1,associativity=direct,policy=write-back,line=8 
\
   -numa 
hmat-cache,node-id=1,size=10K,level=1,associativity=direct,policy=write-back,line=8

This extends previous example by defining some latencies and cache 
attributes.
The node 0 has access latency of 5 ns and bandwidth of 200MB/s and node 
1 has
access latency of 10ns and bandwidth of only 100MB/s. The memory cache 
level 1
on both nodes is 10KB, cache line is 8B long with write-back policy and 
direct
associativity (whatever that means).

For better future extensibility I'd express these as separate elements, 
rather
than attributes to <cell/> element. For instance like this:

   <domain>
     <vcpu>2</vcpu>
     <cpu>
       <numa>
         <cell id='0' cpus='0,1' memory='1' unit='GiB'>
           <latencies>
             <latency type='access' value='5'/>
             <bandwidth type='access' unit='MiB' value='200'/>
           </latencies>
           <caches>
             <cache level='1' associativity='direct' policy='write-back'>
               <size unit='KiB' value='10'/>
               <line unit='B' value='8'/>
             </cache>
           </caches>
         </cell>
         <cell id='1' memory='1' unit='GiB' initiator='0'>
           <latencies>
             <latency type='access' value='10'/>
             <bandwidth type='access' unit='MiB' value='100'/>
           </latencies>
           <caches>
             <cache level='1' associativity='direct' policy='write-back'>
               <size unit='KiB' value='10'/>
               <line unit='B' value='8'/>
             </cache>
           </caches>
         </cell>
       </numa>
     </cpu>
   </domain>

Thing is, the @hierarchy argument accepts: memory (referring to whole 
memory),
or first-level|second-level|third-level (referring to side caches for each
domain). I haven't figured out yet, how to express the levels in XML yet.

The @data-type argument accepts access|read|write (this is expressed by 
@type
attribute to <latency/> and <bandwidth/> elements). Latency and 
bandwidth can
be combined with each type: access-latency, read-latency, write-latency,
access-bandwidth, read-bandwidth, write-bandwidth. And these 6 can then be
combined with aforementioned @hierarchy, producing 24 combinations (if I 
read
qemu cmd line specs correctly [2]).

What are your thoughts?

Michal

1: https://bugzilla.redhat.com/show_bug.cgi?id=1786303
2: 
https://git.qemu.org/?p=qemu.git;a=blob;f=qemu-options.hx;h=d4b73ef60c1d4589148169ac658a34eee5f54522;hb=HEAD#l174