[PATCH RFC 00/10] qemu: Enable SCHED_CORE for domains and helper processes

Thu May 26 14:00:28 UTC 2022

On Thu, 2022-05-26 at 14:01 +0200, Dario Faggioli wrote:
> Thoughts?
> 
Oh, and there are even a couple of other (potential) use case, for
having an (even more!) fine grained control of core-scheduling.

So, right now, giving a virtual topology to a VM, pretty much only
makes sense if the VM has its vcpus pinned. Well, actually, there's
something that we can do even if that is not the case, especially if we
define at least *some* constraints on where the vcpus can run, even if
we don't have strict and static 1-to-1 pinning... But for sure we
shouldn't define an SMT topology, if we don't have that (i.e., if we
don't have strict and static 1-to-1 pinning). And yet, the vcpus will
run on cores and threads!

Now, if we implement per-vcpu core-scheduling (which means being able
to put not necessarily whole VMs, but single vcpus [although, of the
same VM], in trusted groups), then we can:
- put vcpu0 and vcpu1 of VM1 in a group
- put vcpu2 and vcpu3 of VM1 in a(nother!) group
- define, in the virtual topology of VM1, vcpu0 and vcpu1 as
  SMT-threads of the same core
- define, in the virtual topology of VM1, vcpu2 and vcpu3 as
  SMT-threads of the same core

From the perspective of the accuracy of the mapping between virtual and
physical topology (and hence, most likely, of performance), it's still
a mixed bag. I.e., on an idle or lightly loaded system, vcpu0 and vcpu1
can still run on two different cores. So, if the guest kernel and apps
assume that the two vcpus are SMT-siblings, and optimize for that,
well, that still might be false/wrong (like it would be without any
core-scheduling, without any pinning, etc). At least, when they run on
different cores, they run there alone, which is nice (but achievable
with per-VM core-scheduling already).

On a heavily loaded system, instead, vcpu0 and vcpu1 should (when they
both want to run) have much higher chances of actually ending up
running on the same core. [Of couse, not necessarily always on one same
specific core --like when we do pinning-- but always on the one core.]
So, in-guest workloads operating under the assumption that those two
vcpus are SMT-siblings, will hopefully benefit from that.

And for the lightly loaded case, well, I believe that combining
per-vcpu core-scheduling + SMT virtual topology with *some* kind of
vcpu affinity (and I mean something more flexible and less wasteful
than 1-to-1 pinning, of course!) and/or with something like numad, will
actually bring some performance and determinism benefits, even in such
a scenario... But, of course, we need data for that, and I don't have
any yet. :-)

Anyway, let's now consider the case where the user/customer wants to be
able use core-scheduling _inside_ of the guest, e.g., for protecting
and/or shielding, some sensitive workload that he/she is running inside
of the VM itself, from all the other tasks. But for using core-
scheduling inside of the guest we need the guest to have cores and
threads. And for the protection/shielding to be effective, we need to
be sure that, say, if two guest tasks are in the same trusted group and
are running on two vcpus that are virtual SMT-siblings, these two vcpus
either (1) run on two actual physical SMT-siblings pCPUs on the host
(i.e., on they run on the same core), or (2) run on different host
cores, each one on a thread, with no other vCPU from any other VM (and
no host task, for what matters) running on the other thread. And this
is exactly what per-vcpu core-scheduling + SMT virtual topology gives
us. :-D

Of course, as in the previous message, I think that it's perfectly fine
for something like this to not be implemented immediately, and come
later. At least as far as we don't do anything at this stage that will
prevent/make it difficult to implement such extensions in future.

Which I guess is, after all, the main point of these very long emails
(sorry!) that I am writing. I.e., _if_ we agree that it might be
interesting to have per-VM, or even per-vcpu, core-scheduling in the
future, let's just try to make sure that what we put together now
(especially at the interface level) is easy to extend in that
direction. :-)

Thanks and Regards
-- 
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: This is a digitally signed message part
URL: <http://listman.redhat.com/archives/libvir-list/attachments/20220526/ccecc24b/attachment-0001.sig>