[PATCH RFC 10/10] qemu: Place helper processes into the same trusted group

Tue May 24 15:35:03 UTC 2022

On 5/24/22 12:33, Daniel P. Berrangé wrote:
> On Tue, May 24, 2022 at 11:50:50AM +0200, Michal Prívozník wrote:
>> On 5/23/22 18:30, Daniel P. Berrangé wrote:
>>> On Mon, May 09, 2022 at 05:02:17PM +0200, Michal Privoznik wrote:
>>>> Since the level of trust that QEMU has is the same level of trust
>>>> that helper processes have there's no harm in placing all of them
>>>> into the same group.
>>>
>>> This assumption feels like it might be a bit of a stretch. I
>>> recall discussing this with Paolo to some extent a long time
>>> back, but let me recap my understanding.
>>>
>>> IIUC, the attack scenario is that a guest vCPU thread is scheduled
>>> on a SMT sibling with another thread that is NOT running guest OS
>>> code. "another thread" in this context refers to many things
>>>
>>>   - Random host OS processes
>>>   - QEMU vCPU threads from a different geust
>>>   - QEMU emulator threads from any guest
>>>   - QEMU helper process threads from any guest
>>>
>>> Consider for example, if the QEMU emulator thread contains a password
>>> used for logging into a remote RBD/Ceph server. That is a secret
>>> credential that the guest OS should not have permission to access.
>>>
>>> Consider alternatively that the QEMU emulator is making a TLS connection
>>> to some service, and there are keys negotiated for the TLS session. While
>>> some of the data transmitted over the session is known to the guest OS,
>>> we shouldn't assume it all is.
>>>
>>> Now in the case of QEMU emulator threads I think you can make a somewhat
>>> decent case that we don't have to worry about it. Most of the keys/passwds
>>> are used once at cold boot, so there's no attack window for vCPUs at that
>>> point. There is a small window of risk when hotplugging. If someone is
>>> really concerned about this though, they shouldn't have let QEMU have
>>> these credentials in the first place, as its already vulnerable to a
>>> guest escape. eg use kernel RBD instead of letting QEMU directly login
>>> to RBD.
>>>
>>> IOW, on balance of probabilities it is reasonable to let QEMU emulator
>>> threads be in the same core scheduling domain as vCPU threads.
>>>
>>> In the case of external QEMU helper processes though, I think it is
>>> a far less clearcut decision.  There are a number of reasons why helper
>>> processes are used, but at least one significant motivating factor is
>>> security isolation between QEMU & the helper - they can only communicate
>>> and share information through certain controlled mechanisms.
>>>
>>> With this in mind I think it is risky to assume that it is  safe to
>>> run QEMU and helper processes in the same core scheduling group. At
>>> the same time there are likely cases where it is also just fine to
>>> do so.
>>>
>>> If we separate helper processes from QEMU vCPUs this is not as wasteful
>>> as it sounds. Some the helper processes are running trusted code, there
>>> is no need for helper processes from different guests to be isolated.
>>> They can all just live in the default core scheduling domain.
>>>
>>> I feel like I'm talking myself into suggesting the core scheduling
>>> host knob in qemu.conf needs to be more than just a single boolean.
>>> Either have two knobs - one to turn it on/off and one to control
>>> whether helpers are split or combined - or have one knob and make
>>> it an enumeration.
>>
>> Seems reasonable. And the default should be QEMU's emulator + vCPU
>> threads in one sched group, and all helper processes in another, right?
> 
> Not quite. I'm suggesting that helper processes can remain in the
> host's default core scheduling group, since the helpers are all
> executing trusted machine code.
> 
>>> One possible complication comes if we consider a guest that is
>>> pinned, but not on the fine grained per-vCPU basis.
>>>
>>> eg if guest is set to allow floating over a sub-set of host CPUs
>>> we need to make sure that it is possible to actually execute the
>>> guest still. ie if entire guest is pinned to 1 host CPU but our
>>> config implies use of 2 distinct core scheduling domains, we have
>>> an unsolvable constraint.
>>
>> Do we? Since we're placing emulator + vCPUs into one group and helper
>> processes into another these would never run at the same time, but that
>> would be the case anyways - if emulator write()-s into a helper's socket
>> it would be blocked because the helper isn't running. This "bottleneck"
>> is result of pinning everything onto a single CPU and exists regardless
>> of scheduling groups.
>>
>> The only case where scheduling groups would make the bottleneck worse is
>> if emulator and vCPUs were in different groups, but we don't intent to
>> allow that.
> 
> Do we actually pin the helper processes at all ?

Yes, we do. Into the same CGroup as emulator thread:
qemuSetupCgroupForExtDevices().

> 
> I was thinking of a scenario where we implicitly pin helper processes to
> the same CPUs as the emulator threads and/or QEMU process-global pinning
> mask. eg
> 
> If we only had
> 
>   <vcpu placement='static' cpuset="2-3" current="1">2</vcpu>
> 
> Traditionally the emulator threads, i/o threads, vCPU  threads will
> all float across host CPUs 2 & 3. I was assuming we also placed
> helper processes in these same 2 host CPUs. Not sure if that's right
> or not. Assuming we do, then...
> 
> Lets say CPUs 2 & 3 are SMT siblings.
> 
> We have helper processes in the default core scheduling
> domain and QEMU in a dedicated core scheduling domain. We
> loose 100% of concurrency between the vCPUs and helper
> processes.

So in this case users might want to have helpers and emulator in the
same group. Therefore, in qemu.conf we should allow something like:

  sched_core = "none" // off, no SCHED_CORE
               "emulator" // default, place only emulator & vCPU threads
                          // into the group
               "helpers" // place emulator & vCPU & helpers into the
                         // group

I agree that "helpers" is terrible name, maybe "emulator+helpers"? Or
something completely different? Maybe:

  sched_core = [] // off
               ["emulator"] // enumlator & vCPU threads
               ["emulator","helpers"] // emulator + helpers

We can refine "helpers" in future (if needed) to say "virtiofsd",
"dbus", "swtpm" allowing users to fine tune what helper processes are
part of the group.

Michal