[libvirt] Matching the type of mediated devices in the migration

Tue Aug 21 18:42:50 UTC 2018

On 08/21/18 07:08, Alex Williamson wrote:
> On Sun, 19 Aug 2018 22:25:19 +0800
> Zhi Wang <zhi.a.wang at intel.com> wrote:
> 
>> Share some updates of my work on this topic recently:
>>
>> Thanks for Erik's guide and advices. Now my PoC patches almost works.
>> Will send the RFC soon.
>>
>> Mostly the ideas are based on Alex's idea: a match between a device
>> state version and a minimum required version
>>
>>
>> "Match of versions" in Libvirt
>>
>> Initialization stage:
>>
>> - Libvirt would detect if there is any device state version in a
>> "mdev_type" of a mediated device when creating a mdev node in node
>> device tree.
>> 	- If the "mdev_type" of a mediated device *has* a device state version,
>> then this mediated device supports migration.
>> 	- If not, (compatibility case, mostly for old vendor drivers which
>> don't support migration), this mediated device doesn't support migration
>>
>> Migration stage:
>>
>> - Libvirt would put the mdev information inside cookies and send them
>> between src machine and dst machine. So a new type of cookie would be
>> added here.
>>
>> There are different versions of migration protocols in libvirt. Each of
>> them starts to send cookies in different sequence. The idea here is to
>> let the match happens as early as possible. Looks like QEMU driver in
>> libvirt only support V2/V3 proto.
>>
>>
>> V2 proto:
>>
>> - The match would happen in SRC machine after the DST machine transfers
>> the cookies with mdev information back to the SRC machine during the
>> "preparation" stage. The disadvantage is the DST virtual machine has
>> already been created in "preparation" stage. If the match fails, the
>> virtual machine in DST machine has to be killed as well, which would
>> waste some time.
>>
>> V3 proto:
>>
>> - The match would happen in DST machine after the SRC machine transfers
>> the cookies to the DST machine during the "begin" stage. As the DST
>> machine hasn't entered into "preparation" stage at this time, the
>> virtual machine hasn't been created in DST machine at this point. No
>> extra VM destroy is needed if the match fails. This would be the ideal
>> place for a match.
>>
>> "Match of version" in QEMU level
>>
>> As there are several different types of migration in libvirt. In a
>> migration with hypervisor native transport, the target machine could
>> even not have libvirtd, the migration happens between device models
>> directly. So we need a match in QEMU level as well. We might still need
>> Kirti's approach as the last level match.
> 
> The kernel and vendor driver will always have a last opportunity to nak
> a migration, the purpose of making certain information readily
> available to libvirt is only to allow userspace some insight into where
> a migration is likely to be successful.  Even if we expose these things
> to userspace, it's the kernel's responsibility to validate the
> migration data.  

Yes. The vendor driver should be the last keeper to nak a migration. It 
should be implemented inside the vendor driver.

In fact, pushing state information for a device into
> the kernel would seem to be a massive security target.  For instance
> how many vulnerabilities might a malicious user be able to exploit in
> the code that parses the device specific state information?  How do we
> even detect non-malicious user errors, like trying to migrate GVTg
> device state to an NVIDIA vGPU?

For now, we only depends on mdev_type, after the discussion of vendor id 
or device id.
> 
> The latter at least suggests that the kernel needs to perform the same
> set of validation that we're trying to enable userspace to do.
> Cornelia also mentioned that some mdev devices are more or less shells
> within which a device is configured, such as ccw and likely the crypto
> ap devices.  In those cases the mdev type might not be sufficient meta
> data about what we're dealing with.  This might suggest some sort of
> header within the migration region parsed by common code for basic
> validation.
Yes. If we could validate it earlier then better since, we don't need to 
wait until the DST machine start the VM and try to load the 1st states.
> 
> Are there any suggestions how we can deal with security issues?
> Allowing userspace to provide a data stream representing the internal
> state of a virtual device model living within the kernel seems
> troublesome.  If we need to trust the data stream, do we need to
> somehow make the operation more privileged than what a vfio user might
> have otherwise?  Does the data stream need to be somehow signed and how
> might we do that?  How can we build in protection against an untrusted
> restore image?  Thanks,
What a good point!

I dig the kernel module security case, which seems similar with this 
case. The security of loading kernel module relies on root privilege and 
signature.

For root privilege, QEMU could run as non root in libvirtd. So this 
wouldn't be an option.

For signature, I am wondering if there is any similar cases in other 
kernel components, like KVM or another modules which provides ioctls to 
userspace. Maybe they don't even load some binary from userspace, but 
they could suffer from DDOS flood from userspace. Maybe some ioctls or 
interfaces in kernel should only allow signed/trusted userspace 
application to call. (previously it's "allow signed kernel module to load")

Thanks,
Zhi.

> 
> Alex
>