[libvirt] Matching the type of mediated devices in the migration

Thu Aug 23 04:23:54 UTC 2018

On Thu, 23 Aug 2018 04:02:43 +0000
"Tian, Kevin" <kevin.tian at intel.com> wrote:

> > From: Alex Williamson [mailto:alex.williamson at redhat.com]
> > Sent: Thursday, August 23, 2018 11:47 AM
> > 
> > On Wed, 22 Aug 2018 02:30:12 +0000
> > "Tian, Kevin" <kevin.tian at intel.com> wrote:
> >   
> > > > From: Alex Williamson [mailto:alex.williamson at redhat.com]
> > > > Sent: Wednesday, August 22, 2018 10:08 AM
> > > >
> > > > On Wed, 22 Aug 2018 01:27:05 +0000
> > > > "Tian, Kevin" <kevin.tian at intel.com> wrote:
> > > >  
> > > > > > From: Wang, Zhi A
> > > > > > Sent: Wednesday, August 22, 2018 2:43 AM  
> > > > > > >
> > > > > > > Are there any suggestions how we can deal with security issues?
> > > > > > > Allowing userspace to provide a data stream representing the  
> > internal  
> > > > > > > state of a virtual device model living within the kernel seems
> > > > > > > troublesome.  If we need to trust the data stream, do we need to
> > > > > > > somehow make the operation more privileged than what a vfio  
> > user  
> > > > > > might  
> > > > > > > have otherwise?  Does the data stream need to be somehow  
> > signed  
> > > > and  
> > > > > > how  
> > > > > > > might we do that?  How can we build in protection against an  
> > > > untrusted  
> > > > > > > restore image?  Thanks,  
> > > > >
> > > > > imo it is not necessary. restoring mdev state should be handled as if
> > > > > guest is programming the mdev.  
> > > >
> > > > To me this suggests that a state save/restore is just an algorithm
> > > > executed by userspace using the existing vfio device accesses.  This is
> > > > not at all what we've been discussing for migration.  I believe the  
> > >
> > > not algorithm by userspace. It's kernel driver to apply the audit when
> > > receiving opaque state data.  
> > 
> > And a kernel driver receiving and processing opaque state date from a
> > user doesn't raise security concerns for you?  
> 
> opaque is from userspace p.o.v. kernel driver understands the actual
> format and thus can audit when restoring the state.

Which only means that we risk having untold security issues within each
separate mdev vendor driver.

> > > > interface we've been hashing out exposes opaque device state through  
> > a  
> > > > vfio region.  We therefore must assume that that opaque data contains
> > > > not only device state, but also emulation state, similar to what we see
> > > > for any QEMU device.  Not only is there internal emulation state, but
> > > > we have no guarantee that the device state goes through the same
> > > > auditing as it does through the vfio interface.  Since this device and
> > > > emulation state live inside the kernel and not just within the user's
> > > > own process, a malicious user can do far more than shoot themselves.  
> > It  
> > > > would be one thing devices were IOMMU isolated, but they're not,
> > > > they're isolated through vendor and device specific mechanism, and for
> > > > all we know the parameters of that isolation are included in the
> > > > restore state.  I don't see how we can say this is not an issue.  
> > >
> > > I didn't quite get this. My understanding is that isolation configuration
> > > is completed when a mdev is created on DEST machine given a type
> > > definition. The state image contains just runtime data reflecting what
> > > guest driver does on SRC machine. Restoring such state shouldn't
> > > change the isolation policy.  
> > 
> > Let's invent an example where the mdev vendor driver has a set of
> > pinned pages which are the current working set for the device at the
> > time of migration.  Information about that pinning might be included in
> > the opaque migration state.  If a malicious user discovers this, they
> > can potentially also craft a modified state which can exploit the host
> > kernel isolation.  
> 
> pinned pages may be not a good example. the pin knowledge could be
> reconstructed when restoring the state (e.g. in GVT-g pinning is triggered
> by shadowing GPU page table which has to be recreated on DEST). 

There are always ways for vendor drivers to do this correctly, but
again, one vendor doing it correctly doesn't prevent this from being a
gaping security issue with unending vulnerabilities for other vendors.

> > > > > Then all the audits/security checks
> > > > > enforced in normal emulation path should still apply. vendor driver
> > > > > may choose to audit every state restore operation one-by-one, and
> > > > > do it altoghter at a synchronization point (e.g. when the mdev is re-
> > > > > scheduled, similar to what we did before VMENTRY).  
> > > >
> > > > Giving the vendor driver the choice of whether to be secure or not is
> > > > exactly what I'm trying to propose we spend some time thinking about.
> > > > For instance, what if instead of allowing the user to load device state
> > > > through a region, the kernel could side load it using sometime similar
> > > > to the firmware loading path.  The user could be provided with a file
> > > > name token that they push through the vfio interface to trigger the
> > > > state loading from a location with proper file level ACLs such that the
> > > > image can be considered trusted.  Unfortunately the collateral is that
> > > > libvirt would need to become the secure delivery entity, somehow
> > > > stripping this section of the migration stream into a file and
> > > > providing a token for the user to ask the kernel to load it.  What are
> > > > some other options?  Could save/restore be done simply as an
> > > > algorithmic script matched to stack of data, as I read into your first
> > > > statement above?  I have doubts that we can achieve the internal state
> > > > we need, or maybe even the performance we need using such a process.
> > > > Thanks,
> > > >  
> > >
> > > for GVT-g I think we invoke common functions as used in emulation path
> > > to recover vGPU state, e.g. gtt rw handler, etc. Zhi can correct me if
> > > I'm wrong.  
> > 
> > One example of migration state being restored in a secure manner does
> > not prove that such an interface is universally secure or a good idea.
> >   
> > > Can you elaborate the difference between device state and emulation
> > > state which you mentioned earlier? We may need look at some concrete
> > > example to understand the actual problem here.  
> > 
> > See my example above, or imagine that the migration state information
> > includes any sort of index field where a user might be able to modify
> > the index and trick the driver into inserting malicious code elsewhere
> > in the host kernel stack.  It's a security nightmare waiting to
> > happen.  Thanks,
> >   
> 
> one thing which I'm not sure is why this becomes a new concern but
> not applied to existing VM state maintained by kernel, e.g. various
> vCPU states...

vCPU state follows a processor specification, it can be audited and
there aren't that many CPU vendors.  We don't have each hardware OEM
plugging in a new vCPU save/restore blob.  For mdev devices, we have
both closed and open source implementations and new proposals for mdev
drivers seemingly on a regular basis.  "It's worked so far" is also
rarely a valid rebuttal to security concerns ;)  Thanks,

Alex