[libvirt] [PATCH] qemu: process: Fix automatic setting of locked memory limit for VFIO

Tue Nov 10 16:40:34 UTC 2015

On Fri, 2015-11-06 at 08:30 +0100, Peter Krempa wrote:
> On Thu, Nov 05, 2015 at 08:41:46 -0700, Alex Williamson wrote:
> > On Thu, 2015-11-05 at 16:27 +0100, Peter Krempa wrote:
> > > On Wed, Nov 04, 2015 at 17:16:53 -0700, Alex Williamson wrote:
> 
> [...]
> 
> > The power devs will need to speak to what their locked memory
> > requirements are and maybe we can come up with a combined algorithm that
> > works good enough for both, or maybe libvirt needs to apply different
> > algorithms based on the machine type.  The x86 limit has been workinga
> 
> Okay, that seems reasonable. I'll extract the code that sets the limit
> into a helper, so that there's a central point where this stuff can be
> tweaked and made machine dependent.
> 
> > well for us, so I see no reason to abandon it simply because we tried to
> > apply it to a different platform and it didn't work.  Thanks,
> 
> As of the x86 lock limit. Currently the code states the following:
> 
>         /* VFIO requires all of the guest's memory to be
>          * locked resident, plus some amount for IO
>          * space. Alex Williamson suggested adding 1GiB for IO
>          * space just to be safe (some finer tuning might be
>          * nice, though).
>          */
> 
> Disregarding the empirical evidence that this worked until now on x86
> the comment's message here is rather unsure about the actual size used
> at this point and hints it might need tweaking. If I'm going to move
> this I'd really like to remove the uncertainity from the message.
> 
> So will I be able to justify:
> - dropping "some" from some ammount for IO space
> - replacing "Alex Williamson suggested adding" by something more
>   scientifical
> 
> I think the mlock size will be justified far more with comment like:
> 
>   /* VFIO requires all of guest's memory and memory for the IO space to
>    * be locked persistent.
>    *
>    * On x86 the IO space size is 1GiB.
>    * [some lines for possibly other platforms]
>    */
> 
> Is it valid to make such change or do we need to keep the ambiguity and
> just reference the empirical fact that it didn't break yet?

The reason we use the 1G "fudge factor" is that VFIO maps MMIO for other
devices through the IOMMU.  This theoretically allows peer-to-peer DMA
between devices in the VM, for instance, an assigned GPU could DMA
directly into the emulated VGA framebuffer.  The reason we picked 1G is
that it's often the size of reserved MMIO space below 4G on x86 systems,
and so far we've never hit it.  We do have 64bit MMIO space, but an
important detail is that MMIO space for assigned devices doesn't count
against the limit because it maps mmap'd MMIO space through the IOMMU,
so there's not actually any RAM locked, it's MMIO space on the device
that the user already owns.  So things like a Tesla card with 4G PCI
BARs is still going to work just fine w/ that 1G fudge factor.  The 1G
would need to come entirely from emulated/virtio devices.

Now the next question is whether this peer-to-peer DMA is useful, or
even used.  I have heard of users assigning multiple AMD cards to a
system and making use of AMD's XDMA crossfire support, which certainly
seems like it's using this capability.  NVIDIA may follow along, but
they seem to limit SLI in their driver.  But as I said, these aren't
counting against the locked memory limit and there's probably very
little potential use of peer-to-peer between assigned and emulated
devices.  At the time I first implemented VFIO, I couldn't see a way to
tell the difference between mappings for emulated and assigned devices,
they're all just opaque mappings.  However, I think we could now be a
bit smarter to figure out the device associated with the MemorySection
and skip emulated device regions.  If we were to do that, we could
expose it as a new option to vfio-pci, so libvirt could probe for it and
maybe remove the fudge factor on the lock limit.  Thanks,

Alex