[vfio-users] qemu stuck when hot-add memory to a virtual machine with a device passthrough

Mon Apr 23 16:37:27 UTC 2018

On Sat, 21 Apr 2018 09:02:14 +0000
"Wuzongyong (Euler Dept)" <cordius.wu at huawei.com> wrote:

> > > > > Hi,
> > > > >
> > > > > The qemu process will  stuck when hot-add large size  memory to
> > > > > the virtual machine with a device passtrhough.
> > > > > We found it is too slow to pin and map pages in vfio_dma_do_map.
> > > > > Is there any method to improve this process?  
> > > >
> > > > At what size do you start to see problems?  The time to map a
> > > > section of memory should be directly proportional to the size.  As
> > > > the size is increased, it will take longer, but I don't know why
> > > > you'd reach a point of not making forward progress.  Is it actually
> > > > stuck or is it just taking longer than you want?  Using hugepages
> > > > can certainly help, we still need to pin each PAGE_SIZE page within
> > > > the hugepage, but we'll have larger contiguous regions and therefore
> > > > call iommu_map() less frequently.  Please share more data.  Thanks,
> > > >
> > > > Alex  
> > > It just take longer time, instead of actually stuck.
> > > We found that the problem exist when we hot-added 16G memory. And it
> > > will consume tens of minutes when we hot-added 1T memory.  
> > 
> > Is the stall adding 1TB roughly 64 times the stall adding 16GB or do we
> > have some inflection in the size vs time curve?  There is a cost to
> > pinning an mapping through the IOMMU, perhaps we can improve that, but I
> > don't see how we can eliminate it or how it wouldn't be at least linear
> > compared to the size of memory added without moving to a page request
> > model, which hardly any hardware currently supports.  A workaround might
> > be to incrementally add memory in smaller chunks which generate a less
> > noticeable stall.  Thanks,
> > 
> > Alex  
> I collected a part of report as below recorded by perf when I hot-added 24GB memory:
> +   63.41%     0.00%  qemu-kvm         qemu-kvm-2.8.1-25.127       [.] 0xffffffffffc7534a
> +   63.41%     0.00%  qemu-kvm         [kernel.vmlinux]            [k] do_vfs_ioctl
> +   63.41%     0.00%  qemu-kvm         [kernel.vmlinux]            [k] sys_ioctl
> +   63.41%     0.00%  qemu-kvm         libc-2.17.so                [.] __GI___ioctl
> +   63.41%     0.00%  qemu-kvm         qemu-kvm-2.8.1-25.127       [.] 0xffffffffffc71c59
> +   63.10%     0.00%  qemu-kvm         [vfio]                      [k] vfio_fops_unl_ioctl
> +   63.10%     0.00%  qemu-kvm         qemu-kvm-2.8.1-25.127       [.] 0xffffffffffcbbb6a
> +   63.10%     0.02%  qemu-kvm         [vfio_iommu_type1]          [k] vfio_iommu_type1_ioctl
> +   60.67%     0.31%  qemu-kvm         [vfio_iommu_type1]          [k] vfio_pin_pages_remote
> +   60.06%     0.46%  qemu-kvm         [vfio_iommu_type1]          [k] vaddr_get_pfn
> +   59.61%     0.95%  qemu-kvm         [kernel.vmlinux]            [k] get_user_pages_fast
> +   54.28%     0.02%  qemu-kvm         [kernel.vmlinux]            [k] get_user_pages_unlocked
> +   54.24%     0.04%  qemu-kvm         [kernel.vmlinux]            [k] __get_user_pages
> +   54.13%     0.01%  qemu-kvm         [kernel.vmlinux]            [k] handle_mm_fault
> +   54.08%     0.03%  qemu-kvm         [kernel.vmlinux]            [k] do_huge_pmd_anonymous_page
> +   52.09%    52.09%  qemu-kvm         [kernel.vmlinux]            [k] clear_page
> +    9.42%     0.12%  swapper          [kernel.vmlinux]            [k] cpu_startup_entry
> +    9.20%     0.00%  swapper          [kernel.vmlinux]            [k] start_secondary
> +    8.85%     0.02%  swapper          [kernel.vmlinux]            [k] arch_cpu_idle
> +    8.79%     0.07%  swapper          [kernel.vmlinux]            [k] cpuidle_idle_call
> +    6.16%     0.29%  swapper          [kernel.vmlinux]            [k] apic_timer_interrupt
> +    5.73%     0.07%  swapper          [kernel.vmlinux]            [k] smp_apic_timer_interrupt
> +    4.34%     0.99%  qemu-kvm         [kernel.vmlinux]            [k] gup_pud_range
> +    3.56%     0.16%  swapper          [kernel.vmlinux]            [k] local_apic_timer_interrupt
> +    3.32%     0.41%  swapper          [kernel.vmlinux]            [k] hrtimer_interrupt
> +    3.25%     3.21%  qemu-kvm         [kernel.vmlinux]            [k] gup_huge_pmd
> +    2.31%     0.01%  qemu-kvm         [kernel.vmlinux]            [k] iommu_map
> +    2.30%     0.00%  qemu-kvm         [kernel.vmlinux]            [k] intel_iommu_map
> 
> It seems that the bottleneck is trying to pin pages through get_user_pages instead of do iommu mapping.

Sure, the IOMMU mapping is more lightweight than the page pinning, but
both are required.  We're pinning the pages for the purpose of IOMMU
mapping them.  It also seems the bulk of the time is spent clearing
pages, which is necessary so as not to leak data from the kernel or
other users to this process.  Perhaps there are ways to take further
advantage of hugepages in the pinning process, but as far as I'm aware
we still need to pin at the PAGE_SIZE page rather than the hugepage.
Thanks,

Alex