[Virtio-fs] [PATCH 0/9] virtio-fs fixes

Tue Apr 30 01:38:14 UTC 2019

On Mon, Apr 29, 2019 at 09:18:22AM -0400, Vivek Goyal wrote:
> On Fri, Apr 26, 2019 at 05:58:39PM -0700, Liu Bo wrote:
> > On Thu, Apr 25, 2019 at 11:10:08AM -0700, Liu Bo wrote:
> > > On Thu, Apr 25, 2019 at 10:59:50AM -0400, Vivek Goyal wrote:
> > > > On Wed, Apr 24, 2019 at 04:12:59PM -0700, Liu Bo wrote:
> > > > > Hi Vivek,
> > > > > 
> > > > > On Wed, Apr 24, 2019 at 02:41:30PM -0400, Vivek Goyal wrote:
> > > > > > Hi Liubo,
> > > > > > 
> > > > > > I have made some fixes and took some of yours and pushed latest snapshot
> > > > > > of my internal tree here.
> > > > > > 
> > > > > > https://github.com/rhvgoyal/linux/commits/virtio-fs-dev-5.1
> > > > > > 
> > > > > > Patches have been rebased to 5.1-rc5 kernel. I am thinking of updating
> > > > > > this branch frequently with latest code.
> > > > > 
> > > > > With this branch, generic/476 still got hang, and yes, it's related to
> > > > > "async page fault related events" just as what you've mentioned on #irc.
> > > > > 
> > > > > I confirmed this with kvm and kvmmmu tracepoints.
> > > > > 
> > > > > The tracepoints[1] showed that
> > > > > [1]: https://paste.ubuntu.com/p/N9ngrthKCf/
> > > > > 
> > > > > ---
> > > > > handle_ept_violation
> > > > >   kvm_mmu_page_fault(error_code=182)
> > > > >     tdp_page_fault
> > > > >       fast_page_fault # spte not present
> > > > >       try_async_pf #queue a async_pf work and return RETRY
> > > > > 
> > > > > vcpu_run
> > > > >  kvm_check_async_pf_completion
> > > > >    kvm_arch_async_page_ready
> > > > >      tdp_page_fault(vcpu, work->gva, 0, true);
> > > > >        fast_page_fault(error_code == 0);
> > > > >        try_async_pf # found hpa
> > > > >        __direct_map()
> > > > > 	  set_spte(error_code == 0) # won't set the write bit
> > > > > 
> > > > > handle_ept_violation
> > > > >   kvm_mmu_page_fault(error_code=1aa)
> > > > >     tdp_page_fault
> > > > >       fast_page_fault # spte present but no write bit
> > > > >       try_async_pf # no hpa again queue a async_pf work and return RETRY
> > > > 
> > > > So why there is no "hpa"?
> > > >
> > > 
> > > TBH, I have no idea, __gfn_to_pfn_memslot() did returned a pfn
> > > successfully after async pf, but during its following EPT_VIOLATION,
> > > __gfn_to_pfn_memslot() returned KVM_PFN_ERR_FAULT and indicated
> > > callers to do another async pf, and over and over again.
> > >
> > 
> > So I think I've figured out it, here is the summary,
> > 
> > virtiofs's dax write implementation sends a fallocate request to extend inode
> > size and allocate space on the underlying fs so that the underlying mmap can
> > fault in pages on demands.
> > 
> > There're two problems here,
> 
> > 
> > 1) virtiofs write(2) only checks if the write range is within inode size,
> >    however, this doesn't work all the time because besides write(2) and
> >    fallocate(2), inode size can also be extended by truncate(2) which doesn't
> >    allocate space on the underlying fs, so when guest VM writes to this address,
> >    it then causes a EPT_VIOLATION which will help fault-in the necessary page
> >    from the underlying %vma, and if it's a write fault, page_mkwrite() will be
> >    called, if the required space is not yet allocated, page_mkwrite() then tries
> >    to allocate the space, which may fail with ENOSPC if the underlying fs has
> >    already been full,
> > 
> > 2) async pf doesn't check whether gup is successful.
> 
> Ok. So filesystem on host is full but truncate still succeeds (as it did
> not reuiqre fs block allocations). But later when a write from guest
> process happens, it results in async pf on host and that fails because
> fs block can't be allocated.
> 
> But this still sounds like an issue with async pf where an error needs
> to be captured and somehow communicated back to guest OS. In this
> case -ENOSPC.

I have a question about how guest responds to this kind of error, so guest vm is
doing dax_copy_from_iter (in case of write), and eventually it's memory copying
a iovector, right?

I'm not sure how guest can exit gracefully from there?  Can copy_in() return a
EFAULT somehow?

My workaround is to ensure there is enough fs space allocated to dax mapping
range when doing SETUPMAPPING, in other words, we can do a plain fallocate upon
the range before sending messages to the vhost-user backend.

thanks,
-liubo