[Virtio-fs] Status of DAX for virtio-fs/virtiofsd?

Vivek Goyal vgoyal at redhat.com
Thu May 18 19:45:46 UTC 2023


On Wed, May 17, 2023 at 12:26:18PM -0400, Stefan Hajnoczi wrote:
> On Wed, 17 May 2023 at 11:54, Alex Bennée <alex.bennee at linaro.org> wrote:
> Hi Alex,
> There were two unresolved issues:
> 
> 1. How to inject SIGBUS when the guest accesses a page that's beyond
> the end-of-file.
> 2. Implementing the vhost-user messages for mapping ranges of files to
> the vhost-user frontend.
> 
> The harder problem is SIGBUS. An mmap area may be larger than the
> length of the file. Or another process could truncate the file while
> it's mmapped, causing a previously correctly sized mmap to become
> longer than the actual file. When a page beyond the end of file is
> accessed, the kernel raises SIGBUS.
> 
> When this scenario occurs in the DAX Window, kvm.ko gets some type of
> vmexit (fault) and the code currently enters an infinite loop because
> it expects KVM memory regions to resolve faults. Since there is no
> page backing that part of the vma, the fault handling fails and the
> code loops trying to do this forever.
> 
> There needs to be a way to inject this fault back into the guest.
> However, we did not found a way to do that. We considered Machine
> Check Exceptions (MCEs), x86 interrupts, and paravirtualized
> approaches. None of them looked like a clean and sane way to do this.
> The Linux maintainers for MCEs and kvm.ko were not excited about
> supporting this.
> 
> So in the end, SIGBUS was never solved. It leads to a DoS because the
> host kernel will enter an infinite loop. We decided that until there
> is progress on SIGBUS, we can't go ahead with DAX Windows in
> production.
> 
> The easier problem is adding new vhost-user messages. It does lead to
> a fundamental change in the vhost-user protocol: the presence of the
> DAX Window means there are memory ranges that cannot be accessed via
> shared memory. Imagine Device A has a DAX Window and Device B needs to
> DMA to/from it. That doesn't work because the mmaps happen inside the
> frontend (QEMU), so Device B doesn't have access to the current
> mappings. The fundamental change to vhost-user is that virtqueue
> descriptor mapping code must now deal with the situation where guest
> addresses are absent from the shared memory regions and instead send
> vhost-user protocol messages to read/write to/from bounce buffers
> instead. The rest of the device backend does not require modification.
> This is a slow path, but at least it works whereas currently the I/O
> would fail because the memory is absent. Other solutions to the
> vhost-user DMA problem exist, but this is the one that Dave and I last
> discussed.
> 
> In the end, there is still work to do to make the DAX Window
> supportable. There is experimental code out there that kind of works,
> but we felt it was incomplete.

I feel that it will be good if someone can solve the vhost-user problem
first and get patches upstream. Now virtiofsd support from qemu has
been removed, so someone will have to add DAX support to rust virtiofsd.
(And make correspoding vhost-user changes in qemu).

Once that is done, someone can look into MCE issue.

With vhost-user problem solved, DAX will be usable in non-shared mode.
That is just pass through host filesystem into the guest and even host
can't make modifications. And that should steer clear us of the truncation
issue.

virtiofs DAX is a good piece of technology and provides speed up in many
cases. Will be sad to see the patches lost.

Now people are posting fixes to kernel side of DAX and there is no good
way to test these. I will try to make it work with old DAX branch david
had to test kernel changes but I am sure at some point of time it will
stop working and I don't want virtiofs kernel DAX code to become unstable.

Will be good if somebody takes up this project and makes it happen.

Thanks
Vivek

> 
> To your specific questions:
> 
> >  * What VMM/daemon combinations has DAX been tested on?
> 
> Only the experimental virtio-fs Kata Containers kernels and QEMU
> builds that were available a few years ago. I don't think the code has
> been rebased.
> 
> >  * Isn't it time the vhost-user spec is updated?
> 
> I don't know if Dave ever wrote the spec for or implemented the final
> version of the vhost-user protocol messages we discussed.
> 
> >  * Is anyone picking up Dave's patches for the QEMU side of support?
> 
> Not at the moment. It would be nice to support, but someone needs the
> energy/time/focus to deal with the outstanding issues I mentioned.
> 
> If you want to work on it, feel free to include me. I can help dig up
> old discussions and give input.
> 
> Stefan
> 


More information about the Virtio-fs mailing list