[Linux-cachefs] Adventures in NFS re-exporting

Mon Oct 19 16:19:43 UTC 2020

----- On 16 Sep, 2020, at 17:01, Daire Byrne daire at dneg.com wrote:

> Trond/Bruce,
> 
> ----- On 15 Sep, 2020, at 20:59, Trond Myklebust trondmy at hammerspace.com wrote:
> 
>> On Tue, 2020-09-15 at 13:21 -0400, J. Bruce Fields wrote:
>>> On Mon, Sep 07, 2020 at 06:31:00PM +0100, Daire Byrne wrote:
>>> > 1) The kernel can drop entries out of the NFS client inode cache
>>> > (under memory cache churn) when those filehandles are still being
>>> > used by the knfsd's remote clients resulting in sporadic and random
>>> > stale filehandles. This seems to be mostly for directories from
>>> > what I've seen. Does the NFS client not know that knfsd is still
>>> > using those files/dirs? The workaround is to never drop inode &
>>> > dentry caches on the re-export servers (vfs_cache_pressure=1). This
>>> > also helps to ensure that we actually make the most of our
>>> > actimeo=3600,nocto mount options for the full specified time.
>>> 
>>> I thought reexport worked by embedding the original server's
>>> filehandles
>>> in the filehandles given out by the reexporting server.
>>> 
>>> So, even if nothing's cached, when the reexporting server gets a
>>> filehandle, it should be able to extract the original filehandle from
>>> it
>>> and use that.
>>> 
>>> I wonder why that's not working?
>> 
>> NFSv3? If so, I suspect it is because we never wrote a lookupp()
>> callback for it.
> 
> So in terms of the ESTALE counter on the reexport server, we see it increase if
> the end client mounts the reexport using either NFSv3 or NFSv4. But there is a
> difference in the client experience in that with NFSv3 we quickly get
> input/output errors but with NFSv4 we don't. But it does seem like the
> performance drops significantly which makes me think that NFSv4 retries the
> lookups (which succeed) when an ESTALE is reported but NFSv3 does not?
> 
> This is the simplest reproducer I could come up with but it may still be
> specific to our workloads/applications and hard to replicate exactly.
> 
> nfs-client # sudo mount -t nfs -o vers=3,actimeo=5,ro
> reexport-server:/vol/software /mnt/software
> nfs-client # while true; do /mnt/software/bin/application; echo 3 | sudo tee
> /proc/sys/vm/drop_caches; done
> 
> reexport-server # sysctl -w vm.vfs_cache_pressure=100
> reexport-server # while true; do echo 3 > /proc/sys/vm/drop_caches ; done
> reexport-server # while true; do awk '/fh/ {print $2}' /proc/net/rpc/nfsd; sleep
> 10; done
> 
> Where "application" is some big application with lots of paths to scan with libs
> to memory map and "/vol/software" is an NFS mount on the reexport-server from
> another originating NFS server. I don't know why this application loading
> workload shows this best, but perhaps the access patterns of memory mapped
> binaries and libs is particularly susceptible to estale?
> 
> With vfs_cache_pressure=100, running "echo 3 > /proc/sys/vm/drop_caches"
> repeatedly on the reexport server drops chunks of the dentry & nfs_inode_cache.
> The ESTALE count increases and the client running the application reports
> input/output errors with NFSv3 or the loading slows to a crawl with NFSv4.
> 
> As soon as we switch to vfs_cache_pressure=0, the repeating drop_caches on the
> reexport server do not cull the dentry or nfs_inode_cache, the ESTALE counter
> no longer increases and the client experiences no issues (NFSv3 & NFSv4).

I don't suppose anyone has any more thoughts on this one? This is likely the first problem that anyone trying to NFS re-export is going to encounter. If they re-export NFSv3 they'll just get lots of ESTALE as the nfs inodes are dropped from cache (with the default vfs_cache_pressure=100) and if they re-export NFSv4, the lookup performance will drop significantly as an ESTALE triggers re-lookups.

For our particular use case, it is actually desirable to have vfs_cache_pressure=0 to keep nfs client inodes and dentry caches in memory to help with expensive metadata lookups, but it would still be nice to have the option of using a less drastic setting (such as vfs_cache_pressure=1) to help avoid OOM conditions.

Daire