[Linux-cachefs] Adventures in NFS re-exporting

Thu Oct 1 00:09:29 UTC 2020

----- On 30 Sep, 2020, at 20:30, Jeff Layton jlayton at kernel.org wrote:

> On Tue, 2020-09-22 at 13:31 +0100, Daire Byrne wrote:
>> Hi,
>> 
>> I just thought I'd flesh out the other two issues I have found with re-exporting
>> that are ultimately responsible for the biggest performance bottlenecks. And
>> both of them revolve around the caching of metadata file lookups in the NFS
>> client.
>> 
>> Especially for the case where we are re-exporting a server many milliseconds
>> away (i.e. on-premise -> cloud), we want to be able to control how much the
>> client caches metadata and file data so that it's many LAN clients all benefit
>> from the re-export server only having to do the WAN lookups once (within a
>> specified coherency time).
>> 
>> Keeping the file data in the vfs page cache or on disk using fscache/cachefiles
>> is fairly straightforward, but keeping the metadata cached is particularly
>> difficult. And without the cached metadata we introduce long delays before we
>> can serve the already present and locally cached file data to many waiting
>> clients.
>> 
>> ----- On 7 Sep, 2020, at 18:31, Daire Byrne daire at dneg.com wrote:
>> > 2) If we cache metadata on the re-export server using actimeo=3600,nocto we can
>> > cut the network packets back to the origin server to zero for repeated lookups.
>> > However, if a client of the re-export server walks paths and memory maps those
>> > files (i.e. loading an application), the re-export server starts issuing
>> > unexpected calls back to the origin server again, ignoring/invalidating the
>> > re-export server's NFS client cache. We worked around this this by patching an
>> > inode/iversion validity check in inode.c so that the NFS client cache on the
>> > re-export server is used. I'm not sure about the correctness of this patch but
>> > it works for our corner case.
>> 
>> If we use actimeo=3600,nocto (say) to mount a remote software volume on the
>> re-export server, we can successfully cache the loading of applications and
>> walking of paths directly on the re-export server such that after a couple of
>> runs, there are practically zero packets back to the originating NFS server
>> (great!). But, if we then do the same thing on a client which is mounting that
>> re-export server, the re-export server now starts issuing lots of calls back to
>> the originating server and invalidating it's client cache (bad!).
>> 
>> I'm not exactly sure why, but the iversion of the inode gets changed locally
>> (due to atime modification?) most likely via invocation of method
>> inode_inc_iversion_raw. Each time it gets incremented the following call to
>> validate attributes detects changes causing it to be reloaded from the
>> originating server.
>> 
> 
> I'd expect the change attribute to track what's in actual inode on the
> "home" server. The NFS client is supposed to (mostly) keep the raw
> change attribute in its i_version field.
> 
> The only place we call inode_inc_iversion_raw is in
> nfs_inode_add_request, which I don't think you'd be hitting unless you
> were writing to the file while holding a write delegation.
> 
> What sort of server is hosting the actual data in your setup?

We mostly use RHEL7.6 NFS servers with XFS backed filesystems and a couple of (older) Netapps too. The re-export server is running the latest mainline kernel(s).

As far as I can make out, both these originating (home) server types exhibit a similar (but not exactly the same) effect on the Linux NFS client cache when it is being re-exported and accessed by other clients. I can replicate it when only using a read-only mount at every hop so I don't think that writes are related.

Our RHEL7 NFS servers actually mount XFS with noatime too so any atime updates that might be causing this client invalidation (which is what I initially thought) are ultimately a wasted effort.

>> This patch helps to avoid this when applied to the re-export server but there
>> may be other places where this happens too. I accept that this patch is
>> probably not the right/general way to do this, but it helps to highlight the
>> issue when re-exporting and it works well for our use case:
>> 
>> --- linux-5.5.0-1.el7.x86_64/fs/nfs/inode.c     2020-01-27 00:23:03.000000000
>> +0000
>> +++ new/fs/nfs/inode.c  2020-02-13 16:32:09.013055074 +0000
>> @@ -1869,7 +1869,7 @@
>>  
>>         /* More cache consistency checks */
>>         if (fattr->valid & NFS_ATTR_FATTR_CHANGE) {
>> -               if (!inode_eq_iversion_raw(inode, fattr->change_attr)) {
>> +               if (inode_peek_iversion_raw(inode) < fattr->change_attr) {
>>                         /* Could it be a race with writeback? */
>>                         if (!(have_writers || have_delegation)) {
>>                                 invalid |= NFS_INO_INVALID_DATA
>> 
>> With this patch, the re-export server's NFS client attribute cache is maintained
>> and used by all the clients that then mount it. When many hundreds of clients
>> are all doing similar things at the same time, the re-export server's NFS
>> client cache is invaluable in accelerating the lookups (getattrs).
>> 
>> Perhaps a more correct approach would be to detect when it is knfsd that is
>> accessing the client mount and change the cache consistency checks accordingly?
> 
> Yeah, I don't think you can do this for the reasons Trond outlined.

Yea, I kind of felt like it wasn't quite right, but I didn't know enough about the intricacies to say why exactly. So thanks to everyone for clearing that up for me.

We just followed the code and found that the re-export server spent a lot of time in this code block when we assumed that we should be able to serve the same read-only metadata requests to multiple clients out of the re-export server's NFS client cache. I guess the patch was more for us to see if we could (incorrectly) engineer our desired behaviour with a dirty hack.

While the patch definitely helps to better utilise the re-export server's nfs client cache when exporting via knfsd, we do still see many repeat getattrs per minute for the same files on the re-export server when 100s of clients are all reading the same files. So this is probably not the only area where the reading via a knfsd export of an nfs client mount, invalidates the re-export server's nfs client cache. 

Ultimately, I guess we are willing to take some risks with cache coherency (similar to actimeo=large,nocto) if it means that we can do expensive metadata lookups to a remote (WAN) server once and re-export that result to hundreds of (LAN) clients. For read-only or "almost" read-only workloads like ours where we repeatedly read the same files from many clients, it can lead to big savings over the WAN.

But I accept that it is a coherency and locking nightmare when you want to do writes to shared files.

Daire