[Linux-cachefs] Adventures in NFS re-exporting

Wed Nov 25 17:14:51 UTC 2020

----- On 24 Nov, 2020, at 21:15, bfields bfields at fieldses.org wrote:
> On Tue, Nov 24, 2020 at 08:35:06PM +0000, Daire Byrne wrote:
>> Sometimes I have seen clusters of 16 GETATTRs for the same file on the
>> wire with nothing else inbetween. So if the re-export server is the
>> only "client" writing these files to the originating server, why do we
>> need to do so many repeat GETATTR calls when using nconnect>1? And why
>> are the COMMIT calls required when the writes are coming via nfsd but
>> not from userspace on the re-export server? Is that due to some sort
>> of memory pressure or locking?
>> 
>> I picked the NFSv3 originating server case because my head starts to
>> hurt tracking the equivalent packets, stateids and compound calls with
>> NFSv4. But I think it's mostly the same for NFSv4. The writes through
>> the re-export server lead to lots of COMMITs and (double) GETATTRs but
>> using nconnect>1 at least doesn't seem to make it any worse like it
>> does for NFSv3.
>> 
>> But maybe you actually want all the extra COMMITs to help better
>> guarantee your writes when putting a re-export server in the way?
>> Perhaps all of this is by design...
> 
> Maybe that's close-to-open combined with the server's tendency to
> open/close on every IO operation?  (Though the file cache should have
> helped with that, I thought; as would using version >=4.0 on the final
> client.)
> 
> Might be interesting to know whether the nocto mount option makes a
> difference.  (So, add "nocto" to the mount options for the NFS mount
> that you're re-exporting on the re-export server.)

The nocto didn't really seem to help but the NFSv4.2 re-export of a NFSv3 server did. I also realised I had done some tests with nconnect on the re-export server's client and consequently mixed things up a bit in my head. So I did some more tests and tried to make the results clear and simple. In all cases I'm just writing a big file with "dd" and capturing the traffic between the originating server and re-export server.

First off, writing direct to the originating server mount on the re-export server from userspace shows the ideal behaviour for all combinations:

 originating server <- (vers=X,actimeo=1800,nconnect=X) <- reexport server writing = WRITE,WRITE .... repeating (good!)

Then re-exporting a NFSv4.2 server:

 originating server <- (vers=4.2) <- reexport server - (vers=3) <- client writing = GETATTR,COMMIT,WRITE .... repeating
 originating server <- (vers=4.2) <- reexport server - (vers=4.2) <- client writing = GETATTR,WRITE .... repeating

And re-exporting a NFSv3 server:

 originating server <- (vers=3) <- reexport server - (vers=4.2) <- client writing = WRITE,WRITE .... repeating (good!)
 originating server <- (vers=3) <- reexport server - (vers=3) <- client writing = WRITE,COMMIT .... repeating

So of all the combinations, a NFSv4.2 re-export of an NFSv3 server is the only one that matches the "ideal" case where we WRITE continuously without all the extra chatter.

And for completeness, taking that "good" case and making it bad with nconnect:

 originating server <- (vers=3,nconnect=16) <- reexport server - (vers=4.2) <- client writing = WRITE,WRITE .... repeating (good!)
 originating server <- (vers=3) <- reexport server <- (vers=4.2,nconnect=16) <- client writing = WRITE,COMMIT,GETATTR .... randomly repeating

So using nconnect on the re-export's client causes lots more metadata ops. There are reasons for doing that for increasing throughput but it could be that the gain is offset by the extra metadata roundtrips. 

Similarly, we have mostly been using a NFSv4.2 re-export of a NFSV4.2 server over the WAN because of reduced metadata ops for reading, but it looks like we incur extra metadata ops for writing.

Side note: it's hard to decode nconnect enabled packet captures because wireshark doesn't seem to like those extra port streams.

> By the way I made a start at a list of issues at
> 
>	http://wiki.linux-nfs.org/wiki/index.php/NFS_re-export
> 
> but I was a little vague on which of your issues remained and didn't
> take much time over it.

Cool. I'm glad there are some notes for others to reference - this thread is now too long for any human to read. The only things I'd consider adding are:

* re-export of NFSv4.0 filesystem can give input/output errors when the cache is dropped
* a weird interaction with nfs client readahead such that all reads are limited to the default 128k unless you manually increase it to match rsize.

The only other thing I can offer are tips & tricks for doing this kind of thing over the WAN (vfs_cache_pressure, actimeo, nocto) and using fscache.

Daire