[Linux-cachefs] Adventures in NFS re-exporting

Sat Sep 19 11:08:26 UTC 2020

----- On 17 Sep, 2020, at 22:57, bfields bfields at fieldses.org wrote:

> On Thu, Sep 17, 2020 at 08:23:03PM +0000, Frank van der Linden wrote:
>> On Thu, Sep 17, 2020 at 03:09:31PM -0400, bfields wrote:
>> > 
>> > On Thu, Sep 17, 2020 at 05:01:11PM +0100, Daire Byrne wrote:
>> > >
>> > > ----- On 15 Sep, 2020, at 18:21, bfields bfields at fieldses.org wrote:
>> > >
>> > > >> 4) With an NFSv4 re-export, lots of open/close requests (hundreds per
>> > > >> second) quickly eat up the CPU on the re-export server and perf top
>> > > >> shows we are mostly in native_queued_spin_lock_slowpath.
>> > > >
>> > > > Any statistics on who's calling that function?
>> > >
>> > > With just 40 clients mounting the reexport server (v5.7.6) using NFSv4.2, we see
>> > > the CPU of the nfsd threads increase rapidly and by the time we have 100
>> > > clients, we have maxed out the 32 cores of the server with most of that in
>> > > native_queued_spin_lock_slowpath.
>> > 
>> > That sounds a lot like what Frank Van der Linden reported:
>> > 
>> >         https://lore.kernel.org/linux-nfs/20200608192122.GA19171@dev-dsk-fllinden-2c-c1893d73.us-west-2.amazon.com/
>> > 
>> > It looks like a bug in the filehandle caching code.
>> > 
>> > --b.
>> 
>> Yes, that does look like the same one.
>> 
>> I still think that not caching v4 files at all may be the best way to go
>> here, since the intent of the filecache code was to speed up v2/v3 I/O,
>> where you end up doing a lot of opens/closes, but it doesn't make as
>> much sense for v4.
>> 
>> However, short of that, I tested a local patch a few months back, that
>> I never posted here, so I'll do so now. It just makes v4 opens in to
>> 'long term' opens, which do not get put on the LRU, since that doesn't
>> make sense (they are in the hash table, so they are still cached).
> 
> That makes sense to me.  But I'm also not opposed to turning it off for
> v4 at this point.
> 
> --b.

Thank you both, that's absolutely the issue with our (broken) production workload. I totally missed that thread while researching the archives.

I tried both of Frank's patches and the CPU returned to normal levels, native_queued_spin_lock_slowpath went from 88% to 2% usage and the server performed pretty much the same as it does for an NFSv3 export.

So, ultimately this had nothing to do with NFS re-exporting; it's just that I was using a newer kernel with filecache to do it. All our other NFSv4 originating servers are running older kernels, hence why our (broken) workload never caused us any problems before. Thanks for clearing that up for me.

With regards to dropping the filecache feature completely for NFSv4, I do wonder if it does still save a few precious network round-trips (which is especially important for my re-export scenario)? We want to be able to choose the level of caching on the re-export server and minimise expensive lookups to originating servers that may be many milliseconds away (coherency be damned).

Seeing as there was some interest in issue #1 (drop caches = estale re-exports) and this #4 issue (NFSv4 filecache vs ridiculous open/close counts), I'll post some more detail & reproducers next week for #2 (invalidating the re-export server's NFS client cache) and #3 (cached client metadata lookups not returned quickly enough when the client is busy with reads).

That way anyone trying to follow in my (re-exporting) footsteps is fully aware of all the potential performance pitfalls I have discovered so far.

Many thanks,

Daire