[Linux-cachefs] NFS conversion to new netfs and fscache APIs

Fri Dec 4 20:56:21 UTC 2020

I didn't knowingly extend the files.... But I had been using some old files
written months ago elsewhere.

So I quickly tried with some new files... To avoid confusion and caching, I
wrote them directly on the server to the local XFS filesystem that we are
then exporting to client1 & client2. First thing I noticed is that there is
a difference in behaviour depending on whether we write zeros or random
data:

server # dd if=/dev/zero of=/serverxfs/test.file.zero bs=1M count=512
server # dd if=/dev/urandom of=/serverxfs/test.file.random bs=1M count=512

client1 # md5sum /mnt/server/test-file.zero
aa559b4e3523a6c931f08f4df52d58f2
client1 # md5sum /mnt/server/test-file.random
b8ea132924f105d5acc27787d57a9aa2

client2 # for x in {1..10}; do (cat /mnt/server/test.file.zero > /dev/null
&); done; wait
client2 # md5sum /mnt/server/test.file.zero
aa559b4e3523a6c931f08f4df52d58f2

client2 # for x in {1..10}; do (cat /mnt/server/test.file.random >
/dev/null &); done; wait
client2 # md5sum /mnt/server/test.file.random
e0334bd762800ab7447bfeab033e030d

So the file full of zeros is okay but the random one is getting corrupted?
I'm scratching my head a bit wondering if the XFS backing filesystem server
and/or how the extents are laid out could in any way effect this but the
NFS client shouldn't care right?

With regards to the NFS server kernel it's 3.10.0-693.1.1.el7.x86_64 but if
you mean your patched kernel, I just checked out your fscache-iter-nfs
branch, made a git archive and then built an RPM out of it.... I must say
there are a couple of nfs re-export patches (due for v5.11) that I have
also applied on top.

If you still can't reproduce, then I'll rip them out and test again.

Daire

On Fri, Dec 4, 2020 at 7:36 PM David Wysochanski <dwysocha at redhat.com>
wrote:

> On Fri, Dec 4, 2020 at 2:09 PM David Wysochanski <dwysocha at redhat.com>
> wrote:
> >
> > On Fri, Dec 4, 2020 at 1:03 PM Daire Byrne <daire.byrne at gmail.com>
> wrote:
> > >
> > > David,
> > >
> > > Okay, I spent a little more time on this today and I think we can
> forget about the re-export thing for a moment.
> > >
> > > I looked at what was happening and the issue seemed to be that once I
> had multiple clients of the re-export server (which has the iter fscache
> and fsc enabled mounts) all reading the same files at the same time (for
> the first time), then we often ended up with a missing sequential chunk of
> data from the cached file.
> > >
> > > The size and apparent size seemed to be the same as the original file
> on the server but md5sum and hexdump against the client mounted file showed
> otherwise.
> > >
> > > So then I tried to replicate this scenario in the simplest way using
> just a single (fscache-iter) client with an fsc enabled mountpoint using
> multiple processes to read the same uncached file for the first time (no
> NFS re-exporting).
> > >
> > > * client1 mounts the NFS server without fsc
> > > * client2 mounts the NFS server with fsc (with fscache-iter).
> > >
> > > client1 # md5sum /mnt/server/file.1
> > > 9ca99335b6f75a300dc22e45a776440c
> > > client2 # cat /mnt/server/file.1
> > > client2 # md5sum /mnt/server/file.1
> > > 9ca99335b6f75a300dc22e45a776440c
> > >
> > > All good. The files was cached to disk and looks good. Now let's read
> the an uncached file using multiple processes simultaneously:
> > >
> > > client1 # md5sum /mnt/server/file.2
> > > 9ca99335b6f75a300dc22e45a776440c
> > > client2 # for x in {1..10}; do (cat /mnt/server/file.2 > /dev/null &);
> done; wait
> > > client2 # md5sum /mnt/server/file.2
> > > 26dd67fbf206f734df30fdec72d71429
> > >
> > > The file is now different/corrupt. So in my re-export case it's just
> that we have multiple knfsd processes reading in the same file
> simultaneously for the first time into cache. Then it remains corrupt and
> serves that out to multiple NFS clients.
> > >
> >
> > Hmmm, yeah that for sure shouldn't happen!
> >
> >
> > > In this case the backing filesystem was ext4 and the nfs client mount
> options were fsc,vers=4.2 (vers=3 is the same). The NFS server is running
> RHEL7.4.
> > >
> >
> > How big is ' /mnt/server/file.2' and what is the NFS server kernel?
> > Also can you give me the mount options from /proc/mounts on 'client2'?
> > I'm not able to reproduce this yet but I'll keep trying.
> >
> >
>
> Ok I think I have a reproducer now, but it requires extending the file
> size.  Did you re-write the file with a new size by any chance?
> It doesn't reproduce for me on first go, but after extending the size
> of the file it does.
>
> # mount -o vers=4.2,fsc 127.0.0.1:/export/dir1 /mnt/dir1
> # dd if=/dev/urandom of=/export/dir1/file.bin bs=10M count=1
> 1+0 records in
> 1+0 records out
> 10485760 bytes (10 MB, 10 MiB) copied, 0.216783 s, 48.4 MB/s
> # for x in {1..10}; do (cat /mnt/dir1/file.bin > /dev/null &); done; wait
> # md5sum /export/dir1/file.bin /mnt/dir1/file.bin
> 94d2d0fe70f155211b5559bf7de27b34  /export/dir1/file.bin
> 94d2d0fe70f155211b5559bf7de27b34  /mnt/dir1/file.bin
> # dd if=/dev/urandom of=/export/dir1/file.bin bs=20M count=1
> 1+0 records in
> 1+0 records out
> 20971520 bytes (21 MB, 20 MiB) copied, 0.453869 s, 46.2 MB/s
> # for x in {1..10}; do (cat /mnt/dir1/file.bin > /dev/null &); done; wait
> # md5sum /export/dir1/file.bin /mnt/dir1/file.bin
> 32b9beb19b97655e9026c09bbe064dc8  /export/dir1/file.bin
> f05fe078fe65b4e5c54afcd73c97686d  /mnt/dir1/file.bin
> # uname -r
> 5.10.0-rc4-94e9633d98a5+
>
>
>
> >
> >
> > > Daire
> > >
> > > On Thu, Dec 3, 2020 at 4:27 PM David Wysochanski <dwysocha at redhat.com>
> wrote:
> > >>
> > >> On Wed, Dec 2, 2020 at 12:01 PM Daire Byrne <daire.byrne at gmail.com>
> wrote:
> > >> >
> > >> > David,
> > >> >
> > >> > First off, thanks for the work on this - we look forward to this
> landing.
> > >> >
> > >>
> > >> Yeah no problem - thank you for your interest and testing it!
> > >>
> > >> > I did some very quick tests of just the bandwidth using server
> class networking (40Gbit) and storage (NVMe).
> > >> >
> > >> > Comparing the old fscache with the new one, we saw a minimal
> degradation in reading back from the backing disk. But I am putting this
> more down the the more directio style of access in the new version.
> > >> >
> > >> > This can be seen when the cache is being written as we no longer
> use the writeback cache. I'm assuming something similar happens on reads so
> that we don't use readahead?
> > >> >
> > >>
> > >> Without getting into it too much and just guessing, I'd guess either
> > >> it's the usage of directIO or the limitation of the 1GB in cachefiles,
> > >> but not sure.  We need to drill down of course into it because it
> > >> could be a lot of things.
> > >>
> > >> > Anyway, the quick summary of performance using 10 threads of reads
> follows. I should mention that the NVMe has a physical limit of ~2,500MB/s
> writes & 5,000MB/s reads:
> > >> >
> > >> > iter fscache:
> > >> > uncached first reads ~2,500MB/s (writing to nvme ext4/xfs)
> > >> > cached subsequent reads ~4,200 (reading from nvme ext4)
> > >> > cached subsequent reads ~3,500 (reading from nvme xfs)
> > >> >
> > >> > old fscache:
> > >> > uncached first reads ~2,500MB/s (writing to nvme ext4/xfs)
> > >> > cached subsequent reads ~5,000 (reading from nvme ext4)
> > >> > xfs crashes a lot ...
> > >> >
> > >> > I have not done a thorough analysis of CPU usage or perf top
> differences yet.
> > >> >
> > >> > Then I went on to test our rather unique NFS re-export workload
> where we take this fscache backed server and re-export the fsc mounts to
> many clients. At this point something odd appeared to be happening. The
> clients were loading software from the fscache backed mounts but were often
> segfaulting at various points. This suggested that they were getting
> corrupted data or the memory mapping (binaries, libraries) was failing in
> some way. Perhaps some odd interaction between fscache and knfsd?
> > >> >
> > >> > I did a quick test of re-export without the fsc caching enabled on
> the server mounts (with the same 5.10-rc kernel) and I didn't get any
> errors. That's as far as I got before I got drawn away by other things. I
> hope to dig into it a little more next week. But I just thought I'd give
> some quick feedback of one potential difference I'm seeing compared to the
> previous version.
> > >> >
> > >>
> > >> Hmmm, interesting.  So just to be clear, you ran my patches without
> > >> 'fsc' on the mount and it was fine, but with 'fsc' on the mount there
> > >> were data corruptions in this re-export use case?  I've not done any
> > >> tests with a re-export like that but off the top of my head I'm not
> > >> sure why it would be a problem.  What NFS version(s) are you using?
> > >>
> > >>
> > >> > I also totally accept that this is a very niche workload (and hard
> to reproduce)... I should have more details on it next week.
> > >> >
> > >>
> > >> Ok - thanks again Daire!
> > >>
> > >>
> > >>
> > >> > Daire
> > >> >
> > >> > On Sat, Nov 21, 2020 at 1:50 PM David Wysochanski <
> dwysocha at redhat.com> wrote:
> > >> >>
> > >> >> I just posted patches to linux-nfs but neglected to CC this list.
> For
> > >> >> any interested in patches which convert NFS to use the new netfs
> and
> > >> >> fscache APIs, please see the following series on linux-nfs:
> > >> >> [PATCH v1 0/13] Convert NFS to new netfs and fscache APIs
> > >> >> https://marc.info/?l=linux-nfs&m=160596540022461&w=2
> > >> >>
> > >> >> Thanks.
> > >> >>
> > >> >> --
> > >> >> Linux-cachefs mailing list
> > >> >> Linux-cachefs at redhat.com
> > >> >> https://www.redhat.com/mailman/listinfo/linux-cachefs
> > >> >>
> > >>
>
>