[Libguestfs] [PATCH nbdkit] file: Implement cache=none and fadvise=normal|random|sequential.

Fri Aug 7 13:43:12 UTC 2020

On Fri, Aug 7, 2020, 16:16 Richard W.M. Jones <rjones at redhat.com> wrote:

> On Fri, Aug 07, 2020 at 07:53:13AM -0500, Eric Blake wrote:
> > >$ free -m; time ./nbdkit file /var/tmp/random fadvise=sequential
> cache=none --run 'qemu-img convert -n -p -m 16 -W $nbd
> "json:{\"file.driver\":\"null-co\",\"file.size\":\"1E\"}"' ; free -m ;
> cachestats /var/tmp/random
> >
> > Hmm - the -W actually says that qemu-img is performing semi-random
> > access (there is no guarantee that the 16 coroutines are serviced in
> > linear order of the file), even though we really are making only one
> > pass through the file in bulk.  I don't know if fadvise=normal would
> > be any better; dropping -W but keeping -m 16 might also be an
> > interesting number to check (where qemu-img tries harder to do
> > in-order access, but still take advantage of parallel threads).
> >
> > >               total        used        free      shared  buff/cache
>  available
> > >Mem:          32083        1188       27928           1        2966
>    30440
> > >Swap:         16135          16       16119
> > >     (100.00/100%)
> > >
> > >real 0m13.107s
> > >user 0m2.051s
> > >sys  0m37.556s
> > >               total        used        free      shared  buff/cache
>  available
> > >Mem:          32083        1196       27861           1        3024
>    30429
> > >Swap:         16135          16       16119
> > >pages in cache: 14533/8388608 (0.2%)  [filesize=33554432.0K,
> pagesize=4K]
>
> Without -W it's very similar:
>
> $ free -m; time ./nbdkit file /var/tmp/random fadvise=sequential
> cache=none --run 'qemu-img convert -n -p -m 16 $nbd
> "json:{\"file.driver\":\"null-co\",\"file.size\":\"1E\"}"' ; free -m ;
> cachestats /var/tmp/random
>               total        used        free      shared  buff/cache
>  available
> Mem:          32083        1184       26113           1        4785
>  30444
> Swap:         16135          16       16119
>     (100.00/100%)
>
> real    0m13.308s
> user    0m1.961s
> sys     0m40.455s
>               total        used        free      shared  buff/cache
>  available
> Mem:          32083        1188       26049           1        4845
>  30438
> Swap:         16135          16       16119
> pages in cache: 14808/8388608 (0.2%)  [filesize=33554432.0K, pagesize=4K]
>
> With -W and using fadvise=random is also about the same:
>
> $ free -m; time ./nbdkit file /var/tmp/random fadvise=random cache=none
> --run 'qemu-img convert -n -p -m 16 -W $nbd
> "json:{\"file.driver\":\"null-co\",\"file.size\":\"1E\"}"' ; free -m ;
> cachestats /var/tmp/random
>               total        used        free      shared  buff/cache
>  available
> Mem:          32083        1187       26109           1        4785
>  30440
> Swap:         16135          16       16119
>     (100.00/100%)
>
> real    0m13.030s
> user    0m1.986s
> sys     0m37.498s
>               total        used        free      shared  buff/cache
>  available
> Mem:          32083        1187       26053           1        4842
>  30440
> Swap:         16135          16       16119
> pages in cache: 14336/8388608 (0.2%)  [filesize=33554432.0K, pagesize=4K]
>
> I'm going to guess that for this case readahead doesn't have much time
> to get ahead of qemu.
>
> > >+=item B<fadvise=normal>
> > >+
> > >+=item B<fadvise=random>
> > >+
> > >+=item B<fadvise=sequential>
> > >+
> > >+This optional flag hints to the kernel that you will access the file
> > >+normally, or in a random order, or sequentially.  The exact behaviour
> > >+depends on your operating system, but for Linux using C<normal> causes
> > >+the kernel to read-ahead, C<sequential> causes the kernel to
> > >+read-ahead twice as much as C<normal>, and C<random> turns off
> > >+read-ahead.
> >
> > Is it worth a mention of L<posix_fadvise(3)> here, to let the user
> > get some idea of what their operating system supports?
>
> Yes I had this at one point but I seem to have dropped it.  Will
> add it back, thanks.
>
> > >+=head2 Reducing evictions from the page cache
> > >+
> > >+If the file is very large and you known the client will only
> > >+read/write the file sequentially one time (eg for making a single copy
> > >+or backup) then this will stop other processes from being evicted from
> > >+the page cache:
> > >+
> > >+ nbdkit file disk.img fadvise=sequential cache=none
> >
> > It's also possible to avoid polluting the page cache by using
> > O_DIRECT, but that comes with harder guarantees (aligned access
> > through aligned buffers), so we may add it as another mode later on.
> > But in the meantime, cache=none is fairly nice while still avoiding
> > O_DIRECT.
>
> I'm not sure if or even how we could ever do a robust O_DIRECT
>

We can let the plugin an filter deal with that. The simplest solution is to
drop it on the user and require aligned requests.

Maybe a filter can handle alignment?

implementation, but my idea was that it might be an alternate
> implementation of cache=none.  But if we thought we might use O_DIRECT
> as a separate mode, then maybe we should rename cache=none.
> cache=advise?  cache=dontneed?  I can't think of a good name!
>

Yes, don't call it none if you use the cache.

How about advise=?

I would keep cache semantics similar to qemu.

> > >@@ -355,6 +428,17 @@ file_pwrite (void *handle, const void *buf,
> uint32_t count, uint64_t offset,
> > >  {
> > >    struct handle *h = handle;
> > >+#if defined (HAVE_POSIX_FADVISE) && defined (POSIX_FADV_DONTNEED)
> > >+  uint32_t orig_count = count;
> > >+  uint64_t orig_offset = offset;
> > >+
> > >+  /* If cache=none we want to force pages we have just written to the
> > >+   * file to be flushed to disk so we can immediately evict them from
> > >+   * the page cache.
> > >+   */
> > >+  if (cache_mode == cache_none) flags |= NBDKIT_FLAG_FUA;
> > >+#endif
> > >+
> > >    while (count > 0) {
> > >      ssize_t r = pwrite (h->fd, buf, count, offset);
> > >      if (r == -1) {
> > >@@ -369,6 +453,12 @@ file_pwrite (void *handle, const void *buf,
> uint32_t count, uint64_t offset,
> > >    if ((flags & NBDKIT_FLAG_FUA) && file_flush (handle, 0) == -1)
> > >      return -1;
> > >+#ifdef HAVE_POSIX_FADVISE
> > >+  /* On Linux this will evict the pages we just wrote from the page
> cache. */
> > >+  if (cache_mode == cache_none)
> > >+    posix_fadvise (h->fd, orig_offset, orig_count,
> POSIX_FADV_DONTNEED);
> > >+#endif
> >
> > So on Linux, POSIX_FADV_DONTNEED after a write that was not flushed
> > doesn't help?  You did point out that the use of FUA for flushing
> > slows things down, but that's a fair price to pay to keep the cache
> > clean.
>
> On Linux POSIX_FADV_DONTNEED won't flush dirty buffers.  I expect (but
> didn't actually measure) that just after a medium sized write the
> buffers would all be dirty so the posix_fadvise(DONTNEED) call would
> do nothing at all.  The advice online does seem to be that you must
> flush before calling this.  (Linus advocates a complex
> double-buffering solution so that you can be reading into one buffer
> while flushing the other, so you don't have the overhead of waiting
> for the flush).
>
> I'm going to do a bit of benchmarking of the write side now.
>

We already tried this with dd and the results were not good.

Nir

> Thanks,
>
> Rich.
>
> > Patch looks good to me.
> >
> > --
> > Eric Blake, Principal Software Engineer
> > Red Hat, Inc.           +1-919-301-3226
> > Virtualization:  qemu.org | libvirt.org
>
> --
> Richard Jones, Virtualization Group, Red Hat
> http://people.redhat.com/~rjones
> Read my programming and virtualization blog: http://rwmj.wordpress.com
> libguestfs lets you edit virtual machines.  Supports shell scripting,
> bindings from many languages.  http://libguestfs.org
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/libguestfs/attachments/20200807/170bebe3/attachment.htm>