[Libguestfs] [PATCH libnbd] copy: Set POSIX_FADV_SEQUENTIAL and POSIX_FADV_DONTNEED.

Richard W.M. Jones rjones at redhat.com
Fri Feb 19 19:41:56 UTC 2021


On Fri, Feb 19, 2021 at 09:08:08PM +0200, Nir Soffer wrote:
> On Fri, Feb 19, 2021 at 7:57 PM Richard W.M. Jones <rjones at redhat.com> wrote:
> >
> > Set POSIX_FADV_SEQUENTIAL flag on the file when copying to or from a
> > local file.  This tells the operating system that we intend to read or
> > write sequentially and to optimize for that case.
> >
> > Avoid polluting the page cache when copying to or from a local file or
> > block device by calling POSIX_FADV_DONTNEED.  This is similar to what
> > we do in nbdkit-file-plugin:
> > https://gitlab.com/nbdkit/nbdkit/-/commit/aa5a2183a6d16afd919174b5ba8222b2bccf4039
> >
> > Before this change:
> >
> > $ cachedel /var/tmp/random
> > $ cachestats /var/tmp/random
> > pages in cache: 0/8388608 (0.0%)  [filesize=33554432.0K, pagesize=4K]
> > $ free -m; time ./run nbdcopy /var/tmp/random null: ; free -m ; cachestats /var/tmp/random
> >               total        used        free      shared  buff/cache   available
> > Mem:          32080        1069       14432           3       16578       30550
> > Swap:         16135          86       16049
> >
> > real    0m21.484s
> > user    0m2.275s
> > sys     0m25.837s
> >               total        used        free      shared  buff/cache   available
> > Mem:          32080        1041         297           3       30741       30581
> > Swap:         16135          86       16049
> > pages in cache: 4971577/8388608 (59.3%)  [filesize=33554432.0K, pagesize=4K]
> >
> > After this change:
> >
> > $ cachedel /var/tmp/random
> > $ free -m; time ./run nbdcopy /var/tmp/random null: ; free -m ; cachestats /var/tmp/random
> >               total        used        free      shared  buff/cache   available
> > Mem:          32080        1071       14431           3       16578       30548
> > Swap:         16135          86       16049
> >
> > real    0m22.423s
> > user    0m2.180s
> > sys     0m31.059s
> >               total        used        free      shared  buff/cache   available
> > Mem:          32080        1053       14238           3       16789       30566
> > Swap:         16135          86       16049
> > pages in cache: 53888/8388608 (0.6%)  [filesize=33554432.0K, pagesize=4K]
> > ---
> >  configure.ac    |  3 +++
> >  copy/file-ops.c | 17 +++++++++++++++--
> >  copy/main.c     | 11 +++++++++++
> >  3 files changed, 29 insertions(+), 2 deletions(-)
> >
> > diff --git a/configure.ac b/configure.ac
> > index 58a7724..b36ae09 100644
> > --- a/configure.ac
> > +++ b/configure.ac
> > @@ -79,6 +79,9 @@ AC_CHECK_HEADERS([\
> >
> >  AC_CHECK_HEADERS([linux/vm_sockets.h], [], [], [#include <sys/socket.h>])
> >
> > +dnl posix_fadvise helps to optimise linear reads and writes (optional).
> > +AC_CHECK_FUNCS([posix_fadvise])
> > +
> >  dnl Check for strerrordesc_np (optional, glibc only).
> >  dnl Prefer this over sys_errlist.
> >  dnl https://lists.fedoraproject.org/archives/list/glibc@lists.fedoraproject.org/thread/WJHGG2OO7ABNAYICGA5WQZ2Q34Q2FEHU/
> > diff --git a/copy/file-ops.c b/copy/file-ops.c
> > index 2a239d0..1310f08 100644
> > --- a/copy/file-ops.c
> > +++ b/copy/file-ops.c
> > @@ -60,11 +60,12 @@ static size_t
> >  file_synch_read (struct rw *rw,
> >                   void *data, size_t len, uint64_t offset)
> >  {
> > +  int fd = rw->u.local.fd;
> >    size_t n = 0;
> >    ssize_t r;
> >
> >    while (len > 0) {
> > -    r = pread (rw->u.local.fd, data, len, offset);
> > +    r = pread (fd, data, len, offset);
> >      if (r == -1) {
> >        perror (rw->name);
> >        exit (EXIT_FAILURE);
> > @@ -72,6 +73,11 @@ file_synch_read (struct rw *rw,
> >      if (r == 0)
> >        return n;
> >
> > +#if defined (HAVE_POSIX_FADVISE) && defined (POSIX_FADV_DONTNEED)
> > +    /* On Linux this will evict the pages we just read from the page cache. */
> > +    posix_fadvise (fd, offset, r, POSIX_FADV_DONTNEED);
> 
> I don't think this is a good idea, since this affects the current page
> cache, for
> the entire host.
> 
> So if the host is having an image in cache for good reason, running nbdcopy
> will drop the cache since nbdcopy does not need it, but maybe the host will
> need that cache after running nbdcopy.
> 
> The right way to avoid polluting the page cache is to bypass the cache using
> O_DIRECT, so nbdcopy is not using or affecting the page cache.
> 
> This can be useful if the user can enable this with a flag.

The trouble with O_DIRECT is it's a pain to use correctly, and I guess
doesn't use readahead or the existing cache (kind of the opposite
problem).

However I take your point that we probably ought not to remove files
from the cache that are already there.  I somehow thought that
POSIX_FADV_DONTNEED only affected the current 'fd', but looking at the
Linux impl I see it affects the whole system.

This argues for having a flag to enable this.

Rich.

> > +#endif
> > +
> >      data += r;
> >      offset += r;
> >      len -= r;
> > @@ -85,14 +91,21 @@ static void
> >  file_synch_write (struct rw *rw,
> >                    const void *data, size_t len, uint64_t offset)
> >  {
> > +  int fd = rw->u.local.fd;
> >    ssize_t r;
> >
> >    while (len > 0) {
> > -    r = pwrite (rw->u.local.fd, data, len, offset);
> > +    r = pwrite (fd, data, len, offset);
> >      if (r == -1) {
> >        perror (rw->name);
> >        exit (EXIT_FAILURE);
> >      }
> > +
> > +#if defined (HAVE_POSIX_FADVISE) && defined (POSIX_FADV_DONTNEED)
> > +    /* On Linux this will evict the pages we just wrote from the page cache. */
> > +    posix_fadvise (fd, offset, r, POSIX_FADV_DONTNEED);
> 
> This is fine, since we are creating a new image, and nobody else needs it yet.
> But when I tested similar code in the past it had no effect until you
> call fsync().
> 
> > +#endif
> > +
> >      data += r;
> >      offset += r;
> >      len -= r;
> > diff --git a/copy/main.c b/copy/main.c
> > index 68a6030..502ad28 100644
> > --- a/copy/main.c
> > +++ b/copy/main.c
> > @@ -460,6 +460,15 @@ open_null (struct rw *rw)
> >    rw->size = INT64_MAX;
> >  }
> >
> > +/* Set the POSIX_FADV_SEQUENTIAL flag on a file descriptor, but don't fail. */
> > +static inline void
> > +fadvise_sequential (int fd)
> > +{
> > +#if defined (HAVE_POSIX_FADVISE) && defined (POSIX_FADV_SEQUENTIAL)
> > +  posix_fadvise (fd, 0, 0, POSIX_FADV_SEQUENTIAL);
> > +#endif
> > +}
> > +
> >  /* Open a local (non-NBD) file, ie. a file, device, or "-" for stdio.
> >   * Returns the open file descriptor which the caller must close.
> >   *
> > @@ -524,12 +533,14 @@ open_local (const char *prog,
> >      if (ioctl (fd, BLKSSZGET, &rw->u.local.sector_size))
> >        fprintf (stderr, "warning: cannot get sector size: %s: %m", rw->name);
> >  #endif
> > +    fadvise_sequential (fd);
> >    }
> >    else if (S_ISREG (rw->u.local.stat.st_mode)) {
> >      /* Regular file. */
> >      rw->ops = &file_ops;
> >      rw->size = rw->u.local.stat.st_size;
> >      rw->u.local.seek_hole_supported = seek_hole_supported (fd);
> > +    fadvise_sequential (fd);
> >    }
> >    else {
> >      /* Probably stdin/stdout, a pipe or a socket.  Set size == -1
> > --
> > 2.29.0.rc2
> >

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-p2v converts physical machines to virtual machines.  Boot with a
live CD or over the network (PXE) and turn machines into KVM guests.
http://libguestfs.org/virt-v2v




More information about the Libguestfs mailing list