[Libguestfs] [PATCH nbdkit] file: Implement cache=none and fadvise=normal|random|sequential.

Fri Aug 7 12:53:13 UTC 2020

On 8/7/20 6:31 AM, Richard W.M. Jones wrote:
> You can use these flags as described in the manual page to optimize
> access patterns, and to get better behaviour with the page cache in
> some scenarios.

And if you guess wrong, it is only a performance penalty, not a 
correctness issue.

> 
> For my testing I used the cachedel and cachestats utilities written by
> Julius Plenz (https://github.com/Feh/nocache).  I started with a 32 GB
> file of random data on a machine with about 32 GB of RAM.  At the
> beginning of the test I evicted the file from the page cache:
> 
> $ cachedel /var/tmp/random
> $ cachestats /var/tmp/random
> pages in cache: 0/8388608 (0.0%)  [filesize=33554432.0K, pagesize=4K]
> 
> Performing a normal sequential copy of the file to /dev/null shows
> that the file is almost entirely pulled into page cache (thus evicting
> useful programs and data):
> 
> $ free -m; time ./nbdkit file /var/tmp/random --run 'qemu-img convert -n -p -m 16 -W $nbd "json:{\"file.driver\":\"null-co\",\"file.size\":\"1E\"}"' ; free -m ; cachestats /var/tmp/random
>                total        used        free      shared  buff/cache   available
> Mem:          32083        1193       27816           1        3073       30435
> Swap:         16135          16       16119
>      (100.00/100%)
> 
> real	0m12.437s
> user	0m2.005s
> sys	0m31.339s
>                total        used        free      shared  buff/cache   available
> Mem:          32083        1190         313           1       30578       30433
> Swap:         16135          16       16119
> pages in cache: 7053276/8388608 (84.1%)  [filesize=33554432.0K, pagesize=4K]
> 
> Now we repeat the test using fadvise=sequential cache=none:
> 
> $ cachedel /var/tmp/random
> $ cachestats /var/tmp/random
> pages in cache: 106/8388608 (0.0%)  [filesize=33554432.0K, pagesize=4K]
> 
> $ free -m; time ./nbdkit file /var/tmp/random fadvise=sequential cache=none --run 'qemu-img convert -n -p -m 16 -W $nbd "json:{\"file.driver\":\"null-co\",\"file.size\":\"1E\"}"' ; free -m ; cachestats /var/tmp/random

Hmm - the -W actually says that qemu-img is performing semi-random 
access (there is no guarantee that the 16 coroutines are serviced in 
linear order of the file), even though we really are making only one 
pass through the file in bulk.  I don't know if fadvise=normal would be 
any better; dropping -W but keeping -m 16 might also be an interesting 
number to check (where qemu-img tries harder to do in-order access, but 
still take advantage of parallel threads).

>                total        used        free      shared  buff/cache   available
> Mem:          32083        1188       27928           1        2966       30440
> Swap:         16135          16       16119
>      (100.00/100%)
> 
> real	0m13.107s
> user	0m2.051s
> sys	0m37.556s
>                total        used        free      shared  buff/cache   available
> Mem:          32083        1196       27861           1        3024       30429
> Swap:         16135          16       16119
> pages in cache: 14533/8388608 (0.2%)  [filesize=33554432.0K, pagesize=4K]
> 
> In this case the file largely avoids being pulled into the page cache,
> and we do not evict useful stuff.
> 
> Notice that the test takes slightly longer to run.  This is expected
> because page cache eviction happens synchronously.  I expect the cost
> when doing sequential writes to be higher.  Linus outlined a technique
> to do this without the overhead, but unfortunately it is considerably
> more complex and dangerous than I am comfortable adding to the file
> plugin:
> 
> http://lkml.iu.edu/hypermail/linux/kernel/1005.2/01845.html
> http://lkml.iu.edu/hypermail/linux/kernel/1005.2/01953.html
> 
> (See also scary warnings in the sync_file_range man page)

We can always add more knobs later if someone has a use case and 
benchmarks for them.  I think what you have here is fine.

> +
> +=item B<fadvise=normal>
> +
> +=item B<fadvise=random>
> +
> +=item B<fadvise=sequential>
> +
> +This optional flag hints to the kernel that you will access the file
> +normally, or in a random order, or sequentially.  The exact behaviour
> +depends on your operating system, but for Linux using C<normal> causes
> +the kernel to read-ahead, C<sequential> causes the kernel to
> +read-ahead twice as much as C<normal>, and C<random> turns off
> +read-ahead.

Is it worth a mention of L<posix_fadvise(3)> here, to let the user get 
some idea of what their operating system supports?

> +=head2 Reducing evictions from the page cache
> +
> +If the file is very large and you known the client will only
> +read/write the file sequentially one time (eg for making a single copy
> +or backup) then this will stop other processes from being evicted from
> +the page cache:
> +
> + nbdkit file disk.img fadvise=sequential cache=none

It's also possible to avoid polluting the page cache by using O_DIRECT, 
but that comes with harder guarantees (aligned access through aligned 
buffers), so we may add it as another mode later on.  But in the 
meantime, cache=none is fairly nice while still avoiding O_DIRECT.

> @@ -355,6 +428,17 @@ file_pwrite (void *handle, const void *buf, uint32_t count, uint64_t offset,
>   {
>     struct handle *h = handle;
>   
> +#if defined (HAVE_POSIX_FADVISE) && defined (POSIX_FADV_DONTNEED)
> +  uint32_t orig_count = count;
> +  uint64_t orig_offset = offset;
> +
> +  /* If cache=none we want to force pages we have just written to the
> +   * file to be flushed to disk so we can immediately evict them from
> +   * the page cache.
> +   */
> +  if (cache_mode == cache_none) flags |= NBDKIT_FLAG_FUA;
> +#endif
> +
>     while (count > 0) {
>       ssize_t r = pwrite (h->fd, buf, count, offset);
>       if (r == -1) {
> @@ -369,6 +453,12 @@ file_pwrite (void *handle, const void *buf, uint32_t count, uint64_t offset,
>     if ((flags & NBDKIT_FLAG_FUA) && file_flush (handle, 0) == -1)
>       return -1;
>   
> +#ifdef HAVE_POSIX_FADVISE
> +  /* On Linux this will evict the pages we just wrote from the page cache. */
> +  if (cache_mode == cache_none)
> +    posix_fadvise (h->fd, orig_offset, orig_count, POSIX_FADV_DONTNEED);
> +#endif

So on Linux, POSIX_FADV_DONTNEED after a write that was not flushed 
doesn't help?  You did point out that the use of FUA for flushing slows 
things down, but that's a fair price to pay to keep the cache clean.

Patch looks good to me.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org