[Libguestfs] [PATCH nbdkit] file: Implement cache=none and fadvise=normal|random|sequential.
Eric Blake
eblake at redhat.com
Fri Aug 7 12:53:13 UTC 2020
On 8/7/20 6:31 AM, Richard W.M. Jones wrote:
> You can use these flags as described in the manual page to optimize
> access patterns, and to get better behaviour with the page cache in
> some scenarios.
And if you guess wrong, it is only a performance penalty, not a
correctness issue.
>
> For my testing I used the cachedel and cachestats utilities written by
> Julius Plenz (https://github.com/Feh/nocache). I started with a 32 GB
> file of random data on a machine with about 32 GB of RAM. At the
> beginning of the test I evicted the file from the page cache:
>
> $ cachedel /var/tmp/random
> $ cachestats /var/tmp/random
> pages in cache: 0/8388608 (0.0%) [filesize=33554432.0K, pagesize=4K]
>
> Performing a normal sequential copy of the file to /dev/null shows
> that the file is almost entirely pulled into page cache (thus evicting
> useful programs and data):
>
> $ free -m; time ./nbdkit file /var/tmp/random --run 'qemu-img convert -n -p -m 16 -W $nbd "json:{\"file.driver\":\"null-co\",\"file.size\":\"1E\"}"' ; free -m ; cachestats /var/tmp/random
> total used free shared buff/cache available
> Mem: 32083 1193 27816 1 3073 30435
> Swap: 16135 16 16119
> (100.00/100%)
>
> real 0m12.437s
> user 0m2.005s
> sys 0m31.339s
> total used free shared buff/cache available
> Mem: 32083 1190 313 1 30578 30433
> Swap: 16135 16 16119
> pages in cache: 7053276/8388608 (84.1%) [filesize=33554432.0K, pagesize=4K]
>
> Now we repeat the test using fadvise=sequential cache=none:
>
> $ cachedel /var/tmp/random
> $ cachestats /var/tmp/random
> pages in cache: 106/8388608 (0.0%) [filesize=33554432.0K, pagesize=4K]
>
> $ free -m; time ./nbdkit file /var/tmp/random fadvise=sequential cache=none --run 'qemu-img convert -n -p -m 16 -W $nbd "json:{\"file.driver\":\"null-co\",\"file.size\":\"1E\"}"' ; free -m ; cachestats /var/tmp/random
Hmm - the -W actually says that qemu-img is performing semi-random
access (there is no guarantee that the 16 coroutines are serviced in
linear order of the file), even though we really are making only one
pass through the file in bulk. I don't know if fadvise=normal would be
any better; dropping -W but keeping -m 16 might also be an interesting
number to check (where qemu-img tries harder to do in-order access, but
still take advantage of parallel threads).
> total used free shared buff/cache available
> Mem: 32083 1188 27928 1 2966 30440
> Swap: 16135 16 16119
> (100.00/100%)
>
> real 0m13.107s
> user 0m2.051s
> sys 0m37.556s
> total used free shared buff/cache available
> Mem: 32083 1196 27861 1 3024 30429
> Swap: 16135 16 16119
> pages in cache: 14533/8388608 (0.2%) [filesize=33554432.0K, pagesize=4K]
>
> In this case the file largely avoids being pulled into the page cache,
> and we do not evict useful stuff.
>
> Notice that the test takes slightly longer to run. This is expected
> because page cache eviction happens synchronously. I expect the cost
> when doing sequential writes to be higher. Linus outlined a technique
> to do this without the overhead, but unfortunately it is considerably
> more complex and dangerous than I am comfortable adding to the file
> plugin:
>
> http://lkml.iu.edu/hypermail/linux/kernel/1005.2/01845.html
> http://lkml.iu.edu/hypermail/linux/kernel/1005.2/01953.html
>
> (See also scary warnings in the sync_file_range man page)
We can always add more knobs later if someone has a use case and
benchmarks for them. I think what you have here is fine.
> +
> +=item B<fadvise=normal>
> +
> +=item B<fadvise=random>
> +
> +=item B<fadvise=sequential>
> +
> +This optional flag hints to the kernel that you will access the file
> +normally, or in a random order, or sequentially. The exact behaviour
> +depends on your operating system, but for Linux using C<normal> causes
> +the kernel to read-ahead, C<sequential> causes the kernel to
> +read-ahead twice as much as C<normal>, and C<random> turns off
> +read-ahead.
Is it worth a mention of L<posix_fadvise(3)> here, to let the user get
some idea of what their operating system supports?
> +=head2 Reducing evictions from the page cache
> +
> +If the file is very large and you known the client will only
> +read/write the file sequentially one time (eg for making a single copy
> +or backup) then this will stop other processes from being evicted from
> +the page cache:
> +
> + nbdkit file disk.img fadvise=sequential cache=none
It's also possible to avoid polluting the page cache by using O_DIRECT,
but that comes with harder guarantees (aligned access through aligned
buffers), so we may add it as another mode later on. But in the
meantime, cache=none is fairly nice while still avoiding O_DIRECT.
> @@ -355,6 +428,17 @@ file_pwrite (void *handle, const void *buf, uint32_t count, uint64_t offset,
> {
> struct handle *h = handle;
>
> +#if defined (HAVE_POSIX_FADVISE) && defined (POSIX_FADV_DONTNEED)
> + uint32_t orig_count = count;
> + uint64_t orig_offset = offset;
> +
> + /* If cache=none we want to force pages we have just written to the
> + * file to be flushed to disk so we can immediately evict them from
> + * the page cache.
> + */
> + if (cache_mode == cache_none) flags |= NBDKIT_FLAG_FUA;
> +#endif
> +
> while (count > 0) {
> ssize_t r = pwrite (h->fd, buf, count, offset);
> if (r == -1) {
> @@ -369,6 +453,12 @@ file_pwrite (void *handle, const void *buf, uint32_t count, uint64_t offset,
> if ((flags & NBDKIT_FLAG_FUA) && file_flush (handle, 0) == -1)
> return -1;
>
> +#ifdef HAVE_POSIX_FADVISE
> + /* On Linux this will evict the pages we just wrote from the page cache. */
> + if (cache_mode == cache_none)
> + posix_fadvise (h->fd, orig_offset, orig_count, POSIX_FADV_DONTNEED);
> +#endif
So on Linux, POSIX_FADV_DONTNEED after a write that was not flushed
doesn't help? You did point out that the use of FUA for flushing slows
things down, but that's a fair price to pay to keep the cache clean.
Patch looks good to me.
--
Eric Blake, Principal Software Engineer
Red Hat, Inc. +1-919-301-3226
Virtualization: qemu.org | libvirt.org
More information about the Libguestfs
mailing list