<div dir="auto"><div><br><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Aug 7, 2020, 16:16 Richard W.M. Jones <<a href="mailto:rjones@redhat.com">rjones@redhat.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">On Fri, Aug 07, 2020 at 07:53:13AM -0500, Eric Blake wrote:<br> > >$ free -m; time ./nbdkit file /var/tmp/random fadvise=sequential cache=none --run 'qemu-img convert -n -p -m 16 -W $nbd "json:{\"file.driver\":\"null-co\",\"file.size\":\"1E\"}"' ; free -m ; cachestats /var/tmp/random<br> > <br> > Hmm - the -W actually says that qemu-img is performing semi-random<br> > access (there is no guarantee that the 16 coroutines are serviced in<br> > linear order of the file), even though we really are making only one<br> > pass through the file in bulk. I don't know if fadvise=normal would<br> > be any better; dropping -W but keeping -m 16 might also be an<br> > interesting number to check (where qemu-img tries harder to do<br> > in-order access, but still take advantage of parallel threads).<br> > <br> > > total used free shared buff/cache available<br> > >Mem: 32083 1188 27928 1 2966 30440<br> > >Swap: 16135 16 16119<br> > > (100.00/100%)<br> > ><br> > >real 0m13.107s<br> > >user 0m2.051s<br> > >sys 0m37.556s<br> > > total used free shared buff/cache available<br> > >Mem: 32083 1196 27861 1 3024 30429<br> > >Swap: 16135 16 16119<br> > >pages in cache: 14533/8388608 (0.2%) [filesize=33554432.0K, pagesize=4K]<br> <br> Without -W it's very similar:<br> <br> $ free -m; time ./nbdkit file /var/tmp/random fadvise=sequential cache=none --run 'qemu-img convert -n -p -m 16 $nbd "json:{\"file.driver\":\"null-co\",\"file.size\":\"1E\"}"' ; free -m ; cachestats /var/tmp/random <br> total used free shared buff/cache available<br> Mem: 32083 1184 26113 1 4785 30444<br> Swap: 16135 16 16119<br> (100.00/100%)<br> <br> real 0m13.308s<br> user 0m1.961s<br> sys 0m40.455s<br> total used free shared buff/cache available<br> Mem: 32083 1188 26049 1 4845 30438<br> Swap: 16135 16 16119<br> pages in cache: 14808/8388608 (0.2%) [filesize=33554432.0K, pagesize=4K]<br> <br> With -W and using fadvise=random is also about the same:<br> <br> $ free -m; time ./nbdkit file /var/tmp/random fadvise=random cache=none --run 'qemu-img convert -n -p -m 16 -W $nbd "json:{\"file.driver\":\"null-co\",\"file.size\":\"1E\"}"' ; free -m ; cachestats /var/tmp/random <br> total used free shared buff/cache available<br> Mem: 32083 1187 26109 1 4785 30440<br> Swap: 16135 16 16119<br> (100.00/100%)<br> <br> real 0m13.030s<br> user 0m1.986s<br> sys 0m37.498s<br> total used free shared buff/cache available<br> Mem: 32083 1187 26053 1 4842 30440<br> Swap: 16135 16 16119<br> pages in cache: 14336/8388608 (0.2%) [filesize=33554432.0K, pagesize=4K]<br> <br> I'm going to guess that for this case readahead doesn't have much time<br> to get ahead of qemu.<br> <br> > >+=item B<fadvise=normal><br> > >+<br> > >+=item B<fadvise=random><br> > >+<br> > >+=item B<fadvise=sequential><br> > >+<br> > >+This optional flag hints to the kernel that you will access the file<br> > >+normally, or in a random order, or sequentially. The exact behaviour<br> > >+depends on your operating system, but for Linux using C<normal> causes<br> > >+the kernel to read-ahead, C<sequential> causes the kernel to<br> > >+read-ahead twice as much as C<normal>, and C<random> turns off<br> > >+read-ahead.<br> > <br> > Is it worth a mention of L<posix_fadvise(3)> here, to let the user<br> > get some idea of what their operating system supports?<br> <br> Yes I had this at one point but I seem to have dropped it. Will<br> add it back, thanks.<br> <br> > >+=head2 Reducing evictions from the page cache<br> > >+<br> > >+If the file is very large and you known the client will only<br> > >+read/write the file sequentially one time (eg for making a single copy<br> > >+or backup) then this will stop other processes from being evicted from<br> > >+the page cache:<br> > >+<br> > >+ nbdkit file disk.img fadvise=sequential cache=none<br> > <br> > It's also possible to avoid polluting the page cache by using<br> > O_DIRECT, but that comes with harder guarantees (aligned access<br> > through aligned buffers), so we may add it as another mode later on.<br> > But in the meantime, cache=none is fairly nice while still avoiding<br> > O_DIRECT.<br> <br> I'm not sure if or even how we could ever do a robust O_DIRECT<br></blockquote></div></div><div dir="auto"><br></div><div dir="auto">We can let the plugin an filter deal with that. The simplest solution is to drop it on the user and require aligned requests.</div><div dir="auto"><br></div><div dir="auto">Maybe a filter can handle alignment?</div><div dir="auto"><br></div><div dir="auto"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> implementation, but my idea was that it might be an alternate<br> implementation of cache=none. But if we thought we might use O_DIRECT<br> as a separate mode, then maybe we should rename cache=none.<br> cache=advise? cache=dontneed? I can't think of a good name!<br></blockquote></div></div><div dir="auto"><br></div><div dir="auto">Yes, don't call it none if you use the cache.</div><div dir="auto"><br></div><div dir="auto">How about advise=?</div><div dir="auto"><br></div><div dir="auto">I would keep cache semantics similar to qemu.</div><div dir="auto"><br></div><div dir="auto"><br></div><div dir="auto"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> <br> > >@@ -355,6 +428,17 @@ file_pwrite (void *handle, const void *buf, uint32_t count, uint64_t offset,<br> > > {<br> > > struct handle *h = handle;<br> > >+#if defined (HAVE_POSIX_FADVISE) && defined (POSIX_FADV_DONTNEED)<br> > >+ uint32_t orig_count = count;<br> > >+ uint64_t orig_offset = offset;<br> > >+<br> > >+ /* If cache=none we want to force pages we have just written to the<br> > >+ * file to be flushed to disk so we can immediately evict them from<br> > >+ * the page cache.<br> > >+ */<br> > >+ if (cache_mode == cache_none) flags |= NBDKIT_FLAG_FUA;<br> > >+#endif<br> > >+<br> > > while (count > 0) {<br> > > ssize_t r = pwrite (h->fd, buf, count, offset);<br> > > if (r == -1) {<br> > >@@ -369,6 +453,12 @@ file_pwrite (void *handle, const void *buf, uint32_t count, uint64_t offset,<br> > > if ((flags & NBDKIT_FLAG_FUA) && file_flush (handle, 0) == -1)<br> > > return -1;<br> > >+#ifdef HAVE_POSIX_FADVISE<br> > >+ /* On Linux this will evict the pages we just wrote from the page cache. */<br> > >+ if (cache_mode == cache_none)<br> > >+ posix_fadvise (h->fd, orig_offset, orig_count, POSIX_FADV_DONTNEED);<br> > >+#endif<br> > <br> > So on Linux, POSIX_FADV_DONTNEED after a write that was not flushed<br> > doesn't help? You did point out that the use of FUA for flushing<br> > slows things down, but that's a fair price to pay to keep the cache<br> > clean.<br> <br> On Linux POSIX_FADV_DONTNEED won't flush dirty buffers. I expect (but<br> didn't actually measure) that just after a medium sized write the<br> buffers would all be dirty so the posix_fadvise(DONTNEED) call would<br> do nothing at all. The advice online does seem to be that you must<br> flush before calling this. (Linus advocates a complex<br> double-buffering solution so that you can be reading into one buffer<br> while flushing the other, so you don't have the overhead of waiting<br> for the flush).<br> <br> I'm going to do a bit of benchmarking of the write side now.<br></blockquote></div></div><div dir="auto"><br></div><div dir="auto">We already tried this with dd and the results were not good.</div><div dir="auto"><br></div><div dir="auto">Nir</div><div dir="auto"><br></div><div dir="auto"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> <br> Thanks,<br> <br> Rich.<br> <br> > Patch looks good to me.<br> > <br> > -- <br> > Eric Blake, Principal Software Engineer<br> > Red Hat, Inc. +1-919-301-3226<br> > Virtualization: <a href="http://qemu.org" rel="noreferrer noreferrer" target="_blank">qemu.org</a> | <a href="http://libvirt.org" rel="noreferrer noreferrer" target="_blank">libvirt.org</a><br> <br> -- <br> Richard Jones, Virtualization Group, Red Hat <a href="http://people.redhat.com/~rjones" rel="noreferrer noreferrer" target="_blank">http://people.redhat.com/~rjones</a><br> Read my programming and virtualization blog: <a href="http://rwmj.wordpress.com" rel="noreferrer noreferrer" target="_blank">http://rwmj.wordpress.com</a><br> libguestfs lets you edit virtual machines. Supports shell scripting,<br> bindings from many languages. <a href="http://libguestfs.org" rel="noreferrer noreferrer" target="_blank">http://libguestfs.org</a><br> <br> </blockquote></div></div></div>