[dm-devel] [PATCH v2] memcpy_flushcache: use cache flusing for larger lengths
Mikulas Patocka
mpatocka at redhat.com
Tue Mar 31 11:58:34 UTC 2020
On Tue, 31 Mar 2020, Elliott, Robert (Servers) wrote:
>
>
> > -----Original Message-----
> > From: Mikulas Patocka <mpatocka at redhat.com>
> > Sent: Monday, March 30, 2020 6:32 AM
> > To: Dan Williams <dan.j.williams at intel.com>; Vishal Verma
> > <vishal.l.verma at intel.com>; Dave Jiang <dave.jiang at intel.com>; Ira
> > Weiny <ira.weiny at intel.com>; Mike Snitzer <msnitzer at redhat.com>
> > Cc: linux-nvdimm at lists.01.org; dm-devel at redhat.com
> > Subject: [PATCH v2] memcpy_flushcache: use cache flusing for larger
> > lengths
> >
> > I tested dm-writecache performance on a machine with Optane nvdimm
> > and it turned out that for larger writes, cached stores + cache
> > flushing perform better than non-temporal stores. This is the
> > throughput of dm- writecache measured with this command:
> > dd if=/dev/zero of=/dev/mapper/wc bs=64 oflag=direct
> >
> > block size 512 1024 2048 4096
> > movnti 496 MB/s 642 MB/s 725 MB/s 744 MB/s
> > clflushopt 373 MB/s 688 MB/s 1.1 GB/s 1.2 GB/s
> >
> > We can see that for smaller block, movnti performs better, but for
> > larger blocks, clflushopt has better performance.
>
> There are other interactions to consider... see threads from the last
> few years on the linux-nvdimm list.
dm-writecache is the only linux driver that uses memcpy_flushcache on
persistent memory. There is also the btt driver, it uses the "do_io"
method to write to persistent memory and I don't know where this method
comes from.
Anyway, if patching memcpy_flushcache conflicts with something else, we
should introduce memcpy_flushcache_to_pmem.
> For example, software generally expects that read()s take a long time and
> avoids re-reading from disk; the normal pattern is to hold the data in
> memory and read it from there. By using normal stores, CPU caches end up
> holding a bunch of persistent memory data that is probably not going to
> be read again any time soon, bumping out more useful data. In contrast,
> movnti avoids filling the CPU caches.
But if I write one cacheline and flush it immediatelly, it would consume
just one associative entry in the cache.
> Another option is the AVX vmovntdq instruction (if available), the
> most recent of which does 64-byte (cache line) sized transfers to
> zmm registers. There's a hefty context switching overhead (e.g.,
> 304 clocks), and the CPU often runs AVX instructions at a slower
> clock frequency, so it's hard to judge when it's worthwhile.
The benchmark shows that 64-byte non-temporal avx512 vmovntdq is as good
as 8, 16 or 32-bytes writes.
ram nvdimm
sequential write-nt 4 bytes 4.1 GB/s 1.3 GB/s
sequential write-nt 8 bytes 4.1 GB/s 1.3 GB/s
sequential write-nt 16 bytes (sse) 4.1 GB/s 1.3 GB/s
sequential write-nt 32 bytes (avx) 4.2 GB/s 1.3 GB/s
sequential write-nt 64 bytes (avx512) 4.1 GB/s 1.3 GB/s
With cached writes (where each cache line is immediatelly followed by clwb
or clflushopt), 8, 16 or 32-byte write performs better than non-temporal
stores and avx512 performs worse.
sequential write 8 + clwb 5.1 GB/s 1.6 GB/s
sequential write 16 (sse) + clwb 5.1 GB/s 1.6 GB/s
sequential write 32 (avx) + clwb 4.4 GB/s 1.5 GB/s
sequential write 64 (avx512) + clwb 1.7 GB/s 0.6 GB/s
> In user space, glibc faces similar choices for its memcpy() functions;
> glibc memcpy() uses non-temporal stores for transfers > 75% of the
> L3 cache size divided by the number of cores. For example, with
> glibc-2.216-16.fc27 (August 2017), on a Broadwell system with
> E5-2699 36 cores 45 MiB L3 cache, non-temporal stores are used
> for memcpy()s over 36 MiB.
BTW. what does glibc do with reads? Does it flush them from the cache
after they are consumed?
AFAIK glibc doesn't support persistent memory - i.e. there is no function
that flushes data and the user has to use inline assembly for that.
> It'd be nice if glibc, PMDK, and the kernel used the same algorithms.
Mikulas
More information about the dm-devel
mailing list