[dm-devel] [PATCH 2/2] dm-writecache
Mikulas Patocka
mpatocka at redhat.com
Tue Feb 13 22:00:32 UTC 2018
On Fri, 8 Dec 2017, Dan Williams wrote:
> > > > when we write to
> > > > persistent memory using cached write instructions and use dax_flush
> > > > afterwards to flush cache for the affected range, the performance is about
> > > > 350MB/s. It is practically unusable - worse than low-end SSDs.
> > > >
> > > > On the other hand, the movnti instruction can sustain performance of one
> > > > 8-byte write per clock cycle. We don't have to flush cache afterwards, the
> > > > only thing that must be done is to flush the write-combining buffer with
> > > > the sfence instruction. Movnti has much better throughput than dax_flush.
> > >
> > > What about memcpy_flushcache?
> >
> > but
> >
> > - using memcpy_flushcache is overkill if we need just one or two 8-byte
> > writes to the metadata area. Why not use movnti directly?
> >
>
> The driver performs so many 8-byte moves that the cost of the
> memcpy_flushcache() function call significantly eats into your
> performance?
I've measured it on Skylake i7-6700 - and the dm-writecache driver has 2%
lower throughput when it uses memcpy_flushcache() to update it metadata
instead of explicitly coded "movnti" instructions.
I've created this patch - it doesn't change API in any way, but it
optimizes memcpy_flushcache for 4, 8 and 16-byte writes (that is what my
driver mostly uses). With this patch, I can remove the explicit "asm"
statements from my driver. Would you consider commiting this patch to the
kernel?
Mikulas
x86: optimize memcpy_flushcache
I use memcpy_flushcache in my persistent memory driver for metadata
updates and it turns out that the overhead of memcpy_flushcache causes 2%
performance degradation compared to "movnti" instruction explicitly coded
using inline assembler.
This patch recognizes memcpy_flushcache calls with constant short length
and turns them into inline assembler - so that I don't have to use inline
assembler in the driver.
Signed-off-by: Mikulas Patocka <mpatocka at redhat.com>
---
arch/x86/include/asm/string_64.h | 20 +++++++++++++++++++-
arch/x86/lib/usercopy_64.c | 6 +++---
2 files changed, 22 insertions(+), 4 deletions(-)
Index: linux-2.6/arch/x86/include/asm/string_64.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/string_64.h 2018-01-31 11:06:19.953577699 -0500
+++ linux-2.6/arch/x86/include/asm/string_64.h 2018-02-13 12:31:06.506810497 -0500
@@ -147,7 +147,25 @@ memcpy_mcsafe(void *dst, const void *src
#ifdef CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE
#define __HAVE_ARCH_MEMCPY_FLUSHCACHE 1
-void memcpy_flushcache(void *dst, const void *src, size_t cnt);
+void __memcpy_flushcache(void *dst, const void *src, size_t cnt);
+static __always_inline void memcpy_flushcache(void *dst, const void *src, size_t cnt)
+{
+ if (__builtin_constant_p(cnt)) {
+ switch (cnt) {
+ case 4:
+ asm ("movntil %1, %0" : "=m"(*(u32 *)dst) : "r"(*(u32 *)src));
+ return;
+ case 8:
+ asm ("movntiq %1, %0" : "=m"(*(u64 *)dst) : "r"(*(u64 *)src));
+ return;
+ case 16:
+ asm ("movntiq %1, %0" : "=m"(*(u64 *)dst) : "r"(*(u64 *)src));
+ asm ("movntiq %1, %0" : "=m"(*(u64 *)(dst + 8)) : "r"(*(u64 *)(src + 8)));
+ return;
+ }
+ }
+ __memcpy_flushcache(dst, src, cnt);
+}
#endif
#endif /* __KERNEL__ */
Index: linux-2.6/arch/x86/lib/usercopy_64.c
===================================================================
--- linux-2.6.orig/arch/x86/lib/usercopy_64.c 2018-01-31 11:06:19.988577678 -0500
+++ linux-2.6/arch/x86/lib/usercopy_64.c 2018-02-13 11:56:40.249154414 -0500
@@ -133,7 +133,7 @@ long __copy_user_flushcache(void *dst, c
return rc;
}
-void memcpy_flushcache(void *_dst, const void *_src, size_t size)
+void __memcpy_flushcache(void *_dst, const void *_src, size_t size)
{
unsigned long dest = (unsigned long) _dst;
unsigned long source = (unsigned long) _src;
@@ -196,14 +196,14 @@ void memcpy_flushcache(void *_dst, const
clean_cache_range((void *) dest, size);
}
}
-EXPORT_SYMBOL_GPL(memcpy_flushcache);
+EXPORT_SYMBOL_GPL(__memcpy_flushcache);
void memcpy_page_flushcache(char *to, struct page *page, size_t offset,
size_t len)
{
char *from = kmap_atomic(page);
- memcpy_flushcache(to, from + offset, len);
+ __memcpy_flushcache(to, from + offset, len);
kunmap_atomic(from);
}
#endif
More information about the dm-devel
mailing list