[dm-devel] [PATCH 2/2] dm-writecache

Tue Nov 28 04:01:43 UTC 2017

On Fri, 24 Nov 2017, Dan Williams wrote:

> On Wed, Nov 22, 2017 at 6:36 PM, Mikulas Patocka <mpatocka at redhat.com> wrote:
> > Hi
> >
> > Here I'm sending slighly update dm-writecache target.
> >
> 
> I would expect some theory of operation documentation in the changelog
> or in Documentation/ for a new driver.

It caches writes for a block device in persistent memory. It doesn't cache 
reads because reads are supposed to be cached in normal ram.

> > Signed-off-by: Mikulas Patocka <mpatocka at redhat.com>
> >
> [..]
> > +/*
> > + * The clflushopt instruction is very slow on Broadwell, so we use non-temporal
> > + * stores instead.
> > + */
> 
> The clflushopt instruction's first instantiation in a cpu was after
> Broadwell. Also, the performance of clflushopt does not much matter
> when running on a system without ADR (asynchronous DRAM refresh) where
> flushing the cpu cache can just leave the writes stranded on their way
> to the DIMM. ADR is not available on general purpose servers until
> ACPI 6.x NFIT-defined persistent memory platforms.

There used to be pcommit instruction, but intel abandoned it - see 
https://software.intel.com/en-us/blogs/2016/09/12/deprecate-pcommit-instruction

So, according to the document, flushing the cache should be enough for 
writes to reach persistent memory.

> > +#ifdef CONFIG_X86_64
> 
> In general something is broken if we end up with per-arch ifdefs like
> this in drivers to handle persistent memory. This should be using the
> pmem api or extensions of it, and we need to settle on a mechanism for
> upper-level drivers to ask if pmem is driving platform protected
> persistent memory.
> 
> > +#define NT_STORE(dest, src)    asm ("movnti %1, %0" : "=m"(dest) : "r"(src))
> > +#define FLUSH_RANGE(dax, ptr, size)  do { } while (0)
> > +#define COMMIT_FLUSHED()       wmb()
> > +#else
> > +#define NT_STORE(dest, src)    ACCESS_ONCE(dest) = (src)
> > +#define FLUSH_RANGE            dax_flush
> > +#define COMMIT_FLUSHED()       do { } while (0)
> 
> Is this just for test purposes? How does the user discover that they
> are running in a degraded mode as far as persistence guarantees? I
> think we should be falling back DM_WRITECACHE_ONLY_SSD mode if we're
> not on a pmem platform.

What degraded mode do you mean? According to that document, flushing cache 
should be enough. If it is not enough, what else do I need to do to flush 
cache?

The above code is not for test purposes. It is for performance purposes.

On dual-socket Broadwell server with persistent memory, when we write to 
persistent memory using cached write instructions and use dax_flush 
afterwards to flush cache for the affected range, the performance is about 
350MB/s. It is practically unusable - worse than low-end SSDs.

On the other hand, the movnti instruction can sustain performance of one 
8-byte write per clock cycle. We don't have to flush cache afterwards, the 
only thing that must be done is to flush the write-combining buffer with 
the sfence instruction. Movnti has much better throughput than dax_flush.

You argue that I should use the pmem api, but there is no pmem api 
providing the movnti intruction. I have no choice, but coding the 
instruction directly in assembler.

If you want to create API, take that "NT_STORE" definition and move it to 
some x86-specific include file.

Mikulas