[dm-devel] [PATCH 2/2] dm-writecache

Fri Dec 22 22:24:25 UTC 2017

On Fri, Dec 22, 2017 at 12:56 PM, Mikulas Patocka <mpatocka at redhat.com> wrote:
>
>
> On Fri, 8 Dec 2017, Dan Williams wrote:
>
>> > > What about memcpy_flushcache?
>> >
>> > but
>> >
>> > - using memcpy_flushcache is overkill if we need just one or two 8-byte
>> > writes to the metadata area. Why not use movnti directly?
>> >
>>
>> The driver performs so many 8-byte moves that the cost of the
>> memcpy_flushcache() function call significantly eats into your
>> performance?
>>
>> > - on some architctures, memcpy_flushcache is just an alias for memcpy, so
>> > there will still be some arch-specific ifdefs
>>
>> ...but those should all be hidden by arch code, not in drivers.
>>
>> > - there is no architecture-neutral way how to guarantee ordering between
>> > multiple memcpy_flushcache calls. On x86, we need wmb(), on Power we
>> > don't, on ARM64 I don't know (arch_wb_cache_pmem calls dmb(osh),
>> > memcpy_flushcache doesn't - I don't know what are the implications of this
>> > difference) on other architectures, wmb() is insufficient to guarantee
>> > ordering between multiple memcpy_flushcache calls.
>>
>> wmb() should always be sufficient to order stores on all architectures.
>
> No, it isn't. See this example:
>
> uint64_t var_a, var_b;
>
> void fn(void)
> {
>         uint64_t val = 3;
>         memcpy_flushcache(&var_a, &val, 8);
>         wmb();
>         val = 5;
>         memcpy_flushcache(&var_b, &val, 8);;
> }
>
> On x86-64, memcpy_flushcache is implemented using the movnti instruction
> (that writes the value to the write-combining buffer) and wmb() is
> compiled into the sfence instruction (that flushes the write-combining
> buffer) - so it's OK - it is guaranteed that the variable var_a is written
> to persistent memory before the variable var_b;
>
> However, on i686 (and most other architectures), memcpy_flushcache is just
> an alias for memcpy - see this code in include/linux/string.h:
> #ifndef __HAVE_ARCH_MEMCPY_FLUSHCACHE
> static inline void memcpy_flushcache(void *dst, const void *src, size_t
> cnt)
> {
>         memcpy(dst, src, cnt);
> }
> #endif
>
> So, memcpy_flushcache writes the variable var_a to the cache, wmb()
> flushes the pipeline, but otherwise does nothing (because on i686 all
> writes to the cache are already ordered) and then memcpy_flushcache writes
> the variable var_b to the cache - however, both writes end up in cache and
> wmb() doesn't flush the cache. So wmb() doesn't provide any sort of
> guarantee that var_a is written to persistent memory before var_b. If the
> cache sector containing the variable var_b is more contended than the
> cache sector containg the variable var_a, the CPU will flush cache line
> containing var_b before var_a.

You're overlooking the fact that there is no persistent memory support
for i686. The pmem driver will warn you of this fact:

    dev_warn(dev, "unable to guarantee persistence of writes\n");

The reason persistent memory is not supported on i686 is due to the
fact that older cpus do not guarantee that non-temporal stores always
bypass the cache.

> We have the dax_flush() function that (unlike wmb) guarantees that the
> specified range is flushed from the cache and written to persistent
> memory. But dax_flush is very slow on x86-64.

dax_flush() does not make that guarantee on i686, and it's only slow
on architectures that were not built for persistent memory support
i.e. architectures prior to the arrival of the clwb and clflushopt
instructions.

> So, we have a situation where wmb() is fast, but it only guarantees
> ordering on x86-64 and dax_flush() that guarantees ordering on all
> architectures, but it is slow on x86-64.
>
> So, my driver uses wmb() on x86-64 and dax_flush() on all the others.

...and I'm saying you're hard coding for a cpu that was not built to
support persistent memory efficiently in the first place.

> You argue that the driver shouldn't use any per-architecture #ifdefs, but
> there is no other way how to do it - using dax_flush() on x86-64 kills
> performance and using wmb() on all architectures is unreliable because
> wmb() doesn't guarantee cache flushing at all.
>
>> > At least for Broadwell, the write-back memory type on persistent memory
>> > has so horrible performance that it not really usable.
>>
>> ...and my concern is that you're designing a pmem driver mechanism for
>> Broadwell that predates the clwb instruction for efficient pmem
>> access.
>
> That's what we have for testing. If I get access to some Skylake server, I
> can test if using write combining is faster than cache flushing.

I'm still missing where memcpy_flushcache is unsuitable for your use
case given that i686 does not support reliable peristent memory
access.