[dm-devel] [PATCH] dax: remove the pmem_dax_ops->flush abstraction

Fri Sep 1 03:34:37 UTC 2017

On Thu, Aug 31, 2017 at 8:11 PM, Dan Williams <dan.j.williams at intel.com> wrote:
> On Thu, Aug 31, 2017 at 7:49 PM, Mikulas Patocka <mpatocka at redhat.com> wrote:
>>
>>
>> On Thu, 31 Aug 2017, Dan Williams wrote:
>>
>>> On Thu, Aug 31, 2017 at 6:47 PM, Mikulas Patocka <mpatocka at redhat.com> wrote:
>>> > The patch abebfbe2f731 ("dm: add ->flush() dax operation support") is
>>> > buggy. A dm device may be composed of multiple underlying devices and all
>>> > of them need to be flushed. The patch just routes the flush request to the
>>> > first device and ignores the other devices.
>>> >
>>> > It could be fixed by adding more complex logic to the device mapper. But
>>> > there is only one implementation of the method pmem_dax_ops->flush - that
>>> > is pmem_dax_flush() - and it calls arch_wb_cache_pmem(). Consequently, we
>>> > don't need the pmem_dax_ops->flush abstraction at all, we can call
>>> > arch_wb_cache_pmem() directly from dax_flush() because dax_dev->ops->flush
>>> > couldn't ever reach anything different from arch_wb_cache_pmem().
>>>
>>> Unfortunately, this is not true, see usage of DAXDEV_WRITE_CACHE. This
>>> is for platforms that arrange for cpu caches to be flushed on
>>> power-fail, like standalone storage appliances, where it would be a
>>> waste for the kernel to track and flush dirty cachelines for dax.
>>> Theoretically this could be done on a per-address range basis (think
>>> battery backing a subset of the system memory). I think we need to fix
>>> the routing to flush the same dax_device that ->direct_access() was
>>> invoked.
>>
>> With my patch, dax_flush checks DAXDEV_WRITE_CACHE and doesn't do anything
>> if it's not set - it just doesn't go through device mapper.
>
> True, but there's no guarantee that arch_wb_pmem() will be sufficient
> going forward and the reason we routed this to the driver in the first
> place was to allow driver / platform specific flush implementations.
>
>>
>>> > It should be also pointed out that for some uses of persistent memory it
>>> > is needed to flush only very small amount of data (such as 1 cacheline),
>>> > and it would be overkill if we go through that device mapper machinery for
>>> > a single flushed cache line.
>>>
>>> I think this is more an argument to not enable DAX on that
>>> device-mapper topology if operation routing impacts performance. DAX
>>> is meant to get the kernel out of the way most of the time.
>>
>> I don't understand what do you mean? I am writing a device mapper target
>> that uses persistent memory as a cache. When flushing the metadata,
>> sometimes it is needed to flush single cache lines. Ideally, I would use
>> the clwb instruction for this purpose, but that clwb is hidden behind a
>> complicated pmem_dax_ops->flush abstraction (and that abstraction will get
>> even more complicated if implemented correctly). How do you think I should
>> write that target without DAX?
>
> I don't think you want DAX for this, I think you want direct control
> of a persistent-memory address range.  For example the brd driver
> supports DAX but I wouldn't recommend using it as the basis for a
> directly managed cache.
>
> So I think we need to dive deeper into what an in kernel interface to
> access a contiguous persistent memory range looks like. DAX was not
> designed with that use case in mind.

For a cache driver using pmem, a strawman / starting point would be
trying to do a ->direct_access() call for the full capacity of the
device and then:

1/ validate that it returns the full capacity (this will preclude
drivers like brd, and make sure the dax range is contiguous)

2/ use the returned pfn to walk the iomem resource tree and validate
that you are talking to persistent memory and not some special
dax-device range.