[dm-devel] Persistent memory interface

Doug Dumitru doug at easyco.com
Mon Jun 22 19:01:42 UTC 2015


... while you are at it, you should consider supporting other binary block
sizes.

Doug

On Mon, Jun 22, 2015 at 9:50 AM, Doug Dumitru <doug at easyco.com> wrote:

> I would like to comment on the BTT docs a bit.  There are some design
> points you might want to consider.
>
> First, real use cases will have no read/write collisions.  If you think of
> a file system, the case of reading a block that is being written or writing
> a single block twice just don't happen because the data itself is non
> deterministic.  The driver still needs to handle these cases, but
> optimizing it for this is not all that logical.
>
> Lets start at the BTT table.  It would be useful if we could distinguish
> between a stable block and one that is getting updated now.  An option is
> to encode the TRIM/ERROR bits as four "states"  (stable, updating, trimmed,
> error) and just use the BTT entry as an index.  This could probably point
> directly to the FLOG entry.  The BTT table, at four bytes, has atomic
> updates without locks, so two threads can simultaneously update it to point
> to the FLOG table and then after the update, see if they won.  If they did
> not win, they can wait for the first update to complete.  The FLOG table
> could also have a parallel RAM based BTT2 table to store spinlocks or
> linked-lists to handle collisions.  Then again, a simple spin or spin/sleep
> is probably good enough.
>
> The same works for readers.  If you read a block, check the BTT table
> after you finish the read.  If it is the same, your read was good.  If it
> changed underneath you, or is pointing to a FLOG block, then you need to
> wait or re-read.  Again, the real-world frequency of collisions is very
> low.  This would let you eliminate the RTT table entirely.
>
> One final optimization would be to keep the BTT table both in standard RAM
> as well as in NV RAM.  If standard RAM is faster, then reads could lookup
> blocks without touching the NV driver.  For 512G, this is 1B blocks or 4G
> of RAM.  Then again, if the NV RAM is just as fast, this would not help.
> Perhaps an option.
>
> I have gotten into a lot of trouble optimizing for fio collisions when
> these collisions don't really impact real-workload performance.  The code
> has to be "correct" in the collision case, but it does not really need to
> be fast.
>
> Doug Dumitru
> EasyCo LLC
>
>
>
> On Fri, Jun 19, 2015 at 9:50 AM, Verma, Vishal L <vishal.l.verma at intel.com
> > wrote:
>
>> On Fri, 2015-06-19 at 12:33 -0400, Mikulas Patocka wrote:
>> > Hi
>> >
>> > I looked at the new the persistent memory block device driver
>> > (drivers/block/pmem.c and arch/x86/kernel/pmem.c) and it seems that the
>> > interface between them is incorrect.
>> >
>> > If I want to use persistent memory in another driver, for a different
>> > purpose, how can I make sure that that drivers/block/pmem.c doesn't
>> attach
>> > to this piece of memory and export it? It seems not possible.
>> > drivers/block/pmem.c attaches to everything without regard that there
>> may
>> > be other users of persistent memory.
>> >
>> > I think a correct solution would be to add a partition table at the
>> > beginning of persistent memory area and this partition table would
>> > describe which parts belong to which programs - so that different
>> programs
>> > could use persistent memory and not step over each other's data. Is
>> there
>> > some effort to standardize the partition table ongoing?
>> >
>> >
>> > BTW. some journaling filesystems assume that 512-byte sector is written
>> > atomically. drivers/block/pmem.c breaks this requirement. Persistent
>> > memory only gurantees 8-byte atomic writes.
>>
>> Hi Mikulas,
>>
>> I can answer this part - The idea is that file systems that need sector
>> atomicity will use the "Block Translation Table" (BTT) [1]. It would be
>> a stacked block device on top of a pmem device (or partition), and file
>> systems would be able to use it either for the entire space to get
>> atomicity for all blocks, or if they want to use DAX, make two
>> partitions, and enable the BTT only on one partition, and use it as the
>> logdev.
>>
>>         -Vishal
>>
>> [1]: https://lkml.org/lkml/2015/6/17/950
>>
>> >
>> > Mikulas
>> > _______________________________________________
>> > Linux-nvdimm mailing list
>> > Linux-nvdimm at lists.01.org
>> > https://lists.01.org/mailman/listinfo/linux-nvdimm
>>
>>
>> --
>> dm-devel mailing list
>> dm-devel at redhat.com
>> https://www.redhat.com/mailman/listinfo/dm-devel
>>
>
>
>
> --
> Doug Dumitru
> EasyCo LLC
>



-- 
Doug Dumitru
EasyCo LLC
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/dm-devel/attachments/20150622/999e59d6/attachment.htm>


More information about the dm-devel mailing list