[dm-devel] [PATCH] reworked dm-switch target

Wed Aug 22 01:02:35 UTC 2012

On Tue, 21 Aug 2012, Jim Ramsay wrote:

> On Mon, Aug 20, 2012 at 03:20:42PM -0400, Mikulas Patocka wrote:
> > On Fri, 17 Aug 2012, Jim_Ramsay at DELL.com wrote:
> > > 1) Uploading large page tables
> <snip>
> > > Assuming a fairly well-distributed layout of 1572864 pages where 50% of 
> > > the pages are different every other page, 20% are different every 2 pages, 
> > > 10% every 5 pages, 10% every 10 pages, and 10% every 20 pages, this would 
> > > leave us with a dmsetup message with argc=998768
> > > 
> > >   dmsetup message switch 0 set-table 0-0:1 1-1:0 2-2:2 3-3:1 4-4:0 5-5:2 6-6:0 7-8:1 9-15:2 16-16:1 ... (plus almost 1000000 more arguments...)
> > 
> > You don't have to use the dash, you can send:
> > dmsetup message switch 0 set-table 0:1 1:0 2:2 3:1 4:0 ... etc.
> > 
> > You don't have to send the whole table at once in one message. Using 
> > message with 998768 arguments is bad (it can trigger allocation failures 
> > in the kernel).
> > 
> > But you can split the initial table load into several messages, each 
> > having up to 4096 bytes, so that it fits into a single page.
> 
> Even removing the '-' for single-page sets, you're looking at having to
> send 4 bytes minimum per page (and as the index of the page you're
> indexing increases significantly, it takes many more bytes to represent
> a page), which means that each 4096-byte run would have maybe 1000 page
> table entries in it at most.
> 
> This would mean that to upload an entire page table for my example
> volume, we would have to run 'dmsetup message ...' almost 1000 times.
> 
> I'm sure we can come up with other syntactical shortcuts like those
> Alasdair came up with, but encoding into any ascii format will always be
> less space-efficient than a pure binary transfer.

I converted the format to use hexadecimal numbers (they are faster to 
produce and faster to parse) and made an option to omit the page number 
(in this case, the previous page plus one is used) - and it takes 0.05s 
to load a table with one million entries on 2.3GHz Opteron.

The table is loaded with 67 dm message calls, each having 45000 bytes 
(the number 45000 was experimentally found to be near the optimum).

So I don't think there are performance problems with this.

I'll send you the program that updates the table with messages.

> > > Perhaps we can work with you on designing alternate non-netlink mechanism 
> > > to achieve the same goal... A sysfs file per DM device for userland 
> > > processes to do direct I/O with?  Base64-encoding larger chunks of the 
> > > binary page tables and passing those values through 'dmsetup message'?
> > 
> > As I said, you don't have to upload the whole table with one message ... 
> > or if you really need to update the whole table at once, explain why.
> 
> At the very least, we would need to update the whole page table in the
> following scenarios:
> 
>   1) When we first learn the geometry of the volume
> 
>   2) When the volume layout changes significantly (for example, if it was
>      previously represented by 2 devices and is then later moved onto 3
>      devices, or the underlying LUN is resized)
> 
>   3) When the protocol used to fetch the data can fetch segments of the
>      page table in a dense binary formate, it is considerably more work
>      for a userland processes to keep its own persistent copy of the
>      page table, compare a new version with the old version, calculate
>      the differences, and send only those differences.  It is much
>      simpler to have a binary conduit to upload the entire table at
>      once, provided it does not occur too frequently.

But you don't have to upload the table at once - you can upload the table 
incrementally with several dm messages.

> Furthermore, if a userland process already has an internal binary
> representation of a page map, what is the value in converting this into
> a complicated human-readable ascii representation then having the kernel
> do the opposite de-conversion when it receives the data?

The reason is simplicity - the dm message code is noticeably smaller than 
the netlink code. It is also less bug-prone because no structures are 
allocated or freed there.

> > > 2) vmalloc and TLB performance
> <snip>
> 
> > The original code uses a simple kmalloc to allocate the whole table.
> > 
> > The maximum size allocatable with kmalloc is 4MB.
> > 
> > The minimum vmalloc arena is 128MB (on x86) - so the switch from kmalloc 
> > to vmalloc makes it no worse.
> > 
> > > On SMP systems, the page table changes required by
> > > vmalloc() allocations can require expensive cross-processor interrupts on
> > > all CPUs.
> > 
> > vmalloc is used only once when the target is loaded, so performance is not 
> > an issue here.
> 
> The table would also have to be reallocated on LUN resize or if the data
> is moved to be across a different number of devices (provided the change
> is such that it causes the number of bits-per-page to be changed), such
> as if you had a 2-device setup represented by 1-bit-per-page change to a
> 3-device setup represented by 2-bit-per-page.
> 
> Granted these are not frequent operations, but we need to continue to
> properly handle these cases.
>
> We also need to keep the multiple device scenario in mind (perhaps 100s of
> targets in use or being created simultaneously).

For these operations (resizing the device or changing the number of 
underlying devices), you can load a new table, suspend the device and 
resume the device. It will switch to the new table and destroy the old 
one.

You have to reload the table anyway if you change device size, so there is 
no need to include code to change table size in the target driver.

> > > And, on all systems, use of space in the vmalloc() range
> > > increases pressure on the translation lookaside buffer (TLB), reducing the
> > > performance of the system."
> > > 
> > > The page table lookup is in the I/O path, so performance is an important 
> > > consideration.  Do you have any performance comparisons between our 
> > > existing 2-level lookup of kmalloc'd memory versus a single vmalloc'd 
> > 
> > There was just 1-level lookup in the original dm-switch patch. Did you add 
> > 2-level lookup recently?
> 
> In October 2011 I posted a 'v3' version of our driver to the dm-devel
> list that did this 2-stage lookup to the dm-devel list:
> 
> http://www.redhat.com/archives/dm-devel/2011-October/msg00109.html
> 
> The main consideration was to avoid single large kmalloc allocations,
> but to also support sparse allocations in the future.
> 
> > > memory lookup?  Multiple devices of similarly large table size may be in 
> > > use simultaneously, so this needs consideration as well.
> > > 
> > > Also, in the example above with 1572864 page table entries, assuming 2 
> > > bits per entry requires a table of 384KB.  Would this be a problem for the 
> > > vmalloc system, especially on 32-bit systems, if there are multiple 
> > > devices of similarly large size in use at the same time?
> > 
> > 384KB is not a problem, the whole vmalloc space has 128MB.
> 
> This means we could allow ~375 similarly-sized devices in the system,
> assuming no other kernel objects are consuming any vmalloc space.  This
> could be okay, provided our performance considerations are also
> addressed, but allowing sparse allocation may be a good enough reason
> to use a 2-level allocation scheme.
> 
> > > It can also be desirable to allow sparsely-populated page tables, when it 
> > > is known that large chunks are not needed or deemed (by external logic) 
> > > not important enough to consume kernel memory.  A 2-level kmalloc'd memory 
> > > scheme can save memory in sparsely-allocated situations.
> 
> This ability to do sparse allocations may be important depending on what
> else is going on in the kernel and using vmalloc space.

It may be possible to use radix tree and do sparse allocations, but given 
the current usage (tables with million entries, each entry having a few 
bits), it doesn't seem as a problem now.

> Thanks for your comments, and I do hope to send our 'v4' driver code as
> well as a demonstration application with the netlink socket interface to
> this list in the very near future.
> 
> -- 
> Jim Ramsay

Mikulas