[dm-devel] dm-userspace (no in-kernel cache version)

Wed Sep 13 02:01:31 UTC 2006

From: Dan Smith <danms at us.ibm.com>
Subject: Re: [dm-devel] dm-userspace (no in-kernel cache version)
Date: Tue, 12 Sep 2006 14:50:02 -0700

> FT> As explained, this removes rmap (in-kernel cache) and use mmaped
> FT> buffer instead of read/write system calls for user/kernel
> FT> communication.
> 
> Ok, I got your code to work, and I have run some benchmarks.  I'll cut
> directly to the chase...
> 
> I used dbench with a single process, for 120 seconds on a dm-userspace
> device mapping directly to an LVM device.  I used my example.c and the
> example-rb.c provided with the ringbuffer version.  The results are:
> 
>   with cache, chardev:  251 MB/s
>   no cache, ringbuffer: 243 MB/s

Thanks. Looks very nice.

> I am very pleased with these results.  I assume that your code is not
> tuned for performance yet, which means we should be able to squeeze at
> least 8 MB/s more out to make it equal (or better).  Even still, the
> amount of code it saves is worth the hit, IMHO.

Yeah.

> I do have a couple of comments:
> 
> 1. You said that the ringbuffer saves the need for syscalls on each
>    batch read.  This is partially true, but you still use a write() to
>    signal completion so that the kernel will read the u->k ringbuffer.

Right. Practically, a user-space daemon needs to notify the kernel of
new events.

>    So, at best, the number of syscalls made is half of my
>    read()/write() method.  I think it's possible that another
>    signaling mechanism could be used, which would eliminate this call.

Yeah. There are other possible mechanisms for notifications. I just
chose a easy one.

>    I do think eliminating the copying with the ringbuffer approach is
>    very nice; I like it a lot.
> 
> 2. I was unable to get your code to perform well with multiple threads
>    of dbench.  While my code sustains performance with 16 threads, the
>    non-cache/ringbuffer version slows to a crawl (~1MB/s with 16
>    procs).  I noticed that the request list grows to over a 100,000
>    entries at times, which means that the response from userspace
>    requires searching that linearly, which may be the issue.

Surely, we need to replace the request list with hash list.

Another possible improvement is that simplifying dmu_ctl_write() by
using kernel thread. Now the user-space daemon calls dmu_ctl_write()
and does lots of work in kernel mode. It is better for a user-space
daemon to just notify kernel of new events, go back to user space, and
receive new events from kernel in SMP boxes. I like to create kernel
threads, dmu_ctl_write just wakes up the threads, and they call
dmu_event_recv().

> I am going to further study your changes, but I think in the end that
> I will incorporate most or all of them.  Some work will need to be
> done to incorporate support for some of the newer features (endio, for
> example), but I'll start looking into that.

Yep, I dropped some of the features because of my laziness, though if
endio means that the kernel notifies user-space of I/O completion, I
think that I implemented it.

One possible feature is support for multiple destinations. If user
space can tell kernel to write multiple devices, we can implement
kinda RAID daemons in user space.