[dm-devel] Newbie device mapper questions

Tue Jun 16 21:05:48 UTC 2015

Johannes,

I was not trying to scare you, just tell you the rough path.  I have been
exactly where you are.

On Tue, Jun 16, 2015 at 11:54 AM, Johannes Bauer <dfnsonfsduifb at gmx.de>
wrote:

> On 15.06.2015 21:52, Doug Dumitru wrote:
>
> >> Sounds pretty easy and I also got surprisingly far with my little kernel
> >> module. I've so far implemented ctr, dtr, map and status.
> >
> > Congratulations, you are actually a long way there.
>
> Thanks but I think I have the mountain still ahead -- still, I would
> really like to figure out the nitty-gritty.
>
> > You have to allocate a bio, populate it, allocate pages for buffer,
> > populate the bvec, and call make_request (or generic make request).  You
> > will get the completion from the bio on the bottom half of the interrupt
> > handler, so how much work you can do there is debatable.  You cannot
> start
> > an new IO from there, which you need to.  You will probably want to
> start a
> > helper thread and have the completion routine schedule itself onto your
> > thread.  Once you are back on your thread, you can do just about
> anything.
> >
> > Because you need to do IO, you will not be able to do a simple bio
> "bounce
> > redirect".  You will need to do the IO youself (ie, call another make
> > request), but you can use the callers bvec for this, so there is no data
> > copy required.  Once the request completes, you can then fin the caller.
>
> Oh, wow. This sounds truly terrifying. Let's dive in!
>
> I tried to read your hints one word at a time. So here's the somewhat
> pseudocodish solution to my homework:
>
> struct bio *b = bio_alloc(GFP_NOIO, 1);
> b->bi_size = 8;
> bio_alloc_pages(b, GFP_NOIO);
> b->bi_sector = 1234;
> b->bi_bdev = lc->metadev->bdev;
> b->bi_rw = READ;
> b->bi_private = local_ctx;
> b->bi_end_io = read_complete_callback;
> generic_make_request(bi);
>

size is in bytes

biovec count is in pages

you will need to allocate local_ctx (it cannot be on the stack).  You
probably need to allocate a structure in the .ctr routine that is your
"device context".  Each "operation" then gets it's own alloc that points
back to the device context.

>
> static void read_complete_callback(struct bio *b, int error) {
>   // ???
>   printk(KERN_INFO "First read byte: %02x\n",
>      b->bi_io_vec[0]->bv_page[0]);
> }
>

Here you usally do:

Q_WORK *q = bio->bi_private
DEV *dev = q->dev;

to get your context back.

>
> So I hope this is even remotely close to what I should end up with.
>
> This will alloc a new bio with, as I understand it, one page buffer in
> b->bi_io_vec. This buffer is then allocated with bio_alloc_pages to 8
> sectors in size (i.e. exactly one page of 4096 bytes). Then the read
> address, block device and read mode is set. I pass some kind of local
> context so I can do something meaningful in the callback and specify the
> callback function. Then I execute the request.
>
> As I understand, this executes asynchronously. So here comes the
> threading into play, right? Just pseudocode (because I can't judge how
> far I'm off here), but let's say this is map():
>
> void read_complete_callback() {
>     semaphore_inc(local_ctx);
> }
>
> void map() {
>    local_ctx->semaphore->value = 0;
>
>    // Issue read as above
>    generic_make_request(bi);
>
>    semaphore_dec(&local_ctx->semaphore);
>
>    // Now the concurrent async IO has finished and we interpret the data
>    [...]
> }
>

It is more like:  (really psuedo code)

thread_helper(...)

DEV *dev = thread_param;
while ( 1 ) {
  if ( dev->shutdown_flg ) break;
  spinlock(dev->workqueue_slock,flags);
  if ( dev->workqueue_head) {
    q = dev->workqueue_head;
    dev->workqueue_head = dev->workqueue_head->nxt;
    spinunlock(dev->workqueue_slock,flags);
    ... process q work
    continue;
  }
  spinunlock(dev->workqueue_slock,flags);
  down(dev->workqueue_sem);
}

You have to start the thread, setup a semaphore and spinlock.  Better is to
use a waitq, but semaphores do work.

When you want to schedule on the background, you add your new "q" item to
the head/tail single linked list.  A double linked list is fine and easier
to program, but overkill.

>
> Oh boy I really don't know if this is even remotely close. Any hints, as
> easy as they may seem to you guys, are really greatly appreciated. I've
> never worked with this stuff.
>

Start by creating a thread in module load and destroying it in module
unload.  You can use statics as the DEV.  You should use atomics as "thread
counters", so when the thread starts, it increments the "running thread
count".  When a thread exits, it decrements the counter.  This way, the
module unload routine can set the "dev->shutddown_flag", do a bunch of
up(...) to wake up the threads, and then wait for the threads to exit by
watching the counter.  Throw in some sleeps to keep the loop waiting for
exits from killing the box.

If you do it correctly, you can start a bunch of copies of the worker
thread.  If you are after a lot of bandwidth or IOPS, this might be
helpful.  Otherwise, you can probably get away with just one.  Having just
one helper is nice because you don't have to set as many locks to protect
yourself from yourself.

Once you have your first live thread, you can build a queue to give it work
to do.  Once you have work you can give it, you are off to the races.

>
> > If you cannot continue because devices are not present or the right
> size,
> > yes you should fail the ctr routine.
>
> Alright!
>
> > If you want to setup /proc or other monitoring stuff, you can use the
> init
> > routine, probably plus some statics, to setup "views" into your module.
> If
> > you want to support multiple instances (and you should), setup a
> > /proc/{yourname} directory on the init and then populate it with
> > sub-directories every time you create a device.
>
> Okay, I'll try to do this (want to make statistics available via procfs
> later on), but one construction site at a time for me.
>
> >> - Can I determine the size the bio in map() will have already in ctr()
> >> somehow? Can I assume it will never change if it was once determined?
> >> The reason is that for my example I need to make sure the chunk size is
> >> a integer multiple of the bio size and I would only like to check this
> >> once (in ctr) and not every time (in map).
> >
> > Block size will not change.  The size of requests to you is limited by
> the
> > setup of ti->max_io_len.  If you don't set this with recent kernels, you
> > will only get 4K, which is not all that efficient.  This is actually part
> > of another big topic of "stacked limits", which someone could write a
> book
> > on (and I would read it).
>
> So if I would want to do a large I/O operation (say write one megabyte
> of data to a block device somewhere within my driver) I'd have to make
> lots of calls to generic_make_request?
>
> Thank you so much for your help,
> Best regards,
> Johannes
>

-- 
Doug Dumitru
EasyCo LLC
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/dm-devel/attachments/20150616/bd106eee/attachment.htm>