[Linux-cluster] io scheduler and gnbd

Sat Oct 21 23:31:03 UTC 2006

hi,

Am Samstag, 21. Oktober 2006 02:46 schrieb Benjamin Marzinski:
> On Fri, Oct 20, 2006 at 04:20:37PM +0200, Markus Hochholdinger wrote:
> > i'm succesfully using gnbd as a single service for a long time. Now i
> > discovered a weired problem with the gnbd devices with kernel 2.6.18. I
> > build the gnbd.ko module out of the cvs tree.
> > All works fine if you don't do to much an the gnbds. But if you stress
> > test the devices, the gnbds will hang, e.g. reads and writes hang. If you
> > restart the gnbd server, the client will continue to read and write until
> > the next hang.
> > So i first checked my gnbd servers and tried from 1.01 till 1.03 and the
> > latest cvs. But the problem is still there. From another gnbd client i
> > had no problem, with none of these gnbd server versions (i was impressed
> > you can mix these versions). Also changing the kernel on the gnbd server
> > didn't helped.
> > So i was stick to the gnbd client with kernel 2.6.18. I have to use this
> > kernel because of the new hardware. So i tried a little and found out
> > that changing the default io scheduler for the gnbd devices on the client
> > makes the hanging write and reads resume. The default scheduler was cfq
> > and with this i can easily reproduce this behavior. With the deadline
> > scheduler it doesn't.
> > So i read a little about io scheduling on linux. And my assumption is a
> > gnbd device shouldn't need any io scheduling, because the network has no
> > latency when seeking like a hard disk. On the gnbd server there are
> > getting request from more than one gnbd client, so scheduled io on the
> > client would mix up the scheduling on the server. And also the server
> > does its own io scheduling when writing to the real disk.
> > So i could use the noop scheduler or have i missed something?
> > Has anyone on the list more info about io scheduling and gnbd?
> If gnbd isn't working with the latest kernel, it's pretty definitely a bug.

i've spend the whole saturday to solve this problem. And i figured out that it 
is not the gnbd's fault.
It's the 3ware controller in the gnbd server!
I thought the gnbd server is all ok because my other gnbd clients had no 
problem, but i was wrong. The new gnbd client, which doesn't work 
correctly/smothly, was connected with 1GBit/s network in contrast to the 
other clients with 100MBit/s.
The 3ware controller in the gnbd server has very little performance (i don't 
get it if it is the driver or the hardware). I've JBOD configured in the 
3ware controller and manage 3 hard disks with lvm, which i export over gnbd. 
Even without gnbd i can't write more than 5MB/s on one of the JBOD disks if i 
write more than the system can cache. (Because of a lot of RAM in this gnbd 
server and the stripe set of three disks i didn't notice before.)
When i turn on WRITE CACHE in the 3ware controller (without bbu) i get 75MB/s 
per JBOD disk!
And with this, gnbd all works fine. (I think i've to replace the 3ware 
controller or to buy a bbu.)
So after having a fast gnbd server i couldn't reproduce the problem.

> I'll take a look and see if I can't reproduce it.  As far as IO scheduling
> goes, if you can work around this with a different scheduler, that's great,
> but depending on your IO patterns, gnbd should see a benefit from
> reordering requests.  It is more efficient to send a fewer number of larger
> requests to the server. There are a couple or reasons why but the big one
> is that the gnbd server does not reorder requests itself.
> Currently the gnbd server receives a request, performs the request, returns
> the result, and then goes on to the next request. There is only one thread
> per client per device. Obviously, your need to have the IO complete to disk
> before you return the read result. But because gnbd is pretending to be a
> block device, when the server says that the data has been written out, the
> data must be actually on disk. This means that the request must be synced
> to disk before the server returns a write result and goes on to the next
> request. So the gnbd server always has it's requests complete to disk
> before it gets a new one, so it cannot usefully reorder them.  It can
> reorder requests if they come in from different clients, but I don't think
> that this gets you much.

Well, i've a setup like this:
gnbd-server, 3 disks, lvm with stripes, logical volumes, a lot of gnbd 
clients.
So if one gnbd client reads/writes it's sure another one is also reading and 
writing. So the final scheduling is in the block device of the 3 disks resp. 
if there comes data from more than one client at a time it will be 
rescheduled in the block device.
And i think here was the problem.
Either the io scheduler on the gnbd server gets over(?) filled or the ordering 
coming from the gnbd client has produced some kind of dead lock together with 
the bad performance!?

Many thanks for your explanation of the io and scheduling behavior of the gnbd 
server. So i will go back to use a scheduler for the gnbd devices.
Also i think it should be wise to use the same io scheduler on the gnbd server 
as on the client.

BTW: I've recently set up exactly the same combination of hard disks, gnbd 
server and client like in this case. And there i get great performance over 
1GBit/s network (read/write about 80MB/s).

> Now that (I believe) you can do async IO to a device opened with the O_SYNC
> flag from userspace, the gnbd server could be rewritten much more
> effectively. Unfortuntely, it probably won't happen anytime soon.

In my case i mean the async io comes on the gnbd server from multiple clients 
writing to the same hard disks (physcal volumes) because of the lvm and 
stripes.
So the io scheduler of the real hard disks has to manage pre scheduled events 
from gnbd. And before this email i thought doing scheduling twice will be a 
waste of time because the network doesn't need time to seek like a hard disk. 
But as you explained, the gnbd server and client is optimized for large data 
blocks and therefore scheduling on the client is a good thing.

> Thanks for the heads up, and if you wouldn't mind filing a bugzilla about
> this at bugzilla.redhat.com, that would be helpful.

Hm, i didn't really find a bug. Perhaps i found a bug in the linux io 
scheduler or in some hardware.
But as far as i can see, gnbd behaves like a rock. And if the hard disks 
didn't get managed the data, i think gnbd can't help here.

BTW: Now that i've tested all gnbd kernel modules and all gnbd client and 
server versions against each other (1.01, 1.02, 1.03 and latest cvs) and it 
worked, i must say that you guys make great work :-) It never crashed (oh, 
well, one time i get kernel panics when i loaded the gnbd kernel module with 
all debugging on, but the system kept on running). I was really surprised.

BTW2: I was afraid when i will upgrade my gnbd servers and clients in which 
order i've to update. Now i know it doesn't really matter. Great :-)

-- 
greetings

eMHa
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20061022/7442b87f/attachment.sig>