[linux-lvm] poor read performance on rbd+LVM, LVM overload

Sage Weil sage at inktank.com
Mon Oct 21 03:58:58 UTC 2013


On Sun, 20 Oct 2013, Ugis wrote:
> >> output follows:
> >> #pvs -o pe_start /dev/rbd1p1
> >>   1st PE
> >>     4.00m
> >> # cat /sys/block/rbd1/queue/minimum_io_size
> >> 4194304
> >> # cat /sys/block/rbd1/queue/optimal_io_size
> >> 4194304
> >
> > Well, the parameters are being set at least.  Mike, is it possible that
> > having minimum_io_size set to 4m is causing some read amplification
> > in LVM, translating a small read into a complete fetch of the PE (or
> > somethinga long those lines)?
> >
> > Ugis, if your cluster is on the small side, it might be interesting to see
> > what requests the client is generated in the LVM and non-LVM case by
> > setting 'debug ms = 1' on the osds (e.g., ceph tell osd.* injectargs
> > '--debug-ms 1') and then looking at the osd_op messages that appear in
> > /var/log/ceph/ceph-osd*.log.  It may be obvious that the IO pattern is
> > different.
> >
> Sage, here follows debug output. I am no pro in reading this, but
> seems read block size differ(or what is that number following ~ sign)?

Yep, it's offset~length.  

It looks like without LVM we're getting 128KB requests (which IIRC is 
typical), but with LVM it's only 4KB.  Unfortunately my memory is a bit 
fuzzy here, but I seem to recall a property on the request_queue or device 
that affected this.  RBD is currently doing

	segment_size = rbd_obj_bytes(&rbd_dev->header);
	blk_queue_max_hw_sectors(q, segment_size / SECTOR_SIZE);
	blk_queue_max_segment_size(q, segment_size);
	blk_queue_io_min(q, segment_size);
	blk_queue_io_opt(q, segment_size);

where segment_size is 4MB (so, much more than 128KB); maybe it has 
something to do with how many smaller ios get coalesced a larger requests?

In any case, something appears to be lost due to the pass through LVM, but 
I'm not very familiar with the block layer code at all...  :/

sage


> 
> OSD.2 read with LVM:
> 2013-10-20 16:59:05.307159 7f95acfa5700  1 -- x.x.x.x:6804/1944 -->
> x.x.x.y:0/269199468 -- osd_op_reply(176566434
> rbd_data.3ad974b0dc51.0000000000007cef [read 4083712~4096] ondisk = 0)
> v4 -- ?+0 0xdc35c00 con 0xd9e4840
> 2013-10-20 16:59:05.307655 7f95b27b0700  1 -- x.x.x.x:6804/1944 <==
> client.38069 x.x.x.y:0/269199468 5548 ====
> osd_op(client.38069.1:176566435 rbd_data.3ad974b0dc51.0000000000007cef
> [read 4087808~4096] 4.5672f053 e6870) v4 ==== 177+0+0 (1554835253 0 0)
> 0x12593d80 con 0xd9e4840
> 2013-10-20 16:59:05.307824 7f95ac7a4700  1 -- x.x.x.x:6804/1944 -->
> x.x.x.y:0/269199468 -- osd_op_reply(176566435
> rbd_data.3ad974b0dc51.0000000000007cef [read 4087808~4096] ondisk = 0)
> v4 -- ?+0 0xe24fc00 con 0xd9e4840
> 2013-10-20 16:59:05.308316 7f95b27b0700  1 -- x.x.x.x:6804/1944 <==
> client.38069 x.x.x.y:0/269199468 5549 ====
> osd_op(client.38069.1:176566436 rbd_data.3ad974b0dc51.0000000000007cef
> [read 4091904~4096] 4.5672f053 e6870) v4 ==== 177+0+0 (3467296840 0 0)
> 0xe28f6c0 con 0xd9e4840
> 2013-10-20 16:59:05.308499 7f95acfa5700  1 -- x.x.x.x:6804/1944 -->
> x.x.x.y:0/269199468 -- osd_op_reply(176566436
> rbd_data.3ad974b0dc51.0000000000007cef [read 4091904~4096] ondisk = 0)
> v4 -- ?+0 0xdc35a00 con 0xd9e4840
> 2013-10-20 16:59:05.308985 7f95b27b0700  1 -- x.x.x.x:6804/1944 <==
> client.38069 x.x.x.y:0/269199468 5550 ====
> osd_op(client.38069.1:176566437 rbd_data.3ad974b0dc51.0000000000007cef
> [read 4096000~4096] 4.5672f053 e6870) v4 ==== 177+0+0 (3104591620 0 0)
> 0xe0b46c0 con 0xd9e4840
> 
> OSD.2 read without LVM
> 2013-10-20 17:03:13.730881 7f95ac7a4700  1 -- x.x.x.x:6804/1944 -->
> x.x.x.y:0/269199468 -- osd_op_reply(176708854
> rb.0.967b.238e1f29.000000000071 [read 2359296~131072] ondisk = 0) v4
> -- ?+0 0x1019d200 con 0xd9e4840
> 2013-10-20 17:03:13.731318 7f95b27b0700  1 -- x.x.x.x:6804/1944 <==
> client.38069 x.x.x.y:0/269199468 18232 ====
> osd_op(client.38069.1:176708855 rb.0.967b.238e1f29.000000000071 [read
> 2490368~131072] 4.c0d1e4cb e6870) v4 ==== 170+0+0 (1987168552 0 0)
> 0x171a7480 con 0xd9e4840
> 2013-10-20 17:03:13.731664 7f95acfa5700  1 -- x.x.x.x:6804/1944 -->
> x.x.x.y:0/269199468 -- osd_op_reply(176708855
> rb.0.967b.238e1f29.000000000071 [read 2490368~131072] ondisk = 0) v4
> -- ?+0 0x12b81200 con 0xd9e4840
> 2013-10-20 17:03:13.733112 7f95b27b0700  1 -- x.x.x.x:6804/1944 <==
> client.38069 x.x.x.y:0/269199468 18233 ====
> osd_op(client.38069.1:176708856 rb.0.967b.238e1f29.000000000071 [read
> 2621440~131072] 4.c0d1e4cb e6870) v4 ==== 170+0+0 (527551382 0 0)
> 0x12593d80 con 0xd9e4840
> 2013-10-20 17:03:13.733393 7f95ac7a4700  1 -- x.x.x.x:6804/1944 -->
> x.x.x.y:0/269199468 -- osd_op_reply(176708856
> rb.0.967b.238e1f29.000000000071 [read 2621440~131072] ondisk = 0) v4
> -- ?+0 0xeba9000 con 0xd9e4840
> 2013-10-20 17:03:13.733741 7f95b27b0700  1 -- x.x.x.x:6804/1944 <==
> client.38069 x.x.x.y:0/269199468 18234 ====
> osd_op(client.38069.1:176708857 rb.0.967b.238e1f29.000000000071 [read
> 2752512~131072] 4.c0d1e4cb e6870) v4 ==== 170+0+0 (178955972 0 0)
> 0xe0b4d80 con 0xd9e4840
> 
> How to proceed with tuning read performance on LVM? Is there some
> chanage needed in code of ceph/LVM or my config needs to be tuned?
> If what is shown in logs means 4k read block in LVM case - then it
> seems I need to tell LVM(or xfs on top of LVM dictates read block
> side?) that io block should be rather 4m?
> 
> Ugis
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 




More information about the linux-lvm mailing list