<div dir="ltr"><div>Hello, </div><div><br></div><div>I am using 4 drives to construct a RAID5 and build a thin </div><div>volume on it. To get better performance, I use '-Zn' option</div><div>in 'lvcreate' to make the thin pool assume all blocks are </div>
<div>already zeroed. The chunk size in RAID5 and thin-pool are </div><div>both 512KB and the stripe_cache_size=4096 on RAID5.</div><div><br></div><div>The following is the performance result I got when writes to</div><div>
a RAID5 device and a thin volume:</div><div><br></div><div>dd if=/dev/zero of=/dev/md5 bs=2M count=1000</div><div>1000+0 records in</div><div>1000+0 records out</div><div>2097152000 bytes (2.1 GB) copied, 6.02630 seconds, 348 MB/s</div>
<div><br></div><div>dd if=/dev/zero of=/dev/mapper/vg1-lv1 bs=2M count=1000</div><div>1000+0 records in</div><div>1000+0 records out</div><div>2097152000 bytes (2.1 GB) copied, 11.58648 seconds, 181 MB/s</div><div><br></div>
<div>To find out what may cause the performance dropped so much, I </div><div>made some traces in codes and finally got some interesting </div><div>result. First, the bio size with dd command is 4KB, thus every</div><div>
128 bios would fill up a thin-block/RAID-chunk in my situation. </div><div>Since I have set ‘pool->pf.zero_new_blocks’ = false, it seems</div><div>when a new block is provisioned for a bio, this bio would be </div><div>
put back to the tail of ‘pool->deferred_bios’ list but rather</div><div>than issue it immediately. Thus, this made a re-arrangement </div><div>for the incoming bio sequences. </div><div><br></div><div>For example, the bi_sector of the incoming each ‘PAGE_SIZE’ bios</div>
<div>are:</div><div>bi_sector : [0, 8, 16, 32,......1024]</div><div>After each of them got mapped, the orders of issuing to the lower</div><div>layer become non-sequential as:</div><div>bi_sector : [8,16,24,…136,144,152, …1016] + [0,128,256,384,512, </div>
<div>640,768,896,1024]</div><div><br></div><div>As you can see, the bios which triggered the provision_block() </div><div>got re-arranged and separated with other consecutive ones. Thus, </div><div>if the lower layer device cannot merge them back, this may cause </div>
<div>some read-modify-writes or seek latency overhead.</div><div><br></div><div>According to this observation, I made a rough patch on kernel 3.6</div><div>to maintain the sequential order of bios when </div><div>‘pool->pf.zero_new_blocks’ = false:</div>
<div><br></div><div>diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c</div><div>index b0a5ed9..76cda40 100644</div><div>--- a/drivers/md/dm-thin.c</div><div>+++ b/drivers/md/dm-thin.c</div><div>@@ -1321,6 +1321,7 @@ static void schedule_zero(struct thin_c *tc, dm_block_t virt_block,</div>
<div> {</div><div> <span class="" style="white-space:pre"> </span>struct pool *pool = tc->pool;</div><div> <span class="" style="white-space:pre"> </span>struct new_mapping *m = get_next_mapping(pool);</div><div>+<span class="" style="white-space:pre"> </span>int r;</div>
<div> </div><div> <span class="" style="white-space:pre"> </span>INIT_LIST_HEAD(&m->list);</div><div> <span class="" style="white-space:pre"> </span>m->quiesced = 1;</div><div>@@ -1337,9 +1338,20 @@ static void schedule_zero(struct thin_c *tc, dm_block_t virt_block,</div>
<div> <span class="" style="white-space:pre"> </span> * zeroing pre-existing data, we can issue the bio immediately.</div><div> <span class="" style="white-space:pre"> </span> * Otherwise we use kcopyd to zero the data first.</div>
<div> <span class="" style="white-space:pre"> </span> */</div><div>-<span class="" style="white-space:pre"> </span>if (!pool->pf.zero_new_blocks)</div><div>-<span class="" style="white-space:pre"> </span>process_prepared_mapping(m);</div>
<div>-</div><div>+<span class="" style="white-space:pre"> </span>if (!pool->pf.zero_new_blocks) {</div><div>+<span class="" style="white-space:pre"> </span>r = dm_thin_insert_block(tc->td, m->virt_block, m->data_block, 0);</div>
<div>+<span class="" style="white-space:pre"> </span>if (r) {</div><div>+<span class="" style="white-space:pre"> </span>DMERR("schedule_zero() failed");</div><div>+<span class="" style="white-space:pre"> </span>cell_error(m->cell);</div>
<div>+<span class="" style="white-space:pre"> </span>}</div><div>+<span class="" style="white-space:pre"> </span>else {</div><div>+<span class="" style="white-space:pre"> </span>inc_all_io_entry(pool, bio);</div><div>
+<span class="" style="white-space:pre"> </span>cell_defer_except(tc, cell);</div><div>+<span class="" style="white-space:pre"> </span>remap_and_issue(tc, bio, data_block);</div><div>+<span class="" style="white-space:pre"> </span>}</div>
<div>+<span class="" style="white-space:pre"> </span>list_del(&m->list);</div><div>+<span class="" style="white-space:pre"> </span>mempool_free(m, tc->pool->mapping_pool);</div><div>+<span class="" style="white-space:pre"> </span>}</div>
<div> <span class="" style="white-space:pre"> </span>else if (io_overwrites_block(pool, bio)) {</div><div> <span class="" style="white-space:pre"> </span>struct endio_hook *h = dm_get_mapinfo(bio)->ptr;</div><div> <span class="" style="white-space:pre"> </span>h->overwrite_mapping = m;</div>
<div><br></div><div>And the performance also got better:</div><div>dd if=/dev/zero of=/dev/mapper/vg1-lv1 bs=2M count=1000</div><div>1000+0 records in</div><div>1000+0 records out</div><div>2097152000 bytes (2.1 GB) copied, 6.16819 seconds, 340 MB/s</div>
<div><br></div><div>Suppose my thin-pool is setup with pf->zero_new_blocks = false,</div><div>I think it's OK to issue one bio immediately rather than put it </div><div>back to the pool->deferred_bios when the mapping is known. Thus, </div>
<div>the sequential order can be maintained in this way. However, I </div><div>wonder if I would miss some cases in this rough patch, any </div><div>suggestions would be helpful.</div><div><br></div><div><br></div></div>