[dm-devel] dm-zoned performance degradation after apply 75d66ffb48efb3 ("dm zoned:, properly handle backing device failure")

Wed Nov 6 23:00:59 UTC 2019

Hi Zhang,

I just posted the patch that fixes this issue. Could you please try it and let
me know how this patch works for you? In my testing, I don't see any excessive
TURs issued with this patch in place. It takes around 12 minutes to run
mkfs.ext4 on a freshly created dm-zoned device on top of a 14TB SCSI drive.
The same test on top of a 14TB SATA drive takes around 10 minutes. These are
direct attached drives on a physical server.

I didn't test this patch on 4.19 kernel. If you have any findings about how
it behaves, do let me know.

Regards,
Dmitry

On Thu, 2019-10-31 at 16:20 +0800, zhangxiaoxu (A) wrote:
> hi Dmitry, thanks for your reply.
> 
> I also test it use the mainline, it also takes more than 1 hours.
> my mechine has 64 CPUs core and the disk is SATA.
> 
> when mkfs.ext4, I found the 'scsi_test_unit_ready' run more than 1000 times
> per second by the different kworker.
> and every 'scsi_test_unit_ready' takes more than 200us, and the interval
> less than 20us.
> So, I think your guess is right.
> 
> but there is another question, why 4.19 branch takes more than 10 hour?
> I will work on it, if any information about it, I will reply you.
> 
> Thanks.
> 
> my script:
> 	dmzadm --format /dev/sdi
> 	echo "0 21485322240 zoned /dev/sdi" | dmsetup create dmz-sdi
> 	date; mkfs.ext4 /dev/mapper/dmz-sdi; date
> 
> mainline:
> 	[root at localhost ~]# uname -a
> 	Linux localhost 5.4.0-rc5 #1 SMP Thu Oct 31 11:41:20 CST 2019 aarch64 aarch64 aarch64 GNU/Linux
> 
> 	Thu Oct 31 13:58:55 CST 2019
> 	mke2fs 1.43.6 (29-Aug-2017)
> 	Discarding device blocks: done
> 	Creating filesystem with 2684354560 4k blocks and 335544320 inodes
> 	Filesystem UUID: e0d8e01e-efa8-47fd-a019-b184e66f65b0
> 	Superblock backups stored on blocks:
> 		32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
> 		4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
> 		102400000, 214990848, 512000000, 550731776, 644972544, 1934917632,
> 		2560000000
> 
> 	Allocating group tables: done
> 	Writing inode tables: done
> 	Creating journal (262144 blocks): done
> 	Writing superblocks and filesystem accounting information: done
> 
> 	Thu Oct 31 15:01:01 CST 2019
> 
> after delete the 'check_events' on mainline:
> 	[root at localhost ~]# uname -a
> 	Linux localhost 5.4.0-rc5+ #2 SMP Thu Oct 31 15:07:36 CST 2019 aarch64 aarch64 aarch64 GNU/Linux
> 	Thu Oct 31 15:19:56 CST 2019
> 	mke2fs 1.43.6 (29-Aug-2017)
> 	Discarding device blocks: done
> 	Creating filesystem with 2684354560 4k blocks and 335544320 inodes
> 	Filesystem UUID: 735198e8-9df0-49fc-aaa8-23b0869dfa05
> 	Superblock backups stored on blocks:
> 		32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
> 		4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
> 		102400000, 214990848, 512000000, 550731776, 644972544, 1934917632,
> 		2560000000
> 
> 	Allocating group tables: done
> 	Writing inode tables: done
> 	Creating journal (262144 blocks): done
> 	Writing superblocks and filesystem accounting information: done
> 
> 	Thu Oct 31 15:30:51 CST 2019
> 
> 在 2019/10/27 10:56, Dmitry Fomichev 写道:
> > Zhang,
> > 
> > I just did some testing of this scenario with a recent kernel that includes this patch.
> > 
> > The log below is a run in QEMU with 8 CPUs and it took 18.5 minutes to create the FS on a
> > 14TB ATA drive. Doing the same thing on bare metal with 32 CPUs takes 10.5 minutes in my
> > environment. However, when doing the same test with a SAS drive, the run takes 43 minutes.
> > This is not quite the degradation you are observing, but still a big performance hit.
> > 
> > Is the disk that you are using SAS or SATA?
> > 
> > My current guess is that sd driver may generate some TEST UNIT READY commands to check if
> > the drive is really online as a part of check_events() processing. For ATA drives, this is
> > nearly a NOP since all TURs are completed internally in libata. But, in SCSI case, these
> > blocking TURs are issued to the drive and certainly may degrade performance.
> > 
> > The check_events() call has been added to bdev_device_is_dying() because simply calling
> > bdev_queue_dying() doesn't cover the situation when the drive gets offlined in SCSI layer.
> > It might be possible to only call check_events() once before every reclaim run and to avoid
> > calling it in I/O mapping path. If this works, the overhead would likely be acceptable.
> > I am going to take a look into this.
> > 
> > Regards,
> > Dmitry
> > 
> > [root at xxx dmz]# uname -a
> > Linux xxx 5.4.0-rc1-DMZ+ #1 SMP Fri Oct 11 11:23:13 PDT 2019 x86_64 x86_64 x86_64 GNU/Linux
> > [root at xxx dmz]# lsscsi
> > [0:0:0:0]    disk    QEMU     QEMU HARDDISK    2.5+  /dev/sda
> > [1:0:0:0]    zbc     ATA      HGST HSH721415AL T240  /dev/sdb
> > [root at xxx dmz]# ./setup-dmz test /dev/sdb
> > [root at xxx dmz]# cat /proc/kallsyms | grep dmz_bdev_is_dying
> > (standard input):90782:ffffffffc070a401 t dmz_bdev_is_dying.cold	[dm_zoned]
> > (standard input):90849:ffffffffc0706e10 t dmz_bdev_is_dying	[dm_zoned]
> > [root at xxx dmz]# time mkfs.ext4 /dev/mapper/test
> > mke2fs 1.44.6 (5-Mar-2019)
> > Discarding device blocks: done
> > Creating filesystem with 3660840960 4k blocks and 457605120 inodes
> > Filesystem UUID: 4536bacd-cfb5-41b2-b0bf-c2513e6e3360
> > Superblock backups stored on blocks:
> > 	32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
> > 	4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
> > 	102400000, 214990848, 512000000, 550731776, 644972544, 1934917632,
> > 	2560000000
> > 
> > Allocating group tables: done
> > Writing inode tables: done
> > Creating journal (262144 blocks): done
> > Writing superblocks and filesystem accounting information: done
> > 
> > 
> > real	18m30.867s
> > user	0m0.172s
> > sys	0m11.198s
> > 
> > 
> > On Sat, 2019-10-26 at 09:56 +0800, zhangxiaoxu (A) wrote:
> > > Hi all, when I 'mkfs.ext4' on the dmz device based on 10T smr disk,
> > > it takes more than 10 hours after apply 75d66ffb48efb3 ("dm zoned:
> > > properly handle backing device failure").
> > > 
> > > After delete the 'check_events' in 'dmz_bdev_is_dying', it just
> > > take less than 12 mins.
> > > 
> > > I test it based on 4.19 branch.
> > > Must we do the 'check_events' at mapping path, reclaim or metadata I/O?
> > > 
> > > Thanks.
> > >