[dm-devel] dm-zoned performance degradation after apply 75d66ffb48efb3 ("dm zoned:, properly handle backing device failure")

Thu Oct 31 08:20:55 UTC 2019

hi Dmitry, thanks for your reply.

I also test it use the mainline, it also takes more than 1 hours.
my mechine has 64 CPUs core and the disk is SATA.

when mkfs.ext4, I found the 'scsi_test_unit_ready' run more than 1000 times
per second by the different kworker.
and every 'scsi_test_unit_ready' takes more than 200us, and the interval
less than 20us.
So, I think your guess is right.

but there is another question, why 4.19 branch takes more than 10 hour?
I will work on it, if any information about it, I will reply you.

Thanks.

my script:
	dmzadm --format /dev/sdi
	echo "0 21485322240 zoned /dev/sdi" | dmsetup create dmz-sdi
	date; mkfs.ext4 /dev/mapper/dmz-sdi; date

mainline:
	[root at localhost ~]# uname -a
	Linux localhost 5.4.0-rc5 #1 SMP Thu Oct 31 11:41:20 CST 2019 aarch64 aarch64 aarch64 GNU/Linux

	Thu Oct 31 13:58:55 CST 2019
	mke2fs 1.43.6 (29-Aug-2017)
	Discarding device blocks: done
	Creating filesystem with 2684354560 4k blocks and 335544320 inodes
	Filesystem UUID: e0d8e01e-efa8-47fd-a019-b184e66f65b0
	Superblock backups stored on blocks:
		32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
		4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
		102400000, 214990848, 512000000, 550731776, 644972544, 1934917632,
		2560000000

	Allocating group tables: done
	Writing inode tables: done
	Creating journal (262144 blocks): done
	Writing superblocks and filesystem accounting information: done

	Thu Oct 31 15:01:01 CST 2019

after delete the 'check_events' on mainline:
	[root at localhost ~]# uname -a
	Linux localhost 5.4.0-rc5+ #2 SMP Thu Oct 31 15:07:36 CST 2019 aarch64 aarch64 aarch64 GNU/Linux
	Thu Oct 31 15:19:56 CST 2019
	mke2fs 1.43.6 (29-Aug-2017)
	Discarding device blocks: done
	Creating filesystem with 2684354560 4k blocks and 335544320 inodes
	Filesystem UUID: 735198e8-9df0-49fc-aaa8-23b0869dfa05
	Superblock backups stored on blocks:
		32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
		4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
		102400000, 214990848, 512000000, 550731776, 644972544, 1934917632,
		2560000000

	Allocating group tables: done
	Writing inode tables: done
	Creating journal (262144 blocks): done
	Writing superblocks and filesystem accounting information: done

	Thu Oct 31 15:30:51 CST 2019

在 2019/10/27 10:56, Dmitry Fomichev 写道:
> Zhang,
> 
> I just did some testing of this scenario with a recent kernel that includes this patch.
> 
> The log below is a run in QEMU with 8 CPUs and it took 18.5 minutes to create the FS on a
> 14TB ATA drive. Doing the same thing on bare metal with 32 CPUs takes 10.5 minutes in my
> environment. However, when doing the same test with a SAS drive, the run takes 43 minutes.
> This is not quite the degradation you are observing, but still a big performance hit.
> 
> Is the disk that you are using SAS or SATA?
> 
> My current guess is that sd driver may generate some TEST UNIT READY commands to check if
> the drive is really online as a part of check_events() processing. For ATA drives, this is
> nearly a NOP since all TURs are completed internally in libata. But, in SCSI case, these
> blocking TURs are issued to the drive and certainly may degrade performance.
> 
> The check_events() call has been added to bdev_device_is_dying() because simply calling
> bdev_queue_dying() doesn't cover the situation when the drive gets offlined in SCSI layer.
> It might be possible to only call check_events() once before every reclaim run and to avoid
> calling it in I/O mapping path. If this works, the overhead would likely be acceptable.
> I am going to take a look into this.
> 
> Regards,
> Dmitry
> 
> [root at xxx dmz]# uname -a
> Linux xxx 5.4.0-rc1-DMZ+ #1 SMP Fri Oct 11 11:23:13 PDT 2019 x86_64 x86_64 x86_64 GNU/Linux
> [root at xxx dmz]# lsscsi
> [0:0:0:0]    disk    QEMU     QEMU HARDDISK    2.5+  /dev/sda
> [1:0:0:0]    zbc     ATA      HGST HSH721415AL T240  /dev/sdb
> [root at xxx dmz]# ./setup-dmz test /dev/sdb
> [root at xxx dmz]# cat /proc/kallsyms | grep dmz_bdev_is_dying
> (standard input):90782:ffffffffc070a401 t dmz_bdev_is_dying.cold	[dm_zoned]
> (standard input):90849:ffffffffc0706e10 t dmz_bdev_is_dying	[dm_zoned]
> [root at xxx dmz]# time mkfs.ext4 /dev/mapper/test
> mke2fs 1.44.6 (5-Mar-2019)
> Discarding device blocks: done
> Creating filesystem with 3660840960 4k blocks and 457605120 inodes
> Filesystem UUID: 4536bacd-cfb5-41b2-b0bf-c2513e6e3360
> Superblock backups stored on blocks:
> 	32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
> 	4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
> 	102400000, 214990848, 512000000, 550731776, 644972544, 1934917632,
> 	2560000000
> 
> Allocating group tables: done
> Writing inode tables: done
> Creating journal (262144 blocks): done
> Writing superblocks and filesystem accounting information: done
> 
> 
> real	18m30.867s
> user	0m0.172s
> sys	0m11.198s
> 
> 
> On Sat, 2019-10-26 at 09:56 +0800, zhangxiaoxu (A) wrote:
>> Hi all, when I 'mkfs.ext4' on the dmz device based on 10T smr disk,
>> it takes more than 10 hours after apply 75d66ffb48efb3 ("dm zoned:
>> properly handle backing device failure").
>>
>> After delete the 'check_events' in 'dmz_bdev_is_dying', it just
>> take less than 12 mins.
>>
>> I test it based on 4.19 branch.
>> Must we do the 'check_events' at mapping path, reclaim or metadata I/O?
>>
>> Thanks.
>>