[dm-devel] [v4.13-rc BUG] system lockup when running big buffered write(4M) to IB SRP via mpath

Ming Lei ming.lei at redhat.com
Wed Aug 23 11:35:27 UTC 2017


On Wed, Aug 09, 2017 at 05:10:01PM +0000, Bart Van Assche wrote:
> On Wed, 2017-08-09 at 12:43 -0400, Laurence Oberman wrote:
> > Your latest patch on stock upstream without Ming's latest patches is 
> > behaving for me.
> > 
> > As already mentioned, the requeue -11 and clone failure messages are 
> > gone and I am not actually seeing any soft lockups or hard lockups.
> > 
> > When Ming gets back I will work with him on his patch set and the lockups.
> > 
> > Running 10 parallel writes which easily trips into soft lockups on 
> > Ming's kernel (even with your patch) has been stable here on 4.13-RC3 
> > with your patch.
> > 
> > I will leave it running for a while now but the patch is good.
> > 
> > If it survives 4 hours I will add a Tested-by to your latest patch.
> 
> Hello Laurence,
> 
> I'm working on an additional patch that should reduce unnecessary requeuing
> even further. I will let you know when it's ready.
> 
> Additionally, please trim e-mails when replying such that e-mails do not get
> too long.

soft lockup still can be observed easily with patch d4acf3650c7c(
block: Make blk_mq_delay_kick_requeue_list() rerun the queue at a quiet time),
but no hard lockup.

With the patchset of 'blk-mq-sched: improve SCSI-MQ performance', hard
lockup can be observed following some failure log:

	[  269.277653] device-mapper: multipath: blk_get_request() returned -11 - requeuing
	[  269.321244] device-mapper: multipath: blk_get_request() returned -11 - requeuing
	...
	[  273.421688] scsi host2: SRP abort called
	[  273.444577] scsi host2: Sending SRP abort for tag 0x6007e
	[  273.673871] scsi host2: Null scmnd for RSP w/tag 0x0000000006007e received on ch 6 / QP 0x30
	...
	[  274.372110] device-mapper: multipath: blk_get_request() returned -11 - requeuing
	[  278.658671] scsi host2: SRP abort called
	[  278.690630] scsi host2: SRP abort called
	[  278.717634] scsi host2: SRP abort called
	[  278.745629] scsi host2: SRP abort called
	[  279.083227] multipath_clone_and_map: 1092 callbacks suppressed
	....
	[  296.210503] scsi host2: SRP reset_device called
	....
	[  303.784287] NMI watchdog: Watchdog detected hard LOCKUP on cpu 10

The trick thing is that both hard lockup and soft lockup share
one same stack trace.

Another question, I don't understand why request is allocated with
GFP_ATOMIC in multipath_clone_and_map(), looks it shouldn't be
necessary.


--
Ming




More information about the dm-devel mailing list