[dm-devel] Question about iscsi session block

Wed Feb 16 01:28:12 UTC 2022

On Wed, Feb 16, 2022 at 12:31 AM Mike Christie
<michael.christie at oracle.com> wrote:
>
> On 2/15/22 9:49 AM, Zhengyuan Liu wrote:
> > Hi, all
> >
> > We have an online server which uses multipath + iscsi to attach storage
> > from Storage Server. There are two NICs on the server and for each it
> > carries about 20 iscsi sessions and for each session it includes about 50
> >  iscsi devices (yes, there are totally about 2*20*50=2000 iscsi block devices
> >  on the server). The problem is: once a NIC gets faulted, it will take too long
> > (nearly 80s) for multipath to switch to another good NIC link, because it
> > needs to block all iscsi devices over that faulted NIC firstly. The callstack is
> >  shown below:
> >
> >     void iscsi_block_session(struct iscsi_cls_session *session)
> >     {
> >         queue_work(iscsi_eh_timer_workq, &session->block_work);
> >     }
> >
> >  __iscsi_block_session() -> scsi_target_block() -> target_block() ->
> >   device_block() ->  scsi_internal_device_block() -> scsi_stop_queue() ->
> >  blk_mq_quiesce_queue()>synchronize_rcu()
> >
> > For all sessions and all devices, it was processed sequentially, and we have
> > traced that for each synchronize_rcu() call it takes about 80ms, so
> > the total cost
> > is about 80s (80ms * 20 * 50). It's so long that the application can't
> > tolerate and
> > may interrupt service.
> >
> > So my question is that can we optimize the procedure to reduce the time cost on
> > blocking all iscsi devices?  I'm not sure if it is a good idea to increase the
> > workqueue's max_active of iscsi_eh_timer_workq to improve concurrency.
>
> We need a patch, so the unblock call waits/cancels/flushes the block call or
> they could be running in parallel.
>
> I'll send a patchset later today so you can test it.

I'm glad to test once you push the patchset.

Thank you, Mike.