[dm-devel] dm-multipath - IO queue dispatch based on FPIN Congestion/Latency notifications.

Thu Apr 1 02:48:36 UTC 2021

Hello Muneendra

On Wed, 2021-03-31 at 16:18 +0530, Muneendra Kumar M wrote:
> Hi Martin,
> Below are my replies.
> 
> 
> > If there was any discussion, I haven't been involved :-)
> 
> > I haven't looked into FPIN much so far. I'm rather sceptic with
> > it's
> usefulness for dm-multipath. Being a property of FC-2, FPIN works at
> least
> 2 layers below dm-multipath. dm-multipath is agnostic against
> protocol and
> transport properties by design. User space multipathd can cross these
> layers and tune dm-multipath based on lower-level properties, but
> such
> actions  have rather large latencies.
> 
> > As you know, dm-multipath has 3 switches for routing IO via
> > different
> paths:
> 
> > 1 priority groups,
> > 2 path status (good / failed)
>  >3 path selector algorithm
> 
> > 1) and 2) are controlled by user space, and have high latency.
> 
> > The current "marginal" concept in multipathd watches paths for
> > repeated
> failures, and configures the kernel to avoid using paths that are
> considered marginal, using methods 1) and 2). This is a very-high-
> latency
> algorithm that >changes state on the time scale of minutes.
> > There is no concept for "delaying" or "pausing" IO on paths on
> > short time
> scale.
> 
> > The only low-latency mechanism is 3). But it's block level, no
> > existing
> selector looks at transport-level properties.
> 
> > That said, I can quite well imagine a feedback mechanism based on
> throttling or delays applied in the FC drivers. For example, it a
> remote
> port was throttled by the driver in response to FPIN messages, it's
> bandwidth would >decrease, and a path selector like "service-time"
> > would automatically assign less IO to such paths. This wouldn't
> > need any
> changes in dm-multipath or multipath-tools, it would work entirely on
> the
> FC level.
> 
> [Muneendra]Agreed.
I think the only way the FC drivers can respond to this is by delaying
the R_RDY primitives resulting in less credits being available for the
remote side to use. That only works on a link layer and not fabric
wide. It cannot change linkspeed at all as that would bounce a port
resulting in all sorts of state changes. That being said this is
already the existing behavior and not really tied to fpins. The goal of
the fpin method was to provide a more proactive method and inform the
OS layer of fabric issues so it could act upon it by adjusting the IO
profile.
> 
> > Talking about improving the current "marginal" algorithm in
> > multipathd,
> and knowing that it's slow, FPIN might provide additional data that
> would
> be good to have. Currently, multipathd only has 2 inputs, "good<-
> >bad"
> state >transitions based either on kernel I/O errors or path checker
> results, and failure statistics from multipathd's internal
> "io_err_stat"
> thread, which only reads sector 0. This could obviously be improved,
> but
> there may actually be >lower-hanging fruit than evaluating FPIN
> notifications (for example, I've pondered utilizing the kernel's
> blktrace
> functionality to detect unusually long IO latencies or bandwidth
> drops).
> 
> > Talking about FPIN, is it planned to notify user space about such
> > fabric
> events, and if yes, how?
> 
> [Muneendra]Yes. FC drivers, when receiving FC FPIN ELS's are calling
> a
> scsi transport routine with the FPIN payload.  The transport
> is pushing this as an "event" via netlink.  An app bound to the local
> address used by the scsi transport can receive the event and parse
> it.
> 
> Benjamin has added a marginal_path group(multipath marginal
> pathgroups) in
> the dm-multipath.
> https://patchwork.kernel.org/project/dm-devel/cover/1564763622-31752-1-git
> -send-email-bmarzins at redhat.com/
> 
> One of the intention of the Benjamin's patch (support for maginal
> path) is
> to support for the FPIN events we receive from fabric.
> On receiving the fpin-li our intention was to  place all the paths
> that
> are affected into the marginal path group.
I think this should all be done in kernel space as we're talking sub-
millisecond timings here when it comes to fpins and the reaction time
expected. I may be wrong but I'll leave that up to you.
> 
> Below are the 4 types of descriptors returned in an FPIN:
> •       Link Integrity (LN): some error on a link that affected
> frames,
> which is the main one for "flaky path"
> •       Delivery Notification (DN):  something explicitly knew about
> a
> dropped frame and is reporting it. Usually, things like a CRC error
> says
> you can't trust the frame header, so you it's a LI error. But if you
> do
> have a valid frame, but drop it, such as a fabric edge timer (don't
> queue
> it more the 250-600ms), then it becomes a DN type. Could be flaky
> path,
> but not necessarily.
> •       Congestion (CN): fabric is saying it's congested sending to
> "your"
> port. Meaning if a host receives it - fabric is saying it has more
> frames
> for the host than it's pulling in so it's backing up the fabric.What
> should happen is load by the host should be lowered - but it's across
> all
> targets. Not all targets are perhaps in the mpio path list
> •       Peer Congestion (PCN): this goes along with CN in that the
> fabric
> is now telling the other devices in the zone sending traffic to that
> congested port that the other port is backing up. So the idea is
> these
> peer send less load to the congested port.  Shouldn't really tie to
> mpio.
> some of the current thinking is targets could see this and reduce
> their
> transmission rate to a host to the link speed of the host
> 
> On receiving the congestion notifications our intention is to
> slowdown the
> work load gradually from the host until it stops receiving the
> congestion
> notifications.
> We need to validate the same how we can achieve the same of
> decreasing the
> workloads with the help of dm-multipath.
Would it be possible to piggyback on the service time path selector in
this when it pertains latency?  

Another thing is that at some stage the IO queueing decision needs to
take into account the various different FPIN descriptors. A remote
delivery notification due to slow drain behaviour is very different
than ISL congestion or any physical issues.
> 
> As Hannes mentioned  in his earlier mail our primary goal is that the
> admin first should be _alerted_, having FPINs showing up in the
> message
> log, to alert the
> admin that his fabric is not performing well.
> 
This is a bit of a reactive approach that should be a secondary
objective. Having been in storage/fc support for 20 years I know that
most admins are not really responsive to this and taking actions based
on event entries take a very very long time. From an operations
perspective any sort of manual action should be avoided as much as
possible.
> 
> Regards,
> Muneendra.
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/dm-devel/attachments/20210401/9351ac88/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 488 bytes
Desc: This is a digitally signed message part
URL: <http://listman.redhat.com/archives/dm-devel/attachments/20210401/9351ac88/attachment.sig>