[dm-devel] dm-multipath - IO queue dispatch based on FPIN Congestion/Latency notifications.

Wed Mar 31 09:57:29 UTC 2021

On Wed, 2021-03-31 at 09:25 +0200, Hannes Reinecke wrote:
> Hi Erwin,
> 
> On 3/31/21 2:22 AM, Erwin van Londen wrote:
> > Hello Muneendra, benjamin,
> > 
> > The fpin options that are developed do have a whole plethora of
> > options
> > and do not mainly trigger paths being in a marginal state. Th mpio
> > layer
> > could utilise the various triggers like congestion and latency and
> > not
> > just use a marginal state as a decisive point. If a path is
> > somewhat
> > congested the amount of io's dispersed over these paths could just
> > be
> > reduced by a flexible margin depending on how often and which fpins
> > are
> > actually received. If for instance and fpin is recieved that an
> > upstream
> > port is throwing physical errors you may exclude is entirely from
> > queueing IO's to it. If it is a latency related problem where
> > credit
> > shortages come in play you may just need to queue very small IO's
> > to it.
> > The scsi CDB will tell the size of the IO. Congestion notifications
> > may
> > just be used for potentially adding an artificial  delay to reduce
> > the
> > workload on these paths and schedule them on another.
> > 
> As correctly noted, FPINs come with a variety of options.
> And I'm not certain we can everything correctly; a degraded path is
> simple, but for congestion there is only _so_ much we can do.
> The typical cause for congestion is, say, a 32G host port talking to
> a
> 16G (or even 8G) target port _and_ a 32G target port.
> 
> So the host cannot 'tune down' it's link to 8G; doing so would impact
> performance on the 32G target port.
> (And we would suffer reverse congestion whenever that target port
> sends
> frames).
> 
> And throttling things on the SCSI layer only helps _so_ much, as the
> real congestion is due to the speed with which the frames are
> sequenced
> onto the wire. Which is not something we from the OS can control.
> 
> From another POV this is arguably a fabric mis-design; so it _could_
> be
> alleviated by separating out the ports with lower speeds into its own
> zone (or even on a separate SAN); that would trivially make the
> congestion go away.
> 
> But for that the admin first should be _alerted_, and this really is
> my
> primary goal: having FPINs showing up in the message log, to alert
> the
> admin that his fabric is not performing well.
> 
> A second step will be to massaging FPINs into DM multipath, and have
> it
> influencing the path priority or path status. But this is currently
> under discussion how it could be integrated best.

If there was any discussion, I haven't been involved :-) 

I haven't looked into FPIN much so far. I'm rather sceptic with it's
usefulness for dm-multipath. Being a property of FC-2, FPIN works at
least 2 layers below dm-multipath. dm-multipath is agnostic against
protocol and transport properties by design. User space multipathd can
cross these layers and tune dm-multipath based on lower-level
properties, but such actions  have rather large latencies.

As you know, dm-multipath has 3 switches for routing IO via different
paths:

 1 priority groups,
 2 path status (good / failed)
 3 path selector algorithm

1) and 2) are controlled by user space, and have high latency.

The current "marginal" concept in multipathd watches paths for repeated
failures, and configures the kernel to avoid using paths that are
considered marginal, using methods 1) and 2). This is a very-high-
latency algorithm that changes state on the time scale of minutes.
There is no concept for "delaying" or "pausing" IO on paths on short
time scale.

The only low-latency mechanism is 3). But it's block level, no existing
selector looks at transport-level properties.

That said, I can quite well imagine a feedback mechanism based on
throttling or delays applied in the FC drivers. For example, it a
remote port was throttled by the driver in response to FPIN messages,
it's bandwidth would decrease, and a path selector like "service-time"
would automatically assign less IO to such paths. This wouldn't need
any changes in dm-multipath or multipath-tools, it would work entirely
on the FC level.

Talking about improving the current "marginal" algorithm in multipathd,
and knowing that it's slow, FPIN might provide additional data
that would be good to have. Currently, multipathd only has 2 inputs,
"good<->bad" state transitions based either on kernel I/O errors or
path checker results, and failure statistics from multipathd's internal
"io_err_stat" thread, which only reads sector 0. This could obviously
be improved, but there may actually be lower-hanging fruit than
evaluating FPIN notifications (for example, I've pondered utilizing the
kernel's blktrace functionality to detect unusually long IO latencies
or bandwidth drops).

Talking about FPIN, is it planned to notify user space about such
fabric events, and if yes, how?

Thanks,
Martin

-- 
Dr. Martin Wilck <mwilck at suse.com>, Tel. +49 (0)911 74053 2107
SUSE Software Solutions Germany GmbH
HRB 36809, AG Nürnberg GF: Felix Imendörffer