<html><head></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;"><div>Hello Muneendra</div><div><span></span></div><div><br></div><div>On Wed, 2021-03-31 at 16:18 +0530, Muneendra Kumar M wrote:</div><blockquote type="cite" style="margin:0 0 0 .8ex; border-left:2px #729fcf solid;padding-left:1ex"><div>Hi Martin,<br></div><div>Below are my replies.<br></div><div><br></div><div><br></div><blockquote type="cite" style="margin:0 0 0 .8ex; border-left:2px #729fcf solid;padding-left:1ex"><div>If there was any discussion, I haven't been involved :-)<br></div></blockquote><div><br></div><blockquote type="cite" style="margin:0 0 0 .8ex; border-left:2px #729fcf solid;padding-left:1ex"><div>I haven't looked into FPIN much so far. I'm rather sceptic with it's<br></div></blockquote><div>usefulness for dm-multipath. Being a property of FC-2, FPIN works at least<br></div><div>2 layers below dm-multipath. dm-multipath is agnostic against protocol and<br></div><div>transport properties by design. User space multipathd can cross these<br></div><div>layers and tune dm-multipath based on lower-level properties, but such<br></div><div>actions have rather large latencies.<br></div><div><br></div><blockquote type="cite" style="margin:0 0 0 .8ex; border-left:2px #729fcf solid;padding-left:1ex"><div>As you know, dm-multipath has 3 switches for routing IO via different<br></div></blockquote><div>paths:<br></div><div><br></div><blockquote type="cite" style="margin:0 0 0 .8ex; border-left:2px #729fcf solid;padding-left:1ex"><div>1 priority groups,<br></div><div>2 path status (good / failed)<br></div></blockquote><div> >3 path selector algorithm<br></div><div><br></div><blockquote type="cite" style="margin:0 0 0 .8ex; border-left:2px #729fcf solid;padding-left:1ex"><div>1) and 2) are controlled by user space, and have high latency.<br></div></blockquote><div><br></div><blockquote type="cite" style="margin:0 0 0 .8ex; border-left:2px #729fcf solid;padding-left:1ex"><div>The current "marginal" concept in multipathd watches paths for repeated<br></div></blockquote><div>failures, and configures the kernel to avoid using paths that are<br></div><div>considered marginal, using methods 1) and 2). This is a very-high- latency<br></div><div>algorithm that >changes state on the time scale of minutes.<br></div><blockquote type="cite" style="margin:0 0 0 .8ex; border-left:2px #729fcf solid;padding-left:1ex"><div>There is no concept for "delaying" or "pausing" IO on paths on short time<br></div></blockquote><div>scale.<br></div><div><br></div><blockquote type="cite" style="margin:0 0 0 .8ex; border-left:2px #729fcf solid;padding-left:1ex"><div>The only low-latency mechanism is 3). But it's block level, no existing<br></div></blockquote><div>selector looks at transport-level properties.<br></div><div><br></div><blockquote type="cite" style="margin:0 0 0 .8ex; border-left:2px #729fcf solid;padding-left:1ex"><div>That said, I can quite well imagine a feedback mechanism based on<br></div></blockquote><div>throttling or delays applied in the FC drivers. For example, it a remote<br></div><div>port was throttled by the driver in response to FPIN messages, it's<br></div><div>bandwidth would >decrease, and a path selector like "service-time"<br></div><blockquote type="cite" style="margin:0 0 0 .8ex; border-left:2px #729fcf solid;padding-left:1ex"><div>would automatically assign less IO to such paths. This wouldn't need any<br></div></blockquote><div>changes in dm-multipath or multipath-tools, it would work entirely on the<br></div><div>FC level.<br></div><div><br></div><div>[Muneendra]Agreed.<br></div></blockquote><div>I think the only way the FC drivers can respond to this is by delaying the R_RDY primitives resulting in less credits being available for the remote side to use. That only works on a link layer and not fabric wide. It cannot change linkspeed at all as that would bounce a port resulting in all sorts of state changes. That being said this is already the existing behavior and not really tied to fpins. The goal of the fpin method was to provide a more proactive method and inform the OS layer of fabric issues so it could act upon it by adjusting the IO profile.</div><blockquote type="cite" style="margin:0 0 0 .8ex; border-left:2px #729fcf solid;padding-left:1ex"><div><br></div><blockquote type="cite" style="margin:0 0 0 .8ex; border-left:2px #729fcf solid;padding-left:1ex"><div>Talking about improving the current "marginal" algorithm in multipathd,<br></div></blockquote><div>and knowing that it's slow, FPIN might provide additional data that would<br></div><div>be good to have. Currently, multipathd only has 2 inputs, "good<->bad"<br></div><div>state >transitions based either on kernel I/O errors or path checker<br></div><div>results, and failure statistics from multipathd's internal "io_err_stat"<br></div><div>thread, which only reads sector 0. This could obviously be improved, but<br></div><div>there may actually be >lower-hanging fruit than evaluating FPIN<br></div><div>notifications (for example, I've pondered utilizing the kernel's blktrace<br></div><div>functionality to detect unusually long IO latencies or bandwidth drops).<br></div><div><br></div><blockquote type="cite" style="margin:0 0 0 .8ex; border-left:2px #729fcf solid;padding-left:1ex"><div>Talking about FPIN, is it planned to notify user space about such fabric<br></div></blockquote><div>events, and if yes, how?<br></div><div><br></div><div>[Muneendra]Yes. FC drivers, when receiving FC FPIN ELS's are calling a<br></div><div>scsi transport routine with the FPIN payload. The transport<br></div><div>is pushing this as an "event" via netlink. An app bound to the local<br></div><div>address used by the scsi transport can receive the event and parse it.<br></div><div><br></div><div>Benjamin has added a marginal_path group(multipath marginal pathgroups) in<br></div><div>the dm-multipath.<br></div><div><a href="https://patchwork.kernel.org/project/dm-devel/cover/1564763622-31752-1-git">https://patchwork.kernel.org/project/dm-devel/cover/1564763622-31752-1-git</a><br></div><div><a href="mailto:-send-email-bmarzins@redhat.com">-send-email-bmarzins@redhat.com</a>/<br></div><div><br></div><div>One of the intention of the Benjamin's patch (support for maginal path) is<br></div><div>to support for the FPIN events we receive from fabric.<br></div><div>On receiving the fpin-li our intention was to place all the paths that<br></div><div>are affected into the marginal path group.<br></div></blockquote><div>I think this should all be done in kernel space as we're talking sub-millisecond timings here when it comes to fpins and the reaction time expected. I may be wrong but I'll leave that up to you.</div><blockquote type="cite" style="margin:0 0 0 .8ex; border-left:2px #729fcf solid;padding-left:1ex"><div><br></div><div>Below are the 4 types of descriptors returned in an FPIN:<br></div><div>• Link Integrity (LN): some error on a link that affected frames,<br></div><div>which is the main one for "flaky path"<br></div><div>• Delivery Notification (DN): something explicitly knew about a<br></div><div>dropped frame and is reporting it. Usually, things like a CRC error says<br></div><div>you can't trust the frame header, so you it's a LI error. But if you do<br></div><div>have a valid frame, but drop it, such as a fabric edge timer (don't queue<br></div><div>it more the 250-600ms), then it becomes a DN type. Could be flaky path,<br></div><div>but not necessarily.<br></div><div>• Congestion (CN): fabric is saying it's congested sending to "your"<br></div><div>port. Meaning if a host receives it - fabric is saying it has more frames<br></div><div>for the host than it's pulling in so it's backing up the fabric.What<br></div><div>should happen is load by the host should be lowered - but it's across all<br></div><div>targets. Not all targets are perhaps in the mpio path list<br></div><div>• Peer Congestion (PCN): this goes along with CN in that the fabric<br></div><div>is now telling the other devices in the zone sending traffic to that<br></div><div>congested port that the other port is backing up. So the idea is these<br></div><div>peer send less load to the congested port. Shouldn't really tie to mpio.<br></div><div>some of the current thinking is targets could see this and reduce their<br></div><div>transmission rate to a host to the link speed of the host<br></div><div><br></div><div>On receiving the congestion notifications our intention is to slowdown the<br></div><div>work load gradually from the host until it stops receiving the congestion<br></div><div>notifications.<br></div><div>We need to validate the same how we can achieve the same of decreasing the<br></div><div>workloads with the help of dm-multipath.<br></div></blockquote><div>Would it be possible to piggyback on the service time path selector in this when it pertains latency? </div><div><br></div><div>Another thing is that at some stage the IO queueing decision needs to take into account the various different FPIN descriptors. A remote delivery notification due to slow drain behaviour is very different than ISL congestion or any physical issues.</div><blockquote type="cite" style="margin:0 0 0 .8ex; border-left:2px #729fcf solid;padding-left:1ex"><div><br></div><div>As Hannes mentioned in his earlier mail our primary goal is that the<br></div><div>admin first should be _alerted_, having FPINs showing up in the message<br></div><div>log, to alert the<br></div><div>admin that his fabric is not performing well.<br></div><div><br></div></blockquote><div>This is a bit of a reactive approach that should be a secondary objective. Having been in storage/fc support for 20 years I know that most admins are not really responsive to this and taking actions based on event entries take a very very long time. From an operations perspective any sort of manual action should be avoided as much as possible.</div><blockquote type="cite" style="margin:0 0 0 .8ex; border-left:2px #729fcf solid;padding-left:1ex"><div><br></div><div>Regards,<br></div><div>Muneendra.<br></div><div><br></div></blockquote></body></html>