[Cluster-devel] [PATCHv3 dlm/next 7/8] fs: dlm: add reliable connection if reconnect

Mon Apr 12 15:35:30 UTC 2021

Hi,

On Fri, Apr 9, 2021 at 5:11 PM Guillaume Nault <gnault at redhat.com> wrote:
>
> On Mon, Apr 05, 2021 at 04:29:10PM -0400, Alexander Ahring Oder Aring wrote:
> > Hi,
> >
> > On Mon, Apr 5, 2021 at 1:33 PM Alexander Ahring Oder Aring
> > <aahringo at redhat.com> wrote:
> > >
> > > Hi,
> > >
> > > On Sat, Apr 3, 2021 at 11:34 AM Alexander Ahring Oder Aring
> > > <aahringo at redhat.com> wrote:
> > > >
> > > ...
> > > >
> > > > > It seems to me that the only time DLM might need to retransmit data, is
> > > > > when recovering from a connection failure. So why can't we just resend
> > > > > unacknowledged data at reconnection time? That'd probably simplify the
> > > > > code a lot (no need to maintain a retransmission timeout on TX, no need
> > > > > to handle sequence numbers that are in the future on RX).
> > > > >
> > > >
> > > > I can try to remove the timer, timeout and do the above approach to
> > > > retransmit at reconnect. Then I test it again and I will report back
> > > > to see if it works or why we have other problems.
> > > >
> > >
> > > I have an implementation of this running and so far I don't see any problems.
> >
> > There is a problem but it's related to the behaviour how reconnections
> > are triggered. The whole communication can be stuck because the send()
> > triggers a reconnection if not connected anymore. Before, the timer
> > was triggering some send() and this was triggering a reconnection in a
> > periodic way. Therefore we never had any stuck situation where nobody
> > was sending anything anymore. It's a rare case but I am currently
> > running into it. However I think I need to change how the
> > reconnections are triggered with some "forever periodic try" which
> > should solve this issue.
>
> Would it be sufficient to detect socket errors to avoid this problem?
> For example by letting lowcomms_error_report() do the reconnection when
> necessary?

I have something like that as a patch for afterwards, it also contains
some change in the lowcomms workqueue handling and removal of the
"othercon" race paradigm. There are sometimes two connections
established because of a race which I mentioned in midcomms as well
for version detection. In short every node wants to connect
immediately after the cluster manager reports membership to the kernel
dlm implementation. However I ignored that problem of reconnection for
now as it occurs never/rarely but I think it's still there.

Thanks for your review.

- Alex