[Cluster-devel] [PATCHv3 dlm/next 7/8] fs: dlm: add reliable connection if reconnect

Mon Apr 12 15:21:16 UTC 2021

Hi,

On Fri, Apr 9, 2021 at 4:32 PM Guillaume Nault <gnault at redhat.com> wrote:
...
> >
> > Yes, there is one thing missing. I detected a synchronization issue
> > because the node membership of a lockspace is done by a different
> > protocol handled in user space. This protocol declares the actual peer
> > of the DLM protocol. In short, there are two protocols here, whereas
> > one handles the membership of peers and the other handles the actual
> > operations. There is a missing synchronization between operations and
> > membership which I solved with a DLM_FIN message.
>
> Thanks, I'll re-read the parts about DLM_FIN with this explanation in
> mind.
>

okay. Thank you.

> > > I feel that this patch goes very far into re-implementing TCP-like
> > > features. For example, why is fast-retransmit even needed? DLM seems to
> > > always uses a transport protocol that guarantees reliable and in order
> > > delivery. Therefore, duplicate DLM acknowledgements can't happen on an
> > > established connection. Even when reconnecting, it'd be the sender's
> > > responsibility to resend the last unackowledged data first. How could
> > > this lead to holes in the DLM sequence numbers on the receiver's side?
> > >
> >
> > I agree it's a TCP re-implementation on the application layer.
> >
> > Fast-retransmit is not needed as it is not needed in TCP. However it
> > solves drops faster, in DLM and GFS2 you would experience a stuck in
> > some filesystem syscalls until the drop is resolved.
>
> I was not talking about fast retransmits in general. My question was:
> how could a DLM fast retransmit even be triggered? This can only happen
> when some DLM messages get lost or are received out of order. But the
> transport layer guarantees lossless and in order delivery. Even in case
> of a reconnection, I can't see how the peer could get holes in its
> receive queue. What am I missing?
>

It can be triggered in cases that DLM just does some other operations
which triggers new messages after a reconnect and drop. The receiver
will send back multiple ack frames back with the same sequence number
which indicates for the sender a drop. If the ack count with multiple
times the same sequence number (the receiver expected sequence number)
a retransmission will be triggered. Just like in [0].

> > > It seems to me that the only time DLM might need to retransmit data, is
> > > when recovering from a connection failure. So why can't we just resend
> > > unacknowledged data at reconnection time? That'd probably simplify the
> > > code a lot (no need to maintain a retransmission timeout on TX, no need
> > > to handle sequence numbers that are in the future on RX).
> > >
> >
> > I can try to remove the timer, timeout and do the above approach to
> > retransmit at reconnect. Then I test it again and I will report back
> > to see if it works or why we have other problems.
>
> I see you've posted a new version of the patch series. I'll look at it
> soon.
>

ok.

> > > Also, couldn't we set the DLM sequence numbers in
> > > dlm_midcomms_commit_buffer_3_2() rather than using a callback function
> > > in dlm_lowcomms_new_buffer()?
> > >
> >
> > That has something to do with the internal buffer allocator. The order
> > is defined in dlm_lowcomms_new_buffer() and there is an internal lock
> > which needs to be held there. I can't do this without somehow making
> > the lock accessible in dlm_midcomms_commit_buffer_3_2(). The code is
> > also optimized that the lock isn't held during kmalloc().
>
> I missed that. Thanks for the explanation.
>

ok, no problem. It took some time for me to figure that out as well.

> > > Finally, I wonder if we could avoid adding DLM sequence numbers
> > > entirely. Have you considered using the TCP_REPAIR infrastructure to
> > > retrieve the send queue of the failed socket and resend that data once
> > > the new socket is connected?
> > >
> >
> > Yes, I looked into TCP_REPAIR at first and I agree it can be used to
> > solve this problem. However TCP_REPAIR can be used as a part of a more
> > generic solution, there needs to be something "additional handling"
> > done e.g. additional socket options to let the application layer save
> > states before receiving errors. I am also concerned how it would work
> > with haproxy, etc. It also has some advantages to restore window size,
> > etc. I was thinking about submitting a proposal to the next netdevconf
> > after doing some experiments with it.
>
> Well, I was thinking of using TCP_REPAIR only to retrieve the send
> queue. I didn't mean to avoid the hand-check like in the CRIU use case.
> A TCP connection break is a sign that something went wrong between the
> peers, so a full reconnection is in order IMHO.
>

I am somehow worried about what data-format they are in the "queues"
and even if we can figure out message boundaries? Yes "queues" so far
I remember/know there exists two queues for TCP, a write queue and
retransmit queue. I am not sure if the data payload is with TCP header
or not, something to figure out. Otherwise we would need to parse TCP
here?

> > We decided to solve this problem at the application layer first, then
> > look into how to make a more generic solution. For me, I am not sure
> > how cifs deals with reconnects or it just allows drops.
>
> Indeed, it'd be interesting to know how other protocols handle this
> case.
>

Yes, this protocol is very picky if there is any drop. Of course we
are talking about lock operations over the network, it can be very
fatal if drops occur. Pablo review mentions a lot of "security"
related things and disconnecting peers if they happen. We work here in
the cluster world which operates a little bit differently. What we can
do is send an event to the cluster manager to fence nodes. However I
am not aware that we currently support such handling if something runs
out-of-boundaries of queues, sequence numbers, etc. We need to trust
the network anyway because you can easily trigger deadlocks from
outside, I guess. Trusting network != things run crazy, for the crazy
case we could send events to fence a specific node.

- Alex

[0] https://www.isi.edu/nsnam/DIRECTED_RESEARCH/DR_WANIDA/DR/images/fast-retransmit-2.gif