[Cluster-devel] [PATCHv2 dlm/next 0/8] fs: dlm: introduce dlm re-transmission layer

Mon Mar 22 22:54:47 UTC 2021

Hi,

On Tue, Mar 16, 2021 at 2:37 PM Paolo Abeni <pabeni at redhat.com> wrote:
>
> Hello,
>
> On Fri, 2021-03-12 at 09:52 -0500, Alexander Ahring Oder Aring wrote:
> > On Thu, Mar 11, 2021 at 4:09 AM Paolo Abeni <pabeni at redhat.com> wrote:
> > > - DEFAULT_BUFFER_SIZE == LOWCOMMS_MAX_TX_BUFFER_LEN in current net-
> > > next, so looks like a change below is actually a no op ?!?
> >
> > It's LOWCOMMS_MAX_TX_BUFFER_LEN updated in patch 7/8:
> >
> > #define LOWCOMMS_MAX_TX_BUFFER_LEN     (DEFAULT_BUFFER_SIZE -
> > DLM_MIDCOMMS_OPT_LEN)
> >
> > whereas DEFAULT_BUFFER_SIZE is the maximum possible dlm message size
> > at socket layer level. Mainly this is limited because we can get a
> > maximum page buffer allocation for the dlm application layer only and
> > the application layer cannot deal with any additional logic to handle
> > more. However that is also an architecture thing, because it needs to
> > be the minimal page buffer size of all architecture due compatibility
> > with other architectures. I introduced a check which should report
> > problems if architectures doesn't support it:
> >
> > BUILD_BUG_ON(PAGE_SIZE < DEFAULT_BUFFER_SIZE);
> >
> > LOWCOMMS_MAX_TX_BUFFER_LEN is the length after the encapsulation
> > header, the size "what the dlm application layer can maximum put in a
> > message" on application layer level.
> >
> > The names are misnamed, it's on my list to update them. Any
> > recommendations (MAX_DLM_SOCKET_BUFFER_SIZE/MAX_DLM_APP_BUFFER_SIZE)?
>
> I'm almost literaly the last person to ask this kind of questions, as I
> always pick up bad names... I would opt for:
>
> DLM_MAX_SOCKET_BUFSIZE
> DLM_MAX_APP_BUFSIZE
>

ok. I will prepare patches to change the naming later.

> > > - Could you please add more info WRT the reference to unaligned memory
> > > access in the code comments? Which field[s] is[are] subject to that?
> > >
> >
> > none of these fields. Some of the DLM messages have a variable sized
> > payload at the tail. It's a string value. This string length needs to
> > be 8 byte aligned. The kernel violate this, however there exists an
> > userspace API and the user needs to define this string value. One user
> > is lvm, lvm uses an unique identifier the UUID of the volume for this
> > string value, the UUID string value is 36 chars. It will end in a
> > message boundary issue which will/can end in unaligned memory access
> > at the sending side (because of some internal buffer handling) and the
> > receiving side. I tried to fix it by filling zero bytes to this string
> > but the problem is more complicated as David Teigland pointed out to
> > still have backwards compatibility. I would like to schedule to fix
> > the issue in a next major version bump of dlm.
>
> Uhm... It looks like the unaligned access was partially handled with
> memmove/put_unaligned, but that code was removed with
> commit df9e06b800ed025411ce9ab348299b3ef258cf8b - because the removed
> code did not cover all the use-case.
>

This patch actually solves the unaligned memory access for all cases
at the receiving side. However we still have problems at the sending
side (explanation below), that's why I reverted it. The memmove() will
actually fix the message boundary issue and will ensure that a dlm
message pointer starts at a 8 byte aligned address which is not
guaranteed currently.

> I guess a complete solution based on get_unaligned/put_unaligned is too
> invasive?
>

the application layer doesn't use get_unaligned/put_unaligned and yes
changing it is too invasive. Remember we have full control of our
buffer allocation and operate on socket api and those will use the
user allocated buffers (no skb's). Because this reason
get_unaligned/put_unaligned isn't kind of ?required? as we try to keep
natural alignment of structures. We always know where our buffers come
from (I hope).

> I'm not sure I follow correctly when you says the issue should be
> solved on the send side... What if a sender is buggy/uses an old
> implementation? would that cause unaligned access?
>

The problem here is that dlm has an internal buffer allocator [0]. It
works like:

1. page buffer will be allocated and return pointer to page start to
place a dlm message in there.
2. so far no buffer committed yet which belongs to a page is committed
the _next_ available buffer on the page will be returned to
application layer to fill another message. This pointer is directly at
the end of address of the last allocated buffer on the same page
buffer.
3. if buffer is committed it will be scheduled to transmit and 1.
starts again with a new page buffer.

It's an optimization because a processing of a dlm message can trigger
multiple messages so they try to put a bunch of them into one page.

Why we run into an unaligned memory access now? It's 2. which makes
problems when we have message boundary issues, means our dlm message
length is not divided to 8 and the next message will end in unaligned
memory issues. If an architecture cannot deal with this unaligned
memory access, we can already run into issues at sending side, because
the application layer will directly dereference structure fields on
returned buffer address.

We cannot control "how" a message with a message boundary issue is
processed at the send and receiving side (due TCP stack and what to
put into one segment) and this problem exists for a long time. If we
received/processed only one message with boundary issues everything
would be fine. We need to fix the sending side here to not allow such
messages to send, of course the network is trusted here... We assume
in general that the network is trusted because I think you can simple
occur deadlock states by sending specific dlm messages around, not
just malformed messages.

I simple think we don't have anybody which is using dlm on an
architecture which cannot deal with that unaligned memory access. I
also told you something wrong in the last mail, the kernel does _not_
sending dlm message with a message boundary issue, it's some user
space users e.g. lvm which adds a UUID string as resource name. This
string is a stored inside at the dlm message tail payload and it's 36
chars long (not divided by 8). Such message is very rarely, but the
problem still exists. The user space dlm requests can occur any time.

> If a message is unaligned, will it stay unaligned on re-transmissions?

Depends if case 2. occurs from above description and a previous
allocated had a message boundary issue.

> Otherwise, if the unaligned messages are really a corner case, given
> that this implements app-layer retransmission, what about simply
> detecting the unaligned message and dropping it? It sounds a bit
> scaring to me that a remote peer would be able to trigger an oops (on
> some arches) with a badly crafted packet.
>

No, we can't do that (see below). I detected this issue by inserting a
WARN_ON() and needed to remove it because it is part of the protocol
behaviour for now. I really think nobody using it on an arch which
makes big trouble e.g. kernel oops.

> It looks like the amount of memory the 'receive_queue' and 'send_queue'
> can use is still unbounded ?!? I suggest to drop at least a commend
> about why that should not be an issue (if that is really not a problem
> ;)

We add this reliable feature on reconnects because dropping a message
in DLM will/could end into a deadlock. If DLM drops a message it's
considered to be in a broken state. However DLM is protected against
if a node/peer get fenced by a cluster manager .e.g pressing reset
button. But reset button/fencing != reconnect. This feature will help
to not dropping any DLM message because some e.g. TCP reset was
received for any reason.

I will add a comment, we cannot drop any message in this protocol
(that's why midcomms exists), so the queues must be unbounded. However
if the queue sizes getting too high we can fence the specific node
(because weird things going on)... this is maybe something to improve
in the future.

>
> I had also an additional minor comment on patch 7/8, see there for the
> details...

ok.

- Alex

[0] https://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm.git/tree/fs/dlm/lowcomms.c?h=next#n1370