[Cluster-devel] [PATCHv2 dlm/next] fs: dlm: avoid F_SETLKW plock op lookup collisions

Tue May 30 17:40:13 UTC 2023

On Tue, May 30, 2023 at 4:08 PM Alexander Aring <aahringo at redhat.com> wrote:
> Hi,
>
> On Tue, May 30, 2023 at 7:01 AM Andreas Gruenbacher <agruenba at redhat.com> wrote:
> >
> > On Tue, May 30, 2023 at 12:19 AM Alexander Aring <aahringo at redhat.com> wrote:
> > > Hi,
> > >
> > > On Thu, May 25, 2023 at 11:02 AM Andreas Gruenbacher
> > > <agruenba at redhat.com> wrote:
> > > >
> > > > On Wed, May 24, 2023 at 6:02 PM Alexander Aring <aahringo at redhat.com> wrote:
> > > > > This patch fixes a possible plock op collisions when using F_SETLKW lock
> > > > > requests and fsid, number and owner are not enough to identify a result
> > > > > for a pending request. The ltp testcases [0] and [1] are examples when
> > > > > this is not enough in case of using classic posix locks with threads and
> > > > > open filedescriptor posix locks.
> > > > >
> > > > > The idea to fix the issue here is to place all lock request in order. In
> > > > > case of non F_SETLKW lock request (indicated if wait is set or not) the
> > > > > lock requests are ordered inside the recv_list. If a result comes back
> > > > > the right plock op can be found by the first plock_op in recv_list which
> > > > > has not info.wait set. This can be done only by non F_SETLKW plock ops as
> > > > > dlm_controld always reads a specific plock op (list_move_tail() from
> > > > > send_list to recv_mlist) and write the result immediately back.
> > > > >
> > > > > This behaviour is for F_SETLKW not possible as multiple waiters can be
> > > > > get a result back in an random order. To avoid a collisions in cases
> > > > > like [0] or [1] this patch adds more fields to compare the plock
> > > > > operations as the lock request is the same. This is also being made in
> > > > > NFS to find an result for an asynchronous F_SETLKW lock request [2][3]. We
> > > > > still can't find the exact lock request for a specific result if the
> > > > > lock request is the same, but if this is the case we don't care the
> > > > > order how the identical lock requests get their result back to grant the
> > > > > lock.
> > > >
> > > > When the recv_list contains multiple indistinguishable requests, this
> > > > can only be because they originated from multiple threads of the same
> > > > process. In that case, I agree that it doesn't matter which of those
> > > > requests we "complete" in dev_write() as long as we only complete one
> > > > request. We do need to compare the additional request fields in
> > > > dev_write() to find a suitable request, so that makes sense as well.
> > > > We need to compare all of the fields that identify a request (optype,
> > > > ex, wait, pid, nodeid, fsid, number, start, end, owner) to find the
> > > > "right" request (or in case there is more than one identical request,
> > > > a "suitable" request).
> > > >
> > >
> > > In my "definition" why this works is as you said the "identical
> > > request". There is a more deeper definition of "when is a request
> > > identical" and in my opinion it is here as: "A request A is identical
> > > to request B when they get granted under the same 'time'" which is all
> > > the fields you mentioned.
> > >
> > > Even with cancellation (F_SETLKW only) it does not matter which
> > > "identical request" you cancel because the kernel and user
> > > (dlm_controld) makes no relation between a lock request instance. You
> > > need to have at least the same amount of "results" coming back from
> > > user space as the amount you are waiting for a result for the same
> > > "identical request".
> >
> > That's not incorrect per se, but cancellations create an additional
> > difficulty: they can either succeed or fail. To indicate that a
> > cancellation has succeeded, a new type of message can be introduced
> > (say, "CANCELLED"), and it's obvious that a CANCELLED message can only
> > belong to a locking request that is being cancelled. When cancelling a
> > locking request fails, the kernel will see a "locking request granted"
> > message though, and when multiple identical locking requests are
> > queued and only some of them have been cancelled, it won't be obvious
> > which locking request a "locking request granted" message should be
> > assigned to anymore. You really don't want to mix things up in that
> > case.
> >
> > This complication makes it a whole lot more difficult to reason about
> > the correctness of the code. All that complexity is avoidable by
> > sticking with a fixed mapping of requests and replies (i.e., a unique
> > request identifier).
> >
> > To put it differently, you can shoot yourself in the foot and still
> > hop along on the other leg, but it may not be the best of all possible
> > ideas.
> >
>
> It makes things more complicated, I agree and the reason why this
> works now is because there are a lot of "dependencies". I would love
> to have an unique identifier to make it possible that we can follow an
> instance handle of the original lock request.
>
> * an unique identifier which also works with the async lock request of
> lockd case.

What's the lockd case you're referring to here, and why is it relevant
for the problem at hand?

> > > > The above patch description doesn't match the code anymore, and the
> > > > code doesn't fully revert the recv_list splitting of the previous
> > > > version.
> > > >
> > >
> > > This isn't a revert. Is it a new patch version, I did drop the
> > > recv_setlkw_list here, dropping in means of removing the
> > > recv_setlkw_list and handling everything in the recv_list. Although
> > > there might be a performance impact by splitting the requests in two
> > > lists as we don't need to jump over all F_SETLKW requests.
> > >
> > > > > [0] https://gitlab.com/netcoder/ltp/-/blob/dlm_fcntl_owner_testcase/testcases/kernel/syscalls/fcntl/fcntl40.c
> > > > > [1] https://gitlab.com/netcoder/ltp/-/blob/dlm_fcntl_owner_testcase/testcases/kernel/syscalls/fcntl/fcntl41.c
> > > > > [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/lockd/lockd.h?h=v6.4-rc1#n373
> > > > > [3] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/lockd/svclock.c?h=v6.4-rc1#n731
> > > > >
> > > > > Cc: stable at vger.kernel.org
> > > > > Signed-off-by: Alexander Aring <aahringo at redhat.com>
> > > > > ---
> > > > > change since v2:
> > > > >  - don't split recv_list into recv_setlkw_list
> > > > >
> > > > >  fs/dlm/plock.c | 43 ++++++++++++++++++++++++++++++-------------
> > > > >  1 file changed, 30 insertions(+), 13 deletions(-)
> > > > >
> > > > > diff --git a/fs/dlm/plock.c b/fs/dlm/plock.c
> > > > > index 31bc601ee3d8..53d17dbbb716 100644
> > > > > --- a/fs/dlm/plock.c
> > > > > +++ b/fs/dlm/plock.c
> > > > > @@ -391,7 +391,7 @@ static ssize_t dev_read(struct file *file, char __user *u, size_t count,
> > > > >                 if (op->info.flags & DLM_PLOCK_FL_CLOSE)
> > > > >                         list_del(&op->list);
> > > > >                 else
> > > > > -                       list_move(&op->list, &recv_list);
> > > > > +                       list_move_tail(&op->list, &recv_list);
> > > >
> > > > ^ This should be obsolete, but it won't hurt, either.
> > > >
> > >
> > > No it is necessary, I tested it and looked deeper into the reason.
> > > dlm_controld handles the lock requests in an ordered way over a
> > > select() mechanism, but it will not always write a result back when
> > > it's read the request out. This is the case for F_SETLKW but also for
> > > all other plock op requests, such as F_GETLK. Instead of writing the
> > > result back it will send it to corosync and the corosync select()
> > > mechanism will write the result back. Corosync will keep the order to
> > > write the result back. Due the fact that it's going through corosync
> > > multiple non F_SETLKW can be queued up in recv_list and need to be
> > > appended on the tail to later find the first entry which is non
> > > F_SETLKW to find the result.
> > >
> > > This ordered lock request read and write the result back (for non
> > > F_SETLKW ops) is not part of UAPI of dlm plock and dlm_controld did it
> > > always this way.
> >
> > This sounds pretty confused. Let's look at
> >
>
> As I said, yes it is a lot of specific handling of user space why this
> is actually working.
>
> > > > >                 memcpy(&info, &op->info, sizeof(info));
> > > > >         }
> > > > >         spin_unlock(&ops_lock);
> > > > > @@ -430,19 +430,36 @@ static ssize_t dev_write(struct file *file, const char __user *u, size_t count,
> > > > >                 return -EINVAL;
> > > > >
> > > > >         spin_lock(&ops_lock);
> > > > > -       list_for_each_entry(iter, &recv_list, list) {
> > > > > -               if (iter->info.fsid == info.fsid &&
> > > > > -                   iter->info.number == info.number &&
> > > > > -                   iter->info.owner == info.owner) {
> > > > > -                       list_del_init(&iter->list);
> > > > > -                       memcpy(&iter->info, &info, sizeof(info));
> > > > > -                       if (iter->data)
> > > > > -                               do_callback = 1;
> > > > > -                       else
> > > > > -                               iter->done = 1;
> > > > > -                       op = iter;
> > > > > -                       break;
> > > > > +       if (info.wait) {
> > > >
> > .> > We should be able to use the same list_for_each_entry() loop for
> > > > F_SETLKW requests (which have info.wait set) as for all other requests
> > > > as far as I can see.
> > > >
> > >
> > > We can't match non F_SETLKW operations on all fields because F_GETLK
> > > will change some fields when it's handled in user space. This is the
> > > whole reason why the ordered handling is done here.
> >
> > I know that F_GETLK uses the l_type field to indicate the outcome of
> > the operation. But that happens in dlm_posix_get() when processing the
> > reply from dlm_controld; it doesn't affect info.optype or any other
> > fields in struct dlm_plock_info. So we actually can compare all of the
> > key fields in struct dlm_plock_info.
> >
>
> F_GETLK also uses start, end, etc. to tell the caller about the region
> which is in conflict. The region which is in conflict is returned into
> the result info struct. See [0] [1].
> Is this more clear now that other fields are affected?

Ah, that sucks.

Thanks,
Andreas