[Cluster-devel] [RFC v6.5-rc2 2/3] fs: lockd: fix race in async lock request handling

Thu Aug 10 20:37:47 UTC 2023

Hi,

On Fri, Jul 21, 2023 at 11:45 AM Jeff Layton <jlayton at kernel.org> wrote:
>
> On Thu, 2023-07-20 at 08:58 -0400, Alexander Aring wrote:
> > This patch fixes a race in async lock request handling between adding
> > the relevant struct nlm_block to nlm_blocked list after the request was
> > sent by vfs_lock_file() and nlmsvc_grant_deferred() does a lookup of the
> > nlm_block in the nlm_blocked list. It could be that the async request is
> > completed before the nlm_block was added to the list. This would end
> > in a -ENOENT and a kernel log message of "lockd: grant for unknown
> > block".
> >
> > To solve this issue we add the nlm_block before the vfs_lock_file() call
> > to be sure it has been added when a possible nlmsvc_grant_deferred() is
> > called. If the vfs_lock_file() results in an case when it wouldn't be
> > added to nlm_blocked list, the nlm_block struct will be removed from
> > this list again.
> >
> > Signed-off-by: Alexander Aring <aahringo at redhat.com>
> > ---
> >  fs/lockd/svclock.c          | 80 +++++++++++++++++++++++++++----------
> >  include/linux/lockd/lockd.h |  1 +
> >  2 files changed, 60 insertions(+), 21 deletions(-)
> >
> > diff --git a/fs/lockd/svclock.c b/fs/lockd/svclock.c
> > index 28abec5c451d..62ef27a69a9e 100644
> > --- a/fs/lockd/svclock.c
> > +++ b/fs/lockd/svclock.c
> > @@ -297,6 +297,8 @@ static void nlmsvc_free_block(struct kref *kref)
> >
> >       dprintk("lockd: freeing block %p...\n", block);
> >
> > +     WARN_ON_ONCE(block->b_flags & B_PENDING_CALLBACK);
> > +
> >       /* Remove block from file's list of blocks */
> >       list_del_init(&block->b_flist);
> >       mutex_unlock(&file->f_mutex);
> > @@ -543,6 +545,12 @@ nlmsvc_lock(struct svc_rqst *rqstp, struct nlm_file *file,
> >               goto out;
> >       }
> >
> > +     if (block->b_flags & B_PENDING_CALLBACK)
> > +             goto pending_request;
> > +
> > +     /* Append to list of blocked */
> > +     nlmsvc_insert_block(block, NLM_NEVER);
> > +
> >       if (!wait)
> >               lock->fl.fl_flags &= ~FL_SLEEP;
> >       mode = lock_to_openmode(&lock->fl);
> > @@ -552,9 +560,13 @@ nlmsvc_lock(struct svc_rqst *rqstp, struct nlm_file *file,
> >       dprintk("lockd: vfs_lock_file returned %d\n", error);
> >       switch (error) {
> >               case 0:
> > +                     nlmsvc_remove_block(block);
> >                       ret = nlm_granted;
> >                       goto out;
> >               case -EAGAIN:
> > +                     if (!wait)
> > +                             nlmsvc_remove_block(block);
> > +pending_request:
> >                       /*
> >                        * If this is a blocking request for an
> >                        * already pending lock request then we need
> > @@ -565,6 +577,8 @@ nlmsvc_lock(struct svc_rqst *rqstp, struct nlm_file *file,
> >                       ret = async_block ? nlm_lck_blocked : nlm_lck_denied;
> >                       goto out;
> >               case FILE_LOCK_DEFERRED:
> > +                     block->b_flags |= B_PENDING_CALLBACK;
> > +
> >                       if (wait)
> >                               break;
> >                       /* Filesystem lock operation is in progress
> > @@ -572,17 +586,16 @@ nlmsvc_lock(struct svc_rqst *rqstp, struct nlm_file *file,
> >                       ret = nlmsvc_defer_lock_rqst(rqstp, block);
>
> When the above function is called, it's going to end up reinserting the
> block into the list. I think you probably also need to remove the call
> to nlmsvc_insert_block from nlmsvc_defer_lock_rqst since it could have
> been granted before that occurs.
>

it cannot be granted during this time because the f_mutex is held. We
insert it in the first place to have a way to get the block lookup
working when a lm_grant() is really fast. Then lm_grant() will lookup
the lock and have a way to get f_mutex to hold it. Then lm_grant()
will only run when nobody is in this critical area (on a per nlm_file
basis).

There is a difference in the call between NLM_NEVER and NLM_TIMEOUT in
nlmsvc_defer_lock_rqst(), when nlmsvc_defer_lock_rqst() it will just
update the timeout value. I am not sure about the consequences when it
does a nlmsvc_insert_block() with NLM_NEVER instead of NLM_TIMEOUT.
But as I said it should not be possible to grant the block when
f_mutex is held.

> >                       goto out;
> >               case -EDEADLK:
> > +                     nlmsvc_remove_block(block);
> >                       ret = nlm_deadlock;
> >                       goto out;
> >               default:                        /* includes ENOLCK */
> > +                     nlmsvc_remove_block(block);
> >                       ret = nlm_lck_denied_nolocks;
> >                       goto out;
> >       }
> >
> >       ret = nlm_lck_blocked;
> > -
> > -     /* Append to list of blocked */
> > -     nlmsvc_insert_block(block, NLM_NEVER);
> >  out:
> >       mutex_unlock(&file->f_mutex);
> >       nlmsvc_release_block(block);
> > @@ -739,34 +752,59 @@ nlmsvc_update_deferred_block(struct nlm_block *block, int result)
> >               block->b_flags |= B_TIMED_OUT;
> >  }
> >
> > +static int __nlmsvc_grant_deferred(struct nlm_block *block,
> > +                                struct file_lock *fl,
> > +                                int result)
> > +{
> > +     int rc = 0;
> > +
> > +     dprintk("lockd: nlmsvc_notify_blocked block %p flags %d\n",
> > +                                     block, block->b_flags);
> > +     if (block->b_flags & B_QUEUED) {
> > +             if (block->b_flags & B_TIMED_OUT) {
> > +                     rc = -ENOLCK;
> > +                     goto out;
> > +             }
> > +             nlmsvc_update_deferred_block(block, result);
> > +     } else if (result == 0)
> > +             block->b_granted = 1;
> > +
> > +     nlmsvc_insert_block_locked(block, 0);
> > +     svc_wake_up(block->b_daemon);
> > +out:
> > +     return rc;
> > +}
> > +
> >  static int nlmsvc_grant_deferred(struct file_lock *fl, int result)
> >  {
> > -     struct nlm_block *block;
> > -     int rc = -ENOENT;
> > +     struct nlm_block *block = NULL;
> > +     int rc;
> >
> >       spin_lock(&nlm_blocked_lock);
> >       list_for_each_entry(block, &nlm_blocked, b_list) {
> >               if (nlm_compare_locks(&block->b_call->a_args.lock.fl, fl)) {
> > -                     dprintk("lockd: nlmsvc_notify_blocked block %p flags %d\n",
> > -                                                     block, block->b_flags);
> > -                     if (block->b_flags & B_QUEUED) {
> > -                             if (block->b_flags & B_TIMED_OUT) {
> > -                                     rc = -ENOLCK;
> > -                                     break;
> > -                             }
> > -                             nlmsvc_update_deferred_block(block, result);
> > -                     } else if (result == 0)
> > -                             block->b_granted = 1;
> > -
> > -                     nlmsvc_insert_block_locked(block, 0);
> > -                     svc_wake_up(block->b_daemon);
> > -                     rc = 0;
> > +                     kref_get(&block->b_count);
> >                       break;
> >               }
> >       }
> >       spin_unlock(&nlm_blocked_lock);
> > -     if (rc == -ENOENT)
> > -             printk(KERN_WARNING "lockd: grant for unknown block\n");
> > +
> > +     if (!block) {
> > +             pr_warn("lockd: grant for unknown pending block\n");
> > +             return -ENOENT;
> > +     }
> > +
> > +     /* don't interfere with nlmsvc_lock() */
> > +     mutex_lock(&block->b_file->f_mutex);
>
>
> This is called from lm_grant, and Documentation/filesystems/locking.rst
> says that lm_grant is not allowed to block. The only caller though is
> dlm_plock_callback, and I don't see anything that would prevent
> blocking.
>
> Do we need to fix the documentation there?
>

You are right and I think it should not call any sleepable API.
However DLM is the only one upstream user and I have no other idea how
to make the current situation better.

We should update the documentation but be open to make it
non-sleepable in future?

>
> > +     block->b_flags &= ~B_PENDING_CALLBACK;
> > +
>
> You're adding this new flag when the lock is deferred and then clearing
> it when the lock is granted. What about when the lock request is
> cancelled (e.g. by signal)? It seems like you also need to clear it then
> too, correct?
>

correct. I add code to clear it when the block is getting removed from
nlm_blocked in nlmsvc_remove_block().

> > +     spin_lock(&nlm_blocked_lock);
> > +     WARN_ON_ONCE(list_empty(&block->b_list));
> > +     rc = __nlmsvc_grant_deferred(block, fl, result);
> > +     spin_unlock(&nlm_blocked_lock);
> > +     mutex_unlock(&block->b_file->f_mutex);
> > +
> > +     nlmsvc_release_block(block);
> >       return rc;
> >  }
> >
> > diff --git a/include/linux/lockd/lockd.h b/include/linux/lockd/lockd.h
> > index f42594a9efe0..a977be8bcc2c 100644
> > --- a/include/linux/lockd/lockd.h
> > +++ b/include/linux/lockd/lockd.h
> > @@ -189,6 +189,7 @@ struct nlm_block {
> >  #define B_QUEUED             1       /* lock queued */
> >  #define B_GOT_CALLBACK               2       /* got lock or conflicting lock */
> >  #define B_TIMED_OUT          4       /* filesystem too slow to respond */
> > +#define B_PENDING_CALLBACK   8       /* pending callback for lock request */
> >  };
> >
> >  /*
>
> --
> Jeff Layton <jlayton at kernel.org>
>