[dm-devel] [PATCH] multipath-tools: prevent a kernel upgrade leading to all paths lost

Benjamin Marzinski bmarzins at redhat.com
Sat Apr 30 02:31:08 UTC 2022


On Mon, Apr 18, 2022 at 02:54:15PM -0700, Brian Bunker wrote:
> With the kernel commit 268940b80fa4096397fd0a28e6ad807e64120215, the
> handling of ALUA state transitioning state changed. It used to be that
> paths which returned with ALUA transitioning state would not result in
> those paths being failed. With this patch, that state returned will
> result in the path being failed. The result of this will result in all
> paths down events when a configuration has the no_path_retry set to 0.
> Before this kernel change that same configuration would not reach all
> paths down.
> 
> If the kernel is not changed back to the previous behavior, the
> multipath-tools have to protect against this condition or
> configurations which were working correctly prior to the kernel change
> will lead to an unexpected failure.

Unless I'm missing something, this patch isn't going to be helpful. It
doesn't configure the multipath device to queue IOs before the failure.
When the kernel runs out of paths, it will see that the mutltipath
device has no paths and isn't set to queue, and it will fail the IOs.
After the paths come back up again, multipath will enable queueing in
update_queue_mode_add_path().  But it's too late by then. The IOs will
have already been failed.

Another issue is that the patch is doing scsi specific work in code
meant for general block devices. Finally, It's seems pointless to set
the device to queue when all the paths go away and the turn off queueing
when the paths come back. no_path_retry doesn't do anything when there
are paths, so if it's necessary to set when the paths are gone, it won't
hurt anything to be set when the paths are present. I understand that
you only want to do this for devices in the ALUA transtitioning state,
but the only why to make this actually help is if multipathd enables
queueing before the device ever gets into that state. 

The only really safe thing to do would be to set no_path_retry to
something other than 0 or NO_PATH_RETRY_QUEUE. This can already be done
in the configuration files. But I agree that failing in IOs because the
device is transitioning seems wrong in the first place, and fixing this
in the kernel makes sense.
 
> See the kernel discussions here:
> https://marc.info/?l=linux-scsi&m=162636826305740&w=2
> https://marc.info/?l=linux-scsi&m=164987222322261&w=2
> 
> Signed-off-by: brian at purestorage.com
> Reviewed-by: sconnor at purestorage.com
> ___
> diff --git a/libmultipath/structs.h b/libmultipath/structs.h
> index d94f93a0..0af7e598 100644
> --- a/libmultipath/structs.h
> +++ b/libmultipath/structs.h
> @@ -370,6 +370,7 @@ struct multipath {
>         int failback_tick;
> 
>         int no_path_retry; /* number of retries after all paths are down */
> +       int no_path_retry_configured; /* value in config */
>         int retry_tick;    /* remaining times for retries */
>         int disable_queueing;
>         int minio;
> diff --git a/libmultipath/structs_vec.c b/libmultipath/structs_vec.c
> index 6c23df86..3db4e6f7 100644
> --- a/libmultipath/structs_vec.c
> +++ b/libmultipath/structs_vec.c
> @@ -780,8 +780,15 @@ void update_queue_mode_add_path(struct multipath *mpp)
>  {
>         int active = count_active_paths(mpp);
> 
> -       if (active > 0)
> +       if (active > 0) {
>                 leave_recovery_mode(mpp);
> +               if (mpp->no_path_retry != mpp->no_path_retry_configured) {
> +                       condlog(2, "%s: return no path retry to %d
> from %d", mpp->alias,
> +                       mpp->no_path_retry_configured, mpp->no_path_retry);
> +                       mpp->no_path_retry = mpp->no_path_retry_configured;
> +               }
> +       }
> +
>         condlog(2, "%s: remaining active paths: %d", mpp->alias, active);
>  }
> 
> diff --git a/multipathd/main.c b/multipathd/main.c
> index f2c0b280..92841d13 100644
> --- a/multipathd/main.c
> +++ b/multipathd/main.c
> @@ -524,7 +524,7 @@ int update_multipath (struct vectors *vecs, char
> *mapname, int reset)
>         struct multipath *mpp;
>         struct pathgroup  *pgp;
>         struct path *pp;
> -       int i, j;
> +       int i, j, tpg;
> 
>         mpp = find_mp_by_alias(vecs->mpvec, mapname);
> 
> @@ -553,6 +553,15 @@ int update_multipath (struct vectors *vecs, char
> *mapname, int reset)
>                                 checkint = conf->checkint;
>                                 put_multipath_config(conf);
>                                 condlog(2, "%s: mark as failed", pp->dev);
> +                               tpg = get_target_port_group(pp, DEF_TIMEOUT);
> +                               if ((tpg >= 0) &&
> +                                   (mpp->no_path_retry == 0) &&
> +                                   (get_asymmetric_access_state(pp,
> tpg, DEF_TIMEOUT) == AAS_TRANSITIONING)) {
> +                                       mpp->no_path_retry_configured
> = mpp->no_path_retry;
> +                                       mpp->no_path_retry = (60 / checkint);
> +                                       condlog(2, "%s: changed %s no
> path retry to %d", pp->dev, mpp->alias,
> +                                               (60 / checkint));
> +                               }
>                                 mpp->stat_path_failures++;
>                                 pp->state = PATH_DOWN;
>                                 if (oldstate == PATH_UP ||
> ___
> 
> -- 
> Brian Bunker
> PURE Storage, Inc.
> brian at purestorage.com
> 
> --
> dm-devel mailing list
> dm-devel at redhat.com
> https://listman.redhat.com/mailman/listinfo/dm-devel


More information about the dm-devel mailing list