[Cluster-devel] [GFS2 PATCH v3 09/19] gfs2: Ignore recovery attempts if gfs2 has io error or is withdrawn
Steven Whitehouse
swhiteho at redhat.com
Wed May 1 00:08:39 UTC 2019
Hi,
On 01/05/2019 00:03, Bob Peterson wrote:
> This patch addresses various problems with gfs2/dlm recovery.
>
> For example, suppose a node with a bunch of gfs2 mounts suddenly
> reboots due to kernel panic, and dlm determines it should perform
> recovery. DLM does so from a pseudo-state machine calling various
> callbacks into lock_dlm to perform a sequence of steps. It uses
> generation numbers and recover bits in dlm "control" lock lvbs.
>
> Now suppose another node tries to recover the failed node's
> journal, but in so doing, encounters an IO error or withdraws
> due to unforeseen circumstances, such as an hba driver failure.
> In these cases, the recovery would eventually bail out, but it
> would still update its generation number in the lvb. The other
> nodes would all see the newer generation number and think they
> don't need to do recovery because the generation number is newer
> than the last one they saw, and therefore someone else has already
> taken care of it.
>
> If the file system has an io error or is withdrawn, it cannot
> safely replay any journals (its own or others) but someone else
> still needs to do it. Therefore we don't want it messing with
> the journal recovery generation numbers: the local generation
> numbers eventually get put into the lvb generation numbers to be
> seen by all nodes.
>
> This patch adds checks to many of the callbacks used by dlm
> in its recovery state machine so that the functions are ignored
> and skipped if an io error has occurred or if the file system
> was withdraw.
>
> Signed-off-by: Bob Peterson <rpeterso at redhat.com>
These should probably propagate the error back to the caller of the
recovery request. We do have a proper notification system for failed
recovery via uevents,
Steve.
> ---
> fs/gfs2/lock_dlm.c | 18 ++++++++++++++++++
> fs/gfs2/util.c | 15 +++++++--------
> 2 files changed, 25 insertions(+), 8 deletions(-)
>
> diff --git a/fs/gfs2/lock_dlm.c b/fs/gfs2/lock_dlm.c
> index 31df26ed7854..9329f86ffcbe 100644
> --- a/fs/gfs2/lock_dlm.c
> +++ b/fs/gfs2/lock_dlm.c
> @@ -1081,6 +1081,10 @@ static void gdlm_recover_prep(void *arg)
> struct gfs2_sbd *sdp = arg;
> struct lm_lockstruct *ls = &sdp->sd_lockstruct;
>
> + if (gfs2_withdrawn(sdp)) {
> + fs_err(sdp, "recover_prep ignored due to withdraw.\n");
> + return;
> + }
> spin_lock(&ls->ls_recover_spin);
> ls->ls_recover_block = ls->ls_recover_start;
> set_bit(DFL_DLM_RECOVERY, &ls->ls_recover_flags);
> @@ -1103,6 +1107,11 @@ static void gdlm_recover_slot(void *arg, struct dlm_slot *slot)
> struct lm_lockstruct *ls = &sdp->sd_lockstruct;
> int jid = slot->slot - 1;
>
> + if (gfs2_withdrawn(sdp)) {
> + fs_err(sdp, "recover_slot jid %d ignored due to withdraw.\n",
> + jid);
> + return;
> + }
> spin_lock(&ls->ls_recover_spin);
> if (ls->ls_recover_size < jid + 1) {
> fs_err(sdp, "recover_slot jid %d gen %u short size %d\n",
> @@ -1127,6 +1136,10 @@ static void gdlm_recover_done(void *arg, struct dlm_slot *slots, int num_slots,
> struct gfs2_sbd *sdp = arg;
> struct lm_lockstruct *ls = &sdp->sd_lockstruct;
>
> + if (gfs2_withdrawn(sdp)) {
> + fs_err(sdp, "recover_done ignored due to withdraw.\n");
> + return;
> + }
> /* ensure the ls jid arrays are large enough */
> set_recover_size(sdp, slots, num_slots);
>
> @@ -1154,6 +1167,11 @@ static void gdlm_recovery_result(struct gfs2_sbd *sdp, unsigned int jid,
> {
> struct lm_lockstruct *ls = &sdp->sd_lockstruct;
>
> + if (gfs2_withdrawn(sdp)) {
> + fs_err(sdp, "recovery_result jid %d ignored due to withdraw.\n",
> + jid);
> + return;
> + }
> if (test_bit(DFL_NO_DLM_OPS, &ls->ls_recover_flags))
> return;
>
> diff --git a/fs/gfs2/util.c b/fs/gfs2/util.c
> index 0a814ccac41d..7eaea6dfe1cf 100644
> --- a/fs/gfs2/util.c
> +++ b/fs/gfs2/util.c
> @@ -259,14 +259,13 @@ void gfs2_io_error_bh_i(struct gfs2_sbd *sdp, struct buffer_head *bh,
> const char *function, char *file, unsigned int line,
> bool withdraw)
> {
> - if (!test_bit(SDF_SHUTDOWN, &sdp->sd_flags))
> - fs_err(sdp,
> - "fatal: I/O error\n"
> - " block = %llu\n"
> - " function = %s, file = %s, line = %u\n",
> - (unsigned long long)bh->b_blocknr,
> - function, file, line);
> + if (gfs2_withdrawn(sdp))
> + return;
> +
> + fs_err(sdp, "fatal: I/O error\n"
> + " block = %llu\n"
> + " function = %s, file = %s, line = %u\n",
> + (unsigned long long)bh->b_blocknr, function, file, line);
> if (withdraw)
> gfs2_lm_withdraw(sdp, NULL);
> }
> -
More information about the Cluster-devel
mailing list