[Cluster-devel] [GFS2 PATCH v3 09/19] gfs2: Ignore recovery attempts if gfs2 has io error or is withdrawn

Steven Whitehouse swhiteho at redhat.com
Wed May 1 00:08:39 UTC 2019


Hi,

On 01/05/2019 00:03, Bob Peterson wrote:
> This patch addresses various problems with gfs2/dlm recovery.
>
> For example, suppose a node with a bunch of gfs2 mounts suddenly
> reboots due to kernel panic, and dlm determines it should perform
> recovery. DLM does so from a pseudo-state machine calling various
> callbacks into lock_dlm to perform a sequence of steps. It uses
> generation numbers and recover bits in dlm "control" lock lvbs.
>
> Now suppose another node tries to recover the failed node's
> journal, but in so doing, encounters an IO error or withdraws
> due to unforeseen circumstances, such as an hba driver failure.
> In these cases, the recovery would eventually bail out, but it
> would still update its generation number in the lvb. The other
> nodes would all see the newer generation number and think they
> don't need to do recovery because the generation number is newer
> than the last one they saw, and therefore someone else has already
> taken care of it.
>
> If the file system has an io error or is withdrawn, it cannot
> safely replay any journals (its own or others) but someone else
> still needs to do it. Therefore we don't want it messing with
> the journal recovery generation numbers: the local generation
> numbers eventually get put into the lvb generation numbers to be
> seen by all nodes.
>
> This patch adds checks to many of the callbacks used by dlm
> in its recovery state machine so that the functions are ignored
> and skipped if an io error has occurred or if the file system
> was withdraw.
>
> Signed-off-by: Bob Peterson <rpeterso at redhat.com>

These should probably propagate the error back to the caller of the 
recovery request. We do have a proper notification system for failed 
recovery via uevents,

Steve.

> ---
>   fs/gfs2/lock_dlm.c | 18 ++++++++++++++++++
>   fs/gfs2/util.c     | 15 +++++++--------
>   2 files changed, 25 insertions(+), 8 deletions(-)
>
> diff --git a/fs/gfs2/lock_dlm.c b/fs/gfs2/lock_dlm.c
> index 31df26ed7854..9329f86ffcbe 100644
> --- a/fs/gfs2/lock_dlm.c
> +++ b/fs/gfs2/lock_dlm.c
> @@ -1081,6 +1081,10 @@ static void gdlm_recover_prep(void *arg)
>   	struct gfs2_sbd *sdp = arg;
>   	struct lm_lockstruct *ls = &sdp->sd_lockstruct;
>   
> +	if (gfs2_withdrawn(sdp)) {
> +		fs_err(sdp, "recover_prep ignored due to withdraw.\n");
> +		return;
> +	}
>   	spin_lock(&ls->ls_recover_spin);
>   	ls->ls_recover_block = ls->ls_recover_start;
>   	set_bit(DFL_DLM_RECOVERY, &ls->ls_recover_flags);
> @@ -1103,6 +1107,11 @@ static void gdlm_recover_slot(void *arg, struct dlm_slot *slot)
>   	struct lm_lockstruct *ls = &sdp->sd_lockstruct;
>   	int jid = slot->slot - 1;
>   
> +	if (gfs2_withdrawn(sdp)) {
> +		fs_err(sdp, "recover_slot jid %d ignored due to withdraw.\n",
> +		       jid);
> +		return;
> +	}
>   	spin_lock(&ls->ls_recover_spin);
>   	if (ls->ls_recover_size < jid + 1) {
>   		fs_err(sdp, "recover_slot jid %d gen %u short size %d\n",
> @@ -1127,6 +1136,10 @@ static void gdlm_recover_done(void *arg, struct dlm_slot *slots, int num_slots,
>   	struct gfs2_sbd *sdp = arg;
>   	struct lm_lockstruct *ls = &sdp->sd_lockstruct;
>   
> +	if (gfs2_withdrawn(sdp)) {
> +		fs_err(sdp, "recover_done ignored due to withdraw.\n");
> +		return;
> +	}
>   	/* ensure the ls jid arrays are large enough */
>   	set_recover_size(sdp, slots, num_slots);
>   
> @@ -1154,6 +1167,11 @@ static void gdlm_recovery_result(struct gfs2_sbd *sdp, unsigned int jid,
>   {
>   	struct lm_lockstruct *ls = &sdp->sd_lockstruct;
>   
> +	if (gfs2_withdrawn(sdp)) {
> +		fs_err(sdp, "recovery_result jid %d ignored due to withdraw.\n",
> +		       jid);
> +		return;
> +	}
>   	if (test_bit(DFL_NO_DLM_OPS, &ls->ls_recover_flags))
>   		return;
>   
> diff --git a/fs/gfs2/util.c b/fs/gfs2/util.c
> index 0a814ccac41d..7eaea6dfe1cf 100644
> --- a/fs/gfs2/util.c
> +++ b/fs/gfs2/util.c
> @@ -259,14 +259,13 @@ void gfs2_io_error_bh_i(struct gfs2_sbd *sdp, struct buffer_head *bh,
>   			const char *function, char *file, unsigned int line,
>   			bool withdraw)
>   {
> -	if (!test_bit(SDF_SHUTDOWN, &sdp->sd_flags))
> -		fs_err(sdp,
> -		       "fatal: I/O error\n"
> -		       "  block = %llu\n"
> -		       "  function = %s, file = %s, line = %u\n",
> -		       (unsigned long long)bh->b_blocknr,
> -		       function, file, line);
> +	if (gfs2_withdrawn(sdp))
> +		return;
> +
> +	fs_err(sdp, "fatal: I/O error\n"
> +	       "  block = %llu\n"
> +	       "  function = %s, file = %s, line = %u\n",
> +	       (unsigned long long)bh->b_blocknr, function, file, line);
>   	if (withdraw)
>   		gfs2_lm_withdraw(sdp, NULL);
>   }
> -




More information about the Cluster-devel mailing list