[dm-devel] dm-crypt: Fix deadlock when algo returns -EBUSY

Tue Apr 7 15:55:01 UTC 2015

On Fri, Apr 03 2015 at 12:09pm -0400,
Ben Collins <ben.c at servergy.com> wrote:

> I suspect this doesn't show up for most anyone because software
> algorithms typically don't have a sense of being too busy. However,
> when working with the Freescale CAAM driver, it would, under heavy
> load, return -EBUSY on occasion, which would leave dm-crypt dead
> locked and quite unhappy.
> 
> After checking the logic in some other drivers, the scheme for
> crypt_convert() and it's callback, kcryptd_async_done() were not
> correctly laid out to handle -EBUSY properly (nor -EINPROGRESS for
> that matter).
> 
> This patch corrects that and makes things pretty happy. Where-as I
> could cause a lockup withing 1-2 minutes, I've not been able to run
> the reproduction test with CAAM drivers fully installed for over
> 24 hours now. I've regression tested it against software algorithms
> on PPC32 and x86_64, and things seem perfectly happy there as well.
> 
> I've tested this all the way back to 3.13, and a few iterations
> in-between. Please consider pushing this to various stable trees.
> 
> Cc: dm-devel at redhat.com
> Cc: Alasdair Kergon <agk at redhat.com>
> Cc: Mike Snitzer <snitzer at redhat.com>
> Signed-off-by: Ben Collins <ben.c at servergy.com>
> ---
>  drivers/md/dm-crypt.c | 13 +++++++------
>  1 file changed, 7 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
> index 713a962..3891f26 100644
> --- a/drivers/md/dm-crypt.c
> +++ b/drivers/md/dm-crypt.c
> @@ -925,11 +925,10 @@ static int crypt_convert(struct crypt_config *cc,
>  
>  		switch (r) {
>  		/* async */
> +		case -EINPROGRESS:
>  		case -EBUSY:
>  			wait_for_completion(&ctx->restart);
>  			reinit_completion(&ctx->restart);
> -			/* fall through*/
> -		case -EINPROGRESS:
>  			ctx->req = NULL;
>  			ctx->cc_sector++;
>  			continue;
> @@ -1346,10 +1345,8 @@ static void kcryptd_async_done(struct crypto_async_request *async_req,
>  	struct dm_crypt_io *io = container_of(ctx, struct dm_crypt_io, ctx);
>  	struct crypt_config *cc = io->cc;
>  
> -	if (error == -EINPROGRESS) {
> -		complete(&ctx->restart);
> +	if (error == -EINPROGRESS)
>  		return;
> -	}
>  
>  	if (!error && cc->iv_gen_ops && cc->iv_gen_ops->post)
>  		error = cc->iv_gen_ops->post(cc, iv_of_dmreq(cc, dmreq), dmreq);
> @@ -1359,13 +1356,17 @@ static void kcryptd_async_done(struct crypto_async_request *async_req,
>  
>  	crypt_free_req(cc, req_of_dmreq(cc, dmreq), io->base_bio);
>  
> -	if (!atomic_dec_and_test(&ctx->cc_pending))
> +	if (!atomic_dec_and_test(&ctx->cc_pending)) {
> +		complete(&ctx->restart);
>  		return;
> +	}
>  
>  	if (bio_data_dir(io->base_bio) == READ)
>  		kcryptd_crypt_read_done(io);
>  	else
>  		kcryptd_crypt_write_io_submit(io, 1);
> +
> +	complete(&ctx->restart);
>  }
>  
>  static void kcryptd_crypt(struct work_struct *work)

It looks like you're _always_ using the completion regardless of whether
crypt_convert() will be waiting (e.g. even if error is 0).

I can see this "working" but it seems less than ideal.  Would it be
better to record the need to use the completion in ctx and then
conditionally call complete()?

Mike