[dm-devel] How do you force-close a dm device after a disk failure?

Thu Sep 17 14:04:13 UTC 2015

On Mon, Sep 14, 2015 at 07:45:52PM +1000, Adam Nielsen wrote:
> > Whole dm  table with all deps needs to be known.
> 
> $ dmsetup table
> backup: 0 11720531968 crypt aes-xts-plain64
>   0000000000000000000000000000000000000000000000000000000000000000 0
>   9:10 4096
> 
> $ dmsetup status
> backup: 0 11720531968 crypt
> 
> $ dmsetup ls --tree
> backup (253:0)
>  └─ (9:10)
> 
> $ dmsetup info -f
> Name:              backup
> State:             ACTIVE (DEFERRED REMOVE)
> Read Ahead:        4096
> Tables present:    LIVE
> Open count:        1
> Event number:      0
> Major, minor:      253, 0
> Number of targets: 1
> UUID: CRYPT-LUKS1-d0b3d38e421545908537dc50f59fb217-backup
> 
> All I'm using it for is to encrypt an mdadm-style RAID array composed
> of two external disks, connected temporarily via USB to do a full
> system backup with rsync.
> 
> > > I'm not sure how to do this, could you please elaborate?  I thought
> > > "dmsetup remove --force" would do this but as that doesn't work
> > 
> > really state of whole table needs to be known.
> > 
> > >> Also note - dmsetup remove  supports --deferred removal (see man
> > >> page).
> > >
> > > Oh I didn't notice that.  It doesn't seem to have much of an effect
> > > though:
> > 
> > Sure it will not fix your problem - it's like lazy umount...
> 
> So replacing the table with the 'error' target won't release the
> underlying device, even though that device is not used by the new
> target?
> 
> > What is not clear to me is - what is your expectation here ?
> > Obviously your system is far more broken - so placing 'error' target
> > for your backup device will not fix it.
> > 
> > You should likely attach also portion of 'dmesg' - there surely will
> > be written what is going wrong with your system.
> 
> What happened was in the middle of the backup, there was some USB
> interruption and the disks dropped out, so the writes started failing.
> The kernel logs were full of write errors to various sector numbers.  I
> think you would have the same result if you set things up with a USB
> stick and then unplugged it during a data transfer.
> 
> The devices are connected like this:
> 
>   dm device "backup"
>    |
>    +-- mdadm device /dev/md10
>         |
>         +-- USB/SATA disk A (/dev/sdd)
>         |
>         +-- USB/SATA disk B (/dev/sde)

mdadm /dev/md10 --fail /dev/sdd --remove /dev/sdd
mdadm /dev/md10 --fail /dev/sde --remove /dev/sde
(or maybe combine in one command line, if that is supposed to work)

Should kick out both disks from the MD,
should make md10 fail all pending (and new) request,
should even get the stuck dm suspend unstuck.

No?

Cheers,

	Lars Ellenberg

> The problem is that I can't just reconnect the disks and rerun the
> backup.  mdadm refuses to stop the RAID array as it is in use by
> the dm device, and it thinks the array is active despite the disks being
> unplugged and in a drawer.  If I reconnect the disks they appear as
> different devices (sdf and sdg) but I still can't start the "new" array
> from these new disk devices, as it tells me the disks are already part
> of an active array.
> 
> So the only way I can have another go at running this backup is to
> close down /dev/md10, and it seems the only way I can do that is to
> tell dm to release that device.  It doesn't matter if the dm device
> "backup" is unusable, I will just create "backup2" to use for the
> second attempt.
> 
> But until I can figure out how to get dm to release the underlying
> device, I'm stuck!
> 
> > i.e. you cannot expect 'remove --force' will work when your machine
> > start to show kernel errors.
> 
> There were no kernel crashes, just errors related to USB transfers.  I
> would assume this is not much different to how a real failed disk might
> behave, so I figure it is a situation that should be encountered
> relatively often!
> 
> Thanks again,
> Adam.