[dm-devel] How do you force-close a dm device after a disk failure?

Mon Sep 21 17:50:57 UTC 2015

Dne 21.9.2015 v 13:39 Lars Ellenberg napsal(a):
> On Sat, Sep 19, 2015 at 07:47:52PM +1000, Adam Nielsen wrote:
>>> Was this the 'ONLY' dmsetup in your listing (i.e. you reproduced case
>>> again)?
>>
>> This was the original instance of the problem.  Today I have rebooted
>> and reproduced the problem on a fresh kernel.
>>
>>> I mean - your existing reported situation was already hopeless and
>>> needed reboot - as if  flushing suspend holds some mutexes - no other
>>> suspend call can fix it ->  you usually have just  1 chance to fix it
>>> in right way, if you go wrong way reboot is unavoidable.
>>
>> That sounds like a very unforgiving buggy kernel, if you only have one
>> chance to fix the problem ;-)
>>
>> Here is my attempt on the fresh kernel.  I received some write errors
>> in dmesg, so tried to umount the dm device to confirm I had reproduced
>> the problem, and when umount failed to exit I tried this:
>>
>>    $ dmsetup reload backup --table "0 11720531968 error"
>>    $ dmsetup suspend --noflush --nolockfs backup
>
> You need to *resume* to activate the new table.
>
>> These two worked fine now.  "dmsetup suspend" was locking up before,
>> this time it worked.
>>
>>    $ umount /mnt/backup
>>    umount: /mnt/backup: not mounted
>>
>> The dm instance is no longer mounted.
>>
>>    $ mdadm --manage --stop /dev/md10
>>    mdadm: Cannot get exclusive access to /dev/md10:Perhaps a running
>>      process, mounted filesystem or active volume group?
>
> Also, as mentioned before, why don't you
> mdadm /dev/md10 --fail /dev/sdd --remove /dev/sdd
> mdadm /dev/md10 --fail /dev/sde --remove /dev/sde
> (for whatever sdX members it currently has;
> or maybe combine in one command line, if that is supposed to work)
>
> Should kick out the disks from the MD,
> should make md10 fail all pending (and new) requests,
> should even get the stuck dm suspend going again
> (the implicit "flush" one, not the --noflush one,
> as that did not get stuck anyways).
>
>> I can't restart the underlying RAID array though, as the dm instance is
>> still holding onto the devices.
>>
>>    $ dmsetup remove --force backup
>>    device-mapper: remove ioctl on backup failed: Device or resource busy
>>    Command failed
>
> You need to *resume* the new (error) table.
> Or the previous table is only suspended, but still holds references.
>

There is a condition which may prevent replacement dm table.

If the 'dm' target has in-progress bio operation and the underlying device is 
not responding (acking bio completed),  you can't suspend such targeted with 
bio-in-progress.

It's not trivial to improve this.

So if you happen to 'deadlock' in this state - there is currently no other 
help then rebooting machine if you want to get rid of such 'frozen' device.

On the other hand - from what was said -  'dropping' USB disk out of system 
should not be causing such state.

So probably more details from logs need to be know for knowing more about this.

Zdenek