[linux-lvm] Add udev-md-raid-safe-timeouts.rules

Tue Apr 17 11:15:25 UTC 2018

On 2018-04-16 11:02, Wol's lists wrote:
> On 16/04/18 12:43, Austin S. Hemmelgarn wrote:
>> On 2018-04-15 21:04, Chris Murphy wrote:
>>> I just ran into this:
>>> https://github.com/neilbrown/mdadm/pull/32/commits/af1ddca7d5311dfc9ed60a5eb6497db1296f1bec 
>>>
>>>
>>> This solution is inadequate, can it be made more generic? This isn't
>>> an md specific problem, it affects Btrfs and LVM as well. And in fact
>>> raid0, and even none raid setups.
>>>
>>> There is no good reason to prevent deep recovery, which is what
>>> happens with the default command timer of 30 seconds, with this class
>>> of drive. Basically that value is going to cause data loss for the
>>> single device and also raid0 case, where the reset happens before deep
>>> recovery has a chance. And even if deep recovery fails to return user
>>> data, what we need to see is the proper error message: read error UNC,
>>> rather than a link reset message which just obfuscates the problem.
>>
>> This has been discussed at least once here before (probably more 
>> times, hard to be sure since it usually comes up as a side discussion 
>> in an only marginally related thread). 
> 
> Sorry, but where is "here"? This message is cross-posted to about three 
> lists at least ...
Oops, didn't see the extra lists listed.  In this case, discussed 
previously on the BTRFS ML.
> 
>   Last I knew, the consensus here was
>> that it needs to be changed upstream in the kernel, not by adding a 
>> udev rule because while the value is technically system policy, the 
>> default policy is brain-dead for anything but the original disks it 
>> was i9ntended for (30 seconds works perfectly fine for actual SCSI 
>> devices because they behave sanely in the face of media errors, but 
>> it's horribly inadequate for ATA devices).
>>
>> To re-iterate what I've said before on the subject:
>>
> imho (and it's probably going to be a pain to implement :-) there should 
> be a soft time-out and a hard time-out. The soft time-out should trigger 
> "drive is taking too long to respond" messages that end up in a log - so 
> that people who actually care can keep a track of this sort of thing. 
> The hard timeout should be the current set-up, where the kernel just 
> gives up.
Agreed, although as pointed out by Roger in his reply to this, it kind 
of already works this way in some cases.