[lvm-devel] [PATCH 2 of 4] Handle transient secondary mirror leg failures

Takahiro Yasui tyasui at redhat.com
Fri Dec 18 20:21:33 UTC 2009


On 12/18/09 13:49, malahal at us.ibm.com wrote:
> Takahiro Yasui [tyasui at redhat.com] wrote:
>> On 12/18/09 12:10, Jonathan Brassow wrote:
>>> 2) If you don't get a new table loaded, it will behave as a suspend/ 
>>> resume only.  Recent code changes in dm-raid1.c are causing  
>>> 'log_failure' and 'leg_failure' to not be reset in those cases.  IOW,  
>>> all these steps could be for nothing.  :(
>>
>> I would like to know how effective the retry is. As Jon explained
>> above, recent upstream kernel blocks all write I/Os on NOSYNC regions.
>> This means that those write I/Os are kept blocked for a long time.
>> For example, mirror retry interval in your patch #4 is 30 seconds and
>> application or  filesystem will be waited for 30 seconds (330 seconds
>> if retry count is 10). Can your application wait for more than 5 minutes?
>>
>> This behaviour will not been solved even if kernel is fixed so that
>> log_failure and leg_failure are reset. The write I/Os blocked will
>> be re-queued in the kernel when suspend/resume are done, but they
>> will be put in the hold queue again if the device failure is not
>> transient but permanent.
>>
>> I would like to know the use case of this patch set.
> 
> I have tested with RHEL5.4 kernel that doesn't block on device failure.

I see. Your retry approach looks good for the kernels on RHEL5.4 kernel
and 2.6.32 kernel since write I/Os aren't blocked even after the error
is detected on a mirror leg.

> IMHO, suspend/resume is doing two things if the kernel code blocks on
> failure -- 1) letting the kernel module unblock 2) start resync. We
> should separate those two actions. Maybe, we should do something here???
> Have an ioctl to start resync or have a message to unblock....

I'm sorry I don't get your point to separate unblock and resync.
Currently "unblock" is done by "suspend" and resync is done by "resume"
We need to reset log_failure/leg_failre, but anything else?

> Also, blocking the kernel module introduced a regression -- the
> mirror will block forever on a transient failure!

You are right. Refreshing block I/Os is necessary when dmeventd noticed
an error event and lvconvert command found nothing to be done for repair.
dmeventd should kick an action to refresh the mirror device.

Thanks,
Taka




More information about the lvm-devel mailing list