[dm-devel] [RFC] dm-raid1 fault tolerance patches

Jonathan Brassow jbrassow at redhat.com
Thu Nov 2 18:44:23 UTC 2006


Some of these patches are incomplete, but I'm hoping for some comments
on them before finishing them to ensure I'm headed in the right
direction.  (Patches apply to 2.6.18.1)

I'm providing all the patches in my set (even though some are already on
their way upstream) so that later patches apply cleanly.  I have not
included any read-balancing work, and these patches are not dependent on
that being there.

These 4 patches have already been submitted to the list, and are waiting
for inclusion upstream.
 dm-multipath-add_path_order_fix.patch
 dm-raid1-log_function_enhancement_and_name_change.patch
 dm-raid1-reset_sync_search_on_resume.patch
 dm-raid1-proper_suspend_fix.patch

This patch adds the ability to print that the log device has failed on
the status line.  This is useful to user-space when responding to
failures.
dm-raid1-log_fault_detection.patch

This next patch fixes something which may be controversial, which is why
it is not part of the previous patch.  When the log is resumed, it can
return an error if it failed to read the log device.  However, mirror
can do nothing about it because the target resume function returns
'void'.  Since the mirror will proceed regardless of a log resume
failure, we have the log assume all regions are out-of-sync - just as
you'd expect from a mirror with no persistent log.
dm-raid1-log_fault_detection_part2.patch

This patch is already in 19-rc4-mm1
dm-raid1-status_line_fix.patch

This patch adds new options to the mirror mapping/constructor table.
Specifically, it adds the ability to specify the 'handle_errors'
feature.  Other features can be specified here in the future as well
(like 'async').  Note that this is a departure from having
'block_on_error' as a log argument, as was previously the case.  I
believe this feature is better kept in the mirroring code vs. the
logging code.
dm-raid1-features_addition_to_table.patch

This patch was originally developed by NEC.  It ensures that if a resync
fails to a region we don't mark it clean - potentially allowing reads to
unsynced regions.  I've changed some things to support backwards
compatibility.  We still want correct behavior, but if we are ignoring
failures as before, we must still mark the region sync finished.  (This
is necessary for pvmove to complete properly if moving off of a faulty
device.)
dm-raid1-handle_resync_failures.patch

This patch adds handling for the case where a mirror device dies on
write.  The log must reflect that the region is not properly in-sync.
We also add the ability to print the status of mirror devices - allowing
user-space to take appropriate action.  Again, it is important to
maintain backwards compatibility for those you didn't specify
'handle_errors' as a feature to the mirror.  Note that this patch is not
complete until the ability to requeue request to device-mapper core is
added (more detail in patch header).
dm-raid1-write_fault_tolerance.patch

This patch adds the ability to handle read failures.  That is, to choose
another device if the region is in-sync.
dm-raid1-read_fault_tolerance.patch

This following patches are incomplete, but should provide useful insight
for those interested.  (BTW, if anyone is really good with netlink, I'd
love the help.)
dm-raid1-add_cluster_ability.patch
dm-raid1-version-bump.patch
dm-raid1-cluster_logging.patch

More comments can be found in the patch headers.  All comments are
welcome.
 brassow

-------------- next part --------------
A non-text attachment was scrubbed...
Name: 2.6.18.x-mirror_patches-11022006.tgz
Type: application/x-compressed-tar
Size: 20579 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/dm-devel/attachments/20061102/6d92ec96/attachment.bin>


More information about the dm-devel mailing list