[lvm-devel] [PATCH 1 of 5] LVM2: RAID design doc

Tue Jul 5 20:41:23 UTC 2011

I've updated some sections - specifically the section on allocation.
 brassow

LVM2 RAID documentation file.

Index: LVM2/doc/lvm2-raid.txt
===================================================================

--- /dev/null
+++ LVM2/doc/lvm2-raid.txt
@@ -0,0 +1,333 @@
+=======================
+= LVM RAID Design Doc =
+=======================
+
+#############################
+# Chapter 1: User-Interface #
+#############################
+
+***************** CREATING A RAID DEVICE ******************
+
+01: lvcreate --type <RAID type> \
+02:	     [--region_size <size>] \
+03:	     [-i/--stripes <#>] [-I,--stripesize <size>] \
+04:	     [-m/--mirrors <#>] \
+05:	     [--[min|max]_recovery_rate <kB/sec/disk>] \
+06:	     [--stripe_cache <size>] \
+07:	     [--write_mostly <devices>] \
+08:	     [--max_write_behind <size>] \
+09:	     [[no]sync] \
+10:	     <Other normal args, like: -L 5G -n lv vg> \
+11:	     [devices]
+
+Line 01:
+I don't intend for there to be shorthand options for specifying the
+segment type.  The available RAID types are:
+	"raid0"  - Stripe [NOT IMPLEMENTED]
+	"raid1"  - should replace DM Mirroring
+	"raid10" - striped mirrors, [NOT IMPLEMENTED]
+	"raid4"  - RAID4
+	"raid5"  - Same as "raid5_ls" (Same default as MD)
+	"raid5_la" - RAID5 Rotating parity 0 with data continuation
+	"raid5_ra" - RAID5 Rotating parity N with data continuation
+	"raid5_ls" - RAID5 Rotating parity 0 with data restart
+	"raid5_rs" - RAID5 Rotating parity N with data restart
+	"raid6"    - Same as "raid6_zr"
+	"raid6_zr" - RAID6 Rotating parity 0 with data restart
+	"raid6_nr" - RAID6 Rotating parity N with data restart
+	"raid6_nc" - RAID6 Rotating parity N with data continuation
+The exception to 'no shorthand options' will be where the RAID implementations
+can displace traditional tagets.  This is the case with 'mirror' and 'raid1'.
+In these cases, a switch will exist in lvm.conf allowing the user to specify
+which implementation they want.  When this is in place, the segment type is
+inferred from the argument, '-m' for example.
+
+Line 02:
+Region size is relevant for all RAID types.  It defines the granularity for
+which the bitmap will track the active areas of disk.  The default is currently
+4MiB.  I see no reason to change this unless it is a problem for MD performance.
+MD does impose a restriction of 2^21 regions for a given device, however.  This
+means two things: 1) we should never need a metadata area larger than
+8kiB+sizeof(superblock)+bitmap_offset (IOW, pretty small) and 2) the region
+size will have to be upwardly revised if the device is larger than 8TiB
+(assuming defaults).
+
+Line 03/04:
+The '-m/--mirrors' option is only relevant to RAID1 and will be used just like
+it is today for DM mirroring.  For all other RAID types, -i/--stripes and
+-I/--stripesize are relevant.  The former will specify the number of data
+devices that will be used for striping.  For example, if the user specifies
+'--type raid0 -i 3', then 3 devices are needed.  If the user specifies
+'--type raid6 -i 3', then 5 devices are needed.  The -I/--stripesize may be
+confusing to MD users, as they use the term "chunksize".  I think they will
+adapt without issue and I don't wish to create a conflict with the term
+"chunksize" that we use for snapshots.
+
+Line 05/06/07:
+I'm still not clear on how to specify these options.  Some are easier than
+others.  '--write-mostly' is particularly hard because it involves specifying
+which devices shall be 'write-mostly' and thus, also have 'max-write-behind'
+applied to them.  I also welcome suggestions on exactly how the options should
+appear (--stripe_cache, --stripe-cache, or --stripecache).  It has been
+suggested that a '--read-mostly'/'--read-favored' or similar option could be
+introduced as a way to specify a primary disk vs. specifying all the non-primary
+disks via '--write-mostly'.  I like this idea, but haven't come up with a good
+name yet.  Thus, these will remain unimplemented until future specification.
+
+Line 09/10/11:
+These are familiar.
+
+Further creation related ideas:
+Today, you can specify '--type mirror' without an '-m/--mirrors' argument
+necessary.  The number of devices defaults to two (and the log defaults to
+'disk').  A similar thing should happen with the RAID types.  All of them
+should default to having two data devices unless otherwise specified.  This
+would mean a total number of 2 devices for RAID 0/1, 3 devices for RAID 4/5,
+and 4 devices for RAID 6/10.  There should be two ways to change this:
+    1) By specifying -i/--stripes or -m/--mirrors
+    2) By simply specifying [devices] (line 11).
+If the user is specifying the devices to use at creation time, that should
+be enough information allong with the RAID type to infer the number of total
+devices.
+
+Another "nice to have" would be automatic VG creation, but it's understandable
+if this won't happen.  There could be a default VG name that is modeled after
+the LV name chosen.  Alternately, a system default VG could be extended when a
+create is performed and the LV could be added to that.  When devices are
+specified, there are really only three conditions:
+	   1) the devices are associated with no VG
+	   2) the devices are associated with different VGs
+	   3) the devices are associated with one VG
+We can do as described above in the case of #1, #2 is an error, and #3 can
+just proceed with the revealed name.  This could simplify all the PV, VG,
+and LV steps of creating a RAID device down just one very short command:
+
+$> lvcreate --type raid1 /dev/sd[bc]1
+
+Note that even the size can be inferred from the above command.  Simply use
+all the space on the devices provided!
+
+If we chose to have a system default VG instead of modeling the name after
+the LV, then creating a snapshot of this RAID volume would again be simple:
+
+$> lvcreate -s <VG/LV> /dev/sdd1
+
+
+***************** CONVERTING A RAID DEVICE ******************
+
+01: lvconvert [--type <RAID type>] \
+02:	      [-R/--regionsize <size>] \
+03:	      [-i/--stripes <#>] [-I,--stripesize <size>] \
+04:	      [-m/--mirrors <#>] \
+05:	      [--splitmirrors <#>] \
+06:	      [--replace <sub_lv|device>] \
+07:	      [--[min|max]_recovery_rate <kB/sec/disk>] \
+08:	      [--stripe_cache <size>] \
+09:	      [--write_mostly <devices>] \
+10:	      [--max_write_behind <size>] \
+11:	      vg/lv
+12:	      [devices]
+
+lvconvert should work exactly as it does now when dealing with mirrors -
+even if(when) we switch to MD RAID1.  Of course, there are no plans to
+allow the presense of the metadata area to be configurable (e.g. --corelog).
+It will be simple enough to detect if the LV being up/down-converted is
+new or old-style mirroring.
+
+If we choose to use MD RAID0 as well, it will be possible to change the
+number of stripes and the stripesize.  It is therefore conceivable to see
+something like, 'lvconvert -i +1 vg/lv'.
+
+Line 01:
+It is possible to change the RAID type of an LV - even if that LV is already
+a RAID device of a different type.  For example, you could change from
+RAID4 to RAID5 or RAID5 to RAID6.
+
+Line 02/03/04/05:
+These are familiar options - all of which would now be available as options
+for change.  (However, it'd be nice if we didn't have regionsize in there.
+It's simple on the kernel side, but is just an extra - often unecessary -
+parameter to many functions in the LVM codebase.)
+
+Line 06:
+This option allows the user to specify a sub_lv (e.g. a mirror image) or
+a particular device for replacement.  The device (or all the devices in
+the sub_lv) will be removed and replaced with different devices from the
+VG.
+
+Line 07/08/09/10:
+It should be possible to alter these parameters of a RAID device.  As with
+lvcreate, however, I'm not entirely certain how to best define some of these.
+We don't need all the capabilities at once though, so it isn't a pressing
+issue.
+
+Line 11:
+The LV to operate on.
+
+Line 12:
+Devices that are to be used to satisfy the conversion request.  If the
+operation removes devices or splits a mirror, then the devices specified
+form the list of candidates for removal.  If the operation adds or replaces
+devices, then the devices specified form the list of candidates for allocation.
+
+
+
+###############################################
+# Chapter 2: LVM RAID internal representation #
+###############################################
+
+The internal representation is somewhat like mirroring, but with alterations
+for the different metadata components.  LVM mirroring has a single log LV,
+but RAID will have one for each data device.  Because of this, I've added a
+new 'areas' list to the 'struct lv_segment' - 'meta_areas'.  There is exactly
+a one-to-one relationship between 'areas' and 'meta_areas'.  The 'areas' array
+still holds the data sub-lv's (similar to mirroring), while the 'meta_areas'
+array holds the metadata sub-lv's (akin to the mirroring log device).
+
+The sub_lvs will be named '%s_rimage_%d' instead of '%s_mimage_%d' as it is
+for mirroring, and '%s_rmeta_%d' instead of '%s_mlog'.  Thus, you can imagine
+an LV named 'foo' with the following layout:
+foo
+[foo's lv_segment]
+|
+|-> foo_rimage_0 (areas[0])
+|   [foo_rimage_0's lv_segment]
+|-> foo_rimage_1 (areas[1])
+|   [foo_rimage_1's lv_segment]
+|
+|-> foo_rmeta_0 (meta_areas[0])
+|   [foo_rmeta_0's lv_segment]
+|-> foo_rmeta_1 (meta_areas[1])
+|   [foo_rmeta_1's lv_segment]
+
+LVM Meta-data format
+--------------------
+The RAID format will need to be able to store parameters that are unique to
+RAID and unique to specific RAID sub-devices.  It will be modeled after that
+of mirroring.
+
+Here is an example of the mirroring layout:
+lv {
+	id = "agL1vP-1B8Z-5vnB-41cS-lhBJ-Gcvz-dh3L3H"
+	status = ["READ", "WRITE", "VISIBLE"]
+	flags = []
+	segment_count = 1
+
+	segment1 {
+		start_extent = 0
+		extent_count = 125      # 500 Megabytes
+
+		type = "mirror"
+		mirror_count = 2
+		mirror_log = "lv_mlog"
+		region_size = 1024
+
+		mirrors = [
+			"lv_mimage_0", 0,
+			"lv_mimage_1", 0
+		]
+	}
+}
+
+The real trick is dealing with the metadata devices.  Mirroring has an entry,
+'mirror_log', in the top-level segment.  This won't work for RAID because there
+is a one-to-one mapping between the data devices and the metadata devices.  The
+mirror devices are layed-out in sub-device/le pairs.  The 'le' parameter is
+redundant since it will always be zero.  So for RAID, I have simple put the
+metadata and data devices in pairs without the 'le' parameter.
+
+RAID metadata:
+lv {
+	id = "EnpqAM-5PEg-i9wB-5amn-P116-1T8k-nS3GfD"
+	status = ["READ", "WRITE", "VISIBLE"]
+	flags = []
+	segment_count = 1
+
+	segment1 {
+		start_extent = 0
+		extent_count = 125      # 500 Megabytes
+
+		type = "raid1"
+		device_count = 2
+		region_size = 1024
+
+		raids = [
+			"lv_rmeta_0", "lv_rimage_0",
+			"lv_rmeta_1", "lv_rimage_1",
+		]
+	}
+}
+
+The metadata also must be capable of representing the various tunables.  We
+already have a good example for one from mirroring, region_size.
+'max_write_behind', 'stripe_cache', and '[min|max]_recovery_rate' could also
+be handled in this way.  However, 'write_mostly' cannot be handled in this
+way, because it is a characteristic associated with the sub_lvs, not the
+array as a whole.  In these cases, the status field of the sub-lv's themselves
+will hold these flags - the meaning being only useful in the larger context.
+
+Another important thing to mention is that we are running out of space for
+'lv->status' additions.  We are /very/ close to having our 64 slots filled.
+So, for RAID, RAID_IMAGE, and RAID_META, I am reusing some VG-only flag slots
+for these LV status values.  I'm not sure if there are further slots for
+"WRITE_MOSTLY" or other such values that would go in lv->status.  I may need to
+reuse more flags, or expand the way flags are handled.
+
+New Segment Type(s)
+-------------------
+I've created a new file 'lib/raid/raid.c' that will handle the various different
+RAID types.  While there will be a unique segment type for each RAID variant,
+they will all share a common backend - segtype_handler functions and
+segtype->flags = SEG_RAID.
+
+I'm also adding a new field to 'struct segment_type', parity_devs.  For every
+segment_type except RAID4/5/6, this will be 0.  This field facilitates in
+allocation and size calculations.  For example, the lvcreate for RAID5 would
+look something like:
+~> lvcreate --type raid5 -L 30G -i 3 -n my_raid5 my_vg
+or
+~> lvcreate --type raid5 -n my_raid5 my_vg /dev/sd[bcdef]1
+
+In the former case, the stripe count (3) and device size are computed, and
+then 'segtype->parity_devs' extra devices are allocated of the same size.  In
+the latter case, the number of PVs is determined and 'segtype->parity_devs' is
+subtracted off to determine the number of stripes.
+
+This should also work in the case of RAID10 and doing things in this manor
+should not affect the way size is calculated via the area_multiple.
+
+Allocation
+----------
+When a RAID device is created, metadata LVs must be created along with the
+data LVs that will ultimately compose the top-level RAID array.  For the
+foreseeable future, the metadata LVs must reside on the same device as (or
+at least one of the devices that compose) the data LV.  We use this property
+to simplify the allocation process.  Rather than allocating for the data LVs
+and then asking for a small chunk of space on the same device (or the other
+way around), we simply ask for the aggregate size of the data LV plus the
+metadata LV.  Once we have the space allocated, we divide it between the
+metadata and data LVs.  This also greatly simplifies the process of finding
+parallel space for all the data LVs that will compose the RAID array.  When
+a RAID device is resized, we will not need to take the metadata LV into
+account, because it will already be present.
+
+Apart from the metadata areas, the other unique characteristic of RAID
+devices is the parity device count.  The number of parity devices does nothing
+to the calculation of size-per-device.  The 'area_multiple' means nothing
+here.  The parity devices will simply be the same size as all the other devices
+and will also require a metadata LV (i.e. it is treated no differently than
+the other devices).
+
+Therefore, to allocate space for RAID devices, we need to know two things:
+1) how many parity devices are required and 2) does an allocated area need to
+be split out for the metadata LVs after finding the space to fill the request.
+We simply add these two fields to the 'alloc_handle' data structure as,
+'parity_count' and 'alloc_and_split_meta'.  These two fields get set simply
+in '_alloc_init'.   The 'segtype->parity_devs' holds the number of parity
+drives and can be directly copied to 'ah->parity_count' and
+'alloc_and_split_meta' is set when a RAID segtype is detected and
+'metadata_area_count' has been specified.  With these two variables set, we
+can calculate how many allocated areas we need.  Also, in the routines that
+find the actual space, they stop not when they have found ah->area_count but
+when they have found (ah->area_count + ah->parity_count).
+