[linux-lvm] Power loss consistency for RAID

Mon Mar 18 12:38:39 UTC 2019

On Sun, 17 Mar 2019, Zheng Lv wrote:

> I'm recently considering using software RAID instead of hardware controllers 
> for my home server.
>
> AFAIK, write operation on a RAID array is not atomic across disks. I'm 
> concerned that what happens to RAID1/5/6/10 LVs after power loss.
>
> Is manual recovery required, or is it automatically checked and repaired on 
> LV activation?
>
> Also I'm curious about how such recovery works internally.

I use md raid1 and raid10.  I recommend that instead of the LVM RAID,
which is newer.  Create your RAID volumes with md, and add them as PVs:

   PV         VG      Fmt  Attr PSize   PFree
   /dev/md1   vg_span lvm2 a--u 214.81g      0
   /dev/md2   vg_span lvm2 a--u 214.81g  26.72g
   /dev/md3   vg_span lvm2 a--u 249.00g 148.00g
   /dev/md4   vg_span lvm2 a--u 252.47g 242.47g

Note that you do not need matching drives as with hardware RAID, you
can add disks and mix and match partitions of the same size on drives
of differing sizes.  LVM does this automatically, you have to manually
assign partitions to block devices with md.  There are very few (large)
partitions to assign, so it is a pleasant human sized exercise.

While striping and mirror schemes like raid0, raid1, raid10 are actually
faster with software RAID, I avoid RAID schemes with RMW cycles like
raid5 - you really need the hardware for those.

I use raid1 when the filesystem needs to be readable without the md 
driver - as with /boot.  Raid10 provides striping as well as mirroring,
with however many drives you have (I usually have 3 or 4).

Here is a brief overview of MD recovery and diagnostics.  Someone else
will have to fill in with the mechanics of LVM raid.

Md keeps a version in the superblock of each device in a logical md
drive - and marks the older leg as failed and replaced (and begins to
sync it).  In newer superblock formats, it also keeps a bit map so that it 
can sync only possibly modified areas.

Once a week (configurable), check_raid compares the legs (on most
distros).  If it encounters a read error on either drive, it immediately
syncs that block from the good drive.  This reassigns the sector on
modern drives.  (On ancient drives, a write error on resync marks the
drive as failed.) If for some reason (there are legitimate ones
involving write optimizations for SWAP volumes and such) the two legs do
not match, it arbitrarily copies one leg to the other, keeping a count.
(IMO it should also log the block offset so that I can occasionally check
that the out of sync occurred in an expected volume.)

-- 
 	      Stuart D. Gathman <stuart at gathman.org>
"Confutatis maledictis, flamis acribus addictis" - background song for
a Microsoft sponsored "Where do you want to go from here?" commercial.