[Linux-cluster] Announce: Higher level cluster raid

Mon Mar 14 22:17:19 UTC 2005

Hi all,

A number of participants on this list have speculated about the 
possibility of running higher level raid on a cluster, to provide lower 
redundancy and greater throughput than simple cluster mirroring.  This 
is a natural idea, and I am pleased to be able to say that it is now a 
reality here on my workbench.

This project started life as a cluster mirror back before Jon decided to 
try extending the existing dm-raid1 driver to work on a cluster.  It 
got mothballed at that time, but luckily I had already promised a 
cluster raid paper for LCA next month.  With that deadline looming, I 
unmothballed it and went to work.  To add spice, I thought I would try 
to extend it to a higher raid level.

Well, raid 5 is really hard to implement under the best of conditions, 
but on a cluster it gets really crazy due to caching issues[1].  Raid 2 
and 3 are generally considered impractical to implement in software 
because of the necessary per-byte parity handling.  Fortunately, there 
is a variant I investigated a couple of years ago and published a paper 
on, that I call raid 3.5.  This approach restricts the number of disks 
in an array to 2**k+1 and divides each filesystem block into a binary 
number of fragments which are striped across the array.  You can think 
of this as sub-block striping, as opposed to raid 4 and 5 which use 
multi-block striping.  The big advantage of raid 3.5 over raid 4 or 5 
is that each parity block does not tie together multiple data blocks, 
so caching can be avoided.  Each data block can be written directly to 
or read directly from the array, along with its associated parity.

Here is a nice picture of how raid 3 works:

   http://www.acnc.com/04_01_03.html

Raid 3.5 is much like that, except for the way data blocks are striped 
across disks, which is designed to make it easy to implement in 
software.

On a single node, the main attraction of raid 3.5 is, it is much easier 
to implement than raid 4 or 5.  However, on a cluster raid 3.5 really 
comes into its own: decoupling the parity blocks from unrelated data 
blocks is a fundamental advantage.  Per-block synchronization becomes 
unnecessary, which would otherwise be complex and costly.  More 
importantly, there is no need to read other blocks of a parity set 
before updating, which would create a horrible new level of data 
caching and cluster synchronization issues.

The main theoretical disadvantage of raid 3.5 versus raid 5 is lower 
transaction throughput, because each filesystem transaction involves 
all disks of the array, whereas raid 5 can sometimes seek in parallel 
for parallel filesystem transactions.  This problem disappears on a 
cluster  because you can fill each cluster data node with many disks 
and stripe them together, yielding the same transaction parallelization 
characteristics as raid 4 or 5.

Another disadvantage of raid 3.5 is the limited number of supported 
array configurations.  Each 4K filesystem block can only be divided 
into eight pieces before hitting the 512 byte Linux sector addressing 
limit.  This means that for practical purposes, raid 3.5 can only have 
3, 5 or 9 members[2].  Well, that is not a big problem when you 
consider that, until now, you could not do cluster level data 
redundancy at all, other than mirroring.

Now to put this in perspective.  The availability of higher level 
cluster raid gives us a new type of cluster, a "distributed data" 
cluster, as opposed to what GFS currently offers, which I call a 
"shared data" cluster.  With a distributed data cluster, you can unplug 
one of the data nodes and the cluster will continue running without any 
data loss and with negligible loss of performance.

Ironically, many naive observers tend to hold the  belief that this is 
exactly the way a cluster filesystem is supposed to work and we must 
disappoint them by telling them you actually need to share a disk.  
Well, now we have the means to cater to that common fantasy and avoid 
the disappointment.

Note that the number of data nodes in a distributed data cluster is 
independent from the total number of nodes.  You could, for example, 
have a 100 node cluster with 5 data nodes.  This is superior to running 
the cluster off a mirrored disk because storage redundancy is reduced 
from 50% to 20% and maximum read throughput is increased from 2X to 4X.  
Unlike a mirror, raid 3.5 increases maximum write throughput as well.

It turned out that raid 3.5 was quite easy to graft on top of my cluster 
mirror prototype, which only took a couple of days.  Except for one 
thing: handling multi-page bio transfers is really tedious.  Going from 
single page bio transfers to multipage transfers has kept me busy for 
the better part of a week, and I am still putting the finishing touches 
on it.  But the thing is, now a single bio can handle a massive amount 
of contiguous filesystem data.  Even better, the parity calculations 
for all the eesny weensy little block fragments line up nicely and 
coalesce into contiguous runs inside each multipage bio.  Though you 
would think raid 3.5 makes a real hash of transfer contiguity, actually 
it does not.  In essence, I finesse the problem onto the scatter gather 
hardware[3].

I have tested up to 1 megabyte contiguous transfers, which works very 
well[4].  I have not yet had time to test throughput in a realistic 
environment, so I cannot report performance numbers today.  Hopefully 
tomorrow.

Code will be in cvs on Wednesday, I think.  This is a new project, 
tentatively called "ddraid" where dd means "distributed data".  A 
project page will appear in the fullness of time.  And of course that 
LCA paper, which I will present next month in Canberra.

The current code is production quality in a number of interesting 
respects, particularly the main read/write path.  I have implemented 
read-with-verify on the read path, where each read transaction checks 
the parity fragments to ensure nothing has gone wrong in hardware or 
software.  This works nicely, and I wonder why raid implementations do 
not typically provide such an option.

I built the ddraid implementation using the same template (and code) as 
the cluster snapshot, which saved an immense amount of time and 
hopefully will save maintenance time as well  Anybody who is interested 
in how that works can find reasonably up to date documentation here:

  http://sourceware.org/cluster/csnap/cluster.snapshot.design.html
  http://sourceware.org/cluster/csnap/csnap.ps

Like the cluster snapshot, ddraid has a userspace synchronization 
server.  And like any cluster server (userspace or not) this raises 
difficult memory deadlock issues.  Hopefully, after LCA I can sit down 
and solve these issues definitively for these and other cluster server 
components.

Due to time pressure, I must leave some work hanging until after LCA:

  * Degraded read (reconstruct from parity)
  * Degraded write
  * Parity reconstruction (currently can only reconstruct raid 1)
  * Some aspects of failover

OK, thanks a lot for reading this far, time to get back to work.  I will 
provide further details in a couple of days.

[1] To update one block of a raid 4 or 5 parity set, at least two parity 
set members must first be read and other nodes must be prevented from 
modifying these before the update is completed.

[2] Two member raid 3.5 is equivalent to raid 1.

[3] It will be interesting to see how well the scatter gather hardware 
holds up under the load of tens of thousands of tiny transfer fragments 
per second.

[4] The only real limit to transfer size is the size of a region in the 
dirty log, which persistently records which parts of the data array 
need to be reconstructed if the system fails at any given instant.  
Typical dirty log region size runs from a few KB to a few MB.

Regards,

Daniel