[Linux-cluster] Announce: Higher level cluster raid
Daniel Phillips
phillips at redhat.com
Mon Mar 14 22:17:19 UTC 2005
Hi all,
A number of participants on this list have speculated about the
possibility of running higher level raid on a cluster, to provide lower
redundancy and greater throughput than simple cluster mirroring. This
is a natural idea, and I am pleased to be able to say that it is now a
reality here on my workbench.
This project started life as a cluster mirror back before Jon decided to
try extending the existing dm-raid1 driver to work on a cluster. It
got mothballed at that time, but luckily I had already promised a
cluster raid paper for LCA next month. With that deadline looming, I
unmothballed it and went to work. To add spice, I thought I would try
to extend it to a higher raid level.
Well, raid 5 is really hard to implement under the best of conditions,
but on a cluster it gets really crazy due to caching issues[1]. Raid 2
and 3 are generally considered impractical to implement in software
because of the necessary per-byte parity handling. Fortunately, there
is a variant I investigated a couple of years ago and published a paper
on, that I call raid 3.5. This approach restricts the number of disks
in an array to 2**k+1 and divides each filesystem block into a binary
number of fragments which are striped across the array. You can think
of this as sub-block striping, as opposed to raid 4 and 5 which use
multi-block striping. The big advantage of raid 3.5 over raid 4 or 5
is that each parity block does not tie together multiple data blocks,
so caching can be avoided. Each data block can be written directly to
or read directly from the array, along with its associated parity.
Here is a nice picture of how raid 3 works:
http://www.acnc.com/04_01_03.html
Raid 3.5 is much like that, except for the way data blocks are striped
across disks, which is designed to make it easy to implement in
software.
On a single node, the main attraction of raid 3.5 is, it is much easier
to implement than raid 4 or 5. However, on a cluster raid 3.5 really
comes into its own: decoupling the parity blocks from unrelated data
blocks is a fundamental advantage. Per-block synchronization becomes
unnecessary, which would otherwise be complex and costly. More
importantly, there is no need to read other blocks of a parity set
before updating, which would create a horrible new level of data
caching and cluster synchronization issues.
The main theoretical disadvantage of raid 3.5 versus raid 5 is lower
transaction throughput, because each filesystem transaction involves
all disks of the array, whereas raid 5 can sometimes seek in parallel
for parallel filesystem transactions. This problem disappears on a
cluster because you can fill each cluster data node with many disks
and stripe them together, yielding the same transaction parallelization
characteristics as raid 4 or 5.
Another disadvantage of raid 3.5 is the limited number of supported
array configurations. Each 4K filesystem block can only be divided
into eight pieces before hitting the 512 byte Linux sector addressing
limit. This means that for practical purposes, raid 3.5 can only have
3, 5 or 9 members[2]. Well, that is not a big problem when you
consider that, until now, you could not do cluster level data
redundancy at all, other than mirroring.
Now to put this in perspective. The availability of higher level
cluster raid gives us a new type of cluster, a "distributed data"
cluster, as opposed to what GFS currently offers, which I call a
"shared data" cluster. With a distributed data cluster, you can unplug
one of the data nodes and the cluster will continue running without any
data loss and with negligible loss of performance.
Ironically, many naive observers tend to hold the belief that this is
exactly the way a cluster filesystem is supposed to work and we must
disappoint them by telling them you actually need to share a disk.
Well, now we have the means to cater to that common fantasy and avoid
the disappointment.
Note that the number of data nodes in a distributed data cluster is
independent from the total number of nodes. You could, for example,
have a 100 node cluster with 5 data nodes. This is superior to running
the cluster off a mirrored disk because storage redundancy is reduced
from 50% to 20% and maximum read throughput is increased from 2X to 4X.
Unlike a mirror, raid 3.5 increases maximum write throughput as well.
It turned out that raid 3.5 was quite easy to graft on top of my cluster
mirror prototype, which only took a couple of days. Except for one
thing: handling multi-page bio transfers is really tedious. Going from
single page bio transfers to multipage transfers has kept me busy for
the better part of a week, and I am still putting the finishing touches
on it. But the thing is, now a single bio can handle a massive amount
of contiguous filesystem data. Even better, the parity calculations
for all the eesny weensy little block fragments line up nicely and
coalesce into contiguous runs inside each multipage bio. Though you
would think raid 3.5 makes a real hash of transfer contiguity, actually
it does not. In essence, I finesse the problem onto the scatter gather
hardware[3].
I have tested up to 1 megabyte contiguous transfers, which works very
well[4]. I have not yet had time to test throughput in a realistic
environment, so I cannot report performance numbers today. Hopefully
tomorrow.
Code will be in cvs on Wednesday, I think. This is a new project,
tentatively called "ddraid" where dd means "distributed data". A
project page will appear in the fullness of time. And of course that
LCA paper, which I will present next month in Canberra.
The current code is production quality in a number of interesting
respects, particularly the main read/write path. I have implemented
read-with-verify on the read path, where each read transaction checks
the parity fragments to ensure nothing has gone wrong in hardware or
software. This works nicely, and I wonder why raid implementations do
not typically provide such an option.
I built the ddraid implementation using the same template (and code) as
the cluster snapshot, which saved an immense amount of time and
hopefully will save maintenance time as well Anybody who is interested
in how that works can find reasonably up to date documentation here:
http://sourceware.org/cluster/csnap/cluster.snapshot.design.html
http://sourceware.org/cluster/csnap/csnap.ps
Like the cluster snapshot, ddraid has a userspace synchronization
server. And like any cluster server (userspace or not) this raises
difficult memory deadlock issues. Hopefully, after LCA I can sit down
and solve these issues definitively for these and other cluster server
components.
Due to time pressure, I must leave some work hanging until after LCA:
* Degraded read (reconstruct from parity)
* Degraded write
* Parity reconstruction (currently can only reconstruct raid 1)
* Some aspects of failover
OK, thanks a lot for reading this far, time to get back to work. I will
provide further details in a couple of days.
[1] To update one block of a raid 4 or 5 parity set, at least two parity
set members must first be read and other nodes must be prevented from
modifying these before the update is completed.
[2] Two member raid 3.5 is equivalent to raid 1.
[3] It will be interesting to see how well the scatter gather hardware
holds up under the load of tens of thousands of tiny transfer fragments
per second.
[4] The only real limit to transfer size is the size of a region in the
dirty log, which persistently records which parts of the data array
need to be reconstructed if the system fails at any given instant.
Typical dirty log region size runs from a few KB to a few MB.
Regards,
Daniel
More information about the Linux-cluster
mailing list