[Linux-cluster] Announce: Higher level cluster raid

Sun Mar 27 00:40:54 UTC 2005

Hi all,

OK, it took quite a bit longer to get basic throughput numbers than I 
had hoped, because I ran into a nasty performance bug and thought I'd 
better do something about it first.  This is solved, and performance
is now looking pretty good.

I don't have a suitable high performance cluster at hand so I am working 
with a scsi array on a local machine.  However, the full cluster 
synchronization stack is used, including a cluster-aware persistent 
dirty log (remembers which regions had in-flight writes in order to 
limit post-crash resyncing).  Though I have five disks available, I 
used one for the persistent dirty log, so I am restricted to an 
order-one ddraid array of three disks at the moment.  One of the disks 
is a dedicated parity disk, so the best I could hope for is a factor of 
two throughput increase versus a single raw disk.

For larger transfer sizes, throughput does in fact double compared to 
raw IO to one of the scsi disks.  For small transfer sizes, the 
overhead of parity calculations, dirty logging and bio cloning become 
relatively large versus raw disk IO, so breakeven occurs at about 64K 
transfer size.  Below that, a single raw disk is faster and above, the 
ddraid array is faster.  With 1 MB transfers, the ddraid array is 
nearly twice as fast.

I tried various combinations disabling the parity calculations and dirty 
logging to see how the write overheads break down:

   No persistent dirty log, no parity calculations:
      Tie at 8K
      Almost twice as fast at 32K and above

   Parity calculations, no persistent dirty log:
      Tie between 8K and 16K

   Persistent dirty log, no parity calculations:
      Tie at 16K

   Parity calculations and persistent dirty log:
      Tie at 64K

We see from this that dirty logging is the biggest overhead, which is no 
surprise.  After that, the overhead parity calculation and basic 
bookkeeping seem about the same.  The parity calculations can easily be 
optimized, the bio bookkeeping overhead will be a little harder.  There 
are probably a few tricks remaining to reduce the dirty log overhead.

The main point, though, is that even before a lot of optimization, 
performance looks good enough for production use.  Some loads will be a 
little worse, some loads will perform dramatically better.  Over time, 
I suspect various optimizations will reduce the per-transfer overhead 
considerably, so that the array is always faster.

Whether or not the array is faster, it certainly is more redundant than 
a raw disk.  This was my primary objective, and increased performance 
is just a nice fringe benefit.  Of course, with no other cluster raid 
implemention to compete with, this is by default the fastest one :-)

Performance notes:

  - Scatter/gather worries turned out to be unfounded.  The scsi
    controller handles the thousands of sg regions I throw at it per
    second without measureable overhead.  Just in case, I investigated
    the overhead of copying the IO into linear chunks via memcpy,
    which turned out to be small but noticeable.  Gnbd and iscsi do
    network scatter/gather, which I haven't tested yet, but I would be
    surprised if there is a problem.

  - Read transfers have no dirty log overhead.

  - For reliability, I check parity on read.  Later this will be an
    option, so read transfers don't necessarily have parity overhead
    either.

  - I think I may be able to increase read throughput to N times single
    disk throughput, instead of N-1.

  - I haven't determined where the bookkeeping overhead comes from,
    but I suspect most of it can be eliminated.

  - The nasty performance bug: releasing a dirty region immediately
    after all writes complete is a bad idea that really hammers the
    performance of back to back writes.  There is now a timer delay
    on release for each region, which cures this.

This code is also capable of running an n-way cluster mirror, but that 
feature is broken at the moment.  I'll restore it next week and we can 
check mirror performance as well.

I need to add degraded mode IO and parity reconstruction before this is 
actually useful, which I must put off until after LCA.  Code cleanup is 
in progress and should land in cvs Monday or Tuesday.  More benchmark 
numbers and hopefully some pretty charts are on the way.

Regards,

Daniel