Opteron Vs. Athlon X2

Sat Dec 10 06:09:34 UTC 2005

> > In the hardware RAID case:  100MB read from ram -> CPU
> > copies it to the I/O space of the controller ->

actually, the CPU PIO's a command packet to the controller 
which gives the command, size and pointer(s) to the data.
the CPU never touches the actual data.  IDE, net and scsi 
controllers are all broadly along these lines.

> > controller calculated raid-5 checksums -> 125 MB is written
> > to the disks.
> 
> No, the CPU is virtually not involved other than to
> command/queue -- _no_ programmed I/O (PIO), only Direct
> Memory Access (DMA).  100MB is read from RAM and written
> directly as 100MB to memory mapped I/O (which is the block

the driver just says "hey, write this blob of 128K at offset X".

> > Software RAID:
> > 100MB read from ram -> cpu copies and checksums 125 MB to
> > the controller -> controller writes 125 MB to the disks.
> 
> For mirroring, it's straight-forward (at least still DMA,
> just a redundant write):  
> 
> 100MB is read from RAM and written to two different memory
> mapped I/O (which is the block device) by the PCI-X or PCIe
> DMA controller.

just two "hey..." commands.  the source data (in host ram)
is not replicated.

> For RAID-5, it's a little more interesting, it's PIO:  

no, not really.

> The CPU reads in 100MB from RAM and calculates XOR, writing

well, the default would be 4x64k (and would consume <100 us
assuming no cache hits.)

> the XOR parity calculated to memory -- e.g., 25MB for a 5
> disk  RAID-5.  Then the 125MB is read from RAM and written
> directly as 125MB to memory mapped I/O by the PCI-X or PCIe
> DMA controller.

again, probably smaller pieces, each with a separate command packet
to the controller(s).  don't forget the reads necessary for 
sub-stripe or non-stripe-aligned writes.

> ANAL NOTE:  The software RAID-5 would commit a fraction of
> the data as a fraction of the XOR is calculated and stored in
> memory, and wouldn't wait until all XORs have been calculated
> and stored in memory.  But still, the XOR operation is
> programmed I/O, requiring that parity not be committed until
> it has been calculated by the CPU.

which is a trivial concern, since on any commodity system, the host
is faster and has more available memory bandwidth than the IO system
can manage in the first place.  we're talking <100 us for the 5-disk,
64K chunk MD.

> For mirror, you push 2x over the interconnect, but at least
> it's still 100% DMA (no CPU overhead).
> 
> For RAID-5, you only push 1/(N-1) over the interconnect
> (e.g., 1/(5-1) = 1/4th = 25% for a 5-disc RAID-5), but you

I think that was garbled - using HW raid saves 1/(n-1) or 20%
of the bus bandwidth.  which is not a scarce commodity any more.

> push the _entire_ amount of data through the CPU for that
> extra write.

for the xor.  which takes negligable time, even assuming no cache hits.

> > I just checked my fileserver, it can RAID-5 checksum at
> > 7.5GB/sec.  So yes one cpu would be slightly more busy,
> > just a few %.
> 
> I'm sorry, but the 3-issue ALU of the Opteron can_not_ do 7.5
> billion LOAD (the slowest part), FETCH-DECODE-XOR (the most
> simplistic part) and STOR operations per second!

for a normal dual-opteron server, sure it can.  r5 parity is actually
easier than Stream's triad, which a single opteron can run at >4.5 GB/s.
stream's reporting ignores write-allocate traffic, but MD does its 
writes through the cache, therefore can hit 6 GB/s.

> It's much faster and far more efficient to do XOR with a
> dedicated ASIC or ASIC peripheral on a superscale I/O
> processor that is in-line and far closer to the actual
> storage channels.

it _could_be_ much faster.  just as there's no question that GPU's
_can_ do 3d graphics transforms faster than the host.  whether it makes 
sense is a very different question, mainly cost.  why should I spend 
20% extra on my fileservers to leave cycles waste?  especially since 
I could spend that money on more capacity, ram, slightly more "enterpris-y"
disks, etc.

> When I started deploying some of my first ServerWorks
> ServerSet III chipset mainboards about 5-6 years ago for
> P3/Xeon, I saw significant gains with 3Ware cards as well as
> StrongARM-based SCSI RAID cards at RAID-10 over software RAID

you bet.  the p3 was in .2-.8 GB/s range, and disks of that era
were around 30 MB/s apiece.  now the host's memory bandwidth is 12x 
higher, but disks are only about 2x.

> And video cards have dedicated Graphics Processor Unit (GPU)
> processors that manipulate data far better than any vector
> processing on a CPU.

GPUs make sense if you're using your GPU to its limit a lot.
certainly if you're a gamer, this is true.  does a high-end GPU make a lot 
of sense for a generic desktop?  no, actually, it doesn't - transparent windows
is not a great justification for a $500 GPU.

> LOAD-FETCH/DECODE/XOR-STOR.  Why do you think Intel is
> putting its XScale logic in forthcoming bridges?

so they have something to point at to justify higher prices.
but the main point is that the price will be only a very little bit higher,
since transistors are practically free these days (bridges are 
probably pad-limited anyway).  no where near your 20% increase to system price.

> Intel learned long ago that processing with local memory
> closer to the end-device is going to be far higher performing
> because of no redundant copying/processing, reduced latency,
> etc...

if it can be had for near-zero marginal cost, sure.

> > If you want cheap I'd switch to software RAID.  I've seen
> > pci-e 2 channel controllers for $60 or so.  Or just get a
> new
> > motherboard getting 8 ports on the motherboard is fairly
> easy
> > on a $100-$150 motherboard.
> 
> PCIe x1?  No thanx, not for a server.

oh yeah right: "if it's a server, it's needs to be gold plated."

but for a 2ch controller, 250+250 really is plenty of bandwidth.

> BTW, with regards to the 333MHz, no offense, but you're what
> us semiconductor design engineers call a "MHz whore."

I'm a "price/measured performance whore".  that's why I like MD
and dumb SATA controllers.

> Maybe it's because I've spend several years of my career
> designing memory and bus controllers at the layout level, but
> there is a _huge_ difference between a CPU and a
> microcontroller with ASICs designed specifically for
> something.  In the case of the IOP33x superscalar ARM
> XScales, they are very much designed to efficiently put a
> data stream to many disks.

efficient?  that's the beauty of commoditized hardware: you get 
6 GB/s per opteron whether you need it or not.  it's certainly
tidier to have the controller perform the XOR, but since the host's
CPU is faster and will otherwise sit idle, well...

> Excuse me?  MD has changed several times between 2.2, 2.4 and
> yet again with 2.6.  LVM2 is a major problem, with massive

MD has had excellent on-disk compatibility (afaikr, but only two 
superblock versions).  LVM is irrelevant.

> No offense, but if I had a dime for everytime I saw someone
> on the MD or LVM support lists say "this should work" and
> then they had to come back and say, "yeah, you'll have to
> re-create that" I'd be a very, very rich man.

hmm, imagine seeing support traffic on a list devoted to support issues.

> I've avoided the 9500S and the 9550SX is too new for me to
> consider.  But for high-performance RAID-10, the 7000/8000
> are just absolutely dreamy -- and have been for almost 5
> years.

interesting - 3ware people say that until the 95xx's, their designs 
were crippled by lack of onboard memory bandwidth.