Opteron Vs. Athlon X2

Sat Dec 10 12:46:18 UTC 2005

On Sat, 2005-12-10 at 01:09 -0500, Mark Hahn wrote:
> actually, the CPU PIO's a command packet to the controller 
> which gives the command, size and pointer(s) to the data.
> the CPU never touches the actual data.  IDE, net and scsi 
> controllers are all broadly along these lines.

Yes, that's DMA.  Yes, the CPU sends a PIO command.  I didn't realize
you were getting so anal on this.

I meant that the CPU is _not_ doing PIO for the data transfer.  When you
do IDE/EIDE, as well as the ATA PIO modes 0-5, you _are_ doing PIO for
the data transfer.  Only when you are doing ATA DMA is it not.

When you use software RAID-3/4/5/6 (anything with an XOR), it very much
_is_ doing PIO because _every_ single byte goes through the CPU.

> the driver just says "hey, write this blob of 128K at offset X".

Yes, when it does the DMA transfer, I don't disagree.

> just two "hey..." commands.  the source data (in host ram)
> is not replicated.

But *2x* the load on your interconnect.  Can you see me on that one?

> no, not really.

Not really?  100% of the data is slammed from memory into the CPU --
every single byte is worked on by the CPU _before_ the parity can be
calculated.  That is Programmed I/O, I don't know how much more it could
be PIO than that!

That's why I said consider software RAID XORs to be PIO!  ;->

> well, the default would be 4x64k (and would consume <100 us
> assuming no cache hits.)

In any case, you're _not_ slamming 6.4GBps through that CPU for
XORs.  ;->  If you think so, you obviously don't know the first thing
how general purpose, superscalar microprocessors work.

> again, probably smaller pieces, each with a separate command packet
> to the controller(s).  don't forget the reads necessary for 
> sub-stripe or non-stripe-aligned writes.

He he, I was trying to give you the "best case scenario."  But yes, if
you have to read all the way back through I/O from disk, it gets 10+
times worse.  ;->

Thanx for being honest and forthcoming in that regard.  ;->

> which is a trivial concern, since on any commodity system, the host
> is faster and has more available memory bandwidth than the IO system
> can manage in the first place.  we're talking <100 us for the 5-disk,
> 64K chunk MD.

Once again, you're _not_ going to be able to even remotely slam 6.4GBps
for XORs through that CPU.  ;->

> I think that was garbled - using HW raid saves 1/(n-1) or 20%
> of the bus bandwidth.  which is not a scarce commodity any more.

You just totally ignored my entire discussion on what you have to push
through the CPU!  A stream of data through the ALU of the CPU --
something _not_ designed with an ASIC peripheral that does that one
thing and does it well!  Something that you _would_ have on a hardware
RAID controller.

Even at it's "measly" 300-400MHz would do a heck of a lot more
efficiently, without tying up the system.

That's why Intel is putting its IOP-ASIC XScale processors into the
southbridge of its future server designs.  Because there's a lot to be
gained by off-loading very simple, specialized operations away from the
general purpose microprocessor that handles them far more inefficiently.

> for the xor.  which takes negligable time, even assuming no cache hits.

The XOR -- FETCH/DECODE/EXECUTE -- yes, to a point.
The LOAD/STORE?  I'm sorry, I think not.

Microprocessors are not designed to work on I/O efficiently.
Microcontrollers with specific peripheral ASICs are.

They can easily best a general microprocessor 10:1 or better, MHz for
MHz.  That's why Intel has several lines of XScale processors, not just
one "general" one.

> for a normal dual-opteron server, sure it can.  r5 parity is actually
> easier than Stream's triad, which a single opteron can run at >4.5 GB/s.
> stream's reporting ignores write-allocate traffic, but MD does its 
> writes through the cache, therefore can hit 6 GB/s.

Assuming it's in the L1 cache, _maybe_.  At 2.6GHz with a 3-issue ALU
and that the XOR operations can be (effectively) completed once per
clock cycle in a SIMD operation, that's wishful thinking.

> it _could_be_ much faster.  just as there's no question that GPU's
> _can_ do 3d graphics transforms faster than the host.  whether it makes 
> sense is a very different question, mainly cost.  why should I spend 
> 20% extra on my fileservers to leave cycles waste?

But what if you're not wasting cycles?
Not wasting precious interconnect?

That's the main point I've been making -- what if you're slamming so
much higher layer network traffic for services that your CPU and
interconnect are already very busy?

I don't disagree with you if you have a web server or something.  But on
a database or file server, no, I'm putting in an IOP (RAID-5) or custom
ASIC (RAID-10).

> especially since I could spend that money on more capacity, ram,
> slightly more "enterpris-y" disks, etc.

Nope, I see a balance for my database and file server applications.

> you bet.  the p3 was in .2-.8 GB/s range, and disks of that era
> were around 30 MB/s apiece.  now the host's memory bandwidth is 12x 
> higher, but disks are only about 2x.

All the more reason to commit to disk ASAP, get it in its battery-backed
DRAM, or capacitor backed SRAM, in case of a system failure.

> GPUs make sense if you're using your GPU to its limit a lot.
> certainly if you're a gamer, this is true.  does a high-end GPU make a lot 
> of sense for a generic desktop?  no, actually, it doesn't - transparent windows
> is not a great justification for a $500 GPU.

But we're not talking about transparent windows, we're talking about 3D,
even if simple 3D!  That's what you have when you start doing RAID-5.

> so they have something to point at to justify higher prices.

Sigh, that's argumentative and I'm not that dumb.

> but the main point is that the price will be only a very little bit higher,
> since transistors are practically free these days (bridges are 
> probably pad-limited anyway).  no where near your 20% increase to system price.

And that's a good thing, it brings the cost down by making it commodity.
But that's also a sign that it's useful in the first place.  ;->

> if it can be had for near-zero marginal cost, sure.

In some applications where your CPU and interconnect aren't fully used,
I agree, software RAID is fine from a performance standpoint.

But when 20% gains you a 30+% improvement in database and file server
performance -- not just a "disk I/O" benchmark, but how quickly I can
serve 100+ clients -- I'm going to spend the dough!

> oh yeah right: "if it's a server, it's needs to be gold plated."
> but for a 2ch controller, 250+250 really is plenty of bandwidth.

At 250+250, you're slowing down much of your interconnect just to burst
the data through.  Remember, just because you have 6.4GBps system
interconnect does _not_ mean that it will only less than 5% to send down
250MBps.  ;->

Or are you not familiar with how HT-to-PCIe bridges work?  ;->

> I'm a "price/measured performance whore".  that's why I like MD
> and dumb SATA controllers.

And if you're doing software RAID, I 100% agree!  You don't want to put
in a hardware RAID controller where the ASIC or IOP is "in the way."

> efficient?  that's the beauty of commoditized hardware: you get 
> 6 GB/s per opteron whether you need it or not.

You get 6.4GBps *IDEAL* burst.  You do *NOT* slam 6.4GBps of XORs
through the CPU.  Nor does a burst to a 250MBps downstream I/O bridge
take less than 5% of your interconnect's time.  ;->

Com'mon, I assumed you were smarter than that.  ;->

> it's certainly
> tidier to have the controller perform the XOR, but since the host's
> CPU is faster and will otherwise sit idle, well...

But what if your CPU and interconnect aren't?
Again, I'm talking about heavy data manipulation here.

> MD has had excellent on-disk compatibility (afaikr, but only two 
> superblock versions).  LVM is irrelevant.

So you just use MD on "raw," legacy BIOS/DOS partitions?
That's just reciepie for disaster when someone removes a disk.

> hmm, imagine seeing support traffic on a list devoted to support issues.

I've been on the MD lists a long, long time.

> interesting - 3ware people say that until the 95xx's, their designs 
> were crippled by lack of onboard memory bandwidth.

No, it had _nothing_ to do with memory bandwidth.
It had to do with memory _size_!

Now you're just showing your ignorance.
That's the problem with 2nd hand hearsay.
I'm used to it with software RAID advocates.  ;->

3Ware 7000/8000 series cards have 1-4MB of SRAM, _not_ DRAM.
Do you know the first thing about how SRAM differs from DRAM?

It's the same reason why your layer-2 Ethernet switch can resolve MAC
addresses at wire-speeds, _unlike_ a PC that does bridging.

Same deal with 3Ware's Escalade 7000/8000.  It's how it can replicate
and resolve striping/mirroring at full volume set speed.  That's why
they are called "storage switches."

But it sucks at XOR buffering, because it's only 1-4MB of SRAM.
That's why the 9500S and, now, the 9550SX adds 128+MB of DRAM.
Just like other "buffering" RAID cards.

At this point, I'm not responding any further.  It's clear that you
don't want to see my points, even though I see many of yours.  And
you're making arguments that are not founded on accurate information, or
"GHz/GBps whoring" everything.

-- 
Bryan J. Smith   mailto:b.j.smith at ieee.org
http://thebs413.blogspot.com
------------------------------------------
Some things (or athletes) money can't buy.
For everything else there's "ManningCard."