Opteron Vs. Athlon X2
Bryan J. Smith
b.j.smith at ieee.org
Sat Dec 10 12:46:18 UTC 2005
On Sat, 2005-12-10 at 01:09 -0500, Mark Hahn wrote:
> actually, the CPU PIO's a command packet to the controller
> which gives the command, size and pointer(s) to the data.
> the CPU never touches the actual data. IDE, net and scsi
> controllers are all broadly along these lines.
Yes, that's DMA. Yes, the CPU sends a PIO command. I didn't realize
you were getting so anal on this.
I meant that the CPU is _not_ doing PIO for the data transfer. When you
do IDE/EIDE, as well as the ATA PIO modes 0-5, you _are_ doing PIO for
the data transfer. Only when you are doing ATA DMA is it not.
When you use software RAID-3/4/5/6 (anything with an XOR), it very much
_is_ doing PIO because _every_ single byte goes through the CPU.
> the driver just says "hey, write this blob of 128K at offset X".
Yes, when it does the DMA transfer, I don't disagree.
> just two "hey..." commands. the source data (in host ram)
> is not replicated.
But *2x* the load on your interconnect. Can you see me on that one?
> no, not really.
Not really? 100% of the data is slammed from memory into the CPU --
every single byte is worked on by the CPU _before_ the parity can be
calculated. That is Programmed I/O, I don't know how much more it could
be PIO than that!
That's why I said consider software RAID XORs to be PIO! ;->
> well, the default would be 4x64k (and would consume <100 us
> assuming no cache hits.)
In any case, you're _not_ slamming 6.4GBps through that CPU for
XORs. ;-> If you think so, you obviously don't know the first thing
how general purpose, superscalar microprocessors work.
> again, probably smaller pieces, each with a separate command packet
> to the controller(s). don't forget the reads necessary for
> sub-stripe or non-stripe-aligned writes.
He he, I was trying to give you the "best case scenario." But yes, if
you have to read all the way back through I/O from disk, it gets 10+
times worse. ;->
Thanx for being honest and forthcoming in that regard. ;->
> which is a trivial concern, since on any commodity system, the host
> is faster and has more available memory bandwidth than the IO system
> can manage in the first place. we're talking <100 us for the 5-disk,
> 64K chunk MD.
Once again, you're _not_ going to be able to even remotely slam 6.4GBps
for XORs through that CPU. ;->
> I think that was garbled - using HW raid saves 1/(n-1) or 20%
> of the bus bandwidth. which is not a scarce commodity any more.
You just totally ignored my entire discussion on what you have to push
through the CPU! A stream of data through the ALU of the CPU --
something _not_ designed with an ASIC peripheral that does that one
thing and does it well! Something that you _would_ have on a hardware
RAID controller.
Even at it's "measly" 300-400MHz would do a heck of a lot more
efficiently, without tying up the system.
That's why Intel is putting its IOP-ASIC XScale processors into the
southbridge of its future server designs. Because there's a lot to be
gained by off-loading very simple, specialized operations away from the
general purpose microprocessor that handles them far more inefficiently.
> for the xor. which takes negligable time, even assuming no cache hits.
The XOR -- FETCH/DECODE/EXECUTE -- yes, to a point.
The LOAD/STORE? I'm sorry, I think not.
Microprocessors are not designed to work on I/O efficiently.
Microcontrollers with specific peripheral ASICs are.
They can easily best a general microprocessor 10:1 or better, MHz for
MHz. That's why Intel has several lines of XScale processors, not just
one "general" one.
> for a normal dual-opteron server, sure it can. r5 parity is actually
> easier than Stream's triad, which a single opteron can run at >4.5 GB/s.
> stream's reporting ignores write-allocate traffic, but MD does its
> writes through the cache, therefore can hit 6 GB/s.
Assuming it's in the L1 cache, _maybe_. At 2.6GHz with a 3-issue ALU
and that the XOR operations can be (effectively) completed once per
clock cycle in a SIMD operation, that's wishful thinking.
> it _could_be_ much faster. just as there's no question that GPU's
> _can_ do 3d graphics transforms faster than the host. whether it makes
> sense is a very different question, mainly cost. why should I spend
> 20% extra on my fileservers to leave cycles waste?
But what if you're not wasting cycles?
Not wasting precious interconnect?
That's the main point I've been making -- what if you're slamming so
much higher layer network traffic for services that your CPU and
interconnect are already very busy?
I don't disagree with you if you have a web server or something. But on
a database or file server, no, I'm putting in an IOP (RAID-5) or custom
ASIC (RAID-10).
> especially since I could spend that money on more capacity, ram,
> slightly more "enterpris-y" disks, etc.
Nope, I see a balance for my database and file server applications.
> you bet. the p3 was in .2-.8 GB/s range, and disks of that era
> were around 30 MB/s apiece. now the host's memory bandwidth is 12x
> higher, but disks are only about 2x.
All the more reason to commit to disk ASAP, get it in its battery-backed
DRAM, or capacitor backed SRAM, in case of a system failure.
> GPUs make sense if you're using your GPU to its limit a lot.
> certainly if you're a gamer, this is true. does a high-end GPU make a lot
> of sense for a generic desktop? no, actually, it doesn't - transparent windows
> is not a great justification for a $500 GPU.
But we're not talking about transparent windows, we're talking about 3D,
even if simple 3D! That's what you have when you start doing RAID-5.
> so they have something to point at to justify higher prices.
Sigh, that's argumentative and I'm not that dumb.
> but the main point is that the price will be only a very little bit higher,
> since transistors are practically free these days (bridges are
> probably pad-limited anyway). no where near your 20% increase to system price.
And that's a good thing, it brings the cost down by making it commodity.
But that's also a sign that it's useful in the first place. ;->
> if it can be had for near-zero marginal cost, sure.
In some applications where your CPU and interconnect aren't fully used,
I agree, software RAID is fine from a performance standpoint.
But when 20% gains you a 30+% improvement in database and file server
performance -- not just a "disk I/O" benchmark, but how quickly I can
serve 100+ clients -- I'm going to spend the dough!
> oh yeah right: "if it's a server, it's needs to be gold plated."
> but for a 2ch controller, 250+250 really is plenty of bandwidth.
At 250+250, you're slowing down much of your interconnect just to burst
the data through. Remember, just because you have 6.4GBps system
interconnect does _not_ mean that it will only less than 5% to send down
250MBps. ;->
Or are you not familiar with how HT-to-PCIe bridges work? ;->
> I'm a "price/measured performance whore". that's why I like MD
> and dumb SATA controllers.
And if you're doing software RAID, I 100% agree! You don't want to put
in a hardware RAID controller where the ASIC or IOP is "in the way."
> efficient? that's the beauty of commoditized hardware: you get
> 6 GB/s per opteron whether you need it or not.
You get 6.4GBps *IDEAL* burst. You do *NOT* slam 6.4GBps of XORs
through the CPU. Nor does a burst to a 250MBps downstream I/O bridge
take less than 5% of your interconnect's time. ;->
Com'mon, I assumed you were smarter than that. ;->
> it's certainly
> tidier to have the controller perform the XOR, but since the host's
> CPU is faster and will otherwise sit idle, well...
But what if your CPU and interconnect aren't?
Again, I'm talking about heavy data manipulation here.
> MD has had excellent on-disk compatibility (afaikr, but only two
> superblock versions). LVM is irrelevant.
So you just use MD on "raw," legacy BIOS/DOS partitions?
That's just reciepie for disaster when someone removes a disk.
> hmm, imagine seeing support traffic on a list devoted to support issues.
I've been on the MD lists a long, long time.
> interesting - 3ware people say that until the 95xx's, their designs
> were crippled by lack of onboard memory bandwidth.
No, it had _nothing_ to do with memory bandwidth.
It had to do with memory _size_!
Now you're just showing your ignorance.
That's the problem with 2nd hand hearsay.
I'm used to it with software RAID advocates. ;->
3Ware 7000/8000 series cards have 1-4MB of SRAM, _not_ DRAM.
Do you know the first thing about how SRAM differs from DRAM?
It's the same reason why your layer-2 Ethernet switch can resolve MAC
addresses at wire-speeds, _unlike_ a PC that does bridging.
Same deal with 3Ware's Escalade 7000/8000. It's how it can replicate
and resolve striping/mirroring at full volume set speed. That's why
they are called "storage switches."
But it sucks at XOR buffering, because it's only 1-4MB of SRAM.
That's why the 9500S and, now, the 9550SX adds 128+MB of DRAM.
Just like other "buffering" RAID cards.
At this point, I'm not responding any further. It's clear that you
don't want to see my points, even though I see many of yours. And
you're making arguments that are not founded on accurate information, or
"GHz/GBps whoring" everything.
--
Bryan J. Smith mailto:b.j.smith at ieee.org
http://thebs413.blogspot.com
------------------------------------------
Some things (or athletes) money can't buy.
For everything else there's "ManningCard."
More information about the amd64-list
mailing list