Opteron Vs. Athlon X2
Mark Hahn
hahn at physics.mcmaster.ca
Sat Dec 10 22:59:37 UTC 2005
> > actually, the CPU PIO's a command packet to the controller
> > which gives the command, size and pointer(s) to the data.
> > the CPU never touches the actual data. IDE, net and scsi
> > controllers are all broadly along these lines.
>
> Yes, that's DMA. Yes, the CPU sends a PIO command. I didn't realize
> you were getting so anal on this.
it's only because you're incredibly sloppy, and some hapless reader
might be confused by your stuff. "for instance, "sending a PIO command"
makes it sound like the CPU is telling an IO device to PIO somthing.
that's not the case. the CPU PIOs a small packet containing a command
(a DMA command, by any normal terminology - telling the controller where
to directly access main memory.)
> I meant that the CPU is _not_ doing PIO for the data transfer. When you
> do IDE/EIDE, as well as the ATA PIO modes 0-5, you _are_ doing PIO for
> the data transfer. Only when you are doing ATA DMA is it not.
"only"? PIO went out 10+ years ago.
> When you use software RAID-3/4/5/6 (anything with an XOR), it very much
> _is_ doing PIO because _every_ single byte goes through the CPU.
<shrug> you can abuse colloquial usage this way if you want, but you're
speaking your own language. when the CPU computes XOR on a block,
it's not called PIO, since, at the very least, it's not IO. I suppose
you'd also call memcpy a PIO operation.
> > just two "hey..." commands. the source data (in host ram)
> > is not replicated.
>
> But *2x* the load on your interconnect. Can you see me on that one?
sure, but so what? the interconnect is not the bottleneck.
> > no, not really.
>
> Not really? 100% of the data is slammed from memory into the CPU --
> every single byte is worked on by the CPU _before_ the parity can be
> calculated. That is Programmed I/O, I don't know how much more it could
> be PIO than that!
it's a block memory operation, no different from memset or memcpy. and,
(HERE'S THE POINT), it's as fast, namely saturating memory at ~6 GB/s.
> That's why I said consider software RAID XORs to be PIO! ;->
OK, you have your own language.
> > well, the default would be 4x64k (and would consume <100 us
> > assuming no cache hits.)
>
> In any case, you're _not_ slamming 6.4GBps through that CPU for
> XORs. ;-> If you think so, you obviously don't know the first thing
> how general purpose, superscalar microprocessors work.
don't be a jerk. if you don't think a basic OTC opteron can xor
a block at around 6 GB/s, prove me wrong. for an easier task,
show how stream can do c[i] = a[i] + k * b[i] at 6 GB/s and you
think c[i] = a[i] ^ b[i] is somehow harder or dramatically slower.
> Thanx for being honest and forthcoming in that regard. ;->
you only show yourself as a jerk when you pretend that your partner in
dialog is being deceptive.
> Once again, you're _not_ going to be able to even remotely slam 6.4GBps
> for XORs through that CPU. ;->
I said 6.0, actually, but you are clearly wrong. not so much wrong,
just haven't noticed that a commodity opteron system (and even some
intel-based systems) are dramatically better than your old PIII.
> > I think that was garbled - using HW raid saves 1/(n-1) or 20%
> > of the bus bandwidth. which is not a scarce commodity any more.
>
> You just totally ignored my entire discussion on what you have to push
> through the CPU! A stream of data through the ALU of the CPU --
> something _not_ designed with an ASIC peripheral that does that one
> thing and does it well! Something that you _would_ have on a hardware
> RAID controller.
I addressed it precisely. a commodity processor can do the parity
calculation at ~6 GB/s, therefore it's a nonissue. similarly, the extra
bandwidth consumed by SW raid is also a nonissue. this prevalence of
nonissues is why SW raid is so very attractive on fileservers where
the CPU would sit idle if you offload all the work to a HW raid card.
> Even at it's "measly" 300-400MHz would do a heck of a lot more
> efficiently, without tying up the system.
it's a frigging fileserver. one that has 6-12 GB/s memory bandwidth.
> > for the xor. which takes negligable time, even assuming no cache hits.
>
> The XOR -- FETCH/DECODE/EXECUTE -- yes, to a point.
> The LOAD/STORE? I'm sorry, I think not.
why can't you see that it's common for machines to have 6 GB/s available
these days?
> They can easily best a general microprocessor 10:1 or better, MHz for
> MHz. That's why Intel has several lines of XScale processors, not just
> one "general" one.
if the specialized board adds 20% to the system cost, but winds up slower,
what's the point?
> > for a normal dual-opteron server, sure it can. r5 parity is actually
> > easier than Stream's triad, which a single opteron can run at >4.5 GB/s.
> > stream's reporting ignores write-allocate traffic, but MD does its
> > writes through the cache, therefore can hit 6 GB/s.
>
> Assuming it's in the L1 cache, _maybe_. At 2.6GHz with a 3-issue ALU
> and that the XOR operations can be (effectively) completed once per
> clock cycle in a SIMD operation, that's wishful thinking.
google for the stream benchmark, check my numbers. it doesn't even
take a 2.6 GB/s cpu to drive 6 GB/s through today's memory systems.
More information about the amd64-list
mailing list