Opteron Vs. Athlon X2

Sat Dec 10 22:59:37 UTC 2005

> > actually, the CPU PIO's a command packet to the controller 
> > which gives the command, size and pointer(s) to the data.
> > the CPU never touches the actual data.  IDE, net and scsi 
> > controllers are all broadly along these lines.
> 
> Yes, that's DMA.  Yes, the CPU sends a PIO command.  I didn't realize
> you were getting so anal on this.

it's only because you're incredibly sloppy, and some hapless reader
might be confused by your stuff.  "for instance, "sending a PIO command"
makes it sound like the CPU is telling an IO device to PIO somthing.
that's not the case.  the CPU PIOs a small packet containing a command
(a DMA command, by any normal terminology - telling the controller where 
to directly access main memory.)

> I meant that the CPU is _not_ doing PIO for the data transfer.  When you
> do IDE/EIDE, as well as the ATA PIO modes 0-5, you _are_ doing PIO for
> the data transfer.  Only when you are doing ATA DMA is it not.

"only"?  PIO went out 10+ years ago.

> When you use software RAID-3/4/5/6 (anything with an XOR), it very much
> _is_ doing PIO because _every_ single byte goes through the CPU.

<shrug> you can abuse colloquial usage this way if you want, but you're 
speaking your own language.  when the CPU computes XOR on a block,
it's not called PIO, since, at the very least, it's not IO.  I suppose
you'd also call memcpy a PIO operation.

> > just two "hey..." commands.  the source data (in host ram)
> > is not replicated.
> 
> But *2x* the load on your interconnect.  Can you see me on that one?

sure, but so what?  the interconnect is not the bottleneck.

> > no, not really.
> 
> Not really?  100% of the data is slammed from memory into the CPU --
> every single byte is worked on by the CPU _before_ the parity can be
> calculated.  That is Programmed I/O, I don't know how much more it could
> be PIO than that!

it's a block memory operation, no different from memset or memcpy.  and,
(HERE'S THE POINT), it's as fast, namely saturating memory at ~6 GB/s.

> That's why I said consider software RAID XORs to be PIO!  ;->

OK, you have your own language.

> > well, the default would be 4x64k (and would consume <100 us
> > assuming no cache hits.)
> 
> In any case, you're _not_ slamming 6.4GBps through that CPU for
> XORs.  ;->  If you think so, you obviously don't know the first thing
> how general purpose, superscalar microprocessors work.

don't be a jerk.  if you don't think a basic OTC opteron can xor 
a block at around 6 GB/s, prove me wrong.  for an easier task, 
show how stream can do c[i] = a[i] + k * b[i] at 6 GB/s and you
think c[i] = a[i] ^ b[i] is somehow harder or dramatically slower.

> Thanx for being honest and forthcoming in that regard.  ;->

you only show yourself as a jerk when you pretend that your partner in 
dialog is being deceptive.

> Once again, you're _not_ going to be able to even remotely slam 6.4GBps
> for XORs through that CPU.  ;->

I said 6.0, actually, but you are clearly wrong.  not so much wrong,
just haven't noticed that a commodity opteron system (and even some 
intel-based systems) are dramatically better than your old PIII.

> > I think that was garbled - using HW raid saves 1/(n-1) or 20%
> > of the bus bandwidth.  which is not a scarce commodity any more.
> 
> You just totally ignored my entire discussion on what you have to push
> through the CPU!  A stream of data through the ALU of the CPU --
> something _not_ designed with an ASIC peripheral that does that one
> thing and does it well!  Something that you _would_ have on a hardware
> RAID controller.

I addressed it precisely.  a commodity processor can do the parity
calculation at ~6 GB/s, therefore it's a nonissue.  similarly, the extra
bandwidth consumed by SW raid is also a nonissue.  this prevalence of 
nonissues is why SW raid is so very attractive on fileservers where 
the CPU would sit idle if you offload all the work to a HW raid card.

> Even at it's "measly" 300-400MHz would do a heck of a lot more
> efficiently, without tying up the system.

it's a frigging fileserver.  one that has 6-12 GB/s memory bandwidth.

> > for the xor.  which takes negligable time, even assuming no cache hits.
> 
> The XOR -- FETCH/DECODE/EXECUTE -- yes, to a point.
> The LOAD/STORE?  I'm sorry, I think not.

why can't you see that it's common for machines to have 6 GB/s available
these days?

> They can easily best a general microprocessor 10:1 or better, MHz for
> MHz.  That's why Intel has several lines of XScale processors, not just
> one "general" one.

if the specialized board adds 20% to the system cost, but winds up slower,
what's the point?  

> > for a normal dual-opteron server, sure it can.  r5 parity is actually
> > easier than Stream's triad, which a single opteron can run at >4.5 GB/s.
> > stream's reporting ignores write-allocate traffic, but MD does its 
> > writes through the cache, therefore can hit 6 GB/s.
> 
> Assuming it's in the L1 cache, _maybe_.  At 2.6GHz with a 3-issue ALU
> and that the XOR operations can be (effectively) completed once per
> clock cycle in a SIMD operation, that's wishful thinking.

google for the stream benchmark, check my numbers.  it doesn't even
take a 2.6 GB/s cpu to drive 6 GB/s through today's memory systems.