Opteron Vs. Athlon X2

Bryan J. Smith b.j.smith at ieee.org
Fri Dec 9 08:08:23 UTC 2005


Bill Broadley <bill at cse.ucdavis.edu> wrote:
> After all why would intel design for an 8x slot if they
> only needed a 4x?

Don't know.  Commonality?  Most storage controllers are x8.

BTW, there are two (2) PCI-X busses bridged in the IOP332
(and IOP333) to the PCIe x8 channels.  The specs very much do
say 133MHz PCI-X.  ;->

> In either case with only 16 disks even 1.0 GB/sec isn't
> going to be the limit.

Agreed.

> In the hardware RAID case:  100MB read from ram -> CPU
> copies it to the I/O space of the controller ->
> controller calculated raid-5 checksums -> 125 MB is written
> to the disks.

No, the CPU is virtually not involved other than to
command/queue -- _no_ programmed I/O (PIO), only Direct
Memory Access (DMA).  100MB is read from RAM and written
directly as 100MB to memory mapped I/O (which is the block
device) by the PCI-X or PCIe DMA controller.  The on-board
controller handles any mirroring/parity.

> Software RAID:
> 100MB read from ram -> cpu copies and checksums 125 MB to
> the controller -> controller writes 125 MB to the disks.

For mirroring, it's straight-forward (at least still DMA,
just a redundant write):  

100MB is read from RAM and written to two different memory
mapped I/O (which is the block device) by the PCI-X or PCIe
DMA controller.

For RAID-5, it's a little more interesting, it's PIO:  

The CPU reads in 100MB from RAM and calculates XOR, writing
the XOR parity calculated to memory -- e.g., 25MB for a 5
disk  RAID-5.  Then the 125MB is read from RAM and written
directly as 125MB to memory mapped I/O by the PCI-X or PCIe
DMA controller.

ANAL NOTE:  The software RAID-5 would commit a fraction of
the data as a fraction of the XOR is calculated and stored in
memory, and wouldn't wait until all XORs have been calculated
and stored in memory.  But still, the XOR operation is
programmed I/O, requiring that parity not be committed until
it has been calculated by the CPU.

> So yes your 4GB/sec I/O bus sees another 25MB/sec so it's
> 1.25% more busy.  Is that a big deal?
> Say it was a mirror, it's 5% more busy.  So?  

Re-read the above analysis.

For mirror, you push 2x over the interconnect, but at least
it's still 100% DMA (no CPU overhead).

For RAID-5, you only push 1/(N-1) over the interconnect
(e.g., 1/(5-1) = 1/4th = 25% for a 5-disc RAID-5), but you
push the _entire_ amount of data through the CPU for that
extra write.

> I just checked my fileserver, it can RAID-5 checksum at
> 7.5GB/sec.  So yes one cpu would be slightly more busy,
> just a few %.

I'm sorry, but the 3-issue ALU of the Opteron can_not_ do 7.5
billion LOAD (the slowest part), FETCH-DECODE-XOR (the most
simplistic part) and STOR operations per second!

It's much faster and far more efficient to do XOR with a
dedicated ASIC or ASIC peripheral on a superscale I/O
processor that is in-line and far closer to the actual
storage channels.

> I never noticed, I've done many comparisons between
> hardware and software RAID.  Even a small fraction of an 
> opteron or p4 is a substantially larger resource then the
> controllers which even today are only 32 bit and a few 100
> MHz.

When I started deploying some of my first ServerWorks
ServerSet III chipset mainboards about 5-6 years ago for
P3/Xeon, I saw significant gains with 3Ware cards as well as
StrongARM-based SCSI RAID cards at RAID-10 over software RAID
to ATA or SCSI channels.

> Video cards deal with higher bandwidths then disks

Video is an even better example of why CPU processing sucks
for specialized operations!

And video cards have dedicated Graphics Processor Unit (GPU)
processors that manipulate data far better than any vector
processing on a CPU.

Same deal with storage I/O when it comes to ASICs and/or I/O
Processors (IOP) -- they are built for handling data streams
and real-time XORs than CPUs with traditional
LOAD-FETCH/DECODE/XOR-STOR.  Why do you think Intel is
putting its XScale logic in forthcoming bridges?

Intel learned long ago that processing with local memory
closer to the end-device is going to be far higher performing
because of no redundant copying/processing, reduced latency,
etc...

> Strange, they don't mentioned anything like that on the
> 9550sx page.

Why would engineering specifications for AMCC's PowerPC
products it be on the 9550SX product page?!?!?!

That's like Areca putting the engineering specifications of
the Intel's XScale IOP products on their ARC 1100/1200
product pages!!!  You go to developer.intel.com, _not_
areca.com!

Go look up AMCC's PowerPC product lines -- that's what the
GamePC.COM reviewers did!  Their bus arbitrator and RAID
engine micontrollers has PCI-X and PCIe support.

Not strange at all!  In fact, most of the time, end-user
product vendors hide things, and it takes me awhile to track
down the information and get to the actual engineering
specification sheets of the original microelectronics vendor.
 In the case of Areca, that's Intel.  In AMCC's case, they
_are_ the lead developer of the PowerPC 400 series now (IBM
remains as the foundary), so it's themselves -- but still a
different set of pages (because they are _engineering_-level,
_not_ product level).

I'm sure AMCC wanted to get the 9550SX products out first in
PCI-X before they tackled PCIe.  Just like Areca's 1100
series line came out well in advance of their 1200 series.

And 3Ware/AMCC has a history of _never_ announcing products
in advance.

> Indeed, I'm puzzled by suns move to pci-x 2.0.

I'm sure their consumers want it -- at least a number.  But
Sun _does_ use the nForce Pro 2200/2050 chips on some of
their newer models.

> It's great for consumers, for relatively cheaply you can
> get TWO 8GB/sec I/O slots on a motherboard.

I agree, the traces are drastically reduced, which is the big
cost with PCI-X.

> If you want cheap I'd switch to software RAID.  I've seen
> pci-e 2 channel controllers for $60 or so.  Or just get a
new
> motherboard getting 8 ports on the motherboard is fairly
easy
> on a $100-$150 motherboard.

PCIe x1?  No thanx, not for a server.

> My experience is exactly the opposite, I've been shocked
> how many hardware RAIDs couldn't manage a sustained
> 50MB/sec for writes.

That's because they use old i960 IOP30x controllers that
weren't even viable 5 years ago.  Many Tier-1 PC OEMs _still_
ship controllers with those -- quite sad.

> Nice.  It allows creating/destroying RAIDs, shows
> temperatures, allows you to blink a drives activity light
to
> find it.  Supports at least RAID 0,1,5, and 6.  I didn't
really
> check for 10.  Both JBOD and passthru work well ;-).
> Nothing, just connect a network cable to the RAID card.

Now that's sweet!  Card-based web administration!  Very nice.

> Heh, all what 333 MHz of 32 bit power roaring inside my
> RAID card?

You're _already_ using the XScale -- even though you're using
it in JBOD mode doing software RAID, _all_ data is going
through it!  You're better off using standard SATA channels
on a Broadcom, HighPoint or some other card that does _not_
have intelligence in between the bus and the SATA channels.

Why?  Did you think that if you use JBOD, the XScale isn't
used?  Don't you know how these cards work?!?!

BTW, with regards to the 333MHz, no offense, but you're what
us semiconductor design engineers call a "MHz whore."

Maybe it's because I've spend several years of my career
designing memory and bus controllers at the layout level, but
there is a _huge_ difference between a CPU and a
microcontroller with ASICs designed specifically for
something.  In the case of the IOP33x superscalar ARM
XScales, they are very much designed to efficiently put a
data stream to many disks.

> So that I get an email if there is any change.  Sure I
> could install the Areca tools, use hardware RAID, setup
some
> areca specific documentation, disaster recovery plan,
> monitoring, and management.  Then forever tie
> that data and those disks to a specific type of controller.

Sigh, now you're just argumentative.  I've heard this
argument over and over, and yet, it has never stuck.

E.g., Several vendors drivers work with the stock kernel
facilities and support many user-space utilities (e.g.,
smartd).  The facilities for e-mail notification can be used
to trap kernel messages with regards to various cards.  Many
vendors are conscience of Linux kernel and user-space
integration -- 3Ware, Broadcom and LSI Logic have always
been.

You can choose the vendor utilities, or you can stick with
the Linux utilities you know.  Everytime I bring up 3Ware's
smartd support, or the fact that kernel messages match to the
utilities catch them, etc... -- I just get _ignored_.  Why? 
Because 3Ware offers another option, and that's bad--*BAD*
they tell me!  ;->

It's like slamming nVidia for 3D drivers, when they actively
support open source 2D drivers more than just about any other
vendor.  Don't knock nVidia when they offer more open source
than most anyone else in the video card space, just because
they offer additional that is not.

> At the time for 16 ports the areca seemed easiest (single
> card), I'm certainly watching for cheaper solutions.  

There's something you continue to miss here, on a _real_
hardware RAID card, the ASIC/microcontroller is "in the way"
of the SATA channels.  You would be better off going with a
Broadcom, HighPoint or other solution -- I know HighPoint now
has an 8-channel SATA card for PCIe.

> I've benchmark many 3ware hardware RAIDs that were slower
> than software raid on the same hardware.  I've not
benchmarked
> the new 9550sx though.

The 9500S had some nasty performance bugs in the 9.2 firmware
if you had more than 1 array.  If you had some JBODs on the
same card, I'm sure you saw it.

> I don't see any reason why the XScale should slow anything
> down, all it has to do is copy data from PCIe to the sata
> interface. 

But you're not sending data directly from the bus to the SATA
channels.

You're sending data to be buffered in the DRAM on the card,
then the XScale puts it to the channels -- kinda like
"Store'n Forward."  In essence, you're getting the
_same_performance_ as if the XScale was enabled for hardware
RAID!

But now you're putting the overhead on your CPU-memory
interconnect, instead of on the XScale and its DRAM, which
it's built for.  This is what I'm talking about -- you're
using hardware RAID how it is _not_ efficient.

It's not really "slowing things down" -- but you're basically
removing any reason for having the XScale.  The XScale is
going to handle the RAID operations in real-time, that's what
it's ASIC peripherals around its core were designed for.

Your CPU was designed to be generic -- and it excels at many
different things in moderation.  It sucks at doing one thing
a lot -- especially wasteful is something as simple as XOR,
and the whole LOAD-FETCH/DECODE/XOR-STOR cycle.

> Certainly the rate it can do that should be higher than the
> rate it can do RAID-5 calculations.

Umm ... NOT!  That's the problem, you don't understand how
these IOPs work!  They are *NOT* like traditional processors!
 They are superscalar microcontrollers with specialized ASIC
peripherals that work on data in-line with their transfer.

Remember that old RC5 crack contest and the EFF's entry that
cracked it in just over a day?  They build a specialized
system with specialized, microcontroller-based ASICs that not
only crunched the keys faster, but it was designed to fed
them in more efficiently too.

> Even 3ware's hotswap isn't perfect, I've seen disks get 
> confused enough to hang the controller.

Because you have it in JBOD mode!  That's really an OS issue,
something 3Ware hides from the OS when you use its hardware
RAID facilities.

This is the stuff you software RAID advocates just don't get.

You complain about 3Ware's hot-swap, but that's because
you're using it in a way that it was NOT designed!  And the
problem with the hot-swap is because the OS can see the
"individual disk" -- something that is _only_ when you use
JBOD.  It's a problem with the OS, _not_ the 3Ware card!  ;->

> Not that 3ware doesn't do it better than most of the
> non-RAID sata drivers controllers.

Again, this is where I want to take a baseball bat and knock
all you software RAID advocates over because you don't know
the first thing about how hot-swap works on a 3Ware card.

The only reason 3Ware is better in JBOD is because the 3Ware
SATA controller buffers commands/reads/writes, whereas "dumb
block" SATA controllers can't.  So while the "dumb block"
SATA controller says "um, I can't read/write" immediately
back to the kernel, the 3Ware controller is going, "yeah,
I've got that command queued, hold on" while the disk isn't
available.

But that doesn't mean that the Linux kernel will go "where
the fsck did my disk go?" if you don't swap it very quickly
when you have it in JBOD mode.  3Ware can do _nothing_ about
that -- that's a 100% kernel issue!  The 3Ware card basically
just "keeps the kernel busy" queuing commands as long as it
can.  Hot-plug is supposed to address this, but it only goes
so far.

3Ware _only_ offers hot-swap _when_ you use non-JBOD modes!

> Software RAID on the other hand seems pretty bulletproof,
> it's widely tested, and very robust.  I regularly
> have 400-500 day uptimes on busy production servers.

Excuse me?  MD has changed several times between 2.2, 2.4 and
yet again with 2.6.  LVM2 is a major problem, with massive
race conditions.  I've also had to spend a _lot_ of _manual_
time dealing with MD not finding the appropriate disks for a
volume set.  I have to heavily document things for _each_
system so we're not spending 15+ minutes trying to get a
volume remounted/rebuilt.  Especially on legacy BIOS/DOS disk
labels, and LVM/LVM2 disk labels are not always safe
(especially not LVM2).

Stuff I _never_ worry about with hardware RAID and true
hot-swap.

> I consider selling an expensive RAID card that is
> completely broken and writes corrupt data to the drive bad.
> I trusted them and was betrayed.  Certainly hindsight is
20/20,
> shouldn't have done that.

No offense, but if I had a dime for everytime I saw someone
on the MD or LVM support lists say "this should work" and
then they had to come back and say, "yeah, you'll have to
re-create that" I'd be a very, very rich man.

> Caution is warranted, more caution still (IMO) is to use
> software RAID.

Again, I've had so much toasted data thanx to software RAID I
refuse to touch it.  Everytime I try it, it's something new. 
Now mdadm has helped _drastically_ when it comes to managing
MD, but it's still far from perfect.

> Hopefully 3ware can read/write a block without errors, and
> the rest I trust to software RAID.  I also find the mdadm
> functionality quite desirable when compared to most
hardware
> raid interfaces. 

I disagree.  mdadm is much, much better than the old mdtools,
but it's still not as nice as some of the hardware RAID
products I trust.

> Even if I found the exact functionality equivalent, it
would
> still be specific to that controller.

Not an issue when you _only_ use controllers from 1-2 vendors
that have excellent support histories.  3Ware has been superb
on upward compatibility.  And their 7000/8000 series were
pretty flawless in my book.

I've avoided the 9500S and the 9550SX is too new for me to
consider.  But for high-performance RAID-10, the 7000/8000
are just absolutely dreamy -- and have been for almost 5
years.


-- 
Bryan J. Smith                | Sent from Yahoo Mail
mailto:b.j.smith at ieee.org     |  (please excuse any
http://thebs413.blogspot.com/ |   missing headers)




More information about the amd64-list mailing list