Fedora SMP dual core, dual AMD 64 processor system

Bryan J. Smith b.j.smith at ieee.org
Fri Sep 23 15:08:45 UTC 2005


Bill Broadley <bill at cse.ucdavis.edu> wrote:
> Queue where?  Linux?  RAID driver?  RAID hardware?

RAID hardware.  In a true, intelligent RAID card, queuing is
done via the on-board uC/ASIC controller.  In reality, true,
intelligent RAID cards have rather "dumb" block drivers
(other than management/reporting features) since the
"intelligence" is on-card.

> I don't follow this line of reasoning.

Instead of relying on the kernel to schedule I/O, the
hardware itself schedules I/O.  The kernel merely passes on
requests, and doesn't get caught up with all the overhead,
which is the responsibility of the I/O processor on the RAID
card.

In fact, this is definitely an area where FRAID (fake RAID)
hardware is at its absolute worst.  An OS knows how to queue
better for itself than a FRAID (software driver).  But at the
same time, an intelligent RAID controller is closer to the
hardware so it can schedule it far better/more optimal than
the OS can logically too.

Understand I use OS software RAID (MD/LVM) for RAID-0, I love
it.  But when it comes to RAID-1 (and 1e/10) and RAID-5, I
then rely on the hardware.

> In any case name a workload, post numbers, and I'll
> replicate so we can compare.

I promise I will.  I just took a job about 5 weeks ago that
is permanent, and I'm doing more engineering again than IT. 
I should have done some benchmarks months ago, but I'm
typically a consultant that designs and brings in a solution
and my benchmarking is probably too application-specific.

> My main point is that even for the ideal bandwidth case (a
> large sequential read or write) that software RAID does
> not cause any bottlenecks, everything involved is mostly
> idle (memory bus, cpu, hypertransport, and I/O bus).

How can you measure I/O bus in Linux?  You can't.  You can
only measure the I/O the CPU is servicing, which is not
actual.

I don't dispute that the Opteron can handle the PIO required
for today's advanced storage I/O done in software.  I just
said the transfer load, especially using the Opterons as I/O
Processors doing programmed I/O, takes away from other
transfer operations it _might_ be doing if its servicing user
capabilities.

PC processors and interconnects will always be grossly
inefficient compared to dedicated I/O Processors and
interconnects.  I think Intel has (and this is one of the few
times I agree with Intel) the right idea in putting the I/O
Processor in the I/O controller.  Although the most idea
solution is to put it on the card itself.

Especially during a failed drive, when you are constantly
reading in disk data over the much slower PCI-X interconnect.
 At those times, you really could use a 2-4GBps of local
interconnect handling that -- instead of pushing all the way
up through the I/O to memory to CPU, just to get the data.

In fact, this is one area where the Acera really _tanks_
compared to the 3Ware cards.

> In either case (hardware or software) I'd expect multiple
> sequential or random streams to have lower throughput in
> both cases leaving even more of the I/O, cpu, and related
> idle.

CPU processing idle is one thing.  XOR operations don't even
dent a CPU's processing capability.

The problem is that CPUs are designed for computation, not
pushing data around.  Their interconnects are designed to
have a good balance between processing and data movement.  If
you're jamming your CPU with LOAD/MOV operations just for
storage, then you're turning it into an I/O processor --
something it's not designed for, and it ends up doing
Programmed I/O.

I/O Processors are designed for less processing, more data
movement -- including simplistic, virtually "in-line" data
movement operations like XORs and compares using ASICs and
other peripherals outside the core.  They use far, far less
clock cycles -- typically 1:1 to their external bus, without
the traditional fetch-decode-execute-etc...

> My point is that any linux/MD/software RAID in the world
> uses the same tools, interfaces, drivers.

And if you haven't followed it, as these are more
"standardized" in the Linux world, 3Ware has been adding
support for them.

> So tuning various parameters, recovery, monitoring, and
> migration is the same.

The approach, yes.  But for the hardware, it varies.  So you
don't get away from having to tune.  But instead of tuning on
individual disks, you now tune on the card itself.

One thing I've learn to trust explicitly is 3Ware's ability
to handle even the most problematic ATA drives.  In software
RAID, you often have a tri-fecta clusterfsck between 1) the
ATA drive vendor's Integrated Drive Electronics (IDE), 2) the
ATA channel vendor's registers/bus control and 3) the OS
driver that supposively gets the two to talk correctly.  With
3Ware -- both the uC/ASIC-firware and the ATA channel
registers/bus control are 3Ware's -- which just leaves 3Ware
to deal with the IDE of the ATA drive itself.  I've yet to
have ATA bus timeouts, resets, etc... in 6+ years of 3Ware
devices.

Now some would argue SCSI, and I would agree, SCSI is less
headache.  And the new crop of Serial Attached SCSI (SAS)
solutions are very capable.  In fact, many SAS controllers
are coming with hardware RAID-0, 1, 1e or 10 for "free." 
They also do SATA for "free" too.

> A RAID volume can be migrated across machines without
> problem, worst case (which I've not seen) you'd
> have to run a different kernel.

And I understand this argument, but I've yet it match the 6+
year history of 3Ware upgradability.  As long as the firmware
is the same or newer, you're set.

Other vendors have various records.  Adaptec has a _poor_
one, and they _destroyed_ DPT's when they took them over (not
that DPT offered anything good, they were all old i960
designs).  LSI Logic has varied, with any StrongARM or newer
(now XScale) having great records as well.

BTW, except for RAID-5, there is DM manager support for 3Ware
volumes on regular ATA channels.  And there were early
solutions as well.

> Nothing that can't be done onsite.  I've done lots of
> late 2.2, 2.4, and 2.6 migrations and upgrades without
> issue.  Although I suspect my 2.2 setup was using
> backported MD drivers (which redhat did).  The 2.0 -> 2.2
> migration is a bit further back them I'd trust my memory.

Well, I've been using 3Ware since 1999.

> If you don't have a spare hardware raid card, recovery is
> very tough.

Again, except for RAID-5, not so.  Most 3Ware volumes are
readable by kernel 2.4+ MD and newer 2.6 DM code.

I tend to stick with RAID-10 for performance.

> Even if you do getting that card working on a new machine
> can be fairly difficult.

???  Please explain  ???
I've plopped in 3Ware cards without issue.
The only issue I had with older cards was the 3.3V v 5V, but
the newer 7000+ don't have that issue, they are universal
(PCI 5V, PCI64 3.3/5V, PCI-X 3.3V).

> I.e. finding a kernel+initrd that will load the hardware
> RAID driver before you can mount the RAID.

You obviously haven't used 3Ware.  ;->

GPL driver in stock kernel since 2.2.15 (yes, 2.2).
Same 3w-xxxx driver is used for _all_ products until the
latest 9000 series (3w-9xxx) which adds DRAM.

The core logic of the 3Ware AccelerATA through Escalade 8000
is all the same -- ASIC+SRAM design.  There was a slight
redesign for RAID-5 in the 7000+ series.

> I like that I can take 4-8 drives in a RAID volume and plug
> them into external or internal arrays on various
> architectures (alpha, itanium, opteron, and IA32) and just
> have it work without tracking down which RAID controller is
> in which.

3Ware doesn't have Alpha or Itanium support, no.  But the
MD/DM drivers can read 3Ware RAID-0, 1 and 10 volumes.

> BTW, I'm agree 3ware cards are reliable, functional, and
> work well.  In hardware RAID or software RAID mode.

Actually, in software RAID mode with exception of RAID-0, it
kinda defeats the purpose.

In fact, the #1 complaint I've seen on 3Ware cards is when
people use them with software RAID for the hot-swap
capability, leaving the discs in JBOD mode.  It was _only_
until recently that the kernel added hotplug capability, so
you should _never_ use 3Ware cards with JBOD (instead of
RAID) and attempt hot-swap if you are doing software RAID.

3Ware gets a "bad rap" for people who do _not_ understand the
limitations of hot-swap in all but the latest kernels.  The
3Ware design "hides" the raw disks from the OS so it _can_
provide hot-swap _regardless_ of kernel capability, but
_only_ when you provide it a redundant volume managed by the
3Ware card itself.

> Sure, and each RAID controller sends different messages.

Again, 3Ware is following a lot of the standard messages
being standardized in newer LVM/MD development, including
SMART messages.
 
> So you need to very carefully filter for each controller
> and each message they could send.  

I see a trend here.  You're talking "in general."  I'm saying
I _agree_ with you on "most" hardware RAID vendors.  But for
companies like 3Ware and select LSI Logic (SA/XScale
solutions), I strong _disagree_.

> mdadm, /proc/mdstat, diff, SMTP, and cron are all you need
> to manage, watch, and receive status reports on any linux
> MD raid on the planet.

And the 3Ware /proc interface provides a superset of
capabilities, with 3Ware adding GPL code to many of these
projects to interface into them.

> Sure 3ware has the functionality, if you jump through the
> customized hopes to get it.

By "jump through" I strongly _disagree_.  It's _cake_ to
setup.  The lack of standard approaches in Linux MD/LVM
management until just recently is part of the reason I don't
like it.  But as many things are being standardized (such as
mdadm), 3Ware is moving to support them.

3Ware thinks of Linux _first_, unlike most other vendors.

> Or if you say.. want to manage the RAID?

The newer mdadm developments underway.

> 5 or 10 years ago I'd agree.  More recently I've seeing an
> increasing number of people concluding that software RAID
> is faster in most cases.

Software RAID-0, yes.  And given a _poor_ RAID-5 solution --
even pre-9000 series 3Ware products, I'd agree that an
Opteron doing software RAID gave you more throughput.

But for RAID-10, I'll stick with my 3Ware.  And for RAID-5, I
like the new 9000 series -- _especially_ during a failed
disk/rebuild.  That's when you're really killing your disk
with software RAID-5.

> I'm certainly open to data points to support either
> conclusion.  Er, no.  If it was filecache it would be
> much faster, 16GB is plenty large to mostly flush the
> cache of a 4GB of ram machine.

I just don't see how your write speed is 2x the read.  It
doesn't make sense.

> Er, so you have a 6.4 GB/sec interface to memory (actually
> 2), 8.0 GB/sec hypertransport, and 4GB/sec pci-e.  Which
> one is the bottleneck for this 250MB/sec stream?

The problem is that you're going from memory to CPU, then
back to memory, before you even commit to I/O.  You're _not_
getting anywhere 8GBps from the memory to CPU, because the
CPU is engaged in traditional LOAD/MOV operations (even if
the XOR takes only a few cycles), which cost dozens upon
dozens of cycles in the entire fetch-decode-execute-etc...
cycle.

A well-designed hardware RAID card does these in-line with
the data-write with an ASIC XOR.  0 wait state, non-blocking
I/O.  The system merely commits from memory directly to
storage controller, and that's it.

Again, it's like using a PC as a layer-3 switch versus a
device with a layer-3 switch fabric.  The PC is going to
incur massive overhead to do what a switch fabric does
non-blocking (sub-10ms).

> Mine is 8x, and why not count both ways?  PCI-x is 1GB/sec
> total (read or write).  PCI-e 8x is 2GB/sec read and
2GB/sec
> write.  All communications use both sides (the request and
> the answer), even reads cause disk writes (updating file
> timestamps), writes cause reads (to calculate the new
> checksum).

They do _not_ happen simultenously _unless_ you have a
hardware RAID card.  The PC operation is buffered.

> 250 MB/sec streams + overhead for checksums leaves all
> involved busses mostly idle.

Not when you are failed/rebuilding in RAID-5.  First you have
to read from the storage to memory, then memory to CPU for
PIO storage operations, then back to memory and finally back
to storage.  Those operations do not happen simultaneously.

Even when just doing normal writes, it's buffered I/O, as the
data stream is jammed, waiting on the CPU to go through the
traditional fetch-decode-execute-etc... operation just to do
an XOR (the actual instruction is not the bottleneck).

> Er, more like, read 7 chunks of data, calculate 8th block
> of checksum data then setup a DMA to write all 8 blocks. 
> MD is just as capable of setting up a DMA as the RAID card.

Not true!  A CPU is _not_ an I/O processor with XOR ASICs
designed to calculate in-line.

Again, 3Ware calls its solution a "Storage Switch" for a
reason.  It's the same reason you don't use a PC as a network
switch.

> Er, and what exactly else should the fileserver be doing
> besides, er serving files?  Serving out GigE?  That's only
> another 100MB/sec.

Not for some of us.  ;->

> The hypertransport, CPU, and PCI-e are still mostly idle.

So you say.  Unfortunately, you can't track this in the Linux
kernel.  But you can in the Solaris kernel.

> Don't forget the read overhead of software RAID is ZERO.

Agreed.  That's why 99% of software RAID benchmarks only show
read performance.  In that case -- other than heavy I/O
queuing -- software RAID typically _wins_.  No argument from
me there.

> The write overhead (depending on write size) can be as
> little as something like 1/8th.

Not true.  You have to push _every_single_byte_ up through
the CPU interconnect and run a SIMD instruction which does
LOAD/MOV in a traditional CPU design.  A "storage switch"
does XORs directly in the datapath.

> So a 100MB write could be as little as an extra 12MB.

A 100MB RAID-5 write pushes 100MB through the CPU's
interconnect, _period_.  It might only generate an extra 12MB
overall in the write, but you can _not_ avoid pushing that
data through.

That path is _not_ traversed in hardware RAID.  The hardware
storage controller takes the 100MB, and a well-designed card
does non-blocking I/O with XOR calculations on-the-fly in
virtually real-time.

> On these 4-8GB/sec busses an extra 12.5 % is not a big
deal.

You obviously don't understand the dataflow.  You are pushing
100% to the CPU, then an additional 12.5% out.

> You seem to claim there are all these studies supporting
> hardware RAIDs performance superiority.  Maybe you could
> share some.

Not hardware RAID in general, just 3Ware and select LSI Logic
solutions.  I'll send you some links on 3Ware when I have
time.

I apologize but I'm spending 15 hours/day working right now
(not including my 2+ hours travel time a day) just supporting
Katrina and planning Rita recovery efforts.  I work for the
company that provides emergency communications where none are
available.

> I'll start:
> http://www.chemistry.wustl.edu/~gelb/castle_raid.html
> All below are the 8GB filesize numbers on a 1GB ram machine
> (I.e. not effected by the file cache much.)

Note the date:  2004 Mar!  The 9000 series just came out.

What firmware was used for the 9000 series with how many
volumes?  There were well-know issues with early 9000 series
firmware and multiple volumes.

BTW, I also hope you noted the RAID-10 performance.  But even
then, it's pretty crappy.

I have a dual-P3 at home with an old 3Ware Escalade 7800 that
breaks 100MBps writes with RAID-10 at 8GiB bonnie tests.  I
don't see how their newer system could be slower.  I assume
they are running an early 9000 series firmware.

> So Hardware RAID manages 20MB/sec write 50MB/sec read using
> the 3ware 8500.
> Software raid is 52-76MB/sec write and 120-229MB/sec reads.

And what firmware was used?

Also, I noted this statement:  
  "This suggests that using two 4-drive hardware RAID cards
   and striping them via software might be competitive with
   the all-software solution above, but it would depend very
   much on the performance of the RAID cards."

> You have measured this, or it's just a theory you have? 
> Have you quantified it?

Yes.  At the very high-end, 4x GbE and 2x 8506 cards spread
over 4 PCI-X channels using RAID-0 across the 8506 volumes. 
I was serving over 500MBps via NFS consistently to over 25
clients simultaneously.

> Having significantly higher I/O bandwidth to draw from
> leaves many advantages.

You keep missing the fact that you're stuffing it into the
CPU, which can't work as fast as an ASIC.

> With a CPU that can do the xor calculations on the order
> of 7GB/sec most parts of the system

???  I'd _really_ like to know how you came up with that
number!  I don't see someone being able to stuff 7GBps
through a CPU with even a SIMD operation.

> (besides the disks)
> are mostly idle even when sustaining these 140-250MB/sec
> data rates.

Yes, that would suggest the bottleneck is the CPU doing PIO.

> Which interconnect are you talking about exactly?  Each
> opteron has 3 8GB/sec hypertransports.  With today's
> kernels they are mostly idle.

You can_not_ read this with the Linux kernel.
The Linux kernel _only_ shows you the time the CPU is doing
I/O, not the actual I/O throughput/latency/usage.

> Shared memory happens over one link, but the newer kernels
> keep most memory traffic local to the CPU the process is
> running on (in most cases).
> The other hypertransport links are mostly idle a 250MB/sec
> RAID doesn't have much effect.  Sure the 6.4 GB/sec memory
> busses can be very heavily used by many usage patterns, but
> 250MB/sec isn't going to impact those very much.

Understand that when you use _any_ time in an operation, you
deduct that _time_.  I.e., if you use 50% of a 250MBps bus,
you do _not_ deduct 250MBps from the next bus, but you deduct
50% of the available bandwidth from that next bus.

So if it took you 50% of your I/O bus to read the data into
memory, then the memory to CPU operation do _not_ happen at
the theoretical maximum of the memory throughput, but only
50% left.

At this point, I think I'd have to draw some timing and state
diagrams to explain this.  You seem to be missing it.



-- 
Bryan J. Smith                | Sent from Yahoo Mail
mailto:b.j.smith at ieee.org     |  (please excuse any
http://thebs413.blogspot.com/ |   missing headers)




More information about the amd64-list mailing list