Fedora SMP dual core, dual AMD 64 processor system

Sat Sep 24 20:38:17 UTC 2005

> Instead of relying on the kernel to schedule I/O, the
> hardware itself schedules I/O.  The kernel merely passes on
> requests, and doesn't get caught up with all the overhead,

of course, the "overhead" in scheduling a disk is rather trivial.

> better for itself than a FRAID (software driver).  But at the
> same time, an intelligent RAID controller is closer to the
> hardware so it can schedule it far better/more optimal than
> the OS can logically too.

this is not very true.  for instance, a serious server today
will have many GB of pages cached from disk.  that's significantly
larger than any HW raid I've seen (typically in the ~256M range.)

it's also not true that the HW controller has much more knowlege
of the disk hardware.  both the host and HW controller have to 
guess about where the actual head is, and have to guess about 
how tracks are laid out.  but this is not very hard: ignoring 
remapped sectors, seek distance is monotonic with block distance.
that means that no one except the disk itself can really know that 
two blocks are on the same cylinder, but if two block addresses 
are "close", you can guess that they are.  simply establishing 
monotonicity is the crux of disk scheduling.

however, the other part of disk scheduling is "meta-request" info,
such as which processes a req belongs to, whether it's synchronous,
merely readahead/writebehind, etc.  here's where the host has a
real advantage - it knows about more requests, and more about them.

> > My main point is that even for the ideal bandwidth case (a
> > large sequential read or write) that software RAID does
> > not cause any bottlenecks, everything involved is mostly
> > idle (memory bus, cpu, hypertransport, and I/O bus).
> 
> How can you measure I/O bus in Linux?  You can't.  You can
> only measure the I/O the CPU is servicing, which is not
> actual.

non-sequitur.  Bill rightly points out that ~300 MB/s of IO,
which is pretty decent, does not come close to saturating a modern
platform.  this is true by inspection.

> I don't dispute that the Opteron can handle the PIO required
> for today's advanced storage I/O done in software.  I just
> said the transfer load, especially using the Opterons as I/O
> Processors doing programmed I/O, takes away from other
> transfer operations it _might_ be doing if its servicing user
> capabilities.

sure, but so what?  so SW raid will need to transfer a few extra 
chunks over some of the 8GB/s HT channels, among some of the 
26 GB/s of memory bandwidth available.  why do you think that 
a few hundred MB/s out of many GB/s is going to make a difference?

> Especially during a failed drive, when you are constantly
> reading in disk data over the much slower PCI-X interconnect.
>  At those times, you really could use a 2-4GBps of local
> interconnect handling that -- instead of pushing all the way
> up through the I/O to memory to CPU, just to get the data.

how strange!  what on earth do you think you can do with disks
at 4 GB/s?  or are you worried about streaming reads from ~50
disks at once?

> The problem is that CPUs are designed for computation, not
> pushing data around.

what a strange idea!  easily most of what most computers do is just
dumb pushing around of data.  there's very little computation in 
most web/db serving, for instance, very little in any desktop app.

> have a good balance between processing and data movement.  If
> you're jamming your CPU with LOAD/MOV operations just for
> storage, then you're turning it into an I/O processor --

your whole critique seems to be aesthetic - that the noble CPU 
should not be doing lowly xors.  even if the CPU already has dedicated
prefetch engines to help with this sort of thing, and can have multiple
128b xors in the pipe at once.

> something it's not designed for, and it ends up doing
> Programmed I/O.

that's just plain weird.  I haven't had a computer that did PIO
for probably a decade.  if you're just saying that SW raid>1 is 
like PIO in that the CPU touches data, well, OK, but what's so bad
with that?  the data rates are basically trivial, and does the 
server actually have something better to do with its cycles?

> I just don't see how your write speed is 2x the read.  It
> doesn't make sense.

you misread the columns.

> > Er, and what exactly else should the fileserver be doing
> > besides, er serving files?  Serving out GigE?  That's only
> > another 100MB/sec.
> 
> Not for some of us.  ;->

hmm.  the fastest IO clusters I know of are Luster+HSI
(Quadrics or IB usually).  servers typically manage about 300 MB/s
each.

> > So a 100MB write could be as little as an extra 12MB.
> 
> A 100MB RAID-5 write pushes 100MB through the CPU's
> interconnect, _period_.  It might only generate an extra 12MB

but who gives a damn?  100 MB approximately 2 second*disk,
but about .025 second*cpu.  in other words, 8 disks will take 
about .5 seconds to transfer 100MB (ignoring seeks), 
but the CPU will take about 1/20 that to process it.

> That path is _not_ traversed in hardware RAID.  The hardware

duh.  everyone knows that HW raid avoids passing the raw blocks
through the host cpu.  really, *everyone*.  trust me.

> > On these 4-8GB/sec busses an extra 12.5 % is not a big
> deal.
> 
> You obviously don't understand the dataflow.  You are pushing
> 100% to the CPU, then an additional 12.5% out.

100% of the cpu for a small fraction of the time.

> > You have measured this, or it's just a theory you have? 
> > Have you quantified it?
> 
> Yes.  At the very high-end, 4x GbE and 2x 8506 cards spread
> over 4 PCI-X channels using RAID-0 across the 8506 volumes. 
> I was serving over 500MBps via NFS consistently to over 25
> clients simultaneously.

those numbers are pretty odd - just getting 500 MB/s over 4x Gb
is pretty unusual.  or do you mean that the 25 clients saw 
an aggregate 500 MB/s (which would be explained by client-side 
caching)?

> > Having significantly higher I/O bandwidth to draw from
> > leaves many advantages.
> 
> You keep missing the fact that you're stuffing it into the
> CPU, which can't work as fast as an ASIC.

and you're missing the fact that disks are slow, therefore disk
IO is slow, and only amounts to a small fraction of a modest CPU's 
capability.

> > With a CPU that can do the xor calculations on the order
> > of 7GB/sec most parts of the system
> 
> ???  I'd _really_ like to know how you came up with that
> number!  I don't see someone being able to stuff 7GBps
> through a CPU with even a SIMD operation.

xor is not computationally harder than copying data,
and yes, just look at stream+openmp to see 7GB/s on a system.
(and system is the right target here, not cpu.)

> Understand that when you use _any_ time in an operation, you
> deduct that _time_.  I.e., if you use 50% of a 250MBps bus,
> you do _not_ deduct 250MBps from the next bus, but you deduct
> 50% of the available bandwidth from that next bus.

but your numbers are wrong.  MD uses a ~10% of many-GB/s buses.
we're not talking about dual-P3's with 64x33 PCI any more.

> So if it took you 50% of your I/O bus to read the data into
> memory, then the memory to CPU operation do _not_ happen at

but it doesn't.  the 8 GB/s HT channel is *not* 50% committed
to a measly 250 MB/s.

> At this point, I think I'd have to draw some timing and state
> diagrams to explain this.  You seem to be missing it.

you underestimate your partners in dialog.  not really a good thing.