parallel I/O on shared-memory multi-CPU machines

Chris Worley worleys at
Wed Apr 4 16:33:42 UTC 2007

Other cluster file systems include Terrascale (now RapidScale),
Panasas, IBRIX, PVFS-2, and IBM's proprietary GPFS.

Of these (and including previously mentioned Lustre) Panasas and GPFS
are most mature.  There are also distributed file systems like GFS and
Polyserve... but these don't scale that well.  NFS-V4 is supposed to
have some distribute-ability, but it will probably only work reliably
in proprietary Netapp devices, and if NFS-V3 is any indicator (and
Linux's NFS is under the same management as for V3), it will take many
years to stabilize.  The "rule of thumb" is: I/O performance,
reliability, and price; pick any two.

In using a cluster file system, the best you can expect per node is
your interconnect speed, which is, worst case, 100MB/s over GigE...
much better than a single local disk (~40MB/s for SATA)... but, to get
this number to scale over a lot of nodes has a hefty price tag.

Local disks always scale best, if you can use them.

Your problem sounds like it's write-intensive.  For read-intensive
problems, multicast can be used to sync with main file server quickly
and cheaply.

As stated before, if you can use local disk drives, they scale
linerarly between nodes (but not as number of threads increase on the
same node), and you can create a local software RAID cheaply (use
Linux MD, not a HW RAID controller for best speed) and get best local

Also, many MPI-1 implementations have ROMIO extensions, part of MPI-2
spec, which can hide the underlying implementation and be
programatically intuitive, but this is not always fastest: most (not
all) cluster file systems perform best when each thread/process writes
to files in different directories.

Write speeds on cluster file systems will be better than read speeds,
as all the caching layers between the app and the actual disk provide
better performance, unless you need synchronous operation (like a
journaled DBMS needs, or where messages are passed via files, as with
distributed matlab).

On 4/3/07, Andreas Dilger <adilger at> wrote:
> On Apr 03, 2007  15:53 -0700, Larry McVoy wrote:
> > On Tue, Apr 03, 2007 at 03:50:27PM -0700, David Schwartz wrote:
> > > > I write scientific number-crunching codes which deal with large
> > > > input and output files (Gb, tens of Gb, as much as hundreds of
> > > > Gygabytes on occasions). Now there are these multi-core setups
> > > > like Intel Core 2 Duo becoming available at a low cost. Disk I/O
> > > > is one of the biggest bottlenecks. I would very much like to put
> > > > to use the multiple processors to read or write in parallel
> > > > (probably using OMP directives in a C program).
> > >
> > > How do you think more processors is going to help? Disk I/O doesn't take
> > > much processor. It's hard to imagine a realistic disk I/O application that
> > > was somehow limited by available CPU.
> >
> > Indeed.  S/he is mistaking CPUs for DMA engines.  The way you make this go
> > fast is a lot of disk controllers running parallel.  A single CPU can
> > handle a boatload of interrupts.
> Rather, I think the issue is that for an HPC application which needs to
> do frequent checkpoint and data dumps can be slowed down dramatically
> by lack of parallel IO.  Adding more CPUs on a single node generally
> speeds up the computation part, but can slow down the IO part.
> That said, the most common IO model for such applications is to have
> a single input and/or output file for each CPU in the computation.
> That avoids a lot of IO serialization within the kernel and is efficient
> as long as the amount of IO per CPU is relatively large (as it sounds
> like in the initial posting.
> Cheers, Andreas
> --
> Andreas Dilger
> Principal Software Engineer
> Cluster File Systems, Inc.
> _______________________________________________
> Ext3-users mailing list
> Ext3-users at

More information about the Ext3-users mailing list