[Cluster-devel] [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls

Thu Jul 31 03:25:36 UTC 2014

On Mon, Jul 28, 2014 at 08:22:22AM -0400, Abhijith Das wrote:
> 
> 
> ----- Original Message -----
> > From: "Dave Chinner" <david at fromorbit.com>
> > To: "Zach Brown" <zab at redhat.com>
> > Cc: "Abhijith Das" <adas at redhat.com>, linux-kernel at vger.kernel.org, "linux-fsdevel" <linux-fsdevel at vger.kernel.org>,
> > "cluster-devel" <cluster-devel at redhat.com>
> > Sent: Friday, July 25, 2014 7:38:59 PM
> > Subject: Re: [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls
> > 
> > On Fri, Jul 25, 2014 at 10:52:57AM -0700, Zach Brown wrote:
> > > On Fri, Jul 25, 2014 at 01:37:19PM -0400, Abhijith Das wrote:
> > > > Hi all,
> > > > 
> > > > The topic of a readdirplus-like syscall had come up for discussion at
> > > > last year's
> > > > LSF/MM collab summit. I wrote a couple of syscalls with their GFS2
> > > > implementations
> > > > to get at a directory's entries as well as stat() info on the individual
> > > > inodes.
> > > > I'm presenting these patches and some early test results on a single-node
> > > > GFS2
> > > > filesystem.
> > > > 
> > > > 1. dirreadahead() - This patchset is very simple compared to the
> > > > xgetdents() system
> > > > call below and scales very well for large directories in GFS2.
> > > > dirreadahead() is
> > > > designed to be called prior to getdents+stat operations.
> > > 
> > > Hmm.  Have you tried plumbing these read-ahead calls in under the normal
> > > getdents() syscalls?
> > 
> > The issue is not directory block readahead (which some filesystems
> > like XFS already have), but issuing inode readahead during the
> > getdents() syscall.
> > 
> > It's the semi-random, interleaved inode IO that is being optimised
> > here (i.e. queued, ordered, issued, cached), not the directory
> > blocks themselves. As such, why does this need to be done in the
> > kernel?  This can all be done in userspace, and even hidden within
> > the readdir() or ftw/ntfw() implementations themselves so it's OS,
> > kernel and filesystem independent......
> > 
> 
> I don't see how the sorting of the inode reads in disk block order can be
> accomplished in userland without knowing the fs-specific topology.

I didn't say anything about doing "disk block ordering" in
userspace. disk block ordering can be done by the IO scheduler and
that's simple enough to do by multithreading and dispatch a few tens
of stat() calls at once....

> From my
> observations, I've seen that the performance gain is the most when we can
> order the reads such that seek times are minimized on rotational media.

Yup, which is done by ensuring that we drive deep IO queues rather
than issuing a single IO at a time and waiting for completion before
issuing the next one. This can easily be done from userspace.

> I have not tested my patches against SSDs, but my guess would be that the
> performance impact would be minimal, if any.

Depends. if the overhead of executing readahead is higher than the time spent
waiting for IO completion, then it will reduce performance. i.e. the
faster the underlying storage, the less CPU time we want to spend on
IO. Readahead generally increases CPU time per object that needs to
be retrieved from disk, and so on high IOP devices there's a really
good chance we don't want readahead like this at all.

i.e. this is yet another reason directory traversal readahead should
be driven from userspace so the policy can be easily controlled by
the application and/or user....

Cheers,

Dave.
-- 
Dave Chinner
david at fromorbit.com