[Linux-cluster] GFS create file performance

Wed Mar 17 15:45:09 UTC 2010

Hi,

On Wed, 2010-03-17 at 10:47 -0400, Jeff Sturm wrote:
> We are using GFS to store session files for our web application.  I've
> spent some time exploring GFS performance and tuning the software for
> optimal latency on system calls—we control the software, and the core
> libraries are written in C.  So I've been following related
> discussions of e.g. stat() performance with a great deal of interest.
> 
>  
> 
> I hit a wall reducing latency of new file creation.  Average create
> times are around 10ms and fluctuate from about 1ms up to 100ms or so.
> Here's an example:
> 
>  
> 
> open("/tb2/session/localhost/1800/ac18c/379/905bbc40.ts", O_WRONLY|
> O_CREAT|O_EXCL, 0660) = 4 <0.015415>
> 
>  
> 
> The parent directory of this file (379) was created on this node.  Our
> session storage ensures that no two nodes will attempt to create files
> in the same directory. I'm also limiting the number of directories we
> have to create so there is about a 50:1 ratio of files to directories
> (mkdir performance on GFS is generally awful).
> 
mkdir and open(O_CREAT) are pretty similar in terms of code paths.

>  
> 
> Here's a breakdown of the most common system calls made from my test
> harness:
> 
>  
> 
> % time     seconds  usecs/call     calls    errors syscall
> 
> ------ ----------- ----------- --------- --------- ----------------
> 
>  91.26    0.046362         228       203           open
> 
>   7.87    0.003998        1999         2           mkdir
> 
>   0.59    0.000298           0       600         2 stat
> 
>   0.19    0.000098           0       200           write
> 
>   0.09    0.000045           0       302           read
> 
>   0.00    0.000000           0       202           close
> 
>  
> 
> Note this report doesn't show wall-clock time  (I obtained it with
> strace –c).  Roughly half the calls to open() are creating files, the
> rest open existing files.
> 
>  
> 
> My questions:
> 
>  
> 
> -     What exactly happens during open()?  I'm guessing that at least
> the journal is flushed to disk.  Timings for open() are long and
> highly variable compared to other filesystems (e.g. ext3).  The strace
> utility is limited to showing system calls from user space—it'd be
> interesting to see what I/O takes place in kernel space, but I don't
> have any way to do that (do I?).  Am I network bound or I/O bound
> here?  The latency looks suspiciously like disk seek times to me.
> 
Lets take these one at a time... :-) Firstly, during open, there are two
paths depending on whether a file is being created or not. If not then
the time taken is likely to be a lot shorter since there is less to do.
In the non-create case, open takes a shared glock which implies not only
the dlm lock request, but also a disk read in order to read in the inode
itself. This is true for both gfs and gfs2 and its a bit of a pain that
the only reason that gfs2 requires this is to make an O_LARGEFILE test
against the size of the inode.

The create case can cause a (potentially) a lot of other I/O to occur.
Adding a directory entry most of the time only takes a short period of
time, due to there already being space available in the directory.
Potentially, if the directory has become full in some sense and needs to
be expanded, there can be I/O to allocate a directory leaf block and/or
hash table blocks and/or indirect blocks.

This is in addition to the block for the inode itself, and if selinux or
acls are in use, additional blocks may be allocated to contain their
xattrs as well.

The blocks are allocated from resource groups so a suitable resource
group must be found and locked in order to allow allocation. That
requires reading in the rgrp header (or more than one if the fs is
nearly full and the rgrp has not got enough blocks). GFS and GFS2 use
slightly different algorithms for selecting a suitable resource group,
but the same principle of it being a bitmap with summary information
applies.

> -     Is there a strategy I can use to return more quickly from open()
> on a GFS filesystem?
> 
At the current time, probably not.

> -     Before I spend time migrating to GFS2, is there any reason to
> believe GFS2 would perform significantly better here?
> 
At the moment, I suspect that there wouldn't be a huge difference in
this area, but we are intending to do some work to speed it up in GFS2.
Thats one reason that I started working on the xattr code a little while
back since that is one of the things which was blocking a more efficient
create open,

Steve.

>  
>