[Linux-cluster] Directories with >100K files

Wed Jan 21 13:29:15 UTC 2009

Quoting nick at javacat.f2s.com:

> Hi,
>
> Quoting Steven Whitehouse <swhiteho at redhat.com>:
>
> > Hi,
> >
> > On Tue, 2009-01-20 at 22:32 -0500, Jeff Sturm wrote:
> > > > -----Original Message-----
> > > > From: linux-cluster-bounces at redhat.com
> > > > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of
> > > > nick at javacat.f2s.com
> > > > Sent: Tuesday, January 20, 2009 5:19 AM
> > > > To: linux-cluster at redhat.com
> > > > Subject: [Linux-cluster] Directories with >100K files
> > > >
> > > > We have a GFS filesystem mounted over iSCSI. When doing an
> > > > 'ls' on directories with several thousand files it takes
> > > > around 10 minutes to get a response back -
> > >
> > > You don't say how many nodes you have, or anything about your
> > > networking.
> > >
> > > Some general pointers:
> > >
> > > - A plain "ls" is probably much faster any variant that fetches inode
> > > metatdata, e.g. "ls -l".  The latter performs a stat() on each
> > > individual file which in turn triggers locking activity of some sort.
> > > This is known to be slow on GFS1.  (I've heard reports that GFS2 is/will
> > > be better.)
> > >
> > The latest gfs1 is also much better. It is a tricky thing to do
> > efficiently, and not doing the stats is a good plan.
> >
> > > - You want a fast, reliable low-latency network for your cluster.  Intel
> > > GigE cards and a fast switch are a good bet.
> > >
> > > - Unless your application needs access times or quota support, mounting
> > > with "noquota,noatime" is a good idea.  Maybe also "nodiratime".
> > >
> > > > Can anyone recommend any GFS tunables to help us out here ?
> > >
> > > You could try bumping demote_secs up from its default of 5 minutes.
> > > That'll cause locks to be held longer so they may not need to be
> > > reacquired so often.  It won't help with the initial directory listing,
> > > but should help on subsequent invocations.
> > >
> > > In your case, with "ls" taking 8 minutes to run, some locks initially
> > > acuired during execution of the command have already been demoted once
> > > complete.
> > >
> > Also the question to ask is how many nodes are accessing this
> > filesystem? If more than one node is accessing the same directory and at
> > least one of those does a write (i.e. inode create/delete) within the
> > demote_secs time, then the demote_secs time will not make much
> > difference since the locks will be pushed out by the other node's access
> > anyway.
>
> We all 4 nodes in our test env and 5 in our prod env.
> The directory structure is as follows:
>
> [root at finapp4 ~]# cd /apps/prod/prodcomn/admin/
> [root at finapp4 admin]# ls
> inbound  install  log  out  outbound  scripts  trace
> [root at finapp4 admin]# ls log/ out/
> log/:
> PROD_finapp1  PROD_finapp2  PROD_finapp3  PROD_finapp4  PROD_finapp5  WFSC_oracleprod
>
> out/:
> o14679499.out  o14798714.out  PROD_finapp2  PROD_finapp4  WFSC_oracleprod
> o14698655.out  PROD_finapp1   PROD_finapp3  PROD_finapp5
>
> The WFSC_oracleprod dirs in both the log/ and the out/ directories each contain over 120,000 small files.
> This WFSC_oracleprod dir will be accessed by all cluster members for both read and write operations.
> If it help to make it any clearer these servers are clustered Oracle Applications servers running concurrent managers.
>
>
> > > > Should we set statfs_fast to 1 ?
> > >
> > > Probably good to set this, regardless.
> > >
> > > > What about glock_purge ?
> > >
> > > Glock_purge helps limit CPU time consumed by gfs_scand when a large
> > > number of unused glocks are present.  See
> > > http://people.redhat.com/wcheng/Patches/GFS/readme.gfs_glock_trimming.R4
> > > .  This may make your system run better but I'm not sure it's going to
> > > help with listing your giant directories.
> > >
> > Better to disable this altogether unless there is a very good reason to
> > use it. It generally has the effect of pushing things out of cache early
> > so is to be avoided.
> >
> > > > Here is the fstab entry for the GFS filesystem:
> > > > /dev/vggfs/lvol00       /apps                   gfs
> > > > _netdev         1 2
> > >
> > > Try "noatime,noquota" here.
>
> We also the the Oracle DB server accessing the GFS /apps directory from one of the Oracle Application servers via NFS, which I reckon is not
> helping
> performance. I'm trying to get the DBA's to give me a list of directories to export instead of exporting the whole /apps partition.
>
>
> Doing testing I can set statfs_fast to 1 and it makes no difference at all on an ls of any of the WFSC_oraclprod directories.
>
> I am making tuning changes 1 at a time and seeing what happens ...

OK here's a record of what I've done and the associated ls response times -

1. run ls with no tuning:  5m 42s
2. set statfs_fast 1:      5m 43s
                           6m 13s
3. set statfs_slots 128:   6m 10s
                           5m 36s
                           9m 31s
4. noatime,nodiratime,noquota: 6m 12s
                               6m 36s
5. set glock_purge 50:     7m 0s
                           9m 12s
6. set demote_secs 600     5m 06s
                           5m 47s
7. set directio on all files and inherit_directio on parent directory: 3m 44s
                                                                       4m 24s
                                                                       3m 44s
                                                                       4m 03s
                                                                       4m 47s
                                                                       5m 18s

So changing these values has made no difference.

What is the way forward now ? I've got users complaining left right and centre. Should I ditch GFS and use NFS ?

Cheers
Nick.