[Linux-cluster] Ext3/ext4 in a clustered environement

Wed Nov 9 14:57:45 UTC 2011

Nicolas Ross wrote:

> Get me right, there are millions of files, but no more than a few 
> hundreds per directory. They are spread out splited on the database id, 
> 2 caracters at a time. So a file name 1234567.jpg would end up in a 
> directory 12/34/5/, or something similar.

OK, the way you wrote it looked like flat directory spacing.

We see appreciable knee points in GFS directory performance at 512, 4096 
and 16384 files/directory, with progressively worse performance 
deterioration between each knee pair. (It's a 2^n type problem)

> Yes it is a GFS specific, our backup server is on ext3 and rsyncing can 
> be made in a couple of hours, without eating cpu at all (only memory), 
> and without bringing the server on it's knees.

Have you tuned dentry/inode hashes? Have you got enough memory?

Bear in mind that rsync has to (at least) stat() every single file it 
looks at, which causes multicast locking traffic between the nodes if 
the FS is mounted on multiple machines - even mounted on a single node, 
it's slow.

If you can remount the FS with localflock then you'll see performance 
akin to your ext3 results, but on a single node mount with appropriate 
network/memory tuning you can at least double the rsync speed over 
vanilla configuration if there are a few million files involved.

>> We've experienced numerous cases where the filesystem hangs after a
>> service migration due a node (or service) failover. These hangs all
>> seem to be related to quota or NFS issues, so this may not be an issue
>> in your environment.
> 
> While we do not use nfs on top of the 3 most important directories, it 
> will be used on some of those volumes...

nfs(v2,3) is old, crufty, non-cluster/multitask aware(*), doesn't play 
nice with anything else accessing the disk and seems to be the root 
cause of most of our stability problems.

I can't talk about pNFS (NFSv4) stability as that requires bind mounts 
which aren't supported in a failover environment - it seems to work on 
individual nodes but I've never managed to have it working properly on a 
cluster.

(*) BEWARE if you have multiple services with NFS exports in them, the 
exportfs commands can play a nasty race game and scribble over the 
export list in an unpredictable manner. We fixed this with flocking in 
nfsclient.sh but redhat haven't rolled it into their distribution yet.

>> => would be failed and need to be manually restarted. What would be the
>> => consequence if the filesystem happens to be mounted on 2 nodes ?
>>
>> Most likely, filesystem corruption.
> 
> Other responses led me to beleive that if I let the cluster manage the 
> filesystem, and never mount it myselef, it's much less likely to happen.

Correct... But human factors being what they are added with other 
possibilities (such as failure to unmount, etc) mean that the chance is 
significantly higher than zero for my liking on any important FS