[Linux-cluster] GFS, Locking, Read-Only, and high processor loads

Tue May 13 21:08:38 UTC 2008

rick ochoa wrote:

> I work for a company that is migrating to a SAN, implementing GFS as the 
> filesystem. We currently rsync our data from a master server to 5 
> front-end webservers running Apache and PHP. The rsyncs take an 
> extraordinarily long time as our content (currently >2.5 million small 
> files) grows, and does not scale very well as we add more front-end 
> machines. Our thinking was to put content generated on two inward facing 
> editorial machines on the SAN as read/write, and our web front-ends as 
> read-only. All temporary files and logging would write to local disk. 
> The goal of our initial work was to create this content filesystem, 
> mount the disks, eliminate the rsyncs, and free up our rsync server for 
> use as a slave database server.

You may have options that don't require SAN. If you're happy to continue 
with DAS (i.e. the cost of SAN doesn't exceed the cost of having 
separate disks in each machine with the number of machines you foresee 
using in the near future), you may do well with DRBD instead of a SAN.

> We used the Luci to configure a node and fencing on a new front-end, and 
> formatted and configured our disk with it. Our deploy plan was to set 
> this machine up, put it behind the load-balancer, and have it operate 
> under normal load for a few days to "burn it in." Once complete, we 
> would begin to migrate the other four front-ends over to the SAN, 
> mounted RO after a reinstall of the OS.
>
> This procedure worked without too much issue until we hit the fourth 
> machine in the cluster, where the cpu load went terrifyingly high and we 
> got many "D" state httpd processes. Googling "uninterruptible sleep GFS 
> php" I found references from 2006 about file locking with php and its 
> use of flock() at the start of a session. The disks were remounted as 
> "spectator" in an attempt to limit disk I/O on journals. This seemed to 
> help, but as it was the end of the day seems a false positive. The next 
> day, CPU load was again incredibly high, and after much flailing about 
> we went back to local ext3 disks to buy us some time.

If you have lots of I/O on lots of files in few directories, you may be 
out of luck. A lot of the overhead of GFS (or any similar FS) is 
unavoidable be - the locking between the nodes has to be synchronised 
for every file open.

Mounting with noatime,nodiratime,noquota may help a bit, but you will 
never see performance with frequent access to lots of small files that 
gets anywhere near local disk performance.

There are, however, other options. If DAS is an option for you (and it 
sounds like it is), look into GlusterFS. It's performance isn't great 
per se (may well be worse than GFS) if you use it the intended way, but 
you can use it as a file replication system. If you point your web 
directory directly at the file store (if you do this, you must be 100% 
sure that NOTHING you do to those files will involve any kind of 
writing, or things can get unpredictable and files can get corrupted). 
This means you'll get local disk performance with the advantage of not 
having to rsync the data. As long as all nodes are connected, the file 
changes on the master server will get sent out to the replicas. If you 
need to reboot a node, you'll need to ensure that it's consistent, which 
is done by forcing a resync by firing off a find to read the first byte 
of every file on the mount point. This will force the node to check that 
it's files are up to date against other nodes. Note that this will cause 
increased load on all the other nodes while it completes, so use with care.

Gordan