[Linux-cluster] GFS, Locking, Read-Only, and high processor loads

Tue May 13 20:47:55 UTC 2008

Hi,

I'm setting up a GFS implementation and was wondering what kind of  
tuning parameters I can set for both read-only and read-write.

I work for a company that is migrating to a SAN, implementing GFS as  
the filesystem. We currently rsync our data from a master server to 5  
front-end webservers running Apache and PHP. The rsyncs take an  
extraordinarily long time as our content (currently >2.5 million small  
files) grows, and does not scale very well as we add more front-end  
machines. Our thinking was to put content generated on two inward  
facing editorial machines on the SAN as read/write, and our web front- 
ends as read-only. All temporary files and logging would write to  
local disk. The goal of our initial work was to create this content  
filesystem, mount the disks, eliminate the rsyncs, and free up our  
rsync server for use as a slave database server.

We used the Luci to configure a node and fencing on a new front-end,  
and formatted and configured our disk with it. Our deploy plan was to  
set this machine up, put it behind the load-balancer, and have it  
operate under normal load for a few days to "burn it in." Once  
complete, we would begin to migrate the other four front-ends over to  
the SAN, mounted RO after a reinstall of the OS.

This procedure worked without too much issue until we hit the fourth  
machine in the cluster, where the cpu load went terrifyingly high and  
we got many "D" state httpd processes. Googling "uninterruptible sleep  
GFS php" I found references from 2006 about file locking with php and  
its use of flock() at the start of a session. The disks were remounted  
as "spectator" in an attempt to limit disk I/O on journals. This  
seemed to help, but as it was the end of the day seems a false  
positive. The next day, CPU load was again incredibly high, and after  
much flailing about we went back to local ext3 disks to buy us some  
time.

I'm reading through this list, which is very informative. I'm  
attempting to tune our GFS mounts a bit, watching the output of  
gfs_tool counters on the filesystems, and looking for any anomalies.  
Here's a more detailed description of our setup:

Our hardware configuration consists of a NexSAN SATABoy populated with  
8 750GB disks (RAID 5/4.7Tb), and a Brocade Silkworm 3800 for data and  
fencing. We purchased QLogic single-port, 4Gb HBAs for our servers.  
(more info available on request)

The RAID has 4 partitions, 2 are not mounted:

	local - (not mounted) 500GB, extents 4.0MB, block size 4KB,  
attributes -wi-ao,
		dlm lock protocol - mount /usr/local_san (rw)
		this is a copy of /usr/local, which can be synced to all hosts
	code - 	(not mounted) 500GB, extents 4.0MB, block size 4KB,  
attributes -wi-ao,
		dlm lock protocol - mount /web/code (rw)
		this is a copy of /huffpo/web/prod, without the www content and tmp  
trees
	tmp -   500GB, extents 4.0MB, block size 4KB, attributes -wi-a-,
		dlm lock protocol - mount /web/prod/tmp (rw)
		this is the temporary directory for front-end web code
	www -   2TB, extents 4MB, block size 4KB, attributes -wi-ao,
		dlm local protocol - mount /web/prod/www (ro)
		read-only content directory, 4 hosts, /etc/fstab options at the time  
were ro
                 read/write on 1 host

	we have ~2 more TB available, currently not in use

After reading the list a bit, I've come up with the following tunings  
for read-only:

      gfs_tool settune /web/prod/www/content glock_purge 80
      gfs_tool settune /web/prod/www/content quota_account 0
      gfs_tool settune /web/prod/www/content demote_secs 60
      gfs_tool settune /web/prod/www/content scand_secs 30

      /etc/fstab has spectator,noatime,num_glockd=32 as mount options

And the read/write host has:

      gfs_tool settune /web/prod/www/content statfs_fast 1

      /etc/fstab has num_glockd=32,noatime as mount options

I've noticed using gfs_tool counters /web/prod/www/content usually has  
sub 80k locks for the read/write host running rsync, and sub 10k locks  
for the one (and only) read-only host, where previously the number of  
locks on all hosts numbered ~80k.

Can I be a bit more aggressive with locks on read-only filesystems  
with the current tunings enabled? I'm not sure what the purpose of the  
locks on read-only filesystems serve in this instance.

Is there a better configuration for heavy reads on a GFS filesystem  
that is read only? vmstat -d gives me for this filesystem:
disk- ------------reads------------ ------------writes----------- ----- 
IO------
[...]
sdc   411192  82490 3998862 7402555    607    645   10016    3837       
0    695

My big fear is although the systems currently seem to be running  
without too much incident, as I add nodes back into the cluster the  
number of locks and system load will again run high. As we transition  
from using rsync to writing directly onto the SAN, the number of locks  
on rw hosts should go down because the spendy directory scans should  
be removed.

Are there certain other optimizations I could use to lower the lock  
counts?