[Linux-cluster] GFS, Locking, Read-Only, and high processor loads
gordan at bobich.net
Tue May 13 21:08:38 UTC 2008
rick ochoa wrote:
> I work for a company that is migrating to a SAN, implementing GFS as the
> filesystem. We currently rsync our data from a master server to 5
> front-end webservers running Apache and PHP. The rsyncs take an
> extraordinarily long time as our content (currently >2.5 million small
> files) grows, and does not scale very well as we add more front-end
> machines. Our thinking was to put content generated on two inward facing
> editorial machines on the SAN as read/write, and our web front-ends as
> read-only. All temporary files and logging would write to local disk.
> The goal of our initial work was to create this content filesystem,
> mount the disks, eliminate the rsyncs, and free up our rsync server for
> use as a slave database server.
You may have options that don't require SAN. If you're happy to continue
with DAS (i.e. the cost of SAN doesn't exceed the cost of having
separate disks in each machine with the number of machines you foresee
using in the near future), you may do well with DRBD instead of a SAN.
> We used the Luci to configure a node and fencing on a new front-end, and
> formatted and configured our disk with it. Our deploy plan was to set
> this machine up, put it behind the load-balancer, and have it operate
> under normal load for a few days to "burn it in." Once complete, we
> would begin to migrate the other four front-ends over to the SAN,
> mounted RO after a reinstall of the OS.
> This procedure worked without too much issue until we hit the fourth
> machine in the cluster, where the cpu load went terrifyingly high and we
> got many "D" state httpd processes. Googling "uninterruptible sleep GFS
> php" I found references from 2006 about file locking with php and its
> use of flock() at the start of a session. The disks were remounted as
> "spectator" in an attempt to limit disk I/O on journals. This seemed to
> help, but as it was the end of the day seems a false positive. The next
> day, CPU load was again incredibly high, and after much flailing about
> we went back to local ext3 disks to buy us some time.
If you have lots of I/O on lots of files in few directories, you may be
out of luck. A lot of the overhead of GFS (or any similar FS) is
unavoidable be - the locking between the nodes has to be synchronised
for every file open.
Mounting with noatime,nodiratime,noquota may help a bit, but you will
never see performance with frequent access to lots of small files that
gets anywhere near local disk performance.
There are, however, other options. If DAS is an option for you (and it
sounds like it is), look into GlusterFS. It's performance isn't great
per se (may well be worse than GFS) if you use it the intended way, but
you can use it as a file replication system. If you point your web
directory directly at the file store (if you do this, you must be 100%
sure that NOTHING you do to those files will involve any kind of
writing, or things can get unpredictable and files can get corrupted).
This means you'll get local disk performance with the advantage of not
having to rsync the data. As long as all nodes are connected, the file
changes on the master server will get sent out to the replicas. If you
need to reboot a node, you'll need to ensure that it's consistent, which
is done by forcing a resync by firing off a find to read the first byte
of every file on the mount point. This will force the node to check that
it's files are up to date against other nodes. Note that this will cause
increased load on all the other nodes while it completes, so use with care.
More information about the Linux-cluster