[Linux-cluster] Starter Cluster / GFS

Wed Nov 10 14:12:51 UTC 2010

Nicolas Ross wrote:

>> That depends largely on how big your operations are. I cannot remember 
>> what the defaults are, but they are reasonable. In general, big 
>> journals can help if you do big I/O operations. In practice, block 
>> group sizes can be more important for performance (bigger can help on 
>> very large file systems or big files).
> 
> The volume will be composed of 7 1TB disk in raid5, so 6 TB.

Be careful with that arrangement. You are right up against the ragged 
edge in terms of data safety.

1TB disks a consumer grade SATA disks with non-recoverable error rates 
of about 10^-14. That is one non-recoverable error per 11TB.

Now consider what happens when one of your disks fails. You have to read 
6TB to reconstruct the failed disk. With error rate of 1 in 11TB, the 
chances of another failure occurring in 6TB of reads is about 53%. So 
the chances are that during this operation, you are going to have 
another failure, and the chances are that your RAID layer will kick the 
disk out as faulty - at which point you will find yourself with 2 failed 
disks in a RAID5 array and in need of a day or two of downtime to scrub 
your data to a fresh array and hope for the best.

RAID5 is ill suited to arrays over 5TB. Using enterprise grade disks 
will gain you an improved error rate (10^-15), which makes it good 
enough - if you also have regular backups. But enterprise grade disks 
are much smaller and much more expensive.

Not to mention that your performance on small writes (smaller than the 
stripe width) will be appalling with RAID5 due to the write-read-write 
operation required to construct the parity which will reduce your 
effective performance to that of a single disk.

> It will 
> host many, many small files, and some biger files. But the files that 
> change the most often will mos likely be smaller than the blocsize.

That sounds like a scenario from hell for RAID5 (or RAID6).

> The 
> gfs will not be used for io-intensive tasks, that's where the standalone 
> volumes comes into play. It'll be used to access many files, often. 
> Specificly, apache will run from it, with document root, session store, 
> etc on the gfs.

Performance-wise, GFS should should be OK for that if you are running 
with noatime and the operations are all reads. If you end up with write 
contention without partitioning the access to directory subtrees on a 
per server basis, the performance will fall off a cliff pretty quickly.

Gordan