[Linux-cluster] Starter Cluster / GFS

Wed Nov 10 16:21:55 UTC 2010

>> The volume will be composed of 7 1TB disk in raid5, so 6 TB.
>
> Be careful with that arrangement. You are right up against the ragged edge
> in terms of data safety.
>
> 1TB disks a consumer grade SATA disks with non-recoverable error rates of
> about 10^-14. That is one non-recoverable error per 11TB.
>
> Now consider what happens when one of your disks fails. You have to read
> 6TB to reconstruct the failed disk. With error rate of 1 in 11TB, the
> chances of another failure occurring in 6TB of reads is about 53%. So the
> chances are that during this operation, you are going to have another
> failure, and the chances are that your RAID layer will kick the disk out
> as faulty - at which point you will find yourself with 2 failed disks in a
> RAID5 array and in need of a day or two of downtime to scrub your data to
> a fresh array and hope for the best.
>
> RAID5 is ill suited to arrays over 5TB. Using enterprise grade disks will
> gain you an improved error rate (10^-15), which makes it good enough - if
> you also have regular backups. But enterprise grade disks are much smaller
> and much more expensive.
>
> Not to mention that your performance on small writes (smaller than the
> stripe width) will be appalling with RAID5 due to the write-read-write
> operation required to construct the parity which will reduce your
> effective performance to that of a single disk.

Wow...

The enclosure I will use (and already have) is an activestorage's activeraid
in 16 x 1tb config. (http://www.getactivestorage.com/activeraid.php). The
drives are Hitachi model HDE721010SLA33. From what I could find, error rate
is 1 in 10^15.

We will do have good backups. One of the node will have a local copy of the 
critical data (about 1 tb) on a internally-attached disks. All of the rest 
of the data will be rsync-ed off site to a secondary identical setup.

>> It will host many, many small files, and some biger files. But the files
>> that change the most often will mos likely be smaller than the blocsize.
>
> That sounds like a scenario from hell for RAID5 (or RAID6).

What do you suggest to acheive size in the range of 6-7 TB, maybe more ?

>> The gfs will not be used for io-intensive tasks, that's where the
>> standalone volumes comes into play. It'll be used to access many files,
>> often. Specificly, apache will run from it, with document root, session
>> store, etc on the gfs.
>
> Performance-wise, GFS should should be OK for that if you are running with
> noatime and the operations are all reads. If you end up with write
> contention without partitioning the access to directory subtrees on a per
> server basis, the performance will fall off a cliff pretty quickly.

Can you explain a little bit more ? I'm not sure I fully understand the
partitioning into directories ?