[Linux-cluster] GFS or Web Server Performance issues?

Wed Nov 28 17:07:56 UTC 2007

On Wed, 28 Nov 2007, isplist at logicore.net wrote:

> Come on folks, you're making me feel like I should give up or something  :)
>
> From Gordan;
>
>> I think a part of the problem is perception.
>
> Perception can only be what marketing says it will do.

And marketing rarely (if ever) reflects what things will really do.

> I can't say I have once
> seen anything that says it won't scale performance wise by virtue of what it
> is, a cluster. I looked at SSI and other types of clusters, this seemed to be
> the key for my LAMP based services.

It depends on where the bottleneck on your system is. It stands to reason 
that if the physical performance of the disk (e.g. a SAN appliance) is 
fixed, then piling more machines using it, while having to keep locks in 
sync cannot possibly result in performance magically increasing. Anybody 
who tells you otherwise is either trying to sell you something, or doesn't 
understand what they are talking about.

>> leads to _LOWER_ performance on I/O bound processes. If it's CPU bound,
>
> Sure, there is a performance cost from each node but I would guess it's an
> acceptable cost so long as I can work out the I/O side of things. I'm guessing
> a lot of folks have come up with all sorts of good ways of handling this
> otherwise, no one would be using these tools.

It improves _some_ types of performance, not all. It also depends on how 
higher levels handle things. If you have partitioned data so that each 
node handles a subset of it, then you will get improved performance. If 
all the data is in once place, then the chances are that clustering will 
cause an overhead, and thus a slowdown if the system is I/O bound in the 
first place.

>> then sure, it'll help. But on I/O it'll likely do you harm. It's more
>> about redundancy and graceful degradation than performance. There's no way
>> of getting away from the fact that a cluster has to do more work than a
>> single node, just because it has to keep itself in sync.
>
> When I started learning about the RH cluster suite and GFS, it was because the
> hype was that I could build a highly scalable, highly available environment
> where I could share data in a way I had not been able to before.

It is scaleable right up to the point where you are I/O (and lock) bound. 
If you are serving only static web pages, then your performance will 
likely degrade. If you are using lots of CPU intensive CGI processes, then 
the performance will most likely increase.

For example, you (surely?) wouldn't expect to dump a MySQL DB on a GFS 
file system on a cluster, get external locking going, and expect the 
read/write performance to increase, would you??

SSI and GFS are technologies that have their place, but they are not the 
right tool for every job. For example, I have an SSI root file system, but 
I have /var/lib/mysql mounted off local disks, with round-robin 
replication set up on the nodes, so each is a master and a slave. I 
wouldn't dream of expecting similar performance if it was running off GFS 
with external locking.

>> The only way clustering will give you scaleable performance benefit is
>> with partitioned (as opposed to shared) data. Shared data clustering is
>> about convenience and redundancy, not about performance.
>
> I agree but this is a very general statement. In my case, I have a LAMP
> application which benefits more from having shared GFS space. I might move to
> purely distributed at some point but for now, I'd prefer to find out what I
> can do with what I've built so far.

For static data, you could set up an unshared cache space on local 
storage, and set up squid in accelerator mode (basically outbound rather 
than inbound cache). This will mean that most of your access hits for 
static data will never hit Apache or the the GFS storage.

Gordan