[Linux-cluster] DB Clustering Question

Fri Apr 15 14:19:42 UTC 2005

Matt wrote:

> What I really want to do is parallel processing with mySQL or
> Postgresql.  If I can't do that, then simply having multiple SQL servers
> share the same DB files is the next option.  Can anyone push me in the
> right direction?

   I'm new to the list as well, but having just gone through the process
of evaluating exactly this kind of problem, I have a few cents I can
throw in.

   I think the very short answer to your question is that databases and
multiple servers don't mix well at all, as a general rule, and if you
need full transactional SQL support in a cluster, you're likely looking
at a commerical solution.

   The fundamental problem is that transactional databases, of which
SQL databases are a subset, need to ensure that all transactions occur
atomically, and to do this, they need very robust, very fast locking
subsystems.  For example, before updating a row in a table, a
database process needs to be sure that it acquires a lock on that data
first, so that other database processes handling other client requests
don't read partially altered data.

   Now locking is hard enough to do when you have just one machine
(either single CPU or multiple CPUs), but can be done quite effectively
and efficiently through the use of in-memory mutexes and other such
devices.  Oracle, for example, takes out a big chunk of shared memory,
which all processes use to coordinate locking.

   Doing this in a cluster of machines is much, much more difficult.
It's compounded by the problem that one or more machines could fail,
or the network could fail in various ways, and the DB software must
ensure that under no conditions does the data become corrupted.
(See all the work involved in the GFS DLM, for example, involving
handling "split brain" conditions and the such like.)

   Oracle RAC (Real Application Cluster) provides this functionality
at considerable expense, for instance, by requiring that you have
a high-speed interconnection network between your machines, and
then by providing its own internal lock manager and cluster monitor
and so forth.  Essentialy, many of the components of GFS are
provided inside Oracle RAC, for its own purposes, but are unavailable
to outside processes.  You can also run Oracle RAC on Linux, in
various ways:

http://www.redhat.com/software/rha/gfs/
http://www.oracle.com/technology/tech/linux/index.html
http://www.veritas.com/van/articles/7655.jsp

If I understand the RedHat option correctly, Oracle relies on GFS
to manage the shared storage in the cluster, but still uses its own
lock manager, cluster monitor, etc., for its own internal cache
management and transaction handling.  However, I haven't read the
installation white paper, so I'm not sure about that.  (Note to
RedHat folks: trying to register on the Web site leads to an
access denied error for the /info/ page.)

   Open source SQL databases like PostgreSQL and MySQL just don't
have this kind of feature, so far as I can determine.  MySQL provides a
cluster mechanism over regular TCP, but as far as I could tell from the
documentation, this works by keeping the entire database in RAM on each
cluster node:

http://dev.mysql.com/doc/mysql/en/multi-hardware-software-network.html

   PostgreSQL can be run in a cluster by emulating a single operating
system underneath it, using high-speed interconnections and special
kernel modifications:

http://www.linuxlabs.com/clusgres.html

   I don't know much about Ingres myself, but I didn't see anything
about clustering for that, either.

   It's perhaps worth noting that PostgreSQL and Oracle face special
complexities regarding data consistency and locking because they
provide MVCC (Multi-Version Concurrency Control), which means that
each database client sees a "snapshot" of the entire database
as it was when they began their transaction.  As long as their
transaction remains active, the database retains previous versions
of all data modified by all other active transactions, so that
the snapshot remains accurate to a past point in time.  Only once
the transaction has closed can the database clean up old versions of
data.  This is subtly different from just providing row-level locking
in a table; if one transaction is slowing reading through all the
rows of a table while another one performs updates of selected rows,
the old versions of the updated rows are kept around until the
reader's transaction closes, in case they are needed to provide an
accurate view of what the data in the table looked like when the
reader's transaction began.  So that's all just to say that the
business of locking and shuffling data around is especially complex
for such databases, and doing it in a cluster even more so.

   What you are able to do with the available options depends partly
on your requirements, obviously.  If you don't mind having multiple
read-only copies of your database files, and allowing them to be
somewhat out of date, there are various ways you could replicate
your data files from a master read-write node to multiple read-only
nodes.  You'd want to ensure that the copying process performed
the necessary interactions with the master database to ensure that
it never copied partially complete data files; performing a hot
backup and then replicating those files to the read-only nodes would
work.

   Another related option if you don't mind having read-only and
slightly out-of-date copies is to use memcached:

http://www.danga.com/memcached/

This functions as a data cache between your client programs and
the database, and spreads the data around to multiple machines.
But obviously write requests need to go to the master database,
and then be replicated to the caches, and there's a period of time
when you might not read up-to-date data from the cache.  But this
may be OK for your application.

   If you need true full transactional SQL support spread across a
cluster, I believe you'll have to look at Oracle or another commerical
solution like the ClusGres one I referenced above.  I'd love to
stand correctly, though, if anyone knows more about this.

Chris.

-- 
GPG Key ID: 366A375B
GPG Key Fingerprint: 485E 5041 17E1 E2BB C263  E4DE C8E3 FA36 366A 375B