[Linux-cluster] BerkleyDB locking problems with GFS 6.0?
Treece, Britt
Britt.Treece at savvis.net
Mon Jul 17 14:25:39 UTC 2006
Darren,
You will definitely want to increase the lock networks switch to at
least 100M and if you have the hardware you should seriously consider
adding dedicated lock servers.
Your load problems are being caused by lock traffic bottlenecks in your
setup.
Britt
-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Darren Jacobs
Sent: Monday, July 17, 2006 2:57 AM
To: linux clustering
Subject: [Linux-cluster] BerkleyDB locking problems with GFS 6.0?
We have web cluster running on three dual 3GHz processor server's (RHEL
3) attached to an SATA san with a single LUN shared among them using GFS
6.0. We're running lock_gulmd on each of these server's: they're
locking server's as well as apache servers. The locking network that
attaches the servers is simply a 10Mbit hub. Network traffic's
distributed by a hardware load balancer, not RH cluster.
We suffered a melt down while doing a trial test of Movable Type
(blogging software) on the cluster. We were using a BerkleyDB backend
database housed on the shared LUN. The software was installed on all
three servers.
Once we fired up movable type we noticed that the load average on each
of the three server's was climbed a bit above normal. On one box in
particular we got up to a load average of 8 while the other two boxes
were around 2. Everything still moved along ok but we could see the
load on the (8) box inching up. We noted what appeared to be some hung
cgi processes associated with movable type. They resisted kill commands
and couldn't be 'kill -9".
So we decided to remove the highly loaded box from the cluster. The
second we ran the command the other two boxes load averages shot to
100. Shortly there after they locked up. The boxes locked up so fast
we couldn't pull any diagnostic data before they crashed.
I've seen behavior like the above when server's submit multiple i/o
requests to a SAN and for some reason they don't return in a timely
manner. The out standing i/o's make the load average climb into the
stratosphere. I'm thinking something like this happened here. However
because the server's tanked so quickly I couldn't found out for certain.
We've mulled over the possibilities as to what the heck happened. Did
concurrent access attempts from 3 servers on a berkleydb database on a
gfs partition blow us up? Should we have had a 100Mbit switch on the
locking network instead of the 10Mbit hub? Separate locking servers?
Any suggestions?
Regards,
Darren Jacobs
_
University of Toronto
--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster
More information about the Linux-cluster
mailing list