[Linux-cluster] problems with gfs locking

Thu Apr 7 12:02:32 UTC 2005

Hello everybody!

Please, someone help me with a huge problem.

We have two servers HP DL380G4 connected to HP MSA500 (Modular Smart
Array with currently installed 3 disks as RAID-5 summary volume of
274G). Servers works under Red Hat Enterprise Linux, data storage is
formatted to GFS. Two months system with 2 nodes works fine. But two
weeks ago we started experiencing problems with system load. Symptoms
are as follows:

1. Server on which httpd is running become unstable because of
increasing of simultaneously running processes - uptime shows numbers
10, 20,..., 120, 160 in few minutes, top hangs after this number is
big enough. If run ps to see httpd processes, all of them will be with
status D (uninterruptible sleep) - so Apache runs MaxClients processes
every of them never ends. I can't kill none of them and they are
locked with high probability by GFS - there are two processes
gulm_Cb_Handler both taking about 100% of CPU usage.

2. Apache server-status shows that almost every process hangs with
status W (sending reply), MySQL shows that lot of connections are open
(each script in auto-prepend file opens connection) but they are
sleeping. Apache document_root points to GFS raid, so every
http-request causes filesystem to read or write files (users activity
was about 8 Gb in 10000 files in last month, which is twice as much in
previous month, when system seemed stable). Now filesystem is used at
15% (about 40Gb of 274Gb), the biggest folder contains over 30000
files - may be this is the reason of problems, like when quantity
turns into (low) quality.

3. Another reason which caused locking of filesystem is cvs, which
goes over all of that thousands of files. But this can not be repeated
- only few times cvs hanged while updating (in fact, checking) some
folders (not very big sometimes).

4. Traffic diagram (by MRTG) shows that when GFS going down there are suspicious
spikes of activity on network interface which is used to link GFS
nodes raising up to 4 Mbits/sec (while average throughput is about 100
kbits/sec) in both sides. We assume that our problems started when we
changed link between two nodes from plain patch cord to Cisco Catalyst
switch (which may have only 10 Mbits/sec througput). Can slow network be the
reason of our troubles? And another question - does journals
synchronizes or is there any other activity between two nodes while
reading data from GFS on one of them?

Thanks for any qualified answers.

--
Sergey