[Linux-cluster] scsi_watchdog
isplist at logicore.net
isplist at logicore.net
Wed Nov 21 18:45:16 UTC 2007
> What exactly do you mean by slow?
A general description about the web servers not responding to requests very
efficiently since my latest yum update.
The storage is Xyratex fibre channel sectioned into RAID5 partitions.
The HBA's are Qlogic's older 2200's.
The OS is RHEL4.
The setup is 5 nodes for testing, 3 web servers sharing GFS storage for their
web pages, 1 image server to offload the web servers, 1 admin server for
design and administration.
When I first connect any node to the storage, there is a long delay of about
20 or more seconds before the df returns the storage. This happens on each
node when first connected and later if there has been no activity (http
connections to the web server).
It is almost like it takes a few moments to take inventory of the storage
current statistics/configuration.
hdparm -tT gives a return that seems very low for this type of setup;
/dev/VolGroup01/sql:
Timing cached reads: 604 MB in 2.01 seconds = 300.10 MB/sec
Timing buffered disk reads: 60 MB in 3.06 seconds = 19.63 MB/sec
However, I have not gotten around to fine tuning anything yet on the storage
either. I just installed bonnie++ so need to read up on how to use it.
Since the update, web nodes have pretty high loads on them when running idle.
They idle around 0.20/0.50 then constantly spike around 1.00 to 2.50. The only
things I see when using top to check are I don't see anything unusual;
Here is an average cut;
top - 12:25:12 up 2 days, 12:22, 1 user, load average: 1.10, 0.84, 0.74
Tasks: 87 total, 1 running, 86 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.0% us, 0.3% sy, 0.0% ni, 99.7% id, 0.0% wa, 0.0% hi, 0.0% si
Mem: 515568k total, 351548k used, 164020k free, 41356k buffers
Swap: 786232k total, 0k used, 786232k free, 129592k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
30977 root 16 0 2868 956 764 R 0.3 0.2 0:05.06 top
1 root 16 0 3444 548 468 S 0.0 0.1 0:06.56 init
2 root 34 19 0 0 0 S 0.0 0.0 0:00.03 ksoftirqd/0
3 root 5 -10 0 0 0 S 0.0 0.0 0:00.00 events/0
4 root 5 -10 0 0 0 S 0.0 0.0 0:00.03 khelper
5 root 5 -10 0 0 0 S 0.0 0.0 0:00.00 kblockd/0
Here is a higher load cut;
top - 12:43:24 up 2 days, 12:40, 1 user, load average: 2.15, 0.98, 0.74
Tasks: 87 total, 1 running, 86 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.0% us, 0.0% sy, 0.0% ni, 100.0% id, 0.0% wa, 0.0% hi, 0.0% si
Mem: 515568k total, 352124k used, 163444k free, 41356k buffers
Swap: 786232k total, 0k used, 786232k free, 130060k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1 root 16 0 3444 548 468 S 0.0 0.1 0:06.56 init
2 root 34 19 0 0 0 S 0.0 0.0 0:00.03 ksoftirqd/0
3 root 5 -10 0 0 0 S 0.0 0.0 0:00.00 events/0
4 root 5 -10 0 0 0 S 0.0 0.0 0:00.03 khelper
5 root 5 -10 0 0 0 S 0.0 0.0 0:00.00 kblockd/0
6 root 25 0 0 0 0 S 0.0 0.0 0:00.00 khubd
35 root 15 0 0 0 0 S 0.0 0.0 0:00.00 kapmd
38 root 20 0 0 0 0 S 0.0 0.0 0:00.00 pdflush
39 root 15 0 0 0 0 S 0.0 0.0 0:01.84 pdflush
40 root 25 0 0 0 0 S 0.0 0.0 0:00.00 kswapd0
41 root 14 -10 0 0 0 S 0.0 0.0 0:00.00 aio/0
Note that when the load goes up, it happens on all three servers at the same
time. Seconds apart at most.
> Can you tell if any processes are hogging CPU or anything?
> Can you do a bonnie++ against your disks and see if the IO
> is slower than normal for some reason?
Anything else I can provide to help solve this problem, I'll be more than
happy to.
Mike
More information about the Linux-cluster
mailing list