[Linux-cluster] scsi_watchdog

Wed Nov 21 18:45:16 UTC 2007

> What exactly do you mean by slow?

A general description about the web servers not responding to requests very 
efficiently since my latest yum update.

The storage is Xyratex fibre channel sectioned into RAID5 partitions. 
The HBA's are Qlogic's older 2200's.
The OS is RHEL4. 
The setup is 5 nodes for testing, 3 web servers sharing GFS storage for their 
web pages, 1 image server to offload the web servers, 1 admin server for 
design and administration.

When I first connect any node to the storage, there is a long delay of about 
20 or more seconds before the df returns the storage. This happens on each 
node when first connected and later if there has been no activity (http 
connections to the web server).
It is almost like it takes a few moments to take inventory of the storage 
current statistics/configuration.

hdparm -tT gives a return that seems very low for this type of setup;

/dev/VolGroup01/sql:
 Timing cached reads:   604 MB in  2.01 seconds = 300.10 MB/sec
 Timing buffered disk reads:   60 MB in  3.06 seconds =  19.63 MB/sec

However, I have not gotten around to fine tuning anything yet on the storage 
either. I just installed bonnie++ so need to read up on how to use it.

Since the update, web nodes have pretty high loads on them when running idle. 
They idle around 0.20/0.50 then constantly spike around 1.00 to 2.50. The only 
things I see when using top to check are I don't see anything unusual;

Here is an average cut;

top - 12:25:12 up 2 days, 12:22,  1 user,  load average: 1.10, 0.84, 0.74
Tasks:  87 total,   1 running,  86 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0% us,  0.3% sy,  0.0% ni, 99.7% id,  0.0% wa,  0.0% hi,  0.0% si
Mem:    515568k total,   351548k used,   164020k free,    41356k buffers
Swap:   786232k total,        0k used,   786232k free,   129592k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
30977 root      16   0  2868  956  764 R  0.3  0.2   0:05.06 top
    1 root      16   0  3444  548  468 S  0.0  0.1   0:06.56 init
    2 root      34  19     0    0    0 S  0.0  0.0   0:00.03 ksoftirqd/0
    3 root       5 -10     0    0    0 S  0.0  0.0   0:00.00 events/0
    4 root       5 -10     0    0    0 S  0.0  0.0   0:00.03 khelper
    5 root       5 -10     0    0    0 S  0.0  0.0   0:00.00 kblockd/0

Here is a higher load cut;

top - 12:43:24 up 2 days, 12:40,  1 user,  load average: 2.15, 0.98, 0.74
Tasks:  87 total,   1 running,  86 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0% us,  0.0% sy,  0.0% ni, 100.0% id,  0.0% wa,  0.0% hi,  0.0% si
Mem:    515568k total,   352124k used,   163444k free,    41356k buffers
Swap:   786232k total,        0k used,   786232k free,   130060k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
    1 root      16   0  3444  548  468 S  0.0  0.1   0:06.56 init
    2 root      34  19     0    0    0 S  0.0  0.0   0:00.03 ksoftirqd/0
    3 root       5 -10     0    0    0 S  0.0  0.0   0:00.00 events/0
    4 root       5 -10     0    0    0 S  0.0  0.0   0:00.03 khelper
    5 root       5 -10     0    0    0 S  0.0  0.0   0:00.00 kblockd/0
    6 root      25   0     0    0    0 S  0.0  0.0   0:00.00 khubd
   35 root      15   0     0    0    0 S  0.0  0.0   0:00.00 kapmd
   38 root      20   0     0    0    0 S  0.0  0.0   0:00.00 pdflush
   39 root      15   0     0    0    0 S  0.0  0.0   0:01.84 pdflush
   40 root      25   0     0    0    0 S  0.0  0.0   0:00.00 kswapd0
   41 root      14 -10     0    0    0 S  0.0  0.0   0:00.00 aio/0

Note that when the load goes up, it happens on all three servers at the same 
time. Seconds apart at most.

> Can you tell if any processes are hogging CPU or anything?
> Can you do a bonnie++ against your disks and see if the IO
> is slower than normal for some reason?

Anything else I can provide to help solve this problem, I'll be more than 
happy to.

Mike