Determining what is causing load when server is idle.

Fri Apr 7 16:41:15 UTC 2006

On Fri, Apr 07, 2006 at 12:17:30PM +0200, George Magklaras wrote:
> Seeing init in S mode in 'top' like that:
>   1 root      16   0  1972  556  480 S  0.0  0.0   0:00.53 init
> 
> is not so extraordinary if you just invoke 'top'. If it is in R or other 
> process mode continuously, that would be alarming.

init stays in 'S' mode for the duration of top.

> 
> >Another symptom that comes along with this weird non-0.00 load issue is 
> >that
> >user I/O seems to "glitch" every now and then.  Almost like the hard drives
> >are spinning up after being put to sleep... however, APM is disabled in my
> >kernel since I am running in SMP mode.
> 
> I think that #might# be the key symptom. How exactly do you mean the 
> 'glitch'. Does I/O pause for an interval to the point where you notice 
> it for several seconds and then continues, abort completely (I/O 
> errors)? It could be that there is somekind of background reconstruction 
>  or syncing happenning due to driver or hardware issues.

Yes.  This is exactly the behavior I'm experiencing.  Everything just pauses
then within 2-5 seconds control returns.

> dmesg | grep -i md
> 
> should give you any hickups related to the RAID config. Doing also a 
> 'vmstat 3'

Nothing interesting really in the dmesg output, but vmstat shows a lot of
interrupts:

On DL140G2 w/ SATA software RAID1:

[root at localhost oracle]# vmstat 3
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 0  0      0  89340  46236 1828624    0    0     1    29  185    17  0  0 98  1
 0  0      0  89340  46236 1828624    0    0     0    11 1014    17  0  0 100  1
 0  0      0  89276  46236 1828624    0    0     0    28 1017    25  0  0 99  1
 0  0      0  89276  46236 1828624    0    0     0    11 1014    19  0  0 100  1
 0  0      0  89276  46236 1828624    0    0     0    21 1016    24  0  0 99  1
 0  0      0  89276  46236 1828624    0    0     0    11 1014    19  0  0 99  1
 0  0      0  89276  46236 1828624    0    0     0    20 1016    24  0  0 99  1
 0  0      0  89276  46236 1828624    0    0     0    11 1013    19  0  0 99  1

On DL140G1 w/ IDE software RAID1 (this box is actually in production so is
"busier" than the box above)

[root at billmax root]# vmstat 3
procs                      memory      swap          io     system         cpu
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 0  0  24604  18508 127772 507604    0    0     1     0    1     0  0  0  0  1
 0  0  24604  18508 127772 507604    0    0     0     0  113    24  0  0 100  0
 0  0  24604  18508 127772 507604    0    0     0    16  115    29  0  0 100  0
 0  0  24604  18508 127772 507604    0    0     0   111  164    52  0  0 96  3
 0  0  24604  18508 127772 507604    0    0     0     0  121    33  0  0 100  0
 0  0  24604  18508 127772 507608    0    0     1     7  116    48  0  0 100  0
 0  0  24604  18508 127772 507620    0    0     3     0  113    38  0  0 100  0
 0  0  24604  18508 127772 507620    0    0     0    51  131    26  0  0 100  0
 0  0  24604  18508 127772 507620    0    0     0     0  113    34  0  0 100  0

> /proc/interrupts, the output of 'lsmod' and your SoftRAID configs files 
> would help, as well as your kernel version.
> 

Kernel is 2.6.9-22.ELsmp.

[root at localhost oracle]# cat /proc/interrupts 
           CPU0       CPU1       
  0:   33071575   33118497    IO-APIC-edge  timer
  1:         28         58    IO-APIC-edge  i8042
  8:          0          1    IO-APIC-edge  rtc
  9:          0          0   IO-APIC-level  acpi
 14:      79946      81927    IO-APIC-edge  libata
 15:      81059      80767    IO-APIC-edge  libata
169:    1037048        132   IO-APIC-level  uhci_hcd, eth0
177:          0          0   IO-APIC-level  uhci_hcd
185:          0          0   IO-APIC-level  ehci_hcd
NMI:          0          0 
LOC:   66192663   66192736 
ERR:          0
MIS:          0

RAID configuration -- it doesn't appear that /etc/raidtab gets generated any
longer.  Here is /etc/mdadm.conf:

DEVICE partitions
MAILADDR root
ARRAY /dev/md0 super-minor=0
ARRAY /dev/md1 super-minor=1

Some output from dmesg:

md: md driver 0.90.0 MAX_MD_DEVS=256, MD_SB_DISKS=27
ata1: SATA max UDMA/133 cmd 0x1F0 ctl 0x3F6 bmdma 0x1470 irq 14
ata2: SATA max UDMA/133 cmd 0x170 ctl 0x376 bmdma 0x1478 irq 15
md: raid1 personality registered as nr 3
md: Autodetecting RAID arrays.
md: autorun ...
md: considering sdb3 ...
md:  adding sdb3 ...
md: sdb1 has different UUID to sdb3
md:  adding sda3 ...
md: sda1 has different UUID to sdb3
md: created md0
md: bind<sda3>
md: bind<sdb3>
md: running: <sdb3><sda3>
raid1: raid set md0 active with 2 out of 2 mirrors
md: considering sdb1 ...
md:  adding sdb1 ...
md:  adding sda1 ...
md: created md1
md: bind<sda1>
md: bind<sdb1>
md: running: <sdb1><sda1>
raid1: raid set md1 active with 2 out of 2 mirrors
md: ... autorun DONE.
md: Autodetecting RAID arrays.
md: autorun ...
md: ... autorun DONE.
md: Autodetecting RAID arrays.
md: autorun ...
md: ... autorun DONE.
EXT3 FS on md0, internal journal
EXT3 FS on md1, internal journal

[root at localhost oracle]# cat /proc/mdstat 
Personalities : [raid1] 
md1 : active raid1 sdb1[1] sda1[0]
      104320 blocks [2/2] [UU]

md0 : active raid1 sdb3[1] sda3[0]
      76991424 blocks [2/2] [UU]

unused devices: <none>

Here are also some sar statistics:

[root at localhost oracle]# sar
12:00:01 AM       CPU     %user     %nice   %system   %iowait     %idle
08:00:01 AM       all      0.00      0.00      0.01      0.92     99.06
08:10:01 AM       all      0.15      0.00      0.02      0.98     98.85
08:20:01 AM       all      0.01      0.00      0.01      0.95     99.03
08:30:01 AM       all      0.01      0.00      0.01      0.95     99.03
08:40:01 AM       all      0.00      0.00      0.01      1.05     98.94
08:50:01 AM       all      0.01      0.00      0.01      0.95     99.03
09:00:01 AM       all      0.02      0.00      0.02      0.95     99.00
09:10:01 AM       all      0.16      0.00      0.03      0.98     98.83
09:20:01 AM       all      0.01      0.00      0.01      0.96     99.02
Average:          all      0.04      0.01      0.03      1.01     98.91

iowait seems noticeably higher than on my DL140G1.

[root at localhost oracle]# sar -B
Linux 2.6.9-22.ELsmp (localhost.localdomain)    04/07/2006

12:00:01 AM  pgpgin/s pgpgout/s   fault/s  majflt/s
12:10:01 AM      0.07     19.70     45.95      0.00
12:20:01 AM      0.00     17.59     10.47      0.00
12:30:01 AM      0.00     17.10      9.02      0.00
12:40:01 AM      0.00     21.03     15.56      0.00
12:50:01 AM      0.00     17.34     15.80      0.00
01:00:01 AM      0.00     17.20      8.97      0.00
01:10:01 AM      0.00     19.50     45.04      0.00
01:20:01 AM      0.00     17.49      9.28      0.00
01:30:01 AM      0.00     17.22      8.94      0.00
01:40:01 AM      0.00     20.27     15.61      0.00
01:50:01 AM      0.00     17.08      9.10      0.00

Not sure if the number of page faults there is unusual or not.

The most unusual thing seems to be the number of interrupts going on.  I
can't seem to call sar -I with an IRQ value of 0, but a watch -n 1 "cat
/proc/interrupts" seems to show about 1000 interrupts per second to the
IO-APIC-edge timer on the DL140G2 system.

On the DL140G1 system, I am only seeing about 100 interrupts per second to
the IO-APIC-edge timer.

Anyways, I am going to keep playing around with sar and see if anything else
stands out.  Any suggestions?

Ray