Determining what is causing load when server is idle.
Ray Van Dolson
rayvd at digitalpath.net
Fri Apr 7 16:41:15 UTC 2006
On Fri, Apr 07, 2006 at 12:17:30PM +0200, George Magklaras wrote:
> Seeing init in S mode in 'top' like that:
> 1 root 16 0 1972 556 480 S 0.0 0.0 0:00.53 init
>
> is not so extraordinary if you just invoke 'top'. If it is in R or other
> process mode continuously, that would be alarming.
init stays in 'S' mode for the duration of top.
>
> >Another symptom that comes along with this weird non-0.00 load issue is
> >that
> >user I/O seems to "glitch" every now and then. Almost like the hard drives
> >are spinning up after being put to sleep... however, APM is disabled in my
> >kernel since I am running in SMP mode.
>
> I think that #might# be the key symptom. How exactly do you mean the
> 'glitch'. Does I/O pause for an interval to the point where you notice
> it for several seconds and then continues, abort completely (I/O
> errors)? It could be that there is somekind of background reconstruction
> or syncing happenning due to driver or hardware issues.
Yes. This is exactly the behavior I'm experiencing. Everything just pauses
then within 2-5 seconds control returns.
> dmesg | grep -i md
>
> should give you any hickups related to the RAID config. Doing also a
> 'vmstat 3'
Nothing interesting really in the dmesg output, but vmstat shows a lot of
interrupts:
On DL140G2 w/ SATA software RAID1:
[root at localhost oracle]# vmstat 3
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
0 0 0 89340 46236 1828624 0 0 1 29 185 17 0 0 98 1
0 0 0 89340 46236 1828624 0 0 0 11 1014 17 0 0 100 1
0 0 0 89276 46236 1828624 0 0 0 28 1017 25 0 0 99 1
0 0 0 89276 46236 1828624 0 0 0 11 1014 19 0 0 100 1
0 0 0 89276 46236 1828624 0 0 0 21 1016 24 0 0 99 1
0 0 0 89276 46236 1828624 0 0 0 11 1014 19 0 0 99 1
0 0 0 89276 46236 1828624 0 0 0 20 1016 24 0 0 99 1
0 0 0 89276 46236 1828624 0 0 0 11 1013 19 0 0 99 1
On DL140G1 w/ IDE software RAID1 (this box is actually in production so is
"busier" than the box above)
[root at billmax root]# vmstat 3
procs memory swap io system cpu
r b swpd free buff cache si so bi bo in cs us sy id wa
0 0 24604 18508 127772 507604 0 0 1 0 1 0 0 0 0 1
0 0 24604 18508 127772 507604 0 0 0 0 113 24 0 0 100 0
0 0 24604 18508 127772 507604 0 0 0 16 115 29 0 0 100 0
0 0 24604 18508 127772 507604 0 0 0 111 164 52 0 0 96 3
0 0 24604 18508 127772 507604 0 0 0 0 121 33 0 0 100 0
0 0 24604 18508 127772 507608 0 0 1 7 116 48 0 0 100 0
0 0 24604 18508 127772 507620 0 0 3 0 113 38 0 0 100 0
0 0 24604 18508 127772 507620 0 0 0 51 131 26 0 0 100 0
0 0 24604 18508 127772 507620 0 0 0 0 113 34 0 0 100 0
> /proc/interrupts, the output of 'lsmod' and your SoftRAID configs files
> would help, as well as your kernel version.
>
Kernel is 2.6.9-22.ELsmp.
[root at localhost oracle]# cat /proc/interrupts
CPU0 CPU1
0: 33071575 33118497 IO-APIC-edge timer
1: 28 58 IO-APIC-edge i8042
8: 0 1 IO-APIC-edge rtc
9: 0 0 IO-APIC-level acpi
14: 79946 81927 IO-APIC-edge libata
15: 81059 80767 IO-APIC-edge libata
169: 1037048 132 IO-APIC-level uhci_hcd, eth0
177: 0 0 IO-APIC-level uhci_hcd
185: 0 0 IO-APIC-level ehci_hcd
NMI: 0 0
LOC: 66192663 66192736
ERR: 0
MIS: 0
RAID configuration -- it doesn't appear that /etc/raidtab gets generated any
longer. Here is /etc/mdadm.conf:
DEVICE partitions
MAILADDR root
ARRAY /dev/md0 super-minor=0
ARRAY /dev/md1 super-minor=1
Some output from dmesg:
md: md driver 0.90.0 MAX_MD_DEVS=256, MD_SB_DISKS=27
ata1: SATA max UDMA/133 cmd 0x1F0 ctl 0x3F6 bmdma 0x1470 irq 14
ata2: SATA max UDMA/133 cmd 0x170 ctl 0x376 bmdma 0x1478 irq 15
md: raid1 personality registered as nr 3
md: Autodetecting RAID arrays.
md: autorun ...
md: considering sdb3 ...
md: adding sdb3 ...
md: sdb1 has different UUID to sdb3
md: adding sda3 ...
md: sda1 has different UUID to sdb3
md: created md0
md: bind<sda3>
md: bind<sdb3>
md: running: <sdb3><sda3>
raid1: raid set md0 active with 2 out of 2 mirrors
md: considering sdb1 ...
md: adding sdb1 ...
md: adding sda1 ...
md: created md1
md: bind<sda1>
md: bind<sdb1>
md: running: <sdb1><sda1>
raid1: raid set md1 active with 2 out of 2 mirrors
md: ... autorun DONE.
md: Autodetecting RAID arrays.
md: autorun ...
md: ... autorun DONE.
md: Autodetecting RAID arrays.
md: autorun ...
md: ... autorun DONE.
EXT3 FS on md0, internal journal
EXT3 FS on md1, internal journal
[root at localhost oracle]# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sdb1[1] sda1[0]
104320 blocks [2/2] [UU]
md0 : active raid1 sdb3[1] sda3[0]
76991424 blocks [2/2] [UU]
unused devices: <none>
Here are also some sar statistics:
[root at localhost oracle]# sar
12:00:01 AM CPU %user %nice %system %iowait %idle
08:00:01 AM all 0.00 0.00 0.01 0.92 99.06
08:10:01 AM all 0.15 0.00 0.02 0.98 98.85
08:20:01 AM all 0.01 0.00 0.01 0.95 99.03
08:30:01 AM all 0.01 0.00 0.01 0.95 99.03
08:40:01 AM all 0.00 0.00 0.01 1.05 98.94
08:50:01 AM all 0.01 0.00 0.01 0.95 99.03
09:00:01 AM all 0.02 0.00 0.02 0.95 99.00
09:10:01 AM all 0.16 0.00 0.03 0.98 98.83
09:20:01 AM all 0.01 0.00 0.01 0.96 99.02
Average: all 0.04 0.01 0.03 1.01 98.91
iowait seems noticeably higher than on my DL140G1.
[root at localhost oracle]# sar -B
Linux 2.6.9-22.ELsmp (localhost.localdomain) 04/07/2006
12:00:01 AM pgpgin/s pgpgout/s fault/s majflt/s
12:10:01 AM 0.07 19.70 45.95 0.00
12:20:01 AM 0.00 17.59 10.47 0.00
12:30:01 AM 0.00 17.10 9.02 0.00
12:40:01 AM 0.00 21.03 15.56 0.00
12:50:01 AM 0.00 17.34 15.80 0.00
01:00:01 AM 0.00 17.20 8.97 0.00
01:10:01 AM 0.00 19.50 45.04 0.00
01:20:01 AM 0.00 17.49 9.28 0.00
01:30:01 AM 0.00 17.22 8.94 0.00
01:40:01 AM 0.00 20.27 15.61 0.00
01:50:01 AM 0.00 17.08 9.10 0.00
Not sure if the number of page faults there is unusual or not.
The most unusual thing seems to be the number of interrupts going on. I
can't seem to call sar -I with an IRQ value of 0, but a watch -n 1 "cat
/proc/interrupts" seems to show about 1000 interrupts per second to the
IO-APIC-edge timer on the DL140G2 system.
On the DL140G1 system, I am only seeing about 100 interrupts per second to
the IO-APIC-edge timer.
Anyways, I am going to keep playing around with sar and see if anything else
stands out. Any suggestions?
Ray
More information about the redhat-list
mailing list