Determining what is causing load when server is idle.
George Magklaras
georgios at ulrik.uio.no
Thu Apr 20 09:28:50 UTC 2006
Hi Ray,
Sorry, I lost the thread on my browser. I have looked at the figures and
no I do not see an excessive number of page faults. The vmstat you
indicated on a very quiet system does indeed point to an unusually high
number of interrupts.
If you still have this issue, try booting on a uniprocessor kernel or
boot with the 'noapic' option and see if you face the symptom or your
system boots properly. I also note that the kernel you mentioned did not
appear to be the latest. So, I would up2date the system, see if the
problem goes away and then on the up-to-date kernel (2.6.9-34.ELsmp as
we speak) boot with the noapic option and see if you see a difference.
Best Regards,
GM
Ray Van Dolson wrote:
> On Fri, Apr 07, 2006 at 12:17:30PM +0200, George Magklaras wrote:
>
>>Seeing init in S mode in 'top' like that:
>> 1 root 16 0 1972 556 480 S 0.0 0.0 0:00.53 init
>>
>>is not so extraordinary if you just invoke 'top'. If it is in R or other
>>process mode continuously, that would be alarming.
>
>
> init stays in 'S' mode for the duration of top.
>
>
>>>Another symptom that comes along with this weird non-0.00 load issue is
>>>that
>>>user I/O seems to "glitch" every now and then. Almost like the hard drives
>>>are spinning up after being put to sleep... however, APM is disabled in my
>>>kernel since I am running in SMP mode.
>>
>>I think that #might# be the key symptom. How exactly do you mean the
>>'glitch'. Does I/O pause for an interval to the point where you notice
>>it for several seconds and then continues, abort completely (I/O
>>errors)? It could be that there is somekind of background reconstruction
>> or syncing happenning due to driver or hardware issues.
>
>
> Yes. This is exactly the behavior I'm experiencing. Everything just pauses
> then within 2-5 seconds control returns.
>
>
>>dmesg | grep -i md
>>
>>should give you any hickups related to the RAID config. Doing also a
>>'vmstat 3'
>
>
> Nothing interesting really in the dmesg output, but vmstat shows a lot of
> interrupts:
>
> On DL140G2 w/ SATA software RAID1:
>
> [root at localhost oracle]# vmstat 3
> procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
> r b swpd free buff cache si so bi bo in cs us sy id wa
> 0 0 0 89340 46236 1828624 0 0 1 29 185 17 0 0 98 1
> 0 0 0 89340 46236 1828624 0 0 0 11 1014 17 0 0 100 1
> 0 0 0 89276 46236 1828624 0 0 0 28 1017 25 0 0 99 1
> 0 0 0 89276 46236 1828624 0 0 0 11 1014 19 0 0 100 1
> 0 0 0 89276 46236 1828624 0 0 0 21 1016 24 0 0 99 1
> 0 0 0 89276 46236 1828624 0 0 0 11 1014 19 0 0 99 1
> 0 0 0 89276 46236 1828624 0 0 0 20 1016 24 0 0 99 1
> 0 0 0 89276 46236 1828624 0 0 0 11 1013 19 0 0 99 1
>
> On DL140G1 w/ IDE software RAID1 (this box is actually in production so is
> "busier" than the box above)
>
> [root at billmax root]# vmstat 3
> procs memory swap io system cpu
> r b swpd free buff cache si so bi bo in cs us sy id wa
> 0 0 24604 18508 127772 507604 0 0 1 0 1 0 0 0 0 1
> 0 0 24604 18508 127772 507604 0 0 0 0 113 24 0 0 100 0
> 0 0 24604 18508 127772 507604 0 0 0 16 115 29 0 0 100 0
> 0 0 24604 18508 127772 507604 0 0 0 111 164 52 0 0 96 3
> 0 0 24604 18508 127772 507604 0 0 0 0 121 33 0 0 100 0
> 0 0 24604 18508 127772 507608 0 0 1 7 116 48 0 0 100 0
> 0 0 24604 18508 127772 507620 0 0 3 0 113 38 0 0 100 0
> 0 0 24604 18508 127772 507620 0 0 0 51 131 26 0 0 100 0
> 0 0 24604 18508 127772 507620 0 0 0 0 113 34 0 0 100 0
>
>
>
>>/proc/interrupts, the output of 'lsmod' and your SoftRAID configs files
>>would help, as well as your kernel version.
>>
>
>
> Kernel is 2.6.9-22.ELsmp.
>
> [root at localhost oracle]# cat /proc/interrupts
> CPU0 CPU1
> 0: 33071575 33118497 IO-APIC-edge timer
> 1: 28 58 IO-APIC-edge i8042
> 8: 0 1 IO-APIC-edge rtc
> 9: 0 0 IO-APIC-level acpi
> 14: 79946 81927 IO-APIC-edge libata
> 15: 81059 80767 IO-APIC-edge libata
> 169: 1037048 132 IO-APIC-level uhci_hcd, eth0
> 177: 0 0 IO-APIC-level uhci_hcd
> 185: 0 0 IO-APIC-level ehci_hcd
> NMI: 0 0
> LOC: 66192663 66192736
> ERR: 0
> MIS: 0
>
> RAID configuration -- it doesn't appear that /etc/raidtab gets generated any
> longer. Here is /etc/mdadm.conf:
>
> DEVICE partitions
> MAILADDR root
> ARRAY /dev/md0 super-minor=0
> ARRAY /dev/md1 super-minor=1
>
> Some output from dmesg:
>
> md: md driver 0.90.0 MAX_MD_DEVS=256, MD_SB_DISKS=27
> ata1: SATA max UDMA/133 cmd 0x1F0 ctl 0x3F6 bmdma 0x1470 irq 14
> ata2: SATA max UDMA/133 cmd 0x170 ctl 0x376 bmdma 0x1478 irq 15
> md: raid1 personality registered as nr 3
> md: Autodetecting RAID arrays.
> md: autorun ...
> md: considering sdb3 ...
> md: adding sdb3 ...
> md: sdb1 has different UUID to sdb3
> md: adding sda3 ...
> md: sda1 has different UUID to sdb3
> md: created md0
> md: bind<sda3>
> md: bind<sdb3>
> md: running: <sdb3><sda3>
> raid1: raid set md0 active with 2 out of 2 mirrors
> md: considering sdb1 ...
> md: adding sdb1 ...
> md: adding sda1 ...
> md: created md1
> md: bind<sda1>
> md: bind<sdb1>
> md: running: <sdb1><sda1>
> raid1: raid set md1 active with 2 out of 2 mirrors
> md: ... autorun DONE.
> md: Autodetecting RAID arrays.
> md: autorun ...
> md: ... autorun DONE.
> md: Autodetecting RAID arrays.
> md: autorun ...
> md: ... autorun DONE.
> EXT3 FS on md0, internal journal
> EXT3 FS on md1, internal journal
>
> [root at localhost oracle]# cat /proc/mdstat
> Personalities : [raid1]
> md1 : active raid1 sdb1[1] sda1[0]
> 104320 blocks [2/2] [UU]
>
> md0 : active raid1 sdb3[1] sda3[0]
> 76991424 blocks [2/2] [UU]
>
> unused devices: <none>
>
> Here are also some sar statistics:
>
> [root at localhost oracle]# sar
> 12:00:01 AM CPU %user %nice %system %iowait %idle
> 08:00:01 AM all 0.00 0.00 0.01 0.92 99.06
> 08:10:01 AM all 0.15 0.00 0.02 0.98 98.85
> 08:20:01 AM all 0.01 0.00 0.01 0.95 99.03
> 08:30:01 AM all 0.01 0.00 0.01 0.95 99.03
> 08:40:01 AM all 0.00 0.00 0.01 1.05 98.94
> 08:50:01 AM all 0.01 0.00 0.01 0.95 99.03
> 09:00:01 AM all 0.02 0.00 0.02 0.95 99.00
> 09:10:01 AM all 0.16 0.00 0.03 0.98 98.83
> 09:20:01 AM all 0.01 0.00 0.01 0.96 99.02
> Average: all 0.04 0.01 0.03 1.01 98.91
>
> iowait seems noticeably higher than on my DL140G1.
>
> [root at localhost oracle]# sar -B
> Linux 2.6.9-22.ELsmp (localhost.localdomain) 04/07/2006
>
> 12:00:01 AM pgpgin/s pgpgout/s fault/s majflt/s
> 12:10:01 AM 0.07 19.70 45.95 0.00
> 12:20:01 AM 0.00 17.59 10.47 0.00
> 12:30:01 AM 0.00 17.10 9.02 0.00
> 12:40:01 AM 0.00 21.03 15.56 0.00
> 12:50:01 AM 0.00 17.34 15.80 0.00
> 01:00:01 AM 0.00 17.20 8.97 0.00
> 01:10:01 AM 0.00 19.50 45.04 0.00
> 01:20:01 AM 0.00 17.49 9.28 0.00
> 01:30:01 AM 0.00 17.22 8.94 0.00
> 01:40:01 AM 0.00 20.27 15.61 0.00
> 01:50:01 AM 0.00 17.08 9.10 0.00
>
> Not sure if the number of page faults there is unusual or not.
>
> The most unusual thing seems to be the number of interrupts going on. I
> can't seem to call sar -I with an IRQ value of 0, but a watch -n 1 "cat
> /proc/interrupts" seems to show about 1000 interrupts per second to the
> IO-APIC-edge timer on the DL140G2 system.
>
> On the DL140G1 system, I am only seeing about 100 interrupts per second to
> the IO-APIC-edge timer.
>
> Anyways, I am going to keep playing around with sar and see if anything else
> stands out. Any suggestions?
>
> Ray
>
More information about the redhat-list
mailing list