Determining what is causing load when server is idle.

Thu Apr 20 09:28:50 UTC 2006

Hi Ray,

Sorry, I lost the thread on my browser. I have looked at the figures and 
no I do not see an excessive number of page faults. The vmstat you 
indicated on a very quiet system does indeed point to an unusually high 
number of interrupts.

If you still have this issue, try booting on a uniprocessor kernel or 
boot with the 'noapic' option and see if you face the symptom or your 
system boots properly. I also note that the kernel you mentioned did not 
appear to be the latest. So, I would up2date the system, see if the 
problem goes away and then on the up-to-date kernel (2.6.9-34.ELsmp as 
we speak) boot with the noapic option and see if you see a difference.

Best Regards,
GM

Ray Van Dolson wrote:
> On Fri, Apr 07, 2006 at 12:17:30PM +0200, George Magklaras wrote:
> 
>>Seeing init in S mode in 'top' like that:
>>  1 root      16   0  1972  556  480 S  0.0  0.0   0:00.53 init
>>
>>is not so extraordinary if you just invoke 'top'. If it is in R or other 
>>process mode continuously, that would be alarming.
> 
> 
> init stays in 'S' mode for the duration of top.
> 
> 
>>>Another symptom that comes along with this weird non-0.00 load issue is 
>>>that
>>>user I/O seems to "glitch" every now and then.  Almost like the hard drives
>>>are spinning up after being put to sleep... however, APM is disabled in my
>>>kernel since I am running in SMP mode.
>>
>>I think that #might# be the key symptom. How exactly do you mean the 
>>'glitch'. Does I/O pause for an interval to the point where you notice 
>>it for several seconds and then continues, abort completely (I/O 
>>errors)? It could be that there is somekind of background reconstruction 
>> or syncing happenning due to driver or hardware issues.
> 
> 
> Yes.  This is exactly the behavior I'm experiencing.  Everything just pauses
> then within 2-5 seconds control returns.
> 
> 
>>dmesg | grep -i md
>>
>>should give you any hickups related to the RAID config. Doing also a 
>>'vmstat 3'
> 
> 
> Nothing interesting really in the dmesg output, but vmstat shows a lot of
> interrupts:
> 
> On DL140G2 w/ SATA software RAID1:
> 
> [root at localhost oracle]# vmstat 3
> procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
>  0  0      0  89340  46236 1828624    0    0     1    29  185    17  0  0 98  1
>  0  0      0  89340  46236 1828624    0    0     0    11 1014    17  0  0 100  1
>  0  0      0  89276  46236 1828624    0    0     0    28 1017    25  0  0 99  1
>  0  0      0  89276  46236 1828624    0    0     0    11 1014    19  0  0 100  1
>  0  0      0  89276  46236 1828624    0    0     0    21 1016    24  0  0 99  1
>  0  0      0  89276  46236 1828624    0    0     0    11 1014    19  0  0 99  1
>  0  0      0  89276  46236 1828624    0    0     0    20 1016    24  0  0 99  1
>  0  0      0  89276  46236 1828624    0    0     0    11 1013    19  0  0 99  1
> 
> On DL140G1 w/ IDE software RAID1 (this box is actually in production so is
> "busier" than the box above)
> 
> [root at billmax root]# vmstat 3
> procs                      memory      swap          io     system         cpu
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
>  0  0  24604  18508 127772 507604    0    0     1     0    1     0  0  0  0  1
>  0  0  24604  18508 127772 507604    0    0     0     0  113    24  0  0 100  0
>  0  0  24604  18508 127772 507604    0    0     0    16  115    29  0  0 100  0
>  0  0  24604  18508 127772 507604    0    0     0   111  164    52  0  0 96  3
>  0  0  24604  18508 127772 507604    0    0     0     0  121    33  0  0 100  0
>  0  0  24604  18508 127772 507608    0    0     1     7  116    48  0  0 100  0
>  0  0  24604  18508 127772 507620    0    0     3     0  113    38  0  0 100  0
>  0  0  24604  18508 127772 507620    0    0     0    51  131    26  0  0 100  0
>  0  0  24604  18508 127772 507620    0    0     0     0  113    34  0  0 100  0
> 
> 
> 
>>/proc/interrupts, the output of 'lsmod' and your SoftRAID configs files 
>>would help, as well as your kernel version.
>>
> 
> 
> Kernel is 2.6.9-22.ELsmp.
> 
> [root at localhost oracle]# cat /proc/interrupts 
>            CPU0       CPU1       
>   0:   33071575   33118497    IO-APIC-edge  timer
>   1:         28         58    IO-APIC-edge  i8042
>   8:          0          1    IO-APIC-edge  rtc
>   9:          0          0   IO-APIC-level  acpi
>  14:      79946      81927    IO-APIC-edge  libata
>  15:      81059      80767    IO-APIC-edge  libata
> 169:    1037048        132   IO-APIC-level  uhci_hcd, eth0
> 177:          0          0   IO-APIC-level  uhci_hcd
> 185:          0          0   IO-APIC-level  ehci_hcd
> NMI:          0          0 
> LOC:   66192663   66192736 
> ERR:          0
> MIS:          0
> 
> RAID configuration -- it doesn't appear that /etc/raidtab gets generated any
> longer.  Here is /etc/mdadm.conf:
> 
> DEVICE partitions
> MAILADDR root
> ARRAY /dev/md0 super-minor=0
> ARRAY /dev/md1 super-minor=1
> 
> Some output from dmesg:
> 
> md: md driver 0.90.0 MAX_MD_DEVS=256, MD_SB_DISKS=27
> ata1: SATA max UDMA/133 cmd 0x1F0 ctl 0x3F6 bmdma 0x1470 irq 14
> ata2: SATA max UDMA/133 cmd 0x170 ctl 0x376 bmdma 0x1478 irq 15
> md: raid1 personality registered as nr 3
> md: Autodetecting RAID arrays.
> md: autorun ...
> md: considering sdb3 ...
> md:  adding sdb3 ...
> md: sdb1 has different UUID to sdb3
> md:  adding sda3 ...
> md: sda1 has different UUID to sdb3
> md: created md0
> md: bind<sda3>
> md: bind<sdb3>
> md: running: <sdb3><sda3>
> raid1: raid set md0 active with 2 out of 2 mirrors
> md: considering sdb1 ...
> md:  adding sdb1 ...
> md:  adding sda1 ...
> md: created md1
> md: bind<sda1>
> md: bind<sdb1>
> md: running: <sdb1><sda1>
> raid1: raid set md1 active with 2 out of 2 mirrors
> md: ... autorun DONE.
> md: Autodetecting RAID arrays.
> md: autorun ...
> md: ... autorun DONE.
> md: Autodetecting RAID arrays.
> md: autorun ...
> md: ... autorun DONE.
> EXT3 FS on md0, internal journal
> EXT3 FS on md1, internal journal
> 
> [root at localhost oracle]# cat /proc/mdstat 
> Personalities : [raid1] 
> md1 : active raid1 sdb1[1] sda1[0]
>       104320 blocks [2/2] [UU]
>       
> md0 : active raid1 sdb3[1] sda3[0]
>       76991424 blocks [2/2] [UU]
>       
> unused devices: <none>
> 
> Here are also some sar statistics:
> 
> [root at localhost oracle]# sar
> 12:00:01 AM       CPU     %user     %nice   %system   %iowait     %idle
> 08:00:01 AM       all      0.00      0.00      0.01      0.92     99.06
> 08:10:01 AM       all      0.15      0.00      0.02      0.98     98.85
> 08:20:01 AM       all      0.01      0.00      0.01      0.95     99.03
> 08:30:01 AM       all      0.01      0.00      0.01      0.95     99.03
> 08:40:01 AM       all      0.00      0.00      0.01      1.05     98.94
> 08:50:01 AM       all      0.01      0.00      0.01      0.95     99.03
> 09:00:01 AM       all      0.02      0.00      0.02      0.95     99.00
> 09:10:01 AM       all      0.16      0.00      0.03      0.98     98.83
> 09:20:01 AM       all      0.01      0.00      0.01      0.96     99.02
> Average:          all      0.04      0.01      0.03      1.01     98.91
> 
> iowait seems noticeably higher than on my DL140G1.
> 
> [root at localhost oracle]# sar -B
> Linux 2.6.9-22.ELsmp (localhost.localdomain)    04/07/2006
> 
> 12:00:01 AM  pgpgin/s pgpgout/s   fault/s  majflt/s
> 12:10:01 AM      0.07     19.70     45.95      0.00
> 12:20:01 AM      0.00     17.59     10.47      0.00
> 12:30:01 AM      0.00     17.10      9.02      0.00
> 12:40:01 AM      0.00     21.03     15.56      0.00
> 12:50:01 AM      0.00     17.34     15.80      0.00
> 01:00:01 AM      0.00     17.20      8.97      0.00
> 01:10:01 AM      0.00     19.50     45.04      0.00
> 01:20:01 AM      0.00     17.49      9.28      0.00
> 01:30:01 AM      0.00     17.22      8.94      0.00
> 01:40:01 AM      0.00     20.27     15.61      0.00
> 01:50:01 AM      0.00     17.08      9.10      0.00
> 
> Not sure if the number of page faults there is unusual or not.
> 
> The most unusual thing seems to be the number of interrupts going on.  I
> can't seem to call sar -I with an IRQ value of 0, but a watch -n 1 "cat
> /proc/interrupts" seems to show about 1000 interrupts per second to the
> IO-APIC-edge timer on the DL140G2 system.
> 
> On the DL140G1 system, I am only seeing about 100 interrupts per second to
> the IO-APIC-edge timer.
> 
> Anyways, I am going to keep playing around with sar and see if anything else
> stands out.  Any suggestions?
> 
> Ray
>