[rhelv6-list] Odd load average spikes

Wed Dec 18 15:41:44 UTC 2013

I have a system that is a NAT/LVS front-end for a bunch of servers
(mail, web, etc.).  I noticed in my monitoring that, about every 100
minutes, the load average is spiking up to around 3-4 (not just a steady
number, it fluctuates between maybe 2.5 and 4) for around 10 minutes.
Then it drops back to near 0 fairly quickly.

The weird thing is that nothing unusual appears to be happening on the
server during this time.  I was logged in when it happened just now, and
top, ps, vmstat, iostat, etc. showed absolutely nothing unusual, except
for the load average spikes.  There was no unusual traffic, no problems
with the load balancing, no CPU spike (still around 97% idle), no I/O
load, etc.

I'm pretty stumped.  It doesn't appear to be causing any problem, but it
shouldn't do that, and I don't like my servers doing things I don't
understand.  This is a new setup (just in service for a week or so now)
on a Dell PowerEdge R300.

The server is running keepalived, dnsmasq (for internal hostname
mappings only), and fail2ban (although SSH is actually limited in
iptables so this is probably redundant).  It does have SNMP enabled, and
keepalived is running with SNMP turned on (although I haven't got
anything monitoring that yet).  There aren't any cron jobs running
around the times of the spikes.  It also is running SELinux in enforcing
mode.

I know load average is a relatively poor indicator of actual system
load; AFAIK Linux calculates it as the average number of processes
running, "ready to run", or "waiting to run" (i.e. states R and D in
ps/top).  How would the load average jump to 4 when "ps | grep -v S"
shows only the "ps" command itself?

Any suggestions or ideas on how to track this down?  Anybody seen
something like this before?
-- 
Chris Adams <linux at cmadams.net>