[Linux-cluster] Load peaks - caused by the cluster?

Mon Aug 13 09:07:04 UTC 2007

Hi!

Remember the fs.sh status checks mayhem I reported a while ago? Now,
there was the ghost-like load flux, but the system getting stuck wasn't
(only) because of the excess number of execs - it was, plain and simple,
memory starvation. *sigh*

Anyway, now that I (or, to be exact, my servers) have enough memory, I
noticed that the problem with the inexplicable load flux hasn't gone
anywhere. With a more-or-less regular 11-hour interval, there is a
four-hour long peak in the load, shaped like an elf's pointy hat. (In an
otherwise idle system, the height of the peak is abt 6.0. If there is
load caused by something "real", the peak is on top of the other load -
it looks as if it just linearly adds up.) I'm seriously beginning to
consider the possibility that there are elfs in my kernel, since I can't
see the peaks anywhere else than the loads: CPU usage, number of
processes, IP/TCP/UDP traffic, IO load, paging activity - nothing
reflects the load peaks. I had a look at the process accounting
statistics during a peak and during no peak, but couldn't see any
difference.

One suggestion my colleague had was that the peaks might be caused by
the cluster somehow changing the 'lead' - somewhere inside the kernel,
in such a low level that it can't be noticed elsewhere than in the load.
That was because there is a difference of phase in the peaks. It didn't
sound very credible to me, but I'll ask anyway: could there be something
like that going on?

On the other hand, on the one node in the cluster that doesn't have
rgmanager running (it's in the cluster so that there wouldn't be an even
number of nodes), I'm not seeing these elfs. And I have an another
cluster that had the elf-hats before I added an exit 0 into their fs.sh
scripts. But they don't have the elf-hats anymore. The difference
between these two clusters is that the cluster with elfs has a lot more
active cluster services than the one without. That is, the cluster with
elfs has a lot more, say, ip.sh execs than the one without. I wonder if
these, when over a certain limit, could have an effect on the load
similar to the excess fs.fh execs had?

Next, I think I'm going to put an exit 0 to the status checks of ip.sh
(and see if the elfs go away). Then I'm going to start wondering if the
cluster'd notice our server room falling apart... ;)

Any suggestions? At this point, I'm not any more even certain whether
the problem lies within the cluster. On the other hand, since I see no
difference at the process level during peak and no-peak time, the
difference must (as far as I understand) be inside kernel. So it can't
be my application. So it must be the cluster, mustn't it?

Thanks.

--Janne
-- 
Janne Peltonen <janne.peltonen at helsinki.fi>