Defining load thresholds for Nagios

Tue Nov 4 11:49:40 UTC 2008

Hello list,

I have been reading and thinking about proper thresholds for
the check_load plugin in Nagios.

My current understanding of load in Linux:

The load average over 1,5, and 15 min in Linux is the number of processes
in running, runnable, and uninterruptable sleep states
(according to the load entry in Wikipedia).
According to the same Wikipedia page processes in the uninterruptable state
usually waits for I/O so both CPU-bound and IO-bound processes
can contribute to the load average.
So if we have a server with many I/O-bound processes the
CPU utilization can be low and the load average can be high.
The number of cores or CPUs also determines the impact of the load.
A load of 8 can therefore mean that all cores in a 2 x 4 core-server are
utilized.

To determine where to set warning and critical thresholds the impact the load
has on the services running must also be taken into account. For
instance on a system running large batch-jobs a high load can be less
of a problem than
on a system running a webserver where users want a response quickly.

So if you had a server where you had little knowledge of the services,
how would you pick thresholds for 1,5, and 15 min warning and 1,5, and
15 min critical?

Thanks,

Erling