head node has an extremely high load average.

Wed Jun 26 19:27:54 UTC 2013

I have a computer cluster Running rocks 5.2,  Centos 6.

The head node is over loaded.  There are 2 CPUs on the head node.

top - 14:27:49 up 1 day,  6:11,  6 users,  load average: 13.65, 14.12, 13.92
Tasks: 168 total,   3 running, 163 sleeping,   0 stopped,   2 zombie
Cpu(s):  1.2%us,  1.9%sy,  0.0%ni,  0.0%id, 91.7%wa,  1.0%hi,  4.1%si,
0.0%st
Mem:   2053088k total,  2001464k used,    51624k free,    74476k buffers
Swap:  1020116k total,      388k used,  1019728k free,  1638076k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+
COMMAND

 2515 nobody    15   0  218m 3176 1048 S  2.3  0.2   8:46.23
gmetad
 2967 root      15   0     0    0    0 S  2.0  0.0   0:20.31
nfsd
 2970 root      15   0     0    0    0 R  1.0  0.0   0:20.60
nfsd
 3110 nobody    15   0  198m  20m 3360 S  0.3  1.0   4:22.71
gmond
29788 mad       15   0 90736 2336 1084 S  0.3  0.1   0:02.91
sshd
    1 root      15   0 10372  684  572 S  0.0  0.0   0:00.51
init
    2 root      RT  -5     0    0    0 S  0.0  0.0   0:00.00
migration/0
    3 root      34  19     0    0    0 S  0.0  0.0   0:00.00
ksoftirqd/0
    4 root      RT  -5     0    0    0 S  0.0  0.0   0:00.00 watchdog/0

I have everyone logged off of the head node.  Four jobs are running on the
compute nodes, but I believe they are non-parallel jobs which causes no
traffic on the head node.   The load_avg on each of the compute nodes is
less than 8.  Each compute node has 8 CPUs.

How can I find the problem?   I have seen the zombies go as high as 2 on
the head node; most of the time there are 0 zombies.

I did reboot the head node, but the problem comes back fairly quickly.