head node has an extremely high load average.

Thu Jun 27 12:43:05 UTC 2013

So, when you run an NFS server and have a lot of clients simultaneously
writing to the exported volumes, you should expect to have a high load
average.  In fact, on a fully loaded NFS server I'd expect to see a load
average that is equal to the number of nfsd threads you have configured in
/etc/sysconfig/nfs, plus whatever CPU load your system normally has.  Why
is this?  The load average is computed based on the average number of items
in the CPU's runtime queue.  When you're serving NFS, each thread is often
waiting on I/O or sending data, so each thread is increasing the load
average by one.

This is very common and expected behavior.  You'll often see the same thing
on web servers when there are a lot of httpds waiting for I/O (for example,
if they're getting a denial of service attack that leaves half-open
connections).  The actual amount of CPU activity on these systems aren't
actually that high, in most cases, the CPU is just sitting around waiting
for the remote end or local disk to return data.

On Wed, Jun 26, 2013 at 3:59 PM, Doll, Margaret Ann <margaret_doll at brown.edu
> wrote:

> The users' home directories are nfs'd to the compute nodes.
>
> On Wed, Jun 26, 2013 at 3:35 PM, Jonathan Billings <jsbillin at umich.edu
> >wrote:
>
> > Hello,
> >
> > Is your head node an NFS server, and are the jobs writing to the NFS
> share?
> >
> >
> > On Wed, Jun 26, 2013 at 3:27 PM, Doll, Margaret Ann <
> > margaret_doll at brown.edu
> > > wrote:
> >
> > > I have a computer cluster Running rocks 5.2,  Centos 6.
> > >
> > > The head node is over loaded.  There are 2 CPUs on the head node.
> > >
> > > top - 14:27:49 up 1 day,  6:11,  6 users,  load average: 13.65, 14.12,
> > > 13.92
> > > Tasks: 168 total,   3 running, 163 sleeping,   0 stopped,   2 zombie
> > > Cpu(s):  1.2%us,  1.9%sy,  0.0%ni,  0.0%id, 91.7%wa,  1.0%hi,  4.1%si,
> > > 0.0%st
> > > Mem:   2053088k total,  2001464k used,    51624k free,    74476k
> buffers
> > > Swap:  1020116k total,      388k used,  1019728k free,  1638076k cached
> > >
> > >   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+
> > > COMMAND
> > >
> > >  2515 nobody    15   0  218m 3176 1048 S  2.3  0.2   8:46.23
> > > gmetad
> > >  2967 root      15   0     0    0    0 S  2.0  0.0   0:20.31
> > > nfsd
> > >  2970 root      15   0     0    0    0 R  1.0  0.0   0:20.60
> > > nfsd
> > >  3110 nobody    15   0  198m  20m 3360 S  0.3  1.0   4:22.71
> > > gmond
> > > 29788 mad       15   0 90736 2336 1084 S  0.3  0.1   0:02.91
> > > sshd
> > >     1 root      15   0 10372  684  572 S  0.0  0.0   0:00.51
> > > init
> > >     2 root      RT  -5     0    0    0 S  0.0  0.0   0:00.00
> > > migration/0
> > >     3 root      34  19     0    0    0 S  0.0  0.0   0:00.00
> > > ksoftirqd/0
> > >     4 root      RT  -5     0    0    0 S  0.0  0.0   0:00.00 watchdog/0
> > >
> > > I have everyone logged off of the head node.  Four jobs are running on
> > the
> > > compute nodes, but I believe they are non-parallel jobs which causes no
> > > traffic on the head node.   The load_avg on each of the compute nodes
> is
> > > less than 8.  Each compute node has 8 CPUs.
> > >
> > > How can I find the problem?   I have seen the zombies go as high as 2
> on
> > > the head node; most of the time there are 0 zombies.
> > >
> > > I did reboot the head node, but the problem comes back fairly quickly.
> > > --
> > > redhat-list mailing list
> > > unsubscribe mailto:redhat-list-request at redhat.com?subject=unsubscribe
> > > https://www.redhat.com/mailman/listinfo/redhat-list
> > >
> >
> >
> >
> > --
> > Jonathan Billings <jsbillin at umich.edu>
> > College of Engineering - CAEN - Unix and Linux Support
> > --
> > redhat-list mailing list
> > unsubscribe mailto:redhat-list-request at redhat.com?subject=unsubscribe
> > https://www.redhat.com/mailman/listinfo/redhat-list
> >
> --
> redhat-list mailing list
> unsubscribe mailto:redhat-list-request at redhat.com?subject=unsubscribe
> https://www.redhat.com/mailman/listinfo/redhat-list
>

-- 
Jonathan Billings <jsbillin at umich.edu>
College of Engineering - CAEN - Unix and Linux Support