head node has an extremely high load average.

Wed Jun 26 21:56:17 UTC 2013

NFS may hang up. Have you tried running autofs?

Yixin

On Wed, Jun 26, 2013 at 2:59 PM, Doll, Margaret Ann <margaret_doll at brown.edu
> wrote:

> The users' home directories are nfs'd to the compute nodes.
>
> On Wed, Jun 26, 2013 at 3:35 PM, Jonathan Billings <jsbillin at umich.edu
> >wrote:
>
> > Hello,
> >
> > Is your head node an NFS server, and are the jobs writing to the NFS
> share?
> >
> >
> > On Wed, Jun 26, 2013 at 3:27 PM, Doll, Margaret Ann <
> > margaret_doll at brown.edu
> > > wrote:
> >
> > > I have a computer cluster Running rocks 5.2,  Centos 6.
> > >
> > > The head node is over loaded.  There are 2 CPUs on the head node.
> > >
> > > top - 14:27:49 up 1 day,  6:11,  6 users,  load average: 13.65, 14.12,
> > > 13.92
> > > Tasks: 168 total,   3 running, 163 sleeping,   0 stopped,   2 zombie
> > > Cpu(s):  1.2%us,  1.9%sy,  0.0%ni,  0.0%id, 91.7%wa,  1.0%hi,  4.1%si,
> > > 0.0%st
> > > Mem:   2053088k total,  2001464k used,    51624k free,    74476k
> buffers
> > > Swap:  1020116k total,      388k used,  1019728k free,  1638076k cached
> > >
> > >   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+
> > > COMMAND
> > >
> > >  2515 nobody    15   0  218m 3176 1048 S  2.3  0.2   8:46.23
> > > gmetad
> > >  2967 root      15   0     0    0    0 S  2.0  0.0   0:20.31
> > > nfsd
> > >  2970 root      15   0     0    0    0 R  1.0  0.0   0:20.60
> > > nfsd
> > >  3110 nobody    15   0  198m  20m 3360 S  0.3  1.0   4:22.71
> > > gmond
> > > 29788 mad       15   0 90736 2336 1084 S  0.3  0.1   0:02.91
> > > sshd
> > >     1 root      15   0 10372  684  572 S  0.0  0.0   0:00.51
> > > init
> > >     2 root      RT  -5     0    0    0 S  0.0  0.0   0:00.00
> > > migration/0
> > >     3 root      34  19     0    0    0 S  0.0  0.0   0:00.00
> > > ksoftirqd/0
> > >     4 root      RT  -5     0    0    0 S  0.0  0.0   0:00.00 watchdog/0
> > >
> > > I have everyone logged off of the head node.  Four jobs are running on
> > the
> > > compute nodes, but I believe they are non-parallel jobs which causes no
> > > traffic on the head node.   The load_avg on each of the compute nodes
> is
> > > less than 8.  Each compute node has 8 CPUs.
> > >
> > > How can I find the problem?   I have seen the zombies go as high as 2
> on
> > > the head node; most of the time there are 0 zombies.
> > >
> > > I did reboot the head node, but the problem comes back fairly quickly.
> > > --
> > > redhat-list mailing list
> > > unsubscribe mailto:redhat-list-request at redhat.com?subject=unsubscribe
> > > https://www.redhat.com/mailman/listinfo/redhat-list
> > >
> >
> >
> >
> > --
> > Jonathan Billings <jsbillin at umich.edu>
> > College of Engineering - CAEN - Unix and Linux Support
> > --
> > redhat-list mailing list
> > unsubscribe mailto:redhat-list-request at redhat.com?subject=unsubscribe
> > https://www.redhat.com/mailman/listinfo/redhat-list
> >
> --
> redhat-list mailing list
> unsubscribe mailto:redhat-list-request at redhat.com?subject=unsubscribe
> https://www.redhat.com/mailman/listinfo/redhat-list
>