Slowness on a head node

Margaret Doll margaret_doll at brown.edu
Fri Mar 27 14:20:13 UTC 2009


The last week and a half we have been experiencing a slow down on our  
cluster's head node.
We are running Rocks 3 over a RedHat OS, 2.6.18-53.1.14.el5.   The  
last down time was 174 days ago.
And we have run successfully when 160 out 176 compute nodes, are  
running queued jobs.  16 of the
cores are reserved for interactive jobs.

There appears that no job is running on the head node.  Over half the  
memory is used up, but there is
still plenty of memory left.  I know that large sftp transfers can  
slow the system down, but my users say
that their transfers are finished.

Where else should I look for the problem?  There are no pending queue  
jobs currently, but there were
pending jobs during this last week and a half period.

top

Tasks: 142 total,   1 running, 141 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us,  2.6%sy,  0.0%ni, 54.5%id, 39.9%wa,  0.0%hi,   
3.1%si,  0.0%st
Mem:   2054132k total,  1278532k used,   775600k free,   214924k buffers
Swap:  1020116k total,   799336k used,   220780k free,   801988k cached

   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
  2874 root      15   0     0    0    0 D    2  0.0  40:26.07 nfsd
  2871 root      15   0     0    0    0 S    1  0.0  40:07.30 nfsd
  2873 root      15   0     0    0    0 S    1  0.0  40:22.56 nfsd
  1697 root      10  -5     0    0    0 D    1  0.0  31:54.46 kjournald
  2872 root      15   0     0    0    0 D    1  0.0  40:50.30 nfsd
  2837 root      15   0 12712 1080  788 R    0  0.1   0:00.31 top
     1 root      15   0 10312   80   48 S    0  0.0   0:39.35 init

qstat -g c
CLUSTER QUEUE                   CQLOAD   USED  AVAIL  TOTAL aoACDS   
cdsuE
-------------------------------------------------------------------------------
all.q                             0.78      0    160    160       
0      0
chemistry                         0.98      0    128    128       
0      0
group1                            1.01     64      0     64       
0      0
group3                            0.00      0     32     32       
0      0
group3-24hr                       0.00      0     32     32       
0      0
group3-2hr                        0.00      0     32     32       
0      0
mem16.q                           0.94      0     16     16       
0      0
mem4.q                            0.91      0      8     32      
24      0
mem8.q                            1.01      0     64     80      
16      0
group2                            0.94     61      3     64       
0      0

  finger
Login     Name            Tty      Idle  Login Time   Office      
Office Phone
acct2     		   pts/5   18:37  Mar 26 14:05
acct1                      pts/2   12:57  Mar 25 12:14
acct4       		   pts/4          Mar 23 09:36
acct5                      pts/8     10d  Mar  9 13:28
acct3                      pts/1   19:03  Mar 26 13:41

  ps -ef | grep sftp-server
root      2880  9542  0 08:47 pts/4    00:00:00 grep sftp-server
acct1     20700 20699  0 Mar25 ?        00:00:00 csh -c /usr/libexec/ 
openssh/sftp-server
acct1     20829 20700  0 Mar25 ?        00:00:02 /usr/libexec/openssh/ 
sftp-server
acct2     26707 26706  0 Mar26 ?        00:00:00 /usr/libexec/openssh/ 
sftp-server
acct3     31337 31336  0 Mar26 ?        00:00:00 csh -c /usr/libexec/ 
openssh/sftp-server
acct3     31466 31337  0 Mar26 ?        00:00:00 /usr/libexec/openssh/ 
sftp-server

iostat
Linux 2.6.18-53.1.14.el5 system.edu)        03/27/2009

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
            4.83    0.00    0.41    2.34    0.00   92.42

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda              37.34       309.74       582.06 4681624648 8797649022


mpstat -P ALL
Linux 2.6.18-53.1.14.el5 (system.edu)        03/27/2009

08:48:37 AM  CPU   %user   %nice    %sys %iowait    %irq   %soft   
%steal   %idle    intr/s
08:48:37 AM  all    4.83    0.00    0.31    2.34    0.02    0.09     
0.00   92.42    267.91
08:48:37 AM    0    4.26    0.00    0.27    1.15    0.00    0.02     
0.00   94.30    150.40
08:48:37 AM    1    5.39    0.00    0.34    3.54    0.03    0.16     
0.00   90.54    117.51

  free
              total       used       free     shared    buffers      
cached
Mem:       2054132    1274000     780132          0     215796      
802872
-/+ buffers/cache:     255332    1798800
Swap:      1020116     799268     220848






More information about the redhat-list mailing list