Looking for job which is causing a large work load - solved

Margaret Doll Margaret_Doll at brown.edu
Tue Feb 16 16:21:43 UTC 2010


I used  "pe -el" and found the zombie process.

On Feb 16, 2010, at 10:54 AM, Margaret Doll wrote:

> We have an eight processor system, running 2.6.18-128.1.6.el5xen   
> Redhat.
>
> We noticed the other day that sendmail was just queuing jobs and not  
> sending them.
> mqueue, however, is empty.
>
> That lead us to look at the load average as a possible reason for  
> the failure of sendmail.
> The QueueLA on sendmail is set to "8" as it should be.
>
> w and top show that we have a high load average and most of the  
> memory on the system
> is being used.  However, no job shows up in top using a lot of memory.
>
> top - 10:50:52 up 232 days, 15:18, 20 users,  load average: 619.06,  
> 619.04, 618.98
> Tasks: 3419 total,   1 running, 3417 sleeping,   0 stopped,   1 zombie
> Cpu(s):  0.3%us,  0.9%sy,  0.0%ni, 98.8%id,  0.0%wa,  0.0%hi,   
> 0.0%si,  0.0%st
> Mem:  16099528k total, 16063880k used,    35648k free,   487200k  
> buffers
> Swap:  6127608k total,   105920k used,  6021688k free, 12683800k  
> cached
>
>  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> 11917 user1     16   0 13424 3624  784 S  3.8  0.0   0:04.16 top
> 11922 root      16   0 13360 3624  776 R  3.8  0.0   0:00.39 top
> 8187 user1     16   0 13356 3620  780 S  3.5  0.0  44:48.71 top
> 11895 user1     16   0 13452 3648  780 R  3.5  0.0   0:11.35 top
>    1 root      15   0 10348  632  540 S  0.0  0.0   0:01.75 init
>    2 root      RT  -5     0    0    0 S  0.0  0.0   0:07.51  
> migration/0
>    3 root      34  19     0    0    0 S  0.0  0.0   0:24.56  
> ksoftirqd/0
>    4 root      RT  -5     0    0    0 S  0.0  0.0   0:00.00 watchdog/0
>    5 root      RT  -5     0    0    0 S  0.0  0.0   0:03.77  
> migration/1
>    6 root      34  19     0    0    0 S  0.0  0.0   0:04.96  
> ksoftirqd/1
>
> This machine is running long jobs from time to time and is hosting  
> large databases, so we don't want to reboot it.
>
> How can we find the "job" that is using all the memory and bringing  
> the work load up to such a high level?  Is it the zombie that is  
> reported in top?
>
>
> Thanks
>
> w
> 10:57:27 up 232 days, 15:25, 18 users,  load average: 619.19,  
> 619.28, 619.13
> USER     TTY      FROM              LOGIN@   IDLE   JCPU   PCPU WHAT
> user1    pts/2    lfps             15Jan10  4days  0.10s  0.10s -tcsh
> user1    pts/3    lfps             Thu16   17:45m 44:55  44:54  top
> user1    pts/4    lfps             15Jan10 25days  0.10s  0.10s -tcsh
> user2      pts/5    gc166-mm.geo.bro Thu16    4days  0.02s  0.01s  
> sshd: user2 [priv]
> crism    pts/8    molybdenum       Fri13    3days  1:27   1:27  /usr/ 
> local/itt/idl70/bin/bin.linux.x8
> root     pts/9    :0.0             23Oct09 116days  0.00s  0.00s ssh  
> -l user1 moly
> wjuser1  pts/10   porter2.geo.brow Mon10    6:01   0.11s  0.11s -tcsh
> user2      pts/12   gc166-mm.geo.bro Fri14    0.00s  0.07s  0.00s  
> sshd: user2 [priv]
> root     :0       -                23Oct09 ?xdm?   2:24m  0.03s /usr/ 
> bin/gnome-session
> user1    pts/16   lfps             Mon14    3:47  10.30s 10.24s top
> user1    pts/14   quahog2.geo.brow Mon15    8:22  17.54s 17.48s top
> root     pts/15   :0.0             23Oct09 116days  0.01s  0.01s - 
> bin/tcsh
> user1    pts/17   quahog2.geo.brow Mon14   18:19m  0.11s  0.11s -tcsh
> root     pts/23   :0.0             23Oct09 116days  0.01s  0.01s - 
> bin/tcsh
> root     pts/24   :0.0             23Oct09 116days  0.01s  0.01s - 
> bin/tcsh
> user1    pts/28   lfps             15Jan10  4:08   0.12s  0.12s -tcsh
> user1    pts/30   lfps             15Jan10  6:01   0.39s  0.00s  
> sshd: user1 [priv]
> root     pts/7    :0.0             23Oct09 116days  5.78s  0.00s - 
> bin/tcsh
>
>
> -- 
> redhat-list mailing list
> unsubscribe mailto:redhat-list-request at redhat.com?subject=unsubscribe
> https://www.redhat.com/mailman/listinfo/redhat-list




More information about the redhat-list mailing list