Looking for job which is causing a large work load

Tue Feb 16 18:56:49 UTC 2010

Thanks.  I will look at atop for my systems.

On Feb 16, 2010, at 1:03 PM, Alan A wrote:

> Install atop - the best tool for tracking runaway processes/ user  
> abuse/
> network utilization -etc...
>
> On Tue, Feb 16, 2010 at 10:18 AM, Stainforth, Matthew (SD/DS) <
> Matthew.Stainforth at gnb.ca> wrote:
>
>> Memory doesn't appear to be a problem.  Run "free" and look at the  
>> amount
>> of free memory on the "+/- buffers/cache" line.
>>
>> Top is reporting 3419 processes total with 600+ in a runnable  
>> state.  What
>> does "ps auwwx" tell you?
>>
>> -----Original Message-----
>> From: redhat-list-bounces at redhat.com [mailto:
>> redhat-list-bounces at redhat.com] On Behalf Of Margaret Doll
>> Sent: Tuesday, February 16, 2010 11:54 AM
>> To: General Red Hat Linux discussion list
>> Subject: Looking for job which is causing a large work load
>>
>> We have an eight processor system, running 2.6.18-128.1.6.el5xen
>> Redhat.
>>
>> We noticed the other day that sendmail was just queuing jobs and not
>> sending them.
>> mqueue, however, is empty.
>>
>> That lead us to look at the load average as a possible reason for the
>> failure of sendmail.
>> The QueueLA on sendmail is set to "8" as it should be.
>>
>> w and top show that we have a high load average and most of the  
>> memory
>> on the system
>> is being used.  However, no job shows up in top using a lot of  
>> memory.
>>
>> top - 10:50:52 up 232 days, 15:18, 20 users,  load average: 619.06,
>> 619.04, 618.98
>> Tasks: 3419 total,   1 running, 3417 sleeping,   0 stopped,   1  
>> zombie
>> Cpu(s):  0.3%us,  0.9%sy,  0.0%ni, 98.8%id,  0.0%wa,  0.0%hi,
>> 0.0%si,  0.0%st
>> Mem:  16099528k total, 16063880k used,    35648k free,   487200k  
>> buffers
>> Swap:  6127608k total,   105920k used,  6021688k free, 12683800k  
>> cached
>>
>>  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>> 11917 user1     16   0 13424 3624  784 S  3.8  0.0   0:04.16 top
>> 11922 root      16   0 13360 3624  776 R  3.8  0.0   0:00.39 top
>> 8187 user1     16   0 13356 3620  780 S  3.5  0.0  44:48.71 top
>> 11895 user1     16   0 13452 3648  780 R  3.5  0.0   0:11.35 top
>>    1 root      15   0 10348  632  540 S  0.0  0.0   0:01.75 init
>>    2 root      RT  -5     0    0    0 S  0.0  0.0   0:07.51
>> migration/0
>>    3 root      34  19     0    0    0 S  0.0  0.0   0:24.56
>> ksoftirqd/0
>>    4 root      RT  -5     0    0    0 S  0.0  0.0   0:00.00  
>> watchdog/0
>>    5 root      RT  -5     0    0    0 S  0.0  0.0   0:03.77
>> migration/1
>>    6 root      34  19     0    0    0 S  0.0  0.0   0:04.96
>> ksoftirqd/1
>>
>> This machine is running long jobs from time to time and is hosting
>> large databases, so we don't want to reboot it.
>>
>> How can we find the "job" that is using all the memory and bringing
>> the work load up to such a high level?  Is it the zombie that is
>> reported in top?
>>
>>
>> Thanks
>>
>> w
>> 10:57:27 up 232 days, 15:25, 18 users,  load average: 619.19,
>> 619.28, 619.13
>> USER     TTY      FROM              LOGIN@   IDLE   JCPU   PCPU WHAT
>> user1    pts/2    lfps             15Jan10  4days  0.10s  0.10s -tcsh
>> user1    pts/3    lfps             Thu16   17:45m 44:55  44:54  top
>> user1    pts/4    lfps             15Jan10 25days  0.10s  0.10s -tcsh
>> user2      pts/5    gc166-mm.geo.bro Thu16    4days  0.02s  0.01s
>> sshd: user2 [priv]
>> crism    pts/8    molybdenum       Fri13    3days  1:27   1:27  /usr/
>> local/itt/idl70/bin/bin.linux.x8
>> root     pts/9    :0.0             23Oct09 116days  0.00s  0.00s  
>> ssh -
>> l user1 moly
>> wjuser1  pts/10   porter2.geo.brow Mon10    6:01   0.11s  0.11s -tcsh
>> user2      pts/12   gc166-mm.geo.bro Fri14    0.00s  0.07s  0.00s
>> sshd: user2 [priv]
>> root     :0       -                23Oct09 ?xdm?   2:24m  0.03s /usr/
>> bin/gnome-session
>> user1    pts/16   lfps             Mon14    3:47  10.30s 10.24s top
>> user1    pts/14   quahog2.geo.brow Mon15    8:22  17.54s 17.48s top
>> root     pts/15   :0.0             23Oct09 116days  0.01s  0.01s - 
>> bin/
>> tcsh
>> user1    pts/17   quahog2.geo.brow Mon14   18:19m  0.11s  0.11s -tcsh
>> root     pts/23   :0.0             23Oct09 116days  0.01s  0.01s - 
>> bin/
>> tcsh
>> root     pts/24   :0.0             23Oct09 116days  0.01s  0.01s - 
>> bin/
>> tcsh
>> user1    pts/28   lfps             15Jan10  4:08   0.12s  0.12s -tcsh
>> user1    pts/30   lfps             15Jan10  6:01   0.39s  0.00s sshd:
>> user1 [priv]
>> root     pts/7    :0.0             23Oct09 116days  5.78s  0.00s - 
>> bin/
>> tcsh
>>
>>
>> --
>> redhat-list mailing list
>> unsubscribe mailto:redhat-list-request at redhat.com?subject=unsubscribe
>> https://www.redhat.com/mailman/listinfo/redhat-list
>>
>> --
>> redhat-list mailing list
>> unsubscribe mailto:redhat-list-request at redhat.com?subject=unsubscribe
>> https://www.redhat.com/mailman/listinfo/redhat-list
>>
>
>
>
> -- 
> Alan A.
> -- 
> redhat-list mailing list
> unsubscribe mailto:redhat-list-request at redhat.com?subject=unsubscribe
> https://www.redhat.com/mailman/listinfo/redhat-list