Looking for job which is causing a large work load

Tue Feb 16 18:03:33 UTC 2010

Install atop - the best tool for tracking runaway processes/ user abuse/
network utilization -etc...

On Tue, Feb 16, 2010 at 10:18 AM, Stainforth, Matthew (SD/DS) <
Matthew.Stainforth at gnb.ca> wrote:

> Memory doesn't appear to be a problem.  Run "free" and look at the amount
> of free memory on the "+/- buffers/cache" line.
>
> Top is reporting 3419 processes total with 600+ in a runnable state.  What
> does "ps auwwx" tell you?
>
> -----Original Message-----
> From: redhat-list-bounces at redhat.com [mailto:
> redhat-list-bounces at redhat.com] On Behalf Of Margaret Doll
> Sent: Tuesday, February 16, 2010 11:54 AM
> To: General Red Hat Linux discussion list
> Subject: Looking for job which is causing a large work load
>
> We have an eight processor system, running 2.6.18-128.1.6.el5xen
> Redhat.
>
> We noticed the other day that sendmail was just queuing jobs and not
> sending them.
> mqueue, however, is empty.
>
> That lead us to look at the load average as a possible reason for the
> failure of sendmail.
> The QueueLA on sendmail is set to "8" as it should be.
>
> w and top show that we have a high load average and most of the memory
> on the system
> is being used.  However, no job shows up in top using a lot of memory.
>
> top - 10:50:52 up 232 days, 15:18, 20 users,  load average: 619.06,
> 619.04, 618.98
> Tasks: 3419 total,   1 running, 3417 sleeping,   0 stopped,   1 zombie
> Cpu(s):  0.3%us,  0.9%sy,  0.0%ni, 98.8%id,  0.0%wa,  0.0%hi,
> 0.0%si,  0.0%st
> Mem:  16099528k total, 16063880k used,    35648k free,   487200k buffers
> Swap:  6127608k total,   105920k used,  6021688k free, 12683800k cached
>
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> 11917 user1     16   0 13424 3624  784 S  3.8  0.0   0:04.16 top
> 11922 root      16   0 13360 3624  776 R  3.8  0.0   0:00.39 top
>  8187 user1     16   0 13356 3620  780 S  3.5  0.0  44:48.71 top
> 11895 user1     16   0 13452 3648  780 R  3.5  0.0   0:11.35 top
>     1 root      15   0 10348  632  540 S  0.0  0.0   0:01.75 init
>     2 root      RT  -5     0    0    0 S  0.0  0.0   0:07.51
> migration/0
>     3 root      34  19     0    0    0 S  0.0  0.0   0:24.56
> ksoftirqd/0
>     4 root      RT  -5     0    0    0 S  0.0  0.0   0:00.00 watchdog/0
>     5 root      RT  -5     0    0    0 S  0.0  0.0   0:03.77
> migration/1
>     6 root      34  19     0    0    0 S  0.0  0.0   0:04.96
> ksoftirqd/1
>
> This machine is running long jobs from time to time and is hosting
> large databases, so we don't want to reboot it.
>
> How can we find the "job" that is using all the memory and bringing
> the work load up to such a high level?  Is it the zombie that is
> reported in top?
>
>
> Thanks
>
> w
>  10:57:27 up 232 days, 15:25, 18 users,  load average: 619.19,
> 619.28, 619.13
> USER     TTY      FROM              LOGIN@   IDLE   JCPU   PCPU WHAT
> user1    pts/2    lfps             15Jan10  4days  0.10s  0.10s -tcsh
> user1    pts/3    lfps             Thu16   17:45m 44:55  44:54  top
> user1    pts/4    lfps             15Jan10 25days  0.10s  0.10s -tcsh
> user2      pts/5    gc166-mm.geo.bro Thu16    4days  0.02s  0.01s
> sshd: user2 [priv]
> crism    pts/8    molybdenum       Fri13    3days  1:27   1:27  /usr/
> local/itt/idl70/bin/bin.linux.x8
> root     pts/9    :0.0             23Oct09 116days  0.00s  0.00s ssh -
> l user1 moly
> wjuser1  pts/10   porter2.geo.brow Mon10    6:01   0.11s  0.11s -tcsh
> user2      pts/12   gc166-mm.geo.bro Fri14    0.00s  0.07s  0.00s
> sshd: user2 [priv]
> root     :0       -                23Oct09 ?xdm?   2:24m  0.03s /usr/
> bin/gnome-session
> user1    pts/16   lfps             Mon14    3:47  10.30s 10.24s top
> user1    pts/14   quahog2.geo.brow Mon15    8:22  17.54s 17.48s top
> root     pts/15   :0.0             23Oct09 116days  0.01s  0.01s -bin/
> tcsh
> user1    pts/17   quahog2.geo.brow Mon14   18:19m  0.11s  0.11s -tcsh
> root     pts/23   :0.0             23Oct09 116days  0.01s  0.01s -bin/
> tcsh
> root     pts/24   :0.0             23Oct09 116days  0.01s  0.01s -bin/
> tcsh
> user1    pts/28   lfps             15Jan10  4:08   0.12s  0.12s -tcsh
> user1    pts/30   lfps             15Jan10  6:01   0.39s  0.00s sshd:
> user1 [priv]
> root     pts/7    :0.0             23Oct09 116days  5.78s  0.00s -bin/
> tcsh
>
>
> --
> redhat-list mailing list
> unsubscribe mailto:redhat-list-request at redhat.com?subject=unsubscribe
> https://www.redhat.com/mailman/listinfo/redhat-list
>
> --
> redhat-list mailing list
> unsubscribe mailto:redhat-list-request at redhat.com?subject=unsubscribe
> https://www.redhat.com/mailman/listinfo/redhat-list
>

-- 
Alan A.