RHELv4 and v5 - So slow as to be unusable.

Wed Oct 6 19:22:37 UTC 2010

The machines are both Pentium-4's, around 6 years old, 2Gb memory and IDE
disks.  Runlevel is 6 (full GUI).  No obvious errors are showing up in
/var/log/messages or in dmesg.  I have three machines in another room that
are the same model, upgraded the same way, that seem to be happy.  Those
machines are only used remotely and not generally from their consoles.

"top" says that nothing is going on although the load average is 3+.
"sar" also says that nothing is going on.

Yesterday I turned off the BIOS power management (so no disks spin down 
and
no monitors turn off and such).  I also changed the /etc/ntp.conf to more
closely match the latest ".rpmnew" version from Red Hat.  It references 
our
company's itnernal ntp servers but otherwise it is out-of-the-box from Red
Hat.  And I thought that maybe I had found something in doing that.  The
machines ran fine all the rest of the day.

Then at about 7:18pm last night both machines essentially stopped working.
I had "sar" running on the one machine dumping data every 30 seconds.
According to the sar output, at about 7:01pm last night that machine
essentially stopped having any work to do.  Disk activity went to near 
zero,
the machine went to 99.99% idle, there was network activity every once in 
a
while, the occassion hint of I/O activity but nothing else.

The user on that machine was logged in remotely and they said that at 
about
7:15pm last night the connection suddenly got so slow that they couldn't
work any more.  The other machine was not actively in use at the time
although the user was logged in and the screen locked.

This morning coming in, both machines still thought it was about 7:18pm.
The "date" was the day before at 7:18pm (give or take depending on which
machine) and the /usr/sbin/hwclock was correct about the actual time.

As a separate problem, last evening I discovered that NFS mount points 
being
exported from all of the RHELv4 machines can be mounted by Solaris v6, v7,
v8, and v9 machines but Solaris v10 machines, both Sparc and X86 based, 
are
unable to mount the RHELv4 mount points.  And there are no errors in
/var/log/messages.  HP, SGI, and AIX machines can mount those points, but
not Solaris 10.

I may have "fixed" the RHELv5 version of the problem.  I had noticed that
netstat was reporting around 2200 TIME_WAIT sockets, nearly all for the 
NIS
or DNS servers.  I find that by setting the systcl tcp_tw_reuse flag to 1
(default is 0 on RHELv3, RHELv4, and RHELv5) that the number of sockets in
TIME_WAIT drops to what I see on other machines *and* the RHELv5 machine 
no
longer develops its version of the slowdown problem.

If I can ever get one of the problem RHELv4 machines to run a netstat 
while
the slowdown effect is in effect I'll have to see if something similar 
helps
there.  It is hard to get their "attention" when the slowdown effect is 
going
on.  It can be done but have coffee handy.

   Gary

>  ------------------------------------------------------------ 
> From: "Marti, Robert" <RJM002 at shsu.edu>
> 
> What kind of disks are you using?  I'd tend to look at IO issues with 
that kind of description.
> ------------------------------------------------------------ 
> From: m.roth at 5-cent.us
> 
> What runlevel? Any clues in /var/log/messages? Or dmesg?
>  ------------------------------------------------------------ 
> From: "Mr. Paul M. Whitney" <paul.whitney at me.com>
> 
> Do you have any other software installed? anti-virus? other third-party
> software?  It could be a memory leak.
>  ------------------------------------------------------------ 
> From: Kenneth Kirchner <ken at kirchners.com>
> 
> I would recommend installing some kind of performance monitoring 
software
> like Nagios or just SNMPd and Cacti.  These will track the performance 
of
> your machine and let you see memory, cpu, disk I/O, processes, etc over 
time
> to help identify what is going on.  These arent system intensive and 
give
> you much better visibility when problems like this do occur. There are 
many
> benefits.
>  ------------------------------------------------------------ 
> From: James Jones <jrjones at alaska.edu>
> 
> I agree with Ken,  you need to figure out what is eating your lunch, as 
they
> say.  You have several programs such as top, and system monitor that can
> give you some high level look into what is going on. Also, how much 
memory
> is on the machines and what type of cpus are installed.
> 
> If you can provide some info either from top or system monitor that may 
help
> in providing some additional assistance also.
>  ------------------------------------------------------------ 
> From: "Geofrey Rainey" <Geofrey.Rainey at tvnz.co.nz>
> 
> I don't know if anyone has said this yet, but have you installed the
> sysstat package and used the "sar" utility? Perhaps you've got I/O
> issues which sar will reveal.
>  ------------------------------------------------------------ 
> [mailto:redhat-list-bounces at redhat.com] On Behalf Of Kenneth Kirchner
> 
> I would recommend installing some kind of performance monitoring
> software like Nagios or just SNMPd and Cacti.  These will track the
> performance of your machine and let you see memory, cpu, disk I/O,
> processes, etc over time to help identify what is going on.  These arent
> system intensive and give you much better visibility when problems like
> this do occur. There are many benefits.
>  ------------------------------------------------------------ 
> On Oct 4, 2010, at 10:58 AM, Gary E Barnes wrote:
> 
> > The past week I upgraded our RHELv3 machines to v4.  Previously we had
> > several v3's, one v4, and one v5.  The v5 has never worked.  Now the 
new 
> > v4's are acting up.
> > 
> > Boot the machine, things are fine.  Wait overnight and the machine may
> > take ten minutes to unlock the screen, may take several 10's of 
seconds to 
> > do an ls, and generally simply isn't usable.
> > 
> > The v4's if you reboot them seem to be fine for the day.
> > The v5 if you reboot it is fine for maybe 15 minutes.
> > 
> > The v4's, there will be a load average of 3 to 4, but top says nothing
> > whatsoever (other than top and the xterm) is running.
> > The v5, there will be a load average of 0.1 or less and top again says
> > nothing is running.
> > 
> > SELinux is turned off.  Firewall is turned off.  I've even tried 
turning 
> > off every service that isn't vital to being able to simply boot the 
> > machines.
> > 
> > Any ideas?
> > 
> >        Gary