[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Unknown cause of server crash or overload



Good afternoon,

Our remotely hosted web server goes down about once per month and must be
manually rebooted. Being a remote machine, I haven't been able to check the
console to determine whether the system is overloaded or has crashed.

The server is a 1RU Pentium 2Ghz 1GB/40GB running RH 7.3 with all the latest
up2date releases (except for custom compiled Apache/mod_perl/php).

So, the first part of my question is; how can I tell whether the server has
crashed or is just being unresponsive (overloaded)?

The server is in a data warehouse on a NAT firewall which denies ping packets.
But I have asked internal staff to ping the server, and they get zero packets
back. Is there some other check I can do to test whether the server has crashed,
or is there some sort of logging I can do that will show the problem after the
server is restarted?

I am currently doing a fair amount of logging with MRTG and I don't see any
gradual growth before the 'crash'. Maybe there is a sudden load increase which
prevents MRTG from running its checks. All graphs are always zeroed during the
time of the 'crash'. Some of the targets for MRTG:

- user/system cpu
- load
- free/used memory
- number of processes
- number of open files (lsof)
- disk usage for all partitions (except swap)
- network throughput
- plus other hardware checks (cpu temp, fan speed, etc)

None of the above show any noticeable change before a 'crash'.

And the second part of my question; what are some of the culprits I should be
looking for to determine what is either crashing the server or causing it to be
unresponsive.

My instinct tells me the problem is too much load. On one or two occasions, I
was unable to use any existing ssh sessions (nor create new ones) but I was able
to get a few http responses from some of the apache processes (before the server
ground to a complete halt).

Any feedback or suggestions would be greatly appreciated. I have been trying to
solve this for about 4 months now, and I don't know what else to try.

Thanks,
Charlie
-- 
   Charlie Garrison    garrison zeta org au
   PO Box 141, Windsor, NSW 2756, Australia 





[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]