[K12OSN] Server hang and locked out of other servers
Eric Harrison
eharrison at mail.mesd.k12.or.us
Tue Dec 7 05:25:54 UTC 2004
On Tue, 7 Dec 2004 dahopkins at comcast.net wrote:
> I dont' know where to even look for the answer, but my primary server (handles LDAP authentication, and home directories) is hanging after serveral hours of operation. A warm reboot (reset button) brings everything back each time, but where do I even start to look for the problem? Which log file?
> A symptom is that top and ps both hang (never return), so I want to find out what is consuming the CPU, but can't. Perhaps cat'ing some file in /proc? aaaarrrrgggghhhh!!!!! a bad day/week.
Here's the general order of things I'd check in such a case:
Before rebooting:
1) Run "dmesg", that will show you the kernel's message buffer. This is
particularly useful for finding SCSI/IDE problems (a long list of
SCSI resets or DMA errors)
After the server has been rebooted (even if these commands work when
the server is in a hosed state, you might as well get it back into
production ASAP...)
2) Run "last | less" to get a feel for when the server went down, if
someone rebooted the server and didn't tell you at what time....
3) check /var/log/messages. A useful trick here is to open it with less
("less /var/log/messages"), go to the end of the file (press the ">"
key, aka shift-.), then search for the where the boot messages
start (hitting the "?" key, typing "klogd", then hitting Enter will
usually get you in the neighborhood). After a crash, the juiciest
bits are often located just before the reboot.
4) /var/log/secure, often contains useful info in DOS attacks,
authentication meltdowns, etc.
5) run "ls -lart /var/log/" to see if there are any other log files
that have been modified since the crash. Take a look at them to
see if they are interesting
6) run "sar -A | less" and look for anything that spiked around the
time the server crashed. High I/O, or high CPU, or high network
usage, etc, can often give you at least a general idea of what
is going on.
If the server can't find the "sar" command, install the "sysstat"
package: "up2date -i sysstat" or "yum install sysstat"
> Also, when this system is down, I can't even log onto the other K12LTSP servers as root since they claim that they can't authenticate. I have nsswitch.conf set with
>
> passwd: files ldap
> shadow: files ldap
> group: files ldap
>
> on all the systems which I thought meant use local files first, but this doesn't seem to be working. (It used to though).
I've seen this on RH9 & FC1. There is a change you can make to
/etc/pam.d/system-auth to keep this from happening, but I can't seem to
find it right now. I'll keep looking, it's in my mail spool somewhere ;-)
> I will be coming in to the school tomorrow at around 6:30 am to check the systems before I head off for work, and then will be back tomorrow night around 7:30 pm (after work, 10 hour day). Any suggestions welcome.
>
> Sincerely thankful for any help, desparate .
Good luck!
-Eric
More information about the K12OSN
mailing list