[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [K12OSN] Server hang and locked out of other servers



On Tue, 7 Dec 2004 dahopkins comcast net wrote:

I dont' know where to even look for the answer, but my primary server (handles LDAP authentication, and home directories) is hanging after serveral hours of operation.  A warm reboot (reset button) brings everything back each time, but where do I even start to look for the problem?  Which log file?
A symptom is that top and ps both hang (never return), so I want to find out what is consuming the CPU, but can't.  Perhaps cat'ing some file in /proc? aaaarrrrgggghhhh!!!!! a bad day/week.

Here's the general order of things I'd check in such a case:


Before rebooting:

1) Run "dmesg", that will show you the kernel's message buffer. This is
   particularly useful for finding SCSI/IDE problems (a long list of
   SCSI resets or DMA errors)

After the server has been rebooted (even if these commands work when
the server is in a hosed state, you might as well get it back into
production ASAP...)

2) Run "last | less" to get a feel for when the server went down, if
   someone rebooted the server and didn't tell you at what time....

3) check /var/log/messages. A useful trick here is to open it with less
   ("less /var/log/messages"), go to the end of the file (press the ">"
     key, aka shift-.), then search for the where the boot messages
     start (hitting the "?" key, typing "klogd", then hitting Enter will
     usually get you in the neighborhood). After a crash, the juiciest
     bits are often located just before the reboot.

4) /var/log/secure, often contains useful info in DOS attacks,
   authentication meltdowns, etc.

5) run "ls -lart /var/log/" to see if there are any other log files
   that have been modified since the crash. Take a look at them to
   see if they are interesting

6) run "sar -A | less" and look for anything that spiked around the
   time the server crashed. High I/O, or high CPU, or high network
   usage, etc, can often give you at least a general idea of what
   is going on.

   If the server can't find the "sar" command, install the "sysstat"
   package: "up2date -i sysstat" or "yum install sysstat"


Also, when this system is down, I can't even log onto the other K12LTSP servers as root since they claim that they can't authenticate. I have nsswitch.conf set with

passwd:     files ldap
shadow:     files ldap
group:      files ldap

on all the systems which I thought meant use local files first, but this doesn't seem to be working. (It used to though).

I've seen this on RH9 & FC1. There is a change you can make to /etc/pam.d/system-auth to keep this from happening, but I can't seem to
find it right now. I'll keep looking, it's in my mail spool somewhere ;-)


I will be coming in to the school tomorrow at around 6:30 am to check the systems before I head off for work, and then will be back tomorrow night around 7:30 pm (after work, 10 hour day). Any suggestions welcome.

Sincerely thankful for any help, desparate .

Good luck!


-Eric


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]