[K12OSN] Server hang and locked out of other servers

Eric Harrison eharrison at mail.mesd.k12.or.us
Tue Dec 7 05:25:54 UTC 2004


On Tue, 7 Dec 2004 dahopkins at comcast.net wrote:

> I dont' know where to even look for the answer, but my primary server (handles LDAP authentication, and home directories) is hanging after serveral hours of operation.  A warm reboot (reset button) brings everything back each time, but where do I even start to look for the problem?  Which log file?
> A symptom is that top and ps both hang (never return), so I want to find out what is consuming the CPU, but can't.  Perhaps cat'ing some file in /proc? aaaarrrrgggghhhh!!!!! a bad day/week.

Here's the general order of things I'd check in such a case:

Before rebooting:

1) Run "dmesg", that will show you the kernel's message buffer. This is
    particularly useful for finding SCSI/IDE problems (a long list of
    SCSI resets or DMA errors)

After the server has been rebooted (even if these commands work when
the server is in a hosed state, you might as well get it back into
production ASAP...)

2) Run "last | less" to get a feel for when the server went down, if
    someone rebooted the server and didn't tell you at what time....

3) check /var/log/messages. A useful trick here is to open it with less
    ("less /var/log/messages"), go to the end of the file (press the ">"
      key, aka shift-.), then search for the where the boot messages
      start (hitting the "?" key, typing "klogd", then hitting Enter will
      usually get you in the neighborhood). After a crash, the juiciest
      bits are often located just before the reboot.

4) /var/log/secure, often contains useful info in DOS attacks,
    authentication meltdowns, etc.

5) run "ls -lart /var/log/" to see if there are any other log files
    that have been modified since the crash. Take a look at them to
    see if they are interesting

6) run "sar -A | less" and look for anything that spiked around the
    time the server crashed. High I/O, or high CPU, or high network
    usage, etc, can often give you at least a general idea of what
    is going on.

    If the server can't find the "sar" command, install the "sysstat"
    package: "up2date -i sysstat" or "yum install sysstat"


> Also, when this system is down, I can't even log onto the other K12LTSP servers as root since they claim that they can't authenticate.  I have nsswitch.conf set with
>
> passwd:     files ldap
> shadow:     files ldap
> group:      files ldap
>
> on all the systems which I thought meant use local files first, but this doesn't seem to be working. (It used to though).

I've seen this on RH9 & FC1. There is a change you can make to 
/etc/pam.d/system-auth to keep this from happening, but I can't seem to
find it right now. I'll keep looking, it's in my mail spool somewhere ;-)

> I will be coming in to the school tomorrow at around 6:30 am to check the systems before I head off for work, and then will be back tomorrow night around 7:30 pm (after work, 10 hour day).   Any suggestions welcome.
>
> Sincerely thankful for any help, desparate .

Good luck!

-Eric




More information about the K12OSN mailing list