[K12OSN] Server Help! (a little desperate)
pvdw at criticalcontrol.com
Fri Oct 1 18:16:39 UTC 2004
Shawn Powers wrote:
> Things have been going great this year, our entire district is using
> thin clients. Here's a very brief breakdown of how things are running:
> 1 Server handles DNS, TFTP, DHCP, NIS
> 1 Server handles NFS (/home), SMB
> 1 Server handles LTSP (running 4.0.1, but the TFTP and DHCP are farmed
> out to the other server)
> For some reason, I've had 2 major "glitches" this year.
> Last week, eth0 (where clients connect) just quit responding. The
> server appeared fine, but 10.10.10.10 was not pingable. After a brief
> panic, I just ran ifdown eth0, and ifup eth0 -- and I've had no
> problems until today. They started right after I left for lunch, of
> Today, the LTSP server quit responding altogether. When going to the
> console, I couldn't even get THAT to come up. I power cycled the
> machine, and everything has come up just peachy -- BUT I'm very
> worried now.
> I'm getting some "I told you so's" from the staff, who accused me that
> putting all my eggs in one basket was a bad idea, and with linux you
> get what you pay for, etc, etc, etc...
> My question? Where do I start looking for some problems? I've read
> just about every bit of text in /var/log -- and nothing looks fishy.
> At 13:00, messages just stopped being written to /var/log/messages.
> There were no odd entries before it stopped.
> Are there other logs I should be checking? Perhaps after school
> today, I'll take the server down and run memtest... Especially during
> this first year, I need close to 100% uptime, and I've had bad luck so
That's one of those things... all is fine and suddenly ... BOOM !
I have seen this (too) many times before, a couple tips :
run a check for badblocks on the harddisk(s).
Change your logging, If it is the disk/controller that goes bad, the
system isn't able to write to the logs to report the failure...
You can have your syslogs going to another server (the NFS one for example)
Use some resource monitoring tool to keep an eye on processor/memory usage.
Where is the network load/processor/memory (incl swap) at seconds before
the system went down?
As said, I have seen this before, in most cases it was either the
harddisk or the controller that went (slowly) bad.
In one case it was the NIC, bad NICs can start 'sending' random bits
over the lines.
If nobody is connected to the LTS, is the NIC still 'sending' stuff, or
is it all quite?
Any technology distinguishable from
foodoo-magic is insufficiently advanced.
Peter Van den Wildenbergh
CriticalControl Solutions Inc.
Bow Valley Square II
205 - 5th avenue SW
Calgary, AB T2P 2V7
More information about the K12OSN