Need help with Reboot cause
Peter J. Stieber
developer at toyon.com
Tue Apr 7 17:41:43 UTC 2009
PS = Pete Stieber
PS>> I have a dual opteron system that has been acting as
PS>> the worldly node for a small cluster of computers
PS>> since September, 2004. The machine is running the
PS>> latest x86_64 Fedora 10 kernel that I recently loaded
PS>> (April 2). The machine reboots without warning. I
PS>> can't find the cause in log files (maybe I'm not
PS>> looking in the correct log).
PS>>
PS>> I'm currently running memtest. If all of the tests
PS>> pass, could the community suggest other diagnostic
PS>> tasks or information I could post to help diagnose the
PS>> problem?
m> Have you tried going back to the previous kernel?
The machine is still running memtest (no errors so far), but I already
removed the prior kernel. I did notice reboots with the prior kernel.
BTW my current kernel is 2.6.27.21-170.2.56.fc10.x86_64.
Reboots indicated by information in /var/log/messages...
Sunday March 29 4:08
Tuesday March 31 7:02
Thursday April 2 18:27 Intentional reboot due to new kernel
Friday April 3 1:36
Sunday April 5 1:37
Sunday April 5 2:48
Sunday April 5 9:43
Sunday April 5 13:20 as I was typing this email
m> Did you check dmesg and /var/log/messages?
Yes. I can see reboots, but not the cause.
m> Does it boot normally and then just fail at some random
m> interval or is it consistently failing at the same point?
I have had top running during a few of the reboots. I have forced a
couple of them by starting my nightly build process. The linker/loader
has been running during some of the reboots...
top - 13:19:53 up 3:36, 6 users, load average: 1.27, 2.70, 2.32
Tasks: 138 total, 6 running, 132 sleeping, 0 stopped, 0 zombie
Cpu(s): 40.8%us, 13.8%sy, 0.0%ni, 42.5%id, 2.7%wa, 0.0%hi, 0.3%si,
0.0%st
Mem: 2060232k total, 1683996k used, 376236k free, 164484k buffers
Swap: 2031608k total, 56k used, 2031552k free, 1230796k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
8878 pstieber 20 0 34552 25m 1096 R 7.6 1.3 0:00.23 ld
8884 pstieber 20 0 48284 27m 1080 R 5.0 1.4 0:00.15 ld
7 root 15 -5 0 0 0 S 0.3 0.0 0:00.17 ksoftirqd/1
22427 pstieber 20 0 14880 1208 872 R 0.3 0.1 0:03.49 top
1 root 20 0 4096 876 616 S 0.0 0.0 0:00.71 init
Another instance
top - 06:55:13 up 17:34, 2 users, load average: 2.83, 2.59, 1.86
Tasks: 127 total, 2 running, 125 sleeping, 0 stopped, 0 zombie
Cpu(s): 45.1%us, 4.7%sy, 0.0%ni, 49.8%id, 0.5%wa, 0.0%hi, 0.0%si,
0.0%st
Mem: 2060232k total, 1763404k used, 296828k free, 177052k buffers
Swap: 2031608k total, 56k used, 2031552k free, 1271964k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5757 pstieber 20 0 79788 69m 1080 R 12.3 3.5 0:00.37 ld
1 root 20 0 4096 876 616 S 0.0 0.0 0:00.68 init
2 root 15 -5 0 0 0 S 0.0 0.0 0:00.00 kthreadd
I'm not sure this is always the case.
m> Other things you may consider:
m> CPU type?
Motherboard: Tyan Thunder K8W (S2885ANRF)
CPUs: Dual Opteron 244 (1.8 GHz) processors
Memory: 2 GB 4-512MB CT6472Y40B DDR PC3200 from Crucial
m> temperature?
Is there a command to monitor this while running the OS?
m> potential hard drive issue?
I have 3 SATA drives running. It's been so long since I have done this,
but how does one manually do a disk chack?
m> any new hardware attached or installed recently?
No
m> Notice any power surges or brownouts?
The machine is on a UPS that deals with this.
m> any other nodes having issues?
No and they are not on UPSs. They also do not have as large of a work load.
The machine in question is used for nightly builds and regression tests.
I use distcc with the compute nodes to perform the builds.
The machine also runs samba to provide a network share to Windows users
and provides authentication using Windows domain accounts.
m> Recent power surge zapped a board, DSL modem,
m> and the surge protector.
I doubt this is the problem.
Memtest make it through the first pass of all test successfully.
Thanks for the suggestions, especially considering my vague information.
Pete
More information about the fedora-list
mailing list