Need help with Reboot cause

Peter J. Stieber developer at toyon.com
Tue Apr 7 17:41:43 UTC 2009


PS = Pete Stieber
PS>> I have a dual opteron system that has been acting as
PS>> the worldly node for a small cluster of computers
PS>> since September, 2004.  The machine is running the
PS>> latest x86_64 Fedora 10 kernel that I recently loaded
PS>> (April 2).  The machine reboots without warning.  I
PS>> can't find the cause in log files (maybe I'm not
PS>> looking in the correct log).
PS>>
PS>> I'm currently running memtest.  If all of the tests
PS>> pass, could the community suggest other diagnostic
PS>> tasks or information I could post to help diagnose the
PS>> problem?

m> Have you tried going back to the previous kernel?

The machine is still running memtest (no errors so far), but I already 
removed the prior kernel.  I did notice reboots with the prior kernel. 
BTW my current kernel is 2.6.27.21-170.2.56.fc10.x86_64.

Reboots indicated by information in /var/log/messages...

Sunday    March 29   4:08
Tuesday   March 31   7:02
Thursday  April  2  18:27 Intentional reboot due to new kernel
Friday    April  3   1:36
Sunday    April  5   1:37
Sunday    April  5   2:48
Sunday    April  5   9:43
Sunday    April  5  13:20 as I was typing this email

m> Did you check dmesg and /var/log/messages?

Yes.  I can see reboots, but not the cause.

m> Does it boot normally and then just fail at some random
m> interval or is it consistently failing at the same point?

I have had top running during a few of the reboots.  I have forced a 
couple of them by starting my nightly build process.  The linker/loader 
has been running during some of the reboots...

top - 13:19:53 up  3:36,  6 users,  load average: 1.27, 2.70, 2.32
Tasks: 138 total,   6 running, 132 sleeping,   0 stopped,   0 zombie
Cpu(s): 40.8%us, 13.8%sy,  0.0%ni, 42.5%id,  2.7%wa,  0.0%hi,  0.3%si, 
0.0%st
Mem:   2060232k total,  1683996k used,   376236k free,   164484k buffers
Swap:  2031608k total,       56k used,  2031552k free,  1230796k cached

   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
  8878 pstieber  20   0 34552  25m 1096 R  7.6  1.3   0:00.23 ld
  8884 pstieber  20   0 48284  27m 1080 R  5.0  1.4   0:00.15 ld
     7 root      15  -5     0    0    0 S  0.3  0.0   0:00.17 ksoftirqd/1
22427 pstieber  20   0 14880 1208  872 R  0.3  0.1   0:03.49 top
     1 root      20   0  4096  876  616 S  0.0  0.0   0:00.71 init

Another instance

top - 06:55:13 up 17:34,  2 users,  load average: 2.83, 2.59, 1.86
Tasks: 127 total,   2 running, 125 sleeping,   0 stopped,   0 zombie
Cpu(s): 45.1%us,  4.7%sy,  0.0%ni, 49.8%id,  0.5%wa,  0.0%hi,  0.0%si, 
0.0%st
Mem:   2060232k total,  1763404k used,   296828k free,   177052k buffers
Swap:  2031608k total,       56k used,  2031552k free,  1271964k cached

   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
  5757 pstieber  20   0 79788  69m 1080 R 12.3  3.5   0:00.37 ld
     1 root      20   0  4096  876  616 S  0.0  0.0   0:00.68 init
     2 root      15  -5     0    0    0 S  0.0  0.0   0:00.00 kthreadd

I'm not sure this is always the case.

m> Other things you may consider:
m> CPU type?

Motherboard: Tyan Thunder K8W (S2885ANRF)
CPUs: Dual Opteron 244 (1.8 GHz) processors
Memory: 2 GB   4-512MB  CT6472Y40B  DDR PC3200 from Crucial

m> temperature?

Is there a command to monitor this while running the OS?

m> potential hard drive issue?

I have 3 SATA drives running.  It's been so long since I have done this, 
but how does one manually do a disk chack?

m> any new hardware attached or installed recently?

No

m> Notice any power surges or brownouts?

The machine is on a UPS that deals with this.

m> any other nodes having issues?

No and they are not on UPSs.  They also do not have as large of a work load.

The machine in question is used for nightly builds and regression tests. 
  I use distcc with the compute nodes to perform the builds.

The machine also runs samba to provide a network share to Windows users 
and provides authentication using Windows domain accounts.

m> Recent power surge zapped a board, DSL modem,
m> and the surge protector.

I doubt this is the problem.

Memtest make it through the first pass of all test successfully.

Thanks for the suggestions, especially considering my vague information.

Pete




More information about the fedora-list mailing list