Instability issue with FC5 on dual-core system.

Hans Kristian Rosbach hk at isphuset.no
Thu Jun 29 16:39:34 UTC 2006


System runs latest FC5 x86-64 kernel (2.6.17-1.2139_FC5)
System might suddenly hang hard or reboot.

Seems to blurt sporadic errors to syslog.

Console messages:
Message from syslogd at m1 at Thu Jun 29 00:56:55 2006 ...
m1 kernel: Oops: 0000 [1] SMP
Message from syslogd at m1 at Thu Jun 29 00:56:55 2006 ...
m1 kernel: CR2: 0000000000000020


Dmesg shows this oops:
Unable to handle kernel NULL pointer dereference at 0000000000000020
RIP:
<ffffffff8022021c>{copy_process+3132}
PGD 56690067 PUD 54fbe067 PMD 0
Oops: 0000 [1] SMP
last sysfs file: /block/hdd/removable
CPU 0
Modules linked in: ipv6 autofs4 dm_mirror dm_mod video button battery
acpi_memhotplug ac lp parport_pc parport sg i2c_nforce2 forcedeth floppy
i2c_core raid1 ext3 jbd sata_nv libata sd_mod scsi_mod
Pid: 27939, comm: get-errors.sh Not tainted 2.6.17-1.2139_FC5 #1
RIP: 0010:[<ffffffff8022021c>] <ffffffff8022021c>{copy_process+3132}
RSP: 0018:ffff810056239d78  EFLAGS: 00010286
RAX: 0000000000000000 RBX: ffff810056bfa700 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff81005642c648 RDI: ffff81007ac52420
RBP: ffff81007ac52420 R08: ffff81005593a000 R09: 00000000000559e1
R10: 0000000000000000 R11: 0000000000000001 R12: ffff81007dccf0c0
R13: ffff810037e6e400 R14: ffff81005642c590 R15: ffff81007b5be080
FS:  00002aaaaaab2d50(0000) GS:ffffffff8069c000(0000)
knlGS:00000000f7e45b60
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000020 CR3: 0000000058a1e000 CR4: 00000000000006e0
Process get-errors.sh (pid: 27939, threadinfo ffff810056238000, task
ffff8100639980c0)
Stack: 00000000ffffffff 00002aaaaaab2de0 0000000000000000
ffff810056239f58
       00007fffbe0c1fd0 0000000001200011 0000000000000000
ffff81007dccf0c0
       ffff810037e6e400 ffff81007ac522c8
Call Trace: <ffffffff8024ade2>{sprintf+81}
<ffffffff8026a183>{_spin_unlock_irq+9}
       <ffffffff802331d7>{do_fork+208}
<ffffffff802697db>{__mutex_lock_slowpath+868}
       <ffffffff80254017>{do_pipe+610}
<ffffffff80269469>{__mutex_unlock_slowpath+522}
       <ffffffff8021454a>{generic_file_llseek+127}
<ffffffff80262d8e>{system_call+126}
       <ffffffff8026309b>{ptregscall_common+103}

Code: 48 8b 40 20 f0 ff 43 28 f6 45 29 08 74 07 f0 ff 88 34 03 00
RIP <ffffffff8022021c>{copy_process+3132} RSP <ffff810056239d78>
CR2: 0000000000000020


get-errors.sh runs smartctl on both existing and nonexisting disks.
It seems to have hung on hdd.
hdd: CDU5211, ATAPI CD/DVD-ROM drive
hdd: ATAPI 52X CD-ROM drive, 120kB Cache, UDMA(33)


relevant part of get-errors.sh:
-----
for device in "hda" "hdb" "hdc" "hdd" "hde" "hdf"
do
        exists=0
        smartctl -i /dev/$device | grep -i support | grep -ci enabled >
$tempfile1
        exists=`cat $tempfile1`

        # Continus only if SMART support is enabled
        if [ "$exists" != "0" ]; then
                smartctl -H /dev/$device > $tempfile2
                smartctl -c /dev/$device >> $tempfile2
                errtmp=`grep -ci PASSED $tempfile2`

                # Continue only if the disk reports a temperature
                if [ "$errtmp" == "0" ]; then
                        echo "SMART: /dev/$device failed smart tests" >>
$logfile
                        smarterr=$smarterr+1
                fi
        fi
done
-----


The system has rebootet twicein 3 days, this time it did not reboot
but some processes have hung. Running ps ax was a bad idea, as that
hung too.

crash utility just fails with the following error:
crash: cannot resolve "cpu_pda"

For more info just ask. I might reboot it but I'm sure it's not
going to be far between crashes.

-HK




More information about the fedora-devel-list mailing list