Second Block on Partition overwritten with 0xFF
Tomas Pospisek ML
tpo2 at sourcepole.ch
Mon Sep 3 15:01:49 UTC 2007
we're running a small population of lightly embedded machines with the
System: +- standard intel box
FS: ext3 (defaults,errors=remount-ro,noatime)
HD: TRANSCEND, ATA DISK drive, Compact Flash (CF), 2000880 sectors (1024
MB) w/2KiB Cache, CHS=1985/16/63
Driver: Standard IDE Driver
ICH4: chipset revision 2
ICH4: not 100% native mode: will probe irqs later
ide0: BM-DMA at 0xf000-0xf007, BIOS settings: hda:pio,
ide1: BM-DMA at 0xf008-0xf00f, BIOS settings: hdc:pio,
kernel: 220.127.116.11 #1 PREEMPT Sat Mar 11 00:56:41 CET 2006 i686 GNU/Linux
ext3 was chosen in the hope to make the system more power-failure
resilient. The system run on a UPS, but unfortunately some operators
will just pull the power plug (allthought they're instucted not to).
What we have experienced now multiple times is, that the systems run just
fine, absolutely no complaints in dmesg/kern.log, until it is rebooted
(shutdown -r now). At that point, *very rarely* GRUB will no longer be
able to read the boot filesystem (Error 17).
I've checked the on-disk data and have discovered that 0x200-0x1c00 is
overwritten with 0xff, then a single 0x0f and after that 0x00 untill
That is the second to the sixteenth on-disk blocks have been overwritten:
000001e0 53 59 53 4d 53 44 4f 53 20 20 20 53 59 53 7f 01 |SYSMSDOS
000001f0 00 41 bb 00 07 60 66 6a 00 e9 3b ff 00 00 00 00
00000200 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
00001c00 ff 0f 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00001c10 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00002080 ed 41 00 00 00 04 00 00 1e 39 a0 46 a6 6a dd 45 |íA.......9
Our project does no hardware-level operations. All access is through
regular file-operations only. Thus there's no way we're aware of that
our software would be changing blocks on-disk directly.
What's striking about the problem above is that the first affected block
starts _before_ the on-disk filesystem (0x200), which starts at 0x400.
My question is: does the ext3 driver _ever_ write outside of its own
space on disk - i.e into 0x000-0x400? That is can we exclude with
certainity that it's _not_ the ext3 driver causing the problem?
What else could cause the problem then? We don't see any sign of a
problem before reboot only after. Could the IDE driver be the problem?
Or is it the IDE CF Card HW?
I've done a dd=/dev/hdc of=/dev/null and there was absolutely no trouble
visible (nothing in kern.log/dmesg), thus the card does not seem to be
broken on the physical level and doesn't have badblocks that would fail
Does this ring a bell with anybody?
More information about the Ext3-users