botched RAID, now e2fsck or what?

Thu Dec 10 20:30:19 UTC 2009

Thank you all for your kind replies.

One extra thought and question: would it help if I had some large file
that is also on the array? Could I search for a part of the file on
the individual drives, or at least on permutated arrays?

I tried findsuper, but it keeps finding the same backup superblocks,
no matter how I switch the order of the disks except for the first
one. It might be possible, I think, that the superblocks fall on the
same disk. That is only a general impression, so I am running it
thoroughly on a smaller array, to make sure. Another issue with this
approach is that it takes a lot of time: I have a 4.5 TB array with
720 permutations to try. This sound more like a job for a few years.
2009/12/10  <tytso at mit.edu>:
> Something that may help is to use the findsuper program, in the
> e2fsprogs sources; it's not built by default, but you can build it by
> hand.
> The group number information should help you determine the order of the
> disks in the raid array.

Same issue if I use the used inode count: the permutations yield the
same numbers over and over again. I think dume2fs -h doesn't go into
the actual drive, but only reads the descriptions in the beginning,
and these fall on the same drive...
2009/12/10 Christian Kujau <lists at nerdbynature.de>:
> On Wed, 9 Dec 2009 at 20:50, Lucian Șandor wrote:
>> Question 1: Is there a way to make dumpe2fs or another command
>> estimate the number of files in what appears to be an ext3 partition?
>
> I can only think of:
> $ dumpe2fs -h /dev/loop0 | egrep 'Inode count|Free inodes'
> The difference between both values should be the used inodes, i.e.
> files/directories on the filesystem.

2009/12/10 Andreas Dilger <adilger at sun.com>:
> On 2009-12-09, at 18:50, Lucian Șandor wrote:
>>
>> However, no combination seems useful. Sometimes I get:
>> "e2fsck: Bad magic number in super-block while trying to open /dev/md0"
>> Other times I get:
>> "Superblock has an invalid journal (inode 8)."
>> Other times I get:
>> "e2fsck: Illegal inode number while checking ext3 journal for /dev/md2."
>> None of these appears in only one permutation, so none is indicative
>> for the corectness of the permutation.
>
> You need to know a bit about your RAID layout and the structure of ext*.
>  One thing that is VERY important is whether your new MD config has the same
> chunk size as it did initially.  It will be impossible to recover your
> config if you don't have the same chunk size.
>
> Also, if you haven't disabled RAID resync then it may well be that changing
> the RAID layout has caused a resync that has permanently corrupted your
> data.

I have the chunk size for one of the arrays. I thought that mdadm
would automatically use the same values it used when it first created
the arrays, but gues what, it did not. Now I have another headache for
the other array.
The arrays were degraded at the time of the whole mess, and I always
re-created them as degraded. I wonder how long can I still pull this
feat, after being so messy in the first place.

> That said, I will assume the primary ext3 superblock will reside on the
> first disk in the RAID set, since it is located at an offset of 1kB from the
> start of the device.
>
> You should build and run the "findsuper" tool that is in the e2fsprogs
> source tree.  It will scan the raw disk devices and locate the ext3
> superblocks.  Each superblock contains the group number in which it is
> stored, so you can find the first RAID disk by looking for the one that has
> superblock 0 at offset 1kB from the start of the disk.
>
> There may be other copies of the superblock #0 stored in the journal file,
> but those should be ignored.
>
> The backup superblocks have a non-zero group number, and "findsuper" prints
> the offset at which that superblock should be located from the start of the
> LUN.  Depending on whether you have a non-power-of-two number of disks in
> your RAID set, you may find the superblock copies on different disks, and
> you can do some math to determine which order the disks should be in by
> computing the relative offset of the superblck within the RAID set.
>
>
> The other thing that can help order the disks (depending on the RAID
> chunksize and the total number of groups in the filesystem, proportional to
> the filesystem size) is the group descriptor table.  It is located
> immediately after the superblocks, and contains a very regular list of block
> numbers for the block and inode bitmaps, and the inode table in each group.
>
> Using "od -Ax -tx4" on a regular ext3 filesystem you can see the group
> descriptor table starting at offset 0x1000, and the block numbers basically
> just "count" up.  This may in fact be the easiest way to order the disks, if
> the group descriptor table is large enough to cover all of the disks:
>
> # od -Ax -tx4 /dev/hda1 | more
> :
> :
> 001000 0000012c 0000012d 0000012e 02430000
> 001010 000001f2 00000000 00000000 00000000
> 001020 0000812c 0000812d 0000812e 2e422b21
> 001030 0000000d 00000000 00000000 00000000
> 001040 00010000 00010001 00010002 27630074
> 001050 000000b8 00000000 00000000 00000000
> 001060 0001812c 0001812d 0001812e 27a70b8a
> 001070 00000231 00000000 00000000 00000000
> 001080 00020000 00020001 00020002 2cc10000
> 001090 00000008 00000000 00000000 00000000
> 0010a0 0002812c 0002812d 0002812e 25660134
> 0010b0 00000255 00000000 00000000 00000000
> 0010c0 00030000 00030001 00030002 17a50003
> 0010d0 000001c6 00000000 00000000 00000000
> 0010e0 0003812c 0003812d 0003812e 27a70000
> 0010f0 00000048 00000000 00000000 00000000
> 001100 00040000 00040001 00040002 2f8b0000
>
> See nearly regular incrementing sequence every 0x20 bytes:
>
> 0000012c, 0000812c, 00010000, 0001812c, 00020000, 0002812c, 00030000,
> 0003812c
>
>
> Each group descriptor block (4kB = 0x1000) covers 16GB of filesystem space,
> so  64 blocks per 1TB of filesystem size.  If your RAID chunk size is not
> too large, and the filesystem IS large, you will be able to fully order your
> disks in the RAID set.  You can also verify the RAID chunk size by
> determining how many blocks of consecutive group descriptors are present
> before there is a "jump" where the group descriptor blocks were written to
> other disks before returning to the current disk.  Remember that one of the
> disks in the set will also need to store parity, so there will be some
> number of "garbage" blocks before the proper data resumes.
>

This seems a great idea. The 4.5 TB array is huge (should have a 1100
kB table), and likely its group descriptor table extends on all
partitions. I already found the pattern, but the job requires
programming, since it would be troubling to read megs of data over the
hundreds of permutations. I will try coding it, but I hope that
somebody else wrote it before. Isn't there any utility that will take
a group descriptor table and verify its integrity without modifying
it?

>> I also ran dumpe2fs /dev/md2, but I don't know how to make it more
>> useful than it is now. Right now it finds supernodes in a series of
>> permutations, so again, it is not of much help.
>
> I would also make sure that you can get the correct ordering and MD chunk
> size before doing ANY kind of modification to the disks.  It would only take
> a single mistake (e.g. RAID parity rebuild while not in the right order) to
> totally corrupt the filesystem.
>
>> Question 1: Is there a way to make dumpe2fs or another command
>> estimate the number of files in what appears to be an ext3 partition?
>> (I would then go by the permutation which fonds the largest number of
>> files.)
>> Question: if I were to struck lucky and find the right combination,
>> would dumpe2fs give me a very-very long list of superblocks? Do the
>> superblocks extend far into the partition, or do they always stop
>> early (thus showing the same number each time my RAID starts with the
>> right drive)?
>>
>> Question 3: Is there any other tool that would search for files in the
>> remains of an ext3 partition, and, this way, validate or invalidate
>> the permutations I try?
>>
>> Thanks,
>> Lucian Sandor
>>
>>
>> 2009/12/9 Eric Sandeen <sandeen at redhat.com>:
>>>
>>> Lucian Șandor wrote:
>>>>
>>>> Hi all,
>>>>
>>>> Somehow I managed to mess with a RAID array containing an ext3
>>>> partition.
>>>>
>>>> Parenthesis, if it matters: I disconnected physically a drive while
>>>> the array was online. Next thing, I lost the right order of the drives
>>>> in the array. While trying to re-create it, I overwrote the raid
>>>> superblocks. Luckily, the array was RAID5 degraded, so whenever I
>>>> re-created it, it didn't go into sync; thus, everything besides the
>>>> RAID superblocks is preserved (or so I think).
>>>>
>>>> Now, I am trying to re-create the array in the proper order. It takes
>>>> me countless attempts, through hundreds of permutations. I am doing it
>>>> programatically, but I don't think I have the right tool.
>>>> Now, after creating the array and mounting it with
>>>> mount -t ext3 -n -r /dev/md2 /media/olddepot
>>>> I issue an:
>>>> e2fsck -n -f /media/olddepot
>>>> However, I cycled through all the permutations without apparent
>>>> success. I.e., in all combinations it just refused to check it, saying
>>>> something about "short read" and, of course, about invalid file
>>>> systems.
>>>
>>> As Christian pointed out, use the device not the mountpoint for the fsck
>>> arg:
>>>
>>> [tmp]$ mkdir dir
>>> [tmp]$ e2fsck -fn dir/
>>> e2fsck 1.41.4 (27-Jan-2009)
>>> e2fsck: Attempt to read block from filesystem resulted in short read
>>> while trying to open dir/
>>> Could this be a zero-length partition?
>>>
>>>
>>>  :)
>>>
>>> -Eric
>>>
>>
>
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
>
>