botched RAID, now e2fsck or what?

Thu Dec 10 06:54:54 UTC 2009

On 2009-12-09, at 18:50, Lucian Șandor wrote:
> However, no combination seems useful. Sometimes I get:
> "e2fsck: Bad magic number in super-block while trying to open /dev/ 
> md0"
> Other times I get:
> "Superblock has an invalid journal (inode 8)."
> Other times I get:
> "e2fsck: Illegal inode number while checking ext3 journal for /dev/ 
> md2."
> None of these appears in only one permutation, so none is indicative
> for the corectness of the permutation.

You need to know a bit about your RAID layout and the structure of  
ext*.  One thing that is VERY important is whether your new MD config  
has the same chunk size as it did initially.  It will be impossible to  
recover your config if you don't have the same chunk size.

Also, if you haven't disabled RAID resync then it may well be that  
changing the RAID layout has caused a resync that has permanently  
corrupted your data.

That said, I will assume the primary ext3 superblock will reside on  
the first disk in the RAID set, since it is located at an offset of  
1kB from the start of the device.

You should build and run the "findsuper" tool that is in the e2fsprogs  
source tree.  It will scan the raw disk devices and locate the ext3  
superblocks.  Each superblock contains the group number in which it is  
stored, so you can find the first RAID disk by looking for the one  
that has superblock 0 at offset 1kB from the start of the disk.

There may be other copies of the superblock #0 stored in the journal  
file, but those should be ignored.

The backup superblocks have a non-zero group number, and "findsuper"  
prints the offset at which that superblock should be located from the  
start of the LUN.  Depending on whether you have a non-power-of-two  
number of disks in your RAID set, you may find the superblock copies  
on different disks, and you can do some math to determine which order  
the disks should be in by computing the relative offset of the  
superblck within the RAID set.

The other thing that can help order the disks (depending on the RAID  
chunksize and the total number of groups in the filesystem,  
proportional to the filesystem size) is the group descriptor table.   
It is located immediately after the superblocks, and contains a very  
regular list of block numbers for the block and inode bitmaps, and the  
inode table in each group.

Using "od -Ax -tx4" on a regular ext3 filesystem you can see the group  
descriptor table starting at offset 0x1000, and the block numbers  
basically just "count" up.  This may in fact be the easiest way to  
order the disks, if the group descriptor table is large enough to  
cover all of the disks:

# od -Ax -tx4 /dev/hda1 | more
:
:
001000 0000012c 0000012d 0000012e 02430000
001010 000001f2 00000000 00000000 00000000
001020 0000812c 0000812d 0000812e 2e422b21
001030 0000000d 00000000 00000000 00000000
001040 00010000 00010001 00010002 27630074
001050 000000b8 00000000 00000000 00000000
001060 0001812c 0001812d 0001812e 27a70b8a
001070 00000231 00000000 00000000 00000000
001080 00020000 00020001 00020002 2cc10000
001090 00000008 00000000 00000000 00000000
0010a0 0002812c 0002812d 0002812e 25660134
0010b0 00000255 00000000 00000000 00000000
0010c0 00030000 00030001 00030002 17a50003
0010d0 000001c6 00000000 00000000 00000000
0010e0 0003812c 0003812d 0003812e 27a70000
0010f0 00000048 00000000 00000000 00000000
001100 00040000 00040001 00040002 2f8b0000

See nearly regular incrementing sequence every 0x20 bytes:

0000012c, 0000812c, 00010000, 0001812c, 00020000, 0002812c, 00030000,  
0003812c

Each group descriptor block (4kB = 0x1000) covers 16GB of filesystem  
space, so  64 blocks per 1TB of filesystem size.  If your RAID chunk  
size is not too large, and the filesystem IS large, you will be able  
to fully order your disks in the RAID set.  You can also verify the  
RAID chunk size by determining how many blocks of consecutive group  
descriptors are present before there is a "jump" where the group  
descriptor blocks were written to other disks before returning to the  
current disk.  Remember that one of the disks in the set will also  
need to store parity, so there will be some number of "garbage" blocks  
before the proper data resumes.

> I also ran dumpe2fs /dev/md2, but I don't know how to make it more
> useful than it is now. Right now it finds supernodes in a series of
> permutations, so again, it is not of much help.

I would also make sure that you can get the correct ordering and MD  
chunk size before doing ANY kind of modification to the disks.  It  
would only take a single mistake (e.g. RAID parity rebuild while not  
in the right order) to totally corrupt the filesystem.

> Question 1: Is there a way to make dumpe2fs or another command
> estimate the number of files in what appears to be an ext3 partition?
> (I would then go by the permutation which fonds the largest number of
> files.)
> Question: if I were to struck lucky and find the right combination,
> would dumpe2fs give me a very-very long list of superblocks? Do the
> superblocks extend far into the partition, or do they always stop
> early (thus showing the same number each time my RAID starts with the
> right drive)?
>
> Question 3: Is there any other tool that would search for files in the
> remains of an ext3 partition, and, this way, validate or invalidate
> the permutations I try?
>
> Thanks,
> Lucian Sandor
>
>
> 2009/12/9 Eric Sandeen <sandeen at redhat.com>:
>> Lucian Șandor wrote:
>>> Hi all,
>>>
>>> Somehow I managed to mess with a RAID array containing an ext3  
>>> partition.
>>>
>>> Parenthesis, if it matters: I disconnected physically a drive while
>>> the array was online. Next thing, I lost the right order of the  
>>> drives
>>> in the array. While trying to re-create it, I overwrote the raid
>>> superblocks. Luckily, the array was RAID5 degraded, so whenever I
>>> re-created it, it didn't go into sync; thus, everything besides the
>>> RAID superblocks is preserved (or so I think).
>>>
>>> Now, I am trying to re-create the array in the proper order. It  
>>> takes
>>> me countless attempts, through hundreds of permutations. I am  
>>> doing it
>>> programatically, but I don't think I have the right tool.
>>> Now, after creating the array and mounting it with
>>> mount -t ext3 -n -r /dev/md2 /media/olddepot
>>> I issue an:
>>> e2fsck -n -f /media/olddepot
>>> However, I cycled through all the permutations without apparent
>>> success. I.e., in all combinations it just refused to check it,  
>>> saying
>>> something about "short read" and, of course, about invalid file
>>> systems.
>>
>> As Christian pointed out, use the device not the mountpoint for the  
>> fsck arg:
>>
>> [tmp]$ mkdir dir
>> [tmp]$ e2fsck -fn dir/
>> e2fsck 1.41.4 (27-Jan-2009)
>> e2fsck: Attempt to read block from filesystem resulted in short  
>> read while trying to open dir/
>> Could this be a zero-length partition?
>>
>>
>>  :)
>>
>> -Eric
>>
>
> _______________________________________________
> Ext3-users mailing list
> Ext3-users at redhat.com
> https://www.redhat.com/mailman/listinfo/ext3-users

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.