botched RAID, now e2fsck or what?

Fri Dec 11 19:33:01 UTC 2009

Hi,

Thanks for your idea. It worked great in the first step. One other
thing: immediately after the first table, there is a second one. Using
both tables, I was able to tell the parity position. For me, with 6
drives. the tables fell into an annoying pattern of complementation,
such as that four of them will always give 0000 0000 0000 and the
other two drives had identical chunks.

I am still no better because I don't know how to assemble it. Should I
create it as 1 2 3 4 5 P, or maybe as P 1 2 3 4 5?. But that is
something I might find trying a few combinations and looking at the
way the beginning of /dev/md0 is assembled.

One issue is that no matter how I will mix them, I have an extra drive
that I need to keep out. (The array was degraded for a few days before
the drive mix, and the failing drive is in the computer, now mixed up
with the others.) I can try assemble the array with any of the six
drives as missing, but I don't see a difference in the beginning of
/dev/md0, that part being written back in the times when the array was
running, and I get the same errors from e2fsck (complaining about
journal invalidity). Findsuper finds the same superblocks, e2fsck find
the same inodes :(

There should be a way of telling whether one of the 6 left
permutations makes a better combination. As I said, I even have files
that are also on the array. Any other thoughts?

Best,
Lucian Sandor

2009/12/10 Andreas Dilger <adilger at sun.com>:
> On 2009-12-10, at 13:30, Lucian Șandor wrote:
>>
>> 2009/12/10 Andreas Dilger <adilger at sun.com>:
>>>
>>> Using "od -Ax -tx4" on a regular ext3 filesystem you can see the group
>>> descriptor table starting at offset 0x1000, and the block numbers basically
>>> just "count" up.  This may in fact be the easiest way to order the disks, if
>>> the group descriptor table is large enough to cover all of the disks:
>>>
>>> # od -Ax -tx4 /dev/hda1 | more
>>> :
>>> :
>>> 001000 0000012c 0000012d 0000012e 02430000
>>> 001010 000001f2 00000000 00000000 00000000
>>> 001020 0000812c 0000812d 0000812e 2e422b21
>>> 001030 0000000d 00000000 00000000 00000000
>>> 001040 00010000 00010001 00010002 27630074
>>> 001050 000000b8 00000000 00000000 00000000
>>> 001060 0001812c 0001812d 0001812e 27a70b8a
>>> 001070 00000231 00000000 00000000 00000000
>>> 001080 00020000 00020001 00020002 2cc10000
>>> 001090 00000008 00000000 00000000 00000000
>>> 0010a0 0002812c 0002812d 0002812e 25660134
>>> 0010b0 00000255 00000000 00000000 00000000
>>> 0010c0 00030000 00030001 00030002 17a50003
>>> 0010d0 000001c6 00000000 00000000 00000000
>>> 0010e0 0003812c 0003812d 0003812e 27a70000
>>> 0010f0 00000048 00000000 00000000 00000000
>>> 001100 00040000 00040001 00040002 2f8b0000
>>>
>>> See nearly regular incrementing sequence every 0x20 bytes:
>>>
>>> 0000012c, 0000812c, 00010000, 0001812c, 00020000, 0002812c, 00030000,
>>> 0003812c
>>>
>>>
>>> Each group descriptor block (4kB = 0x1000) covers 16GB of filesystem
>>> space, so  64 blocks per 1TB of filesystem size.  If your RAID chunk size is
>>> not too large, and the filesystem IS large, you will be able to fully order
>>> your disks in the RAID set.  You can also verify the RAID chunk size by
>>> determining how many blocks of consecutive group descriptors are present
>>> before there is a "jump" where the group descriptor blocks were written to
>>> other disks before returning to the current disk.  Remember that one of the
>>> disks in the set will also need to store parity, so there will be some
>>> number of "garbage" blocks before the proper data resumes.
>>
>> This seems a great idea. The 4.5 TB array is huge (should have a 1100
>> kB table), and likely its group descriptor table extends on all
>> partitions. I already found the pattern, but the job requires
>> programming, since it would be troubling to read megs of data over the
>> hundreds of permutations. I will try coding it, but I hope that
>> somebody else wrote it before. Isn't there any utility that will take
>> a group descriptor table and verify its integrity without modifying
>> it?
>
> I think you are going about this incorrectly...  Run the "od" command on the
> raw component drives (e.g. /dev/sda, /dev/sdb, /dev/sdc, etc), not on the
> assembled MD RAID array (e.g. NOT /dev/md0).
>
> The data blocks on the raw devices will be correct, with every 1/N chunks of
> space being used for parity information (so will look like garbage).  That
> won't prevent you from seeing the data in the group descriptor table and
> allowing you to see the order in which the disks are supposed to be AND the
> chunk size.
>
> Since the group descriptor table is only a few kB from the start of the disk
> (I'm assuming you used whole-disk devices for the MD array, instead of DOS
> partitions) you can just use "od ... | less" and your eyes to see what is
> there.  No programming needed.
>
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
>
>