ext3 + fs > 2Tbyte

Fri Nov 4 07:37:44 UTC 2005

On Nov 04, 2005  16:19 +1100, Vincent.McIntyre at csiro.au wrote:
> >Do you only use the parted "mkfs" or do you actually use the mke2fs 
> >from e2fsprogs? 
> The script does this
>   parted -s /dev/sdb1 print
>   parted -s /dev/sdb1 mklabel gpt
>   parted -s /dev/sdb1 print
>   parted -s /dev/sdb1 mkpart primary 0 10
>   parted -s /dev/sdb1 print
>   parted -s /dev/sdb1 mke2fs 1 ext2
>   parted -s /dev/sdb1 print

Hmm, I don't use parted often, but does it make sense to be making a GPT
disklabel on /dev/sdb1 instead of making it on /dev/sdb?

Note also that there is actually no need to make a partition at all if
you are just going to use the whole device for the filesystem.  This
is particularly interesting with some RAID hardware, since the partition
table adds a 512-byte offset to every single IO, and this can cause
some noticable performance problems.

Just do "mke2fs -j /dev/sdb" and be happy.

> Yes. While you were typing,
>  * I made a teeny 10 Mbyte filesystem (using parted, as above)
>  * mounted
>  * umounted
>  * ran findsuper and od
>  * reboot
>  * ran parted /dev/sdb1 print
>    (repeated, using strace)
>  * ran an straced e2fsck /dev/sdb1
> and got the same error.
> 
> I couldn't quite believe this so I tried it again. Same result.

This sounds like parted isn't doing what you want, and ext3 is not
the source of the problem at all.

> So it is starting to look like the GPT disklabel is causing a problem.

I agree.

> ah, of course. I thought findsuper would respect the partition boundaries
> and stop at the "end" of the filesystem. It did that pre-reboot, e.g. my
> 10Mbyte test above

It DOES respect the partition boundaries, actually.  In fact, if you
point it at a partition (instead of the parent device) it should not
be POSSIBLE for it to read beyond the end of the partition, and the
kernel should prevent it.

>   starting at 0, with 512 byte increments
>        thisoff     block fs_blk_sz  blksz grp last_mount
>           1024         1     10223  1024    0 Thu Jan  1 10:00:00 1970
>        8389632      8193     10223  1024    1 Thu Jan  1 10:00:00 1970
> 
>       10468864: finished with errno 0
> 
> Post-reboot, I get this:
>   starting at 0, with 512 byte increments
>        thisoff     block fs_blk_sz  blksz grp last_mount
>          17920        17     10223  1024    0 Thu Jan  1 10:00:00 1970
>        8406528      8209     10223  1024    1 Thu Jan  1 10:00:00 1970
>      134235648    131089 511999995  4096    1 Thu Jan  1 10:00:00 1970
>      209733120    204817   1023983  1024   25 Thu Jan  1 10:00:00 1970
>      226510336    221201   1023983  1024   27 Thu Jan  1 10:00:00 1970

This would seem to indicate your partition table is being corrupted.

>   # /local/sbin/parted /dev/sdb print
>   Error: The primary GPT table is corrupt, but the backup appears ok, so
>   that will be used.
>   OK/Cancel? OK
>   Disk geometry for /dev/sdb: 0.000-2289288.000 megabytes
>   Disk label type: gpt
>   Minor    Start       End     Filesystem  Name                  Flags
>   1          0.017     10.000  ext2
>   Information: Don't forget to update /etc/fstab, if necessary.

I suspect this is part of the problem.  The GPT disk label is being
written into /dev/sdb1 (which isn't really valid) and upon reboot the
"backup" is being found at the end of the device and doesn't match
the existing partition table on /dev/sdb.

>   # strace -o strace.e2fsck.post-parted /local/sbin/e2fsck -n /dev/sdb1
>   e2fsck 1.38 (30-Jun-2005)
>   Couldn't find ext2 superblock, trying backup blocks...
>   /local/sbin/e2fsck: Bad magic number in super-block while trying to open
>   /dev/sdb1

At this point, you are trying to access a filesystem with an offset from
the start of the partition.  If you want to recover from this (your real
filesystem), what you should probably do is locate the expected start of
the filesystem using findsuper and then copy it onto your backup device:

dd if=/dev/orig of=/dev/backup bs=offset skip=1

The backup superblocks should have a byte offset of {1,3,5,...} * 32768 * 4096
from the start of the device, so subtracting this from the actual offsets
found will tell you where the filesystem is supposed to start.  Checking the
first few (non group = 0) backup superblocks should make it pretty clear
where the filesystem is supposed to start.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.