ext3 filesystem corruption - more info

Thu Apr 13 00:06:11 UTC 2006

I've seen similar errors when attempting to have a >2TB filesystem on a 
32-bit RHEL3 machine.  We have since implemented a 3.5TB filesystem on a 
64-bit RHEL4 machine.

It would help if you could answer the question Andreas Dilger posed:

"Does this imply you have a 6TB ext3 filesystem?"

Damian

On Wed, 12 Apr 2006, Sev Binello wrote:

> 
> Hi -
> 
> In case this helps,
> we got the following messages from EXT3 before the filesystem went
> Does anyone recognize these.....
> 
> //seems to mount okay
>    Mar 25 17:52:30 acnlin82 kernel: EXT3 FS 2.4-0.9.19, 19 August 2002 on sd(8,33),
> internal journal
>    Mar 25 17:52:30 acnlin82 kernel: EXT3-fs: recovery complete.
>    Mar 26 00:04:01 acnlin82 kernel: EXT3-fs: mounted filesystem with ordered data
> mode.
> 
> //soon as nfs clients start get a TON of errors like this
> Mar 26 00:07:19 acnlin82 kernel: EXT3-fs error (device sd(8,49)): ext3_free_blocks:
> Freeing blocks not in datazone - block =    3443589120, count = 1
> Mar 26 00:07:19 acnlin82 kernel: EXT3-fs error (device sd(8,49)): ext3_free_blocks:
> Freeing blocks not in datazone - block = 2113834232, count = 1
> Mar 26 00:07:22 acnlin82 kernel: EXT3-fs error (device sd(8,49)): ext3_free_blocks:
> bit already cleared for block 49125
> 
> //interspersed with some of these
> Mar 26 00:10:56 acnlin82 kernel: attempt to access beyond end of device
> Mar 26 00:10:56 acnlin82 kernel: 08:31: rw=0, want=1891463980, limit=1722264358
> Mar 26 00:10:56 acnlin82 kernel: attempt to access beyond end of device
> Mar 26 00:10:56 acnlin82 kernel: 08:31: rw=0, want=1824250576, limit=1722264358
> Mar 26 00:10:56 acnlin82 kernel: attempt to access beyond end of device
> 
> Then we had to reboot and basically filesystem is shot
> 
> Thanks
> -Sev
> 
> Sev Binello wrote:
>       Hi -
>
>          We have had 3 rather major occurances of ext3 filesystem corruption
>       lately,
>          i.e. so bad we couldn't event mount, and fsck didn't help.
>
>          I am looking for pointers, that could help us investigate the root
>       cause.
>
>          In general...
>            We are running  RedHat WS 3 Update 6,   2.4.21-40.2.ELsmp or
>       2.4.21-37.ELsmp
>
>          We have a small SAN  system that looks like this
>                     3 NFS servers each containing 2 Qlocic hba's connected to 2
>       qlogic switches
>                connected to an nstor (now xyratex) 6TB raid system containing 2
>       (active-active) controllers.
>
>        On the first 2 occasions one of the controllers was failed over.
>        On a 3rd occasion both SAN  switches lost power, and the hosts and raid
>       lost communication.
> 
>
>        On all occasions the qlocic failover driver tried to start up on the
>       alternate HBA.
>
>        On the first 2 instances we sort of tried to blame the controller.
>        On the 3rd, that was harder to do since the raid system and the hosts
>       stayed up
>        but lost communication.
>
>        I can provide more detail if anyone as any info on how to proceed.
>
>       Thanks
>       -Sev
> 
> 
>
>  -- 
> 
> Sev Binello
> Brookhaven National Laboratory
> Upton, New York
> 631-344-5647
> sev at bnl.gov
> 
>

Damian Menscher
-- 
-=#| <menscher at uiuc.edu> www.uiuc.edu/~menscher/ Ofc:(650)253-2757 |#=-
-=#| The above opinions are not necessarily those of my employers. |#=-