From jemf at gabcmt.eb.mil.br  Sun Jan  1 21:54:40 2006
From: jemf at gabcmt.eb.mil.br (JEMF)
Date: Sun, 01 Jan 2006 19:54:40 -0200
Subject: Questions about partitioning and ext3
Message-ID: <dp9j34$rna$1@sea.gmane.org>

Hello all!

I have a 512 MB Kingston flash disk. When I try to create a partition 
with 460 MB (471.040 KB), the partition is created with 460.6 MB 
(471.665 KB).

------------------------------------------------
   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1               1         951      471665   83  Linux
------------------------------------------------

Why? Geometry?

2nd Question:

When I format the partition with ext3, the df -k command returns:

------------------------------------------------
Filesystem           1K-blocks      Used  Available Use%  Mounted on
/dev/sdb1               456730       8239    424908   2%    /mnt
------------------------------------------------

I think 8239 KB (8.05 MB) was used by journal. But the amount of  blocks 
decreased after formatted (471665 to 456730). Why?

Thanks.



From daniel at rimspace.net  Mon Jan  2 00:11:37 2006
From: daniel at rimspace.net (Daniel Pittman)
Date: Mon, 02 Jan 2006 11:11:37 +1100
Subject: Questions about partitioning and ext3
References: <dp9j34$rna$1@sea.gmane.org>
Message-ID: <87lkxzo5dy.fsf@rimspace.net>

JEMF <jemf at gabcmt.eb.mil.br> writes:

> I have a 512 MB Kingston flash disk. When I try to create a partition
> with 460 MB (471.040 KB), the partition is created with 460.6 MB
> (471.665 KB).
>
> ------------------------------------------------
>   Device Boot      Start         End      Blocks   Id  System
> /dev/sdb1               1         951      471665   83  Linux
> ------------------------------------------------
>
> Why? Geometry?

Yup; when the flash device pretends to have geometry so that it doesn't
confuse software that still lives in DOS land, it caused that.

> 2nd Question:
>
> When I format the partition with ext3, the df -k command returns:
>
> ------------------------------------------------
> Filesystem           1K-blocks      Used  Available Use%  Mounted on
> /dev/sdb1               456730       8239    424908   2%    /mnt
> ------------------------------------------------
>
> I think 8239 KB (8.05 MB) was used by journal. But the amount of  blocks
> decreased after formatted (471665 to 456730). Why?

The difference, of around 23,000 blocks, is five percent of the
available space on the filesystem.

The default reserved block count for root is five percent...

    Daniel



From jemf at gabcmt.eb.mil.br  Mon Jan  2 01:35:51 2006
From: jemf at gabcmt.eb.mil.br (JEMF)
Date: Sun, 01 Jan 2006 23:35:51 -0200
Subject: Questions about partitioning and ext3
In-Reply-To: <87lkxzo5dy.fsf@rimspace.net>
References: <dp9j34$rna$1@sea.gmane.org> <87lkxzo5dy.fsf@rimspace.net>
Message-ID: <dpa01o$pkd$1@sea.gmane.org>

Daniel Pittman escreveu:
> JEMF <jemf at gabcmt.eb.mil.br> writes:
>>Why? Geometry?
> 
> Yup; when the flash device pretends to have geometry so that it doesn't
> confuse software that still lives in DOS land, it caused that.

How the system calculate this geometry?

>>I think 8239 KB (8.05 MB) was used by journal. But the amount of  blocks
>>decreased after formatted (471665 to 456730). Why?
> 
> The difference, of around 23,000 blocks, is five percent of the
> available space on the filesystem.

No! The difference is 14935 blocks! I mentioned in the previous message 
the difference between sizes of the unformatted partition and formated 
partition.

Can you help me again?

Thanks!



From daniel at rimspace.net  Mon Jan  2 02:43:32 2006
From: daniel at rimspace.net (Daniel Pittman)
Date: Mon, 02 Jan 2006 13:43:32 +1100
Subject: Questions about partitioning and ext3
References: <dp9j34$rna$1@sea.gmane.org> <87lkxzo5dy.fsf@rimspace.net>
	<dpa01o$pkd$1@sea.gmane.org>
Message-ID: <878xtznycr.fsf@rimspace.net>

JEMF <jemf at gabcmt.eb.mil.br> writes:
> Daniel Pittman escreveu:
>> JEMF <jemf at gabcmt.eb.mil.br> writes:
>>>Why? Geometry?
>> Yup; when the flash device pretends to have geometry so that it doesn't
>> confuse software that still lives in DOS land, it caused that.
>
> How the system calculate this geometry?

Basically, magic.  Seriously, there are a bunch of heuristics, or it can
come from the DOS partition table, or from the BIOS, but it really is
pretty much just invented in (hopefully) the same way that DOS-ish
operating systems and the BIOS will do, so they also work.

>>>I think 8239 KB (8.05 MB) was used by journal. But the amount of  blocks
>>>decreased after formatted (471665 to 456730). Why?
>> The difference, of around 23,000 blocks, is five percent of the
>> available space on the filesystem.
>
> No! The difference is 14935 blocks! I mentioned in the previous message
> the difference between sizes of the unformatted partition and formated
> partition.

You are right -- my mental math is broken this morning.  I approximated
five percent in my head, then managed to miss-subtract the two numbers
to make them match.  How embarrassing. :/

My expectation would be that the difference is caused by filesystem
meta-data, such as inode allocation tables, which consume some storage
space on the raw device, but are not available for file storage.

The ext3 file system uses fixed size and location tables, so that space
is consumed irregardless of the space used for files.

   Daniel



From bunk at stusta.de  Mon Jan  2 16:09:39 2006
From: bunk at stusta.de (Adrian Bunk)
Date: Mon, 2 Jan 2006 17:09:39 +0100
Subject: 2.6.15-rc6 OOPS
In-Reply-To: <20051224200336.GF12561@kmv.ru>
References: <20051224200336.GF12561@kmv.ru>
Message-ID: <20060102160939.GG17398@stusta.de>

On Sat, Dec 24, 2005 at 11:03:36PM +0300, Andrey J. Melnikoff (TEMHOTA) wrote:

> Hello.


Hi Andrey,


> Please, CC me, i'm not subscribed.
> 
> Kernel 2.6.15-rc6 OOPS:
> 
> kernel: general protection fault: 0000 [#1]
> kernel: SMP
> kernel: Modules linked in: ipt_REDIRECT ipt_LOG ipt_TOS ipt_TCPMSS ipt_tos
> ip_nat_ftp ipt_tcpmss iptable_nat ip_nat iptable_mangle iptable_filter
> ipt_multiport ipt_mac ipt_state ipt_limit ipt_conntrack ip_conntrack_ftp 
> ip_conntrack ip_tables af_packet ipv6 pcspkr floppy i2c_piix4 i2c_core 
> ohci_hcd usbcore aic7xxx scsi_transport_spi psmouse ide_disk ide_cd 
> cdrom genrtc
> kernel: CPU:    0
> kernel: EIP:    0060:[<c019d70f>]    Not tainted VLI
> kernel: EFLAGS: 00010286   (2.6.15-rc6)
> kernel: EIP is at ext3_find_entry+0x18f/0x3e0
> kernel: eax: ffffffff   ebx: 00010001   ecx: 00000002   edx: 00000000
> kernel: esi: 00000000   edi: ffffffff   ebp: 00000000   esp: f71b9d60
> kernel: ds: 007b   es: 007b   ss: 0068
> kernel: Process smbd (pid: 2999, threadinfo=f71b8000 task=f7aee530)
> kernel: Stack: 00000000 f71b9db8 00000000 00000027 000005b4 ffffffff f71a62e8 00000000
> kernel:        f71b9ea8 00001000 f71a636c 00000001 00000001 00010001 00000001 00000000
> kernel:        00000000 00000000 f7caf400 f71b9df0 f71503d4 ffffffff 00000000 f7159c68
> kernel: Call Trace:
> kernel:  [<c025eb29>] memcpy_toiovec+0x29/0x50
> kernel:  [<c019dbda>] ext3_lookup+0x3a/0xc0
> kernel:  [<c0167c8e>] real_lookup+0xae/0xd0
> kernel:  [<c0167f35>] do_lookup+0x85/0x90
> kernel:  [<c016872f>] __link_path_walk+0x7ef/0xdd0
> kernel:  [<c0168d5e>] link_path_walk+0x4e/0xd0
> kernel:  [<c016907f>] path_lookup+0x9f/0x170
> kernel:  [<c01693cf>] __user_walk+0x2f/0x60
> kernel:  [<c0163b5d>] vfs_stat+0x1d/0x60
> kernel:  [<c01641df>] sys_stat64+0xf/0x30
> kernel:  [<c0121271>] sys_gettimeofday+0x21/0x60
> kernel:  [<c0102e59>] syscall_call+0x7/0xb
> kernel: Code: 07 7e 88 89 f6 8d bc 27 00 00 00 00 8b 5c 24 34 8b 44 9c 5c 43 89
> 5c 24 34 85 c0 89 44 24 14 89 44 24 54 0f 84 b7 00 00 00 89 c7 <8b> 00 a8 04 75
> 07 8b 47 0c 85 c0 75 11 8b 44 24 14 e8 fb e1 fb


is this Oops in any way reproducible?

If yes, does it occur in earlier kernels like 2.6.14.x?


> After OOPS system work, but smbd process in 'D' state:
> 
> kernel: smbd          D 00000000     0  3000   2871          3001  2872 (NOTLB)
> kernel: f71b9dbc 000005b4 000005b4 00000000 00000000 f71b9ea8 c1b70dc0 00000000
> kernel:        7fffffff c031d940 c1807400 00000000 998db100 003d099f c0300b20 f7aee530
> kernel:        f7aee658 f71a63e0 f71a63e8 00000292 f7aee530 c02ba525 00000001 f7aee530
> kernel: Call Trace:
> kernel:  [<c02ba525>] __down+0x75/0xe0
> kernel:  [<c0118d70>] default_wake_function+0x0/0x10
> kernel:  [<c0172804>] __d_lookup+0xa4/0x110
> kernel:  [<c02b8e8f>] __down_failed+0x7/0xc
> kernel:  [<c016bb42>] .text.lock.namei+0x8/0x1e6
> kernel:  [<c0167f35>] do_lookup+0x85/0x90
> kernel:  [<c016872f>] __link_path_walk+0x7ef/0xdd0
> kernel:  [<c0168d5e>] link_path_walk+0x4e/0xd0
> kernel:  [<c017ce14>] __mark_inode_dirty+0x104/0x1b0
> kernel:  [<c016907f>] path_lookup+0x9f/0x170
> kernel:  [<c01693cf>] __user_walk+0x2f/0x60
> kernel:  [<c0163b5d>] vfs_stat+0x1d/0x60
> kernel:  [<c017ce14>] __mark_inode_dirty+0x104/0x1b0
> kernel:  [<c0121a0f>] current_fs_time+0x5f/0x70
> kernel:  [<c01641df>] sys_stat64+0xf/0x30
> kernel:  [<c0174832>] update_atime+0x52/0x90
> kernel:  [<c016cdc5>] vfs_readdir+0x85/0x90
> kernel:  [<c0171891>] dput+0x71/0x1b0
> kernel:  [<c01763bb>] mntput_no_expire+0x1b/0x70
> kernel:  [<c0159d8c>] filp_close+0x3c/0x80
> kernel:  [<c0102e59>] syscall_call+0x7/0xb
> 
> kernel: smbd          D 00000000     0  3001   2871          3008  3000 (NOTLB)
> kernel: f71fbdbc 000005b4 000005b4 00000000 00000000 f71fbea8 c1b70e60 00000000
> kernel:        7fffffff c031d940 c1807400 00000000 0dfc9800 003d09ad c0300b20 f7aeea30
> kernel:        f7aeeb58 f71a63e0 f71a63e8 00000292 f7aeea30 c02ba525 00000001 f7aeea30
> kernel: Call Trace:
> kernel:  [<c02ba525>] __down+0x75/0xe0
> kernel:  [<c0118d70>] default_wake_function+0x0/0x10
> kernel:  [<c0172804>] __d_lookup+0xa4/0x110
> kernel:  [<c02b8e8f>] __down_failed+0x7/0xc
> kernel:  [<c016bb42>] .text.lock.namei+0x8/0x1e6
> kernel:  [<c0167f35>] do_lookup+0x85/0x90
> kernel:  [<c016872f>] __link_path_walk+0x7ef/0xdd0
> kernel:  [<c0168d5e>] link_path_walk+0x4e/0xd0
> kernel:  [<c017cd72>] __mark_inode_dirty+0x62/0x1b0
> kernel:  [<c016907f>] path_lookup+0x9f/0x170
> kernel:  [<c01693cf>] __user_walk+0x2f/0x60
> kernel:  [<c0163b5d>] vfs_stat+0x1d/0x60
> kernel:  [<c017cd72>] __mark_inode_dirty+0x62/0x1b0
> kernel:  [<c0121a0f>] current_fs_time+0x5f/0x70
> kernel:  [<c01641df>] sys_stat64+0xf/0x30
> kernel:  [<c0174832>] update_atime+0x52/0x90
> kernel:  [<c016cdc5>] vfs_readdir+0x85/0x90
> kernel:  [<c0171891>] dput+0x71/0x1b0
> kernel:  [<c01763bb>] mntput_no_expire+0x1b/0x70
> kernel:  [<c0159d8c>] filp_close+0x3c/0x80
> kernel:  [<c0102e59>] syscall_call+0x7/0xb
> 
> kernel: smbd          D 00000000     0  3008   2871          3015  3001 (NOTLB)
> kernel: f736bdbc 000005b4 000005b4 00000000 00000000 f736bea8 f79e4e00 00000000
> kernel:        7fffffff c031d940 c1807400 00000000 66f2b100 003d09bd c0300b20 f7b3b0b0
> kernel:        f7b3b1d8 f71a63e0 f71a63e8 00000292 f7b3b0b0 c02ba525 00000001 f7b3b0b0
> kernel: Call Trace:
> kernel:  [<c02ba525>] __down+0x75/0xe0
> kernel:  [<c0118d70>] default_wake_function+0x0/0x10
> kernel:  [<c0172804>] __d_lookup+0xa4/0x110
> kernel:  [<c02b8e8f>] __down_failed+0x7/0xc
> kernel:  [<c016bb42>] .text.lock.namei+0x8/0x1e6
> kernel:  [<c0167f35>] do_lookup+0x85/0x90
> kernel:  [<c016872f>] __link_path_walk+0x7ef/0xdd0
> kernel:  [<c0168d5e>] link_path_walk+0x4e/0xd0
> kernel:  [<c017cd72>] __mark_inode_dirty+0x62/0x1b0
> kernel:  [<c016907f>] path_lookup+0x9f/0x170
> kernel:  [<c01693cf>] __user_walk+0x2f/0x60
> kernel:  [<c0163b5d>] vfs_stat+0x1d/0x60
> kernel:  [<c017cd72>] __mark_inode_dirty+0x62/0x1b0
> kernel:  [<c0121a0f>] current_fs_time+0x5f/0x70
> kernel:  [<c01641df>] sys_stat64+0xf/0x30
> kernel:  [<c0174832>] update_atime+0x52/0x90
> kernel:  [<c016cdc5>] vfs_readdir+0x85/0x90
> kernel:  [<c0171891>] dput+0x71/0x1b0
> kernel:  [<c01763bb>] mntput_no_expire+0x1b/0x70
> kernel:  [<c0159d8c>] filp_close+0x3c/0x80
> kernel:  [<c0102e59>] syscall_call+0x7/0xb
> 
> kernel: smbd          D C01641BA     0  3015   2871          3036  3008 (NOTLB)
> kernel: f7273f30 bfe3253c 00000000 c01641ba 00000804 00000000 00000000 0048815f
> kernel:        000041c0 00000008 c1807400 00000000 66224500 003d09e6 c0300b20 f7b3bab0
> kernel:        f7b3bbd8 f71a63e0 f71a63e8 00000286 f7b3bab0 c02ba525 00000001 f7b3bab0
> kernel: Call Trace:
> kernel:  [<c01641ba>] cp_new_stat64+0xea/0x100
> kernel:  [<c02ba525>] __down+0x75/0xe0
> kernel:  [<c0118d70>] default_wake_function+0x0/0x10
> kernel:  [<c02b8e8f>] __down_failed+0x7/0xc
> kernel:  [<c016d070>] filldir64+0x0/0xf0
> kernel:  [<c016d23f>] .text.lock.readdir+0x8/0x29
> kernel:  [<c016d1d7>] sys_getdents64+0x77/0xd7
> kernel:  [<c016c36e>] do_fcntl+0x16e/0x1e0
> kernel:  [<c0102e59>] syscall_call+0x7/0xb
> 
> 
> Hardware: IBM eServer xSeries 330, 1Gb memory, ServeRaid 4Mx.
> 
> Config, other data - on request.
> 
> -- 
>  Best regards, TEMHOTA-RIPN aka MJA13-RIPE
>  System Administrator. mailto:temnota at kmv.ru
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

cu
Adrian

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed



From jemf at gabcmt.eb.mil.br  Thu Jan  5 12:47:25 2006
From: jemf at gabcmt.eb.mil.br (JEMF)
Date: Thu, 05 Jan 2006 10:47:25 -0200
Subject: Questions about partitioning and ext3
In-Reply-To: <dp9j34$rna$1@sea.gmane.org>
References: <dp9j34$rna$1@sea.gmane.org>
Message-ID: <dpj4gu$djh$1@sea.gmane.org>

Somebody has more ideas about this subject?

JEMF escreveu:
> Hello all!
> 
> I have a 512 MB Kingston flash disk. When I try to create a partition 
> with 460 MB (471.040 KB), the partition is created with 460.6 MB 
> (471.665 KB).
> 
> ------------------------------------------------
>   Device Boot      Start         End      Blocks   Id  System
> /dev/sdb1               1         951      471665   83  Linux
> ------------------------------------------------
> 
> Why? Geometry?
> 
> 2nd Question:
> 
> When I format the partition with ext3, the df -k command returns:
> 
> ------------------------------------------------
> Filesystem           1K-blocks      Used  Available Use%  Mounted on
> /dev/sdb1               456730       8239    424908   2%    /mnt
> ------------------------------------------------
> 
> I think 8239 KB (8.05 MB) was used by journal. But the amount of  blocks 
> decreased after formatted (471665 to 456730). Why?
> 
> Thanks.



From teeeelo at googlemail.com  Fri Jan  6 10:54:07 2006
From: teeeelo at googlemail.com (Thilo)
Date: Fri, 06 Jan 2006 11:54:07 +0100
Subject: Folder with a questionmark "?mnt1"
Message-ID: <dpli8c$ecn$1@sea.gmane.org>

Hey guys,
i had a folder ~/mnt1 where a smbmount was mounted in.

I browsed the subnet with smbc as normal user, than my ssh connection
crashed because of disconnecting my remote wlan. :-/

Now i cannot access to mnt1, even root can't. The folder listing needs a
lot of time and does not list mnt1 anymore.
The midnight comander (mc) shows that folder in red color with the name
?mnt1 also after a long latency.
The cpu usage is quite normal.

I tried chattr -V -R -i ./mnt1
I/O failure by getting the status of ./mnt1
(Eingabe-/Ausgabefehler beim Auslesen des Status von ./mnt1)

Does anyone know what happens?

Thanks



From teeeelo at googlemail.com  Fri Jan  6 17:02:33 2006
From: teeeelo at googlemail.com (Thilo)
Date: Fri, 06 Jan 2006 18:02:33 +0100
Subject: Folder with a questionmark "?mnt1"
In-Reply-To: <dpli8c$ecn$1@sea.gmane.org>
References: <dpli8c$ecn$1@sea.gmane.org>
Message-ID: <dpm7r7$rf4$1@sea.gmane.org>

Hey,
i fixed it. Sorry my samba caused that error.
Anyway I have to check my filesystem because fsck.ext3 showes several 
failures... thats why I asked my question here in that newsgroup ...

But what did that questionmark mean....
Have fun! :-)



From temnota at kmv.ru  Mon Jan  9 17:28:35 2006
From: temnota at kmv.ru (Andrey J. Melnikoff (TEMHOTA))
Date: Mon, 9 Jan 2006 20:28:35 +0300
Subject: 2.6.15-rc6 OOPS
In-Reply-To: <20060102160939.GG17398@stusta.de>
References: <20051224200336.GF12561@kmv.ru> <20060102160939.GG17398@stusta.de>
Message-ID: <20060109172835.GB2724@kmv.ru>

Hi Adrian Bunk!
 On Mon, Jan 02, 2006 at 05:09:39PM +0100, Adrian Bunk wrote next:

> On Sat, Dec 24, 2005 at 11:03:36PM +0300, Andrey J. Melnikoff (TEMHOTA) wrote:
> 
> > Please, CC me, i'm not subscribed.
> > 
> > Kernel 2.6.15-rc6 OOPS:
> > 
> > kernel: general protection fault: 0000 [#1]
> > kernel: SMP
> > kernel: Modules linked in: ipt_REDIRECT ipt_LOG ipt_TOS ipt_TCPMSS ipt_tos
> > ip_nat_ftp ipt_tcpmss iptable_nat ip_nat iptable_mangle iptable_filter
> > ipt_multiport ipt_mac ipt_state ipt_limit ipt_conntrack ip_conntrack_ftp 
> > ip_conntrack ip_tables af_packet ipv6 pcspkr floppy i2c_piix4 i2c_core 
> > ohci_hcd usbcore aic7xxx scsi_transport_spi psmouse ide_disk ide_cd 
> > cdrom genrtc
> > kernel: CPU:    0
> > kernel: EIP:    0060:[<c019d70f>]    Not tainted VLI
> > kernel: EFLAGS: 00010286   (2.6.15-rc6)
> > kernel: EIP is at ext3_find_entry+0x18f/0x3e0
> > kernel: eax: ffffffff   ebx: 00010001   ecx: 00000002   edx: 00000000
> > kernel: esi: 00000000   edi: ffffffff   ebp: 00000000   esp: f71b9d60
> > kernel: ds: 007b   es: 007b   ss: 0068
> > kernel: Process smbd (pid: 2999, threadinfo=f71b8000 task=f7aee530)
> > kernel: Stack: 00000000 f71b9db8 00000000 00000027 000005b4 ffffffff f71a62e8 00000000
> > kernel:        f71b9ea8 00001000 f71a636c 00000001 00000001 00010001 00000001 00000000
> > kernel:        00000000 00000000 f7caf400 f71b9df0 f71503d4 ffffffff 00000000 f7159c68
> > kernel: Call Trace:
> > kernel:  [<c025eb29>] memcpy_toiovec+0x29/0x50
> > kernel:  [<c019dbda>] ext3_lookup+0x3a/0xc0
> > kernel:  [<c0167c8e>] real_lookup+0xae/0xd0
> > kernel:  [<c0167f35>] do_lookup+0x85/0x90
> > kernel:  [<c016872f>] __link_path_walk+0x7ef/0xdd0
> > kernel:  [<c0168d5e>] link_path_walk+0x4e/0xd0
> > kernel:  [<c016907f>] path_lookup+0x9f/0x170
> > kernel:  [<c01693cf>] __user_walk+0x2f/0x60
> > kernel:  [<c0163b5d>] vfs_stat+0x1d/0x60
> > kernel:  [<c01641df>] sys_stat64+0xf/0x30
> > kernel:  [<c0121271>] sys_gettimeofday+0x21/0x60
> > kernel:  [<c0102e59>] syscall_call+0x7/0xb
> > kernel: Code: 07 7e 88 89 f6 8d bc 27 00 00 00 00 8b 5c 24 34 8b 44 9c 5c 43 89
> > 5c 24 34 85 c0 89 44 24 14 89 44 24 54 0f 84 b7 00 00 00 89 c7 <8b> 00 a8 04 75
> > 07 8b 47 0c 85 c0 75 11 8b 44 24 14 e8 fb e1 fb
> 
> 
> is this Oops in any way reproducible?
No. We replease server and this kernel work on new hardware. I think this is
memory/hardware/overheat problem.
 
> If yes, does it occur in earlier kernels like 2.6.14.x?

Sorry for noise and long delay.

-- 
 Best regards, TEMHOTA-RIPN aka MJA13-RIPE
 System Administrator. mailto:temnota at kmv.ru



From cpwright at cpwright.com  Thu Jan 12 17:07:51 2006
From: cpwright at cpwright.com (Charles P. Wright)
Date: Thu, 12 Jan 2006 12:07:51 -0500
Subject: Extended Attribute Write Performance
Message-ID: <1137085671.30101.0.camel@localhost.localdomain>

Hello,

I'm writing an application that makes pretty extensive use of extended
attributes to store file attributes on Ext2.  I used a profiling tool
developed by my colleague Nikolai Joukov at SUNY Stony Brook to dig a
bit deeper into the performance of my application.

In the course of my benchmark, there are 54247 setxattr operations
during a 54 seconds.   They use about 10.56 seconds of the time, which
seemed to be a rather outsized performance toll to me (~40k writes took
only 10% as long).

After looking at the profile, 27 of those writes end up taking 7.74
seconds.  That works out to roughly 286 ms per call; which seems a bit
high.

The workload is not memory constrained (the working set is 50MB + 5000
files).  Each file has one extended attribute block that contains two
attributes totaling 32 bytes.  The attributes are unique (random
actually), so there isn't any sharing.

Can someone provide me with some intuition as to why there are so many
writes that reach the disk, and why they take so long.  I would expect
that the operations shouldn't take much longer than a seek (on the order
of 10ms, not 200+)?

Charles



From adilger at clusterfs.com  Thu Jan 12 19:52:03 2006
From: adilger at clusterfs.com (Andreas Dilger)
Date: Thu, 12 Jan 2006 12:52:03 -0700
Subject: Extended Attribute Write Performance
In-Reply-To: <1137085671.30101.0.camel@localhost.localdomain>
References: <1137085671.30101.0.camel@localhost.localdomain>
Message-ID: <20060112195203.GD3682@schatzie.adilger.int>

On Jan 12, 2006  12:07 -0500, Charles P. Wright wrote:
> I'm writing an application that makes pretty extensive use of extended
> attributes to store file attributes on Ext2.  I used a profiling tool
> developed by my colleague Nikolai Joukov at SUNY Stony Brook to dig a
> bit deeper into the performance of my application.

Presumably you are using ext3 and not ext2, given posting to this list?

> In the course of my benchmark, there are 54247 setxattr operations
> during a 54 seconds.   They use about 10.56 seconds of the time, which
> seemed to be a rather outsized performance toll to me (~40k writes took
> only 10% as long).
> 
> After looking at the profile, 27 of those writes end up taking 7.74
> seconds.  That works out to roughly 286 ms per call; which seems a bit
> high.
> 
> The workload is not memory constrained (the working set is 50MB + 5000
> files).  Each file has one extended attribute block that contains two
> attributes totaling 32 bytes.  The attributes are unique (random
> actually), so there isn't any sharing.
> 
> Can someone provide me with some intuition as to why there are so many
> writes that reach the disk, and why they take so long.  I would expect
> that the operations shouldn't take much longer than a seek (on the order
> of 10ms, not 200+)?

I suspect the reason is that the journal is getting full and jbd is
doing a full journal checkpoint because it has run out of space for
new transactions.  This is because using external EA blocks consume
a lot of space (4kB) regardless of how small the EA is, and this can
eat up the journal quickly.  54247 * 4kB = 211MB, much larger than
the default 32MB (or maybe 128MB with newer e2fsprogs) journal size.

Solutions to your specific problem are to use large inodes and the
fast EA space ("mke2fs -j -I 256 ..." makes 256-byte inodes, 128 bytes
left for EAs) and/or increasing the journal size ("mke2fs -J size=400",
though even 400MB won't be enough for this test case).

We implemented the large inodes + fast EAs (included in 2.6.12+ kernels)
to avoid the need to do any seeking when reading/writing EAs, in addition
to the benefit of not writing so much data (mostly unused) to disk.
This showed a huge performance increase for Lustre metadata servers
(which use EAs on every file) and also with Samba4 testing.

We've run into similar problems recently with test loads that are
generating a lot of dirty metadata.  The real solution is to fix the
jbd layer not to be so aggressive about flushing out the whole journal
when it runs out of space, as this introduces gigantic latencies.
It should instead only clear out a smaller amount of space in order to
allow the new transaction to start and it can again do the checkpoint
in the background.  Not sure when we'll be able to work on that.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.



From adilger at clusterfs.com  Fri Jan 13 06:54:17 2006
From: adilger at clusterfs.com (Andreas Dilger)
Date: Thu, 12 Jan 2006 23:54:17 -0700
Subject: Extended Attribute Write Performance
In-Reply-To: <1137117134.8569.7.camel@polarbear.fsl.cs.sunysb.edu>
References: <1137085671.30101.0.camel@localhost.localdomain>
	<20060112195203.GD3682@schatzie.adilger.int>
	<1137117134.8569.7.camel@polarbear.fsl.cs.sunysb.edu>
Message-ID: <20060113065417.GD6006@schatzie.adilger.int>

On Jan 12, 2006  20:52 -0500, Charles P. Wright wrote:
> On Thu, 2006-01-12 at 12:52 -0700, Andreas Dilger wrote: 
> > Presumably you are using ext3 and not ext2, given posting to this list?
>
> Actually this test case was on Ext2, not Ext3.  I did a quick search for
> an ext2-users list and didn't immediately see results, so I figured that
> as Ext2 and Ext3 have similar EA implementations, this list would be
> appropriate.

There is ext2-devel at lists.sourceforge.net, which is listed in the MAINTAINERS
file for ext2...  You are right that the same people read both lists.

> > Solutions to your specific problem are to use large inodes and the
> > fast EA space ("mke2fs -j -I 256 ..." makes 256-byte inodes, 128 bytes
> > left for EAs)
>
> Increasing the inode size to 256 bytes made a huge difference under
> Ext3.  The spikes that I mentioned for Ext2 also existed in Ext3, and
> were eliminated by this change.  My application's performance increased
> by about 40%, and the standard deviations dropped from around 20% to 4%.
> 
> However, for Ext2 it made very little difference.  I still have a
> handful of operations (.05%) that account for 73% of the time.  I know
> that Ext2 is optimized for shared attribute blocks (for the case of
> ACLs).  Is there something about having lots of unique attributes that
> results in poor performance?

There is no support for fast EAs in ext2 at this time, so it would only
slow things down there because you are writing more (useless) data to disk.

I honestly have no ideas about ext2 performance, as I only ever use ext3.
I would suspect that some of these operations are slower because they are
"stuck" with doing some extra amount of work, like reading a bitmap from
disk, and the rest of the operations are going to cache.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.



From cwright at ic.sunysb.edu  Fri Jan 13 01:52:14 2006
From: cwright at ic.sunysb.edu (Charles P. Wright)
Date: Thu, 12 Jan 2006 20:52:14 -0500
Subject: Extended Attribute Write Performance
In-Reply-To: <20060112195203.GD3682@schatzie.adilger.int>
References: <1137085671.30101.0.camel@localhost.localdomain>
	<20060112195203.GD3682@schatzie.adilger.int>
Message-ID: <1137117134.8569.7.camel@polarbear.fsl.cs.sunysb.edu>

Andreas,

Thanks for your helpful reply.

On Thu, 2006-01-12 at 12:52 -0700, Andreas Dilger wrote: 
> On Jan 12, 2006  12:07 -0500, Charles P. Wright wrote:
> > I'm writing an application that makes pretty extensive use of extended
> > attributes to store file attributes on Ext2.  I used a profiling tool
> > developed by my colleague Nikolai Joukov at SUNY Stony Brook to dig a
> > bit deeper into the performance of my application.
> 
> Presumably you are using ext3 and not ext2, given posting to this list?
Actually this test case was on Ext2, not Ext3.  I did a quick search for
an ext2-users list and didn't immediately see results, so I figured that
as Ext2 and Ext3 have similar EA implementations, this list would be
appropriate.

> > In the course of my benchmark, there are 54247 setxattr operations
> > during a 54 seconds.   They use about 10.56 seconds of the time, which
> > seemed to be a rather outsized performance toll to me (~40k writes took
> > only 10% as long).
> > 
> > After looking at the profile, 27 of those writes end up taking 7.74
> > seconds.  That works out to roughly 286 ms per call; which seems a bit
> > high.
> > 
> > The workload is not memory constrained (the working set is 50MB + 5000
> > files).  Each file has one extended attribute block that contains two
> > attributes totaling 32 bytes.  The attributes are unique (random
> > actually), so there isn't any sharing.
> > 
> > Can someone provide me with some intuition as to why there are so many
> > writes that reach the disk, and why they take so long.  I would expect
> > that the operations shouldn't take much longer than a seek (on the order
> > of 10ms, not 200+)?
> 
> I suspect the reason is that the journal is getting full and jbd is
> doing a full journal checkpoint because it has run out of space for
> new transactions.  This is because using external EA blocks consume
> a lot of space (4kB) regardless of how small the EA is, and this can
> eat up the journal quickly.  54247 * 4kB = 211MB, much larger than
> the default 32MB (or maybe 128MB with newer e2fsprogs) journal size.
> 
> Solutions to your specific problem are to use large inodes and the
> fast EA space ("mke2fs -j -I 256 ..." makes 256-byte inodes, 128 bytes
> left for EAs) and/or increasing the journal size ("mke2fs -J size=400",
> though even 400MB won't be enough for this test case).
Increasing the inode size to 256 bytes made a huge difference under
Ext3.  The spikes that I mentioned for Ext2 also existed in Ext3, and
were eliminated by this change.  My application's performance increased
by about 40%, and the standard deviations dropped from around 20% to 4%.

However, for Ext2 it made very little difference.  I still have a
handful of operations (.05%) that account for 73% of the time.  I know
that Ext2 is optimized for shared attribute blocks (for the case of
ACLs).  Is there something about having lots of unique attributes that
results in poor performance?

> We implemented the large inodes + fast EAs (included in 2.6.12+ kernels)
> to avoid the need to do any seeking when reading/writing EAs, in addition
> to the benefit of not writing so much data (mostly unused) to disk.
> This showed a huge performance increase for Lustre metadata servers
> (which use EAs on every file) and also with Samba4 testing.
I can see why, especially on a journalled file system.

Thanks,
Charles



From agruen at suse.de  Sun Jan 15 03:12:46 2006
From: agruen at suse.de (Andreas Gruenbacher)
Date: Sun, 15 Jan 2006 04:12:46 +0100
Subject: Extended Attribute Write Performance
In-Reply-To: <1137117134.8569.7.camel@polarbear.fsl.cs.sunysb.edu>
References: <1137085671.30101.0.camel@localhost.localdomain>
	<20060112195203.GD3682@schatzie.adilger.int>
	<1137117134.8569.7.camel@polarbear.fsl.cs.sunysb.edu>
Message-ID: <200601150412.46782.agruen@suse.de>

On Friday 13 January 2006 02:52, Charles P. Wright wrote:
> Increasing the inode size to 256 bytes made a huge difference under
> Ext3.  The spikes that I mentioned for Ext2 also existed in Ext3, and
> were eliminated by this change.  My application's performance increased
> by about 40%, and the standard deviations dropped from around 20% to 4%.
>
> However, for Ext2 it made very little difference.  I still have a
> handful of operations (.05%) that account for 73% of the time.  I know
> that Ext2 is optimized for shared attribute blocks (for the case of
> ACLs).  Is there something about having lots of unique attributes that
> results in poor performance?

Without fast xattrs (i.e., bigger inodes), unique attributes can consume lots 
of memory, you will end up writing entire blocks for each xattr change, and 
you may also waste a considerable amount of disk space. You already noticed 
that ext3 fast xattrs are much faster for small attributes, no matter if they 
are unique or not. Ext2 does not have fast xattr support; it most likely 
never will. There, the extra space is just wasted and you'll see about the 
same performance no matter which inode size you choose.

Regards,
Andreas



From satimis at yahoo.com  Tue Jan 17 06:31:54 2006
From: satimis at yahoo.com (Stephen Liu)
Date: Tue, 17 Jan 2006 14:31:54 +0800 (CST)
Subject: Mounting problem
Message-ID: <20060117063154.17930.qmail@web34715.mail.mud.yahoo.com>

Hi folks,

For unknown cause I encounter following mounting problem;

# /mnt/hda8
mount: wrong fs type, bad option, bad superblock on /dev/hda8,
       missing codepage or other error
       In some cases useful info is found in syslog - try
       dmesg | tail  or so


# dmesg | tail
via82cxxx: timeout while reading AC97 codec (0x9A0000)
via82cxxx: timeout while reading AC97 codec (0x9A0000)
via82cxxx: timeout while reading AC97 codec (0x9A0000)
via82cxxx: timeout while reading AC97 codec (0x9A0000)
via82cxxx: timeout while reading AC97 codec (0x9A0000)
via82cxxx: timeout while reading AC97 codec (0x9A0000)
via82cxxx: timeout while reading AC97 codec (0x9A0000)
EXT3-fs: hda8: couldn't mount because of unsupported optional features
(2000200).
EXT3-fs: hda8: couldn't mount because of unsupported optional features
(2000200)

Previously this partition can be mounted without problem.  I also tried
adding -t ext3 without result still having the same warning.


# e2fsck -f /dev/hda8
e2fsck 1.38 (30-Jun-2005)
e2fsck: Filesystem revision too high while trying to open dev/hda8
The filesystem revision is apparently too high for this version of
e2fsck.  (Or the filesystem superblock is corrupt)

The superblock could not be read or does not describe a correct ext2
filesystem.  If the device is valid and it really contains an ext2
filesystem (and not swap or ufs or something else), then the superblock
is corrupt, and you might try running e2fsck with an alternate
superblock:
    e2fsck -b 8193 <device>


# e2fsck -f 8193 /dev/hda8
Usage: e2fsck [-panyrcdfvstDFSV] [-b superblock] [-B blocksize]
                [-I inode_buffer_blocks] [-P process_inode_size]
                [-l|-L bad_blocks_file] [-C fd] [-j external_journal]
                [-E extended-options] device

Emergency help:
 -p                   Automatic repair (no questions)
 -n                   Make no changes to the filesystem
 -y                   Assume "yes" to all questions
 -c                   Check for bad blocks and add them to the badblock
list
 -f                   Force checking even if filesystem is marked clean
 -v                   Be verbose
 -b superblock        Use alternative superblock
 -B blocksize         Force blocksize when looking for superblock
 -j external_journal  Set location of the external journal
 -l bad_blocks_file   Add to badblocks list
 -L bad_blocks_file   Set badblocks list


Please advise how to fix the problem.  TIA

B.R.
SL



From adilger at clusterfs.com  Tue Jan 17 07:11:46 2006
From: adilger at clusterfs.com (Andreas Dilger)
Date: Tue, 17 Jan 2006 00:11:46 -0700
Subject: Mounting problem
In-Reply-To: <20060117063154.17930.qmail@web34715.mail.mud.yahoo.com>
References: <20060117063154.17930.qmail@web34715.mail.mud.yahoo.com>
Message-ID: <20060117071146.GB8009@schatzie.adilger.int>

On Jan 17, 2006  14:31 +0800, Stephen Liu wrote:
> # dmesg | tail
> via82cxxx: timeout while reading AC97 codec (0x9A0000)
> via82cxxx: timeout while reading AC97 codec (0x9A0000)
> via82cxxx: timeout while reading AC97 codec (0x9A0000)
> EXT3-fs: hda8: couldn't mount because of unsupported optional features
> (2000200).
> EXT3-fs: hda8: couldn't mount because of unsupported optional features
> (2000200)

It certainly looks like your disk is corrupted with "0x0200" data.  I'm
not sure where that would come from.  Please attach output from:

	dd if=/dev/hda8 bs=4k count=1 | gzip -9 > /tmp/hda8-sb.gz

> # e2fsck -f /dev/hda8
> e2fsck 1.38 (30-Jun-2005)
> e2fsck: Filesystem revision too high while trying to open dev/hda8
> The filesystem revision is apparently too high for this version of
> e2fsck.  (Or the filesystem superblock is corrupt)
> 
> The superblock could not be read or does not describe a correct ext2
> filesystem.  If the device is valid and it really contains an ext2
> filesystem (and not swap or ufs or something else), then the superblock
> is corrupt, and you might try running e2fsck with an alternate
> superblock:
>     e2fsck -b 8193 <device>

I believe modern e2fsck's already try the backup superblocks automatically,
but I coul dbe wrong.  In any case, the number after "-b" usually depends
on the size of the filesystem.  For smaller filesystems (< 512MB) it is
8193 or 24576 or 8192 * {3,5,7}^n + 1.  For larger filesystems it is
32768 or 98304 or 32768 * {3,5,7}^n by default.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.



From satimis at yahoo.com  Tue Jan 17 10:59:11 2006
From: satimis at yahoo.com (Stephen Liu)
Date: Tue, 17 Jan 2006 18:59:11 +0800 (CST)
Subject: Mounting problem
In-Reply-To: <20060117071146.GB8009@schatzie.adilger.int>
Message-ID: <20060117105911.48558.qmail@web34705.mail.mud.yahoo.com>

Hi Andreas,

Tks for your advice.

- snip -
> I'm
> not sure where that would come from.  Please attach output from:
> 
> 	dd if=/dev/hda8 bs=4k count=1 | gzip -9 > /tmp/hda8-sb.gz

# dd if=/dev/hda8 bs=4k count=1 | gzip -9 > /tmp/hda8-sb.gz
1+0 records in
1+0 records out
4096 bytes transferred in 0.016240 seconds (252216 bytes/sec)

- snip -

> For smaller filesystems (< 512MB) it is
> 8193 or 24576 or 8192 * {3,5,7}^n + 1.  For larger filesystems
> it is
> 32768 or 98304 or 32768 * {3,5,7}^n by default.

This partition is about 5~6G, if I recall correctly.  Which number
shall I use to test.  

Besides do I need to backup this partition before test?  Partition
/dev/hda7 has about 5.6G space available.
# mount /mnt/hda7
# df -hT /mnt/hda7
Filesystem    Type    Size  Used Avail Use% Mounted on
/UNIONFS/dev/hda7
              ext3    5.6G   33M  5.3G   1% /mnt/hda7

Would running
# dd if=/dev/hda8 of=/dev/hda7

backup its content to /dev/hda7.  Please advise.  TIA

Remark: there are several non-important working files on /dev/hda7.  To
overwrite them has no problem.  The data on /dev/hda8 is about 900MB

B.R.
SL





From satimis at yahoo.com  Wed Jan 18 02:11:15 2006
From: satimis at yahoo.com (Stephen Liu)
Date: Wed, 18 Jan 2006 10:11:15 +0800 (CST)
Subject: Mounting problem
In-Reply-To: <20060117071146.GB8009@schatzie.adilger.int>
Message-ID: <20060118021115.93572.qmail@web34702.mail.mud.yahoo.com>

Hi Damian and Andreas,

Tks for your advice.  The file is attached to this posting.

B.R.
SL
--- Damian Menscher <menscher at uiuc.edu> wrote:

> On Tue, 17 Jan 2006, Stephen Liu wrote:
> 
> >> I'm
> >> not sure where that would come from.  Please attach output from:
> >>
> >> 	dd if=/dev/hda8 bs=4k count=1 | gzip -9 > /tmp/hda8-sb.gz
> >
> > # dd if=/dev/hda8 bs=4k count=1 | gzip -9 > /tmp/hda8-sb.gz
> > 1+0 records in
> > 1+0 records out
> > 4096 bytes transferred in 0.016240 seconds (252216 bytes/sec)
> 
> I think Andreas meant for you to attach the /tmp/hda8-sb.gz file that
> 
> command created.  He can then analyze the file to see what went
> wrong.
> 
> Damian Menscher
-------------- next part --------------
A non-text attachment was scrubbed...
Name: hda8-sb.gz
Type: application/x-gunzip
Size: 202 bytes
Desc: 3092802863-hda8-sb.gz
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20060118/9da49547/attachment.bin>

From evoltech at 2inches.com  Thu Jan 19 00:16:13 2006
From: evoltech at 2inches.com (Dennis Williams)
Date: Wed, 18 Jan 2006 16:16:13 -0800 (PST)
Subject: ext3 fs errors 3T fs
Message-ID: <20060118160515.M49352@periphery.2inches.com>

Hello,
I looked through the archives a bit and could not find anything relevant,
if you know otherwise please point me in the right direction.

I have a ~3T ext3 filesystem on linux software raid that had been behaving
corectly for sometime.  Not to long ago it gave the following error after
trying to mount it:

mount: wrong fs type, bad option, bad superblock on /dev/md0,
       or too many mounted file systems

after a long fsck which I had to do manually I noticed the following in
/var/log/messages after trying to mount again:

Jan 19 09:13:11 terrorbytes kernel: EXT3-fs error (device md0):
ext3_check_descriptors: Block bitmap for group 3584 not in group (block
0)!
Jan 19 09:13:11 terrorbytes kernel: EXT3-fs: group descriptors corrupted !

when trying to correct again with e2fsck I get this error:

e2fsck 1.34 (25-Jul-2003)
Group descriptors look bad... trying backup blocks...
e2fsck: Invalid argument while checking ext3 journal for /dev/md0

some more information on the system:
os flavor: Suse 9.1
kernel version: 2.6.5-7.202.7-default (various suse patches applied to
   2.6.5 kernel)

I am not sure where to go from here, any help, experience, or references
to documentation that would help me better understand the problem would be
apreciated.

Sincerely,
Dennison Williams

"And for all the good or evil, creation or destruction
your living might have of accomplished, you might have
just never have lived at all"
-The Sleeping Beauty



From adilger at clusterfs.com  Thu Jan 19 10:32:42 2006
From: adilger at clusterfs.com (Andreas Dilger)
Date: Thu, 19 Jan 2006 03:32:42 -0700
Subject: Mounting problem
In-Reply-To: <20060118021115.93572.qmail@web34702.mail.mud.yahoo.com>
References: <20060117071146.GB8009@schatzie.adilger.int>
	<20060118021115.93572.qmail@web34702.mail.mud.yahoo.com>
Message-ID: <20060119103242.GT4124@schatzie.adilger.int>

On Jan 18, 2006  10:11 +0800, Stephen Liu wrote:
> Tks for your advice.  The file is attached to this posting.

It is pretty clear that there is some type of single-bit corruption
with your disk:

000000 00000000 00000000 00000000 00000000
*
000400 0009f400 0013e0e1 0000fc8b 000ef82c
000410 0009a4c4 00000000 00000002 00000002
000420 00008000 00008000 00003dc0 43cc70d6
000430 43cc76cc 02250015 0203ef53 02000201
000440 43ad2a1a 02ed4e00 02000200 02000201
000450 02000200 0200020b 02000280 02000204
000460 02000206 02000201 2be35aee d34ae63d
000470 6eeb1bb3 c368ff68 02000200 02000200
000480 02000200 02000200 02000200 02000200
*
0004e0 02000208 02000200 02000200 5a46b3f1
0004f0 87471f23 6f6ed694 87f63eb6 02000302
000500 02000200 02000200 43ad2a1a 02000207
000510 02000208 02000209 0200020a 0200020b
000520 0200020c 0200020d 0200020e 0200020f
000530 02000210 02000211 02000212 02000213
000540 02000614 02000200 02000200 02000200
000550 02000200 02000200 02000200 02000200
*
001000

Note that all of the "02000200" bits are set for most of the superblock.
I'd suspect either something bad in the controller or maybe a cable.  If
this is present throughout the disk then there is nothing that can be
done about it, except restore from backup.


Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.



From adilger at clusterfs.com  Thu Jan 19 12:26:39 2006
From: adilger at clusterfs.com (Andreas Dilger)
Date: Thu, 19 Jan 2006 05:26:39 -0700
Subject: ext3 fs errors 3T fs
In-Reply-To: <20060118160515.M49352@periphery.2inches.com>
References: <20060118160515.M49352@periphery.2inches.com>
Message-ID: <20060119122639.GW4124@schatzie.adilger.int>

On Jan 18, 2006  16:16 -0800, Dennis Williams wrote:
> I looked through the archives a bit and could not find anything relevant,
> if you know otherwise please point me in the right direction.
> 
> I have a ~3T ext3 filesystem on linux software raid that had been behaving
> corectly for sometime.  Not to long ago it gave the following error after
> trying to mount it:
> 
> mount: wrong fs type, bad option, bad superblock on /dev/md0,
>        or too many mounted file systems

This sounds like the superblock has been overwritten.  There are occasional
reports from > 2TB filesystem users of similar corruption.  It isn't clear
if the problem exists in ext3 or if it is in the block or SCSI layer.

> some more information on the system:
> os flavor: Suse 9.1
> kernel version: 2.6.5-7.202.7-default (various suse patches applied to
>    2.6.5 kernel)

RHEL4 (2.6.9) claims support for up to 8TB filesystems.  I don't know what
patches they made, if any, in order to have this working.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.



From menscher at uiuc.edu  Thu Jan 19 16:35:56 2006
From: menscher at uiuc.edu (Damian Menscher)
Date: Thu, 19 Jan 2006 10:35:56 -0600 (CST)
Subject: ext3 fs errors 3T fs
In-Reply-To: <20060119122639.GW4124@schatzie.adilger.int>
References: <20060118160515.M49352@periphery.2inches.com>
	<20060119122639.GW4124@schatzie.adilger.int>
Message-ID: <Pine.LNX.4.63.0601191028110.22306@zeus.itg.uiuc.edu>

On Thu, 19 Jan 2006, Andreas Dilger wrote:
> On Jan 18, 2006  16:16 -0800, Dennis Williams wrote:
>>
>> I have a ~3T ext3 filesystem on linux software raid that had been behaving
>> corectly for sometime.  Not to long ago it gave the following error after
>> trying to mount it:
>>
>> mount: wrong fs type, bad option, bad superblock on /dev/md0,
>>        or too many mounted file systems
>
> This sounds like the superblock has been overwritten.  There are occasional
> reports from > 2TB filesystem users of similar corruption.  It isn't clear
> if the problem exists in ext3 or if it is in the block or SCSI layer.
>
>> some more information on the system:
>> os flavor: Suse 9.1
>> kernel version: 2.6.5-7.202.7-default (various suse patches applied to
>>    2.6.5 kernel)

32bit or 64bit?

> RHEL4 (2.6.9) claims support for up to 8TB filesystems.  I don't know what
> patches they made, if any, in order to have this working.

FWIW, when we first tried using a >2TB filesystem on linux (I think it 
was FC3 at the time), we discovered filesystem corruption once data had 
been written past the 2TB mark on a 32-bit machine.  I'm guessing this 
is what you're seeing also.

We have been using (and filling) >2TB filesystems on 64-bit machines 
(FC4 and RHEL4) for some time now without problems.

Note that we didn't bother doing a detailed analysis of configurations, 
but rather tried a couple of variations until we found one that worked, 
so this could be a red herring (for those not familiar with the term, 
that means a clue that leads you in the wrong direction).

Damian Menscher
-- 
-=#| <menscher at uiuc.edu> www.uiuc.edu/~menscher/ Ofc:(650)253-2757 |#=-
-=#| The above opinions are not necessarily those of my employers. |#=-



From evoltech at 2inches.com  Thu Jan 19 21:25:29 2006
From: evoltech at 2inches.com (Dennis Williams)
Date: Thu, 19 Jan 2006 13:25:29 -0800 (PST)
Subject: ext3 fs errors 3T fs
In-Reply-To: <20060118160515.M49352@periphery.2inches.com>
References: <20060118160515.M49352@periphery.2inches.com>
Message-ID: <20060119125736.Q66109@periphery.2inches.com>

This is a 64 bit system running a 64 bit kernel.

After reading through the manpage for e2fsck a bit I noticed that mke2fs
can be used to determine additional superblock backups with the -n flag.
Not knowing how the fs was created I assumed that the default blocksize
was used.

terrorbytes:~ # mke2fs -n /dev/md0
mke2fs 1.34 (25-Jul-2003)
warning: 160 blocks unused.

Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
403685856 inodes, 805797888 blocks
40289902 blocks (5.00%) reserved for the super user
First data block=0
24591 block groups
32768 blocks per group, 32768 fragments per group
16416 inodes per group
Superblock backups stored on blocks:
        32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632,
2654208,
        4096000, 7962624, 11239424, 20480000, 23887872, 71663616,
78675968,
        102400000, 214990848, 512000000, 550731776, 644972544


terrorbytes:~ # e2fsck -yb 229376 /dev/md0

The system has now been corecting errors for the past 12 hours.  I hope
when it finishes, it will mount without complaints.

Sincerely,
Dennis Williams

"And for all the good or evil, creation or destruction
your living might have of accomplished, you might have
just never have lived at all"
-From: The Sleeping Beauty

On Wed, 18 Jan 2006, Dennis Williams wrote:

> Hello,
> I looked through the archives a bit and could not find anything relevant,
> if you know otherwise please point me in the right direction.
>
> I have a ~3T ext3 filesystem on linux software raid that had been behaving
> corectly for sometime.  Not to long ago it gave the following error after
> trying to mount it:
>
> mount: wrong fs type, bad option, bad superblock on /dev/md0,
>        or too many mounted file systems
>
> after a long fsck which I had to do manually I noticed the following in
> /var/log/messages after trying to mount again:
>
> Jan 19 09:13:11 terrorbytes kernel: EXT3-fs error (device md0):
> ext3_check_descriptors: Block bitmap for group 3584 not in group (block
> 0)!
> Jan 19 09:13:11 terrorbytes kernel: EXT3-fs: group descriptors corrupted !
>
> when trying to correct again with e2fsck I get this error:
>
> e2fsck 1.34 (25-Jul-2003)
> Group descriptors look bad... trying backup blocks...
> e2fsck: Invalid argument while checking ext3 journal for /dev/md0
>
> some more information on the system:
> os flavor: Suse 9.1
> kernel version: 2.6.5-7.202.7-default (various suse patches applied to
>    2.6.5 kernel)
>
> I am not sure where to go from here, any help, experience, or references
> to documentation that would help me better understand the problem would be
> apreciated.
>
> Sincerely,
> Dennison Williams
>
> "And for all the good or evil, creation or destruction
> your living might have of accomplished, you might have
> just never have lived at all"
> -The Sleeping Beauty
>
> _______________________________________________
> Ext3-users mailing list
> Ext3-users at redhat.com
> https://www.redhat.com/mailman/listinfo/ext3-users
>



From satimis at yahoo.com  Fri Jan 20 15:45:54 2006
From: satimis at yahoo.com (Stephen Liu)
Date: Fri, 20 Jan 2006 23:45:54 +0800 (CST)
Subject: Mounting problem
In-Reply-To: <20060119103242.GT4124@schatzie.adilger.int>
Message-ID: <20060120154554.79769.qmail@web34712.mail.mud.yahoo.com>

Hi Andreas,

Tks for your advice.

> It is pretty clear that there is some type of 
> single-bit corruption with your disk:

- snip -

> I'd suspect either something bad in the controller 
> or maybe a cable. 

I recall once the screen hanged compelling me to press a hard-reboot. 
Later I found out the cause was due to the bad contact of the power
cable of HD.


I got the problem fix after running;

# fcsk.ext3 -b 32768   /dev/hda8
and answering several questions.

Now partition /dev/hda8 is now working.

Tks again for your assistance.

B.R.
SL



From evoltech at 2inches.com  Fri Jan 20 17:22:03 2006
From: evoltech at 2inches.com (Dennis Williams)
Date: Fri, 20 Jan 2006 09:22:03 -0800 (PST)
Subject: ext3 fs errors 3T fs
In-Reply-To: <E1EzzVc-0002iK-44@BlackLife.kmv.ru>
References: <E1EzzVc-0002iK-44@BlackLife.kmv.ru>
Message-ID: <20060120091159.H84552@periphery.2inches.com>


> > The system has now been corecting errors for the past 12 hours.  I hope
> > when it finishes, it will mount without complaints.
>
> Never belive fsck here. It may check heavy corrupted filesystems serval DAYS.
> For me (corrupted 120 Gb ext3 partition) "fsck.ext3 -y" work 3 days before i
> interrupt it. In manual mode, avoid 'duplicate inode clone' and answer yes to
> 'delete file' - only 30 minutes.
>

Just out of morbid curiosity what does 'duplicate inode clone' mean?  And
how does the fs get in that state?

The fsck finished this morning with the following final statements:

/dev/md0: ***** FILE SYSTEM WAS MODIFIED *****

/dev/md0: ********** WARNING: Filesystem still has errors **********

/dev/md0: 1472505/403685856 files (10.3% non-contiguous),
673983041/805797888 blocks

1) Why would the fs still have errors?  Is it correct to assume that
running fsck again is the answer? (I hope so)

2) What does the last line of this message mean?

I did notice that the fs mounted correctly after this with the following
errors in /var/log/messages:

Jan 21 02:09:48 terrorbytes kernel: kjournald starting.  Commit interval 5
seconds
Jan 21 02:09:48 terrorbytes kernel: EXT3-fs warning (device md0):
ext3_clear_journal_err: Filesystem error recorded from previous mount: IO
failure
Jan 21 02:09:48 terrorbytes kernel: EXT3-fs warning (device md0):
ext3_clear_journal_err: Marking fs in need of filesystem check.
Jan 21 02:09:48 terrorbytes kernel: EXT3-fs warning: mounting unchecked
fs, running e2fsck is recommended
Jan 21 02:09:48 terrorbytes kernel: EXT3 FS on md0, internal journal
Jan 21 02:09:48 terrorbytes kernel: EXT3-fs: mounted filesystem with
ordered data mode.

after unmounting the filesystem, I ran a standard fsck again:
terrorbytes:~ # e2fsck /dev/md0
e2fsck 1.34 (25-Jul-2003)
/dev/md0 contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes

Thank you to everyone who has responded to my posts with thier
suggestions.

Sincerely,
Dennison Williams



From evoltech at 2inches.com  Sat Jan 21 07:07:07 2006
From: evoltech at 2inches.com (Dennis Williams)
Date: Fri, 20 Jan 2006 23:07:07 -0800 (PST)
Subject: ext3 fs errors 3T fs
In-Reply-To: <20060120091159.H84552@periphery.2inches.com>
References: <E1EzzVc-0002iK-44@BlackLife.kmv.ru>
	<20060120091159.H84552@periphery.2inches.com>
Message-ID: <20060120225709.A95489@periphery.2inches.com>

Hello,
After the fsck finished this evening there were no final statements
refering to problems.  I remounted the filesystem without any errors.
After noticing that there were a number of files missing, I started to
attempt to recover from the lost+found directory.  I was repeatedly able
to get the the filesystem to error and remount read only when find
traversed a specific directory in lost+found.  This is the error message I
recieved from /var/log/messages:

Jan 21 16:00:26 terrorbytes kernel: EXT3-fs error (device md0):
ext3_readdir: bad entry in directory #73117155: directory entry across
blocks - offset=0, inode=0, rec_len=8196, name_len=84
Jan 21 16:00:26 terrorbytes kernel: Aborting journal on device md0.
Jan 21 16:00:26 terrorbytes kernel: ext3_abort called.
Jan 21 16:00:26 terrorbytes kernel: EXT3-fs abort (device md0):
ext3_journal_start: Detected aborted journal
Jan 21 16:00:26 terrorbytes kernel: Remounting filesystem read-only

1) Can someone explain what this means, and or why it might happen?
2) Why this condition might exist even after a succesfull fsck?

I am planning on running a fsck yet again.

Sincerely,
Dennis Williams

On Fri, 20 Jan 2006, Dennis Williams wrote:

>
> > > The system has now been corecting errors for the past 12 hours.  I hope
> > > when it finishes, it will mount without complaints.
> >
> > Never belive fsck here. It may check heavy corrupted filesystems serval DAYS.
> > For me (corrupted 120 Gb ext3 partition) "fsck.ext3 -y" work 3 days before i
> > interrupt it. In manual mode, avoid 'duplicate inode clone' and answer yes to
> > 'delete file' - only 30 minutes.
> >
>
> Just out of morbid curiosity what does 'duplicate inode clone' mean?  And
> how does the fs get in that state?
>
> The fsck finished this morning with the following final statements:
>
> /dev/md0: ***** FILE SYSTEM WAS MODIFIED *****
>
> /dev/md0: ********** WARNING: Filesystem still has errors **********
>
> /dev/md0: 1472505/403685856 files (10.3% non-contiguous),
> 673983041/805797888 blocks
>
> 1) Why would the fs still have errors?  Is it correct to assume that
> running fsck again is the answer? (I hope so)
>
> 2) What does the last line of this message mean?
>
> I did notice that the fs mounted correctly after this with the following
> errors in /var/log/messages:
>
> Jan 21 02:09:48 terrorbytes kernel: kjournald starting.  Commit interval 5
> seconds
> Jan 21 02:09:48 terrorbytes kernel: EXT3-fs warning (device md0):
> ext3_clear_journal_err: Filesystem error recorded from previous mount: IO
> failure
> Jan 21 02:09:48 terrorbytes kernel: EXT3-fs warning (device md0):
> ext3_clear_journal_err: Marking fs in need of filesystem check.
> Jan 21 02:09:48 terrorbytes kernel: EXT3-fs warning: mounting unchecked
> fs, running e2fsck is recommended
> Jan 21 02:09:48 terrorbytes kernel: EXT3 FS on md0, internal journal
> Jan 21 02:09:48 terrorbytes kernel: EXT3-fs: mounted filesystem with
> ordered data mode.
>
> after unmounting the filesystem, I ran a standard fsck again:
> terrorbytes:~ # e2fsck /dev/md0
> e2fsck 1.34 (25-Jul-2003)
> /dev/md0 contains a file system with errors, check forced.
> Pass 1: Checking inodes, blocks, and sizes
>
> Thank you to everyone who has responded to my posts with thier
> suggestions.
>
> Sincerely,
> Dennison Williams
>
> _______________________________________________
> Ext3-users mailing list
> Ext3-users at redhat.com
> https://www.redhat.com/mailman/listinfo/ext3-users
>



From adilger at clusterfs.com  Sun Jan 22 19:25:25 2006
From: adilger at clusterfs.com (Andreas Dilger)
Date: Sun, 22 Jan 2006 12:25:25 -0700
Subject: ext3 fs errors 3T fs
In-Reply-To: <20060120225709.A95489@periphery.2inches.com>
References: <E1EzzVc-0002iK-44@BlackLife.kmv.ru>
	<20060120091159.H84552@periphery.2inches.com>
	<20060120225709.A95489@periphery.2inches.com>
Message-ID: <20060122192525.GL4124@schatzie.adilger.int>

On Jan 20, 2006  23:07 -0800, Dennis Williams wrote:
> After the fsck finished this evening there were no final statements
> refering to problems.  I remounted the filesystem without any errors.
> After noticing that there were a number of files missing, I started to
> attempt to recover from the lost+found directory.  I was repeatedly able
> to get the the filesystem to error and remount read only when find
> traversed a specific directory in lost+found.  This is the error message I
> recieved from /var/log/messages:
> 
> Jan 21 16:00:26 terrorbytes kernel: EXT3-fs error (device md0):
> ext3_readdir: bad entry in directory #73117155: directory entry across
> blocks - offset=0, inode=0, rec_len=8196, name_len=84
> Jan 21 16:00:26 terrorbytes kernel: Aborting journal on device md0.
> Jan 21 16:00:26 terrorbytes kernel: ext3_abort called.
> Jan 21 16:00:26 terrorbytes kernel: EXT3-fs abort (device md0):
> ext3_journal_start: Detected aborted journal
> Jan 21 16:00:26 terrorbytes kernel: Remounting filesystem read-only
> 
> 1) Can someone explain what this means, and or why it might happen?
> 2) Why this condition might exist even after a succesfull fsck?

In case it wasn't clear before (I thought it was) you are having problems
because this fs is > 2TB.  Why, I'm not sure - it may relate to LVM/MD,
it may be the block layer, or it may be an ext3 bug.  The fact that it is
at 2TB makes it seem like a block layer bug or lower.

I would start by making a backup if you haven't already.

I think debugging it would be easiest if you had a backup and were
willing to overwrite the device with a test pattern.

If you can isolate the corruptionto a single file or dir, you may get some
insight into the problem by running filefrag on it (or "stat {path}" in
debugfs.

> I am planning on running a fsck yet again.

Won't prevent problems from recurring.

> 
> Sincerely,
> Dennis Williams
> 
> On Fri, 20 Jan 2006, Dennis Williams wrote:
> 
> >
> > > > The system has now been corecting errors for the past 12 hours.  I hope
> > > > when it finishes, it will mount without complaints.
> > >
> > > Never belive fsck here. It may check heavy corrupted filesystems serval DAYS.
> > > For me (corrupted 120 Gb ext3 partition) "fsck.ext3 -y" work 3 days before i
> > > interrupt it. In manual mode, avoid 'duplicate inode clone' and answer yes to
> > > 'delete file' - only 30 minutes.
> > >
> >
> > Just out of morbid curiosity what does 'duplicate inode clone' mean?  And
> > how does the fs get in that state?
> >
> > The fsck finished this morning with the following final statements:
> >
> > /dev/md0: ***** FILE SYSTEM WAS MODIFIED *****
> >
> > /dev/md0: ********** WARNING: Filesystem still has errors **********
> >
> > /dev/md0: 1472505/403685856 files (10.3% non-contiguous),
> > 673983041/805797888 blocks
> >
> > 1) Why would the fs still have errors?  Is it correct to assume that
> > running fsck again is the answer? (I hope so)
> >
> > 2) What does the last line of this message mean?
> >
> > I did notice that the fs mounted correctly after this with the following
> > errors in /var/log/messages:
> >
> > Jan 21 02:09:48 terrorbytes kernel: kjournald starting.  Commit interval 5
> > seconds
> > Jan 21 02:09:48 terrorbytes kernel: EXT3-fs warning (device md0):
> > ext3_clear_journal_err: Filesystem error recorded from previous mount: IO
> > failure
> > Jan 21 02:09:48 terrorbytes kernel: EXT3-fs warning (device md0):
> > ext3_clear_journal_err: Marking fs in need of filesystem check.
> > Jan 21 02:09:48 terrorbytes kernel: EXT3-fs warning: mounting unchecked
> > fs, running e2fsck is recommended
> > Jan 21 02:09:48 terrorbytes kernel: EXT3 FS on md0, internal journal
> > Jan 21 02:09:48 terrorbytes kernel: EXT3-fs: mounted filesystem with
> > ordered data mode.
> >
> > after unmounting the filesystem, I ran a standard fsck again:
> > terrorbytes:~ # e2fsck /dev/md0
> > e2fsck 1.34 (25-Jul-2003)
> > /dev/md0 contains a file system with errors, check forced.
> > Pass 1: Checking inodes, blocks, and sizes
> >
> > Thank you to everyone who has responded to my posts with thier
> > suggestions.
> >
> > Sincerely,
> > Dennison Williams
> >
> > _______________________________________________
> > Ext3-users mailing list
> > Ext3-users at redhat.com
> > https://www.redhat.com/mailman/listinfo/ext3-users
> >
> 
> _______________________________________________
> Ext3-users mailing list
> Ext3-users at redhat.com
> https://www.redhat.com/mailman/listinfo/ext3-users

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.



From rkimber at ntlworld.com  Mon Jan 23 13:16:40 2006
From: rkimber at ntlworld.com (R Kimber)
Date: Mon, 23 Jan 2006 13:16:40 +0000
Subject: Oops
Message-ID: <20060123131640.7a7c355b.rkimber@ntlworld.com>


I don't know enough about it to know whether this is a known problem (I
couldn't make much sense of what I found on Google), but it seems to be
a journal-related issue.  Is it likely that data has been corrupted?
Do I need to take any action?

2.6.12-9-amd64-k8-smp, Ubuntu 5.10, dual opteron 2GB

Jan 23 03:08:22 infinity kernel: [24686.841032] Unable to handle kernel
NULL pointer dereference at 0000000000000000 RIP: Jan 23 03:08:22
infinity kernel: [24686.841039] <ffffffff8810d3e2>
{:jbd:journal_commit_transaction+2594} Jan 23 03:08:22 infinity kernel:
[24686.841064] PGD 6ea20067 PUD 6ea97067 PMD 0 Jan 23 03:08:22 infinity
kernel: [24686.841069] Oops: 0000 [1] SMP Jan 23 03:08:22 infinity
kernel: [24686.841073] CPU 1 Jan 23 03:08:22 infinity kernel:
[24686.841075] Modules linked in: ext2 binfmt_misc ipt_limit
iptable_mangle ipt_LOG ipt_MASQUERADE iptable_nat ipt_TOS ipt_REJECT
ip_conntrack_irc ip_conntrack_ftp ipt_state ip_conntrack iptable_filter
ip_tables ipv6 pcspkr snd_seq_dummy snd_seq_oss snd_seq_midi
snd_seq_midi_event snd_seq snd_via82xx gameport snd_ac97_codec
snd_mpu401_uart snd_rawmidi snd_seq_device bt878 snd_bt87x snd_pcm_oss
snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc tuner
tvaudio bttv video_buf firmware_class i2c_algo_bit v4l2_common
btcx_risc tveeprom videodev nls_iso8859_1 nls_cp437 vfat fat dm_mod
tsdev evdev nvidia w83627hf eeprom i2c_sensor i2c_isa i2c_viapro
i2c_core rtc psmouse mousedev parport_pc lp parport sd_mod md ext3 jbd
mbcache thermal processor fan ide_cd cdrom usb_storage scsi_mod
ehci_hcd uhci_hcd tg3 pdc202xx_new ide_disk ide_generic via82cxxx
ide_core unix vesafb capability commoncap vga16fb vgastate softcursor
cfbimgblt cfbfillrect cfbcopyarea fbcon tileblit font bitblit Jan 23
03:08:22 infinity kernel: [24686.841128] Pid: 3318, comm: kjournald
Tainted: P      2.6.12-9-amd64-k8-smp Jan 23 03:08:22 infinity kernel:
[24686.841132] RIP: 0010:[_end+130728930/2132406272] <ffffffff8810d3e2>
{:jbd:journal_commit_transaction+2594} Jan 23 03:08:22 infinity kernel:
[24686.841146] RSP: 0018:ffff81007cf47d88  EFLAGS: 00010286 Jan 23
03:08:22 infinity kernel: [24686.841150] RAX: 0000000000000002 RBX:
0000000000000000 RCX: 0000000000000035 Jan 23 03:08:22 infinity kernel:
[24686.841155] RDX: ffff81004d06ada0 RSI: ffff810071d0bb88 RDI:
ffff810071d0bb88 Jan 23 03:08:22 infinity kernel: [24686.841159] RBP:
ffff81004d06ae00 R08: 0000000000000015 R09: 0000000000000000 Jan 23
03:08:22 infinity kernel: [24686.841162] R10: 0000000000000000 R11:
0000000000000000 R12: 0000000000000000 Jan 23 03:08:22 infinity kernel:
[24686.841166] R13: ffff810064213d40 R14: ffff81007d4c4400 R15:
0000000000000000 Jan 23 03:08:22 infinity kernel: [24686.841171] FS:
00002aaaadd8ebe0(0000) GS:ffffffff804286c0(0000) knlGS:0000000000000000
Jan 23 03:08:22 infinity kernel: [24686.841175] CS:  0010 DS: 0000 ES:
0000 CR0: 000000008005003b Jan 23 03:08:22 infinity kernel:
[24686.841179] CR2: 0000000000000000 CR3: 000000006e619000 CR4:
00000000000006e0 Jan 23 03:08:22 infinity kernel: [24686.841184]
Process kjournald (pid: 3318, threadinfo ffff81007cf46000, task
ffff81007d6090b0) Jan 23 03:08:22 infinity kernel: [24686.841187]
Stack: ffff81007d4c4424 ffff81007d4c455c 00000fd400000000
ffff810058bb002c Jan 23 03:08:22 infinity kernel: [24686.841196]
0000000002c0e4a0 ffff81007d628000 ffff81004d06ace0 0000000000001163 Jan
23 03:08:22 infinity kernel: [24686.841203]        ffffffff8032c340
0000007300000003 Jan 23 03:08:22 infinity kernel: [24686.841208] Call
Trace:<ffffffff8012e413>{__wake_up+67} <ffffffff8810fb54>{:jbd:kjournald
+276} Jan 23 03:08:22 infinity kernel: [24686.841260]
<ffffffff80148e10>{autoremove_wake_function+0} <ffffffff80148e10>
{autoremove_wake_function+0} Jan 23 03:08:22 infinity kernel:
[24686.841283]        <ffffffff8810fa20>{:jbd:commit_timeout+0}
<ffffffff8010e61b>{child_rip+8} Jan 23 03:08:22 infinity kernel:
[24686.841318]        <ffffffff8810fa40>{:jbd:kjournald+0}
<ffffffff8010e613>{child_rip+0} Jan 23 03:08:22 infinity kernel:
[24686.841350] Jan 23 03:08:22 infinity kernel: [24686.841357] Jan 23
03:08:22 infinity kernel: [24686.841358] Code: 8b 03 a8 04 74 18 8b 03
a8 04 75 07 8b 43 18 85 c0 75 e0 48 Jan 23 03:08:22 infinity kernel:
[24686.841369] RIP <ffffffff8810d3e2>{:jbd:journal_commit_transaction
+2594} RSP <ffff81007cf47d88> Jan 23 03:08:22 infinity kernel:
[24686.841382] CR2: 0000000000000000

Thanks
-- 
Richard Kimber
http://www.psr.keele.ac.uk/



From rkimber at ntlworld.com  Mon Jan 23 13:16:40 2006
From: rkimber at ntlworld.com (R Kimber)
Date: Mon, 23 Jan 2006 13:16:40 +0000
Subject: Oops
Message-ID: <20060123131640.7a7c355b.rkimber@ntlworld.com>


I don't know enough about it to know whether this is a known problem (I
couldn't make much sense of what I found on Google), but it seems to be
a journal-related issue.  Is it likely that data has been corrupted?
Do I need to take any action?

2.6.12-9-amd64-k8-smp, Ubuntu 5.10, dual opteron 2GB

Jan 23 03:08:22 infinity kernel: [24686.841032] Unable to handle kernel
NULL pointer dereference at 0000000000000000 RIP: Jan 23 03:08:22
infinity kernel: [24686.841039] <ffffffff8810d3e2>
{:jbd:journal_commit_transaction+2594} Jan 23 03:08:22 infinity kernel:
[24686.841064] PGD 6ea20067 PUD 6ea97067 PMD 0 Jan 23 03:08:22 infinity
kernel: [24686.841069] Oops: 0000 [1] SMP Jan 23 03:08:22 infinity
kernel: [24686.841073] CPU 1 Jan 23 03:08:22 infinity kernel:
[24686.841075] Modules linked in: ext2 binfmt_misc ipt_limit
iptable_mangle ipt_LOG ipt_MASQUERADE iptable_nat ipt_TOS ipt_REJECT
ip_conntrack_irc ip_conntrack_ftp ipt_state ip_conntrack iptable_filter
ip_tables ipv6 pcspkr snd_seq_dummy snd_seq_oss snd_seq_midi
snd_seq_midi_event snd_seq snd_via82xx gameport snd_ac97_codec
snd_mpu401_uart snd_rawmidi snd_seq_device bt878 snd_bt87x snd_pcm_oss
snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc tuner
tvaudio bttv video_buf firmware_class i2c_algo_bit v4l2_common
btcx_risc tveeprom videodev nls_iso8859_1 nls_cp437 vfat fat dm_mod
tsdev evdev nvidia w83627hf eeprom i2c_sensor i2c_isa i2c_viapro
i2c_core rtc psmouse mousedev parport_pc lp parport sd_mod md ext3 jbd
mbcache thermal processor fan ide_cd cdrom usb_storage scsi_mod
ehci_hcd uhci_hcd tg3 pdc202xx_new ide_disk ide_generic via82cxxx
ide_core unix vesafb capability commoncap vga16fb vgastate softcursor
cfbimgblt cfbfillrect cfbcopyarea fbcon tileblit font bitblit Jan 23
03:08:22 infinity kernel: [24686.841128] Pid: 3318, comm: kjournald
Tainted: P      2.6.12-9-amd64-k8-smp Jan 23 03:08:22 infinity kernel:
[24686.841132] RIP: 0010:[_end+130728930/2132406272] <ffffffff8810d3e2>
{:jbd:journal_commit_transaction+2594} Jan 23 03:08:22 infinity kernel:
[24686.841146] RSP: 0018:ffff81007cf47d88  EFLAGS: 00010286 Jan 23
03:08:22 infinity kernel: [24686.841150] RAX: 0000000000000002 RBX:
0000000000000000 RCX: 0000000000000035 Jan 23 03:08:22 infinity kernel:
[24686.841155] RDX: ffff81004d06ada0 RSI: ffff810071d0bb88 RDI:
ffff810071d0bb88 Jan 23 03:08:22 infinity kernel: [24686.841159] RBP:
ffff81004d06ae00 R08: 0000000000000015 R09: 0000000000000000 Jan 23
03:08:22 infinity kernel: [24686.841162] R10: 0000000000000000 R11:
0000000000000000 R12: 0000000000000000 Jan 23 03:08:22 infinity kernel:
[24686.841166] R13: ffff810064213d40 R14: ffff81007d4c4400 R15:
0000000000000000 Jan 23 03:08:22 infinity kernel: [24686.841171] FS:
00002aaaadd8ebe0(0000) GS:ffffffff804286c0(0000) knlGS:0000000000000000
Jan 23 03:08:22 infinity kernel: [24686.841175] CS:  0010 DS: 0000 ES:
0000 CR0: 000000008005003b Jan 23 03:08:22 infinity kernel:
[24686.841179] CR2: 0000000000000000 CR3: 000000006e619000 CR4:
00000000000006e0 Jan 23 03:08:22 infinity kernel: [24686.841184]
Process kjournald (pid: 3318, threadinfo ffff81007cf46000, task
ffff81007d6090b0) Jan 23 03:08:22 infinity kernel: [24686.841187]
Stack: ffff81007d4c4424 ffff81007d4c455c 00000fd400000000
ffff810058bb002c Jan 23 03:08:22 infinity kernel: [24686.841196]
0000000002c0e4a0 ffff81007d628000 ffff81004d06ace0 0000000000001163 Jan
23 03:08:22 infinity kernel: [24686.841203]        ffffffff8032c340
0000007300000003 Jan 23 03:08:22 infinity kernel: [24686.841208] Call
Trace:<ffffffff8012e413>{__wake_up+67} <ffffffff8810fb54>{:jbd:kjournald
+276} Jan 23 03:08:22 infinity kernel: [24686.841260]
<ffffffff80148e10>{autoremove_wake_function+0} <ffffffff80148e10>
{autoremove_wake_function+0} Jan 23 03:08:22 infinity kernel:
[24686.841283]        <ffffffff8810fa20>{:jbd:commit_timeout+0}
<ffffffff8010e61b>{child_rip+8} Jan 23 03:08:22 infinity kernel:
[24686.841318]        <ffffffff8810fa40>{:jbd:kjournald+0}
<ffffffff8010e613>{child_rip+0} Jan 23 03:08:22 infinity kernel:
[24686.841350] Jan 23 03:08:22 infinity kernel: [24686.841357] Jan 23
03:08:22 infinity kernel: [24686.841358] Code: 8b 03 a8 04 74 18 8b 03
a8 04 75 07 8b 43 18 85 c0 75 e0 48 Jan 23 03:08:22 infinity kernel:
[24686.841369] RIP <ffffffff8810d3e2>{:jbd:journal_commit_transaction
+2594} RSP <ffff81007cf47d88> Jan 23 03:08:22 infinity kernel:
[24686.841382] CR2: 0000000000000000

Thanks
-- 
Richard Kimber
http://www.psr.keele.ac.uk/



From evoltech at 2inches.com  Mon Jan 23 17:09:23 2006
From: evoltech at 2inches.com (Dennis Williams)
Date: Mon, 23 Jan 2006 09:09:23 -0800 (PST)
Subject: ext3 fs errors 3T fs
In-Reply-To: <20060122192525.GL4124@schatzie.adilger.int>
References: <E1EzzVc-0002iK-44@BlackLife.kmv.ru>
	<20060120091159.H84552@periphery.2inches.com>
	<20060120225709.A95489@periphery.2inches.com>
	<20060122192525.GL4124@schatzie.adilger.int>
Message-ID: <20060123085621.X57575@periphery.2inches.com>

> In case it wasn't clear before (I thought it was) you are having problems
> because this fs is > 2TB.  Why, I'm not sure - it may relate to LVM/MD,
> it may be the block layer, or it may be an ext3 bug.  The fact that it is
> at 2TB makes it seem like a block layer bug or lower.

_you_ were clear, though others lead me to believe that on a 64 bit system
I should have no problem with a ext3 fs > 2T, further more there are a
number of claims on the Internet that ext3 should have no problem being
> 2T. http://en.wikipedia.org/wiki/Comparison_of_file_systems#Limits.
That being said though I do plan on rebuilding the raid + filesystem in
chunks < 2T as soon as I get additional storage setup.

> If you can isolate the corruption to a single file or dir, you may get some
> insight into the problem by running filefrag on it (or "stat {path}" in
> debugfs.

I was able to isolate the problem to 2 different directories repeatedly.
Both of them were in the lost+found directory.  I ran "stat {path}" in
debugfs. on them but did not see any info that stood out as abnormal.
When I get access to the system again, I will repost the output.

> I think debugging it would be easiest if you had a backup and were
> willing to overwrite the device with a test pattern.

I would like to debug this situation when I get backup storage.  What
steps would you recommend to do this?

Thanks again, to everyone who has offered suggestions.

Sincerely,
Dennison Williams



From bladilo at rice.edu  Wed Jan 25 01:12:46 2006
From: bladilo at rice.edu (Franco M. Bladilo)
Date: Tue, 24 Jan 2006 19:12:46 -0600
Subject: EXT3: failed to claim external journal device.
Message-ID: <43D6D08E.2060107@rice.edu>

We are having problems remounting an ext3 filesystem using an external 
journal device. The filesystem in question was working fine until the 
server was rebooted.
This is what we see on dmesg when trying to mount:
EXT3: failed to claim external journal device.
The external journal lives on a LVM2 logical volume and it seems to be 
accessible ( we can dumpe2fs and see filesystem information).

Here's the system information and command line used to create the 
filesystem :
SuSE SLES9 2 , kernel 2.6.5
ada718-5:/ # rpm -qa | grep e2fs
e2fsprogs-1.36-6.2
-----------------------------------------
mke2fs -O journal_dev   /dev/mapper/home_jou_vol_grp-home_jou 400000
mke2fs -E stride=16 -O sparse_super,dir_index -j -J 
device=/dev/mapper/home_jou_vol_grp-home_jou  /dev/mapper/home_vol_grp-home

Any ideas?

Thanks in advance,

-- 
Franco Bladilo
Linux/HPCC Administrator
Research Computing Support Group
Rice University
bladilo at rice.edu



From adilger at clusterfs.com  Wed Jan 25 10:05:09 2006
From: adilger at clusterfs.com (Andreas Dilger)
Date: Wed, 25 Jan 2006 03:05:09 -0700
Subject: EXT3: failed to claim external journal device.
In-Reply-To: <43D6D08E.2060107@rice.edu>
References: <43D6D08E.2060107@rice.edu>
Message-ID: <20060125100509.GJ11642@schatzie.adilger.int>

On Jan 24, 2006  19:12 -0600, Franco M. Bladilo wrote:
> We are having problems remounting an ext3 filesystem using an external 
> journal device. The filesystem in question was working fine until the 
> server was rebooted.
> This is what we see on dmesg when trying to mount:
> EXT3: failed to claim external journal device.
> The external journal lives on a LVM2 logical volume and it seems to be 
> accessible ( we can dumpe2fs and see filesystem information).
> 
> Here's the system information and command line used to create the 
> filesystem :
> SuSE SLES9 2 , kernel 2.6.5
> ada718-5:/ # rpm -qa | grep e2fs
> e2fsprogs-1.36-6.2
> -----------------------------------------
> mke2fs -O journal_dev   /dev/mapper/home_jou_vol_grp-home_jou 400000
> mke2fs -E stride=16 -O sparse_super,dir_index -j -J 
> device=/dev/mapper/home_jou_vol_grp-home_jou  /dev/mapper/home_vol_grp-home
> 
> Any ideas?

I believe the kernel does the journal device lookup by the device major/minor,
and those are not fixed for LVM devices.  Bull recently posted a patch here
for mount to automatically find the correct block device for this journal
UUID.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.



From johann.lombardi at bull.net  Thu Jan 26 09:52:34 2006
From: johann.lombardi at bull.net (Johann Lombardi)
Date: Thu, 26 Jan 2006 10:52:34 +0100
Subject: EXT3: failed to claim external journal device.
In-Reply-To: <20060125100509.GJ11642@schatzie.adilger.int>
References: <43D6D08E.2060107@rice.edu>
	<20060125100509.GJ11642@schatzie.adilger.int>
Message-ID: <200601261052.35134.johann.lombardi@bull.net>

> > Here's the system information and command line used to create the
> > filesystem :
> > SuSE SLES9 2 , kernel 2.6.5
> > ada718-5:/ # rpm -qa | grep e2fs
> > e2fsprogs-1.36-6.2
> > -----------------------------------------
> > mke2fs -O journal_dev   /dev/mapper/home_jou_vol_grp-home_jou 400000
> > mke2fs -E stride=16 -O sparse_super,dir_index -j -J
> > device=/dev/mapper/home_jou_vol_grp-home_jou 
> > /dev/mapper/home_vol_grp-home
> >
> > Any ideas?
>
> I believe the kernel does the journal device lookup by the device
> major/minor, and those are not fixed for LVM devices.  

If the filesystem was _cleanly_ unmounted, you can try to remove/reattach the
external journal. It will update the superblock with the new major/minor
numbers.
You can proceed as follows:
# tune2fs -f -O^has_journal /dev/mapper/home_vol_grp-home
# tune2fs -J device=/dev/mapper/home_jou_vol_grp-home_jou /dev/mapper/home_vol_grp-home

It will work until the journal device's major/minor numbers change again 
(the next reboot?).

> Bull recently posted a patch here for mount to automatically find the
> correct block device for this journal UUID.

Actually, it was on ext2-devel:
http://thread.gmane.org/gmane.comp.file-systems.ext2.devel/2950

Johann
-------------- next part --------------
An embedded message was scrubbed...
From: Pekka Enberg <penberg at cs.helsinki.fi>
Subject: [Ext2-devel] [PATCH] ext2: return FSID for statvfs
Date: Tue, 06 Dec 2005 22:22:48 +0200
Size: 5581
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20060126/20cc0461/attachment.eml>

From johann.lombardi at bull.net  Thu Jan 26 10:20:22 2006
From: johann.lombardi at bull.net (Johann Lombardi)
Date: Thu, 26 Jan 2006 11:20:22 +0100
Subject: EXT3: failed to claim external journal device.
In-Reply-To: <200601261052.35134.johann.lombardi@bull.net>
References: <43D6D08E.2060107@rice.edu>
	<20060125100509.GJ11642@schatzie.adilger.int>
	<200601261052.35134.johann.lombardi@bull.net>
Message-ID: <20060126102022.GA19355@lombardij>

oops, my mistake. Please do not pay attention to the attachment of my
previous post.



From bladilo at rice.edu  Thu Jan 26 16:19:17 2006
From: bladilo at rice.edu (Franco M. Bladilo)
Date: Thu, 26 Jan 2006 10:19:17 -0600
Subject: EXT3: failed to claim external journal device.
In-Reply-To: <200601261052.35134.johann.lombardi@bull.net>
References: <43D6D08E.2060107@rice.edu>
	<20060125100509.GJ11642@schatzie.adilger.int>
	<200601261052.35134.johann.lombardi@bull.net>
Message-ID: <43D8F685.2060701@rice.edu>

Johann, Andreas,
Thanks for the pointers, they certainly explain the issue we were 
seeing. Has the mount/util-linux external journal patch been accepted ?

Franco.

Johann Lombardi wrote:

>>>Here's the system information and command line used to create the
>>>filesystem :
>>>SuSE SLES9 2 , kernel 2.6.5
>>>ada718-5:/ # rpm -qa | grep e2fs
>>>e2fsprogs-1.36-6.2
>>>-----------------------------------------
>>>mke2fs -O journal_dev   /dev/mapper/home_jou_vol_grp-home_jou 400000
>>>mke2fs -E stride=16 -O sparse_super,dir_index -j -J
>>>device=/dev/mapper/home_jou_vol_grp-home_jou 
>>>/dev/mapper/home_vol_grp-home
>>>
>>>Any ideas?
>>>      
>>>
>>I believe the kernel does the journal device lookup by the device
>>major/minor, and those are not fixed for LVM devices.  
>>    
>>
>
>If the filesystem was _cleanly_ unmounted, you can try to remove/reattach the
>external journal. It will update the superblock with the new major/minor
>numbers.
>You can proceed as follows:
># tune2fs -f -O^has_journal /dev/mapper/home_vol_grp-home
># tune2fs -J device=/dev/mapper/home_jou_vol_grp-home_jou /dev/mapper/home_vol_grp-home
>
>It will work until the journal device's major/minor numbers change again 
>(the next reboot?).
>
>  
>
>>Bull recently posted a patch here for mount to automatically find the
>>correct block device for this journal UUID.
>>    
>>
>
>Actually, it was on ext2-devel:
>http://thread.gmane.org/gmane.comp.file-systems.ext2.devel/2950
>
>Johann
>  
>
>
> ------------------------------------------------------------------------
>
> Subject:
> [Ext2-devel] [PATCH] ext2: return FSID for statvfs
> From:
> Pekka Enberg <penberg at cs.helsinki.fi>
> Date:
> Tue, 06 Dec 2005 22:22:48 +0200
> To:
> akpm at osdl.org
>
> To:
> akpm at osdl.org
> CC:
> linux-kernel at vger.kernel.org, ext2-devel at lists.sourceforge.net
>
>
>This patch changes ext2_statfs() to return a FSID based on least significant
>64-bits of the 128-bit filesystem UUID. This patch is a partial fix for
>Bugzilla Bug <http://bugzilla.kernel.org/show_bug.cgi?id=136>.
>
>Signed-off-by: Pekka Enberg <penberg at cs.helsinki.fi>
>---
>
> super.c |   13 ++++++++-----
> 1 file changed, 8 insertions(+), 5 deletions(-)
>
>Index: 2.6/fs/ext2/super.c
>===================================================================
>--- 2.6.orig/fs/ext2/super.c
>+++ 2.6/fs/ext2/super.c
>@@ -1038,6 +1038,7 @@ restore_opts:
> static int ext2_statfs (struct super_block * sb, struct kstatfs * buf)
> {
> 	struct ext2_sb_info *sbi = EXT2_SB(sb);
>+	struct ext2_super_block *es = sbi->s_es;
> 	unsigned long overhead;
> 	int i;
> 
>@@ -1052,7 +1053,7 @@ static int ext2_statfs (struct super_blo
> 		 * All of the blocks before first_data_block are
> 		 * overhead
> 		 */
>-		overhead = le32_to_cpu(sbi->s_es->s_first_data_block);
>+		overhead = le32_to_cpu(es->s_first_data_block);
> 
> 		/*
> 		 * Add the overhead attributed to the superblock and
>@@ -1073,14 +1074,16 @@ static int ext2_statfs (struct super_blo
> 
> 	buf->f_type = EXT2_SUPER_MAGIC;
> 	buf->f_bsize = sb->s_blocksize;
>-	buf->f_blocks = le32_to_cpu(sbi->s_es->s_blocks_count) - overhead;
>+	buf->f_blocks = le32_to_cpu(es->s_blocks_count) - overhead;
> 	buf->f_bfree = ext2_count_free_blocks(sb);
>-	buf->f_bavail = buf->f_bfree - le32_to_cpu(sbi->s_es->s_r_blocks_count);
>-	if (buf->f_bfree < le32_to_cpu(sbi->s_es->s_r_blocks_count))
>+	buf->f_bavail = buf->f_bfree - le32_to_cpu(es->s_r_blocks_count);
>+	if (buf->f_bfree < le32_to_cpu(es->s_r_blocks_count))
> 		buf->f_bavail = 0;
>-	buf->f_files = le32_to_cpu(sbi->s_es->s_inodes_count);
>+	buf->f_files = le32_to_cpu(es->s_inodes_count);
> 	buf->f_ffree = ext2_count_free_inodes (sb);
> 	buf->f_namelen = EXT2_NAME_LEN;
>+	buf->f_fsid.val[0] = le32_to_cpup((void *)es->s_uuid);
>+	buf->f_fsid.val[1] = le32_to_cpup((void *)es->s_uuid + sizeof(u32));
> 	return 0;
> }
> 
>
>
>
>
>-------------------------------------------------------
>This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
>for problems?  Stop!  Download the new AJAX search engine that makes
>searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
>http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
>_______________________________________________
>Ext2-devel mailing list
>Ext2-devel at lists.sourceforge.net
>https://lists.sourceforge.net/lists/listinfo/ext2-devel
>  
>


-- 
Franco Bladilo
Linux/HPCC Administrator
Research Computing Support Group
Rice University
bladilo at rice.edu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20060126/6b23ffe5/attachment.htm>

From johann.lombardi at bull.net  Thu Jan 26 17:11:48 2006
From: johann.lombardi at bull.net (Johann Lombardi)
Date: Thu, 26 Jan 2006 18:11:48 +0100
Subject: EXT3: failed to claim external journal device.
In-Reply-To: <43D8F685.2060701@rice.edu>
References: <43D6D08E.2060107@rice.edu>
	<200601261052.35134.johann.lombardi@bull.net>
	<43D8F685.2060701@rice.edu>
Message-ID: <200601261811.48527.johann.lombardi@bull.net>

> Has the mount/util-linux external journal patch been accepted ?
Not at the moment.



From adilger at clusterfs.com  Fri Jan 27 01:03:04 2006
From: adilger at clusterfs.com (Andreas Dilger)
Date: Thu, 26 Jan 2006 18:03:04 -0700
Subject: ext3 fs errors 3T fs
In-Reply-To: <20060123085621.X57575@periphery.2inches.com>
References: <E1EzzVc-0002iK-44@BlackLife.kmv.ru>
	<20060120091159.H84552@periphery.2inches.com>
	<20060120225709.A95489@periphery.2inches.com>
	<20060122192525.GL4124@schatzie.adilger.int>
	<20060123085621.X57575@periphery.2inches.com>
Message-ID: <20060127010304.GY11642@schatzie.adilger.int>

On Jan 23, 2006  09:09 -0800, Dennis Williams wrote:
> I was able to isolate the problem to 2 different directories repeatedly.
> Both of them were in the lost+found directory.  I ran "stat {path}" in
> debugfs. on them but did not see any info that stood out as abnormal.
> When I get access to the system again, I will repost the output.

What would be of interest is the block numbers of the lost+found dir,
and all of the files therein.  Anything with a block number > 250M
(at the 2TB =  4B sector boundary) would be of interest.

> > I think debugging it would be easiest if you had a backup and were
> > willing to overwrite the device with a test pattern.
> 
> I would like to debug this situation when I get backup storage.  What
> steps would you recommend to do this?

If possible, it would be desirable to isolate the exact operation that
is causing the corruption.  Since we are fairly sure it is corrupting
the beginning of the filesystem (which likely aliases to just beyond
the 2TB device boundary) we could do a test like the following:

- do a backup of the first, say, 128kB of the device with dd
- read 50MB of data at 2TB offset
- compare this data - it should probably not be the same
- rewrite out the 50MB of data beyond 2TB
- verify that the first 128kB of data in the device did not change
- do some operation on _one_ file in the lost+found
- verify that the first 128kB of data does not change
- run e2fsck

I don't have anything else specific, just in the nature of "play around"
and see what breaks.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.



From fk at linuxburg.de  Mon Jan 30 18:38:54 2006
From: fk at linuxburg.de (Felix E. Klee)
Date: Mon, 30 Jan 2006 19:38:54 +0100
Subject: df reports false size
Message-ID: <200601301938.54145.fk@linuxburg.de>

On a customer's machine running SuSE 9.2, the size of the occupied space on  
the harddisk is reported incorrectly by "df -h".  After we noticed the 
problem, I rebooted the machine and had it checked by "e2fsck" (check forced 
with "tune2fs -C 40", we are not on location).  Right after the reboot I 
proceeded as follows, but I could not find any information about the cause, 
and the problem is still there - see below.  That the value reported by "du 
-shx" is close to the correct one was verified by copying the data to an 
identical partition on a second harddisk: On this disk "du" and "df" both 
reported a size of about 4 GB, and not 7.6G, which is completely off the 
mark.

# df -h /
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda1             7.6G  7.0G  216M  98% /
# du -shx /
4.2G    /
# find / -xdev | wc -l
161021
# tune2fs -l /dev/sda1
tune2fs 1.35 (28-Feb-2004)
Filesystem volume name:   <none>
Last mounted on:          <not available>
Filesystem UUID:          a3f40d6f-51be-448b-bf71-76292772fea0
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal filetype needs_recovery sparse_super
Default mount options:    (none)
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              1005888
Block count:              2010125
Reserved block count:     100506
Free blocks:              155746
Free inodes:              744793
First block:              0
Block size:               4096
Fragment size:            4096
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         16224
Inode blocks per group:   507
Filesystem created:       Sat Nov  5 19:00:05 2005
Last mount time:          Mon Jan 30 13:28:19 2006
Last write time:          Mon Jan 30 13:28:19 2006
Mount count:              1
Maximum mount count:      39
Last checked:             Mon Jan 30 13:28:19 2006
Check interval:           15552000 (6 months)
Next check after:         Sat Jul 29 14:28:19 2006
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:               128
Journal inode:            8
First orphan inode:       357173
Default directory hash:   tea
Directory Hash Seed:      59ce6d12-990c-40ad-8268-212ae9bb8291
Journal backup:           inode blocks

Later we also tried out the following commands - apparently sparse files or 
unlinked files are not to blame:

# lsof -s | grep deleted
isam       6354   david    0r      REG        8,1         55     
357173 /tmp/sh-thd-1138650835 (deleted)
vmware-vm 15452    arzt   48u      REG        8,1   11948032     
357177 /tmp/ram0 (deleted)
# df --sync -h /
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda1             7.6G  7.0G  212M  98% /
# du -shx --apparent-size /
3.9G    .

Any idea what may be the cause of the problem?

-- 
Dipl.-Phys. Felix E. Klee
Email: fk at linuxburg.de (work), felix.klee at inka.de (home)
Tel: +49 721 8307937, Fax: +49 721 8307936
Linuxburg, Goethestr. 15A, 76135 Karlsruhe, Germany



From adilger at clusterfs.com  Mon Jan 30 23:10:36 2006
From: adilger at clusterfs.com (Andreas Dilger)
Date: Mon, 30 Jan 2006 16:10:36 -0700
Subject: df reports false size
In-Reply-To: <200601301938.54145.fk@linuxburg.de>
References: <200601301938.54145.fk@linuxburg.de>
Message-ID: <20060130231036.GR11642@schatzie.adilger.int>

On Jan 30, 2006  19:38 +0100, Felix E. Klee wrote:
> On a customer's machine running SuSE 9.2, the size of the occupied space on  
> the harddisk is reported incorrectly by "df -h".
> 
> # df -h /
> Filesystem            Size  Used Avail Use% Mounted on
> /dev/sda1             7.6G  7.0G  216M  98% /
> # du -shx /
> 4.2G    /
> # find / -xdev | wc -l
> 161021

> # tune2fs -l /dev/sda1
> tune2fs 1.35 (28-Feb-2004)
> Inode count:              1005888
> Block count:              2010125
> Reserved block count:     100506
> Free blocks:              155746
> Free inodes:              744793
> First orphan inode:       357173

The "find | wc -l" definitely does not agree with the superblock info.
That reports (1005888 - 744793 = 261095) in use inodes, not 161021.
If those numbers agreed, I'd suspect some space leakage (though not
after an e2fsck run).  With 2.6 kernels the ext3 superblock info does
not get updated on disk, except at shutdown (though it would be nice to
have this done at, say, statfs time).

There are no EAs consuming blocks (this would be 4kB per file, so 1GB
in total for 250k files).

> Later we also tried out the following commands - apparently sparse files or 
> unlinked files are not to blame:
> 
> # lsof -s | grep deleted
> vmware-vm 15452    arzt   48u      REG        8,1   11948032     
> 357177 /tmp/ram0 (deleted)
> isam       6354   david    0r      REG        8,1         55     
> 357173 /tmp/sh-thd-1138650835 (deleted)

This is also the file shown in the orphan inode list, so at least it
is consistent.  I also wouldn't expect files to be orphaned after e2fsck.

The other thing you can do is run "dumpe2fs /dev/sda1" to see what the
block group descriptors report for free blocks/inodes.  You'd need some
scripting to add this up, but fairly easy.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.



From fk at linuxburg.de  Tue Jan 31 09:36:41 2006
From: fk at linuxburg.de (Felix E. Klee)
Date: Tue, 31 Jan 2006 10:36:41 +0100
Subject: df reports false size
In-Reply-To: <20060130231036.GR11642@schatzie.adilger.int>
References: <200601301938.54145.fk@linuxburg.de>
	<20060130231036.GR11642@schatzie.adilger.int>
Message-ID: <200601311036.41707.fk@linuxburg.de>

Am Dienstag, 31. Januar 2006 00:10 schrieb Andreas Dilger:
> The "find | wc -l" definitely does not agree with the superblock info.
> That reports (1005888 - 744793 = 261095) in use inodes, not 161021.
> If those numbers agreed, I'd suspect some space leakage (though not
> after an e2fsck run).  With 2.6 kernels the ext3 superblock info does
> not get updated on disk, except at shutdown (though it would be nice to
> have this done at, say, statfs time).
>
> There are no EAs consuming blocks (this would be 4kB per file, so 1GB
> in total for 250k files).

I understand what you're after, but what's an EA?

> This is also the file shown in the orphan inode list, so at least it
> is consistent.  I also wouldn't expect files to be orphaned after e2fsck.
>
> The other thing you can do is run "dumpe2fs /dev/sda1" to see what the
> block group descriptors report for free blocks/inodes.  You'd need some
> scripting to add this up, but fairly easy.

Thanks for the hint.  However, before following your suggestions, I'd like to 
try something else:

In another ML someone mentioned that the problem could be caused by another 
partition mounted on a non-empty subdirectory.  This sounds quite plausible, 
especially since we're dealing with partition containing the root directory.  
How do I get the complete list of files on an ext3 FS?

-- 
Dipl.-Phys. Felix E. Klee
Email: fk at linuxburg.de (work), felix.klee at inka.de (home)
Tel: +49 721 8307937, Fax: +49 721 8307936
Linuxburg, Goethestr. 15A, 76135 Karlsruhe, Germany



From bryan at kadzban.is-a-geek.net  Tue Jan 31 11:54:40 2006
From: bryan at kadzban.is-a-geek.net (Bryan Kadzban)
Date: Tue, 31 Jan 2006 06:54:40 -0500
Subject: df reports false size
In-Reply-To: <200601311036.41707.fk@linuxburg.de>
References: <200601301938.54145.fk@linuxburg.de>	<20060130231036.GR11642@schatzie.adilger.int>
	<200601311036.41707.fk@linuxburg.de>
Message-ID: <43DF5000.8030408@kadzban.is-a-geek.net>

Felix E. Klee wrote:
> In another ML someone mentioned that the problem could be caused by
> another partition mounted on a non-empty subdirectory.  This sounds
> quite plausible, especially since we're dealing with partition
> containing the root directory. How do I get the complete list of
> files on an ext3 FS?

find / -xdev -type f

?
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 256 bytes
Desc: OpenPGP digital signature
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20060131/de2253b4/attachment.sig>

From fk at linuxburg.de  Tue Jan 31 12:06:51 2006
From: fk at linuxburg.de (Felix E. Klee)
Date: Tue, 31 Jan 2006 13:06:51 +0100
Subject: df reports false size
In-Reply-To: <43DF5000.8030408@kadzban.is-a-geek.net>
References: <200601301938.54145.fk@linuxburg.de>
	<200601311036.41707.fk@linuxburg.de>
	<43DF5000.8030408@kadzban.is-a-geek.net>
Message-ID: <200601311306.51975.fk@linuxburg.de>

Am Dienstag, 31. Januar 2006 12:54 schrieb Bryan Kadzban:
> find / -xdev -type f
>
> ?

It doesn't work if files are hidden by a mounted partition:

$ mkdir /tmp/foo
$ touch /tmp/foo/bar
$ find / -xdev | grep '^/tmp/foo/bar$'
/tmp/foo/bar
$ mount /dev/hdb1 /tmp/foo
$ find / -xdev | grep '^/tmp/foo/bar$'
[nothing found]

-- 
Dipl.-Phys. Felix E. Klee
Email: fk at linuxburg.de (work), felix.klee at inka.de (home)
Tel: +49 721 8307937, Fax: +49 721 8307936
Linuxburg, Goethestr. 15A, 76135 Karlsruhe, Germany



From fk at linuxburg.de  Tue Jan 31 12:31:20 2006
From: fk at linuxburg.de (Felix E. Klee)
Date: Tue, 31 Jan 2006 13:31:20 +0100
Subject: df reports false size
In-Reply-To: <200601311306.51975.fk@linuxburg.de>
References: <200601301938.54145.fk@linuxburg.de>
	<43DF5000.8030408@kadzban.is-a-geek.net>
	<200601311306.51975.fk@linuxburg.de>
Message-ID: <200601311331.20929.fk@linuxburg.de>

Am Dienstag, 31. Januar 2006 13:06 schrieb Felix E. Klee:
> It doesn't work if files are hidden by a mounted partition:

Hey I just found something cool: "debugfs".  Here one can see all files on the 
file system, even ones that are hidden by mounted partitions.  And, as it 
looks, this is indeed our problem:

$ mount | grep ' /nfsroot'
/dev/sda7 on /nfsroot type ext3 (rw,acl,user_xattr)

# debugfs /dev/sda1
debugfs 1.35 (28-Feb-2004)
debugfs:  ls /nfsroot
 519169  (12) .    2  (12) ..    519171  (4072) 9.2   

Problem most likely found!  Now, we need to solve it - fortunately someone is 
on location today.

-- 
Dipl.-Phys. Felix E. Klee
Email: fk at linuxburg.de (work), felix.klee at inka.de (home)
Tel: +49 721 8307937, Fax: +49 721 8307936
Linuxburg, Goethestr. 15A, 76135 Karlsruhe, Germany



From adilger at clusterfs.com  Tue Jan 31 17:59:24 2006
From: adilger at clusterfs.com (Andreas Dilger)
Date: Tue, 31 Jan 2006 10:59:24 -0700
Subject: df reports false size
In-Reply-To: <200601311331.20929.fk@linuxburg.de>
References: <200601301938.54145.fk@linuxburg.de>
	<43DF5000.8030408@kadzban.is-a-geek.net>
	<200601311306.51975.fk@linuxburg.de>
	<200601311331.20929.fk@linuxburg.de>
Message-ID: <20060131175924.GA11642@schatzie.adilger.int>

On Jan 31, 2006  13:31 +0100, Felix E. Klee wrote:
> Hey I just found something cool: "debugfs".  Here one can see all files on the 
> file system, even ones that are hidden by mounted partitions.  And, as it 
> looks, this is indeed our problem:
> 
> $ mount | grep ' /nfsroot'
> /dev/sda7 on /nfsroot type ext3 (rw,acl,user_xattr)
> 
> # debugfs /dev/sda1
> debugfs 1.35 (28-Feb-2004)
> debugfs:  ls /nfsroot
>  519169  (12) .    2  (12) ..    519171  (4072) 9.2   
> 
> Problem most likely found!  Now, we need to solve it - fortunately someone is 
> on location today.

You can use "mount -t bind / /mnt" and then "/mnt/nfsroot" will be the
underlying directory.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.