From codevana at gmail.com  Wed Jun  4 02:43:18 2008
From: codevana at gmail.com (Srinivas Murthy)
Date: Tue, 3 Jun 2008 19:43:18 -0700
Subject: Fwd: md_thread and kjournald race w/ raid1 on 2-way x86
In-Reply-To: <df1bdeb10806031348y61568f2fo8abbc1fb0b9a3141@mail.gmail.com>
References: <df1bdeb10806021835v71897b2es83e29a8188ca97f1@mail.gmail.com>
	<df1bdeb10806031348y61568f2fo8abbc1fb0b9a3141@mail.gmail.com>
Message-ID: <df1bdeb10806031943l33971020re7c9a38763859bea@mail.gmail.com>

Hi

I have the following kernel (2.6.23) crash on a 2-cpu smp 32b x86 system.

The crash is intermittent and seems to happen w/ md raid 1 sync.

As you can see one of the cpu's is running the md_thread while the
other was in kjournald. Is there a known race condn between kjournald
and md_thread threads?

Anyone knows the fix for this?

Thanks.



<6>md: md0 stopped.
<6>md: bind<sda1>
<6>md: bind<sdb1>
<6>raid1: raid set md0 active with 2 out of 2 mirrors
<6>md: md1 stopped.
<6>md: bind<sda2>
<6>md: bind<sdb2>
<6>raid1: raid set md1 active with 2 out of 2 mirrors
<6>md: md2 stopped.
<6>md: bind<sda3>
<6>md: bind<sdb3>
<6>raid1: raid set md2 active with 2 out of 2 mirrors
<6>md: md3 stopped.
<6>md: bind<sda5>
<6>md: bind<sdb5>
<6>raid1: raid set md3 active with 2 out of 2 mirrors
<6>md: md4 stopped.
<6>md: bind<sda6>
<6>md: bind<sdb6>
<6>raid1: raid set md4 active with 2 out of 2 mirrors
<6>md: md5 stopped.
<6>md: bind<sda7>
<6>md: bind<sdb7>
<3>md: md5: raid array is not clean -- starting background reconstruction
<6>raid1: raid set md5 active with 2 out of 2 mirrors
<6>md: resync of RAID array md5
<6>md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
<6>md: using maximum available idle IO bandwidth (but not more than
20000 KB/sec) for resync.
<6>md: using 128k window, over a total of 7339904 blocks.
<6>md: md6 stopped.
<6>md: bind<sda8>
<6>md: bind<sdb8>
<6>raid1: raid set md6 active with 2 out of 2 mirrors
<6>kjournald starting.  Commit interval 5 seconds
<6>EXT3-fs: mounted filesystem with ordered data mode.
<6>kjournald starting.  Commit interval 5 seconds
<6>EXT3-fs: mounted filesystem with ordered data mode.
<6>kjournald starting.  Commit interval 5 seconds
<6>EXT3 FS on md2, internal journal
<6>EXT3-fs: mounted filesystem with ordered data mode.
<6>kjournald starting.  Commit interval 5 seconds
<6>EXT3 FS on md5, internal journal
<6>EXT3-fs: mounted filesystem with ordered data mode.
<6>EXT3 FS on md0, internal journal
<6>Adding 4138872k swap on /dev/md3.  Priority:-1 extents:1 across:4138872k
<6>EXT3 FS on md0, internal journal
<6>bonding: bond0: setting mode to balance-rr (0).
<6>tg3: eth0: Link is up at 1000 Mbps, full duplex.
<6>tg3: eth0: Flow control is off for TX and off for RX.
<6>kjournald starting.  Commit interval 5 seconds
<6>EXT3 FS on md4, internal journal
<6>EXT3-fs: mounted filesystem with ordered data mode.
<6>kjournald starting.  Commit interval 5 seconds
<6>EXT3 FS on md6, internal journal
<6>EXT3-fs: mounted filesystem with ordered data mode.
<0>Assertion failure in journal_commit_transaction() at
fs/jbd/commit.c:693: "((&bh->b_count)->counter) == 0"
<0>------------[ cut here ]------------
<2>kernel BUG at fs/jbd/commit.c:693!
<0>invalid opcode: 0000 [#1]
<0>PREEMPT SMP
<0>CPU:    1
<0>EIP:    0060:[<c02c1149>]    Tainted: P        VLI
<0>EFLAGS: 00010296   (2.6.23.waas #4)
<0>EIP is at journal_commit_transaction+0x879/0xe00
<0>eax: 00000096   ebx: f76bcf00   ecx: ffffffff   edx: f7588ac0
<0>esi: f6c66f88   edi: c0651dcd   ebp: c6549f5c   esp: c6549ec4
<0>ds: 007b   es: 007b   fs: 00d8  gs: 0000  ss: 0068
<0>Process kjournald (pid: 1684, ti=c6548000 task=c39db580 task.ti=c6548000)
<0>Stack: c0651de8 c0651dcd c0651e44 000002b5 c0651fa0 00000000
00000000 f7f63414
<0>       f7f634dc 00000000 00000fcc f7435034 00000000 00000000
c6402000 00000000
<0>       f7f63400 f7386fc0 000005d7 f77fb580 c39db580 f70bdd74
00000202 c70f8000
<0>Call Trace:
<0> [<c0203a1a>] show_trace_log_lvl+0x1a/0x30
<0> [<c0203aea>] show_stack_log_lvl+0x9a/0xc0
<0> [<c0203d46>] show_registers+0x1d6/0x340
<0> [<c020403d>] die+0x10d/0x220
<0> [<c02041e1>] do_trap+0x91/0xd0
<0> [<c0204419>] do_invalid_op+0x89/0xa0
<0> [<c06317e2>] error_code+0x72/0x78
<0> [<c02c3845>] kjournald+0xb5/0x1f0
<0> [<c0232a5c>] kthread+0x5c/0xa0
<0> [<c020388b>] kernel_thread_helper+0x7/0x1c
<0> =======================
<0>Code: 65 c0 b8 a0 1f 65 c0 bf cd 1d 65 c0 89 44 24 10 b8 b5 02 00
00 89 44 24 0c b8 44 1e 65 c0 89 7c 24 04 89 44 24 08 e8 57 de f5 ff
<0f> 0b eb fe 8b 43 34 85 c0 75 4e 8b 53 38 85 d2 0f 84 9e 00 00
<0>EIP: [<c02c1149>] journal_commit_transaction+0x879/0xe00 SS:ESP 0068:c6549ec4
<6>SysRq : Changing Loglevel
<4>Loglevel set to 7

[0]kdb> btc
btc: cpu status: Currently on cpu 0
Available cpus: 0-1
Stack traceback for pid 1609
0xc69ce000     1609        2  1    0   R  0xc69ce1e0 *md5_resync
esp        eip        Function (args)
0xc69e5d4c 0xc028fef3 bio_alloc_bioset+0xb3 (0x11200, invalid, 0xc70e3060)
0xc69e5d78 0xc028ffbe bio_alloc+0xe
0xc69e5d80 0xc054f6d7 r1buf_pool_alloc+0x37 (0x11200, 0xc39ca0c0)
0xc69e5da4 0xc024aff6 mempool_alloc+0x26 (0xf7e7dcc0, invalid)
0xc69e5de0 0xc0552624 sync_request+0x1f4 (0xf7f40a00, 0xa2ce80, 0x0,
0xc69e5ea0, 0x0)
0xc69e5e40 0xc056667f md_do_sync+0x4ef (0xf7f40a00)
0xc69e5f78 0xc0564f55 md_thread+0x35 (0xf7e7dc80)
0xc69e5fd0 0xc0232a5c kthread+0x5c (invalid)
Stack traceback for pid 1684
0xc39db580     1684        2  1    1   R  0xc39db760  kjournald
esp        eip        Function (args)
kdb_bb: address 0xffffffff not recognised
Using old style backtrace, unreliable with no arguments
esp        eip        Function (args)
0xc6549eb8 0xc02c1149 journal_commit_transaction+0x879
0xc6549f28 0xc0227945 lock_timer_base+0x25
0xc6549f40 0xc0227b6a try_to_del_timer_sync+0x4a
0xc6549f60 0xc02c3845 kjournald+0xb5
0xc6549f88 0xc0233040 autoremove_wake_function
0xc6549f94 0xc062f8e1 __sched_text_start+0x1f1
0xc6549fa8 0xc0233040 autoremove_wake_function
[0]kdb>



From sandeen at redhat.com  Wed Jun  4 02:47:06 2008
From: sandeen at redhat.com (Eric Sandeen)
Date: Tue, 03 Jun 2008 21:47:06 -0500
Subject: Fwd: md_thread and kjournald race w/ raid1 on 2-way x86
In-Reply-To: <df1bdeb10806031943l33971020re7c9a38763859bea@mail.gmail.com>
References: <df1bdeb10806021835v71897b2es83e29a8188ca97f1@mail.gmail.com>	<df1bdeb10806031348y61568f2fo8abbc1fb0b9a3141@mail.gmail.com>
	<df1bdeb10806031943l33971020re7c9a38763859bea@mail.gmail.com>
Message-ID: <4846022A.1010707@redhat.com>

Srinivas Murthy wrote:
> Hi
> 
> I have the following kernel (2.6.23) crash on a 2-cpu smp 32b x86 system.

> <0>Assertion failure in journal_commit_transaction() at
> fs/jbd/commit.c:693: "((&bh->b_count)->counter) == 0"
> <0>------------[ cut here ]------------
> <2>kernel BUG at fs/jbd/commit.c:693!
> <0>invalid opcode: 0000 [#1]
> <0>PREEMPT SMP
> <0>CPU:    1
> <0>EIP:    0060:[<c02c1149>]    Tainted: P        VLI

What's the proprietary kernel; does it happen without the tainted kernel?

-Eric

> <0>EFLAGS: 00010296   (2.6.23.waas #4)
> <0>EIP is at journal_commit_transaction+0x879/0xe00
> <0>eax: 00000096   ebx: f76bcf00   ecx: ffffffff   edx: f7588ac0
> <0>esi: f6c66f88   edi: c0651dcd   ebp: c6549f5c   esp: c6549ec4
> <0>ds: 007b   es: 007b   fs: 00d8  gs: 0000  ss: 0068
> <0>Process kjournald (pid: 1684, ti=c6548000 task=c39db580 task.ti=c6548000)
> <0>Stack: c0651de8 c0651dcd c0651e44 000002b5 c0651fa0 00000000
> 00000000 f7f63414
> <0>       f7f634dc 00000000 00000fcc f7435034 00000000 00000000
> c6402000 00000000
> <0>       f7f63400 f7386fc0 000005d7 f77fb580 c39db580 f70bdd74
> 00000202 c70f8000
> <0>Call Trace:
> <0> [<c0203a1a>] show_trace_log_lvl+0x1a/0x30
> <0> [<c0203aea>] show_stack_log_lvl+0x9a/0xc0
> <0> [<c0203d46>] show_registers+0x1d6/0x340
> <0> [<c020403d>] die+0x10d/0x220
> <0> [<c02041e1>] do_trap+0x91/0xd0
> <0> [<c0204419>] do_invalid_op+0x89/0xa0
> <0> [<c06317e2>] error_code+0x72/0x78
> <0> [<c02c3845>] kjournald+0xb5/0x1f0
> <0> [<c0232a5c>] kthread+0x5c/0xa0
> <0> [<c020388b>] kernel_thread_helper+0x7/0x1c
> <0> =======================
> <0>Code: 65 c0 b8 a0 1f 65 c0 bf cd 1d 65 c0 89 44 24 10 b8 b5 02 00
> 00 89 44 24 0c b8 44 1e 65 c0 89 7c 24 04 89 44 24 08 e8 57 de f5 ff
> <0f> 0b eb fe 8b 43 34 85 c0 75 4e 8b 53 38 85 d2 0f 84 9e 00 00
> <0>EIP: [<c02c1149>] journal_commit_transaction+0x879/0xe00 SS:ESP 0068:c6549ec4
> <6>SysRq : Changing Loglevel
> <4>Loglevel set to 7
> 
> [0]kdb> btc
> btc: cpu status: Currently on cpu 0
> Available cpus: 0-1
> Stack traceback for pid 1609
> 0xc69ce000     1609        2  1    0   R  0xc69ce1e0 *md5_resync
> esp        eip        Function (args)
> 0xc69e5d4c 0xc028fef3 bio_alloc_bioset+0xb3 (0x11200, invalid, 0xc70e3060)
> 0xc69e5d78 0xc028ffbe bio_alloc+0xe
> 0xc69e5d80 0xc054f6d7 r1buf_pool_alloc+0x37 (0x11200, 0xc39ca0c0)
> 0xc69e5da4 0xc024aff6 mempool_alloc+0x26 (0xf7e7dcc0, invalid)
> 0xc69e5de0 0xc0552624 sync_request+0x1f4 (0xf7f40a00, 0xa2ce80, 0x0,
> 0xc69e5ea0, 0x0)
> 0xc69e5e40 0xc056667f md_do_sync+0x4ef (0xf7f40a00)
> 0xc69e5f78 0xc0564f55 md_thread+0x35 (0xf7e7dc80)
> 0xc69e5fd0 0xc0232a5c kthread+0x5c (invalid)
> Stack traceback for pid 1684
> 0xc39db580     1684        2  1    1   R  0xc39db760  kjournald
> esp        eip        Function (args)
> kdb_bb: address 0xffffffff not recognised
> Using old style backtrace, unreliable with no arguments
> esp        eip        Function (args)
> 0xc6549eb8 0xc02c1149 journal_commit_transaction+0x879
> 0xc6549f28 0xc0227945 lock_timer_base+0x25
> 0xc6549f40 0xc0227b6a try_to_del_timer_sync+0x4a
> 0xc6549f60 0xc02c3845 kjournald+0xb5
> 0xc6549f88 0xc0233040 autoremove_wake_function
> 0xc6549f94 0xc062f8e1 __sched_text_start+0x1f1
> 0xc6549fa8 0xc0233040 autoremove_wake_function
> [0]kdb>
> 
> _______________________________________________
> Ext3-users mailing list
> Ext3-users at redhat.com
> https://www.redhat.com/mailman/listinfo/ext3-users



From codevana at gmail.com  Wed Jun  4 02:49:31 2008
From: codevana at gmail.com (Srinivas Murthy)
Date: Tue, 3 Jun 2008 19:49:31 -0700
Subject: Fwd: md_thread and kjournald race w/ raid1 on 2-way x86
In-Reply-To: <4846022A.1010707@redhat.com>
References: <df1bdeb10806021835v71897b2es83e29a8188ca97f1@mail.gmail.com>
	<df1bdeb10806031348y61568f2fo8abbc1fb0b9a3141@mail.gmail.com>
	<df1bdeb10806031943l33971020re7c9a38763859bea@mail.gmail.com>
	<4846022A.1010707@redhat.com>
Message-ID: <df1bdeb10806031949i1aa5c227p69b1587fdb8563c8@mail.gmail.com>

The changes we have are in the networking part. Nothing in the fs or
block layers.

Thanks,
_Sri





On Tue, Jun 3, 2008 at 7:47 PM, Eric Sandeen <sandeen at redhat.com> wrote:
> Srinivas Murthy wrote:
>> Hi
>>
>> I have the following kernel (2.6.23) crash on a 2-cpu smp 32b x86 system.
>
>> <0>Assertion failure in journal_commit_transaction() at
>> fs/jbd/commit.c:693: "((&bh->b_count)->counter) == 0"
>> <0>------------[ cut here ]------------
>> <2>kernel BUG at fs/jbd/commit.c:693!
>> <0>invalid opcode: 0000 [#1]
>> <0>PREEMPT SMP
>> <0>CPU:    1
>> <0>EIP:    0060:[<c02c1149>]    Tainted: P        VLI
>
> What's the proprietary kernel; does it happen without the tainted kernel?
>
> -Eric
>
>> <0>EFLAGS: 00010296   (2.6.23.waas #4)
>> <0>EIP is at journal_commit_transaction+0x879/0xe00
>> <0>eax: 00000096   ebx: f76bcf00   ecx: ffffffff   edx: f7588ac0
>> <0>esi: f6c66f88   edi: c0651dcd   ebp: c6549f5c   esp: c6549ec4
>> <0>ds: 007b   es: 007b   fs: 00d8  gs: 0000  ss: 0068
>> <0>Process kjournald (pid: 1684, ti=c6548000 task=c39db580 task.ti=c6548000)
>> <0>Stack: c0651de8 c0651dcd c0651e44 000002b5 c0651fa0 00000000
>> 00000000 f7f63414
>> <0>       f7f634dc 00000000 00000fcc f7435034 00000000 00000000
>> c6402000 00000000
>> <0>       f7f63400 f7386fc0 000005d7 f77fb580 c39db580 f70bdd74
>> 00000202 c70f8000
>> <0>Call Trace:
>> <0> [<c0203a1a>] show_trace_log_lvl+0x1a/0x30
>> <0> [<c0203aea>] show_stack_log_lvl+0x9a/0xc0
>> <0> [<c0203d46>] show_registers+0x1d6/0x340
>> <0> [<c020403d>] die+0x10d/0x220
>> <0> [<c02041e1>] do_trap+0x91/0xd0
>> <0> [<c0204419>] do_invalid_op+0x89/0xa0
>> <0> [<c06317e2>] error_code+0x72/0x78
>> <0> [<c02c3845>] kjournald+0xb5/0x1f0
>> <0> [<c0232a5c>] kthread+0x5c/0xa0
>> <0> [<c020388b>] kernel_thread_helper+0x7/0x1c
>> <0> =======================
>> <0>Code: 65 c0 b8 a0 1f 65 c0 bf cd 1d 65 c0 89 44 24 10 b8 b5 02 00
>> 00 89 44 24 0c b8 44 1e 65 c0 89 7c 24 04 89 44 24 08 e8 57 de f5 ff
>> <0f> 0b eb fe 8b 43 34 85 c0 75 4e 8b 53 38 85 d2 0f 84 9e 00 00
>> <0>EIP: [<c02c1149>] journal_commit_transaction+0x879/0xe00 SS:ESP 0068:c6549ec4
>> <6>SysRq : Changing Loglevel
>> <4>Loglevel set to 7
>>
>> [0]kdb> btc
>> btc: cpu status: Currently on cpu 0
>> Available cpus: 0-1
>> Stack traceback for pid 1609
>> 0xc69ce000     1609        2  1    0   R  0xc69ce1e0 *md5_resync
>> esp        eip        Function (args)
>> 0xc69e5d4c 0xc028fef3 bio_alloc_bioset+0xb3 (0x11200, invalid, 0xc70e3060)
>> 0xc69e5d78 0xc028ffbe bio_alloc+0xe
>> 0xc69e5d80 0xc054f6d7 r1buf_pool_alloc+0x37 (0x11200, 0xc39ca0c0)
>> 0xc69e5da4 0xc024aff6 mempool_alloc+0x26 (0xf7e7dcc0, invalid)
>> 0xc69e5de0 0xc0552624 sync_request+0x1f4 (0xf7f40a00, 0xa2ce80, 0x0,
>> 0xc69e5ea0, 0x0)
>> 0xc69e5e40 0xc056667f md_do_sync+0x4ef (0xf7f40a00)
>> 0xc69e5f78 0xc0564f55 md_thread+0x35 (0xf7e7dc80)
>> 0xc69e5fd0 0xc0232a5c kthread+0x5c (invalid)
>> Stack traceback for pid 1684
>> 0xc39db580     1684        2  1    1   R  0xc39db760  kjournald
>> esp        eip        Function (args)
>> kdb_bb: address 0xffffffff not recognised
>> Using old style backtrace, unreliable with no arguments
>> esp        eip        Function (args)
>> 0xc6549eb8 0xc02c1149 journal_commit_transaction+0x879
>> 0xc6549f28 0xc0227945 lock_timer_base+0x25
>> 0xc6549f40 0xc0227b6a try_to_del_timer_sync+0x4a
>> 0xc6549f60 0xc02c3845 kjournald+0xb5
>> 0xc6549f88 0xc0233040 autoremove_wake_function
>> 0xc6549f94 0xc062f8e1 __sched_text_start+0x1f1
>> 0xc6549fa8 0xc0233040 autoremove_wake_function
>> [0]kdb>
>>
>> _______________________________________________
>> Ext3-users mailing list
>> Ext3-users at redhat.com
>> https://www.redhat.com/mailman/listinfo/ext3-users
>
>



From sandeen at redhat.com  Wed Jun  4 02:58:23 2008
From: sandeen at redhat.com (Eric Sandeen)
Date: Tue, 03 Jun 2008 21:58:23 -0500
Subject: Fwd: md_thread and kjournald race w/ raid1 on 2-way x86
In-Reply-To: <df1bdeb10806031949i1aa5c227p69b1587fdb8563c8@mail.gmail.com>
References: <df1bdeb10806021835v71897b2es83e29a8188ca97f1@mail.gmail.com>	
	<df1bdeb10806031348y61568f2fo8abbc1fb0b9a3141@mail.gmail.com>	
	<df1bdeb10806031943l33971020re7c9a38763859bea@mail.gmail.com>	
	<4846022A.1010707@redhat.com>
	<df1bdeb10806031949i1aa5c227p69b1587fdb8563c8@mail.gmail.com>
Message-ID: <484604CF.10709@redhat.com>

Srinivas Murthy wrote:
> The changes we have are in the networking part. Nothing in the fs or
> block layers.
> 
> Thanks,
> _Sri
> 

Ok - and does it still happen without the taint? :)

networking can corrupt memory as well as anything else.

I'm not saying that's it for sure but it's worth testing.

-Eric

> 
> 
> 
> On Tue, Jun 3, 2008 at 7:47 PM, Eric Sandeen <sandeen at redhat.com> wrote:
>> Srinivas Murthy wrote:
>>> Hi
>>>
>>> I have the following kernel (2.6.23) crash on a 2-cpu smp 32b x86 system.
>>> <0>Assertion failure in journal_commit_transaction() at
>>> fs/jbd/commit.c:693: "((&bh->b_count)->counter) == 0"
>>> <0>------------[ cut here ]------------
>>> <2>kernel BUG at fs/jbd/commit.c:693!
>>> <0>invalid opcode: 0000 [#1]
>>> <0>PREEMPT SMP
>>> <0>CPU:    1
>>> <0>EIP:    0060:[<c02c1149>]    Tainted: P        VLI
>> What's the proprietary kernel; does it happen without the tainted kernel?
>>
>> -Eric
>>
>>> <0>EFLAGS: 00010296   (2.6.23.waas #4)
>>> <0>EIP is at journal_commit_transaction+0x879/0xe00
>>> <0>eax: 00000096   ebx: f76bcf00   ecx: ffffffff   edx: f7588ac0
>>> <0>esi: f6c66f88   edi: c0651dcd   ebp: c6549f5c   esp: c6549ec4
>>> <0>ds: 007b   es: 007b   fs: 00d8  gs: 0000  ss: 0068
>>> <0>Process kjournald (pid: 1684, ti=c6548000 task=c39db580 task.ti=c6548000)
>>> <0>Stack: c0651de8 c0651dcd c0651e44 000002b5 c0651fa0 00000000
>>> 00000000 f7f63414
>>> <0>       f7f634dc 00000000 00000fcc f7435034 00000000 00000000
>>> c6402000 00000000
>>> <0>       f7f63400 f7386fc0 000005d7 f77fb580 c39db580 f70bdd74
>>> 00000202 c70f8000
>>> <0>Call Trace:
>>> <0> [<c0203a1a>] show_trace_log_lvl+0x1a/0x30
>>> <0> [<c0203aea>] show_stack_log_lvl+0x9a/0xc0
>>> <0> [<c0203d46>] show_registers+0x1d6/0x340
>>> <0> [<c020403d>] die+0x10d/0x220
>>> <0> [<c02041e1>] do_trap+0x91/0xd0
>>> <0> [<c0204419>] do_invalid_op+0x89/0xa0
>>> <0> [<c06317e2>] error_code+0x72/0x78
>>> <0> [<c02c3845>] kjournald+0xb5/0x1f0
>>> <0> [<c0232a5c>] kthread+0x5c/0xa0
>>> <0> [<c020388b>] kernel_thread_helper+0x7/0x1c
>>> <0> =======================
>>> <0>Code: 65 c0 b8 a0 1f 65 c0 bf cd 1d 65 c0 89 44 24 10 b8 b5 02 00
>>> 00 89 44 24 0c b8 44 1e 65 c0 89 7c 24 04 89 44 24 08 e8 57 de f5 ff
>>> <0f> 0b eb fe 8b 43 34 85 c0 75 4e 8b 53 38 85 d2 0f 84 9e 00 00
>>> <0>EIP: [<c02c1149>] journal_commit_transaction+0x879/0xe00 SS:ESP 0068:c6549ec4
>>> <6>SysRq : Changing Loglevel
>>> <4>Loglevel set to 7
>>>
>>> [0]kdb> btc
>>> btc: cpu status: Currently on cpu 0
>>> Available cpus: 0-1
>>> Stack traceback for pid 1609
>>> 0xc69ce000     1609        2  1    0   R  0xc69ce1e0 *md5_resync
>>> esp        eip        Function (args)
>>> 0xc69e5d4c 0xc028fef3 bio_alloc_bioset+0xb3 (0x11200, invalid, 0xc70e3060)
>>> 0xc69e5d78 0xc028ffbe bio_alloc+0xe
>>> 0xc69e5d80 0xc054f6d7 r1buf_pool_alloc+0x37 (0x11200, 0xc39ca0c0)
>>> 0xc69e5da4 0xc024aff6 mempool_alloc+0x26 (0xf7e7dcc0, invalid)
>>> 0xc69e5de0 0xc0552624 sync_request+0x1f4 (0xf7f40a00, 0xa2ce80, 0x0,
>>> 0xc69e5ea0, 0x0)
>>> 0xc69e5e40 0xc056667f md_do_sync+0x4ef (0xf7f40a00)
>>> 0xc69e5f78 0xc0564f55 md_thread+0x35 (0xf7e7dc80)
>>> 0xc69e5fd0 0xc0232a5c kthread+0x5c (invalid)
>>> Stack traceback for pid 1684
>>> 0xc39db580     1684        2  1    1   R  0xc39db760  kjournald
>>> esp        eip        Function (args)
>>> kdb_bb: address 0xffffffff not recognised
>>> Using old style backtrace, unreliable with no arguments
>>> esp        eip        Function (args)
>>> 0xc6549eb8 0xc02c1149 journal_commit_transaction+0x879
>>> 0xc6549f28 0xc0227945 lock_timer_base+0x25
>>> 0xc6549f40 0xc0227b6a try_to_del_timer_sync+0x4a
>>> 0xc6549f60 0xc02c3845 kjournald+0xb5
>>> 0xc6549f88 0xc0233040 autoremove_wake_function
>>> 0xc6549f94 0xc062f8e1 __sched_text_start+0x1f1
>>> 0xc6549fa8 0xc0233040 autoremove_wake_function
>>> [0]kdb>
>>>
>>> _______________________________________________
>>> Ext3-users mailing list
>>> Ext3-users at redhat.com
>>> https://www.redhat.com/mailman/listinfo/ext3-users
>>



From codevana at gmail.com  Wed Jun  4 03:04:18 2008
From: codevana at gmail.com (Srinivas Murthy)
Date: Tue, 3 Jun 2008 20:04:18 -0700
Subject: Fwd: md_thread and kjournald race w/ raid1 on 2-way x86
In-Reply-To: <484604CF.10709@redhat.com>
References: <df1bdeb10806021835v71897b2es83e29a8188ca97f1@mail.gmail.com>
	<df1bdeb10806031348y61568f2fo8abbc1fb0b9a3141@mail.gmail.com>
	<df1bdeb10806031943l33971020re7c9a38763859bea@mail.gmail.com>
	<4846022A.1010707@redhat.com>
	<df1bdeb10806031949i1aa5c227p69b1587fdb8563c8@mail.gmail.com>
	<484604CF.10709@redhat.com>
Message-ID: <df1bdeb10806032004v3050e8c1wf08f0faf4017351a@mail.gmail.com>

Sorry. Understand.

Yes, I am told it does. No way to be sure.


On Tue, Jun 3, 2008 at 7:58 PM, Eric Sandeen <sandeen at redhat.com> wrote:
> Srinivas Murthy wrote:
>> The changes we have are in the networking part. Nothing in the fs or
>> block layers.
>>
>> Thanks,
>> _Sri
>>
>
> Ok - and does it still happen without the taint? :)
>
> networking can corrupt memory as well as anything else.
>
> I'm not saying that's it for sure but it's worth testing.
>
> -Eric
>
>>
>>
>>
>> On Tue, Jun 3, 2008 at 7:47 PM, Eric Sandeen <sandeen at redhat.com> wrote:
>>> Srinivas Murthy wrote:
>>>> Hi
>>>>
>>>> I have the following kernel (2.6.23) crash on a 2-cpu smp 32b x86 system.
>>>> <0>Assertion failure in journal_commit_transaction() at
>>>> fs/jbd/commit.c:693: "((&bh->b_count)->counter) == 0"
>>>> <0>------------[ cut here ]------------
>>>> <2>kernel BUG at fs/jbd/commit.c:693!
>>>> <0>invalid opcode: 0000 [#1]
>>>> <0>PREEMPT SMP
>>>> <0>CPU:    1
>>>> <0>EIP:    0060:[<c02c1149>]    Tainted: P        VLI
>>> What's the proprietary kernel; does it happen without the tainted kernel?
>>>
>>> -Eric
>>>
>>>> <0>EFLAGS: 00010296   (2.6.23.waas #4)
>>>> <0>EIP is at journal_commit_transaction+0x879/0xe00
>>>> <0>eax: 00000096   ebx: f76bcf00   ecx: ffffffff   edx: f7588ac0
>>>> <0>esi: f6c66f88   edi: c0651dcd   ebp: c6549f5c   esp: c6549ec4
>>>> <0>ds: 007b   es: 007b   fs: 00d8  gs: 0000  ss: 0068
>>>> <0>Process kjournald (pid: 1684, ti=c6548000 task=c39db580 task.ti=c6548000)
>>>> <0>Stack: c0651de8 c0651dcd c0651e44 000002b5 c0651fa0 00000000
>>>> 00000000 f7f63414
>>>> <0>       f7f634dc 00000000 00000fcc f7435034 00000000 00000000
>>>> c6402000 00000000
>>>> <0>       f7f63400 f7386fc0 000005d7 f77fb580 c39db580 f70bdd74
>>>> 00000202 c70f8000
>>>> <0>Call Trace:
>>>> <0> [<c0203a1a>] show_trace_log_lvl+0x1a/0x30
>>>> <0> [<c0203aea>] show_stack_log_lvl+0x9a/0xc0
>>>> <0> [<c0203d46>] show_registers+0x1d6/0x340
>>>> <0> [<c020403d>] die+0x10d/0x220
>>>> <0> [<c02041e1>] do_trap+0x91/0xd0
>>>> <0> [<c0204419>] do_invalid_op+0x89/0xa0
>>>> <0> [<c06317e2>] error_code+0x72/0x78
>>>> <0> [<c02c3845>] kjournald+0xb5/0x1f0
>>>> <0> [<c0232a5c>] kthread+0x5c/0xa0
>>>> <0> [<c020388b>] kernel_thread_helper+0x7/0x1c
>>>> <0> =======================
>>>> <0>Code: 65 c0 b8 a0 1f 65 c0 bf cd 1d 65 c0 89 44 24 10 b8 b5 02 00
>>>> 00 89 44 24 0c b8 44 1e 65 c0 89 7c 24 04 89 44 24 08 e8 57 de f5 ff
>>>> <0f> 0b eb fe 8b 43 34 85 c0 75 4e 8b 53 38 85 d2 0f 84 9e 00 00
>>>> <0>EIP: [<c02c1149>] journal_commit_transaction+0x879/0xe00 SS:ESP 0068:c6549ec4
>>>> <6>SysRq : Changing Loglevel
>>>> <4>Loglevel set to 7
>>>>
>>>> [0]kdb> btc
>>>> btc: cpu status: Currently on cpu 0
>>>> Available cpus: 0-1
>>>> Stack traceback for pid 1609
>>>> 0xc69ce000     1609        2  1    0   R  0xc69ce1e0 *md5_resync
>>>> esp        eip        Function (args)
>>>> 0xc69e5d4c 0xc028fef3 bio_alloc_bioset+0xb3 (0x11200, invalid, 0xc70e3060)
>>>> 0xc69e5d78 0xc028ffbe bio_alloc+0xe
>>>> 0xc69e5d80 0xc054f6d7 r1buf_pool_alloc+0x37 (0x11200, 0xc39ca0c0)
>>>> 0xc69e5da4 0xc024aff6 mempool_alloc+0x26 (0xf7e7dcc0, invalid)
>>>> 0xc69e5de0 0xc0552624 sync_request+0x1f4 (0xf7f40a00, 0xa2ce80, 0x0,
>>>> 0xc69e5ea0, 0x0)
>>>> 0xc69e5e40 0xc056667f md_do_sync+0x4ef (0xf7f40a00)
>>>> 0xc69e5f78 0xc0564f55 md_thread+0x35 (0xf7e7dc80)
>>>> 0xc69e5fd0 0xc0232a5c kthread+0x5c (invalid)
>>>> Stack traceback for pid 1684
>>>> 0xc39db580     1684        2  1    1   R  0xc39db760  kjournald
>>>> esp        eip        Function (args)
>>>> kdb_bb: address 0xffffffff not recognised
>>>> Using old style backtrace, unreliable with no arguments
>>>> esp        eip        Function (args)
>>>> 0xc6549eb8 0xc02c1149 journal_commit_transaction+0x879
>>>> 0xc6549f28 0xc0227945 lock_timer_base+0x25
>>>> 0xc6549f40 0xc0227b6a try_to_del_timer_sync+0x4a
>>>> 0xc6549f60 0xc02c3845 kjournald+0xb5
>>>> 0xc6549f88 0xc0233040 autoremove_wake_function
>>>> 0xc6549f94 0xc062f8e1 __sched_text_start+0x1f1
>>>> 0xc6549fa8 0xc0233040 autoremove_wake_function
>>>> [0]kdb>
>>>>
>>>> _______________________________________________
>>>> Ext3-users mailing list
>>>> Ext3-users at redhat.com
>>>> https://www.redhat.com/mailman/listinfo/ext3-users
>>>
>
>



From sandeen at redhat.com  Wed Jun  4 03:06:17 2008
From: sandeen at redhat.com (Eric Sandeen)
Date: Tue, 03 Jun 2008 22:06:17 -0500
Subject: Fwd: md_thread and kjournald race w/ raid1 on 2-way x86
In-Reply-To: <df1bdeb10806031943l33971020re7c9a38763859bea@mail.gmail.com>
References: <df1bdeb10806021835v71897b2es83e29a8188ca97f1@mail.gmail.com>	<df1bdeb10806031348y61568f2fo8abbc1fb0b9a3141@mail.gmail.com>
	<df1bdeb10806031943l33971020re7c9a38763859bea@mail.gmail.com>
Message-ID: <484606A9.3070403@redhat.com>

Srinivas Murthy wrote:

> <6>EXT3-fs: mounted filesystem with ordered data mode.
> <0>Assertion failure in journal_commit_transaction() at
> fs/jbd/commit.c:693: "((&bh->b_count)->counter) == 0"
> <0>------------[ cut here ]------------
> <2>kernel BUG at fs/jbd/commit.c:693!
> <0>invalid opcode: 0000 [#1]
> <0>PREEMPT SMP
> <0>CPU:    1
> <0>EIP:    0060:[<c02c1149>]    Tainted: P        VLI
> <0>EFLAGS: 00010296   (2.6.23.waas #4)
> <0>EIP is at journal_commit_transaction+0x879/0xe00
> <0>eax: 00000096   ebx: f76bcf00   ecx: ffffffff   edx: f7588ac0
> <0>esi: f6c66f88   edi: c0651dcd   ebp: c6549f5c   esp: c6549ec4
> <0>ds: 007b   es: 007b   fs: 00d8  gs: 0000  ss: 0068
> <0>Process kjournald (pid: 1684, ti=c6548000 task=c39db580 task.ti=c6548000)
> <0>Stack: c0651de8 c0651dcd c0651e44 000002b5 c0651fa0 00000000
> 00000000 f7f63414
> <0>       f7f634dc 00000000 00000fcc f7435034 00000000 00000000
> c6402000 00000000
> <0>       f7f63400 f7386fc0 000005d7 f77fb580 c39db580 f70bdd74
> 00000202 c70f8000
> <0>Call Trace:
> <0> [<c0203a1a>] show_trace_log_lvl+0x1a/0x30
> <0> [<c0203aea>] show_stack_log_lvl+0x9a/0xc0
> <0> [<c0203d46>] show_registers+0x1d6/0x340
> <0> [<c020403d>] die+0x10d/0x220
> <0> [<c02041e1>] do_trap+0x91/0xd0
> <0> [<c0204419>] do_invalid_op+0x89/0xa0
> <0> [<c06317e2>] error_code+0x72/0x78
> <0> [<c02c3845>] kjournald+0xb5/0x1f0
> <0> [<c0232a5c>] kthread+0x5c/0xa0
> <0> [<c020388b>] kernel_thread_helper+0x7/0x1c
> <0> =======================
> <0>Code: 65 c0 b8 a0 1f 65 c0 bf cd 1d 65 c0 89 44 24 10 b8 b5 02 00
> 00 89 44 24 0c b8 44 1e 65 c0 89 7c 24 04 89 44 24 08 e8 57 de f5 ff
> <0f> 0b eb fe 8b 43 34 85 c0 75 4e 8b 53 38 85 d2 0f 84 9e 00 00
> <0>EIP: [<c02c1149>] journal_commit_transaction+0x879/0xe00 SS:ESP 0068:c6549ec4
> <6>SysRq : Changing Loglevel
> <4>Loglevel set to 7
> 
> [0]kdb> btc
> btc: cpu status: Currently on cpu 0

Also, I'd backtrace pid 1684 (kjournald) and dump the bh, see what it
looks like...

kdb> btp 1684
kdb> bh <whatever the address of the buffer head is>

if i remember correctly...

-Eric



From codevana at gmail.com  Wed Jun  4 03:29:51 2008
From: codevana at gmail.com (Srinivas Murthy)
Date: Tue, 3 Jun 2008 20:29:51 -0700
Subject: Fwd: md_thread and kjournald race w/ raid1 on 2-way x86
In-Reply-To: <484606A9.3070403@redhat.com>
References: <df1bdeb10806021835v71897b2es83e29a8188ca97f1@mail.gmail.com>
	<df1bdeb10806031348y61568f2fo8abbc1fb0b9a3141@mail.gmail.com>
	<df1bdeb10806031943l33971020re7c9a38763859bea@mail.gmail.com>
	<484606A9.3070403@redhat.com>
Message-ID: <df1bdeb10806032029y6ceee314x96f449118da52177@mail.gmail.com>

[0]kdb> btp 1684
Stack traceback for pid 1684
0xc39db580     1684        2  1    1   R  0xc39db760  kjournald
esp        eip        Function (args)
kdb_bb: address 0xffffffff not recognised
Using old style backtrace, unreliable with no arguments
esp        eip        Function (args)
0xc6549eb8 0xc02c1149 journal_commit_transaction+0x879
0xc6549f28 0xc0227945 lock_timer_base+0x25
0xc6549f40 0xc0227b6a try_to_del_timer_sync+0x4a
0xc6549f60 0xc02c3845 kjournald+0xb5
0xc6549f88 0xc0233040 autoremove_wake_function
0xc6549f94 0xc062f8e1 __sched_text_start+0x1f1
0xc6549fa8 0xc0233040 autoremove_wake_function


Based on this code below :

0xc02c10e3 journal_commit_transaction+0x813:   jmp    0xc02c10e3
journal_commit_transaction+0x813
0xc02c10e5 journal_commit_transaction+0x815:   movl   $0xc0651de8,(%esp)
0xc02c10ec journal_commit_transaction+0x81c:   mov    $0xc0651e44,%ecx
0xc02c10f1 journal_commit_transaction+0x821:   mov    $0xc0651dcd,%edx
0xc02c10f6 journal_commit_transaction+0x826:   mov    %ecx,0x8(%esp)
0xc02c10fa journal_commit_transaction+0x82a:   mov    $0xc0651f8a,%esi
0xc02c10ff journal_commit_transaction+0x82f:   mov    $0x2bd,%ebx
0xc02c1104 journal_commit_transaction+0x834:   mov    %esi,0x10(%esp)
0xc02c1108 journal_commit_transaction+0x838:   mov    %ebx,0xc(%esp)
0xc02c110c journal_commit_transaction+0x83c:   mov    %edx,0x4(%esp)
0xc02c1110 journal_commit_transaction+0x840:   call   0xc021efa0 printk
[0]kdb>
0xc02c1115 journal_commit_transaction+0x845:   ud2a
0xc02c1117 journal_commit_transaction+0x847:   jmp    0xc02c1117
journal_commit_transaction+0x847
0xc02c1119 journal_commit_transaction+0x849:   movl   $0xc0651de8,(%esp)
0xc02c1120 journal_commit_transaction+0x850:   mov    $0xc0651fa0,%eax
0xc02c1125 journal_commit_transaction+0x855:   mov    $0xc0651dcd,%edi
0xc02c112a journal_commit_transaction+0x85a:   mov    %eax,0x10(%esp)
0xc02c112e journal_commit_transaction+0x85e:   mov    $0x2b5,%eax
0xc02c1133 journal_commit_transaction+0x863:   mov    %eax,0xc(%esp)
0xc02c1137 journal_commit_transaction+0x867:   mov    $0xc0651e44,%eax
0xc02c113c journal_commit_transaction+0x86c:   mov    %edi,0x4(%esp)
0xc02c1140 journal_commit_transaction+0x870:   mov    %eax,0x8(%esp)
0xc02c1144 journal_commit_transaction+0x874:   call   0xc021efa0 printk
0xc02c1149 journal_commit_transaction+0x879:   ud2a
0xc02c114b journal_commit_transaction+0x87b:   jmp    0xc02c114b
journal_commit_transaction+0x87b
0xc02c114d journal_commit_transaction+0x87d:   mov    0x34(%ebx),%eax
0xc02c1150 journal_commit_transaction+0x880:   test   %eax,%eax
[0]kdb>
0xc02c1152 journal_commit_transaction+0x882:   jne    0xc02c11a2
journal_commit_transaction+0x8d2
0xc02c1154 journal_commit_transaction+0x884:   mov    0x38(%ebx),%edx
0xc02c1157 journal_commit_transaction+0x887:   test   %edx,%edx
0xc02c1159 journal_commit_transaction+0x889:   je     0xc02c11fd
journal_commit_transaction+0x92d
0xc02c115f journal_commit_transaction+0x88f:   mov    0x24(%edx),%edi
0xc02c1162 journal_commit_transaction+0x892:   mov    (%edi),%esi
0xc02c1164 journal_commit_transaction+0x894:   mov    (%esi),%eax
0xc02c1166 journal_commit_transaction+0x896:   test   $0x4,%al
0xc02c1168 journal_commit_transaction+0x898:   jne    0xc02c11e0
journal_commit_transaction+0x910
0xc02c116a journal_commit_transaction+0x89a:   call   0xc06302f0 cond_resched
0xc02c116f journal_commit_transaction+0x89f:   test   %eax,%eax
0xc02c1171 journal_commit_transaction+0x8a1:   jne    0xc02c1154
journal_commit_transaction+0x884
0xc02c1173 journal_commit_transaction+0x8a3:   mov    (%esi),%eax
0xc02c1175 journal_commit_transaction+0x8a5:   test   $0x1,%al
0xc02c1177 journal_commit_transaction+0x8a7:   mov    $0xfffffffb,%eax
0xc02c117c journal_commit_transaction+0x8ac:   cmovne 0xffffff98(%ebp),%eax
[0]kdb> rd
eax = 0x00000096 ebx = 0xf76bcf00 ecx = 0xffffffff edx = 0xf7588ac0
esi = 0xf6c66f88 edi = 0xc0651dcd esp = 0xc6549ec4 eip = 0xc02c1149
ebp = 0xc6549f5c xss = 0xc0580068 xcs = 0x00000060 eflags = 0x00010296
xds = 0xc065007b xes = 0xc654007b origeax = 0xffffffff &regs = 0xc6549e8c

and,

(gdb) p &(((struct buffer_head *)0)->b_count)
$1 = (atomic_t *) 0x34

I think bh is, 0xf76bcf00

but,

[0]kdb> md 0xf76bcf00
0xf76bcf00 f7f63400 00701310 00000004 000001ca   .4....p.........
0xf76bcf10 00000000 00000000 00000000 00000000   ................
0xf76bcf20 00000000 c6320b98 00000000 00000000   ......2.........
0xf76bcf30 00000000 f7386498 f7386b28 00000001   .....d8.(k8.....
0xf76bcf40 00000000 00000000 00000000 00000000   ................
0xf76bcf50 00000000 ffffefab 00000008 00000000   ................
0xf76bcf60 f76bc4e0 00100100 00200200 f76bcf70   ..k....... .p.k.
0xf76bcf70 00000001 00000000 f88eef70 f76bcf7c   ........p...|.k.
[0]kdb>
0xf76bcf80 f76bcf7c f76251e0 0000000d 0011ffff   |.k..Qb.........
0xf76bcf90 00000000 00000001 00000000 00000000   ................
0xf76bcfa0 00000000 f7ae9840 f88f140c deadc0de   .... at ...........
0xf76bcfb0 00000019 00000000 00000000 00000004   ................
0xf76bcfc0 00000000 00000000 00000000 00000000   ................
0xf76bcfd0-0xf76bcfef zero suppressed
0xf76bcff0 00000000 00000000 00000000 00000000   ................
[0]kdb>
0xf76bd000 00000000 00000000 00000000 00000000   ................
0xf76bd010-0xf76bd06f zero suppressed
0xf76bd070 00000000 00000000 00000000 00000000   ................

Not sure I'm reading bh correctly.


On Tue, Jun 3, 2008 at 8:06 PM, Eric Sandeen <sandeen at redhat.com> wrote:
> Srinivas Murthy wrote:
>
>> <6>EXT3-fs: mounted filesystem with ordered data mode.
>> <0>Assertion failure in journal_commit_transaction() at
>> fs/jbd/commit.c:693: "((&bh->b_count)->counter) == 0"
>> <0>------------[ cut here ]------------
>> <2>kernel BUG at fs/jbd/commit.c:693!
>> <0>invalid opcode: 0000 [#1]
>> <0>PREEMPT SMP
>> <0>CPU:    1
>> <0>EIP:    0060:[<c02c1149>]    Tainted: P        VLI
>> <0>EFLAGS: 00010296   (2.6.23.waas #4)
>> <0>EIP is at journal_commit_transaction+0x879/0xe00
>> <0>eax: 00000096   ebx: f76bcf00   ecx: ffffffff   edx: f7588ac0
>> <0>esi: f6c66f88   edi: c0651dcd   ebp: c6549f5c   esp: c6549ec4
>> <0>ds: 007b   es: 007b   fs: 00d8  gs: 0000  ss: 0068
>> <0>Process kjournald (pid: 1684, ti=c6548000 task=c39db580 task.ti=c6548000)
>> <0>Stack: c0651de8 c0651dcd c0651e44 000002b5 c0651fa0 00000000
>> 00000000 f7f63414
>> <0>       f7f634dc 00000000 00000fcc f7435034 00000000 00000000
>> c6402000 00000000
>> <0>       f7f63400 f7386fc0 000005d7 f77fb580 c39db580 f70bdd74
>> 00000202 c70f8000
>> <0>Call Trace:
>> <0> [<c0203a1a>] show_trace_log_lvl+0x1a/0x30
>> <0> [<c0203aea>] show_stack_log_lvl+0x9a/0xc0
>> <0> [<c0203d46>] show_registers+0x1d6/0x340
>> <0> [<c020403d>] die+0x10d/0x220
>> <0> [<c02041e1>] do_trap+0x91/0xd0
>> <0> [<c0204419>] do_invalid_op+0x89/0xa0
>> <0> [<c06317e2>] error_code+0x72/0x78
>> <0> [<c02c3845>] kjournald+0xb5/0x1f0
>> <0> [<c0232a5c>] kthread+0x5c/0xa0
>> <0> [<c020388b>] kernel_thread_helper+0x7/0x1c
>> <0> =======================
>> <0>Code: 65 c0 b8 a0 1f 65 c0 bf cd 1d 65 c0 89 44 24 10 b8 b5 02 00
>> 00 89 44 24 0c b8 44 1e 65 c0 89 7c 24 04 89 44 24 08 e8 57 de f5 ff
>> <0f> 0b eb fe 8b 43 34 85 c0 75 4e 8b 53 38 85 d2 0f 84 9e 00 00
>> <0>EIP: [<c02c1149>] journal_commit_transaction+0x879/0xe00 SS:ESP 0068:c6549ec4
>> <6>SysRq : Changing Loglevel
>> <4>Loglevel set to 7
>>
>> [0]kdb> btc
>> btc: cpu status: Currently on cpu 0
>
> Also, I'd backtrace pid 1684 (kjournald) and dump the bh, see what it
> looks like...
>
> kdb> btp 1684
> kdb> bh <whatever the address of the buffer head is>
>
> if i remember correctly...
>
> -Eric
>
>



From sandeen at redhat.com  Wed Jun  4 03:52:27 2008
From: sandeen at redhat.com (Eric Sandeen)
Date: Tue, 03 Jun 2008 22:52:27 -0500
Subject: Fwd: md_thread and kjournald race w/ raid1 on 2-way x86
In-Reply-To: <df1bdeb10806032029y6ceee314x96f449118da52177@mail.gmail.com>
References: <df1bdeb10806021835v71897b2es83e29a8188ca97f1@mail.gmail.com>	
	<df1bdeb10806031348y61568f2fo8abbc1fb0b9a3141@mail.gmail.com>	
	<df1bdeb10806031943l33971020re7c9a38763859bea@mail.gmail.com>	
	<484606A9.3070403@redhat.com>
	<df1bdeb10806032029y6ceee314x96f449118da52177@mail.gmail.com>
Message-ID: <4846117B.1010206@redhat.com>

Srinivas Murthy wrote:

> 
> [0]kdb> md 0xf76bcf00
> 0xf76bcf00 f7f63400 00701310 00000004 000001ca   .4....p.........
> 0xf76bcf10 00000000 00000000 00000000 00000000   ................
> 0xf76bcf20 00000000 c6320b98 00000000 00000000   ......2.........
> 0xf76bcf30 00000000 f7386498 f7386b28 00000001   .....d8.(k8.....

... doesn't look right ...

If you hit this often enough (and since you have kdb) you could modify
the assert to print the bh address first ....

then it'd be easy to print out, might offer some clues.

-Eric



From sebastia at l00-bugdead-prods.de  Wed Jun  4 07:55:00 2008
From: sebastia at l00-bugdead-prods.de (Sebastian Reitenbach)
Date: Wed, 04 Jun 2008 09:55:00 +0200
Subject: problem with default mask in acls
Message-ID: <20080604075501.17AF24971F@smtp.l00-bugdead-prods.de>

Hi,

when I copy a file to a directory, using whatever tool, it seems the 
behavior of the mask is wrong.

user1 at host1:~> getfacl source/test1
# file: source/test1
# owner: user1
# group: grp1
user::rw-
group::r--
other::r--

user1 at host1:~> getfacl target/
# file: target
# owner: user1
# group: grp1
user::rwx
group::---
group:grp1:rwx
mask::rwx
other::---
default:user::rwx
default:group::---
default:group:grp1:rwx
default:mask::rwx
default:other::---

user1 at host1:~> cp source/test1 target/
user1 at host1:~> getfacl target/test1
# file: target/test1
# owner: user1
# group: grp1
user::rw-
group::---
group:grp1:rwx                 #effective:r--
mask::r--
other::---

I'd expected the effective mask of the file in the destination directory to 
be rwx. Is there anything I'm doing wrong?

I'm on a SLES10SP1 x86_64.

Linux nfspublic 2.6.16.57-0.9-xen #1 SMP Mon Jan 21 19:55:27 UTC 2008 x86_64 
x86_64 x86_64 GNU/Linux

I guess I do sth. wrong, but what?

thanks
sebastian



From jelledejong at powercraft.nl  Fri Jun  6 18:24:49 2008
From: jelledejong at powercraft.nl (Jelle de Jong)
Date: Fri, 06 Jun 2008 20:24:49 +0200
Subject: needs help, root inode gone after usb bus reset on sata disks
In-Reply-To: <20080529212048.GI8065@mit.edu>
References: <483BCCC0.5020502@powercraft.nl> <20080527124711.GI7515@mit.edu>
	<483C07EE.1060905@powercraft.nl> <483D6FC5.30109@powercraft.nl>
	<20080528232452.GO6843@mit.edu> <483E7955.7020508@powercraft.nl>
	<20080529125816.GD8065@mit.edu> <483EC138.5090200@powercraft.nl>
	<20080529200140.GF8065@mit.edu> <483F0ECC.7030505@powercraft.nl>
	<20080529212048.GI8065@mit.edu>
Message-ID: <484980F1.1040604@powercraft.nl>

Theodore Tso wrote:
> On Thu, May 29, 2008 at 10:15:08PM +0200, Jelle de Jong wrote:
>> I did the following:
>>
>> debugfs -w /dev/sda1
>> debugfs: features dir_index filetype sparse_super
>> debugfs: quit
>>
>> then i run
>>
>> e2fsck -nf /dev/sda1
>>
>> to see if it still wanted to relocate inodes. This was not the case 
>> anymore, however it still wanted to relocate the root inode...
>>
>> I then run:
>>
>> e2fsck -f /dev/sda1
>>
>> and manual answer yes to the question until i had to enter a lot of "y" 
>> (see logs) and killed the program with ctrl-c
> 
> what answers did you answer yes to?  I don't have a log of your
> "e2fsck -f /dev/sda1" run, and so I can't tell what happened.  The
> e2fsck -fy run you gave me was large, but information-free, since it
> just had pass #5 messages regarding adjusting accounting information.
> 
> If it was just deleting the root inode (because it was corrupted), and
> creating a new root inode, that doesn't explain why all of the inodes
> disappeared, unless the inode table had somehow gotten completely
> zero'ed out
> 
> At this point, what I would probably suggest is that you run 
> 
> 	e2image -r /dev/hda1 - | bzip2 > hda1.e2i.bz2
> 
> ... and put it someplace where I can download it and take a look at
> what the heck happened to your filesystem.
> 
> By the way, please look at the "script" command ("man script"); it is
> very handy for capturing a record of what an interactive session with
> a program like e2fsck.
> 

Thanks for all the info Ted,

http://www.powercraft.nl/temp/e2image-r-sda1-v0.1.1.e2i.bz2

I did some experimenting and to see if I can find some data on the disk 
by doing the below command on an unaltered backup:

e2fsck -fy /dev/sda1 > e2fsck-fy-info-sda1-v0.1.1j.txt 2>&1

However no files where found, so maybe something when wrong with the dd 
backup.

I don't now if there is a way to see if there is actual data on the disk.

So for now i am giving up on recovery the data, maybe you can get a glue 
of what the hack happened to the file system and learn something new...

The only thing i would like to now is how to backup and restore the 
filesystem. (for example i am going to setup a raid setup but this kind 
of file system crashes are not covered with a raid setup)

Thanks in advance,

Jelle



From codevana at gmail.com  Sat Jun  7 01:24:56 2008
From: codevana at gmail.com (Srinivas Murthy)
Date: Fri, 6 Jun 2008 18:24:56 -0700
Subject: Fwd: md_thread and kjournald race w/ raid1 on 2-way x86
In-Reply-To: <4846117B.1010206@redhat.com>
References: <df1bdeb10806021835v71897b2es83e29a8188ca97f1@mail.gmail.com>
	<df1bdeb10806031348y61568f2fo8abbc1fb0b9a3141@mail.gmail.com>
	<df1bdeb10806031943l33971020re7c9a38763859bea@mail.gmail.com>
	<484606A9.3070403@redhat.com>
	<df1bdeb10806032029y6ceee314x96f449118da52177@mail.gmail.com>
	<4846117B.1010206@redhat.com>
Message-ID: <df1bdeb10806061824h6d7d695eub8661a6e97a65d41@mail.gmail.com>

Eric,

 I got the output you asked for.

<3>journal_commit_transaction 694 c60b54d0
<4>WARNING: at fs/jbd/commit.c:695 journal_commit_transaction()
<4> [<c0203a1a>] show_trace_log_lvl+0x1a/0x30
<4> [<c0203a42>] show_trace+0x12/0x20
<4> [<c0203b66>] dump_stack+0x16/0x20
<4> [<c02c0e9d>] journal_commit_transaction+0x5cd/0xe60
<4> [<c02c38a5>] kjournald+0xb5/0x1f0
<4> [<c0232a5c>] kthread+0x5c/0xa0
[1]more> q
[1]kdb> bh 0xc60b54d0
buffer_head at 0xc60b54d0
  bno 3297 size 4096 dev 0x900005
  count 1 state 0x8029 [Uptodate Req Mapped Private]
  b_data 0xf5c80000
  b_page 0xc16b9000 b_this_page 0x00000000 b_private 0xf7fb05b0
  b_end_io 0xc02c03a0 journal_end_buffer_io_sync
[1]kdb>

What do you think?

Thanks.

On Tue, Jun 3, 2008 at 8:52 PM, Eric Sandeen <sandeen at redhat.com> wrote:
> Srinivas Murthy wrote:
>
>>
>> [0]kdb> md 0xf76bcf00
>> 0xf76bcf00 f7f63400 00701310 00000004 000001ca   .4....p.........
>> 0xf76bcf10 00000000 00000000 00000000 00000000   ................
>> 0xf76bcf20 00000000 c6320b98 00000000 00000000   ......2.........
>> 0xf76bcf30 00000000 f7386498 f7386b28 00000001   .....d8.(k8.....
>
> ... doesn't look right ...
>
> If you hit this often enough (and since you have kdb) you could modify
> the assert to print the bh address first ....
>
> then it'd be easy to print out, might offer some clues.
>
> -Eric
>



From ross at biostat.ucsf.edu  Sun Jun  8 05:30:38 2008
From: ross at biostat.ucsf.edu (Ross Boylan)
Date: Sat, 07 Jun 2008 22:30:38 -0700
Subject: spd_readdir.c and readdir_r
Message-ID: <1212903039.7158.31.camel@corn.betterworld.us>

I still haven't been able to pinpoint exactly where bacula hangs up when
LD_PRELOAD is set to use spd_readdir, but I have a suspect.  bacula-fd
gets directory entries with readdir_r, which is a function that is not
reimplemented in spd_readdir.

So when bacula calls opendir it gets the shadow version, which calls the
original open, read, and closedir functions.  It then returns its
private dir_s structure.  The (unshadowed) readdir_r then tries to work
with dir_s.

It looks as if I (or one of you gurus?) need to implement a wrapper for
readdir_r.  A quick looks suggests there may be a couple of subtleties
(the spd_readdir struct dir_s is allocated, and so thread safe, but it's
dir entry is not; and readdir_r is expecting some "real" system data
structures back and users may have problems with fake ones).

Ross



From sandeen at redhat.com  Mon Jun  9 04:03:45 2008
From: sandeen at redhat.com (Eric Sandeen)
Date: Sun, 08 Jun 2008 23:03:45 -0500
Subject: Fwd: md_thread and kjournald race w/ raid1 on 2-way x86
In-Reply-To: <df1bdeb10806061824h6d7d695eub8661a6e97a65d41@mail.gmail.com>
References: <df1bdeb10806021835v71897b2es83e29a8188ca97f1@mail.gmail.com>	
	<df1bdeb10806031348y61568f2fo8abbc1fb0b9a3141@mail.gmail.com>	
	<df1bdeb10806031943l33971020re7c9a38763859bea@mail.gmail.com>	
	<484606A9.3070403@redhat.com>	
	<df1bdeb10806032029y6ceee314x96f449118da52177@mail.gmail.com>	
	<4846117B.1010206@redhat.com>
	<df1bdeb10806061824h6d7d695eub8661a6e97a65d41@mail.gmail.com>
Message-ID: <484CABA1.2040202@redhat.com>

Srinivas Murthy wrote:
> Eric,
> 
>  I got the output you asked for.
> 
> <3>journal_commit_transaction 694 c60b54d0
> <4>WARNING: at fs/jbd/commit.c:695 journal_commit_transaction()
> <4> [<c0203a1a>] show_trace_log_lvl+0x1a/0x30
> <4> [<c0203a42>] show_trace+0x12/0x20
> <4> [<c0203b66>] dump_stack+0x16/0x20
> <4> [<c02c0e9d>] journal_commit_transaction+0x5cd/0xe60
> <4> [<c02c38a5>] kjournald+0xb5/0x1f0
> <4> [<c0232a5c>] kthread+0x5c/0xa0
> [1]more> q
> [1]kdb> bh 0xc60b54d0
> buffer_head at 0xc60b54d0
>   bno 3297 size 4096 dev 0x900005
>   count 1 state 0x8029 [Uptodate Req Mapped Private]
>   b_data 0xf5c80000
>   b_page 0xc16b9000 b_this_page 0x00000000 b_private 0xf7fb05b0
>   b_end_io 0xc02c03a0 journal_end_buffer_io_sync
> [1]kdb>
> 
> What do you think?

I think that it looks  more like a buffer head accounting problem than a
corruption problem; the rest of the buffer head looks sane...

Think you could narrow down a test case for this problem?

-Eric



From ross at biostat.ucsf.edu  Mon Jun  9 04:26:28 2008
From: ross at biostat.ucsf.edu (Ross Boylan)
Date: Sun, 08 Jun 2008 21:26:28 -0700
Subject: spd_readdir.c and readdir_r [new version]
In-Reply-To: <1212903039.7158.31.camel@corn.betterworld.us>
References: <1212903039.7158.31.camel@corn.betterworld.us>
Message-ID: <1212985588.32113.13.camel@corn.betterworld.us>

I've attached a modified version of Ted's spd_readdir.c that adds
support for readdir_r and readdir64_r.  It appears to be working
(readdir64_r is the only new routine getting exercised), but should be
taken as a rough cut.  I also added a Makefile and a test program.

It also looks as if this is giving me a huge speed improvement (at least
x4) of my backups of my ext3 partitions.  I'll try to report after a
full and incremental backup complete, which will be a couple of days.

Originally I tried taking the threading code from the system
implementations of the original readdir_r.  When that didn't work (since
it was designed to be part of a libc build) I switched to pthreads.  I
don't know if recursive locking is essential; I activated it at one
point while trying to get things to work.

For big directories this code could use quite a lot of memory.  It
allows an optional max size, beyond which it reverts to the original
system calls.  I wonder if instead taking large directories in chunks
would preserve much of the speedup while putting a bound on memory use.

Ross Boylan

-------------- next part --------------
A non-text attachment was scrubbed...
Name: RBspd_dir.tgz
Type: application/x-compressed-tar
Size: 889 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20080608/edf71b04/attachment.bin>

From santi at usansolo.net  Mon Jun  9 17:33:48 2008
From: santi at usansolo.net (santi at usansolo.net)
Date: Mon, 09 Jun 2008 19:33:48 +0200
Subject: 2GB memory limit running fsck on a +6TB device
Message-ID: <13126f2f5661d30187551469b3793fa7@usansolo.net>


Dear Srs,

That's the scenario: +6TB device on a 3ware 9550SX RAID controller, running
Debian Etch 32bits, with 2.6.25.4 kernel, and defaults e2fsprogs version,
"1.39+1.40-WIP-2006.11.14+dfsg-2etch1".

Running "tune2fs" returns that filesystem is in EXT3_ERROR_FS state, "clean
with errors":

# tune2fs -l /dev/sda4   
tune2fs 1.40.10 (21-May-2008)
Filesystem volume name:   <none>
Last mounted on:          <not available>
Filesystem UUID:          7701b70e-f776-417b-bf31-3693dba56f86
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal dir_index filetype needs_recovery
sparse_super large_file
Default mount options:    (none)
Filesystem state:         clean with errors
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              792576000
Block count:              1585146848

It's a backup storage server, with more than 113 million files, this's the
output of "df -i":

# df -i /backup/
Filesystem            Inodes   IUsed   IFree IUse% Mounted on
/dev/sda4            792576000 113385959 679190041   15% /backup


Running fsck.ext3 or  fsck.ext2 I get:

# fsck.ext3 /dev/sda4
e2fsck 1.40.10 (21-May-2008)
Adding dirhash hint to filesystem.

/dev/sda4 contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Error allocating directory block array: Memory allocation failed
e2fsck: aborted

With some straces:

================================================================================
gettimeofday({1213032482, 940738}, NULL) = 0
getrusage(RUSAGE_SELF, {ru_utime={0, 0}, ru_stime={0, 16001}, ...}) = 0
write(1, "Pass 1: Checking ", 17Pass 1: Checking )       = 17
write(1, "inode", 5inode)                    = 5
write(1, "s, ", 3s, )                      = 3
write(1, "block", 5block)                    = 5
write(1, "s, and sizes\n", 13s, and sizes
)          = 13
mmap2(NULL, 99074048, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
0) = 0x404fa000
mmap2(NULL, 99074048, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
0) = 0x46376000
mmap2(NULL, 99074048, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
0) = 0x4c1f2000
mmap2(NULL, 198148096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
0) = 0x5206e000
mmap2(NULL, 99074048, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
0) = 0x5dd66000
mmap2(NULL, 748892160, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
0) = 0x63be2000
mmap2(NULL, 1866240000, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS,
-1, 0) = -1 ENOMEM (Cannot allocate memory)
brk(0x77488000)                         = 0x80ab000
mmap2(NULL, 1866375168, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS,
-1, 0) = -1 ENOMEM (Cannot allocate memory)
mmap2(NULL, 2097152, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE,
-1, 0) = 0x90615000
munmap(0x90615000, 962560)              = 0
munmap(0x90800000, 86016)               = 0
mprotect(0x90700000, 135168, PROT_READ|PROT_WRITE) = 0
mmap2(NULL, 1866240000, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS,
-1, 0) = -1 ENOMEM (Cannot allocate memory)
================================================================================

Appears that fsck is trying to use more than 2GB memory to store inode
table relationship. System has 4GB of physical RAM and 4GB of swap, is
there anyway to limit the memory used by fsck or any solution to check this
filesystem? Running fsck with a 64bit LiveCD will solve the problem?

I also tried with last e2fsprogs stable release 1.40.10, getting the same
error :-/

Regards,

--
Santi Saez



From tytso at mit.edu  Mon Jun  9 21:33:20 2008
From: tytso at mit.edu (Theodore Tso)
Date: Mon, 9 Jun 2008 17:33:20 -0400
Subject: 2GB memory limit running fsck on a +6TB device
In-Reply-To: <13126f2f5661d30187551469b3793fa7@usansolo.net>
References: <13126f2f5661d30187551469b3793fa7@usansolo.net>
Message-ID: <20080609213320.GB26759@mit.edu>

On Mon, Jun 09, 2008 at 07:33:48PM +0200, santi at usansolo.net wrote:
> It's a backup storage server, with more than 113 million files, this's the
> output of "df -i":
> 
> Appears that fsck is trying to use more than 2GB memory to store inode
> table relationship. System has 4GB of physical RAM and 4GB of swap, is
> there anyway to limit the memory used by fsck or any solution to check this
> filesystem? Running fsck with a 64bit LiveCD will solve the problem?

Yes, running with a 64-bit Live CD is one way to solve the problem.

If you are using e2fsprogs 1.40.10, there is another solution that may
help.  Create an /etc/e2fsck.conf file with the following contents:

[scratch_files]
	directory = /var/cache/e2fsck

...and then make sure /var/cache/e2fsck exists by running the command
"mkdir /var/cache/e2fsck".

This will cause e2fsck to store certain data structures which grow
large with backup servers that have a vast number of hard-linked files
in /var/cache/e2fsck instead of in memory.  This will slow down e2fsck
by approximately 25%, but for large filesystems where you couldn't
otherwise get e2fsck to complete because you're exhausting the 2GB VM
per-process limitation for 32-bit systems, it should allow you to run
through to completion.

					- Ted



From adilger at sun.com  Mon Jun  9 21:50:32 2008
From: adilger at sun.com (Andreas Dilger)
Date: Mon, 09 Jun 2008 15:50:32 -0600
Subject: 2GB memory limit running fsck on a +6TB device
In-Reply-To: <13126f2f5661d30187551469b3793fa7@usansolo.net>
References: <13126f2f5661d30187551469b3793fa7@usansolo.net>
Message-ID: <20080609215031.GC3726@webber.adilger.int>

On Jun 09, 2008  19:33 +0200, santi at usansolo.net wrote:
> That's the scenario: +6TB device on a 3ware 9550SX RAID controller, running
> Debian Etch 32bits, with 2.6.25.4 kernel, and defaults e2fsprogs version,
> "1.39+1.40-WIP-2006.11.14+dfsg-2etch1".
> 
> Running "tune2fs" returns that filesystem is in EXT3_ERROR_FS state, "clean
> with errors":
> 
> # tune2fs -l /dev/sda4   
> tune2fs 1.40.10 (21-May-2008)
> Filesystem volume name:   <none>
> Last mounted on:          <not available>
> Filesystem UUID:          7701b70e-f776-417b-bf31-3693dba56f86
> Filesystem magic number:  0xEF53
> Filesystem revision #:    1 (dynamic)
> Filesystem features:      has_journal dir_index filetype needs_recovery
> sparse_super large_file
> Default mount options:    (none)
> Filesystem state:         clean with errors
> Errors behavior:          Continue
> Filesystem OS type:       Linux
> Inode count:              792576000
> Block count:              1585146848
> 
> It's a backup storage server, with more than 113 million files, this's the
> output of "df -i":
> 
> # df -i /backup/
> Filesystem            Inodes   IUsed   IFree IUse% Mounted on
> /dev/sda4            792576000 113385959 679190041   15% /backup
> 
> 
> Running fsck.ext3 or  fsck.ext2 I get:
> 
> # fsck.ext3 /dev/sda4
> e2fsck 1.40.10 (21-May-2008)
> Adding dirhash hint to filesystem.
> 
> /dev/sda4 contains a file system with errors, check forced.
> Pass 1: Checking inodes, blocks, and sizes

I recall that e2fsck allocates on the order of 3 * block_count / 8 bytes,
and 5 * inode_count / 8 bytes, so in your case this is about:

(5 * 1585146848 + 3 * 792576000) / 8 = 1287932780 bytes = 1.2GB

at a minimum, but my estimates might be incorrect.

> mmap2(NULL, 99074048, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
> 0) = 0x404fa000

Judging by the return values of these functions, this is a 32-bit system,
and it is entirely possible that you are exceeding the per-process memory
allocation limit.

> mmap2(NULL, 748892160, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
> 0) = 0x63be2000
> mmap2(NULL, 1866240000, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS,
> -1, 0) = -1 ENOMEM (Cannot allocate memory)

Hmm, it seems a bit excessive to allocate 1.8GB in a single chunk.

> Error allocating directory block array: Memory allocation failed
> e2fsck: aborted

This message is a bit tricky to nail down because it doesn't exist anywhere
in the code directly.  It is encoded into "e2fsck abbreviations", and
the expansion that is normally in the corresponding comment is different.
It is PR_1_ALLOCATE_DBCOUNT returned from the call chain:
	ext2fs_init_dblist->
	  make_dblist->
	    ext2fs_get_num_dirs()

which is counting the number of directories in the filesystem, and allocating
two 12-byte array element for each one.  This implies you have 77M directories
in your filesystem, or an average of only 10 files per directory?

> Appears that fsck is trying to use more than 2GB memory to store inode
> table relationship. System has 4GB of physical RAM and 4GB of swap, is
> there anyway to limit the memory used by fsck or any solution to check this
> filesystem?

I don't know offhand how important the dblist structure is, so I'm not
sure if there is a way to reduce the memory usage for it.  I believe
that in low-memory situations it is possible to use tdb in newer versions
of e2fsck for the dblist, but I don't know much of the details.

> Running fsck with a 64bit LiveCD will solve the problem?

Yes, I suspect with a 64-bit kernel you could allocate the full 4GB of RAM
for e2fsck and be able to check the filesystem.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.



From carlo at alinoe.com  Mon Jun  9 22:08:56 2008
From: carlo at alinoe.com (Carlo Wood)
Date: Tue, 10 Jun 2008 00:08:56 +0200
Subject: 2GB memory limit running fsck on a +6TB device
In-Reply-To: <20080609215031.GC3726@webber.adilger.int>
References: <13126f2f5661d30187551469b3793fa7@usansolo.net>
	<20080609215031.GC3726@webber.adilger.int>
Message-ID: <20080609220856.GA21530@alinoe.com>

On Mon, Jun 09, 2008 at 03:50:32PM -0600, Andreas Dilger wrote:
> > Running fsck with a 64bit LiveCD will solve the problem?
> 
> Yes, I suspect with a 64-bit kernel you could allocate the full 4GB of RAM
> for e2fsck and be able to check the filesystem.

We had a simular problem with ext3grep.
You have to realize that every mmap uses memory
address space, even if it's a map to disk.
Therefore, on a 32bit machine, if the total
of all normal allocations plus all simultaneous
mmap's exceeds 4GB then you "run out of memory",
even if -say- only 1 GB is really allocated
and >3GB of the disk is mmap-ed.

In that case a 64bit machine would solve the
problem because then all ram (2 GB I read in
the Subject) can be used for normal allocations
while any disk mmap has cazillion address space
left for itself.

-- 
Carlo Wood <carlo at alinoe.com>



From tytso at mit.edu  Mon Jun  9 22:37:36 2008
From: tytso at mit.edu (Theodore Tso)
Date: Mon, 9 Jun 2008 18:37:36 -0400
Subject: 2GB memory limit running fsck on a +6TB device
In-Reply-To: <20080609215031.GC3726@webber.adilger.int>
References: <13126f2f5661d30187551469b3793fa7@usansolo.net>
	<20080609215031.GC3726@webber.adilger.int>
Message-ID: <20080609223736.GA7069@mit.edu>

On Mon, Jun 09, 2008 at 03:50:32PM -0600, Andreas Dilger wrote:
> This message is a bit tricky to nail down because it doesn't exist anywhere
> in the code directly.  It is encoded into "e2fsck abbreviations", and
> the expansion that is normally in the corresponding comment is different.
> It is PR_1_ALLOCATE_DBCOUNT returned from the call chain:
> 	ext2fs_init_dblist->
> 	  make_dblist->
> 	    ext2fs_get_num_dirs()
> 
> which is counting the number of directories in the filesystem, and allocating
> two 12-byte array element for each one.  This implies you have 77M directories
> in your filesystem, or an average of only 10 files per directory?

There are a number of backup solutions that use hardlinks to conserve
space between increment snapshots.  So yeah, with these worklodas
you'll see something like 80-85M inodes, of which 77M-odd will be
directories.  When you combine the vast number of directories used by
these filesystems, and the fact that e2fsck tries to opimize memory
use by observing that on most normal filesystems, most files have
n_link count of 1, which is NOT true on these filesystems used for
backups, e2fsck's tricks to optimize for speed by caching information
to avoid re-reading them from disk end up costing a large amount of
memory.

> I don't know offhand how important the dblist structure is, so I'm not
> sure if there is a way to reduce the memory usage for it.  I believe
> that in low-memory situations it is possible to use tdb in newer versions
> of e2fsck for the dblist, but I don't know much of the details.

Yep, please see [scratch_files] section in e2fsck.conf.  It is
described in the e2fsck.conf(5) man page.

					- Ted



From adilger at sun.com  Mon Jun  9 22:57:59 2008
From: adilger at sun.com (Andreas Dilger)
Date: Mon, 09 Jun 2008 16:57:59 -0600
Subject: 2GB memory limit running fsck on a +6TB device
In-Reply-To: <20080609223736.GA7069@mit.edu>
References: <13126f2f5661d30187551469b3793fa7@usansolo.net>
	<20080609215031.GC3726@webber.adilger.int>
	<20080609223736.GA7069@mit.edu>
Message-ID: <20080609225759.GG3726@webber.adilger.int>

On Jun 09, 2008  18:37 -0400, Theodore Ts'o wrote:
> On Mon, Jun 09, 2008 at 03:50:32PM -0600, Andreas Dilger wrote:
> > I don't know offhand how important the dblist structure is, so I'm not
> > sure if there is a way to reduce the memory usage for it.  I believe
> > that in low-memory situations it is possible to use tdb in newer versions
> > of e2fsck for the dblist, but I don't know much of the details.
> 
> Yep, please see [scratch_files] section in e2fsck.conf.  It is
> described in the e2fsck.conf(5) man page.

Hmm, maybe if the ext2fs_init_dblist() function returns PR_1_ALLOCATE_DBCOUNT
this should be a user-fixable problem that asks if the user wants to use
an on-disk tdb file in /var/tmp, and if that is a "no" then point them at
the right section in /etc/e2fsck.conf?

I don't think it is reasonable to default to using /tmp, because it might
be a RAM-backed filesystem, and I suspect in most cases the root filesystem
will not run out of memory in this way...  Even if it fails because /var/tmp
is read-only, or too small, it is no worse off than it is today.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.



From gregt at maths.otago.ac.nz  Tue Jun 10 03:36:52 2008
From: gregt at maths.otago.ac.nz (Greg Trounson)
Date: Tue, 10 Jun 2008 15:36:52 +1200
Subject: 2GB memory limit running fsck on a +6TB device
In-Reply-To: <20080609215031.GC3726@webber.adilger.int>
References: <13126f2f5661d30187551469b3793fa7@usansolo.net>
	<20080609215031.GC3726@webber.adilger.int>
Message-ID: <484DF6D4.4050700@maths.otago.ac.nz>

Andreas Dilger wrote:
> On Jun 09, 2008  19:33 +0200, santi at usansolo.net wrote:
...
>> Running fsck with a 64bit LiveCD will solve the problem?
> 
> Yes, I suspect with a 64-bit kernel you could allocate the full 4GB of RAM
> for e2fsck and be able to check the filesystem.

Couldn't you achieve the same thing just by enabling PAE on your 32-bit kernel?

Greg



From tytso at mit.edu  Tue Jun 10 13:18:28 2008
From: tytso at mit.edu (Theodore Tso)
Date: Tue, 10 Jun 2008 09:18:28 -0400
Subject: 2GB memory limit running fsck on a +6TB device
In-Reply-To: <484DF6D4.4050700@maths.otago.ac.nz>
References: <13126f2f5661d30187551469b3793fa7@usansolo.net>
	<20080609215031.GC3726@webber.adilger.int>
	<484DF6D4.4050700@maths.otago.ac.nz>
Message-ID: <20080610131828.GC18768@mit.edu>

On Tue, Jun 10, 2008 at 03:36:52PM +1200, Greg Trounson wrote:
> Andreas Dilger wrote:
>> On Jun 09, 2008  19:33 +0200, santi at usansolo.net wrote:
> ...
>>> Running fsck with a 64bit LiveCD will solve the problem?
>> Yes, I suspect with a 64-bit kernel you could allocate the full 4GB of RAM
>> for e2fsck and be able to check the filesystem.
>
> Couldn't you achieve the same thing just by enabling PAE on your 32-bit 
> kernel?

No, that doesn't increase the amount address space available to the
user process, which is the limitation here.  You can have 16 GB of
physical memory, but 2**32 is still 4GB, and the kernel needs address
space, so that means userspace will have at most 3GB of space to
itself.

						- Ted



From santi at usansolo.net  Tue Jun 10 15:34:35 2008
From: santi at usansolo.net (santi at usansolo.net)
Date: Tue, 10 Jun 2008 17:34:35 +0200
Subject: 2GB memory limit running fsck on a +6TB device
In-Reply-To: <20080609213320.GB26759@mit.edu>
References: <13126f2f5661d30187551469b3793fa7@usansolo.net>
	<20080609213320.GB26759@mit.edu>
Message-ID: <7f8a3e70ccd659f304ede5f067ff46c7@usansolo.net>



On Mon, 9 Jun 2008 17:33:20 -0400, Theodore Tso <tytso at mit.edu> wrote:
 
> If you are using e2fsprogs 1.40.10, there is another solution that may
> help.  Create an /etc/e2fsck.conf file with the following contents:
> 
> [scratch_files]
> 	directory = /var/cache/e2fsck

(..)

> This will cause e2fsck to store certain data structures which grow
> large with backup servers that have a vast number of hard-linked files
> in /var/cache/e2fsck instead of in memory.  This will slow down e2fsck
> by approximately 25%, but for large filesystems where you couldn't
> otherwise get e2fsck to complete because you're exhausting the 2GB VM
> per-process limitation for 32-bit systems, it should allow you to run
> through to completion.

I'm trying with fsck.ext3 v1.40.8, backported from Lenny's package to Etch,
instead of v1.40.10 because we have the same sceneario in all backup
servers running BackupPC, and package must be distributed. If needed, we
can make test with the latest version ;-)

fsck.ext3 started 4 hours ago, and still is in "Pass 1: Checking inodes,
blocks, and sizes", that's normal knowing that the filesystem has +113
million inodes?

I will send more info as requested Ted in "Call for testers w/ using
BackupPC" [1], but now this is the scenario:

- fsck.ext3 is using more than 2GB of memory and no swap, server has 4GB
phisycal RAM + 2GB of swap, this's the output of "pmap -d"  with memory
map:

# pmap -d 7014
7014:   fsck.ext3 -y /dev/sda4
Address   Kbytes Mode  Offset           Device    Mapping
(..)
242fd000 1834768 rw--- 00000000242fd000 000:00000   [ anon ]
942c2000  582604 rw--- 00000000942c2000 000:00000   [ anon ]
(..)

All the output is available at: http://pastebin.com/f67115de2


- Files in "/var/cache/e2fsck" appears that grow very slow, I think, 300Kb
per hour aprox, now that's the size:

# ls -lh /var/cache/e2fsck/
total 170M
-rw------- 1 root root 76M 2008-06-10 17:24
7701b70e-f776-417b-bf31-3693dba56f86-dirinfo-VkmFXP
-rw------- 1 root root 95M 2008-06-10 17:24
7701b70e-f776-417b-bf31-3693dba56f86-icount-YO08bu


- fsck is using 100% of one CPU, it's dual processor motherboard, output of
strace available at:

http://pastebin.com/f68389cce


- More info:
   * Kernel 2.6.25.4, i686 arch on a Debian Etch box.
   * Storage: 3ware 9550SXU-16ML, 5.91 TB in a RAID-5 with 14 500GB SATA
disks (ST3500630AS), 64kB stripe size (array is in optimal state)


Thanks all for the advices :-)

[1] http://www.redhat.com/archives/ext3-users/2007-April/msg00017.html

--
Santi Saez



From tytso at mit.edu  Tue Jun 10 18:38:55 2008
From: tytso at mit.edu (Theodore Tso)
Date: Tue, 10 Jun 2008 14:38:55 -0400
Subject: 2GB memory limit running fsck on a +6TB device
In-Reply-To: <7f8a3e70ccd659f304ede5f067ff46c7@usansolo.net>
References: <13126f2f5661d30187551469b3793fa7@usansolo.net>
	<20080609213320.GB26759@mit.edu>
	<7f8a3e70ccd659f304ede5f067ff46c7@usansolo.net>
Message-ID: <20080610183855.GB8397@mit.edu>

On Tue, Jun 10, 2008 at 05:34:35PM +0200, santi at usansolo.net wrote:
> 
> fsck.ext3 started 4 hours ago, and still is in "Pass 1: Checking inodes,
> blocks, and sizes", that's normal knowing that the filesystem has +113
> million inodes?
> 

It depends on a lot of things; how big are your files on average, the
speed of your hard drive, and whether /var/cache/e2fsck is on the same
disk as the partition which you are checking, or on a separate spindle
(guess which is better :-).

It's always a good idea when running e2fsck (aka fsck.ext3) directly
and/or on a terminal/console to include as command-line options "-C
0".  This will display a progress bar, so you can gauge how it is
doing.  (0 through 70% is pass 1, which requires scanning the inode
table and following all of the indirect blocks.)

						- Ted



From santi at usansolo.net  Tue Jun 10 22:24:27 2008
From: santi at usansolo.net (Santi Saez)
Date: Wed, 11 Jun 2008 00:24:27 +0200
Subject: 2GB memory limit running fsck on a +6TB device
In-Reply-To: <20080610183855.GB8397@mit.edu>
References: <13126f2f5661d30187551469b3793fa7@usansolo.net>
	<20080609213320.GB26759@mit.edu>
	<7f8a3e70ccd659f304ede5f067ff46c7@usansolo.net>
	<20080610183855.GB8397@mit.edu>
Message-ID: <484EFF1B.1010104@usansolo.net>

Theodore Tso escribi?:
> It's always a good idea when running e2fsck (aka fsck.ext3) directly
> and/or on a terminal/console to include as command-line options "-C
> 0".  This will display a progress bar, so you can gauge how it is
> doing.  (0 through 70% is pass 1, which requires scanning the inode
> table and following all of the indirect blocks.)
>   

Thanks for the tip! :-)

'/var/cache/e2fsck' is in the _same_ disk, perhaps mounting via iSCSI, 
NFS, etc.. this directory will improve, we will work with this in other 
test.

I have enabled progress bar sending SIGUSR1 signal to the process, and 
it's still on 2% ;-(

"scratch_files" directory size is now 251M, it has grown 81MB in the 
last 7 hours:

# ls -lh /var/cache/e2fsck/
total 251M
-rw------- 1 root root 112M 2008-06-11 00:09 
7701b70e-f776-417b-bf31-3693dba56f86-dirinfo-VkmFXP
-rw------- 1 root root 139M 2008-06-11 00:09 
7701b70e-f776-417b-bf31-3693dba56f86-icount-YO08bu

strace's output is the same, and also memory usage is the same.

I will let the process more time.. but I think it will take too much 
time to complete, at least to finish the pass 1, perhaps more than 50 
hours? According that now is only on 2% of the process + take 12 hours 
to complete, and pass 1 is from 0% through 70%.. is there any other 
solution to solve this?

ext4 will solve this problem? I have not tested ext4 already, but I have 
read that it will improve fast filesytem checking...

Regards,

--
Santi Saez



From tytso at mit.edu  Tue Jun 10 23:01:24 2008
From: tytso at mit.edu (Theodore Tso)
Date: Tue, 10 Jun 2008 19:01:24 -0400
Subject: 2GB memory limit running fsck on a +6TB device
In-Reply-To: <484EFF1B.1010104@usansolo.net>
References: <13126f2f5661d30187551469b3793fa7@usansolo.net>
	<20080609213320.GB26759@mit.edu>
	<7f8a3e70ccd659f304ede5f067ff46c7@usansolo.net>
	<20080610183855.GB8397@mit.edu> <484EFF1B.1010104@usansolo.net>
Message-ID: <20080610230124.GH8397@mit.edu>

On Wed, Jun 11, 2008 at 12:24:27AM +0200, Santi Saez wrote:
>
> '/var/cache/e2fsck' is in the _same_ disk, perhaps mounting via iSCSI, NFS, 
> etc.. this directory will improve, we will work with this in other test.
>
> I have enabled progress bar sending SIGUSR1 signal to the process, and it's 
> still on 2% ;-(
>
> "scratch_files" directory size is now 251M, it has grown 81MB in the last 7 
> hours:

hmm.....  can you send me the output of dumpe2fs /dev/sdXX?  You can
run that command while e2fsck is running, since it's read-only.  I'm
curious exactly how big the filesystem is, and how many directories
are in the first part of the filesystem.

How big is the filesystem(s) that you are backing up via BackupPC, in
terms of size (megabytes) and files (number of inodes)?  And how many
days of incremental backups are you keeping?  Also, how often do files
change?  Can you give a rough estimate of how many files get modified
per backup cycle?

Thanks,

						- Ted



From santi at usansolo.net  Tue Jun 10 23:48:35 2008
From: santi at usansolo.net (Santi Saez)
Date: Wed, 11 Jun 2008 01:48:35 +0200
Subject: 2GB memory limit running fsck on a +6TB device
In-Reply-To: <20080610230124.GH8397@mit.edu>
References: <13126f2f5661d30187551469b3793fa7@usansolo.net>
	<20080609213320.GB26759@mit.edu>
	<7f8a3e70ccd659f304ede5f067ff46c7@usansolo.net>
	<20080610183855.GB8397@mit.edu> <484EFF1B.1010104@usansolo.net>
	<20080610230124.GH8397@mit.edu>
Message-ID: <484F12D3.2050201@usansolo.net>

Theodore Tso escribi?:
> hmm.....  can you send me the output of dumpe2fs /dev/sdXX?  You can
> run that command while e2fsck is running, since it's read-only.  I'm
> curious exactly how big the filesystem is, and how many directories
> are in the first part of the filesystem.
>   
Upsss... dumpe2fs takes about 3 minutes to complete and generates about 
133MB output file:

dumpe2fs 1.40.8 (13-Mar-2008)
Filesystem volume name:   <none>
Last mounted on:          <not available>
Filesystem UUID:          7701b70e-f776-417b-bf31-3693dba56f86
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal dir_index filetype sparse_super 
large_file
Default mount options:    (none)
Filesystem state:         clean with errors
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              792576000
Block count:              1585146848
Reserved block count:     0
Free blocks:              913341561
Free inodes:              678201512
First block:              0
Block size:               4096
Fragment size:            4096
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         16384
Inode blocks per group:   512
Filesystem created:       Mon Nov 13 10:12:49 2006
Last mount time:          Mon Jun  9 19:37:12 2008
Last write time:          Tue Jun 10 12:18:25 2008
Mount count:              37
Maximum mount count:      -1
Last checked:             Mon Nov 13 10:12:49 2006
Check interval:           0 (<none>)
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:               128
Journal inode:            8
Default directory hash:   tea
Directory Hash Seed:      afabe3f6-4405-44f4-934b-76c23945db7b
Journal backup:           inode blocks
Journal size:             32M

Some example output from group 0 to 5 is available at:

http://pastebin.com/f5341d121

> How big is the filesystem(s) that you are backing up via BackupPC, in
> terms of size (megabytes) and files (number of inodes)?  And how many
> days of incremental backups are you keeping?  Also, how often do files
> change?  Can you give a rough estimate of how many files get modified
> per backup cycle?
>   

Where are backing up several servers, near about 15 in this case, with 
60-80GB data size to backup in each server and +2-3 millon inodes, with 
15 day incrementals. I think near about 2-3% of the files changes each 
day, but I will ask for more info to the backup administrator.

I have found and old doc with some build info for this server, the 
partition was formated with:

    # mkfs.ext3 -b 4096 -j -m 0 -O dir_index /dev/sda4
    # tune2fs -c 0 -i 0 /dev/sda4
    # mount -o data=writeback,noatime,nodiratime,commit=60 /dev/sda4 /backup

I'm going to fetch more info about BackupPC and backup cycles, thanks Ted!!

Regards,

--
Santi Saez



From tytso at mit.edu  Wed Jun 11 02:18:00 2008
From: tytso at mit.edu (Theodore Tso)
Date: Tue, 10 Jun 2008 22:18:00 -0400
Subject: 2GB memory limit running fsck on a +6TB device
In-Reply-To: <484F12D3.2050201@usansolo.net>
References: <13126f2f5661d30187551469b3793fa7@usansolo.net>
	<20080609213320.GB26759@mit.edu>
	<7f8a3e70ccd659f304ede5f067ff46c7@usansolo.net>
	<20080610183855.GB8397@mit.edu> <484EFF1B.1010104@usansolo.net>
	<20080610230124.GH8397@mit.edu> <484F12D3.2050201@usansolo.net>
Message-ID: <20080611021759.GI8397@mit.edu>

On Wed, Jun 11, 2008 at 01:48:35AM +0200, Santi Saez wrote:
> Theodore Tso escribi?:
>> hmm.....  can you send me the output of dumpe2fs /dev/sdXX?  You can
>> run that command while e2fsck is running, since it's read-only.  I'm
>> curious exactly how big the filesystem is, and how many directories
>> are in the first part of the filesystem.
>>   
> Upsss... dumpe2fs takes about 3 minutes to complete and generates about 
> 133MB output file:

True, but it compresses well.  :-) And the aside from the first part
of the dumpe2fs, the part that I was most interested could have been
summarized by simply doing a "grep directories dumpe2fs.out".  

But simply looking at your dumpe2fs, and take an average from the
first 6 block groups which you included in the pastebin, I can
extrapolate and guess that you have about 63 million directories, out
of approximately 114 million total inodes (so about 51 million regular
files, nearly all of which have hard link counts > 1).  Unfortunately,
BackupPC blows out of the water all of our memory reduction
hueristics.  I estimate you need something like 2.6GB to 3GB of memory
just for these data structures alone.  (Not to mention 94 MB for each
inode bitmap, and 188 MB for each block bitmap.)  The good news is
that 4GB of memory should do you --- just.  (I'd probably put in a bit
more physical memory just to be on the safe side, or enable swap
before running e2fsck).  The bad news is you really, REALLY need a
64-bit kernel on your system.

Because /var/cache/e2fsck is on the same disk spindle as the
filesystem you are checking, you're probably getting killed on seeks.
Moving /var/cache/e2fsck to another disk partition will help (or
better yet, battery backed memory device), but the best thing you can
do is get a 64-bit kernel and not need to use the auxiliary storage in
the first place.  

As far as what to advice to give you, why are you running e2fsck?  Was
this an advisory thing caused by the mount count and/or length of time
between filesystem checks?  Or do you have real reason to believe the
filesystem may be corrupt?

						- Ted



From ext3 at kalucki.com  Wed Jun 11 05:18:46 2008
From: ext3 at kalucki.com (John Kalucki)
Date: Tue, 10 Jun 2008 22:18:46 -0700
Subject: Poor Performance WhenNumber of Files > 1M
Message-ID: <484F6036.8020900@kalucki.com>

I am seeing similar problems to Sean McCauliff (2007-08-02) using ext3. 
I have a simple test that times file creations in a hashed directory 
structure. File creation time inexorably increases as the number of 
files in the filesystem increases. Altering variables can change the 
absolute performance, but I always see the steady performance degradation.

All of the following have no material effect on the steady drop in 
performance:

File length (1k, 4k, 16k)
Directory depth (5, 10, 15)
Average & Max files per directory (10, 20, 100)
Single or multi-threaded test
Moving test directory to a new name on same filesystem, restarting test.
Directory hash
RAID10 vs. simple disk
Linux version (RHE, Ubuntu)
System memory (32gig, 2gig)
Syncing after each close
Free space
Partition Age (old, perhaps fragmented, a bit dirty, new fs)

Performance seems to always map directly to the number of files in the 
ext3 filesystem.

After some initial run-fast time, perhaps once dirty pages begin to be 
written aggressively, for every 5,000 files added, my files created per 
second tends to drop by about one. So, depending on the variables, say 
with 6 RAID10 spindles, I might start at ~700 files/sec, quickly drop, 
then more slowly drop to ~300 files/sec at perhaps 1 million files, then 
see 299 files/sec for the next 5,000 creations, 298 files/sec, etc. etc.

As you'd expect, there isn't much CPU utilization, other than iowait, 
and some kjournald activity.

Is this a known limitation of ext3? Is expecting to write to 
O(10^6)-O(10^7) files in something approaching constant time expecting 
too much from a filesystem? What, exactly, am I stressing to cause this 
unbounded performance degradation?

Thanks,
-John Kalucki
ext3 at kalucki.com




----

    Hi all,

    I plan on having about 100M files totaling about 8.5TiBytes.   To see
    how ext3 would perform with large numbers of files I've written a test
    program which creates a configurable number of files into a 
configurable
    number of directories, reads from those files, lists them and then
    deletes them.  Even up to 1M files ext3 seems to perform well and scale
    linearly; the time to execute the program on 1M files is about double
    the time it takes it to execute on .5M files.  But past 1M files it
    seems to have n^2 scalability.  Test details appear below.

    Looking at the various options for ext3 nothing jumps out as the 
obvious
    one to use to improve performance.

    Any recommendations?

    Thanks!
    Sean





From sandeen at redhat.com  Wed Jun 11 05:33:20 2008
From: sandeen at redhat.com (Eric Sandeen)
Date: Wed, 11 Jun 2008 00:33:20 -0500
Subject: Poor Performance WhenNumber of Files > 1M
In-Reply-To: <484F6036.8020900@kalucki.com>
References: <484F6036.8020900@kalucki.com>
Message-ID: <484F63A0.50606@redhat.com>

John Kalucki wrote:

> Performance seems to always map directly to the number of files in the 
> ext3 filesystem.
> 
> After some initial run-fast time, perhaps once dirty pages begin to be 
> written aggressively, for every 5,000 files added, my files created per 
> second tends to drop by about one. So, depending on the variables, say 
> with 6 RAID10 spindles, I might start at ~700 files/sec, quickly drop, 
> then more slowly drop to ~300 files/sec at perhaps 1 million files, then 
> see 299 files/sec for the next 5,000 creations, 298 files/sec, etc. etc.
> 
> As you'd expect, there isn't much CPU utilization, other than iowait, 
> and some kjournald activity.
> 
> Is this a known limitation of ext3? Is expecting to write to 
> O(10^6)-O(10^7) files in something approaching constant time expecting 
> too much from a filesystem? What, exactly, am I stressing to cause this 
> unbounded performance degradation?

I think this is a linear search through the block groups for the new
inode allocation, which always starts at the parent directory's block
group; and starts over from there each time.  See find_group_other().

So if the parent's group is full and so are the next 1000 block groups,
it will search 1000 groups and find space in the 1001st.  On the next
inode allocation it will re-search(!) those 1000 groups, and again find
space in the 1001st.  And so on.  Until the 1001st is full, and then
it'll search 1001 groups and find space in the 1002nd... etc (If I'm
remembering/reading correctly, but this does jive with what you see.).

I've toyed  with keeping track (in the parent's inode) where the last
successful child allocation happened, and start the search there.  I'm a
bit leery of how this might age, though... plus I'm not sure if it
should be on-disk or just in memory.... But this behavior clearly needs
some help.  I should probably just get it sent out for comment.

-Eric



From santi at usansolo.net  Wed Jun 11 08:14:45 2008
From: santi at usansolo.net (santi at usansolo.net)
Date: Wed, 11 Jun 2008 10:14:45 +0200
Subject: 2GB memory limit running fsck on a +6TB device
In-Reply-To: <20080611021759.GI8397@mit.edu>
References: <13126f2f5661d30187551469b3793fa7@usansolo.net>
	<20080609213320.GB26759@mit.edu>
	<7f8a3e70ccd659f304ede5f067ff46c7@usansolo.net>
	<20080610183855.GB8397@mit.edu> <484EFF1B.1010104@usansolo.net>
	<20080610230124.GH8397@mit.edu> <484F12D3.2050201@usansolo.net>
	<20080611021759.GI8397@mit.edu>
Message-ID: <00bf4cac93645dc74c04229696a20f11@usansolo.net>



On Tue, 10 Jun 2008 22:18:00 -0400, Theodore Tso <tytso at mit.edu> wrote:

> True, but it compresses well.  :-) And the aside from the first part
> of the dumpe2fs, the part that I was most interested could have been
> summarized by simply doing a "grep directories dumpe2fs.out".

:D

"grep directories" is available at:

http://santi.usansolo.net/tmp/dumpe2fs_directories.txt.gz (317K)

Full "dumpe2fs" output compressed is 34M and available at:

http://santi.usansolo.net/tmp/dumpe2fs.txt.gz



> But simply looking at your dumpe2fs, and take an average from the
> first 6 block groups which you included in the pastebin, I can
> extrapolate and guess that you have about 63 million directories, out
> of approximately 114 million total inodes (so about 51 million regular
> files, nearly all of which have hard link counts > 1).

# grep directories dumpe2fs.txt | awk '{sum += $7} END {print sum}'
78283294



> BackupPC blows out of the water all of our memory reduction
> hueristics.  I estimate you need something like 2.6GB to 3GB of memory
> just for these data structures alone.  (Not to mention 94 MB for each
> inode bitmap, and 188 MB for each block bitmap.)  The good news is
> that 4GB of memory should do you --- just.  (I'd probably put in a bit
> more physical memory just to be on the safe side, or enable swap
> before running e2fsck).  The bad news is you really, REALLY need a
> 64-bit kernel on your system.

Unfortunately, I have killed the process, in 21 hours only 2.5% of the fsck
is completed ;-(

'scratch_files' directory has grown to 311M

===================================================================
# time fsck -y /dev/sda4
fsck 1.40.8 (13-Mar-2008)
e2fsck 1.40.8 (13-Mar-2008)
Adding dirhash hint to filesystem.

/dev/sda4 contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes

/dev/sda4: e2fsck canceled.                                                
                                                       
/dev/sda4: ***** FILE SYSTEM WAS MODIFIED *****
/dev/sda4: ********** WARNING: Filesystem still has errors **********

real    1303m19.306s
user    1079m58.898s
sys     217m10.130s
===================================================================
 
> Because /var/cache/e2fsck is on the same disk spindle as the
> filesystem you are checking, you're probably getting killed on seeks.
> Moving /var/cache/e2fsck to another disk partition will help (or
> better yet, battery backed memory device), but the best thing you can
> do is get a 64-bit kernel and not need to use the auxiliary storage in
> the first place.

I'm trying a fast test with "mount tmpfs /var/cache/e2fsck -t tmpfs -o
size=2048M", but appears that will take a long time to complete too.. so
the next test will be with a 64-bit LiveCD :)



> As far as what to advice to give you, why are you running e2fsck?  Was
> this an advisory thing caused by the mount count and/or length of time
> between filesystem checks?  Or do you have real reason to believe the
> filesystem may be corrupt?

No, it's not related with mount count and/or length of time between
filesystem checks. When booting we get this error/warning:

EXT3-fs warning: mounting fs with errors, running e2fsck is recommended
EXT3 FS on sda4, internal journal
EXT3-fs: mounted filesystem with writeback data mode.

And "tune2fs" returns that ext3 is in "clean with errors" state.. so, we
think that completing a full fsck process is a good idea; what means in
this case "clean with errors" state, running a fsck is not needed?

Thanks again for all the help and advices!!

--
Santi Saez



From santi at usansolo.net  Wed Jun 11 11:51:17 2008
From: santi at usansolo.net (santi at usansolo.net)
Date: Wed, 11 Jun 2008 13:51:17 +0200
Subject: 2GB memory limit running fsck on a +6TB device
In-Reply-To: <00bf4cac93645dc74c04229696a20f11@usansolo.net>
References: <13126f2f5661d30187551469b3793fa7@usansolo.net>
	<20080609213320.GB26759@mit.edu>
	<7f8a3e70ccd659f304ede5f067ff46c7@usansolo.net>
	<20080610183855.GB8397@mit.edu> <484EFF1B.1010104@usansolo.net>
	<20080610230124.GH8397@mit.edu> <484F12D3.2050201@usansolo.net>
	<20080611021759.GI8397@mit.edu>
	<00bf4cac93645dc74c04229696a20f11@usansolo.net>
Message-ID: <d907e7ae73fd64adb02e0c2edb878e4e@usansolo.net>


On Wed, 11 Jun 2008 10:14:45 +0200, <santi at usansolo.net> wrote:

>> Because /var/cache/e2fsck is on the same disk spindle as the
>> filesystem you are checking, you're probably getting killed on seeks.
>> Moving /var/cache/e2fsck to another disk partition will help (or
>> better yet, battery backed memory device), but the best thing you can
>> do is get a 64-bit kernel and not need to use the auxiliary storage in
>> the first place.
> 
> I'm trying a fast test with "mount tmpfs /var/cache/e2fsck -t tmpfs -o
> size=2048M", but appears that will take a long time to complete too.. so
> the next test will be with a 64-bit LiveCD :)

Note that putting '/var/cache/e2fsck' in a memory filesystem is aprox. 3
times faster ;-)

Making some fast test with e2fsck v1.40.10 appears that is a bit faster
than v1.40.8, last version improves this feature? Anyway, finally I had to
cancel the process..

# ./e2fsck -nfvttC0 /dev/sda4
e2fsck 1.40.10 (21-May-2008)
Pass 1: Checking inodes, blocks, and sizes
/dev/sda4: e2fsck canceled.                                                
     

/dev/sda4: ********** WARNING: Filesystem still has errors **********

Memory used: 260k/581088k (183k/78k)

Regards,

--
Santi Saez



From adilger at sun.com  Wed Jun 11 14:59:08 2008
From: adilger at sun.com (Andreas Dilger)
Date: Wed, 11 Jun 2008 08:59:08 -0600
Subject: 2GB memory limit running fsck on a +6TB device
In-Reply-To: <d907e7ae73fd64adb02e0c2edb878e4e@usansolo.net>
References: <13126f2f5661d30187551469b3793fa7@usansolo.net>
	<20080609213320.GB26759@mit.edu>
	<7f8a3e70ccd659f304ede5f067ff46c7@usansolo.net>
	<20080610183855.GB8397@mit.edu> <484EFF1B.1010104@usansolo.net>
	<20080610230124.GH8397@mit.edu> <484F12D3.2050201@usansolo.net>
	<20080611021759.GI8397@mit.edu>
	<00bf4cac93645dc74c04229696a20f11@usansolo.net>
	<d907e7ae73fd64adb02e0c2edb878e4e@usansolo.net>
Message-ID: <20080611145908.GP3726@webber.adilger.int>

On Jun 11, 2008  13:51 +0200, santi at usansolo.net wrote:
> On Wed, 11 Jun 2008 10:14:45 +0200, <santi at usansolo.net> wrote:
> 
> >> Because /var/cache/e2fsck is on the same disk spindle as the
> >> filesystem you are checking, you're probably getting killed on seeks.
> >> Moving /var/cache/e2fsck to another disk partition will help (or
> >> better yet, battery backed memory device), but the best thing you can
> >> do is get a 64-bit kernel and not need to use the auxiliary storage in
> >> the first place.
> > 
> > I'm trying a fast test with "mount tmpfs /var/cache/e2fsck -t tmpfs -o
> > size=2048M", but appears that will take a long time to complete too.. so
> > the next test will be with a 64-bit LiveCD :)
> 
> Note that putting '/var/cache/e2fsck' in a memory filesystem is aprox. 3
> times faster ;-)

...but, isn't the problem that you don't have enough RAM?  Using tdb+ramfs
isn't going to be faster than using the RAM directly.

I suspect that the only way you are going to check this filesystem efficiently
is to boot a 64-bit kernel (even just from a rescue disk), set up some swap
just in case, and run e2fsck from there.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.



From bryan at kadzban.is-a-geek.net  Wed Jun 11 16:49:04 2008
From: bryan at kadzban.is-a-geek.net (Bryan Kadzban)
Date: Wed, 11 Jun 2008 12:49:04 -0400
Subject: 2GB memory limit running fsck on a +6TB device
In-Reply-To: <20080611145908.GP3726@webber.adilger.int>
References: <20080609213320.GB26759@mit.edu>
	<7f8a3e70ccd659f304ede5f067ff46c7@usansolo.net>
	<20080610183855.GB8397@mit.edu> <484EFF1B.1010104@usansolo.net>
	<20080610230124.GH8397@mit.edu> <484F12D3.2050201@usansolo.net>
	<20080611021759.GI8397@mit.edu>
	<00bf4cac93645dc74c04229696a20f11@usansolo.net>
	<d907e7ae73fd64adb02e0c2edb878e4e@usansolo.net>
	<20080611145908.GP3726@webber.adilger.int>
Message-ID: <20080611164904.GA10071@kadzban.is-a-geek.net>

On Wed, Jun 11, 2008 at 08:59:08AM -0600, Andreas Dilger wrote:
> On Jun 11, 2008  13:51 +0200, santi at usansolo.net wrote:
> > On Wed, 11 Jun 2008 10:14:45 +0200, <santi at usansolo.net> wrote:
> > 
> > >> Moving /var/cache/e2fsck to another disk partition will help (or
> > >> better yet, battery backed memory device), but the best thing you
> > >> can do is get a 64-bit kernel and not need to use the auxiliary
> > >> storage in the first place.
> > > 
> > > I'm trying a fast test with "mount tmpfs /var/cache/e2fsck -t tmpfs
> > > -o size=2048M", but appears that will take a long time to complete
> > > too.. so the next test will be with a 64-bit LiveCD :)
> > 
> > Note that putting '/var/cache/e2fsck' in a memory filesystem is aprox.
> > 3 times faster ;-)
> 
> ...but, isn't the problem that you don't have enough RAM?  Using
> tdb+ramfs isn't going to be faster than using the RAM directly.

It won't be faster, no, but it will be faster than tdb-on-disk, and much
faster than tdb on the same disk as the one that's being checked.

And it *might* allow e2fsck to allocate all the virtual memory that it
needs, depending on how the tmpfs driver works.  If tmpfs uses the same
VA space as e2fsck and the rest of the kernel, then it probably won't
help.  But if tmpfs can use a different pool somehow (whether that's
because the kernel uses a different set of pagetables, or whatever),
then it might.

> I suspect that the only way you are going to check this filesystem
> efficiently is to boot a 64-bit kernel (even just from a rescue disk),
> set up some swap just in case, and run e2fsck from there.

And try to run a 64-bit e2fsck binary, too.  The virtual address space
usage estimate that someone (Ted?) came up with earlier in this thread
was close to 4G, which means that even with a 64-bit kernel, a 32-bit
e2fsck binary might still run out of virtual address space.  (It will
need to map lots of disk, plus any real RAM usage, plus itself and any
libraries.  That last bit *might* push it over 4G, depending on how
accurate the estimate of 4G turns out to be.)

The easiest way to do this is probably run the e2fsck from the LiveCD
itself; don't try to run the 32-bit version that the system has
installed.  That version *might* work, but it'll be tight; a 64-bit
version that can use 40-odd bits in its virtual addresses (44?  48?  I
think it depends on the exact CPU model -- and the kernel, of course)
will have a *lot* more headroom.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20080611/ce465b98/attachment.sig>

From ext3 at kalucki.com  Wed Jun 11 22:04:17 2008
From: ext3 at kalucki.com (John Kalucki)
Date: Wed, 11 Jun 2008 15:04:17 -0700
Subject: Poor Performance WhenNumber of Files > 1M
In-Reply-To: <484F63A0.50606@redhat.com>
References: <484F6036.8020900@kalucki.com> <484F63A0.50606@redhat.com>
Message-ID: <48504BE1.2000104@kalucki.com>

Eric Sandeen wrote:
> John Kalucki wrote:
>
>   
>> Performance seems to always map directly to the number of files in the 
>> ext3 filesystem.
>>
>> After some initial run-fast time, perhaps once dirty pages begin to be 
>> written aggressively, for every 5,000 files added, my files created per 
>> second tends to drop by about one. So, depending on the variables, say 
>> with 6 RAID10 spindles, I might start at ~700 files/sec, quickly drop, 
>> then more slowly drop to ~300 files/sec at perhaps 1 million files, then 
>> see 299 files/sec for the next 5,000 creations, 298 files/sec, etc. etc.
>>
>> As you'd expect, there isn't much CPU utilization, other than iowait, 
>> and some kjournald activity.
>>
>> Is this a known limitation of ext3? Is expecting to write to 
>> O(10^6)-O(10^7) files in something approaching constant time expecting 
>> too much from a filesystem? What, exactly, am I stressing to cause this 
>> unbounded performance degradation?
>>     
>
> I think this is a linear search through the block groups for the new
> inode allocation, which always starts at the parent directory's block
> group; and starts over from there each time.  See find_group_other().
>
> So if the parent's group is full and so are the next 1000 block groups,
> it will search 1000 groups and find space in the 1001st.  On the next
> inode allocation it will re-search(!) those 1000 groups, and again find
> space in the 1001st.  And so on.  Until the 1001st is full, and then
> it'll search 1001 groups and find space in the 1002nd... etc (If I'm
> remembering/reading correctly, but this does jive with what you see.).
>
> I've toyed  with keeping track (in the parent's inode) where the last
> successful child allocation happened, and start the search there.  I'm a
> bit leery of how this might age, though... plus I'm not sure if it
> should be on-disk or just in memory.... But this behavior clearly needs
> some help.  I should probably just get it sent out for comment.
>
> -Eric
>   

This is the best explanation I've read so far. There does indeed appear 
to be some O(n) behavior that is exacerbated by having many directories 
in the working set (not open, just referenced often) and perhaps 
moderate fragmentation. I read up on ext3 inode allocation, and the 
attempt to place files in the same cylinder group as directories. Trying 
to work with this system, I started on a fresh filesystem and flattened 
the directory depth to just 4 levels, I've managed to boost performance 
greatly, and flatten the degradation curve quite a bit.

I can get to about 2,800,000 files before performance starts to slowly 
drop from a nearly constant ~1,700 file/sec. At ~4,000,000 files, I see 
about ~1,500 files/sec, and afterwards I start to see the old behavior 
of greater decline. By 5,500,000 files, it's down to 1,230 files/sec. 
I've used 9% of the space and 8% of the inodes at this point.

Changing journal size and /proc/sys/fs/file-max had no effect. Even 
dir_index had only marginal impact, as my directories have only about 
300 files each.

I think the biggest factor to making performance nearly linear is the 
number of directories in the working set. If this grows too large, the 
linear allocation behavior is magnified, and performance drops. My 
version of RHEL doesn't seem to allow tweaking of directory cache 
behavior, perhaps a deprecated feature from the 2.4 days.

If I discover anything else, I'll be sure to update this thread.
-John










From ext3 at kalucki.com  Wed Jun 11 22:25:17 2008
From: ext3 at kalucki.com (John Kalucki)
Date: Wed, 11 Jun 2008 15:25:17 -0700
Subject: Poor Performance WhenNumber of Files > 1M
In-Reply-To: <484FD343.1060308@redhat.com>
References: <484F6036.8020900@kalucki.com> <484F63A0.50606@redhat.com>
	<484FD343.1060308@redhat.com>
Message-ID: <485050CD.8070403@kalucki.com>

Ric Wheeler wrote:
> Eric Sandeen wrote:
>> John Kalucki wrote:
>>
>>  
>>> Performance seems to always map directly to the number of files in 
>>> the ext3 filesystem.
>>>
>>> After some initial run-fast time, perhaps once dirty pages begin to 
>>> be written aggressively, for every 5,000 files added, my files 
>>> created per second tends to drop by about one. So, depending on the 
>>> variables, say with 6 RAID10 spindles, I might start at ~700 
>>> files/sec, quickly drop, then more slowly drop to ~300 files/sec at 
>>> perhaps 1 million files, then see 299 files/sec for the next 5,000 
>>> creations, 298 files/sec, etc. etc.
>>>
>>> As you'd expect, there isn't much CPU utilization, other than 
>>> iowait, and some kjournald activity.
>>>
>>> Is this a known limitation of ext3? Is expecting to write to 
>>> O(10^6)-O(10^7) files in something approaching constant time 
>>> expecting too much from a filesystem? What, exactly, am I stressing 
>>> to cause this unbounded performance degradation?
>>>     
>>
>> I think this is a linear search through the block groups for the new
>> inode allocation, which always starts at the parent directory's block
>> group; and starts over from there each time.  See find_group_other().
>>
>> So if the parent's group is full and so are the next 1000 block groups,
>> it will search 1000 groups and find space in the 1001st.  On the next
>> inode allocation it will re-search(!) those 1000 groups, and again find
>> space in the 1001st.  And so on.  Until the 1001st is full, and then
>> it'll search 1001 groups and find space in the 1002nd... etc (If I'm
>> remembering/reading correctly, but this does jive with what you see.).
>>
>> I've toyed  with keeping track (in the parent's inode) where the last
>> successful child allocation happened, and start the search there.  I'm a
>> bit leery of how this might age, though... plus I'm not sure if it
>> should be on-disk or just in memory.... But this behavior clearly needs
>> some help.  I should probably just get it sent out for comment.
>>
>> -Eric
>>
>>   
> I run a very similar test, but normally run with a synchronous write 
> work load (i.e., fsync before close). In my testing, you will see a 
> slow but gradual decline in the files/sec. For example, on a 1TB S-ATA 
> drive, the latest test run started off at a rate of 22 files/sec (each 
> file is 40k) and is currently chugging along at a bit over 17 
> files/sec when it has hit 2.8 million files in one directory. I am 
> using the ext3 run to get a baseline for a similar run of xfs and btrfs.
>
> One other random tuning thought - you can help by writing into 
> separate directories, but you will need to make sure that you don't 
> produce a random write pattern when you select your target 
> subdirectory. I think that the use case mentioned using a hashed 
> directory structure which is fine, but you want to hash in a way that 
> writes into a shared subdirectory for some period of time (say get a 
> rotation of every X files or Y seconds).  Easiest way to do this is to 
> use a GUID with a time stamp and hash on the time stamp bits.
>
> Note that there is a multi-threaded performance bug in ext3 (Josef 
> Bacik had looked at fixing this) which throttles writes/sec down to 
> around 230 when you do synchronous transactions so you might be 
> hitting that as well.
>
> ric

Unfortunately, I don't have the opportunity to limit the directories. My 
application is taking random-ish data and organizing it into logical 
groups for subsequent quick reading. But I did take your suggestion into 
account and it contains what seems to be the important nugget -- too 
many active directories makes a bad situation worse.

But still, my test reaches a steady state of active directories pretty 
quickly -- or so I'd like to think. The performance does indeed continue 
to creep downwards.

I'm doing everything single-threaded. Introducing a second thread seems 
to be an immediate disaster, even though I'm stripped across 3 disks. 
Unfortunate. Perhaps moving the journal to another filesystem would 
allow better multi-threaded throughput, but I'm not sure that this is 
important to me.

xfs, zfs, btrfs, and reiser could be attractive for my use-case.

Thanks for your response,
John









From tytso at mit.edu  Thu Jun 12 05:24:29 2008
From: tytso at mit.edu (Theodore Tso)
Date: Thu, 12 Jun 2008 01:24:29 -0400
Subject: 2GB memory limit running fsck on a +6TB device
In-Reply-To: <20080611145908.GP3726@webber.adilger.int>
References: <20080609213320.GB26759@mit.edu>
	<7f8a3e70ccd659f304ede5f067ff46c7@usansolo.net>
	<20080610183855.GB8397@mit.edu> <484EFF1B.1010104@usansolo.net>
	<20080610230124.GH8397@mit.edu> <484F12D3.2050201@usansolo.net>
	<20080611021759.GI8397@mit.edu>
	<00bf4cac93645dc74c04229696a20f11@usansolo.net>
	<d907e7ae73fd64adb02e0c2edb878e4e@usansolo.net>
	<20080611145908.GP3726@webber.adilger.int>
Message-ID: <20080612052429.GA18229@mit.edu>

On Wed, Jun 11, 2008 at 08:59:08AM -0600, Andreas Dilger wrote:
> > Note that putting '/var/cache/e2fsck' in a memory filesystem is aprox. 3
> > times faster ;-)
> 
> ...but, isn't the problem that you don't have enough RAM?  Using tdb+ramfs
> isn't going to be faster than using the RAM directly.

Tmpfs is swap backed, if swap has been configured.  So it can help. 

Another possibility is to use a statically linked e2fsck, since the
shared libraries chew up a lot of VM address space.  But in this
particular case, it probably wouldn't be enough.

I think the best thing to do is this case to use a 64-bit kernel and a
64-bit compiled e2fsck binary.

						- Ted



From ross at biostat.ucsf.edu  Mon Jun 16 03:46:21 2008
From: ross at biostat.ucsf.edu (Ross Boylan)
Date: Sun, 15 Jun 2008 20:46:21 -0700
Subject: spd_readdir.c and readdir_r [real new version]
In-Reply-To: <1212985588.32113.13.camel@corn.betterworld.us>
References: <1212903039.7158.31.camel@corn.betterworld.us>
	<1212985588.32113.13.camel@corn.betterworld.us>
Message-ID: <1213587981.8578.189.camel@corn.betterworld.us>

My previous attachment had only a link for the main file; the current
one should have the real thing.

For the full backup, using the preload library changed the backup time
from  over 35 hours to 22 hours for a full backup.  The full backup got
much slower as it progressed; my guess is something other than the
preload library (perhaps the snapshotting itself, bacula, or postgresql)
accounts for that.

The percentage change for incremental backups, which involve relatively
more time scanning, is larger: from 3 hours to under .5 hours.

There's no obvious speedup for the jobs involving Reiser filesystems.

All in all, a big win.  Thanks to everyone for your help, and especially
to Ted for the original code.

Ross Boylan

On Sun, 2008-06-08 at 21:26 -0700, Ross Boylan wrote:
> I've attached a modified version of Ted's spd_readdir.c that adds
> support for readdir_r and readdir64_r.  It appears to be working
> (readdir64_r is the only new routine getting exercised), but should be
> taken as a rough cut.  I also added a Makefile and a test program.
> 
> It also looks as if this is giving me a huge speed improvement (at least
> x4) of my backups of my ext3 partitions.  I'll try to report after a
> full and incremental backup complete, which will be a couple of days.
> 
> Originally I tried taking the threading code from the system
> implementations of the original readdir_r.  When that didn't work (since
> it was designed to be part of a libc build) I switched to pthreads.  I
> don't know if recursive locking is essential; I activated it at one
> point while trying to get things to work.
> 
> For big directories this code could use quite a lot of memory.  It
> allows an optional max size, beyond which it reverts to the original
> system calls.  I wonder if instead taking large directories in chunks
> would preserve much of the speedup while putting a bound on memory use.
> 
> Ross Boylan
> 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: RBspd_dir.tgz
Type: application/x-compressed-tar
Size: 3147 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20080615/d814b0fa/attachment.bin>

From magawake at gmail.com  Thu Jun 19 00:05:57 2008
From: magawake at gmail.com (Mag Gam)
Date: Wed, 18 Jun 2008 20:05:57 -0400
Subject: stride
Message-ID: <1cbd6f830806181705s4acc3817x409cb4ce5f5cb9bb@mail.gmail.com>

I am trying to understand the stride option for ext3 .

If I am using a Hardware RAID (3ware) with 6 disks and I decide to go with
RAID 5 with stripe of 128KB (default on my controller) and no spare.
By reading documentation I should do 128/4 as my stride size when creating
the file system. I am not understanding how this number works and what
exactly stride does. Can someone care to explain this to me?


TIA
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20080618/00810a2a/attachment.htm>

From magawake at gmail.com  Thu Jun 19 00:14:29 2008
From: magawake at gmail.com (Mag Gam)
Date: Wed, 18 Jun 2008 20:14:29 -0400
Subject: stride
Message-ID: <1cbd6f830806181714q7112605dm198cb752956956aa@mail.gmail.com>

I am trying to understand the stride setting for ext3 .

If I am using a Hardware RAID (3ware) with 6 disks and I decide to go with
RAID 5 with stripe of 128KB (default on my controller) and no spare.
By reading documentation I should do 128/4 as my stride size when creating
the file system. I am not understanding how this number works and what
exactly stride does. Can someone care to explain this to me?


TIA
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20080618/47a9a33d/attachment.htm>

From adilger at sun.com  Thu Jun 19 05:47:50 2008
From: adilger at sun.com (Andreas Dilger)
Date: Wed, 18 Jun 2008 23:47:50 -0600
Subject: stride
In-Reply-To: <1cbd6f830806181714q7112605dm198cb752956956aa@mail.gmail.com>
References: <1cbd6f830806181714q7112605dm198cb752956956aa@mail.gmail.com>
Message-ID: <20080619054750.GO3726@webber.adilger.int>

On Jun 18, 2008  20:14 -0400, Mag Gam wrote:
> If I am using a Hardware RAID (3ware) with 6 disks and I decide to go with
> RAID 5 with stripe of 128KB (default on my controller) and no spare.
> By reading documentation I should do 128/4 as my stride size when creating
> the file system. I am not understanding how this number works and what
> exactly stride does. Can someone care to explain this to me?

The "stride" option changes the location of some of the filesystem metadata
so that it isn't all located on the same disk.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.



From magawake at gmail.com  Thu Jun 19 10:21:24 2008
From: magawake at gmail.com (Mag Gam)
Date: Thu, 19 Jun 2008 06:21:24 -0400
Subject: stride
In-Reply-To: <20080619054750.GO3726@webber.adilger.int>
References: <1cbd6f830806181714q7112605dm198cb752956956aa@mail.gmail.com>
	<20080619054750.GO3726@webber.adilger.int>
Message-ID: <1cbd6f830806190321j2e77ee83k469844a20e13461f@mail.gmail.com>

ok, in a way its like a stripe? I though when you do a stripe you put the
metadata on number of disks too. How is that different? Is there a diagram I
can refer to?


TIA



On Thu, Jun 19, 2008 at 1:47 AM, Andreas Dilger <adilger at sun.com> wrote:

> On Jun 18, 2008  20:14 -0400, Mag Gam wrote:
> > If I am using a Hardware RAID (3ware) with 6 disks and I decide to go
> with
> > RAID 5 with stripe of 128KB (default on my controller) and no spare.
> > By reading documentation I should do 128/4 as my stride size when
> creating
> > the file system. I am not understanding how this number works and what
> > exactly stride does. Can someone care to explain this to me?
>
> The "stride" option changes the location of some of the filesystem metadata
> so that it isn't all located on the same disk.
>
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20080619/b6361f2c/attachment.htm>

From tytso at mit.edu  Thu Jun 19 11:42:44 2008
From: tytso at mit.edu (Theodore Tso)
Date: Thu, 19 Jun 2008 07:42:44 -0400
Subject: stride
In-Reply-To: <1cbd6f830806190321j2e77ee83k469844a20e13461f@mail.gmail.com>
References: <1cbd6f830806181714q7112605dm198cb752956956aa@mail.gmail.com>
	<20080619054750.GO3726@webber.adilger.int>
	<1cbd6f830806190321j2e77ee83k469844a20e13461f@mail.gmail.com>
Message-ID: <20080619114244.GD11516@mit.edu>

On Thu, Jun 19, 2008 at 06:21:24AM -0400, Mag Gam wrote:
> ok, in a way its like a stripe? I though when you do a stripe you put the
> metadata on number of disks too. How is that different? Is there a diagram I
> can refer to?

Yes, which is why the mke2fs man page states:

	stride=<stripe-size>
		Configure  the	filesystem  for	 a  RAID  array with
		<stripe-size> filesystem blocks per stripe.

So if the size of a stripe on each a disk is 64k, and you are using a
4k filesystem blocksize, then 64k/4k == 16, and that would be an
"ideal" stride size, in that for each successive block group, the
inode and block bitmap would increased by an offset of 16 blocks from
the beginning of the block group.

The reason for doing this is to avoid problems where the block bitmap
ends up on the same disk for every single block group.  The classic
case where this would happen is if you have a 5 disks in a RAID 5
configuration, which means with 4 disks per stripe, and 8192 blocks in
a blockgroup, then if the block bitmap is always at the same offset
from the beginning of the block group, one disk will get all of the
block bitmaps, and that ends up being a major hot spot problem for the
hard drive.

As it turns out, if you use 4 disks in a RAID 5 configuration, or 6
disks in a RAID 5 configuration, this problem doesn't arise at all,
and you don't need to use the stride option.  And in most cases,
simply using a stride=1, that is actually enough to make sure that
each block and inode bitmaps will get forced onto successively
different disks.

With ext4's flex_bg enhancement, the need to specify stride option of
RAID arrays will also go away.

							- Ted



From magawake at gmail.com  Fri Jun 20 01:17:45 2008
From: magawake at gmail.com (Mag Gam)
Date: Thu, 19 Jun 2008 21:17:45 -0400
Subject: stride
In-Reply-To: <20080619114244.GD11516@mit.edu>
References: <1cbd6f830806181714q7112605dm198cb752956956aa@mail.gmail.com>
	<20080619054750.GO3726@webber.adilger.int>
	<1cbd6f830806190321j2e77ee83k469844a20e13461f@mail.gmail.com>
	<20080619114244.GD11516@mit.edu>
Message-ID: <1cbd6f830806191817y24e32e9bh2e174d77a3f1541c@mail.gmail.com>

What happens if you use a hardware raid, should the stride option be
considered? It seems you are referring to software raid, correct?
TIA


On Thu, Jun 19, 2008 at 7:42 AM, Theodore Tso <tytso at mit.edu> wrote:

> On Thu, Jun 19, 2008 at 06:21:24AM -0400, Mag Gam wrote:
> > ok, in a way its like a stripe? I though when you do a stripe you put the
> > metadata on number of disks too. How is that different? Is there a
> diagram I
> > can refer to?
>
> Yes, which is why the mke2fs man page states:
>
>        stride=<stripe-size>
>                Configure  the  filesystem  for  a  RAID  array with
>                <stripe-size> filesystem blocks per stripe.
>
> So if the size of a stripe on each a disk is 64k, and you are using a
> 4k filesystem blocksize, then 64k/4k == 16, and that would be an
> "ideal" stride size, in that for each successive block group, the
> inode and block bitmap would increased by an offset of 16 blocks from
> the beginning of the block group.
>
> The reason for doing this is to avoid problems where the block bitmap
> ends up on the same disk for every single block group.  The classic
> case where this would happen is if you have a 5 disks in a RAID 5
> configuration, which means with 4 disks per stripe, and 8192 blocks in
> a blockgroup, then if the block bitmap is always at the same offset
> from the beginning of the block group, one disk will get all of the
> block bitmaps, and that ends up being a major hot spot problem for the
> hard drive.
>
> As it turns out, if you use 4 disks in a RAID 5 configuration, or 6
> disks in a RAID 5 configuration, this problem doesn't arise at all,
> and you don't need to use the stride option.  And in most cases,
> simply using a stride=1, that is actually enough to make sure that
> each block and inode bitmaps will get forced onto successively
> different disks.
>
> With ext4's flex_bg enhancement, the need to specify stride option of
> RAID arrays will also go away.
>
>                                                        - Ted
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20080619/87d39f6a/attachment.htm>

From tytso at mit.edu  Fri Jun 20 02:08:47 2008
From: tytso at mit.edu (Theodore Tso)
Date: Thu, 19 Jun 2008 22:08:47 -0400
Subject: stride
In-Reply-To: <1cbd6f830806191817y24e32e9bh2e174d77a3f1541c@mail.gmail.com>
References: <1cbd6f830806181714q7112605dm198cb752956956aa@mail.gmail.com>
	<20080619054750.GO3726@webber.adilger.int>
	<1cbd6f830806190321j2e77ee83k469844a20e13461f@mail.gmail.com>
	<20080619114244.GD11516@mit.edu>
	<1cbd6f830806191817y24e32e9bh2e174d77a3f1541c@mail.gmail.com>
Message-ID: <20080620020847.GE9119@mit.edu>

On Thu, Jun 19, 2008 at 09:17:45PM -0400, Mag Gam wrote:
> What happens if you use a hardware raid, should the stride option be
> considered? It seems you are referring to software raid, correct?

It doesn't matter whethre it is hardware or software raid.  What
matters is the *geometry* of the RAID array.  i.e., how many
filesystem blocks are in an individual disk's stripe, and how many
disks are in use (minus how many parity disks are in use).  This
information may be somewhat more hidden in a hardware raid array, but
it is possible to extract this information, and most hardware raid
arrays will allow you to configure these parameters as well, to
varying degrees of flexibility.

					- Ted



From magawake at gmail.com  Fri Jun 20 10:21:37 2008
From: magawake at gmail.com (Mag Gam)
Date: Fri, 20 Jun 2008 06:21:37 -0400
Subject: stride
In-Reply-To: <20080620020847.GE9119@mit.edu>
References: <1cbd6f830806181714q7112605dm198cb752956956aa@mail.gmail.com>
	<20080619054750.GO3726@webber.adilger.int>
	<1cbd6f830806190321j2e77ee83k469844a20e13461f@mail.gmail.com>
	<20080619114244.GD11516@mit.edu>
	<1cbd6f830806191817y24e32e9bh2e174d77a3f1541c@mail.gmail.com>
	<20080620020847.GE9119@mit.edu>
Message-ID: <1cbd6f830806200321r18deb81cx5089f21520b1a838@mail.gmail.com>

Ted,

This is the type of information I was looking for. No seems to explain this
well.

Also, on the same topic. For a very large filesystem ie, 3TB, should I
consider anything special, something like -O dir_index? I am looking for
peek performance.


TIA


On Thu, Jun 19, 2008 at 10:08 PM, Theodore Tso <tytso at mit.edu> wrote:

> On Thu, Jun 19, 2008 at 09:17:45PM -0400, Mag Gam wrote:
> > What happens if you use a hardware raid, should the stride option be
> > considered? It seems you are referring to software raid, correct?
>
> It doesn't matter whethre it is hardware or software raid.  What
> matters is the *geometry* of the RAID array.  i.e., how many
> filesystem blocks are in an individual disk's stripe, and how many
> disks are in use (minus how many parity disks are in use).  This
> information may be somewhat more hidden in a hardware raid array, but
> it is possible to extract this information, and most hardware raid
> arrays will allow you to configure these parameters as well, to
> varying degrees of flexibility.
>
>                                        - Ted
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20080620/d702e13b/attachment.htm>

From lists at nerdbynature.de  Sun Jun 22 00:34:47 2008
From: lists at nerdbynature.de (Christian Kujau)
Date: Sun, 22 Jun 2008 02:34:47 +0200 (CEST)
Subject: stride
In-Reply-To: <1cbd6f830806200321r18deb81cx5089f21520b1a838@mail.gmail.com>
References: <1cbd6f830806181714q7112605dm198cb752956956aa@mail.gmail.com>
	<20080619054750.GO3726@webber.adilger.int>
	<1cbd6f830806190321j2e77ee83k469844a20e13461f@mail.gmail.com>
	<20080619114244.GD11516@mit.edu>
	<1cbd6f830806191817y24e32e9bh2e174d77a3f1541c@mail.gmail.com>
	<20080620020847.GE9119@mit.edu>
	<1cbd6f830806200321r18deb81cx5089f21520b1a838@mail.gmail.com>
Message-ID: <alpine.DEB.1.10.0806220230210.15824@sheep.housecafe.de>

On Fri, 20 Jun 2008, Mag Gam wrote:
> consider anything special, something like -O dir_index? I am looking for
> peek performance.

Depends on how many files, directories, small/big files, 
reads/writes...etc.

There are various benchmarks and tuning hints for ext3 around, but if you 
want peak performance, you're better off testing *your* application with 
different mkfs/mount options and see what's best for *you*.

my 2 cents,
C.
-- 
BOFH excuse #391:

We already sent around a notice about that.



From magawake at gmail.com  Sun Jun 22 02:03:03 2008
From: magawake at gmail.com (Mag Gam)
Date: Sat, 21 Jun 2008 22:03:03 -0400
Subject: indexing symbolic links
Message-ID: <1cbd6f830806211903i4cc02814gc5517934e3952694@mail.gmail.com>

Is there a way to index symbolic links in ext3? For example, I want to keep
track of all symbolic links on the filesystem (soft mainly). I think I would
have to write a wrapper around ln to keep it in a database, but I was
wondering if anyone has done something similar to this.

TIA
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20080621/d9b0e064/attachment.htm>

From alex at alex.org.uk  Sun Jun 22 08:18:51 2008
From: alex at alex.org.uk (Alex Bligh)
Date: Sun, 22 Jun 2008 09:18:51 +0100
Subject: indexing symbolic links
In-Reply-To: <1cbd6f830806211903i4cc02814gc5517934e3952694@mail.gmail.com>
References: <1cbd6f830806211903i4cc02814gc5517934e3952694@mail.gmail.com>
Message-ID: <CF03C0719F380D8579A79157@nimrod.local>



--On 21 June 2008 22:03:03 -0400 Mag Gam <magawake at gmail.com> wrote:

> Is there a way to index symbolic links in ext3? For example, I want to
> keep track of all symbolic links on the filesystem (soft mainly). I think
> I would have to write a wrapper around ln to keep it in a database, but I
> was wondering if anyone has done something similar to this.

How about
  find [mount point] -type l -x -print

Wrapping ln won't do the job completely as (a) it won't track the links
being removed (e.g. via rm), and (b) it won't track links being created
by programs other than ln which use the library or the system call
directly.

When you say "mainly soft", remember EVERY file /is/ a hard link. Just
some files have more than one. Look at the "-links" option to find, which
is easy enough for normal files though you will have to do a bit of thinking
re hard linked directories, "." and "..".

Alex



From magawake at gmail.com  Sun Jun 22 13:12:26 2008
From: magawake at gmail.com (Mag Gam)
Date: Sun, 22 Jun 2008 09:12:26 -0400
Subject: indexing symbolic links
In-Reply-To: <CF03C0719F380D8579A79157@nimrod.local>
References: <1cbd6f830806211903i4cc02814gc5517934e3952694@mail.gmail.com>
	<CF03C0719F380D8579A79157@nimrod.local>
Message-ID: <1cbd6f830806220612k1e3126e5t2c91a1164321c9e5@mail.gmail.com>

Find or ls I can check for symbolic links, but the file system is very
large. About 250GB and I have several of them.
I was wondering if  ext3 kept track of these things, apparently it does not.


At my university, we have physical storage in a filesystem, and we assign
professors and students space by doing a symbolic link. Basically I want to
keep track of physical storage with virtual/logical storage. Thats why I ask
:-)

TIA



On Sun, Jun 22, 2008 at 4:18 AM, Alex Bligh <alex at alex.org.uk> wrote:

>
>
> --On 21 June 2008 22:03:03 -0400 Mag Gam <magawake at gmail.com> wrote:
>
>  Is there a way to index symbolic links in ext3? For example, I want to
>> keep track of all symbolic links on the filesystem (soft mainly). I think
>> I would have to write a wrapper around ln to keep it in a database, but I
>> was wondering if anyone has done something similar to this.
>>
>
> How about
>  find [mount point] -type l -x -print
>
> Wrapping ln won't do the job completely as (a) it won't track the links
> being removed (e.g. via rm), and (b) it won't track links being created
> by programs other than ln which use the library or the system call
> directly.
>
> When you say "mainly soft", remember EVERY file /is/ a hard link. Just
> some files have more than one. Look at the "-links" option to find, which
> is easy enough for normal files though you will have to do a bit of
> thinking
> re hard linked directories, "." and "..".
>
> Alex
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20080622/5e5efb3a/attachment.htm>

From darkonc at gmail.com  Sun Jun 22 16:05:15 2008
From: darkonc at gmail.com (Stephen Samuel)
Date: Sun, 22 Jun 2008 09:05:15 -0700
Subject: indexing symbolic links
In-Reply-To: <1cbd6f830806220612k1e3126e5t2c91a1164321c9e5@mail.gmail.com>
References: <1cbd6f830806211903i4cc02814gc5517934e3952694@mail.gmail.com>
	<CF03C0719F380D8579A79157@nimrod.local>
	<1cbd6f830806220612k1e3126e5t2c91a1164321c9e5@mail.gmail.com>
Message-ID: <6cd50f9f0806220905y339f4a11hd2ad9a2d7a7a65c3@mail.gmail.com>

If you're only counting when YOU create and remove links, then you could put
a hook and count from there. (without depending on anything within ext3)
If, on the other hand, you're depending on when ANYBODY creates or removes a
link (hard or soft), then you have a good bit more work to do.  The only way
that I can think of to do that would be to put a link into the ext3 driver
-- but you wouldn't just have to log the symlink calls. you'd also have to
track things like renames  (in-directory vs cross-directory vs
cross-filesystem) and unlinks (rm)

Given that it sounds like you're doing symlinks and the target files aren't
actually being owned by the person in question, it doesn't sound like the
quota system would do the job for you, so you're probably going to need tro
either do some kernel hacking, or write a batch job that runs regularly that
does the information collection for you.

2008/6/22 Mag Gam <magawake at gmail.com>:

> Find or ls I can check for symbolic links, but the file system is very
> large. About 250GB and I have several of them.
> I was wondering if  ext3 kept track of these things, apparently it does
> not.
>
> At my university, we have physical storage in a filesystem, and we assign
> professors and students space by doing a symbolic link. Basically I want to
> keep track of physical storage with virtual/logical storage. Thats why I ask
> :-)
>
> TIA


-- 
Stephen Samuel http://www.bcgreen.com
778-861-7641
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20080622/5ec0a155/attachment.htm>

From magawake at gmail.com  Sun Jun 22 18:25:16 2008
From: magawake at gmail.com (Mag Gam)
Date: Sun, 22 Jun 2008 14:25:16 -0400
Subject: indexing symbolic links
In-Reply-To: <6cd50f9f0806220905y339f4a11hd2ad9a2d7a7a65c3@mail.gmail.com>
References: <1cbd6f830806211903i4cc02814gc5517934e3952694@mail.gmail.com>
	<CF03C0719F380D8579A79157@nimrod.local>
	<1cbd6f830806220612k1e3126e5t2c91a1164321c9e5@mail.gmail.com>
	<6cd50f9f0806220905y339f4a11hd2ad9a2d7a7a65c3@mail.gmail.com>
Message-ID: <1cbd6f830806221125j75b628e1l66b6150793a649fe@mail.gmail.com>

wow, i didn't think about renames and all.

I am not a strong C programmer so I don't think hacking the kernel is an
option :-(

I bet there is more to this also...

Thanks for your thoughts.


On Sun, Jun 22, 2008 at 12:05 PM, Stephen Samuel <darkonc at gmail.com> wrote:

> If you're only counting when YOU create and remove links, then you could
> put a hook and count from there. (without depending on anything within ext3)
> If, on the other hand, you're depending on when ANYBODY creates or removes
> a link (hard or soft), then you have a good bit more work to do.  The only
> way that I can think of to do that would be to put a link into the ext3
> driver -- but you wouldn't just have to log the symlink calls. you'd also
> have to track things like renames  (in-directory vs cross-directory vs
> cross-filesystem) and unlinks (rm)
>
> Given that it sounds like you're doing symlinks and the target files aren't
> actually being owned by the person in question, it doesn't sound like the
> quota system would do the job for you, so you're probably going to need tro
> either do some kernel hacking, or write a batch job that runs regularly that
> does the information collection for you.
>
> 2008/6/22 Mag Gam <magawake at gmail.com>:
>
>> Find or ls I can check for symbolic links, but the file system is very
>> large. About 250GB and I have several of them.
>> I was wondering if  ext3 kept track of these things, apparently it does
>> not.
>>
>> At my university, we have physical storage in a filesystem, and we assign
>> professors and students space by doing a symbolic link. Basically I want to
>> keep track of physical storage with virtual/logical storage. Thats why I ask
>> :-)
>>
>> TIA
>
>
> --
> Stephen Samuel http://www.bcgreen.com
> 778-861-7641
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20080622/484941ba/attachment.htm>

From alex at alex.org.uk  Sun Jun 22 19:04:17 2008
From: alex at alex.org.uk (Alex Bligh)
Date: Sun, 22 Jun 2008 20:04:17 +0100
Subject: indexing symbolic links
In-Reply-To: <1cbd6f830806220612k1e3126e5t2c91a1164321c9e5@mail.gmail.com>
References: <1cbd6f830806211903i4cc02814gc5517934e3952694@mail.gmail.com>
	<CF03C0719F380D8579A79157@nimrod.local>
	<1cbd6f830806220612k1e3126e5t2c91a1164321c9e5@mail.gmail.com>
Message-ID: <61CF57D9DB48898E17DC2EF7@Ximines.local>



--On 22 June 2008 09:12:26 -0400 Mag Gam <magawake at gmail.com> wrote:

> At my university, we have physical storage in a filesystem, and we assign
> professors and students space by doing a symbolic link. Basically I want
> to keep track of physical storage with virtual/logical storage. Thats why
> I ask :-)

If you want to track space usage, I suggest you track it using quota
or similar. "man quota" will give you a start.

Alex



From magawake at gmail.com  Sun Jun 22 20:37:59 2008
From: magawake at gmail.com (Mag Gam)
Date: Sun, 22 Jun 2008 16:37:59 -0400
Subject: indexing symbolic links
In-Reply-To: <61CF57D9DB48898E17DC2EF7@Ximines.local>
References: <1cbd6f830806211903i4cc02814gc5517934e3952694@mail.gmail.com>
	<CF03C0719F380D8579A79157@nimrod.local>
	<1cbd6f830806220612k1e3126e5t2c91a1164321c9e5@mail.gmail.com>
	<61CF57D9DB48898E17DC2EF7@Ximines.local>
Message-ID: <1cbd6f830806221337y5dbc5173qbf4e7222b3fa9f67@mail.gmail.com>

Unfortunately, tracking space wasn't me goal. I want to keep track of my
symbolic links :-)



On Sun, Jun 22, 2008 at 3:04 PM, Alex Bligh <alex at alex.org.uk> wrote:

>
>
> --On 22 June 2008 09:12:26 -0400 Mag Gam <magawake at gmail.com> wrote:
>
>  At my university, we have physical storage in a filesystem, and we assign
>> professors and students space by doing a symbolic link. Basically I want
>> to keep track of physical storage with virtual/logical storage. Thats why
>> I ask :-)
>>
>
> If you want to track space usage, I suggest you track it using quota
> or similar. "man quota" will give you a start.
>
> Alex
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20080622/5e4eb33e/attachment.htm>

From rjackson at mason.gmu.edu  Tue Jun 24 12:30:29 2008
From: rjackson at mason.gmu.edu (Richard Jackson)
Date: Tue, 24 Jun 2008 08:30:29 -0400 (EDT)
Subject: stride (fwd)
Message-ID: <200806241230.m5OCUTqq004576@mason.gmu.edu>

Two things;

1. Most likely I missed it but I could not find how to report the stride 
   setting for a ext3 filesystem.  I do not see stride mentioned in the
   man pages for dumpe2fs and tune2fs nor in the dumpe2fs report.

2. It has been pointed out the mke2fs man page description for stride needs
   improvement.  Andreas Dilger in a post last year,
   http://osdir.com/ml/file-systems.ext3.user/2007-06/msg00003.html,
   mentioned a patch was submitted.  I assume to address the mke2fs man page.

   If this is not the case then I suggest adding something similar to Ted's
   or Andreas' descriptions to replace the current stride mke2fs man page.

   If nothing else change from

 	stride=<stripe-size>
 		Configure  the	filesystem  for	 a  RAID  array with
 		<stripe-size> filesystem blocks per stripe.

   to

 	stride=<stride-size>

                The number of filesystem blocks on a single disk.  The purpose
  		is to spread the filesystem metadata across the disks.  For
		example, if the RAID chunk/segment size is 64KB and the 
 		filesystem block size is 4KB, then the stride size is 16
		(64KB/4KB).

These types of explanations are more helpful than something like...

  -f fragment-size
              Specify the size of fragments in bytes.

taken from the mke2fs man pages.  As you can see the explanation adds very
little value.  The stride explanation simply seems wrong.

Richard

Forwarded message:
> On Thu, Jun 19, 2008 at 06:21:24AM -0400, Mag Gam wrote:
> > ok, in a way its like a stripe? I though when you do a stripe you put the
> > metadata on number of disks too. How is that different? Is there a diagram I
> > can refer to?
> 
> Yes, which is why the mke2fs man page states:
> 
> 	stride=<stripe-size>
> 		Configure  the	filesystem  for	 a  RAID  array with
> 		<stripe-size> filesystem blocks per stripe.
> 
> So if the size of a stripe on each a disk is 64k, and you are using a
> 4k filesystem blocksize, then 64k/4k == 16, and that would be an
> "ideal" stride size, in that for each successive block group, the
> inode and block bitmap would increased by an offset of 16 blocks from
> the beginning of the block group.
> 
> The reason for doing this is to avoid problems where the block bitmap
> ends up on the same disk for every single block group.  The classic
> case where this would happen is if you have a 5 disks in a RAID 5
> configuration, which means with 4 disks per stripe, and 8192 blocks in
> a blockgroup, then if the block bitmap is always at the same offset
> from the beginning of the block group, one disk will get all of the
> block bitmaps, and that ends up being a major hot spot problem for the
> hard drive.
> 
> As it turns out, if you use 4 disks in a RAID 5 configuration, or 6
> disks in a RAID 5 configuration, this problem doesn't arise at all,
> and you don't need to use the stride option.  And in most cases,
> simply using a stride=1, that is actually enough to make sure that
> each block and inode bitmaps will get forced onto successively
> different disks.
> 
> With ext4's flex_bg enhancement, the need to specify stride option of
> RAID arrays will also go away.
> 
> 							- Ted
> 
> _______________________________________________
> Ext3-users mailing list
> Ext3-users at redhat.com
> https://www.redhat.com/mailman/listinfo/ext3-users
> 


-- 
Regards,                                                   /~\ The ASCII
Richard Jackson                                            \ / Ribbon Campaign
Computer Systems Engineer,                                  X  Against HTML
Information Technology Unit, Technology Systems Division   / \ Email!
Enterprise Servers and Operations Department
George Mason University, Fairfax, Virginia



From adilger at sun.com  Wed Jun 25 08:36:37 2008
From: adilger at sun.com (Andreas Dilger)
Date: Wed, 25 Jun 2008 02:36:37 -0600
Subject: stride (fwd)
In-Reply-To: <200806241230.m5OCUTqq004576@mason.gmu.edu>
References: <200806241230.m5OCUTqq004576@mason.gmu.edu>
Message-ID: <20080625083637.GW6239@webber.adilger.int>

On Jun 24, 2008  08:30 -0400, Richard Jackson wrote:
>    If this is not the case then I suggest adding something similar to Ted's
>    or Andreas' descriptions to replace the current stride mke2fs man page.
> 
>    If nothing else change from
> 
>  	stride=<stripe-size>
>  		Configure  the	filesystem  for	 a  RAID  array with
>  		<stripe-size> filesystem blocks per stripe.
> 
>    to
> 
>  	stride=<stride-size>
> 
>                 The number of filesystem blocks on a single disk.  The purpose
>   		is to spread the filesystem metadata across the disks.  For
> 		example, if the RAID chunk/segment size is 64KB and the 
>  		filesystem block size is 4KB, then the stride size is 16
> 		(64KB/4KB).

The patch to add the "stride" and "stripe-size" options to mke2fs and
mke2fs(8) man pages were already included upstream for 1.40.7 or earlier.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.



From yamin_yossi at diligent.com  Wed Jun 25 12:55:24 2008
From: yamin_yossi at diligent.com (Yamin, Yossi)
Date: Wed, 25 Jun 2008 15:55:24 +0300
Subject: "Attempt to access beyond end of device" problem 
Message-ID: <E6BDB590E306534485134219F431E0A853CBB6@ILEX01.corp.diligent.com>

Hi,

We are using Ext3 on with RedHat 4 U3 File Sysetm.

We got the following errors at the /var/log/messages file

 

Jun 23 09:28:29 diligent1 kernel: attempt to access beyond end of device

Jun 23 09:28:29 diligent1 kernel: sddlmbj1: rw=1, want=6332971392,
limit=4183807887

--

Jun 23 09:37:47 diligent1 kernel: lpfc1: BUFF seg 5 free 946 numblks
1024

Jun 23 09:44:03 diligent1 kernel: attempt to access beyond end of device

Jun 23 09:44:03 diligent1 kernel: sddlmbj1: rw=1, want=6332971392,
limit=4183807887

Jun 23 09:44:03 diligent1 kernel: attempt to access beyond end of device

Jun 23 09:44:03 diligent1 kernel: sddlmbj1: rw=1, want=6332971392,
limit=4183807887

Jun 23 09:44:03 diligent1 kernel: attempt to access beyond end of device

Jun 23 09:44:03 diligent1 kernel: sddlmbj1: rw=1, want=6332971392,
limit=4183807887

Jun 23 09:44:03 diligent1 kernel: attempt to access beyond end of device

Jun 23 09:44:03 diligent1 kernel: sddlmbj1: rw=1, want=6332971392,
limit=4183807887

 

 

running fsck.ext3 -n -f on this device yield the Below  output.

"

 Pass 1: Checking inodes, blocks, and sizes

Inode 192086020 has illegal block(s).  Clear? no

 

Illegal block #18620428 (774778414) in inode 192086020.  IGNORED.

Illegal block #18620429 (774778414) in inode 192086020.  IGNORED.

Illegal block #18620430 (774778414) in inode 192086020.  IGNORED.

Illegal block #18620431 (774778414) in inode 192086020.  IGNORED.

Illegal block #18620432 (774778414) in inode 192086020.  IGNORED.

Illegal block #18620433 (774778414) in inode 192086020.  IGNORED.

Illegal block #18620434 (774778414) in inode 192086020.  IGNORED.

Illegal block #18620435 (774778414) in inode 192086020.  IGNORED.

Illegal block #18620436 (774778414) in inode 192086020.  IGNORED.

Illegal block #18620437 (774778414) in inode 192086020.  IGNORED.

Illegal block #18620438 (774778414) in inode 192086020.  IGNORED.

Too many illegal blocks in inode 192086020.

Clear inode? no

 

Suppress messages? no

 

Illegal block #18620439 (774778414) in inode 192086020.  IGNORED.

Illegal block #18620440 (774778414) in inode 192086020.  IGNORED.

Illegal block #18620441 (774778414) in inode 192086020.  IGNORED.

Illegal block #18620442 (774778414) in inode 192086020.  IGNORED.

Illegal block #18620443 (774778414) in inode 192086020.  IGNORED.

Illegal block #18620444 (774778414) in inode 192086020.  IGNORED.

Illegal block #18620445 (774778414) in inode 192086020.  IGNORED.

Illegal block #18620446 (774778414) in inode 192086020.  IGNORED.

Illegal block #18620447 (774778414) in inode 192086020.  IGNORED.

Illegal block #18620448 (774778414) in inode 192086020.  IGNORED.

Illegal block #18620449 (774778414) in inode 192086020.  IGNORED.

Illegal block #18620450 (774778414) in inode 192086020.  IGNORED.

Too many illegal blocks in inode 192086020.

Clear inode? no

 

Suppress messages? no

 

Illegal block #18620451 (774778414) in inode 192086020.  IGNORED.

Illegal block #18620452 (774778414) in inode 192086020.  IGNORED.

Illegal block #18620453 (774778414) in inode 192086020.  IGNORED.

Illegal block #18620454 (774778414) in inode 192086020.  IGNORED.

Illegal block #18620455 (774778414) in inode 192086020.  IGNORED.

Illegal block #18620456 (774778414) in inode 192086020.  IGNORED.

Illegal block #18620457 (774778414) in inode 192086020.  IGNORED.

Illegal block #18620458 (774778414) in inode 192086020.  IGNORED.

Illegal block #18620459 (774778414) in inode 192086020.  IGNORED.

Illegal block #18620460 (774778414) in inode 192086020.  IGNORED.

Illegal block #18620461 (774778414) in inode 192086020.  IGNORED.

Illegal block #18620462 (774778414) in inode 192086020.  IGNORED.

Too many illegal blocks in inode 192086020.

Clear inode? no

 

Suppress messages? no

 

 "

Do you think that running fsck with corrective actions will destroy part
of my data consistency? if the answer is yes is there any other way to
recover?

 

What do you think was the root cause for this issue ?

 

Please notice that this specific FS is more than 2TB size but configured
with msdos partition label.

 

 

Best regards

Yossi Yamin

Sr.Technical specialist 

Diligent Technologies, an IBM Company

 

 

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20080625/b34a4743/attachment.htm>

From sandeen at redhat.com  Wed Jun 25 13:56:04 2008
From: sandeen at redhat.com (Eric Sandeen)
Date: Wed, 25 Jun 2008 08:56:04 -0500
Subject: "Attempt to access beyond end of device" problem
In-Reply-To: <E6BDB590E306534485134219F431E0A853CBB6@ILEX01.corp.diligent.com>
References: <E6BDB590E306534485134219F431E0A853CBB6@ILEX01.corp.diligent.com>
Message-ID: <48624E74.1070306@redhat.com>

Yamin, Yossi wrote:
> Hi,
> 
> We are using Ext3 on with RedHat 4 U3 File Sysetm.
> 
> We got the following errors at the /var/log/messages file
> 
>  
> 
> Jun 23 09:28:29 diligent1 kernel: attempt to access beyond end of device
> 
> Jun 23 09:28:29 diligent1 kernel: sddlmbj1: rw=1, want=6332971392,
> limit=4183807887

...

> What do you think was the root cause for this issue ?

> Please notice that this specific FS is more than 2TB size but configured
> with msdos partition label.

Hm, I think you just stated the root cause.  Did the filesystem work
fine until you rebooted it?  Did it used to be 8T?  And did you use
parted to create it?

Parted wrongly allows you to make a >2T msdos partition table and pokes
it directly into the kernel, but the on-disk format cannot hold the
large value.  So when you reboot, it's read as something smaller.

You might be able to do a trick where you create a new GPT label in
place of the old DOS label, with the same start point as the dos label,
but with a correct endpoint.

I would not repair the fs; if my guess is right, 3/4 of it is now
unreachable and fsck will probably do heavy damage.

-Eric



From yamin_yossi at diligent.com  Wed Jun 25 14:58:59 2008
From: yamin_yossi at diligent.com (Yamin, Yossi)
Date: Wed, 25 Jun 2008 17:58:59 +0300
Subject: "Attempt to access beyond end of device" problem
In-Reply-To: <48624E74.1070306@redhat.com>
References: <E6BDB590E306534485134219F431E0A853CBB6@ILEX01.corp.diligent.com>
	<48624E74.1070306@redhat.com>
Message-ID: <E6BDB590E306534485134219F431E0A853CBBA@ILEX01.corp.diligent.com>

HI,
Thanks for the quick response.
I think the situation is different then what you describe since we have
abut  10 FS with the same size 2142.1 GB that have no problem.
Thanks,
Yossi

-----Original Message-----
From: Eric Sandeen [mailto:sandeen at redhat.com] 
Sent: Wednesday, June 25, 2008 4:56 PM
To: Yamin, Yossi
Cc: ext3-users at redhat.com
Subject: Re: "Attempt to access beyond end of device" problem

Yamin, Yossi wrote:
> Hi,
> 
> We are using Ext3 on with RedHat 4 U3 File Sysetm.
> 
> We got the following errors at the /var/log/messages file
> 
>  
> 
> Jun 23 09:28:29 diligent1 kernel: attempt to access beyond end of
device
> 
> Jun 23 09:28:29 diligent1 kernel: sddlmbj1: rw=1, want=6332971392,
> limit=4183807887

...

> What do you think was the root cause for this issue ?

> Please notice that this specific FS is more than 2TB size but
configured
> with msdos partition label.

Hm, I think you just stated the root cause.  Did the filesystem work
fine until you rebooted it?  Did it used to be 8T?  And did you use
parted to create it?

Parted wrongly allows you to make a >2T msdos partition table and pokes
it directly into the kernel, but the on-disk format cannot hold the
large value.  So when you reboot, it's read as something smaller.

You might be able to do a trick where you create a new GPT label in
place of the old DOS label, with the same start point as the dos label,
but with a correct endpoint.

I would not repair the fs; if my guess is right, 3/4 of it is now
unreachable and fsck will probably do heavy damage.

-Eric



From sandeen at redhat.com  Wed Jun 25 15:03:29 2008
From: sandeen at redhat.com (Eric Sandeen)
Date: Wed, 25 Jun 2008 10:03:29 -0500
Subject: "Attempt to access beyond end of device" problem
In-Reply-To: <E6BDB590E306534485134219F431E0A853CBBA@ILEX01.corp.diligent.com>
References: <E6BDB590E306534485134219F431E0A853CBB6@ILEX01.corp.diligent.com>
	<48624E74.1070306@redhat.com>
	<E6BDB590E306534485134219F431E0A853CBBA@ILEX01.corp.diligent.com>
Message-ID: <48625E41.5020104@redhat.com>

Yamin, Yossi wrote:
> HI,
> Thanks for the quick response.
> I think the situation is different then what you describe since we have
> abut  10 FS with the same size 2142.1 GB that have no problem.

Hm, ok, you said that it was > 2T... I guess that's TiB vs. TB.

Then perhaps it is just localized corruption (hard to say from *what*)
and an e2fsck might fix it up just fine.

-Eric



From yamin_yossi at diligent.com  Tue Jun 24 16:45:46 2008
From: yamin_yossi at diligent.com (Yamin, Yossi)
Date: Tue, 24 Jun 2008 19:45:46 +0300
Subject: "Attempt to access beyond end of device" problem 
Message-ID: <E6BDB590E306534485134219F431E0A82DE712@ILEX01.corp.diligent.com>

Hi,

We are using Ext3 on with RedHat 4 U3 File Sysetm.

We got the following errors at the /var/log/messages file

 

Jun 23 09:28:29 diligent1 kernel: attempt to access beyond end of device

Jun 23 09:28:29 diligent1 kernel: sddlmbj1: rw=1, want=6332971392,
limit=4183807887

--

Jun 23 09:37:47 diligent1 kernel: lpfc1: BUFF seg 5 free 946 numblks
1024

Jun 23 09:44:03 diligent1 kernel: attempt to access beyond end of device

Jun 23 09:44:03 diligent1 kernel: sddlmbj1: rw=1, want=6332971392,
limit=4183807887

Jun 23 09:44:03 diligent1 kernel: attempt to access beyond end of device

Jun 23 09:44:03 diligent1 kernel: sddlmbj1: rw=1, want=6332971392,
limit=4183807887

Jun 23 09:44:03 diligent1 kernel: attempt to access beyond end of device

Jun 23 09:44:03 diligent1 kernel: sddlmbj1: rw=1, want=6332971392,
limit=4183807887

Jun 23 09:44:03 diligent1 kernel: attempt to access beyond end of device

Jun 23 09:44:03 diligent1 kernel: sddlmbj1: rw=1, want=6332971392,
limit=4183807887

 

 

running fsck.ext3 -n -f on this device yield the attached output.

 

 

Do you think that running fsck with corrective actions will destroy part
of my data consistency? if the answer is yes is there any other way to
recover?

 

What do you think was the root cause for this issue ?

 

Please notice that this specific FS is more than 2TB size but configured
with msdos partition label.

 

 

Best regards

Yossi Yamin

Sr.Technical specialist 

Diligent Technologies, an IBM Company

 

 

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20080624/a07341d7/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: fsck_sddlmbj1.zip
Type: application/x-zip-compressed
Size: 463793 bytes
Desc: fsck_sddlmbj1.zip
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20080624/a07341d7/attachment.bin>

From howachen at gmail.com  Fri Jun 27 04:28:58 2008
From: howachen at gmail.com (howard chen)
Date: Fri, 27 Jun 2008 12:28:58 +0800
Subject: Recommended number of files stored under a single folder
Message-ID: <b66ddc900806262128y2966ecbep470b2dffbb38ec0f@mail.gmail.com>

Hi,

I have a web site for storing images and serve to public. In my site, I
need to set a rule for controlling the max. number of files that would
be allowed for client to upload,
asI know that performance of FS degrade when number of files increase,
can anyone
suggest a number so I can stop client from uploading too many files?
E.g. 10K would
be okay?


Thanks.



From magawake at gmail.com  Sat Jun 28 04:13:30 2008
From: magawake at gmail.com (Mag Gam)
Date: Sat, 28 Jun 2008 00:13:30 -0400
Subject: inode and filesystem question
Message-ID: <1cbd6f830806272113o43bf9a54x65ced9c917e2c07b@mail.gmail.com>

While reading for fun, I noticed inode does not carry filename. I always
though it did. I read that it is carried by the directory structure and the
kernel interpolates it. Can someone please explain this to me

TIA
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20080628/32cd656a/attachment.htm>

From bruno at wolff.to  Sat Jun 28 04:22:18 2008
From: bruno at wolff.to (Bruno Wolff III)
Date: Fri, 27 Jun 2008 23:22:18 -0500
Subject: inode and filesystem question
In-Reply-To: <1cbd6f830806272113o43bf9a54x65ced9c917e2c07b@mail.gmail.com>
References: <1cbd6f830806272113o43bf9a54x65ced9c917e2c07b@mail.gmail.com>
Message-ID: <20080628042218.GA17730@wolff.to>

On Sat, Jun 28, 2008 at 00:13:30 -0400,
  Mag Gam <magawake at gmail.com> wrote:
> While reading for fun, I noticed inode does not carry filename. I always
> though it did. I read that it is carried by the directory structure and the
> kernel interpolates it. Can someone please explain this to me

A file can have more than one name. You can read up on "hard link" for
more information.



From magawake at gmail.com  Sat Jun 28 11:39:55 2008
From: magawake at gmail.com (Mag Gam)
Date: Sat, 28 Jun 2008 07:39:55 -0400
Subject: inode and filesystem question
In-Reply-To: <20080628042218.GA17730@wolff.to>
References: <1cbd6f830806272113o43bf9a54x65ced9c917e2c07b@mail.gmail.com>
	<20080628042218.GA17730@wolff.to>
Message-ID: <1cbd6f830806280439s10b2daefocea7a6d08c84326f@mail.gmail.com>

Well, I guess this is more for a theoretical question. How the filename is
determined if its not in the inode.



On Sat, Jun 28, 2008 at 12:22 AM, Bruno Wolff III <bruno at wolff.to> wrote:

> On Sat, Jun 28, 2008 at 00:13:30 -0400,
>  Mag Gam <magawake at gmail.com> wrote:
> > While reading for fun, I noticed inode does not carry filename. I always
> > though it did. I read that it is carried by the directory structure and
> the
> > kernel interpolates it. Can someone please explain this to me
>
> A file can have more than one name. You can read up on "hard link" for
> more information.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20080628/ba675c19/attachment.htm>

From alex at alex.org.uk  Sat Jun 28 11:54:20 2008
From: alex at alex.org.uk (Alex Bligh)
Date: Sat, 28 Jun 2008 12:54:20 +0100
Subject: inode and filesystem question
In-Reply-To: <1cbd6f830806280439s10b2daefocea7a6d08c84326f@mail.gmail.com>
References: <1cbd6f830806272113o43bf9a54x65ced9c917e2c07b@mail.gmail.com>
	<20080628042218.GA17730@wolff.to>
	<1cbd6f830806280439s10b2daefocea7a6d08c84326f@mail.gmail.com>
Message-ID: <2903310EA2E80B55C5C080F7@Ximines.local>



--On 28 June 2008 07:39:55 -0400 Mag Gam <magawake at gmail.com> wrote:

> Well, I guess this is more for a theoretical question. How the filename
> is determined if its not in the inode.

It isn't. There is no easy way to get back from an inode number to a
filename (or filenames, as there can be more than one - think how hard
links work, multiple directory entries (and hence filenames) pointing to
one inode) apart from recurse through the entire directory tree and find
which directory entries contain that inode number. That's because there
is (fsck type operations apart) in general no need to go from an inode
number to the list of directory entries that point to it. Indeed some
inodes can have no directory entry pointing to them (e.g. if you
open a file, then unlink it (with rm) before closing it).

This isn't ext3 specific, this is the way UNIX file systems work. I suggest
doing some background reading on UNIX filesystems in general rather than
asking on an ext3 specific list. For a very simple intro see:
 http://en.wikipedia.org/wiki/Inode

Alex



From davids at webmaster.com  Sat Jun 28 18:02:54 2008
From: davids at webmaster.com (David Schwartz)
Date: Sat, 28 Jun 2008 11:02:54 -0700
Subject: inode and filesystem question
In-Reply-To: <1cbd6f830806280439s10b2daefocea7a6d08c84326f@mail.gmail.com>
Message-ID: <MDEHLPKNGKAHNMBLJOLKGEDPNMAC.davids@webmaster.com>



> Well, I guess this is more for a theoretical question.
> How the filename is determined if its not in the inode.

Simple, files don't have names. Directory entries do. A directory entry's name is stored in the directory entry, along with the inode number of the file it references.

This is the UNIX way, love it or hate it.

DS





From yamin_yossi at diligent.com  Sat Jun 28 20:50:16 2008
From: yamin_yossi at diligent.com (Yamin, Yossi)
Date: Sat, 28 Jun 2008 23:50:16 +0300
Subject: debugfs question 
In-Reply-To: <E6BDB590E306534485134219F431E0A82DE712@ILEX01.corp.diligent.com>
References: <E6BDB590E306534485134219F431E0A82DE712@ILEX01.corp.diligent.com>
Message-ID: <E6BDB590E306534485134219F431E0A82DE719@ILEX01.corp.diligent.com>

Hi, 

I am trying to read a file directly from the disk using debugfs utility.

I am running "stat" on the file I want, filter out IND and Bind blocks,
and then copy the data blocks using dd directly from the Disk.

On small files it work'd (13MB).

On big files (3.5 , 440 GB)  the size is the same but md5sum get differ.

What am I doing wrong?

I umount the FS before I start so the file is not changing.

 

Best regards

Yossi Yamin

Sr.Technical specialist 

Diligent Technologies, an IBM Company

 

 

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20080628/1e51f6c5/attachment.htm>

From bruno at wolff.to  Sun Jun 29 13:37:16 2008
From: bruno at wolff.to (Bruno Wolff III)
Date: Sun, 29 Jun 2008 08:37:16 -0500
Subject: inode and filesystem question
In-Reply-To: <1cbd6f830806280439s10b2daefocea7a6d08c84326f@mail.gmail.com>
References: <1cbd6f830806272113o43bf9a54x65ced9c917e2c07b@mail.gmail.com>
	<20080628042218.GA17730@wolff.to>
	<1cbd6f830806280439s10b2daefocea7a6d08c84326f@mail.gmail.com>
Message-ID: <20080629133716.GA25425@wolff.to>

On Sat, Jun 28, 2008 at 07:39:55 -0400,
  Mag Gam <magawake at gmail.com> wrote:
> Well, I guess this is more for a theoretical question. How the filename is
> determined if its not in the inode.

Filenames are matched to inodes in the directory blocks. (I am assuming
that's the question you meant to ask. The phrasing of your question is a bit
odd and you may have really been asking a different question.)



From magawake at gmail.com  Sun Jun 29 17:14:46 2008
From: magawake at gmail.com (Mag Gam)
Date: Sun, 29 Jun 2008 13:14:46 -0400
Subject: inode and filesystem question
In-Reply-To: <20080629133716.GA25425@wolff.to>
References: <1cbd6f830806272113o43bf9a54x65ced9c917e2c07b@mail.gmail.com>
	<20080628042218.GA17730@wolff.to>
	<1cbd6f830806280439s10b2daefocea7a6d08c84326f@mail.gmail.com>
	<20080629133716.GA25425@wolff.to>
Message-ID: <1cbd6f830806291014s46be11f5j7045d145a887574d@mail.gmail.com>

Thanks Bruno. Thats exactly what I was asking.

Some people got angry at me for asking here since its a "basic" Unix
question.  Sorry about that


On Sun, Jun 29, 2008 at 9:37 AM, Bruno Wolff III <bruno at wolff.to> wrote:

> On Sat, Jun 28, 2008 at 07:39:55 -0400,
>   Mag Gam <magawake at gmail.com> wrote:
> > Well, I guess this is more for a theoretical question. How the filename
> is
> > determined if its not in the inode.
>
> Filenames are matched to inodes in the directory blocks. (I am assuming
> that's the question you meant to ask. The phrasing of your question is a
> bit
> odd and you may have really been asking a different question.)
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20080629/2de6a972/attachment.htm>