From codevana at gmail.com Wed Jun 4 02:43:18 2008 From: codevana at gmail.com (Srinivas Murthy) Date: Tue, 3 Jun 2008 19:43:18 -0700 Subject: Fwd: md_thread and kjournald race w/ raid1 on 2-way x86 In-Reply-To: References: Message-ID: Hi I have the following kernel (2.6.23) crash on a 2-cpu smp 32b x86 system. The crash is intermittent and seems to happen w/ md raid 1 sync. As you can see one of the cpu's is running the md_thread while the other was in kjournald. Is there a known race condn between kjournald and md_thread threads? Anyone knows the fix for this? Thanks. <6>md: md0 stopped. <6>md: bind <6>md: bind <6>raid1: raid set md0 active with 2 out of 2 mirrors <6>md: md1 stopped. <6>md: bind <6>md: bind <6>raid1: raid set md1 active with 2 out of 2 mirrors <6>md: md2 stopped. <6>md: bind <6>md: bind <6>raid1: raid set md2 active with 2 out of 2 mirrors <6>md: md3 stopped. <6>md: bind <6>md: bind <6>raid1: raid set md3 active with 2 out of 2 mirrors <6>md: md4 stopped. <6>md: bind <6>md: bind <6>raid1: raid set md4 active with 2 out of 2 mirrors <6>md: md5 stopped. <6>md: bind <6>md: bind <3>md: md5: raid array is not clean -- starting background reconstruction <6>raid1: raid set md5 active with 2 out of 2 mirrors <6>md: resync of RAID array md5 <6>md: minimum _guaranteed_ speed: 1000 KB/sec/disk. <6>md: using maximum available idle IO bandwidth (but not more than 20000 KB/sec) for resync. <6>md: using 128k window, over a total of 7339904 blocks. <6>md: md6 stopped. <6>md: bind <6>md: bind <6>raid1: raid set md6 active with 2 out of 2 mirrors <6>kjournald starting. Commit interval 5 seconds <6>EXT3-fs: mounted filesystem with ordered data mode. <6>kjournald starting. Commit interval 5 seconds <6>EXT3-fs: mounted filesystem with ordered data mode. <6>kjournald starting. Commit interval 5 seconds <6>EXT3 FS on md2, internal journal <6>EXT3-fs: mounted filesystem with ordered data mode. <6>kjournald starting. Commit interval 5 seconds <6>EXT3 FS on md5, internal journal <6>EXT3-fs: mounted filesystem with ordered data mode. <6>EXT3 FS on md0, internal journal <6>Adding 4138872k swap on /dev/md3. Priority:-1 extents:1 across:4138872k <6>EXT3 FS on md0, internal journal <6>bonding: bond0: setting mode to balance-rr (0). <6>tg3: eth0: Link is up at 1000 Mbps, full duplex. <6>tg3: eth0: Flow control is off for TX and off for RX. <6>kjournald starting. Commit interval 5 seconds <6>EXT3 FS on md4, internal journal <6>EXT3-fs: mounted filesystem with ordered data mode. <6>kjournald starting. Commit interval 5 seconds <6>EXT3 FS on md6, internal journal <6>EXT3-fs: mounted filesystem with ordered data mode. <0>Assertion failure in journal_commit_transaction() at fs/jbd/commit.c:693: "((&bh->b_count)->counter) == 0" <0>------------[ cut here ]------------ <2>kernel BUG at fs/jbd/commit.c:693! <0>invalid opcode: 0000 [#1] <0>PREEMPT SMP <0>CPU: 1 <0>EIP: 0060:[] Tainted: P VLI <0>EFLAGS: 00010296 (2.6.23.waas #4) <0>EIP is at journal_commit_transaction+0x879/0xe00 <0>eax: 00000096 ebx: f76bcf00 ecx: ffffffff edx: f7588ac0 <0>esi: f6c66f88 edi: c0651dcd ebp: c6549f5c esp: c6549ec4 <0>ds: 007b es: 007b fs: 00d8 gs: 0000 ss: 0068 <0>Process kjournald (pid: 1684, ti=c6548000 task=c39db580 task.ti=c6548000) <0>Stack: c0651de8 c0651dcd c0651e44 000002b5 c0651fa0 00000000 00000000 f7f63414 <0> f7f634dc 00000000 00000fcc f7435034 00000000 00000000 c6402000 00000000 <0> f7f63400 f7386fc0 000005d7 f77fb580 c39db580 f70bdd74 00000202 c70f8000 <0>Call Trace: <0> [] show_trace_log_lvl+0x1a/0x30 <0> [] show_stack_log_lvl+0x9a/0xc0 <0> [] show_registers+0x1d6/0x340 <0> [] die+0x10d/0x220 <0> [] do_trap+0x91/0xd0 <0> [] do_invalid_op+0x89/0xa0 <0> [] error_code+0x72/0x78 <0> [] kjournald+0xb5/0x1f0 <0> [] kthread+0x5c/0xa0 <0> [] kernel_thread_helper+0x7/0x1c <0> ======================= <0>Code: 65 c0 b8 a0 1f 65 c0 bf cd 1d 65 c0 89 44 24 10 b8 b5 02 00 00 89 44 24 0c b8 44 1e 65 c0 89 7c 24 04 89 44 24 08 e8 57 de f5 ff <0f> 0b eb fe 8b 43 34 85 c0 75 4e 8b 53 38 85 d2 0f 84 9e 00 00 <0>EIP: [] journal_commit_transaction+0x879/0xe00 SS:ESP 0068:c6549ec4 <6>SysRq : Changing Loglevel <4>Loglevel set to 7 [0]kdb> btc btc: cpu status: Currently on cpu 0 Available cpus: 0-1 Stack traceback for pid 1609 0xc69ce000 1609 2 1 0 R 0xc69ce1e0 *md5_resync esp eip Function (args) 0xc69e5d4c 0xc028fef3 bio_alloc_bioset+0xb3 (0x11200, invalid, 0xc70e3060) 0xc69e5d78 0xc028ffbe bio_alloc+0xe 0xc69e5d80 0xc054f6d7 r1buf_pool_alloc+0x37 (0x11200, 0xc39ca0c0) 0xc69e5da4 0xc024aff6 mempool_alloc+0x26 (0xf7e7dcc0, invalid) 0xc69e5de0 0xc0552624 sync_request+0x1f4 (0xf7f40a00, 0xa2ce80, 0x0, 0xc69e5ea0, 0x0) 0xc69e5e40 0xc056667f md_do_sync+0x4ef (0xf7f40a00) 0xc69e5f78 0xc0564f55 md_thread+0x35 (0xf7e7dc80) 0xc69e5fd0 0xc0232a5c kthread+0x5c (invalid) Stack traceback for pid 1684 0xc39db580 1684 2 1 1 R 0xc39db760 kjournald esp eip Function (args) kdb_bb: address 0xffffffff not recognised Using old style backtrace, unreliable with no arguments esp eip Function (args) 0xc6549eb8 0xc02c1149 journal_commit_transaction+0x879 0xc6549f28 0xc0227945 lock_timer_base+0x25 0xc6549f40 0xc0227b6a try_to_del_timer_sync+0x4a 0xc6549f60 0xc02c3845 kjournald+0xb5 0xc6549f88 0xc0233040 autoremove_wake_function 0xc6549f94 0xc062f8e1 __sched_text_start+0x1f1 0xc6549fa8 0xc0233040 autoremove_wake_function [0]kdb> From sandeen at redhat.com Wed Jun 4 02:47:06 2008 From: sandeen at redhat.com (Eric Sandeen) Date: Tue, 03 Jun 2008 21:47:06 -0500 Subject: Fwd: md_thread and kjournald race w/ raid1 on 2-way x86 In-Reply-To: References: Message-ID: <4846022A.1010707@redhat.com> Srinivas Murthy wrote: > Hi > > I have the following kernel (2.6.23) crash on a 2-cpu smp 32b x86 system. > <0>Assertion failure in journal_commit_transaction() at > fs/jbd/commit.c:693: "((&bh->b_count)->counter) == 0" > <0>------------[ cut here ]------------ > <2>kernel BUG at fs/jbd/commit.c:693! > <0>invalid opcode: 0000 [#1] > <0>PREEMPT SMP > <0>CPU: 1 > <0>EIP: 0060:[] Tainted: P VLI What's the proprietary kernel; does it happen without the tainted kernel? -Eric > <0>EFLAGS: 00010296 (2.6.23.waas #4) > <0>EIP is at journal_commit_transaction+0x879/0xe00 > <0>eax: 00000096 ebx: f76bcf00 ecx: ffffffff edx: f7588ac0 > <0>esi: f6c66f88 edi: c0651dcd ebp: c6549f5c esp: c6549ec4 > <0>ds: 007b es: 007b fs: 00d8 gs: 0000 ss: 0068 > <0>Process kjournald (pid: 1684, ti=c6548000 task=c39db580 task.ti=c6548000) > <0>Stack: c0651de8 c0651dcd c0651e44 000002b5 c0651fa0 00000000 > 00000000 f7f63414 > <0> f7f634dc 00000000 00000fcc f7435034 00000000 00000000 > c6402000 00000000 > <0> f7f63400 f7386fc0 000005d7 f77fb580 c39db580 f70bdd74 > 00000202 c70f8000 > <0>Call Trace: > <0> [] show_trace_log_lvl+0x1a/0x30 > <0> [] show_stack_log_lvl+0x9a/0xc0 > <0> [] show_registers+0x1d6/0x340 > <0> [] die+0x10d/0x220 > <0> [] do_trap+0x91/0xd0 > <0> [] do_invalid_op+0x89/0xa0 > <0> [] error_code+0x72/0x78 > <0> [] kjournald+0xb5/0x1f0 > <0> [] kthread+0x5c/0xa0 > <0> [] kernel_thread_helper+0x7/0x1c > <0> ======================= > <0>Code: 65 c0 b8 a0 1f 65 c0 bf cd 1d 65 c0 89 44 24 10 b8 b5 02 00 > 00 89 44 24 0c b8 44 1e 65 c0 89 7c 24 04 89 44 24 08 e8 57 de f5 ff > <0f> 0b eb fe 8b 43 34 85 c0 75 4e 8b 53 38 85 d2 0f 84 9e 00 00 > <0>EIP: [] journal_commit_transaction+0x879/0xe00 SS:ESP 0068:c6549ec4 > <6>SysRq : Changing Loglevel > <4>Loglevel set to 7 > > [0]kdb> btc > btc: cpu status: Currently on cpu 0 > Available cpus: 0-1 > Stack traceback for pid 1609 > 0xc69ce000 1609 2 1 0 R 0xc69ce1e0 *md5_resync > esp eip Function (args) > 0xc69e5d4c 0xc028fef3 bio_alloc_bioset+0xb3 (0x11200, invalid, 0xc70e3060) > 0xc69e5d78 0xc028ffbe bio_alloc+0xe > 0xc69e5d80 0xc054f6d7 r1buf_pool_alloc+0x37 (0x11200, 0xc39ca0c0) > 0xc69e5da4 0xc024aff6 mempool_alloc+0x26 (0xf7e7dcc0, invalid) > 0xc69e5de0 0xc0552624 sync_request+0x1f4 (0xf7f40a00, 0xa2ce80, 0x0, > 0xc69e5ea0, 0x0) > 0xc69e5e40 0xc056667f md_do_sync+0x4ef (0xf7f40a00) > 0xc69e5f78 0xc0564f55 md_thread+0x35 (0xf7e7dc80) > 0xc69e5fd0 0xc0232a5c kthread+0x5c (invalid) > Stack traceback for pid 1684 > 0xc39db580 1684 2 1 1 R 0xc39db760 kjournald > esp eip Function (args) > kdb_bb: address 0xffffffff not recognised > Using old style backtrace, unreliable with no arguments > esp eip Function (args) > 0xc6549eb8 0xc02c1149 journal_commit_transaction+0x879 > 0xc6549f28 0xc0227945 lock_timer_base+0x25 > 0xc6549f40 0xc0227b6a try_to_del_timer_sync+0x4a > 0xc6549f60 0xc02c3845 kjournald+0xb5 > 0xc6549f88 0xc0233040 autoremove_wake_function > 0xc6549f94 0xc062f8e1 __sched_text_start+0x1f1 > 0xc6549fa8 0xc0233040 autoremove_wake_function > [0]kdb> > > _______________________________________________ > Ext3-users mailing list > Ext3-users at redhat.com > https://www.redhat.com/mailman/listinfo/ext3-users From codevana at gmail.com Wed Jun 4 02:49:31 2008 From: codevana at gmail.com (Srinivas Murthy) Date: Tue, 3 Jun 2008 19:49:31 -0700 Subject: Fwd: md_thread and kjournald race w/ raid1 on 2-way x86 In-Reply-To: <4846022A.1010707@redhat.com> References: <4846022A.1010707@redhat.com> Message-ID: The changes we have are in the networking part. Nothing in the fs or block layers. Thanks, _Sri On Tue, Jun 3, 2008 at 7:47 PM, Eric Sandeen wrote: > Srinivas Murthy wrote: >> Hi >> >> I have the following kernel (2.6.23) crash on a 2-cpu smp 32b x86 system. > >> <0>Assertion failure in journal_commit_transaction() at >> fs/jbd/commit.c:693: "((&bh->b_count)->counter) == 0" >> <0>------------[ cut here ]------------ >> <2>kernel BUG at fs/jbd/commit.c:693! >> <0>invalid opcode: 0000 [#1] >> <0>PREEMPT SMP >> <0>CPU: 1 >> <0>EIP: 0060:[] Tainted: P VLI > > What's the proprietary kernel; does it happen without the tainted kernel? > > -Eric > >> <0>EFLAGS: 00010296 (2.6.23.waas #4) >> <0>EIP is at journal_commit_transaction+0x879/0xe00 >> <0>eax: 00000096 ebx: f76bcf00 ecx: ffffffff edx: f7588ac0 >> <0>esi: f6c66f88 edi: c0651dcd ebp: c6549f5c esp: c6549ec4 >> <0>ds: 007b es: 007b fs: 00d8 gs: 0000 ss: 0068 >> <0>Process kjournald (pid: 1684, ti=c6548000 task=c39db580 task.ti=c6548000) >> <0>Stack: c0651de8 c0651dcd c0651e44 000002b5 c0651fa0 00000000 >> 00000000 f7f63414 >> <0> f7f634dc 00000000 00000fcc f7435034 00000000 00000000 >> c6402000 00000000 >> <0> f7f63400 f7386fc0 000005d7 f77fb580 c39db580 f70bdd74 >> 00000202 c70f8000 >> <0>Call Trace: >> <0> [] show_trace_log_lvl+0x1a/0x30 >> <0> [] show_stack_log_lvl+0x9a/0xc0 >> <0> [] show_registers+0x1d6/0x340 >> <0> [] die+0x10d/0x220 >> <0> [] do_trap+0x91/0xd0 >> <0> [] do_invalid_op+0x89/0xa0 >> <0> [] error_code+0x72/0x78 >> <0> [] kjournald+0xb5/0x1f0 >> <0> [] kthread+0x5c/0xa0 >> <0> [] kernel_thread_helper+0x7/0x1c >> <0> ======================= >> <0>Code: 65 c0 b8 a0 1f 65 c0 bf cd 1d 65 c0 89 44 24 10 b8 b5 02 00 >> 00 89 44 24 0c b8 44 1e 65 c0 89 7c 24 04 89 44 24 08 e8 57 de f5 ff >> <0f> 0b eb fe 8b 43 34 85 c0 75 4e 8b 53 38 85 d2 0f 84 9e 00 00 >> <0>EIP: [] journal_commit_transaction+0x879/0xe00 SS:ESP 0068:c6549ec4 >> <6>SysRq : Changing Loglevel >> <4>Loglevel set to 7 >> >> [0]kdb> btc >> btc: cpu status: Currently on cpu 0 >> Available cpus: 0-1 >> Stack traceback for pid 1609 >> 0xc69ce000 1609 2 1 0 R 0xc69ce1e0 *md5_resync >> esp eip Function (args) >> 0xc69e5d4c 0xc028fef3 bio_alloc_bioset+0xb3 (0x11200, invalid, 0xc70e3060) >> 0xc69e5d78 0xc028ffbe bio_alloc+0xe >> 0xc69e5d80 0xc054f6d7 r1buf_pool_alloc+0x37 (0x11200, 0xc39ca0c0) >> 0xc69e5da4 0xc024aff6 mempool_alloc+0x26 (0xf7e7dcc0, invalid) >> 0xc69e5de0 0xc0552624 sync_request+0x1f4 (0xf7f40a00, 0xa2ce80, 0x0, >> 0xc69e5ea0, 0x0) >> 0xc69e5e40 0xc056667f md_do_sync+0x4ef (0xf7f40a00) >> 0xc69e5f78 0xc0564f55 md_thread+0x35 (0xf7e7dc80) >> 0xc69e5fd0 0xc0232a5c kthread+0x5c (invalid) >> Stack traceback for pid 1684 >> 0xc39db580 1684 2 1 1 R 0xc39db760 kjournald >> esp eip Function (args) >> kdb_bb: address 0xffffffff not recognised >> Using old style backtrace, unreliable with no arguments >> esp eip Function (args) >> 0xc6549eb8 0xc02c1149 journal_commit_transaction+0x879 >> 0xc6549f28 0xc0227945 lock_timer_base+0x25 >> 0xc6549f40 0xc0227b6a try_to_del_timer_sync+0x4a >> 0xc6549f60 0xc02c3845 kjournald+0xb5 >> 0xc6549f88 0xc0233040 autoremove_wake_function >> 0xc6549f94 0xc062f8e1 __sched_text_start+0x1f1 >> 0xc6549fa8 0xc0233040 autoremove_wake_function >> [0]kdb> >> >> _______________________________________________ >> Ext3-users mailing list >> Ext3-users at redhat.com >> https://www.redhat.com/mailman/listinfo/ext3-users > > From sandeen at redhat.com Wed Jun 4 02:58:23 2008 From: sandeen at redhat.com (Eric Sandeen) Date: Tue, 03 Jun 2008 21:58:23 -0500 Subject: Fwd: md_thread and kjournald race w/ raid1 on 2-way x86 In-Reply-To: References: <4846022A.1010707@redhat.com> Message-ID: <484604CF.10709@redhat.com> Srinivas Murthy wrote: > The changes we have are in the networking part. Nothing in the fs or > block layers. > > Thanks, > _Sri > Ok - and does it still happen without the taint? :) networking can corrupt memory as well as anything else. I'm not saying that's it for sure but it's worth testing. -Eric > > > > On Tue, Jun 3, 2008 at 7:47 PM, Eric Sandeen wrote: >> Srinivas Murthy wrote: >>> Hi >>> >>> I have the following kernel (2.6.23) crash on a 2-cpu smp 32b x86 system. >>> <0>Assertion failure in journal_commit_transaction() at >>> fs/jbd/commit.c:693: "((&bh->b_count)->counter) == 0" >>> <0>------------[ cut here ]------------ >>> <2>kernel BUG at fs/jbd/commit.c:693! >>> <0>invalid opcode: 0000 [#1] >>> <0>PREEMPT SMP >>> <0>CPU: 1 >>> <0>EIP: 0060:[] Tainted: P VLI >> What's the proprietary kernel; does it happen without the tainted kernel? >> >> -Eric >> >>> <0>EFLAGS: 00010296 (2.6.23.waas #4) >>> <0>EIP is at journal_commit_transaction+0x879/0xe00 >>> <0>eax: 00000096 ebx: f76bcf00 ecx: ffffffff edx: f7588ac0 >>> <0>esi: f6c66f88 edi: c0651dcd ebp: c6549f5c esp: c6549ec4 >>> <0>ds: 007b es: 007b fs: 00d8 gs: 0000 ss: 0068 >>> <0>Process kjournald (pid: 1684, ti=c6548000 task=c39db580 task.ti=c6548000) >>> <0>Stack: c0651de8 c0651dcd c0651e44 000002b5 c0651fa0 00000000 >>> 00000000 f7f63414 >>> <0> f7f634dc 00000000 00000fcc f7435034 00000000 00000000 >>> c6402000 00000000 >>> <0> f7f63400 f7386fc0 000005d7 f77fb580 c39db580 f70bdd74 >>> 00000202 c70f8000 >>> <0>Call Trace: >>> <0> [] show_trace_log_lvl+0x1a/0x30 >>> <0> [] show_stack_log_lvl+0x9a/0xc0 >>> <0> [] show_registers+0x1d6/0x340 >>> <0> [] die+0x10d/0x220 >>> <0> [] do_trap+0x91/0xd0 >>> <0> [] do_invalid_op+0x89/0xa0 >>> <0> [] error_code+0x72/0x78 >>> <0> [] kjournald+0xb5/0x1f0 >>> <0> [] kthread+0x5c/0xa0 >>> <0> [] kernel_thread_helper+0x7/0x1c >>> <0> ======================= >>> <0>Code: 65 c0 b8 a0 1f 65 c0 bf cd 1d 65 c0 89 44 24 10 b8 b5 02 00 >>> 00 89 44 24 0c b8 44 1e 65 c0 89 7c 24 04 89 44 24 08 e8 57 de f5 ff >>> <0f> 0b eb fe 8b 43 34 85 c0 75 4e 8b 53 38 85 d2 0f 84 9e 00 00 >>> <0>EIP: [] journal_commit_transaction+0x879/0xe00 SS:ESP 0068:c6549ec4 >>> <6>SysRq : Changing Loglevel >>> <4>Loglevel set to 7 >>> >>> [0]kdb> btc >>> btc: cpu status: Currently on cpu 0 >>> Available cpus: 0-1 >>> Stack traceback for pid 1609 >>> 0xc69ce000 1609 2 1 0 R 0xc69ce1e0 *md5_resync >>> esp eip Function (args) >>> 0xc69e5d4c 0xc028fef3 bio_alloc_bioset+0xb3 (0x11200, invalid, 0xc70e3060) >>> 0xc69e5d78 0xc028ffbe bio_alloc+0xe >>> 0xc69e5d80 0xc054f6d7 r1buf_pool_alloc+0x37 (0x11200, 0xc39ca0c0) >>> 0xc69e5da4 0xc024aff6 mempool_alloc+0x26 (0xf7e7dcc0, invalid) >>> 0xc69e5de0 0xc0552624 sync_request+0x1f4 (0xf7f40a00, 0xa2ce80, 0x0, >>> 0xc69e5ea0, 0x0) >>> 0xc69e5e40 0xc056667f md_do_sync+0x4ef (0xf7f40a00) >>> 0xc69e5f78 0xc0564f55 md_thread+0x35 (0xf7e7dc80) >>> 0xc69e5fd0 0xc0232a5c kthread+0x5c (invalid) >>> Stack traceback for pid 1684 >>> 0xc39db580 1684 2 1 1 R 0xc39db760 kjournald >>> esp eip Function (args) >>> kdb_bb: address 0xffffffff not recognised >>> Using old style backtrace, unreliable with no arguments >>> esp eip Function (args) >>> 0xc6549eb8 0xc02c1149 journal_commit_transaction+0x879 >>> 0xc6549f28 0xc0227945 lock_timer_base+0x25 >>> 0xc6549f40 0xc0227b6a try_to_del_timer_sync+0x4a >>> 0xc6549f60 0xc02c3845 kjournald+0xb5 >>> 0xc6549f88 0xc0233040 autoremove_wake_function >>> 0xc6549f94 0xc062f8e1 __sched_text_start+0x1f1 >>> 0xc6549fa8 0xc0233040 autoremove_wake_function >>> [0]kdb> >>> >>> _______________________________________________ >>> Ext3-users mailing list >>> Ext3-users at redhat.com >>> https://www.redhat.com/mailman/listinfo/ext3-users >> From codevana at gmail.com Wed Jun 4 03:04:18 2008 From: codevana at gmail.com (Srinivas Murthy) Date: Tue, 3 Jun 2008 20:04:18 -0700 Subject: Fwd: md_thread and kjournald race w/ raid1 on 2-way x86 In-Reply-To: <484604CF.10709@redhat.com> References: <4846022A.1010707@redhat.com> <484604CF.10709@redhat.com> Message-ID: Sorry. Understand. Yes, I am told it does. No way to be sure. On Tue, Jun 3, 2008 at 7:58 PM, Eric Sandeen wrote: > Srinivas Murthy wrote: >> The changes we have are in the networking part. Nothing in the fs or >> block layers. >> >> Thanks, >> _Sri >> > > Ok - and does it still happen without the taint? :) > > networking can corrupt memory as well as anything else. > > I'm not saying that's it for sure but it's worth testing. > > -Eric > >> >> >> >> On Tue, Jun 3, 2008 at 7:47 PM, Eric Sandeen wrote: >>> Srinivas Murthy wrote: >>>> Hi >>>> >>>> I have the following kernel (2.6.23) crash on a 2-cpu smp 32b x86 system. >>>> <0>Assertion failure in journal_commit_transaction() at >>>> fs/jbd/commit.c:693: "((&bh->b_count)->counter) == 0" >>>> <0>------------[ cut here ]------------ >>>> <2>kernel BUG at fs/jbd/commit.c:693! >>>> <0>invalid opcode: 0000 [#1] >>>> <0>PREEMPT SMP >>>> <0>CPU: 1 >>>> <0>EIP: 0060:[] Tainted: P VLI >>> What's the proprietary kernel; does it happen without the tainted kernel? >>> >>> -Eric >>> >>>> <0>EFLAGS: 00010296 (2.6.23.waas #4) >>>> <0>EIP is at journal_commit_transaction+0x879/0xe00 >>>> <0>eax: 00000096 ebx: f76bcf00 ecx: ffffffff edx: f7588ac0 >>>> <0>esi: f6c66f88 edi: c0651dcd ebp: c6549f5c esp: c6549ec4 >>>> <0>ds: 007b es: 007b fs: 00d8 gs: 0000 ss: 0068 >>>> <0>Process kjournald (pid: 1684, ti=c6548000 task=c39db580 task.ti=c6548000) >>>> <0>Stack: c0651de8 c0651dcd c0651e44 000002b5 c0651fa0 00000000 >>>> 00000000 f7f63414 >>>> <0> f7f634dc 00000000 00000fcc f7435034 00000000 00000000 >>>> c6402000 00000000 >>>> <0> f7f63400 f7386fc0 000005d7 f77fb580 c39db580 f70bdd74 >>>> 00000202 c70f8000 >>>> <0>Call Trace: >>>> <0> [] show_trace_log_lvl+0x1a/0x30 >>>> <0> [] show_stack_log_lvl+0x9a/0xc0 >>>> <0> [] show_registers+0x1d6/0x340 >>>> <0> [] die+0x10d/0x220 >>>> <0> [] do_trap+0x91/0xd0 >>>> <0> [] do_invalid_op+0x89/0xa0 >>>> <0> [] error_code+0x72/0x78 >>>> <0> [] kjournald+0xb5/0x1f0 >>>> <0> [] kthread+0x5c/0xa0 >>>> <0> [] kernel_thread_helper+0x7/0x1c >>>> <0> ======================= >>>> <0>Code: 65 c0 b8 a0 1f 65 c0 bf cd 1d 65 c0 89 44 24 10 b8 b5 02 00 >>>> 00 89 44 24 0c b8 44 1e 65 c0 89 7c 24 04 89 44 24 08 e8 57 de f5 ff >>>> <0f> 0b eb fe 8b 43 34 85 c0 75 4e 8b 53 38 85 d2 0f 84 9e 00 00 >>>> <0>EIP: [] journal_commit_transaction+0x879/0xe00 SS:ESP 0068:c6549ec4 >>>> <6>SysRq : Changing Loglevel >>>> <4>Loglevel set to 7 >>>> >>>> [0]kdb> btc >>>> btc: cpu status: Currently on cpu 0 >>>> Available cpus: 0-1 >>>> Stack traceback for pid 1609 >>>> 0xc69ce000 1609 2 1 0 R 0xc69ce1e0 *md5_resync >>>> esp eip Function (args) >>>> 0xc69e5d4c 0xc028fef3 bio_alloc_bioset+0xb3 (0x11200, invalid, 0xc70e3060) >>>> 0xc69e5d78 0xc028ffbe bio_alloc+0xe >>>> 0xc69e5d80 0xc054f6d7 r1buf_pool_alloc+0x37 (0x11200, 0xc39ca0c0) >>>> 0xc69e5da4 0xc024aff6 mempool_alloc+0x26 (0xf7e7dcc0, invalid) >>>> 0xc69e5de0 0xc0552624 sync_request+0x1f4 (0xf7f40a00, 0xa2ce80, 0x0, >>>> 0xc69e5ea0, 0x0) >>>> 0xc69e5e40 0xc056667f md_do_sync+0x4ef (0xf7f40a00) >>>> 0xc69e5f78 0xc0564f55 md_thread+0x35 (0xf7e7dc80) >>>> 0xc69e5fd0 0xc0232a5c kthread+0x5c (invalid) >>>> Stack traceback for pid 1684 >>>> 0xc39db580 1684 2 1 1 R 0xc39db760 kjournald >>>> esp eip Function (args) >>>> kdb_bb: address 0xffffffff not recognised >>>> Using old style backtrace, unreliable with no arguments >>>> esp eip Function (args) >>>> 0xc6549eb8 0xc02c1149 journal_commit_transaction+0x879 >>>> 0xc6549f28 0xc0227945 lock_timer_base+0x25 >>>> 0xc6549f40 0xc0227b6a try_to_del_timer_sync+0x4a >>>> 0xc6549f60 0xc02c3845 kjournald+0xb5 >>>> 0xc6549f88 0xc0233040 autoremove_wake_function >>>> 0xc6549f94 0xc062f8e1 __sched_text_start+0x1f1 >>>> 0xc6549fa8 0xc0233040 autoremove_wake_function >>>> [0]kdb> >>>> >>>> _______________________________________________ >>>> Ext3-users mailing list >>>> Ext3-users at redhat.com >>>> https://www.redhat.com/mailman/listinfo/ext3-users >>> > > From sandeen at redhat.com Wed Jun 4 03:06:17 2008 From: sandeen at redhat.com (Eric Sandeen) Date: Tue, 03 Jun 2008 22:06:17 -0500 Subject: Fwd: md_thread and kjournald race w/ raid1 on 2-way x86 In-Reply-To: References: Message-ID: <484606A9.3070403@redhat.com> Srinivas Murthy wrote: > <6>EXT3-fs: mounted filesystem with ordered data mode. > <0>Assertion failure in journal_commit_transaction() at > fs/jbd/commit.c:693: "((&bh->b_count)->counter) == 0" > <0>------------[ cut here ]------------ > <2>kernel BUG at fs/jbd/commit.c:693! > <0>invalid opcode: 0000 [#1] > <0>PREEMPT SMP > <0>CPU: 1 > <0>EIP: 0060:[] Tainted: P VLI > <0>EFLAGS: 00010296 (2.6.23.waas #4) > <0>EIP is at journal_commit_transaction+0x879/0xe00 > <0>eax: 00000096 ebx: f76bcf00 ecx: ffffffff edx: f7588ac0 > <0>esi: f6c66f88 edi: c0651dcd ebp: c6549f5c esp: c6549ec4 > <0>ds: 007b es: 007b fs: 00d8 gs: 0000 ss: 0068 > <0>Process kjournald (pid: 1684, ti=c6548000 task=c39db580 task.ti=c6548000) > <0>Stack: c0651de8 c0651dcd c0651e44 000002b5 c0651fa0 00000000 > 00000000 f7f63414 > <0> f7f634dc 00000000 00000fcc f7435034 00000000 00000000 > c6402000 00000000 > <0> f7f63400 f7386fc0 000005d7 f77fb580 c39db580 f70bdd74 > 00000202 c70f8000 > <0>Call Trace: > <0> [] show_trace_log_lvl+0x1a/0x30 > <0> [] show_stack_log_lvl+0x9a/0xc0 > <0> [] show_registers+0x1d6/0x340 > <0> [] die+0x10d/0x220 > <0> [] do_trap+0x91/0xd0 > <0> [] do_invalid_op+0x89/0xa0 > <0> [] error_code+0x72/0x78 > <0> [] kjournald+0xb5/0x1f0 > <0> [] kthread+0x5c/0xa0 > <0> [] kernel_thread_helper+0x7/0x1c > <0> ======================= > <0>Code: 65 c0 b8 a0 1f 65 c0 bf cd 1d 65 c0 89 44 24 10 b8 b5 02 00 > 00 89 44 24 0c b8 44 1e 65 c0 89 7c 24 04 89 44 24 08 e8 57 de f5 ff > <0f> 0b eb fe 8b 43 34 85 c0 75 4e 8b 53 38 85 d2 0f 84 9e 00 00 > <0>EIP: [] journal_commit_transaction+0x879/0xe00 SS:ESP 0068:c6549ec4 > <6>SysRq : Changing Loglevel > <4>Loglevel set to 7 > > [0]kdb> btc > btc: cpu status: Currently on cpu 0 Also, I'd backtrace pid 1684 (kjournald) and dump the bh, see what it looks like... kdb> btp 1684 kdb> bh if i remember correctly... -Eric From codevana at gmail.com Wed Jun 4 03:29:51 2008 From: codevana at gmail.com (Srinivas Murthy) Date: Tue, 3 Jun 2008 20:29:51 -0700 Subject: Fwd: md_thread and kjournald race w/ raid1 on 2-way x86 In-Reply-To: <484606A9.3070403@redhat.com> References: <484606A9.3070403@redhat.com> Message-ID: [0]kdb> btp 1684 Stack traceback for pid 1684 0xc39db580 1684 2 1 1 R 0xc39db760 kjournald esp eip Function (args) kdb_bb: address 0xffffffff not recognised Using old style backtrace, unreliable with no arguments esp eip Function (args) 0xc6549eb8 0xc02c1149 journal_commit_transaction+0x879 0xc6549f28 0xc0227945 lock_timer_base+0x25 0xc6549f40 0xc0227b6a try_to_del_timer_sync+0x4a 0xc6549f60 0xc02c3845 kjournald+0xb5 0xc6549f88 0xc0233040 autoremove_wake_function 0xc6549f94 0xc062f8e1 __sched_text_start+0x1f1 0xc6549fa8 0xc0233040 autoremove_wake_function Based on this code below : 0xc02c10e3 journal_commit_transaction+0x813: jmp 0xc02c10e3 journal_commit_transaction+0x813 0xc02c10e5 journal_commit_transaction+0x815: movl $0xc0651de8,(%esp) 0xc02c10ec journal_commit_transaction+0x81c: mov $0xc0651e44,%ecx 0xc02c10f1 journal_commit_transaction+0x821: mov $0xc0651dcd,%edx 0xc02c10f6 journal_commit_transaction+0x826: mov %ecx,0x8(%esp) 0xc02c10fa journal_commit_transaction+0x82a: mov $0xc0651f8a,%esi 0xc02c10ff journal_commit_transaction+0x82f: mov $0x2bd,%ebx 0xc02c1104 journal_commit_transaction+0x834: mov %esi,0x10(%esp) 0xc02c1108 journal_commit_transaction+0x838: mov %ebx,0xc(%esp) 0xc02c110c journal_commit_transaction+0x83c: mov %edx,0x4(%esp) 0xc02c1110 journal_commit_transaction+0x840: call 0xc021efa0 printk [0]kdb> 0xc02c1115 journal_commit_transaction+0x845: ud2a 0xc02c1117 journal_commit_transaction+0x847: jmp 0xc02c1117 journal_commit_transaction+0x847 0xc02c1119 journal_commit_transaction+0x849: movl $0xc0651de8,(%esp) 0xc02c1120 journal_commit_transaction+0x850: mov $0xc0651fa0,%eax 0xc02c1125 journal_commit_transaction+0x855: mov $0xc0651dcd,%edi 0xc02c112a journal_commit_transaction+0x85a: mov %eax,0x10(%esp) 0xc02c112e journal_commit_transaction+0x85e: mov $0x2b5,%eax 0xc02c1133 journal_commit_transaction+0x863: mov %eax,0xc(%esp) 0xc02c1137 journal_commit_transaction+0x867: mov $0xc0651e44,%eax 0xc02c113c journal_commit_transaction+0x86c: mov %edi,0x4(%esp) 0xc02c1140 journal_commit_transaction+0x870: mov %eax,0x8(%esp) 0xc02c1144 journal_commit_transaction+0x874: call 0xc021efa0 printk 0xc02c1149 journal_commit_transaction+0x879: ud2a 0xc02c114b journal_commit_transaction+0x87b: jmp 0xc02c114b journal_commit_transaction+0x87b 0xc02c114d journal_commit_transaction+0x87d: mov 0x34(%ebx),%eax 0xc02c1150 journal_commit_transaction+0x880: test %eax,%eax [0]kdb> 0xc02c1152 journal_commit_transaction+0x882: jne 0xc02c11a2 journal_commit_transaction+0x8d2 0xc02c1154 journal_commit_transaction+0x884: mov 0x38(%ebx),%edx 0xc02c1157 journal_commit_transaction+0x887: test %edx,%edx 0xc02c1159 journal_commit_transaction+0x889: je 0xc02c11fd journal_commit_transaction+0x92d 0xc02c115f journal_commit_transaction+0x88f: mov 0x24(%edx),%edi 0xc02c1162 journal_commit_transaction+0x892: mov (%edi),%esi 0xc02c1164 journal_commit_transaction+0x894: mov (%esi),%eax 0xc02c1166 journal_commit_transaction+0x896: test $0x4,%al 0xc02c1168 journal_commit_transaction+0x898: jne 0xc02c11e0 journal_commit_transaction+0x910 0xc02c116a journal_commit_transaction+0x89a: call 0xc06302f0 cond_resched 0xc02c116f journal_commit_transaction+0x89f: test %eax,%eax 0xc02c1171 journal_commit_transaction+0x8a1: jne 0xc02c1154 journal_commit_transaction+0x884 0xc02c1173 journal_commit_transaction+0x8a3: mov (%esi),%eax 0xc02c1175 journal_commit_transaction+0x8a5: test $0x1,%al 0xc02c1177 journal_commit_transaction+0x8a7: mov $0xfffffffb,%eax 0xc02c117c journal_commit_transaction+0x8ac: cmovne 0xffffff98(%ebp),%eax [0]kdb> rd eax = 0x00000096 ebx = 0xf76bcf00 ecx = 0xffffffff edx = 0xf7588ac0 esi = 0xf6c66f88 edi = 0xc0651dcd esp = 0xc6549ec4 eip = 0xc02c1149 ebp = 0xc6549f5c xss = 0xc0580068 xcs = 0x00000060 eflags = 0x00010296 xds = 0xc065007b xes = 0xc654007b origeax = 0xffffffff ®s = 0xc6549e8c and, (gdb) p &(((struct buffer_head *)0)->b_count) $1 = (atomic_t *) 0x34 I think bh is, 0xf76bcf00 but, [0]kdb> md 0xf76bcf00 0xf76bcf00 f7f63400 00701310 00000004 000001ca .4....p......... 0xf76bcf10 00000000 00000000 00000000 00000000 ................ 0xf76bcf20 00000000 c6320b98 00000000 00000000 ......2......... 0xf76bcf30 00000000 f7386498 f7386b28 00000001 .....d8.(k8..... 0xf76bcf40 00000000 00000000 00000000 00000000 ................ 0xf76bcf50 00000000 ffffefab 00000008 00000000 ................ 0xf76bcf60 f76bc4e0 00100100 00200200 f76bcf70 ..k....... .p.k. 0xf76bcf70 00000001 00000000 f88eef70 f76bcf7c ........p...|.k. [0]kdb> 0xf76bcf80 f76bcf7c f76251e0 0000000d 0011ffff |.k..Qb......... 0xf76bcf90 00000000 00000001 00000000 00000000 ................ 0xf76bcfa0 00000000 f7ae9840 f88f140c deadc0de .... at ........... 0xf76bcfb0 00000019 00000000 00000000 00000004 ................ 0xf76bcfc0 00000000 00000000 00000000 00000000 ................ 0xf76bcfd0-0xf76bcfef zero suppressed 0xf76bcff0 00000000 00000000 00000000 00000000 ................ [0]kdb> 0xf76bd000 00000000 00000000 00000000 00000000 ................ 0xf76bd010-0xf76bd06f zero suppressed 0xf76bd070 00000000 00000000 00000000 00000000 ................ Not sure I'm reading bh correctly. On Tue, Jun 3, 2008 at 8:06 PM, Eric Sandeen wrote: > Srinivas Murthy wrote: > >> <6>EXT3-fs: mounted filesystem with ordered data mode. >> <0>Assertion failure in journal_commit_transaction() at >> fs/jbd/commit.c:693: "((&bh->b_count)->counter) == 0" >> <0>------------[ cut here ]------------ >> <2>kernel BUG at fs/jbd/commit.c:693! >> <0>invalid opcode: 0000 [#1] >> <0>PREEMPT SMP >> <0>CPU: 1 >> <0>EIP: 0060:[] Tainted: P VLI >> <0>EFLAGS: 00010296 (2.6.23.waas #4) >> <0>EIP is at journal_commit_transaction+0x879/0xe00 >> <0>eax: 00000096 ebx: f76bcf00 ecx: ffffffff edx: f7588ac0 >> <0>esi: f6c66f88 edi: c0651dcd ebp: c6549f5c esp: c6549ec4 >> <0>ds: 007b es: 007b fs: 00d8 gs: 0000 ss: 0068 >> <0>Process kjournald (pid: 1684, ti=c6548000 task=c39db580 task.ti=c6548000) >> <0>Stack: c0651de8 c0651dcd c0651e44 000002b5 c0651fa0 00000000 >> 00000000 f7f63414 >> <0> f7f634dc 00000000 00000fcc f7435034 00000000 00000000 >> c6402000 00000000 >> <0> f7f63400 f7386fc0 000005d7 f77fb580 c39db580 f70bdd74 >> 00000202 c70f8000 >> <0>Call Trace: >> <0> [] show_trace_log_lvl+0x1a/0x30 >> <0> [] show_stack_log_lvl+0x9a/0xc0 >> <0> [] show_registers+0x1d6/0x340 >> <0> [] die+0x10d/0x220 >> <0> [] do_trap+0x91/0xd0 >> <0> [] do_invalid_op+0x89/0xa0 >> <0> [] error_code+0x72/0x78 >> <0> [] kjournald+0xb5/0x1f0 >> <0> [] kthread+0x5c/0xa0 >> <0> [] kernel_thread_helper+0x7/0x1c >> <0> ======================= >> <0>Code: 65 c0 b8 a0 1f 65 c0 bf cd 1d 65 c0 89 44 24 10 b8 b5 02 00 >> 00 89 44 24 0c b8 44 1e 65 c0 89 7c 24 04 89 44 24 08 e8 57 de f5 ff >> <0f> 0b eb fe 8b 43 34 85 c0 75 4e 8b 53 38 85 d2 0f 84 9e 00 00 >> <0>EIP: [] journal_commit_transaction+0x879/0xe00 SS:ESP 0068:c6549ec4 >> <6>SysRq : Changing Loglevel >> <4>Loglevel set to 7 >> >> [0]kdb> btc >> btc: cpu status: Currently on cpu 0 > > Also, I'd backtrace pid 1684 (kjournald) and dump the bh, see what it > looks like... > > kdb> btp 1684 > kdb> bh > > if i remember correctly... > > -Eric > > From sandeen at redhat.com Wed Jun 4 03:52:27 2008 From: sandeen at redhat.com (Eric Sandeen) Date: Tue, 03 Jun 2008 22:52:27 -0500 Subject: Fwd: md_thread and kjournald race w/ raid1 on 2-way x86 In-Reply-To: References: <484606A9.3070403@redhat.com> Message-ID: <4846117B.1010206@redhat.com> Srinivas Murthy wrote: > > [0]kdb> md 0xf76bcf00 > 0xf76bcf00 f7f63400 00701310 00000004 000001ca .4....p......... > 0xf76bcf10 00000000 00000000 00000000 00000000 ................ > 0xf76bcf20 00000000 c6320b98 00000000 00000000 ......2......... > 0xf76bcf30 00000000 f7386498 f7386b28 00000001 .....d8.(k8..... ... doesn't look right ... If you hit this often enough (and since you have kdb) you could modify the assert to print the bh address first .... then it'd be easy to print out, might offer some clues. -Eric From sebastia at l00-bugdead-prods.de Wed Jun 4 07:55:00 2008 From: sebastia at l00-bugdead-prods.de (Sebastian Reitenbach) Date: Wed, 04 Jun 2008 09:55:00 +0200 Subject: problem with default mask in acls Message-ID: <20080604075501.17AF24971F@smtp.l00-bugdead-prods.de> Hi, when I copy a file to a directory, using whatever tool, it seems the behavior of the mask is wrong. user1 at host1:~> getfacl source/test1 # file: source/test1 # owner: user1 # group: grp1 user::rw- group::r-- other::r-- user1 at host1:~> getfacl target/ # file: target # owner: user1 # group: grp1 user::rwx group::--- group:grp1:rwx mask::rwx other::--- default:user::rwx default:group::--- default:group:grp1:rwx default:mask::rwx default:other::--- user1 at host1:~> cp source/test1 target/ user1 at host1:~> getfacl target/test1 # file: target/test1 # owner: user1 # group: grp1 user::rw- group::--- group:grp1:rwx #effective:r-- mask::r-- other::--- I'd expected the effective mask of the file in the destination directory to be rwx. Is there anything I'm doing wrong? I'm on a SLES10SP1 x86_64. Linux nfspublic 2.6.16.57-0.9-xen #1 SMP Mon Jan 21 19:55:27 UTC 2008 x86_64 x86_64 x86_64 GNU/Linux I guess I do sth. wrong, but what? thanks sebastian From jelledejong at powercraft.nl Fri Jun 6 18:24:49 2008 From: jelledejong at powercraft.nl (Jelle de Jong) Date: Fri, 06 Jun 2008 20:24:49 +0200 Subject: needs help, root inode gone after usb bus reset on sata disks In-Reply-To: <20080529212048.GI8065@mit.edu> References: <483BCCC0.5020502@powercraft.nl> <20080527124711.GI7515@mit.edu> <483C07EE.1060905@powercraft.nl> <483D6FC5.30109@powercraft.nl> <20080528232452.GO6843@mit.edu> <483E7955.7020508@powercraft.nl> <20080529125816.GD8065@mit.edu> <483EC138.5090200@powercraft.nl> <20080529200140.GF8065@mit.edu> <483F0ECC.7030505@powercraft.nl> <20080529212048.GI8065@mit.edu> Message-ID: <484980F1.1040604@powercraft.nl> Theodore Tso wrote: > On Thu, May 29, 2008 at 10:15:08PM +0200, Jelle de Jong wrote: >> I did the following: >> >> debugfs -w /dev/sda1 >> debugfs: features dir_index filetype sparse_super >> debugfs: quit >> >> then i run >> >> e2fsck -nf /dev/sda1 >> >> to see if it still wanted to relocate inodes. This was not the case >> anymore, however it still wanted to relocate the root inode... >> >> I then run: >> >> e2fsck -f /dev/sda1 >> >> and manual answer yes to the question until i had to enter a lot of "y" >> (see logs) and killed the program with ctrl-c > > what answers did you answer yes to? I don't have a log of your > "e2fsck -f /dev/sda1" run, and so I can't tell what happened. The > e2fsck -fy run you gave me was large, but information-free, since it > just had pass #5 messages regarding adjusting accounting information. > > If it was just deleting the root inode (because it was corrupted), and > creating a new root inode, that doesn't explain why all of the inodes > disappeared, unless the inode table had somehow gotten completely > zero'ed out > > At this point, what I would probably suggest is that you run > > e2image -r /dev/hda1 - | bzip2 > hda1.e2i.bz2 > > ... and put it someplace where I can download it and take a look at > what the heck happened to your filesystem. > > By the way, please look at the "script" command ("man script"); it is > very handy for capturing a record of what an interactive session with > a program like e2fsck. > Thanks for all the info Ted, http://www.powercraft.nl/temp/e2image-r-sda1-v0.1.1.e2i.bz2 I did some experimenting and to see if I can find some data on the disk by doing the below command on an unaltered backup: e2fsck -fy /dev/sda1 > e2fsck-fy-info-sda1-v0.1.1j.txt 2>&1 However no files where found, so maybe something when wrong with the dd backup. I don't now if there is a way to see if there is actual data on the disk. So for now i am giving up on recovery the data, maybe you can get a glue of what the hack happened to the file system and learn something new... The only thing i would like to now is how to backup and restore the filesystem. (for example i am going to setup a raid setup but this kind of file system crashes are not covered with a raid setup) Thanks in advance, Jelle From codevana at gmail.com Sat Jun 7 01:24:56 2008 From: codevana at gmail.com (Srinivas Murthy) Date: Fri, 6 Jun 2008 18:24:56 -0700 Subject: Fwd: md_thread and kjournald race w/ raid1 on 2-way x86 In-Reply-To: <4846117B.1010206@redhat.com> References: <484606A9.3070403@redhat.com> <4846117B.1010206@redhat.com> Message-ID: Eric, I got the output you asked for. <3>journal_commit_transaction 694 c60b54d0 <4>WARNING: at fs/jbd/commit.c:695 journal_commit_transaction() <4> [] show_trace_log_lvl+0x1a/0x30 <4> [] show_trace+0x12/0x20 <4> [] dump_stack+0x16/0x20 <4> [] journal_commit_transaction+0x5cd/0xe60 <4> [] kjournald+0xb5/0x1f0 <4> [] kthread+0x5c/0xa0 [1]more> q [1]kdb> bh 0xc60b54d0 buffer_head at 0xc60b54d0 bno 3297 size 4096 dev 0x900005 count 1 state 0x8029 [Uptodate Req Mapped Private] b_data 0xf5c80000 b_page 0xc16b9000 b_this_page 0x00000000 b_private 0xf7fb05b0 b_end_io 0xc02c03a0 journal_end_buffer_io_sync [1]kdb> What do you think? Thanks. On Tue, Jun 3, 2008 at 8:52 PM, Eric Sandeen wrote: > Srinivas Murthy wrote: > >> >> [0]kdb> md 0xf76bcf00 >> 0xf76bcf00 f7f63400 00701310 00000004 000001ca .4....p......... >> 0xf76bcf10 00000000 00000000 00000000 00000000 ................ >> 0xf76bcf20 00000000 c6320b98 00000000 00000000 ......2......... >> 0xf76bcf30 00000000 f7386498 f7386b28 00000001 .....d8.(k8..... > > ... doesn't look right ... > > If you hit this often enough (and since you have kdb) you could modify > the assert to print the bh address first .... > > then it'd be easy to print out, might offer some clues. > > -Eric > From ross at biostat.ucsf.edu Sun Jun 8 05:30:38 2008 From: ross at biostat.ucsf.edu (Ross Boylan) Date: Sat, 07 Jun 2008 22:30:38 -0700 Subject: spd_readdir.c and readdir_r Message-ID: <1212903039.7158.31.camel@corn.betterworld.us> I still haven't been able to pinpoint exactly where bacula hangs up when LD_PRELOAD is set to use spd_readdir, but I have a suspect. bacula-fd gets directory entries with readdir_r, which is a function that is not reimplemented in spd_readdir. So when bacula calls opendir it gets the shadow version, which calls the original open, read, and closedir functions. It then returns its private dir_s structure. The (unshadowed) readdir_r then tries to work with dir_s. It looks as if I (or one of you gurus?) need to implement a wrapper for readdir_r. A quick looks suggests there may be a couple of subtleties (the spd_readdir struct dir_s is allocated, and so thread safe, but it's dir entry is not; and readdir_r is expecting some "real" system data structures back and users may have problems with fake ones). Ross From sandeen at redhat.com Mon Jun 9 04:03:45 2008 From: sandeen at redhat.com (Eric Sandeen) Date: Sun, 08 Jun 2008 23:03:45 -0500 Subject: Fwd: md_thread and kjournald race w/ raid1 on 2-way x86 In-Reply-To: References: <484606A9.3070403@redhat.com> <4846117B.1010206@redhat.com> Message-ID: <484CABA1.2040202@redhat.com> Srinivas Murthy wrote: > Eric, > > I got the output you asked for. > > <3>journal_commit_transaction 694 c60b54d0 > <4>WARNING: at fs/jbd/commit.c:695 journal_commit_transaction() > <4> [] show_trace_log_lvl+0x1a/0x30 > <4> [] show_trace+0x12/0x20 > <4> [] dump_stack+0x16/0x20 > <4> [] journal_commit_transaction+0x5cd/0xe60 > <4> [] kjournald+0xb5/0x1f0 > <4> [] kthread+0x5c/0xa0 > [1]more> q > [1]kdb> bh 0xc60b54d0 > buffer_head at 0xc60b54d0 > bno 3297 size 4096 dev 0x900005 > count 1 state 0x8029 [Uptodate Req Mapped Private] > b_data 0xf5c80000 > b_page 0xc16b9000 b_this_page 0x00000000 b_private 0xf7fb05b0 > b_end_io 0xc02c03a0 journal_end_buffer_io_sync > [1]kdb> > > What do you think? I think that it looks more like a buffer head accounting problem than a corruption problem; the rest of the buffer head looks sane... Think you could narrow down a test case for this problem? -Eric From ross at biostat.ucsf.edu Mon Jun 9 04:26:28 2008 From: ross at biostat.ucsf.edu (Ross Boylan) Date: Sun, 08 Jun 2008 21:26:28 -0700 Subject: spd_readdir.c and readdir_r [new version] In-Reply-To: <1212903039.7158.31.camel@corn.betterworld.us> References: <1212903039.7158.31.camel@corn.betterworld.us> Message-ID: <1212985588.32113.13.camel@corn.betterworld.us> I've attached a modified version of Ted's spd_readdir.c that adds support for readdir_r and readdir64_r. It appears to be working (readdir64_r is the only new routine getting exercised), but should be taken as a rough cut. I also added a Makefile and a test program. It also looks as if this is giving me a huge speed improvement (at least x4) of my backups of my ext3 partitions. I'll try to report after a full and incremental backup complete, which will be a couple of days. Originally I tried taking the threading code from the system implementations of the original readdir_r. When that didn't work (since it was designed to be part of a libc build) I switched to pthreads. I don't know if recursive locking is essential; I activated it at one point while trying to get things to work. For big directories this code could use quite a lot of memory. It allows an optional max size, beyond which it reverts to the original system calls. I wonder if instead taking large directories in chunks would preserve much of the speedup while putting a bound on memory use. Ross Boylan -------------- next part -------------- A non-text attachment was scrubbed... Name: RBspd_dir.tgz Type: application/x-compressed-tar Size: 889 bytes Desc: not available URL: From santi at usansolo.net Mon Jun 9 17:33:48 2008 From: santi at usansolo.net (santi at usansolo.net) Date: Mon, 09 Jun 2008 19:33:48 +0200 Subject: 2GB memory limit running fsck on a +6TB device Message-ID: <13126f2f5661d30187551469b3793fa7@usansolo.net> Dear Srs, That's the scenario: +6TB device on a 3ware 9550SX RAID controller, running Debian Etch 32bits, with 2.6.25.4 kernel, and defaults e2fsprogs version, "1.39+1.40-WIP-2006.11.14+dfsg-2etch1". Running "tune2fs" returns that filesystem is in EXT3_ERROR_FS state, "clean with errors": # tune2fs -l /dev/sda4 tune2fs 1.40.10 (21-May-2008) Filesystem volume name: Last mounted on: Filesystem UUID: 7701b70e-f776-417b-bf31-3693dba56f86 Filesystem magic number: 0xEF53 Filesystem revision #: 1 (dynamic) Filesystem features: has_journal dir_index filetype needs_recovery sparse_super large_file Default mount options: (none) Filesystem state: clean with errors Errors behavior: Continue Filesystem OS type: Linux Inode count: 792576000 Block count: 1585146848 It's a backup storage server, with more than 113 million files, this's the output of "df -i": # df -i /backup/ Filesystem Inodes IUsed IFree IUse% Mounted on /dev/sda4 792576000 113385959 679190041 15% /backup Running fsck.ext3 or fsck.ext2 I get: # fsck.ext3 /dev/sda4 e2fsck 1.40.10 (21-May-2008) Adding dirhash hint to filesystem. /dev/sda4 contains a file system with errors, check forced. Pass 1: Checking inodes, blocks, and sizes Error allocating directory block array: Memory allocation failed e2fsck: aborted With some straces: ================================================================================ gettimeofday({1213032482, 940738}, NULL) = 0 getrusage(RUSAGE_SELF, {ru_utime={0, 0}, ru_stime={0, 16001}, ...}) = 0 write(1, "Pass 1: Checking ", 17Pass 1: Checking ) = 17 write(1, "inode", 5inode) = 5 write(1, "s, ", 3s, ) = 3 write(1, "block", 5block) = 5 write(1, "s, and sizes\n", 13s, and sizes ) = 13 mmap2(NULL, 99074048, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x404fa000 mmap2(NULL, 99074048, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x46376000 mmap2(NULL, 99074048, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x4c1f2000 mmap2(NULL, 198148096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x5206e000 mmap2(NULL, 99074048, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x5dd66000 mmap2(NULL, 748892160, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x63be2000 mmap2(NULL, 1866240000, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory) brk(0x77488000) = 0x80ab000 mmap2(NULL, 1866375168, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory) mmap2(NULL, 2097152, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x90615000 munmap(0x90615000, 962560) = 0 munmap(0x90800000, 86016) = 0 mprotect(0x90700000, 135168, PROT_READ|PROT_WRITE) = 0 mmap2(NULL, 1866240000, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory) ================================================================================ Appears that fsck is trying to use more than 2GB memory to store inode table relationship. System has 4GB of physical RAM and 4GB of swap, is there anyway to limit the memory used by fsck or any solution to check this filesystem? Running fsck with a 64bit LiveCD will solve the problem? I also tried with last e2fsprogs stable release 1.40.10, getting the same error :-/ Regards, -- Santi Saez From tytso at mit.edu Mon Jun 9 21:33:20 2008 From: tytso at mit.edu (Theodore Tso) Date: Mon, 9 Jun 2008 17:33:20 -0400 Subject: 2GB memory limit running fsck on a +6TB device In-Reply-To: <13126f2f5661d30187551469b3793fa7@usansolo.net> References: <13126f2f5661d30187551469b3793fa7@usansolo.net> Message-ID: <20080609213320.GB26759@mit.edu> On Mon, Jun 09, 2008 at 07:33:48PM +0200, santi at usansolo.net wrote: > It's a backup storage server, with more than 113 million files, this's the > output of "df -i": > > Appears that fsck is trying to use more than 2GB memory to store inode > table relationship. System has 4GB of physical RAM and 4GB of swap, is > there anyway to limit the memory used by fsck or any solution to check this > filesystem? Running fsck with a 64bit LiveCD will solve the problem? Yes, running with a 64-bit Live CD is one way to solve the problem. If you are using e2fsprogs 1.40.10, there is another solution that may help. Create an /etc/e2fsck.conf file with the following contents: [scratch_files] directory = /var/cache/e2fsck ...and then make sure /var/cache/e2fsck exists by running the command "mkdir /var/cache/e2fsck". This will cause e2fsck to store certain data structures which grow large with backup servers that have a vast number of hard-linked files in /var/cache/e2fsck instead of in memory. This will slow down e2fsck by approximately 25%, but for large filesystems where you couldn't otherwise get e2fsck to complete because you're exhausting the 2GB VM per-process limitation for 32-bit systems, it should allow you to run through to completion. - Ted From adilger at sun.com Mon Jun 9 21:50:32 2008 From: adilger at sun.com (Andreas Dilger) Date: Mon, 09 Jun 2008 15:50:32 -0600 Subject: 2GB memory limit running fsck on a +6TB device In-Reply-To: <13126f2f5661d30187551469b3793fa7@usansolo.net> References: <13126f2f5661d30187551469b3793fa7@usansolo.net> Message-ID: <20080609215031.GC3726@webber.adilger.int> On Jun 09, 2008 19:33 +0200, santi at usansolo.net wrote: > That's the scenario: +6TB device on a 3ware 9550SX RAID controller, running > Debian Etch 32bits, with 2.6.25.4 kernel, and defaults e2fsprogs version, > "1.39+1.40-WIP-2006.11.14+dfsg-2etch1". > > Running "tune2fs" returns that filesystem is in EXT3_ERROR_FS state, "clean > with errors": > > # tune2fs -l /dev/sda4 > tune2fs 1.40.10 (21-May-2008) > Filesystem volume name: > Last mounted on: > Filesystem UUID: 7701b70e-f776-417b-bf31-3693dba56f86 > Filesystem magic number: 0xEF53 > Filesystem revision #: 1 (dynamic) > Filesystem features: has_journal dir_index filetype needs_recovery > sparse_super large_file > Default mount options: (none) > Filesystem state: clean with errors > Errors behavior: Continue > Filesystem OS type: Linux > Inode count: 792576000 > Block count: 1585146848 > > It's a backup storage server, with more than 113 million files, this's the > output of "df -i": > > # df -i /backup/ > Filesystem Inodes IUsed IFree IUse% Mounted on > /dev/sda4 792576000 113385959 679190041 15% /backup > > > Running fsck.ext3 or fsck.ext2 I get: > > # fsck.ext3 /dev/sda4 > e2fsck 1.40.10 (21-May-2008) > Adding dirhash hint to filesystem. > > /dev/sda4 contains a file system with errors, check forced. > Pass 1: Checking inodes, blocks, and sizes I recall that e2fsck allocates on the order of 3 * block_count / 8 bytes, and 5 * inode_count / 8 bytes, so in your case this is about: (5 * 1585146848 + 3 * 792576000) / 8 = 1287932780 bytes = 1.2GB at a minimum, but my estimates might be incorrect. > mmap2(NULL, 99074048, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, > 0) = 0x404fa000 Judging by the return values of these functions, this is a 32-bit system, and it is entirely possible that you are exceeding the per-process memory allocation limit. > mmap2(NULL, 748892160, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, > 0) = 0x63be2000 > mmap2(NULL, 1866240000, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, > -1, 0) = -1 ENOMEM (Cannot allocate memory) Hmm, it seems a bit excessive to allocate 1.8GB in a single chunk. > Error allocating directory block array: Memory allocation failed > e2fsck: aborted This message is a bit tricky to nail down because it doesn't exist anywhere in the code directly. It is encoded into "e2fsck abbreviations", and the expansion that is normally in the corresponding comment is different. It is PR_1_ALLOCATE_DBCOUNT returned from the call chain: ext2fs_init_dblist-> make_dblist-> ext2fs_get_num_dirs() which is counting the number of directories in the filesystem, and allocating two 12-byte array element for each one. This implies you have 77M directories in your filesystem, or an average of only 10 files per directory? > Appears that fsck is trying to use more than 2GB memory to store inode > table relationship. System has 4GB of physical RAM and 4GB of swap, is > there anyway to limit the memory used by fsck or any solution to check this > filesystem? I don't know offhand how important the dblist structure is, so I'm not sure if there is a way to reduce the memory usage for it. I believe that in low-memory situations it is possible to use tdb in newer versions of e2fsck for the dblist, but I don't know much of the details. > Running fsck with a 64bit LiveCD will solve the problem? Yes, I suspect with a 64-bit kernel you could allocate the full 4GB of RAM for e2fsck and be able to check the filesystem. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. From carlo at alinoe.com Mon Jun 9 22:08:56 2008 From: carlo at alinoe.com (Carlo Wood) Date: Tue, 10 Jun 2008 00:08:56 +0200 Subject: 2GB memory limit running fsck on a +6TB device In-Reply-To: <20080609215031.GC3726@webber.adilger.int> References: <13126f2f5661d30187551469b3793fa7@usansolo.net> <20080609215031.GC3726@webber.adilger.int> Message-ID: <20080609220856.GA21530@alinoe.com> On Mon, Jun 09, 2008 at 03:50:32PM -0600, Andreas Dilger wrote: > > Running fsck with a 64bit LiveCD will solve the problem? > > Yes, I suspect with a 64-bit kernel you could allocate the full 4GB of RAM > for e2fsck and be able to check the filesystem. We had a simular problem with ext3grep. You have to realize that every mmap uses memory address space, even if it's a map to disk. Therefore, on a 32bit machine, if the total of all normal allocations plus all simultaneous mmap's exceeds 4GB then you "run out of memory", even if -say- only 1 GB is really allocated and >3GB of the disk is mmap-ed. In that case a 64bit machine would solve the problem because then all ram (2 GB I read in the Subject) can be used for normal allocations while any disk mmap has cazillion address space left for itself. -- Carlo Wood From tytso at mit.edu Mon Jun 9 22:37:36 2008 From: tytso at mit.edu (Theodore Tso) Date: Mon, 9 Jun 2008 18:37:36 -0400 Subject: 2GB memory limit running fsck on a +6TB device In-Reply-To: <20080609215031.GC3726@webber.adilger.int> References: <13126f2f5661d30187551469b3793fa7@usansolo.net> <20080609215031.GC3726@webber.adilger.int> Message-ID: <20080609223736.GA7069@mit.edu> On Mon, Jun 09, 2008 at 03:50:32PM -0600, Andreas Dilger wrote: > This message is a bit tricky to nail down because it doesn't exist anywhere > in the code directly. It is encoded into "e2fsck abbreviations", and > the expansion that is normally in the corresponding comment is different. > It is PR_1_ALLOCATE_DBCOUNT returned from the call chain: > ext2fs_init_dblist-> > make_dblist-> > ext2fs_get_num_dirs() > > which is counting the number of directories in the filesystem, and allocating > two 12-byte array element for each one. This implies you have 77M directories > in your filesystem, or an average of only 10 files per directory? There are a number of backup solutions that use hardlinks to conserve space between increment snapshots. So yeah, with these worklodas you'll see something like 80-85M inodes, of which 77M-odd will be directories. When you combine the vast number of directories used by these filesystems, and the fact that e2fsck tries to opimize memory use by observing that on most normal filesystems, most files have n_link count of 1, which is NOT true on these filesystems used for backups, e2fsck's tricks to optimize for speed by caching information to avoid re-reading them from disk end up costing a large amount of memory. > I don't know offhand how important the dblist structure is, so I'm not > sure if there is a way to reduce the memory usage for it. I believe > that in low-memory situations it is possible to use tdb in newer versions > of e2fsck for the dblist, but I don't know much of the details. Yep, please see [scratch_files] section in e2fsck.conf. It is described in the e2fsck.conf(5) man page. - Ted From adilger at sun.com Mon Jun 9 22:57:59 2008 From: adilger at sun.com (Andreas Dilger) Date: Mon, 09 Jun 2008 16:57:59 -0600 Subject: 2GB memory limit running fsck on a +6TB device In-Reply-To: <20080609223736.GA7069@mit.edu> References: <13126f2f5661d30187551469b3793fa7@usansolo.net> <20080609215031.GC3726@webber.adilger.int> <20080609223736.GA7069@mit.edu> Message-ID: <20080609225759.GG3726@webber.adilger.int> On Jun 09, 2008 18:37 -0400, Theodore Ts'o wrote: > On Mon, Jun 09, 2008 at 03:50:32PM -0600, Andreas Dilger wrote: > > I don't know offhand how important the dblist structure is, so I'm not > > sure if there is a way to reduce the memory usage for it. I believe > > that in low-memory situations it is possible to use tdb in newer versions > > of e2fsck for the dblist, but I don't know much of the details. > > Yep, please see [scratch_files] section in e2fsck.conf. It is > described in the e2fsck.conf(5) man page. Hmm, maybe if the ext2fs_init_dblist() function returns PR_1_ALLOCATE_DBCOUNT this should be a user-fixable problem that asks if the user wants to use an on-disk tdb file in /var/tmp, and if that is a "no" then point them at the right section in /etc/e2fsck.conf? I don't think it is reasonable to default to using /tmp, because it might be a RAM-backed filesystem, and I suspect in most cases the root filesystem will not run out of memory in this way... Even if it fails because /var/tmp is read-only, or too small, it is no worse off than it is today. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. From gregt at maths.otago.ac.nz Tue Jun 10 03:36:52 2008 From: gregt at maths.otago.ac.nz (Greg Trounson) Date: Tue, 10 Jun 2008 15:36:52 +1200 Subject: 2GB memory limit running fsck on a +6TB device In-Reply-To: <20080609215031.GC3726@webber.adilger.int> References: <13126f2f5661d30187551469b3793fa7@usansolo.net> <20080609215031.GC3726@webber.adilger.int> Message-ID: <484DF6D4.4050700@maths.otago.ac.nz> Andreas Dilger wrote: > On Jun 09, 2008 19:33 +0200, santi at usansolo.net wrote: ... >> Running fsck with a 64bit LiveCD will solve the problem? > > Yes, I suspect with a 64-bit kernel you could allocate the full 4GB of RAM > for e2fsck and be able to check the filesystem. Couldn't you achieve the same thing just by enabling PAE on your 32-bit kernel? Greg From tytso at mit.edu Tue Jun 10 13:18:28 2008 From: tytso at mit.edu (Theodore Tso) Date: Tue, 10 Jun 2008 09:18:28 -0400 Subject: 2GB memory limit running fsck on a +6TB device In-Reply-To: <484DF6D4.4050700@maths.otago.ac.nz> References: <13126f2f5661d30187551469b3793fa7@usansolo.net> <20080609215031.GC3726@webber.adilger.int> <484DF6D4.4050700@maths.otago.ac.nz> Message-ID: <20080610131828.GC18768@mit.edu> On Tue, Jun 10, 2008 at 03:36:52PM +1200, Greg Trounson wrote: > Andreas Dilger wrote: >> On Jun 09, 2008 19:33 +0200, santi at usansolo.net wrote: > ... >>> Running fsck with a 64bit LiveCD will solve the problem? >> Yes, I suspect with a 64-bit kernel you could allocate the full 4GB of RAM >> for e2fsck and be able to check the filesystem. > > Couldn't you achieve the same thing just by enabling PAE on your 32-bit > kernel? No, that doesn't increase the amount address space available to the user process, which is the limitation here. You can have 16 GB of physical memory, but 2**32 is still 4GB, and the kernel needs address space, so that means userspace will have at most 3GB of space to itself. - Ted From santi at usansolo.net Tue Jun 10 15:34:35 2008 From: santi at usansolo.net (santi at usansolo.net) Date: Tue, 10 Jun 2008 17:34:35 +0200 Subject: 2GB memory limit running fsck on a +6TB device In-Reply-To: <20080609213320.GB26759@mit.edu> References: <13126f2f5661d30187551469b3793fa7@usansolo.net> <20080609213320.GB26759@mit.edu> Message-ID: <7f8a3e70ccd659f304ede5f067ff46c7@usansolo.net> On Mon, 9 Jun 2008 17:33:20 -0400, Theodore Tso wrote: > If you are using e2fsprogs 1.40.10, there is another solution that may > help. Create an /etc/e2fsck.conf file with the following contents: > > [scratch_files] > directory = /var/cache/e2fsck (..) > This will cause e2fsck to store certain data structures which grow > large with backup servers that have a vast number of hard-linked files > in /var/cache/e2fsck instead of in memory. This will slow down e2fsck > by approximately 25%, but for large filesystems where you couldn't > otherwise get e2fsck to complete because you're exhausting the 2GB VM > per-process limitation for 32-bit systems, it should allow you to run > through to completion. I'm trying with fsck.ext3 v1.40.8, backported from Lenny's package to Etch, instead of v1.40.10 because we have the same sceneario in all backup servers running BackupPC, and package must be distributed. If needed, we can make test with the latest version ;-) fsck.ext3 started 4 hours ago, and still is in "Pass 1: Checking inodes, blocks, and sizes", that's normal knowing that the filesystem has +113 million inodes? I will send more info as requested Ted in "Call for testers w/ using BackupPC" [1], but now this is the scenario: - fsck.ext3 is using more than 2GB of memory and no swap, server has 4GB phisycal RAM + 2GB of swap, this's the output of "pmap -d" with memory map: # pmap -d 7014 7014: fsck.ext3 -y /dev/sda4 Address Kbytes Mode Offset Device Mapping (..) 242fd000 1834768 rw--- 00000000242fd000 000:00000 [ anon ] 942c2000 582604 rw--- 00000000942c2000 000:00000 [ anon ] (..) All the output is available at: http://pastebin.com/f67115de2 - Files in "/var/cache/e2fsck" appears that grow very slow, I think, 300Kb per hour aprox, now that's the size: # ls -lh /var/cache/e2fsck/ total 170M -rw------- 1 root root 76M 2008-06-10 17:24 7701b70e-f776-417b-bf31-3693dba56f86-dirinfo-VkmFXP -rw------- 1 root root 95M 2008-06-10 17:24 7701b70e-f776-417b-bf31-3693dba56f86-icount-YO08bu - fsck is using 100% of one CPU, it's dual processor motherboard, output of strace available at: http://pastebin.com/f68389cce - More info: * Kernel 2.6.25.4, i686 arch on a Debian Etch box. * Storage: 3ware 9550SXU-16ML, 5.91 TB in a RAID-5 with 14 500GB SATA disks (ST3500630AS), 64kB stripe size (array is in optimal state) Thanks all for the advices :-) [1] http://www.redhat.com/archives/ext3-users/2007-April/msg00017.html -- Santi Saez From tytso at mit.edu Tue Jun 10 18:38:55 2008 From: tytso at mit.edu (Theodore Tso) Date: Tue, 10 Jun 2008 14:38:55 -0400 Subject: 2GB memory limit running fsck on a +6TB device In-Reply-To: <7f8a3e70ccd659f304ede5f067ff46c7@usansolo.net> References: <13126f2f5661d30187551469b3793fa7@usansolo.net> <20080609213320.GB26759@mit.edu> <7f8a3e70ccd659f304ede5f067ff46c7@usansolo.net> Message-ID: <20080610183855.GB8397@mit.edu> On Tue, Jun 10, 2008 at 05:34:35PM +0200, santi at usansolo.net wrote: > > fsck.ext3 started 4 hours ago, and still is in "Pass 1: Checking inodes, > blocks, and sizes", that's normal knowing that the filesystem has +113 > million inodes? > It depends on a lot of things; how big are your files on average, the speed of your hard drive, and whether /var/cache/e2fsck is on the same disk as the partition which you are checking, or on a separate spindle (guess which is better :-). It's always a good idea when running e2fsck (aka fsck.ext3) directly and/or on a terminal/console to include as command-line options "-C 0". This will display a progress bar, so you can gauge how it is doing. (0 through 70% is pass 1, which requires scanning the inode table and following all of the indirect blocks.) - Ted From santi at usansolo.net Tue Jun 10 22:24:27 2008 From: santi at usansolo.net (Santi Saez) Date: Wed, 11 Jun 2008 00:24:27 +0200 Subject: 2GB memory limit running fsck on a +6TB device In-Reply-To: <20080610183855.GB8397@mit.edu> References: <13126f2f5661d30187551469b3793fa7@usansolo.net> <20080609213320.GB26759@mit.edu> <7f8a3e70ccd659f304ede5f067ff46c7@usansolo.net> <20080610183855.GB8397@mit.edu> Message-ID: <484EFF1B.1010104@usansolo.net> Theodore Tso escribi?: > It's always a good idea when running e2fsck (aka fsck.ext3) directly > and/or on a terminal/console to include as command-line options "-C > 0". This will display a progress bar, so you can gauge how it is > doing. (0 through 70% is pass 1, which requires scanning the inode > table and following all of the indirect blocks.) > Thanks for the tip! :-) '/var/cache/e2fsck' is in the _same_ disk, perhaps mounting via iSCSI, NFS, etc.. this directory will improve, we will work with this in other test. I have enabled progress bar sending SIGUSR1 signal to the process, and it's still on 2% ;-( "scratch_files" directory size is now 251M, it has grown 81MB in the last 7 hours: # ls -lh /var/cache/e2fsck/ total 251M -rw------- 1 root root 112M 2008-06-11 00:09 7701b70e-f776-417b-bf31-3693dba56f86-dirinfo-VkmFXP -rw------- 1 root root 139M 2008-06-11 00:09 7701b70e-f776-417b-bf31-3693dba56f86-icount-YO08bu strace's output is the same, and also memory usage is the same. I will let the process more time.. but I think it will take too much time to complete, at least to finish the pass 1, perhaps more than 50 hours? According that now is only on 2% of the process + take 12 hours to complete, and pass 1 is from 0% through 70%.. is there any other solution to solve this? ext4 will solve this problem? I have not tested ext4 already, but I have read that it will improve fast filesytem checking... Regards, -- Santi Saez From tytso at mit.edu Tue Jun 10 23:01:24 2008 From: tytso at mit.edu (Theodore Tso) Date: Tue, 10 Jun 2008 19:01:24 -0400 Subject: 2GB memory limit running fsck on a +6TB device In-Reply-To: <484EFF1B.1010104@usansolo.net> References: <13126f2f5661d30187551469b3793fa7@usansolo.net> <20080609213320.GB26759@mit.edu> <7f8a3e70ccd659f304ede5f067ff46c7@usansolo.net> <20080610183855.GB8397@mit.edu> <484EFF1B.1010104@usansolo.net> Message-ID: <20080610230124.GH8397@mit.edu> On Wed, Jun 11, 2008 at 12:24:27AM +0200, Santi Saez wrote: > > '/var/cache/e2fsck' is in the _same_ disk, perhaps mounting via iSCSI, NFS, > etc.. this directory will improve, we will work with this in other test. > > I have enabled progress bar sending SIGUSR1 signal to the process, and it's > still on 2% ;-( > > "scratch_files" directory size is now 251M, it has grown 81MB in the last 7 > hours: hmm..... can you send me the output of dumpe2fs /dev/sdXX? You can run that command while e2fsck is running, since it's read-only. I'm curious exactly how big the filesystem is, and how many directories are in the first part of the filesystem. How big is the filesystem(s) that you are backing up via BackupPC, in terms of size (megabytes) and files (number of inodes)? And how many days of incremental backups are you keeping? Also, how often do files change? Can you give a rough estimate of how many files get modified per backup cycle? Thanks, - Ted From santi at usansolo.net Tue Jun 10 23:48:35 2008 From: santi at usansolo.net (Santi Saez) Date: Wed, 11 Jun 2008 01:48:35 +0200 Subject: 2GB memory limit running fsck on a +6TB device In-Reply-To: <20080610230124.GH8397@mit.edu> References: <13126f2f5661d30187551469b3793fa7@usansolo.net> <20080609213320.GB26759@mit.edu> <7f8a3e70ccd659f304ede5f067ff46c7@usansolo.net> <20080610183855.GB8397@mit.edu> <484EFF1B.1010104@usansolo.net> <20080610230124.GH8397@mit.edu> Message-ID: <484F12D3.2050201@usansolo.net> Theodore Tso escribi?: > hmm..... can you send me the output of dumpe2fs /dev/sdXX? You can > run that command while e2fsck is running, since it's read-only. I'm > curious exactly how big the filesystem is, and how many directories > are in the first part of the filesystem. > Upsss... dumpe2fs takes about 3 minutes to complete and generates about 133MB output file: dumpe2fs 1.40.8 (13-Mar-2008) Filesystem volume name: Last mounted on: Filesystem UUID: 7701b70e-f776-417b-bf31-3693dba56f86 Filesystem magic number: 0xEF53 Filesystem revision #: 1 (dynamic) Filesystem features: has_journal dir_index filetype sparse_super large_file Default mount options: (none) Filesystem state: clean with errors Errors behavior: Continue Filesystem OS type: Linux Inode count: 792576000 Block count: 1585146848 Reserved block count: 0 Free blocks: 913341561 Free inodes: 678201512 First block: 0 Block size: 4096 Fragment size: 4096 Blocks per group: 32768 Fragments per group: 32768 Inodes per group: 16384 Inode blocks per group: 512 Filesystem created: Mon Nov 13 10:12:49 2006 Last mount time: Mon Jun 9 19:37:12 2008 Last write time: Tue Jun 10 12:18:25 2008 Mount count: 37 Maximum mount count: -1 Last checked: Mon Nov 13 10:12:49 2006 Check interval: 0 () Reserved blocks uid: 0 (user root) Reserved blocks gid: 0 (group root) First inode: 11 Inode size: 128 Journal inode: 8 Default directory hash: tea Directory Hash Seed: afabe3f6-4405-44f4-934b-76c23945db7b Journal backup: inode blocks Journal size: 32M Some example output from group 0 to 5 is available at: http://pastebin.com/f5341d121 > How big is the filesystem(s) that you are backing up via BackupPC, in > terms of size (megabytes) and files (number of inodes)? And how many > days of incremental backups are you keeping? Also, how often do files > change? Can you give a rough estimate of how many files get modified > per backup cycle? > Where are backing up several servers, near about 15 in this case, with 60-80GB data size to backup in each server and +2-3 millon inodes, with 15 day incrementals. I think near about 2-3% of the files changes each day, but I will ask for more info to the backup administrator. I have found and old doc with some build info for this server, the partition was formated with: # mkfs.ext3 -b 4096 -j -m 0 -O dir_index /dev/sda4 # tune2fs -c 0 -i 0 /dev/sda4 # mount -o data=writeback,noatime,nodiratime,commit=60 /dev/sda4 /backup I'm going to fetch more info about BackupPC and backup cycles, thanks Ted!! Regards, -- Santi Saez From tytso at mit.edu Wed Jun 11 02:18:00 2008 From: tytso at mit.edu (Theodore Tso) Date: Tue, 10 Jun 2008 22:18:00 -0400 Subject: 2GB memory limit running fsck on a +6TB device In-Reply-To: <484F12D3.2050201@usansolo.net> References: <13126f2f5661d30187551469b3793fa7@usansolo.net> <20080609213320.GB26759@mit.edu> <7f8a3e70ccd659f304ede5f067ff46c7@usansolo.net> <20080610183855.GB8397@mit.edu> <484EFF1B.1010104@usansolo.net> <20080610230124.GH8397@mit.edu> <484F12D3.2050201@usansolo.net> Message-ID: <20080611021759.GI8397@mit.edu> On Wed, Jun 11, 2008 at 01:48:35AM +0200, Santi Saez wrote: > Theodore Tso escribi?: >> hmm..... can you send me the output of dumpe2fs /dev/sdXX? You can >> run that command while e2fsck is running, since it's read-only. I'm >> curious exactly how big the filesystem is, and how many directories >> are in the first part of the filesystem. >> > Upsss... dumpe2fs takes about 3 minutes to complete and generates about > 133MB output file: True, but it compresses well. :-) And the aside from the first part of the dumpe2fs, the part that I was most interested could have been summarized by simply doing a "grep directories dumpe2fs.out". But simply looking at your dumpe2fs, and take an average from the first 6 block groups which you included in the pastebin, I can extrapolate and guess that you have about 63 million directories, out of approximately 114 million total inodes (so about 51 million regular files, nearly all of which have hard link counts > 1). Unfortunately, BackupPC blows out of the water all of our memory reduction hueristics. I estimate you need something like 2.6GB to 3GB of memory just for these data structures alone. (Not to mention 94 MB for each inode bitmap, and 188 MB for each block bitmap.) The good news is that 4GB of memory should do you --- just. (I'd probably put in a bit more physical memory just to be on the safe side, or enable swap before running e2fsck). The bad news is you really, REALLY need a 64-bit kernel on your system. Because /var/cache/e2fsck is on the same disk spindle as the filesystem you are checking, you're probably getting killed on seeks. Moving /var/cache/e2fsck to another disk partition will help (or better yet, battery backed memory device), but the best thing you can do is get a 64-bit kernel and not need to use the auxiliary storage in the first place. As far as what to advice to give you, why are you running e2fsck? Was this an advisory thing caused by the mount count and/or length of time between filesystem checks? Or do you have real reason to believe the filesystem may be corrupt? - Ted From ext3 at kalucki.com Wed Jun 11 05:18:46 2008 From: ext3 at kalucki.com (John Kalucki) Date: Tue, 10 Jun 2008 22:18:46 -0700 Subject: Poor Performance WhenNumber of Files > 1M Message-ID: <484F6036.8020900@kalucki.com> I am seeing similar problems to Sean McCauliff (2007-08-02) using ext3. I have a simple test that times file creations in a hashed directory structure. File creation time inexorably increases as the number of files in the filesystem increases. Altering variables can change the absolute performance, but I always see the steady performance degradation. All of the following have no material effect on the steady drop in performance: File length (1k, 4k, 16k) Directory depth (5, 10, 15) Average & Max files per directory (10, 20, 100) Single or multi-threaded test Moving test directory to a new name on same filesystem, restarting test. Directory hash RAID10 vs. simple disk Linux version (RHE, Ubuntu) System memory (32gig, 2gig) Syncing after each close Free space Partition Age (old, perhaps fragmented, a bit dirty, new fs) Performance seems to always map directly to the number of files in the ext3 filesystem. After some initial run-fast time, perhaps once dirty pages begin to be written aggressively, for every 5,000 files added, my files created per second tends to drop by about one. So, depending on the variables, say with 6 RAID10 spindles, I might start at ~700 files/sec, quickly drop, then more slowly drop to ~300 files/sec at perhaps 1 million files, then see 299 files/sec for the next 5,000 creations, 298 files/sec, etc. etc. As you'd expect, there isn't much CPU utilization, other than iowait, and some kjournald activity. Is this a known limitation of ext3? Is expecting to write to O(10^6)-O(10^7) files in something approaching constant time expecting too much from a filesystem? What, exactly, am I stressing to cause this unbounded performance degradation? Thanks, -John Kalucki ext3 at kalucki.com ---- Hi all, I plan on having about 100M files totaling about 8.5TiBytes. To see how ext3 would perform with large numbers of files I've written a test program which creates a configurable number of files into a configurable number of directories, reads from those files, lists them and then deletes them. Even up to 1M files ext3 seems to perform well and scale linearly; the time to execute the program on 1M files is about double the time it takes it to execute on .5M files. But past 1M files it seems to have n^2 scalability. Test details appear below. Looking at the various options for ext3 nothing jumps out as the obvious one to use to improve performance. Any recommendations? Thanks! Sean From sandeen at redhat.com Wed Jun 11 05:33:20 2008 From: sandeen at redhat.com (Eric Sandeen) Date: Wed, 11 Jun 2008 00:33:20 -0500 Subject: Poor Performance WhenNumber of Files > 1M In-Reply-To: <484F6036.8020900@kalucki.com> References: <484F6036.8020900@kalucki.com> Message-ID: <484F63A0.50606@redhat.com> John Kalucki wrote: > Performance seems to always map directly to the number of files in the > ext3 filesystem. > > After some initial run-fast time, perhaps once dirty pages begin to be > written aggressively, for every 5,000 files added, my files created per > second tends to drop by about one. So, depending on the variables, say > with 6 RAID10 spindles, I might start at ~700 files/sec, quickly drop, > then more slowly drop to ~300 files/sec at perhaps 1 million files, then > see 299 files/sec for the next 5,000 creations, 298 files/sec, etc. etc. > > As you'd expect, there isn't much CPU utilization, other than iowait, > and some kjournald activity. > > Is this a known limitation of ext3? Is expecting to write to > O(10^6)-O(10^7) files in something approaching constant time expecting > too much from a filesystem? What, exactly, am I stressing to cause this > unbounded performance degradation? I think this is a linear search through the block groups for the new inode allocation, which always starts at the parent directory's block group; and starts over from there each time. See find_group_other(). So if the parent's group is full and so are the next 1000 block groups, it will search 1000 groups and find space in the 1001st. On the next inode allocation it will re-search(!) those 1000 groups, and again find space in the 1001st. And so on. Until the 1001st is full, and then it'll search 1001 groups and find space in the 1002nd... etc (If I'm remembering/reading correctly, but this does jive with what you see.). I've toyed with keeping track (in the parent's inode) where the last successful child allocation happened, and start the search there. I'm a bit leery of how this might age, though... plus I'm not sure if it should be on-disk or just in memory.... But this behavior clearly needs some help. I should probably just get it sent out for comment. -Eric From santi at usansolo.net Wed Jun 11 08:14:45 2008 From: santi at usansolo.net (santi at usansolo.net) Date: Wed, 11 Jun 2008 10:14:45 +0200 Subject: 2GB memory limit running fsck on a +6TB device In-Reply-To: <20080611021759.GI8397@mit.edu> References: <13126f2f5661d30187551469b3793fa7@usansolo.net> <20080609213320.GB26759@mit.edu> <7f8a3e70ccd659f304ede5f067ff46c7@usansolo.net> <20080610183855.GB8397@mit.edu> <484EFF1B.1010104@usansolo.net> <20080610230124.GH8397@mit.edu> <484F12D3.2050201@usansolo.net> <20080611021759.GI8397@mit.edu> Message-ID: <00bf4cac93645dc74c04229696a20f11@usansolo.net> On Tue, 10 Jun 2008 22:18:00 -0400, Theodore Tso wrote: > True, but it compresses well. :-) And the aside from the first part > of the dumpe2fs, the part that I was most interested could have been > summarized by simply doing a "grep directories dumpe2fs.out". :D "grep directories" is available at: http://santi.usansolo.net/tmp/dumpe2fs_directories.txt.gz (317K) Full "dumpe2fs" output compressed is 34M and available at: http://santi.usansolo.net/tmp/dumpe2fs.txt.gz > But simply looking at your dumpe2fs, and take an average from the > first 6 block groups which you included in the pastebin, I can > extrapolate and guess that you have about 63 million directories, out > of approximately 114 million total inodes (so about 51 million regular > files, nearly all of which have hard link counts > 1). # grep directories dumpe2fs.txt | awk '{sum += $7} END {print sum}' 78283294 > BackupPC blows out of the water all of our memory reduction > hueristics. I estimate you need something like 2.6GB to 3GB of memory > just for these data structures alone. (Not to mention 94 MB for each > inode bitmap, and 188 MB for each block bitmap.) The good news is > that 4GB of memory should do you --- just. (I'd probably put in a bit > more physical memory just to be on the safe side, or enable swap > before running e2fsck). The bad news is you really, REALLY need a > 64-bit kernel on your system. Unfortunately, I have killed the process, in 21 hours only 2.5% of the fsck is completed ;-( 'scratch_files' directory has grown to 311M =================================================================== # time fsck -y /dev/sda4 fsck 1.40.8 (13-Mar-2008) e2fsck 1.40.8 (13-Mar-2008) Adding dirhash hint to filesystem. /dev/sda4 contains a file system with errors, check forced. Pass 1: Checking inodes, blocks, and sizes /dev/sda4: e2fsck canceled. /dev/sda4: ***** FILE SYSTEM WAS MODIFIED ***** /dev/sda4: ********** WARNING: Filesystem still has errors ********** real 1303m19.306s user 1079m58.898s sys 217m10.130s =================================================================== > Because /var/cache/e2fsck is on the same disk spindle as the > filesystem you are checking, you're probably getting killed on seeks. > Moving /var/cache/e2fsck to another disk partition will help (or > better yet, battery backed memory device), but the best thing you can > do is get a 64-bit kernel and not need to use the auxiliary storage in > the first place. I'm trying a fast test with "mount tmpfs /var/cache/e2fsck -t tmpfs -o size=2048M", but appears that will take a long time to complete too.. so the next test will be with a 64-bit LiveCD :) > As far as what to advice to give you, why are you running e2fsck? Was > this an advisory thing caused by the mount count and/or length of time > between filesystem checks? Or do you have real reason to believe the > filesystem may be corrupt? No, it's not related with mount count and/or length of time between filesystem checks. When booting we get this error/warning: EXT3-fs warning: mounting fs with errors, running e2fsck is recommended EXT3 FS on sda4, internal journal EXT3-fs: mounted filesystem with writeback data mode. And "tune2fs" returns that ext3 is in "clean with errors" state.. so, we think that completing a full fsck process is a good idea; what means in this case "clean with errors" state, running a fsck is not needed? Thanks again for all the help and advices!! -- Santi Saez From santi at usansolo.net Wed Jun 11 11:51:17 2008 From: santi at usansolo.net (santi at usansolo.net) Date: Wed, 11 Jun 2008 13:51:17 +0200 Subject: 2GB memory limit running fsck on a +6TB device In-Reply-To: <00bf4cac93645dc74c04229696a20f11@usansolo.net> References: <13126f2f5661d30187551469b3793fa7@usansolo.net> <20080609213320.GB26759@mit.edu> <7f8a3e70ccd659f304ede5f067ff46c7@usansolo.net> <20080610183855.GB8397@mit.edu> <484EFF1B.1010104@usansolo.net> <20080610230124.GH8397@mit.edu> <484F12D3.2050201@usansolo.net> <20080611021759.GI8397@mit.edu> <00bf4cac93645dc74c04229696a20f11@usansolo.net> Message-ID: On Wed, 11 Jun 2008 10:14:45 +0200, wrote: >> Because /var/cache/e2fsck is on the same disk spindle as the >> filesystem you are checking, you're probably getting killed on seeks. >> Moving /var/cache/e2fsck to another disk partition will help (or >> better yet, battery backed memory device), but the best thing you can >> do is get a 64-bit kernel and not need to use the auxiliary storage in >> the first place. > > I'm trying a fast test with "mount tmpfs /var/cache/e2fsck -t tmpfs -o > size=2048M", but appears that will take a long time to complete too.. so > the next test will be with a 64-bit LiveCD :) Note that putting '/var/cache/e2fsck' in a memory filesystem is aprox. 3 times faster ;-) Making some fast test with e2fsck v1.40.10 appears that is a bit faster than v1.40.8, last version improves this feature? Anyway, finally I had to cancel the process.. # ./e2fsck -nfvttC0 /dev/sda4 e2fsck 1.40.10 (21-May-2008) Pass 1: Checking inodes, blocks, and sizes /dev/sda4: e2fsck canceled. /dev/sda4: ********** WARNING: Filesystem still has errors ********** Memory used: 260k/581088k (183k/78k) Regards, -- Santi Saez From adilger at sun.com Wed Jun 11 14:59:08 2008 From: adilger at sun.com (Andreas Dilger) Date: Wed, 11 Jun 2008 08:59:08 -0600 Subject: 2GB memory limit running fsck on a +6TB device In-Reply-To: References: <13126f2f5661d30187551469b3793fa7@usansolo.net> <20080609213320.GB26759@mit.edu> <7f8a3e70ccd659f304ede5f067ff46c7@usansolo.net> <20080610183855.GB8397@mit.edu> <484EFF1B.1010104@usansolo.net> <20080610230124.GH8397@mit.edu> <484F12D3.2050201@usansolo.net> <20080611021759.GI8397@mit.edu> <00bf4cac93645dc74c04229696a20f11@usansolo.net> Message-ID: <20080611145908.GP3726@webber.adilger.int> On Jun 11, 2008 13:51 +0200, santi at usansolo.net wrote: > On Wed, 11 Jun 2008 10:14:45 +0200, wrote: > > >> Because /var/cache/e2fsck is on the same disk spindle as the > >> filesystem you are checking, you're probably getting killed on seeks. > >> Moving /var/cache/e2fsck to another disk partition will help (or > >> better yet, battery backed memory device), but the best thing you can > >> do is get a 64-bit kernel and not need to use the auxiliary storage in > >> the first place. > > > > I'm trying a fast test with "mount tmpfs /var/cache/e2fsck -t tmpfs -o > > size=2048M", but appears that will take a long time to complete too.. so > > the next test will be with a 64-bit LiveCD :) > > Note that putting '/var/cache/e2fsck' in a memory filesystem is aprox. 3 > times faster ;-) ...but, isn't the problem that you don't have enough RAM? Using tdb+ramfs isn't going to be faster than using the RAM directly. I suspect that the only way you are going to check this filesystem efficiently is to boot a 64-bit kernel (even just from a rescue disk), set up some swap just in case, and run e2fsck from there. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. From bryan at kadzban.is-a-geek.net Wed Jun 11 16:49:04 2008 From: bryan at kadzban.is-a-geek.net (Bryan Kadzban) Date: Wed, 11 Jun 2008 12:49:04 -0400 Subject: 2GB memory limit running fsck on a +6TB device In-Reply-To: <20080611145908.GP3726@webber.adilger.int> References: <20080609213320.GB26759@mit.edu> <7f8a3e70ccd659f304ede5f067ff46c7@usansolo.net> <20080610183855.GB8397@mit.edu> <484EFF1B.1010104@usansolo.net> <20080610230124.GH8397@mit.edu> <484F12D3.2050201@usansolo.net> <20080611021759.GI8397@mit.edu> <00bf4cac93645dc74c04229696a20f11@usansolo.net> <20080611145908.GP3726@webber.adilger.int> Message-ID: <20080611164904.GA10071@kadzban.is-a-geek.net> On Wed, Jun 11, 2008 at 08:59:08AM -0600, Andreas Dilger wrote: > On Jun 11, 2008 13:51 +0200, santi at usansolo.net wrote: > > On Wed, 11 Jun 2008 10:14:45 +0200, wrote: > > > > >> Moving /var/cache/e2fsck to another disk partition will help (or > > >> better yet, battery backed memory device), but the best thing you > > >> can do is get a 64-bit kernel and not need to use the auxiliary > > >> storage in the first place. > > > > > > I'm trying a fast test with "mount tmpfs /var/cache/e2fsck -t tmpfs > > > -o size=2048M", but appears that will take a long time to complete > > > too.. so the next test will be with a 64-bit LiveCD :) > > > > Note that putting '/var/cache/e2fsck' in a memory filesystem is aprox. > > 3 times faster ;-) > > ...but, isn't the problem that you don't have enough RAM? Using > tdb+ramfs isn't going to be faster than using the RAM directly. It won't be faster, no, but it will be faster than tdb-on-disk, and much faster than tdb on the same disk as the one that's being checked. And it *might* allow e2fsck to allocate all the virtual memory that it needs, depending on how the tmpfs driver works. If tmpfs uses the same VA space as e2fsck and the rest of the kernel, then it probably won't help. But if tmpfs can use a different pool somehow (whether that's because the kernel uses a different set of pagetables, or whatever), then it might. > I suspect that the only way you are going to check this filesystem > efficiently is to boot a 64-bit kernel (even just from a rescue disk), > set up some swap just in case, and run e2fsck from there. And try to run a 64-bit e2fsck binary, too. The virtual address space usage estimate that someone (Ted?) came up with earlier in this thread was close to 4G, which means that even with a 64-bit kernel, a 32-bit e2fsck binary might still run out of virtual address space. (It will need to map lots of disk, plus any real RAM usage, plus itself and any libraries. That last bit *might* push it over 4G, depending on how accurate the estimate of 4G turns out to be.) The easiest way to do this is probably run the e2fsck from the LiveCD itself; don't try to run the 32-bit version that the system has installed. That version *might* work, but it'll be tight; a 64-bit version that can use 40-odd bits in its virtual addresses (44? 48? I think it depends on the exact CPU model -- and the kernel, of course) will have a *lot* more headroom. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From ext3 at kalucki.com Wed Jun 11 22:04:17 2008 From: ext3 at kalucki.com (John Kalucki) Date: Wed, 11 Jun 2008 15:04:17 -0700 Subject: Poor Performance WhenNumber of Files > 1M In-Reply-To: <484F63A0.50606@redhat.com> References: <484F6036.8020900@kalucki.com> <484F63A0.50606@redhat.com> Message-ID: <48504BE1.2000104@kalucki.com> Eric Sandeen wrote: > John Kalucki wrote: > > >> Performance seems to always map directly to the number of files in the >> ext3 filesystem. >> >> After some initial run-fast time, perhaps once dirty pages begin to be >> written aggressively, for every 5,000 files added, my files created per >> second tends to drop by about one. So, depending on the variables, say >> with 6 RAID10 spindles, I might start at ~700 files/sec, quickly drop, >> then more slowly drop to ~300 files/sec at perhaps 1 million files, then >> see 299 files/sec for the next 5,000 creations, 298 files/sec, etc. etc. >> >> As you'd expect, there isn't much CPU utilization, other than iowait, >> and some kjournald activity. >> >> Is this a known limitation of ext3? Is expecting to write to >> O(10^6)-O(10^7) files in something approaching constant time expecting >> too much from a filesystem? What, exactly, am I stressing to cause this >> unbounded performance degradation? >> > > I think this is a linear search through the block groups for the new > inode allocation, which always starts at the parent directory's block > group; and starts over from there each time. See find_group_other(). > > So if the parent's group is full and so are the next 1000 block groups, > it will search 1000 groups and find space in the 1001st. On the next > inode allocation it will re-search(!) those 1000 groups, and again find > space in the 1001st. And so on. Until the 1001st is full, and then > it'll search 1001 groups and find space in the 1002nd... etc (If I'm > remembering/reading correctly, but this does jive with what you see.). > > I've toyed with keeping track (in the parent's inode) where the last > successful child allocation happened, and start the search there. I'm a > bit leery of how this might age, though... plus I'm not sure if it > should be on-disk or just in memory.... But this behavior clearly needs > some help. I should probably just get it sent out for comment. > > -Eric > This is the best explanation I've read so far. There does indeed appear to be some O(n) behavior that is exacerbated by having many directories in the working set (not open, just referenced often) and perhaps moderate fragmentation. I read up on ext3 inode allocation, and the attempt to place files in the same cylinder group as directories. Trying to work with this system, I started on a fresh filesystem and flattened the directory depth to just 4 levels, I've managed to boost performance greatly, and flatten the degradation curve quite a bit. I can get to about 2,800,000 files before performance starts to slowly drop from a nearly constant ~1,700 file/sec. At ~4,000,000 files, I see about ~1,500 files/sec, and afterwards I start to see the old behavior of greater decline. By 5,500,000 files, it's down to 1,230 files/sec. I've used 9% of the space and 8% of the inodes at this point. Changing journal size and /proc/sys/fs/file-max had no effect. Even dir_index had only marginal impact, as my directories have only about 300 files each. I think the biggest factor to making performance nearly linear is the number of directories in the working set. If this grows too large, the linear allocation behavior is magnified, and performance drops. My version of RHEL doesn't seem to allow tweaking of directory cache behavior, perhaps a deprecated feature from the 2.4 days. If I discover anything else, I'll be sure to update this thread. -John From ext3 at kalucki.com Wed Jun 11 22:25:17 2008 From: ext3 at kalucki.com (John Kalucki) Date: Wed, 11 Jun 2008 15:25:17 -0700 Subject: Poor Performance WhenNumber of Files > 1M In-Reply-To: <484FD343.1060308@redhat.com> References: <484F6036.8020900@kalucki.com> <484F63A0.50606@redhat.com> <484FD343.1060308@redhat.com> Message-ID: <485050CD.8070403@kalucki.com> Ric Wheeler wrote: > Eric Sandeen wrote: >> John Kalucki wrote: >> >> >>> Performance seems to always map directly to the number of files in >>> the ext3 filesystem. >>> >>> After some initial run-fast time, perhaps once dirty pages begin to >>> be written aggressively, for every 5,000 files added, my files >>> created per second tends to drop by about one. So, depending on the >>> variables, say with 6 RAID10 spindles, I might start at ~700 >>> files/sec, quickly drop, then more slowly drop to ~300 files/sec at >>> perhaps 1 million files, then see 299 files/sec for the next 5,000 >>> creations, 298 files/sec, etc. etc. >>> >>> As you'd expect, there isn't much CPU utilization, other than >>> iowait, and some kjournald activity. >>> >>> Is this a known limitation of ext3? Is expecting to write to >>> O(10^6)-O(10^7) files in something approaching constant time >>> expecting too much from a filesystem? What, exactly, am I stressing >>> to cause this unbounded performance degradation? >>> >> >> I think this is a linear search through the block groups for the new >> inode allocation, which always starts at the parent directory's block >> group; and starts over from there each time. See find_group_other(). >> >> So if the parent's group is full and so are the next 1000 block groups, >> it will search 1000 groups and find space in the 1001st. On the next >> inode allocation it will re-search(!) those 1000 groups, and again find >> space in the 1001st. And so on. Until the 1001st is full, and then >> it'll search 1001 groups and find space in the 1002nd... etc (If I'm >> remembering/reading correctly, but this does jive with what you see.). >> >> I've toyed with keeping track (in the parent's inode) where the last >> successful child allocation happened, and start the search there. I'm a >> bit leery of how this might age, though... plus I'm not sure if it >> should be on-disk or just in memory.... But this behavior clearly needs >> some help. I should probably just get it sent out for comment. >> >> -Eric >> >> > I run a very similar test, but normally run with a synchronous write > work load (i.e., fsync before close). In my testing, you will see a > slow but gradual decline in the files/sec. For example, on a 1TB S-ATA > drive, the latest test run started off at a rate of 22 files/sec (each > file is 40k) and is currently chugging along at a bit over 17 > files/sec when it has hit 2.8 million files in one directory. I am > using the ext3 run to get a baseline for a similar run of xfs and btrfs. > > One other random tuning thought - you can help by writing into > separate directories, but you will need to make sure that you don't > produce a random write pattern when you select your target > subdirectory. I think that the use case mentioned using a hashed > directory structure which is fine, but you want to hash in a way that > writes into a shared subdirectory for some period of time (say get a > rotation of every X files or Y seconds). Easiest way to do this is to > use a GUID with a time stamp and hash on the time stamp bits. > > Note that there is a multi-threaded performance bug in ext3 (Josef > Bacik had looked at fixing this) which throttles writes/sec down to > around 230 when you do synchronous transactions so you might be > hitting that as well. > > ric Unfortunately, I don't have the opportunity to limit the directories. My application is taking random-ish data and organizing it into logical groups for subsequent quick reading. But I did take your suggestion into account and it contains what seems to be the important nugget -- too many active directories makes a bad situation worse. But still, my test reaches a steady state of active directories pretty quickly -- or so I'd like to think. The performance does indeed continue to creep downwards. I'm doing everything single-threaded. Introducing a second thread seems to be an immediate disaster, even though I'm stripped across 3 disks. Unfortunate. Perhaps moving the journal to another filesystem would allow better multi-threaded throughput, but I'm not sure that this is important to me. xfs, zfs, btrfs, and reiser could be attractive for my use-case. Thanks for your response, John From tytso at mit.edu Thu Jun 12 05:24:29 2008 From: tytso at mit.edu (Theodore Tso) Date: Thu, 12 Jun 2008 01:24:29 -0400 Subject: 2GB memory limit running fsck on a +6TB device In-Reply-To: <20080611145908.GP3726@webber.adilger.int> References: <20080609213320.GB26759@mit.edu> <7f8a3e70ccd659f304ede5f067ff46c7@usansolo.net> <20080610183855.GB8397@mit.edu> <484EFF1B.1010104@usansolo.net> <20080610230124.GH8397@mit.edu> <484F12D3.2050201@usansolo.net> <20080611021759.GI8397@mit.edu> <00bf4cac93645dc74c04229696a20f11@usansolo.net> <20080611145908.GP3726@webber.adilger.int> Message-ID: <20080612052429.GA18229@mit.edu> On Wed, Jun 11, 2008 at 08:59:08AM -0600, Andreas Dilger wrote: > > Note that putting '/var/cache/e2fsck' in a memory filesystem is aprox. 3 > > times faster ;-) > > ...but, isn't the problem that you don't have enough RAM? Using tdb+ramfs > isn't going to be faster than using the RAM directly. Tmpfs is swap backed, if swap has been configured. So it can help. Another possibility is to use a statically linked e2fsck, since the shared libraries chew up a lot of VM address space. But in this particular case, it probably wouldn't be enough. I think the best thing to do is this case to use a 64-bit kernel and a 64-bit compiled e2fsck binary. - Ted From ross at biostat.ucsf.edu Mon Jun 16 03:46:21 2008 From: ross at biostat.ucsf.edu (Ross Boylan) Date: Sun, 15 Jun 2008 20:46:21 -0700 Subject: spd_readdir.c and readdir_r [real new version] In-Reply-To: <1212985588.32113.13.camel@corn.betterworld.us> References: <1212903039.7158.31.camel@corn.betterworld.us> <1212985588.32113.13.camel@corn.betterworld.us> Message-ID: <1213587981.8578.189.camel@corn.betterworld.us> My previous attachment had only a link for the main file; the current one should have the real thing. For the full backup, using the preload library changed the backup time from over 35 hours to 22 hours for a full backup. The full backup got much slower as it progressed; my guess is something other than the preload library (perhaps the snapshotting itself, bacula, or postgresql) accounts for that. The percentage change for incremental backups, which involve relatively more time scanning, is larger: from 3 hours to under .5 hours. There's no obvious speedup for the jobs involving Reiser filesystems. All in all, a big win. Thanks to everyone for your help, and especially to Ted for the original code. Ross Boylan On Sun, 2008-06-08 at 21:26 -0700, Ross Boylan wrote: > I've attached a modified version of Ted's spd_readdir.c that adds > support for readdir_r and readdir64_r. It appears to be working > (readdir64_r is the only new routine getting exercised), but should be > taken as a rough cut. I also added a Makefile and a test program. > > It also looks as if this is giving me a huge speed improvement (at least > x4) of my backups of my ext3 partitions. I'll try to report after a > full and incremental backup complete, which will be a couple of days. > > Originally I tried taking the threading code from the system > implementations of the original readdir_r. When that didn't work (since > it was designed to be part of a libc build) I switched to pthreads. I > don't know if recursive locking is essential; I activated it at one > point while trying to get things to work. > > For big directories this code could use quite a lot of memory. It > allows an optional max size, beyond which it reverts to the original > system calls. I wonder if instead taking large directories in chunks > would preserve much of the speedup while putting a bound on memory use. > > Ross Boylan > -------------- next part -------------- A non-text attachment was scrubbed... Name: RBspd_dir.tgz Type: application/x-compressed-tar Size: 3147 bytes Desc: not available URL: From magawake at gmail.com Thu Jun 19 00:05:57 2008 From: magawake at gmail.com (Mag Gam) Date: Wed, 18 Jun 2008 20:05:57 -0400 Subject: stride Message-ID: <1cbd6f830806181705s4acc3817x409cb4ce5f5cb9bb@mail.gmail.com> I am trying to understand the stride option for ext3 . If I am using a Hardware RAID (3ware) with 6 disks and I decide to go with RAID 5 with stripe of 128KB (default on my controller) and no spare. By reading documentation I should do 128/4 as my stride size when creating the file system. I am not understanding how this number works and what exactly stride does. Can someone care to explain this to me? TIA -------------- next part -------------- An HTML attachment was scrubbed... URL: From magawake at gmail.com Thu Jun 19 00:14:29 2008 From: magawake at gmail.com (Mag Gam) Date: Wed, 18 Jun 2008 20:14:29 -0400 Subject: stride Message-ID: <1cbd6f830806181714q7112605dm198cb752956956aa@mail.gmail.com> I am trying to understand the stride setting for ext3 . If I am using a Hardware RAID (3ware) with 6 disks and I decide to go with RAID 5 with stripe of 128KB (default on my controller) and no spare. By reading documentation I should do 128/4 as my stride size when creating the file system. I am not understanding how this number works and what exactly stride does. Can someone care to explain this to me? TIA -------------- next part -------------- An HTML attachment was scrubbed... URL: From adilger at sun.com Thu Jun 19 05:47:50 2008 From: adilger at sun.com (Andreas Dilger) Date: Wed, 18 Jun 2008 23:47:50 -0600 Subject: stride In-Reply-To: <1cbd6f830806181714q7112605dm198cb752956956aa@mail.gmail.com> References: <1cbd6f830806181714q7112605dm198cb752956956aa@mail.gmail.com> Message-ID: <20080619054750.GO3726@webber.adilger.int> On Jun 18, 2008 20:14 -0400, Mag Gam wrote: > If I am using a Hardware RAID (3ware) with 6 disks and I decide to go with > RAID 5 with stripe of 128KB (default on my controller) and no spare. > By reading documentation I should do 128/4 as my stride size when creating > the file system. I am not understanding how this number works and what > exactly stride does. Can someone care to explain this to me? The "stride" option changes the location of some of the filesystem metadata so that it isn't all located on the same disk. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. From magawake at gmail.com Thu Jun 19 10:21:24 2008 From: magawake at gmail.com (Mag Gam) Date: Thu, 19 Jun 2008 06:21:24 -0400 Subject: stride In-Reply-To: <20080619054750.GO3726@webber.adilger.int> References: <1cbd6f830806181714q7112605dm198cb752956956aa@mail.gmail.com> <20080619054750.GO3726@webber.adilger.int> Message-ID: <1cbd6f830806190321j2e77ee83k469844a20e13461f@mail.gmail.com> ok, in a way its like a stripe? I though when you do a stripe you put the metadata on number of disks too. How is that different? Is there a diagram I can refer to? TIA On Thu, Jun 19, 2008 at 1:47 AM, Andreas Dilger wrote: > On Jun 18, 2008 20:14 -0400, Mag Gam wrote: > > If I am using a Hardware RAID (3ware) with 6 disks and I decide to go > with > > RAID 5 with stripe of 128KB (default on my controller) and no spare. > > By reading documentation I should do 128/4 as my stride size when > creating > > the file system. I am not understanding how this number works and what > > exactly stride does. Can someone care to explain this to me? > > The "stride" option changes the location of some of the filesystem metadata > so that it isn't all located on the same disk. > > Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tytso at mit.edu Thu Jun 19 11:42:44 2008 From: tytso at mit.edu (Theodore Tso) Date: Thu, 19 Jun 2008 07:42:44 -0400 Subject: stride In-Reply-To: <1cbd6f830806190321j2e77ee83k469844a20e13461f@mail.gmail.com> References: <1cbd6f830806181714q7112605dm198cb752956956aa@mail.gmail.com> <20080619054750.GO3726@webber.adilger.int> <1cbd6f830806190321j2e77ee83k469844a20e13461f@mail.gmail.com> Message-ID: <20080619114244.GD11516@mit.edu> On Thu, Jun 19, 2008 at 06:21:24AM -0400, Mag Gam wrote: > ok, in a way its like a stripe? I though when you do a stripe you put the > metadata on number of disks too. How is that different? Is there a diagram I > can refer to? Yes, which is why the mke2fs man page states: stride= Configure the filesystem for a RAID array with filesystem blocks per stripe. So if the size of a stripe on each a disk is 64k, and you are using a 4k filesystem blocksize, then 64k/4k == 16, and that would be an "ideal" stride size, in that for each successive block group, the inode and block bitmap would increased by an offset of 16 blocks from the beginning of the block group. The reason for doing this is to avoid problems where the block bitmap ends up on the same disk for every single block group. The classic case where this would happen is if you have a 5 disks in a RAID 5 configuration, which means with 4 disks per stripe, and 8192 blocks in a blockgroup, then if the block bitmap is always at the same offset from the beginning of the block group, one disk will get all of the block bitmaps, and that ends up being a major hot spot problem for the hard drive. As it turns out, if you use 4 disks in a RAID 5 configuration, or 6 disks in a RAID 5 configuration, this problem doesn't arise at all, and you don't need to use the stride option. And in most cases, simply using a stride=1, that is actually enough to make sure that each block and inode bitmaps will get forced onto successively different disks. With ext4's flex_bg enhancement, the need to specify stride option of RAID arrays will also go away. - Ted From magawake at gmail.com Fri Jun 20 01:17:45 2008 From: magawake at gmail.com (Mag Gam) Date: Thu, 19 Jun 2008 21:17:45 -0400 Subject: stride In-Reply-To: <20080619114244.GD11516@mit.edu> References: <1cbd6f830806181714q7112605dm198cb752956956aa@mail.gmail.com> <20080619054750.GO3726@webber.adilger.int> <1cbd6f830806190321j2e77ee83k469844a20e13461f@mail.gmail.com> <20080619114244.GD11516@mit.edu> Message-ID: <1cbd6f830806191817y24e32e9bh2e174d77a3f1541c@mail.gmail.com> What happens if you use a hardware raid, should the stride option be considered? It seems you are referring to software raid, correct? TIA On Thu, Jun 19, 2008 at 7:42 AM, Theodore Tso wrote: > On Thu, Jun 19, 2008 at 06:21:24AM -0400, Mag Gam wrote: > > ok, in a way its like a stripe? I though when you do a stripe you put the > > metadata on number of disks too. How is that different? Is there a > diagram I > > can refer to? > > Yes, which is why the mke2fs man page states: > > stride= > Configure the filesystem for a RAID array with > filesystem blocks per stripe. > > So if the size of a stripe on each a disk is 64k, and you are using a > 4k filesystem blocksize, then 64k/4k == 16, and that would be an > "ideal" stride size, in that for each successive block group, the > inode and block bitmap would increased by an offset of 16 blocks from > the beginning of the block group. > > The reason for doing this is to avoid problems where the block bitmap > ends up on the same disk for every single block group. The classic > case where this would happen is if you have a 5 disks in a RAID 5 > configuration, which means with 4 disks per stripe, and 8192 blocks in > a blockgroup, then if the block bitmap is always at the same offset > from the beginning of the block group, one disk will get all of the > block bitmaps, and that ends up being a major hot spot problem for the > hard drive. > > As it turns out, if you use 4 disks in a RAID 5 configuration, or 6 > disks in a RAID 5 configuration, this problem doesn't arise at all, > and you don't need to use the stride option. And in most cases, > simply using a stride=1, that is actually enough to make sure that > each block and inode bitmaps will get forced onto successively > different disks. > > With ext4's flex_bg enhancement, the need to specify stride option of > RAID arrays will also go away. > > - Ted > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tytso at mit.edu Fri Jun 20 02:08:47 2008 From: tytso at mit.edu (Theodore Tso) Date: Thu, 19 Jun 2008 22:08:47 -0400 Subject: stride In-Reply-To: <1cbd6f830806191817y24e32e9bh2e174d77a3f1541c@mail.gmail.com> References: <1cbd6f830806181714q7112605dm198cb752956956aa@mail.gmail.com> <20080619054750.GO3726@webber.adilger.int> <1cbd6f830806190321j2e77ee83k469844a20e13461f@mail.gmail.com> <20080619114244.GD11516@mit.edu> <1cbd6f830806191817y24e32e9bh2e174d77a3f1541c@mail.gmail.com> Message-ID: <20080620020847.GE9119@mit.edu> On Thu, Jun 19, 2008 at 09:17:45PM -0400, Mag Gam wrote: > What happens if you use a hardware raid, should the stride option be > considered? It seems you are referring to software raid, correct? It doesn't matter whethre it is hardware or software raid. What matters is the *geometry* of the RAID array. i.e., how many filesystem blocks are in an individual disk's stripe, and how many disks are in use (minus how many parity disks are in use). This information may be somewhat more hidden in a hardware raid array, but it is possible to extract this information, and most hardware raid arrays will allow you to configure these parameters as well, to varying degrees of flexibility. - Ted From magawake at gmail.com Fri Jun 20 10:21:37 2008 From: magawake at gmail.com (Mag Gam) Date: Fri, 20 Jun 2008 06:21:37 -0400 Subject: stride In-Reply-To: <20080620020847.GE9119@mit.edu> References: <1cbd6f830806181714q7112605dm198cb752956956aa@mail.gmail.com> <20080619054750.GO3726@webber.adilger.int> <1cbd6f830806190321j2e77ee83k469844a20e13461f@mail.gmail.com> <20080619114244.GD11516@mit.edu> <1cbd6f830806191817y24e32e9bh2e174d77a3f1541c@mail.gmail.com> <20080620020847.GE9119@mit.edu> Message-ID: <1cbd6f830806200321r18deb81cx5089f21520b1a838@mail.gmail.com> Ted, This is the type of information I was looking for. No seems to explain this well. Also, on the same topic. For a very large filesystem ie, 3TB, should I consider anything special, something like -O dir_index? I am looking for peek performance. TIA On Thu, Jun 19, 2008 at 10:08 PM, Theodore Tso wrote: > On Thu, Jun 19, 2008 at 09:17:45PM -0400, Mag Gam wrote: > > What happens if you use a hardware raid, should the stride option be > > considered? It seems you are referring to software raid, correct? > > It doesn't matter whethre it is hardware or software raid. What > matters is the *geometry* of the RAID array. i.e., how many > filesystem blocks are in an individual disk's stripe, and how many > disks are in use (minus how many parity disks are in use). This > information may be somewhat more hidden in a hardware raid array, but > it is possible to extract this information, and most hardware raid > arrays will allow you to configure these parameters as well, to > varying degrees of flexibility. > > - Ted > -------------- next part -------------- An HTML attachment was scrubbed... URL: From lists at nerdbynature.de Sun Jun 22 00:34:47 2008 From: lists at nerdbynature.de (Christian Kujau) Date: Sun, 22 Jun 2008 02:34:47 +0200 (CEST) Subject: stride In-Reply-To: <1cbd6f830806200321r18deb81cx5089f21520b1a838@mail.gmail.com> References: <1cbd6f830806181714q7112605dm198cb752956956aa@mail.gmail.com> <20080619054750.GO3726@webber.adilger.int> <1cbd6f830806190321j2e77ee83k469844a20e13461f@mail.gmail.com> <20080619114244.GD11516@mit.edu> <1cbd6f830806191817y24e32e9bh2e174d77a3f1541c@mail.gmail.com> <20080620020847.GE9119@mit.edu> <1cbd6f830806200321r18deb81cx5089f21520b1a838@mail.gmail.com> Message-ID: On Fri, 20 Jun 2008, Mag Gam wrote: > consider anything special, something like -O dir_index? I am looking for > peek performance. Depends on how many files, directories, small/big files, reads/writes...etc. There are various benchmarks and tuning hints for ext3 around, but if you want peak performance, you're better off testing *your* application with different mkfs/mount options and see what's best for *you*. my 2 cents, C. -- BOFH excuse #391: We already sent around a notice about that. From magawake at gmail.com Sun Jun 22 02:03:03 2008 From: magawake at gmail.com (Mag Gam) Date: Sat, 21 Jun 2008 22:03:03 -0400 Subject: indexing symbolic links Message-ID: <1cbd6f830806211903i4cc02814gc5517934e3952694@mail.gmail.com> Is there a way to index symbolic links in ext3? For example, I want to keep track of all symbolic links on the filesystem (soft mainly). I think I would have to write a wrapper around ln to keep it in a database, but I was wondering if anyone has done something similar to this. TIA -------------- next part -------------- An HTML attachment was scrubbed... URL: From alex at alex.org.uk Sun Jun 22 08:18:51 2008 From: alex at alex.org.uk (Alex Bligh) Date: Sun, 22 Jun 2008 09:18:51 +0100 Subject: indexing symbolic links In-Reply-To: <1cbd6f830806211903i4cc02814gc5517934e3952694@mail.gmail.com> References: <1cbd6f830806211903i4cc02814gc5517934e3952694@mail.gmail.com> Message-ID: --On 21 June 2008 22:03:03 -0400 Mag Gam wrote: > Is there a way to index symbolic links in ext3? For example, I want to > keep track of all symbolic links on the filesystem (soft mainly). I think > I would have to write a wrapper around ln to keep it in a database, but I > was wondering if anyone has done something similar to this. How about find [mount point] -type l -x -print Wrapping ln won't do the job completely as (a) it won't track the links being removed (e.g. via rm), and (b) it won't track links being created by programs other than ln which use the library or the system call directly. When you say "mainly soft", remember EVERY file /is/ a hard link. Just some files have more than one. Look at the "-links" option to find, which is easy enough for normal files though you will have to do a bit of thinking re hard linked directories, "." and "..". Alex From magawake at gmail.com Sun Jun 22 13:12:26 2008 From: magawake at gmail.com (Mag Gam) Date: Sun, 22 Jun 2008 09:12:26 -0400 Subject: indexing symbolic links In-Reply-To: References: <1cbd6f830806211903i4cc02814gc5517934e3952694@mail.gmail.com> Message-ID: <1cbd6f830806220612k1e3126e5t2c91a1164321c9e5@mail.gmail.com> Find or ls I can check for symbolic links, but the file system is very large. About 250GB and I have several of them. I was wondering if ext3 kept track of these things, apparently it does not. At my university, we have physical storage in a filesystem, and we assign professors and students space by doing a symbolic link. Basically I want to keep track of physical storage with virtual/logical storage. Thats why I ask :-) TIA On Sun, Jun 22, 2008 at 4:18 AM, Alex Bligh wrote: > > > --On 21 June 2008 22:03:03 -0400 Mag Gam wrote: > > Is there a way to index symbolic links in ext3? For example, I want to >> keep track of all symbolic links on the filesystem (soft mainly). I think >> I would have to write a wrapper around ln to keep it in a database, but I >> was wondering if anyone has done something similar to this. >> > > How about > find [mount point] -type l -x -print > > Wrapping ln won't do the job completely as (a) it won't track the links > being removed (e.g. via rm), and (b) it won't track links being created > by programs other than ln which use the library or the system call > directly. > > When you say "mainly soft", remember EVERY file /is/ a hard link. Just > some files have more than one. Look at the "-links" option to find, which > is easy enough for normal files though you will have to do a bit of > thinking > re hard linked directories, "." and "..". > > Alex > -------------- next part -------------- An HTML attachment was scrubbed... URL: From darkonc at gmail.com Sun Jun 22 16:05:15 2008 From: darkonc at gmail.com (Stephen Samuel) Date: Sun, 22 Jun 2008 09:05:15 -0700 Subject: indexing symbolic links In-Reply-To: <1cbd6f830806220612k1e3126e5t2c91a1164321c9e5@mail.gmail.com> References: <1cbd6f830806211903i4cc02814gc5517934e3952694@mail.gmail.com> <1cbd6f830806220612k1e3126e5t2c91a1164321c9e5@mail.gmail.com> Message-ID: <6cd50f9f0806220905y339f4a11hd2ad9a2d7a7a65c3@mail.gmail.com> If you're only counting when YOU create and remove links, then you could put a hook and count from there. (without depending on anything within ext3) If, on the other hand, you're depending on when ANYBODY creates or removes a link (hard or soft), then you have a good bit more work to do. The only way that I can think of to do that would be to put a link into the ext3 driver -- but you wouldn't just have to log the symlink calls. you'd also have to track things like renames (in-directory vs cross-directory vs cross-filesystem) and unlinks (rm) Given that it sounds like you're doing symlinks and the target files aren't actually being owned by the person in question, it doesn't sound like the quota system would do the job for you, so you're probably going to need tro either do some kernel hacking, or write a batch job that runs regularly that does the information collection for you. 2008/6/22 Mag Gam : > Find or ls I can check for symbolic links, but the file system is very > large. About 250GB and I have several of them. > I was wondering if ext3 kept track of these things, apparently it does > not. > > At my university, we have physical storage in a filesystem, and we assign > professors and students space by doing a symbolic link. Basically I want to > keep track of physical storage with virtual/logical storage. Thats why I ask > :-) > > TIA -- Stephen Samuel http://www.bcgreen.com 778-861-7641 -------------- next part -------------- An HTML attachment was scrubbed... URL: From magawake at gmail.com Sun Jun 22 18:25:16 2008 From: magawake at gmail.com (Mag Gam) Date: Sun, 22 Jun 2008 14:25:16 -0400 Subject: indexing symbolic links In-Reply-To: <6cd50f9f0806220905y339f4a11hd2ad9a2d7a7a65c3@mail.gmail.com> References: <1cbd6f830806211903i4cc02814gc5517934e3952694@mail.gmail.com> <1cbd6f830806220612k1e3126e5t2c91a1164321c9e5@mail.gmail.com> <6cd50f9f0806220905y339f4a11hd2ad9a2d7a7a65c3@mail.gmail.com> Message-ID: <1cbd6f830806221125j75b628e1l66b6150793a649fe@mail.gmail.com> wow, i didn't think about renames and all. I am not a strong C programmer so I don't think hacking the kernel is an option :-( I bet there is more to this also... Thanks for your thoughts. On Sun, Jun 22, 2008 at 12:05 PM, Stephen Samuel wrote: > If you're only counting when YOU create and remove links, then you could > put a hook and count from there. (without depending on anything within ext3) > If, on the other hand, you're depending on when ANYBODY creates or removes > a link (hard or soft), then you have a good bit more work to do. The only > way that I can think of to do that would be to put a link into the ext3 > driver -- but you wouldn't just have to log the symlink calls. you'd also > have to track things like renames (in-directory vs cross-directory vs > cross-filesystem) and unlinks (rm) > > Given that it sounds like you're doing symlinks and the target files aren't > actually being owned by the person in question, it doesn't sound like the > quota system would do the job for you, so you're probably going to need tro > either do some kernel hacking, or write a batch job that runs regularly that > does the information collection for you. > > 2008/6/22 Mag Gam : > >> Find or ls I can check for symbolic links, but the file system is very >> large. About 250GB and I have several of them. >> I was wondering if ext3 kept track of these things, apparently it does >> not. >> >> At my university, we have physical storage in a filesystem, and we assign >> professors and students space by doing a symbolic link. Basically I want to >> keep track of physical storage with virtual/logical storage. Thats why I ask >> :-) >> >> TIA > > > -- > Stephen Samuel http://www.bcgreen.com > 778-861-7641 -------------- next part -------------- An HTML attachment was scrubbed... URL: From alex at alex.org.uk Sun Jun 22 19:04:17 2008 From: alex at alex.org.uk (Alex Bligh) Date: Sun, 22 Jun 2008 20:04:17 +0100 Subject: indexing symbolic links In-Reply-To: <1cbd6f830806220612k1e3126e5t2c91a1164321c9e5@mail.gmail.com> References: <1cbd6f830806211903i4cc02814gc5517934e3952694@mail.gmail.com> <1cbd6f830806220612k1e3126e5t2c91a1164321c9e5@mail.gmail.com> Message-ID: <61CF57D9DB48898E17DC2EF7@Ximines.local> --On 22 June 2008 09:12:26 -0400 Mag Gam wrote: > At my university, we have physical storage in a filesystem, and we assign > professors and students space by doing a symbolic link. Basically I want > to keep track of physical storage with virtual/logical storage. Thats why > I ask :-) If you want to track space usage, I suggest you track it using quota or similar. "man quota" will give you a start. Alex From magawake at gmail.com Sun Jun 22 20:37:59 2008 From: magawake at gmail.com (Mag Gam) Date: Sun, 22 Jun 2008 16:37:59 -0400 Subject: indexing symbolic links In-Reply-To: <61CF57D9DB48898E17DC2EF7@Ximines.local> References: <1cbd6f830806211903i4cc02814gc5517934e3952694@mail.gmail.com> <1cbd6f830806220612k1e3126e5t2c91a1164321c9e5@mail.gmail.com> <61CF57D9DB48898E17DC2EF7@Ximines.local> Message-ID: <1cbd6f830806221337y5dbc5173qbf4e7222b3fa9f67@mail.gmail.com> Unfortunately, tracking space wasn't me goal. I want to keep track of my symbolic links :-) On Sun, Jun 22, 2008 at 3:04 PM, Alex Bligh wrote: > > > --On 22 June 2008 09:12:26 -0400 Mag Gam wrote: > > At my university, we have physical storage in a filesystem, and we assign >> professors and students space by doing a symbolic link. Basically I want >> to keep track of physical storage with virtual/logical storage. Thats why >> I ask :-) >> > > If you want to track space usage, I suggest you track it using quota > or similar. "man quota" will give you a start. > > Alex > -------------- next part -------------- An HTML attachment was scrubbed... URL: From rjackson at mason.gmu.edu Tue Jun 24 12:30:29 2008 From: rjackson at mason.gmu.edu (Richard Jackson) Date: Tue, 24 Jun 2008 08:30:29 -0400 (EDT) Subject: stride (fwd) Message-ID: <200806241230.m5OCUTqq004576@mason.gmu.edu> Two things; 1. Most likely I missed it but I could not find how to report the stride setting for a ext3 filesystem. I do not see stride mentioned in the man pages for dumpe2fs and tune2fs nor in the dumpe2fs report. 2. It has been pointed out the mke2fs man page description for stride needs improvement. Andreas Dilger in a post last year, http://osdir.com/ml/file-systems.ext3.user/2007-06/msg00003.html, mentioned a patch was submitted. I assume to address the mke2fs man page. If this is not the case then I suggest adding something similar to Ted's or Andreas' descriptions to replace the current stride mke2fs man page. If nothing else change from stride= Configure the filesystem for a RAID array with filesystem blocks per stripe. to stride= The number of filesystem blocks on a single disk. The purpose is to spread the filesystem metadata across the disks. For example, if the RAID chunk/segment size is 64KB and the filesystem block size is 4KB, then the stride size is 16 (64KB/4KB). These types of explanations are more helpful than something like... -f fragment-size Specify the size of fragments in bytes. taken from the mke2fs man pages. As you can see the explanation adds very little value. The stride explanation simply seems wrong. Richard Forwarded message: > On Thu, Jun 19, 2008 at 06:21:24AM -0400, Mag Gam wrote: > > ok, in a way its like a stripe? I though when you do a stripe you put the > > metadata on number of disks too. How is that different? Is there a diagram I > > can refer to? > > Yes, which is why the mke2fs man page states: > > stride= > Configure the filesystem for a RAID array with > filesystem blocks per stripe. > > So if the size of a stripe on each a disk is 64k, and you are using a > 4k filesystem blocksize, then 64k/4k == 16, and that would be an > "ideal" stride size, in that for each successive block group, the > inode and block bitmap would increased by an offset of 16 blocks from > the beginning of the block group. > > The reason for doing this is to avoid problems where the block bitmap > ends up on the same disk for every single block group. The classic > case where this would happen is if you have a 5 disks in a RAID 5 > configuration, which means with 4 disks per stripe, and 8192 blocks in > a blockgroup, then if the block bitmap is always at the same offset > from the beginning of the block group, one disk will get all of the > block bitmaps, and that ends up being a major hot spot problem for the > hard drive. > > As it turns out, if you use 4 disks in a RAID 5 configuration, or 6 > disks in a RAID 5 configuration, this problem doesn't arise at all, > and you don't need to use the stride option. And in most cases, > simply using a stride=1, that is actually enough to make sure that > each block and inode bitmaps will get forced onto successively > different disks. > > With ext4's flex_bg enhancement, the need to specify stride option of > RAID arrays will also go away. > > - Ted > > _______________________________________________ > Ext3-users mailing list > Ext3-users at redhat.com > https://www.redhat.com/mailman/listinfo/ext3-users > -- Regards, /~\ The ASCII Richard Jackson \ / Ribbon Campaign Computer Systems Engineer, X Against HTML Information Technology Unit, Technology Systems Division / \ Email! Enterprise Servers and Operations Department George Mason University, Fairfax, Virginia From adilger at sun.com Wed Jun 25 08:36:37 2008 From: adilger at sun.com (Andreas Dilger) Date: Wed, 25 Jun 2008 02:36:37 -0600 Subject: stride (fwd) In-Reply-To: <200806241230.m5OCUTqq004576@mason.gmu.edu> References: <200806241230.m5OCUTqq004576@mason.gmu.edu> Message-ID: <20080625083637.GW6239@webber.adilger.int> On Jun 24, 2008 08:30 -0400, Richard Jackson wrote: > If this is not the case then I suggest adding something similar to Ted's > or Andreas' descriptions to replace the current stride mke2fs man page. > > If nothing else change from > > stride= > Configure the filesystem for a RAID array with > filesystem blocks per stripe. > > to > > stride= > > The number of filesystem blocks on a single disk. The purpose > is to spread the filesystem metadata across the disks. For > example, if the RAID chunk/segment size is 64KB and the > filesystem block size is 4KB, then the stride size is 16 > (64KB/4KB). The patch to add the "stride" and "stripe-size" options to mke2fs and mke2fs(8) man pages were already included upstream for 1.40.7 or earlier. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. From yamin_yossi at diligent.com Wed Jun 25 12:55:24 2008 From: yamin_yossi at diligent.com (Yamin, Yossi) Date: Wed, 25 Jun 2008 15:55:24 +0300 Subject: "Attempt to access beyond end of device" problem Message-ID: Hi, We are using Ext3 on with RedHat 4 U3 File Sysetm. We got the following errors at the /var/log/messages file Jun 23 09:28:29 diligent1 kernel: attempt to access beyond end of device Jun 23 09:28:29 diligent1 kernel: sddlmbj1: rw=1, want=6332971392, limit=4183807887 -- Jun 23 09:37:47 diligent1 kernel: lpfc1: BUFF seg 5 free 946 numblks 1024 Jun 23 09:44:03 diligent1 kernel: attempt to access beyond end of device Jun 23 09:44:03 diligent1 kernel: sddlmbj1: rw=1, want=6332971392, limit=4183807887 Jun 23 09:44:03 diligent1 kernel: attempt to access beyond end of device Jun 23 09:44:03 diligent1 kernel: sddlmbj1: rw=1, want=6332971392, limit=4183807887 Jun 23 09:44:03 diligent1 kernel: attempt to access beyond end of device Jun 23 09:44:03 diligent1 kernel: sddlmbj1: rw=1, want=6332971392, limit=4183807887 Jun 23 09:44:03 diligent1 kernel: attempt to access beyond end of device Jun 23 09:44:03 diligent1 kernel: sddlmbj1: rw=1, want=6332971392, limit=4183807887 running fsck.ext3 -n -f on this device yield the Below output. " Pass 1: Checking inodes, blocks, and sizes Inode 192086020 has illegal block(s). Clear? no Illegal block #18620428 (774778414) in inode 192086020. IGNORED. Illegal block #18620429 (774778414) in inode 192086020. IGNORED. Illegal block #18620430 (774778414) in inode 192086020. IGNORED. Illegal block #18620431 (774778414) in inode 192086020. IGNORED. Illegal block #18620432 (774778414) in inode 192086020. IGNORED. Illegal block #18620433 (774778414) in inode 192086020. IGNORED. Illegal block #18620434 (774778414) in inode 192086020. IGNORED. Illegal block #18620435 (774778414) in inode 192086020. IGNORED. Illegal block #18620436 (774778414) in inode 192086020. IGNORED. Illegal block #18620437 (774778414) in inode 192086020. IGNORED. Illegal block #18620438 (774778414) in inode 192086020. IGNORED. Too many illegal blocks in inode 192086020. Clear inode? no Suppress messages? no Illegal block #18620439 (774778414) in inode 192086020. IGNORED. Illegal block #18620440 (774778414) in inode 192086020. IGNORED. Illegal block #18620441 (774778414) in inode 192086020. IGNORED. Illegal block #18620442 (774778414) in inode 192086020. IGNORED. Illegal block #18620443 (774778414) in inode 192086020. IGNORED. Illegal block #18620444 (774778414) in inode 192086020. IGNORED. Illegal block #18620445 (774778414) in inode 192086020. IGNORED. Illegal block #18620446 (774778414) in inode 192086020. IGNORED. Illegal block #18620447 (774778414) in inode 192086020. IGNORED. Illegal block #18620448 (774778414) in inode 192086020. IGNORED. Illegal block #18620449 (774778414) in inode 192086020. IGNORED. Illegal block #18620450 (774778414) in inode 192086020. IGNORED. Too many illegal blocks in inode 192086020. Clear inode? no Suppress messages? no Illegal block #18620451 (774778414) in inode 192086020. IGNORED. Illegal block #18620452 (774778414) in inode 192086020. IGNORED. Illegal block #18620453 (774778414) in inode 192086020. IGNORED. Illegal block #18620454 (774778414) in inode 192086020. IGNORED. Illegal block #18620455 (774778414) in inode 192086020. IGNORED. Illegal block #18620456 (774778414) in inode 192086020. IGNORED. Illegal block #18620457 (774778414) in inode 192086020. IGNORED. Illegal block #18620458 (774778414) in inode 192086020. IGNORED. Illegal block #18620459 (774778414) in inode 192086020. IGNORED. Illegal block #18620460 (774778414) in inode 192086020. IGNORED. Illegal block #18620461 (774778414) in inode 192086020. IGNORED. Illegal block #18620462 (774778414) in inode 192086020. IGNORED. Too many illegal blocks in inode 192086020. Clear inode? no Suppress messages? no " Do you think that running fsck with corrective actions will destroy part of my data consistency? if the answer is yes is there any other way to recover? What do you think was the root cause for this issue ? Please notice that this specific FS is more than 2TB size but configured with msdos partition label. Best regards Yossi Yamin Sr.Technical specialist Diligent Technologies, an IBM Company -------------- next part -------------- An HTML attachment was scrubbed... URL: From sandeen at redhat.com Wed Jun 25 13:56:04 2008 From: sandeen at redhat.com (Eric Sandeen) Date: Wed, 25 Jun 2008 08:56:04 -0500 Subject: "Attempt to access beyond end of device" problem In-Reply-To: References: Message-ID: <48624E74.1070306@redhat.com> Yamin, Yossi wrote: > Hi, > > We are using Ext3 on with RedHat 4 U3 File Sysetm. > > We got the following errors at the /var/log/messages file > > > > Jun 23 09:28:29 diligent1 kernel: attempt to access beyond end of device > > Jun 23 09:28:29 diligent1 kernel: sddlmbj1: rw=1, want=6332971392, > limit=4183807887 ... > What do you think was the root cause for this issue ? > Please notice that this specific FS is more than 2TB size but configured > with msdos partition label. Hm, I think you just stated the root cause. Did the filesystem work fine until you rebooted it? Did it used to be 8T? And did you use parted to create it? Parted wrongly allows you to make a >2T msdos partition table and pokes it directly into the kernel, but the on-disk format cannot hold the large value. So when you reboot, it's read as something smaller. You might be able to do a trick where you create a new GPT label in place of the old DOS label, with the same start point as the dos label, but with a correct endpoint. I would not repair the fs; if my guess is right, 3/4 of it is now unreachable and fsck will probably do heavy damage. -Eric From yamin_yossi at diligent.com Wed Jun 25 14:58:59 2008 From: yamin_yossi at diligent.com (Yamin, Yossi) Date: Wed, 25 Jun 2008 17:58:59 +0300 Subject: "Attempt to access beyond end of device" problem In-Reply-To: <48624E74.1070306@redhat.com> References: <48624E74.1070306@redhat.com> Message-ID: HI, Thanks for the quick response. I think the situation is different then what you describe since we have abut 10 FS with the same size 2142.1 GB that have no problem. Thanks, Yossi -----Original Message----- From: Eric Sandeen [mailto:sandeen at redhat.com] Sent: Wednesday, June 25, 2008 4:56 PM To: Yamin, Yossi Cc: ext3-users at redhat.com Subject: Re: "Attempt to access beyond end of device" problem Yamin, Yossi wrote: > Hi, > > We are using Ext3 on with RedHat 4 U3 File Sysetm. > > We got the following errors at the /var/log/messages file > > > > Jun 23 09:28:29 diligent1 kernel: attempt to access beyond end of device > > Jun 23 09:28:29 diligent1 kernel: sddlmbj1: rw=1, want=6332971392, > limit=4183807887 ... > What do you think was the root cause for this issue ? > Please notice that this specific FS is more than 2TB size but configured > with msdos partition label. Hm, I think you just stated the root cause. Did the filesystem work fine until you rebooted it? Did it used to be 8T? And did you use parted to create it? Parted wrongly allows you to make a >2T msdos partition table and pokes it directly into the kernel, but the on-disk format cannot hold the large value. So when you reboot, it's read as something smaller. You might be able to do a trick where you create a new GPT label in place of the old DOS label, with the same start point as the dos label, but with a correct endpoint. I would not repair the fs; if my guess is right, 3/4 of it is now unreachable and fsck will probably do heavy damage. -Eric From sandeen at redhat.com Wed Jun 25 15:03:29 2008 From: sandeen at redhat.com (Eric Sandeen) Date: Wed, 25 Jun 2008 10:03:29 -0500 Subject: "Attempt to access beyond end of device" problem In-Reply-To: References: <48624E74.1070306@redhat.com> Message-ID: <48625E41.5020104@redhat.com> Yamin, Yossi wrote: > HI, > Thanks for the quick response. > I think the situation is different then what you describe since we have > abut 10 FS with the same size 2142.1 GB that have no problem. Hm, ok, you said that it was > 2T... I guess that's TiB vs. TB. Then perhaps it is just localized corruption (hard to say from *what*) and an e2fsck might fix it up just fine. -Eric From yamin_yossi at diligent.com Tue Jun 24 16:45:46 2008 From: yamin_yossi at diligent.com (Yamin, Yossi) Date: Tue, 24 Jun 2008 19:45:46 +0300 Subject: "Attempt to access beyond end of device" problem Message-ID: Hi, We are using Ext3 on with RedHat 4 U3 File Sysetm. We got the following errors at the /var/log/messages file Jun 23 09:28:29 diligent1 kernel: attempt to access beyond end of device Jun 23 09:28:29 diligent1 kernel: sddlmbj1: rw=1, want=6332971392, limit=4183807887 -- Jun 23 09:37:47 diligent1 kernel: lpfc1: BUFF seg 5 free 946 numblks 1024 Jun 23 09:44:03 diligent1 kernel: attempt to access beyond end of device Jun 23 09:44:03 diligent1 kernel: sddlmbj1: rw=1, want=6332971392, limit=4183807887 Jun 23 09:44:03 diligent1 kernel: attempt to access beyond end of device Jun 23 09:44:03 diligent1 kernel: sddlmbj1: rw=1, want=6332971392, limit=4183807887 Jun 23 09:44:03 diligent1 kernel: attempt to access beyond end of device Jun 23 09:44:03 diligent1 kernel: sddlmbj1: rw=1, want=6332971392, limit=4183807887 Jun 23 09:44:03 diligent1 kernel: attempt to access beyond end of device Jun 23 09:44:03 diligent1 kernel: sddlmbj1: rw=1, want=6332971392, limit=4183807887 running fsck.ext3 -n -f on this device yield the attached output. Do you think that running fsck with corrective actions will destroy part of my data consistency? if the answer is yes is there any other way to recover? What do you think was the root cause for this issue ? Please notice that this specific FS is more than 2TB size but configured with msdos partition label. Best regards Yossi Yamin Sr.Technical specialist Diligent Technologies, an IBM Company -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: fsck_sddlmbj1.zip Type: application/x-zip-compressed Size: 463793 bytes Desc: fsck_sddlmbj1.zip URL: From howachen at gmail.com Fri Jun 27 04:28:58 2008 From: howachen at gmail.com (howard chen) Date: Fri, 27 Jun 2008 12:28:58 +0800 Subject: Recommended number of files stored under a single folder Message-ID: Hi, I have a web site for storing images and serve to public. In my site, I need to set a rule for controlling the max. number of files that would be allowed for client to upload, asI know that performance of FS degrade when number of files increase, can anyone suggest a number so I can stop client from uploading too many files? E.g. 10K would be okay? Thanks. From magawake at gmail.com Sat Jun 28 04:13:30 2008 From: magawake at gmail.com (Mag Gam) Date: Sat, 28 Jun 2008 00:13:30 -0400 Subject: inode and filesystem question Message-ID: <1cbd6f830806272113o43bf9a54x65ced9c917e2c07b@mail.gmail.com> While reading for fun, I noticed inode does not carry filename. I always though it did. I read that it is carried by the directory structure and the kernel interpolates it. Can someone please explain this to me TIA -------------- next part -------------- An HTML attachment was scrubbed... URL: From bruno at wolff.to Sat Jun 28 04:22:18 2008 From: bruno at wolff.to (Bruno Wolff III) Date: Fri, 27 Jun 2008 23:22:18 -0500 Subject: inode and filesystem question In-Reply-To: <1cbd6f830806272113o43bf9a54x65ced9c917e2c07b@mail.gmail.com> References: <1cbd6f830806272113o43bf9a54x65ced9c917e2c07b@mail.gmail.com> Message-ID: <20080628042218.GA17730@wolff.to> On Sat, Jun 28, 2008 at 00:13:30 -0400, Mag Gam wrote: > While reading for fun, I noticed inode does not carry filename. I always > though it did. I read that it is carried by the directory structure and the > kernel interpolates it. Can someone please explain this to me A file can have more than one name. You can read up on "hard link" for more information. From magawake at gmail.com Sat Jun 28 11:39:55 2008 From: magawake at gmail.com (Mag Gam) Date: Sat, 28 Jun 2008 07:39:55 -0400 Subject: inode and filesystem question In-Reply-To: <20080628042218.GA17730@wolff.to> References: <1cbd6f830806272113o43bf9a54x65ced9c917e2c07b@mail.gmail.com> <20080628042218.GA17730@wolff.to> Message-ID: <1cbd6f830806280439s10b2daefocea7a6d08c84326f@mail.gmail.com> Well, I guess this is more for a theoretical question. How the filename is determined if its not in the inode. On Sat, Jun 28, 2008 at 12:22 AM, Bruno Wolff III wrote: > On Sat, Jun 28, 2008 at 00:13:30 -0400, > Mag Gam wrote: > > While reading for fun, I noticed inode does not carry filename. I always > > though it did. I read that it is carried by the directory structure and > the > > kernel interpolates it. Can someone please explain this to me > > A file can have more than one name. You can read up on "hard link" for > more information. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From alex at alex.org.uk Sat Jun 28 11:54:20 2008 From: alex at alex.org.uk (Alex Bligh) Date: Sat, 28 Jun 2008 12:54:20 +0100 Subject: inode and filesystem question In-Reply-To: <1cbd6f830806280439s10b2daefocea7a6d08c84326f@mail.gmail.com> References: <1cbd6f830806272113o43bf9a54x65ced9c917e2c07b@mail.gmail.com> <20080628042218.GA17730@wolff.to> <1cbd6f830806280439s10b2daefocea7a6d08c84326f@mail.gmail.com> Message-ID: <2903310EA2E80B55C5C080F7@Ximines.local> --On 28 June 2008 07:39:55 -0400 Mag Gam wrote: > Well, I guess this is more for a theoretical question. How the filename > is determined if its not in the inode. It isn't. There is no easy way to get back from an inode number to a filename (or filenames, as there can be more than one - think how hard links work, multiple directory entries (and hence filenames) pointing to one inode) apart from recurse through the entire directory tree and find which directory entries contain that inode number. That's because there is (fsck type operations apart) in general no need to go from an inode number to the list of directory entries that point to it. Indeed some inodes can have no directory entry pointing to them (e.g. if you open a file, then unlink it (with rm) before closing it). This isn't ext3 specific, this is the way UNIX file systems work. I suggest doing some background reading on UNIX filesystems in general rather than asking on an ext3 specific list. For a very simple intro see: http://en.wikipedia.org/wiki/Inode Alex From davids at webmaster.com Sat Jun 28 18:02:54 2008 From: davids at webmaster.com (David Schwartz) Date: Sat, 28 Jun 2008 11:02:54 -0700 Subject: inode and filesystem question In-Reply-To: <1cbd6f830806280439s10b2daefocea7a6d08c84326f@mail.gmail.com> Message-ID: > Well, I guess this is more for a theoretical question. > How the filename is determined if its not in the inode. Simple, files don't have names. Directory entries do. A directory entry's name is stored in the directory entry, along with the inode number of the file it references. This is the UNIX way, love it or hate it. DS From yamin_yossi at diligent.com Sat Jun 28 20:50:16 2008 From: yamin_yossi at diligent.com (Yamin, Yossi) Date: Sat, 28 Jun 2008 23:50:16 +0300 Subject: debugfs question In-Reply-To: References: Message-ID: Hi, I am trying to read a file directly from the disk using debugfs utility. I am running "stat" on the file I want, filter out IND and Bind blocks, and then copy the data blocks using dd directly from the Disk. On small files it work'd (13MB). On big files (3.5 , 440 GB) the size is the same but md5sum get differ. What am I doing wrong? I umount the FS before I start so the file is not changing. Best regards Yossi Yamin Sr.Technical specialist Diligent Technologies, an IBM Company -------------- next part -------------- An HTML attachment was scrubbed... URL: From bruno at wolff.to Sun Jun 29 13:37:16 2008 From: bruno at wolff.to (Bruno Wolff III) Date: Sun, 29 Jun 2008 08:37:16 -0500 Subject: inode and filesystem question In-Reply-To: <1cbd6f830806280439s10b2daefocea7a6d08c84326f@mail.gmail.com> References: <1cbd6f830806272113o43bf9a54x65ced9c917e2c07b@mail.gmail.com> <20080628042218.GA17730@wolff.to> <1cbd6f830806280439s10b2daefocea7a6d08c84326f@mail.gmail.com> Message-ID: <20080629133716.GA25425@wolff.to> On Sat, Jun 28, 2008 at 07:39:55 -0400, Mag Gam wrote: > Well, I guess this is more for a theoretical question. How the filename is > determined if its not in the inode. Filenames are matched to inodes in the directory blocks. (I am assuming that's the question you meant to ask. The phrasing of your question is a bit odd and you may have really been asking a different question.) From magawake at gmail.com Sun Jun 29 17:14:46 2008 From: magawake at gmail.com (Mag Gam) Date: Sun, 29 Jun 2008 13:14:46 -0400 Subject: inode and filesystem question In-Reply-To: <20080629133716.GA25425@wolff.to> References: <1cbd6f830806272113o43bf9a54x65ced9c917e2c07b@mail.gmail.com> <20080628042218.GA17730@wolff.to> <1cbd6f830806280439s10b2daefocea7a6d08c84326f@mail.gmail.com> <20080629133716.GA25425@wolff.to> Message-ID: <1cbd6f830806291014s46be11f5j7045d145a887574d@mail.gmail.com> Thanks Bruno. Thats exactly what I was asking. Some people got angry at me for asking here since its a "basic" Unix question. Sorry about that On Sun, Jun 29, 2008 at 9:37 AM, Bruno Wolff III wrote: > On Sat, Jun 28, 2008 at 07:39:55 -0400, > Mag Gam wrote: > > Well, I guess this is more for a theoretical question. How the filename > is > > determined if its not in the inode. > > Filenames are matched to inodes in the directory blocks. (I am assuming > that's the question you meant to ask. The phrasing of your question is a > bit > odd and you may have really been asking a different question.) > -------------- next part -------------- An HTML attachment was scrubbed... URL: