From subscribe at sydes.la Tue May 3 02:20:30 2005 From: subscribe at sydes.la (Jason Sydes) Date: Mon, 2 May 2005 19:20:30 -0700 Subject: several ext3 and mysql kernel crashes Message-ID: <20050503022030.GH23016@hq.newdream.net> Hi Ext3! I'm running about 30 dedicated MySQL machines under quite decent loads, and they are occassionally crashing. I've been logging console messages recently in an effort to find the cause, and some appear to be related to I perused your lists and found the message I'm replying to. If you don't mind, I've included messages and ksymoops from two crashes that I had recently. Both were different. I'm not sure if you have fixes for them in the new kernel, so I'll be upgrading a few machines tonight. I'm running 2.6.10 with the "data=journal" mount option. Is that the best / safest option for running with MySQL? In any case, I'm logging all console messages now, so hopefully I can have more ksymoops output for you soon enough. I've included the output for each below. Thank you for your time! Jason First Machine ("Scratchy") ========================== Assertion failure in __journal_drop_transaction() at fs/jbd/checkpoint.c:613: "transaction->t_forget == NULL" ------------[ cut here ]------------ kernel BUG at fs/jbd/checkpoint.c:613! invalid operand: 0000 [#1] SMP CPU: 2 EIP: 0060:[] Not tainted VLI EFLAGS: 00010282 (2.6.10-grsec+gg3+e+fhs6b+nfs+gr0501+++p4+c4a+gr6b-reslog-v6.189) EIP is at __journal_drop_transaction+0x128/0x290 eax: 00000071 ebx: d1abf680 ecx: c04ea524 edx: 00000286 esi: f6877400 edi: 00000013 ebp: d1abf680 esp: f5e59dc0 ds: 007b es: 007b ss: 0068 Process kjournald (pid: 1086, threadinfo=f5e58000 task=f6801a60) Stack: c04295a0 c0429764 c0429567 00000265 c0429805 f6877400 f188cf8c c01f71c1 f6877400 d1abf680 f6877400 f6877414 f6877414 00000000 f68774c0 f6877454 f687743c d1abf6b8 f6877414 f6877414 ed3021b8 f6877478 f5e58000 00000000 Call Trace: [] journal_commit_transaction+0xf09/0xf68 [] rcu_check_quiescent_state+0x55/0x64 [] rcu_check_quiescent_state+0x5f/0x64 [] autoremove_wake_function+0x0/0x40 [] autoremove_wake_function+0x0/0x40 [] find_busiest_group+0xeb/0x2d8 [] scheduler_tick+0x443/0x450 [] run_timer_softirq+0x150/0x158 [] __do_softirq+0x6a/0xd4 [] irq_exit+0x2d/0x30 [] smp_apic_timer_interrupt+0xce/0xd4 [] apic_timer_interrupt+0x1c/0x30 [] del_timer_sync+0xa3/0xdc [] kjournald+0xd3/0x228 [] kjournald+0x0/0x228 [] autoremove_wake_function+0x0/0x40 [] autoremove_wake_function+0x0/0x40 [] commit_timeout+0x0/0xc [] kernel_thread_helper+0x5/0xc Code: 95 42 c0 83 c4 14 90 83 7b 24 00 74 2a 68 05 98 42 c0 68 65 02 00 00 68 67 95 42 c0 68 64 97 42 c0 68 a0 95 42 c0 e8 50 c4 f5 ff <0f> 0b 65 02 67 95 42 c0 83 c4 14 90 83 7b 2c 00 74 2a 68 40 98 scratchy: 07:17pm# ksymoops -m /boot/System.map-2.6.10-grsec+gg3+e+fhs6b+nfs+gr0501+++p4+c4a+gr6b-reslog-v6.189 /tmp/scratchy.apr30th.or.something.edited ksymoops 2.4.5 on i686 2.6.10-grsec+gg3+e+fhs6b+nfs+gr0501+++p4+c4a+gr6b-reslog-v6.189. Options used -V (default) -k /proc/ksyms (default) -l /proc/modules (default) -o /lib/modules/2.6.10-grsec+gg3+e+fhs6b+nfs+gr0501+++p4+c4a+gr6b-reslog-v6.189/ (default) -m /boot/System.map-2.6.10-grsec+gg3+e+fhs6b+nfs+gr0501+++p4+c4a+gr6b-reslog-v6.189 (specified) Error (regular_file): read_ksyms stat /proc/ksyms failed ksymoops: No such file or directory No modules in ksyms, skipping objects No ksyms, skipping lsmod kernel BUG at fs/jbd/checkpoint.c:613! invalid operand: 0000 [#1] CPU: 2 EIP: 0060:[] Not tainted VLI Using defaults from ksymoops -t elf32-i386 -a i386 EFLAGS: 00010282 (2.6.10-grsec+gg3+e+fhs6b+nfs+gr0501+++p4+c4a+gr6b-reslog-v6.189) eax: 00000071 ebx: d1abf680 ecx: c04ea524 edx: 00000286 esi: f6877400 edi: 00000013 ebp: d1abf680 esp: f5e59dc0 ds: 007b es: 007b ss: 0068 Stack: c04295a0 c0429764 c0429567 00000265 c0429805 f6877400 f188cf8c c01f71c1 f6877400 d1abf680 f6877400 f6877414 f6877414 00000000 f68774c0 f6877454 f687743c d1abf6b8 f6877414 f6877414 ed3021b8 f6877478 f5e58000 00000000 [] journal_commit_transaction+0xf09/0xf68 [] rcu_check_quiescent_state+0x55/0x64 [] rcu_check_quiescent_state+0x5f/0x64 [] autoremove_wake_function+0x0/0x40 [] autoremove_wake_function+0x0/0x40 [] find_busiest_group+0xeb/0x2d8 [] scheduler_tick+0x443/0x450 [] run_timer_softirq+0x150/0x158 [] __do_softirq+0x6a/0xd4 [] irq_exit+0x2d/0x30 [] smp_apic_timer_interrupt+0xce/0xd4 [] apic_timer_interrupt+0x1c/0x30 [] del_timer_sync+0xa3/0xdc [] kjournald+0xd3/0x228 [] kjournald+0x0/0x228 [] autoremove_wake_function+0x0/0x40 [] autoremove_wake_function+0x0/0x40 [] commit_timeout+0x0/0xc [] kernel_thread_helper+0x5/0xc Code: 95 42 c0 83 c4 14 90 83 7b 24 00 74 2a 68 05 98 42 c0 68 65 02 00 00 68 67 95 42 c0 68 64 97 42 c0 68 a0 95 42 c0 e8 50 c4 f5 ff <0f> 0b 65 02 67 95 42 c0 83 c4 14 90 83 7b 2c 00 74 2a 68 40 98 >>EIP; c01f8404 <__journal_drop_transaction+128/290> <===== >>ebx; d1abf680 >>ecx; c04ea524 >>esi; f6877400 >>ebp; d1abf680 >>esp; f5e59dc0 Code; c01f83d9 <__journal_drop_transaction+fd/290> 00000000 <_EIP>: Code; c01f83d9 <__journal_drop_transaction+fd/290> 0: 95 xchg %eax,%ebp Code; c01f83da <__journal_drop_transaction+fe/290> 1: 42 inc %edx Code; c01f83db <__journal_drop_transaction+ff/290> 2: c0 83 c4 14 90 83 7b rolb $0x7b,0x839014c4(%ebx) Code; c01f83e2 <__journal_drop_transaction+106/290> 9: 24 00 and $0x0,%al Code; c01f83e4 <__journal_drop_transaction+108/290> b: 74 2a je 37 <_EIP+0x37> c01f8410 <__journal_drop_transaction+134/290> Code; c01f83e6 <__journal_drop_transaction+10a/290> d: 68 05 98 42 c0 push $0xc0429805 Code; c01f83eb <__journal_drop_transaction+10f/290> 12: 68 65 02 00 00 push $0x265 Code; c01f83f0 <__journal_drop_transaction+114/290> 17: 68 67 95 42 c0 push $0xc0429567 Code; c01f83f5 <__journal_drop_transaction+119/290> 1c: 68 64 97 42 c0 push $0xc0429764 Code; c01f83fa <__journal_drop_transaction+11e/290> 21: 68 a0 95 42 c0 push $0xc04295a0 Code; c01f83ff <__journal_drop_transaction+123/290> 26: e8 50 c4 f5 ff call fff5c47b <_EIP+0xfff5c47b> c0154854 Code; c01f8404 <__journal_drop_transaction+128/290> <===== 2b: 0f 0b ud2a <===== Code; c01f8406 <__journal_drop_transaction+12a/290> 2d: 65 02 67 95 add %gs:0xffffff95(%edi),%ah Code; c01f840a <__journal_drop_transaction+12e/290> 31: 42 inc %edx Code; c01f840b <__journal_drop_transaction+12f/290> 32: c0 83 c4 14 90 83 7b rolb $0x7b,0x839014c4(%ebx) Code; c01f8412 <__journal_drop_transaction+136/290> 39: 2c 00 sub $0x0,%al Code; c01f8414 <__journal_drop_transaction+138/290> 3b: 74 2a je 67 <_EIP+0x67> c01f8440 <__journal_drop_transaction+164/290> Code; c01f8416 <__journal_drop_transaction+13a/290> 3d: 68 .byte 0x68 Code; c01f8417 <__journal_drop_transaction+13b/290> 3e: 40 inc %eax Code; c01f8418 <__journal_drop_transaction+13c/290> 3f: 98 cwtl 1 error issued. Results may not be reliable. scratchy: 07:17pm# Second Machine ("Tib") ====================== Unable to handle kernel NULL pointer dereference at virtual address 00000004 ^M printing eip: ^Mc01fab35 ^M*pgd = c040fa1800000000 ^M*pmd = 0000000000000000 ^MOops: 0000 [#1] ^MSMP ^MCPU: 2 ^MEIP: 0060:[] Not tainted VLI ^MEFLAGS: 00010246 (2.6.10-grsec+gg3+e+fhs6b+nfs+gr0501+++p4+c4a+gr6b-reslog-v6.189) ^MEIP is at __journal_remove_journal_head+0x9/0x130 ^Meax: 00000000 ebx: 00000000 ecx: f7d4f200 edx: 00000014 ^Mesi: f1920320 edi: 00000013 ebp: c0d6f280 esp: f46d5dcc ^Mds: 007b es: 007b ss: 0068 ^MProcess kjournald (pid: 1091, threadinfo=f46d4000 task=f597ca60) ^MStack: f1920320 da3ee14c c01fac83 f1920320 f1920320 c01f70fb f1920320 f7d4f400 ^M f7d4f414 f7d4f414 00000000 f7d4f4c0 f7d4f454 f7d4f43c c0d6f2b8 f7d4f414 ^M f7d4f414 eba83db8 f7d4f478 f46d4000 00000000 00000ebc d7c53144 00000000 ^MCall Trace: ^M [] journal_remove_journal_head+0x27/0x44 ^M [] journal_commit_transaction+0xe43/0xf68 ^M [] d_callback+0x27/0x2c ^M [] autoremove_wake_function+0x0/0x40 ^M [] autoremove_wake_function+0x0/0x40 ^M [] kjournald+0xd3/0x228 ^M [] kjournald+0x0/0x228 ^M [] autoremove_wake_function+0x0/0x40 ^M [] autoremove_wake_function+0x0/0x40 ^M [] commit_timeout+0x0/0xc ^M [] kernel_thread_helper+0x5/0xc ^MCode: 74 06 8b 5a 28 ff 43 04 8b 02 a9 00 00 10 00 75 08 0f 0b 19 02 c0 9b 42 c0 f0 0f ba 32 14 89 d8 5b c3 56 53 8b 74 24 0c 8b 5e 28 <83> 7b 04 00 7d 29 68 e0 a4 42 c0 68 e3 06 00 00 68 cc 9c 42 c0 tib: 07:00pm# ksymoops -m /boot/System.map-2.6.10-grsec+gg3+e+fhs6b+nfs+gr0501+++p4+c4a+gr6b-reslog-v6.189/root/oops ksymoops 2.4.5 on i686 2.6.10-grsec+gg3+e+fhs6b+nfs+gr0501+++p4+c4a+gr6b-reslog-v6.189. Options used -V (default) -k /proc/ksyms (default) -l /proc/modules (default) -o /lib/modules/2.6.10-grsec+gg3+e+fhs6b+nfs+gr0501+++p4+c4a+gr6b-reslog-v6.189/ (default) -m /boot/System.map-2.6.10-grsec+gg3+e+fhs6b+nfs+gr0501+++p4+c4a+gr6b-reslog-v6.189(specified) Error (regular_file): read_ksyms stat /proc/ksyms failed ksymoops: No such file or directory No modules in ksyms, skipping objects No ksyms, skipping lsmod Unable to handle kernel NULL pointer dereference at virtual address 00000004 c01fab35 *pgd = c040fa1800000000 Oops: 0000 [#1] CPU: 2 EIP: 0060:[] Not tainted VLI Using defaults from ksymoops -t elf32-i386 -a i386 EFLAGS: 00010246 (2.6.10-grsec+gg3+e+fhs6b+nfs+gr0501+++p4+c4a+gr6b-reslog-v6.189) eax: 00000000 ebx: 00000000 ecx: f7d4f200 edx: 00000014 esi: f1920320 edi: 00000013 ebp: c0d6f280 esp: f46d5dcc ds: 007b es: 007b ss: 0068 Stack: f1920320 da3ee14c c01fac83 f1920320 f1920320 c01f70fb f1920320 f7d4f400 f7d4f414 f7d4f414 00000000 f7d4f4c0 f7d4f454 f7d4f43c c0d6f2b8 f7d4f414 f7d4f414 eba83db8 f7d4f478 f46d4000 00000000 00000ebc d7c53144 00000000 [] journal_remove_journal_head+0x27/0x44 [] journal_commit_transaction+0xe43/0xf68 [] d_callback+0x27/0x2c [] autoremove_wake_function+0x0/0x40 [] autoremove_wake_function+0x0/0x40 [] kjournald+0xd3/0x228 [] kjournald+0x0/0x228 [] autoremove_wake_function+0x0/0x40 [] autoremove_wake_function+0x0/0x40 [] commit_timeout+0x0/0xc [] kernel_thread_helper+0x5/0xc Code: 74 06 8b 5a 28 ff 43 04 8b 02 a9 00 00 10 00 75 08 0f 0b 19 02 c0 9b 42 c0 f0 0f ba 32 14 89 d8 5b c3 56 53 8b 74 24 0c 8b 5e 28 <83> 7b 04 00 7d 29 68 e0 a4 42 c0 68 e3 06 00 00 68 cc 9c 42 c0 >>EIP; c01fab35 <__journal_remove_journal_head+9/130> <===== >>ecx; f7d4f200 >>esi; f1920320 >>ebp; c0d6f280 >>esp; f46d5dcc Code; c01fab0a 00000000 <_EIP>: Code; c01fab0a 0: 74 06 je 8 <_EIP+0x8> c01fab12 Code; c01fab0c 2: 8b 5a 28 mov 0x28(%edx),%ebx Code; c01fab0f 5: ff 43 04 incl 0x4(%ebx) Code; c01fab12 8: 8b 02 mov (%edx),%eax Code; c01fab14 a: a9 00 00 10 00 test $0x100000,%eax Code; c01fab19 f: 75 08 jne 19 <_EIP+0x19> c01fab23 Code; c01fab1b 11: 0f 0b ud2a Code; c01fab1d 13: 19 02 sbb %eax,(%edx) Code; c01fab1f 15: c0 9b 42 c0 f0 0f ba rcrb $0xba,0xff0c042(%ebx) Code; c01fab26 1c: 32 14 89 xor (%ecx,%ecx,4),%dl Code; c01fab29 1f: d8 5b c3 fcomps 0xffffffc3(%ebx) Code; c01fab2c <__journal_remove_journal_head+0/130> 22: 56 push %esi Code; c01fab2d <__journal_remove_journal_head+1/130> 23: 53 push %ebx Code; c01fab2e <__journal_remove_journal_head+2/130> 24: 8b 74 24 0c mov 0xc(%esp,1),%esi Code; c01fab32 <__journal_remove_journal_head+6/130> 28: 8b 5e 28 mov 0x28(%esi),%ebx Code; c01fab35 <__journal_remove_journal_head+9/130> <===== 2b: 83 7b 04 00 cmpl $0x0,0x4(%ebx) <===== Code; c01fab39 <__journal_remove_journal_head+d/130> 2f: 7d 29 jge 5a <_EIP+0x5a> c01fab64 <__journal_remove_journal_head+38/130> Code; c01fab3b <__journal_remove_journal_head+f/130> 31: 68 e0 a4 42 c0 push $0xc042a4e0 Code; c01fab40 <__journal_remove_journal_head+14/130> 36: 68 e3 06 00 00 push $0x6e3 Code; c01fab45 <__journal_remove_journal_head+19/130> 3b: 68 cc 9c 42 c0 push $0xc0429ccc On Mon, May 02, 2005 at 06:17:35PM -0700, Jason Sydes wrote: > Mike Fedyk writes: > > > Nicolas Kowalski wrote: > >> Mike Fedyk writes: > >> > >>>Nicolas Kowalski wrote: > >>> > >>>>I will try to reproduce these errors on a non-production server now. > >>> > >>>Beautiful. > >>> > >>>It might be good if you put a stack_dump() call just after the > >>>printk() call in the ext3 source. > >> I apologize, (I am not familiar with kernel debugging), but when > >> compiling the kernel with this call inserted after the printk in the > >> sources, it fails with an resolved symbol error. ... > >> fs/fs.o: In function `__jbd_unexpected_dirty_buffer': > >> fs/fs.o(.text+0x3ab8a): undefined reference to `stack_dump' > >> ... > >> I must be missing an option, but which one ? > > > > Oh crap. It's called dump_stack(). > > Ok. I had another similar error this morning: > > Unexpected dirty buffer encountered at do_get_write_access:618 (08:11 > blocknr 920701) > dba1fddc dba1fe04 c017565e c03054a0 c0305483 c030373b 0000026a c03fc5e0 > 000e0c7d d1072580 dba1fe4c c016f76b c030373b 0000026a d34f1d80 > d1072580 > df4c1e94 d34f1d80 c01701dd 00000000 00000000 00000003 df4c1e00 > d3615430 > Call Trace: [] [] [] [] > [] > [] [] [] [] [] > [] > [] [] > > > ksymoops gives me: > > Trace; c017565e <__jbd_unexpected_dirty_buffer+3a/74> > Trace; c016f76b > Trace; c01701dd > Trace; c016fc10 > Trace; c0167c88 > Trace; c0167c4e > Trace; c0167e21 > Trace; c0167c74 > Trace; c012f67a > Trace; c012fb14 > Trace; c01657e2 > Trace; c013c807 > Trace; c0108be3 > > > Does this help ? > > -- > Nicolas > > > _______________________________________________ > Ext3-users mailing list > Ext3-users at redhat.com > https://www.redhat.com/mailman/listinfo/ext3-users > From evilninja at gmx.net Sun May 8 01:20:33 2005 From: evilninja at gmx.net (Christian) Date: Sun, 08 May 2005 03:20:33 +0200 Subject: 2.6.12-rc3-mm2 benchmarks Message-ID: <427D6961.6080000@gmx.net> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 [!! i've Cc'ed several fs lists, please remove when when replying !!] hi all, from time to time i do some benchmarks for several filesystems and several crypto-algorithms too, details here: http://nerdbynature.de/bench/ latest results here: http://nerdbynature.de/bench/prinz/2.6.12-rc3-mm2/bonnie.html http://nerdbynature.de/bench/prinz/2.6.12-rc3-mm2/tiobench.txt Christian. - -- BOFH excuse #173: Recursive traversal of loopback mount points -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.1 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org iD8DBQFCfWlhC/PVm5+NVoYRAmCBAJ9D+UrpvNJ+AoJijJwCN3DVs1Da/QCgkMoC Ea5VVCQ1Q2XrJNahJQoif1c= =m8tN -----END PGP SIGNATURE----- From hans.yperman at gmail.com Thu May 12 22:35:16 2005 From: hans.yperman at gmail.com (Hans Yperman) Date: Fri, 13 May 2005 00:35:16 +0200 Subject: Smashing EXT3 for fun and profit (or: how to loose all your data) Message-ID: Hello everyone, I've just lost my whole EXT3 linux partition by what was probably a bug. For your reading pleasure, and in the hope there is enough information to fix this problem in the future, here the story of a violent ending: This tragic history starts actually on windows: MS Word had wiped out an important file on a floppy, and I got the task of retrieving what was possible. Using Linux, I made an image with dd,and put it on the now extinct EXT3 partition. I used an undelete programma , and then mounted the image with a loopback device: mount -o loop /tmp/image.img /floppy As it turns out,the undeleter managed to screw up the FAT, and the loopback device complains about reading past the end of the device. After fixing the floppy on another computer, I come back to the linux computer. The console is full of error messages. What happened? A first bug: Linux remounted the loopback-device read-only because of the bad FAT on the image. BUT this did not work out right: not only the loopback device, but the whole EXT3-partition were now read-only. Every little write action results in an error, hence all the messages. I did not really think much of it at that point, and just did a mount -o remount,rw / At this point, I am already screwed, but I don't realize it yet: The computer works completely normal from here on. The problem happens the next time I boot: fsck complains about problems (weird, fsck is not supposed to run for EXT3). Specifically, fsck complains about double-allocated blocks, does a pass 1B and 1C (I'd never seen these before either), dumps pages and pages and pages of block numbers, get's very very veeeeryyy slow, and crashes. I restart fsck. This time it starts asking me tons of yes/no questions because it wants to know what to do with the double-allocated block. I yes them all (There is no real right answer anyhow) and reboot. And that was it: init starts, and complains about not having an /etc/inittab (and asks me which runlevel to start. Never seen that before either). Then it crashes. Booting with knoppix reveals lots and lost of damaged files. Everything that was cached seems to be damaged, and some random files are also dead (my gues is ext3 screwed up while updating atimes or something like that). Game over. I guess these 2 facts need fixing: 1) loopback devices should not pass errors over to their underlying filesystems. 2) ext3 suicidally allows remounting read-write when parts of its data are invalid. Now I don't complain much. I have a 1 day old backup of my home directory (thanks, unison). I lost all my tweaks to /etc, but, well, the hard drive image was copied/resized from computer to computer to computer, and initially started its life under linux 2.0.35 on a pentium 133Mhz. A rewrite was probably a good idea anyway. I lost all my MP3's, but a very nice girl promised me to help me re-rip them all from my CD's. (Thanks to ext3 I get to spend some time with a very sexy girl. Lots of it by talking and laughing while we wait for lame to end. I actually start to think my hard drive should get erased more often ;-) ). Other people might not like loosing a whole partition, so I mail this sad story to you all. A bit of advice: if you ever see ext3 complaining about being read-only, press the reset button. It might save your partition. I did not test my claim of the loopback being the bug, as I am busy reinstalling right now (on EXT2 this time). Have a nice day, everyone, Hans. From theman at josephdwagner.info Fri May 13 19:55:55 2005 From: theman at josephdwagner.info (Joseph D. Wagner) Date: Fri, 13 May 2005 14:55:55 -0500 Subject: Smashing EXT3 for fun and profit (or: how to loose all your data) In-Reply-To: Message-ID: <200505131955.j4DJtWaB003721@josephdwagner.info> > I guess these 2 facts need fixing: > 1) loopback devices should not pass errors over > to their underlying filesystems. I have a test partition setup for these circumstances. I'll try to reproduce the read-write/read-only error spreading to an underlying file system when the loopback file system has the error. However, I will have to double check with the file system designers. There may be a good reason it behaves this way. > 2) ext3 suicidally allows remounting read-write > when parts of its data are invalid. When you are logged in as root, it will let you whatever suicidal -- or imho stupid -- things you tell it to do. That is not going to change. It actually takes something serious to bring down a file system mid-stride, not just an atime update. In other words, by the time Linux is remounting your file system as read-only, something is already fubar. The remount as read-only is really only a stop-gap measure to prevent further damage while you save your work -- on other partitions -- and reboot. If all you have is one honkin' / (root) partition, you may just want to change that behavior to panic. After all, if you only have 1 partition, there's no where else to save your work. So long as you're redoing your partitions, be sure to separate out /tmp, /var, and just to be safe /home too, so next time all you lose is the one bad partition. Joseph D. Wagner From tytso at mit.edu Sat May 14 02:28:03 2005 From: tytso at mit.edu (Theodore Ts'o) Date: Fri, 13 May 2005 22:28:03 -0400 Subject: Smashing EXT3 for fun and profit (or: how to loose all your data) In-Reply-To: References: Message-ID: <20050514022803.GA26057@thunk.org> On Fri, May 13, 2005 at 12:35:16AM +0200, Hans Yperman wrote: > This tragic history starts actually on windows: MS Word had wiped out > an important file on a floppy, and I got the task of retrieving what > was possible. Using Linux, I made an image with dd,and put it on the > now extinct EXT3 partition. I used an undelete programma , and then > mounted the image with a loopback device: > mount -o loop /tmp/image.img /floppy > As it turns out,the undeleter managed to screw up the FAT, and the > loopback device complains about reading past the end of the device. > After fixing the floppy on another computer, I come back to the linux > computer. The console is full of error messages. What version of the kernel are you using? What undelete program were you using? Most undelete programs don't require that you mount the filesystem; in fact, they often require that you *don't* mount them. > What happened? A first bug: Linux remounted the loopback-device > read-only because of the bad FAT on the image. BUT this did not work > out right: not only the loopback device, but the whole EXT3-partition > were now read-only. Every little write action results in an error, > hence all the messages. I did not really think much of it at that > point, and just did a > mount -o remount,rw / Without the logs, it sounds like the ext3 filesystem got corrupted, and so it was mounted remounted read-only. How this happened is not clear, and you didn't give us enough information to determine that; but it's consistent with e2fsck displaying errors. > At this point, I am already screwed, but I don't realize it yet: The > computer works completely normal from here on. The problem happens > the next time I boot: fsck complains about problems (weird, fsck is > not supposed to run for EXT3). When the kernel discovered a filesystem corruption, it marks the filesystem as containing errors, and remounts it read-only. When fsck will run, it will note the fact that filesystem has problems, and try to fix it. > Specifically, fsck complains about > double-allocated blocks, does a pass 1B and 1C (I'd never seen these > before either), dumps pages and pages and pages of block numbers, > get's very very veeeeryyy slow, and crashes. I restart fsck. This > time it starts asking me tons of yes/no questions because it wants to > know what to do with the double-allocated block. I yes them all > (There is no real right answer anyhow) and reboot. What version of e2fsck are you running? It must be an ancient one if got really slow like that. You wouldn't be running Debian Obsolete^H^H^H^H^H^H^H Stable, are you? > And that was it: init starts, and complains about not having an > /etc/inittab (and asks me which runlevel to start. Never seen that > before either). Then it crashes. Booting with knoppix reveals lots > and lost of damaged files. Everything that was cached seems to be > damaged, and some random files are also dead (my gues is ext3 screwed > up while updating atimes or something like that). Game over. The filesystem was probably screwed up much earlier than that. Probably something with the undelete program was run, or perhaps because you remounted the filesystem read-write after errors were uncovered, but it's going to be hard to reconstruct without a lot more details. (What specific messages were printed by the kernel describing the errors, exactly what version of the kernel, e2fsprogs, and undelete program you were using, etc.) I will say that while remounting a filesystem read/write after errors is dangerous, the fact that e2fsck displayed pages and pages of block numbers tends to indicate that that there was something more that went wrong. Merely remounting a filesystem read/write might result in a some multiply claimed blocks, which pass 1b/1c/1d are designed to resolve, but how many you have depends on how many files are written and how badly corrupted were the block allocation bitmaps. Assuming that you didn't run the system for very long before you rebooted, or didn't write a lot of files during this interim, it seems somewhat unlikely that it would have resulted in "pages and pages and pages" of block numbers. That would tend to argue that portions of the inode table got written to the wrong location, which is generally caused by a hardware error. It might have been caused by the undelete program, but that seems hard to believe. But then again, I don't know which undelete program you used, and it does seem very surprising that the undelete program would work with a mounted filesystem, so that part sounds like another user error (but not one that would be expected to cause major filesystem corruption). So the bottom line is I can't really tell you what could have happened with the limited facts that you've given me. > I guess these 2 facts need fixing: > 1) loopback devices should not pass errors over to their underlying filesystems. Loopback devices don't pass errors back over to their underlying filesystems. > 2) ext3 suicidally allows remounting read-write when parts of its data > are invalid. Linux will allow you to do many things that might be, well, ill-advised. When the kernel printed all of the warnings, it warned you that the filesystem had errors. Remounting it read/write was a really bad idea --- but then again, so is running the command "dd if=/dev/zero of=/dev/hda1" as root. > Other people might not like loosing a whole partition, so I mail this > sad story to you all. A bit of advice: if you ever see ext3 > complaining about being read-only, press the reset button. It might > save your partition. Or run e2fsck manually yourself; there are a number of things that you can do. Blindly remounting the filesystem read/write is certainly not one of them. Saving all of the error messages from the kernel describing the filesystem corruption is a really good idea. As is saving the messages from e2fsck, so people can figure out what happened after the fact. The one good thing is that you kept good backups, so you didn't lose that much; I definitely commend that. :-) - Ted From dclunie at dclunie.com Sun May 15 13:56:53 2005 From: dclunie at dclunie.com (David Clunie) Date: Sun, 15 May 2005 09:56:53 -0400 Subject: Intermittent ext3 corruption on external firewire Micronet 1.5Tb RAID on FC3 Message-ID: <42875525.8010202@dclunie.com> Hi I have a Firewire connected Micronet 1.5TB RAID with a single large ext3 filesystem on one partition on a dual Xeon system. I am checking out from an extremely large cvs repository (don't ask) to this drive over the course of many days, and intermittently I get bad blocks and the filesystem goes read-only. This is not related to any power failure or anything similar. The RAID is currently about 40% full; this started to happen around the 15% mark as I recall. I checked the RAID firmware setup, found that caching was set to write-back, and changed it to write-through to see if that would help (since I gather the Linux kernel presumes write-through, though why it should make a difference in the absence of a reboot or power failure I don't understand). This reduced the frequency of the error from once a night to once every couple of nights; interestingly mostly at about 04:03 AM or so. Looking at cron.daily, only mrtg and sa seem to be starting up at about that time. I suspect the timing is related to a change in the pattern of disk activity rather than anything else. I have no reason to suspect that there is anything actually wrong with the RAID itself, which just appears as a really big firewire external disk. It is new however, so this can't be ruled out. My next step is to just turn off journaling and see if doing this with just ext2 works OK. Journaling doesn't seem to be doing much good as I am stuck regularly running ordinary fsck's with all these errors anyway ! I just thought I would ask if anyone else has had a similar experience, and whether such issues are known to be with ext3, or the firewire interface, or both together. PS. I did actually create the partition and did the mkfs on an AMD64 FC3 system at a different site, though that is not the system to which the RAID is currently connected. Just mention that in case this makes a difference, but I presume an fsck would have noticed and fixed anything fundamentally wrong in this regard. David May 15 04:03:30 localhost kernel: Aborting journal on device sdd1. May 15 04:03:30 localhost kernel: EXT3-fs error (device sdd1): ext3_journal_start_sb: Detected aborted journal May 15 04:03:30 localhost kernel: EXT3-fs error (device sdd1): ext3_xattr_get: inode 63343526: bad block 165510584 May 15 04:03:30 localhost kernel: EXT3-fs error (device sdd1) in start_transaction: Journal has aborted May 15 04:03:30 localhost kernel: EXT3-fs error (device sdd1) in start_transaction: Journal has aborted May 15 04:03:30 localhost kernel: inode_doinit_with_dentry: getxattr returned 5 for dev=sdd1 ino=63343526 May 15 04:03:34 localhost kernel: EXT3-fs error (device sdd1): ext3_xattr_get: inode 63343381: bad block 141623810 May 15 04:03:34 localhost kernel: EXT3-fs error (device sdd1): ext3_xattr_get: inode 63947123: bad block 203323361 Linux localhost.localdomain 2.6.9-1.667smp #1 SMP Tue Nov 2 14:59:52 EST 2004 i686 i686 i386 GNU/Linux From theman at josephdwagner.info Sun May 15 22:48:44 2005 From: theman at josephdwagner.info (Joseph D. Wagner) Date: Sun, 15 May 2005 17:48:44 -0500 Subject: Intermittent ext3 corruption on external firewire Micronet 1.5Tb RAID on FC3 In-Reply-To: <42875525.8010202@dclunie.com> Message-ID: <200505152248.j4FMmI7b031303@josephdwagner.info> > May 15 04:03:30 localhost kernel: EXT3-fs error (device sdd1): > ext3_xattr_get: inode 63343526: bad block 165510584 > May 15 04:03:34 localhost kernel: EXT3-fs error (device sdd1): > ext3_xattr_get: inode 63343381: bad block 141623810 > May 15 04:03:34 localhost kernel: EXT3-fs error (device sdd1): > ext3_xattr_get: inode 63947123: bad block 203323361 These errors cannot be caused by a bug in the file system. It is possible, although highly unlikely, that a bug in the device driver could generate these errors. The most likely cause is that there actually are bad blocks on your new 1.5TB file system. Do us all a favor and run: Badblocks -v -b block_size /dev/device And let us know about the results. Joseph D. Wagner From anandtiwari at softhome.net Mon May 16 23:39:00 2005 From: anandtiwari at softhome.net (anandtiwari at softhome.net) Date: Mon, 16 May 2005 17:39:00 -0600 Subject: Ext3 journal corruption In-Reply-To: <42875525.8010202@dclunie.com> References: <42875525.8010202@dclunie.com> Message-ID: Hi all, I was having a ext3 filesystem with writeback. yesterday my system crashed and now when i try to mount it, it gives me "Invalid argument". Following is the command line #mount -t ext3 /dev/hda1 /mnt/home i tried debugging it and later i found out, its was complaining about journaling inode. Is there any way to recover my files, i did clone the disk and mounted it as ext2 after few tries but there was nothing in it. any help or pointers will be appreciated, Thanks anand From tytso at mit.edu Tue May 17 01:08:26 2005 From: tytso at mit.edu (Theodore Ts'o) Date: Mon, 16 May 2005 21:08:26 -0400 Subject: Ext3 journal corruption In-Reply-To: References: <42875525.8010202@dclunie.com> Message-ID: <20050517010826.GC11282@thunk.org> On Mon, May 16, 2005 at 05:39:00PM -0600, anandtiwari at softhome.net wrote: > Hi all, > > I was having a ext3 filesystem with writeback. yesterday my system crashed > and now when i try to mount it, it gives me "Invalid argument". Following > is the command line > #mount -t ext3 /dev/hda1 /mnt/home > > i tried debugging it and later i found out, its was complaining about > journaling inode. Is there any way to recover my files, i did clone the > disk and mounted it as ext2 after few tries but there was nothing in it. > any help or pointers will be appreciated, 1) Run e2fsck to correct any filesystem errors. This may remove the journal inode. 2) If it didn't, to be safe, remove the journal: "tune2fs -O ^has_journal /dev/hdXX" 3) Then recreate the journal: "tune2fs -j /dev/hdXX" Ted From anandtiwari at softhome.net Tue May 17 02:05:16 2005 From: anandtiwari at softhome.net (Anand Tiwari) Date: Mon, 16 May 2005 20:05:16 -0600 Subject: Ext3 journal corruption References: <42875525.8010202@dclunie.com> <20050517010826.GC11282@thunk.org> Message-ID: <001e01c55a84$dff2ee70$fa00a8c0@darkstar> ok, but just curious, if it is not cleanly umounted, mount shouldnt be able to mount it as ext2fs. ----- Original Message ----- From: "Theodore Ts'o" To: Cc: Sent: Monday, May 16, 2005 7:08 PM Subject: Re: Ext3 journal corruption > On Mon, May 16, 2005 at 05:39:00PM -0600, anandtiwari at softhome.net wrote: > > Hi all, > > > > I was having a ext3 filesystem with writeback. yesterday my system crashed > > and now when i try to mount it, it gives me "Invalid argument". Following > > is the command line > > #mount -t ext3 /dev/hda1 /mnt/home > > > > i tried debugging it and later i found out, its was complaining about > > journaling inode. Is there any way to recover my files, i did clone the > > disk and mounted it as ext2 after few tries but there was nothing in it. > > any help or pointers will be appreciated, > > 1) Run e2fsck to correct any filesystem errors. This may remove the journal inode. > > 2) If it didn't, to be safe, remove the journal: > "tune2fs -O ^has_journal /dev/hdXX" > > 3) Then recreate the journal: "tune2fs -j /dev/hdXX" > > Ted From adilger at clusterfs.com Tue May 17 06:04:44 2005 From: adilger at clusterfs.com (Andreas Dilger) Date: Tue, 17 May 2005 00:04:44 -0600 Subject: Intermittent ext3 corruption on external firewire Micronet 1.5Tb RAID on FC3 In-Reply-To: <42875525.8010202@dclunie.com> References: <42875525.8010202@dclunie.com> Message-ID: <20050517060444.GJ1499@schnapps.adilger.int> On May 15, 2005 09:56 -0400, David Clunie wrote: > I have a Firewire connected Micronet 1.5TB RAID with a single > large ext3 filesystem on one partition on a dual Xeon system. For some kernels (maybe even current ones) it is possible that there is a problem with IO beyond 1 TB. What I would do (if you don't mind overwriting the disk, presumably not if it is just new and doesn't contain important data) is to write a small test program to write the byte offset at the start of every 4kB block on the disk, then read them all back and verify it is correct. This will tell you if there is aliasing in the block device (possibly e.g. an int used instead of __u32 or sector_t). Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From theman at josephdwagner.info Tue May 17 07:42:15 2005 From: theman at josephdwagner.info (Joseph D. Wagner) Date: Tue, 17 May 2005 02:42:15 -0500 Subject: Intermittent ext3 corruption on external firewire Micronet1.5Tb RAID on FC3 In-Reply-To: <20050517060444.GJ1499@schnapps.adilger.int> Message-ID: <200505170741.j4H7fnl2031520@josephdwagner.info> > What I would do (if you don't mind overwriting the disk, presumably > not if it is just new and doesn't contain important data) is to > write a small test program to write the byte offset at the start of > every 4kB block on the disk, then read them all back and verify it > is correct. That's what badblocks is for when doing a destructive write test. Joseph D. Wagner From adilger at clusterfs.com Tue May 17 08:44:37 2005 From: adilger at clusterfs.com ('Andreas Dilger') Date: Tue, 17 May 2005 02:44:37 -0600 Subject: Intermittent ext3 corruption on external firewire Micronet1.5Tb RAID on FC3 In-Reply-To: <200505170741.j4H7fnl2031520@josephdwagner.info> References: <20050517060444.GJ1499@schnapps.adilger.int> <200505170741.j4H7fnl2031520@josephdwagner.info> Message-ID: <20050517084437.GN1499@schnapps.adilger.int> On May 17, 2005 02:42 -0500, Joseph D. Wagner wrote: > > What I would do (if you don't mind overwriting the disk, presumably > > not if it is just new and doesn't contain important data) is to > > write a small test program to write the byte offset at the start of > > every 4kB block on the disk, then read them all back and verify it > > is correct. > > That's what badblocks is for when doing a destructive write test. Looking at the badblocks man page, I don't think this is true (though I could be wrong). If badblocks is only writing out a repetetive pattern, and only verifying in 64-block chunks this will not detect device block address aliasing because (a) the pattern doesn't depend on the offset so will verify correctl, and (b) 64 blocks is likely aligned to the same offset as where the device would concievably wrap. Having this feature as part of badblocks (e.g. add "-t offset" pattern) is probably a great place to do this because it is widely available and already has most of the framework for this. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From cchan at outblaze.com Thu May 19 11:05:34 2005 From: cchan at outblaze.com (Christopher Chan) Date: Thu, 19 May 2005 19:05:34 +0800 Subject: ext3 journal problems Message-ID: <428C72FE.3080506@outblaze.com> This caused a crash on a 2.6.10-1.12_FC2smp kernel May 19 09:56:35 spf1 kernel: Assertion failure in log_do_checkpoint() at fs/jbd/checkpoint.c:361: "drop_count != 0 || cleanup_ret != 0" May 19 09:56:35 spf1 kernel: ------------[ cut here ]------------ May 19 09:56:37 spf1 kernel: kernel BUG at fs/jbd/checkpoint.c:361! May 19 09:56:37 spf1 kernel: invalid operand: 0000 [#1] May 19 09:56:37 spf1 kernel: SMP May 19 09:56:37 spf1 kernel: Modules linked in: md5 ipv6 autofs4 e100 mii ipt_REJECT iptable_filter ip_tables microcode dm_mod ohci_hcd ext3 jbd raid1 raid0 May 19 09:56:37 spf1 kernel: CPU: 0 May 19 09:56:37 spf1 kernel: EIP: 0060:[] Not tainted VLI May 19 09:56:37 spf1 kernel: EFLAGS: 00010202 (2.6.10-1.12_FC2smp) May 19 09:56:37 spf1 kernel: EIP is at log_do_checkpoint+0x106/0x146 [jbd] May 19 09:56:37 spf1 kernel: eax: 0000006e ebx: eaadccbc ecx: e453ab90 edx: f883d756 May 19 09:56:37 spf1 kernel: esi: f6a11a00 edi: 00000000 ebp: c091d5e0 esp: e453ab8c May 19 09:56:37 spf1 kernel: ds: 007b es: 007b ss: 0068 May 19 09:56:37 spf1 kernel: Process cleanup (pid: 29696, threadinfo=e453a000 task=e907b060) May 19 09:56:37 spf1 kernel: Stack: f883d756 f883c91d f883d742 00000169 f883d811 034aa511 dd06f92c eaadccbc May 19 09:56:38 spf1 kernel: 00000000 00000000 c628f3bc f5be8764 c0154c62 00001000 f6a11c00 f14c3498 May 19 09:56:38 spf1 kernel: f5d8c360 00000001 f14c3498 f5c87480 f5d8c360 f5d8c290 f14c3498 f8870c79 May 19 09:56:38 spf1 kernel: Call Trace: May 19 09:56:38 spf1 kernel: [] __getblk+0x24/0x42 May 19 09:56:38 spf1 kernel: [] ext3_do_update_inode+0x2fb/0x322 [ext3] May 19 09:56:38 spf1 kernel: [] journal_get_write_access+0x25/0x2c [jbd] May 19 09:56:38 spf1 kernel: [] ext3_mark_iloc_dirty+0x10/0x18 [ext3] May 19 09:56:38 spf1 kernel: [] ext3_mark_inode_dirty+0x33/0x3a [ext3] May 19 09:56:38 spf1 kernel: [] ext3_splice_branch+0xeb/0x18c [ext3] May 19 09:56:38 spf1 kernel: [] do_get_write_access+0x54f/0x56b [jbd] May 19 09:56:38 spf1 kernel: [] __find_get_block+0xb5/0xbe May 19 09:56:38 spf1 kernel: [] __mod_timer+0xf1/0xfb May 19 09:56:38 spf1 kernel: [] __log_wait_for_space+0xa4/0xc7 [jbd] May 19 09:56:38 spf1 kernel: [] start_this_handle+0x2f8/0x33e [jbd] May 19 09:56:38 spf1 kernel: [] __wake_up+0x29/0x3c May 19 09:56:38 spf1 kernel: [] journal_start+0x78/0x9e [jbd] May 19 09:56:38 spf1 kernel: [] ext3_prepare_write+0x32/0xf4 [ext3] May 19 09:56:38 spf1 kernel: [] generic_file_buffered_write+0x1a3/0x499 May 19 09:56:38 spf1 kernel: [] inode_update_time+0x6e/0x96 May 19 09:56:38 spf1 kernel: [] __generic_file_aio_write_nolock+0x38e/0x3bc May 19 09:56:38 spf1 kernel: [] generic_file_aio_write_nolock+0x39/0x7f May 19 09:56:38 spf1 kernel: [] generic_file_aio_write+0x6e/0xbe May 19 09:56:38 spf1 kernel: [] ext3_file_write+0x19/0x8a [ext3] May 19 09:56:38 spf1 kernel: [] do_sync_write+0x97/0xc9 May 19 09:56:38 spf1 kernel: [] poll_freewait+0x33/0x3a May 19 09:56:38 spf1 kernel: [] autoremove_wake_function+0x0/0x2d May 19 09:56:38 spf1 kernel: [] scheduler_tick+0x3b3/0x3c9 May 19 09:56:38 spf1 kernel: [] vfs_write+0xb8/0xe4 May 19 09:56:38 spf1 kernel: [] sys_write+0x3c/0x62 May 19 09:56:38 spf1 kernel: [] syscall_call+0x7/0xb May 19 09:56:38 spf1 kernel: Code: ff ff 83 7c 24 10 00 75 2d 85 c0 75 29 68 11 d8 83 f8 68 69 01 00 00 68 42 d7 83 f8 68 1d c9 83 f8 68 56 d7 83 f8 e8 63 45 8e c7 <0f> 0b 69 01 42 d7 83 f8 83 c4 14 39 6e 40 75 0a 83 7e 40 00 0f From theman at josephdwagner.info Thu May 19 17:19:11 2005 From: theman at josephdwagner.info (Joseph D. Wagner) Date: Thu, 19 May 2005 12:19:11 -0500 Subject: ext3 journal problems In-Reply-To: <428C72FE.3080506@outblaze.com> Message-ID: <200505191718.j4JHIhmh016701@josephdwagner.info> > May 19 09:56:37 spf1 kernel: kernel BUG at fs/jbd/checkpoint.c:361! fs/jbd is not ext3. Please direct this to the jbd people. Joseph D. Wagner From adilger at clusterfs.com Thu May 19 17:36:21 2005 From: adilger at clusterfs.com (Andreas Dilger) Date: Thu, 19 May 2005 11:36:21 -0600 Subject: ext3 journal problems In-Reply-To: <200505191718.j4JHIhmh016701@josephdwagner.info> References: <428C72FE.3080506@outblaze.com> <200505191718.j4JHIhmh016701@josephdwagner.info> Message-ID: <20050519173621.GG1499@schnapps.adilger.int> On May 19, 2005 12:19 -0500, Joseph D. Wagner wrote: > > May 19 09:56:37 spf1 kernel: kernel BUG at fs/jbd/checkpoint.c:361! > > fs/jbd is not ext3. > > Please direct this to the jbd people. ??? Maybe you are thinking of "jfs", but jbd is developed by Stephen explicitly for ext3. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From mvolaski at aecom.yu.edu Thu May 19 17:40:34 2005 From: mvolaski at aecom.yu.edu (Maurice Volaski) Date: Thu, 19 May 2005 13:40:34 -0400 Subject: mke2fs options for very large filesystems In-Reply-To: <20050208170005.3F34E72E1E@hormel.redhat.com> References: <20050208170005.3F34E72E1E@hormel.redhat.com> Message-ID: >Yes, if you are creating larger files. By default e2fsck assumes the average >file size is 8kB and allocates a corresponding number of inodes there. If, >for example, you are storing lots of larger files there (digital photos, MP3s, >etc) that are in the MB range you can use "-t largefile" or "-t largefile4" >to specify an average file size of 1MB or 4MB respectively. You can also >use -i or -N (see man page) to override the default bytes-per-inode value. Wouldn't -T largefile already be making choices about the default bytes-per-inode? How could I make my own determination about what values are most appropriate for -i and -N? My filesystems are generally several hundreds of gigabytes, filled with files that average about one megabyte in size. -- Maurice Volaski, mvolaski at aecom.yu.edu Computing Support, Rose F. Kennedy Center Albert Einstein College of Medicine of Yeshiva University From mvolaski at aecom.yu.edu Thu May 19 17:49:28 2005 From: mvolaski at aecom.yu.edu (Maurice Volaski) Date: Thu, 19 May 2005 13:49:28 -0400 Subject: [Q] Where does all the space go? Message-ID: I created a filesystem as follows: mke2fs -j -O dir_index -O sparse_super -T largefile /dev/drbd/6 Here's the the output from df Filesystem Size Used Avail Use% /dev/drbd/6 475G 33M 452G 1% It seems that ext3 has taken 23 GB, which is about 5% of the total disk size, for itself. Is that right? If that is, indeed, the case, why does df just list 33M as being used? -- Maurice Volaski, mvolaski at aecom.yu.edu Computing Support, Rose F. Kennedy Center Albert Einstein College of Medicine of Yeshiva University From menscher at uiuc.edu Thu May 19 17:55:04 2005 From: menscher at uiuc.edu (Damian Menscher) Date: Thu, 19 May 2005 12:55:04 -0500 (CDT) Subject: [Q] Where does all the space go? In-Reply-To: References: Message-ID: On Thu, 19 May 2005, Maurice Volaski wrote: > mke2fs -j -O dir_index -O sparse_super -T largefile /dev/drbd/6 > > Filesystem Size Used Avail Use% > /dev/drbd/6 475G 33M 452G 1% > > It seems that ext3 has taken 23 GB, which is about 5% of the total disk size, > for itself. Is that right? It's not reserved for the filesystem, but rather for root. Read about the -m option in the manpage to adjust that 5%. > If that is, indeed, the case, why does df just list 33M as being used? I think the 33M is the space used by the journal, or the filesystem itself. Damian Menscher -- -=#| Physics Grad Student & SysAdmin @ U Illinois Urbana-Champaign |#=- -=#| 488 LLP, 1110 W. Green St, Urbana, IL 61801 Ofc:(217)333-0038 |#=- -=#| 4602 Beckman, VMIL/MS, Imaging Technology Group:(217)244-3074 |#=- -=#| www.uiuc.edu/~menscher/ Fax:(217)333-9819 |#=- -=#| The above opinions are not necessarily those of my employers. |#=- From kwijibo at zianet.com Thu May 19 17:55:02 2005 From: kwijibo at zianet.com (kwijibo at zianet.com) Date: Thu, 19 May 2005 11:55:02 -0600 Subject: [Q] Where does all the space go? In-Reply-To: References: Message-ID: <428CD2F6.8030105@zianet.com> Investigate the -m option of mkfs.ext2/3 or tune2fs. The default is 5%. Maurice Volaski wrote: > I created a filesystem as follows: > > mke2fs -j -O dir_index -O sparse_super -T largefile /dev/drbd/6 > > Here's the the output from df > > Filesystem Size Used Avail Use% > /dev/drbd/6 475G 33M 452G 1% > > It seems that ext3 has taken 23 GB, which is about 5% of the total disk > size, for itself. Is that right? > > If that is, indeed, the case, why does df just list 33M as being used? From theman at josephdwagner.info Thu May 19 23:52:01 2005 From: theman at josephdwagner.info (Joseph D. Wagner) Date: Thu, 19 May 2005 18:52:01 -0500 Subject: ext3 journal problems In-Reply-To: <20050519173621.GG1499@schnapps.adilger.int> References: <428C72FE.3080506@outblaze.com> <200505191718.j4JHIhmh016701@josephdwagner.info> <20050519173621.GG1499@schnapps.adilger.int> Message-ID: <20050519234943.M23142@josephdwagner.info> > ??? Maybe you are thinking of "jfs", but jbd is developed by Stephen > explicitly for ext3. Oops. My bad. Sorry, I'm new to this file system development thing. I'm find it to be quite a steep learning curve. Joseph D. Wagner From cchan at outblaze.com Fri May 20 02:00:35 2005 From: cchan at outblaze.com (Christopher Chan) Date: Fri, 20 May 2005 10:00:35 +0800 Subject: ext3 journal problems In-Reply-To: <20050519234943.M23142@josephdwagner.info> References: <428C72FE.3080506@outblaze.com> <200505191718.j4JHIhmh016701@josephdwagner.info> <20050519173621.GG1499@schnapps.adilger.int> <20050519234943.M23142@josephdwagner.info> Message-ID: <428D44C3.4040206@outblaze.com> Joseph D. Wagner wrote: >>??? Maybe you are thinking of "jfs", but jbd is developed by Stephen >>explicitly for ext3. > > > Oops. My bad. Sorry, I'm new to this file system development thing. I'm > find it to be quite a steep learning curve. > > Joseph D. Wagner > No problem. Please make sure of your homework. I don't run anything but ext3 on the box with the problem. From mvolaski at aecom.yu.edu Fri May 20 17:14:23 2005 From: mvolaski at aecom.yu.edu (Maurice Volaski) Date: Fri, 20 May 2005 13:14:23 -0400 Subject: [Q] Where does all the space go? In-Reply-To: <20050520160006.C8245736F1@hormel.redhat.com> References: <20050520160006.C8245736F1@hormel.redhat.com> Message-ID: >It's not reserved for the filesystem, but rather for root. Read about >the -m option in the manpage to adjust that 5%. >Investigate the -m option of mkfs.ext2/3 or tune2fs. >The default is 5%. Thanks for the info. I found a post previously that claims it is required to prevent high levels of fragmentation as well as "other, very important" reasons. I wonder how accurate this statement is. >Ummm, the 5% reservation is to prevent the high levels of >fragmentation that occur when the filesystem is near full (something >that I wish Windows would adopt as standard too ;-). It is also to >keep your system from "hanging" if system/root processes need to >write to the filesystem, so they aren't at the "mercy" of users >filling it up. And there are a few other, very important reasons >too. -- Maurice Volaski, mvolaski at aecom.yu.edu Computing Support, Rose F. Kennedy Center Albert Einstein College of Medicine of Yeshiva University From theman at josephdwagner.info Fri May 20 17:50:14 2005 From: theman at josephdwagner.info (Joseph D. Wagner) Date: Fri, 20 May 2005 12:50:14 -0500 Subject: [Q] Where does all the space go? In-Reply-To: Message-ID: <200505201749.j4KHnisY013617@josephdwagner.info> > I found a post previously that claims it is required to prevent high > levels of fragmentation as well as "other, very important" reasons. I > wonder how accurate this statement is. Very accurate. Fragmentation increases exponentially. The harder it is for the file system to find contiguous space for a file (as the file system gets more and more full) the exponentially worse fragmentation gets. There's argument on exactly what the cut off point should be -- 90%, 95%, etc -- but by the time your TB file system is that full, you've got more serious problems anyway. There's several studies on this out there on the web, somewhere. Joseph D. Wagner From tytso at mit.edu Sat May 21 02:40:45 2005 From: tytso at mit.edu (Theodore Ts'o) Date: Fri, 20 May 2005 22:40:45 -0400 Subject: mke2fs options for very large filesystems In-Reply-To: References: <20050208170005.3F34E72E1E@hormel.redhat.com> Message-ID: <20050521024045.GC6708@thunk.org> On Thu, May 19, 2005 at 01:40:34PM -0400, Maurice Volaski wrote: > Wouldn't -T largefile already be making choices about the default > bytes-per-inode? > > How could I make my own determination about what values are most > appropriate for -i and -N? My filesystems are generally several > hundreds of gigabytes, filled with files that average about one > megabyte in size. Well, "mke2s -i 1048576" will create an inode for every megabyte (1,048,576 byte) of space on the filesystem. However, once you create a filesystem, it's not possible to increase the number of inodes in that filesystem afterwards. Also, symbolic links also take up inodes, as do block and character devices. So in general you want to overallocate inodes somewhat. For example, if you specify "mke2fs -i 524288" then you will be creating twice as many inodes, since you are asking mke2fs to create an inode for every 512k of space. - Ted From dbond at nrggos.com.au Sun May 22 22:53:28 2005 From: dbond at nrggos.com.au (Darryl Bond) Date: Mon, 23 May 2005 08:53:28 +1000 Subject: FSCK of corrupted ext3 filesystem Message-ID: <42910D68.4090303@nrggos.com.au> Hello, I have a 1.3TB ext3 filesystem that has been in service for about 3 months. About 6 days ago the Emulex fibrechannel controller logged a SCSI error and the filesystem changed to RO. It appears that the filesystem instantly changes to RO and prevents the journal from working, therefore invalidating the filesystem. The filesystem was unmounted and a remount was attempted. The mount failed due to errors and an fsck came up with errors. Top output looks like this: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 4562 root 25 0 780m 214m 236 R 99.9 42.6 6211:44 fsck.ext3 The fsck has been running for 6 days without printing anything to the screen. It seems to be working as an strace produces the following. Process 4562 attached - interrupt to quit _llseek(5, 5979127808, [5979127808], SEEK_SET) = 0 read(5, "\377\276\340oY\\i\17\346N\231\370\216\v\276\361\255\245"..., 4096) = 4096 _llseek(5, 299281825792, [299281825792], SEEK_SET) = 0 write(5, "\323\265-Q\33<\331\216\325\304U\3V\221\213\301e\32Q\220"..., 4096) = 4096 _llseek(5, 5979131904, [5979131904], SEEK_SET) = 0 read(5, "\327\347\2435\210\253^\222H\253\302\331\360\245\323\352"..., 4096) = 4096 _llseek(5, 299281829888, [299281829888], SEEK_SET) = 0 write(5, "\242\355\370A\2759Q\251\31>\254\240\301\34\320\226J5\22"..., 4096) = 4096 _llseek(5, 5979136000, [5979136000], SEEK_SET) = 0 read(5, "X\220ik\266\312\306\\ \266\32\220A\362\3319\250\27&\f\357"..., 4096) = 4096 _llseek(5, 299281833984, [299281833984], SEEK_SET) = 0 write(5, "U\352\255\303`\262\372h\242\275\312\333_\352\3\322\313"..., 4096) = 4096 _llseek(5, 5979140096, [5979140096], SEEK_SET) = 0 read(5, "\33\265#\367\332{\250Bj\215\277[\313\201\23\340\223\216"..., 4096) = 4096 _llseek(5, 299281838080, [299281838080], SEEK_SET) = 0 write(5, "\313-\234z\236\253/\3\360\232\222\237p\t5L\353\v\363t%"..., 4096) = 4096 Process 4562 detached How long should I let the fsck run? Regards Darryl Bond DISCLAIMER The contents of this electronic message and any attachments are intended only for the addressee and may contain legally privileged, personal, sensitive or confidential information. If you are not the intended addressee, and have received this email, any transmission, distribution, downloading, printing or photocopying of the contents of this message or attachments is strictly prohibited. Any legal privilege or confidentiality attached to this message and attachments is not waived, lost or destroyed by reason of delivery to any person other than intended addressee. If you have received this message and are not the intended addressee you should notify the sender by return email and destroy all copies of the message and any attachments. Unless expressly attributed, the views expressed in this email do not necessarily represent the views of the company. From tytso at mit.edu Mon May 23 17:40:37 2005 From: tytso at mit.edu (Theodore Ts'o) Date: Mon, 23 May 2005 13:40:37 -0400 Subject: FSCK of corrupted ext3 filesystem In-Reply-To: <42910D68.4090303@nrggos.com.au> References: <42910D68.4090303@nrggos.com.au> Message-ID: <20050523174037.GA30505@thunk.org> On Mon, May 23, 2005 at 08:53:28AM +1000, Darryl Bond wrote: > Hello, > I have a 1.3TB ext3 filesystem that has been in service for about 3 months. > About 6 days ago the Emulex fibrechannel controller logged a SCSI error > and the filesystem changed to RO. > It appears that the filesystem instantly changes to RO and prevents the > journal from working, therefore invalidating the filesystem. > The filesystem was unmounted and a remount was attempted. The mount > failed due to errors and an fsck came up with errors. What version of e2fsck are you using, and what kernel messages were displayed when the filesystem was remounted read-only? What version of the kernel/distribution are you using? What essages were printed by e2fsck? - Ted From dbond at nrggos.com.au Wed May 25 10:57:25 2005 From: dbond at nrggos.com.au (Darryl Bond) Date: Wed, 25 May 2005 20:57:25 +1000 Subject: FSCK of corrupted ext3 filesystem In-Reply-To: References: <42910D68.4090303@nrggos.com.au> Message-ID: <42945A15.1080206@nrggos.com.au> Perhaps, but should I stop it. It doesn't seem to be thrashing. The box is still quite responsive. After 8 days it is still working. If I stop it, will I have a mountable filesystem that I can get as much as possible off. I have ordered some 400G disks to try to get as much as possible. Regards Per Andreas Buer wrote: >Darryl Bond writes: > > > >> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND >> 4562 root 25 0 780m 214m 236 R 99.9 42.6 >>6211:44 fsck.ext3 >> >> > >I looks like fsck.ext3 has eaten all of your memory. Your system is >probably thrashing. Buy more memory. > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From perbu at linpro.no Wed May 25 08:25:52 2005 From: perbu at linpro.no (Per Andreas Buer) Date: 25 May 2005 10:25:52 +0200 Subject: FSCK of corrupted ext3 filesystem In-Reply-To: <42910D68.4090303@nrggos.com.au> References: <42910D68.4090303@nrggos.com.au> Message-ID: Darryl Bond writes: > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 4562 root 25 0 780m 214m 236 R 99.9 42.6 > 6211:44 fsck.ext3 I looks like fsck.ext3 has eaten all of your memory. Your system is probably thrashing. Buy more memory. -- Per Andreas Buer From menscher at uiuc.edu Wed May 25 14:29:49 2005 From: menscher at uiuc.edu (Damian Menscher) Date: Wed, 25 May 2005 09:29:49 -0500 (CDT) Subject: FSCK of corrupted ext3 filesystem In-Reply-To: References: <42910D68.4090303@nrggos.com.au> Message-ID: On Wed, 25 May 2005, Per Andreas Buer wrote: > Darryl Bond writes: > >> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND >> 4562 root 25 0 780m 214m 236 R 99.9 42.6 >> 6211:44 fsck.ext3 > > I looks like fsck.ext3 has eaten all of your memory. Your system is > probably thrashing. Buy more memory. No. Look at the columns again, reformatted properly: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 4562 root 25 0 780m 214m 236 R 99.9 42.6 6211:44 fsck.ext3 My _uninformed_ suggestion would be to kill it and run it again. It might help. Or not. At least it's unlikely to make matters worse. Damian Menscher -- -=#| Physics Grad Student & SysAdmin @ U Illinois Urbana-Champaign |#=- -=#| 488 LLP, 1110 W. Green St, Urbana, IL 61801 Ofc:(217)333-0038 |#=- -=#| 4602 Beckman, VMIL/MS, Imaging Technology Group:(217)244-3074 |#=- -=#| www.uiuc.edu/~menscher/ Fax:(217)333-9819 |#=- -=#| The above opinions are not necessarily those of my employers. |#=- From mvolaski at aecom.yu.edu Thu May 26 23:23:49 2005 From: mvolaski at aecom.yu.edu (Maurice Volaski) Date: Thu, 26 May 2005 19:23:49 -0400 Subject: Confusing -t for -T causes bad block count error In-Reply-To: <20050208170005.3F34E72E1E@hormel.redhat.com> References: <20050208170005.3F34E72E1E@hormel.redhat.com> Message-ID: Just in case anyone ever reads this old post below and tries making a file system with the little, lower case letter "t" below, it results in a baffling bad block count error. The correct option is the upper case, capital letter "T" :) >Yes, if you are creating larger files. By default e2fsck assumes the average >file size is 8kB and allocates a corresponding number of inodes there. If, >for example, you are storing lots of larger files there (digital photos, MP3s, >etc) that are in the MB range you can use "-t largefile" or "-t largefile4" >to specify an average file size of 1MB or 4MB respectively. You can also >use -i or -N (see man page) to override the default bytes-per-inode value. >This will also speed up e2fsck noticably. -- Maurice Volaski, mvolaski at aecom.yu.edu Computing Support, Rose F. Kennedy Center Albert Einstein College of Medicine of Yeshiva University From sct at redhat.com Fri May 27 15:13:35 2005 From: sct at redhat.com (Stephen C. Tweedie) Date: Fri, 27 May 2005 16:13:35 +0100 Subject: Intermittent ext3 corruption on external firewire Micronet 1.5Tb RAID on FC3 In-Reply-To: <200505152248.j4FMmI7b031303@josephdwagner.info> References: <200505152248.j4FMmI7b031303@josephdwagner.info> Message-ID: <1117206814.1957.42.camel@sisko.sctweedie.blueyonder.co.uk> Hi, On Sun, 2005-05-15 at 23:48, Joseph D. Wagner wrote: > > May 15 04:03:30 localhost kernel: EXT3-fs error (device sdd1): > > ext3_xattr_get: inode 63343526: bad block 165510584 > > May 15 04:03:34 localhost kernel: EXT3-fs error (device sdd1): > > ext3_xattr_get: inode 63343381: bad block 141623810 > > May 15 04:03:34 localhost kernel: EXT3-fs error (device sdd1): > > ext3_xattr_get: inode 63947123: bad block 203323361 > > These errors cannot be caused by a bug in the file system. Yes they can, and almost certainly were: I'm not sure why you'd assert otherwise. These messages are coming straight back from ext3 when it doesn't find the right magic number in an xattr block. Looking at the kernel version in the initial error: > Linux localhost.localdomain 2.6.9-1.667smp #1 SMP Tue Nov 2 14:59:52 > EST 2004 i686 i686 i386 GNU/Linux Andreas and I found and fixed an xattr sharing bug in December 2004, about five months ago. It's a race when one process is deleting an unshared xattr block while another process is simultaneously trying to share it, and it seems to be particularly visible when you've got SELinux on. The fix is in the core mbcache.c code, but directly affects ext3 xattrs. This was fixed both upstream and in Fedora updates quite some time ago. "yum update" is your friend in this case. :-) Cheers, Stephen From sct at redhat.com Fri May 27 15:24:24 2005 From: sct at redhat.com (Stephen C. Tweedie) Date: Fri, 27 May 2005 16:24:24 +0100 Subject: Intermittent ext3 corruption on external firewire Micronet1.5Tb RAID on FC3 In-Reply-To: <200505170741.j4H7fnl2031520@josephdwagner.info> References: <200505170741.j4H7fnl2031520@josephdwagner.info> Message-ID: <1117207464.1957.53.camel@sisko.sctweedie.blueyonder.co.uk> Hi, On Tue, 2005-05-17 at 08:42, Joseph D. Wagner wrote: > > What I would do (if you don't mind overwriting the disk, presumably > > not if it is just new and doesn't contain important data) is to > > write a small test program to write the byte offset at the start of > > every 4kB block on the disk, then read them all back and verify it > > is correct. > > That's what badblocks is for when doing a destructive write test. No, badblocks just tells you if an IO succeeded. It's really not designed to make sure that the IO went to the correct disk block in the presence of block aliasing, which is what you need to detect wraps. I wrote a program to test such things a couple of months ago, and have recently been polishing it up and writing documentation for it for public consumption. It's called "verify-data", and it does a write-then-read verify scan designed for large block devices. It uses 1MB IOs by default, with the buffer carefully constructed to be easily recognisable: buffers contain a repeating pattern of block offset, byte offset, magic number and pass number, so any IOs going astray are instantly recognisable. Everything should be 64-bit safe, and I've used it on block devices up to 13TB in size. By default it just writes and verifies one chunk every 128GB throughout the device, but you can tell it to walk the whole device (MUCH slower!). I've found it very good for detecting edge-conditions, wraps etc. on large block devices. (It also includes a query mode, -Q, to interrogate the GETBLKSIZE[64] ioctls too.) It's called "verify-data" and can be found at http://people.redhat.com/sct/src/verify-data/ I've got it in git locally, and can push the git repo to http too if people find it useful. Cheers, Stephen