From shweta.vichare at tcs.com Mon Feb 2 14:56:25 2009 From: shweta.vichare at tcs.com (Shweta Vichare) Date: Mon, 2 Feb 2009 20:26:25 +0530 Subject: Query on EXT3 Online Resize Message-ID: Hello, Do we have any handy patch for onlne resize of EXT3 filesystems using e2fsprogs 1.41 ( resize2fs ) for user space 32 bit and kernel space 64 bit Shweta =====-----=====-----===== Notice: The information contained in this e-mail message and/or attachments to it may contain confidential or privileged information. If you are not the intended recipient, any dissemination, use, review, distribution, printing or copying of the information contained in this e-mail message and/or attachments to it are strictly prohibited. If you have received this communication in error, please notify us by reply e-mail or telephone and immediately and permanently delete the message and any attachments. Thank you From tytso at mit.edu Mon Feb 2 15:01:27 2009 From: tytso at mit.edu (Theodore Tso) Date: Mon, 2 Feb 2009 10:01:27 -0500 Subject: Query on EXT3 Online Resize In-Reply-To: References: Message-ID: <20090202150127.GA14762@mit.edu> On Mon, Feb 02, 2009 at 08:26:25PM +0530, Shweta Vichare wrote: > > Do we have any handy patch for onlne resize of EXT3 filesystems using > e2fsprogs 1.41 ( resize2fs ) for user space 32 bit and kernel space 64 bit What specific version of e2fsprogs and (more importantly) the kernel are you using? It should Just Work, although there were some compatibility bugs that were fixed sometime around 2.6.26 or 2.6.27 if memory serves correctly (hmm... although only for ext4 if memory serves correctly; I should double check and see if the bug was fixed for ext3.) It's not something which gets a lot of testing though, so it's possible it got broken and no one noticed. - Ted From Mike.Miller at hp.com Mon Feb 2 15:55:50 2009 From: Mike.Miller at hp.com (Miller, Mike (OS Dev)) Date: Mon, 2 Feb 2009 15:55:50 +0000 Subject: barrier and commit options? In-Reply-To: <20090130220245.GA27950@mit.edu> References: <20090130135329.GW20896@petole.demisel.net> <49831B46.5080202@redhat.com> <0F5B06BAB751E047AB5C87D1F77A778859F9DD0800@GVW0547EXC.americas.hpqcorp.net> <49831F5E.6000506@redhat.com> <0F5B06BAB751E047AB5C87D1F77A778859F9DD0835@GVW0547EXC.americas.hpqcorp.net> <498324E7.3000705@redhat.com> <20090130220245.GA27950@mit.edu> Message-ID: <0F5B06BAB751E047AB5C87D1F77A778859F9E41D5F@GVW0547EXC.americas.hpqcorp.net> Theodore Tso wrote: > > Well, we still need the barrier on the block I/O elevantor > side to make sure that requests don't get reordered in the > block layer. But what you're saying is that once the write > is posted to the array, it is guaranteed that it is on > "stable storage" (even if it is BBWC) such that if someone > hits the Big Red Switch at the exit to the data center, and > power is forcibly cut from the entire data center in case of > a fire, the battery will still keep the cache alive, at least > until the sprinklers go off, anyway, right? :-) That's an accurate accessment. ;-) > > In that case, I suspect the right thing for the cciss array > to do is to ignore the barrier, but not to return an error. We agree and will fix the IO error. > If you return an error, and refuse the write with barrier > operation (which is what the cciss driver seems to be doing > starting in 2.6.29-rcX), ext4 will retry the write without > the barrier, at which point we are vulnerable to the block > layer reordering things at the I/O scheduler layer. In > effect, you're claiming that every single write to cciss is > implicitly a "barrier write" in that once it is received by > the device, it is guaranteed not to be lost even if the power > to the entire system is forcibly removed. Of course, we can't cover all possible scenarios like the data center exploding or something crazy. But under _most_ circumstances the data will remain in cache for up to 72 hours of no power. So if there is a complete power outage the controller will write any cached data (in order) to the disks on the next power up. -- mikem > _______________________________________________ > Ext3-users mailing list > Ext3-users at redhat.com > https://www.redhat.com/mailman/listinfo/ext3-users > From Curtis at GreenKey.net Wed Feb 4 02:23:23 2009 From: Curtis at GreenKey.net (Curtis Doty) Date: Tue, 3 Feb 2009 18:23:23 -0800 (PST) Subject: ext4 resize/fsck Message-ID: <20090204022324.07F876F06C@alopias.GreenKey.net> Horsing around with ext4 again...on F-10. This time a fsck was required after both an offline shrink and an online grow. Why? ----8<---- 13:22]stratus~# resize2fs -M -p /dev/foo/bar resize2fs 1.41.3 (12-Oct-2008) Please run 'e2fsck -f /dev/foo/bar' first. 13:22]stratus~# e2fsck -C0 -f /dev/foo/bar e2fsck 1.41.3 (12-Oct-2008) Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information bar: 43186/172800 files (0.2% non-contiguous), 295511/1753088 blocks 13:24]stratus~# resize2fs -M -p /dev/foo/bar resize2fs 1.41.3 (12-Oct-2008) Resizing the filesystem on /dev/foo/bar to 426236 (4k) blocks. Begin pass 2 (max = 101383) Relocating blocks XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Begin pass 3 (max = 54) Scanning inode table XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Begin pass 4 (max = 5278) Updating inode references XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX The filesystem on /dev/foo/bar is now 426236 blocks long. 13:25]stratus~# fsck.ext4 -C0 -f /dev/foo/bar e2fsck 1.41.3 (12-Oct-2008) Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information Block bitmap differences: -(17--18) -(33--34) -(2835--3234) Fix? yes Free blocks count wrong for group #0 (0, counted=404). Fix? yes Free blocks count wrong (138405, counted=138809). Fix? yes bar: ***** FILE SYSTEM WAS MODIFIED ***** bar: 43186/44800 files (0.2% non-contiguous), 287427/426236 blocks ----8<---- Then a bit later, I shrunk the lv to 4G, and then mounted the filesystem (just for fun), and finally online expanded the fs into it. ----8<---- 15:54]stratus~# lvreduce -L4G foo/bar WARNING: Reducing active and open logical volume to 4.00 GB THIS MAY DESTROY YOUR DATA (filesystem etc.) Do you really want to reduce bar? [y/n]: y Reducing logical volume bar to 4.00 GB Logical volume bar successfully resized 15:54]stratus~# resize2fs -p /dev/foo/bar resize2fs 1.41.3 (12-Oct-2008) Filesystem at /dev/foo/bar is mounted on /home; on-line resizing required old desc_blocks = 1, new_desc_blocks = 1 Performing an on-line resize of /dev/foo/bar to 1048576 (4k) blocks. The filesystem on /dev/foo/bar is now 1048576 blocks long. 15:55]stratus~# umount /home 15:55]stratus~# fsck.ext4 -C0 -f /dev/foo/bar e2fsck 1.41.3 (12-Oct-2008) Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information Directories count wrong for group #16 (62, counted=0). Fix? yes Directories count wrong for group #19 (12, counted=0). Fix? yes Directories count wrong for group #25 (1, counted=0). Fix? yes Directories count wrong for group #26 (2, counted=0). Fix? yes Directories count wrong for group #29 (38, counted=0). Fix? yes bar: ***** FILE SYSTEM WAS MODIFIED ***** bar: 43186/102400 files (0.2% non-contiguous), 291069/1048576 blocks ----8<---- Is this all normal? I can suppose a fsck is required after a shrink. But after an online expand, seems odd. Prior to this little experiment, the lv/ext4fs were at 6.7G. Shrinking brought it to 1.7G (with 1.1G in use). And obvously, I ended with 4G even. ../C From sandeen at redhat.com Wed Feb 4 03:09:59 2009 From: sandeen at redhat.com (Eric Sandeen) Date: Tue, 03 Feb 2009 21:09:59 -0600 Subject: ext4 resize/fsck In-Reply-To: <20090204022324.07F876F06C@alopias.GreenKey.net> References: <20090204022324.07F876F06C@alopias.GreenKey.net> Message-ID: <49890707.4020007@redhat.com> Curtis Doty wrote: > Horsing around with ext4 again...on F-10. > > This time a fsck was required after both an offline shrink and an online > grow. Why? Could you please try again with 1.41.4 from rawhide (or koji: http://kojipkgs.fedoraproject.org/packages/e2fsprogs/1.41.4/2.fc11/ - might need to rebuild if there are any library dependency problems) and see if this persists or is fixed? Several resize fixes went into 1.41.4 that should take care of this. I'll probably push 1.41.4 to f10 testing soon, if people are hitting these problems. Thanks, -Eric From Curtis at GreenKey.net Wed Feb 4 03:38:27 2009 From: Curtis at GreenKey.net (Curtis Doty) Date: Tue, 3 Feb 2009 19:38:27 -0800 (PST) Subject: ext4 resize/fsck In-Reply-To: <49890707.4020007@redhat.com> References: <20090204022324.07F876F06C@alopias.GreenKey.net> <49890707.4020007@redhat.com> Message-ID: <20090204033827.1E6646F06C@alopias.GreenKey.net> 9:09pm Eric Sandeen said: > Curtis Doty wrote: >> Horsing around with ext4 again...on F-10. >> >> This time a fsck was required after both an offline shrink and an online >> grow. Why? > > Could you please try again with 1.41.4 from rawhide (or koji: > http://kojipkgs.fedoraproject.org/packages/e2fsprogs/1.41.4/2.fc11/ - > might need to rebuild if there are any library dependency problems) and > see if this persists or is fixed? Several resize fixes went into 1.41.4 > that should take care of this. > > I'll probably push 1.41.4 to f10 testing soon, if people are hitting > these problems. > Is this an improvement or luck? Fewer issues, but still a ghost dir. 19:30]stratus~# lvextend -L6G foo/bar Extending logical volume bar to 6.00 GB Logical volume bar successfully resized 19:31]stratus~# resize2fs -p /dev/foo/bar resize2fs 1.41.4 (27-Jan-2009) Filesystem at /dev/foo/bar is mounted on /home; on-line resizing required old desc_blocks = 1, new_desc_blocks = 1 Performing an on-line resize of /dev/foo/bar to 1572864 (4k) blocks. The filesystem on /dev/foo/bar is now 1572864 blocks long. 19:31]stratus~# umount /home 19:32]stratus~# fsck.ext4 -C0 -f /dev/foo/bar e2fsck 1.41.4 (27-Jan-2009) Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information Directories count wrong for group #37 (1, counted=0). Fix? yes bar: ***** FILE SYSTEM WAS MODIFIED ***** bar: 43188/153600 files (0.1% non-contiguous), 294524/1572864 blocks ../C From sandeen at redhat.com Wed Feb 4 03:45:06 2009 From: sandeen at redhat.com (Eric Sandeen) Date: Tue, 03 Feb 2009 21:45:06 -0600 Subject: ext4 resize/fsck In-Reply-To: <20090204033827.1E6646F06C@alopias.GreenKey.net> References: <20090204022324.07F876F06C@alopias.GreenKey.net> <49890707.4020007@redhat.com> <20090204033827.1E6646F06C@alopias.GreenKey.net> Message-ID: <49890F42.1060007@redhat.com> Curtis Doty wrote: > 9:09pm Eric Sandeen said: > >> Curtis Doty wrote: >>> Horsing around with ext4 again...on F-10. >>> >>> This time a fsck was required after both an offline shrink and an online >>> grow. Why? >> Could you please try again with 1.41.4 from rawhide (or koji: >> http://kojipkgs.fedoraproject.org/packages/e2fsprogs/1.41.4/2.fc11/ - >> might need to rebuild if there are any library dependency problems) and >> see if this persists or is fixed? Several resize fixes went into 1.41.4 >> that should take care of this. >> >> I'll probably push 1.41.4 to f10 testing soon, if people are hitting >> these problems. >> > > Is this an improvement or luck? Fewer issues, but still a ghost dir. > > 19:30]stratus~# lvextend -L6G foo/bar > Extending logical volume bar to 6.00 GB > Logical volume bar successfully resized > > 19:31]stratus~# resize2fs -p /dev/foo/bar > resize2fs 1.41.4 (27-Jan-2009) > Filesystem at /dev/foo/bar is mounted on /home; on-line resizing required > old desc_blocks = 1, new_desc_blocks = 1 > Performing an on-line resize of /dev/foo/bar to 1572864 (4k) blocks. > The filesystem on /dev/foo/bar is now 1572864 blocks long. > > 19:31]stratus~# umount /home > 19:32]stratus~# fsck.ext4 -C0 -f /dev/foo/bar > e2fsck 1.41.4 (27-Jan-2009) > Pass 1: Checking inodes, blocks, and sizes > Pass 2: Checking directory structure > Pass 3: Checking directory connectivity > Pass 4: Checking reference counts > Pass 5: Checking group summary information > Directories count wrong for group #37 (1, counted=0). > Fix? yes > > bar: ***** FILE SYSTEM WAS MODIFIED ***** > bar: 43188/153600 files (0.1% non-contiguous), 294524/1572864 blocks > > ../C > I hope it's an improvement ;) If you can reproduce it, you might capture an e2image of the fs prior to resize, and we could probably investigate the issue pretty easily... -Eric From Curtis at GreenKey.net Wed Feb 4 04:11:29 2009 From: Curtis at GreenKey.net (Curtis Doty) Date: Tue, 3 Feb 2009 20:11:29 -0800 (PST) Subject: ext4 resize/fsck In-Reply-To: <49890F42.1060007@redhat.com> References: <20090204022324.07F876F06C@alopias.GreenKey.net> <49890707.4020007@redhat.com> <20090204033827.1E6646F06C@alopias.GreenKey.net> <49890F42.1060007@redhat.com> Message-ID: <20090204041129.CC0046F06C@alopias.GreenKey.net> 9:45pm Eric Sandeen said: > Curtis Doty wrote: >> 9:09pm Eric Sandeen said: >> >>> Curtis Doty wrote: >>>> Horsing around with ext4 again...on F-10. >>>> >>>> This time a fsck was required after both an offline shrink and an online >>>> grow. Why? >>> Could you please try again with 1.41.4 from rawhide (or koji: >>> http://kojipkgs.fedoraproject.org/packages/e2fsprogs/1.41.4/2.fc11/ - >>> might need to rebuild if there are any library dependency problems) and >>> see if this persists or is fixed? Several resize fixes went into 1.41.4 >>> that should take care of this. >>> >>> I'll probably push 1.41.4 to f10 testing soon, if people are hitting >>> these problems. >>> >> >> Is this an improvement or luck? Fewer issues, but still a ghost dir. >> >> 19:30]stratus~# lvextend -L6G foo/bar >> Extending logical volume bar to 6.00 GB >> Logical volume bar successfully resized >> >> 19:31]stratus~# resize2fs -p /dev/foo/bar >> resize2fs 1.41.4 (27-Jan-2009) >> Filesystem at /dev/foo/bar is mounted on /home; on-line resizing required >> old desc_blocks = 1, new_desc_blocks = 1 >> Performing an on-line resize of /dev/foo/bar to 1572864 (4k) blocks. >> The filesystem on /dev/foo/bar is now 1572864 blocks long. >> >> 19:31]stratus~# umount /home >> 19:32]stratus~# fsck.ext4 -C0 -f /dev/foo/bar >> e2fsck 1.41.4 (27-Jan-2009) >> Pass 1: Checking inodes, blocks, and sizes >> Pass 2: Checking directory structure >> Pass 3: Checking directory connectivity >> Pass 4: Checking reference counts >> Pass 5: Checking group summary information >> Directories count wrong for group #37 (1, counted=0). >> Fix? yes >> >> bar: ***** FILE SYSTEM WAS MODIFIED ***** >> bar: 43188/153600 files (0.1% non-contiguous), 294524/1572864 blocks >> >> ../C >> > > I hope it's an improvement ;) > > If you can reproduce it, you might capture an e2image of the fs prior to > resize, and we could probably investigate the issue pretty easily... > Ak. Just re-shrunk offline and then re-grew online. With e2images in between each time. However, nothing was inconsistent these times! Could it be that one ghost dir was indeed missed by 1.41.3 and caught/cleaned by 1.41.4? The symptom appears gone here now. ../C From tytso at mit.edu Wed Feb 4 06:26:14 2009 From: tytso at mit.edu (Theodore Tso) Date: Wed, 4 Feb 2009 01:26:14 -0500 Subject: ext4 resize/fsck In-Reply-To: <20090204041129.CC0046F06C@alopias.GreenKey.net> References: <20090204022324.07F876F06C@alopias.GreenKey.net> <49890707.4020007@redhat.com> <20090204033827.1E6646F06C@alopias.GreenKey.net> <49890F42.1060007@redhat.com> <20090204041129.CC0046F06C@alopias.GreenKey.net> Message-ID: <20090204062614.GA14762@mit.edu> It might be fixed with this commit: commit fdff73f094e7220602cc3f8959c7230517976412 Author: Theodore Ts'o Date: Mon Jan 26 19:06:41 2009 -0500 ext4: Initialize the new group descriptor when resizing the filesystem Make sure all of the fields of the group descriptor are properly initialized. Previously, we allowed bg_flags field to be contain random garbage, which could trigger non-deterministic behavior, including a kernel OOPS. http://bugzilla.kernel.org/show_bug.cgi?id=12433 Signed-off-by: "Theodore Ts'o" Cc: stable at kernel.org The patch was merged with mainline shortly after 2.6.29-rc3. - Ted From puhuri at iki.fi Wed Feb 4 08:41:32 2009 From: puhuri at iki.fi (Markus Peuhkuri) Date: Wed, 04 Feb 2009 10:41:32 +0200 Subject: ext4 and unexpected eh_depth Message-ID: <498954BC.9050805@iki.fi> Hi, I'm running Debian lenny with linux-image-2.6.26-1-amd64 (deb 2.6.26-11). I have a lvm stripe over three sata disks (3.5TB total) that is shared over NFS, and I getting following errors EXT4-fs error (device dm-0): ext4_ext_search_right: bad header in inode #269200: unexpected eh_depth - magic f30a,entries 18, max 340(0), depth 1(2) An user is having errors in concatenating large files (100+GB): basicly seems that the the resulting file is right size and ends with right data, but anyway he gets following error: cat: write error: Input/output error on system that has imported partition over NFS. I'm not sure if the file he was accessing did had the same inode. And once I got BUG below. I cannot right now upgrade system as there are some long-running analysis running, but can do some tests at some point, and upgrade in few days. ------------[ cut here ]------------ kernel BUG at fs/jbd2/transaction.c:1161! invalid opcode: 0000 [1] SMP CPU 0 Modules linked in: nfsd lockd nfs_acl auth_rpcgss sunrpc exportfs ext4dev jbd2 crc16 dag(P) ipv6 dm_mod dagmem(P) loop snd_pcm snd_timer snd soundcore snd_page_alloc intel_rng i2c_i801 rng_core i2c_core parport_pc parport pcspkr iTCO_wdt container shpchp pci_hotplug i5000_edac button edac_core evdev ext3 jbd mbcache sd_mod ahci libata scsi_mod dock floppy ehci_hcd uhci_hcd e1000e thermal processor fan thermal_sys Pid: 3501, comm: nfsd Tainted: P 2.6.26-1-amd64 #1 RIP: 0010:[] [] :jbd2:jbd2_journal_dirty_metadata+0x5f/0xe3 RSP: 0018:ffff810009543c90 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff81007cd38880 RCX: 00000000ffffffc0 RDX: ffff81001f9e74c0 RSI: ffff81007cd38880 RDI: ffff8100425833a8 RBP: ffff810045b7c490 R08: ffff810034a5a4d8 R09: ffffffffa024be70 R10: 000000000000005c R11: ffff81007cd38880 R12: ffff81003790c000 R13: ffff8100425833a8 R14: 00000000000020dc R15: 0000000000000000 FS: 00007f13c048e6e0(0000) GS:ffffffff8053b000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00007f8404606210 CR3: 00000000049a0000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process nfsd (pid: 3501, threadinfo ffff810009542000, task ffff81007f05e850) Stack: 0000000000000000 ffff81000948af80 ffff8100425833a8 ffff81007cd38880 ffffffffa024be70 ffffffffa0242122 ffff810034a5a4d8 ffff81000948af80 ffff810005a34b80 ffff810004ca4c90 ffff81007d5c8c00 ffffffffa0230dab Call Trace: [] ? :ext4dev:__ext4_journal_dirty_metadata+0x1e/0x46 [] ? :ext4dev:ext4_free_inode+0x2b7/0x324 [] ? :ext4dev:ext4_delete_inode+0xb7/0xd5 [] ? :ext4dev:ext4_delete_inode+0x0/0xd5 [] ? generic_delete_inode+0xab/0x11f [] ? d_delete+0x49/0xb1 [] ? vfs_unlink+0xe3/0x102 [] ? :nfsd:nfsd_unlink+0x1e9/0x267 [] ? :nfsd:nfsd3_proc_remove+0x9d/0xaa [] ? :nfsd:nfsd_dispatch+0xde/0x1b6 [] ? :sunrpc:svc_process+0x408/0x6e9 [] ? __down_read+0x12/0xa1 [] ? :nfsd:nfsd+0x0/0x2a4 [] ? :nfsd:nfsd+0x194/0x2a4 [] ? schedule_tail+0x27/0x5c [] ? child_rip+0xa/0x12 [] ? :nfsd:nfsd+0x0/0x2a4 [] ? child_rip+0x0/0x12 Code: 03 25 00 00 20 00 48 85 c0 75 f1 f0 0f ba 2b 15 19 c0 85 c0 75 e8 83 7d 10 00 75 19 c7 45 10 01 00 00 00 41 8b 45 08 85 c0 7f 04 <0f> 0b eb fe ff c8 41 89 45 08 48 39 55 28 75 11 83 7d 0c 02 75 RIP [] :jbd2:jbd2_journal_dirty_metadata+0x5f/0xe3 RSP ---[ end trace 1336f55a961cc4ae ]--- # dumpe2fs /dev/work/wdata dumpe2fs 1.41.3 (12-Oct-2008) Filesystem volume name: Last mounted on: Filesystem UUID: 05867e70-54e2-48bf-8c67-5439e98c5982 Filesystem magic number: 0xEF53 Filesystem revision #: 1 (dynamic) Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize Filesystem flags: signed_directory_hash test_filesystem Default mount options: (none) Filesystem state: clean with errors Errors behavior: Continue Filesystem OS type: Linux Inode count: 917536 Block count: 939525120 Reserved block count: 46976256 Free blocks: 507628950 Free inodes: 619199 First block: 0 Block size: 4096 Fragment size: 4096 Reserved GDT blocks: 799 Blocks per group: 32768 Fragments per group: 32768 Inodes per group: 32 Inode blocks per group: 2 Flex block group size: 16 Filesystem created: Thu Dec 11 10:36:48 2008 Last mount time: Mon Jan 26 13:05:07 2009 Last write time: Sat Jan 31 19:53:25 2009 Mount count: 1 Maximum mount count: 26 Last checked: Mon Jan 26 12:53:17 2009 Check interval: 15552000 (6 months) Next check after: Sat Jul 25 13:53:17 2009 Reserved blocks uid: 0 (user root) Reserved blocks gid: 0 (group root) First inode: 11 Inode size: 256 Required extra isize: 28 Desired extra isize: 28 Journal inode: 8 Default directory hash: half_md4 Directory Hash Seed: 2714a303-cb2a-4bbc-8159-29bf52c617ca Journal backup: inode blocks Journal size: 128M (rest of dumpe2fs output omitted: 32MiB, can put it available somewhere). t. Markus From tytso at mit.edu Wed Feb 4 15:55:10 2009 From: tytso at mit.edu (Theodore Tso) Date: Wed, 4 Feb 2009 10:55:10 -0500 Subject: ext4 and unexpected eh_depth In-Reply-To: <498954BC.9050805@iki.fi> References: <498954BC.9050805@iki.fi> Message-ID: <20090204155510.GG14762@mit.edu> On Wed, Feb 04, 2009 at 10:41:32AM +0200, Markus Peuhkuri wrote: > Hi, I'm running Debian lenny with linux-image-2.6.26-1-amd64 (deb > 2.6.26-11). I have a lvm stripe over three sata disks (3.5TB total) > that is shared over NFS, and I getting following errors > > EXT4-fs error (device dm-0): ext4_ext_search_right: bad header in inode #269200: unexpected eh_depth - magic f30a,entries 18, max 340(0), depth 1(2) I can't recall the patch which fixed this, but I'm 95% certain we've seen this before, and it's been fixed since 2.6.26; I think in 2.6.27 or 2.6.28. Note that there have been a *huge* number of bug fixes for ext4 since 2.6.26 and 2.6.27. If you must use such an old kernel I'd suggest moving to at least 2.6.27.x or 2.6.28.x after Greg pulls in the latest set of bug fixes. Critical bug fixes are still being back ported to 2.6.27.x, although you won't see various performance improvements unless you track a much newer kernel. I'd suggest at least 2.6.28.y at this point. - Ted From Ralf.Hildebrandt at charite.de Thu Feb 5 12:58:48 2009 From: Ralf.Hildebrandt at charite.de (Ralf Hildebrandt) Date: Thu, 5 Feb 2009 13:58:48 +0100 Subject: Questions regarding journal replay Message-ID: <20090205125847.GR23918@charite.de> Today, I had to uncleanly shutdown one of our machines due to an error in 2.6.28.3. Durin the boot sequence, the ext4 partition /home experienced a journal replay. /home looks like this: /dev/mapper/volg1-logv1 on /home type ext4 (rw,noexec,nodev,noatime,errors=remount-ro) Filesystem Size Used Avail Use% Mounted on /dev/mapper/volg1-logv1 2,4T 1,4T 1022G 58% /home Filesystem Inodes IUsed IFree IUse% Mounted on /dev/mapper/volg1-logv1 19519488 8793310 10726178 46% /home The journal replay too quite a while. About 800 seconds. # dumpe2fs -h /dev/mapper/volg1-logv1 dumpe2fs 1.41.3 (12-Oct-2008) Filesystem volume name: Last mounted on: Filesystem UUID: 032613d3-6035-4872-bc0a-11db92feec5e Filesystem magic number: 0xEF53 Filesystem revision #: 1 (dynamic) Filesystem features: has_journal resize_inode dir_index filetype needs_recovery extent sparse_super large_file uninit_bg Filesystem flags: signed_directory_hash Default mount options: (none) Filesystem state: clean Errors behavior: Continue Filesystem OS type: Linux Inode count: 19519488 Block count: 624605184 Reserved block count: 0 Free blocks: 267655114 Free inodes: 10726118 First block: 0 Block size: 4096 Fragment size: 4096 Reserved GDT blocks: 875 Blocks per group: 32768 Fragments per group: 32768 Inodes per group: 1024 Inode blocks per group: 32 Filesystem created: Tue May 8 21:04:31 2007 Last mount time: Thu Feb 5 11:08:27 2009 Last write time: Thu Feb 5 11:08:27 2009 Mount count: 12 Maximum mount count: -1 Last checked: Sat Dec 27 23:16:47 2008 Check interval: 0 () Reserved blocks uid: 0 (user root) Reserved blocks gid: 0 (group root) First inode: 11 Inode size: 128 Journal inode: 8 First orphan inode: 17529831 Default directory hash: tea Directory Hash Seed: 44337061-e542-44bb-afb9-40597ccf1c6d Journal backup: inode blocks Journal size: 128M Questions: ========== * Why does it take so long? * What happens during that time? * Is my journal maybe too big? -- Ralf Hildebrandt Ralf.Hildebrandt at charite.de Charite - Universit?tsmedizin Berlin Tel. +49 (0)30-450 570-155 Gesch?ftsbereich IT | Abt. Netzwerk Fax. +49 (0)30-450 570-962 Hindenburgdamm 30 | 12200 Berlin From Ralf.Hildebrandt at charite.de Thu Feb 5 13:23:02 2009 From: Ralf.Hildebrandt at charite.de (Ralf Hildebrandt) Date: Thu, 5 Feb 2009 14:23:02 +0100 Subject: External journal with ext4 Message-ID: <20090205132301.GS23918@charite.de> Can I still use: mke2fs -O journal_dev /dev/journaldevice tune2fs -O ^has_journal /dev/sda1 tune2fs -o journal_data -j -J device=/dev/journaldevice /dev/sda1 -- Ralf Hildebrandt Ralf.Hildebrandt at charite.de Charite - Universit?tsmedizin Berlin Tel. +49 (0)30-450 570-155 Gesch?ftsbereich IT | Abt. Netzwerk Fax. +49 (0)30-450 570-962 Hindenburgdamm 30 | 12200 Berlin From Ralf.Hildebrandt at charite.de Thu Feb 5 16:26:32 2009 From: Ralf.Hildebrandt at charite.de (Ralf Hildebrandt) Date: Thu, 5 Feb 2009 17:26:32 +0100 Subject: External journal with ext4 In-Reply-To: <20090205132301.GS23918@charite.de> References: <20090205132301.GS23918@charite.de> Message-ID: <20090205162632.GF9737@charite.de> * Ralf Hildebrandt : > Can I still use: > > mke2fs -O journal_dev /dev/journaldevice > tune2fs -O ^has_journal /dev/sda1 > tune2fs -j -J device=/dev/journaldevice /dev/sda1 Yes, one can. -- Ralf Hildebrandt Ralf.Hildebrandt at charite.de Charite - Universit?tsmedizin Berlin Tel. +49 (0)30-450 570-155 Gesch?ftsbereich IT | Abt. Netzwerk Fax. +49 (0)30-450 570-962 Hindenburgdamm 30 | 12200 Berlin From Curtis at GreenKey.net Fri Feb 6 14:26:41 2009 From: Curtis at GreenKey.net (Curtis Doty) Date: Fri, 6 Feb 2009 06:26:41 -0800 (PST) Subject: Questions regarding journal replay In-Reply-To: <20090205125847.GR23918@charite.de> References: <20090205125847.GR23918@charite.de> Message-ID: <20090206142641.9FE446F064@alopias.GreenKey.net> Yesterday Ralf Hildebrandt said: > The journal replay too quite a while. About 800 seconds. > Were there any other background iops on the underlying volume devices? Like maybe raid reconstruction? ../C From Ralf.Hildebrandt at charite.de Fri Feb 6 14:28:22 2009 From: Ralf.Hildebrandt at charite.de (Ralf Hildebrandt) Date: Fri, 6 Feb 2009 15:28:22 +0100 Subject: Questions regarding journal replay In-Reply-To: <20090206142641.9FE446F064@alopias.GreenKey.net> References: <20090205125847.GR23918@charite.de> <20090206142641.9FE446F064@alopias.GreenKey.net> Message-ID: <20090206142822.GE31519@charite.de> * Curtis Doty : > Yesterday Ralf Hildebrandt said: > >> The journal replay too quite a while. About 800 seconds. >> > > Were there any other background iops on the underlying volume > devices? Like maybe raid reconstruction? I don't think so. The machine never powered off... -- Ralf Hildebrandt Ralf.Hildebrandt at charite.de Charite - Universit?tsmedizin Berlin Tel. +49 (0)30-450 570-155 Gesch?ftsbereich IT | Abt. Netzwerk Fax. +49 (0)30-450 570-962 Hindenburgdamm 30 | 12200 Berlin From joschi at fliegergruppe-donzdorf.de Mon Feb 9 12:47:02 2009 From: joschi at fliegergruppe-donzdorf.de (Jochen Rueter) Date: Mon, 09 Feb 2009 13:47:02 +0100 Subject: un'stat'able files - fs corruption? Message-ID: <499025C6.6070301@fliegergruppe-donzdorf.de> Hello list, I have some serious problems on my ext3 filesystem. Several folders contain files, which cannot be accessed in any way, not even a stat() on these files is possible: [~]$ ls -l -rwxrwxr-x 1 yvonne users 30208 2007-09-16 12:49 Stoffverteilungsplan tw kl4 07.doc ?--------- ? ? ? ? ? Teddyb?r.docx ?--------- ? ? ? ? ? Termine f?r montag kiga.doc -rwxrwxr-x 1 yvonne users 28672 2001-11-18 17:29 tiere bei den indios.doc [~]$ rm Teddy* Teddyb?r.docx: No such file or directory In this example, e.g. the file named Teddyb?r.docx has these problems. I know it has some non-printable characters in its filename, which seems to be somehow related to the problem, however I have other files containing such characters which can be accessed fine when escaping the characters correctly. Also, those other files show up correctly in the output of 'ls', and 'stat()' works on these files. Also e2fsck does not find any errors on this filesystem. Can anybody help me getting rid of these files? Thanks alot, Jochen From forest at alittletooquiet.net Mon Feb 9 13:33:05 2009 From: forest at alittletooquiet.net (Forest Bond) Date: Mon, 9 Feb 2009 08:33:05 -0500 Subject: un'stat'able files - fs corruption? In-Reply-To: <499025C6.6070301@fliegergruppe-donzdorf.de> References: <499025C6.6070301@fliegergruppe-donzdorf.de> Message-ID: <20090209133304.GJ12167@storm.local.network> Hi, On Mon, Feb 09, 2009 at 01:47:02PM +0100, Jochen Rueter wrote: > Hello list, > > I have some serious problems on my ext3 filesystem. Several folders > contain files, which cannot be accessed in any way, > not even a stat() on these files is possible: > > [~]$ ls -l > -rwxrwxr-x 1 yvonne users 30208 2007-09-16 12:49 Stoffverteilungsplan > tw kl4 07.doc > ?--------- ? ? ? ? ? Teddyb?r.docx > > ?--------- ? ? ? ? ? Termine f?r montag > kiga.doc > -rwxrwxr-x 1 yvonne users 28672 2001-11-18 17:29 tiere bei den indios.doc > [~]$ rm Teddy* > Teddyb?r.docx: No such file or directory [...] I recall having a similar problem with UTF-8 filenames when I took an external drive from a x86 machine and plugged it into a powerpc machine. After spending hours trying to fix this mysterious filesystem "corruption," I got home and plugged it into the original machine. Everything was back to normal. I speculated at the time that the byte order of the machine was somehow affecting filename encoding. In hindsight, this doesn't make a lot of sense to me (UTF-8 defines byte order). Some bug in userspace? I don't know. -Forest -- Forest Bond http://www.alittletooquiet.net http://www.pytagsfs.org -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: Digital signature URL: From lists at nerdbynature.de Thu Feb 12 06:35:36 2009 From: lists at nerdbynature.de (Christian Kujau) Date: Wed, 11 Feb 2009 22:35:36 -0800 (PST) Subject: un'stat'able files - fs corruption? In-Reply-To: <499025C6.6070301@fliegergruppe-donzdorf.de> References: <499025C6.6070301@fliegergruppe-donzdorf.de> Message-ID: On Mon, 9 Feb 2009, Jochen Rueter wrote: > ?--------- ? ? ? ? ? Teddyb?r.docx > ?--------- ? ? ? ? ? Termine f?r montag [...] > Also e2fsck does not find any errors on this filesystem. Hm, strange that e2fsck (current version?) does not report any errors, because I somehow find it hard to believe that it should be related to the umlauts in the filename. Did you try moving the files via its inode#? Something like: # ls -li 1234 ?--------- ? ? ? ? ? Teddyb?r.docx # find . -inum 1234 -exec mv '{}' Teddybaer.docx \; ...does that work? Christian. -- BOFH excuse #102: Power company testing new voltage spike (creation) equipment From vegard at svanberg.no Thu Feb 12 09:54:40 2009 From: vegard at svanberg.no (Vegard Svanberg) Date: Thu, 12 Feb 2009 10:54:40 +0100 Subject: Fsck takes too long on multiply-claimed blocks Message-ID: <20090212095440.GG20749@svanberg.no> After a power failure, a ~500G filesystem crashed. Fsck has been running for days. The problem seems to be multiply-claimed blocks. Example: File /directory/file.name/foo (inode #1234567, mod time Tue Feb 10 08:14:40 2008) has 1800000 multiply-claimed block(s), shared with 1 file(s): /directory/file.name/bar (inode #1234567, mod time Wed Dec 1 15:30:00 2008) Clone multiply-claimed blocks? y This takes like forever, probably due to the large number of multiply-claimed blocks. This number can be from 6-2000000, where the slow ones are fixed quicky and the large ones takes hours/days. I was wondering if: - I can get a list of the impacted files/inodes - Wipe them with debugfs Is this safe? How do I do it? Fsck says it's 538 inodes with this problem. If I could get a file list and be able to wipe the inodes, I could restore the missing files from backup and get the machine online again quickly. Hints/tips? Thanks! -- Vegard Svanberg [*Takapa at IRC (EFnet)] From joschi at fliegergruppe-donzdorf.de Thu Feb 12 13:41:41 2009 From: joschi at fliegergruppe-donzdorf.de (Jochen Rueter) Date: Thu, 12 Feb 2009 14:41:41 +0100 Subject: un'stat'able files - fs corruption? In-Reply-To: References: <499025C6.6070301@fliegergruppe-donzdorf.de> Message-ID: <49942715.1040401@fliegergruppe-donzdorf.de> My e2fsck is version e2fsck 1.40-WIP (14-Nov-2006), which is included in debian lenny. Actually, ls seems even not to be able to determine the inode number: 21022170 -rwxrwxr-x 1 yvonne users 30208 2007-09-16 12:49 Stoffverteilungsplan tw kl4 07.doc ? ?--------- ? ? ? ? ? Termine f?r Dienstag kiga.doc Maybe it's worth noting that this is running on arm: 2.6.21 #1 PREEMPT Tue May 8 21:05:53 CEST 2007 armv5tel GNU/Linux Jochen Christian Kujau schrieb: > On Mon, 9 Feb 2009, Jochen Rueter wrote: > >> ?--------- ? ? ? ? ? Teddyb?r.docx >> ?--------- ? ? ? ? ? Termine f?r montag >> > [...] > >> Also e2fsck does not find any errors on this filesystem. >> > > Hm, strange that e2fsck (current version?) does not report any errors, > because I somehow find it hard to believe that it should be related to the > umlauts in the filename. Did you try moving the files via its inode#? > Something like: > > # ls -li > 1234 ?--------- ? ? ? ? ? Teddyb?r.docx > # find . -inum 1234 -exec mv '{}' Teddybaer.docx \; > > ...does that work? > > Christian. > From tytso at mit.edu Thu Feb 12 14:19:49 2009 From: tytso at mit.edu (Theodore Tso) Date: Thu, 12 Feb 2009 09:19:49 -0500 Subject: Fsck takes too long on multiply-claimed blocks In-Reply-To: <20090212095440.GG20749@svanberg.no> References: <20090212095440.GG20749@svanberg.no> Message-ID: <20090212141948.GB13040@mini-me.lan> On Thu, Feb 12, 2009 at 10:54:40AM +0100, Vegard Svanberg wrote: > After a power failure, a ~500G filesystem crashed. Fsck has been running > for days. The problem seems to be multiply-claimed blocks. Example: > > File /directory/file.name/foo (inode #1234567, mod time Tue Feb > 10 08:14:40 2008) > has 1800000 multiply-claimed block(s), shared with 1 file(s): > > /directory/file.name/bar > (inode #1234567, mod time Wed Dec 1 15:30:00 2008) > Clone multiply-claimed blocks? y > > This takes like forever, probably due to the large number of > multiply-claimed blocks. You are using a version of e2fsprogs/e2fsck newer than 1.28, right? If not, there's your problem; upgrade to something newer. Older e2fsck's had O(n**2) algorithms that made this very slow, causing this pass to be CPU-bound. It could be slow because of memory pressure issues; the data structures for keeping track of all of those blocks aren't small. >I was wondering if: > > - I can get a list of the impacted files/inodes Yes; you can; they were listed by e2fsck during pass 1B, actually: Look for entries like this: Pass 1B: Rescanning for multiply-claimed blocks Multiply-claimed block(s) in inode 12: 25 26 Multiply-claimed block(s) in inode 13: 25 26 57 58 Multiply-claimed block(s) in inode 14: 57 58 > - Wipe them with debugfs You could wipe them all out via debugfs's clri function, like this: debugfs -R "clri <12> <13> <14>" /dev/sdXX The angle brackets indicate that you are passing in an inode number, instead of a pathname; and I've left it as an exercise to the reader how to use your choice of tools (emacs, grep/awk, perl) to pull out the necessary inode numbers from e2fsck's Pass1B output. Then run e2fsck, and it will clear the resulting inodes. To get the filenames, do this first, before the clri command: debugfs -R "ncheck 12 13 14" /dev/sdXX (No angle brackets are needed because ncheck only takes inode numbers and converts them to pathnames.) > Is this safe? How do I do it? Fsck says it's 538 inodes with this > problem. If I could get a file list and be able to wipe the inodes, I > could restore the missing files from backup and get the machine online > again quickly. However, it's not strictly necessary to wipe all 538 inodes. It's likely that you only need to wipe approximately half of them. What happened is that somehow, the disk drive got confused and wrote data to the wrong location on disk. Or, the journal was corrupted (one of the reasons why ext4 has journal checksums) so inode table blocks got written to the wrong place on disk. So that means what you'll see is something like this: Multiply-claimed block(s) in inode 32: 200 201 203 Multiply-claimed block(s) in inode 33: 210 211 212 213 214 Multiply-claimed block(s) in inode 34: 215 216 217 218 ... Multiply-claimed block(s) in inode 128: 200 201 203 Multiply-claimed block(s) in inode 129: 210 211 212 213 214 Multiply-claimed block(s) in inode 130: 215 216 217 218 You may not see 16 or 32 inodes in each group of duplicate inodes (there are 32 inodes in each 4k block, 16 inodes per 4k block if you are using 256 byte inodes), since some inodes may have been deleted or never allocated before. In any case, only one set of inodes will be correct; after you determine which one set seems correct given the mapping between pathnames and file contents, you can clri the other set. Or if that's too much effort, you can clri them all and recover them from backups.... - Ted From tytso at mit.edu Thu Feb 12 14:19:49 2009 From: tytso at mit.edu (Theodore Tso) Date: Thu, 12 Feb 2009 09:19:49 -0500 Subject: Fsck takes too long on multiply-claimed blocks In-Reply-To: <20090212095440.GG20749@svanberg.no> References: <20090212095440.GG20749@svanberg.no> Message-ID: <20090212141948.GB13040@mini-me.lan> On Thu, Feb 12, 2009 at 10:54:40AM +0100, Vegard Svanberg wrote: > After a power failure, a ~500G filesystem crashed. Fsck has been running > for days. The problem seems to be multiply-claimed blocks. Example: > > File /directory/file.name/foo (inode #1234567, mod time Tue Feb > 10 08:14:40 2008) > has 1800000 multiply-claimed block(s), shared with 1 file(s): > > /directory/file.name/bar > (inode #1234567, mod time Wed Dec 1 15:30:00 2008) > Clone multiply-claimed blocks? y > > This takes like forever, probably due to the large number of > multiply-claimed blocks. You are using a version of e2fsprogs/e2fsck newer than 1.28, right? If not, there's your problem; upgrade to something newer. Older e2fsck's had O(n**2) algorithms that made this very slow, causing this pass to be CPU-bound. It could be slow because of memory pressure issues; the data structures for keeping track of all of those blocks aren't small. >I was wondering if: > > - I can get a list of the impacted files/inodes Yes; you can; they were listed by e2fsck during pass 1B, actually: Look for entries like this: Pass 1B: Rescanning for multiply-claimed blocks Multiply-claimed block(s) in inode 12: 25 26 Multiply-claimed block(s) in inode 13: 25 26 57 58 Multiply-claimed block(s) in inode 14: 57 58 > - Wipe them with debugfs You could wipe them all out via debugfs's clri function, like this: debugfs -R "clri <12> <13> <14>" /dev/sdXX The angle brackets indicate that you are passing in an inode number, instead of a pathname; and I've left it as an exercise to the reader how to use your choice of tools (emacs, grep/awk, perl) to pull out the necessary inode numbers from e2fsck's Pass1B output. Then run e2fsck, and it will clear the resulting inodes. To get the filenames, do this first, before the clri command: debugfs -R "ncheck 12 13 14" /dev/sdXX (No angle brackets are needed because ncheck only takes inode numbers and converts them to pathnames.) > Is this safe? How do I do it? Fsck says it's 538 inodes with this > problem. If I could get a file list and be able to wipe the inodes, I > could restore the missing files from backup and get the machine online > again quickly. However, it's not strictly necessary to wipe all 538 inodes. It's likely that you only need to wipe approximately half of them. What happened is that somehow, the disk drive got confused and wrote data to the wrong location on disk. Or, the journal was corrupted (one of the reasons why ext4 has journal checksums) so inode table blocks got written to the wrong place on disk. So that means what you'll see is something like this: Multiply-claimed block(s) in inode 32: 200 201 203 Multiply-claimed block(s) in inode 33: 210 211 212 213 214 Multiply-claimed block(s) in inode 34: 215 216 217 218 ... Multiply-claimed block(s) in inode 128: 200 201 203 Multiply-claimed block(s) in inode 129: 210 211 212 213 214 Multiply-claimed block(s) in inode 130: 215 216 217 218 You may not see 16 or 32 inodes in each group of duplicate inodes (there are 32 inodes in each 4k block, 16 inodes per 4k block if you are using 256 byte inodes), since some inodes may have been deleted or never allocated before. In any case, only one set of inodes will be correct; after you determine which one set seems correct given the mapping between pathnames and file contents, you can clri the other set. Or if that's too much effort, you can clri them all and recover them from backups.... - Ted From ross at biostat.ucsf.edu Fri Feb 13 02:09:26 2009 From: ross at biostat.ucsf.edu (Ross Boylan) Date: Thu, 12 Feb 2009 18:09:26 -0800 Subject: ext2_check_mount_point: No such file ... Message-ID: <1234490966.4722.73.camel@iron.psg.net> I am trying to shrink an ext3 filesystem mounted on top of software RAID. The ultimate goal is to shrink the RAID to make room for a new installation that will use LVM over RAID. I get the error in the subject, and wonder what I need to do to avoid it. Details follow: Shrinking needs to be done offline, right? After being unable to get Knoppix started I used the break=bottom to stop the boot process while I was still in the initrd. When I dismounted the filesystem I discovered my initrd didn't have the ext3 utilities. I remount the file system and copied /sbin, /lib, and /etc onto my ramdisk. Without /etc fsck complained about being unable to find fstab; afterwords, it ran. However, both fsck and resize2fs complained ext2_check_mount_point: No such file or directory while determining whether /dev/md1 is mounted. Adding the -f flag did not help. There is no /etc/mtab file. I realize this is all pretty dodgy, but is there a way I can deal with this problem? What file or directory is it looking for? -- Ross Boylan wk: (415) 514-8146 185 Berry St #5700 ross at biostat.ucsf.edu Dept of Epidemiology and Biostatistics fax: (415) 514-8150 University of California, San Francisco San Francisco, CA 94107-1739 hm: (415) 550-1062 From darkonc at gmail.com Fri Feb 13 08:10:19 2009 From: darkonc at gmail.com (Stephen Samuel) Date: Fri, 13 Feb 2009 00:10:19 -0800 Subject: un'stat'able files - fs corruption? In-Reply-To: <49942715.1040401@fliegergruppe-donzdorf.de> References: <499025C6.6070301@fliegergruppe-donzdorf.de> <49942715.1040401@fliegergruppe-donzdorf.de> Message-ID: <6cd50f9f0902130010n2b8b4485w41c476430acdef12@mail.gmail.com> When you ran the FSCK, did you force the check ( -f )? Usually fsck will refuse to run a full test if it thinks that the filesystem was unmounted cleanly or if it thinks that running the log will be sufficient cleanup. Also, have you moved this filesystem between machines (and, most notably, between architectures)? On Thu, Feb 12, 2009 at 5:41 AM, Jochen Rueter < joschi at fliegergruppe-donzdorf.de> wrote: > My e2fsck is version e2fsck 1.40-WIP (14-Nov-2006), which is included in > debian lenny. > Actually, ls seems even not to be able to determine the inode number: > > 21022170 -rwxrwxr-x 1 yvonne users 30208 2007-09-16 12:49 > Stoffverteilungsplan tw kl4 07.doc > ? ?--------- ? ? ? ? ? Termine f?r > Dienstag kiga.doc > > Maybe it's worth noting that this is running on arm: 2.6.21 #1 PREEMPT Tue > May 8 21:05:53 CEST 2007 armv5tel GNU/Linux > > Jochen > > Christian Kujau schrieb: > >> On Mon, 9 Feb 2009, Jochen Rueter wrote: >> >> >>> ?--------- ? ? ? ? ? Teddyb?r.docx >>> ?--------- ? ? ? ? ? Termine f?r montag >>> >>> >> [...] >> >> >>> Also e2fsck does not find any errors on this filesystem. >>> >>> >> -- Stephen Samuel http://www.bcgreen.com 778-861-7641 -------------- next part -------------- An HTML attachment was scrubbed... URL: From sandeen at redhat.com Fri Feb 13 16:39:59 2009 From: sandeen at redhat.com (Eric Sandeen) Date: Fri, 13 Feb 2009 10:39:59 -0600 Subject: ext2_check_mount_point: No such file ... In-Reply-To: <1234490966.4722.73.camel@iron.psg.net> References: <1234490966.4722.73.camel@iron.psg.net> Message-ID: <4995A25F.5020709@redhat.com> Ross Boylan wrote: > I am trying to shrink an ext3 filesystem mounted on top of software > RAID. The ultimate goal is to shrink the RAID to make room for a new > installation that will use LVM over RAID. I get the error in the > subject, and wonder what I need to do to avoid it. Details follow: > > Shrinking needs to be done offline, right? > > After being unable to get Knoppix started I used the break=bottom to > stop the boot process while I was still in the initrd. When I > dismounted the filesystem I discovered my initrd didn't have the ext3 > utilities. I remount the file system and copied /sbin, /lib, and /etc > onto my ramdisk. Without /etc fsck complained about being unable to > find fstab; afterwords, it ran. > > However, both fsck and resize2fs complained > ext2_check_mount_point: No such file or directory while determining > whether /dev/md1 is mounted. > > Adding the -f flag did not help. > > There is no /etc/mtab file. > > I realize this is all pretty dodgy, but is there a way I can deal with > this problem? What file or directory is it looking for? try "cp /proc/mounts to /etc/mtab" perhaps. Maybe the tools should check both (if they don't already, I haven't actually checked yet) :) -eric From tytso at mit.edu Sat Feb 14 14:36:42 2009 From: tytso at mit.edu (Theodore Tso) Date: Sat, 14 Feb 2009 09:36:42 -0500 Subject: ext2_check_mount_point: No such file ... In-Reply-To: <4995A25F.5020709@redhat.com> References: <1234490966.4722.73.camel@iron.psg.net> <4995A25F.5020709@redhat.com> Message-ID: <20090214143642.GF26628@mini-me.lan> On Fri, Feb 13, 2009 at 10:39:59AM -0600, Eric Sandeen wrote: > > However, both fsck and resize2fs complained > > ext2_check_mount_point: No such file or directory while determining > > whether /dev/md1 is mounted. > > > > Adding the -f flag did not help. > > > > There is no /etc/mtab file. > > > > I realize this is all pretty dodgy, but is there a way I can deal with > > this problem? What file or directory is it looking for? > > try "cp /proc/mounts to /etc/mtab" perhaps. Maybe the tools should > check both (if they don't already, I haven't actually checked yet) :) The tools do check both already. So the easist solution is mount -t proc proc /proc - Ted From ross at biostat.ucsf.edu Sat Feb 14 18:40:14 2009 From: ross at biostat.ucsf.edu (Ross Boylan) Date: Sat, 14 Feb 2009 10:40:14 -0800 Subject: ext2_check_mount_point: No such file ... In-Reply-To: <4995A25F.5020709@redhat.com> References: <1234490966.4722.73.camel@iron.psg.net> <4995A25F.5020709@redhat.com> Message-ID: <1234636814.19527.17.camel@corn.betterworld.us> On Fri, 2009-02-13 at 10:39 -0600, Eric Sandeen wrote: > Ross Boylan wrote: > > I am trying to shrink an ext3 filesystem mounted on top of software > > RAID. The ultimate goal is to shrink the RAID to make room for a new > > installation that will use LVM over RAID. I get the error in the > > subject, and wonder what I need to do to avoid it. Details follow: > > > > Shrinking needs to be done offline, right? > > > > After being unable to get Knoppix started I used the break=bottom to > > stop the boot process while I was still in the initrd. When I > > dismounted the filesystem I discovered my initrd didn't have the ext3 > > utilities. I remount the file system and copied /sbin, /lib, and /etc > > onto my ramdisk. Without /etc fsck complained about being unable to > > find fstab; afterwords, it ran. > > > > However, both fsck and resize2fs complained > > ext2_check_mount_point: No such file or directory while determining > > whether /dev/md1 is mounted. > > > > Adding the -f flag did not help. > > > > There is no /etc/mtab file. > > > > I realize this is all pretty dodgy, but is there a way I can deal with > > this problem? What file or directory is it looking for? > > try "cp /proc/mounts to /etc/mtab" perhaps. Maybe the tools should > check both (if they don't already, I haven't actually checked yet) :) Thanks; that worked. Unfortunately, after resizing the filesystem I had trouble resizing the underlying partitions. I backed up, and now I'm going to do a fresh install. From sandeen at redhat.com Sat Feb 14 20:09:30 2009 From: sandeen at redhat.com (Eric Sandeen) Date: Sat, 14 Feb 2009 14:09:30 -0600 Subject: ext2_check_mount_point: No such file ... In-Reply-To: <1234636814.19527.17.camel@corn.betterworld.us> References: <1234490966.4722.73.camel@iron.psg.net> <4995A25F.5020709@redhat.com> <1234636814.19527.17.camel@corn.betterworld.us> Message-ID: <499724FA.3050703@redhat.com> Ross Boylan wrote: > On Fri, 2009-02-13 at 10:39 -0600, Eric Sandeen wrote: >> Ross Boylan wrote: ... >>> There is no /etc/mtab file. >>> >>> I realize this is all pretty dodgy, but is there a way I can deal with >>> this problem? What file or directory is it looking for? >> try "cp /proc/mounts to /etc/mtab" perhaps. Maybe the tools should >> check both (if they don't already, I haven't actually checked yet) :) > > Thanks; that worked. Unfortunately, after resizing the filesystem I had > trouble resizing the underlying partitions. I backed up, and now I'm > going to do a fresh install. Just to double check; did you have to mount /proc first? -Eric From adilger at sun.com Tue Feb 17 20:47:35 2009 From: adilger at sun.com (Andreas Dilger) Date: Tue, 17 Feb 2009 13:47:35 -0700 Subject: Fsck takes too long on multiply-claimed blocks In-Reply-To: <20090212141948.GB13040@mini-me.lan> References: <20090212095440.GG20749@svanberg.no> <20090212141948.GB13040@mini-me.lan> Message-ID: <20090217204735.GC3199@webber.adilger.int> On Feb 12, 2009 09:19 -0500, Theodore Ts'o wrote: > On Thu, Feb 12, 2009 at 10:54:40AM +0100, Vegard Svanberg wrote: > > After a power failure, a ~500G filesystem crashed. Fsck has been running > > for days. The problem seems to be multiply-claimed blocks. Example: > > > > File /directory/file.name/foo (inode #1234567, mod time Tue Feb > > 10 08:14:40 2008) > > has 1800000 multiply-claimed block(s), shared with 1 file(s): > > > > /directory/file.name/bar > > (inode #1234567, mod time Wed Dec 1 15:30:00 2008) > > Clone multiply-claimed blocks? y > > > > This takes like forever, probably due to the large number of > > multiply-claimed blocks. > > You are using a version of e2fsprogs/e2fsck newer than 1.28, right? > If not, there's your problem; upgrade to something newer. Older > e2fsck's had O(n**2) algorithms that made this very slow, causing this > pass to be CPU-bound. It could be slow because of memory pressure > issues; the data structures for keeping track of all of those blocks > aren't small. The "inode badness" patch in the Lustre e2fsprogs does a reasonably good job at handling this. It will automatically mark one/both of these inodes as "fatally corrupted" and delete it/them. That will not happen if only a handful of blocks are shared, so would not delete files in cases with e.g. simple bitflips and such. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. From darkonc at gmail.com Fri Feb 20 08:29:37 2009 From: darkonc at gmail.com (Stephen Samuel) Date: Fri, 20 Feb 2009 00:29:37 -0800 Subject: fast builds for ext3 filesystems. Message-ID: <6cd50f9f0902200029m3e56d030w6b8e1f06fae2b02d@mail.gmail.com> I'm investigating ways of doing fast builds on a system. The machines that we're building are essentially identical, but the hardware is just short of random. (we rebuild systems from donated machines for donation to non-profits, and thrift-store sales). currently we use the oem install process, but I'm having problems with the current system, so I decided to implement my idea for a fast build process. I've got a build that was copied onto a 6GB partition, then I made a partimage backup of the system. On new systems, I restore the 6GB partition onto the (almost always larger) partition on the new disk (might be between 15GB and 80GB) then use resize2fs to fit the filesystem into the new partition. The last thing I do is run a script to reset the UUIDs for fstab and grub. Question is: what are the disadvantages of using partimage to install the new system? I'm thinking that the only real disadvantage would be performance problems associated with the placemt of OS data on the expanded filesystem. How bad would that be, and are there other issues to look at? My script to reset the uuids on the new system is below. Am I missing any critical locations for changing the UUID? ================= # presumes that mounted filestem for /dev/sdXX is at /tmp/sdXX rootfs=/dev/sda8 swapfs=/dev/sda6 rootdev=${rootfs/%[0-9]/} rootdev=${rootfs/%[0-9]/} # 2 digit partition numbers? grub-install --root-directory=/tmp/${rootfs#/dev/} $rootdev tune2fs -U random $rootfs fs_uuid=05ea19df-a029-4fb3-9ef7-2c497e641a60 sw_uuid=675bf141-9964-4593-9a29-2c0d40c129d5 cd /tmp/${rootfs#/dev/} new_fs_uuid=`vol_id --uuid $rootfs` new_sw_uuid=`vol_id --uuid $swapfs` sed -i "s/$fs_uuid/$new_fs_uuid/g;s/$sw_uuid/$new_sw_uuid/g" /tmp/${rootfs#/dev/}/etc/fstab sed -i "s/$fs_uuid/$new_fs_uuid/g;s/$sw_uuid/$new_sw_uuid/g" /tmp/${rootfs#/dev/}/boot/grub/menu.lst #clear out ethN udev cache sed -i '/^# PCI device /,$d' /tmp/${rootfs#/dev/}/etc/udev/rules.d/70-persistent-net.rules ========================= -- Stephen Samuel http://www.bcgreen.com 778-861-7641 -------------- next part -------------- An HTML attachment was scrubbed... URL: From ross at biostat.ucsf.edu Fri Feb 20 19:13:09 2009 From: ross at biostat.ucsf.edu (Ross Boylan) Date: Fri, 20 Feb 2009 11:13:09 -0800 Subject: advice on partitioning Message-ID: <1235157189.18050.13.camel@iron.psg.net> I use Cyrus, which writes each email message to a separate file on disk. I have a lot of mail. Is it OK to put this on the general /var partition, or are the optimal parameters so different that it really should be separate? I've encrypted /var, so making another encrypted partition means another prompt to deal with on startup. I'm hoping to avoid that. Either way, everything would be on the same physical disks (software RAID1). I'd also appreciate any hints about which parameters I should tune. Thanks. -- Ross Boylan wk: (415) 514-8146 185 Berry St #5700 ross at biostat.ucsf.edu Dept of Epidemiology and Biostatistics fax: (415) 514-8150 University of California, San Francisco San Francisco, CA 94107-1739 hm: (415) 550-1062 From lakshmipathi.g at gmail.com Sun Feb 22 11:19:33 2009 From: lakshmipathi.g at gmail.com (lakshmi pathi) Date: Sun, 22 Feb 2009 16:49:33 +0530 Subject: new features in giis4.4 Message-ID: Hi, giis4.4 has following features included, 1)can recover files deleted on a specific date or deleted before or after a specific date or even within specific date range. 2)Files can be recovered with their original access permission types and file owner and group details. 3)A user-friendly configuration file was added,which supports adding new directories even after installation. 4)Large directories are supported. Any issues/comments please let me know. Download url: www.giis.co.in/ Cheers, Lakshmipathi.G From magawake at gmail.com Tue Feb 24 04:36:24 2009 From: magawake at gmail.com (Mag Gam) Date: Mon, 23 Feb 2009 23:36:24 -0500 Subject: newbie filesystem question Message-ID: <1cbd6f830902232036k30612e7bw3262f81134fcd4b2@mail.gmail.com> Since there are experts here, I though this would be the best place to ask the question: As I understand, ext2 and ext3 we preallocate inodes when a filesystem is being created. It basically writes "zeros" to the volume. (please correct me if I am wrong) Once the filesystem is created it creates an inode table which keeps all the inode information. The inode table changes when there are changes on the filesystem (I/O). I was wondering, how come some other filesystems have a dynamic inode table? Where you can have infinite number of inodes? Sorry, if this is a dumb question. I am trying to learn some Unix basics. TIA From jjneely at ncsu.edu Wed Feb 25 16:13:37 2009 From: jjneely at ncsu.edu (Jack Neely) Date: Wed, 25 Feb 2009 11:13:37 -0500 Subject: ext3 kernel panic Message-ID: <20090225161337.GJ14821@virge.linuxczar.net> Folks, I have a RHEL 3 production Cyrus IMAP server that has kernel paniced twice in as many weeks with something similar to the below. The /imap partition (where the IO was in question) is about 98% full with about 6.2G free and is hosted on a fiber connected EMC Clariion lun. Currently running kernel 2.4.21-15.0.4.ELsmp. I know things are old here, and the machine is on the upgrade list, but I need to do due diligence to figure out how serious this error is. Does anyone have any advice why this is happening? Thanks, Jack Neely Unable to handle kernel NULL pointer dereference at virtual address 00000004 printing eip: f88d2863 *pde = 299d3001 *pte = 00000000 Oops: 0002 audit autofs openafs e1000 iptable_filter ip_tables floppy microcode emcphr emcpmpap emcpmpaa emcpmpc emcpmp sg emcp emcpsf loop lvm-mod keybdev mousedev hid CPU: 0 EIP: 0060:[] Tainted: P EFLAGS: 00010213 EIP is at ext3_orphan_del [ext3] 0x73 (2.4.21-15.0.4.ELsmp/i686) eax: c520d360 ebx: e3fc8900 ecx: e3fc8aac edx: 00000000 esi: 00000000 edi: c520d000 ebp: c520d360 esp: eed8fddc ds: 0068 es: 0068 ss: 0068 Process imapd (pid: 7628, stackpage=eed8f000) Stack: e4de9940 eed8e000 c69b6200 e9ffde00 f88b8445 000f8094 c520d0fc e1ee4e80 00000007 00000292 e3fc8900 00000000 eed8e000 f88cc72c c69b6200 e3fc8900 e4de9940 eed8e000 e9ffde00 f88cc84c e4de9940 e3fc8900 ee6d7500 e3fc8900 Call Trace: [] journal_start_Rsmp_25661df5 [jbd] 0xa5 (0xeed8fdec) [] start_transaction [ext3] 0x8c (0xeed8fe10) [] ext3_delete_inode [ext3] 0x8c (0xeed8fe28) [] ext3_delete_inode [ext3] 0x0 (0xeed8fe3c) [] iput [kernel] 0x150 (0xeed8fe44) [] dput [kernel] 0xca (0xeed8fe60) [] __fput [kernel] 0xbb (0xeed8fe74) [] filp_close [kernel] 0x8e (0xeed8fe90) [] put_files_struct [kernel] 0x6c (0xeed8feac) [] do_exit [kernel] 0x1ba (0xeed8fec8) [] do_group_exit [kernel] 0x8b (0xeed8fee4) [] get_signal_to_deliver [kernel] 0x20b (0xeed8fef8) [] do_signal [kernel] 0x64 (0xeed8ff20) [] sys_select [kernel] 0x296 (0xeed8ff60) Code: 89 42 04 89 10 c7 41 04 00 00 00 00 89 8b ac 01 00 00 89 8b Kernel panic: Fatal exception -- Jack Neely Linux Czar, OIT Campus Linux Services Office of Information Technology, NC State University GPG Fingerprint: 1917 5AC1 E828 9337 7AA4 EA6B 213B 765F 3B6A 5B89 From sandeen at redhat.com Wed Feb 25 16:22:46 2009 From: sandeen at redhat.com (Eric Sandeen) Date: Wed, 25 Feb 2009 10:22:46 -0600 Subject: ext3 kernel panic In-Reply-To: <20090225161337.GJ14821@virge.linuxczar.net> References: <20090225161337.GJ14821@virge.linuxczar.net> Message-ID: <49A57056.4050506@redhat.com> Jack Neely wrote: > Folks, > > I have a RHEL 3 production Cyrus IMAP server that has kernel paniced > twice in as many weeks with something similar to the below. The /imap > partition (where the IO was in question) is about 98% full with about > 6.2G free and is hosted on a fiber connected EMC Clariion lun. > Currently running kernel 2.4.21-15.0.4.ELsmp. > > I know things are old here, and the machine is on the upgrade list, but > I need to do due diligence to figure out how serious this error is. > Does anyone have any advice why this is happening? > > Thanks, > Jack Neely > That's not only an old distro, but an un-updated installation. I'd get the latest RHEL3 kernel and peruse the changelog, for starters, to see if it looks like this might have been fixed (although I don't see anything offhand). -Eric > Unable to handle kernel NULL pointer dereference at virtual address 00000004 > printing eip: > f88d2863 > *pde = 299d3001 > *pte = 00000000 > Oops: 0002 > audit autofs openafs e1000 iptable_filter ip_tables floppy microcode emcphr emcpmpap emcpmpaa emcpmpc emcpmp sg emcp emcpsf loop lvm-mod keybdev mousedev hid > CPU: 0 > EIP: 0060:[] Tainted: P > EFLAGS: 00010213 > > EIP is at ext3_orphan_del [ext3] 0x73 (2.4.21-15.0.4.ELsmp/i686) > eax: c520d360 ebx: e3fc8900 ecx: e3fc8aac edx: 00000000 > esi: 00000000 edi: c520d000 ebp: c520d360 esp: eed8fddc > ds: 0068 es: 0068 ss: 0068 > Process imapd (pid: 7628, stackpage=eed8f000) > Stack: e4de9940 eed8e000 c69b6200 e9ffde00 f88b8445 000f8094 c520d0fc e1ee4e80 > 00000007 00000292 e3fc8900 00000000 eed8e000 f88cc72c c69b6200 e3fc8900 > e4de9940 eed8e000 e9ffde00 f88cc84c e4de9940 e3fc8900 ee6d7500 e3fc8900 > Call Trace: [] journal_start_Rsmp_25661df5 [jbd] 0xa5 (0xeed8fdec) > [] start_transaction [ext3] 0x8c (0xeed8fe10) > [] ext3_delete_inode [ext3] 0x8c (0xeed8fe28) > [] ext3_delete_inode [ext3] 0x0 (0xeed8fe3c) > [] iput [kernel] 0x150 (0xeed8fe44) > [] dput [kernel] 0xca (0xeed8fe60) > [] __fput [kernel] 0xbb (0xeed8fe74) > [] filp_close [kernel] 0x8e (0xeed8fe90) > [] put_files_struct [kernel] 0x6c (0xeed8feac) > [] do_exit [kernel] 0x1ba (0xeed8fec8) > [] do_group_exit [kernel] 0x8b (0xeed8fee4) > [] get_signal_to_deliver [kernel] 0x20b (0xeed8fef8) > [] do_signal [kernel] 0x64 (0xeed8ff20) > [] sys_select [kernel] 0x296 (0xeed8ff60) > > Code: 89 42 04 89 10 c7 41 04 00 00 00 00 89 8b ac 01 00 00 89 8b > > Kernel panic: Fatal exception > From Ralf.Hildebrandt at charite.de Wed Feb 25 16:24:26 2009 From: Ralf.Hildebrandt at charite.de (Ralf Hildebrandt) Date: Wed, 25 Feb 2009 17:24:26 +0100 Subject: Questions regarding journal replay In-Reply-To: <20090206142822.GE31519@charite.de> References: <20090205125847.GR23918@charite.de> <20090206142641.9FE446F064@alopias.GreenKey.net> <20090206142822.GE31519@charite.de> Message-ID: <20090225162426.GA26291@charite.de> * Ralf Hildebrandt : > * Curtis Doty : > > Yesterday Ralf Hildebrandt said: > > > >> The journal replay too quite a while. About 800 seconds. > >> > > > > Were there any other background iops on the underlying volume > > devices? Like maybe raid reconstruction? > > I don't think so. The machine never powered off... Again, 2.6.28.7 failed us and now we're encountering another journal replay. Taking ages. This sucks. Questions: How can I find out (during normal operation) HOW MUCH of the journal is actually in use? How can I resize the journal to be smaller, thus making a journal replay faster? -- Ralf Hildebrandt Ralf.Hildebrandt at charite.de Charite - Universit?tsmedizin Berlin Tel. +49 (0)30-450 570-155 Gesch?ftsbereich IT | Abt. Netzwerk Fax. +49 (0)30-450 570-962 Hindenburgdamm 30 | 12200 Berlin From sandeen at redhat.com Wed Feb 25 16:31:42 2009 From: sandeen at redhat.com (Eric Sandeen) Date: Wed, 25 Feb 2009 10:31:42 -0600 Subject: Questions regarding journal replay In-Reply-To: <20090225162426.GA26291@charite.de> References: <20090205125847.GR23918@charite.de> <20090206142641.9FE446F064@alopias.GreenKey.net> <20090206142822.GE31519@charite.de> <20090225162426.GA26291@charite.de> Message-ID: <49A5726E.6030703@redhat.com> Ralf Hildebrandt wrote: > * Ralf Hildebrandt : >> * Curtis Doty : >>> Yesterday Ralf Hildebrandt said: >>> >>>> The journal replay too quite a while. About 800 seconds. >>>> >>> Were there any other background iops on the underlying volume >>> devices? Like maybe raid reconstruction? >> I don't think so. The machine never powered off... > > Again, 2.6.28.7 failed us and now we're encountering another journal > replay. Taking ages. This sucks. > > Questions: > > How can I find out (during normal operation) HOW MUCH of the > journal is actually in use? > > How can I resize the journal to be smaller, thus making a journal > replay faster? > It'd be better to get to the bottom of the problem ... maybe iostat while it's happening to see if IO is actually happening; run blktrace to see where IO is going, do a few sysrq-t's to see where threads are at, etc. Can you find a way to reproduce this at will? Journal replay should *never* take this long, AFAIK. -Eric From Ralf.Hildebrandt at charite.de Wed Feb 25 17:20:36 2009 From: Ralf.Hildebrandt at charite.de (Ralf Hildebrandt) Date: Wed, 25 Feb 2009 18:20:36 +0100 Subject: dumpe2fs and external journal: Illegal inode number while reading journal inode Message-ID: <20090225172036.GD26291@charite.de> I created an ext4 fs with an external journal. I wanted to check how big the journal was, and tried: # dumpe2fs -h /dev/mapper/volg1-logv1 dumpe2fs 1.41.3 (12-Oct-2008) Filesystem volume name: Last mounted on: Filesystem UUID: 032613d3-6035-4872-bc0a-11db92feec5e Filesystem magic number: 0xEF53 Filesystem revision #: 1 (dynamic) Filesystem features: has_journal resize_inode dir_index filetype needs_recovery extent sparse_super large_file uninit_bg Filesystem flags: signed_directory_hash Default mount options: (none) Filesystem state: clean Errors behavior: Continue Filesystem OS type: Linux Inode count: 19519488 Block count: 624605184 Reserved block count: 0 Free blocks: 257817321 Free inodes: 10481629 First block: 0 Block size: 4096 Fragment size: 4096 Reserved GDT blocks: 875 Blocks per group: 32768 Fragments per group: 32768 Inodes per group: 1024 Inode blocks per group: 32 Filesystem created: Tue May 8 21:04:31 2007 Last mount time: Wed Feb 25 18:01:47 2009 Last write time: Wed Feb 25 18:01:47 2009 Mount count: 19 Maximum mount count: -1 Last checked: Sat Dec 27 23:16:47 2008 Check interval: 0 () Reserved blocks uid: 0 (user root) Reserved blocks gid: 0 (group root) First inode: 11 Inode size: 128 Journal UUID: 1a7063f5-8965-40f2-9feb-e37d6ac467e9 Journal device: 0x6806 First orphan inode: 622943 Default directory hash: tea Directory Hash Seed: 44337061-e542-44bb-afb9-40597ccf1c6d Journal backup: inode blocks dumpe2fs: Illegal inode number while reading journal inode -- Ralf Hildebrandt Ralf.Hildebrandt at charite.de Charite - Universit?tsmedizin Berlin Tel. +49 (0)30-450 570-155 Gesch?ftsbereich IT | Abt. Netzwerk Fax. +49 (0)30-450 570-962 Hindenburgdamm 30 | 12200 Berlin From Ralf.Hildebrandt at charite.de Wed Feb 25 17:23:34 2009 From: Ralf.Hildebrandt at charite.de (Ralf Hildebrandt) Date: Wed, 25 Feb 2009 18:23:34 +0100 Subject: Questions regarding journal replay In-Reply-To: <49A5726E.6030703@redhat.com> References: <20090205125847.GR23918@charite.de> <20090206142641.9FE446F064@alopias.GreenKey.net> <20090206142822.GE31519@charite.de> <20090225162426.GA26291@charite.de> <49A5726E.6030703@redhat.com> Message-ID: <20090225172334.GF26291@charite.de> * Eric Sandeen : > Ralf Hildebrandt wrote: > > * Ralf Hildebrandt : > >> * Curtis Doty : > >>> Yesterday Ralf Hildebrandt said: > >>> > >>>> The journal replay too quite a while. About 800 seconds. > >>>> > >>> Were there any other background iops on the underlying volume > >>> devices? Like maybe raid reconstruction? > >> I don't think so. The machine never powered off... > > > > Again, 2.6.28.7 failed us and now we're encountering another journal > > replay. Taking ages. This sucks. > > > > Questions: > > > > How can I find out (during normal operation) HOW MUCH of the > > journal is actually in use? > > > > How can I resize the journal to be smaller, thus making a journal > > replay faster? > > > > It'd be better to get to the bottom of the problem ... maybe iostat > while it's happening to see if IO is actually happening; run blktrace to > see where IO is going, do a few sysrq-t's to see where threads are at, etc. We had 24GB of reading from the journal device (or 12GB if it's 512byte blocks). I wonder why? > Can you find a way to reproduce this at will? Yes. My users will kill me, though. > Journal replay should *never* take this long, AFAIK. Amen -- Ralf Hildebrandt Ralf.Hildebrandt at charite.de Charite - Universit?tsmedizin Berlin Tel. +49 (0)30-450 570-155 Gesch?ftsbereich IT | Abt. Netzwerk Fax. +49 (0)30-450 570-962 Hindenburgdamm 30 | 12200 Berlin From tytso at mit.edu Wed Feb 25 17:34:59 2009 From: tytso at mit.edu (Theodore Tso) Date: Wed, 25 Feb 2009 12:34:59 -0500 Subject: Questions regarding journal replay In-Reply-To: <49A5726E.6030703@redhat.com> References: <20090205125847.GR23918@charite.de> <20090206142641.9FE446F064@alopias.GreenKey.net> <20090206142822.GE31519@charite.de> <20090225162426.GA26291@charite.de> <49A5726E.6030703@redhat.com> Message-ID: <20090225173459.GO7064@mit.edu> On Wed, Feb 25, 2009 at 10:31:42AM -0600, Eric Sandeen wrote: > > It'd be better to get to the bottom of the problem ... maybe iostat > while it's happening to see if IO is actually happening; run blktrace to > see where IO is going, do a few sysrq-t's to see where threads are at, etc. > > Can you find a way to reproduce this at will? > > Journal replay should *never* take this long, AFAIK. Indeed. The journal is 128 megs, as I recall. So even if the journal was completely full, if it's taking 800 seconds, that's a write rate of 0.16 Mb/S (164 kb/second). That is indeed way too slow. I assume this wasn't your boot partition, so the journal replay was being done by e2fsck, right? Or are you guys skipping e2fsck and the journal replay was happening when you mounted the partition? If the journal replay is happening via e2fsck, is fsck running any other filesystem checks in parallel? Also, what is the geometry of your raid? How many disks, what RAID level, and what is the chunk size? The journal replay is done a filesystem block at a time, so it could be that it's turning into a large number of read-modify-writes, which is trashing your performance if the chunk size is really large. The other thing that might explain the performan problem is if the somehow the number of multiple outstanding requests allowed by the hard drive has been clamped down to a very small number, and so a large number of small read/write requests is really killing performance. The system dmesg log might have some hidden clues about that. - Ted From sandeen at redhat.com Wed Feb 25 17:36:25 2009 From: sandeen at redhat.com (Eric Sandeen) Date: Wed, 25 Feb 2009 11:36:25 -0600 Subject: Questions regarding journal replay In-Reply-To: <20090225172334.GF26291@charite.de> References: <20090205125847.GR23918@charite.de> <20090206142641.9FE446F064@alopias.GreenKey.net> <20090206142822.GE31519@charite.de> <20090225162426.GA26291@charite.de> <49A5726E.6030703@redhat.com> <20090225172334.GF26291@charite.de> Message-ID: <49A58199.2060101@redhat.com> Ralf Hildebrandt wrote: > * Eric Sandeen : >> Ralf Hildebrandt wrote: >>> * Ralf Hildebrandt : >>>> * Curtis Doty : >>>>> Yesterday Ralf Hildebrandt said: >>>>> >>>>>> The journal replay too quite a while. About 800 seconds. >>>>>> >>>>> Were there any other background iops on the underlying volume >>>>> devices? Like maybe raid reconstruction? >>>> I don't think so. The machine never powered off... >>> Again, 2.6.28.7 failed us and now we're encountering another journal >>> replay. Taking ages. This sucks. >>> >>> Questions: >>> >>> How can I find out (during normal operation) HOW MUCH of the >>> journal is actually in use? >>> >>> How can I resize the journal to be smaller, thus making a journal >>> replay faster? >>> >> It'd be better to get to the bottom of the problem ... maybe iostat >> while it's happening to see if IO is actually happening; run blktrace to >> see where IO is going, do a few sysrq-t's to see where threads are at, etc. > > We had 24GB of reading from the journal device (or 12GB if it's > 512byte blocks). I wonder why? 24GB of reading from the journal device (during that 800s of replay during mount?), and your journal is 128M ... well that's odd. You say journal device; is this an external journal? I didn't think so from your first email, but is it? >> Can you find a way to reproduce this at will? > > Yes. My users will kill me, though. No spare box, eh :( >> Journal replay should *never* take this long, AFAIK. > > Amen > so let's figure it out :) -Eric From Ralf.Hildebrandt at charite.de Wed Feb 25 17:39:07 2009 From: Ralf.Hildebrandt at charite.de (Ralf Hildebrandt) Date: Wed, 25 Feb 2009 18:39:07 +0100 Subject: Questions regarding journal replay In-Reply-To: <20090225173459.GO7064@mit.edu> References: <20090205125847.GR23918@charite.de> <20090206142641.9FE446F064@alopias.GreenKey.net> <20090206142822.GE31519@charite.de> <20090225162426.GA26291@charite.de> <49A5726E.6030703@redhat.com> <20090225173459.GO7064@mit.edu> Message-ID: <20090225173907.GG26291@charite.de> * Theodore Tso : > On Wed, Feb 25, 2009 at 10:31:42AM -0600, Eric Sandeen wrote: > > > > It'd be better to get to the bottom of the problem ... maybe iostat > > while it's happening to see if IO is actually happening; run blktrace to > > see where IO is going, do a few sysrq-t's to see where threads are at, etc. > > > > Can you find a way to reproduce this at will? > > > > Journal replay should *never* take this long, AFAIK. > > Indeed. The journal is 128 megs, as I recall. So even if the journal > was completely full, if it's taking 800 seconds, that's a write rate > of 0.16 Mb/S (164 kb/second). That is indeed way too slow. The problem seems to be with the external journal which I recently changed to. It's a 32GB partition. My timings seem to indicate that ALL OF IT was being replayed > I assume this wasn't your boot partition, so the journal replay was > being done by e2fsck, right? Yes > Or are you guys skipping e2fsck and the journal replay was happening > when you mounted the partition? Both. We tried both ways :) > If the journal replay is happening via e2fsck, is fsck running any > other filesystem checks in parallel? No, it's running alone. > Also, what is the geometry of your raid? How many disks, what RAID > level, and what is the chunk size? The journal replay is done a > filesystem block at a time, so it could be that it's turning into a > large number of read-modify-writes, which is trashing your performance > if the chunk size is really large. The RAID is made up from one logical volume, consisting of two drives sda and sdb, each containing 6 disks in a hardware RAID5 setup. > The other thing that might explain the performan problem is if the > somehow the number of multiple outstanding requests allowed by the > hard drive has been clamped down to a very small number, and so a > large number of small read/write requests is really killing > performance. The system dmesg log might have some hidden clues about > that. dmesg is silent -- Ralf Hildebrandt Ralf.Hildebrandt at charite.de Charite - Universit?tsmedizin Berlin Tel. +49 (0)30-450 570-155 Gesch?ftsbereich IT | Abt. Netzwerk Fax. +49 (0)30-450 570-962 Hindenburgdamm 30 | 12200 Berlin From Ralf.Hildebrandt at charite.de Wed Feb 25 17:40:38 2009 From: Ralf.Hildebrandt at charite.de (Ralf Hildebrandt) Date: Wed, 25 Feb 2009 18:40:38 +0100 Subject: Questions regarding journal replay In-Reply-To: <49A58199.2060101@redhat.com> References: <20090205125847.GR23918@charite.de> <20090206142641.9FE446F064@alopias.GreenKey.net> <20090206142822.GE31519@charite.de> <20090225162426.GA26291@charite.de> <49A5726E.6030703@redhat.com> <20090225172334.GF26291@charite.de> <49A58199.2060101@redhat.com> Message-ID: <20090225174038.GH26291@charite.de> * Eric Sandeen : > >> It'd be better to get to the bottom of the problem ... maybe iostat > >> while it's happening to see if IO is actually happening; run blktrace to > >> see where IO is going, do a few sysrq-t's to see where threads are at, etc. > > > > We had 24GB of reading from the journal device (or 12GB if it's > > 512byte blocks). I wonder why? > > 24GB of reading from the journal device (during that 800s of replay > during mount?), and your journal is 128M ... well that's odd. After my initial report I removed the journal and created an external journal on a 32GB partition. Hoping it would be faster, since accoriding to the docs. the journal size is limited to 128MB. > You say journal device; is this an external journal? I didn't think so > from your first email, but is it? It is now. # dumpe2fs -h /dev/cciss/c0d0p6 dumpe2fs 1.41.3 (12-Oct-2008) Filesystem volume name: journal_device Last mounted on: Filesystem UUID: 1a7063f5-8965-40f2-9feb-e37d6ac467e9 Filesystem magic number: 0xEF53 Filesystem revision #: 1 (dynamic) Filesystem features: journal_dev Default mount options: (none) Filesystem state: clean Errors behavior: Continue Filesystem OS type: Linux Inode count: 0 Block count: 8488436 Reserved block count: 0 Free blocks: 0 Free inodes: 0 First block: 0 Block size: 4096 Fragment size: 4096 Blocks per group: 32768 Fragments per group: 32768 Inodes per group: 0 Inode blocks per group: 0 Filesystem created: Thu Feb 5 14:05:36 2009 Last mount time: n/a Last write time: Thu Feb 5 14:15:26 2009 Mount count: 0 Maximum mount count: 30 Last checked: Thu Feb 5 14:05:36 2009 Check interval: 15552000 (6 months) Next check after: Tue Aug 4 15:05:36 2009 Reserved blocks uid: 0 (user root) Reserved blocks gid: 0 (group root) First inode: 11 Inode size: 256 Required extra isize: 28 Desired extra isize: 28 Default directory hash: half_md4 Directory Hash Seed: fddb247a-97df-4582-bfcd-816ef8c17ab2 Journal block size: 4096 Journal length: 8488436 Journal first block: 2 Journal sequence: 0x0027c611 Journal start: 2 Journal number of users: 1 Journal users: 032613d3-6035-4872-bc0a-11db92feec5e -- Ralf Hildebrandt Ralf.Hildebrandt at charite.de Charite - Universit?tsmedizin Berlin Tel. +49 (0)30-450 570-155 Gesch?ftsbereich IT | Abt. Netzwerk Fax. +49 (0)30-450 570-962 Hindenburgdamm 30 | 12200 Berlin From Ralf.Hildebrandt at charite.de Wed Feb 25 17:42:14 2009 From: Ralf.Hildebrandt at charite.de (Ralf Hildebrandt) Date: Wed, 25 Feb 2009 18:42:14 +0100 Subject: Questions regarding journal replay In-Reply-To: <20090225174038.GH26291@charite.de> References: <20090205125847.GR23918@charite.de> <20090206142641.9FE446F064@alopias.GreenKey.net> <20090206142822.GE31519@charite.de> <20090225162426.GA26291@charite.de> <49A5726E.6030703@redhat.com> <20090225172334.GF26291@charite.de> <49A58199.2060101@redhat.com> <20090225174038.GH26291@charite.de> Message-ID: <20090225174214.GI26291@charite.de> * Ralf Hildebrandt : > After my initial report I removed the journal and created an external > journal on a 32GB partition. Hoping it would be faster, since > accoriding to the docs. the journal size is limited to 128MB. That should read: Hoping it would be faster, since -- according to the docs -- the journal size is limited to 128MB. -- Ralf Hildebrandt Ralf.Hildebrandt at charite.de Charite - Universit?tsmedizin Berlin Tel. +49 (0)30-450 570-155 Gesch?ftsbereich IT | Abt. Netzwerk Fax. +49 (0)30-450 570-962 Hindenburgdamm 30 | 12200 Berlin From sandeen at redhat.com Wed Feb 25 17:44:10 2009 From: sandeen at redhat.com (Eric Sandeen) Date: Wed, 25 Feb 2009 11:44:10 -0600 Subject: Questions regarding journal replay In-Reply-To: <20090225173907.GG26291@charite.de> References: <20090205125847.GR23918@charite.de> <20090206142641.9FE446F064@alopias.GreenKey.net> <20090206142822.GE31519@charite.de> <20090225162426.GA26291@charite.de> <49A5726E.6030703@redhat.com> <20090225173459.GO7064@mit.edu> <20090225173907.GG26291@charite.de> Message-ID: <49A5836A.7050508@redhat.com> Ralf Hildebrandt wrote: > * Theodore Tso : >> On Wed, Feb 25, 2009 at 10:31:42AM -0600, Eric Sandeen wrote: >>> It'd be better to get to the bottom of the problem ... maybe iostat >>> while it's happening to see if IO is actually happening; run blktrace to >>> see where IO is going, do a few sysrq-t's to see where threads are at, etc. >>> >>> Can you find a way to reproduce this at will? >>> >>> Journal replay should *never* take this long, AFAIK. >> Indeed. The journal is 128 megs, as I recall. So even if the journal >> was completely full, if it's taking 800 seconds, that's a write rate >> of 0.16 Mb/S (164 kb/second). That is indeed way too slow. > > The problem seems to be with the external journal which I recently > changed to. It's a 32GB partition. My timings seem to indicate that > ALL OF IT was being replayed > But you also saw this with an internal journal? Perhaps you have uncovered 2 bugs ... :) TBH external journals probably aren't tested that much (though they certainly should work) I'll give it a quick sanity test on ext4. -Eric From sandeen at redhat.com Wed Feb 25 18:08:17 2009 From: sandeen at redhat.com (Eric Sandeen) Date: Wed, 25 Feb 2009 12:08:17 -0600 Subject: dumpe2fs and external journal: Illegal inode number while reading journal inode In-Reply-To: <20090225172036.GD26291@charite.de> References: <20090225172036.GD26291@charite.de> Message-ID: <49A58911.5030805@redhat.com> Ralf Hildebrandt wrote: > I created an ext4 fs with an external journal. > I wanted to check how big the journal was, and tried: > > # dumpe2fs -h /dev/mapper/volg1-logv1 > dumpe2fs 1.41.3 (12-Oct-2008) ... > dumpe2fs: Illegal inode number while reading journal inode > this should be fixed by: commit a11d0746b4fb2ac41dcb5e7acf31942b1e8925e2 Author: Theodore Ts'o Date: Sat Nov 15 15:05:51 2008 -0500 dumpe2fs: Only print inline journal information if the journal is internal Currently dumpe2fs displays an error if run on a filesystem with an external journal. Signed-off-by: "Theodore Ts'o" in e2fsprogs-1.41.4 -Eric From Ralf.Hildebrandt at charite.de Wed Feb 25 18:11:08 2009 From: Ralf.Hildebrandt at charite.de (Ralf Hildebrandt) Date: Wed, 25 Feb 2009 19:11:08 +0100 Subject: Questions regarding journal replay In-Reply-To: <49A5836A.7050508@redhat.com> References: <20090205125847.GR23918@charite.de> <20090206142641.9FE446F064@alopias.GreenKey.net> <20090206142822.GE31519@charite.de> <20090225162426.GA26291@charite.de> <49A5726E.6030703@redhat.com> <20090225173459.GO7064@mit.edu> <20090225173907.GG26291@charite.de> <49A5836A.7050508@redhat.com> Message-ID: <20090225181108.GA8554@charite.de> * Eric Sandeen : > > The problem seems to be with the external journal which I recently > > changed to. It's a 32GB partition. My timings seem to indicate that > > ALL OF IT was being replayed > > > > But you also saw this with an internal journal? Yes. > Perhaps you have uncovered 2 bugs ... :) > > TBH external journals probably aren't tested that much (though they > certainly should work) > > I'll give it a quick sanity test on ext4. They DO work, but apparently the docs are wrong! I mean, no sane person needs 32GB of journal -- Ralf Hildebrandt Ralf.Hildebrandt at charite.de Charite - Universit?tsmedizin Berlin Tel. +49 (0)30-450 570-155 Gesch?ftsbereich IT | Abt. Netzwerk Fax. +49 (0)30-450 570-962 Hindenburgdamm 30 | 12200 Berlin From jjneely at ncsu.edu Wed Feb 25 18:22:34 2009 From: jjneely at ncsu.edu (Jack Neely) Date: Wed, 25 Feb 2009 13:22:34 -0500 Subject: ext3 kernel panic In-Reply-To: <49A57056.4050506@redhat.com> References: <20090225161337.GJ14821@virge.linuxczar.net> <49A57056.4050506@redhat.com> Message-ID: <20090225182234.GK14821@virge.linuxczar.net> On Wed, Feb 25, 2009 at 10:22:46AM -0600, Eric Sandeen wrote: > Jack Neely wrote: > > Folks, > > > > I have a RHEL 3 production Cyrus IMAP server that has kernel paniced > > twice in as many weeks with something similar to the below. The /imap > > partition (where the IO was in question) is about 98% full with about > > 6.2G free and is hosted on a fiber connected EMC Clariion lun. > > Currently running kernel 2.4.21-15.0.4.ELsmp. > > > > I know things are old here, and the machine is on the upgrade list, but > > I need to do due diligence to figure out how serious this error is. > > Does anyone have any advice why this is happening? > > > > Thanks, > > Jack Neely > > > > That's not only an old distro, but an un-updated installation. I'd get > the latest RHEL3 kernel and peruse the changelog, for starters, to see > if it looks like this might have been fixed (although I don't see > anything offhand). > > -Eric > I'm caught between a rock and a hard place due to the EMC PowerPath binary only kernel crack. Which makes it painful to both me and my customers to regularly upgrade the kernel. Not to mention the EMC supportability matrix of doom. I have 11 other imap servers configured identically that are not regularly panicing. I'm trying to figure out what specifically could be affecting this one machine or that isn't affecting the others. The only change log entry that seems close is: - fix O_SYNC EIO error propagation through ext3/jbd (Stephen Tweedie) from kernel-2.4.21-34.EL. Is that anywhere close? Jack -- Jack Neely Linux Czar, OIT Campus Linux Services Office of Information Technology, NC State University GPG Fingerprint: 1917 5AC1 E828 9337 7AA4 EA6B 213B 765F 3B6A 5B89 From tytso at mit.edu Wed Feb 25 18:44:48 2009 From: tytso at mit.edu (Theodore Tso) Date: Wed, 25 Feb 2009 13:44:48 -0500 Subject: Questions regarding journal replay In-Reply-To: <20090225174214.GI26291@charite.de> References: <20090205125847.GR23918@charite.de> <20090206142641.9FE446F064@alopias.GreenKey.net> <20090206142822.GE31519@charite.de> <20090225162426.GA26291@charite.de> <49A5726E.6030703@redhat.com> <20090225172334.GF26291@charite.de> <49A58199.2060101@redhat.com> <20090225174038.GH26291@charite.de> <20090225174214.GI26291@charite.de> Message-ID: <20090225184448.GP7064@mit.edu> On Wed, Feb 25, 2009 at 06:42:14PM +0100, Ralf Hildebrandt wrote: > * Ralf Hildebrandt : > > > After my initial report I removed the journal and created an external > > journal on a 32GB partition. Hoping it would be faster, since > > accoriding to the docs. the journal size is limited to 128MB. > > That should read: > > Hoping it would be faster, since -- according to the docs -- the journal > size is limited to 128MB. > Increasing the journal size may speed up certain filesystem workloads which are causing the journal to wrap very frequently. However, increasing the journal *will* increase the time to replay the journal.... How long did the journal replay take when you were using the 128MB internal inode? Was the 800 seconds to replay for the the case when you were using the internal journal or the external journal? - Ted From tytso at mit.edu Wed Feb 25 18:46:17 2009 From: tytso at mit.edu (Theodore Tso) Date: Wed, 25 Feb 2009 13:46:17 -0500 Subject: Questions regarding journal replay In-Reply-To: <20090225173907.GG26291@charite.de> References: <20090205125847.GR23918@charite.de> <20090206142641.9FE446F064@alopias.GreenKey.net> <20090206142822.GE31519@charite.de> <20090225162426.GA26291@charite.de> <49A5726E.6030703@redhat.com> <20090225173459.GO7064@mit.edu> <20090225173907.GG26291@charite.de> Message-ID: <20090225184617.GQ7064@mit.edu> On Wed, Feb 25, 2009 at 06:39:07PM +0100, Ralf Hildebrandt wrote: > > The RAID is made up from one logical volume, consisting of two drives > sda and sdb, each containing 6 disks in a hardware RAID5 setup. Do you know what the chunk size or strip size is for your hardware RAID5? - Ted From tytso at mit.edu Wed Feb 25 18:48:16 2009 From: tytso at mit.edu (Theodore Tso) Date: Wed, 25 Feb 2009 13:48:16 -0500 Subject: dumpe2fs and external journal: Illegal inode number while reading journal inode In-Reply-To: <20090225172036.GD26291@charite.de> References: <20090225172036.GD26291@charite.de> Message-ID: <20090225184816.GR7064@mit.edu> On Wed, Feb 25, 2009 at 06:20:36PM +0100, Ralf Hildebrandt wrote: > I created an ext4 fs with an external journal. > I wanted to check how big the journal was, and tried: > > # dumpe2fs -h /dev/mapper/volg1-logv1 > Journal backup: inode blocks > dumpe2fs: Illegal inode number while reading journal inode This bug was fixed in e2fsprogs 1.41.4. (By commenting out the code that printed the journal size; I was in a hurry to get 1.41.4 out the door.) You can get the size of an exernal journal by running dumpe2fs on the external journal. - Ted From Ralf.Hildebrandt at charite.de Wed Feb 25 18:50:57 2009 From: Ralf.Hildebrandt at charite.de (Ralf Hildebrandt) Date: Wed, 25 Feb 2009 19:50:57 +0100 Subject: dumpe2fs and external journal: Illegal inode number while reading journal inode In-Reply-To: <20090225184816.GR7064@mit.edu> References: <20090225172036.GD26291@charite.de> <20090225184816.GR7064@mit.edu> Message-ID: <20090225185057.GB8554@charite.de> * Theodore Tso : > This bug was fixed in e2fsprogs 1.41.4. (By commenting out the code > that printed the journal size; I was in a hurry to get 1.41.4 out the > door.) > > You can get the size of an exernal journal by running dumpe2fs on the > external journal. Yes, I found out. Anyway, I'm back to a 128MB internal journal now. -- Ralf Hildebrandt Ralf.Hildebrandt at charite.de Charite - Universit?tsmedizin Berlin Tel. +49 (0)30-450 570-155 Gesch?ftsbereich IT | Abt. Netzwerk Fax. +49 (0)30-450 570-962 Hindenburgdamm 30 | 12200 Berlin From tytso at mit.edu Wed Feb 25 18:52:48 2009 From: tytso at mit.edu (Theodore Tso) Date: Wed, 25 Feb 2009 13:52:48 -0500 Subject: Questions regarding journal replay In-Reply-To: <20090225181108.GA8554@charite.de> References: <20090205125847.GR23918@charite.de> <20090206142641.9FE446F064@alopias.GreenKey.net> <20090206142822.GE31519@charite.de> <20090225162426.GA26291@charite.de> <49A5726E.6030703@redhat.com> <20090225173459.GO7064@mit.edu> <20090225173907.GG26291@charite.de> <49A5836A.7050508@redhat.com> <20090225181108.GA8554@charite.de> Message-ID: <20090225185248.GA1363@mit.edu> On Wed, Feb 25, 2009 at 07:11:08PM +0100, Ralf Hildebrandt wrote: > > TBH external journals probably aren't tested that much (though they > > certainly should work) > > > > I'll give it a quick sanity test on ext4. > > They DO work, but apparently the docs are wrong! I mean, no sane > person needs 32GB of journal The docs don't warn against needing that large of a journal, yes. One of the things which never got finished (although it was in the original design of the jbd layer) was the ability to share the journal across multiple filesystems. This would mean that it might more sense to have a single large journal. Probably not 32GB in size, though. Did you find some documentation that actually recommend that large of an external journal? - Ted From Ralf.Hildebrandt at charite.de Wed Feb 25 19:02:15 2009 From: Ralf.Hildebrandt at charite.de (Ralf Hildebrandt) Date: Wed, 25 Feb 2009 20:02:15 +0100 Subject: Questions regarding journal replay In-Reply-To: <20090225185248.GA1363@mit.edu> References: <20090205125847.GR23918@charite.de> <20090206142641.9FE446F064@alopias.GreenKey.net> <20090206142822.GE31519@charite.de> <20090225162426.GA26291@charite.de> <49A5726E.6030703@redhat.com> <20090225173459.GO7064@mit.edu> <20090225173907.GG26291@charite.de> <49A5836A.7050508@redhat.com> <20090225181108.GA8554@charite.de> <20090225185248.GA1363@mit.edu> Message-ID: <20090225190215.GD8554@charite.de> * Theodore Tso : > The docs don't warn against needing that large of a journal, yes. One > of the things which never got finished (although it was in the > original design of the jbd layer) was the ability to share the journal > across multiple filesystems. This would mean that it might more sense > to have a single large journal. Probably not 32GB in size, though. > > Did you find some documentation that actually recommend that large of > an external journal? It says the "journal has a maximum size of 128M" -- Ralf Hildebrandt Ralf.Hildebrandt at charite.de Charite - Universit?tsmedizin Berlin Tel. +49 (0)30-450 570-155 Gesch?ftsbereich IT | Abt. Netzwerk Fax. +49 (0)30-450 570-962 Hindenburgdamm 30 | 12200 Berlin From Ralf.Hildebrandt at charite.de Wed Feb 25 19:01:31 2009 From: Ralf.Hildebrandt at charite.de (Ralf Hildebrandt) Date: Wed, 25 Feb 2009 20:01:31 +0100 Subject: Questions regarding journal replay In-Reply-To: <20090225184448.GP7064@mit.edu> References: <20090205125847.GR23918@charite.de> <20090206142641.9FE446F064@alopias.GreenKey.net> <20090206142822.GE31519@charite.de> <20090225162426.GA26291@charite.de> <49A5726E.6030703@redhat.com> <20090225172334.GF26291@charite.de> <49A58199.2060101@redhat.com> <20090225174038.GH26291@charite.de> <20090225174214.GI26291@charite.de> <20090225184448.GP7064@mit.edu> Message-ID: <20090225190131.GC8554@charite.de> * Theodore Tso : > Increasing the journal size may speed up certain filesystem workloads > which are causing the journal to wrap very frequently. However, > increasing the journal *will* increase the time to replay the journal.... Indeed. This is a Maildir-style mailbox server. Many small writes, reads and deletes. > How long did the journal replay take when you were using the 128MB > internal inode? 800s -- Ralf Hildebrandt Ralf.Hildebrandt at charite.de Charite - Universit?tsmedizin Berlin Tel. +49 (0)30-450 570-155 Gesch?ftsbereich IT | Abt. Netzwerk Fax. +49 (0)30-450 570-962 Hindenburgdamm 30 | 12200 Berlin From sandeen at redhat.com Wed Feb 25 19:21:06 2009 From: sandeen at redhat.com (Eric Sandeen) Date: Wed, 25 Feb 2009 13:21:06 -0600 Subject: ext3 kernel panic In-Reply-To: <20090225182234.GK14821@virge.linuxczar.net> References: <20090225161337.GJ14821@virge.linuxczar.net> <49A57056.4050506@redhat.com> <20090225182234.GK14821@virge.linuxczar.net> Message-ID: <49A59A22.6020509@redhat.com> Jack Neely wrote: > I'm caught between a rock and a hard place due to the EMC PowerPath > binary only kernel crack. Which makes it painful to both me and my > customers to regularly upgrade the kernel. Not to mention the EMC > supportability matrix of doom. > > I have 11 other imap servers configured identically that are not > regularly panicing. I'm trying to figure out what specifically could be > affecting this one machine or that isn't affecting the others. The only > change log entry that seems close is: > > - fix O_SYNC EIO error propagation through ext3/jbd (Stephen > Tweedie) > > from kernel-2.4.21-34.EL. Is that anywhere close? I kind of doubt it; as I said, I don't see anything in the changelogs that looks immediately relevant... -Eric From sandeen at redhat.com Wed Feb 25 19:28:35 2009 From: sandeen at redhat.com (Eric Sandeen) Date: Wed, 25 Feb 2009 13:28:35 -0600 Subject: Questions regarding journal replay In-Reply-To: <20090225174038.GH26291@charite.de> References: <20090205125847.GR23918@charite.de> <20090206142641.9FE446F064@alopias.GreenKey.net> <20090206142822.GE31519@charite.de> <20090225162426.GA26291@charite.de> <49A5726E.6030703@redhat.com> <20090225172334.GF26291@charite.de> <49A58199.2060101@redhat.com> <20090225174038.GH26291@charite.de> Message-ID: <49A59BE3.6070906@redhat.com> Ralf Hildebrandt wrote: > * Eric Sandeen : > >>>> It'd be better to get to the bottom of the problem ... maybe iostat >>>> while it's happening to see if IO is actually happening; run blktrace to >>>> see where IO is going, do a few sysrq-t's to see where threads are at, etc. >>> We had 24GB of reading from the journal device (or 12GB if it's >>> 512byte blocks). I wonder why? >> 24GB of reading from the journal device (during that 800s of replay >> during mount?), and your journal is 128M ... well that's odd. > > After my initial report I removed the journal and created an external > journal on a 32GB partition. Hoping it would be faster, since > accoriding to the docs. the journal size is limited to 128MB. > >> You say journal device; is this an external journal? I didn't think so >> from your first email, but is it? > > It is now. ... > Journal block size: 4096 > Journal length: 8488436 > Journal first block: 2 > Journal sequence: 0x0027c611 > Journal start: 2 > Journal number of users: 1 > Journal users: 032613d3-6035-4872-bc0a-11db92feec5e Ok we might be getting a little off-track here. Your journal is indeed 32G in size. But you also saw this with an internal journal, which should be limited to 128M, and yet you still saw a very long replay, right? -Eric From Ralf.Hildebrandt at charite.de Wed Feb 25 19:31:08 2009 From: Ralf.Hildebrandt at charite.de (Ralf Hildebrandt) Date: Wed, 25 Feb 2009 20:31:08 +0100 Subject: Questions regarding journal replay In-Reply-To: <49A59BE3.6070906@redhat.com> References: <20090205125847.GR23918@charite.de> <20090206142641.9FE446F064@alopias.GreenKey.net> <20090206142822.GE31519@charite.de> <20090225162426.GA26291@charite.de> <49A5726E.6030703@redhat.com> <20090225172334.GF26291@charite.de> <49A58199.2060101@redhat.com> <20090225174038.GH26291@charite.de> <49A59BE3.6070906@redhat.com> Message-ID: <20090225193108.GE8554@charite.de> * Eric Sandeen : > > Journal block size: 4096 > > Journal length: 8488436 > > Journal first block: 2 > > Journal sequence: 0x0027c611 > > Journal start: 2 > > Journal number of users: 1 > > Journal users: 032613d3-6035-4872-bc0a-11db92feec5e > > Ok we might be getting a little off-track here. Your journal is indeed > 32G in size. But you also saw this with an internal journal, which > should be limited to 128M, and yet you still saw a very long replay, right? 800s for 128M, yes -- Ralf Hildebrandt Ralf.Hildebrandt at charite.de Charite - Universit?tsmedizin Berlin Tel. +49 (0)30-450 570-155 Gesch?ftsbereich IT | Abt. Netzwerk Fax. +49 (0)30-450 570-962 Hindenburgdamm 30 | 12200 Berlin From tytso at mit.edu Wed Feb 25 21:11:18 2009 From: tytso at mit.edu (Theodore Tso) Date: Wed, 25 Feb 2009 16:11:18 -0500 Subject: Questions regarding journal replay In-Reply-To: <20090225190215.GD8554@charite.de> References: <20090206142641.9FE446F064@alopias.GreenKey.net> <20090206142822.GE31519@charite.de> <20090225162426.GA26291@charite.de> <49A5726E.6030703@redhat.com> <20090225173459.GO7064@mit.edu> <20090225173907.GG26291@charite.de> <49A5836A.7050508@redhat.com> <20090225181108.GA8554@charite.de> <20090225185248.GA1363@mit.edu> <20090225190215.GD8554@charite.de> Message-ID: <20090225211118.GC1363@mit.edu> On Wed, Feb 25, 2009 at 08:02:15PM +0100, Ralf Hildebrandt wrote: > > > > Did you find some documentation that actually recommend that large of > > an external journal? > > It says the "journal has a maximum size of 128M" That's clearly not right. Where did you see that? We should make sure it gets fixed... - Ted From tytso at mit.edu Wed Feb 25 21:15:31 2009 From: tytso at mit.edu (Theodore Tso) Date: Wed, 25 Feb 2009 16:15:31 -0500 Subject: Questions regarding journal replay In-Reply-To: <20090225190131.GC8554@charite.de> References: <20090206142641.9FE446F064@alopias.GreenKey.net> <20090206142822.GE31519@charite.de> <20090225162426.GA26291@charite.de> <49A5726E.6030703@redhat.com> <20090225172334.GF26291@charite.de> <49A58199.2060101@redhat.com> <20090225174038.GH26291@charite.de> <20090225174214.GI26291@charite.de> <20090225184448.GP7064@mit.edu> <20090225190131.GC8554@charite.de> Message-ID: <20090225211531.GE1363@mit.edu> On Wed, Feb 25, 2009 at 08:01:31PM +0100, Ralf Hildebrandt wrote: > * Theodore Tso : > > > Increasing the journal size may speed up certain filesystem workloads > > which are causing the journal to wrap very frequently. However, > > increasing the journal *will* increase the time to replay the journal.... > > Indeed. This is a Maildir-style mailbox server. Many small writes, > reads and deletes. > > > How long did the journal replay take when you were using the 128MB > > internal inode? > > 800s So maybe I missed it, but about how long did it take with your 32GB external journal? - Ted From Ralf.Hildebrandt at charite.de Wed Feb 25 21:47:14 2009 From: Ralf.Hildebrandt at charite.de (Ralf Hildebrandt) Date: Wed, 25 Feb 2009 22:47:14 +0100 Subject: Questions regarding journal replay In-Reply-To: <20090225211531.GE1363@mit.edu> References: <20090206142822.GE31519@charite.de> <20090225162426.GA26291@charite.de> <49A5726E.6030703@redhat.com> <20090225172334.GF26291@charite.de> <49A58199.2060101@redhat.com> <20090225174038.GH26291@charite.de> <20090225174214.GI26291@charite.de> <20090225184448.GP7064@mit.edu> <20090225190131.GC8554@charite.de> <20090225211531.GE1363@mit.edu> Message-ID: <20090225214714.GK8554@charite.de> * Theodore Tso : > On Wed, Feb 25, 2009 at 08:01:31PM +0100, Ralf Hildebrandt wrote: > > * Theodore Tso : > > > > > Increasing the journal size may speed up certain filesystem workloads > > > which are causing the journal to wrap very frequently. However, > > > increasing the journal *will* increase the time to replay the journal.... > > > > Indeed. This is a Maildir-style mailbox server. Many small writes, > > reads and deletes. > > > > > How long did the journal replay take when you were using the 128MB > > > internal inode? > > > > 800s > > So maybe I missed it, but about how long did it take with your 32GB > external journal? One hour :( -- Ralf Hildebrandt Ralf.Hildebrandt at charite.de Charite - Universit?tsmedizin Berlin Tel. +49 (0)30-450 570-155 Gesch?ftsbereich IT | Abt. Netzwerk Fax. +49 (0)30-450 570-962 Hindenburgdamm 30 | 12200 Berlin From Ralf.Hildebrandt at charite.de Fri Feb 27 16:40:41 2009 From: Ralf.Hildebrandt at charite.de (Ralf Hildebrandt) Date: Fri, 27 Feb 2009 17:40:41 +0100 Subject: Questions regarding journal replay In-Reply-To: <20090225211118.GC1363@mit.edu> References: <20090206142822.GE31519@charite.de> <20090225162426.GA26291@charite.de> <49A5726E.6030703@redhat.com> <20090225173459.GO7064@mit.edu> <20090225173907.GG26291@charite.de> <49A5836A.7050508@redhat.com> <20090225181108.GA8554@charite.de> <20090225185248.GA1363@mit.edu> <20090225190215.GD8554@charite.de> <20090225211118.GC1363@mit.edu> Message-ID: <20090227164041.GE7136@charite.de> * Theodore Tso : > On Wed, Feb 25, 2009 at 08:02:15PM +0100, Ralf Hildebrandt wrote: > > > > > > Did you find some documentation that actually recommend that large of > > > an external journal? > > > > It says the "journal has a maximum size of 128M" > > That's clearly not right. Where did you see that? We should make > sure it gets fixed... The journal options in the tune2fs man page say: ****** CITE ********* size=journal-size Create a journal stored in the filesystem of size journal-size megabytes. The size of the journal must be at least 1024 filesystem blocks (i.e., 1MB if using 1k blocks, 4MB if using 4k blocks, etc.) and may be no more than 102,400 filesystem blocks. There must be enough free space in the filesystem to create a journal of that size. device=external-journal Attach the filesystem to the journal block device located on external-journal. The external journal must have been already created using the command mke2fs -O journal_dev external-journal Note that external-journal must be formatted with the same block size as filesystems which will be using it. In addition, while there is support for attaching multiple filesystems to a single external journal, the Linux kernel and e2fsck(8) do not currently support shared exter? nal journals yet. ****** CITE ********* It would be nice if the manpage included a sentence like: If an external journal is used, the whole journal block device will be used as journal. The sentence "... be no more than 102,400 filesystem blocks." gives the impression (to me, that is) that the same restrcition applies to an external-journal as well! Yes, I know *NOW* that's not the case. -- Ralf Hildebrandt Ralf.Hildebrandt at charite.de Charite - Universit?tsmedizin Berlin Tel. +49 (0)30-450 570-155 Gesch?ftsbereich IT | Abt. Netzwerk Fax. +49 (0)30-450 570-962 Hindenburgdamm 30 | 12200 Berlin From Ralf.Hildebrandt at charite.de Fri Feb 27 16:44:28 2009 From: Ralf.Hildebrandt at charite.de (Ralf Hildebrandt) Date: Fri, 27 Feb 2009 17:44:28 +0100 Subject: tune2fs options Message-ID: <20090227164428.GG7136@charite.de> Is there any way of making operations like dropping and adding a journal: tune2fs -O ^has_journal /dev/local/my_dev and tune2fs -o journal_data -j -J device=LABEL=my-journal-device /dev/local/my_dev more verbose? It would be nice to know what's going on, since just sitting there can be quite unnerving! In the end, it all worked ok, but actually seeing progress can be soothing. -- Ralf Hildebrandt Ralf.Hildebrandt at charite.de Charite - Universit?tsmedizin Berlin Tel. +49 (0)30-450 570-155 Gesch?ftsbereich IT | Abt. Netzwerk Fax. +49 (0)30-450 570-962 Hindenburgdamm 30 | 12200 Berlin