From worleys at gmail.com Wed Oct 1 18:18:21 2008 From: worleys at gmail.com (Chris Worley) Date: Wed, 1 Oct 2008 12:18:21 -0600 Subject: When is a block free? In-Reply-To: <20080929163917.GB10831@mit.edu> References: <48D01448.4050107@redhat.com> <20080929163917.GB10831@mit.edu> Message-ID: On Mon, Sep 29, 2008 at 10:39 AM, Theodore Tso wrote: > On Mon, Sep 29, 2008 at 09:24:33AM -0600, Chris Worley wrote: >> On Tue, Sep 16, 2008 at 3:32 PM, Chris Worley wrote: >> > For example, in balloc.c I'm seeing ext3_free_blocks_sb >> > calls ext3_clear_bit_atomic at the bottom... is that when the block is >> > freed? Are all blocks freed here? >> >> David Woodhouse, in an article at http://lwn.net/Articles/293658/, is >> implementing the T10/T13 committees "Trim" request in 2.6.28 kernels. >> >> Would it be appropriate to call "blkdev_issue_discard" at the bottom >> of ext3_free_blocks_sb where ext3_clear_bit_atomic is being called? > > Unfortunately, it's not as simple as that. The problem is that as > soon as you call trim, the drive is allowed to discard the contents of > that block so that future attempts to read from that block returns all > zeros. Therefore we can't call Trim until after the transaction has > committed. That means we have to keep a linked list of block extents > that are to be trimmed attached to the commit object, and only send > the trim requests once the commit block has been written to disk. > > It's on the ext4 developer's TODO list to add Trim support to ext3 and > ext4. I was perusing David Woodhouse's 2.6.27-rc2 kernel at git://git.infradead.org/users/drzeus/discard-2.6.git, and noticed he has the discard built-in to where I was talking about for ext2... so I coded our driver to handle discards, and it works very nicely!!! The journaling issue you raise is not a show-stopper on the block device side: if the block device has to maintain a couple of blocks that are not really in use, it's no big deal (eventually the blocks will be re-written and the universe will be in order again)... for the users, I can understand if the discard is preserved on the block device, while the fs still thinks there's good data in there (we'll give you back all zeros on read). Chris From tytso at mit.edu Wed Oct 1 18:59:09 2008 From: tytso at mit.edu (Theodore Tso) Date: Wed, 1 Oct 2008 14:59:09 -0400 Subject: When is a block free? In-Reply-To: References: <48D01448.4050107@redhat.com> <20080929163917.GB10831@mit.edu> Message-ID: <20081001185908.GC10080@mit.edu> On Wed, Oct 01, 2008 at 12:18:21PM -0600, Chris Worley wrote: > > I was perusing David Woodhouse's 2.6.27-rc2 kernel at > git://git.infradead.org/users/drzeus/discard-2.6.git, and noticed he > has the discard built-in to where I was talking about for ext2... so I > coded our driver to handle discards, and it works very nicely!!! I'm not sure what you mean by "our driver"? > The journaling issue you raise is not a show-stopper on the block > device side: if the block device has to maintain a couple of blocks > that are not really in use, it's no big deal (eventually the blocks > will be re-written and the universe will be in order again)... for the > users, I can understand if the discard is preserved on the block > device, while the fs still thinks there's good data in there (we'll > give you back all zeros on read). It's no issue on the block device side at all, but from the user's point of view it can be quite disastrous. Consider the following shell script: cp /etc/passwd /etc/passwd.vipw vi /etc/passwd.vipw # atomically update /etc/passwd mv /etc/passwd.vipw /etc/passwd Now assume that we crash right after the "mv" command, but before the transaction has committed. The net result will be that the contents of the /etc/passwd file will be all zeros, which some might consider.... unfortuate. This is exactly the same issue for why we can't just zero data blocks on the unlink command, but instead have to wait until the unlink operation has actually been committed in the journal. - Ted From worleys at gmail.com Wed Oct 1 19:46:00 2008 From: worleys at gmail.com (Chris Worley) Date: Wed, 1 Oct 2008 13:46:00 -0600 Subject: When is a block free? In-Reply-To: <20081001185908.GC10080@mit.edu> References: <48D01448.4050107@redhat.com> <20080929163917.GB10831@mit.edu> <20081001185908.GC10080@mit.edu> Message-ID: On Wed, Oct 1, 2008 at 12:59 PM, Theodore Tso wrote: > On Wed, Oct 01, 2008 at 12:18:21PM -0600, Chris Worley wrote: >> >> I was perusing David Woodhouse's 2.6.27-rc2 kernel at >> git://git.infradead.org/users/drzeus/discard-2.6.git, and noticed he >> has the discard built-in to where I was talking about for ext2... so I >> coded our driver to handle discards, and it works very nicely!!! > > I'm not sure what you mean by "our driver"? Our driver for the ioDrive: http://fusionio.com/Products.aspx So far, all I've implemented is the "discard" in the read/write callback; no barrier, no ioctl. > >> The journaling issue you raise is not a show-stopper on the block >> device side: if the block device has to maintain a couple of blocks >> that are not really in use, it's no big deal (eventually the blocks >> will be re-written and the universe will be in order again)... for the >> users, I can understand if the discard is preserved on the block >> device, while the fs still thinks there's good data in there (we'll >> give you back all zeros on read). > > It's no issue on the block device side at all, but from the user's > point of view it can be quite disastrous. Maybe that should effect the priority of implementation for ext[34]? Chris From tytso at mit.edu Wed Oct 1 21:29:40 2008 From: tytso at mit.edu (Theodore Tso) Date: Wed, 1 Oct 2008 17:29:40 -0400 Subject: When is a block free? In-Reply-To: References: <48D01448.4050107@redhat.com> <20080929163917.GB10831@mit.edu> <20081001185908.GC10080@mit.edu> Message-ID: <20081001212940.GI10080@mit.edu> On Wed, Oct 01, 2008 at 01:46:00PM -0600, Chris Worley wrote: > > Maybe that should effect the priority of implementation for ext[34]? > It's on our todo list, but at the moment you can't even *get* SSD's that have the trim command, apparently for love or money. So that affects the priority as well. If someone wants to ship me an SSD that has trim support, ideally in a 2.5" 9mm hard drive SATA form factor with at least 128gigs, I promise you that would affect priority of that feature, at least for me. :-) - Ted From balu.manyam at gmail.com Thu Oct 2 05:36:38 2008 From: balu.manyam at gmail.com (Balu manyam) Date: Thu, 2 Oct 2008 11:06:38 +0530 Subject: When is a block free? In-Reply-To: <20081001212940.GI10080@mit.edu> References: <48D01448.4050107@redhat.com> <20080929163917.GB10831@mit.edu> <20081001185908.GC10080@mit.edu> <20081001212940.GI10080@mit.edu> Message-ID: <995392220810012236i756ca53at112ead6b03d8f8c1@mail.gmail.com> On Thu, Oct 2, 2008 at 2:59 AM, Theodore Tso wrote: > On Wed, Oct 01, 2008 at 01:46:00PM -0600, Chris Worley wrote: > > > > Maybe that should effect the priority of implementation for ext[34]? > > > > also i am inferring correctly that the SAN array vendors who are now implementing thin provisioning i.e. allocate space on writes can benefit from this ? now that the array can know which blocks are free and update its own list of free blocks? -------------- next part -------------- An HTML attachment was scrubbed... URL: From worleys at gmail.com Thu Oct 2 13:40:30 2008 From: worleys at gmail.com (Chris Worley) Date: Thu, 2 Oct 2008 07:40:30 -0600 Subject: When is a block free? In-Reply-To: <995392220810012236i756ca53at112ead6b03d8f8c1@mail.gmail.com> References: <20080929163917.GB10831@mit.edu> <20081001185908.GC10080@mit.edu> <20081001212940.GI10080@mit.edu> <995392220810012236i756ca53at112ead6b03d8f8c1@mail.gmail.com> Message-ID: On Wed, Oct 1, 2008 at 11:36 PM, Balu manyam wrote: > > > On Thu, Oct 2, 2008 at 2:59 AM, Theodore Tso wrote: >> >> On Wed, Oct 01, 2008 at 01:46:00PM -0600, Chris Worley wrote: >> > >> > Maybe that should effect the priority of implementation for ext[34]? >> > >> > > also i am inferring correctly that the SAN array vendors who are now > implementing thin provisioning i.e. allocate space on writes can benefit > from this ? Absolutely. Chris > now that the array can know which blocks are free and update its > own list of free blocks? > From articpenguin3800 at gmail.com Sat Oct 11 21:01:16 2008 From: articpenguin3800 at gmail.com (John Nelson) Date: Sat, 11 Oct 2008 17:01:16 -0400 Subject: Backup Superblocks Message-ID: <48F1141C.4040604@gmail.com> Where does ext3 store the backup superblock? Does it have one at the very beginning of the partition and one at the very end? From samuel at bcgreen.com Sat Oct 11 21:41:09 2008 From: samuel at bcgreen.com (Stephen Samuel) Date: Sat, 11 Oct 2008 14:41:09 -0700 Subject: Backup Superblocks In-Reply-To: <48F1141C.4040604@gmail.com> References: <48F1141C.4040604@gmail.com> Message-ID: <6cd50f9f0810111441h5b4d5235g2104ef5a337f8bcd@mail.gmail.com> It stores them in various places, depending on the size of your filesystem. If your filesystem is large enough (>~ 1/2 GB) you'll probably find it at block #32768. For smaller filesystems, it appears to put the first backup at block # 8193 You can get more details by using the -n option to mkfs. If you used nonstandard options in your original mkfs, you might want to provide those details here, as well. (( -n has mkfs.ext[23] not actually write to the partition but simply say what it *WOULD* do if it did. )) mkfs -t ext2 -n /dev/mydeice On Sat, Oct 11, 2008 at 2:01 PM, John Nelson wrote: > Where does ext3 store the backup superblock? Does it have one at the very > beginning of the partition and one at the very end? > > _______________________________________________ > Ext3-users mailing list > Ext3-users at redhat.com > https://www.redhat.com/mailman/listinfo/ext3-users > -- Stephen Samuel http://www.bcgreen.com 778-861-7641 -------------- next part -------------- An HTML attachment was scrubbed... URL: From carlo at alinoe.com Wed Oct 15 01:43:10 2008 From: carlo at alinoe.com (Carlo Wood) Date: Wed, 15 Oct 2008 03:43:10 +0200 Subject: How are 'files with holes' stored? Message-ID: <20081015014310.GA1649@alinoe.com> Hi, I don't know how to call them, but it seems that ext3 grep allows files to be stored that have a very large size (when doing an 'ls -l') but do not actually allocate all blocks. I assume this is achieved by using 0 as blocknumber for indirect blocks. What are the exact requirements for such files? Is it allowed to have a double indirect block that exists entirely of zeroes? Is it possible there is are 0 entries in the tripple indirect block? Etc. -- Carlo Wood From lm at bitmover.com Wed Oct 15 01:47:55 2008 From: lm at bitmover.com (Larry McVoy) Date: Tue, 14 Oct 2008 18:47:55 -0700 Subject: How are 'files with holes' stored? In-Reply-To: <20081015014310.GA1649@alinoe.com> References: <20081015014310.GA1649@alinoe.com> Message-ID: <20081015014755.GB32378@bitmover.com> I don't remember how UFS did this but I could go figure it out in 10 or 20 minutes if that helped. ext* - no idea. I'd expect that your "block number is 0" is a darn good guess, that's what I would do. That or -1. On Wed, Oct 15, 2008 at 03:43:10AM +0200, Carlo Wood wrote: > Hi, I don't know how to call them, but it seems > that ext3 grep allows files to be stored that > have a very large size (when doing an 'ls -l') > but do not actually allocate all blocks. > > I assume this is achieved by using 0 as blocknumber > for indirect blocks. > > What are the exact requirements for such files? > Is it allowed to have a double indirect block > that exists entirely of zeroes? Is it possible > there is are 0 entries in the tripple indirect > block? Etc. > > -- > Carlo Wood > > _______________________________________________ > Ext3-users mailing list > Ext3-users at redhat.com > https://www.redhat.com/mailman/listinfo/ext3-users -- --- Larry McVoy lm at bitmover.com http://www.bitkeeper.com From ling at fnal.gov Wed Oct 15 16:56:06 2008 From: ling at fnal.gov (Ling C. Ho) Date: Wed, 15 Oct 2008 11:56:06 -0500 Subject: Need help recovering files. Message-ID: <48F620A6.6060508@fnal.gov> Hello, I am trying to recover a huge ext3 filesystem (5.5TB) and fsck has been running for almost a week. It's still at PASS 1D at this point, showing messages like File ... (inode #138235018, mod time Tue Sep 23 03:04:23 2008) has 1016 multiply-claimed block(s), shared with 327 file(s): ... (inode #375491526, mod time Wed Jun 4 17:05:37 2008) ... I am think if it is possible for me to use debugfs to dump the content of inodes if I write a script to go through all the inodes that is used. But when I try using ncheck to find out the path name (the filename would be enough) I get these messages: ncheck: EXT2 directory corrupted while calling ext2_dir_iterate The root inode according to fsck, and debugfs is gone. If I were to do an ls, it says "Ext2 inode is not a directory". The tools I am using are from e2fsprogs 1.41.2 . The file systems were originally created and mounted on a Fermi Linux SLF4.5 system (similar to RHEL 4.5). Is there anyway for me to dump individual file or search for a valid directory inodes and use rdump? Thanks, ... ling From Curtis at GreenKey.net Sat Oct 18 19:55:56 2008 From: Curtis at GreenKey.net (Curtis Doty) Date: Sat, 18 Oct 2008 12:55:56 -0700 (PDT) Subject: recovering failed resize2fs Message-ID: <20081018195556.EB4016F064@alopias.GreenKey.net> While attempting to expand a 1.64T ext4 volume to 2.18T the F9 kernel deadlocked. (I have photo of screen/oops if anybody's interested.) Now after recovery, the filesystem won't mount EXT4-fs: ext4_check_descriptors: Block bitmap for group 13413 not in group (block 0)!<3>EXT4-fs: group descriptors corrupted! and fsck won't run: fsck.ext4: Group descriptors look bad... trying backup blocks... inst: recovering journal fsck.ext4: unable to set superblock flags on inst I peeked at all backup superblocks, but they all appear the same--the larger/newer 2.18T geometry. :-( What is the best way to recover? I know exactly how the original filesystem was created. Is there a way to just replay the old superblocks and trick it into thinking it never resized? ../C dumpe2fs 1.41.0 (10-Jul-2008) Filesystem volume name: inst Last mounted on: Filesystem UUID: ddabbf0c-bf3f-495a-9777-e832cc14e9df Filesystem magic number: 0xEF53 Filesystem revision #: 1 (dynamic) Filesystem features: has_journal ext_attr resize_inode dir_index filetype extent sparse_super large_file Filesystem flags: signed_directory_hash test_filesystem Default mount options: journal_data_writeback Filesystem state: clean Errors behavior: Remount read-only Filesystem OS type: Linux Inode count: 9080832 Block count: 581173248 Reserved block count: 5808792 Free blocks: 6282771 Free inodes: 4331111 First block: 0 Block size: 4096 Fragment size: 4096 Reserved GDT blocks: 885 Blocks per group: 32768 Fragments per group: 32768 Inodes per group: 512 Inode blocks per group: 16 RAID stride: 32 RAID stripe width: 64 Filesystem created: Sun Jul 27 21:02:11 2008 Last mount time: Wed Aug 13 18:28:35 2008 Last write time: Sat Oct 18 12:39:13 2008 Mount count: 2 Maximum mount count: -1 Last checked: Sun Jul 27 21:02:11 2008 Check interval: 0 () Reserved blocks uid: 0 (user root) Reserved blocks gid: 0 (group root) First inode: 11 Inode size: 128 Journal inode: 8 Default directory hash: tea Directory Hash Seed: c6e2cfa3-0545-46a4-8240-ccb987191b88 Journal backup: inode blocks Journal size: 256M From tytso at mit.edu Sat Oct 18 20:29:36 2008 From: tytso at mit.edu (Theodore Tso) Date: Sat, 18 Oct 2008 16:29:36 -0400 Subject: recovering failed resize2fs In-Reply-To: <20081018195556.EB4016F064@alopias.GreenKey.net> References: <20081018195556.EB4016F064@alopias.GreenKey.net> Message-ID: <20081018202936.GC8383@mit.edu> On Sat, Oct 18, 2008 at 12:55:56PM -0700, Curtis Doty wrote: > While attempting to expand a 1.64T ext4 volume to 2.18T the F9 kernel > deadlocked. (I have photo of screen/oops if anybody's interested.) Yes, that would be useful, thanks. > Now after recovery, the filesystem won't mount > > EXT4-fs: ext4_check_descriptors: Block bitmap for group 13413 not in > group (block 0)!<3>EXT4-fs: group descriptors corrupted! > > and fsck won't run: > > fsck.ext4: Group descriptors look bad... trying backup blocks... > inst: recovering journal > fsck.ext4: unable to set superblock flags on inst Hmm... This sounds like the needs recovery flag was set on the backup superblock, which should never happen. Before we try something more extreme, see if this helps you: e2fsck -b 32768 -B 4096 /dev/where-inst-is-located That forces the use of the backup superblock right away, and might help you get past the initial error. - Ted From Curtis at GreenKey.net Sat Oct 18 23:20:13 2008 From: Curtis at GreenKey.net (Curtis Doty) Date: Sat, 18 Oct 2008 16:20:13 -0700 (PDT) Subject: recovering failed resize2fs In-Reply-To: <20081018202936.GC8383@mit.edu> References: <20081018195556.EB4016F064@alopias.GreenKey.net> <20081018202936.GC8383@mit.edu> Message-ID: <20081018232013.591E26F064@alopias.GreenKey.net> 4:29pm Theodore Tso said: > On Sat, Oct 18, 2008 at 12:55:56PM -0700, Curtis Doty wrote: >> While attempting to expand a 1.64T ext4 volume to 2.18T the F9 kernel >> deadlocked. (I have photo of screen/oops if anybody's interested.) > > Yes, that would be useful, thanks. Three photos of same: http://www.greenkey.net/~curtis/linux/ The rest had scrolled off, so maybe that soft lockup was a secondary effect rather than true cause? It was re-appearing every minute. > >> Now after recovery, the filesystem won't mount >> >> EXT4-fs: ext4_check_descriptors: Block bitmap for group 13413 not in >> group (block 0)!<3>EXT4-fs: group descriptors corrupted! >> >> and fsck won't run: >> >> fsck.ext4: Group descriptors look bad... trying backup blocks... >> inst: recovering journal >> fsck.ext4: unable to set superblock flags on inst > > Hmm... This sounds like the needs recovery flag was set on the backup > superblock, which should never happen. Before we try something more > extreme, see if this helps you: > > e2fsck -b 32768 -B 4096 /dev/where-inst-is-located > > That forces the use of the backup superblock right away, and might > help you get past the initial error. Same as before. :-( # e2fsck -b32768 -B4096 -C0 /dev/dat/inst e2fsck 1.41.0 (10-Jul-2008) inst: recovering journal e2fsck: unable to set superblock flags on inst It appears *all* superblocks are same as that first 32768 by iterating over all superblocks shown in mkfs -n output says so. I'm inclined to just force reduce the underlying lvm. It was 100% full before I extended and tried to resize. And I know the only writes on the new lvm extent would have been from resize2fs. It that wise? ../C From tytso at mit.edu Mon Oct 20 01:53:09 2008 From: tytso at mit.edu (Theodore Tso) Date: Sun, 19 Oct 2008 21:53:09 -0400 Subject: recovering failed resize2fs In-Reply-To: <20081018232013.591E26F064@alopias.GreenKey.net> References: <20081018195556.EB4016F064@alopias.GreenKey.net> <20081018202936.GC8383@mit.edu> <20081018232013.591E26F064@alopias.GreenKey.net> Message-ID: <20081020015309.GB8162@mit.edu> On Sat, Oct 18, 2008 at 04:20:13PM -0700, Curtis Doty wrote: > 4:29pm Theodore Tso said: > >> On Sat, Oct 18, 2008 at 12:55:56PM -0700, Curtis Doty wrote: >>> While attempting to expand a 1.64T ext4 volume to 2.18T the F9 kernel >>> deadlocked. (I have photo of screen/oops if anybody's interested.) >> >> Yes, that would be useful, thanks. > > Three photos of same: http://www.greenkey.net/~curtis/linux/ > > The rest had scrolled off, so maybe that soft lockup was a secondary > effect rather than true cause? It was re-appearing every minute. Looks like the kernel wedged due to running out of memory. The calls to shrink_zone(), shrink_inactive_list(), try_to_release_page(), etc. tends to indicate that the system was frantically trying to find free physical memory at the time. It may or may not have been caused by the online resize; how much memory does your system have, and what else was going on at the time? It may have been that something *else* had been leaking memory at the time, and this pushed it over the line. It's also the case that the online resize is journaled, so it should have been safe; but I'm guessing that the system was thrashing so hard, and you didn't have barriers enabled, and this resulted in the filesystem getting corrupted. >> Hmm... This sounds like the needs recovery flag was set on the backup >> superblock, which should never happen. Before we try something more >> extreme, see if this helps you: >> >> e2fsck -b 32768 -B 4096 /dev/where-inst-is-located >> >> That forces the use of the backup superblock right away, and might >> help you get past the initial error. > > Same as before. :-( > > # e2fsck -b32768 -B4096 -C0 /dev/dat/inst > e2fsck 1.41.0 (10-Jul-2008) > inst: recovering journal > e2fsck: unable to set superblock flags on inst > > It appears *all* superblocks are same as that first 32768 by iterating > over all superblocks shown in mkfs -n output says so. > > I'm inclined to just force reduce the underlying lvm. It was 100% full > before I extended and tried to resize. And I know the only writes on the > new lvm extent would have been from resize2fs. It that wise? No, force reducing the underlying LVM is only going to make things worse, since it doesn't fix the filesystem. So this is what I would do. Create a snapshot and try this on the snapshot first: % lvcreate -s -L 10G -n inst-snapshot /dev/dat/inst % debugfs -w /dev/dat/inst-snapshot debugfs: features ^needs_recovery debugfs: quit % e2fsck -C 0 /dev/dat/inst This will skip running the journal, but there's no guarantee the journal is valid anyway. If this turns into a mess, you can throw away the snapshot and try something else. (The something else would require writing a C program that removes the needs_recovery from all the backup superblock, but keeping it set on the master superbock. That's more work, so let's try this way first.) - Ted From rdavidson at obsidian.com.au Mon Oct 20 02:35:54 2008 From: rdavidson at obsidian.com.au (Robert Davidson) Date: Mon, 20 Oct 2008 13:35:54 +1100 Subject: ext3 file system I/O blocks until reboot Message-ID: <48FBEE8A.80608@obsidian.com.au> Hi all, We have a server that has a 580GB ext3 file system on it. Until recently we ran around 15 virtual servers from this file system. It was fine for at least a few months, then the file system would periodically become inaccessible, getting more frequent as time went on. Eventually we wouldn't even get through a 15-hour period without having to reboot the server. When the I/O got blocked, all processes accessing files on /var/lib/vservers (its mount point) would get stuck waiting for I/O to complete ("D" state) and I couldn't find any way to revive it apart from rebooting the server. I tried sending various signals (TERM and KILL) to some kernel threads but that didn't help at all. The "kjournald" process also got stuck in the "D" state. The server is running kernel 2.6.22.19 with the Linux-Vserver patch vs2.2.0.7, DRBD 8.2.6 and the Areca RAID driver updated to 1.20.0X.15-80603 which was the latest available from Areca at the time. The OS is Debian etch. As part of troubleshooting the problem I'd taken DRBD out of the mix, tried updating the RAID driver in the kernel, replaced the RAID card with another one with slightly later firmware, and also replaced the power supply with a known-good one at the same time and disabled the swap space. None of that helped. What did help was copying the files from the existing file system to a newly formatted ext3 file system. The newly formatted file system is only around 320GB, but is also set up the same as the existing one (both are hardware RAID-6, running on the same host, same controller, same physical disks, etc). When the file system would become inaccessible, there were no notices from the kernel about any issue at all. We have a serial console on this server and nothing was captured by the serial console when this happened, nor is there anything in the system logs (which should have been writable all this time as they are not on the broken file system). I used 'dd' to check if I could read from the underlying device files that the file system was on (/dev/sdc1 and /dev/drbd1), there was no problem doing that. I didn't test writes to these devices though since I don't know of any safe way to do so, but using the SysRq feature, an emergency sync would not complete, nor would an emergency umount, so I assume writes were out of the question. Doing an 'ls' on /var/lib/vservers just left me with yet another process stuck in the "D" state. A forced fsck of the file system (using a fresh build of e2fsprogs 1.41.3 with the matching libraries) provides no hint of any problems. The root file system is an ext3 file system as well, and there were no problems reading/writing to that file system while the ext3 file system on /var/lib/vservers was inaccessible. The filesystem is also on the same RAID card, physical disks, etc. One reason I've not moved to a newer kernel yet is because there isn't a stable linux-vserver patch for anything newer than 2.6.22.19, so I'm kind of stuck with that kernel until there is. I made a start on backporting the ext3 code from 2.6.26.5 to 2.6.22.19 but its not something I trust myself to get right, so I'd rather avoid that approach unless there is another way of doing that. So my questions are: Are there any further diagnostics I can perform on the old file system to try and track down the problem? If so, what are they? Is this a known bug/problem with ext3 or something related to it? Is it likely that one of the 3 or so deadlocks that have been fixed in kernels since 2.6.22.19 would have cured this problem, or would these deadlocks have taken down the hole box and not just affected the one file system? Or even this bug: http://bugzilla.kernel.org/show_bug.cgi?id=10882 (the softlockup part, I think not though because I was able to copy everything off that file system and on to a new one without having any lockups or any other complaints from the kernel). Thanks. -- Regards, Robert Davidson. Obsidian Consulting Group. Ph. 03-9355-7844 E-Mail: support at obsidian.com.au From bruno at wolff.to Mon Oct 20 13:34:49 2008 From: bruno at wolff.to (Bruno Wolff III) Date: Mon, 20 Oct 2008 08:34:49 -0500 Subject: ext3 file system I/O blocks until reboot In-Reply-To: <48FBEE8A.80608@obsidian.com.au> References: <48FBEE8A.80608@obsidian.com.au> Message-ID: <20081020133449.GB26855@wolff.to> On Mon, Oct 20, 2008 at 13:35:54 +1100, Robert Davidson wrote: > > So my questions are: > > Are there any further diagnostics I can perform on the old file system > to try and track down the problem? If so, what are they? > > Is this a known bug/problem with ext3 or something related to it? I saw stuff like this happening starting with later 2.6.20 kernels that wasn't fixed until the 2.6.24 kernels. (See bug 235043.) I wasn't using VM's, so it might not be the same as the bug you are seeing. I do remember seeing some other similar problems people were having that didn't appear to be the same bug as I had when I did bugzilla searches. So you might want to do your own bugzilla search to see what you can find. I have also been getting disk IO lockups in F10, but in a more limited set of circumstances. (Memory pressure on an X86_64 system.) From rdavidson at obsidian.com.au Tue Oct 21 00:40:06 2008 From: rdavidson at obsidian.com.au (Robert Davidson) Date: Tue, 21 Oct 2008 11:40:06 +1100 Subject: ext3 file system I/O blocks until reboot In-Reply-To: <20081020133449.GB26855@wolff.to> References: <48FBEE8A.80608@obsidian.com.au> <20081020133449.GB26855@wolff.to> Message-ID: <48FD24E6.1060106@obsidian.com.au> Bruno Wolff III wrote: > I saw stuff like this happening starting with later 2.6.20 kernels that > wasn't fixed until the 2.6.24 kernels. (See bug 235043.) I wasn't using > VM's, so it might not be the same as the bug you are seeing. I do remember > seeing some other similar problems people were having that didn't appear > to be the same bug as I had when I did bugzilla searches. So you might > want to do your own bugzilla search to see what you can find. > > I have also been getting disk IO lockups in F10, but in a more limited set > of circumstances. (Memory pressure on an X86_64 system.) > Hi Bruno, I've had a look through bugzilla but couldn't find any similar bugs (the closest I can find is 439548 but I doubt very much that thats it). Your bug 235043 does sound rather different since it sounds like new processes would be able to access the file system without a problem, where as on my system any new attempt to read (writing wasn't tested) just resulted in one more process stuck in the "D" state. I might try taking a byte-for-byte copy of the FS and see if I can find a way to reliably re-produce the problem on a similar server. -- Regards, Robert Davidson. Obsidian Consulting Group. Ph. 03-9355-7844 E-Mail: support at obsidian.com.au From bruno at wolff.to Tue Oct 21 03:37:22 2008 From: bruno at wolff.to (Bruno Wolff III) Date: Mon, 20 Oct 2008 22:37:22 -0500 Subject: ext3 file system I/O blocks until reboot In-Reply-To: <48FD24E6.1060106@obsidian.com.au> References: <48FBEE8A.80608@obsidian.com.au> <20081020133449.GB26855@wolff.to> <48FD24E6.1060106@obsidian.com.au> Message-ID: <20081021033722.GA24998@wolff.to> On Tue, Oct 21, 2008 at 11:40:06 +1100, Robert Davidson wrote: > > I've had a look through bugzilla but couldn't find any similar bugs (the > closest I can find is 439548 but I doubt very much that thats it). Your > bug 235043 does sound rather different since it sounds like new > processes would be able to access the file system without a problem, > where as on my system any new attempt to read (writing wasn't tested) > just resulted in one more process stuck in the "D" state. For a while. Eventually everything would lock up. From Curtis at GreenKey.net Tue Oct 21 21:44:33 2008 From: Curtis at GreenKey.net (Curtis Doty) Date: Tue, 21 Oct 2008 14:44:33 -0700 (PDT) Subject: recovering failed resize2fs In-Reply-To: <20081020015309.GB8162@mit.edu> References: <20081018195556.EB4016F064@alopias.GreenKey.net> <20081018202936.GC8383@mit.edu> <20081018232013.591E26F064@alopias.GreenKey.net> <20081020015309.GB8162@mit.edu> Message-ID: <20081021214433.DA4416F064@alopias.GreenKey.net> Sunday Theodore Tso said: > On Sat, Oct 18, 2008 at 04:20:13PM -0700, Curtis Doty wrote: >> 4:29pm Theodore Tso said: >> >>> On Sat, Oct 18, 2008 at 12:55:56PM -0700, Curtis Doty wrote: >>>> While attempting to expand a 1.64T ext4 volume to 2.18T the F9 kernel >>>> deadlocked. (I have photo of screen/oops if anybody's interested.) >>> >>> Yes, that would be useful, thanks. >> >> Three photos of same: http://www.greenkey.net/~curtis/linux/ >> >> The rest had scrolled off, so maybe that soft lockup was a secondary >> effect rather than true cause? It was re-appearing every minute. > > Looks like the kernel wedged due to running out of memory. The calls > to shrink_zone(), shrink_inactive_list(), try_to_release_page(), > etc. tends to indicate that the system was frantically trying to find > free physical memory at the time. It may or may not have been caused > by the online resize; how much memory does your system have, and what > else was going on at the time? It may have been that something *else* > had been leaking memory at the time, and this pushed it over the line. > The system had been a couple months and doing significant i/o on the ext4 volume. And indeed it had been having periodic memory/swap issues: http://www.greenkey.net/~curtis/linux/cracker-kernel.2008-10-21 > It's also the case that the online resize is journaled, so it should > have been safe; but I'm guessing that the system was thrashing so > hard, and you didn't have barriers enabled, and this resulted in the > filesystem getting corrupted. Some other observations... - a snapshot in a different vg blew up a few days prior; it was deleted - ran vgs a few times in another vty during resize2fs *immediately* before crash > >>> Hmm... This sounds like the needs recovery flag was set on the backup >>> superblock, which should never happen. Before we try something more >>> extreme, see if this helps you: >>> >>> e2fsck -b 32768 -B 4096 /dev/where-inst-is-located >>> >>> That forces the use of the backup superblock right away, and might >>> help you get past the initial error. >> >> Same as before. :-( >> >> # e2fsck -b32768 -B4096 -C0 /dev/dat/inst >> e2fsck 1.41.0 (10-Jul-2008) >> inst: recovering journal >> e2fsck: unable to set superblock flags on inst >> >> It appears *all* superblocks are same as that first 32768 by iterating >> over all superblocks shown in mkfs -n output says so. >> >> I'm inclined to just force reduce the underlying lvm. It was 100% full >> before I extended and tried to resize. And I know the only writes on the >> new lvm extent would have been from resize2fs. It that wise? > > No, force reducing the underlying LVM is only going to make things > worse, since it doesn't fix the filesystem. > > So this is what I would do. Create a snapshot and try this on the > snapshot first: > > % lvcreate -s -L 10G -n inst-snapshot /dev/dat/inst > % debugfs -w /dev/dat/inst-snapshot > debugfs: features ^needs_recovery > debugfs: quit > % e2fsck -C 0 /dev/dat/inst Done, but no change. :-( EXT4-fs: ext4_check_descriptors: Block bitmap for group 13413 not in group (block 0)!<3>EXT4-fs: group descriptors corrupted! > > This will skip running the journal, but there's no guarantee the > journal is valid anyway. > > If this turns into a mess, you can throw away the snapshot and try > something else. (The something else would require writing a C program > that removes the needs_recovery from all the backup superblock, but > keeping it set on the master superbock. That's more work, so let's > try this way first.) How does that something else work? ../C From rbock at eudoxos.de Thu Oct 23 08:00:44 2008 From: rbock at eudoxos.de (Roland Bock) Date: Thu, 23 Oct 2008 10:00:44 +0200 Subject: Block bitmap differences Message-ID: <49002F2C.9040209@eudoxos.de> Hi, a few weeks ago, an unhealthy combination of firmware in an Adaptec Raid controller and Seagate disks damaged my Raid6 filesystem. A bunch of files were damaged or lost at that time after the firmaware was updated and I had run e2fsck. Luckily, I was able to restore everything from a backup. A subsequent check with e2fsck reported no errors. Yesterday, I ran e2fsck -n again, to see if the system is still OK. It isn't and I have no idea how to interpret the messages (see attachment). What is the meaning and severity of - Block bitmap differences? - Free blocks count wrong for group? Thanks and regards. Roland -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: e2fsck-n-2008-10-22 URL: From rbock at eudoxos.de Thu Oct 23 08:18:31 2008 From: rbock at eudoxos.de (Roland Bock) Date: Thu, 23 Oct 2008 10:18:31 +0200 Subject: Undeletable files Message-ID: <49003357.8040704@eudoxos.de> Hi, an e2fsck-run left a few files sprinkled over the file system which seem to be undeletable. Although the FS is mounted RW, even root does not seem to be able to delete them. #: ls -l total 4888 -rwxrwx--t 1 18416192 21168618 45056 2007-03-08 17:57 00000000_0000_AAM_TUI_002.txt -r-xr-x-wT 1 2617499625 1426397418 45056 1920-07-13 22:52 00000000_0000_AAM_TUI_007.txt -rw-rwSrw- 1 51446267 130941264 49152 2006-11-03 07:37 00000000_0000_AAM_TUI_015.txt -rwxrwxrwt 1 33686018 59768993 49152 1909-11-25 08:49 00000000_0000_AAM_TUI_021.txt --w----rwt 1 64588982 2634154654 49152 2007-08-14 19:50 00000000_0000_AAM_TUI_034.txt -r-----r-T 1 66500841 4060231152 49152 2007-09-19 01:13 19991001_0000_AAM_TUI_000.txt ------xrw- 1 2885846505 33621835 4243456 2005-11-10 06:47 20011112_0000_AAM_TUI_000.txt -r-x--xrwt 1 2214740202 33685997 49152 2004-09-10 00:51 20040116_0000_AAM_TUI_000.txt ---x-w---t 1 18200553 19138330 49152 2022-08-08 13:51 20051109_0000_AAM_TUI_000.txt --w-rw---x 1 93782446 2533491176 45056 2004-08-22 20:46 20060609_0000_AAM_TUI_000.txt --wxrw-r-- 1 38929139 26715113 49152 2007-09-29 07:38 20061220_0000_AAM_TUI_000.txt ---xr-xrwx 1 33661673 26673902 49152 2007-03-10 04:59 20061221_0000_AAM_TUI_001.txt ---xrw---t 1 30977769 989954793 49152 2004-09-06 05:47 20070117_0000_AAM_TUI_000.txt -rw-rw-rwt 1 80150873 3204594410 49152 2006-12-19 14:22 20070308_0000_AAM_TUI_000.txt --w-r-x--T 1 37617711 58786132 49152 2007-04-07 21:02 20070308_0000_AAM_TUI_002.txt -rwxr-xrwt 1 3137470985 16843449 49152 2012-03-25 11:06 20070419_0000_AAM_TUI_000.txt ----r-Srwt 1 159806442 268563177 49152 2007-08-07 23:15 20070607_0000_AAM_TUI_000.txt None of these files can be deleted or modified. None of the user/group IDs is valid. Root cannot change any of the attributes. For example: #: rm 20011112_0000_AAM_TUI_000.txt rm: cannot remove `20011112_0000_AAM_TUI_000.txt': Operation not permitted #: chown root:root 20051109_0000_AAM_TUI_000.txt chown: changing ownership of `20051109_0000_AAM_TUI_000.txt': Operation not permitted #: chmod a+w 20061221_0000_AAM_TUI_001.txt chmod: changing permissions of `20061221_0000_AAM_TUI_001.txt': Operation not permitted Any idea of how to get rid of these files? I have about a 100 million files on that file system. "About" 30.000 are in such a state as described above. The rest behaves normally (can be modified, deleted, etc). Thanks in advance, Roland From jpiszcz at lucidpixels.com Thu Oct 23 11:30:39 2008 From: jpiszcz at lucidpixels.com (Justin Piszcz) Date: Thu, 23 Oct 2008 07:30:39 -0400 (EDT) Subject: Undeletable files In-Reply-To: <49003357.8040704@eudoxos.de> References: <49003357.8040704@eudoxos.de> Message-ID: On Thu, 23 Oct 2008, Roland Bock wrote: > Hi, > > an e2fsck-run left a few files sprinkled over the file system which seem to > be undeletable. Although the FS is mounted RW, even root does not seem to be > able to delete them. > [ .. ] > For example: > #: rm 20011112_0000_AAM_TUI_000.txt > rm: cannot remove `20011112_0000_AAM_TUI_000.txt': Operation not permitted > > #: chown root:root 20051109_0000_AAM_TUI_000.txt > chown: changing ownership of `20051109_0000_AAM_TUI_000.txt': Operation not > permitted > > #: chmod a+w 20061221_0000_AAM_TUI_001.txt > chmod: changing permissions of `20061221_0000_AAM_TUI_001.txt': Operation not > permitted > > > Any idea of how to get rid of these files? I have about a 100 million files > on that file system. "About" 30.000 are in such a state as described above. > The rest behaves normally (can be modified, deleted, etc). Either the FS is damaged or the files are chattr'd +i, lsattr -l filename. Are they immutable by chance? Justin. From rbock at eudoxos.de Thu Oct 23 11:51:53 2008 From: rbock at eudoxos.de (Roland Bock) Date: Thu, 23 Oct 2008 13:51:53 +0200 Subject: Undeletable files In-Reply-To: References: <49003357.8040704@eudoxos.de> Message-ID: <49006559.9020803@eudoxos.de> Justin, thanks for the hint! Yes, some of them are immutable, e.g. 00000000_0000_AAM_TUI_007.txt Synchronous_Directory_Updates, Immutable, No_Atime, Compression_Raw_Access Others aren't, e.g. 00000000_0000_AAM_TUI_002.txt Secure_Deletion, Append_Only, No_Atime, Compression_Raw_Access, Top_of_Directory_Hierarchie Got rid of them by: chattr -i -a *; rm * Thanks again, Roland Justin Piszcz wrote: > > > On Thu, 23 Oct 2008, Roland Bock wrote: > >> Hi, >> >> an e2fsck-run left a few files sprinkled over the file system which >> seem to be undeletable. Although the FS is mounted RW, even root does >> not seem to be able to delete them. >> > > [ .. ] > >> For example: >> #: rm 20011112_0000_AAM_TUI_000.txt >> rm: cannot remove `20011112_0000_AAM_TUI_000.txt': Operation not >> permitted >> >> #: chown root:root 20051109_0000_AAM_TUI_000.txt >> chown: changing ownership of `20051109_0000_AAM_TUI_000.txt': >> Operation not permitted >> >> #: chmod a+w 20061221_0000_AAM_TUI_001.txt >> chmod: changing permissions of `20061221_0000_AAM_TUI_001.txt': >> Operation not permitted >> >> >> Any idea of how to get rid of these files? I have about a 100 million >> files on that file system. "About" 30.000 are in such a state as >> described above. The rest behaves normally (can be modified, deleted, >> etc). > > Either the FS is damaged or the files are chattr'd +i, lsattr -l filename. > > Are they immutable by chance? > > Justin. > From tytso at mit.edu Thu Oct 23 14:08:33 2008 From: tytso at mit.edu (Theodore Tso) Date: Thu, 23 Oct 2008 10:08:33 -0400 Subject: Block bitmap differences In-Reply-To: <49002F2C.9040209@eudoxos.de> References: <49002F2C.9040209@eudoxos.de> Message-ID: <20081023140833.GB5529@mit.edu> On Thu, Oct 23, 2008 at 10:00:44AM +0200, Roland Bock wrote: > Hi, > > a few weeks ago, an unhealthy combination of firmware in an Adaptec Raid > controller and Seagate disks damaged my Raid6 filesystem. A bunch of > files were damaged or lost at that time after the firmaware was updated > and I had run e2fsck. Luckily, I was able to restore everything from a > backup. A subsequent check with e2fsck reported no errors. > > Yesterday, I ran e2fsck -n again, to see if the system is still OK. It > isn't and I have no idea how to interpret the messages (see attachment). You ran the e2fsck while the filesystem is mounted. So the output reported is not trustworthy, and block allocation bitmap differences and free block/inode accounting information being wrong is normal when running e2fsck -n on a mounted filesystem. This message, however, is cause for concern: > /dev/sdb1 contains a file system with errors, check forced. This means the filesystem noticed some discrepancy (for example, when freeing a block, it noticed that the block bitmap already showed the block as being not in use, which should never happen and indicates filesystem corruption). I would recommend that you schedule downtime so you can run e2fsck on the filesystem while it is unmounted. Given the errors that you saw when running e2fsck while it was mounted, it's unlikely that you will see anything serious, but it is still something that you should do. Regards, - Ted From rbock at eudoxos.de Thu Oct 23 16:05:57 2008 From: rbock at eudoxos.de (Roland Bock) Date: Thu, 23 Oct 2008 18:05:57 +0200 Subject: Block bitmap differences In-Reply-To: <20081023140833.GB5529@mit.edu> References: <49002F2C.9040209@eudoxos.de> <20081023140833.GB5529@mit.edu> Message-ID: <4900A0E5.1000403@eudoxos.de> Ted, thank you for your answers. Is it normal to encounter file systems with minor errors? We run 8 systems with Ubuntu 8.04 64bit and e2fsck reports " contains file system with errors" for at least one partition on every machine. Since there are 4 different types of hardware configurations, I tend to say that hardware is rather not to be blamed... If it is not normal, what could be the reasons? Are there any options to turn on logging which could give more insight (what would be the performance impact)? Thanks and regards, Roland Theodore Tso wrote: > On Thu, Oct 23, 2008 at 10:00:44AM +0200, Roland Bock wrote: >> Hi, >> >> a few weeks ago, an unhealthy combination of firmware in an Adaptec Raid >> controller and Seagate disks damaged my Raid6 filesystem. A bunch of >> files were damaged or lost at that time after the firmaware was updated >> and I had run e2fsck. Luckily, I was able to restore everything from a >> backup. A subsequent check with e2fsck reported no errors. >> >> Yesterday, I ran e2fsck -n again, to see if the system is still OK. It >> isn't and I have no idea how to interpret the messages (see attachment). > > You ran the e2fsck while the filesystem is mounted. So the output > reported is not trustworthy, and block allocation bitmap differences > and free block/inode accounting information being wrong is normal when > running e2fsck -n on a mounted filesystem. > > This message, however, is cause for concern: > >> /dev/sdb1 contains a file system with errors, check forced. > > This means the filesystem noticed some discrepancy (for example, when > freeing a block, it noticed that the block bitmap already showed the > block as being not in use, which should never happen and indicates > filesystem corruption). > > I would recommend that you schedule downtime so you can run e2fsck on > the filesystem while it is unmounted. Given the errors that you saw > when running e2fsck while it was mounted, it's unlikely that you will > see anything serious, but it is still something that you should do. > > Regards, > > - Ted From sandeen at redhat.com Thu Oct 23 16:07:54 2008 From: sandeen at redhat.com (Eric Sandeen) Date: Thu, 23 Oct 2008 11:07:54 -0500 Subject: Block bitmap differences In-Reply-To: <4900A0E5.1000403@eudoxos.de> References: <49002F2C.9040209@eudoxos.de> <20081023140833.GB5529@mit.edu> <4900A0E5.1000403@eudoxos.de> Message-ID: <4900A15A.9080602@redhat.com> Roland Bock wrote: > Ted, > > thank you for your answers. > > Is it normal to encounter file systems with minor errors? We run 8 > systems with Ubuntu 8.04 64bit and e2fsck reports " contains > file system with errors" for at least one partition on every machine. > > Since there are 4 different types of hardware configurations, I tend to > say that hardware is rather not to be blamed... > > If it is not normal, what could be the reasons? Look in your system logs; if the fs is flagged with errors, it should have issued a message when the error occurred. -Eric > Are there any options to turn on logging which could give more insight > (what would be the performance impact)? > > > Thanks and regards, > > Roland From rbock at eudoxos.de Thu Oct 23 17:53:26 2008 From: rbock at eudoxos.de (Roland Bock) Date: Thu, 23 Oct 2008 19:53:26 +0200 Subject: Block bitmap differences In-Reply-To: <4900A15A.9080602@redhat.com> References: <49002F2C.9040209@eudoxos.de> <20081023140833.GB5529@mit.edu> <4900A0E5.1000403@eudoxos.de> <4900A15A.9080602@redhat.com> Message-ID: <4900BA16.3040905@eudoxos.de> Eric, what should I be looking for? In /var/log I grep'ed for ext and fs (case insensitively) in all syslog, messages and kern.log files. I found nothing which indicated an error to me. Just occasional mount/umount messages and the like. Well, to be exact: I did find some error messages from the time when we had hardware issues on one machine. But nothing since these were resolved two weeks ago. e2fsck was happy then. Thanks and regards, Roland Eric Sandeen wrote: > Roland Bock wrote: >> Ted, >> >> thank you for your answers. >> >> Is it normal to encounter file systems with minor errors? We run 8 >> systems with Ubuntu 8.04 64bit and e2fsck reports " contains >> file system with errors" for at least one partition on every machine. >> >> Since there are 4 different types of hardware configurations, I tend to >> say that hardware is rather not to be blamed... >> >> If it is not normal, what could be the reasons? > > Look in your system logs; if the fs is flagged with errors, it should > have issued a message when the error occurred. > > -Eric > >> Are there any options to turn on logging which could give more insight >> (what would be the performance impact)? >> >> >> Thanks and regards, >> >> Roland > > From carlo at alinoe.com Fri Oct 24 01:16:30 2008 From: carlo at alinoe.com (Carlo Wood) Date: Fri, 24 Oct 2008 03:16:30 +0200 Subject: System crash during mke2fs Message-ID: <20081024011630.GA14432@alinoe.com> Hiya, don't know where else to report this. Please correct me if this isn't the right place. I just ran into a serious bug :(( We were trying to create a virtual filesystem in an image (file) of around 238 GB. Let the files name be foo.img, then we did: losetup /dev/loop0 foo.img and then used fdisk /dev/loop0 to create this partition table: uxley:~>fdisk -lu /dev/loop0 Disk /dev/loop0: 238.3 GB, 238370684928 bytes 255 heads, 63 sectors/track, 28980 cylinders, total 465567744 sectors Units = sectors of 1 * 512 = 512 bytes Device Boot Start End Blocks Id System /dev/loop0p1 * 63 401624 200781 83 Linux /dev/loop0p2 401625 16048934 7823655 83 Linux /dev/loop0p3 16048935 21928724 2939895 82 Linux swap / Solaris /dev/loop0p4 21928725 465563699 221817487+ 5 Extended /dev/loop0p5 21928788 27808514 2939863+ 83 Linux /dev/loop0p6 27808578 47359619 9775521 83 Linux /dev/loop0p7 47359683 57143204 4891761 83 Linux /dev/loop0p8 57143268 465563699 204210216 83 Linux Next we did: losetup -o $((512 * 63)) /dev/loop1 /dev/loop0 which should make the first partition available under /dev/loop1 (this certainly works if that partition already contains a fs, we then can mount it). Finally, I wanted to create a filesystem and ran the following command: uxley:~>mke2fs -j -L "/boot" /dev/loop1 mke2fs 1.40-WIP (14-Nov-2006) Filesystem label=/boot OS type: Linux Block size=4096 (log=2) Fragment size=4096 (log=2) 29097984 inodes, 58195960 blocks 2909798 blocks (5.00%) reserved for the super user First data block=0 Maximum filesystem blocks=0 1776 block groups 32768 blocks per group, 32768 fragments per group 16384 inodes per group Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 4096000, 7962624, 11239424, 20480000, 23887872 Writing inode tables: 306/1776 Here the machine completely halted/crashed. I don't know what happened, because it's a remote machine. The writing of the inode table started very fast, but it was already slowing down the last few - and completely stopped at 306, which was 12 minutes ago (my ssh connection to the machine still didn't time out, weird enough). I can still ping the machine I see. Note that mke2fs says: 29097984 inodes, 58195960 blocks That is 58195960 * 4096 = 238370652160 the full size of the image file?!? This partition is only 200MB though! Did I do something very stupid, or is this a bug in mke2fs ? -- Carlo Wood From jordi.prats at gmail.com Fri Oct 24 06:47:31 2008 From: jordi.prats at gmail.com (Jordi Prats) Date: Fri, 24 Oct 2008 08:47:31 +0200 Subject: System crash during mke2fs In-Reply-To: <20081024011630.GA14432@alinoe.com> References: <20081024011630.GA14432@alinoe.com> Message-ID: <1908f30810232347x7378a6d0o2476e794153b7a68@mail.gmail.com> I don't know how this can hang your system, but instead of doing this: losetup -o $((512 * 63)) /dev/loop1 /dev/loop0 You could use kpartx: kpartx -a /dev/loop0 You are going to find in /dev/mapper your loop0p1: Here you can find an example: [root at shuVak ~]# dd if=/dev/zero of=caca bs=1024k count=100 100+0 records in 100+0 records out 104857600 bytes (105 MB) copied, 0.656977 seconds, 160 MB/s [root at shuVak ~]# losetup /dev/loop0 caca [root at shuVak ~]# fdisk /dev/loop0 Device contains neither a valid DOS partition table, nor Sun, SGI or OSF disklabel Building a new DOS disklabel. Changes will remain in memory only, until you decide to write them. After that, of course, the previous content won't be recoverable. Warning: invalid flag 0x0000 of partition table 4 will be corrected by w(rite) Command (m for help): p Disk /dev/loop0: 104 MB, 104857600 bytes 255 heads, 63 sectors/track, 12 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Device Boot Start End Blocks Id System Command (m for help): n Command action e extended p primary partition (1-4) p Partition number (1-4): 1 First cylinder (1-12, default 1): Using default value 1 Last cylinder or +size or +sizeM or +sizeK (1-12, default 12): Using default value 12 Command (m for help): p Disk /dev/loop0: 104 MB, 104857600 bytes 255 heads, 63 sectors/track, 12 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Device Boot Start End Blocks Id System /dev/loop0p1 1 12 96358+ 83 Linux Command (m for help): t Selected partition 1 Hex code (type L to list codes): 8e Changed system type of partition 1 to 8e (Linux LVM) Command (m for help): w The partition table has been altered! Calling ioctl() to re-read partition table. WARNING: Re-reading the partition table failed with error 22: Invalid argument. The kernel still uses the old table. The new table will be used at the next reboot. Syncing disks. [root at shuVak ~]# ls /dev/loop* loop0 loop1 loop2 loop3 loop4 loop5 loop6 loop7 [root at shuVak ~]# kpartx -a /dev/loop0 [root at shuVak ~]# ls /dev/mapper/loop0p1 /dev/mapper/loop0p1 regards, Jordi On Fri, Oct 24, 2008 at 3:16 AM, Carlo Wood wrote: > Hiya, don't know where else to report this. Please > correct me if this isn't the right place. > > I just ran into a serious bug :(( > > We were trying to create a virtual filesystem > in an image (file) of around 238 GB. > > Let the files name be foo.img, then we did: > > losetup /dev/loop0 foo.img > > and then used fdisk /dev/loop0 to create this partition > table: > > uxley:~>fdisk -lu /dev/loop0 > > Disk /dev/loop0: 238.3 GB, 238370684928 bytes > 255 heads, 63 sectors/track, 28980 cylinders, total 465567744 sectors > Units = sectors of 1 * 512 = 512 bytes > > Device Boot Start End Blocks Id System > /dev/loop0p1 * 63 401624 200781 83 Linux > /dev/loop0p2 401625 16048934 7823655 83 Linux > /dev/loop0p3 16048935 21928724 2939895 82 Linux swap / Solaris > /dev/loop0p4 21928725 465563699 221817487+ 5 Extended > /dev/loop0p5 21928788 27808514 2939863+ 83 Linux > /dev/loop0p6 27808578 47359619 9775521 83 Linux > /dev/loop0p7 47359683 57143204 4891761 83 Linux > /dev/loop0p8 57143268 465563699 204210216 83 Linux > > Next we did: > > losetup -o $((512 * 63)) /dev/loop1 /dev/loop0 > > which should make the first partition available under /dev/loop1 > (this certainly works if that partition already contains a fs, > we then can mount it). > > Finally, I wanted to create a filesystem and ran the following > command: > > uxley:~>mke2fs -j -L "/boot" /dev/loop1 > mke2fs 1.40-WIP (14-Nov-2006) > Filesystem label=/boot > OS type: Linux > Block size=4096 (log=2) > Fragment size=4096 (log=2) > 29097984 inodes, 58195960 blocks > 2909798 blocks (5.00%) reserved for the super user > First data block=0 > Maximum filesystem blocks=0 > 1776 block groups > 32768 blocks per group, 32768 fragments per group > 16384 inodes per group > Superblock backups stored on blocks: > 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, > 4096000, 7962624, 11239424, 20480000, 23887872 > > Writing inode tables: 306/1776 > > > Here the machine completely halted/crashed. I don't know what > happened, because it's a remote machine. > > The writing of the inode table started very fast, but it was > already slowing down the last few - and completely stopped > at 306, which was 12 minutes ago (my ssh connection to the > machine still didn't time out, weird enough). > > I can still ping the machine I see. > > Note that mke2fs says: 29097984 inodes, 58195960 blocks > That is 58195960 * 4096 = 238370652160 the full size of > the image file?!? > > This partition is only 200MB though! > > Did I do something very stupid, or is this a bug in mke2fs ? > > -- > Carlo Wood > > _______________________________________________ > Ext3-users mailing list > Ext3-users at redhat.com > https://www.redhat.com/mailman/listinfo/ext3-users > -- Jordi From tytso at mit.edu Fri Oct 24 10:54:40 2008 From: tytso at mit.edu (Theodore Tso) Date: Fri, 24 Oct 2008 06:54:40 -0400 Subject: System crash during mke2fs In-Reply-To: <20081024011630.GA14432@alinoe.com> References: <20081024011630.GA14432@alinoe.com> Message-ID: <20081024105440.GC8658@mit.edu> On Fri, Oct 24, 2008 at 03:16:30AM +0200, Carlo Wood wrote: > Hiya, don't know where else to report this. Please > correct me if this isn't the right place. > > I just ran into a serious bug :(( > > We were trying to create a virtual filesystem > in an image (file) of around 238 GB. [Using double losetup configuration] > > Here the machine completely halted/crashed. I don't know what > happened, because it's a remote machine. > > The writing of the inode table started very fast, but it was > already slowing down the last few - and completely stopped > at 306, which was 12 minutes ago (my ssh connection to the > machine still didn't time out, weird enough). That's a classic case of mke2fs tickling a VM bug. The VM should be able to do proper write throttling, but mke2fs writes a blocks very quickly, and so it's a great test of the kernel virtual memory subsystem. :-) So the fact that your system hung is a kernel bug, probably caued by the double /dev/loop configuration. What version of the kernel are you using? There is a workaround that might help: "export MKE2FS_SYNC=10". This will force an explicit sync system call every 10 blockgroups, which tends to work around the kernel VM bug. It's not the default mainly because mke2fs is such a great kernel test tool, and the VM really needs to be able to handle this case. > Note that mke2fs says: 29097984 inodes, 58195960 blocks > That is 58195960 * 4096 = 238370652160 the full size of > the image file?!? > > This partition is only 200MB though! That's because you created /dev/loop1 as a loop device with an offset of 512*63 bytes from the beginning of /dev/loop0. There is no way to set the maximum size of a loop device (it's not something which is currently defined as part of the interface of the LOOP_SET_STATUS ioctl. If you want to do things manually like this, you'll need to explicitly specify the size of the desired filesystem to mke2fs; it's a shortcoming in the loop device. The other way to do things would be to create an image file of the desired partition length, and then assemble it by hand afterwards; sorry, the loop device wasn't designed to be used to emulate a partitioned disk. It could be, but kernel patches would be required to extend its functionality. Regards, - Ted From rbock at eudoxos.de Fri Oct 24 15:19:25 2008 From: rbock at eudoxos.de (Roland Bock) Date: Fri, 24 Oct 2008 17:19:25 +0200 Subject: e2fsck discrepancies Message-ID: <4901E77D.6010602@eudoxos.de> Hi, yesterday I ran e2fsck -n on a mounted file system and got: /dev/sdb1 contains a file system with errors, check forced. According to Ted, the lines that followed were not to be trusted due to the fact that the file system was mounted. But this error statement suggests to run a check with the fs unmounted. Today, we scheduled a downtime and ran the check. It came of completely clean: ~: e2fsck -fy /dev/sdb1 e2fsck 1.40.8 (13-Mar-2008) Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information /dev/sdb1: 32028520/536870912 files (0.5% non-contiguous), 802465197/2147460933 blocks Does this mean that read-only checks are generally not trustworthy, even the statement that the filesystem has errors? Or something like Read-only reports clean: fine Read-only reports error: not necessarily really an error Thanks and regards, Roland From sandeen at redhat.com Fri Oct 24 15:20:33 2008 From: sandeen at redhat.com (Eric Sandeen) Date: Fri, 24 Oct 2008 10:20:33 -0500 Subject: System crash during mke2fs In-Reply-To: <20081024011630.GA14432@alinoe.com> References: <20081024011630.GA14432@alinoe.com> Message-ID: <4901E7C1.4050201@redhat.com> Carlo Wood wrote: > Hiya, don't know where else to report this. Please > correct me if this isn't the right place. > > I just ran into a serious bug :(( ... > Finally, I wanted to create a filesystem and ran the following > command: > > uxley:~>mke2fs -j -L "/boot" /dev/loop1 ... > Here the machine completely halted/crashed. I don't know what > happened, because it's a remote machine. It'd be very good to have a console so you can see what really truly happened. A remote machine w/o a console would scare me in any case. :) Is the image file sparse, or is it filled in with zeros? Is it hosted on ext3? Especially if it's sparse, but in either case, I'd be curious to know if it works out any better or worse with other filesystems hosting the image file - trying ext4 and/or xfs just as an experiment might be interesting... -Eric From sandeen at redhat.com Fri Oct 24 15:30:40 2008 From: sandeen at redhat.com (Eric Sandeen) Date: Fri, 24 Oct 2008 10:30:40 -0500 Subject: e2fsck discrepancies In-Reply-To: <4901E77D.6010602@eudoxos.de> References: <4901E77D.6010602@eudoxos.de> Message-ID: <4901EA20.8070800@redhat.com> Roland Bock wrote: > Hi, > > yesterday I ran e2fsck -n on a mounted file system and got: > > /dev/sdb1 contains a file system with errors, check forced. > > According to Ted, the lines that followed were not to be trusted due to > the fact that the file system was mounted. But this error statement > suggests to run a check with the fs unmounted. > > Today, we scheduled a downtime and ran the check. It came of completely > clean: > ~: e2fsck -fy /dev/sdb1 > > e2fsck 1.40.8 (13-Mar-2008) > Pass 1: Checking inodes, blocks, and sizes > Pass 2: Checking directory structure > Pass 3: Checking directory connectivity > Pass 4: Checking reference counts > Pass 5: Checking group summary information > /dev/sdb1: 32028520/536870912 files (0.5% non-contiguous), > 802465197/2147460933 blocks > > > Does this mean that read-only checks are generally not trustworthy, even > the statement that the filesystem has errors? Or something like > > Read-only reports clean: fine > Read-only reports error: not necessarily really an error I think that's possible. When e2fsck starts off, main() does: main() check_super_block() if some sanity tests fail ext2fs_unmark_valid() check_if_skip() if EXT2_ERROR_FS || !ext2fs_test_valid() " contains a file system with errors" check_if_skip is what issues the "contains a file system with errors" message, and it may do so if the filesystem is marked with errors, OR if a call to ext2fs_test_valid() fails. Prior to this, check_super_block() may call ext2fs_unmark_valid() for a variety of reasons, some of which could, I think, be caused by the filesystem being live and not necessarily consistent when viewed by e2fsck. So I think that the message is a bit misleading; "filesystem with errors" sounds to me like EXT2_ERROR_FS, which should always issue some sort of message to the syslog when set - but, you may also get the "filesystem with errors" message due to some inconsistencies that may be wholly due to the filesystem being mounted and in flux as fsck tries to read it. -Eric From lll+ext3 at m4x.org Fri Oct 24 15:59:22 2008 From: lll+ext3 at m4x.org (Loic Le Loarer) Date: Fri, 24 Oct 2008 17:59:22 +0200 Subject: see current superblock information Message-ID: <20081024155922.GF24933@pavuc.le-loarer.org> Hi all, I would like to have the current number of used/free inode count on a mounted ext3 fs. It is very useful to debug a situation where you cannot create new files while the fs isn't full according to "df" (i.e. when free inode count is zero). My first idea was to use "tune2fs -l /dev/device", it gives all the information I need, but it reflects only the on-disk superblock, which seems to never be written while the fs is mounted. So I'm look for a way to either force the flush of the superblock or to just have the current used/free inode count. I hope I have contacted the correct mailing list. Thank you in advance for your answers. Best regards. -- Lo?c "heaven is not a place, it's a feeling" From carlo at alinoe.com Fri Oct 24 17:42:27 2008 From: carlo at alinoe.com (Carlo Wood) Date: Fri, 24 Oct 2008 19:42:27 +0200 Subject: System crash during mke2fs In-Reply-To: <20081024105440.GC8658@mit.edu> References: <20081024011630.GA14432@alinoe.com> <20081024105440.GC8658@mit.edu> Message-ID: <20081024174227.GA24607@alinoe.com> On Fri, Oct 24, 2008 at 06:54:40AM -0400, Theodore Tso wrote: > probably caued by the double /dev/loop configuration. What version of > the kernel are you using? It's running 2.6.18-6-686 We rebooted the machine and nothing seemed corrupted or wrong, except the virtual machine file; had to take another 7 hours to recreate that (it's a vmware thing). -- Carlo Wood From carlo at alinoe.com Fri Oct 24 17:46:33 2008 From: carlo at alinoe.com (Carlo Wood) Date: Fri, 24 Oct 2008 19:46:33 +0200 Subject: System crash during mke2fs In-Reply-To: <4901E7C1.4050201@redhat.com> References: <20081024011630.GA14432@alinoe.com> <4901E7C1.4050201@redhat.com> Message-ID: <20081024174633.GC24607@alinoe.com> On Fri, Oct 24, 2008 at 10:20:33AM -0500, Eric Sandeen wrote: > Is the image file sparse, or is it filled in with zeros? Is it hosted > on ext3? Not sparse, but probably filled with zeroes. Yes it is. > Especially if it's sparse, but in either case, I'd be curious to know if > it works out any better or worse with other filesystems hosting the > image file - trying ext4 and/or xfs just as an experiment might be > interesting... We're just trying to save this company that is down for three weeks now ;) No time for experiments :p Anyway, thanks for your comments. In the meantime we're back on track fortunately. -- Carlo Wood From Curtis at GreenKey.net Fri Oct 24 17:47:55 2008 From: Curtis at GreenKey.net (Curtis Doty) Date: Fri, 24 Oct 2008 10:47:55 -0700 (PDT) Subject: see current superblock information In-Reply-To: <20081024155922.GF24933@pavuc.le-loarer.org> References: <20081024155922.GF24933@pavuc.le-loarer.org> Message-ID: <20081024174756.49A5E6F064@alopias.GreenKey.net> 5:59pm Loic Le Loarer said: > > So I'm look for a way to either force the flush of the superblock or to > just have the current used/free inode count. df -i Is that what you seek? ../C From rbock at eudoxos.de Fri Oct 24 17:49:57 2008 From: rbock at eudoxos.de (Roland Bock) Date: Fri, 24 Oct 2008 19:49:57 +0200 Subject: e2fsck discrepancies In-Reply-To: <4901EA20.8070800@redhat.com> References: <4901E77D.6010602@eudoxos.de> <4901EA20.8070800@redhat.com> Message-ID: <49020AC5.9000703@eudoxos.de> Eric: thanks for the confirmation. Now that I read again the man page, I wonder how I could miss that part: "[...] How- ever, even if it is safe to do so, the results printed by e2fsck are not valid if the filesystem is mounted." Blessed is he who can read :-) Best regards, Roland Eric Sandeen wrote: > Roland Bock wrote: >> Hi, >> >> yesterday I ran e2fsck -n on a mounted file system and got: >> >> /dev/sdb1 contains a file system with errors, check forced. >> >> According to Ted, the lines that followed were not to be trusted due to >> the fact that the file system was mounted. But this error statement >> suggests to run a check with the fs unmounted. >> >> Today, we scheduled a downtime and ran the check. It came of completely >> clean: >> ~: e2fsck -fy /dev/sdb1 >> >> e2fsck 1.40.8 (13-Mar-2008) >> Pass 1: Checking inodes, blocks, and sizes >> Pass 2: Checking directory structure >> Pass 3: Checking directory connectivity >> Pass 4: Checking reference counts >> Pass 5: Checking group summary information >> /dev/sdb1: 32028520/536870912 files (0.5% non-contiguous), >> 802465197/2147460933 blocks >> >> >> Does this mean that read-only checks are generally not trustworthy, even >> the statement that the filesystem has errors? Or something like >> >> Read-only reports clean: fine >> Read-only reports error: not necessarily really an error > > I think that's possible. When e2fsck starts off, main() does: > > main() > check_super_block() > if some sanity tests fail > ext2fs_unmark_valid() > check_if_skip() > if EXT2_ERROR_FS || !ext2fs_test_valid() > " contains a file system with errors" > > > check_if_skip is what issues the "contains a file system with errors" > message, and it may do so if the filesystem is marked with errors, OR if > a call to ext2fs_test_valid() fails. > > Prior to this, check_super_block() may call ext2fs_unmark_valid() for a > variety of reasons, some of which could, I think, be caused by the > filesystem being live and not necessarily consistent when viewed by e2fsck. > > So I think that the message is a bit misleading; "filesystem with > errors" sounds to me like EXT2_ERROR_FS, which should always issue some > sort of message to the syslog when set - but, you may also get the > "filesystem with errors" message due to some inconsistencies that may be > wholly due to the filesystem being mounted and in flux as fsck tries to > read it. > > -Eric From lll+ext3 at m4x.org Fri Oct 24 22:48:01 2008 From: lll+ext3 at m4x.org (Loic Le Loarer) Date: Sat, 25 Oct 2008 00:48:01 +0200 Subject: see current superblock information In-Reply-To: <20081024174756.49A5E6F064@alopias.GreenKey.net> References: <20081024155922.GF24933@pavuc.le-loarer.org> <20081024174756.49A5E6F064@alopias.GreenKey.net> Message-ID: <20081024224801.GH24933@pavuc.le-loarer.org> Le vendredi 24 octobre 2008 ? 10:47:55 -0700, Curtis Doty a ?crit: > 5:59pm Loic Le Loarer said: > > > >So I'm look for a way to either force the flush of the superblock or to > >just have the current used/free inode count. > > df -i > > Is that what you seek? Exactly, it's so obvious now that you say it. Thank you for the help ! -- Lo?c From lists at nerdbynature.de Sat Oct 25 23:22:02 2008 From: lists at nerdbynature.de (Christian Kujau) Date: Sat, 25 Oct 2008 16:22:02 -0700 (PDT) Subject: ext3 file system I/O blocks until reboot In-Reply-To: <48FBEE8A.80608@obsidian.com.au> References: <48FBEE8A.80608@obsidian.com.au> Message-ID: Probably too late anyway, but: On Mon, 20 Oct 2008, Robert Davidson wrote: > The "kjournald" process also got stuck in the "D" state. Did you try a SysReq-w to show all blocked tasks? OR even -d, or -t. You mentioned /var/log was on a different filesystem, so this information might make it to the disks. If not, your serial console should catch it. Maybe then we'll find out *why* these process are in "D" state. Christian. -- BOFH excuse #25: Decreasing electron flux From rdavidson at obsidian.com.au Mon Oct 27 01:10:12 2008 From: rdavidson at obsidian.com.au (Robert Davidson) Date: Mon, 27 Oct 2008 12:10:12 +1100 Subject: ext3 file system I/O blocks until reboot In-Reply-To: References: <48FBEE8A.80608@obsidian.com.au> Message-ID: <490514F4.4060801@obsidian.com.au> Christian Kujau wrote: > Probably too late anyway, but: > > On Mon, 20 Oct 2008, Robert Davidson wrote: >> The "kjournald" process also got stuck in the "D" state. > > Did you try a SysReq-w to show all blocked tasks? OR even -d, or -t. > You mentioned /var/log was on a different filesystem, so this > information might make it to the disks. If not, your serial console > should catch it. Maybe then we'll find out *why* these process are in > "D" state. Hi Christian, Not too late - this is an ongoing problem still. I'm currently trying to see if I can get some newer vserver patches so I can build a newer kernel and try that. Currently I'm stuck with 2.6.22.19 I've tried doing various SysRq requests, none of them would give me anything back on the serial console, but it seems that may have been my own fault for having the console logging set too low. I've fixed that up now. In any case, the responses you'd expect to see from the kernel for the various SysRq commands never made it into the logs. About a month ago when the server last had problems, I made a new ext3 filesystem and copied everything from the old filesystem to the new one. I thought that worked but then last night we lost the same filesystem again and had to reboot. After copying everything off the original filesystem (also ext3) I ran a forced fsck.ext3 on it and it didn't find any problems. -- Regards, Robert Davidson. Obsidian Consulting Group. Ph. 03-9355-7844 E-Mail: support at obsidian.com.au From puhuri at iki.fi Mon Oct 27 09:40:21 2008 From: puhuri at iki.fi (Markus Peuhkuri) Date: Mon, 27 Oct 2008 11:40:21 +0200 Subject: Unlink performance Message-ID: <49058C85.8060901@iki.fi> Hi, I get problems with ext3 delete blocking filesystem access or slowing down write speeds. My system is following: * a process is reading real-time data (with few seconds of buffering) and after processing writing with top speed of 2x10 Mbyte/s (two streams to different disks). * Then there are two processes that read data from the same disks and process it further and copy it to yet another pair of disks. * Yet another processes is then deleting older files to keep disk usage below 85% The reason for this kind of processing is that the second step is too slow to happen real time, the incoming data is bursty in nature and at peek load the processors are not fast enough to process the data. On average (given 2x900 GB disk buffer) the system is, however fast enough to post-process the data. However, as my delete script malfunctioned, and at one point it had 2x100 GB files to delete; thus running 'rm file' one after one for those 400 files, about 500 MB each. What then resulted was that the real-time data processing became too slow and and buffers overfload. Of course, I could force delete script to sleep few seconds between file deletes to allow write process to recover, but still this feels a bit of unsure patch. I looked on IO schedulers, but while I'm quite familar with networking queues, IO scheduler is largely unknown for me. I assume that you cannot assing per-process priorities with IO schedulers? As that would be the case, I would max priority for the real-time process and put delete function to lowest one. Any ideas how I could make sure that the system would do its best to provide good service for real-time processing? The secondary processing is niced, but if I recall right, the delete was running with nice 0. I had few ideas to improve things, but not yet had time to implement: * I could use tee-like program for post-processing. At first it tries to process data real-time (reading from raw stream after it has been written to disk, so data could be in buffer if caching is set ok), but it if could not keep with it, it would then just queue post-processing and continue later, when load allows. * Smaller files would of course make blocking time shorter. If it matters, the systems use sata disks (both native and scsi-raid), and have kernel 2.6.26 (Debian Lenny). . Markus From alex at alex.org.uk Mon Oct 27 09:30:18 2008 From: alex at alex.org.uk (Alex Bligh) Date: Mon, 27 Oct 2008 10:30:18 +0100 Subject: Unlink performance In-Reply-To: <49058C85.8060901@iki.fi> References: <49058C85.8060901@iki.fi> Message-ID: <52F49968757FFFD36073E072@Ximines.local> --On 27 October 2008 11:40:21 +0200 Markus Peuhkuri wrote: > However, as my delete script malfunctioned, and at one point it had > 2x100 GB files to delete; thus running 'rm file' one after one for those > 400 files, about 500 MB each. What then resulted was that the > real-time data processing became too slow and and buffers overfload. Are all the files in the same directory? Even with HTREE there seem to be cases where this is surprisingly slow. Look into using nested directories (e.g. A/B/C/D/foo where A, B, C, D are truncated hashes of the file name). Or, if you don't mind losing data in a power off and the job suits, unlink the file name immediately your processor has opened it. Then it will be deleted on close. Alex From adilger at sun.com Mon Oct 27 19:51:23 2008 From: adilger at sun.com (Andreas Dilger) Date: Mon, 27 Oct 2008 13:51:23 -0600 Subject: Unlink performance In-Reply-To: <52F49968757FFFD36073E072@Ximines.local> References: <49058C85.8060901@iki.fi> <52F49968757FFFD36073E072@Ximines.local> Message-ID: <20081027195123.GM3184@webber.adilger.int> On Oct 27, 2008 10:30 +0100, Alex Bligh wrote: > --On 27 October 2008 11:40:21 +0200 Markus Peuhkuri wrote: > >> However, as my delete script malfunctioned, and at one point it had >> 2x100 GB files to delete; thus running 'rm file' one after one for those >> 400 files, about 500 MB each. What then resulted was that the >> real-time data processing became too slow and and buffers overfload. > > Are all the files in the same directory? Even with HTREE there seem > to be cases where this is surprisingly slow. Look into using nested > directories (e.g. A/B/C/D/foo where A, B, C, D are truncated hashes > of the file name). > > Or, if you don't mind losing data in a power off and the job suits, > unlink the file name immediately your processor has opened it. Then > it will be deleted on close. No, it is likely the problem is with the ext3 indirect block pointer updates for large files. This will also put a lot of blocks into the journal and if the journal is full it can block all other operations. If you run with ext4 extents the unlink time is much shorter, though you should test ext4 yourself before putting it into production. Doing the "unlink; sleep 1" will keep the traffic to the journal lower, as would deleting fewer files more often to ensure you don't delete 200GB of data at one time if you have real-time requirements. If you are not creating files faster than 1/s unlinks should be able to keep up. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.