From sct at redhat.com Mon Sep 6 14:26:57 2004 From: sct at redhat.com (Stephen C. Tweedie) Date: 06 Sep 2004 15:26:57 +0100 Subject: strange non-eraseable mixture between file and directory on ext3 FS In-Reply-To: <20040905181805.GA7547@dominikbrodowski.de> References: <20040905181805.GA7547@dominikbrodowski.de> Message-ID: <1094480817.2687.6.camel@sisko.scot.redhat.com> Hi, On Sun, 2004-09-05 at 19:18, Dominik Brodowski wrote: > On my notebook I got a strage ext3-related problem: > > root at jura98 /usr/portage/app-misc/obexftp # ls -l > total 4 > dr-Sr-sr-t 2 8242 15720 4096 Dec 28 1993 metadata.xml > [how this got created is beyond my knowledge, possibly because of some > random memory corruption a bad kernel patch caused a few weeks ago] > root at jura98 /usr/portage/app-misc/obexftp # rm -r metadata.xml/ > rm: cannot remove directory metadata.xml/': Operation not permitted Try "lsattr" on it --- is the immutable or append-only bit set? Use "chattr -ai" to clear them. Cheers, Stephen From linux at dominikbrodowski.de Sun Sep 5 18:18:05 2004 From: linux at dominikbrodowski.de (Dominik Brodowski) Date: Sun, 5 Sep 2004 20:18:05 +0200 Subject: strange non-eraseable mixture between file and directory on ext3 FS Message-ID: <20040905181805.GA7547@dominikbrodowski.de> Hi! On my notebook I got a strage ext3-related problem: root at jura98 /usr/portage/app-misc/obexftp # ls -l total 4 dr-Sr-sr-t 2 8242 15720 4096 Dec 28 1993 metadata.xml [how this got created is beyond my knowledge, possibly because of some random memory corruption a bad kernel patch caused a few weeks ago] root at jura98 /usr/portage/app-misc/obexftp # rm metadata.xml/ rm: cannot remove etadata.xml/': Is a directory root at jura98 /usr/portage/app-misc/obexftp # rm metadata.xml rm: cannot remove etadata.xml': Operation not permitted root at jura98 /usr/portage/app-misc/obexftp # rm -r metadata.xml/ rm: cannot remove directory etadata.xml/': Operation not permitted Running e2fsck 1.35 (28-Feb-2004) Using EXT2FS Library version 1.35, 28-Feb-2004 doesn't help: root at jura98 /usr # e2fsck -f /dev/hda5 e2fsck 1.35 (28-Feb-2004) Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information /dev/hda5: 97724/1221600 files (0.9% non-contiguous), 977409/2441872 blocks Does anybody have any idea on what could be done? Thanks, Dominik From sudhirs at ibrix.com Sat Sep 4 20:38:58 2004 From: sudhirs at ibrix.com (Sudhir Srinivasan) Date: Sat, 4 Sep 2004 16:38:58 -0400 Subject: "bit already cleared" messages Message-ID: <004601c492bf$406e5d40$8302a8c0@SUDHIRIBRIX> There's been a lot of discussion in the past on these somewhat-mysterious "bit already cleared" messages appearing in the logs when using Ext3 file systems. Unfortunately, I didn't see a conclusion to any of these threads. We're still seeing these messages pop up occasionally and fsck's of the file system reveal lot of orphaned inodes and such. Anybody else see these 'bit already cleared' messages anymore? Anybody know where this thread has gone in the past - i.e. was there ever a definitive culprit? Thanks -Sudhir From sct at redhat.com Fri Sep 10 16:09:27 2004 From: sct at redhat.com (Stephen C. Tweedie) Date: 10 Sep 2004 17:09:27 +0100 Subject: "bit already cleared" messages In-Reply-To: <004601c492bf$406e5d40$8302a8c0@SUDHIRIBRIX> References: <004601c492bf$406e5d40$8302a8c0@SUDHIRIBRIX> Message-ID: <1094832566.2047.147.camel@sisko.scot.redhat.com> Hi, On Sat, 2004-09-04 at 21:38, Sudhir Srinivasan wrote: > There's been a lot of discussion in the past on > these somewhat-mysterious "bit already cleared" > messages appearing in the logs when using Ext3 > file systems. Well, ultimately it means that the data on disk is corrupt. But whether that's a software or a hardware fault requires much more data to analyse, and each case may be unique. --Stephen From mika.liljeberg at welho.com Sat Sep 11 16:02:39 2004 From: mika.liljeberg at welho.com (Mika Liljeberg) Date: Sat, 11 Sep 2004 19:02:39 +0300 Subject: External journal on flash drive Message-ID: <1094918559.17777.9.camel@hades> Hi, I'd like to use a flash drive as a journal device, with the purpose of keeping the main disk drive spun down as long as possible. I have a couple of questions: 1) Does the journaling code spread write accesses to the journal device evenly, as I hope, or are there blocks that are particularly "hot"? I.e., do I have to worry about the flash device dying quickly because of frequent erase/rewrite cycles to a small number of blocks? 2) Currently, the main drive seems to spin up within 60 seconds after a write access. I would like the checkpointing to occur only when the journal device is getting full. How can I tune this? Thanks, MikaL From pegasus at nerv.eu.org Sun Sep 12 20:48:39 2004 From: pegasus at nerv.eu.org (Jure =?UTF-8?B?UGXDhMKNYXI=?=) Date: Sun, 12 Sep 2004 22:48:39 +0200 Subject: External journal on flash drive In-Reply-To: <1094918559.17777.9.camel@hades> References: <1094918559.17777.9.camel@hades> Message-ID: <20040912224839.1fc92b0a.pegasus@nerv.eu.org> On Sat, 11 Sep 2004 19:02:39 +0300 Mika Liljeberg wrote: > 1) Does the journaling code spread write accesses to the journal device > evenly, as I hope, or are there blocks that are particularly "hot"? > I.e., do I have to worry about the flash device dying quickly because of > frequent erase/rewrite cycles to a small number of blocks? Any smart enough journal device should do this internally. Journal, by its nature, is like a circular file. Can't say much about how are its i/os layed out on disk, but it shouldn't matter really. > 2) Currently, the main drive seems to spin up within 60 seconds after a > write access. I would like the checkpointing to occur only when the > journal device is getting full. How can I tune this? Check vm.laptop_mode and vm.bdflush sysctl settings or their appropriate /proc entries and elvtune command. -- Jure Pe?ar From mika.liljeberg at welho.com Mon Sep 13 13:26:29 2004 From: mika.liljeberg at welho.com (Mika Liljeberg) Date: Mon, 13 Sep 2004 16:26:29 +0300 Subject: External journal on flash drive In-Reply-To: <20040912224839.1fc92b0a.pegasus@nerv.eu.org> References: <1094918559.17777.9.camel@hades> <20040912224839.1fc92b0a.pegasus@nerv.eu.org> Message-ID: <1095081989.17777.43.camel@hades> On Sun, 2004-09-12 at 23:48, Jure Pe??ar wrote: > On Sat, 11 Sep 2004 19:02:39 +0300 > Mika Liljeberg wrote: > > 2) Currently, the main drive seems to spin up within 60 seconds after a > > write access. I would like the checkpointing to occur only when the > > journal device is getting full. How can I tune this? > > Check vm.laptop_mode and vm.bdflush sysctl settings or their appropriate > /proc entries and elvtune command. Can you be a bit more specific? As far as I can see, none of these seems to do exactly what I need (i.e. delay writes from journal to fs). Thanks, MikaL From pegasus at nerv.eu.org Mon Sep 13 17:54:38 2004 From: pegasus at nerv.eu.org (Jure =?UTF-8?B?UGXDhMKNYXI=?=) Date: Mon, 13 Sep 2004 19:54:38 +0200 Subject: External journal on flash drive In-Reply-To: <1095081989.17777.43.camel@hades> References: <1094918559.17777.9.camel@hades> <20040912224839.1fc92b0a.pegasus@nerv.eu.org> <1095081989.17777.43.camel@hades> Message-ID: <20040913195438.59181d79.pegasus@nerv.eu.org> On Mon, 13 Sep 2004 16:26:29 +0300 Mika Liljeberg wrote: > Can you be a bit more specific? As far as I can see, none of these seems > to do exactly what I need (i.e. delay writes from journal to fs). Check for example http://www-106.ibm.com/developerworks/linux/library/l-fs8/ or some other google hits about tweaking bdflush. -- Jure Pe?ar From mika.liljeberg at welho.com Mon Sep 13 19:26:53 2004 From: mika.liljeberg at welho.com (Mika Liljeberg) Date: Mon, 13 Sep 2004 22:26:53 +0300 Subject: External journal on flash drive In-Reply-To: <20040913195438.59181d79.pegasus@nerv.eu.org> References: <1094918559.17777.9.camel@hades> <20040912224839.1fc92b0a.pegasus@nerv.eu.org> <1095081989.17777.43.camel@hades> <20040913195438.59181d79.pegasus@nerv.eu.org> Message-ID: <1095103613.17780.59.camel@hades> On Mon, 2004-09-13 at 20:54, Jure Pe??ar wrote: > On Mon, 13 Sep 2004 16:26:29 +0300 > Mika Liljeberg wrote: > > > Can you be a bit more specific? As far as I can see, none of these seems > > to do exactly what I need (i.e. delay writes from journal to fs). > > Check for example http://www-106.ibm.com/developerworks/linux/library/l-fs8/ > or some other google hits about tweaking bdflush. Thanks, but I meant a bit MORE specific. Stopping kupdated or maxing out the kupdate interval doesn't seem like a safe thing to do and I don't see anything else that applies. MikaL From sct at redhat.com Mon Sep 13 22:02:46 2004 From: sct at redhat.com (Stephen C. Tweedie) Date: 13 Sep 2004 23:02:46 +0100 Subject: External journal on flash drive In-Reply-To: <1094918559.17777.9.camel@hades> References: <1094918559.17777.9.camel@hades> Message-ID: <1095112966.2765.61.camel@sisko.scot.redhat.com> Hi, On Sat, 2004-09-11 at 17:02, Mika Liljeberg wrote: > 1) Does the journaling code spread write accesses to the journal device > evenly, as I hope, Pretty much so, yes. Sequential writes are much more efficient than random ones, so the journal IO tries to avoid seeking; a side-effect of that is that the wear should be largely uniform. > or are there blocks that are particularly "hot"? There is one: the journal superblock. It's not updated _hugely_ often, but it is updated whenever we "checkpoint" the journal (ie. when we remove old transactions from the tail end of the journal.) I haven't measured it but I'd expect we're updating that maybe 2 or 3 times more rapidly than other journal blocks. > 2) Currently, the main drive seems to spin up within 60 seconds after a > write access. I would like the checkpointing to occur only when the > journal device is getting full. How can I tune this? That's not related to journal activity --- that's normal writeback. The way the journal works is that it makes sure we update transactions atomically, in the journal, before they are allowed to undergo normal writeback. However, once the transaction _has_ committed, the journal is almost entirely out of the picture. The only interest the journal retains in the updated metadata is that we have to make sure that we don't reuse the journal record for that transaction until all of the metadata has undergone its normal writeback (otherwise we'd risk having no record of it after a crash!) Other than that, it's up to the normal VM writeback to write the updated metadata to its home location on disk at that point. Only if the journal wraps and we need to reclaim journal space urgently will we _force_ the write from the journal code. --Stephen From mika.liljeberg at welho.com Mon Sep 13 22:39:55 2004 From: mika.liljeberg at welho.com (Mika Liljeberg) Date: Tue, 14 Sep 2004 01:39:55 +0300 Subject: External journal on flash drive In-Reply-To: <1095112966.2765.61.camel@sisko.scot.redhat.com> References: <1094918559.17777.9.camel@hades> <1095112966.2765.61.camel@sisko.scot.redhat.com> Message-ID: <1095115195.17779.80.camel@hades> On Tue, 2004-09-14 at 01:02, Stephen C. Tweedie wrote: > > or are there blocks that are particularly "hot"? > > There is one: the journal superblock. It's not updated _hugely_ often, > but it is updated whenever we "checkpoint" the journal (ie. when we > remove old transactions from the tail end of the journal.) I haven't > measured it but I'd expect we're updating that maybe 2 or 3 times more > rapidly than other journal blocks. That's less than ideal but not too bad. Presumably a large journal will help here? > > 2) Currently, the main drive seems to spin up within 60 seconds after a > > write access. I would like the checkpointing to occur only when the > > journal device is getting full. How can I tune this? > > That's not related to journal activity --- that's normal writeback. > > The way the journal works is that it makes sure we update transactions > atomically, in the journal, before they are allowed to undergo normal > writeback. Ah, I see. Does this also hold for data_journal mode? > However, once the transaction _has_ committed, the journal is almost > entirely out of the picture. The only interest the journal retains in > the updated metadata is that we have to make sure that we don't reuse > the journal record for that transaction until all of the metadata has > undergone its normal writeback (otherwise we'd risk having no record of > it after a crash!) Other than that, it's up to the normal VM writeback > to write the updated metadata to its home location on disk at that > point. Sounds like my best bet would be to have my filesystems in data_journal mode and to configure the bdflush parameters to delay writeback as much as possible. If I understand correctly, that should allow for maximal spin down times and still maintain full data integrity after a crash. For ext3 filesystems, anyway. :-| Thanks for the explanation, MikaL From ogi at fmi.uni-sofia.bg Tue Sep 14 04:23:15 2004 From: ogi at fmi.uni-sofia.bg (Ognyan Kulev) Date: Tue, 14 Sep 2004 07:23:15 +0300 Subject: External journal on flash drive In-Reply-To: <1095115195.17779.80.camel@hades> References: <1094918559.17777.9.camel@hades> <1095112966.2765.61.camel@sisko.scot.redhat.com> <1095115195.17779.80.camel@hades> Message-ID: <41467233.1030207@fmi.uni-sofia.bg> Mika Liljeberg wrote: > On Tue, 2004-09-14 at 01:02, Stephen C. Tweedie wrote: > >>> or are there blocks that are particularly "hot"? >> >>There is one: the journal superblock. It's not updated _hugely_ often, >>but it is updated whenever we "checkpoint" the journal (ie. when we >>remove old transactions from the tail end of the journal.) I haven't >>measured it but I'd expect we're updating that maybe 2 or 3 times more >>rapidly than other journal blocks. > > That's less than ideal but not too bad. Presumably a large journal will > help here? Larger journal will be even worse. I think it's possible the superblock to be updated less: when checkpoint transaction is finished, we don't move tail, but remember (in kernel) where tail is moved and what superblock remembers. So we have two tails: one in superblock and one in kernel. The one in superblock is moved to the one in kernel only when head of the journal reaches the tail that is in superblock. It should be correct: replaying will just replay more transactions. And it will have performance impact: requesting journal block will sometimes require superblock to be updated, thus slowing down the transaction a bit. Regards, ogi From ogi at fmi.uni-sofia.bg Tue Sep 14 04:31:18 2004 From: ogi at fmi.uni-sofia.bg (Ognyan Kulev) Date: Tue, 14 Sep 2004 07:31:18 +0300 Subject: External journal on flash drive In-Reply-To: <41467233.1030207@fmi.uni-sofia.bg> References: <1094918559.17777.9.camel@hades> <1095112966.2765.61.camel@sisko.scot.redhat.com> <1095115195.17779.80.camel@hades> <41467233.1030207@fmi.uni-sofia.bg> Message-ID: <41467416.2070305@fmi.uni-sofia.bg> Ognyan Kulev wrote: > Larger journal will be even worse. Actually, the size of the journal doesn't affect how often its superblock is updated. Regards, ogi From sct at redhat.com Tue Sep 14 10:40:04 2004 From: sct at redhat.com (Stephen C. Tweedie) Date: 14 Sep 2004 11:40:04 +0100 Subject: External journal on flash drive In-Reply-To: <1095115195.17779.80.camel@hades> References: <1094918559.17777.9.camel@hades> <1095112966.2765.61.camel@sisko.scot.redhat.com> <1095115195.17779.80.camel@hades> Message-ID: <1095158404.2006.13.camel@sisko.scot.redhat.com> Hi, On Mon, 2004-09-13 at 23:39, Mika Liljeberg wrote: > > > 2) Currently, the main drive seems to spin up within 60 seconds after a > > > write access. > > That's not related to journal activity --- that's normal writeback. > Ah, I see. Does this also hold for data_journal mode? Yes. You can tune the bdflush parameters to tweak the 60 second interval. > Sounds like my best bet would be to have my filesystems in data_journal > mode I'd rather go with data=ordered; flash isn't the fastest of media, and you have much more journal traffic with data=journal. And in data=ordered, if you ... > configure the bdflush parameters to delay writeback as much > as possible ...then you should be able to avoid the spinups. > . If I understand correctly, that should allow for maximal > spin down times and still maintain full data integrity after a crash. > For ext3 filesystems, anyway. :-| data=ordered (the default) already preserves full data integrity after a crash. It's only data=writeback which relaxes the integrity guarantees. Cheers, Stephen From mika.liljeberg at welho.com Tue Sep 14 16:24:29 2004 From: mika.liljeberg at welho.com (Mika Liljeberg) Date: Tue, 14 Sep 2004 19:24:29 +0300 Subject: External journal on flash drive In-Reply-To: <1095158404.2006.13.camel@sisko.scot.redhat.com> References: <1094918559.17777.9.camel@hades> <1095112966.2765.61.camel@sisko.scot.redhat.com> <1095115195.17779.80.camel@hades> <1095158404.2006.13.camel@sisko.scot.redhat.com> Message-ID: <1095179069.17781.123.camel@hades> On Tue, 2004-09-14 at 13:40, Stephen C. Tweedie wrote: > I'd rather go with data=ordered; flash isn't the fastest of media, and > you have much more journal traffic with data=journal. Well, I plan to have several flash drives in a RAID0 configuration and use as large a journal as I can get away with. :) > > . If I understand correctly, that should allow for maximal > > spin down times and still maintain full data integrity after a crash. > > For ext3 filesystems, anyway. :-| > > data=ordered (the default) already preserves full data integrity after a > crash. It's only data=writeback which relaxes the integrity guarantees. However, as I'm shooting for spin down times of hours rather than minutes I would risk losing several hours of work and recent email if the machine crashed. I guess I'll just give it a go and see how it works. Thanks a lot for the advice! MikaL From vijayan at cs.wisc.edu Thu Sep 16 21:34:11 2004 From: vijayan at cs.wisc.edu (Vijayan Prabhakaran) Date: Thu, 16 Sep 2004 16:34:11 -0500 (CDT) Subject: kupdate daemon Message-ID: Hi, I'm using linux 2.4.25 and I'm trying to change the wakeup interval of kupdate daemon. Is there a way to change that ? I once used 'update' tool for that. But that is not there in 2.4.25. Has it been removed ? I appreciate any help regarding this. thanks, Vijayan From akpm at osdl.org Thu Sep 16 21:50:59 2004 From: akpm at osdl.org (Andrew Morton) Date: Thu, 16 Sep 2004 14:50:59 -0700 Subject: [PATCH] BUG on fsync/fdatasync with Ext3 data=journal In-Reply-To: References: Message-ID: <20040916145059.44a7e800.akpm@osdl.org> Seiji Kihara wrote: > > We found that fsync and fdatasync syscalls sometimes don't sync > data in an ext3 file system under the following conditions. > > 1. Kernel version is 2.6.6 or later (including 2.6.8.1 and 2.6.9-rc2). > 2. Ext3's journalling mode is "data=journal". > 3. Create a file (whose size is 1Mbytes) and execute umount/mount. > 4. lseek to a random position within the file, write 8192 bytes > data, and fsync or fdatasync. > > We presume the data was not written to the corresponding disk > before returning from fsync or fdatasync syscall on the evidence > as follows: > > 1. The response time of fsync() and fdatasync() was extremely > short. > > We use the "diskio" tool, which is downloadable from OSDL page > (http://developer.osdl.jp/projects/doubt/). The program showed > that the response time was under 10 microseconds. This time > cannot be achieved with data transfer on IDE and PCI bus! > > 2. The IDE writing routine ide_start_dma() was not called under > DMA enabled. > > We inserted the print messages in the sys_write(), sys_fsync() > and ide_start_dma() by the attached patch. Sometimes the > "ide_start_dma: ..." message was not shown between "write: in > ..." and "fsync: out ...". > > The problem was occurred since 2.6.5-bk1, which includes the patch > "[PATCH] ext3 fsync() and fdatasync() speedup". We found that the > problem was solved by deleting the part of the patch which > modifies ext3_sync_file(). Maybe, i_state is not correctly set to > I_DIRTY when the related page cache is dirty (is it true?) I forgot about this one. > Attached file is tarball (tar + bzip2) which contains following > files. The patches are for 2.6.8.1-kernel (applicable to > 2.6.9-rc2), and the results were also produced with > 2.6.8.1-kernel. We really don't need a 100k tarball to communicate a three line patch :( Yes, the I_DIRTY test is bogus because data pages are not marked dirty at write() time when the filesystem is mounted in data=journal mode. However your patch will disable the above optimisation for data=writeback and data-ordered modes as well. I don't think that's necessary? How about this? --- 25/fs/ext3/fsync.c~ext3-journal-data-fsync-fix Thu Sep 16 14:47:21 2004 +++ 25-akpm/fs/ext3/fsync.c Thu Sep 16 14:47:33 2004 @@ -49,10 +49,6 @@ int ext3_sync_file(struct file * file, s J_ASSERT(ext3_journal_current_handle() == 0); - smp_mb(); /* prepare for lockless i_state read */ - if (!(inode->i_state & I_DIRTY)) - goto out; - /* * data=writeback: * The caller's filemap_fdatawrite()/wait will sync the data. @@ -76,6 +72,10 @@ int ext3_sync_file(struct file * file, s goto out; } + smp_mb(); /* prepare for lockless i_state read */ + if (!(inode->i_state & I_DIRTY)) + goto out; + /* * The VFS has written the file data. If the inode is unaltered * then we need not start a commit. _ From evilninja at gmx.net Thu Sep 16 22:58:34 2004 From: evilninja at gmx.net (evilninja) Date: Fri, 17 Sep 2004 00:58:34 +0200 Subject: kupdate daemon In-Reply-To: References: Message-ID: <414A1A9A.9020301@gmx.net> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Vijayan Prabhakaran wrote: > Hi, > > I'm using linux 2.4.25 and I'm trying to change the wakeup interval of > kupdate daemon. Is there a way to change that ? i don't recall wether this is a sysctl or bootparam tune'able, perhaps you can only alter the source. why do you want to do this at all? > I once used 'update' tool for that. But that is not there in 2.4.25. > Has it been removed ? > from the "update" package of debian/unstable: Description: daemon to periodically flush filesystem buffers. The update daemon flushes the filesystem buffers at regular intervals. This version does not spawn a bdflush daemon, as this is now done by the kernel's kupdate thread. . This package is not needed with Linux 2.2.8 and above. If you do not plan to run a 2.0.x series kernel on this system, you can safely remove this package. update may still be useful in sync mode (as opposed to flush mode) on more recent kernels for the extra paranoid. Christian. - -- BOFH excuse #351: PEBKAC (Problem Exists Between Keyboard And Chair) -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.5 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org iD8DBQFBShqaC/PVm5+NVoYRAsSjAJ92n7NvmY9/UytGnFZsLi24v0NqmACg5P2K VGPUVWy/68paoWAdhQW2/kg= =ta5U -----END PGP SIGNATURE----- From kihara.seiji at lab.ntt.co.jp Fri Sep 17 01:27:25 2004 From: kihara.seiji at lab.ntt.co.jp (Seiji Kihara) Date: Fri, 17 Sep 2004 10:27:25 +0900 Subject: [PATCH] BUG on fsync/fdatasync with Ext3 data=journal In-Reply-To: <20040916145059.44a7e800.akpm@osdl.org> References: <20040916145059.44a7e800.akpm@osdl.org> Message-ID: Hi, Thank you for reply, Mr. Morton. I apologize to everyone that the mail I sent last night was not delivered to lists because of its size (maybe). At Thu, 16 Sep 2004 14:50:59 -0700, Andrew Morton wrote: (snip) > Yes, the I_DIRTY test is bogus because data pages are not marked dirty at > write() time when the filesystem is mounted in data=journal mode. > > However your patch will disable the above optimisation for data=writeback > and data-ordered modes as well. I don't think that's necessary? > > How about this? I have not understood what is the real problem yet, but I agree that your patch is better than mine, and confirmed that no problem occured during the test by the "diskio" and the kernel 2.6.8.1 with your patch. Thank you again. Seiji -- Seiji Kihara Open Source Software Computing Project, NTT Cyber Space Laboratories, Yokosuka, JAPAN From cchan at outblaze.com Fri Sep 17 08:51:50 2004 From: cchan at outblaze.com (Christopher Chan) Date: Fri, 17 Sep 2004 16:51:50 +0800 Subject: [PATCH] BUG on fsync/fdatasync with Ext3 data=journal In-Reply-To: <20040916145059.44a7e800.akpm@osdl.org> References: <20040916145059.44a7e800.akpm@osdl.org> Message-ID: <414AA5A6.3020907@outblaze.com> Andrew Morton wrote: > Seiji Kihara wrote: > >>We found that fsync and fdatasync syscalls sometimes don't sync >>data in an ext3 file system under the following conditions. >> >>1. Kernel version is 2.6.6 or later (including 2.6.8.1 and 2.6.9-rc2). >>2. Ext3's journalling mode is "data=journal". >> >>The problem was occurred since 2.6.5-bk1, which includes the patch >>"[PATCH] ext3 fsync() and fdatasync() speedup". We found that the >>problem was solved by deleting the part of the patch which >>modifies ext3_sync_file(). Maybe, i_state is not correctly set to >>I_DIRTY when the related page cache is dirty (is it true?) > I have a few qmail (about the heaviest fsync using mta software around) boxes that have their queues on ext3. On a 2.6.7 kernel, these guys are guaranteed to crash within hours if I used data=journal for the fs on which the qmail queues are. I say this because I ran two of them with data=journal mode and they crashed once or more a day. Another one which stayed with ordered had no problems during the same period. Going back to ordered meant that they ran stable for days (weeks now). The only thing I could get from the logs is: --------------------------- Aug 17 05:58:22 mta1-7 kernel: Assertion failure in __journal_drop_transaction() at fs/jbd/checkpoint.c:613: "transaction->t _forget == NULL" Aug 17 05:58:22 mta1-7 kernel: ------------[ cut here ]------------ Aug 17 05:58:22 mta1-7 kernel: kernel BUG at fs/jbd/checkpoint.c:613! Aug 17 05:58:22 mta1-7 kernel: invalid operand: 0000 [#1] Aug 17 05:58:22 mta1-7 kernel: SMP Aug 17 05:58:22 mta1-7 kernel: Modules linked in: nfs lockd sunrpc e1000 e100 mii usbcore Aug 17 05:58:22 mta1-7 kernel: CPU: 0 Aug 17 05:58:22 mta1-7 kernel: EIP: 0060:[] Not tainted Aug 17 05:58:22 mta1-7 kernel: EFLAGS: 00010202 (2.6.7) From kihara.seiji at lab.ntt.co.jp Thu Sep 16 12:56:20 2004 From: kihara.seiji at lab.ntt.co.jp (Seiji Kihara) Date: Thu, 16 Sep 2004 21:56:20 +0900 Subject: [PATCH] BUG on fsync/fdatasync with Ext3 data=journal Message-ID: Hello, We found that fsync and fdatasync syscalls sometimes don't sync data in an ext3 file system under the following conditions. 1. Kernel version is 2.6.6 or later (including 2.6.8.1 and 2.6.9-rc2). 2. Ext3's journalling mode is "data=journal". 3. Create a file (whose size is 1Mbytes) and execute umount/mount. 4. lseek to a random position within the file, write 8192 bytes data, and fsync or fdatasync. We presume the data was not written to the corresponding disk before returning from fsync or fdatasync syscall on the evidence as follows: 1. The response time of fsync() and fdatasync() was extremely short. We use the "diskio" tool, which is downloadable from OSDL page (http://developer.osdl.jp/projects/doubt/). The program showed that the response time was under 10 microseconds. This time cannot be achieved with data transfer on IDE and PCI bus! 2. The IDE writing routine ide_start_dma() was not called under DMA enabled. We inserted the print messages in the sys_write(), sys_fsync() and ide_start_dma() by the attached patch. Sometimes the "ide_start_dma: ..." message was not shown between "write: in ..." and "fsync: out ...". The problem was occurred since 2.6.5-bk1, which includes the patch "[PATCH] ext3 fsync() and fdatasync() speedup". We found that the problem was solved by deleting the part of the patch which modifies ext3_sync_file(). Maybe, i_state is not correctly set to I_DIRTY when the related page cache is dirty (is it true?) Attached file is tarball (tar + bzip2) which contains following files. The patches are for 2.6.8.1-kernel (applicable to 2.6.9-rc2), and the results were also produced with 2.6.8.1-kernel. - fsync.c.patch: patch for solving the problem - kernel.printk.patch: patch for showing the problem by printks - kernel.printk.log: printks from the kernel with the printk patch - results/*: output from "diskio" program (result for fsync, and O_SYNC for comparison) - kernel-2.6.8.1.config: .config file which we used to build kernel We use the following hardware to make the data: CPU: Intel Pentium 4 3GHz Hyper Threading, 2nd cache 512KB RAM: 1GBytes IDE controller: on-board ICH5 (ATA100) HDD: E-IDE (7200rpm, 8MB cache, ATA133) x 2 (one for system and the other for writing test data) (The problem was also reproduced with a SCSI system) Regards, Seiji -- Seiji Kihara Open Source Software Computing Project, NTT Cyber Space Laboratories, Yokosuka, JAPAN -------------- next part -------------- A non-text attachment was scrubbed... Name: patches+results.tar.bz2 Type: application/octet-stream Size: 80725 bytes Desc: not available URL: From okuyamak at dd.iij4u.or.jp Sat Sep 18 20:47:41 2004 From: okuyamak at dd.iij4u.or.jp (Kenichi Okuyama) Date: Sun, 19 Sep 2004 05:47:41 +0900 (JST) Subject: [PATCH] BUG on fsync/fdatasync with Ext3 data=journal In-Reply-To: <20040916145059.44a7e800.akpm@osdl.org> References: <20040916145059.44a7e800.akpm@osdl.org> Message-ID: <20040919.054741.01370775.okuyamak@dd.iij4u.or.jp> Dear Mr. Morton, Seiji, and all, >>>>> "AM" == Andrew Morton writes: AM> Yes, the I_DIRTY test is bogus because data pages are not marked dirty at AM> write() time when the filesystem is mounted in data=journal mode. AM> However your patch will disable the above optimisation for data=writeback AM> and data-ordered modes as well. I don't think that's necessary? I don't think Mr. Morton's code have any advantages over Seiji's patch. Please look at lines below. Line starting with AM> + are the point Mr. Morton have added the code ( point where you removed are bit above, and not in the lines ). 74 if (ext3_should_journal_data(inode)) { 75 ret = ext3_force_commit(inode->i_sb); 76 goto out; 77 } AM> + smp_mb(); /* prepare for lockless i_state read */ AM> + if (!(inode->i_state & I_DIRTY)) AM> + goto out; AM> + 78 79 /* 80 * The VFS has written the file data. If the inode is unaltered 81 * then we need not start a commit. 82 */ 83 if (inode->i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) { 84 struct writeback_control wbc = { 85 .sync_mode = WB_SYNC_ALL, 86 .nr_to_write = 0, /* sys_fsync did this */ 87 }; 88 ret = sync_inode(inode, &wbc); 89 } 90 out: 91 return ret; Now. Please note that #define I_DIRTY (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES) is definition of macro 'I_DIRTY'. As result, Mr. Morton's patch is saying that: if (!(inode->i_state & (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGE)) goto out; if (inode->i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) { struct writeback_control wbc = { .sync_mode = WB_SYNC_ALL, .nr_to_write = 0, /* sys_fsync did this */ }; ret = sync_inode(inode, &wbc); } out: But this is equivalent to following code ( think carefully :-) if (inode->i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) { struct writeback_control wbc = { .sync_mode = WB_SYNC_ALL, .nr_to_write = 0, /* sys_fsync did this */ }; ret = sync_inode(inode, &wbc); } out: whch turns out to be what Seiji's patch was. Hence, Mr. Morton's patch have no OPTIMIZATION over Seiji's code. ( if gcc is smart enough, Mr. Morton's code should have no effect to binary. If not, it's overhead. ). My worry is follows. Basically, Seiji's patch is better. But in that case, smp_mb() call right before accessing to inode->i_state will disappear. Is this safe..... I am not sure because even without Seiji's patch, codes at line 83 did exist. And it was working... wasn't it? In the case smp_mb() was simply not nessasary, Seiji's patch will do everything. In case smp_mb() was nessasary, we were lacking one right before line 83. best regards, ---- Kenichi Okuyama From jc at info-systems.de Wed Sep 22 15:13:49 2004 From: jc at info-systems.de (Jakob Curdes) Date: Wed, 22 Sep 2004 17:13:49 +0200 Subject: status of dir_index in 2.4 kernels ? Message-ID: <415196AD.5060805@info-systems.de> Hi, I stumbled on the dir_index option of recent ext2 implementation which would be interesting for our mailservers (running dovecot with maildir) and found out that the newer ext2progs are able to cope with it; the 2.6 kernels do also but for the 2.4 kernels I still require a patch. Is that correct? Which patch would be the correct one to apply to the current 2.4.27 kernel ? Can anybody comment on the stability of dir_index-enabled filesystems ? Thank you for hints, Jakob Curdes From adam.cassar at netregistry.com.au Thu Sep 23 00:45:09 2004 From: adam.cassar at netregistry.com.au (Adam Cassar) Date: Thu, 23 Sep 2004 10:45:09 +1000 Subject: status of dir_index in 2.4 kernels ? In-Reply-To: <415196AD.5060805@info-systems.de> References: <415196AD.5060805@info-systems.de> Message-ID: <1095900309.6157.8.camel@akira2.nro.au.com> We have a similar set up but the maildirs are exported via NFS. Performance was actually worse. There is a userspace preload lib that supposedly fixes the problem but by that time I didn't bother continuing. Search the archives for the thread. On Thu, 2004-09-23 at 01:13, Jakob Curdes wrote: > Hi, > > I stumbled on the dir_index option of recent ext2 implementation which > would be interesting for our mailservers (running dovecot with maildir) > and found out that the newer ext2progs are able to cope with it; the 2.6 > kernels do also but for the 2.4 kernels I still require a patch. Is that > correct? Which patch would be the correct one to apply to the current > 2.4.27 kernel ? Can anybody comment on the stability of > dir_index-enabled filesystems ? > > Thank you for hints, > > Jakob Curdes > > > _______________________________________________ > Ext3-users mailing list > Ext3-users at redhat.com > https://www.redhat.com/mailman/listinfo/ext3-users -- Adam Cassar IT Manager NetRegistry Pty Ltd ______________________________________________ http://www.netregistry.com.au Tel: 02 9699 6099 Fax: 02 9699 6088 PO Box 270 Broadway NSW 2007 Domains |Business Email|Web Hosting|E-Commerce Trusted by 10,000s of businesses since 1997 ______________________________________________ From sneakums at zork.net Thu Sep 23 12:53:36 2004 From: sneakums at zork.net (Sean Neakums) Date: Thu, 23 Sep 2004 13:53:36 +0100 Subject: status of dir_index in 2.4 kernels ? In-Reply-To: <1095900309.6157.8.camel@akira2.nro.au.com> (Adam Cassar's message of "Thu, 23 Sep 2004 10:45:09 +1000") References: <415196AD.5060805@info-systems.de> <1095900309.6157.8.camel@akira2.nro.au.com> Message-ID: <6ullf116zz.fsf@zork.zork.net> Adam Cassar writes: > On Thu, 2004-09-23 at 01:13, Jakob Curdes wrote: >> Hi, >> >> I stumbled on the dir_index option of recent ext2 implementation which >> would be interesting for our mailservers (running dovecot with maildir) >> and found out that the newer ext2progs are able to cope with it; the 2.6 >> kernels do also but for the 2.4 kernels I still require a patch. Is that >> correct? Which patch would be the correct one to apply to the current >> 2.4.27 kernel ? Can anybody comment on the stability of >> dir_index-enabled filesystems ? > > We have a similar set up but the maildirs are exported via NFS. > > Performance was actually worse. There is a userspace preload lib that > supposedly fixes the problem but by that time I didn't bother > continuing. > > Search the archives for the thread. This is probably the one. https://listman.redhat.com/archives/ext3-users/2003-December/msg00040.html From tytso at mit.edu Fri Sep 24 00:27:00 2004 From: tytso at mit.edu (Theodore Ts'o) Date: Thu, 23 Sep 2004 20:27:00 -0400 Subject: status of dir_index in 2.4 kernels ? In-Reply-To: <1095900309.6157.8.camel@akira2.nro.au.com> References: <415196AD.5060805@info-systems.de> <1095900309.6157.8.camel@akira2.nro.au.com> Message-ID: <20040924002700.GB3300@thunk.org> On Thu, Sep 23, 2004 at 10:45:09AM +1000, Adam Cassar wrote: > We have a similar set up but the maildirs are exported via NFS. > > Performance was actually worse. There is a userspace preload lib that > supposedly fixes the problem but by that time I didn't bother > continuing. Here's the userspace preload library. It would also be possible to patch the application to qsort the returned entries from readdir before stat'ing them. - Ted /* * readdir accelerator * * (C) Copyright 2003, 2004 by Theodore Ts'o. * * Compile using the command: * * gcc -o spd_readdir.so -shared spd_readdir.c -ldl * * %Begin-Header% * This file may be redistributed under the terms of the GNU Public * License. * %End-Header% * */ #define ALLOC_STEPSIZE 100 #define MAX_DIRSIZE 0 #define DEBUG #ifdef DEBUG #define DEBUG_DIR(x) {if (do_debug) { x; }} #else #define DEBUG_DIR(x) #endif #define _GNU_SOURCE #define __USE_LARGEFILE64 #include #include #include #include #include #include #include #include #include struct dirent_s { unsigned long long d_ino; long long d_off; unsigned short int d_reclen; unsigned char d_type; char *d_name; }; struct dir_s { DIR *dir; int num; int max; struct dirent_s *dp; int pos; int fd; struct dirent ret_dir; struct dirent64 ret_dir64; }; static int (*real_closedir)(DIR *dir) = 0; static DIR *(*real_opendir)(const char *name) = 0; static struct dirent *(*real_readdir)(DIR *dir) = 0; static struct dirent64 *(*real_readdir64)(DIR *dir) = 0; static off_t (*real_telldir)(DIR *dir) = 0; static void (*real_seekdir)(DIR *dir, off_t offset) = 0; static int (*real_dirfd)(DIR *dir) = 0; static unsigned long max_dirsize = MAX_DIRSIZE; static num_open = 0; #ifdef DEBUG static int do_debug = 0; #endif static void setup_ptr() { char *cp; real_opendir = dlsym(RTLD_NEXT, "opendir"); real_closedir = dlsym(RTLD_NEXT, "closedir"); real_readdir = dlsym(RTLD_NEXT, "readdir"); real_readdir64 = dlsym(RTLD_NEXT, "readdir64"); real_telldir = dlsym(RTLD_NEXT, "telldir"); real_seekdir = dlsym(RTLD_NEXT, "seekdir"); real_dirfd = dlsym(RTLD_NEXT, "dirfd"); if ((cp = getenv("SPD_READDIR_MAX_SIZE")) != NULL) { max_dirsize = atol(cp); } #ifdef DEBUG if (getenv("SPD_READDIR_DEBUG")) do_debug++; #endif } static void free_cached_dir(struct dir_s *dirstruct) { int i; if (!dirstruct->dp) return; for (i=0; i < dirstruct->num; i++) { free(dirstruct->dp[i].d_name); } free(dirstruct->dp); dirstruct->dp = 0; } static int ino_cmp(const void *a, const void *b) { const struct dirent_s *ds_a = (const struct dirent_s *) a; const struct dirent_s *ds_b = (const struct dirent_s *) b; ino_t i_a, i_b; i_a = ds_a->d_ino; i_b = ds_b->d_ino; if (ds_a->d_name[0] == '.') { if (ds_a->d_name[1] == 0) i_a = 0; else if ((ds_a->d_name[1] == '.') && (ds_a->d_name[2] == 0)) i_a = 1; } if (ds_b->d_name[0] == '.') { if (ds_b->d_name[1] == 0) i_b = 0; else if ((ds_b->d_name[1] == '.') && (ds_b->d_name[2] == 0)) i_b = 1; } return (i_a - i_b); } DIR *opendir(const char *name) { DIR *dir; struct dir_s *dirstruct; struct dirent_s *ds, *dnew; struct dirent64 *d; struct stat st; if (!real_opendir) setup_ptr(); DEBUG_DIR(printf("Opendir(%s) (%d open)\n", name, num_open++)); dir = (*real_opendir)(name); if (!dir) return NULL; dirstruct = malloc(sizeof(struct dir_s)); if (!dirstruct) { (*real_closedir)(dir); errno = -ENOMEM; return NULL; } dirstruct->num = 0; dirstruct->max = 0; dirstruct->dp = 0; dirstruct->pos = 0; dirstruct->dir = 0; if (max_dirsize && (stat(name, &st) == 0) && (st.st_size > max_dirsize)) { DEBUG_DIR(printf("Directory size %ld, using direct readdir\n", st.st_size)); dirstruct->dir = dir; return (DIR *) dirstruct; } while ((d = (*real_readdir64)(dir)) != NULL) { if (dirstruct->num >= dirstruct->max) { dirstruct->max += ALLOC_STEPSIZE; DEBUG_DIR(printf("Reallocating to size %d\n", dirstruct->max)); dnew = realloc(dirstruct->dp, dirstruct->max * sizeof(struct dir_s)); if (!dnew) goto nomem; dirstruct->dp = dnew; } ds = &dirstruct->dp[dirstruct->num++]; ds->d_ino = d->d_ino; ds->d_off = d->d_off; ds->d_reclen = d->d_reclen; ds->d_type = d->d_type; if ((ds->d_name = malloc(strlen(d->d_name)+1)) == NULL) { dirstruct->num--; goto nomem; } strcpy(ds->d_name, d->d_name); DEBUG_DIR(printf("readdir: %lu %s\n", (unsigned long) d->d_ino, d->d_name)); } dirstruct->fd = dup((*real_dirfd)(dir)); (*real_closedir)(dir); qsort(dirstruct->dp, dirstruct->num, sizeof(struct dirent_s), ino_cmp); return ((DIR *) dirstruct); nomem: DEBUG_DIR(printf("No memory, backing off to direct readdir\n")); free_cached_dir(dirstruct); dirstruct->dir = dir; return ((DIR *) dirstruct); } int closedir(DIR *dir) { struct dir_s *dirstruct = (struct dir_s *) dir; DEBUG_DIR(printf("Closedir (%d open)\n", --num_open)); if (dirstruct->dir) (*real_closedir)(dirstruct->dir); if (dirstruct->fd >= 0) close(dirstruct->fd); free_cached_dir(dirstruct); free(dirstruct); return 0; } struct dirent *readdir(DIR *dir) { struct dir_s *dirstruct = (struct dir_s *) dir; struct dirent_s *ds; if (dirstruct->dir) return (*real_readdir)(dirstruct->dir); if (dirstruct->pos >= dirstruct->num) return NULL; ds = &dirstruct->dp[dirstruct->pos++]; dirstruct->ret_dir.d_ino = ds->d_ino; dirstruct->ret_dir.d_off = ds->d_off; dirstruct->ret_dir.d_reclen = ds->d_reclen; dirstruct->ret_dir.d_type = ds->d_type; strncpy(dirstruct->ret_dir.d_name, ds->d_name, sizeof(dirstruct->ret_dir.d_name)); return (&dirstruct->ret_dir); } struct dirent64 *readdir64(DIR *dir) { struct dir_s *dirstruct = (struct dir_s *) dir; struct dirent_s *ds; if (dirstruct->dir) return (*real_readdir64)(dirstruct->dir); if (dirstruct->pos >= dirstruct->num) return NULL; ds = &dirstruct->dp[dirstruct->pos++]; dirstruct->ret_dir64.d_ino = ds->d_ino; dirstruct->ret_dir64.d_off = ds->d_off; dirstruct->ret_dir64.d_reclen = ds->d_reclen; dirstruct->ret_dir64.d_type = ds->d_type; strncpy(dirstruct->ret_dir64.d_name, ds->d_name, sizeof(dirstruct->ret_dir64.d_name)); return (&dirstruct->ret_dir64); } off_t telldir(DIR *dir) { struct dir_s *dirstruct = (struct dir_s *) dir; if (dirstruct->dir) return (*real_telldir)(dirstruct->dir); return ((off_t) dirstruct->pos); } void seekdir(DIR *dir, off_t offset) { struct dir_s *dirstruct = (struct dir_s *) dir; if (dirstruct->dir) { (*real_seekdir)(dirstruct->dir, offset); return; } dirstruct->pos = offset; } int dirfd(DIR *dir) { struct dir_s *dirstruct = (struct dir_s *) dir; if (dirstruct->dir) return (*real_dirfd)(dirstruct->dir); return (dirstruct->fd); } From jc at info-systems.de Fri Sep 24 07:13:34 2004 From: jc at info-systems.de (Jakob Curdes) Date: Fri, 24 Sep 2004 09:13:34 +0200 Subject: status of dir_index in 2.4 kernels ? In-Reply-To: <20040924002700.GB3300@thunk.org> References: <415196AD.5060805@info-systems.de> <1095900309.6157.8.camel@akira2.nro.au.com> <20040924002700.GB3300@thunk.org> Message-ID: <4153C91E.5070605@info-systems.de> Ok, I try to summarize what I gathered from this and the earlier thread to help other "htree newbies" (Please correct my statements if necessary): - To use htree enabled ext - Filesystems in 2.4 kernels, you need a) a htree enabled kernel; patch against 2.4.21 can be found at http://thunk.org/tytso/linux/extfs-2.4-update/ Curiously enough, I colud not apply this patch against a clean 2.4.27 tree - is there anything special to be aware of ?? b) the current e2fsprogs which you can get from http://e2fsprogs.sourceforge.net/ c) you have to enable dir_index on the filesystem with <> # umount /dev/xyz # tune2fs -O dir_index /dev/xyz # e2fsck -fD /dev/xyz # mount /dev/xyz d) The performance of the htree indexed filesystem depends on the usage by the userspace programs; if they open all files in a directory after gaining directory information with readdir() the performance is worse than with a vanilla ext3 fs, at least if we have many files in that directory [as it is the case with maildir structures]. This can be cured by an additional userspace library which can be found in the message https://www.redhat.com/archives/ext3-users/2004-September/msg00025.html e) One question remains open : Is the htree feature in its current state considered stable enough to be used in production systems ? I read some reports on filesystem corruption, but most of these applied to older versions of htree. Are there differences between the 2.6 implementation and the 2.4 backport ? Thank you for comments, Jakob Curdes -------------- next part -------------- An HTML attachment was scrubbed... URL: From adam.cassar at netregistry.com.au Sat Sep 25 00:18:09 2004 From: adam.cassar at netregistry.com.au (Adam Cassar) Date: Fri, 24 Sep 2004 17:18:09 -0700 Subject: status of dir_index in 2.4 kernels ? In-Reply-To: <4153C91E.5070605@info-systems.de> References: <415196AD.5060805@info-systems.de> <1095900309.6157.8.camel@akira2.nro.au.com> <20040924002700.GB3300@thunk.org> <4153C91E.5070605@info-systems.de> Message-ID: <4154B941.6020700@netregistry.com.au> I have had some issues with htree under high load but I believe that this is fixed in newer 2.6 kernels. Jakob Curdes wrote: > Ok, I try to summarize what I gathered from this and the earlier > thread to help other "htree newbies" > (Please correct my statements if necessary): > > - To use htree enabled ext - Filesystems in 2.4 kernels, you need > a) a htree enabled kernel; patch against 2.4.21 can be found at > > http://thunk.org/tytso/linux/extfs-2.4-update/ > > Curiously enough, I colud not apply this patch against a clean 2.4.27 > tree - is there anything special to be aware of ?? > > b) the current e2fsprogs which you can get from > > http://e2fsprogs.sourceforge.net/ > > c) you have to enable dir_index on the filesystem with > > <> # umount /dev/xyz > # tune2fs -O dir_index /dev/xyz > # e2fsck -fD /dev/xyz > # mount /dev/xyz > > d) The performance of the htree indexed filesystem depends on the > usage by the userspace programs; if they open all files in a directory > after gaining directory information with readdir() the performance is > worse than with a vanilla ext3 fs, at least if we have many files in > that directory [as it is the case with maildir structures]. This can > be cured by an additional userspace library which can be found in the > message > > https://www.redhat.com/archives/ext3-users/2004-September/msg00025.html > > e) One question remains open : Is the htree feature in its current > state considered stable enough to be used in production systems ? > I read some reports on filesystem corruption, but most of these > applied to older versions of htree. Are there differences between the > 2.6 implementation and the 2.4 backport ? > > > Thank you for comments, > > Jakob Curdes > > > > >------------------------------------------------------------------------ > >_______________________________________________ >Ext3-users mailing list >Ext3-users at redhat.com >https://www.redhat.com/mailman/listinfo/ext3-users > From my_qa2004 at yahoo.com Fri Sep 24 14:01:53 2004 From: my_qa2004 at yahoo.com (Ash) Date: Fri, 24 Sep 2004 07:01:53 -0700 (PDT) Subject: Corrupted journal Message-ID: <20040924140153.97511.qmail@web53209.mail.yahoo.com> Hi I was running few tests on the Ext3 filesystem having an external journal; basically trying to check recovery in crash scenarios. I started with simple scripts doing some filesystem operations on the ext3 partition and crashed the system with a direct poweroff. On reboot, I also corrupted the journal device by "dd"ing it out with blocks of zeroes. Now, when I try to mount the filesystem I get "mount: wrong fs type, bad option, bad superblock on ..." which, I guess, is expected also. After this I tried to run e2fsck (specifying the journal device with the -j option) and this gives me "External journal has bad superblock" which is also understandable since the journal device was also corrupted. Next, I tried to clear the journal parameter in hope of being able to run fsck after that is done. But "tune2fs -O ^ has_journal" tells me "The needs_recovery flag is set. Please run e2fsck before clearing the has_journal flag." And of course, I can't reset the "needs_recovery" flag either. So now e2fsck is telling me that the journal is corrupted and if I try to clear the journal parameters, tune2fs tells me to run e2fsck first. I seem to be stuck in this loop. What can I do to recover my filesystem, clear the existing journal information and attach a new journal to the filesystem ? Any help/pointers will be appreciated. Thanks, Ash __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com From tytso at mit.edu Fri Sep 24 19:59:05 2004 From: tytso at mit.edu (Theodore Ts'o) Date: Fri, 24 Sep 2004 15:59:05 -0400 Subject: status of dir_index in 2.4 kernels ? In-Reply-To: <4153C91E.5070605@info-systems.de> References: <415196AD.5060805@info-systems.de> <1095900309.6157.8.camel@akira2.nro.au.com> <20040924002700.GB3300@thunk.org> <4153C91E.5070605@info-systems.de> Message-ID: <20040924195905.GC20320@thunk.org> On Fri, Sep 24, 2004 at 09:13:34AM +0200, Jakob Curdes wrote: > d) The performance of the htree indexed filesystem depends on the usage > by the userspace programs; if they open all files in a directory after > gaining directory information with readdir() the performance is worse > than with a vanilla ext3 fs, at least if we have many files in that > directory [as it is the case with maildir structures]. Correct. The performance of the htree indexed filesystem can be worse than that of a vanilla ext3 filesystem if the application opens all of the files in readdir() order. This is because readdir() has to return the directory entries in hash sort order. In contrast, in a vanilla ext3 filesystem, normally directory entries are added in the order that they were created, and inodes are created in sequential order. So on a normal ext3 filesystem w/o htree, opening the files in readdir order is roughly equivalent to reading them in inode number sort order, which is a big win since it avoids the disk seaking all over the place. This difference can be diminished if the directory has a lot of file creates and deletes, such that over time, readdir() order != inode number sort order. This is particular true in maildir directories, if mail messages are deleted, refiled, etc. So if the directory is badly out of order, the spd_readdir.so preload library can make a big difference to performance in this scenario as well. Why can't we do this spd_readdir trick in userspace? Because directories can be very large, and we don't want to be allocating this much memory in the kernel. - Ted From andy13 at gmx.net Sun Sep 26 21:37:58 2004 From: andy13 at gmx.net (andy13 at gmx.net) Date: Sun, 26 Sep 2004 23:37:58 +0200 (MEST) Subject: low level search for deleted data Message-ID: <11643.1096234678@www16.gmx.net> Hi everyone, I lost my complete home directory and am facing the problem of retrieving some of the deleted data. I have search the web for this matter, but the only information I found is, that it's not possible for a program to do this and that I have to puzzle the files together by scanning the disk (or disk image) with tools like sleuthkit (www.sleuthkit.org) or lde (lde.sourceforge.net). That's ok, since the only files I like to recover are text files (c and java sourcecode). But even though I read the ext2-undelete-minihowto (which doesn't apply to ext3, I know) I honestly don't know how to start. The partition is 11GB and the data could be anywhere. Can anyone please describe a sensible approach to this task. I'm really helpless :( Thanks in advance Andreas -- GMX ProMail mit bestem Virenschutz http://www.gmx.net/de/go/mail +++ Empfehlung der Redaktion +++ Internet Professionell 10/04 +++ From a.gietl at e-admin.de Sun Sep 26 22:50:17 2004 From: a.gietl at e-admin.de (Andreas Gietl) Date: Mon, 27 Sep 2004 00:50:17 +0200 Subject: low level search for deleted data In-Reply-To: <11643.1096234678@www16.gmx.net> References: <11643.1096234678@www16.gmx.net> Message-ID: <200409270050.18040.a.gietl@e-admin.de> On Sunday 26 September 2004 23:37, andy13 at gmx.net wrote: i use debugfs for that purpose > Hi everyone, > > I lost my complete home directory and am facing the problem of retrieving > some of the deleted data. > I have search the web for this matter, but the only information I found is, > that it's not possible for a program to do this and that I have to puzzle > the files together by scanning the disk (or disk image) with tools like > sleuthkit (www.sleuthkit.org) or lde (lde.sourceforge.net). That's ok, > since the only files I like to recover are text files (c and java > sourcecode). But even though I read the ext2-undelete-minihowto (which > doesn't apply to ext3, I know) I honestly don't know how to start. The > partition is 11GB and the data could be anywhere. Can anyone please > describe a sensible approach to this task. I'm really helpless :( > > Thanks in advance > > Andreas -- e-admin internet gmbh Andreas Gietl tel +49 941 3810884 Ludwig-Thoma-Strasse 35 93051 Regensburg mobil +49 171 6070008 PGP/GPG-Key unter http://www.e-admin.de/gpg.html From andy13 at gmx.net Mon Sep 27 14:56:57 2004 From: andy13 at gmx.net (Andreas Burtzlaff) Date: Mon, 27 Sep 2004 16:56:57 +0200 (MEST) Subject: low level search for deleted data References: <200409270050.18040.a.gietl@e-admin.de> Message-ID: <21656.1096297017@www58.gmx.net> Hi, > On Sunday 26 September 2004 23:37, andy13 at gmx.net wrote: > > i use debugfs for that purpose It doesn't work the "easy" way. lsdel doesn't finds any deleted inodes. I think was already covered in another thread. > > > I lost my complete home directory and am facing the problem of > retrieving > > some of the deleted data. Well, I've played around a bit with autopsy/sleuthkit and a 10Mb test image. It isn't hard to recover text files, but I only tried to copy a file to the file system and delete it afterwards. I expect fragmentation to make matters worse. Can anyone point me to a simple overview of things like blocks, groups, inodes, etc. and fragmentation? What happens if I append some text to an already existing text file? It's really hard to find information on that on the net. Thanks in advance Andreas P.S. If I succeed I promise I'll write a mini howto about it -- +++ GMX DSL Premiumtarife 3 Monate gratis* + WLAN-Router 0,- EUR* +++ Clevere DSL-Nutzer wechseln jetzt zu GMX: http://www.gmx.net/de/go/dsl From tytso at mit.edu Mon Sep 27 16:08:15 2004 From: tytso at mit.edu (Theodore Ts'o) Date: Mon, 27 Sep 2004 12:08:15 -0400 Subject: low level search for deleted data In-Reply-To: <11643.1096234678@www16.gmx.net> References: <11643.1096234678@www16.gmx.net> Message-ID: <20040927160815.GD15589@thunk.org> On Sun, Sep 26, 2004 at 11:37:58PM +0200, andy13 at gmx.net wrote: > Hi everyone, > > I lost my complete home directory and am facing the problem of retrieving > some of the deleted data. > I have search the web for this matter, but the only information I found is, > that it's not possible for a program to do this and that I have to puzzle > the files together by scanning the disk (or disk image) with tools like > sleuthkit (www.sleuthkit.org) or lde (lde.sourceforge.net). That's ok, since > the only files I like to recover are text files (c and java sourcecode). But > even though I read the ext2-undelete-minihowto (which doesn't apply to ext3, > I know) I honestly don't know how to start. The partition is 11GB and the > data could be anywhere. Can anyone please describe a sensible approach to > this task. I'm really helpless :( If you have some specific text that you know was in a file that you're trying recover, the followin gocmmand may be of use: grep -ab /dev/hda1 | awk -F: '{printf("%d\n", ($1 + 4095) / 4096)}' Replace with the regular expression or string that you are trying to find, and if the filesystem is using a 1k blocksize, replace 4095/4096 with 1023/1024. This will given you block numbers which you can then feed into lde. Good luck!! - Ted From sct at redhat.com Thu Sep 30 19:44:38 2004 From: sct at redhat.com (Stephen C. Tweedie) Date: 30 Sep 2004 20:44:38 +0100 Subject: Corrupted journal In-Reply-To: <20040924140153.97511.qmail@web53209.mail.yahoo.com> References: <20040924140153.97511.qmail@web53209.mail.yahoo.com> Message-ID: <1096573477.1977.425.camel@sisko.scot.redhat.com> Hi, On Fri, 2004-09-24 at 15:01, Ash wrote: > Next, I tried to clear the journal parameter in hope > of being able to run fsck after that is done. > But "tune2fs -O ^ has_journal" tells me > "The needs_recovery flag is set. Please run e2fsck > before clearing the has_journal flag." >From "man tune2fs": -f Force the tune2fs operation to complete even in the face of errors. This option is useful when removing the has_journal filesystem feature from a filesystem which has an external jour- nal (or is corrupted such that it appears to have an external journal), but that external journal is not available. Does this help you to get further? Cheers, Stephen