From simon.guilhot at gmail.com Sun Dec 2 14:59:37 2007 From: simon.guilhot at gmail.com (Simon Guilhot) Date: Sun, 2 Dec 2007 15:59:37 +0100 Subject: Meta-data in Ext3 Message-ID: Hi to everyone, I have a student projet quite interessting and quite hard and the informations found on internet about the ext3 aren't relevant. That's why I recquire you're help. My aim is to add some m?tadatas to my files (like public,Private, draft ... the list must be extensible) and have the possibility to display it with the commands line (ls, rm, ...). There is lots of ways to do that. The more appropriate is, for me, to implement it directly in the i-node of the Ext3 (more easy in ext2 ?). Concretly I see it like that (I'm probably wrong): The standard i-node contain informations like Creation/modification dates, rights, number of links ... and I wanted to add a field string (or some bytes, its the same) where i could put my metadatas. Of course the system won't be bootable (and wont be stable). Here is a representation : class|host|device|start_time ils|shirley||1151770485 st_ino|st_alloc|st_uid|st_gid|st_mtime|st_atime|st_ctime|st_mode|st_nlink|st_size|st_block0|st_block1 1|a|0|0|1151770448|1151770448|1151770448|0|0|0|0|0|MY_FIELD 2|a|0|0|1151770448|1151770448|1151770448|40755|3|1024|201|0|MY_FIELD 3|a|0|0|0|0|0|0|0|0|0|0|MY_FIELD Course, I'm maybe dreaming, it's probably very hard, but it's interesting to ask to someone more experienced. Forgive me for my english. Guilhot Simon -------------- next part -------------- An HTML attachment was scrubbed... URL: From ross at biostat.ucsf.edu Tue Dec 4 18:52:32 2007 From: ross at biostat.ucsf.edu (Ross Boylan) Date: Tue, 04 Dec 2007 10:52:32 -0800 Subject: Ancient Very slow directory traversal thread In-Reply-To: <47545B04.7050905@tigershaunt.com> References: <47545B04.7050905@tigershaunt.com> Message-ID: <1196794352.8953.113.camel@corn.betterworld.us> On Mon, 2007-12-03 at 14:37 -0500, Rashkae wrote: > I just came across your message to a mailing list here: [message concerned it taking hours to go through directories in a mail spool on ext3] > > https://www.redhat.com/archives/ext3-users/2007-October/msg00019.html > > This might be a problem you resolved for yourself a long time ago, but I > thought you might be interested to know that Theodore's spd_readdir > library works great with star (even though I also cannot get it to work > with tar or even du). That's interesting. Since I couldn't get it to work with tar, got no response, and wasn't sure how or if to get it to work with the daemon to that really needed it, I haven't made and progress. I wonder what determines whether the library helps or hurts. > > star with spd_readdir is what I use to back up maildir spools, and > something I consider absolutely necessary for any such storage on ext3 > or reiserfs filesystems. I hadn't noticed this was an issue with reiser. I also just noticed that e2fsck has an option, -D, to optimize directories. The man page says this will reindex directories. Does anyone know if that could help? Ross From adilger at sun.com Wed Dec 5 23:47:38 2007 From: adilger at sun.com (Andreas Dilger) Date: Wed, 5 Dec 2007 16:47:38 -0700 Subject: Ancient Very slow directory traversal thread In-Reply-To: <1196794352.8953.113.camel@corn.betterworld.us> References: <47545B04.7050905@tigershaunt.com> <1196794352.8953.113.camel@corn.betterworld.us> Message-ID: <20071205234737.GD3604@webber.adilger.int> On Dec 04, 2007 10:52 -0800, Ross Boylan wrote: > On Mon, 2007-12-03 at 14:37 -0500, Rashkae wrote: > > I just came across your message to a mailing list here: > [message concerned it taking hours to go through directories in a mail > spool on ext3] > > > > https://www.redhat.com/archives/ext3-users/2007-October/msg00019.html > > > > This might be a problem you resolved for yourself a long time ago, but I > > thought you might be interested to know that Theodore's spd_readdir > > library works great with star (even though I also cannot get it to work > > with tar or even du). > > That's interesting. Since I couldn't get it to work with tar, got no > response, and wasn't sure how or if to get it to work with the daemon to > that really needed it, I haven't made and progress. > > I wonder what determines whether the library helps or hurts. Maybe it depends if the app is using normal readdir() calls, or is maybe implementing the directory traversal itself? > I also just noticed that e2fsck has an option, -D, to optimize > directories. The man page says this will reindex directories. Does > anyone know if that could help? That will compress empty space from directories and rebuild the hash table for directories that are not indexed (e.g. older dirs created before DIR_INDEX feature was enabled in fs). It will keep them in hash order so it won't help this issue. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. From Sven_Rudolph at drewag.de Tue Dec 11 12:29:33 2007 From: Sven_Rudolph at drewag.de (Sven Rudolph) Date: Tue, 11 Dec 2007 13:29:33 +0100 Subject: Ext3 Performance Tuning - the journal Message-ID: Hello, I have some performance problems in a file server system. It is used as Samba and NFS file server. I have some ideas what might cause the problems, and I want to try step by step. First I have to learn more about these areas. First I have some questions about tuning/sizing the ext3 journal. The most extensive list I found on ext3 performance tuning is . I learned that the ext3 journal is flushed when either the journal is full or the commit interval is over (set by the mount option "commit="). So started trying these settings. I didnt manage to determine the size of the journal of an already existing filesystem. tunefs tells me the inode: ~# tune2fs -l /dev/vg0/lvol0 | grep -i journal Filesystem features: has_journal resize_inode dir_index filetype needs_recovery sparse_super large_file Journal inode: 8 Journal backup: inode blocks Is there a way to get the size of the journal? And how do I find out how much of the journal is used? Or how often a journal flush actually happens? Or whether the journal flushes happen because the commit interval has finished or because the journal was full? This would give me hints for the sizing of the journal. And I tried to increase the journal flush interval. ~# umount /data/ ~# mount -o commit=30 /dev/vg0/lvol0 /data/ ~# grep /data /proc/mounts /dev/vg0/lvol0 /data ext3 rw,data=ordered 0 0 ~# Watching the disk activity LEDs makes me believe that this works, but I expected the mount option "commit=30" to be listed in /proc/mounts. Did I do something wrong, or is there another way to explain it? As you see above in /proc/mounts I use data=ordered. The fileserver offers both NFS and Samba. "data=journal" might be better for NFS, but I believe that NFS is the smaller part of the fileserver load. Is there a way to measure or estimate how large the impact of NFS on the journal size and transfer rate is? If I used "data=journal" I would need a larger journal and the journal data transfer rate would increase. I fear this might induce a new bottleneck, but I have no idea how to measure this or how to estimate it in advance. Currently I have an internal journal, the filesystem resides on RAID6. I guess this is another potential performance problem. When discussions on external journals appeared some years ago it was mentioned that the external journal code was quite new (see ). I think nowadays I have the option to use an external journal and place it on a dedicated RAID1. Did anyone experience performance advantages by doing this? Even while using "data=journal"? Thats all. Thanks for reading that far ;-) Sven From lists-ext3-users at bruce-guenter.dyndns.org Tue Dec 11 22:15:04 2007 From: lists-ext3-users at bruce-guenter.dyndns.org (Bruce Guenter) Date: Tue, 11 Dec 2007 16:15:04 -0600 Subject: PROBLEM: Duplicated entries in large NFS shared directory Message-ID: <20071211221504.GA12096@untroubled.org> Hi. I have a large directory (almost 40,000 entries) on an ext3 filesystem that is shared over NFS. I discovered recently when listing the directory on the client, one of the files appears twice. The same file does not appear twice on the server. I did a capture using WireShark, and discovered that the offending file name is being sent twice -- once as the last entry in a readdir reply packet and then again as the first entry in the next readdir reply. If I'm reading the trace right, the readdir call sends the cookie for the last entry in the previous readdir reply and the server responds with the next set of entries. In this case, the server responds with the entry containing the same cookie again. The server is running vanilla 2.6.23.8. I would be happy to provide any further information that would help resolve this bug. I posted this to the NFS maintainers, and Neil Brown suggested: > My guess is that you have lucked-out and got two directory entries > that hash to the same value, and they appear either side of a readdir > block boundary. > > It is an awkward designed limitation of ext3 that is rarely a problem > and could possibly be worked around to some extent... -- Bruce Guenter http://untroubled.org/ -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From adilger at sun.com Thu Dec 13 09:20:01 2007 From: adilger at sun.com (Andreas Dilger) Date: Thu, 13 Dec 2007 02:20:01 -0700 Subject: Ext3 Performance Tuning - the journal In-Reply-To: References: Message-ID: <20071213092001.GA3214@webber.adilger.int> On Dec 11, 2007 13:29 +0100, Sven Rudolph wrote: > I didnt manage to determine the size of the journal of an already > existing filesystem. tunefs tells me the inode: > > ~# tune2fs -l /dev/vg0/lvol0 | grep -i journal > Filesystem features: has_journal resize_inode dir_index filetype needs_recovery sparse_super large_file > Journal inode: 8 > Journal backup: inode blocks > > Is there a way to get the size of the journal? dumpe2fs -c -R "stat <8>" /dev/vg0/lvol0 > And how do I find out how much of the journal is used? Or how often a > journal flush actually happens? Or whether the journal flushes happen > because the commit interval has finished or because the journal was > full? This would give me hints for the sizing of the journal. There is a patch for jbd2 (part of the ext4 patch queue, based on a patch for jbd from Lustre) that records transactions and journal stats. > And I tried to increase the journal flush interval. > > ~# umount /data/ > ~# mount -o commit=30 /dev/vg0/lvol0 /data/ > ~# grep /data /proc/mounts > /dev/vg0/lvol0 /data ext3 rw,data=ordered 0 0 > ~# > > Watching the disk activity LEDs makes me believe that this works, but > I expected the mount option "commit=30" to be listed in > /proc/mounts. Did I do something wrong, or is there another way to > explain it? No, /proc/mounts doesn't report all of the mount options correctly. > As you see above in /proc/mounts I use data=ordered. The fileserver > offers both NFS and Samba. "data=journal" might be better for NFS, but > I believe that NFS is the smaller part of the fileserver load. Is > there a way to measure or estimate how large the impact of NFS on the > journal size and transfer rate is? > > If I used "data=journal" I would need a larger journal and the journal > data transfer rate would increase. I fear this might induce a new > bottleneck, but I have no idea how to measure this or how to estimate > it in advance. Increasing the journal size is a good idea for any metadata-heavy load. We use a journal size of 400MB for Lustre metadata servers. > Currently I have an internal journal, the filesystem resides on > RAID6. I guess this is another potential performance problem. For the journal this doesn't make much difference since the IO is sequential writes. The RAID6 is bad for metadata performance because it has to do read-modify-write on the RAID stripes. > When discussions on external journals appeared some years ago it was > mentioned that the external journal code was quite new (see > ). > > I think nowadays I have the option to use an external journal and > place it on a dedicated RAID1. Did anyone experience performance > advantages by doing this? Even while using "data=journal"? Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. From brice+ext3 at daysofwonder.com Thu Dec 13 16:22:24 2007 From: brice+ext3 at daysofwonder.com (Brice Figureau) Date: Thu, 13 Dec 2007 17:22:24 +0100 Subject: Ext3 Performance Tuning - the journal In-Reply-To: References: Message-ID: <1197562944.10717.6.camel@localhost.localdomain> Hi, On Tue, 2007-12-11 at 13:29 +0100, Sven Rudolph wrote: > I have some performance problems in a file server system. It is used > as Samba and NFS file server. I have some ideas what might cause the > problems, and I want to try step by step. First I have to learn more > about these areas. > > First I have some questions about tuning/sizing the ext3 journal. > > The most extensive list I found on ext3 performance tuning is > . > > > I learned that the ext3 journal is flushed when either the journal is > full or the commit interval is over (set by the mount option > "commit="). So started trying these settings. Are your filesystem mounted noatime ? It does a huge difference, especially if your workload is mainly read over write. Without noatime, each access to a file generates a write to change the metadata which will fill your journal. If you are not using noatime, it is worth trying it. See it for a thorough discussion of the topic: http://thread.gmane.org/gmane.linux.kernel/565148 Hope that helps, -- Brice Figureau Days of Wonder http://www.daysofwonder.com/ From tpo2 at sourcepole.ch Tue Dec 18 11:43:13 2007 From: tpo2 at sourcepole.ch (Tomas Pospisek ML) Date: Tue, 18 Dec 2007 11:43:13 +0000 Subject: Soliciting contract work for FS corruption analysis [was: Re: Second Block on Partition overwritten with 0xFF] In-Reply-To: Message-ID: Hello Ext3 world, my customer is still experiencing a too high percentage of machine outages with corrupted superblocks - roughly 10% of our embedded PC population. Thus we are looking for expert contract work to resolve our FS problem. The details of our problem should be roughly sketched in the previous thread (please look in the ML archives for the thread "Second Block on Partition overwritten with 0xFF"). We can provide images (as in dd if=/dev/hdx of=disk_image) of the corrupted file systems. We are also ready to organize travel to our site (Switzerland) if necessary and/or live access to our units. Also any pointers to experts on the subject of ext2/3, IDE, flash cards are highly appreciated. Once again, any and all pointers to sources of help for our problem are very, very appreciated, please do contact us! Thanks in advance, *t On 9/9/2007, "Tomas Pospisek's Mailing Lists" wrote: >On Thu, 6 Sep 2007, Andreas Dilger wrote: > >> On Sep 06, 2007 23:02 +0200, Tomas Pospisek's Mailing Lists wrote: >>> On Thu, 6 Sep 2007, Christian Kujau wrote: >>>> On Thu, 6 Sep 2007, Tomas Pospisek ML wrote: >>>>> default) at 0x400. Thus as I understand it, it *would* be possible for >>>>> the ext3 driver to pysically write to those first sectors inside its >>>>> partition. ^^^^^^ >>>> >>>> Yes, ext3 will write *inside* its assigned partition, but not outside. >>> >>> Thanks, however it seems I can not get through what I need to know - >>> sorry for that. I *do* know that ext3 will only write to its partition >>> only. But once mke2fs has run: >>> >>> * will ext2/3 *ever* write to the first 4 sectors on *its* partition? >>> >>> Same question restated: is it possible that ext2/3 will write into the >>> space before the first block group [1]? >> >> The ext2/3/4 superblock is at offset 1024 bytes. It is written by marking >> the buffer it is in dirty. If the filesystem blocksize is > 1024 bytes >> then the whole block will be written to disk (including the first sectors). >> >> That said, the buffer cache is coherent when written by the filesystem and >> when written via /dev/XXX so any modifications made to the first sectors >> should be rewritten each time the superblock is marked dirty. The ext3 >> code will never itself modify those sectors. > >I just remembered, that once the problem occured when there was very high >memory pressure. I.e. the OOM killer went around and killed applications, >the machine rebooted, at which point the FS was broken. > >So a naive ad hoc theory of mine for the FS corruption would be that the >FS was unmounted at a moment when processes wouldn't receive any more >memory from the OS (due to OOM) and thus umount would flush/write out the >first block (I believe it needs to obligatorily clear the dirty FS flag at >umount) which it failed to properly allocate before?!? >*t > >-- >----------------------------------------------------------- > Tomas Pospisek > http://sourcepole.com - Linux & Open Source Solutions >----------------------------------------------------------- > >_______________________________________________ >Ext3-users mailing list >Ext3-users at redhat.com >https://www.redhat.com/mailman/listinfo/ext3-users > From tpo2 at sourcepole.ch Tue Dec 18 12:53:40 2007 From: tpo2 at sourcepole.ch (Tomas Pospisek ML) Date: Tue, 18 Dec 2007 12:53:40 +0000 Subject: Soliciting contractor for FS corruption analysis [was: Re: Second Block on Partition overwritten with 0xFF] Message-ID: (this is a repost with the subject and slight text corrections) Hello Ext3 world, my customer is still experiencing a too high percentage of machine outages with corrupted superblocks - roughly 10% of our embedded PC population. Thus we are looking to contract an expert to resolve our FS problem. The details of our problem should be roughly sketched in the previous thread (please look in the ML archives for the thread "Second Block on Partition overwritten with 0xFF"). We can provide images (as in dd if=/dev/hdx of=disk_image) of the corrupted file systems. We are also ready to organize travel to our site (Switzerland) if necessary and/or live access to our units. Also any pointers to experts on the subject of ext2/3, IDE, flash cards are highly appreciated. Once again, any and all pointers to sources of help for our problem are very, very appreciated, please do contact us! Thanks in advance, *t On 9/9/2007, "Tomas Pospisek's Mailing Lists" wrote: >On Thu, 6 Sep 2007, Andreas Dilger wrote: > >> On Sep 06, 2007 23:02 +0200, Tomas Pospisek's Mailing Lists wrote: >>> On Thu, 6 Sep 2007, Christian Kujau wrote: >>>> On Thu, 6 Sep 2007, Tomas Pospisek ML wrote: >>>>> default) at 0x400. Thus as I understand it, it *would* be possible for >>>>> the ext3 driver to pysically write to those first sectors inside its >>>>> partition. ^^^^^^ >>>> >>>> Yes, ext3 will write *inside* its assigned partition, but not outside. >>> >>> Thanks, however it seems I can not get through what I need to know - >>> sorry for that. I *do* know that ext3 will only write to its partition >>> only. But once mke2fs has run: >>> >>> * will ext2/3 *ever* write to the first 4 sectors on *its* partition? >>> >>> Same question restated: is it possible that ext2/3 will write into the >>> space before the first block group [1]? >> >> The ext2/3/4 superblock is at offset 1024 bytes. It is written by marking >> the buffer it is in dirty. If the filesystem blocksize is > 1024 bytes >> then the whole block will be written to disk (including the first sectors). >> >> That said, the buffer cache is coherent when written by the filesystem and >> when written via /dev/XXX so any modifications made to the first sectors >> should be rewritten each time the superblock is marked dirty. The ext3 >> code will never itself modify those sectors. > >I just remembered, that once the problem occured when there was very high >memory pressure. I.e. the OOM killer went around and killed applications, >the machine rebooted, at which point the FS was broken. > >So a naive ad hoc theory of mine for the FS corruption would be that the >FS was unmounted at a moment when processes wouldn't receive any more >memory from the OS (due to OOM) and thus umount would flush/write out the >first block (I believe it needs to obligatorily clear the dirty FS flag at >umount) which it failed to properly allocate before?!? >*t > >-- >----------------------------------------------------------- > Tomas Pospisek > http://sourcepole.com - Linux & Open Source Solutions >----------------------------------------------------------- > >_______________________________________________ >Ext3-users mailing list >Ext3-users at redhat.com >https://www.redhat.com/mailman/listinfo/ext3-users > _______________________________________________ Ext3-users mailing list Ext3-users at redhat.com https://www.redhat.com/mailman/listinfo/ext3-users From bart.bas at gmail.com Tue Dec 18 21:11:19 2007 From: bart.bas at gmail.com (Bart) Date: Tue, 18 Dec 2007 22:11:19 +0100 Subject: how ext3 works Message-ID: <64dbfc980712181311haa9df53p3d123ae283ddc684@mail.gmail.com> n the past few days, I've been reading about ext3/journalling/... In order to fully understand how it works I have a few questions. *Imagine the following situation: you opened a file in vi(m), you are editing, but haven't yet saved your work. The system crashes: what will be the result? Will the metadata be modified (assume both atime and noatime)? Will the data itself be corrupted? Or will there be no modification whatsoever because you hadn't saved yet (your work will simply be lost)? *What happens when the system crashes during a write to the journal? Can the journal be corrupted? *About ext3's ordered mode [quote]from Wikipedia: Ordered (medium speed, medium risk) Ordered is as with writeback, but forces file contents to be written before its associated metadata is marked as committed in the journal.[/quote] What's the sequence of events here? 1. user issues command to write his work to disk 2. metadata is recorded in the journal, but is marked as "not yet executed" (or something similar) 3. data (file contents) and metadata are written to disk 4. metadata flag is set as "executed" If a crash happens between step 1 and 2, we are in the situation as described above (first situation): not yet written If a crash happens between step 2 and 3, isn't this the same as writeback? Or is this impossible (I read something about a single transaction, but I forgot where)? Crash between 3 and 4, can be corrected by replaying the journal. Is this a correct view of things? -------------- next part -------------- An HTML attachment was scrubbed... URL: From andreesje_werk at yahoo.com Wed Dec 19 05:44:29 2007 From: andreesje_werk at yahoo.com (wienerschnitzel) Date: Tue, 18 Dec 2007 21:44:29 -0800 Subject: ext3 journaling on flash disk Message-ID: Hello folks, I'm using a rather old kernel (2.4.27) that has been working quite well in an embedded system. Currently, I am conducting some unclean shutdown tests with different flash disks and I'm running into fs corruption. I'm using the data=journal mode for the root and data partitions. I'm mounting with the 'noatime' option. Would it make sense to go to the latest 2.4 kernel or should I move on to 2.6? I have three different flash disks and none of them seem to have write caching, however one of them has the 'Mandatory FLUSH_CACHE' support - what exactly does that mean? From lists at nerdbynature.de Wed Dec 19 10:57:29 2007 From: lists at nerdbynature.de (Christian Kujau) Date: Wed, 19 Dec 2007 11:57:29 +0100 (CET) Subject: ext3 journaling on flash disk In-Reply-To: References: Message-ID: <51009.62.180.231.196.1198061849.squirrel@housecafe.dyndns.org> On Wed, December 19, 2007 06:44, wienerschnitzel wrote: > Currently, I am conducting some unclean shutdown tests with different > flash disks and I'm running into fs corruption. > I'm using the data=journal mode for the root and data partitions. Well, what kind of corruptions do you get? 2.4 is still somewhat supported, I think. And if it turns out to be a bug, maybe someone will fix it. > Would it make sense to go to the latest 2.4 kernel or should I move on to > 2.6? If upgrading to 2.6 is feasible for you, it's worth a try. C. -- BOFH excuse #442: Trojan horse ran out of hay From lists at nerdbynature.de Wed Dec 19 11:53:34 2007 From: lists at nerdbynature.de (Christian Kujau) Date: Wed, 19 Dec 2007 12:53:34 +0100 (CET) Subject: how ext3 works In-Reply-To: <64dbfc980712181311haa9df53p3d123ae283ddc684@mail.gmail.com> References: <64dbfc980712181311haa9df53p3d123ae283ddc684@mail.gmail.com> Message-ID: <37702.62.180.231.196.1198065214.squirrel@housecafe.dyndns.org> On Tue, December 18, 2007 22:11, Bart wrote: > *Imagine the following situation: you opened a file in vi(m), you are > editing, but haven't yet saved your work. The system crashes: what will be > the result? If you haven't saved yet, nothing will happen. But since vi(m) will create a temporary file (.file.swp or something), this file could've made it the disk already. > Will the metadata be modified (assume both atime and noatime)? Will > the data itself be corrupted? The file itsself should not be corrupt. If it were, it'd have been replayed from the journal during bootup (fsck) to provide a non-corrupt filesystem. > simply be lost)? *What happens when the system crashes during a write to > the journal? Can the journal be corrupted? It shouldn't be corrupted. If it were, fsck should be able to fix that, otherwise I'd consider it as a bug. > If a crash happens between step 1 and 2, we are in the situation as > described above (first situation): not yet written If a crash happens > between step 2 and 3, isn't this the same as writeback? AFAIK, writes to the journal have to be atomic: either the journal is updated, or (when it crashes during this operation) it isn't. With data=ordered, the journal is updated after the the data made it to the disk. With data=journal, the journal is updated first. C. -- BOFH excuse #442: Trojan horse ran out of hay From bruno at wolff.to Wed Dec 19 17:34:11 2007 From: bruno at wolff.to (Bruno Wolff III) Date: Wed, 19 Dec 2007 11:34:11 -0600 Subject: how ext3 works In-Reply-To: <64dbfc980712181311haa9df53p3d123ae283ddc684@mail.gmail.com> References: <64dbfc980712181311haa9df53p3d123ae283ddc684@mail.gmail.com> Message-ID: <20071219173411.GA27090@wolff.to> On Tue, Dec 18, 2007 at 22:11:19 +0100, Bart wrote: > n the past few days, I've been reading about ext3/journalling/... In order > to fully understand how it works I have a few questions. > > *Imagine the following situation: you opened a file in vi(m), you are > editing, but haven't yet saved your work. The system crashes: what will be > the result? Will the metadata be modified (assume both atime and noatime)? > Will the data itself be corrupted? Or will there be no modification > whatsoever because you hadn't saved yet (your work will simply be lost)? > *What happens when the system crashes during a write to the journal? Can the > journal be corrupted? vi keeps data in a scratch file, so if you haven't forced a save your original file will be intact. You should be able to use vi -r to recover at least some of the changes you were working on. This is mostly independent of what is going on in the file system. What you really are looking for with journaling is that when you are told data is safe on disk, it is in fact safe on disk. You also need to worry about drive caching when dealing with this. You can either turn write caching off, have it backed by battery (common with real raid controllers) or use write barriers (this is a mount option) to force cache flushes when needed (not all drives support this), or use disk drives that can report back when commands have really been completed (which is not available for PATA drives). > *About ext3's ordered mode > [quote]from Wikipedia: > Ordered > (medium speed, medium risk) Ordered is as with writeback, but forces > file contents to be written before its associated metadata is marked as > committed in the journal.[/quote] Note that for some workloads data=journal can be as fast as data=ordered. Are you really having a throughput problem with your disk drives? If not then you probably want to use data=journal (assuming that reliability is of high concern to you). If you are having throughput problems there are other potential solutions than reducing the effectiveness of journalling. From tpo2 at sourcepole.ch Thu Dec 20 10:27:17 2007 From: tpo2 at sourcepole.ch (Tomas Pospisek's Mailing Lists) Date: Thu, 20 Dec 2007 11:27:17 +0100 (CET) Subject: ext3 journaling on flash disk In-Reply-To: References: Message-ID: On Tue, 18 Dec 2007, wienerschnitzel wrote: > I'm using a rather old kernel (2.4.27) that has been working quite well in > an embedded system. > > Currently, I am conducting some unclean shutdown tests with different flash > disks and I'm running into fs corruption. I'm using the data=journal mode > for the root and data partitions. I'm mounting with the 'noatime' option. How does the corruption look like? I.e. what is corrupted? > Would it make sense to go to the latest 2.4 kernel or should I move on to > 2.6? > > I have three different flash disks and none of them seem to have write > caching, however one of them has the 'Mandatory FLUSH_CACHE' support - what > exactly does that mean? -- ----------------------------------------------------------- Tomas Pospisek http://sourcepole.com - Linux & Open Source Solutions ----------------------------------------------------------- From liuyue at ncic.ac.cn Thu Dec 27 06:30:46 2007 From: liuyue at ncic.ac.cn (liuyue) Date: Thu, 27 Dec 2007 14:30:46 +0800 Subject: ext3 peformance problem Message-ID: <20071227062525.263C113687A@ncic.ac.cn> ext3-usershello all, I am testing ext3 file system recently but find some problem I use GreatTurbo Enterprise Server 10 (Zuma) and 2.6.20 kernel I conducted my test as follows: mkfs.ext3 /dev/sdb1 mount /dev/sdb1 /mnt/test cd /mnt/test mkdir 0 1 2 3 4 5 6 7 I test write and read performance under different subdirs and give the performance result. I also use filefrag to see the file layout. Command line used: /root/iozone -s 5g -r 1m -i 0 -i 1 -f /mnt/test/0/tmpfile -c -e -+n -w 5242880 1024 72706 0 80474 0 /mnt/test/0/tmpfile: 44 extents found, perfection would be 41 extents Command line used: /root/iozone -s 5g -r 1m -i 0 -i 1 -f /mnt/test/1/tmpfile -c -e -+n -w 5242880 1024 49957 0 52899 0 /mnt/test/1/tmpfile: 42 extents found, perfection would be 41 extents Command line used: /root/iozone -s 5g -r 1m -i 0 -i 1 -f /mnt/test/2/tmpfile -c -e -+n -w 5242880 1024 60292 0 64664 0 /mnt/test/2/tmpfile: 42 extents found, perfection would be 41 extents Command line used: /root/iozone -s 5g -r 1m -i 0 -i 1 -f /mnt/test/3/tmpfile -c -e -+n -w 5242880 1024 70540 0 78644 0 /mnt/test/3/tmpfile: 46 extents found, perfection would be 41 extents Command line used: /root/iozone -s 5g -r 1m -i 0 -i 1 -f /mnt/test/4/tmpfile -c -e -+n -w 5242880 1024 61334 0 67778 0 /mnt/test/4/tmpfile: 44 extents found, perfection would be 41 extents Command line used: /root/iozone -s 5g -r 1m -i 0 -i 1 -f /mnt/test/5/tmpfile -c -e -+n -w 5242880 1024 66735 0 75114 0 /mnt/test/5/tmpfile: 42 extents found, perfection would be 41 extents Command line used: /root/iozone -s 5g -r 1m -i 0 -i 1 -f /mnt/test/6/tmpfile -c -e -+n -w 5242880 1024 65062 0 72686 0 /mnt/test/6/tmpfile: 44 extents found, perfection would be 41 extents Command line used: /root/iozone -s 5g -r 1m -i 0 -i 1 -f /mnt/test/7/tmpfile -c -e -+n -w 5242880 1024 69247 0 78563 0 /mnt/test/7/tmpfile: 45 extents found, perfection would be 41 extents Command line used: /root/iozone -s 5g -r 1m -i 0 -i 1 -f /mnt/test/tmpfile -c -e -+n -w 5242880 1024 77085 0 81696 0 Command line used: /root/iozone -s 5g -r 1m -i 0 -i 1 -f /mnt/test/5/tmpfile2 -c -e -+n -w /mnt/test/5/tmpfile2: 48 extents found, perfection would be 41 extents 5242880 1024 57776 0 64870 0 Command line used: /root/iozone -s 5g -r 1m -i 0 -i 1 -f /mnt/test/5/tmpfile3 -c -e -+n -w 5242880 1024 54799 0 59145 0 /mnt/test/5/tmpfile3: 44 extents found, perfection would be 41 extents My questions are: 1. why the performances under different subdirs varies so much? In /mnt/test/0 the performance is 72/80, while in /mnt/test/1 the performance is 49/53 2. I see that the extents of all files are nearly the same, but their performances are different. What are the other factors that influence the performance except for the extents(fragmentation) of the file? 3. Is it true that the more files there already exists in a dir, the lower performance we will get if we test under the dir? as in my test, the performance of /mnt/test/5/tmpfile is 66/75, while the performances of /mnt/test/5/tmpfile2 and tmpfile3 are 57/64 54/59 Thanks very much From ezk at cs.sunysb.edu Mon Dec 24 23:02:24 2007 From: ezk at cs.sunysb.edu (Erez Zadok) Date: Mon, 24 Dec 2007 18:02:24 -0500 Subject: lockdep warning with LTP dio test (v2.6.24-rc6-125-g5356f66) Message-ID: <200712242302.lBON2O8s011190@agora.fsl.cs.sunysb.edu> Setting: ltp-full-20071031, dio01 test on ext3 with Linus's latest tree. Kernel w/ SMP, preemption, and lockdep configured. Cheers, Erez. ======================================================= [ INFO: possible circular locking dependency detected ] 2.6.24-rc6 #83 ------------------------------------------------------- diotest1/2088 is trying to acquire lock: (&mm->mmap_sem){----}, at: [] dio_get_page+0x4e/0x15d but task is already holding lock: (jbd_handle){--..}, at: [] journal_start+0xcb/0xf8 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (jbd_handle){--..}: [] __lock_acquire+0x9cc/0xb95 [] lock_acquire+0x5f/0x78 [] journal_start+0xee/0xf8 [] ext3_journal_start_sb+0x48/0x4a [] ext3_dirty_inode+0x27/0x6c [] __mark_inode_dirty+0x29/0x144 [] touch_atime+0xb7/0xbc [] generic_file_mmap+0x2d/0x42 [] mmap_region+0x1e6/0x3b4 [] do_mmap_pgoff+0x1fb/0x253 [] sys_mmap2+0x9b/0xb5 [] syscall_call+0x7/0xb [] 0xffffffff -> #0 (&mm->mmap_sem){----}: [] __lock_acquire+0x8bc/0xb95 [] lock_acquire+0x5f/0x78 [] down_read+0x3a/0x4c [] dio_get_page+0x4e/0x15d [] __blockdev_direct_IO+0x431/0xa81 [] ext3_direct_IO+0x10c/0x1a1 [] generic_file_direct_IO+0x124/0x139 [] generic_file_direct_write+0x56/0x11c [] __generic_file_aio_write_nolock+0x33d/0x489 [] generic_file_aio_write+0x58/0xb6 [] ext3_file_write+0x27/0x99 [] do_sync_write+0xc5/0x102 [] vfs_write+0x90/0x119 [] sys_write+0x3d/0x61 [] sysenter_past_esp+0x5f/0xa5 [] 0xffffffff other info that might help us debug this: 2 locks held by diotest1/2088: #0: (&sb->s_type->i_mutex_key#6){--..}, at: [] generic_file_aio_write+0x45/0xb6 #1: (jbd_handle){--..}, at: [] journal_start+0xcb/0xf8 stack backtrace: Pid: 2088, comm: diotest1 Not tainted 2.6.24-rc6 #83 [] show_trace_log_lvl+0x1a/0x2f [] show_trace+0x12/0x14 [] dump_stack+0x6c/0x72 [] print_circular_bug_tail+0x5f/0x68 [] __lock_acquire+0x8bc/0xb95 [] lock_acquire+0x5f/0x78 [] down_read+0x3a/0x4c [] dio_get_page+0x4e/0x15d [] __blockdev_direct_IO+0x431/0xa81 [] ext3_direct_IO+0x10c/0x1a1 [] generic_file_direct_IO+0x124/0x139 [] generic_file_direct_write+0x56/0x11c [] __generic_file_aio_write_nolock+0x33d/0x489 [] generic_file_aio_write+0x58/0xb6 [] ext3_file_write+0x27/0x99 [] do_sync_write+0xc5/0x102 [] vfs_write+0x90/0x119 [] sys_write+0x3d/0x61 [] sysenter_past_esp+0x5f/0xa5 =======================