From fk at linuxburg.de Wed Feb 1 11:47:08 2006 From: fk at linuxburg.de (Felix E. Klee) Date: Wed, 1 Feb 2006 12:47:08 +0100 Subject: df reports false size In-Reply-To: <20060131175924.GA11642@schatzie.adilger.int> References: <200601301938.54145.fk@linuxburg.de> <200601311331.20929.fk@linuxburg.de> <20060131175924.GA11642@schatzie.adilger.int> Message-ID: <200602011247.08493.fk@linuxburg.de> Am Dienstag, 31. Januar 2006 18:59 schrieb Andreas Dilger: > You can use "mount -t bind / /mnt" and then "/mnt/nfsroot" will be the > underlying directory. That's what I eventually used in order to be able to remove the directory (I got the hint on another mailing list). Thanks for your help! -- Dipl.-Phys. Felix E. Klee Email: fk at linuxburg.de (work), felix.klee at inka.de (home) Tel: +49 721 8307937, Fax: +49 721 8307936 Linuxburg, Goethestr. 15A, 76135 Karlsruhe, Germany From pradeep.vincent at gmail.com Sat Feb 4 02:17:41 2006 From: pradeep.vincent at gmail.com (Pradeep Vincent) Date: Fri, 3 Feb 2006 18:17:41 -0800 Subject: Ext3 IO context Message-ID: <9fda5f510602031817l34dfcb50x7540bed68f9ea5c@mail.gmail.com> I am running a BDB based application on top of EXT3 on Linux (RH 7.2) - the application and the configuration are exactly the same. When the application writes, sometimes the IO happens in the context of the application thread while most of the time the IO happens in the context kjournald. What does it mean for IO to happen in initiating process context for non O_SYNC file I/O. I was thinking the dirty file cache thresholds determine if a write to a filesystem write will correspond to block I/O in the process context. Is that how ext3 works. I am not very familiar with kjournald functionality - is that thread meant to initiate I/O just for metadata updates or for data updates as well ? I figured out the process context for I/Os using sysctl -w vm.block_dump='1' which throws debug messages into kern.log. Please cc pradeep.vincent at gmail.com Thanks, Pradeep Vincent From tibor.tarnai at sap.com Tue Feb 14 09:35:10 2006 From: tibor.tarnai at sap.com (Tarnai, Tibor) Date: Tue, 14 Feb 2006 10:35:10 +0100 Subject: Ext3 problems Message-ID: Hi! I was really stupid! I have defragmented my ext3 partition with e2defrag, altought i have done that many times in debian without problems on my new gentoo installation it had bad results. When i wanted to boot this partition i got serious e2fsck errors. It has reported that (only) inode 8 has illegal blocks, so i have run e2fsck -fy /dev/hda2, which has cleared the illegal blocks in inode 8. After this was done i copied the whole partition with cp -pr to a different location and created a new ext3 filesystem on /dev/hda2. Is it possible, that i got so much luck, that only the journalling inode was corrupted, and the rest of my system is intact? How can i make shure that the previous assumption is true? Tibor Tarnai Junior Developer SAP Labs Hungary 1031 Budapest Z?hony u. 7. Tel: +36 1 885 7237 Fax: +36 1 885 7575 mailto:tibor.tarnai at sap.com http://www.sap.hu -------------- next part -------------- An HTML attachment was scrubbed... URL: From pegasus at nerv.eu.org Wed Feb 15 17:07:20 2006 From: pegasus at nerv.eu.org (Jure =?UTF-8?Q?Pe=C4=8Dar?=) Date: Wed, 15 Feb 2006 18:07:20 +0100 Subject: max journal size Message-ID: <20060215180720.6158531f.pegasus@nerv.eu.org> Hi all, Man page of tune2fs says that max journal size is 102,400 filesystem blocks, which translates to ~100MB with 1kb blocks or ~400MB with 4kb block. I wonder - why this limitation exists? Now that relatively cheap ssd devices exist (gigabyte iRam) that offer up to 4GB of space, it would be extremely useful to use whole capacity of such device for full data journaling. -- Jure Pe?ar http://jure.pecar.org From adilger at clusterfs.com Wed Feb 15 20:52:38 2006 From: adilger at clusterfs.com (Andreas Dilger) Date: Wed, 15 Feb 2006 13:52:38 -0700 Subject: max journal size In-Reply-To: <20060215180720.6158531f.pegasus@nerv.eu.org> References: <20060215180720.6158531f.pegasus@nerv.eu.org> Message-ID: <20060215205238.GJ13382@schatzie.adilger.int> On Feb 15, 2006 18:07 +0100, Jure Pe?ar wrote: > Man page of tune2fs says that max journal size is 102,400 filesystem blocks, which translates to ~100MB with 1kb blocks or ~400MB with 4kb block. I wonder - why this limitation exists? The limit exists to avoid users making the journal too large and consuming all of their RAM with pinned buffers while the journal is commiting buffers to the journal and checkpointing them to disk. Under heavy load it is possible for jbd to have 3/4*journal_size of lowmem pinned. > Now that relatively cheap ssd devices exist (gigabyte iRam) that offer up to 4GB of space, it would be extremely useful to use whole capacity of such device for full data journaling. This limit does not apply when using an external journal. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From pegasus at nerv.eu.org Thu Feb 16 09:23:39 2006 From: pegasus at nerv.eu.org (Jure =?UTF-8?Q?Pe=C4=8Dar?=) Date: Thu, 16 Feb 2006 10:23:39 +0100 Subject: max journal size In-Reply-To: <20060215205238.GJ13382@schatzie.adilger.int> References: <20060215180720.6158531f.pegasus@nerv.eu.org> <20060215205238.GJ13382@schatzie.adilger.int> Message-ID: <20060216102339.70aa649a.pegasus@nerv.eu.org> On Wed, 15 Feb 2006 13:52:38 -0700 Andreas Dilger wrote: > The limit exists to avoid users making the journal too large and consuming > all of their RAM with pinned buffers while the journal is commiting buffers > to the journal and checkpointing them to disk. Under heavy load it is > possible for jbd to have 3/4*journal_size of lowmem pinned. > > This limit does not apply when using an external journal. Excellent. So that means I can use any size for external journal. Next question ... the point of having ssd for journal is forcing as much io through it as possible, especially writes. Default journal commit is something like 5 seconds, yes? My application - busy mail gateway - would imho bennefit from much larger journal commit times. As I understand, jounral commit is atomic operation - nothing else can do io to that filesystem at the same time. With large journals and long time between commits, the commit itself takes a measureable amount of time. What happens if I pull the plug during such commit? How well tested area is this? -- Jure Pe?ar http://jure.pecar.org From adilger at clusterfs.com Thu Feb 16 21:30:53 2006 From: adilger at clusterfs.com (Andreas Dilger) Date: Thu, 16 Feb 2006 14:30:53 -0700 Subject: max journal size In-Reply-To: <20060216102339.70aa649a.pegasus@nerv.eu.org> References: <20060215180720.6158531f.pegasus@nerv.eu.org> <20060215205238.GJ13382@schatzie.adilger.int> <20060216102339.70aa649a.pegasus@nerv.eu.org> Message-ID: <20060216213053.GY13382@schatzie.adilger.int> On Feb 16, 2006 10:23 +0100, Jure Pe?ar wrote: > Default journal commit is something like 5 seconds, yes? My application > - busy mail gateway - would imho bennefit from much larger journal commit > times. As I understand, jounral commit is atomic operation - nothing else > can do io to that filesystem at the same time. This is incorrect. While journal commit is atomic in the sense that it will either all complete or all not complete (in case of failure) ext3/jbd does not prevent new changes from being made while the transaction is committing, unless the journal becomes totally full. > With large journals and long time between commits, the commit itself > takes a measureable amount of time. What happens if I pull the plug > during such commit? How well tested area is this? If you interrupt a committing transaction then all operations in that transaction (which may be many for a large journal) will be lost (i.e. rollback). If your operations are synchronous then they will not return until the journal has finished the commit (assuming you do not have write-cache enabled on the disks). This is fairly well tested, as it happens all the time. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From ariel.burbaickij at gmail.com Sat Feb 18 15:37:19 2006 From: ariel.burbaickij at gmail.com (Ariel Burbaickij) Date: Sat, 18 Feb 2006 16:37:19 +0100 Subject: unplausible "no space left on deivce" Message-ID: <3058f9b40602180737g11fb82b6we0cc93382aa7d187@mail.gmail.com> Hello all, I recently ran over following issue ( the description is out of necessity bit lengthy): I have corrupted partition with ext3 filesystem oni t, fortunately enough partition was mirrored, so that I was able to dd the the mirroed partiton to the file like this: dd bs=512 if=/dev/ of=some_file and write the content to the primary partition like this: dd bs=512 if=some_file of=/dev/. The most crucial thing is block size choosen -- 512 bytes. Both operations went fine. Now to the trouble: I was able to create normal files of whatever size on the recreated primary partition but I was not able to create any directories I got "no space left on device" with disk/inode usage around 1%. Both primary and mirrored were formated with block size 4096 and, indeed, when I rexecuted dd command everything worked fine. I would consider this as a bug and would be glad to hear about your opinions. With Best Regards Ariel Burbaickij From mvolaski at aecom.yu.edu Sun Feb 19 19:09:51 2006 From: mvolaski at aecom.yu.edu (Maurice Volaski) Date: Sun, 19 Feb 2006 14:09:51 -0500 Subject: ext3 involved in kernel panic in 2.6.13? Message-ID: Dual Opteron system running ext3 atop drbd (network RAID) devices, which, in turn, are atop LVM logical volumes. The underlying device is hardware SCSI RAID via a LSILogic HBA. The kernel is vanilla 2.6.13 on a Gentoo-based system. A panic occurred, which contains references to ext3 code. I'm not sure how others manage to get these typed out, but I'm manually typing it from what's on the monitor: Call Trace: {i8042_interrupt+111} {commit_timeout+0} {run_timer_softirq+387} {__do_softirq+113} {call_softirq+31} {do_softirq+53} {apic_timer_interrupt+132} {do_get_write_access+118} {do_get_write_access+94} {__getblk+47} {filldir+0} {journal_get_write_access+41} {ext3_reserve_inode+write+76} {filldir+0} {ext3_mark_inode_dirty+56} {journal_start_229} {ext3_dirty_inode+113} {__mark_inode_dirty+52} {update_atime+123} {vfs_readdir+166} {syst_getdents+130} {sys_fcntl+830} {system_call+126} Code: 8b 40 18 48 c1 e0 07 48 8b 98 08 58 5b 80 4c 01 e3 48 89 df RIP {try_to_wake_up+57} RSP <0>Kernel panic - not syncing: Aiee, killing interrupt handler! -- Maurice Volaski, mvolaski at aecom.yu.edu Computing Support, Rose F. Kennedy Center Albert Einstein College of Medicine of Yeshiva University From mvolaski at aecom.yu.edu Mon Feb 20 17:06:35 2006 From: mvolaski at aecom.yu.edu (Maurice Volaski) Date: Mon, 20 Feb 2006 12:06:35 -0500 Subject: ext3 involved in kernel panic in 2.6.13? In-Reply-To: <20060220092546.GA12208@atrey.karlin.mff.cuni.cz> References: <20060220092546.GA12208@atrey.karlin.mff.cuni.cz> Message-ID: > > Dual Opteron system running ext3 atop drbd (network RAID) devices, >> which, in turn, are atop LVM logical volumes. The underlying device >> is hardware SCSI RAID via a LSILogic HBA. The kernel is vanilla >> 2.6.13 on a Gentoo-based system. >> >> A panic occurred, which contains references to ext3 code. >> >> I'm not sure how others manage to get these typed out, but I'm >> manually typing it from what's on the monitor: > > There should be more in the logs (just before the Call Trace:). Didn't >you capture also that information? Without it it is rather hard to find >out what was happening. Unfortunately, crash information never appears in regular system log, at least using the metalog logging program. I think I would have to configure the netconsole to do that. > > Call Trace: {i8042_interrupt+111} >> {commit_timeout+0} >> {run_timer_softirq+387} >> {__do_softirq+113} >> {call_softirq+31} {do_softirq+53} >> {apic_timer_interrupt+132} >> {do_get_write_access+118} >> {do_get_write_access+94} {__getblk+47} >> {filldir+0} >> {journal_get_write_access+41} >> {ext3_reserve_inode+write+76} >> {filldir+0} >> {ext3_mark_inode_dirty+56} >> {journal_start_229} >> {ext3_dirty_inode+113} >> {__mark_inode_dirty+52} >> {update_atime+123} {vfs_readdir+166} >> {syst_getdents+130} {sys_fcntl+830} >> {system_call+126} >> >> Code: 8b 40 18 48 c1 e0 07 48 8b 98 08 58 5b 80 4c 01 e3 48 89 df >> RIP {try_to_wake_up+57} RSP >> <0>Kernel panic - not syncing: Aiee, killing interrupt handler! > > Bye > Honza >-- >Jan Kara >SuSE CR Labs -- Maurice Volaski, mvolaski at aecom.yu.edu Computing Support, Rose F. Kennedy Center Albert Einstein College of Medicine of Yeshiva University From adilger at clusterfs.com Tue Feb 21 05:07:40 2006 From: adilger at clusterfs.com (Andreas Dilger) Date: Mon, 20 Feb 2006 22:07:40 -0700 Subject: ext3 involved in kernel panic in 2.6.13? In-Reply-To: References: Message-ID: <20060221050740.GB13382@schatzie.adilger.int> On Feb 19, 2006 14:09 -0500, Maurice Volaski wrote: > A panic occurred, which contains references to ext3 code. > > I'm not sure how others manage to get these typed out, Normally a serial console is best, and if you have at least 2 machines you can cross-connect the serial ports with a NULL-modem cable and run a terminal emulator (e.g. minicom) to log it to disk on the other system. Having netdump is also a good choice, though maybe not quite as reliable as a real serial console. > but I'm manually typing it from what's on the monitor: > > Call Trace: {i8042_interrupt+111} > {commit_timeout+0} > {run_timer_softirq+387} > {__do_softirq+113} > {call_softirq+31} {do_softirq+53} > {apic_timer_interrupt+132} At this point (and above) the process is in an IRQ handler, so it is likely that the problem exists somewhere at that level. However, the critical part of the oops is missing - what actually went wrong? It could be a BUG, which is a kernel assertion, or it could be a bad pointer dereference, or anything really. There is nothing here which indicates what the problem is. > {do_get_write_access+118} > {do_get_write_access+94} {__getblk+47} > {filldir+0} > {journal_get_write_access+41} > {ext3_reserve_inode+write+76} > {filldir+0} > {ext3_mark_inode_dirty+56} > {journal_start_229} > {ext3_dirty_inode+113} > {__mark_inode_dirty+52} > {update_atime+123} {vfs_readdir+166} > {syst_getdents+130} {sys_fcntl+830} > {system_call+126} Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From pradeep.vincent at gmail.com Thu Feb 23 09:26:34 2006 From: pradeep.vincent at gmail.com (Pradeep Vincent) Date: Thu, 23 Feb 2006 01:26:34 -0800 Subject: Ext3: Ordered : Fsync question Message-ID: <9fda5f510602230126r3606cc5j74289638602ccfbe@mail.gmail.com> Does Fsync of a file on a ext3 fs mounted with "ordered" option(the default) result in flush the dirty data buffers in the fs that correspond to previous transactions. In other words, if I keep writing to file1 (lots of data), log something to file2, keep fsyncing file2 after every write - does this mean file1 data would be committed by fsyncs on file2. Please copy me on your replies (pradeep.vincent at gmail.com) Thanks, Pradeep Vincent From hahaha_30k at yahoo.com Fri Feb 24 00:43:33 2006 From: hahaha_30k at yahoo.com (Robinson Tiemuqinke) Date: Thu, 23 Feb 2006 16:43:33 -0800 (PST) Subject: During FC1 to FC4 upgrade, Do I need to upgrade Ext3 file systems? In-Reply-To: <9fda5f510602230126r3606cc5j74289638602ccfbe@mail.gmail.com> Message-ID: <20060224004333.65772.qmail@web36713.mail.mud.yahoo.com> Hi, I'm doing FC1 to FC4 upgrade these days and I find that the ext3 file system features of FC1 and FC4 are different. For FC4, there are three more ext3 file system features are on: they are ext_attr, resieze_inode, and dir_index. FC4: Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery sparse_super large_file FC1: Filesystem features: has_journal filetype needs_recovery sparse_super large_file So How do I add these 3 ext3 features to untouched data partitions like /home after my server is upgraded to FC4? Do I have to do it manually? or the upgrade will do it for me automatically? I'm afraid of losing precious data but still like to have cool new features. Any suggestions are greatly welcomed. Thanks. __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com From adilger at clusterfs.com Sat Feb 25 01:25:20 2006 From: adilger at clusterfs.com (Andreas Dilger) Date: Fri, 24 Feb 2006 18:25:20 -0700 Subject: Linux performance bug: fsync() for files with zero links In-Reply-To: References: Message-ID: <20060225012520.GZ26809@schatzie.adilger.int> On Feb 25, 2006 05:32 +0500, Victor Porton wrote: > Linux kernel (as of 2.6.15.4) has the following performance bug: > > Syncing (fsync() or fdatasync()) files with zero links (deleted files) in not > no-op, as it should be. > > See details, a test C program, and the rationale in the URL below: > > http://b2e.ex-code.com/index.php/soft/2006/02/24/linux_performance_bug_zero_links_fsync > > In the article with the URL above it is also explained how to make much more > efficient /tmp directory when this bug will be fixed. > > Somebody please make a patch. Of course, for a cluster filesystem it does make sense that fsync flushes the data to disk even if the file has no links, because there may be other clients that are accessing the same file... Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From sct at redhat.com Tue Feb 28 17:58:17 2006 From: sct at redhat.com (Stephen C. Tweedie) Date: Tue, 28 Feb 2006 12:58:17 -0500 Subject: [Ext2-devel] Re: Linux performance bug: fsync() for files with zero links In-Reply-To: <20060228163017.GC22017@harddisk-recovery.com> References: <20060228115308.GA22017@harddisk-recovery.com> <20060228163017.GC22017@harddisk-recovery.com> Message-ID: <1141149497.3863.7.camel@orbit.scot.redhat.com> Hi, On Tue, 2006-02-28 at 17:30 +0100, Erik Mouw wrote: > > From man write(2): > > > > write writes up to count bytes to the file referenced by the file > > descriptor fd from the buffer starting at buf. POSIX requires that a > > read() which can be proved to occur after a write() has returned > > returns the new data. Note that not all file systems are POSIX con- > > forming. > > AFAIK that's read() from the same process, not read() from another > process. No, it's read() from any process. fsync() has absolutely no effect in the scenario you describe. This is different from fflush() of buffered IO written by fwrite(): the fflush() *is* needed when using buffered IO if you want to make this guarantee. > Otherwise there would be no need for fsync()/fdatasync(). No -- f[data]sync() is there only to force the flush to disk. The effects of fsync are completely invisible to running processes (apart from some indirect effects, such as performance side-effects incurred due to the disk accesses.) But we still need fsync() to be able to guarantee that data is stable on disk, if we want to support applications that have guaranteed consistency properties over power failure (eg. a mail spooler should not tell a remote mail-sending host that an email has been accepted until an fsync() or similar syscall has guaranteed that it's on disk.) > But look at my example. tail(1) uses fstat64() to figure out if > /var/log/messages changed. Your proposal for a patch will break that. No, it won't. > Again: the number of links of an inode is not a reason to break > established semantics. Correct. And the semantics *will* change with this patch, but in a subtle way. Ext3 happens to guarantee that after fsync(), *all* metadata for a file --- including directory metadata --- are synchronised to disk. So if you unlink an open file and then fsync() it, you are guaranteed that the unlink has been committed to disk. This is not, strictly speaking, a behaviour required by POSIX; but it's still useful, and would be broken if we disabled fsync() for files with i_nlink==0. --Stephen From jack at suse.cz Mon Feb 20 09:25:46 2006 From: jack at suse.cz (Jan Kara) Date: Mon, 20 Feb 2006 10:25:46 +0100 Subject: ext3 involved in kernel panic in 2.6.13? In-Reply-To: References: Message-ID: <20060220092546.GA12208@atrey.karlin.mff.cuni.cz> > Dual Opteron system running ext3 atop drbd (network RAID) devices, > which, in turn, are atop LVM logical volumes. The underlying device > is hardware SCSI RAID via a LSILogic HBA. The kernel is vanilla > 2.6.13 on a Gentoo-based system. > > A panic occurred, which contains references to ext3 code. > > I'm not sure how others manage to get these typed out, but I'm > manually typing it from what's on the monitor: There should be more in the logs (just before the Call Trace:). Didn't you capture also that information? Without it it is rather hard to find out what was happening. > Call Trace: {i8042_interrupt+111} > {commit_timeout+0} > {run_timer_softirq+387} > {__do_softirq+113} > {call_softirq+31} {do_softirq+53} > {apic_timer_interrupt+132} > {do_get_write_access+118} > {do_get_write_access+94} {__getblk+47} > {filldir+0} > {journal_get_write_access+41} > {ext3_reserve_inode+write+76} > {filldir+0} > {ext3_mark_inode_dirty+56} > {journal_start_229} > {ext3_dirty_inode+113} > {__mark_inode_dirty+52} > {update_atime+123} {vfs_readdir+166} > {syst_getdents+130} {sys_fcntl+830} > {system_call+126} > > Code: 8b 40 18 48 c1 e0 07 48 8b 98 08 58 5b 80 4c 01 e3 48 89 df > RIP {try_to_wake_up+57} RSP > <0>Kernel panic - not syncing: Aiee, killing interrupt handler! Bye Honza -- Jan Kara SuSE CR Labs From rogel at ext.upr.edu.cu Fri Feb 24 10:15:33 2006 From: rogel at ext.upr.edu.cu (Rogel Miguez) Date: Fri, 24 Feb 2006 05:15:33 -0500 (CST) Subject: kernel panic Message-ID: <2087.10.2.80.201.1140776133.squirrel@correo.upr.edu.cu> That I should make? I have problems with the compiled kernel: I compiled the kernel 2.6.13.4 with make allnocoonfig, with make allyesconfig, with make defconfig, the LILO is generated automatically and when I restart the computer, it shows me the following error: kernel panic - not syncing : VFS : Unable to mount root fs on unknown-block (0,0) Rogel ------------------------- Que debo hacer? Tengo problemas con el kernel compilado: Yo compil? el kernel 2.6.13.4 con make allnocoonfig, con make allyesconfig, con make defconfig, se genera el LILO automaticamente y cuando reinicio la computadora, me muestra el siguiente error. kernel panic - not syncing : VFS : Unable to mount root fs on unknown-block (0,0) Rogel From rogel at ext.upr.edu.cu Fri Feb 24 10:13:56 2006 From: rogel at ext.upr.edu.cu (Rogel Miguez) Date: Fri, 24 Feb 2006 05:13:56 -0500 (CST) Subject: (no subject) Message-ID: <2076.10.2.80.201.1140776036.squirrel@correo.upr.edu.cu> That I should make? I have problems with the compiled kernel: I compiled the kernel 2.6.13.4 with make allnocoonfig, with make allyesconfig, with make defconfig, the LILO is generated automatically and when I restart the computer, it shows me the following error: kernel panic - not syncing : VFS : Unable to mount root fs on unknown-block (0,0) Rogel ------------------------- Que debo hacer? Tengo problemas con el kernel compilado: Yo compil? el kernel 2.6.13.4 con make allnocoonfig, con make allyesconfig, con make defconfig, se genera el LILO automaticamente y cuando reinicio la computadora, me muestra el siguiente error. kernel panic - not syncing : VFS : Unable to mount root fs on unknown-block (0,0) Rogel From porton at ex-code.com Sat Feb 25 00:32:51 2006 From: porton at ex-code.com (Victor Porton) Date: Sat, 25 Feb 2006 05:32:51 +0500 (YEKT) Subject: Linux performance bug: fsync() for files with zero links Message-ID: Linux kernel (as of 2.6.15.4) has the following performance bug: Syncing (fsync() or fdatasync()) files with zero links (deleted files) in not no-op, as it should be. See details, a test C program, and the rationale in the URL below: http://b2e.ex-code.com/index.php/soft/2006/02/24/linux_performance_bug_zero_links_fsync In the article with the URL above it is also explained how to make much more efficient /tmp directory when this bug will be fixed. Somebody please make a patch. -- Victor Porton (porton at ex-code.com) - http://porton.ex-code.com From erik at harddisk-recovery.com Tue Feb 28 11:53:08 2006 From: erik at harddisk-recovery.com (Erik Mouw) Date: Tue, 28 Feb 2006 12:53:08 +0100 Subject: [Ext2-devel] Re: Linux performance bug: fsync() for files with zero links In-Reply-To: <20060225012520.GZ26809@schatzie.adilger.int> References: <20060225012520.GZ26809@schatzie.adilger.int> Message-ID: <20060228115308.GA22017@harddisk-recovery.com> On Fri, Feb 24, 2006 at 06:25:20PM -0700, Andreas Dilger wrote: > On Feb 25, 2006 05:32 +0500, Victor Porton wrote: > > Linux kernel (as of 2.6.15.4) has the following performance bug: > > > > Syncing (fsync() or fdatasync()) files with zero links (deleted files) in not > > no-op, as it should be. > > > > See details, a test C program, and the rationale in the URL below: > > > > http://b2e.ex-code.com/index.php/soft/2006/02/24/linux_performance_bug_zero_links_fsync > > > > In the article with the URL above it is also explained how to make much more > > efficient /tmp directory when this bug will be fixed. > > > > Somebody please make a patch. > > Of course, for a cluster filesystem it does make sense that fsync flushes > the data to disk even if the file has no links, because there may be other > clients that are accessing the same file... It even makes sense on a single machine with multiple programs still accessing the same file. You want fsync() and fdatasync() to work regardless of the amount of links. Not doing so could subtly break programs. For example: time tty0 tty1 syslogd 0 tail -f /var/log/messages 1 write(messages, "blah"); 2 fsync(messages); 3 blah 4 rm /var/log/messages 5 write(messages, "foobar"); 6 fsync(messages); 7 (nothing) At step 7 you should immediately see the "foobar" from syslogd, but cause of the OP's proposed optimisation, you will only see it some time in the future. Erik -- +-- Erik Mouw -- www.harddisk-recovery.com -- +31 70 370 12 90 -- | Lab address: Delftechpark 26, 2628 XH, Delft, The Netherlands From porton at ex-code.com Tue Feb 28 15:50:53 2006 From: porton at ex-code.com (Victor Porton) Date: Tue, 28 Feb 2006 20:50:53 +0500 (YEKT) Subject: [Ext2-devel] Re: Linux performance bug: fsync() for files with zero links In-Reply-To: <20060228115308.GA22017@harddisk-recovery.com> Message-ID: On 28-Feb-2006 Erik Mouw wrote: > On Fri, Feb 24, 2006 at 06:25:20PM -0700, Andreas Dilger wrote: >> On Feb 25, 2006 05:32 +0500, Victor Porton wrote: >> > Linux kernel (as of 2.6.15.4) has the following performance bug: >> > >> > Syncing (fsync() or fdatasync()) files with zero links (deleted files) in not >> > no-op, as it should be. >> > >> > See details, a test C program, and the rationale in the URL below: >> > >> > http://b2e.ex-code.com/index.php/soft/2006/02/24/linux_performance_bug_zero_links_fsync ... > It even makes sense on a single machine with multiple programs still > accessing the same file. You want fsync() and fdatasync() to work > regardless of the amount of links. Not doing so could subtly break > programs. For example: Erik, what you said above is wrong. There are no need to sync this file to disk (except of when we are out of memory). It is enough to sync the buffers in MEMORY. >From man write(2): write writes up to count bytes to the file referenced by the file descriptor fd from the buffer starting at buf. POSIX requires that a read() which can be proved to occur after a write() has returned returns the new data. Note that not all file systems are POSIX con- forming. Accordingly my understanding of the above paragraph there are no need to do any kinds of syncing after write for the purpose of other processes to read updated data. POSIX already warrants it and we do not need fsync() for this. Somebody with kernel programming experience please update the Linux kernel CVS to not uselessly sync files with zero links. I am right, this should.be implemented. (However, this may be made an optional (either config time or run time) feature because my suggestion may sometimes (rarely) cause data loss in DELETED files preventing their undeletion. Indeed I deem that we reasonably could do this feature not optional as it would harm data safety only a little, but your mileage whether to do it optional may vary.) -- Victor Porton (porton at ex-code.com) - http://porton.ex-code.com From erik at harddisk-recovery.com Tue Feb 28 16:30:17 2006 From: erik at harddisk-recovery.com (Erik Mouw) Date: Tue, 28 Feb 2006 17:30:17 +0100 Subject: [Ext2-devel] Re: Linux performance bug: fsync() for files with zero links In-Reply-To: References: <20060228115308.GA22017@harddisk-recovery.com> Message-ID: <20060228163017.GC22017@harddisk-recovery.com> On Tue, Feb 28, 2006 at 08:50:53PM +0500, Victor Porton wrote: > On 28-Feb-2006 Erik Mouw wrote: > > It even makes sense on a single machine with multiple programs still > > accessing the same file. You want fsync() and fdatasync() to work > > regardless of the amount of links. Not doing so could subtly break > > programs. For example: > > Erik, what you said above is wrong. > > There are no need to sync this file to disk (except of when we are out of > memory). It is enough to sync the buffers in MEMORY. > > From man write(2): > > write writes up to count bytes to the file referenced by the file > descriptor fd from the buffer starting at buf. POSIX requires that a > read() which can be proved to occur after a write() has returned > returns the new data. Note that not all file systems are POSIX con- > forming. AFAIK that's read() from the same process, not read() from another process. Otherwise there would be no need for fsync()/fdatasync(). But look at my example. tail(1) uses fstat64() to figure out if /var/log/messages changed. Your proposal for a patch will break that. Again: the number of links of an inode is not a reason to break established semantics. Erik -- +-- Erik Mouw -- www.harddisk-recovery.com -- +31 70 370 12 90 -- | Lab address: Delftechpark 26, 2628 XH, Delft, The Netherlands From porton at ex-code.com Tue Feb 28 20:57:51 2006 From: porton at ex-code.com (Victor Porton) Date: Wed, 01 Mar 2006 01:57:51 +0500 (YEKT) Subject: [Ext2-devel] Re: Linux performance bug: fsync() for files with zero links In-Reply-To: <1141149497.3863.7.camel@orbit.scot.redhat.com> Message-ID: On 28-Feb-2006 Stephen C. Tweedie wrote: > On Tue, 2006-02-28 at 17:30 +0100, Erik Mouw wrote: > >> > From man write(2): >> > >> > write writes up to count bytes to the file referenced by the file >> > descriptor fd from the buffer starting at buf. POSIX requires that a >> > read() which can be proved to occur after a write() has returned >> > returns the new data. Note that not all file systems are POSIX con- >> > forming. Erik, Stephen Tweedie has already correctly answered your other concerns. I will add about the semantics: >> Again: the number of links of an inode is not a reason to break >> established semantics. > > Correct. And the semantics *will* change with this patch, but in a > subtle way. > > Ext3 happens to guarantee that after fsync(), *all* metadata for a file > --- including directory metadata --- are synchronised to disk. So if > you unlink an open file and then fsync() it, you are guaranteed that the > unlink has been committed to disk. This is not, strictly speaking, a > behaviour required by POSIX; but it's still useful, and would be broken > if we disabled fsync() for files with i_nlink==0. OK, Stephen, you has pointed where following my idea would really significantly change the semantics, and it should not do. So fsync() (but not fdatasync()) should indeed have effect on an inode with zero links but _only the first time_. Precisely: 1. With every fd should be associated a boolean flag "no_links_committed" (to save a bit of memory it could be instead implemented e.g. as having -1 (minus one) as the count of links in the fd data structure instead of 0). 2. When a file is unlinked, then if the number of links becomes zero no_links_commited should be in reset state (or write zero as the count of links in the fd data structure). 3. When fsync() (but not fdatasync() which is simpler) is called on a file: - If the number of links is above 0 proceed as usual. - If the number of links is zero: * If no_links_commited is false do directory synchronization (as mentioned by Stephen) but no other synchronization and then set no_links_committed to true (or number of links to -1 for a little more efficient impl.) * If no_links_committed is true, do nothing. -- Victor Porton (porton at ex-code.com) - http://porton.ex-code.com From robe at amd.co.at Tue Feb 28 23:33:27 2006 From: robe at amd.co.at (Michael Renner) Date: Tue, 28 Feb 2006 23:33:27 +0000 (UTC) Subject: Status of fragment support, advantages of having fewer indoes Message-ID: Hi, There wasn't much information regarding fragment support of ext2/3 since 2003 [1], Andreas stating that there were problems with the xattr implementation. Has this changed in the meanwhile? My second question is regarding the bytes-per-inode ratio: What benefits would I gain from having fewer inodes? I reckon it's only diskspace (if so, how much?). best regards, Michael Renner [1] http://www.kerneltraffic.org/kernel-traffic/kt20030428_214.html#8