From fk at linuxburg.de  Wed Feb  1 11:47:08 2006
From: fk at linuxburg.de (Felix E. Klee)
Date: Wed, 1 Feb 2006 12:47:08 +0100
Subject: df reports false size
In-Reply-To: <20060131175924.GA11642@schatzie.adilger.int>
References: <200601301938.54145.fk@linuxburg.de>
	<200601311331.20929.fk@linuxburg.de>
	<20060131175924.GA11642@schatzie.adilger.int>
Message-ID: <200602011247.08493.fk@linuxburg.de>

Am Dienstag, 31. Januar 2006 18:59 schrieb Andreas Dilger:
> You can use "mount -t bind / /mnt" and then "/mnt/nfsroot" will be the
> underlying directory.

That's what I eventually used in order to be able to remove the directory (I 
got the hint on another mailing list).

Thanks for your help!

-- 
Dipl.-Phys. Felix E. Klee
Email: fk at linuxburg.de (work), felix.klee at inka.de (home)
Tel: +49 721 8307937, Fax: +49 721 8307936
Linuxburg, Goethestr. 15A, 76135 Karlsruhe, Germany


From pradeep.vincent at gmail.com  Sat Feb  4 02:17:41 2006
From: pradeep.vincent at gmail.com (Pradeep Vincent)
Date: Fri, 3 Feb 2006 18:17:41 -0800
Subject: Ext3 IO context
Message-ID: <9fda5f510602031817l34dfcb50x7540bed68f9ea5c@mail.gmail.com>

I am running a BDB based application on top of EXT3 on Linux (RH 7.2)
- the application and the configuration are exactly the same. When the
application writes, sometimes the IO happens in the context of the
application thread while most of the time the IO happens in the
context kjournald.
 What does it mean for IO to happen in initiating process context for
non O_SYNC file I/O. I was thinking the dirty file cache thresholds
determine if a write to a filesystem write will correspond to block
I/O in the process context. Is that how ext3 works. I am not very
familiar with kjournald functionality - is that thread meant to
initiate I/O just for metadata updates or for data updates as well ?

I figured out the process context for I/Os using

sysctl -w vm.block_dump='1' which throws debug messages into kern.log.

Please cc pradeep.vincent at gmail.com

Thanks,

Pradeep Vincent


From tibor.tarnai at sap.com  Tue Feb 14 09:35:10 2006
From: tibor.tarnai at sap.com (Tarnai, Tibor)
Date: Tue, 14 Feb 2006 10:35:10 +0100
Subject: Ext3 problems
Message-ID: <A0F7FF6EDD799C4181A46A28561D514602709653@dewdfe21.wdf.sap.corp>

Hi!
 
I was really stupid! I have defragmented my ext3 partition with e2defrag, altought i have done that many times in debian without problems on my new gentoo installation it had bad results. When i wanted to boot this partition i got serious e2fsck errors.
It has reported that (only) inode 8 has illegal blocks, so i have run e2fsck -fy /dev/hda2, which has cleared the illegal blocks in inode 8. After this was done i copied the whole partition with cp -pr to a different location and created a new ext3 filesystem on /dev/hda2.
Is it possible, that i got so much luck, that only the journalling inode was corrupted, and the rest of my system is intact? How can i make shure that the previous assumption is true?
 
Tibor Tarnai
Junior Developer
SAP Labs Hungary
1031 Budapest Z?hony u. 7.
Tel: +36 1 885 7237
Fax: +36 1 885 7575
mailto:tibor.tarnai at sap.com <mailto:tibor.tarnai at sap.com> 
http://www.sap.hu <http://www.sap.hu/> 
 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20060214/d508950f/attachment.htm>

From pegasus at nerv.eu.org  Wed Feb 15 17:07:20 2006
From: pegasus at nerv.eu.org (Jure =?UTF-8?Q?Pe=C4=8Dar?=)
Date: Wed, 15 Feb 2006 18:07:20 +0100
Subject: max journal size
Message-ID: <20060215180720.6158531f.pegasus@nerv.eu.org>


Hi all,

Man page of tune2fs says that max journal size is 102,400 filesystem blocks, which translates to ~100MB with 1kb blocks or ~400MB with 4kb block. I wonder - why this limitation exists?

Now that relatively cheap ssd devices exist (gigabyte iRam) that offer up to 4GB of space, it would be extremely useful to use whole capacity of such device for full data journaling.

-- 

Jure Pe?ar
http://jure.pecar.org


From adilger at clusterfs.com  Wed Feb 15 20:52:38 2006
From: adilger at clusterfs.com (Andreas Dilger)
Date: Wed, 15 Feb 2006 13:52:38 -0700
Subject: max journal size
In-Reply-To: <20060215180720.6158531f.pegasus@nerv.eu.org>
References: <20060215180720.6158531f.pegasus@nerv.eu.org>
Message-ID: <20060215205238.GJ13382@schatzie.adilger.int>

On Feb 15, 2006  18:07 +0100, Jure Pe?ar wrote:
> Man page of tune2fs says that max journal size is 102,400 filesystem blocks, which translates to ~100MB with 1kb blocks or ~400MB with 4kb block. I wonder - why this limitation exists?

The limit exists to avoid users making the journal too large and consuming
all of their RAM with pinned buffers while the journal is commiting buffers
to the journal and checkpointing them to disk.  Under heavy load it is
possible for jbd to have 3/4*journal_size of lowmem pinned.

> Now that relatively cheap ssd devices exist (gigabyte iRam) that offer up to 4GB of space, it would be extremely useful to use whole capacity of such device for full data journaling.

This limit does not apply when using an external journal.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


From pegasus at nerv.eu.org  Thu Feb 16 09:23:39 2006
From: pegasus at nerv.eu.org (Jure =?UTF-8?Q?Pe=C4=8Dar?=)
Date: Thu, 16 Feb 2006 10:23:39 +0100
Subject: max journal size
In-Reply-To: <20060215205238.GJ13382@schatzie.adilger.int>
References: <20060215180720.6158531f.pegasus@nerv.eu.org>
	<20060215205238.GJ13382@schatzie.adilger.int>
Message-ID: <20060216102339.70aa649a.pegasus@nerv.eu.org>

On Wed, 15 Feb 2006 13:52:38 -0700
Andreas Dilger <adilger at clusterfs.com> wrote:

> The limit exists to avoid users making the journal too large and consuming
> all of their RAM with pinned buffers while the journal is commiting buffers
> to the journal and checkpointing them to disk.  Under heavy load it is
> possible for jbd to have 3/4*journal_size of lowmem pinned.
> 
> This limit does not apply when using an external journal.

Excellent. So that means I can use any size for external journal.

Next question ... the point of having ssd for journal is forcing as much io through it as possible, especially writes. Default journal commit is something like 5 seconds, yes? My application - busy mail gateway - would imho bennefit from much larger journal commit times. As I understand, jounral commit is atomic operation - nothing else can do io to that filesystem at the same time. With large journals and long time between commits, the commit itself takes a measureable amount of time. What happens if I pull the plug during such commit? How well tested area is this?


-- 

Jure Pe?ar
http://jure.pecar.org


From adilger at clusterfs.com  Thu Feb 16 21:30:53 2006
From: adilger at clusterfs.com (Andreas Dilger)
Date: Thu, 16 Feb 2006 14:30:53 -0700
Subject: max journal size
In-Reply-To: <20060216102339.70aa649a.pegasus@nerv.eu.org>
References: <20060215180720.6158531f.pegasus@nerv.eu.org>
	<20060215205238.GJ13382@schatzie.adilger.int>
	<20060216102339.70aa649a.pegasus@nerv.eu.org>
Message-ID: <20060216213053.GY13382@schatzie.adilger.int>

On Feb 16, 2006  10:23 +0100, Jure Pe?ar wrote:
> Default journal commit is something like 5 seconds, yes? My application
> - busy mail gateway - would imho bennefit from much larger journal commit
> times. As I understand, jounral commit is atomic operation - nothing else
> can do io to that filesystem at the same time.

This is incorrect.  While journal commit is atomic in the sense that it
will either all complete or all not complete (in case of failure) ext3/jbd
does not prevent new changes from being made while the transaction is
committing, unless the journal becomes totally full.

> With large journals and long time between commits, the commit itself
> takes a measureable amount of time. What happens if I pull the plug
> during such commit? How well tested area is this?

If you interrupt a committing transaction then all operations in that
transaction (which may be many for a large journal) will be lost (i.e.
rollback).  If your operations are synchronous then they will not
return until the journal has finished the commit (assuming you do not
have write-cache enabled on the disks).  This is fairly well tested,
as it happens all the time.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


From ariel.burbaickij at gmail.com  Sat Feb 18 15:37:19 2006
From: ariel.burbaickij at gmail.com (Ariel Burbaickij)
Date: Sat, 18 Feb 2006 16:37:19 +0100
Subject: unplausible "no space left on deivce"
Message-ID: <3058f9b40602180737g11fb82b6we0cc93382aa7d187@mail.gmail.com>

Hello all,
I recently ran over following issue ( the description is out of necessity
bit lengthy):
I have corrupted partition with ext3 filesystem oni t, fortunately enough
partition was mirrored, so that I was able to dd the the mirroed partiton
to the file like this:

dd bs=512 if=/dev/<where_the_mirrored_partition_is> of=some_file

and write the content to the primary partition like this:

dd bs=512 if=some_file of=/dev/<where_the_primary_is>.


The most crucial thing is block size choosen -- 512 bytes.

Both operations went fine.

Now to the trouble: I was able to create normal files of whatever size
on the recreated
primary partition but I was not able to create any directories I got
"no space left
on device" with disk/inode usage around 1%. Both primary and mirrored
were formated
with block size 4096 and, indeed, when I rexecuted dd command everything worked
fine. I would consider this as a bug and would be glad to hear about
your opinions.

With Best Regards
Ariel Burbaickij


From mvolaski at aecom.yu.edu  Sun Feb 19 19:09:51 2006
From: mvolaski at aecom.yu.edu (Maurice Volaski)
Date: Sun, 19 Feb 2006 14:09:51 -0500
Subject: ext3 involved in kernel panic in 2.6.13?
Message-ID: <a06230908c01e6d2b77b3@[129.98.90.227]>

Dual Opteron system running ext3 atop drbd (network RAID) devices, 
which, in turn, are atop LVM logical volumes. The underlying device 
is hardware SCSI RAID via a LSILogic HBA. The kernel is vanilla 
2.6.13 on a Gentoo-based system.

A panic occurred, which contains references to ext3 code.

I'm not sure how others manage to get these typed out, but I'm 
manually typing it from what's on the monitor:

Call Trace: <IRQ> <ffffffff802820df>{i8042_interrupt+111} 
<ffffffff80200080>{commit_timeout+0}
<ffffffff8013f143>{run_timer_softirq+387} <ffffffff8013b111>{__do_softirq+113}
<ffffffff8010ee63>{call_softirq+31} <ffffffff80110a55>{do_softirq+53}
<ffffffff8010e5c8>{apic_timer_interrupt+132} <EOI> 
<ffffffff801fb8a6>{do_get_write_access+118}
<ffffffff801fb88e>{do_get_write_access+94} <ffffffff80185d1f>{__getblk+47}
<ffffffff80195170>{filldir+0} <ffffffff801fbf69>{journal_get_write_access+41}
<ffffffff801ec41c>{ext3_reserve_inode+write+76} <ffffffff80195170>{filldir+0}
<ffffffff801ec4d8>{ext3_mark_inode_dirty+56} 
<ffffffff801fa9e5>{journal_start_229}
<ffffffff801ee571>{ext3_dirty_inode+113} 
<ffffffff801a5604>{__mark_inode_dirty+52}
<ffffffff8019bd2b>{update_atime+123} <ffffffff80195016>{vfs_readdir+166}
<ffffffff801952e2>{syst_getdents+130} <ffffffff8019465e>{sys_fcntl+830}
<ffffffff8010dc46>{system_call+126}

Code: 8b 40 18 48 c1 e0 07 48 8b 98 08 58 5b 80 4c 01 e3 48 89 df
RIP <ffffffff8012f369>{try_to_wake_up+57} RSP <ffff810004827e88>
<0>Kernel panic - not syncing: Aiee, killing interrupt handler!
-- 

Maurice Volaski, mvolaski at aecom.yu.edu
Computing Support, Rose F. Kennedy Center
Albert Einstein College of Medicine of Yeshiva University


From mvolaski at aecom.yu.edu  Mon Feb 20 17:06:35 2006
From: mvolaski at aecom.yu.edu (Maurice Volaski)
Date: Mon, 20 Feb 2006 12:06:35 -0500
Subject: ext3 involved in kernel panic in 2.6.13?
In-Reply-To: <20060220092546.GA12208@atrey.karlin.mff.cuni.cz>
References: <a06230908c01e6d2b77b3@[129.98.90.227]>
	<20060220092546.GA12208@atrey.karlin.mff.cuni.cz>
Message-ID: <a06230911c01fa4a5783b@[129.98.90.227]>

>  > Dual Opteron system running ext3 atop drbd (network RAID) devices,
>>  which, in turn, are atop LVM logical volumes. The underlying device
>>  is hardware SCSI RAID via a LSILogic HBA. The kernel is vanilla
>>  2.6.13 on a Gentoo-based system.
>>
>>  A panic occurred, which contains references to ext3 code.
>>
>>  I'm not sure how others manage to get these typed out, but I'm
>>  manually typing it from what's on the monitor:
>
>   There should be more in the logs (just before the Call Trace:). Didn't
>you capture also that information? Without it it is rather hard to find
>out what was happening.

Unfortunately, crash information never appears in regular system log, 
at least using the metalog logging program. I think I would have to 
configure the netconsole to do that.

>  > Call Trace: <IRQ> <ffffffff802820df>{i8042_interrupt+111}
>>  <ffffffff80200080>{commit_timeout+0}
>>  <ffffffff8013f143>{run_timer_softirq+387}
>>  <ffffffff8013b111>{__do_softirq+113}
>>  <ffffffff8010ee63>{call_softirq+31} <ffffffff80110a55>{do_softirq+53}
>>  <ffffffff8010e5c8>{apic_timer_interrupt+132} <EOI>
>>  <ffffffff801fb8a6>{do_get_write_access+118}
>>  <ffffffff801fb88e>{do_get_write_access+94} <ffffffff80185d1f>{__getblk+47}
>>  <ffffffff80195170>{filldir+0}
>>  <ffffffff801fbf69>{journal_get_write_access+41}
>>  <ffffffff801ec41c>{ext3_reserve_inode+write+76}
>>  <ffffffff80195170>{filldir+0}
>>  <ffffffff801ec4d8>{ext3_mark_inode_dirty+56}
>>  <ffffffff801fa9e5>{journal_start_229}
>>  <ffffffff801ee571>{ext3_dirty_inode+113}
>>  <ffffffff801a5604>{__mark_inode_dirty+52}
>>  <ffffffff8019bd2b>{update_atime+123} <ffffffff80195016>{vfs_readdir+166}
>>  <ffffffff801952e2>{syst_getdents+130} <ffffffff8019465e>{sys_fcntl+830}
>>  <ffffffff8010dc46>{system_call+126}
>>
>>  Code: 8b 40 18 48 c1 e0 07 48 8b 98 08 58 5b 80 4c 01 e3 48 89 df
>>  RIP <ffffffff8012f369>{try_to_wake_up+57} RSP <ffff810004827e88>
>>  <0>Kernel panic - not syncing: Aiee, killing interrupt handler!
>
>								Bye
>									Honza
>--
>Jan Kara <jack at suse.cz>
>SuSE CR Labs


-- 

Maurice Volaski, mvolaski at aecom.yu.edu
Computing Support, Rose F. Kennedy Center
Albert Einstein College of Medicine of Yeshiva University


From adilger at clusterfs.com  Tue Feb 21 05:07:40 2006
From: adilger at clusterfs.com (Andreas Dilger)
Date: Mon, 20 Feb 2006 22:07:40 -0700
Subject: ext3 involved in kernel panic in 2.6.13?
In-Reply-To: <a06230908c01e6d2b77b3@[129.98.90.227]>
References: <a06230908c01e6d2b77b3@[129.98.90.227]>
Message-ID: <20060221050740.GB13382@schatzie.adilger.int>

On Feb 19, 2006  14:09 -0500, Maurice Volaski wrote:
> A panic occurred, which contains references to ext3 code.
> 
> I'm not sure how others manage to get these typed out,

Normally a serial console is best, and if you have at least 2 machines
you can cross-connect the serial ports with a NULL-modem cable and run
a terminal emulator (e.g. minicom) to log it to disk on the other system.
Having netdump is also a good choice, though maybe not quite as reliable
as a real serial console.

> but I'm manually typing it from what's on the monitor:
> 
> Call Trace: <IRQ> <ffffffff802820df>{i8042_interrupt+111} 
> <ffffffff80200080>{commit_timeout+0}
> <ffffffff8013f143>{run_timer_softirq+387} 
> <ffffffff8013b111>{__do_softirq+113}
> <ffffffff8010ee63>{call_softirq+31} <ffffffff80110a55>{do_softirq+53}
> <ffffffff8010e5c8>{apic_timer_interrupt+132} <EOI> 

At this point (and above) the process is in an IRQ handler, so it is likely
that the problem exists somewhere at that level.  However, the critical
part of the oops is missing - what actually went wrong?  It could be a BUG,
which is a kernel assertion, or it could be a bad pointer dereference, or
anything really.  There is nothing here which indicates what the problem is.

> <ffffffff801fb8a6>{do_get_write_access+118}
> <ffffffff801fb88e>{do_get_write_access+94} <ffffffff80185d1f>{__getblk+47}
> <ffffffff80195170>{filldir+0} 
> <ffffffff801fbf69>{journal_get_write_access+41}
> <ffffffff801ec41c>{ext3_reserve_inode+write+76} 
> <ffffffff80195170>{filldir+0}
> <ffffffff801ec4d8>{ext3_mark_inode_dirty+56} 
> <ffffffff801fa9e5>{journal_start_229}
> <ffffffff801ee571>{ext3_dirty_inode+113} 
> <ffffffff801a5604>{__mark_inode_dirty+52}
> <ffffffff8019bd2b>{update_atime+123} <ffffffff80195016>{vfs_readdir+166}
> <ffffffff801952e2>{syst_getdents+130} <ffffffff8019465e>{sys_fcntl+830}
> <ffffffff8010dc46>{system_call+126}

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


From pradeep.vincent at gmail.com  Thu Feb 23 09:26:34 2006
From: pradeep.vincent at gmail.com (Pradeep Vincent)
Date: Thu, 23 Feb 2006 01:26:34 -0800
Subject: Ext3: Ordered : Fsync question
Message-ID: <9fda5f510602230126r3606cc5j74289638602ccfbe@mail.gmail.com>

Does Fsync of a file on a ext3 fs mounted with "ordered" option(the
default) result in flush the dirty data buffers in the fs that
correspond to previous transactions. In other words, if I keep writing
to file1 (lots of data), log something to file2, keep fsyncing file2
after every write - does this mean file1 data would be committed by
fsyncs on file2.

Please copy me on your replies (pradeep.vincent at gmail.com)

Thanks,

Pradeep Vincent


From hahaha_30k at yahoo.com  Fri Feb 24 00:43:33 2006
From: hahaha_30k at yahoo.com (Robinson Tiemuqinke)
Date: Thu, 23 Feb 2006 16:43:33 -0800 (PST)
Subject: During FC1 to FC4 upgrade, Do I need to upgrade Ext3 file systems? 
In-Reply-To: <9fda5f510602230126r3606cc5j74289638602ccfbe@mail.gmail.com>
Message-ID: <20060224004333.65772.qmail@web36713.mail.mud.yahoo.com>

Hi,

 I'm doing FC1 to FC4 upgrade these days and I find
that the ext3 file system features of FC1 and FC4 are
different.

For FC4, there are three more ext3 file system
features are on: they are ext_attr, resieze_inode, and
dir_index.

FC4:

Filesystem features:      has_journal ext_attr
resize_inode dir_index filetype needs_recovery
sparse_super large_file

FC1:
Filesystem features:      has_journal filetype
needs_recovery sparse_super large_file

So How do I add these 3 ext3 features to untouched
data partitions like /home after my server is upgraded
to FC4? Do I have to do it manually? or the upgrade
will do it for me automatically? I'm afraid of losing
precious data but still like to have cool new
features.

 Any suggestions are greatly welcomed.

Thanks.

 
__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 


From adilger at clusterfs.com  Sat Feb 25 01:25:20 2006
From: adilger at clusterfs.com (Andreas Dilger)
Date: Fri, 24 Feb 2006 18:25:20 -0700
Subject: Linux performance bug: fsync() for files with zero links
In-Reply-To: <E1FCnMh-0000tb-00@porton.narod.ru>
References: <E1FCnMh-0000tb-00@porton.narod.ru>
Message-ID: <20060225012520.GZ26809@schatzie.adilger.int>

On Feb 25, 2006  05:32 +0500, Victor Porton wrote:
> Linux kernel (as of 2.6.15.4) has the following performance bug:
> 
> Syncing (fsync() or fdatasync()) files with zero links (deleted files) in not
> no-op, as it should be.
> 
> See details, a test C program, and the rationale in the URL below:
> 
> http://b2e.ex-code.com/index.php/soft/2006/02/24/linux_performance_bug_zero_links_fsync
> 
> In the article with the URL above it is also explained how to make much more
> efficient /tmp directory when this bug will be fixed.
> 
> Somebody please make a patch.

Of course, for a cluster filesystem it does make sense that fsync flushes
the data to disk even if the file has no links, because there may be other
clients that are accessing the same file...

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


From sct at redhat.com  Tue Feb 28 17:58:17 2006
From: sct at redhat.com (Stephen C. Tweedie)
Date: Tue, 28 Feb 2006 12:58:17 -0500
Subject: [Ext2-devel] Re: Linux performance bug: fsync() for files with
	zero links
In-Reply-To: <20060228163017.GC22017@harddisk-recovery.com>
References: <20060228115308.GA22017@harddisk-recovery.com>
	<E1FE77l-0006Y0-00@porton.narod.ru>
	<20060228163017.GC22017@harddisk-recovery.com>
Message-ID: <1141149497.3863.7.camel@orbit.scot.redhat.com>

Hi,

On Tue, 2006-02-28 at 17:30 +0100, Erik Mouw wrote:

> > From man write(2):
> > 
> >        write  writes  up  to  count  bytes  to the file referenced by the file
> >        descriptor fd from the buffer starting at buf.  POSIX requires  that  a
> >        read()  which  can  be  proved  to  occur  after a write() has returned
> >        returns the new data.  Note that not all file systems  are  POSIX  con-
> >        forming.
> 
> AFAIK that's read() from the same process, not read() from another
> process.

No, it's read() from any process.  fsync() has absolutely no effect in
the scenario you describe.  This is different from fflush() of buffered
IO written by fwrite(): the fflush() *is* needed when using buffered IO
if you want to make this guarantee. 

>  Otherwise there would be no need for fsync()/fdatasync().

No -- f[data]sync() is there only to force the flush to disk.  The
effects of fsync are completely invisible to running processes (apart
from some indirect effects, such as performance side-effects incurred
due to the disk accesses.)  But we still need fsync() to be able to
guarantee that data is stable on disk, if we want to support
applications that have guaranteed consistency properties over power
failure (eg. a mail spooler should not tell a remote mail-sending host
that an email has been accepted until an fsync() or similar syscall has
guaranteed that it's on disk.)

> But look at my example. tail(1) uses fstat64() to figure out if
> /var/log/messages changed. Your proposal for a patch will break that.

No, it won't.

> Again: the number of links of an inode is not a reason to break
> established semantics.

Correct.  And the semantics *will* change with this patch, but in a
subtle way.

Ext3 happens to guarantee that after fsync(), *all* metadata for a file
--- including directory metadata --- are synchronised to disk.  So if
you unlink an open file and then fsync() it, you are guaranteed that the
unlink has been committed to disk.  This is not, strictly speaking, a
behaviour required by POSIX; but it's still useful, and would be broken
if we disabled fsync() for files with i_nlink==0.

--Stephen


From jack at suse.cz  Mon Feb 20 09:25:46 2006
From: jack at suse.cz (Jan Kara)
Date: Mon, 20 Feb 2006 10:25:46 +0100
Subject: ext3 involved in kernel panic in 2.6.13?
In-Reply-To: <a06230908c01e6d2b77b3@[129.98.90.227]>
References: <a06230908c01e6d2b77b3@[129.98.90.227]>
Message-ID: <20060220092546.GA12208@atrey.karlin.mff.cuni.cz>

> Dual Opteron system running ext3 atop drbd (network RAID) devices, 
> which, in turn, are atop LVM logical volumes. The underlying device 
> is hardware SCSI RAID via a LSILogic HBA. The kernel is vanilla 
> 2.6.13 on a Gentoo-based system.
> 
> A panic occurred, which contains references to ext3 code.
> 
> I'm not sure how others manage to get these typed out, but I'm 
> manually typing it from what's on the monitor:

  There should be more in the logs (just before the Call Trace:). Didn't
you capture also that information? Without it it is rather hard to find
out what was happening.

> Call Trace: <IRQ> <ffffffff802820df>{i8042_interrupt+111} 
> <ffffffff80200080>{commit_timeout+0}
> <ffffffff8013f143>{run_timer_softirq+387} 
> <ffffffff8013b111>{__do_softirq+113}
> <ffffffff8010ee63>{call_softirq+31} <ffffffff80110a55>{do_softirq+53}
> <ffffffff8010e5c8>{apic_timer_interrupt+132} <EOI> 
> <ffffffff801fb8a6>{do_get_write_access+118}
> <ffffffff801fb88e>{do_get_write_access+94} <ffffffff80185d1f>{__getblk+47}
> <ffffffff80195170>{filldir+0} 
> <ffffffff801fbf69>{journal_get_write_access+41}
> <ffffffff801ec41c>{ext3_reserve_inode+write+76} 
> <ffffffff80195170>{filldir+0}
> <ffffffff801ec4d8>{ext3_mark_inode_dirty+56} 
> <ffffffff801fa9e5>{journal_start_229}
> <ffffffff801ee571>{ext3_dirty_inode+113} 
> <ffffffff801a5604>{__mark_inode_dirty+52}
> <ffffffff8019bd2b>{update_atime+123} <ffffffff80195016>{vfs_readdir+166}
> <ffffffff801952e2>{syst_getdents+130} <ffffffff8019465e>{sys_fcntl+830}
> <ffffffff8010dc46>{system_call+126}
> 
> Code: 8b 40 18 48 c1 e0 07 48 8b 98 08 58 5b 80 4c 01 e3 48 89 df
> RIP <ffffffff8012f369>{try_to_wake_up+57} RSP <ffff810004827e88>
> <0>Kernel panic - not syncing: Aiee, killing interrupt handler!

								Bye
									Honza
-- 
Jan Kara <jack at suse.cz>
SuSE CR Labs


From rogel at ext.upr.edu.cu  Fri Feb 24 10:15:33 2006
From: rogel at ext.upr.edu.cu (Rogel Miguez)
Date: Fri, 24 Feb 2006 05:15:33 -0500 (CST)
Subject: kernel panic
Message-ID: <2087.10.2.80.201.1140776133.squirrel@correo.upr.edu.cu>

That I should make?
I have problems with the compiled kernel:
I compiled the kernel 2.6.13.4
with make allnocoonfig,
with make allyesconfig,
with make defconfig,
the LILO is generated automatically

and when I restart the computer, it shows me the following error:


kernel panic - not syncing : VFS : Unable to mount root fs on
unknown-block (0,0)


Rogel
-------------------------
Que debo hacer?
Tengo problemas con el kernel compilado:
Yo compil? el kernel 2.6.13.4
con make allnocoonfig,
con make allyesconfig,
con make defconfig,
se genera el LILO automaticamente

y cuando reinicio la computadora, me muestra el siguiente error.

kernel panic - not syncing : VFS : Unable to mount root fs on
unknown-block (0,0)

Rogel


From rogel at ext.upr.edu.cu  Fri Feb 24 10:13:56 2006
From: rogel at ext.upr.edu.cu (Rogel Miguez)
Date: Fri, 24 Feb 2006 05:13:56 -0500 (CST)
Subject: (no subject)
Message-ID: <2076.10.2.80.201.1140776036.squirrel@correo.upr.edu.cu>

That I should make?
I have problems with the compiled kernel:
I compiled the kernel 2.6.13.4
with make allnocoonfig,
with make allyesconfig,
with make defconfig,
the LILO is generated automatically

and when I restart the computer, it shows me the following error:


kernel panic - not syncing : VFS : Unable to mount root fs on
unknown-block (0,0)


Rogel
-------------------------
Que debo hacer?
Tengo problemas con el kernel compilado:
Yo compil? el kernel 2.6.13.4
con make allnocoonfig,
con make allyesconfig,
con make defconfig,
se genera el LILO automaticamente

y cuando reinicio la computadora, me muestra el siguiente error.

kernel panic - not syncing : VFS : Unable to mount root fs on
unknown-block (0,0)

Rogel


From porton at ex-code.com  Sat Feb 25 00:32:51 2006
From: porton at ex-code.com (Victor Porton)
Date: Sat, 25 Feb 2006 05:32:51 +0500 (YEKT)
Subject: Linux performance bug: fsync() for files with zero links
Message-ID: <E1FCnMh-0000tb-00@porton.narod.ru>

Linux kernel (as of 2.6.15.4) has the following performance bug:

Syncing (fsync() or fdatasync()) files with zero links (deleted files) in not
no-op, as it should be.

See details, a test C program, and the rationale in the URL below:

http://b2e.ex-code.com/index.php/soft/2006/02/24/linux_performance_bug_zero_links_fsync

In the article with the URL above it is also explained how to make much more
efficient /tmp directory when this bug will be fixed.

Somebody please make a patch.

-- 
Victor Porton (porton at ex-code.com) - http://porton.ex-code.com


From erik at harddisk-recovery.com  Tue Feb 28 11:53:08 2006
From: erik at harddisk-recovery.com (Erik Mouw)
Date: Tue, 28 Feb 2006 12:53:08 +0100
Subject: [Ext2-devel] Re: Linux performance bug: fsync() for files with
	zero links
In-Reply-To: <20060225012520.GZ26809@schatzie.adilger.int>
References: <E1FCnMh-0000tb-00@porton.narod.ru>
	<20060225012520.GZ26809@schatzie.adilger.int>
Message-ID: <20060228115308.GA22017@harddisk-recovery.com>

On Fri, Feb 24, 2006 at 06:25:20PM -0700, Andreas Dilger wrote:
> On Feb 25, 2006  05:32 +0500, Victor Porton wrote:
> > Linux kernel (as of 2.6.15.4) has the following performance bug:
> > 
> > Syncing (fsync() or fdatasync()) files with zero links (deleted files) in not
> > no-op, as it should be.
> > 
> > See details, a test C program, and the rationale in the URL below:
> > 
> > http://b2e.ex-code.com/index.php/soft/2006/02/24/linux_performance_bug_zero_links_fsync
> > 
> > In the article with the URL above it is also explained how to make much more
> > efficient /tmp directory when this bug will be fixed.
> > 
> > Somebody please make a patch.
> 
> Of course, for a cluster filesystem it does make sense that fsync flushes
> the data to disk even if the file has no links, because there may be other
> clients that are accessing the same file...

It even makes sense on a single machine with multiple programs still
accessing the same file. You want fsync() and fdatasync() to work
regardless of the amount of links. Not doing so could subtly break
programs. For example:

time	tty0		tty1		syslogd
0	tail -f /var/log/messages
1					write(messages, "blah");
2					fsync(messages);
3	blah
4			rm /var/log/messages
5					write(messages, "foobar");
6					fsync(messages);
7	(nothing)

At step 7 you should immediately see the "foobar" from syslogd, but
cause of the OP's proposed optimisation, you will only see it some time
in the future.


Erik

-- 
+-- Erik Mouw -- www.harddisk-recovery.com -- +31 70 370 12 90 --
| Lab address: Delftechpark 26, 2628 XH, Delft, The Netherlands


From porton at ex-code.com  Tue Feb 28 15:50:53 2006
From: porton at ex-code.com (Victor Porton)
Date: Tue, 28 Feb 2006 20:50:53 +0500 (YEKT)
Subject: [Ext2-devel] Re: Linux performance bug: fsync() for files with
	zero links
In-Reply-To: <20060228115308.GA22017@harddisk-recovery.com>
Message-ID: <E1FE77l-0006Y0-00@porton.narod.ru>


On 28-Feb-2006 Erik Mouw wrote:
> On Fri, Feb 24, 2006 at 06:25:20PM -0700, Andreas Dilger wrote:
>> On Feb 25, 2006  05:32 +0500, Victor Porton wrote:
>> > Linux kernel (as of 2.6.15.4) has the following performance bug:
>> > 
>> > Syncing (fsync() or fdatasync()) files with zero links (deleted files) in not
>> > no-op, as it should be.
>> > 
>> > See details, a test C program, and the rationale in the URL below:
>> > 
>> > http://b2e.ex-code.com/index.php/soft/2006/02/24/linux_performance_bug_zero_links_fsync
...
> It even makes sense on a single machine with multiple programs still
> accessing the same file. You want fsync() and fdatasync() to work
> regardless of the amount of links. Not doing so could subtly break
> programs. For example:

Erik, what you said above is wrong.

There are no need to sync this file to disk (except of when we are out of
memory). It is enough to sync the buffers in MEMORY.

>From man write(2):

       write  writes  up  to  count  bytes  to the file referenced by the file
       descriptor fd from the buffer starting at buf.  POSIX requires  that  a
       read()  which  can  be  proved  to  occur  after a write() has returned
       returns the new data.  Note that not all file systems  are  POSIX  con-
       forming.

Accordingly my understanding of the above paragraph there are no need to
do any kinds of syncing after write for the purpose of other processes to
read updated data. POSIX already warrants it and we do not need fsync() for
this.

Somebody with kernel programming experience please update the Linux kernel CVS
to not uselessly sync files with zero links. I am right, this should.be
implemented.

(However, this may be made an optional (either config time or run time) feature
because my suggestion may sometimes (rarely) cause data loss in DELETED files
preventing their undeletion. Indeed I deem that we reasonably could do this
feature not optional as it would harm data safety only a little, but your
mileage whether to do it optional may vary.)

-- 
Victor Porton (porton at ex-code.com) - http://porton.ex-code.com


From erik at harddisk-recovery.com  Tue Feb 28 16:30:17 2006
From: erik at harddisk-recovery.com (Erik Mouw)
Date: Tue, 28 Feb 2006 17:30:17 +0100
Subject: [Ext2-devel] Re: Linux performance bug: fsync() for files with
	zero links
In-Reply-To: <E1FE77l-0006Y0-00@porton.narod.ru>
References: <20060228115308.GA22017@harddisk-recovery.com>
	<E1FE77l-0006Y0-00@porton.narod.ru>
Message-ID: <20060228163017.GC22017@harddisk-recovery.com>

On Tue, Feb 28, 2006 at 08:50:53PM +0500, Victor Porton wrote:
> On 28-Feb-2006 Erik Mouw wrote:
> > It even makes sense on a single machine with multiple programs still
> > accessing the same file. You want fsync() and fdatasync() to work
> > regardless of the amount of links. Not doing so could subtly break
> > programs. For example:
> 
> Erik, what you said above is wrong.
> 
> There are no need to sync this file to disk (except of when we are out of
> memory). It is enough to sync the buffers in MEMORY.
> 
> From man write(2):
> 
>        write  writes  up  to  count  bytes  to the file referenced by the file
>        descriptor fd from the buffer starting at buf.  POSIX requires  that  a
>        read()  which  can  be  proved  to  occur  after a write() has returned
>        returns the new data.  Note that not all file systems  are  POSIX  con-
>        forming.

AFAIK that's read() from the same process, not read() from another
process. Otherwise there would be no need for fsync()/fdatasync().

But look at my example. tail(1) uses fstat64() to figure out if
/var/log/messages changed. Your proposal for a patch will break that.
Again: the number of links of an inode is not a reason to break
established semantics.


Erik

-- 
+-- Erik Mouw -- www.harddisk-recovery.com -- +31 70 370 12 90 --
| Lab address: Delftechpark 26, 2628 XH, Delft, The Netherlands


From porton at ex-code.com  Tue Feb 28 20:57:51 2006
From: porton at ex-code.com (Victor Porton)
Date: Wed, 01 Mar 2006 01:57:51 +0500 (YEKT)
Subject: [Ext2-devel] Re: Linux performance bug: fsync() for files with
	zero links
In-Reply-To: <1141149497.3863.7.camel@orbit.scot.redhat.com>
Message-ID: <E1FEBuq-0000oB-00@porton.narod.ru>


On 28-Feb-2006 Stephen C. Tweedie wrote:
> On Tue, 2006-02-28 at 17:30 +0100, Erik Mouw wrote:
> 
>> > From man write(2):
>> > 
>> >        write  writes  up  to  count  bytes  to the file referenced by the file
>> >        descriptor fd from the buffer starting at buf.  POSIX requires  that  a
>> >        read()  which  can  be  proved  to  occur  after a write() has returned
>> >        returns the new data.  Note that not all file systems  are  POSIX  con-
>> >        forming.

Erik, Stephen Tweedie has already correctly answered your other concerns.

I will add about the semantics:

>> Again: the number of links of an inode is not a reason to break
>> established semantics.
> 
> Correct.  And the semantics *will* change with this patch, but in a
> subtle way.
> 
> Ext3 happens to guarantee that after fsync(), *all* metadata for a file
> --- including directory metadata --- are synchronised to disk.  So if
> you unlink an open file and then fsync() it, you are guaranteed that the
> unlink has been committed to disk.  This is not, strictly speaking, a
> behaviour required by POSIX; but it's still useful, and would be broken
> if we disabled fsync() for files with i_nlink==0.

OK, Stephen, you has pointed where following my idea would really
significantly change the semantics, and it should not do.

So fsync() (but not fdatasync()) should indeed have effect on an inode with
zero links but _only the first time_. Precisely:

1. With every fd should be associated a boolean flag "no_links_committed"
(to save a bit of memory it could be instead implemented e.g. as having -1
(minus one) as the count of links in the fd data structure instead of 0).

2. When a file is unlinked, then if the number of links becomes zero
no_links_commited should be in reset state (or write zero as the count of
links in the fd data structure). 

3. When fsync() (but not fdatasync() which is simpler) is called on a file:
   - If the number of links is above 0 proceed as usual.
   - If the number of links is zero:
     * If no_links_commited is false do directory synchronization
       (as mentioned by Stephen) but no other synchronization and
       then set no_links_committed to true (or number of links to -1 for
       a little more efficient impl.)
     * If no_links_committed is true, do nothing.

-- 
Victor Porton (porton at ex-code.com) - http://porton.ex-code.com


From robe at amd.co.at  Tue Feb 28 23:33:27 2006
From: robe at amd.co.at (Michael Renner)
Date: Tue, 28 Feb 2006 23:33:27 +0000 (UTC)
Subject: Status of fragment support, advantages of having fewer indoes
Message-ID: <loom.20060301T000806-758@post.gmane.org>

Hi,

There wasn't much information regarding fragment support of ext2/3 since 2003
[1], Andreas stating that there were problems with the xattr implementation. Has
this changed in the meanwhile?

My second question is regarding the bytes-per-inode ratio: What benefits would I
gain from having fewer inodes? I reckon it's only diskspace (if so, how much?).

best regards,
Michael Renner

[1] http://www.kerneltraffic.org/kernel-traffic/kt20030428_214.html#8