[PATCH] BUG on fsync/fdatasync with Ext3 data=journal

Andrew Morton akpm at osdl.org
Thu Sep 16 21:50:59 UTC 2004


Seiji Kihara <kihara.seiji at lab.ntt.co.jp> wrote:
>
> We found that fsync and fdatasync syscalls sometimes don't sync
> data in an ext3 file system under the following conditions.
> 
> 1. Kernel version is 2.6.6 or later (including 2.6.8.1 and 2.6.9-rc2).
> 2. Ext3's journalling mode is "data=journal".
> 3. Create a file (whose size is 1Mbytes) and execute umount/mount.
> 4. lseek to a random position within the file, write 8192 bytes
>    data, and fsync or fdatasync.
> 
> We presume the data was not written to the corresponding disk
> before returning from fsync or fdatasync syscall on the evidence
> as follows:
> 
> 1. The response time of fsync() and fdatasync() was extremely
>    short.
> 
>    We use the "diskio" tool, which is downloadable from OSDL page
>    (http://developer.osdl.jp/projects/doubt/).  The program showed
>    that the response time was under 10 microseconds.  This time
>    cannot be achieved with data transfer on IDE and PCI bus!
> 
> 2. The IDE writing routine ide_start_dma() was not called under
>    DMA enabled.
> 
>    We inserted the print messages in the sys_write(), sys_fsync()
>    and ide_start_dma() by the attached patch.  Sometimes the
>    "ide_start_dma: ..." message was not shown between "write: in
>    ..." and "fsync: out ...".
> 
> The problem was occurred since 2.6.5-bk1, which includes the patch
> "[PATCH] ext3 fsync() and fdatasync() speedup".  We found that the
> problem was solved by deleting the part of the patch which
> modifies ext3_sync_file().  Maybe, i_state is not correctly set to
> I_DIRTY when the related page cache is dirty (is it true?)

I forgot about this one.

> Attached file is tarball (tar + bzip2) which contains following
> files.  The patches are for 2.6.8.1-kernel (applicable to
> 2.6.9-rc2), and the results were also produced with
> 2.6.8.1-kernel.

We really don't need a 100k tarball to communicate a three line patch :(

Yes, the I_DIRTY test is bogus because data pages are not marked dirty at
write() time when the filesystem is mounted in data=journal mode.

However your patch will disable the above optimisation for data=writeback
and data-ordered modes as well.  I don't think that's necessary?

How about this?

--- 25/fs/ext3/fsync.c~ext3-journal-data-fsync-fix	Thu Sep 16 14:47:21 2004
+++ 25-akpm/fs/ext3/fsync.c	Thu Sep 16 14:47:33 2004
@@ -49,10 +49,6 @@ int ext3_sync_file(struct file * file, s
 
 	J_ASSERT(ext3_journal_current_handle() == 0);
 
-	smp_mb();		/* prepare for lockless i_state read */
-	if (!(inode->i_state & I_DIRTY))
-		goto out;
-
 	/*
 	 * data=writeback:
 	 *  The caller's filemap_fdatawrite()/wait will sync the data.
@@ -76,6 +72,10 @@ int ext3_sync_file(struct file * file, s
 		goto out;
 	}
 
+	smp_mb();		/* prepare for lockless i_state read */
+	if (!(inode->i_state & I_DIRTY))
+		goto out;
+
 	/*
 	 * The VFS has written the file data.  If the inode is unaltered
 	 * then we need not start a commit.
_





More information about the Ext3-users mailing list