From jidong.xiao at gmail.com  Sat Apr  2 04:01:00 2011
From: jidong.xiao at gmail.com (Jidong Xiao)
Date: Sat, 2 Apr 2011 00:01:00 -0400
Subject: Ext3: Why data=journal is better than data=ordered when data
	needs to be read from and written to disk at the same time
In-Reply-To: <19856.47784.421703.81840@tree.ty.sabi.co.UK>
References: <AANLkTi=By4aBsN4bygBncPw4pyuz8uLqRcNKSBTWi=OJ@mail.gmail.com>
	<20110326235311.GB21075@thunk.org>
	<AANLkTi=NMSQJqzYW6RqRicRQQo8t5ad_KriCoUs4kAzm@mail.gmail.com>
	<20110327024410.GC21075@thunk.org>
	<AANLkTik-NLpXh8q6t=RRJ8HZ+366zJ-Ot16Lq=J_zTkv@mail.gmail.com>
	<19856.47784.421703.81840@tree.ty.sabi.co.UK>
Message-ID: <BANLkTinXE+xLh+gcSWg=_+mBeoFV27rA-w@mail.gmail.com>

On Mon, Mar 28, 2011 at 12:43 PM, Peter Grandi
<pg_ext3 at ext3.for.sabi.co.uk> wrote:
> [ ... ]
>
>>> When executing an fsync(), in data=ordered mode you have to
>>> write the data data blocks into the journal and wait for the
>>> data blocks to be written. ?This requires generally will
>>> require extra seeks. ?In data=journaled mode, the data blocks
>>> can be written directly into the sjoujournal without needing
>>> to seek.
>
>>> Of course eventually the data and metadata blocks will need
>>> to be written to their permanent locations before the journal
>>> space can be reused. ?But for short bursty write patterns,
>>> the fsync() latency will be much smaller in data=journal
>>> mode.
>
>> ?[ ... ]
>
>> In this case, if we conduct the experiment in data=journal
>> mode and data=ordered mode respectively,
>
> That experiment is not necessarily demonstrative, it depends on
> RAM caching, elevator, ...
>
>> since write latency is much smaller in data=journal mode,
>
> Write latency is actually much longer: because it requires *two*
> writes instead of one. It is *fsync* latency as mentioned above
> that is smaller, because it depends only on the first write to
> what is in effect a small log based filesystem. This distinction
> matters a great deal, because it is the reason why "short bursty
> write patterns" is the qualification above. For long write
> patterns things are very different as the journal eventually
> fills up. For any given size it will also fill up a lot faster
> for 'data=journal'.
>
> Ahhh while writing that I have just realized that large journals
> can be a bad idea especially for metadata operations. Will have
> to think more about that.
>
Well, the experiment I described was actually taken from the following article,

http://www.ibm.com/developerworks/library/l-fs8.html?S_TACT=105AGX52&S_CMP=cn-a-l

The author claims that it is Andrew Morton who tested this and showed that
" data=journal mode allowed the 16-meg-file to be read from 9 to over
13 times faster than other ext3 modes, ReiserFS, and even ext2 (which
has no journaling overhead)". Although I cannot find the original
Andrew Morton's post in LKML, one fact is this article is widely
copied to many other websites.

Futhermore, in the kernel internal
document,Documentation/filesystems/ext3.txt, there is saying:

195 * journal mode
196 data=journal mode provides full data and metadata journaling.  All
new data is
197 written to the journal first, and then to its final location.
198 In the event of a crash, the journal can be replayed, bringing both data and
199 metadata into a consistent state.  This mode is the slowest except when data
200 needs to be read from and written to disk at the same time where it
201 outperforms all other modes.

Although Ted and you both explained that the fsync latency is shorter
in data=journal mode, my original question, as the title indicated, is
why data=journal outperforms the other modes when read and write
simultaneously? Or, this statement in the kernel doc is not
accurate?If so, then we should submit a patch and modify this document
so that the other people won't be mislead, and it would be better to
show people some more demonstrative examples in which data=journal
really outperforms the other modes.

In addition, I am actually not very clear why you said that write()
latency is longer while fsync() latency is shorter, I am trying to
repeat what you said, please point out if I am incorrect:
1. Normally we call write() syscall first and then call fsync() to
flush the data.
2. The write() returns as long as the data is written into page caches
while the fsync() returns only if the data have been written into a
stable store.
3. Although write() latency for data=journal mode is much longer
because it requires two writes instead of one, however, since the
write() means writing to page cache, so the actually cost is not so
high, compared to the fsync() syscall where we have to write into disk
and may require disk seeks. So we can mainly focus on the fsync()
system call.
4. Since the journal is a stable store, for the data=journal mode,
fsync() returns as long as the meta data and the real data have been
written into the journal file, and this process is sequential access.
But for the data=moded mode, fsync() will terminate only if the data
itself has been written into the disk, since this process is random
access, we do need many times of disk seeks, which is expensive, so in
this case, fsync() latency is much longer than the in the data=journal
mode. And that's why we claim that data=journal wins for this burst
write case.

Are these correct?

Regards
Jidong


From Sean.D.McCauliff at nasa.gov  Thu Apr  7 23:08:29 2011
From: Sean.D.McCauliff at nasa.gov (Sean McCauliff)
Date: Thu, 7 Apr 2011 16:08:29 -0700
Subject: Resizing a file system that has been converted to ext4
Message-ID: <4D9E43ED.4060301@nasa.gov>

Hello,

I have an ext3 file system of about 8TiB in size.  At the rate data is 
added to the file system it will fill up in a few months so I'm weighing 
my options.  One option would be to create a new ext4 file system and 
copy everything over to a new, larger ext4 file system.  Another option 
would be to modify the software that uses this file system so it can use 
multiple file systems.  Finally, I could could convert the existing file 
system to ext4 and then resize it.  Is this advisable?  Has anyone tried 
this?

Thanks,
Sean McCauliff


From sandeen at redhat.com  Thu Apr  7 23:40:20 2011
From: sandeen at redhat.com (Eric Sandeen)
Date: Thu, 07 Apr 2011 16:40:20 -0700
Subject: Resizing a file system that has been converted to ext4
In-Reply-To: <4D9E43ED.4060301@nasa.gov>
References: <4D9E43ED.4060301@nasa.gov>
Message-ID: <4D9E4B64.7070906@redhat.com>

On 4/7/11 4:08 PM, Sean McCauliff wrote:
> Hello,
> 
> I have an ext3 file system of about 8TiB in size.  At the rate data
> is added to the file system it will fill up in a few months so I'm
> weighing my options.  One option would be to create a new ext4 file
> system and copy everything over to a new, larger ext4 file system.
> Another option would be to modify the software that uses this file
> system so it can use multiple file systems.  Finally, I could could
> convert the existing file system to ext4 and then resize it.  Is this
> advisable?  Has anyone tried this?
> 
> Thanks, Sean McCauliff

Hi Sean -

Modern ext3 should be have a 16T limit just as ext4 does, so converting to ext4 doesn't really change your max fs size.

However, a fresh ext4 filesystem should in theory be a bit more e2fsck-able at that scale and have the best feature-set.

I always recommend migration rather than conversion when possible; you'll get the best features and performance, and run the most tested codepaths that way.

-Eric


From adilger at dilger.ca  Fri Apr  8 02:21:06 2011
From: adilger at dilger.ca (Andreas Dilger)
Date: Thu, 7 Apr 2011 20:21:06 -0600
Subject: Resizing a file system that has been converted to ext4
In-Reply-To: <4D9E43ED.4060301@nasa.gov>
References: <4D9E43ED.4060301@nasa.gov>
Message-ID: <49BD58D4-0129-4C55-8033-0FFE80FF4BA1@dilger.ca>

I'm pretty sure that I have filesystems with mixed extent- and block-mapped files that I've resized in the past. 

That said, depending on your data's importance, and your tolerance for risk you should probably have a backup of your data anyway. At that point, starting with a fresh ext4 filesystem and restoring from backup is also attractive for the performance improvements of extents, as well as other format-time only features. 

My filesystems have a high turnover rate (PVR) and are not impossible to replace, so I have been resizing in place and letting normal turnover of files migrate to extents. I still don't have some if the newer filesystem features. 

Cheers, Andreas

On 2011-04-07, at 5:08 PM, Sean McCauliff <Sean.D.McCauliff at nasa.gov> wrote:

> Hello,
> 
> I have an ext3 file system of about 8TiB in size.  At the rate data is added to the file system it will fill up in a few months so I'm weighing my options.  One option would be to create a new ext4 file system and copy everything over to a new, larger ext4 file system.  Another option would be to modify the software that uses this file system so it can use multiple file systems.  Finally, I could could convert the existing file system to ext4 and then resize it.  Is this advisable?  Has anyone tried this?
> 
> Thanks,
> Sean McCauliff
> 
> _______________________________________________
> Ext3-users mailing list
> Ext3-users at redhat.com
> https://www.redhat.com/mailman/listinfo/ext3-users


From Martin_Zielinski at McAfee.com  Fri Apr  8 09:37:31 2011
From: Martin_Zielinski at McAfee.com (Martin_Zielinski at McAfee.com)
Date: Fri, 8 Apr 2011 04:37:31 -0500
Subject: assertion journal->j_running_transaction != NULL fails in commit
Message-ID: <BCB84D936723884B91E4CC5CA0A7C54AA4F3DFCD3B@EMEADALEXMB1.corp.nai.org>

Hello!

We are using a 2.6.32.25 kernel on a Dell R710 server (16 core Xeon CPU, 12GB RAM).
The servers ran for about 4 weeks since updating from 2.6.27 and are using ext3 as filesystem. 

cat /proc/mounts | grep opt
/dev/mapper/vg00-opt /opt ext3 rw,nosuid,nodev,relatime,errors=remount-ro 0 0

The journaling mode is writeback. 

Suddenly within 3 days, 4 out of 40 machines stop writing data to the /opt partition (all logfiles on this partition are empty since the incident until reboot)
/var/log/messages shows that the assertion J_ASSERT(journal->j_running_transaction != NULL); fails.

Seemingly a phantom bug. I could not find any report about this assertion.
It would be really great, if anyone has an idea, how this can happen or what I could do to track this down.

------------[ cut here ]------------
kernel BUG at fs/jbd/commit.c:342!
invalid opcode: 0000 [#1] SMP 
last sysfs file: /sys/devices/pci0000:00/0000:00:03.0/0000:02:00.1/net/eth3/statistics/tx_bytes
CPU 1 
Modules linked in: bridge stp llc iptable_filter ip_tables x_tables i2c_dev i2c_core ipv6 binfmt_misc sbs sbshc pci_slot fan container battery ac parport_pc lp parport sg ses enclosure button thermal processor
Pid: 2024, comm: kjournald Not tainted 2.6.32-46.r1-x86_64 #1 Appliance
RIP: 0010:[<ffffffff811583db>]  [<ffffffff811583db>] journal_commit_transaction+0xb2/0x10f7
RSP: 0018:ffff88031da3fda0  EFLAGS: 00010246
RAX: ffff88031f04a000 RBX: ffff88031ea41424 RCX: 00000000000004af
RDX: 00000000000004af RSI: 0000000000000000 RDI: ffff88031ea41400
RBP: ffff88031da3fe60 R08: ffff88031ea41488 R09: 0000000000000009
R10: 0000000000000000 R11: 0000000000000001 R12: ffff88031ea41400
R13: ffff88031ea41400 R14: ffff88031ea41590 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffff880033020000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00007f3289a6f6b0 CR3: 0000000231828000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process kjournald (pid: 2024, threadinfo ffff88031da3e000, task ffff88031e6de140)
Stack:
0000000000013680 000571ff7e347b40 ffff88031ff98400 ffff88031f04a000
<0> 0000000000000000 0000000000000286 ffff88031da3fe00 ffffffff810522c9
<0> 00000000ffffffff ffff88031ea41590 ffff88031ea414c8 0000000000000286
Call Trace:
 [<ffffffff810522c9>] ? lock_timer_base+0x26/0x4a
 [<ffffffff81052392>] ? try_to_del_timer_sync+0xa5/0xb2
 [<ffffffff8115c6b1>] kjournald+0x147/0x377
 [<ffffffff8105edd4>] ? autoremove_wake_function+0x0/0x38
 [<ffffffff8115c56a>] ? kjournald+0x0/0x377
 [<ffffffff8105ecbf>] kthread+0x7d/0x86
 [<ffffffff8100c9da>] child_rip+0xa/0x20
 [<ffffffff8105ec42>] ? kthread+0x0/0x86
 [<ffffffff8100c9d0>] ? child_rip+0x0/0x20
Code: 81 ba 53 01 00 00 48 c7 c6 2e b5 8b 81 31 c0 e8 b0 fe ee ff 48 c7 c7 53 b5 8b 81 31 c0 e8 a2 fe ee ff 49 8b 75 50 48 85 f6 75 04 <0f> 0b eb fe 49 83 7d 58 00 74 04 0f 0b eb fe 83 7e 0c 00 49 89 
RIP  [<ffffffff811583db>] journal_commit_transaction+0xb2/0x10f7
 RSP <ffff88031da3fda0>
---[ end trace 7ef4aef5b1834556 ]---

Thanks & Cheers,
Martin


From sean.d.mccauliff at nasa.gov  Fri Apr  8 22:24:16 2011
From: sean.d.mccauliff at nasa.gov (Mccauliff, Sean D. (ARC-PX)[Lockheed Martin Space OPNS])
Date: Fri, 8 Apr 2011 17:24:16 -0500
Subject: Resizing a file system that has been converted to ext4
In-Reply-To: <49BD58D4-0129-4C55-8033-0FFE80FF4BA1@dilger.ca>
References: <4D9E43ED.4060301@nasa.gov>,
	<49BD58D4-0129-4C55-8033-0FFE80FF4BA1@dilger.ca>
Message-ID: <341DAA96EE3A8444B6E4657BE8A846EA38C13205DF@NDJSSCC06.ndc.nasa.gov>

Thanks for all the responses this may change how I approach this issue!

Sean McCauliff
________________________________________
From: Andreas Dilger [adilger at dilger.ca]
Sent: Thursday, April 07, 2011 7:21 PM
To: Mccauliff, Sean D. (ARC-PX)[Lockheed Martin Space OPNS]
Cc: ext3-users at redhat.com
Subject: Re: Resizing a file system that has been converted to ext4

I'm pretty sure that I have filesystems with mixed extent- and block-mapped files that I've resized in the past.

That said, depending on your data's importance, and your tolerance for risk you should probably have a backup of your data anyway. At that point, starting with a fresh ext4 filesystem and restoring from backup is also attractive for the performance improvements of extents, as well as other format-time only features.

My filesystems have a high turnover rate (PVR) and are not impossible to replace, so I have been resizing in place and letting normal turnover of files migrate to extents. I still don't have some if the newer filesystem features.

Cheers, Andreas

On 2011-04-07, at 5:08 PM, Sean McCauliff <Sean.D.McCauliff at nasa.gov> wrote:

> Hello,
>
> I have an ext3 file system of about 8TiB in size.  At the rate data is added to the file system it will fill up in a few months so I'm weighing my options.  One option would be to create a new ext4 file system and copy everything over to a new, larger ext4 file system.  Another option would be to modify the software that uses this file system so it can use multiple file systems.  Finally, I could could convert the existing file system to ext4 and then resize it.  Is this advisable?  Has anyone tried this?
>
> Thanks,
> Sean McCauliff
>
> _______________________________________________
> Ext3-users mailing list
> Ext3-users at redhat.com
> https://www.redhat.com/mailman/listinfo/ext3-users


From markbusheman at gmail.com  Fri Apr  8 22:51:29 2011
From: markbusheman at gmail.com (Mark Busheman)
Date: Fri, 8 Apr 2011 15:51:29 -0700
Subject: ext3 and forced unit access to flush disk cache
Message-ID: <BANLkTinv0z0Sc_yCQoJNi5wAR1Cm+sNccA@mail.gmail.com>

I plan to use data=journal option with ext3 for a customer who is very
specific about the integrity of the data. Would like to know if ext4
send FUA (Forced Unit Access) to flush the disk cache?

Cheers
Mark


From ricwheeler at gmail.com  Wed Apr 13 19:34:08 2011
From: ricwheeler at gmail.com (Ric Wheeler)
Date: Wed, 13 Apr 2011 15:34:08 -0400
Subject: ext3 and forced unit access to flush disk cache
In-Reply-To: <BANLkTinv0z0Sc_yCQoJNi5wAR1Cm+sNccA@mail.gmail.com>
References: <BANLkTinv0z0Sc_yCQoJNi5wAR1Cm+sNccA@mail.gmail.com>
Message-ID: <4DA5FAB0.1090805@gmail.com>

On 04/08/2011 06:51 PM, Mark Busheman wrote:
> I plan to use data=journal option with ext3 for a customer who is very
> specific about the integrity of the data. Would like to know if ext4
> send FUA (Forced Unit Access) to flush the disk cache?
>
> Cheers
> Mark
>

ext4 by default uses "barrier" support and will issue the write commands 
(depends on the storage type). You can "mount -o barrier" ext3 as well.

It will do this for its own metadata reasons and as part of an application 
driven fsync() command, so the customer data should be safe if their application 
is coded properly.

Regards,

Ric


From Sean.D.McCauliff at nasa.gov  Mon Apr 25 22:07:21 2011
From: Sean.D.McCauliff at nasa.gov (Sean McCauliff)
Date: Mon, 25 Apr 2011 15:07:21 -0700
Subject: Allocation of Indirect Blocks
Message-ID: <4DB5F099.8010206@nasa.gov>

Does ext3 allocate indirect blocks as needed or is there some fixed 
number of these like inodes?  Should I be concerned with running out of 
indirect blocks?

Thanks,
Sean


From sandeen at redhat.com  Mon Apr 25 22:13:11 2011
From: sandeen at redhat.com (Eric Sandeen)
Date: Mon, 25 Apr 2011 17:13:11 -0500
Subject: Allocation of Indirect Blocks
In-Reply-To: <4DB5F099.8010206@nasa.gov>
References: <4DB5F099.8010206@nasa.gov>
Message-ID: <4DB5F1F7.2010808@redhat.com>

On 4/25/11 5:07 PM, Sean McCauliff wrote:
> Does ext3 allocate indirect blocks as needed or is there some fixed number of these like inodes?  Should I be concerned with running out of indirect blocks?

ext3 allocates them as needed.

In fact you will often see them allocated consecutively with the data blocks they refer to:

debugfs:  stat bigfile
Inode: 12   Type: regular    Mode:  0644   Flags: 0x0
Generation: 330185944    Version: 0x00000000
User:     0   Group:     0   Size: 8388608
File ACL: 0    Directory ACL: 0
Links: 1   Blockcount: 16450
Fragment:  Address: 0    Number: 0    Size: 0
ctime: 0x4db5f1c8 -- Mon Apr 25 17:12:24 2011
atime: 0x4db5f1c8 -- Mon Apr 25 17:12:24 2011
mtime: 0x4db5f1c8 -- Mon Apr 25 17:12:24 2011
BLOCKS:
(0-11):2561-2572, (IND):2573, (12-267):2574-2829, (DIND):2830, (IND):2831, (268-
523):2832-3087, (IND):3088, (524-779):3089-3344, (IND):3345, (780-1035):3346-360
1, (IND):3602, (1036-1291):3603-3858, (IND):3859, (1292-1547):3860-4115, (IND):4
116, (1548-1803):4117-4372, (IND):4373, (1804-2059):4374-4629, (IND):4630,  ...

... and so on (IND/DIND are indirect & double indirect blocks).

-Eric

> Thanks,
> Sean


From Sean.D.McCauliff at nasa.gov  Mon Apr 25 22:15:57 2011
From: Sean.D.McCauliff at nasa.gov (Sean McCauliff)
Date: Mon, 25 Apr 2011 15:15:57 -0700
Subject: Allocation of Indirect Blocks
In-Reply-To: <4DB5F1F7.2010808@redhat.com>
References: <4DB5F099.8010206@nasa.gov> <4DB5F1F7.2010808@redhat.com>
Message-ID: <4DB5F29D.7000506@nasa.gov>

Cool.

Thanks,
Sean

Eric Sandeen wrote:
> On 4/25/11 5:07 PM, Sean McCauliff wrote:
>   
>> Does ext3 allocate indirect blocks as needed or is there some fixed number of these like inodes?  Should I be concerned with running out of indirect blocks?
>>     
>
> ext3 allocates them as needed.
>
> In fact you will often see them allocated consecutively with the data blocks they refer to:
>
> debugfs:  stat bigfile
> Inode: 12   Type: regular    Mode:  0644   Flags: 0x0
> Generation: 330185944    Version: 0x00000000
> User:     0   Group:     0   Size: 8388608
> File ACL: 0    Directory ACL: 0
> Links: 1   Blockcount: 16450
> Fragment:  Address: 0    Number: 0    Size: 0
> ctime: 0x4db5f1c8 -- Mon Apr 25 17:12:24 2011
> atime: 0x4db5f1c8 -- Mon Apr 25 17:12:24 2011
> mtime: 0x4db5f1c8 -- Mon Apr 25 17:12:24 2011
> BLOCKS:
> (0-11):2561-2572, (IND):2573, (12-267):2574-2829, (DIND):2830, (IND):2831, (268-
> 523):2832-3087, (IND):3088, (524-779):3089-3344, (IND):3345, (780-1035):3346-360
> 1, (IND):3602, (1036-1291):3603-3858, (IND):3859, (1292-1547):3860-4115, (IND):4
> 116, (1548-1803):4117-4372, (IND):4373, (1804-2059):4374-4629, (IND):4630,  ...
>
> ... and so on (IND/DIND are indirect & double indirect blocks).
>
> -Eric
>
>   
>> Thanks,
>> Sean
>>     
>
>