From jidong.xiao at gmail.com Sat Apr 2 04:01:00 2011 From: jidong.xiao at gmail.com (Jidong Xiao) Date: Sat, 2 Apr 2011 00:01:00 -0400 Subject: Ext3: Why data=journal is better than data=ordered when data needs to be read from and written to disk at the same time In-Reply-To: <19856.47784.421703.81840@tree.ty.sabi.co.UK> References: <20110326235311.GB21075@thunk.org> <20110327024410.GC21075@thunk.org> <19856.47784.421703.81840@tree.ty.sabi.co.UK> Message-ID: On Mon, Mar 28, 2011 at 12:43 PM, Peter Grandi wrote: > [ ... ] > >>> When executing an fsync(), in data=ordered mode you have to >>> write the data data blocks into the journal and wait for the >>> data blocks to be written. ?This requires generally will >>> require extra seeks. ?In data=journaled mode, the data blocks >>> can be written directly into the sjoujournal without needing >>> to seek. > >>> Of course eventually the data and metadata blocks will need >>> to be written to their permanent locations before the journal >>> space can be reused. ?But for short bursty write patterns, >>> the fsync() latency will be much smaller in data=journal >>> mode. > >> ?[ ... ] > >> In this case, if we conduct the experiment in data=journal >> mode and data=ordered mode respectively, > > That experiment is not necessarily demonstrative, it depends on > RAM caching, elevator, ... > >> since write latency is much smaller in data=journal mode, > > Write latency is actually much longer: because it requires *two* > writes instead of one. It is *fsync* latency as mentioned above > that is smaller, because it depends only on the first write to > what is in effect a small log based filesystem. This distinction > matters a great deal, because it is the reason why "short bursty > write patterns" is the qualification above. For long write > patterns things are very different as the journal eventually > fills up. For any given size it will also fill up a lot faster > for 'data=journal'. > > Ahhh while writing that I have just realized that large journals > can be a bad idea especially for metadata operations. Will have > to think more about that. > Well, the experiment I described was actually taken from the following article, http://www.ibm.com/developerworks/library/l-fs8.html?S_TACT=105AGX52&S_CMP=cn-a-l The author claims that it is Andrew Morton who tested this and showed that " data=journal mode allowed the 16-meg-file to be read from 9 to over 13 times faster than other ext3 modes, ReiserFS, and even ext2 (which has no journaling overhead)". Although I cannot find the original Andrew Morton's post in LKML, one fact is this article is widely copied to many other websites. Futhermore, in the kernel internal document,Documentation/filesystems/ext3.txt, there is saying: 195 * journal mode 196 data=journal mode provides full data and metadata journaling. All new data is 197 written to the journal first, and then to its final location. 198 In the event of a crash, the journal can be replayed, bringing both data and 199 metadata into a consistent state. This mode is the slowest except when data 200 needs to be read from and written to disk at the same time where it 201 outperforms all other modes. Although Ted and you both explained that the fsync latency is shorter in data=journal mode, my original question, as the title indicated, is why data=journal outperforms the other modes when read and write simultaneously? Or, this statement in the kernel doc is not accurate?If so, then we should submit a patch and modify this document so that the other people won't be mislead, and it would be better to show people some more demonstrative examples in which data=journal really outperforms the other modes. In addition, I am actually not very clear why you said that write() latency is longer while fsync() latency is shorter, I am trying to repeat what you said, please point out if I am incorrect: 1. Normally we call write() syscall first and then call fsync() to flush the data. 2. The write() returns as long as the data is written into page caches while the fsync() returns only if the data have been written into a stable store. 3. Although write() latency for data=journal mode is much longer because it requires two writes instead of one, however, since the write() means writing to page cache, so the actually cost is not so high, compared to the fsync() syscall where we have to write into disk and may require disk seeks. So we can mainly focus on the fsync() system call. 4. Since the journal is a stable store, for the data=journal mode, fsync() returns as long as the meta data and the real data have been written into the journal file, and this process is sequential access. But for the data=moded mode, fsync() will terminate only if the data itself has been written into the disk, since this process is random access, we do need many times of disk seeks, which is expensive, so in this case, fsync() latency is much longer than the in the data=journal mode. And that's why we claim that data=journal wins for this burst write case. Are these correct? Regards Jidong From Sean.D.McCauliff at nasa.gov Thu Apr 7 23:08:29 2011 From: Sean.D.McCauliff at nasa.gov (Sean McCauliff) Date: Thu, 7 Apr 2011 16:08:29 -0700 Subject: Resizing a file system that has been converted to ext4 Message-ID: <4D9E43ED.4060301@nasa.gov> Hello, I have an ext3 file system of about 8TiB in size. At the rate data is added to the file system it will fill up in a few months so I'm weighing my options. One option would be to create a new ext4 file system and copy everything over to a new, larger ext4 file system. Another option would be to modify the software that uses this file system so it can use multiple file systems. Finally, I could could convert the existing file system to ext4 and then resize it. Is this advisable? Has anyone tried this? Thanks, Sean McCauliff From sandeen at redhat.com Thu Apr 7 23:40:20 2011 From: sandeen at redhat.com (Eric Sandeen) Date: Thu, 07 Apr 2011 16:40:20 -0700 Subject: Resizing a file system that has been converted to ext4 In-Reply-To: <4D9E43ED.4060301@nasa.gov> References: <4D9E43ED.4060301@nasa.gov> Message-ID: <4D9E4B64.7070906@redhat.com> On 4/7/11 4:08 PM, Sean McCauliff wrote: > Hello, > > I have an ext3 file system of about 8TiB in size. At the rate data > is added to the file system it will fill up in a few months so I'm > weighing my options. One option would be to create a new ext4 file > system and copy everything over to a new, larger ext4 file system. > Another option would be to modify the software that uses this file > system so it can use multiple file systems. Finally, I could could > convert the existing file system to ext4 and then resize it. Is this > advisable? Has anyone tried this? > > Thanks, Sean McCauliff Hi Sean - Modern ext3 should be have a 16T limit just as ext4 does, so converting to ext4 doesn't really change your max fs size. However, a fresh ext4 filesystem should in theory be a bit more e2fsck-able at that scale and have the best feature-set. I always recommend migration rather than conversion when possible; you'll get the best features and performance, and run the most tested codepaths that way. -Eric From adilger at dilger.ca Fri Apr 8 02:21:06 2011 From: adilger at dilger.ca (Andreas Dilger) Date: Thu, 7 Apr 2011 20:21:06 -0600 Subject: Resizing a file system that has been converted to ext4 In-Reply-To: <4D9E43ED.4060301@nasa.gov> References: <4D9E43ED.4060301@nasa.gov> Message-ID: <49BD58D4-0129-4C55-8033-0FFE80FF4BA1@dilger.ca> I'm pretty sure that I have filesystems with mixed extent- and block-mapped files that I've resized in the past. That said, depending on your data's importance, and your tolerance for risk you should probably have a backup of your data anyway. At that point, starting with a fresh ext4 filesystem and restoring from backup is also attractive for the performance improvements of extents, as well as other format-time only features. My filesystems have a high turnover rate (PVR) and are not impossible to replace, so I have been resizing in place and letting normal turnover of files migrate to extents. I still don't have some if the newer filesystem features. Cheers, Andreas On 2011-04-07, at 5:08 PM, Sean McCauliff wrote: > Hello, > > I have an ext3 file system of about 8TiB in size. At the rate data is added to the file system it will fill up in a few months so I'm weighing my options. One option would be to create a new ext4 file system and copy everything over to a new, larger ext4 file system. Another option would be to modify the software that uses this file system so it can use multiple file systems. Finally, I could could convert the existing file system to ext4 and then resize it. Is this advisable? Has anyone tried this? > > Thanks, > Sean McCauliff > > _______________________________________________ > Ext3-users mailing list > Ext3-users at redhat.com > https://www.redhat.com/mailman/listinfo/ext3-users From Martin_Zielinski at McAfee.com Fri Apr 8 09:37:31 2011 From: Martin_Zielinski at McAfee.com (Martin_Zielinski at McAfee.com) Date: Fri, 8 Apr 2011 04:37:31 -0500 Subject: assertion journal->j_running_transaction != NULL fails in commit Message-ID: Hello! We are using a 2.6.32.25 kernel on a Dell R710 server (16 core Xeon CPU, 12GB RAM). The servers ran for about 4 weeks since updating from 2.6.27 and are using ext3 as filesystem. cat /proc/mounts | grep opt /dev/mapper/vg00-opt /opt ext3 rw,nosuid,nodev,relatime,errors=remount-ro 0 0 The journaling mode is writeback. Suddenly within 3 days, 4 out of 40 machines stop writing data to the /opt partition (all logfiles on this partition are empty since the incident until reboot) /var/log/messages shows that the assertion J_ASSERT(journal->j_running_transaction != NULL); fails. Seemingly a phantom bug. I could not find any report about this assertion. It would be really great, if anyone has an idea, how this can happen or what I could do to track this down. ------------[ cut here ]------------ kernel BUG at fs/jbd/commit.c:342! invalid opcode: 0000 [#1] SMP last sysfs file: /sys/devices/pci0000:00/0000:00:03.0/0000:02:00.1/net/eth3/statistics/tx_bytes CPU 1 Modules linked in: bridge stp llc iptable_filter ip_tables x_tables i2c_dev i2c_core ipv6 binfmt_misc sbs sbshc pci_slot fan container battery ac parport_pc lp parport sg ses enclosure button thermal processor Pid: 2024, comm: kjournald Not tainted 2.6.32-46.r1-x86_64 #1 Appliance RIP: 0010:[] [] journal_commit_transaction+0xb2/0x10f7 RSP: 0018:ffff88031da3fda0 EFLAGS: 00010246 RAX: ffff88031f04a000 RBX: ffff88031ea41424 RCX: 00000000000004af RDX: 00000000000004af RSI: 0000000000000000 RDI: ffff88031ea41400 RBP: ffff88031da3fe60 R08: ffff88031ea41488 R09: 0000000000000009 R10: 0000000000000000 R11: 0000000000000001 R12: ffff88031ea41400 R13: ffff88031ea41400 R14: ffff88031ea41590 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff880033020000(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 00007f3289a6f6b0 CR3: 0000000231828000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process kjournald (pid: 2024, threadinfo ffff88031da3e000, task ffff88031e6de140) Stack: 0000000000013680 000571ff7e347b40 ffff88031ff98400 ffff88031f04a000 <0> 0000000000000000 0000000000000286 ffff88031da3fe00 ffffffff810522c9 <0> 00000000ffffffff ffff88031ea41590 ffff88031ea414c8 0000000000000286 Call Trace: [] ? lock_timer_base+0x26/0x4a [] ? try_to_del_timer_sync+0xa5/0xb2 [] kjournald+0x147/0x377 [] ? autoremove_wake_function+0x0/0x38 [] ? kjournald+0x0/0x377 [] kthread+0x7d/0x86 [] child_rip+0xa/0x20 [] ? kthread+0x0/0x86 [] ? child_rip+0x0/0x20 Code: 81 ba 53 01 00 00 48 c7 c6 2e b5 8b 81 31 c0 e8 b0 fe ee ff 48 c7 c7 53 b5 8b 81 31 c0 e8 a2 fe ee ff 49 8b 75 50 48 85 f6 75 04 <0f> 0b eb fe 49 83 7d 58 00 74 04 0f 0b eb fe 83 7e 0c 00 49 89 RIP [] journal_commit_transaction+0xb2/0x10f7 RSP ---[ end trace 7ef4aef5b1834556 ]--- Thanks & Cheers, Martin From sean.d.mccauliff at nasa.gov Fri Apr 8 22:24:16 2011 From: sean.d.mccauliff at nasa.gov (Mccauliff, Sean D. (ARC-PX)[Lockheed Martin Space OPNS]) Date: Fri, 8 Apr 2011 17:24:16 -0500 Subject: Resizing a file system that has been converted to ext4 In-Reply-To: <49BD58D4-0129-4C55-8033-0FFE80FF4BA1@dilger.ca> References: <4D9E43ED.4060301@nasa.gov>, <49BD58D4-0129-4C55-8033-0FFE80FF4BA1@dilger.ca> Message-ID: <341DAA96EE3A8444B6E4657BE8A846EA38C13205DF@NDJSSCC06.ndc.nasa.gov> Thanks for all the responses this may change how I approach this issue! Sean McCauliff ________________________________________ From: Andreas Dilger [adilger at dilger.ca] Sent: Thursday, April 07, 2011 7:21 PM To: Mccauliff, Sean D. (ARC-PX)[Lockheed Martin Space OPNS] Cc: ext3-users at redhat.com Subject: Re: Resizing a file system that has been converted to ext4 I'm pretty sure that I have filesystems with mixed extent- and block-mapped files that I've resized in the past. That said, depending on your data's importance, and your tolerance for risk you should probably have a backup of your data anyway. At that point, starting with a fresh ext4 filesystem and restoring from backup is also attractive for the performance improvements of extents, as well as other format-time only features. My filesystems have a high turnover rate (PVR) and are not impossible to replace, so I have been resizing in place and letting normal turnover of files migrate to extents. I still don't have some if the newer filesystem features. Cheers, Andreas On 2011-04-07, at 5:08 PM, Sean McCauliff wrote: > Hello, > > I have an ext3 file system of about 8TiB in size. At the rate data is added to the file system it will fill up in a few months so I'm weighing my options. One option would be to create a new ext4 file system and copy everything over to a new, larger ext4 file system. Another option would be to modify the software that uses this file system so it can use multiple file systems. Finally, I could could convert the existing file system to ext4 and then resize it. Is this advisable? Has anyone tried this? > > Thanks, > Sean McCauliff > > _______________________________________________ > Ext3-users mailing list > Ext3-users at redhat.com > https://www.redhat.com/mailman/listinfo/ext3-users From markbusheman at gmail.com Fri Apr 8 22:51:29 2011 From: markbusheman at gmail.com (Mark Busheman) Date: Fri, 8 Apr 2011 15:51:29 -0700 Subject: ext3 and forced unit access to flush disk cache Message-ID: I plan to use data=journal option with ext3 for a customer who is very specific about the integrity of the data. Would like to know if ext4 send FUA (Forced Unit Access) to flush the disk cache? Cheers Mark From ricwheeler at gmail.com Wed Apr 13 19:34:08 2011 From: ricwheeler at gmail.com (Ric Wheeler) Date: Wed, 13 Apr 2011 15:34:08 -0400 Subject: ext3 and forced unit access to flush disk cache In-Reply-To: References: Message-ID: <4DA5FAB0.1090805@gmail.com> On 04/08/2011 06:51 PM, Mark Busheman wrote: > I plan to use data=journal option with ext3 for a customer who is very > specific about the integrity of the data. Would like to know if ext4 > send FUA (Forced Unit Access) to flush the disk cache? > > Cheers > Mark > ext4 by default uses "barrier" support and will issue the write commands (depends on the storage type). You can "mount -o barrier" ext3 as well. It will do this for its own metadata reasons and as part of an application driven fsync() command, so the customer data should be safe if their application is coded properly. Regards, Ric From Sean.D.McCauliff at nasa.gov Mon Apr 25 22:07:21 2011 From: Sean.D.McCauliff at nasa.gov (Sean McCauliff) Date: Mon, 25 Apr 2011 15:07:21 -0700 Subject: Allocation of Indirect Blocks Message-ID: <4DB5F099.8010206@nasa.gov> Does ext3 allocate indirect blocks as needed or is there some fixed number of these like inodes? Should I be concerned with running out of indirect blocks? Thanks, Sean From sandeen at redhat.com Mon Apr 25 22:13:11 2011 From: sandeen at redhat.com (Eric Sandeen) Date: Mon, 25 Apr 2011 17:13:11 -0500 Subject: Allocation of Indirect Blocks In-Reply-To: <4DB5F099.8010206@nasa.gov> References: <4DB5F099.8010206@nasa.gov> Message-ID: <4DB5F1F7.2010808@redhat.com> On 4/25/11 5:07 PM, Sean McCauliff wrote: > Does ext3 allocate indirect blocks as needed or is there some fixed number of these like inodes? Should I be concerned with running out of indirect blocks? ext3 allocates them as needed. In fact you will often see them allocated consecutively with the data blocks they refer to: debugfs: stat bigfile Inode: 12 Type: regular Mode: 0644 Flags: 0x0 Generation: 330185944 Version: 0x00000000 User: 0 Group: 0 Size: 8388608 File ACL: 0 Directory ACL: 0 Links: 1 Blockcount: 16450 Fragment: Address: 0 Number: 0 Size: 0 ctime: 0x4db5f1c8 -- Mon Apr 25 17:12:24 2011 atime: 0x4db5f1c8 -- Mon Apr 25 17:12:24 2011 mtime: 0x4db5f1c8 -- Mon Apr 25 17:12:24 2011 BLOCKS: (0-11):2561-2572, (IND):2573, (12-267):2574-2829, (DIND):2830, (IND):2831, (268- 523):2832-3087, (IND):3088, (524-779):3089-3344, (IND):3345, (780-1035):3346-360 1, (IND):3602, (1036-1291):3603-3858, (IND):3859, (1292-1547):3860-4115, (IND):4 116, (1548-1803):4117-4372, (IND):4373, (1804-2059):4374-4629, (IND):4630, ... ... and so on (IND/DIND are indirect & double indirect blocks). -Eric > Thanks, > Sean From Sean.D.McCauliff at nasa.gov Mon Apr 25 22:15:57 2011 From: Sean.D.McCauliff at nasa.gov (Sean McCauliff) Date: Mon, 25 Apr 2011 15:15:57 -0700 Subject: Allocation of Indirect Blocks In-Reply-To: <4DB5F1F7.2010808@redhat.com> References: <4DB5F099.8010206@nasa.gov> <4DB5F1F7.2010808@redhat.com> Message-ID: <4DB5F29D.7000506@nasa.gov> Cool. Thanks, Sean Eric Sandeen wrote: > On 4/25/11 5:07 PM, Sean McCauliff wrote: > >> Does ext3 allocate indirect blocks as needed or is there some fixed number of these like inodes? Should I be concerned with running out of indirect blocks? >> > > ext3 allocates them as needed. > > In fact you will often see them allocated consecutively with the data blocks they refer to: > > debugfs: stat bigfile > Inode: 12 Type: regular Mode: 0644 Flags: 0x0 > Generation: 330185944 Version: 0x00000000 > User: 0 Group: 0 Size: 8388608 > File ACL: 0 Directory ACL: 0 > Links: 1 Blockcount: 16450 > Fragment: Address: 0 Number: 0 Size: 0 > ctime: 0x4db5f1c8 -- Mon Apr 25 17:12:24 2011 > atime: 0x4db5f1c8 -- Mon Apr 25 17:12:24 2011 > mtime: 0x4db5f1c8 -- Mon Apr 25 17:12:24 2011 > BLOCKS: > (0-11):2561-2572, (IND):2573, (12-267):2574-2829, (DIND):2830, (IND):2831, (268- > 523):2832-3087, (IND):3088, (524-779):3089-3344, (IND):3345, (780-1035):3346-360 > 1, (IND):3602, (1036-1291):3603-3858, (IND):3859, (1292-1547):3860-4115, (IND):4 > 116, (1548-1803):4117-4372, (IND):4373, (1804-2059):4374-4629, (IND):4630, ... > > ... and so on (IND/DIND are indirect & double indirect blocks). > > -Eric > > >> Thanks, >> Sean >> > >