From sergey.shyman at gmail.com Fri Jan 2 20:18:55 2009 From: sergey.shyman at gmail.com (Sergey Shyman) Date: Fri, 02 Jan 2009 22:18:55 +0200 Subject: Big problem with huge number of files Message-ID: <495E76AF.8080702@gmail.com> Hi all, I have an issue when I can't get directory listing for maildir with huge number of files inside. Neither ls, du or any other command finished successfully, it just running for hours without any success. Does anybody know how I could get directory listing and copies of my files? Any pointing would be great and greatly appreciated. Thanks in advance! Here is info about this partition: Filesystem volume name: Last mounted on: Filesystem UUID: 3395b7eb-746c-4fc1-a52e-76547ca7454d Filesystem magic number: 0xEF53 Filesystem revision #: 1 (dynamic) Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery sparse_super large_file Default mount options: (none) Filesystem state: clean Errors behavior: Continue Filesystem OS type: Linux Inode count: 30507008 Block count: 61008816 Reserved block count: 3050440 Free blocks: 36021498 Free inodes: 20268094 First block: 0 Block size: 4096 Fragment size: 4096 Reserved GDT blocks: 1024 Blocks per group: 32768 Fragments per group: 32768 Inodes per group: 16384 Inode blocks per group: 512 Filesystem created: Thu Apr 27 23:40:04 2006 Last mount time: Fri Jan 2 15:11:02 2009 Last write time: Fri Jan 2 15:52:25 2009 Mount count: 37 Maximum mount count: -1 Last checked: Thu Apr 27 23:40:04 2006 Check interval: 0 () Reserved blocks uid: 0 (user root) Reserved blocks gid: 0 (group root) First inode: 11 Inode size: 128 Journal inode: 8 First orphan inode: 28213259 Default directory hash: tea Directory Hash Seed: 04e82a5e-98ca-4893-b03f-44d5f7227e8d Journal backup: inode blocks This partition have noatime enabled. From pegasus at nerv.eu.org Fri Jan 2 21:38:26 2009 From: pegasus at nerv.eu.org (Jure =?UTF-8?B?UGXEjWFy?=) Date: Fri, 2 Jan 2009 22:38:26 +0100 Subject: Big problem with huge number of files In-Reply-To: <495E76AF.8080702@gmail.com> References: <495E76AF.8080702@gmail.com> Message-ID: <20090102223826.774c1942.pegasus@nerv.eu.org> On Fri, 02 Jan 2009 22:18:55 +0200 Sergey Shyman wrote: > Hi all, > > I have an issue when I can't get directory listing for maildir with huge > number of files inside. Neither ls, du or any other command finished > successfully, it just running for hours without any success. Does > anybody know how I could get directory listing and copies of my files? > Any pointing would be great and greatly appreciated. Thanks in advance! Have you tried ls -U so that ls doesn't do internal sorting? Have you tried find? -- Jure Pe?ar http://jure.pecar.org/ From Curtis at GreenKey.net Mon Jan 5 17:21:56 2009 From: Curtis at GreenKey.net (Curtis Doty) Date: Mon, 5 Jan 2009 09:21:56 -0800 (PST) Subject: 16TiB ext4 Message-ID: <20090105172156.AC5036F064@alopias.GreenKey.net> I'm horsing around with ext4 again. This time on Fedora 10. Is there any sane reason why I cannot use the *full* 16TiB volume? ----8<---- # vgcreate foo /dev/mapper/mpath* Volume group "foo" successfully created # lvcreate -L16T -nbar foo Logical volume "bar" created # mkfs.ext4 -Tlargefile4 /dev/foo/bar mke2fs 1.41.3 (12-Oct-2008) mkfs.ext4: Size of device /dev/foo/bar too big to be expressed in 32 bits using a blocksize of 4096. ----8<---- But it appears to *really* allow up to one PE less than the full 16TiB, why? ----8<---- # vgdisplay foo --- Volume group --- VG Name foo System ID Format lvm2 Metadata Areas 2 Metadata Sequence No 2 VG Access read/write VG Status resizable MAX LV 0 Cur LV 1 Open LV 0 Max PV 0 Cur PV 2 Act PV 2 VG Size 18.19 TB PE Size 4.00 MB Total PE 4769266 Alloc PE / Size 4194304 / 16.00 TB Free PE / Size 574962 / 2.19 TB VG UUID tPk8uJ-gIYZ-GJSU-ssob-IoYu-8AUp-pHKALO # lvremove -f foo/bar Logical volume "bar" successfully removed # lvcreate -l4194303 -nbar foo Logical volume "bar" created # mkfs.ext4 -Tlargefile4 /dev/foo/bar mke2fs 1.41.3 (12-Oct-2008) Filesystem label= OS type: Linux Block size=4096 (log=2) Fragment size=4096 (log=2) 1073741824 inodes, 4294966272 blocks 214748313 blocks (5.00%) reserved for the super user First data block=0 131072 block groups 32768 blocks per group, 32768 fragments per group 8192 inodes per group Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, 102400000, 214990848, 512000000, 550731776, 644972544, 1934917632, 2560000000, 3855122432 Writing inode tables: done Creating journal (32768 blocks): done Writing superblocks and filesystem accounting information: done This filesystem will be automatically checked every 35 mounts or 180 days, whichever comes first. Use tune2fs -c or -i to override. ----8<---- In my use case, I'm using much larger PEs, so the loss if just one is significant. Is this a bug in my thinking? Or in the userland tools? ../C From sandeen at redhat.com Mon Jan 5 18:16:08 2009 From: sandeen at redhat.com (Eric Sandeen) Date: Mon, 05 Jan 2009 12:16:08 -0600 Subject: 16TiB ext4 In-Reply-To: <20090105172156.AC5036F064@alopias.GreenKey.net> References: <20090105172156.AC5036F064@alopias.GreenKey.net> Message-ID: <49624E68.8050804@redhat.com> Curtis Doty wrote: > I'm horsing around with ext4 again. This time on Fedora 10. Is there any > sane reason why I cannot use the *full* 16TiB volume? > > ----8<---- > # vgcreate foo /dev/mapper/mpath* > Volume group "foo" successfully created > # lvcreate -L16T -nbar foo > Logical volume "bar" created > # mkfs.ext4 -Tlargefile4 /dev/foo/bar > mke2fs 1.41.3 (12-Oct-2008) > mkfs.ext4: Size of device /dev/foo/bar too big to be expressed in 32 bits > using a blocksize of 4096. > ----8<---- > > But it appears to *really* allow up to one PE less than the full 16TiB, > why? The real limit, IIRC, is (2^32 - 1) blocks, or 4k shy of 16T for 4k blocks. This is a little unfortunate since "lvcreate -L16T" is so handy, but it won't mkfs properly. (ext3 should have the same limitation). We should probably make mkfs just silently lop off one block if it encounters a boundary condition like this ... -Eric From Curtis at GreenKey.net Mon Jan 5 20:23:35 2009 From: Curtis at GreenKey.net (Curtis Doty) Date: Mon, 5 Jan 2009 12:23:35 -0800 (PST) Subject: 16TiB ext4 In-Reply-To: <49624E68.8050804@redhat.com> References: <20090105172156.AC5036F064@alopias.GreenKey.net> <49624E68.8050804@redhat.com> Message-ID: <20090105202335.4A68E6F064@alopias.GreenKey.net> 12:16pm Eric Sandeen said: > The real limit, IIRC, is (2^32 - 1) blocks, or 4k shy of 16T for 4k blocks. > > This is a little unfortunate since "lvcreate -L16T" is so handy, but it > won't mkfs properly. (ext3 should have the same limitation). > > We should probably make mkfs just silently lop off one block if it > encounters a boundary condition like this ... > Ah, thanks Eric! That would be smart. I'm trying to workaround, but... ----8<---- # mkfs.ext4 /dev/foo/bar $[2**32-1] mke2fs 1.41.3 (12-Oct-2008) mkfs.ext4: Size of device /dev/phd/dc1a too big to be expressed in 32 bits using a blocksize of 4096. # mkfs.ext4 /dev/foo/bar 42 # mkfs.ext4 Usage: mkfs.ext4 [-c|-l filename] [-b block-size] [-f fragment-size] [-i bytes-per-inode] [-I inode-size] [-J journal-options] [-G meta group size] [-N number-of-inodes] [-m reserved-blocks-percentage] [-o creator-os] [-g blocks-per-group] [-L volume-label] [-M last-mounted-directory] [-O feature[,...]] [-r fs-revision] [-E extended-option[,...]] [-T fs-type] [-jnqvFSV] device [blocks-count] ----8<---- It doesn't appear to support the blocks-count option anymore. :-( Or did it ever? ../C From Curtis at GreenKey.net Mon Jan 5 20:31:43 2009 From: Curtis at GreenKey.net (Curtis Doty) Date: Mon, 5 Jan 2009 12:31:43 -0800 (PST) Subject: 16TiB ext4 In-Reply-To: <20090105202335.4A68E6F064@alopias.GreenKey.net> References: <20090105172156.AC5036F064@alopias.GreenKey.net> <49624E68.8050804@redhat.com> <20090105202335.4A68E6F064@alopias.GreenKey.net> Message-ID: <20090105203144.3C3A86F064@alopias.GreenKey.net> Ah whoops...forgot to paste entire example. 12:23pm Curtis Doty said: > # mkfs.ext4 /dev/foo/bar 42 mke2fs 1.41.3 (12-Oct-2008) mkfs.ext4: Size of device /dev/phd/dc1a too big to be expressed in 32 bits using a blocksize of 4096. > It doesn't appear to support the blocks-count option anymore. :-( Or did it > ever? > From sandeen at redhat.com Mon Jan 5 20:41:43 2009 From: sandeen at redhat.com (Eric Sandeen) Date: Mon, 05 Jan 2009 14:41:43 -0600 Subject: 16TiB ext4 In-Reply-To: <20090105202335.4A68E6F064@alopias.GreenKey.net> References: <20090105172156.AC5036F064@alopias.GreenKey.net> <49624E68.8050804@redhat.com> <20090105202335.4A68E6F064@alopias.GreenKey.net> Message-ID: <49627087.6050000@redhat.com> Curtis Doty wrote: > 12:16pm Eric Sandeen said: > >> The real limit, IIRC, is (2^32 - 1) blocks, or 4k shy of 16T for 4k blocks. >> >> This is a little unfortunate since "lvcreate -L16T" is so handy, but it >> won't mkfs properly. (ext3 should have the same limitation). >> >> We should probably make mkfs just silently lop off one block if it >> encounters a boundary condition like this ... >> > > Ah, thanks Eric! That would be smart. > > I'm trying to workaround, but... > > ----8<---- > # mkfs.ext4 /dev/foo/bar $[2**32-1] > mke2fs 1.41.3 (12-Oct-2008) > mkfs.ext4: Size of device /dev/phd/dc1a too big to be expressed in 32 bits > using a blocksize of 4096. > # mkfs.ext4 /dev/foo/bar 42 > # mkfs.ext4 > Usage: mkfs.ext4 [-c|-l filename] [-b block-size] [-f fragment-size] > [-i bytes-per-inode] [-I inode-size] [-J journal-options] > [-G meta group size] [-N number-of-inodes] > [-m reserved-blocks-percentage] [-o creator-os] > [-g blocks-per-group] [-L volume-label] [-M last-mounted-directory] > [-O feature[,...]] [-r fs-revision] [-E extended-option[,...]] > [-T fs-type] [-jnqvFSV] device [blocks-count] > ----8<---- > > It doesn't appear to support the blocks-count option anymore. :-( Or did > it ever? it does, and did... but it's checking the device size and erroring before it looks at the value you passed in, sigh: # ls -lh fsfile -rw-r--r-- 1 root root 16T 2009-01-05 14:30 fsfile [root at inode test]# mkfs.ext4 -b 4096 fsfile 4294967295 mke2fs 1.41.3 (12-Oct-2008) fsfile is not a block special device. Proceed anyway? (y,n) y mkfs.ext4: Size of device fsfile too big to be expressed in 32 bits using a blocksize of 4096. Unless you specify -n, not that that actually gets you anywhere! [root at inode test]# mkfs.ext4 -n -b 4096 fsfile 4294967295 mke2fs 1.41.3 (12-Oct-2008) fsfile is not a block special device. Proceed anyway? (y,n) y Filesystem label= OS type: Linux Block size=4096 (log=2) Fragment size=4096 (log=2) 1073741824 inodes, 4294967295 blocks 214748364 blocks (5.00%) reserved for the super user First data block=0 131072 block groups 32768 blocks per group, 32768 fragments per group 8192 inodes per group Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, 102400000, 214990848, 512000000, 550731776, 644972544, 1934917632, 2560000000, 3855122432 and one more block really does fail, though with a less-than-helpful message: [root at inode test]# mkfs.ext4 -n -b 4096 fsfile 4294967296 mke2fs 1.41.3 (12-Oct-2008) mkfs.ext4: invalid blocks count - 4294967296 I'll look into this, it should all be smarter... -Eric From adilger at sun.com Tue Jan 6 09:35:40 2009 From: adilger at sun.com (Andreas Dilger) Date: Tue, 06 Jan 2009 02:35:40 -0700 Subject: Big problem with huge number of files In-Reply-To: <20090102223826.774c1942.pegasus@nerv.eu.org> References: <495E76AF.8080702@gmail.com> <20090102223826.774c1942.pegasus@nerv.eu.org> Message-ID: <20090106093540.GL3932@webber.adilger.int> On Jan 02, 2009 22:38 +0100, Jure Pe?ar wrote: > On Fri, 02 Jan 2009 22:18:55 +0200 > Sergey Shyman wrote: > > I have an issue when I can't get directory listing for maildir with huge > > number of files inside. Neither ls, du or any other command finished > > successfully, it just running for hours without any success. Does > > anybody know how I could get directory listing and copies of my files? > > Any pointing would be great and greatly appreciated. Thanks in advance! > > Have you tried ls -U so that ls doesn't do internal sorting? > Have you tried find? GNU ls is useless in this regard, because even the "-U" option will wait until it has read all of the files before it starts printing anything. It must wait until all the data is available before deciding whether to sort or not. Using "find" will probably work very quickly. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. From shirishag75 at gmail.com Wed Jan 7 14:40:23 2009 From: shirishag75 at gmail.com (shirish) Date: Wed, 7 Jan 2009 20:10:23 +0530 Subject: Big problem with huge number of files In-Reply-To: <495E76AF.8080702@gmail.com> References: <495E76AF.8080702@gmail.com> Message-ID: <511f47f50901070640vd7af70lf313cc7495146d8a@mail.gmail.com> Reply On Sat, Jan 3, 2009 at 01:48, Sergey Shyman wrote: > Hi all, Hi, > Here is info about this partition: > Filesystem volume name: > Last mounted on: > Filesystem UUID: 3395b7eb-746c-4fc1-a52e-76547ca7454d > Filesystem magic number: 0xEF53 > Filesystem revision #: 1 (dynamic) > Filesystem features: has_journal ext_attr resize_inode dir_index > filetype needs_recovery sparse_super large_file > Default mount options: (none) > Filesystem state: clean > Errors behavior: Continue > Filesystem OS type: Linux > Inode count: 30507008 > Block count: 61008816 > Reserved block count: 3050440 > Free blocks: 36021498 > Free inodes: 20268094 > First block: 0 > Block size: 4096 > Fragment size: 4096 > Reserved GDT blocks: 1024 > Blocks per group: 32768 > Fragments per group: 32768 > Inodes per group: 16384 > Inode blocks per group: 512 > Filesystem created: Thu Apr 27 23:40:04 2006 > Last mount time: Fri Jan 2 15:11:02 2009 > Last write time: Fri Jan 2 15:52:25 2009 > Mount count: 37 > Maximum mount count: -1 > Last checked: Thu Apr 27 23:40:04 2006 > Check interval: 0 () > Reserved blocks uid: 0 (user root) > Reserved blocks gid: 0 (group root) > First inode: 11 > Inode size: 128 > Journal inode: 8 > First orphan inode: 28213259 > Default directory hash: tea > Directory Hash Seed: 04e82a5e-98ca-4893-b03f-44d5f7227e8d > Journal backup: inode blocks > > This partition have noatime enabled. probably off-topic to the thread but how were u able to get the above info. Which command/tool did you use to get the above? > _______________________________________________ > Ext3-users mailing list > Ext3-users at redhat.com > https://www.redhat.com/mailman/listinfo/ext3-users > -- Regards, Shirish Agarwal This email is licensed under http://creativecommons.org/licenses/by-nc/3.0/ http://flossexperiences.wordpress.com 065C 6D79 A68C E7EA 52B3 8D70 950D 53FB 729A 8B17 From ulf at openlane.com Wed Jan 7 15:02:31 2009 From: ulf at openlane.com (Ulf Zimmermann) Date: Wed, 7 Jan 2009 07:02:31 -0800 Subject: Big problem with huge number of files In-Reply-To: <511f47f50901070640vd7af70lf313cc7495146d8a@mail.gmail.com> References: <495E76AF.8080702@gmail.com> <511f47f50901070640vd7af70lf313cc7495146d8a@mail.gmail.com> Message-ID: <5DE4B7D3E79067418154C49A739C125104C4A8D4@msmpk01.corp.autc.com> > -----Original Message----- > To: Sergey Shyman > Subject: Re: Big problem with huge number of files > > Reply > > On Sat, Jan 3, 2009 at 01:48, Sergey Shyman > wrote: > > Hi all, > > Hi, > > > probably off-topic to the thread but how were u able to get the above > info. Which command/tool did you use to get the above? tune2fs -l From shirishag75 at gmail.com Wed Jan 7 15:46:53 2009 From: shirishag75 at gmail.com (shirish) Date: Wed, 7 Jan 2009 21:16:53 +0530 Subject: Big problem with huge number of files In-Reply-To: <5DE4B7D3E79067418154C49A739C125104C4A8D4@msmpk01.corp.autc.com> References: <495E76AF.8080702@gmail.com> <511f47f50901070640vd7af70lf313cc7495146d8a@mail.gmail.com> <5DE4B7D3E79067418154C49A739C125104C4A8D4@msmpk01.corp.autc.com> Message-ID: <511f47f50901070746k5ba95d27u52c7f12bbe5444bf@mail.gmail.com> On Wed, Jan 7, 2009 at 20:32, Ulf Zimmermann wrote: Hi Ulf Zimmermann, > tune2fs -l Cool. Thank you for telling me about this tool. -- Regards, Shirish Agarwal This email is licensed under http://creativecommons.org/licenses/by-nc/3.0/ http://flossexperiences.wordpress.com 065C 6D79 A68C E7EA 52B3 8D70 950D 53FB 729A 8B17 From ulf at openlane.com Wed Jan 7 17:56:03 2009 From: ulf at openlane.com (Ulf Zimmermann) Date: Wed, 7 Jan 2009 09:56:03 -0800 Subject: OT: mailing list to talk about multipath under Linux? Message-ID: <5DE4B7D3E79067418154C49A739C125104C4A8D8@msmpk01.corp.autc.com> Not directly related to EXT FS but can anyone point me a mailing list to talk about things like device-mapper-multipath? Specific I am looking to see if anyone has maybe written a script to take SCSI devices offline for a path, to do clean shutdown of a fabric or SAN controller for maintance? Ulf Zimmermann | Senior System Architect OPENLANE 4600 Bohannon Drive, Suite 100 Menlo Park, CA 94025 O: 650-532-6382 M: (510) 396-1764 F: (510) 580-0929 Email: ulf at openlane.com | Web: www.openlane.com From pegasus at nerv.eu.org Wed Jan 7 20:18:00 2009 From: pegasus at nerv.eu.org (Jure =?UTF-8?B?UGXEjWFy?=) Date: Wed, 7 Jan 2009 21:18:00 +0100 Subject: OT: mailing list to talk about multipath under Linux? In-Reply-To: <5DE4B7D3E79067418154C49A739C125104C4A8D8@msmpk01.corp.autc.com> References: <5DE4B7D3E79067418154C49A739C125104C4A8D8@msmpk01.corp.autc.com> Message-ID: <20090107211800.58bab800.pegasus@nerv.eu.org> On Wed, 7 Jan 2009 09:56:03 -0800 "Ulf Zimmermann" wrote: > Not directly related to EXT FS but can anyone point me a mailing list to > talk about things like device-mapper-multipath? Specific I am looking to > see if anyone has maybe written a script to take SCSI devices offline > for a path, to do clean shutdown of a fabric or SAN controller for > maintance? https://www.redhat.com/mailman/listinfo/dm-devel most probably? -- Jure Pe?ar http://jure.pecar.org/ From bruno at wolff.to Wed Jan 7 21:18:55 2009 From: bruno at wolff.to (Bruno Wolff III) Date: Wed, 7 Jan 2009 15:18:55 -0600 Subject: Incorrect disk usage size In-Reply-To: References: Message-ID: <20090107211855.GA5451@wolff.to> On Sat, Dec 20, 2008 at 18:37:41 -0600, Adam Flott wrote: > After an aptitude safe-upgrade of Debian's testing (as of today) my root file > system (ext3) seems to have "filled up" and I'm not sure how to get Linux to > correctly report the used size. Are you aware that there is space in file systems reserved for use only by root? That may explain your confusion. The purpose of the reserve is to allow a sysadm to allow some things to keep working even if a normal user fills up a file system. The size of the reserve on ext2/3 file systems can be changed with tune2fs. From ulf at openlane.com Wed Jan 7 21:22:40 2009 From: ulf at openlane.com (Ulf Zimmermann) Date: Wed, 7 Jan 2009 13:22:40 -0800 Subject: Incorrect disk usage size In-Reply-To: <20090107211855.GA5451@wolff.to> References: <20090107211855.GA5451@wolff.to> Message-ID: <5DE4B7D3E79067418154C49A739C125104C4A8E0@msmpk01.corp.autc.com> > -----Original Message----- > From: ext3-users-bounces at redhat.com [mailto:ext3-users- > bounces at redhat.com] On Behalf Of Bruno Wolff III > Sent: 01/07/2009 13:19 > To: Adam Flott > Cc: ext3-users at redhat.com > Subject: Re: Incorrect disk usage size > > On Sat, Dec 20, 2008 at 18:37:41 -0600, > Adam Flott wrote: > > After an aptitude safe-upgrade of Debian's testing (as of today) my > root file > > system (ext3) seems to have "filled up" and I'm not sure how to get > Linux to > > correctly report the used size. > > Are you aware that there is space in file systems reserved for use only > by > root? That may explain your confusion. > > The purpose of the reserve is to allow a sysadm to allow some things to > keep > working even if a normal user fills up a file system. > > The size of the reserve on ext2/3 file systems can be changed with > tune2fs. Your problem is probably files in /var, not necessary over 1GB in size. I don't know where Debian saves packages downloaded via apt, but yum for example has a /var/cache/yum and you can run "yum clean packages". I would expect apt to have something similar. From lists at nerdbynature.de Thu Jan 8 01:49:52 2009 From: lists at nerdbynature.de (Christian Kujau) Date: Thu, 8 Jan 2009 02:49:52 +0100 (CET) Subject: Incorrect disk usage size In-Reply-To: References: Message-ID: On Sat, 20 Dec 2008, Adam Flott wrote: > $ df > Filesystem 1K-blocks Used Available Use% Mounted on > /dev/sda1 48062440 46976212 0 100% / So, "/" is really ~45 GB in total, but: > $ du -sh -x / > 5.6G / du(1) counts only 5,6 GB? Hm, first thing that comes to mind are of course (stale) open files, which cannot be found with find(1) any more and are not freed to the fs, so df(1) does not know about it. I usually use "lsof -ln | grep deleted", but that'd be a *lot* of large, open files. > Block count: 12207384 > Reserved block count: 610369 This reserve would sum up to ~2,3 GB, but this still does not explain the difference to 45 GB. Hm. > I've looked for large files/directories via find (-type d/f -size +1G) and > fsck'ing the partition multiple times with various options, but no luck. And you unmounted or at least remounted r/o the partition for the fsck, so the open files should not even be an issue here. Strange indeed...sorry to be of no help here... C. -- BOFH excuse #39: terrorist activities From folkert at vanheusden.com Fri Jan 16 12:01:19 2009 From: folkert at vanheusden.com (Folkert van Heusden) Date: Fri, 16 Jan 2009 13:01:19 +0100 Subject: something odd with the order of files in a directory Message-ID: <20090116120119.GB29002@vanheusden.com> Hi, I noticed something odd with the order of files in a directory. When I put files in a directory in a certain order on an ext3-filesystem, the order is not kept. On fat-filesystem it does. E.g.: rm -rf t ; mkdir t touch a.a a.b a.c mv a.b t/ ; mv a.c t/ ; mv a.a t/ ls -Ula t/ I then would expect: a.b a.c a.a but instead I get drwxr-xr-x 3 root root 4096 2009-01-16 12:59 .. -rw-r--r-- 1 root root 0 2009-01-16 12:59 a.c -rw-r--r-- 1 root root 0 2009-01-16 12:59 a.b -rw-r--r-- 1 root root 0 2009-01-16 12:59 a.a drwxr-xr-x 2 root root 4096 2009-01-16 12:59 . I tried adding sync between each mv but that didn't help. Folkert van Heusden -- ---------------------------------------------------------------------- Phone: +31-6-41278122, PGP-key: 1F28D8AE, www.vanheusden.com From folkert at vanheusden.com Fri Jan 16 12:01:19 2009 From: folkert at vanheusden.com (Folkert van Heusden) Date: Fri, 16 Jan 2009 13:01:19 +0100 Subject: something odd with the order of files in a directory Message-ID: <20090116120119.GB29002@vanheusden.com> Hi, I noticed something odd with the order of files in a directory. When I put files in a directory in a certain order on an ext3-filesystem, the order is not kept. On fat-filesystem it does. E.g.: rm -rf t ; mkdir t touch a.a a.b a.c mv a.b t/ ; mv a.c t/ ; mv a.a t/ ls -Ula t/ I then would expect: a.b a.c a.a but instead I get drwxr-xr-x 3 root root 4096 2009-01-16 12:59 .. -rw-r--r-- 1 root root 0 2009-01-16 12:59 a.c -rw-r--r-- 1 root root 0 2009-01-16 12:59 a.b -rw-r--r-- 1 root root 0 2009-01-16 12:59 a.a drwxr-xr-x 2 root root 4096 2009-01-16 12:59 . I tried adding sync between each mv but that didn't help. Folkert van Heusden -- ---------------------------------------------------------------------- Phone: +31-6-41278122, PGP-key: 1F28D8AE, www.vanheusden.com From davidlandy at clara.co.uk Fri Jan 16 12:40:18 2009 From: davidlandy at clara.co.uk (D Landy) Date: Fri, 16 Jan 2009 12:40:18 +0000 Subject: Fw: 32k Blocksize Support Message-ID: Hi again, First of all, thanks to Eric Sandeen for his offline support. I'm coming back here at his suggestion as we haven't managed to resolve it. So far, we've established that it *is* an ext2 filesystem (using file -s), and that resize2fs reports that it has an invalid superblock. Eric wrote: > I'd probably dig into why resize2fs says it's corrupt; large block > should not mean corrupt, AFAIK, even if the running kernel can't > actually mount it. > > You might get this back on-list, too, so future generations can benefit > from your pain (and in case someone else knows these answers). Does anyone know if a 32k blocksize would cause resize2fs to report an invalid superblock? I've downloaded the source code and from what I can see the maximum block size is 64k, so I wouldn't have thought so - but I'm not a C programmer and have trouble following the source sometimes. I'd appreciate another set of eyes going over the code... Any help greatly appreciated. David From sandeen at redhat.com Fri Jan 16 15:32:24 2009 From: sandeen at redhat.com (Eric Sandeen) Date: Fri, 16 Jan 2009 09:32:24 -0600 Subject: Fw: 32k Blocksize Support In-Reply-To: References: Message-ID: <4970A888.7070701@redhat.com> D Landy wrote: > Hi again, > > First of all, thanks to Eric Sandeen for his offline support. > > I'm coming back here at his suggestion as we haven't managed to resolve it. > > So far, we've established that it *is* an ext2 filesystem (using file -s), > and that resize2fs reports that it has an invalid superblock. > > Eric wrote: > >> I'd probably dig into why resize2fs says it's corrupt; large block >> should not mean corrupt, AFAIK, even if the running kernel can't >> actually mount it. >> >> You might get this back on-list, too, so future generations can benefit >> from your pain (and in case someone else knows these answers). > > Does anyone know if a 32k blocksize would cause resize2fs to report an > invalid superblock? I've downloaded the source code and from what I can see > the maximum block size is 64k, so I wouldn't have thought so - but I'm not a > C programmer and have trouble following the source sometimes. > > I'd appreciate another set of eyes going over the code... > > Any help greatly appreciated. I don't know if they're using a standard ext3 fs or not; perhaps it is adultrated in some way for their needs that makes it incompatible w/ the upstream tools. You could go through the code to find where that message is printed, then work backwards to why (either via gdb, or printf insertions, or whatever you're comfortable with...) -Eric From sandeen at redhat.com Fri Jan 16 15:35:24 2009 From: sandeen at redhat.com (Eric Sandeen) Date: Fri, 16 Jan 2009 09:35:24 -0600 Subject: something odd with the order of files in a directory In-Reply-To: <20090116120119.GB29002@vanheusden.com> References: <20090116120119.GB29002@vanheusden.com> Message-ID: <4970A93C.5010709@redhat.com> Folkert van Heusden wrote: > Hi, > > I noticed something odd with the order of files in a directory. > When I put files in a directory in a certain order on an > ext3-filesystem, the order is not kept. On fat-filesystem it does. > E.g.: > rm -rf t ; mkdir t > touch a.a a.b a.c > mv a.b t/ ; mv a.c t/ ; mv a.a t/ > ls -Ula t/ > > I then would expect: > a.b > a.c > a.a > > but instead I get > drwxr-xr-x 3 root root 4096 2009-01-16 12:59 .. > -rw-r--r-- 1 root root 0 2009-01-16 12:59 a.c > -rw-r--r-- 1 root root 0 2009-01-16 12:59 a.b > -rw-r--r-- 1 root root 0 2009-01-16 12:59 a.a > drwxr-xr-x 2 root root 4096 2009-01-16 12:59 . > > I tried adding sync between each mv but that didn't help. This is due to the dir_index feature; you're getting them back in hash (read: random) order. If you turn it off: [root at inode mnt]# tune2fs -O ^dir_index /dev/sdb4 you'll get what you expect: [root at inode test]# rm -rf t ; mkdir t [root at inode test]# touch a.a a.b a.c [root at inode test]# mv a.b t/ ; mv a.c t/ ; mv a.a t/ [root at inode test]# ls -Ula t/ total 8 drwxr-xr-x 2 root root 4096 2009-01-16 15:30 . drwxr-xr-x 4 root root 4096 2009-01-16 15:30 .. -rw-r--r-- 1 root root 0 2009-01-16 15:30 a.b -rw-r--r-- 1 root root 0 2009-01-16 15:30 a.c -rw-r--r-- 1 root root 0 2009-01-16 15:30 a.a but you'll lose the other efficiencies of the dir_index feature. -Eric From folkert at vanheusden.com Fri Jan 16 15:44:24 2009 From: folkert at vanheusden.com (Folkert van Heusden) Date: Fri, 16 Jan 2009 16:44:24 +0100 Subject: something odd with the order of files in a directory In-Reply-To: <4970A93C.5010709@redhat.com> References: <20090116120119.GB29002@vanheusden.com> <4970A93C.5010709@redhat.com> Message-ID: <20090116154424.GH29002@vanheusden.com> > > When I put files in a directory in a certain order on an > > ext3-filesystem, the order is not kept. On fat-filesystem it does. > > This is due to the dir_index feature; you're getting them back in hash > (read: random) order. If you turn it off: Ah ok, thanks! Folkert van Heusden -- MultiTail er et flexible tool for ? kontrolere Logfiles og commandoer. Med filtrer, farger, sammenf?ringer, forskeliger ansikter etc. http://www.vanheusden.com/multitail/ ---------------------------------------------------------------------- Phone: +31-6-41278122, PGP-key: 1F28D8AE, www.vanheusden.com From davidlandy at clara.co.uk Sun Jan 18 09:33:11 2009 From: davidlandy at clara.co.uk (D Landy) Date: Sun, 18 Jan 2009 09:33:11 +0000 Subject: Fw: 32k Blocksize Support Message-ID: Eric Sandeen wrote: > I don't know if they're using a standard ext3 fs or not; perhaps it is > adultrated in some way for their needs that makes it incompatible w/ the > upstream tools. > > You could go through the code to find where that message is printed, > then work backwards to why (either via gdb, or printf insertions, or > whatever you're comfortable with...) Thanks, Eric, that's exactly what I've done. :-) Unfortunately there are many different error conditions that could result in an "invalid superblock" message and it seems like it would be a hard job (at least for me!) to work out which one it was as I don't know how to compile a package or even how to get the right source code for Puppy Linux (which I think is almost Debian compatible). I guess this is going off-topic now and I should ask on other lists for help with that? Any assistance appreciated. David From sandeen at redhat.com Mon Jan 19 17:10:10 2009 From: sandeen at redhat.com (Eric Sandeen) Date: Mon, 19 Jan 2009 11:10:10 -0600 Subject: Fw: 32k Blocksize Support In-Reply-To: References: Message-ID: <4974B3F2.2070009@redhat.com> D Landy wrote: > Eric Sandeen wrote: > >> I don't know if they're using a standard ext3 fs or not; perhaps it is >> adultrated in some way for their needs that makes it incompatible w/ the >> upstream tools. >> >> You could go through the code to find where that message is printed, >> then work backwards to why (either via gdb, or printf insertions, or >> whatever you're comfortable with...) > > Thanks, Eric, that's exactly what I've done. > > :-) > > Unfortunately there are many different error conditions that could result in > an "invalid superblock" message and it seems like it would be a hard job (at > least for me!) to work out which one it was as I don't know how to compile a > package or even how to get the right source code for Puppy Linux (which I > think is almost Debian compatible). > > I guess this is going off-topic now and I should ask on other lists for help > with that? > > Any assistance appreciated. > > David You could make an e2image and hope someone has enough spare time (I'm afraid I don't at the moment) to take a look. (assuming e2image will touch it....) -Eric From lists at nerdbynature.de Fri Jan 23 09:19:12 2009 From: lists at nerdbynature.de (Christian Kujau) Date: Fri, 23 Jan 2009 10:19:12 +0100 (CET) Subject: something odd with the order of files in a directory (fwd) Message-ID: On Fri, 16 Jan 2009, Folkert van Heusden wrote: > I then would expect: > a.b > a.c > a.a > > but instead I get > drwxr-xr-x 3 root root 4096 2009-01-16 12:59 .. > -rw-r--r-- 1 root root 0 2009-01-16 12:59 a.c > -rw-r--r-- 1 root root 0 2009-01-16 12:59 a.b > -rw-r--r-- 1 root root 0 2009-01-16 12:59 a.a > drwxr-xr-x 2 root root 4096 2009-01-16 12:59 . Hm, is this reproducible? Which kernel, mount-options, arch? Here on 2.6.24/amd64 the "directory order" (GNU/ls -U resp. BSD/ls -f) seems to work as expected: $ touch 1 2 3 $ mv 2 t/ ; mv 3 t/; mv 1 t/ $ ls -Ugo --time-style=full-iso t/ -rw-r----- 1 0 2009-01-22 15:50:01.414115303 +0100 2 -rw-r----- 1 0 2009-01-22 15:50:01.414115303 +0100 3 -rw-r----- 1 0 2009-01-22 15:50:01.414115303 +0100 1 Christian. -- BOFH excuse #175: OS swapped to disk From alexfler at msn.com Fri Jan 23 11:10:47 2009 From: alexfler at msn.com (Alex Fler) Date: Fri, 23 Jan 2009 06:10:47 -0500 Subject: Reserved block count for Large Filesystem Message-ID: Hi All, On large FS like 100gb default value of "Reserved block count" takes 5% of usable disk, can this value be safely changed to 1% and not affect a performance ? Is a reservation size of 1gb enough for 100gb disk ? And when we have even larger filesystem like 1Tb default "Reserved block count" is 50GB, is it an absolutely minimum must have reserved number of space for disk performance, or it's just a legacy concept which can be adjusted? Thanks in advance Alex Fler _________________________________________________________________ Windows Live? Hotmail??more than just e-mail. http://windowslive.com/howitworks?ocid=TXT_TAGLM_WL_t2_hm_justgotbetter_howitworks_012009 -------------- next part -------------- An HTML attachment was scrubbed... URL: From pegasus at nerv.eu.org Fri Jan 23 11:26:10 2009 From: pegasus at nerv.eu.org (Jure =?UTF-8?B?UGXEjWFy?=) Date: Fri, 23 Jan 2009 12:26:10 +0100 Subject: Reserved block count for Large Filesystem In-Reply-To: References: Message-ID: <20090123122610.882548d3.pegasus@nerv.eu.org> On Fri, 23 Jan 2009 06:10:47 -0500 Alex Fler wrote: > > Hi All, > > On large FS like 100gb default value of "Reserved block count" takes 5% > of usable disk, can this value be safely changed to 1% and not affect a > performance ? Is a reservation size of 1gb enough for 100gb disk ? And > when we have even larger filesystem like 1Tb default "Reserved block > count" is 50GB, is it an absolutely minimum must have reserved number of > space for disk performance, or it's just a legacy concept which can be > adjusted? These days I simply mkfs all my large non-root and non-var filesystems with -m 0, setting reserved block count to 0%. -- Jure Pe?ar http://jure.pecar.org http://f5j.eu From tytso at mit.edu Fri Jan 23 16:58:24 2009 From: tytso at mit.edu (Theodore Tso) Date: Fri, 23 Jan 2009 11:58:24 -0500 Subject: Reserved block count for Large Filesystem In-Reply-To: References: Message-ID: <20090123165824.GO14966@mit.edu> On Fri, Jan 23, 2009 at 06:10:47AM -0500, Alex Fler wrote: > > On large FS like 100gb default value of "Reserved block count" takes > 5% of usable disk, can this value be safely changed to 1% and not > affect a performance ? Is a reservation size of 1gb enough for 100gb > disk ? And when we have even larger filesystem like 1Tb default > "Reserved block count" is 50GB, is it an absolutely minimum must > have reserved number of space for disk performance, or it's just a > legacy concept which can be adjusted? If you set the reserved block count to zero, it won't affect performance much except if you run for long periods of time (with lots of file creates and deletes) while the filesystem is almost full (i.e., say above 95%), at which point you'll be subject to fragmentation problems. Ext4's multi-block allocator is much more fragmentation resistant, because it tries much harder to find contiguous blocks, so even if you don't enable the other ext4 features, you'll see better results simply mounting an ext3 filesystem using ext4 before the filesystem gets completely full. If you are just using the filesystem for long-term archive, where files aren't changing very often (i.e., a huge mp3 or video store), it obviously won't matter. - Ted From adilger at sun.com Fri Jan 23 22:03:18 2009 From: adilger at sun.com (Andreas Dilger) Date: Fri, 23 Jan 2009 15:03:18 -0700 Subject: something odd with the order of files in a directory (fwd) In-Reply-To: References: Message-ID: <20090123220318.GU3652@webber.adilger.int> On Jan 23, 2009 10:19 +0100, Christian Kujau wrote: > On Fri, 16 Jan 2009, Folkert van Heusden wrote: >> I then would expect: >> a.b >> a.c >> a.a >> >> but instead I get >> drwxr-xr-x 3 root root 4096 2009-01-16 12:59 .. >> -rw-r--r-- 1 root root 0 2009-01-16 12:59 a.c >> -rw-r--r-- 1 root root 0 2009-01-16 12:59 a.b >> -rw-r--r-- 1 root root 0 2009-01-16 12:59 a.a >> drwxr-xr-x 2 root root 4096 2009-01-16 12:59 . > > Hm, is this reproducible? Which kernel, mount-options, arch? > Here on 2.6.24/amd64 the "directory order" (GNU/ls -U resp. BSD/ls -f) > seems to work as expected: There is no such thing as "directory order" in Unix. It can change at any time, with the caveat that a single process doing a single readdir() will get each entry existing at the start and end of readdir exactly once. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. From cmiyata at lycos.com Tue Jan 27 17:57:38 2009 From: cmiyata at lycos.com (Cristina Miyata) Date: Tue, 27 Jan 2009 12:57:38 -0500 (EST) Subject: ext3_journal_start_sb: Detected aborted journal Message-ID: <20090127125738.HM.0000000000002Gx@cmiyata.mail-wwl23.bo3.lycos.com.lycos.com> An HTML attachment was scrubbed... URL: -------------- next part -------------- Dear Ext3 Users, We are running RHEL 4 AS (2.6.9-67.ELsmp ) on a Sun X4200 M2 machine with 2 146GB disks in RAID1. For no aparently reason, a ext3 filesystem has an error and remounts in read-only. => /var/log/messages Jan 21 22:34:32 SPJAG01-SM02 kernel: EXT3-fs error (device sda8): ext3_journal_start_sb: Detected aborted journal Jan 21 22:34:32 SPJAG01-SM02 kernel: Remounting filesystem read-only I've checked the RedHat bug 323921 (https://bugzilla.redhat.com/show_bug.cgi?id=213921) and saw that it could cause this problem and that it was fixed on kernel versions 2.6.9-42.0.7.EL and later. Does anyone know if there is another RedHat bug that could cause such problem? Or another reason that is not hardware problem (Sun tech support said that there is no hardware problem)? Thank you for your attention. Regards, Cristina Miyata From adilger at sun.com Tue Jan 27 22:03:38 2009 From: adilger at sun.com (Andreas Dilger) Date: Tue, 27 Jan 2009 15:03:38 -0700 Subject: ext3_journal_start_sb: Detected aborted journal In-Reply-To: <20090127125738.HM.0000000000002Gx@cmiyata.mail-wwl23.bo3.lycos.com.lycos.com> References: <20090127125738.HM.0000000000002Gx@cmiyata.mail-wwl23.bo3.lycos.com.lycos.com> Message-ID: <20090127220338.GV3652@webber.adilger.int> On Jan 27, 2009 12:57 -0500, Cristina Miyata wrote: > => /var/log/messages > > Jan 21 22:34:32 SPJAG01-SM02 kernel: EXT3-fs error (device sda8): ext3_journal_start_sb: Detected aborted journal > Jan 21 22:34:32 SPJAG01-SM02 kernel: Remounting filesystem read-only Are there messages that mention "JBD" or "journal" or your disk that indicate why the journal was aborted? Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. From nicolas.kowalski at gmail.com Fri Jan 30 13:53:29 2009 From: nicolas.kowalski at gmail.com (Nicolas KOWALSKI) Date: Fri, 30 Jan 2009 14:53:29 +0100 Subject: barrier and commit options? Message-ID: <20090130135329.GW20896@petole.demisel.net> Hello, On my home server (Debian etch, custom 2.6.28.2 kernel), I am using ext3 for both root and /home filesystems, with barriers enabled to prevent corruption caused by my PATA disk write cache. Looking for a better performance, I have also set the commit=nr option as described in linux-2.6.28.2/Documentation/filesystems/ext3.txt, so that I now have: niko at petole:~$ mount -t ext3 /dev/sda1 on / type ext3 (rw,noatime,commit=30,barrier=1) /dev/sda3 on /home type ext3 (rw,noatime,commit=30,barrier=1) I know I may loose the last 30 seconds of "work" (it's just a home server), but is the filesystem at risk (corruption, whatever, ...) with these mount options ? Thanks, -- Nicolas From lists at nerdbynature.de Fri Jan 30 15:17:54 2009 From: lists at nerdbynature.de (Christian Kujau) Date: Fri, 30 Jan 2009 16:17:54 +0100 (CET) Subject: barrier and commit options? In-Reply-To: <20090130135329.GW20896@petole.demisel.net> References: <20090130135329.GW20896@petole.demisel.net> Message-ID: On Fri, 30 Jan 2009, Nicolas KOWALSKI wrote: > I know I may loose the last 30 seconds of "work" (it's just a home > server), but is the filesystem at risk (corruption, whatever, ...) with > these mount options ? No, why would it? If certain mount options would make a filesystem prone to corruption I'd consider this a bug. So apart from losing a few more seconds of work in case of an error, the fs should be fine. C. -- BOFH excuse #199: the curls in your keyboard cord are losing electricity. From sandeen at redhat.com Fri Jan 30 15:22:46 2009 From: sandeen at redhat.com (Eric Sandeen) Date: Fri, 30 Jan 2009 10:22:46 -0500 Subject: barrier and commit options? In-Reply-To: References: <20090130135329.GW20896@petole.demisel.net> Message-ID: <49831B46.5080202@redhat.com> Christian Kujau wrote: > On Fri, 30 Jan 2009, Nicolas KOWALSKI wrote: >> I know I may loose the last 30 seconds of "work" (it's just a home >> server), but is the filesystem at risk (corruption, whatever, ...) with >> these mount options ? > > No, why would it? If certain mount options would make a filesystem prone > to corruption I'd consider this a bug. Well, that's not exactly true. Turning off barriers, depending on your storage, could lead to corruption in some cases. Mounting with data=writeback can expose stale data, which could even be a security issue. But as long as you make these decisions consciously, they may fit your needs. > So apart from losing a few more > seconds of work in case of an error, the fs should be fine. This part is correct, barriers on and longer commit time should not affect filesystem consistency / integrity. -Eric > C. From nicolas.kowalski at gmail.com Fri Jan 30 15:25:47 2009 From: nicolas.kowalski at gmail.com (Nicolas KOWALSKI) Date: Fri, 30 Jan 2009 16:25:47 +0100 Subject: barrier and commit options? In-Reply-To: References: <20090130135329.GW20896@petole.demisel.net> Message-ID: <20090130152547.GA2068@petole.demisel.net> On Fri, Jan 30, 2009 at 04:17:54PM +0100, Christian Kujau wrote: > On Fri, 30 Jan 2009, Nicolas KOWALSKI wrote: >> I know I may loose the last 30 seconds of "work" (it's just a home >> server), but is the filesystem at risk (corruption, whatever, ...) with >> these mount options ? > > No, why would it? If certain mount options would make a filesystem prone > to corruption I'd consider this a bug. Well, not using barrier=1 with disk write cache enabled may cause corruption apparently... > So apart from losing a few more seconds of work in case of an error, > the fs should be fine. Fine. :) Thanks for your reply, -- Nicolas From nicolas.kowalski at gmail.com Fri Jan 30 15:30:21 2009 From: nicolas.kowalski at gmail.com (Nicolas KOWALSKI) Date: Fri, 30 Jan 2009 16:30:21 +0100 Subject: barrier and commit options? In-Reply-To: <49831B46.5080202@redhat.com> References: <20090130135329.GW20896@petole.demisel.net> <49831B46.5080202@redhat.com> Message-ID: <20090130153021.GB2068@petole.demisel.net> On Fri, Jan 30, 2009 at 10:22:46AM -0500, Eric Sandeen wrote: > Christian Kujau wrote: > > On Fri, 30 Jan 2009, Nicolas KOWALSKI wrote: > >> I know I may loose the last 30 seconds of "work" (it's just a home > >> server), but is the filesystem at risk (corruption, whatever, ...) with > >> these mount options ? > > > > No, why would it? If certain mount options would make a filesystem prone > > to corruption I'd consider this a bug. > > Well, that's not exactly true. Turning off barriers, depending on your > storage, could lead to corruption in some cases. Mounting with > data=writeback can expose stale data, which could even be a security issue. > > But as long as you make these decisions consciously, they may fit your > needs. > > > So apart from losing a few more > > seconds of work in case of an error, the fs should be fine. > > This part is correct, barriers on and longer commit time should not > affect filesystem consistency / integrity. Ok, I'm more relaxed about my data then. :) Thanks for your reply, -- Nicolas From Mike.Miller at hp.com Fri Jan 30 15:34:14 2009 From: Mike.Miller at hp.com (Miller, Mike (OS Dev)) Date: Fri, 30 Jan 2009 15:34:14 +0000 Subject: barrier and commit options? In-Reply-To: <49831B46.5080202@redhat.com> References: <20090130135329.GW20896@petole.demisel.net> <49831B46.5080202@redhat.com> Message-ID: <0F5B06BAB751E047AB5C87D1F77A778859F9DD0800@GVW0547EXC.americas.hpqcorp.net> Eric wrote: > > Christian Kujau wrote: > > On Fri, 30 Jan 2009, Nicolas KOWALSKI wrote: > >> I know I may loose the last 30 seconds of "work" (it's just a home > >> server), but is the filesystem at risk (corruption, whatever, ...) > >> with these mount options ? > > > > No, why would it? If certain mount options would make a filesystem > > prone to corruption I'd consider this a bug. > > Well, that's not exactly true. Turning off barriers, > depending on your storage, could lead to corruption in some I hope this a proper forum for this inquiry. I'm the maintainer of the HP Smart Array driver, cciss. We've had requests and now a bug report to support write barriers. It seems that write barriers are primarily intended to ensure the proper ordering of data from the disks write cache to the medium. Is this accurate? Thanks, -- mikem > cases. Mounting with data=writeback can expose stale data, > which could even be a security issue. > > But as long as you make these decisions consciously, they may > fit your needs. > > > So apart from losing a few more > > seconds of work in case of an error, the fs should be fine. > > This part is correct, barriers on and longer commit time > should not affect filesystem consistency / integrity. > > -Eric > > > C. > > _______________________________________________ > Ext3-users mailing list > Ext3-users at redhat.com > https://www.redhat.com/mailman/listinfo/ext3-users > From rwheeler at redhat.com Fri Jan 30 15:40:14 2009 From: rwheeler at redhat.com (Ric Wheeler) Date: Fri, 30 Jan 2009 10:40:14 -0500 Subject: barrier and commit options? In-Reply-To: <0F5B06BAB751E047AB5C87D1F77A778859F9DD0800@GVW0547EXC.americas.hpqcorp.net> References: <20090130135329.GW20896@petole.demisel.net> <49831B46.5080202@redhat.com> <0F5B06BAB751E047AB5C87D1F77A778859F9DD0800@GVW0547EXC.americas.hpqcorp.net> Message-ID: <49831F5E.6000506@redhat.com> Miller, Mike (OS Dev) wrote: > Eric wrote: > >> Christian Kujau wrote: >> >>> On Fri, 30 Jan 2009, Nicolas KOWALSKI wrote: >>> >>>> I know I may loose the last 30 seconds of "work" (it's just a home >>>> server), but is the filesystem at risk (corruption, whatever, ...) >>>> with these mount options ? >>>> >>> No, why would it? If certain mount options would make a filesystem >>> prone to corruption I'd consider this a bug. >>> >> Well, that's not exactly true. Turning off barriers, >> depending on your storage, could lead to corruption in some >> > > I hope this a proper forum for this inquiry. I'm the maintainer of the HP Smart Array driver, cciss. We've had requests and now a bug report to support write barriers. > It seems that write barriers are primarily intended to ensure the proper ordering of data from the disks write cache to the medium. Is this accurate? > > Thanks, > -- mikem > > Hi Mike, Without working barriers, you are especially open to metadata corruption - If I remember the details correctly, Chris Mason has demonstrated a 50% chance of corruption directory entries in ext3 for example. In addition, barriers allows fsync to have real meaning since the target storage will flush its write cache & the user will have that fsync() data after a power outage. If you have a battery backed write cache (say, in a high end array) barriers can be ignored since the storage can effectively make that write cache non-volatile, but otherwise, this is pretty key for anyone wanting to maintain data integrity, Regards, Ric From Mike.Miller at hp.com Fri Jan 30 15:56:33 2009 From: Mike.Miller at hp.com (Miller, Mike (OS Dev)) Date: Fri, 30 Jan 2009 15:56:33 +0000 Subject: barrier and commit options? In-Reply-To: <49831F5E.6000506@redhat.com> References: <20090130135329.GW20896@petole.demisel.net> <49831B46.5080202@redhat.com> <0F5B06BAB751E047AB5C87D1F77A778859F9DD0800@GVW0547EXC.americas.hpqcorp.net> <49831F5E.6000506@redhat.com> Message-ID: <0F5B06BAB751E047AB5C87D1F77A778859F9DD0835@GVW0547EXC.americas.hpqcorp.net> Ric Wheeler wrote: > > I hope this a proper forum for this inquiry. I'm the > maintainer of the HP Smart Array driver, cciss. We've had > requests and now a bug report to support write barriers. > > It seems that write barriers are primarily intended to > ensure the proper ordering of data from the disks write cache > to the medium. Is this accurate? > > > > Thanks, > > -- mikem > > > > > Hi Mike, > > Without working barriers, you are especially open to metadata > corruption > - If I remember the details correctly, Chris Mason has > demonstrated a 50% chance of corruption directory entries in > ext3 for example. > > In addition, barriers allows fsync to have real meaning since > the target storage will flush its write cache & the user will > have that fsync() data after a power outage. > > If you have a battery backed write cache (say, in a high end > array) barriers can be ignored since the storage can > effectively make that write cache non-volatile, but > otherwise, this is pretty key for anyone wanting to maintain > data integrity, > Hi Ric, That's what I getting at, array controllers with a battery backed write cache (BBWC). We disable the write cache on the physical disks and provide no mechanism to re-enable the cache except in some SATA configurations. So my real question is this: Given the fact that many Smart Array controllers ship with a BBWC, will write barriers offer any benefit? I think fsync does nothing on SA since it doesn't know how to flush the controller cache. If a user has no BBWC then all writes are completed all the way down to the disk medium before the command is completed back up to the driver. Thanks, -- mikem > > _______________________________________________ > Ext3-users mailing list > Ext3-users at redhat.com > https://www.redhat.com/mailman/listinfo/ext3-users > From rwheeler at redhat.com Fri Jan 30 16:03:51 2009 From: rwheeler at redhat.com (Ric Wheeler) Date: Fri, 30 Jan 2009 11:03:51 -0500 Subject: barrier and commit options? In-Reply-To: <0F5B06BAB751E047AB5C87D1F77A778859F9DD0835@GVW0547EXC.americas.hpqcorp.net> References: <20090130135329.GW20896@petole.demisel.net> <49831B46.5080202@redhat.com> <0F5B06BAB751E047AB5C87D1F77A778859F9DD0800@GVW0547EXC.americas.hpqcorp.net> <49831F5E.6000506@redhat.com> <0F5B06BAB751E047AB5C87D1F77A778859F9DD0835@GVW0547EXC.americas.hpqcorp.net> Message-ID: <498324E7.3000705@redhat.com> Miller, Mike (OS Dev) wrote: > Ric Wheeler wrote: > > >>> I hope this a proper forum for this inquiry. I'm the >>> >> maintainer of the HP Smart Array driver, cciss. We've had >> requests and now a bug report to support write barriers. >> >>> It seems that write barriers are primarily intended to >>> >> ensure the proper ordering of data from the disks write cache >> to the medium. Is this accurate? >> >>> Thanks, >>> -- mikem >>> >>> >>> >> Hi Mike, >> >> Without working barriers, you are especially open to metadata >> corruption >> - If I remember the details correctly, Chris Mason has >> demonstrated a 50% chance of corruption directory entries in >> ext3 for example. >> >> In addition, barriers allows fsync to have real meaning since >> the target storage will flush its write cache & the user will >> have that fsync() data after a power outage. >> >> If you have a battery backed write cache (say, in a high end >> array) barriers can be ignored since the storage can >> effectively make that write cache non-volatile, but >> otherwise, this is pretty key for anyone wanting to maintain >> data integrity, >> >> > Hi Ric, > That's what I getting at, array controllers with a battery backed write cache (BBWC). We disable the write cache on the physical disks and provide no mechanism to re-enable the cache except in some SATA configurations. > > So my real question is this: Given the fact that many Smart Array controllers ship with a BBWC, will write barriers offer any benefit? I think fsync does nothing on SA since it doesn't know how to flush the controller cache. > > If a user has no BBWC then all writes are completed all the way down to the disk medium before the command is completed back up to the driver. > > Thanks, > -- mikem > In this case (or whenever the write cache is disabled on the disk) the barrier ops don't do anything for us... Some devices simply ignore the flush commands (imagine flushing the gigabytes in an enterprise array on each transaction commit), others might return an error on the flush command itself (which should be handled correctly). I don't think that you need to add support if the HBA has a battery backed cache and the target drives have disabled write caches... Ric > >> _______________________________________________ >> Ext3-users mailing list >> Ext3-users at redhat.com >> https://www.redhat.com/mailman/listinfo/ext3-users >> From tytso at mit.edu Fri Jan 30 22:02:45 2009 From: tytso at mit.edu (Theodore Tso) Date: Fri, 30 Jan 2009 17:02:45 -0500 Subject: barrier and commit options? In-Reply-To: <498324E7.3000705@redhat.com> References: <20090130135329.GW20896@petole.demisel.net> <49831B46.5080202@redhat.com> <0F5B06BAB751E047AB5C87D1F77A778859F9DD0800@GVW0547EXC.americas.hpqcorp.net> <49831F5E.6000506@redhat.com> <0F5B06BAB751E047AB5C87D1F77A778859F9DD0835@GVW0547EXC.americas.hpqcorp.net> <498324E7.3000705@redhat.com> Message-ID: <20090130220245.GA27950@mit.edu> >>> - If I remember the details correctly, Chris Mason has demonstrated a >>> 50% chance of corruption directory entries in ext3 for example. Chris Mason has a script which forces the system to be under a lot of memory pressure, and in that scenario, it is highly likely that without barriers, there will be filesystem corruptions if the system is abruptly turned off while his script is running. Andrew Monrton has been resistant in making barriers=1 be the default for ext3 because (as I understand it) he disbelieves that this is an adequate real-world example, and there is a real performance hit to running without barriers. >>> If you have a battery backed write cache (say, in a high end array) >>> barriers can be ignored since the storage can effectively make that >>> write cache non-volatile, but otherwise, this is pretty key for >>> anyone wanting to maintain data integrity, >>> >> That's what I getting at, array controllers with a battery backed >> write cache (BBWC). We disable the write cache on the physical >> disks and provide no mechanism to re-enable the cache except in >> some SATA configurations. Well, we still need the barrier on the block I/O elevantor side to make sure that requests don't get reordered in the block layer. But what you're saying is that once the write is posted to the array, it is guaranteed that it is on "stable storage" (even if it is BBWC) such that if someone hits the Big Red Switch at the exit to the data center, and power is forcibly cut from the entire data center in case of a fire, the battery will still keep the cache alive, at least until the sprinklers go off, anyway, right? :-) In that case, I suspect the right thing for the cciss array to do is to ignore the barrier, but not to return an error. If you return an error, and refuse the write with barrier operation (which is what the cciss driver seems to be doing starting in 2.6.29-rcX), ext4 will retry the write without the barrier, at which point we are vulnerable to the block layer reordering things at the I/O scheduler layer. In effect, you're claiming that every single write to cciss is implicitly a "barrier write" in that once it is received by the device, it is guaranteed not to be lost even if the power to the entire system is forcibly removed. - Ted From rwheeler at redhat.com Sat Jan 31 12:45:06 2009 From: rwheeler at redhat.com (Ric Wheeler) Date: Sat, 31 Jan 2009 07:45:06 -0500 Subject: barrier and commit options? In-Reply-To: <20090130220245.GA27950@mit.edu> References: <20090130135329.GW20896@petole.demisel.net> <49831B46.5080202@redhat.com> <0F5B06BAB751E047AB5C87D1F77A778859F9DD0800@GVW0547EXC.americas.hpqcorp.net> <49831F5E.6000506@redhat.com> <0F5B06BAB751E047AB5C87D1F77A778859F9DD0835@GVW0547EXC.americas.hpqcorp.net> <498324E7.3000705@redhat.com> <20090130220245.GA27950@mit.edu> Message-ID: <498447D2.1030106@redhat.com> Theodore Tso wrote: >>>> - If I remember the details correctly, Chris Mason has demonstrated a >>>> 50% chance of corruption directory entries in ext3 for example. >>>> > > Chris Mason has a script which forces the system to be under a lot of > memory pressure, and in that scenario, it is highly likely that > without barriers, there will be filesystem corruptions if the system > is abruptly turned off while his script is running. > > Andrew Monrton has been resistant in making barriers=1 be the default > for ext3 because (as I understand it) he disbelieves that this is an > adequate real-world example, and there is a real performance hit to > running without barriers. > > >>>> If you have a battery backed write cache (say, in a high end array) >>>> barriers can be ignored since the storage can effectively make that >>>> write cache non-volatile, but otherwise, this is pretty key for >>>> anyone wanting to maintain data integrity, >>>> >>>> >>> That's what I getting at, array controllers with a battery backed >>> write cache (BBWC). We disable the write cache on the physical >>> disks and provide no mechanism to re-enable the cache except in >>> some SATA configurations. >>> > > Well, we still need the barrier on the block I/O elevantor side to > make sure that requests don't get reordered in the block layer. But > what you're saying is that once the write is posted to the array, it > is guaranteed that it is on "stable storage" (even if it is BBWC) such > that if someone hits the Big Red Switch at the exit to the data > center, and power is forcibly cut from the entire data center in case > of a fire, the battery will still keep the cache alive, at least until > the sprinklers go off, anyway, right? :-) > Yes, true.... > In that case, I suspect the right thing for the cciss array to do is > to ignore the barrier, but not to return an error. If you return an > error, and refuse the write with barrier operation (which is what the > cciss driver seems to be doing starting in 2.6.29-rcX), ext4 will > retry the write without the barrier, at which point we are vulnerable > to the block layer reordering things at the I/O scheduler layer. In > effect, you're claiming that every single write to cciss is implicitly > a "barrier write" in that once it is received by the device, it is > guaranteed not to be lost even if the power to the entire system is > forcibly removed. > > - Ted > > > Aren't barriers tied still to the state of the write cache on the target drive? In other words, if the write cache is off, we disable barriers automatically. I think that this happens for scsi in sd_revalidate_disk(). In this case, it sounds like we have tangled the need to flush a drive's write with the need to not re-order IO in the elevator code. Ric