From sergey.shyman at gmail.com  Fri Jan  2 20:18:55 2009
From: sergey.shyman at gmail.com (Sergey Shyman)
Date: Fri, 02 Jan 2009 22:18:55 +0200
Subject: Big problem with huge number of files
Message-ID: <495E76AF.8080702@gmail.com>

Hi all,

I have an issue when I can't get directory listing for maildir with huge 
number of files inside. Neither ls, du or any other command finished 
successfully, it just running for hours without any success. Does 
anybody know how I could get directory listing and copies of my files? 
Any pointing would be great and greatly appreciated. Thanks in advance!

Here is info about this partition:
Filesystem volume name:   <none>
Last mounted on:          <not available>
Filesystem UUID:          3395b7eb-746c-4fc1-a52e-76547ca7454d
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index 
filetype needs_recovery sparse_super large_file
Default mount options:    (none)
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              30507008
Block count:              61008816
Reserved block count:     3050440
Free blocks:              36021498
Free inodes:              20268094
First block:              0
Block size:               4096
Fragment size:            4096
Reserved GDT blocks:      1024
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         16384
Inode blocks per group:   512
Filesystem created:       Thu Apr 27 23:40:04 2006
Last mount time:          Fri Jan  2 15:11:02 2009
Last write time:          Fri Jan  2 15:52:25 2009
Mount count:              37
Maximum mount count:      -1
Last checked:             Thu Apr 27 23:40:04 2006
Check interval:           0 (<none>)
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:          128
Journal inode:            8
First orphan inode:       28213259
Default directory hash:   tea
Directory Hash Seed:      04e82a5e-98ca-4893-b03f-44d5f7227e8d
Journal backup:           inode blocks

This partition have noatime enabled.



From pegasus at nerv.eu.org  Fri Jan  2 21:38:26 2009
From: pegasus at nerv.eu.org (Jure =?UTF-8?B?UGXEjWFy?=)
Date: Fri, 2 Jan 2009 22:38:26 +0100
Subject: Big problem with huge number of files
In-Reply-To: <495E76AF.8080702@gmail.com>
References: <495E76AF.8080702@gmail.com>
Message-ID: <20090102223826.774c1942.pegasus@nerv.eu.org>

On Fri, 02 Jan 2009 22:18:55 +0200
Sergey Shyman <sergey.shyman at gmail.com> wrote:

> Hi all,
> 
> I have an issue when I can't get directory listing for maildir with huge 
> number of files inside. Neither ls, du or any other command finished 
> successfully, it just running for hours without any success. Does 
> anybody know how I could get directory listing and copies of my files? 
> Any pointing would be great and greatly appreciated. Thanks in advance!

Have you tried ls -U so that ls doesn't do internal sorting?
Have you tried find?


-- 

Jure Pe?ar
http://jure.pecar.org/



From Curtis at GreenKey.net  Mon Jan  5 17:21:56 2009
From: Curtis at GreenKey.net (Curtis Doty)
Date: Mon, 5 Jan 2009 09:21:56 -0800 (PST)
Subject: 16TiB ext4
Message-ID: <20090105172156.AC5036F064@alopias.GreenKey.net>

I'm horsing around with ext4 again. This time on Fedora 10. Is there any 
sane reason why I cannot use the *full* 16TiB volume?

----8<----
# vgcreate foo /dev/mapper/mpath*
   Volume group "foo" successfully created
# lvcreate -L16T -nbar foo
   Logical volume "bar" created
# mkfs.ext4 -Tlargefile4 /dev/foo/bar
mke2fs 1.41.3 (12-Oct-2008)
mkfs.ext4: Size of device /dev/foo/bar too big to be expressed in 32 bits
         using a blocksize of 4096.
----8<----

But it appears to *really* allow up to one PE less than the full 16TiB, 
why?

----8<----
# vgdisplay foo
   --- Volume group ---
   VG Name               foo
   System ID
   Format                lvm2
   Metadata Areas        2
   Metadata Sequence No  2
   VG Access             read/write
   VG Status             resizable
   MAX LV                0
   Cur LV                1
   Open LV               0
   Max PV                0
   Cur PV                2
   Act PV                2
   VG Size               18.19 TB
   PE Size               4.00 MB
   Total PE              4769266
   Alloc PE / Size       4194304 / 16.00 TB
   Free  PE / Size       574962 / 2.19 TB
   VG UUID               tPk8uJ-gIYZ-GJSU-ssob-IoYu-8AUp-pHKALO
# lvremove -f foo/bar
   Logical volume "bar" successfully removed
# lvcreate -l4194303 -nbar foo
   Logical volume "bar" created
# mkfs.ext4 -Tlargefile4 /dev/foo/bar
mke2fs 1.41.3 (12-Oct-2008)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
1073741824 inodes, 4294966272 blocks
214748313 blocks (5.00%) reserved for the super user
First data block=0
131072 block groups
32768 blocks per group, 32768 fragments per group
8192 inodes per group
Superblock backups stored on blocks:
         32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
         4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
         102400000, 214990848, 512000000, 550731776, 644972544, 1934917632,
         2560000000, 3855122432

Writing inode tables: done
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done

This filesystem will be automatically checked every 35 mounts or
180 days, whichever comes first.  Use tune2fs -c or -i to override.
----8<----

In my use case, I'm using much larger PEs, so the loss if just one is 
significant. Is this a bug in my thinking? Or in the userland tools?

../C



From sandeen at redhat.com  Mon Jan  5 18:16:08 2009
From: sandeen at redhat.com (Eric Sandeen)
Date: Mon, 05 Jan 2009 12:16:08 -0600
Subject: 16TiB ext4
In-Reply-To: <20090105172156.AC5036F064@alopias.GreenKey.net>
References: <20090105172156.AC5036F064@alopias.GreenKey.net>
Message-ID: <49624E68.8050804@redhat.com>

Curtis Doty wrote:
> I'm horsing around with ext4 again. This time on Fedora 10. Is there any 
> sane reason why I cannot use the *full* 16TiB volume?
> 
> ----8<----
> # vgcreate foo /dev/mapper/mpath*
>    Volume group "foo" successfully created
> # lvcreate -L16T -nbar foo
>    Logical volume "bar" created
> # mkfs.ext4 -Tlargefile4 /dev/foo/bar
> mke2fs 1.41.3 (12-Oct-2008)
> mkfs.ext4: Size of device /dev/foo/bar too big to be expressed in 32 bits
>          using a blocksize of 4096.
> ----8<----
> 
> But it appears to *really* allow up to one PE less than the full 16TiB, 
> why?

The real limit, IIRC, is (2^32 - 1) blocks, or 4k shy of 16T for 4k blocks.

This is a little unfortunate since "lvcreate -L16T" is so handy, but it
won't mkfs properly.  (ext3 should have the same limitation).

We should probably make mkfs just silently lop off one block if it
encounters a boundary condition like this ...

-Eric



From Curtis at GreenKey.net  Mon Jan  5 20:23:35 2009
From: Curtis at GreenKey.net (Curtis Doty)
Date: Mon, 5 Jan 2009 12:23:35 -0800 (PST)
Subject: 16TiB ext4
In-Reply-To: <49624E68.8050804@redhat.com>
References: <20090105172156.AC5036F064@alopias.GreenKey.net>
	<49624E68.8050804@redhat.com>
Message-ID: <20090105202335.4A68E6F064@alopias.GreenKey.net>

12:16pm Eric Sandeen said:

> The real limit, IIRC, is (2^32 - 1) blocks, or 4k shy of 16T for 4k blocks.
>
> This is a little unfortunate since "lvcreate -L16T" is so handy, but it
> won't mkfs properly.  (ext3 should have the same limitation).
>
> We should probably make mkfs just silently lop off one block if it
> encounters a boundary condition like this ...
>

Ah, thanks Eric! That would be smart.

I'm trying to workaround, but...

----8<----
# mkfs.ext4 /dev/foo/bar $[2**32-1]
mke2fs 1.41.3 (12-Oct-2008)
mkfs.ext4: Size of device /dev/phd/dc1a too big to be expressed in 32 bits
         using a blocksize of 4096.
# mkfs.ext4 /dev/foo/bar 42
# mkfs.ext4
Usage: mkfs.ext4 [-c|-l filename] [-b block-size] [-f fragment-size]
         [-i bytes-per-inode] [-I inode-size] [-J journal-options]
         [-G meta group size] [-N number-of-inodes]
         [-m reserved-blocks-percentage] [-o creator-os]
         [-g blocks-per-group] [-L volume-label] [-M last-mounted-directory]
         [-O feature[,...]] [-r fs-revision] [-E extended-option[,...]]
         [-T fs-type] [-jnqvFSV] device [blocks-count]
----8<----

It doesn't appear to support the blocks-count option anymore. :-( Or did 
it ever?

../C



From Curtis at GreenKey.net  Mon Jan  5 20:31:43 2009
From: Curtis at GreenKey.net (Curtis Doty)
Date: Mon, 5 Jan 2009 12:31:43 -0800 (PST)
Subject: 16TiB ext4
In-Reply-To: <20090105202335.4A68E6F064@alopias.GreenKey.net>
References: <20090105172156.AC5036F064@alopias.GreenKey.net>
	<49624E68.8050804@redhat.com>
	<20090105202335.4A68E6F064@alopias.GreenKey.net>
Message-ID: <20090105203144.3C3A86F064@alopias.GreenKey.net>

Ah whoops...forgot to paste entire example.

12:23pm Curtis Doty said:

> #  mkfs.ext4 /dev/foo/bar 42
mke2fs 1.41.3 (12-Oct-2008)
mkfs.ext4: Size of device /dev/phd/dc1a too big to be expressed in 32 bits
         using a blocksize of 4096.

> It doesn't appear to support the blocks-count option anymore. :-( Or did it 
> ever?
>



From sandeen at redhat.com  Mon Jan  5 20:41:43 2009
From: sandeen at redhat.com (Eric Sandeen)
Date: Mon, 05 Jan 2009 14:41:43 -0600
Subject: 16TiB ext4
In-Reply-To: <20090105202335.4A68E6F064@alopias.GreenKey.net>
References: <20090105172156.AC5036F064@alopias.GreenKey.net>	<49624E68.8050804@redhat.com>
	<20090105202335.4A68E6F064@alopias.GreenKey.net>
Message-ID: <49627087.6050000@redhat.com>

Curtis Doty wrote:
> 12:16pm Eric Sandeen said:
> 
>> The real limit, IIRC, is (2^32 - 1) blocks, or 4k shy of 16T for 4k blocks.
>>
>> This is a little unfortunate since "lvcreate -L16T" is so handy, but it
>> won't mkfs properly.  (ext3 should have the same limitation).
>>
>> We should probably make mkfs just silently lop off one block if it
>> encounters a boundary condition like this ...
>>
> 
> Ah, thanks Eric! That would be smart.
> 
> I'm trying to workaround, but...
> 
> ----8<----
> # mkfs.ext4 /dev/foo/bar $[2**32-1]
> mke2fs 1.41.3 (12-Oct-2008)
> mkfs.ext4: Size of device /dev/phd/dc1a too big to be expressed in 32 bits
>          using a blocksize of 4096.
> # mkfs.ext4 /dev/foo/bar 42
> # mkfs.ext4
> Usage: mkfs.ext4 [-c|-l filename] [-b block-size] [-f fragment-size]
>          [-i bytes-per-inode] [-I inode-size] [-J journal-options]
>          [-G meta group size] [-N number-of-inodes]
>          [-m reserved-blocks-percentage] [-o creator-os]
>          [-g blocks-per-group] [-L volume-label] [-M last-mounted-directory]
>          [-O feature[,...]] [-r fs-revision] [-E extended-option[,...]]
>          [-T fs-type] [-jnqvFSV] device [blocks-count]
> ----8<----
> 
> It doesn't appear to support the blocks-count option anymore. :-( Or did 
> it ever?

it does, and did... but it's checking the device size and erroring
before it looks at the value you passed in, sigh:

# ls -lh fsfile
-rw-r--r-- 1 root root 16T 2009-01-05 14:30 fsfile

[root at inode test]# mkfs.ext4 -b 4096 fsfile 4294967295
mke2fs 1.41.3 (12-Oct-2008)
fsfile is not a block special device.
Proceed anyway? (y,n) y
mkfs.ext4: Size of device fsfile too big to be expressed in 32 bits
        using a blocksize of 4096.

Unless you specify -n, not that that actually gets you anywhere!

[root at inode test]# mkfs.ext4 -n -b 4096 fsfile 4294967295
mke2fs 1.41.3 (12-Oct-2008)
fsfile is not a block special device.
Proceed anyway? (y,n) y
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
1073741824 inodes, 4294967295 blocks
214748364 blocks (5.00%) reserved for the super user
First data block=0
131072 block groups
32768 blocks per group, 32768 fragments per group
8192 inodes per group
Superblock backups stored on blocks:
        32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632,
2654208,
        4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
        102400000, 214990848, 512000000, 550731776, 644972544, 1934917632,
        2560000000, 3855122432

and one more block really does fail, though with a less-than-helpful
message:

[root at inode test]# mkfs.ext4 -n -b 4096 fsfile 4294967296
mke2fs 1.41.3 (12-Oct-2008)
mkfs.ext4: invalid blocks count - 4294967296

I'll look into this, it should all be smarter...

-Eric



From adilger at sun.com  Tue Jan  6 09:35:40 2009
From: adilger at sun.com (Andreas Dilger)
Date: Tue, 06 Jan 2009 02:35:40 -0700
Subject: Big problem with huge number of files
In-Reply-To: <20090102223826.774c1942.pegasus@nerv.eu.org>
References: <495E76AF.8080702@gmail.com>
	<20090102223826.774c1942.pegasus@nerv.eu.org>
Message-ID: <20090106093540.GL3932@webber.adilger.int>

On Jan 02, 2009  22:38 +0100, Jure Pe?ar wrote:
> On Fri, 02 Jan 2009 22:18:55 +0200
> Sergey Shyman <sergey.shyman at gmail.com> wrote:
> > I have an issue when I can't get directory listing for maildir with huge 
> > number of files inside. Neither ls, du or any other command finished 
> > successfully, it just running for hours without any success. Does 
> > anybody know how I could get directory listing and copies of my files? 
> > Any pointing would be great and greatly appreciated. Thanks in advance!
> 
> Have you tried ls -U so that ls doesn't do internal sorting?
> Have you tried find?

GNU ls is useless in this regard, because even the "-U" option will wait
until it has read all of the files before it starts printing anything.
It must wait until all the data is available before deciding whether to
sort or not.

Using "find" will probably work very quickly.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.



From shirishag75 at gmail.com  Wed Jan  7 14:40:23 2009
From: shirishag75 at gmail.com (shirish)
Date: Wed, 7 Jan 2009 20:10:23 +0530
Subject: Big problem with huge number of files
In-Reply-To: <495E76AF.8080702@gmail.com>
References: <495E76AF.8080702@gmail.com>
Message-ID: <511f47f50901070640vd7af70lf313cc7495146d8a@mail.gmail.com>

Reply

On Sat, Jan 3, 2009 at 01:48, Sergey Shyman <sergey.shyman at gmail.com> wrote:
> Hi all,

Hi,

<snip>

> Here is info about this partition:
> Filesystem volume name:   <none>
> Last mounted on:          <not available>
> Filesystem UUID:          3395b7eb-746c-4fc1-a52e-76547ca7454d
> Filesystem magic number:  0xEF53
> Filesystem revision #:    1 (dynamic)
> Filesystem features:      has_journal ext_attr resize_inode dir_index
> filetype needs_recovery sparse_super large_file
> Default mount options:    (none)
> Filesystem state:         clean
> Errors behavior:          Continue
> Filesystem OS type:       Linux
> Inode count:              30507008
> Block count:              61008816
> Reserved block count:     3050440
> Free blocks:              36021498
> Free inodes:              20268094
> First block:              0
> Block size:               4096
> Fragment size:            4096
> Reserved GDT blocks:      1024
> Blocks per group:         32768
> Fragments per group:      32768
> Inodes per group:         16384
> Inode blocks per group:   512
> Filesystem created:       Thu Apr 27 23:40:04 2006
> Last mount time:          Fri Jan  2 15:11:02 2009
> Last write time:          Fri Jan  2 15:52:25 2009
> Mount count:              37
> Maximum mount count:      -1
> Last checked:             Thu Apr 27 23:40:04 2006
> Check interval:           0 (<none>)
> Reserved blocks uid:      0 (user root)
> Reserved blocks gid:      0 (group root)
> First inode:              11
> Inode size:          128
> Journal inode:            8
> First orphan inode:       28213259
> Default directory hash:   tea
> Directory Hash Seed:      04e82a5e-98ca-4893-b03f-44d5f7227e8d
> Journal backup:           inode blocks
>
> This partition have noatime enabled.

probably off-topic to the thread but how were u able to get the above
info. Which command/tool did you use to get the above?

> _______________________________________________
> Ext3-users mailing list
> Ext3-users at redhat.com
> https://www.redhat.com/mailman/listinfo/ext3-users
>

-- 
          Regards,
          Shirish Agarwal
  This email is licensed under http://creativecommons.org/licenses/by-nc/3.0/
http://flossexperiences.wordpress.com
065C 6D79 A68C E7EA 52B3  8D70 950D 53FB 729A 8B17



From ulf at openlane.com  Wed Jan  7 15:02:31 2009
From: ulf at openlane.com (Ulf Zimmermann)
Date: Wed, 7 Jan 2009 07:02:31 -0800
Subject: Big problem with huge number of files
In-Reply-To: <511f47f50901070640vd7af70lf313cc7495146d8a@mail.gmail.com>
References: <495E76AF.8080702@gmail.com>
	<511f47f50901070640vd7af70lf313cc7495146d8a@mail.gmail.com>
Message-ID: <5DE4B7D3E79067418154C49A739C125104C4A8D4@msmpk01.corp.autc.com>

> -----Original Message-----
> To: Sergey Shyman
> Subject: Re: Big problem with huge number of files
> 
> Reply
> 
> On Sat, Jan 3, 2009 at 01:48, Sergey Shyman <sergey.shyman at gmail.com>
> wrote:
> > Hi all,
> 
> Hi,
> 
> 
> probably off-topic to the thread but how were u able to get the above
> info. Which command/tool did you use to get the above?

tune2fs -l <path to device>





From shirishag75 at gmail.com  Wed Jan  7 15:46:53 2009
From: shirishag75 at gmail.com (shirish)
Date: Wed, 7 Jan 2009 21:16:53 +0530
Subject: Big problem with huge number of files
In-Reply-To: <5DE4B7D3E79067418154C49A739C125104C4A8D4@msmpk01.corp.autc.com>
References: <495E76AF.8080702@gmail.com>
	<511f47f50901070640vd7af70lf313cc7495146d8a@mail.gmail.com>
	<5DE4B7D3E79067418154C49A739C125104C4A8D4@msmpk01.corp.autc.com>
Message-ID: <511f47f50901070746k5ba95d27u52c7f12bbe5444bf@mail.gmail.com>

On Wed, Jan 7, 2009 at 20:32, Ulf Zimmermann <ulf at openlane.com> wrote:

<snip>

Hi Ulf Zimmermann,

> tune2fs -l <path to device>

Cool. Thank you for telling me about this tool.

-- 
          Regards,
          Shirish Agarwal
  This email is licensed under http://creativecommons.org/licenses/by-nc/3.0/
http://flossexperiences.wordpress.com
065C 6D79 A68C E7EA 52B3  8D70 950D 53FB 729A 8B17



From ulf at openlane.com  Wed Jan  7 17:56:03 2009
From: ulf at openlane.com (Ulf Zimmermann)
Date: Wed, 7 Jan 2009 09:56:03 -0800
Subject: OT: mailing list to talk about multipath under Linux?
Message-ID: <5DE4B7D3E79067418154C49A739C125104C4A8D8@msmpk01.corp.autc.com>

Not directly related to EXT FS but can anyone point me a mailing list to
talk about things like device-mapper-multipath? Specific I am looking to
see if anyone has maybe written a script to take SCSI devices offline
for a path, to do clean shutdown of a fabric or SAN controller for
maintance?


Ulf Zimmermann | Senior System Architect

OPENLANE
4600 Bohannon Drive, Suite 100
Menlo Park, CA 94025

O: 650-532-6382  M: (510) 396-1764  F: (510) 580-0929

Email: ulf at openlane.com | Web: www.openlane.com





From pegasus at nerv.eu.org  Wed Jan  7 20:18:00 2009
From: pegasus at nerv.eu.org (Jure =?UTF-8?B?UGXEjWFy?=)
Date: Wed, 7 Jan 2009 21:18:00 +0100
Subject: OT: mailing list to talk about multipath under Linux?
In-Reply-To: <5DE4B7D3E79067418154C49A739C125104C4A8D8@msmpk01.corp.autc.com>
References: <5DE4B7D3E79067418154C49A739C125104C4A8D8@msmpk01.corp.autc.com>
Message-ID: <20090107211800.58bab800.pegasus@nerv.eu.org>

On Wed, 7 Jan 2009 09:56:03 -0800
"Ulf Zimmermann" <ulf at openlane.com> wrote:

> Not directly related to EXT FS but can anyone point me a mailing list to
> talk about things like device-mapper-multipath? Specific I am looking to
> see if anyone has maybe written a script to take SCSI devices offline
> for a path, to do clean shutdown of a fabric or SAN controller for
> maintance?

https://www.redhat.com/mailman/listinfo/dm-devel most probably?


-- 

Jure Pe?ar
http://jure.pecar.org/



From bruno at wolff.to  Wed Jan  7 21:18:55 2009
From: bruno at wolff.to (Bruno Wolff III)
Date: Wed, 7 Jan 2009 15:18:55 -0600
Subject: Incorrect disk usage size
In-Reply-To: <alpine.DEB.2.00.0812201813140.5233@an.sumeria>
References: <alpine.DEB.2.00.0812201813140.5233@an.sumeria>
Message-ID: <20090107211855.GA5451@wolff.to>

On Sat, Dec 20, 2008 at 18:37:41 -0600,
  Adam Flott <adam at npjh.com> wrote:
> After an aptitude safe-upgrade of Debian's testing (as of today) my root file
> system (ext3) seems to have "filled up" and I'm not sure how to get Linux to
> correctly report the used size.

Are you aware that there is space in file systems reserved for use only by
root? That may explain your confusion.

The purpose of the reserve is to allow a sysadm to allow some things to keep
working even if a normal user fills up a file system.

The size of the reserve on ext2/3 file systems can be changed with tune2fs.



From ulf at openlane.com  Wed Jan  7 21:22:40 2009
From: ulf at openlane.com (Ulf Zimmermann)
Date: Wed, 7 Jan 2009 13:22:40 -0800
Subject: Incorrect disk usage size
In-Reply-To: <20090107211855.GA5451@wolff.to>
References: <alpine.DEB.2.00.0812201813140.5233@an.sumeria>
	<20090107211855.GA5451@wolff.to>
Message-ID: <5DE4B7D3E79067418154C49A739C125104C4A8E0@msmpk01.corp.autc.com>



> -----Original Message-----
> From: ext3-users-bounces at redhat.com [mailto:ext3-users-
> bounces at redhat.com] On Behalf Of Bruno Wolff III
> Sent: 01/07/2009 13:19
> To: Adam Flott
> Cc: ext3-users at redhat.com
> Subject: Re: Incorrect disk usage size
> 
> On Sat, Dec 20, 2008 at 18:37:41 -0600,
>   Adam Flott <adam at npjh.com> wrote:
> > After an aptitude safe-upgrade of Debian's testing (as of today) my
> root file
> > system (ext3) seems to have "filled up" and I'm not sure how to get
> Linux to
> > correctly report the used size.
> 
> Are you aware that there is space in file systems reserved for use
only
> by
> root? That may explain your confusion.
> 
> The purpose of the reserve is to allow a sysadm to allow some things
to
> keep
> working even if a normal user fills up a file system.
> 
> The size of the reserve on ext2/3 file systems can be changed with
> tune2fs.

Your problem is probably files in /var, not necessary over 1GB in size.
I don't know where Debian saves packages downloaded via apt, but yum for
example has a /var/cache/yum and you can run "yum clean packages". I
would expect apt to have something similar.




From lists at nerdbynature.de  Thu Jan  8 01:49:52 2009
From: lists at nerdbynature.de (Christian Kujau)
Date: Thu, 8 Jan 2009 02:49:52 +0100 (CET)
Subject: Incorrect disk usage size
In-Reply-To: <alpine.DEB.2.00.0812201813140.5233@an.sumeria>
References: <alpine.DEB.2.00.0812201813140.5233@an.sumeria>
Message-ID: <alpine.DEB.2.00.0901080238350.19409@bogon.housecafe.de>

On Sat, 20 Dec 2008, Adam Flott wrote:
>    $ df
>    Filesystem           1K-blocks      Used Available Use% Mounted on
>    /dev/sda1             48062440  46976212         0 100% /

So, "/" is really ~45 GB in total, but:

>    $ du -sh -x /
>    5.6G    /

du(1) counts only 5,6 GB? Hm, first thing that comes to mind are of course 
(stale) open files, which cannot be found with find(1) any more and are 
not freed to the fs, so df(1) does not know about it. I usually use
"lsof -ln | grep deleted", but that'd be a *lot* of large, open files.

>    Block count:              12207384
>    Reserved block count:     610369

This reserve would sum up to ~2,3 GB, but this still does not explain the 
difference to 45 GB. Hm.

> I've looked for large files/directories via find (-type d/f -size +1G) and
> fsck'ing the partition multiple times with various options, but no luck.

And you unmounted or at least remounted r/o the partition for the fsck, so 
the open files should not even be an issue here. Strange indeed...sorry to 
be of no help here...

C.
-- 
BOFH excuse #39:

terrorist activities



From folkert at vanheusden.com  Fri Jan 16 12:01:19 2009
From: folkert at vanheusden.com (Folkert van Heusden)
Date: Fri, 16 Jan 2009 13:01:19 +0100
Subject: something odd with the order of files in a directory
Message-ID: <20090116120119.GB29002@vanheusden.com>

Hi,

I noticed something odd with the order of files in a directory.
When I put files in a directory in a certain order on an
ext3-filesystem, the order is not kept. On fat-filesystem it does.
E.g.:
rm -rf t ; mkdir t
touch a.a a.b a.c
mv a.b t/ ; mv a.c t/ ; mv a.a t/
ls -Ula t/

I then would expect:
a.b
a.c
a.a

but instead I get
drwxr-xr-x 3 root root 4096 2009-01-16 12:59 ..
-rw-r--r-- 1 root root    0 2009-01-16 12:59 a.c
-rw-r--r-- 1 root root    0 2009-01-16 12:59 a.b
-rw-r--r-- 1 root root    0 2009-01-16 12:59 a.a
drwxr-xr-x 2 root root 4096 2009-01-16 12:59 .

I tried adding sync between each mv but that didn't help.


Folkert van Heusden

-- 
----------------------------------------------------------------------
Phone: +31-6-41278122, PGP-key: 1F28D8AE, www.vanheusden.com



From folkert at vanheusden.com  Fri Jan 16 12:01:19 2009
From: folkert at vanheusden.com (Folkert van Heusden)
Date: Fri, 16 Jan 2009 13:01:19 +0100
Subject: something odd with the order of files in a directory
Message-ID: <20090116120119.GB29002@vanheusden.com>

Hi,

I noticed something odd with the order of files in a directory.
When I put files in a directory in a certain order on an
ext3-filesystem, the order is not kept. On fat-filesystem it does.
E.g.:
rm -rf t ; mkdir t
touch a.a a.b a.c
mv a.b t/ ; mv a.c t/ ; mv a.a t/
ls -Ula t/

I then would expect:
a.b
a.c
a.a

but instead I get
drwxr-xr-x 3 root root 4096 2009-01-16 12:59 ..
-rw-r--r-- 1 root root    0 2009-01-16 12:59 a.c
-rw-r--r-- 1 root root    0 2009-01-16 12:59 a.b
-rw-r--r-- 1 root root    0 2009-01-16 12:59 a.a
drwxr-xr-x 2 root root 4096 2009-01-16 12:59 .

I tried adding sync between each mv but that didn't help.


Folkert van Heusden

-- 
----------------------------------------------------------------------
Phone: +31-6-41278122, PGP-key: 1F28D8AE, www.vanheusden.com



From davidlandy at clara.co.uk  Fri Jan 16 12:40:18 2009
From: davidlandy at clara.co.uk (D Landy)
Date: Fri, 16 Jan 2009 12:40:18 +0000
Subject: Fw: 32k Blocksize Support
Message-ID: <E1LNnzm-000LeX-T9@oceanus.uk.clara.net>

Hi again, 

First of all, thanks to Eric Sandeen for his offline support. 

I'm coming back here at his suggestion as we haven't managed to resolve it. 

So far, we've established that it *is* an ext2 filesystem (using file -s), 
and that resize2fs reports that it has an invalid superblock. 

Eric wrote: 

> I'd probably dig into why resize2fs says it's corrupt; large block
> should not mean corrupt, AFAIK, even if the running kernel can't
> actually mount it. 
> 
> You might get this back on-list, too, so future generations can benefit
> from your pain (and in case someone else knows these answers).

Does anyone know if a 32k blocksize would cause resize2fs to report an 
invalid superblock? I've downloaded the source code and from what I can see 
the maximum block size is 64k, so I wouldn't have thought so - but I'm not a 
C programmer and have trouble following the source sometimes. 

I'd appreciate another set of eyes going over the code... 

Any help greatly appreciated. 

David



From sandeen at redhat.com  Fri Jan 16 15:32:24 2009
From: sandeen at redhat.com (Eric Sandeen)
Date: Fri, 16 Jan 2009 09:32:24 -0600
Subject: Fw: 32k Blocksize Support
In-Reply-To: <E1LNnzm-000LeX-T9@oceanus.uk.clara.net>
References: <E1LNnzm-000LeX-T9@oceanus.uk.clara.net>
Message-ID: <4970A888.7070701@redhat.com>

D Landy wrote:
> Hi again, 
> 
> First of all, thanks to Eric Sandeen for his offline support. 
> 
> I'm coming back here at his suggestion as we haven't managed to resolve it. 
> 
> So far, we've established that it *is* an ext2 filesystem (using file -s), 
> and that resize2fs reports that it has an invalid superblock. 
> 
> Eric wrote: 
> 
>> I'd probably dig into why resize2fs says it's corrupt; large block
>> should not mean corrupt, AFAIK, even if the running kernel can't
>> actually mount it. 
>>
>> You might get this back on-list, too, so future generations can benefit
>> from your pain (and in case someone else knows these answers).
> 
> Does anyone know if a 32k blocksize would cause resize2fs to report an 
> invalid superblock? I've downloaded the source code and from what I can see 
> the maximum block size is 64k, so I wouldn't have thought so - but I'm not a 
> C programmer and have trouble following the source sometimes. 
> 
> I'd appreciate another set of eyes going over the code... 
> 
> Any help greatly appreciated. 

I don't know if they're using a standard ext3 fs or not; perhaps it is
adultrated in some way for their needs that makes it incompatible w/ the
upstream tools.

You could go through the code to find where that message is printed,
then work backwards to why (either via gdb, or printf insertions, or
whatever you're comfortable with...)

-Eric



From sandeen at redhat.com  Fri Jan 16 15:35:24 2009
From: sandeen at redhat.com (Eric Sandeen)
Date: Fri, 16 Jan 2009 09:35:24 -0600
Subject: something odd with the order of files in a directory
In-Reply-To: <20090116120119.GB29002@vanheusden.com>
References: <20090116120119.GB29002@vanheusden.com>
Message-ID: <4970A93C.5010709@redhat.com>

Folkert van Heusden wrote:
> Hi,
> 
> I noticed something odd with the order of files in a directory.
> When I put files in a directory in a certain order on an
> ext3-filesystem, the order is not kept. On fat-filesystem it does.
> E.g.:
> rm -rf t ; mkdir t
> touch a.a a.b a.c
> mv a.b t/ ; mv a.c t/ ; mv a.a t/
> ls -Ula t/
> 
> I then would expect:
> a.b
> a.c
> a.a
> 
> but instead I get
> drwxr-xr-x 3 root root 4096 2009-01-16 12:59 ..
> -rw-r--r-- 1 root root    0 2009-01-16 12:59 a.c
> -rw-r--r-- 1 root root    0 2009-01-16 12:59 a.b
> -rw-r--r-- 1 root root    0 2009-01-16 12:59 a.a
> drwxr-xr-x 2 root root 4096 2009-01-16 12:59 .
> 
> I tried adding sync between each mv but that didn't help.

This is due to the dir_index feature; you're getting them back in hash
(read: random) order.  If you turn it off:

[root at inode mnt]# tune2fs -O ^dir_index /dev/sdb4

you'll get what you expect:

[root at inode test]# rm -rf t ; mkdir t
[root at inode test]# touch a.a a.b a.c
[root at inode test]# mv a.b t/ ; mv a.c t/ ; mv a.a t/
[root at inode test]# ls -Ula t/
total 8
drwxr-xr-x 2 root root 4096 2009-01-16 15:30 .
drwxr-xr-x 4 root root 4096 2009-01-16 15:30 ..
-rw-r--r-- 1 root root    0 2009-01-16 15:30 a.b
-rw-r--r-- 1 root root    0 2009-01-16 15:30 a.c
-rw-r--r-- 1 root root    0 2009-01-16 15:30 a.a

but you'll lose the other efficiencies of the dir_index feature.

-Eric



From folkert at vanheusden.com  Fri Jan 16 15:44:24 2009
From: folkert at vanheusden.com (Folkert van Heusden)
Date: Fri, 16 Jan 2009 16:44:24 +0100
Subject: something odd with the order of files in a directory
In-Reply-To: <4970A93C.5010709@redhat.com>
References: <20090116120119.GB29002@vanheusden.com>
	<4970A93C.5010709@redhat.com>
Message-ID: <20090116154424.GH29002@vanheusden.com>

> > When I put files in a directory in a certain order on an
> > ext3-filesystem, the order is not kept. On fat-filesystem it does.
> 
> This is due to the dir_index feature; you're getting them back in hash
> (read: random) order.  If you turn it off:

Ah ok, thanks!


Folkert van Heusden

-- 
MultiTail er et flexible tool for ? kontrolere Logfiles og commandoer.
Med filtrer, farger, sammenf?ringer, forskeliger ansikter etc.
http://www.vanheusden.com/multitail/
----------------------------------------------------------------------
Phone: +31-6-41278122, PGP-key: 1F28D8AE, www.vanheusden.com



From davidlandy at clara.co.uk  Sun Jan 18 09:33:11 2009
From: davidlandy at clara.co.uk (D Landy)
Date: Sun, 18 Jan 2009 09:33:11 +0000
Subject: Fw: 32k Blocksize Support
Message-ID: <E1LOU1o-000B4V-E3@oceanus.uk.clara.net>

Eric Sandeen wrote: 

> I don't know if they're using a standard ext3 fs or not; perhaps it is
> adultrated in some way for their needs that makes it incompatible w/ the
> upstream tools. 
> 
> You could go through the code to find where that message is printed,
> then work backwards to why (either via gdb, or printf insertions, or
> whatever you're comfortable with...)

Thanks, Eric, that's exactly what I've done. 

:-) 

Unfortunately there are many different error conditions that could result in 
an "invalid superblock" message and it seems like it would be a hard job (at 
least for me!) to work out which one it was as I don't know how to compile a 
package or even how to get the right source code for Puppy Linux (which I 
think is almost Debian compatible). 

I guess this is going off-topic now and I should ask on other lists for help 
with that? 

Any assistance appreciated. 

David



From sandeen at redhat.com  Mon Jan 19 17:10:10 2009
From: sandeen at redhat.com (Eric Sandeen)
Date: Mon, 19 Jan 2009 11:10:10 -0600
Subject: Fw: 32k Blocksize Support
In-Reply-To: <E1LOU1o-000B4V-E3@oceanus.uk.clara.net>
References: <E1LOU1o-000B4V-E3@oceanus.uk.clara.net>
Message-ID: <4974B3F2.2070009@redhat.com>

D Landy wrote:
> Eric Sandeen wrote: 
> 
>> I don't know if they're using a standard ext3 fs or not; perhaps it is
>> adultrated in some way for their needs that makes it incompatible w/ the
>> upstream tools. 
>>
>> You could go through the code to find where that message is printed,
>> then work backwards to why (either via gdb, or printf insertions, or
>> whatever you're comfortable with...)
> 
> Thanks, Eric, that's exactly what I've done. 
> 
> :-) 
> 
> Unfortunately there are many different error conditions that could result in 
> an "invalid superblock" message and it seems like it would be a hard job (at 
> least for me!) to work out which one it was as I don't know how to compile a 
> package or even how to get the right source code for Puppy Linux (which I 
> think is almost Debian compatible). 
> 
> I guess this is going off-topic now and I should ask on other lists for help 
> with that? 
> 
> Any assistance appreciated. 
> 
> David

You could make an e2image and hope someone has enough spare time (I'm
afraid I don't at the moment) to take a look. (assuming e2image will
touch it....)

-Eric



From lists at nerdbynature.de  Fri Jan 23 09:19:12 2009
From: lists at nerdbynature.de (Christian Kujau)
Date: Fri, 23 Jan 2009 10:19:12 +0100 (CET)
Subject: something odd with the order of files in a directory (fwd)
Message-ID: <alpine.DEB.2.00.0901231018320.4084@bogon.housecafe.de>

On Fri, 16 Jan 2009, Folkert van Heusden wrote:
> I then would expect:
> a.b
> a.c
> a.a
> 
> but instead I get
> drwxr-xr-x 3 root root 4096 2009-01-16 12:59 ..
> -rw-r--r-- 1 root root    0 2009-01-16 12:59 a.c
> -rw-r--r-- 1 root root    0 2009-01-16 12:59 a.b
> -rw-r--r-- 1 root root    0 2009-01-16 12:59 a.a
> drwxr-xr-x 2 root root 4096 2009-01-16 12:59 .

Hm, is this reproducible? Which kernel, mount-options, arch?
Here on 2.6.24/amd64 the "directory order" (GNU/ls -U resp. BSD/ls -f)
seems to work as expected:

$ touch 1 2 3
$ mv 2 t/ ; mv 3 t/; mv 1 t/
$ ls -Ugo --time-style=full-iso t/
-rw-r----- 1 0 2009-01-22 15:50:01.414115303 +0100 2
-rw-r----- 1 0 2009-01-22 15:50:01.414115303 +0100 3
-rw-r----- 1 0 2009-01-22 15:50:01.414115303 +0100 1

Christian.
-- 
BOFH excuse #175:

OS swapped to disk



From alexfler at msn.com  Fri Jan 23 11:10:47 2009
From: alexfler at msn.com (Alex Fler)
Date: Fri, 23 Jan 2009 06:10:47 -0500
Subject: Reserved block count for Large Filesystem
Message-ID: <COL120-W369AF97911227BA9C0CEE0C8CF0@phx.gbl>


Hi All,
 
On large FS like 100gb default value of "Reserved block count" takes 5% of usable disk, can this value be safely changed to 1% and not affect a performance ? Is a reservation size of 1gb enough for 100gb disk ? And when we have even larger filesystem like 1Tb default "Reserved block count" is 50GB, is it an absolutely minimum must have reserved number of space for disk performance, or it's just a legacy concept which can be adjusted?
 
Thanks in advance 
 
Alex Fler 
 
 
_________________________________________________________________
Windows Live? Hotmail??more than just e-mail. 
http://windowslive.com/howitworks?ocid=TXT_TAGLM_WL_t2_hm_justgotbetter_howitworks_012009
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20090123/9798fcd4/attachment.htm>

From pegasus at nerv.eu.org  Fri Jan 23 11:26:10 2009
From: pegasus at nerv.eu.org (Jure =?UTF-8?B?UGXEjWFy?=)
Date: Fri, 23 Jan 2009 12:26:10 +0100
Subject: Reserved block count for Large Filesystem
In-Reply-To: <COL120-W369AF97911227BA9C0CEE0C8CF0@phx.gbl>
References: <COL120-W369AF97911227BA9C0CEE0C8CF0@phx.gbl>
Message-ID: <20090123122610.882548d3.pegasus@nerv.eu.org>

On Fri, 23 Jan 2009 06:10:47 -0500
Alex Fler <alexfler at msn.com> wrote:

> 
> Hi All,
>  
> On large FS like 100gb default value of "Reserved block count" takes 5%
> of usable disk, can this value be safely changed to 1% and not affect a
> performance ? Is a reservation size of 1gb enough for 100gb disk ? And
> when we have even larger filesystem like 1Tb default "Reserved block
> count" is 50GB, is it an absolutely minimum must have reserved number of
> space for disk performance, or it's just a legacy concept which can be
> adjusted? 

These days I simply mkfs all my large non-root and non-var filesystems with
-m 0, setting reserved block count to 0%.

-- 

Jure Pe?ar
http://jure.pecar.org
http://f5j.eu



From tytso at mit.edu  Fri Jan 23 16:58:24 2009
From: tytso at mit.edu (Theodore Tso)
Date: Fri, 23 Jan 2009 11:58:24 -0500
Subject: Reserved block count for Large Filesystem
In-Reply-To: <COL120-W369AF97911227BA9C0CEE0C8CF0@phx.gbl>
References: <COL120-W369AF97911227BA9C0CEE0C8CF0@phx.gbl>
Message-ID: <20090123165824.GO14966@mit.edu>

On Fri, Jan 23, 2009 at 06:10:47AM -0500, Alex Fler wrote:
> 
> On large FS like 100gb default value of "Reserved block count" takes
> 5% of usable disk, can this value be safely changed to 1% and not
> affect a performance ? Is a reservation size of 1gb enough for 100gb
> disk ? And when we have even larger filesystem like 1Tb default
> "Reserved block count" is 50GB, is it an absolutely minimum must
> have reserved number of space for disk performance, or it's just a
> legacy concept which can be adjusted?

If you set the reserved block count to zero, it won't affect
performance much except if you run for long periods of time (with lots
of file creates and deletes) while the filesystem is almost full
(i.e., say above 95%), at which point you'll be subject to
fragmentation problems.  Ext4's multi-block allocator is much more
fragmentation resistant, because it tries much harder to find
contiguous blocks, so even if you don't enable the other ext4
features, you'll see better results simply mounting an ext3 filesystem
using ext4 before the filesystem gets completely full.

If you are just using the filesystem for long-term archive, where
files aren't changing very often (i.e., a huge mp3 or video store), it
obviously won't matter.

						- Ted



From adilger at sun.com  Fri Jan 23 22:03:18 2009
From: adilger at sun.com (Andreas Dilger)
Date: Fri, 23 Jan 2009 15:03:18 -0700
Subject: something odd with the order of files in a directory (fwd)
In-Reply-To: <alpine.DEB.2.00.0901231018320.4084@bogon.housecafe.de>
References: <alpine.DEB.2.00.0901231018320.4084@bogon.housecafe.de>
Message-ID: <20090123220318.GU3652@webber.adilger.int>

On Jan 23, 2009  10:19 +0100, Christian Kujau wrote:
> On Fri, 16 Jan 2009, Folkert van Heusden wrote:
>> I then would expect:
>> a.b
>> a.c
>> a.a
>>
>> but instead I get
>> drwxr-xr-x 3 root root 4096 2009-01-16 12:59 ..
>> -rw-r--r-- 1 root root    0 2009-01-16 12:59 a.c
>> -rw-r--r-- 1 root root    0 2009-01-16 12:59 a.b
>> -rw-r--r-- 1 root root    0 2009-01-16 12:59 a.a
>> drwxr-xr-x 2 root root 4096 2009-01-16 12:59 .
>
> Hm, is this reproducible? Which kernel, mount-options, arch?
> Here on 2.6.24/amd64 the "directory order" (GNU/ls -U resp. BSD/ls -f)
> seems to work as expected:

There is no such thing as "directory order" in Unix.  It can change at
any time, with the caveat that a single process doing a single readdir()
will get each entry existing at the start and end of readdir exactly once.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.



From cmiyata at lycos.com  Tue Jan 27 17:57:38 2009
From: cmiyata at lycos.com (Cristina Miyata)
Date: Tue, 27 Jan 2009 12:57:38 -0500 (EST)
Subject: ext3_journal_start_sb: Detected aborted journal
Message-ID: <20090127125738.HM.0000000000002Gx@cmiyata.mail-wwl23.bo3.lycos.com.lycos.com>

An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20090127/c97b97c1/attachment.htm>
-------------- next part --------------

Dear Ext3 Users,

 

We are running RHEL 4 AS (2.6.9-67.ELsmp ) on a Sun X4200 M2 machine with 2 146GB disks in RAID1.

 

For no aparently reason, a ext3 filesystem has an error and remounts in read-only.

 

=> /var/log/messages

Jan 21 22:34:32 SPJAG01-SM02 kernel: EXT3-fs error (device sda8): ext3_journal_start_sb: Detected aborted journal
Jan 21 22:34:32 SPJAG01-SM02 kernel: Remounting filesystem read-only

 

I've checked the RedHat bug 323921 (https://bugzilla.redhat.com/show_bug.cgi?id=213921) and saw that it could cause this problem and that it was fixed on kernel versions 2.6.9-42.0.7.EL and later.

 

Does anyone know if there is another RedHat bug that could cause such problem? Or another reason that is not hardware problem (Sun tech support said that there is no hardware problem)?

 

Thank you for your attention.

 

Regards,

 

Cristina Miyata

From adilger at sun.com  Tue Jan 27 22:03:38 2009
From: adilger at sun.com (Andreas Dilger)
Date: Tue, 27 Jan 2009 15:03:38 -0700
Subject: ext3_journal_start_sb: Detected aborted journal
In-Reply-To: <20090127125738.HM.0000000000002Gx@cmiyata.mail-wwl23.bo3.lycos.com.lycos.com>
References: <20090127125738.HM.0000000000002Gx@cmiyata.mail-wwl23.bo3.lycos.com.lycos.com>
Message-ID: <20090127220338.GV3652@webber.adilger.int>

On Jan 27, 2009  12:57 -0500, Cristina Miyata wrote:
> => /var/log/messages
> 
> Jan 21 22:34:32 SPJAG01-SM02 kernel: EXT3-fs error (device sda8): ext3_journal_start_sb: Detected aborted journal
> Jan 21 22:34:32 SPJAG01-SM02 kernel: Remounting filesystem read-only

Are there messages that mention "JBD" or "journal" or your disk that
indicate why the journal was aborted?

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.



From nicolas.kowalski at gmail.com  Fri Jan 30 13:53:29 2009
From: nicolas.kowalski at gmail.com (Nicolas KOWALSKI)
Date: Fri, 30 Jan 2009 14:53:29 +0100
Subject: barrier and commit options?
Message-ID: <20090130135329.GW20896@petole.demisel.net>

Hello,

On my home server (Debian etch, custom 2.6.28.2 kernel), I am using ext3 
for both root and /home filesystems, with barriers enabled to prevent 
corruption caused by my PATA disk write cache.

Looking for a better performance, I have also set the commit=nr option 
as described in linux-2.6.28.2/Documentation/filesystems/ext3.txt, so 
that I now have:

niko at petole:~$ mount -t ext3
/dev/sda1 on / type ext3 (rw,noatime,commit=30,barrier=1)
/dev/sda3 on /home type ext3 (rw,noatime,commit=30,barrier=1)


I know I may loose the last 30 seconds of "work" (it's just a home 
server), but is the filesystem at risk (corruption, whatever, ...) with 
these mount options ?

Thanks,
-- 
Nicolas



From lists at nerdbynature.de  Fri Jan 30 15:17:54 2009
From: lists at nerdbynature.de (Christian Kujau)
Date: Fri, 30 Jan 2009 16:17:54 +0100 (CET)
Subject: barrier and commit options?
In-Reply-To: <20090130135329.GW20896@petole.demisel.net>
References: <20090130135329.GW20896@petole.demisel.net>
Message-ID: <alpine.DEB.2.00.0901301614340.4084@bogon.housecafe.de>

On Fri, 30 Jan 2009, Nicolas KOWALSKI wrote:
> I know I may loose the last 30 seconds of "work" (it's just a home
> server), but is the filesystem at risk (corruption, whatever, ...) with
> these mount options ?

No, why would it? If certain mount options would make a filesystem prone 
to corruption I'd consider this a bug. So apart from losing a few more 
seconds of work in case of an error, the fs should be fine.

C.
-- 
BOFH excuse #199:

the curls in your keyboard cord are losing electricity.



From sandeen at redhat.com  Fri Jan 30 15:22:46 2009
From: sandeen at redhat.com (Eric Sandeen)
Date: Fri, 30 Jan 2009 10:22:46 -0500
Subject: barrier and commit options?
In-Reply-To: <alpine.DEB.2.00.0901301614340.4084@bogon.housecafe.de>
References: <20090130135329.GW20896@petole.demisel.net>
	<alpine.DEB.2.00.0901301614340.4084@bogon.housecafe.de>
Message-ID: <49831B46.5080202@redhat.com>

Christian Kujau wrote:
> On Fri, 30 Jan 2009, Nicolas KOWALSKI wrote:
>> I know I may loose the last 30 seconds of "work" (it's just a home
>> server), but is the filesystem at risk (corruption, whatever, ...) with
>> these mount options ?
> 
> No, why would it? If certain mount options would make a filesystem prone 
> to corruption I'd consider this a bug.

Well, that's not exactly true.  Turning off barriers, depending on your
storage, could lead to corruption in some cases.  Mounting with
data=writeback can expose stale data, which could even be a security issue.

But as long as you make these decisions consciously, they may fit your
needs.

>  So apart from losing a few more 
> seconds of work in case of an error, the fs should be fine.

This part is correct, barriers on and longer commit time should not
affect filesystem consistency / integrity.

-Eric

> C.



From nicolas.kowalski at gmail.com  Fri Jan 30 15:25:47 2009
From: nicolas.kowalski at gmail.com (Nicolas KOWALSKI)
Date: Fri, 30 Jan 2009 16:25:47 +0100
Subject: barrier and commit options?
In-Reply-To: <alpine.DEB.2.00.0901301614340.4084@bogon.housecafe.de>
References: <20090130135329.GW20896@petole.demisel.net>
	<alpine.DEB.2.00.0901301614340.4084@bogon.housecafe.de>
Message-ID: <20090130152547.GA2068@petole.demisel.net>

On Fri, Jan 30, 2009 at 04:17:54PM +0100, Christian Kujau wrote:
> On Fri, 30 Jan 2009, Nicolas KOWALSKI wrote:
>> I know I may loose the last 30 seconds of "work" (it's just a home
>> server), but is the filesystem at risk (corruption, whatever, ...) with
>> these mount options ?
>
> No, why would it? If certain mount options would make a filesystem prone  
> to corruption I'd consider this a bug. 

Well, not using barrier=1 with disk write cache enabled may cause 
corruption apparently...

> So apart from losing a few more seconds of work in case of an error, 
> the fs should be fine.

Fine. :)


Thanks for your reply,
-- 
Nicolas



From nicolas.kowalski at gmail.com  Fri Jan 30 15:30:21 2009
From: nicolas.kowalski at gmail.com (Nicolas KOWALSKI)
Date: Fri, 30 Jan 2009 16:30:21 +0100
Subject: barrier and commit options?
In-Reply-To: <49831B46.5080202@redhat.com>
References: <20090130135329.GW20896@petole.demisel.net>
	<alpine.DEB.2.00.0901301614340.4084@bogon.housecafe.de>
	<49831B46.5080202@redhat.com>
Message-ID: <20090130153021.GB2068@petole.demisel.net>

On Fri, Jan 30, 2009 at 10:22:46AM -0500, Eric Sandeen wrote:
> Christian Kujau wrote:
> > On Fri, 30 Jan 2009, Nicolas KOWALSKI wrote:
> >> I know I may loose the last 30 seconds of "work" (it's just a home
> >> server), but is the filesystem at risk (corruption, whatever, ...) with
> >> these mount options ?
> > 
> > No, why would it? If certain mount options would make a filesystem prone 
> > to corruption I'd consider this a bug.
> 
> Well, that's not exactly true.  Turning off barriers, depending on your
> storage, could lead to corruption in some cases.  Mounting with
> data=writeback can expose stale data, which could even be a security issue.
> 
> But as long as you make these decisions consciously, they may fit your
> needs.
> 
> >  So apart from losing a few more 
> > seconds of work in case of an error, the fs should be fine.
> 
> This part is correct, barriers on and longer commit time should not
> affect filesystem consistency / integrity.

Ok, I'm more relaxed about my data then. :)

Thanks for your reply,
-- 
Nicolas



From Mike.Miller at hp.com  Fri Jan 30 15:34:14 2009
From: Mike.Miller at hp.com (Miller, Mike (OS Dev))
Date: Fri, 30 Jan 2009 15:34:14 +0000
Subject: barrier and commit options?
In-Reply-To: <49831B46.5080202@redhat.com>
References: <20090130135329.GW20896@petole.demisel.net>
	<alpine.DEB.2.00.0901301614340.4084@bogon.housecafe.de>
	<49831B46.5080202@redhat.com>
Message-ID: <0F5B06BAB751E047AB5C87D1F77A778859F9DD0800@GVW0547EXC.americas.hpqcorp.net>

Eric wrote:
> 
> Christian Kujau wrote:
> > On Fri, 30 Jan 2009, Nicolas KOWALSKI wrote:
> >> I know I may loose the last 30 seconds of "work" (it's just a home 
> >> server), but is the filesystem at risk (corruption, whatever, ...) 
> >> with these mount options ?
> > 
> > No, why would it? If certain mount options would make a filesystem 
> > prone to corruption I'd consider this a bug.
> 
> Well, that's not exactly true.  Turning off barriers, 
> depending on your storage, could lead to corruption in some 

I hope this a proper forum for this inquiry. I'm the maintainer of the HP Smart Array driver, cciss. We've had requests and now a bug report to support write barriers. 
It seems that write barriers are primarily intended to ensure the proper ordering of data from the disks write cache to the medium. Is this accurate?

Thanks,
-- mikem

> cases.  Mounting with data=writeback can expose stale data, 
> which could even be a security issue.
> 
> But as long as you make these decisions consciously, they may 
> fit your needs.
> 
> >  So apart from losing a few more
> > seconds of work in case of an error, the fs should be fine.
> 
> This part is correct, barriers on and longer commit time 
> should not affect filesystem consistency / integrity.
> 
> -Eric
> 
> > C.
> 
> _______________________________________________
> Ext3-users mailing list
> Ext3-users at redhat.com
> https://www.redhat.com/mailman/listinfo/ext3-users
> 



From rwheeler at redhat.com  Fri Jan 30 15:40:14 2009
From: rwheeler at redhat.com (Ric Wheeler)
Date: Fri, 30 Jan 2009 10:40:14 -0500
Subject: barrier and commit options?
In-Reply-To: <0F5B06BAB751E047AB5C87D1F77A778859F9DD0800@GVW0547EXC.americas.hpqcorp.net>
References: <20090130135329.GW20896@petole.demisel.net>	<alpine.DEB.2.00.0901301614340.4084@bogon.housecafe.de>	<49831B46.5080202@redhat.com>
	<0F5B06BAB751E047AB5C87D1F77A778859F9DD0800@GVW0547EXC.americas.hpqcorp.net>
Message-ID: <49831F5E.6000506@redhat.com>

Miller, Mike (OS Dev) wrote:
> Eric wrote:
>   
>> Christian Kujau wrote:
>>     
>>> On Fri, 30 Jan 2009, Nicolas KOWALSKI wrote:
>>>       
>>>> I know I may loose the last 30 seconds of "work" (it's just a home 
>>>> server), but is the filesystem at risk (corruption, whatever, ...) 
>>>> with these mount options ?
>>>>         
>>> No, why would it? If certain mount options would make a filesystem 
>>> prone to corruption I'd consider this a bug.
>>>       
>> Well, that's not exactly true.  Turning off barriers, 
>> depending on your storage, could lead to corruption in some 
>>     
>
> I hope this a proper forum for this inquiry. I'm the maintainer of the HP Smart Array driver, cciss. We've had requests and now a bug report to support write barriers. 
> It seems that write barriers are primarily intended to ensure the proper ordering of data from the disks write cache to the medium. Is this accurate?
>
> Thanks,
> -- mikem
>
>   
Hi Mike,

Without working barriers, you are especially open to metadata corruption 
- If I remember the details correctly, Chris Mason has demonstrated a 
50% chance of corruption directory entries in ext3 for example.

In addition, barriers allows fsync to have real meaning since the target 
storage will flush its write cache & the user will have that fsync() 
data after a power outage.

If you have a battery backed write cache (say, in a high end array) 
barriers can be ignored since the storage can effectively make that 
write cache non-volatile, but otherwise, this is pretty key for anyone 
wanting to maintain data integrity,

Regards,

Ric



From Mike.Miller at hp.com  Fri Jan 30 15:56:33 2009
From: Mike.Miller at hp.com (Miller, Mike (OS Dev))
Date: Fri, 30 Jan 2009 15:56:33 +0000
Subject: barrier and commit options?
In-Reply-To: <49831F5E.6000506@redhat.com>
References: <20090130135329.GW20896@petole.demisel.net>
	<alpine.DEB.2.00.0901301614340.4084@bogon.housecafe.de>
	<49831B46.5080202@redhat.com>
	<0F5B06BAB751E047AB5C87D1F77A778859F9DD0800@GVW0547EXC.americas.hpqcorp.net>
	<49831F5E.6000506@redhat.com>
Message-ID: <0F5B06BAB751E047AB5C87D1F77A778859F9DD0835@GVW0547EXC.americas.hpqcorp.net>

Ric Wheeler wrote: 

> > I hope this a proper forum for this inquiry. I'm the 
> maintainer of the HP Smart Array driver, cciss. We've had 
> requests and now a bug report to support write barriers. 
> > It seems that write barriers are primarily intended to 
> ensure the proper ordering of data from the disks write cache 
> to the medium. Is this accurate?
> >
> > Thanks,
> > -- mikem
> >
> >   
> Hi Mike,
> 
> Without working barriers, you are especially open to metadata 
> corruption
> - If I remember the details correctly, Chris Mason has 
> demonstrated a 50% chance of corruption directory entries in 
> ext3 for example.
> 
> In addition, barriers allows fsync to have real meaning since 
> the target storage will flush its write cache & the user will 
> have that fsync() data after a power outage.
> 
> If you have a battery backed write cache (say, in a high end 
> array) barriers can be ignored since the storage can 
> effectively make that write cache non-volatile, but 
> otherwise, this is pretty key for anyone wanting to maintain 
> data integrity,
> 
Hi Ric,
That's what I getting at, array controllers with a battery backed write cache (BBWC). We disable the write cache on the physical disks and provide no mechanism to re-enable the cache except in some SATA configurations.

So my real question is this: Given the fact that many Smart Array controllers ship with a BBWC, will write barriers offer any benefit? I think fsync does nothing on SA since it doesn't know how to flush the controller cache.

If a user has no BBWC then all writes are completed all the way down to the disk medium before the command is completed back up to the driver.

Thanks,
-- mikem

> 
> _______________________________________________
> Ext3-users mailing list
> Ext3-users at redhat.com
> https://www.redhat.com/mailman/listinfo/ext3-users
> 



From rwheeler at redhat.com  Fri Jan 30 16:03:51 2009
From: rwheeler at redhat.com (Ric Wheeler)
Date: Fri, 30 Jan 2009 11:03:51 -0500
Subject: barrier and commit options?
In-Reply-To: <0F5B06BAB751E047AB5C87D1F77A778859F9DD0835@GVW0547EXC.americas.hpqcorp.net>
References: <20090130135329.GW20896@petole.demisel.net>	<alpine.DEB.2.00.0901301614340.4084@bogon.housecafe.de>	<49831B46.5080202@redhat.com>	<0F5B06BAB751E047AB5C87D1F77A778859F9DD0800@GVW0547EXC.americas.hpqcorp.net>
	<49831F5E.6000506@redhat.com>
	<0F5B06BAB751E047AB5C87D1F77A778859F9DD0835@GVW0547EXC.americas.hpqcorp.net>
Message-ID: <498324E7.3000705@redhat.com>

Miller, Mike (OS Dev) wrote:
> Ric Wheeler wrote: 
>
>   
>>> I hope this a proper forum for this inquiry. I'm the 
>>>       
>> maintainer of the HP Smart Array driver, cciss. We've had 
>> requests and now a bug report to support write barriers. 
>>     
>>> It seems that write barriers are primarily intended to 
>>>       
>> ensure the proper ordering of data from the disks write cache 
>> to the medium. Is this accurate?
>>     
>>> Thanks,
>>> -- mikem
>>>
>>>   
>>>       
>> Hi Mike,
>>
>> Without working barriers, you are especially open to metadata 
>> corruption
>> - If I remember the details correctly, Chris Mason has 
>> demonstrated a 50% chance of corruption directory entries in 
>> ext3 for example.
>>
>> In addition, barriers allows fsync to have real meaning since 
>> the target storage will flush its write cache & the user will 
>> have that fsync() data after a power outage.
>>
>> If you have a battery backed write cache (say, in a high end 
>> array) barriers can be ignored since the storage can 
>> effectively make that write cache non-volatile, but 
>> otherwise, this is pretty key for anyone wanting to maintain 
>> data integrity,
>>
>>     
> Hi Ric,
> That's what I getting at, array controllers with a battery backed write cache (BBWC). We disable the write cache on the physical disks and provide no mechanism to re-enable the cache except in some SATA configurations.
>
> So my real question is this: Given the fact that many Smart Array controllers ship with a BBWC, will write barriers offer any benefit? I think fsync does nothing on SA since it doesn't know how to flush the controller cache.
>
> If a user has no BBWC then all writes are completed all the way down to the disk medium before the command is completed back up to the driver.
>
> Thanks,
> -- mikem
>   

In this case (or whenever the write cache is disabled on the disk) the 
barrier ops don't do anything for us... Some devices simply ignore the 
flush commands (imagine flushing the gigabytes in an enterprise array on 
each transaction commit), others might return an error on the flush 
command itself (which should be handled correctly).

I don't think that you need to add support if the HBA has a battery 
backed cache and the target drives have disabled write caches...

Ric

>   
>> _______________________________________________
>> Ext3-users mailing list
>> Ext3-users at redhat.com
>> https://www.redhat.com/mailman/listinfo/ext3-users
>>     



From tytso at mit.edu  Fri Jan 30 22:02:45 2009
From: tytso at mit.edu (Theodore Tso)
Date: Fri, 30 Jan 2009 17:02:45 -0500
Subject: barrier and commit options?
In-Reply-To: <498324E7.3000705@redhat.com>
References: <20090130135329.GW20896@petole.demisel.net>
	<alpine.DEB.2.00.0901301614340.4084@bogon.housecafe.de>
	<49831B46.5080202@redhat.com>
	<0F5B06BAB751E047AB5C87D1F77A778859F9DD0800@GVW0547EXC.americas.hpqcorp.net>
	<49831F5E.6000506@redhat.com>
	<0F5B06BAB751E047AB5C87D1F77A778859F9DD0835@GVW0547EXC.americas.hpqcorp.net>
	<498324E7.3000705@redhat.com>
Message-ID: <20090130220245.GA27950@mit.edu>

>>> - If I remember the details correctly, Chris Mason has demonstrated a 
>>> 50% chance of corruption directory entries in ext3 for example.

Chris Mason has a script which forces the system to be under a lot of
memory pressure, and in that scenario, it is highly likely that
without barriers, there will be filesystem corruptions if the system
is abruptly turned off while his script is running.

Andrew Monrton has been resistant in making barriers=1 be the default
for ext3 because (as I understand it) he disbelieves that this is an
adequate real-world example, and there is a real performance hit to
running without barriers.

>>> If you have a battery backed write cache (say, in a high end array) 
>>> barriers can be ignored since the storage can effectively make that 
>>> write cache non-volatile, but otherwise, this is pretty key for 
>>> anyone wanting to maintain data integrity,
>>>
>> That's what I getting at, array controllers with a battery backed
>> write cache (BBWC). We disable the write cache on the physical
>> disks and provide no mechanism to re-enable the cache except in
>> some SATA configurations.

Well, we still need the barrier on the block I/O elevantor side to
make sure that requests don't get reordered in the block layer.  But
what you're saying is that once the write is posted to the array, it
is guaranteed that it is on "stable storage" (even if it is BBWC) such
that if someone hits the Big Red Switch at the exit to the data
center, and power is forcibly cut from the entire data center in case
of a fire, the battery will still keep the cache alive, at least until
the sprinklers go off, anyway, right?  :-)

In that case, I suspect the right thing for the cciss array to do is
to ignore the barrier, but not to return an error.  If you return an
error, and refuse the write with barrier operation (which is what the
cciss driver seems to be doing starting in 2.6.29-rcX), ext4 will
retry the write without the barrier, at which point we are vulnerable
to the block layer reordering things at the I/O scheduler layer.  In
effect, you're claiming that every single write to cciss is implicitly
a "barrier write" in that once it is received by the device, it is
guaranteed not to be lost even if the power to the entire system is
forcibly removed.

						- Ted




From rwheeler at redhat.com  Sat Jan 31 12:45:06 2009
From: rwheeler at redhat.com (Ric Wheeler)
Date: Sat, 31 Jan 2009 07:45:06 -0500
Subject: barrier and commit options?
In-Reply-To: <20090130220245.GA27950@mit.edu>
References: <20090130135329.GW20896@petole.demisel.net>
	<alpine.DEB.2.00.0901301614340.4084@bogon.housecafe.de>
	<49831B46.5080202@redhat.com>
	<0F5B06BAB751E047AB5C87D1F77A778859F9DD0800@GVW0547EXC.americas.hpqcorp.net>
	<49831F5E.6000506@redhat.com>
	<0F5B06BAB751E047AB5C87D1F77A778859F9DD0835@GVW0547EXC.americas.hpqcorp.net>
	<498324E7.3000705@redhat.com> <20090130220245.GA27950@mit.edu>
Message-ID: <498447D2.1030106@redhat.com>

Theodore Tso wrote:
>>>> - If I remember the details correctly, Chris Mason has demonstrated a 
>>>> 50% chance of corruption directory entries in ext3 for example.
>>>>         
>
> Chris Mason has a script which forces the system to be under a lot of
> memory pressure, and in that scenario, it is highly likely that
> without barriers, there will be filesystem corruptions if the system
> is abruptly turned off while his script is running.
>
> Andrew Monrton has been resistant in making barriers=1 be the default
> for ext3 because (as I understand it) he disbelieves that this is an
> adequate real-world example, and there is a real performance hit to
> running without barriers.
>
>   
>>>> If you have a battery backed write cache (say, in a high end array) 
>>>> barriers can be ignored since the storage can effectively make that 
>>>> write cache non-volatile, but otherwise, this is pretty key for 
>>>> anyone wanting to maintain data integrity,
>>>>
>>>>         
>>> That's what I getting at, array controllers with a battery backed
>>> write cache (BBWC). We disable the write cache on the physical
>>> disks and provide no mechanism to re-enable the cache except in
>>> some SATA configurations.
>>>       
>
> Well, we still need the barrier on the block I/O elevantor side to
> make sure that requests don't get reordered in the block layer.  But
> what you're saying is that once the write is posted to the array, it
> is guaranteed that it is on "stable storage" (even if it is BBWC) such
> that if someone hits the Big Red Switch at the exit to the data
> center, and power is forcibly cut from the entire data center in case
> of a fire, the battery will still keep the cache alive, at least until
> the sprinklers go off, anyway, right?  :-)
>   

Yes, true....
> In that case, I suspect the right thing for the cciss array to do is
> to ignore the barrier, but not to return an error.  If you return an
> error, and refuse the write with barrier operation (which is what the
> cciss driver seems to be doing starting in 2.6.29-rcX), ext4 will
> retry the write without the barrier, at which point we are vulnerable
> to the block layer reordering things at the I/O scheduler layer.  In
> effect, you're claiming that every single write to cciss is implicitly
> a "barrier write" in that once it is received by the device, it is
> guaranteed not to be lost even if the power to the entire system is
> forcibly removed.
>
> 						- Ted
>
>
>   
Aren't barriers tied still to the state of the write cache on the target 
drive? In other words, if the write cache is off, we disable barriers 
automatically. I think that this happens for scsi in sd_revalidate_disk().

In this case, it sounds like we have tangled the need to flush a drive's 
write with the need to not re-order IO in the elevator code.

Ric