From chris at cjx.com Tue Oct 3 22:30:38 2006 From: chris at cjx.com (Chris Allen) Date: Tue, 03 Oct 2006 23:30:38 +0100 Subject: 16TB ext3 mainstream - when? Message-ID: <4522E48E.9040905@cjx.com> Are we likely to see patches to allow 16TB ext3 in the mainstream kernel any time soon? I am working with a storage box that has 16x750GB drives RAID5-ed together to create a potential 10.5TB of potential storage. But because ext3 is limited to 8TB I am forced to split into 2 smaller ext3 filesystems which is really cumbersome for my app. Any ideas anybody? From adilger at clusterfs.com Wed Oct 4 00:06:11 2006 From: adilger at clusterfs.com (Andreas Dilger) Date: Tue, 3 Oct 2006 18:06:11 -0600 Subject: 16TB ext3 mainstream - when? In-Reply-To: <4522E48E.9040905@cjx.com> References: <4522E48E.9040905@cjx.com> Message-ID: <20061004000611.GX22010@schatzie.adilger.int> On Oct 03, 2006 23:30 +0100, Chris Allen wrote: > Are we likely to see patches to allow 16TB ext3 in the mainstream > kernel any time soon? I think the patches are going into -mm (if not already), so start testing... If not, they have been posted here several times, along with a URL for download. > I am working with a storage box that has 16x750GB drives RAID5-ed together > to create a potential 10.5TB of potential storage. But because ext3 is > limited to > 8TB I am forced to split into 2 smaller ext3 filesystems which is really > cumbersome > for my app. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From menscher at uiuc.edu Wed Oct 4 00:20:20 2006 From: menscher at uiuc.edu (Damian Menscher) Date: Tue, 3 Oct 2006 19:20:20 -0500 (CDT) Subject: 16TB ext3 mainstream - when? In-Reply-To: <20061004000611.GX22010@schatzie.adilger.int> References: <4522E48E.9040905@cjx.com> <20061004000611.GX22010@schatzie.adilger.int> Message-ID: On Tue, 3 Oct 2006, Andreas Dilger wrote: > On Oct 03, 2006 23:30 +0100, Chris Allen wrote: >> Are we likely to see patches to allow 16TB ext3 in the mainstream >> kernel any time soon? > > I think the patches are going into -mm (if not already), so start testing... > If not, they have been posted here several times, along with a URL for > download. Will those patches work to grow an existing ext3 filesystem to >8TB, or do they only work on new filesystems (created with those patches or other special options)? I ask because we need to create an <8TB filesystem now, but with the option to grow it to >8TB in the future. Damian Menscher -- -=#| www.uiuc.edu/~menscher/ Ofc:(650)253-2757 |#=- -=#| The above opinions are not necessarily those of my employers. |#=- From adilger at clusterfs.com Wed Oct 4 05:50:03 2006 From: adilger at clusterfs.com (Andreas Dilger) Date: Tue, 3 Oct 2006 23:50:03 -0600 Subject: 16TB ext3 mainstream - when? In-Reply-To: References: <4522E48E.9040905@cjx.com> <20061004000611.GX22010@schatzie.adilger.int> Message-ID: <20061004055003.GA22010@schatzie.adilger.int> On Oct 03, 2006 19:20 -0500, Damian Menscher wrote: > On Tue, 3 Oct 2006, Andreas Dilger wrote: > >On Oct 03, 2006 23:30 +0100, Chris Allen wrote: > >>Are we likely to see patches to allow 16TB ext3 in the mainstream > >>kernel any time soon? > > > >I think the patches are going into -mm (if not already), so start > >testing... > >If not, they have been posted here several times, along with a URL for > >download. > > Will those patches work to grow an existing ext3 filesystem to >8TB, or > do they only work on new filesystems (created with those patches or > other special options)? There are no special options or features needed to use > 8TB filesystems, just bug fixes in the kernel. > I ask because we need to create an <8TB filesystem now, but with the > option to grow it to >8TB in the future. I have never tested that, and I don't know anyone else who has. That said, I'm not aware of any inherent limitations on growing the filesystem up to 16TB. I haven't looked at that code for a long time, and never really with an eye toward scalability to 16TB. It definitely will NOT work to grow past 16TB at all. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From evilninja at gmx.net Wed Oct 4 15:40:31 2006 From: evilninja at gmx.net (Christian) Date: Wed, 4 Oct 2006 16:40:31 +0100 (BST) Subject: 16TB ext3 mainstream - when? In-Reply-To: <20061004000611.GX22010@schatzie.adilger.int> References: <4522E48E.9040905@cjx.com> <20061004000611.GX22010@schatzie.adilger.int> Message-ID: On Tue, 3 Oct 2006, Andreas Dilger wrote: > On Oct 03, 2006 23:30 +0100, Chris Allen wrote: >> Are we likely to see patches to allow 16TB ext3 in the mainstream >> kernel any time soon? > > I think the patches are going into -mm (if not already), so start testing... > If not, they have been posted here several times, along with a URL for > download. I don't get it: I thought ext2/3 filesystems (volumes) can be 32TiB in size? At least that's what [0] says. If this is wrong, someone should correct this information. Although I must admit that 16 TiB per fs makes more sense, given that with a max blocksize of 4K and a max of 2^32 blocks we have 16TiB of data.... Where does this 2^32 limitation come from anyway? Thanks, Christian. [0] http://en.wikipedia.org/wiki/Comparison_of_file_systems -- BOFH excuse #247: Due to Federal Budget problems we have been forced to cut back on the number of users able to access the system at one time. (namely none allowed....) From adilger at clusterfs.com Wed Oct 4 17:11:33 2006 From: adilger at clusterfs.com (Andreas Dilger) Date: Wed, 4 Oct 2006 11:11:33 -0600 Subject: 16TB ext3 mainstream - when? In-Reply-To: References: <4522E48E.9040905@cjx.com> <20061004000611.GX22010@schatzie.adilger.int> Message-ID: <20061004171133.GE22010@schatzie.adilger.int> On Oct 04, 2006 16:40 +0100, Christian wrote: > I don't get it: I thought ext2/3 filesystems (volumes) can be 32TiB in > size? At least that's what [0] says. If this is wrong, someone should > correct this information. Although I must admit that 16 TiB per fs makes > more sense, given that with a max blocksize of 4K and a max of 2^32 > blocks we have 16TiB of data.... Where does this 2^32 limitation come > from anyway? The 2^32 limit is a 32-bit integer number of blocks. In older kernels (i.e. anything except the latest -mm) there is a signed-int problem, so the effective limit is 2^31 blocks. With 1kB blocks this limit is 2TB (2^41 bytes), with 4kB blocks (most common) it is 8TB (2^43 bytes) with 64kB blocks (PPC64, ia64, other large PAGE_SIZE systems) this limit is 32TB (2^45 bytes). The ext4 filesystem allows up to 2^48 blocks in the filesystem so the limit is 2^60 bytes for 4kB blocks, and 2^64 bytes for 64kB blocks. The major problem at this point is e2fsck time, which is about 1h/TB for fast disks, at minimum (i.e. no major corruption found). One of the goals for future ext4 development is to include checksums into the fs to allow online sanity checking, and also speed up e2fsck in various ways. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From evilninja at gmx.net Wed Oct 4 18:30:50 2006 From: evilninja at gmx.net (Christian) Date: Wed, 4 Oct 2006 19:30:50 +0100 (BST) Subject: 16TB ext3 mainstream - when? In-Reply-To: <20061004171133.GE22010@schatzie.adilger.int> References: <4522E48E.9040905@cjx.com> <20061004000611.GX22010@schatzie.adilger.int> <20061004171133.GE22010@schatzie.adilger.int> Message-ID: On Wed, 4 Oct 2006, Andreas Dilger wrote: > 2TB (2^41 bytes), with 4kB blocks (most common) it is 8TB (2^43 bytes) > with 64kB blocks (PPC64, ia64, other large PAGE_SIZE systems) this limit > is 32TB (2^45 bytes). Ah, although a max of 4kB is documented in my e2fsprogs-1.39 manpage, I can override it (e.g. -b 65536, but I cannot mount it then). OK. > The major problem at this point is e2fsck time, which is about 1h/TB for > fast disks, at minimum (i.e. no major corruption found). One of the > goals for future ext4 development is to include checksums into the fs > to allow online sanity checking, and also speed up e2fsck in various ways. I'm tracking -mm, hopefully ext4 will be included in it anytime soon... Thanks, Christian. -- BOFH excuse #140: LBNC (luser brain not connected) From Matt_Dodson at messageone.com Wed Oct 4 21:33:52 2006 From: Matt_Dodson at messageone.com (Matt Dodson) Date: Wed, 4 Oct 2006 16:33:52 -0500 Subject: EXT3 and large directories Message-ID: <44B5599C8B5B1347AFF903FDCEC00307A058FD@auscorpex-1.austin.messageone.com> I have an ext3 filesystem that has several directories and each directory gets a large number of files inserted and then deleted over time. The filesystem is basically used as a temp store before files are processed. The issue is over time the directory scans get extremely slow even if the directories are empty. I have noticed the directories can range in size from 4k - 100M even when they are empty. Is there a way to fix this without recreating the directories or bringing the filesystem offline? File system Info: tune2fs 1.35 (28-Feb-2004) Filesystem volume name: Last mounted on: Filesystem UUID: 7cbda7aa-e8e7-4da1-9c7c-de45668e98f3 Filesystem magic number: 0xEF53 Filesystem revision #: 1 (dynamic) Filesystem features: has_journal resize_inode dir_index filetype needs_recovery sparse_super large_file Default mount options: (none) Filesystem state: clean Errors behavior: Continue Filesystem OS type: Linux Inode count: 98304000 Block count: 196608000 Reserved block count: 9830400 Free blocks: 31795332 Free inodes: 83024519 First block: 0 Block size: 4096 Fragment size: 4096 Reserved GDT blocks: 1024 Blocks per group: 32768 Fragments per group: 32768 Inodes per group: 16384 Inode blocks per group: 512 Filesystem created: Thu Aug 10 11:10:59 2006 Last mount time: Tue Oct 3 00:10:48 2006 Last write time: Tue Oct 3 00:10:48 2006 Mount count: 4 Maximum mount count: 21 Last checked: Thu Aug 10 11:10:59 2006 Check interval: 15552000 (6 months) Next check after: Tue Feb 6 10:10:59 2007 Reserved blocks uid: 0 (user root) Reserved blocks gid: 0 (group root) First inode: 11 Inode size: 128 Journal inode: 8 Default directory hash: tea Directory Hash Seed: 59fd108a-7ec7-45f9-8967-b9f3aaec3edf Journal backup: inode blocks Matt D. -------------- next part -------------- An HTML attachment was scrubbed... URL: From evilninja at gmx.net Wed Oct 4 22:07:36 2006 From: evilninja at gmx.net (Christian) Date: Wed, 4 Oct 2006 23:07:36 +0100 (BST) Subject: EXT3 and large directories In-Reply-To: <44B5599C8B5B1347AFF903FDCEC00307A058FD@auscorpex-1.austin.messageone.com> References: <44B5599C8B5B1347AFF903FDCEC00307A058FD@auscorpex-1.austin.messageone.com> Message-ID: (please refrain from sendin HTML mails) On Wed, 4 Oct 2006, Matt Dodson wrote: > I have an ext3 filesystem that has several directories and each > directory gets a large number of files inserted and then deleted over > time. Can you specify "large number"? What do "ls large-directory | wc -l" say? > The filesystem is basically used as a temp store before files are > processed. The issue is over time the directory scans get extremely slow > even if the directories are empty. I have noticed the directories can > range in size from 4k - 100M even when they are empty. proably deleted-but-still-open files. When lsof(8) is installed, you can find out with: "lsof -ln | grep large-but-empty-directory" Can you specify "slow" as well? You also might want strace(1) an "ls" on your large directory to see what is taking so long. > Is there a way to fix this without recreating the directories or > bringing the filesystem offline? You have enabled htree (dir_index) already: > Filesystem features: has_journal resize_inode dir_index filetype > needs_recovery sparse_super large_file If you've enabled dir_index after the directories have been created, you might want to "e2fsck -D" (see the manpage for details) the filesystem. For partitions with temprary files you could play with "noatime","async" and "data" mount-options (please read the manpage, really!). Which kernel do you use? Which arch? C. -- BOFH excuse #83: Support staff hung over, send aspirin and come back LATER. From evilninja at gmx.net Wed Oct 4 23:18:07 2006 From: evilninja at gmx.net (Christian) Date: Thu, 5 Oct 2006 00:18:07 +0100 (BST) Subject: EXT3 and large directories (fwd) Message-ID: (please reply on-list, so everybody can comment/help) Matt, thanks for the details, but apart from mount-option tuning and dir_index (which you've already enabled), I dunno why ls(1) would take *hours* to stat ~1M files... out of curiosity: are you able to try a newer kernel? does it change anything? Christian. ---------- Forwarded message ---------- The dir_index was enabled during the filesystem creation. The directories can have from 0 - 1,000,000 files at any given time The slowness is on an open of the directory when it is empty, The size of the directory refers to the directory file itself not the size of the directories contents. There are no open files in the directory when it is being stated. I will add that when we do have 50,000 files or more in the directories a listing is also very slow, can take hours. Slowness can be 5-10 minutes on a directory open open("/ems/bigdisk/132", O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY) = 3 fstat64(3, {st_mode=S_IFDIR|0755, st_size=79298560, ...}) = 0 fcntl64(3, F_SETFD, FD_CLOEXEC) = 0 getdents64(3, /* 3 entries */, 4096) = 72 getdents64(3, /* 0 entries */, 4096) = 0 close(3) = 0 Kernel is 2.6.9-34.0.2.ELsmp I have tried noatime which does speed up reads when there are lots of files but does not fix the directory issue These directories are empty except for two file: drwxr-xr-x 3 vbox132 root 76M Oct 2 10:04 132 drwxr-xr-x 3 vbox151 root 226M Oct 4 17:00 151 drwxr-xr-x 3 vbox229 root 33M Oct 2 10:16 229 drwxr-xr-x 3 vbox235 root 7.5M Oct 2 10:14 235 drwxr-xr-x 3 vbox246 root 52M Sep 30 20:59 246 drwxr-xr-x 3 vbox249 root 1.1M Oct 2 10:04 249 -- BOFH excuse #83: Support staff hung over, send aspirin and come back LATER. _______________________________________________ Ext3-users mailing list Ext3-users at redhat.com https://www.redhat.com/mailman/listinfo/ext3-users From adilger at clusterfs.com Thu Oct 5 00:43:22 2006 From: adilger at clusterfs.com (Andreas Dilger) Date: Wed, 4 Oct 2006 18:43:22 -0600 Subject: EXT3 and large directories In-Reply-To: <44B5599C8B5B1347AFF903FDCEC00307A058FD@auscorpex-1.austin.messageone.com> References: <44B5599C8B5B1347AFF903FDCEC00307A058FD@auscorpex-1.austin.messageone.com> Message-ID: <20061005004322.GQ22010@schatzie.adilger.int> On Oct 04, 2006 16:33 -0500, Matt Dodson wrote: > I have an ext3 filesystem that has several directories and each > directory gets a large number of files inserted and then deleted over > time. The filesystem is basically used as a temp store before files are > processed. The issue is over time the directory scans get extremely slow > even if the directories are empty. I have noticed the directories can > range in size from 4k - 100M even when they are empty. Is there a way > to fix this without recreating the directories or bringing the > filesystem offline? No way to fix this w/o offline e2fsck -fD. ext3 doesn't shrink directories when deleting files. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From Matt_Dodson at messageone.com Thu Oct 5 02:12:24 2006 From: Matt_Dodson at messageone.com (Matt Dodson) Date: Wed, 4 Oct 2006 21:12:24 -0500 Subject: EXT3 and large directories In-Reply-To: <20061005004322.GQ22010@schatzie.adilger.int> Message-ID: <44B5599C8B5B1347AFF903FDCEC00307A05945@auscorpex-1.austin.messageone.com> Is this a bug or by design? Would there be a better filesystem to use for my situation? Matt D. -----Original Message----- From: Andreas Dilger [mailto:adilger at clusterfs.com] Sent: Wednesday, October 04, 2006 7:43 PM To: Matt Dodson Cc: ext3-users at redhat.com Subject: Re: EXT3 and large directories On Oct 04, 2006 16:33 -0500, Matt Dodson wrote: > I have an ext3 filesystem that has several directories and each > directory gets a large number of files inserted and then deleted over > time. The filesystem is basically used as a temp store before files are > processed. The issue is over time the directory scans get extremely slow > even if the directories are empty. I have noticed the directories can > range in size from 4k - 100M even when they are empty. Is there a way > to fix this without recreating the directories or bringing the > filesystem offline? No way to fix this w/o offline e2fsck -fD. ext3 doesn't shrink directories when deleting files. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From bjacke at sernet.de Thu Oct 5 13:57:57 2006 From: bjacke at sernet.de (=?iso-8859-1?Q?Bj=F6rn?= JACKE) Date: Thu, 5 Oct 2006 15:57:57 +0200 Subject: creation time stamps for ext4 ? Message-ID: Hi, I would like to know if there are any plans to introduce a creation timestamp in future ext3/4 versions. Having a 4th timestamp saving the creation time would be very good for projekts like Samba for example. It would be important that creation time can also be set manually later on by some system call. Systems like FreeBSD's UFS and Solais' ZFS already support creation times. Unfortunately Linux doesn't have such a thing standarized anywhere but it would be geat if it would. Are there any plans to add this? Bjoern From adilger at clusterfs.com Thu Oct 5 15:19:37 2006 From: adilger at clusterfs.com (Andreas Dilger) Date: Thu, 5 Oct 2006 09:19:37 -0600 Subject: creation time stamps for ext4 ? In-Reply-To: References: Message-ID: <20061005151937.GV22010@schatzie.adilger.int> On Oct 05, 2006 15:57 +0200, Bj?rn JACKE wrote: > I would like to know if there are any plans to introduce a creation > timestamp in future ext3/4 versions. Having a 4th timestamp saving the > creation time would be very good for projekts like Samba for example. > It would be important that creation time can also be set manually > later on by some system call. Systems like FreeBSD's UFS and Solais' > ZFS already support creation times. Unfortunately Linux doesn't have > such a thing standarized anywhere but it would be geat if it would. > > Are there any plans to add this? I've given this some thought for adding as part of the nsec timestamp patch. That is more feasable if we move the nsec ctime into the main inode to double as the version field. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From lists at nerdbynature.de Thu Oct 5 15:41:23 2006 From: lists at nerdbynature.de (Christian Kujau) Date: Thu, 5 Oct 2006 16:41:23 +0100 (BST) Subject: EXT3 and large directories In-Reply-To: <44B5599C8B5B1347AFF903FDCEC00307A05945@auscorpex-1.austin.messageone.com> References: <44B5599C8B5B1347AFF903FDCEC00307A05945@auscorpex-1.austin.messageone.com> Message-ID: On Wed, 4 Oct 2006, Matt Dodson wrote: > Is this a bug or by design? Would there be a better filesystem to use > for my situation? I think it's a design issue, I don't think 1M files was not common when ext3 came out (1999). ReiserFS is said to be "fast with lots of small files", but as always: evaluate the fs before putting applications on it. FWIW, I've did a little test with ext3 and 0,1M/1M files on an already existing fs (rootfs of an existing FC6 installation): http://nerdbynature.de/bits/2.6.18-mm3/ cheers, Christian. -- BOFH excuse #143: had to use hammer to free stuck disk drive heads. From tytso at mit.edu Thu Oct 5 16:55:04 2006 From: tytso at mit.edu (Theodore Tso) Date: Thu, 5 Oct 2006 12:55:04 -0400 Subject: creation time stamps for ext4 ? In-Reply-To: <20061005151937.GV22010@schatzie.adilger.int> References: <20061005151937.GV22010@schatzie.adilger.int> Message-ID: <20061005165504.GA23727@thunk.org> On Thu, Oct 05, 2006 at 09:19:37AM -0600, Andreas Dilger wrote: > On Oct 05, 2006 15:57 +0200, Bj?rn JACKE wrote: > > I would like to know if there are any plans to introduce a creation > > timestamp in future ext3/4 versions. Having a 4th timestamp saving the > > creation time would be very good for projekts like Samba for example. > > It would be important that creation time can also be set manually > > later on by some system call. Systems like FreeBSD's UFS and Solais' > > ZFS already support creation times. Unfortunately Linux doesn't have > > such a thing standarized anywhere but it would be geat if it would. > > > > Are there any plans to add this? > > I've given this some thought for adding as part of the nsec timestamp > patch. That is more feasable if we move the nsec ctime into the main > inode to double as the version field. Shoehorning an extra creation time field into the inode is relatively easy, but it's also necessary to have system calls to get and set the creation time. The stat structure doesn't have room for the creation time, so that means a new version of the stat structure exported the kernel, and a new version of the stat structure exported by glibc. So there are VFS and glibc changes necessary to make this be useful. But that doesn't prevent us from reserving space in the inode and starting to fill it in with the creation time, although it may be quite a while before it will be easily available to user programs like Samba. - Ted From tytso at mit.edu Thu Oct 5 17:02:29 2006 From: tytso at mit.edu (Theodore Tso) Date: Thu, 5 Oct 2006 13:02:29 -0400 Subject: EXT3 and large directories In-Reply-To: <20061005004322.GQ22010@schatzie.adilger.int> References: <44B5599C8B5B1347AFF903FDCEC00307A058FD@auscorpex-1.austin.messageone.com> <20061005004322.GQ22010@schatzie.adilger.int> Message-ID: <20061005170229.GB23727@thunk.org> On Wed, Oct 04, 2006 at 06:43:22PM -0600, Andreas Dilger wrote: > On Oct 04, 2006 16:33 -0500, Matt Dodson wrote: > > I have an ext3 filesystem that has several directories and each > > directory gets a large number of files inserted and then deleted over > > time. The filesystem is basically used as a temp store before files are > > processed. The issue is over time the directory scans get extremely slow > > even if the directories are empty. I have noticed the directories can > > range in size from 4k - 100M even when they are empty. Is there a way > > to fix this without recreating the directories or bringing the > > filesystem offline? > > No way to fix this w/o offline e2fsck -fD. ext3 doesn't shrink directories > when deleting files. Well, if there isn't anyone else using the directory, you can also do the following: mkdir foo.new mv foo/* foo.new rmdir foo mv foo.new foo And of course, if you know the directory is empty, just do: rmdir foo mkdir foo Historically this is a pretty common restriction in Unix filesystems. If someone cared enough, it would be possible to change ext3/4 to release directory blocks when they are empty, but no one has found it important enough to create such a patch. - Ted From alex at alex.org.uk Thu Oct 5 18:10:30 2006 From: alex at alex.org.uk (Alex Bligh) Date: Thu, 05 Oct 2006 19:10:30 +0100 Subject: EXT3 and large directories In-Reply-To: <20061005170229.GB23727@thunk.org> References: <44B5599C8B5B1347AFF903FDCEC00307A058FD@auscorpex-1.austin.messag eone.com> <20061005004322.GQ22010@schatzie.adilger.int> <20061005170229.GB23727@thunk.org> Message-ID: <1BB7D14D25835639B3770CDB@[192.168.0.101]> --On 05 October 2006 13:02 -0400 Theodore Tso wrote: >> >The issue is over time the directory scans get extremely slow >> > even if the directories are empty. I have noticed the directories can >> > range in size from 4k - 100M even when they are empty. ... >> No way to fix this w/o offline e2fsck -fD. ext3 doesn't shrink >> directories when deleting files. ... > Historically this is a pretty common restriction in Unix filesystems. > If someone cared enough, it would be possible to change ext3/4 to > release directory blocks when they are empty, but no one has found it > important enough to create such a patch. I had sort of assumed this wouldn't be a problem after htree was incorporated as far speed, as opposed to size is concerned - and speed was the original poster's problem, not size on disk. Does that imply there is still some linear searching going on, or that htree is not "enough" to speed up the searches. How do the deleted entries get reused? EG if I have a mail spool application, where a given directory has around 100,000 files in at any time, and they are periodically deleted by age in batches of (say) 10,000 such that the number in the directory never exceeds 100,000, does the size of the directory just keep growing for ever? Or do newly created directory entries take up the space in the directory of old ones (assume all the filenames are unique). Alex From tytso at mit.edu Thu Oct 5 18:58:18 2006 From: tytso at mit.edu (Theodore Tso) Date: Thu, 5 Oct 2006 14:58:18 -0400 Subject: EXT3 and large directories In-Reply-To: <1BB7D14D25835639B3770CDB@[192.168.0.101]> References: <44B5599C8B5B1347AFF903FDCEC00307A058FD@auscorpex-1.austin.messageone.com> <20061005004322.GQ22010@schatzie.adilger.int> <20061005170229.GB23727@thunk.org> <1BB7D14D25835639B3770CDB@[192.168.0.101]> Message-ID: <20061005185818.GA7621@thunk.org> On Thu, Oct 05, 2006 at 07:10:30PM +0100, Alex Bligh wrote: > I had sort of assumed this wouldn't be a problem after htree was > incorporated as far speed, as opposed to size is concerned - and speed was > the original poster's problem, not size on disk. Does that imply there is > still some linear searching going on, or that htree is not "enough" to > speed up the searches. The current implementation of htree doesn't shrink leaf nodes when they are empty, so if you create a really, really big directory, and then delete all of the files, the leaf nodes remain in the htree, empty. So htree will speed up the lookup of *specific* files, but it won't speed up readdir() scanning a large, empty directory. - Ted From bjacke at sernet.de Thu Oct 5 19:23:12 2006 From: bjacke at sernet.de (=?iso-8859-1?Q?Bj=F6rn?= JACKE) Date: Thu, 5 Oct 2006 21:23:12 +0200 Subject: creation time stamps for ext4 ? In-Reply-To: <20061005165504.GA23727@thunk.org> References: <20061005151937.GV22010@schatzie.adilger.int> <20061005165504.GA23727@thunk.org> Message-ID: On 2006-10-05 at 12:55 -0400 Theodore Tso sent off: > > I've given this some thought for adding as part of the nsec timestamp > > patch. That is more feasable if we move the nsec ctime into the main > > inode to double as the version field. > > Shoehorning an extra creation time field into the inode is relatively > easy, but it's also necessary to have system calls to get and set the > creation time. The stat structure doesn't have room for the creation > time, so that means a new version of the stat structure exported the > kernel, and a new version of the stat structure exported by glibc. > > So there are VFS and glibc changes necessary to make this be useful. > But that doesn't prevent us from reserving space in the inode and > starting to fill it in with the creation time, although it may be > quite a while before it will be easily available to user programs like > Samba. yes, probably. But it's a reasonable effort to start that at some time. It's good if ext3 developers have it in mind already now. Should I open a feature request at bugzilla.kernel.org for the needed VFS changes? Bjoern From adilger at clusterfs.com Thu Oct 5 20:07:26 2006 From: adilger at clusterfs.com (Andreas Dilger) Date: Thu, 5 Oct 2006 14:07:26 -0600 Subject: creation time stamps for ext4 ? In-Reply-To: <20061005165504.GA23727@thunk.org> References: <20061005151937.GV22010@schatzie.adilger.int> <20061005165504.GA23727@thunk.org> Message-ID: <20061005200726.GW22010@schatzie.adilger.int> On Oct 05, 2006 12:55 -0400, Theodore Tso wrote: > > I've given this some thought for adding creation time as part of the nsec > > timestamp patch. That is more feasable if we move the nsec ctime into > > the main inode to double as the version field. > > Shoehorning an extra creation time field into the inode is relatively > easy, but it's also necessary to have system calls to get and set the > creation time. The stat structure doesn't have room for the creation > time, so that means a new version of the stat structure exported the > kernel, and a new version of the stat structure exported by glibc. For Lustre and NFSv4, an in-kernel interface is sufficient. I was thinking that as a preliminary userspace interface we can use getxattr with a standard name like user.crtime. Storing the crtime directly in the inode is more efficient than a separate EA, but it would also be compatible if Samba wanted to use real EAs to store this in the absence of large inodes. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From adilger at clusterfs.com Thu Oct 5 21:30:36 2006 From: adilger at clusterfs.com (Andreas Dilger) Date: Thu, 5 Oct 2006 15:30:36 -0600 Subject: EXT3 and large directories In-Reply-To: <1BB7D14D25835639B3770CDB@[192.168.0.101]> References: <44B5599C8B5B1347AFF903FDCEC00307A058FD@auscorpex-1.austin.messageone.com> <20061005004322.GQ22010@schatzie.adilger.int> <20061005170229.GB23727@thunk.org> <1BB7D14D25835639B3770CDB@[192.168.0.101]> Message-ID: <20061005213036.GZ22010@schatzie.adilger.int> On Oct 05, 2006 19:10 +0100, Alex Bligh wrote: > How do the deleted entries get reused? EG if I have a mail spool > application, where a given directory has around 100,000 files in at any > time, and they are periodically deleted by age in batches of (say) 10,000 > such that the number in the directory never exceeds 100,000, does the size > of the directory just keep growing for ever? Or do newly created directory > entries take up the space in the directory of old ones (assume all the > filenames are unique). It depends on the hash function, and the nature of the filenames being used. The hash function should be good at randomizing the hashes, and in the above case would expect to have a very uniform hash distribution. That means the empty entries would be filled relatively uniformly. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From bothie at gmx.de Sun Oct 8 16:45:19 2006 From: bothie at gmx.de (Bodo Thiesen) Date: Sun, 8 Oct 2006 18:45:19 +0200 Subject: Retaining undelete data on ext3 In-Reply-To: <20060925154818.GC22010@schatzie.adilger.int> References: <4516C67E.10609@bcgreen.com> <20060925154818.GC22010@schatzie.adilger.int> Message-ID: <20061008184519.6082fe00@30_bodo.rupinet> Andreas Dilger wrote: > On Sep 24, 2006 10:55 -0700, Stephen Samuel wrote: > > Having just spent a day trying to recover a deleted ext3 file > > for a friend, I'm wondering about this way of maintining > > undelete information in ext3, like is done for ext2: > > > > The last step in the deletion process would be to put back > > the (previously zeroed) block pointers. Since it gets logged > > to the journal, I _think_ that this should be safe. The worst > > that would happen is that, if the plug gets pulled in the > > middle of a file delete, the old block pointers would be > > unavailable -- I don't see this as a killer issue, since > > editing the filesystem to do an undelete should be considered an > > emergency operation anyways. > > I've written a couple of times the best way to do this, Your solution works only for small files. Big files must managed another way, like how I wrote on Sun, 1 Feb 2004 07:00:58 +0100 in the thread "Ext3 and undeletion - A way how it could work." But it semms, that the problem is not ideas on how to implement it, but in somebody just doing it ... I don't have the knowledge currently, else I would have done it already. Regards, Bodo From bothie at gmx.de Sun Oct 8 16:52:14 2006 From: bothie at gmx.de (Bodo Thiesen) Date: Sun, 8 Oct 2006 18:52:14 +0200 Subject: Retaining undelete data on ext3 In-Reply-To: <20060924195319.GC11083@thunk.org> References: <4516C67E.10609@bcgreen.com> <20060924195319.GC11083@thunk.org> Message-ID: <20061008185214.04ed1b8f@30_bodo.rupinet> Theodore Tso wrote: > The other caveat is that > storing all of the previously zeroed block pointers temporarily in > memory could take quite a bit of memory, especially if what is being > deleted is really big. Even Stephens idea doesn't need MBs of space. After freeing all blocks pointed to by an ind, that ind is unlinked in it's dind or in the inode whatever applies. In this moment, we can already restore it's contents. So the worst case for 8k blocks is to remember two ind blocks, two dind blocks and one tind block and the inode. That makes 41088 bytes. I don't agree this to be a problem ;) > Of course, storing the information as a series of extents would be an > obvious optimization, which would work on all but a very badly > fragmented file (for example, if said DVD .iso image was created when > the filesystem was close to 100% full). Or just read my mail from Sun, 1 Feb 2004 07:00:58 +0100 (Ext3 and undeletion - A way how it could work.) > The are some other ways it could be done that would be more optimized, > but the bottom line is that main reason why it hasn't be done is > because the people who could do it haven't had the time to implement > it. We've been working on other features that are higher priority, > either for ourselves or for our employers. :( But as I told: Ideas are not the problem. Time is the problem. Regards, Bodo From tytso at mit.edu Sun Oct 8 17:03:29 2006 From: tytso at mit.edu (Theodore Tso) Date: Sun, 8 Oct 2006 13:03:29 -0400 Subject: Retaining undelete data on ext3 In-Reply-To: <20061008185214.04ed1b8f@30_bodo.rupinet> References: <4516C67E.10609@bcgreen.com> <20060924195319.GC11083@thunk.org> <20061008185214.04ed1b8f@30_bodo.rupinet> Message-ID: <20061008170329.GA30816@thunk.org> On Sun, Oct 08, 2006 at 06:52:14PM +0200, Bodo Thiesen wrote: > Theodore Tso wrote: > > > The other caveat is that > > storing all of the previously zeroed block pointers temporarily in > > memory could take quite a bit of memory, especially if what is being > > deleted is really big. > > Even Stephens idea doesn't need MBs of space. After freeing all blocks > pointed to by an ind, that ind is unlinked in it's dind or in the inode > whatever applies. In this moment, we can already restore it's contents. > So the worst case for 8k blocks is to remember two ind blocks, two dind > blocks and one tind block and the inode. That makes 41088 bytes. I don't > agree this to be a problem ;) Actually, you can't --- that's the problem. Until the changes are committed, which means that the changes represented in the filesystem are self-consistent and in a transaction which has been committed to the journal, you can't start restoring the information in the indirect block. You could if you forced transaction boundaries between every single indirect block, but that would seriously degrade ext3's unlink performance, and slow down any other filesystem activity that might be happenning at the same time. This is what makes the undelete problem so subtle. Doing in such a way which is optimal for performance, and is preserves the journalling guarantees, and yet still allows the undelete, is more complicated than it first appears. > But as I told: Ideas are not the problem. Time is the problem. Yep, exactly. - Ted From bothie at gmx.de Sun Oct 8 17:40:12 2006 From: bothie at gmx.de (Bodo Thiesen) Date: Sun, 8 Oct 2006 19:40:12 +0200 Subject: Retaining undelete data on ext3 In-Reply-To: <20061008170329.GA30816@thunk.org> References: <4516C67E.10609@bcgreen.com> <20060924195319.GC11083@thunk.org> <20061008185214.04ed1b8f@30_bodo.rupinet> <20061008170329.GA30816@thunk.org> Message-ID: <20061008194012.35e49669@30_bodo.rupinet> Theodore Tso wrote: > Actually, you can't --- that's the problem. Until the changes are > committed, which means that the changes represented in the filesystem > are self-consistent and in a transaction which has been committed to > the journal, you can't start restoring the information in the indirect > block. I don't see the problem here. Ok, I must admit not to know the code very much, especially the journaling part, I only know the non journaling on-disk structures. But just considering: We are talking about commiting transactions or not committing transactions. Assume we have a big file, and ind block I1, dind block D1 and tind block T must be changed to be self-consistent. Ok, no problem, we store the original contents of this three blocks in memory, and then update (i.e. zero out) some parts. In the next transaction, we need to change I2, D2 and T. If I1 != I2, we restore I1 in this transaction - it's no longer needed, remember the old content of I2 and log the changes for I2 in the journal. Same applies to D2 vs. D1. If I1 and I2 (or D1 and D2 resp.) are the same, they are just updated leaving the in memory copy of the original data alone. So, I don't see the point why we would need to force the data to disk. Is the system crashes, I1 will just be written several times instead just one time. But after the whole log has been replayed, the file system is consistent again. The only missfeature is now, that the in memory copy of the original versions of the blocks will be lost, but my proposal from very long time fixes that as well by just storing the updates in other places then the original version which retain unmodified. Regards, Bodo From tytso at mit.edu Sun Oct 8 19:40:20 2006 From: tytso at mit.edu (Theodore Tso) Date: Sun, 8 Oct 2006 15:40:20 -0400 Subject: Retaining undelete data on ext3 In-Reply-To: <20061008194012.35e49669@30_bodo.rupinet> References: <4516C67E.10609@bcgreen.com> <20060924195319.GC11083@thunk.org> <20061008185214.04ed1b8f@30_bodo.rupinet> <20061008170329.GA30816@thunk.org> <20061008194012.35e49669@30_bodo.rupinet> Message-ID: <20061008194020.GA26726@thunk.org> On Sun, Oct 08, 2006 at 07:40:12PM +0200, Bodo Thiesen wrote: > We are talking about commiting transactions or not committing > transactions. Assume we have a big file, and ind block I1, dind > block D1 and tind block T must be changed to be self-consistent. Ok, > no problem, we store the original contents of this three blocks in > memory, and then update (i.e. zero out) some parts. In the next > transaction, we need to change I2, D2 and T. If I1 != I2, we restore > I1 in this transaction - it's no longer needed, remember the old > content of I2 and log the changes for I2 in the journal. Same "In the next transaction" --- that's exactly the problem, as I said, in my earlier comment: You could if you forced transaction boundaries between every single indirect block, but that would seriously degrade ext3's unlink performance, and slow down any other filesystem activity that might be happenning at the same time. The way ext3 works is that we batch multiple operations into a single transaction. This is because commiting transactions is expensive, so we amortize the cost over potentially a large number of filesystem operations that might be happening very close together. So your "trick" would require force a single unlink system call to require into multiple ext3 transactions, each which would have to be written to the disk, and which would have to stall until all journal blocks have been written to the disk before the journal commit block is written. The resulting performance degradation would be disastrous. - Ted From bothie at gmx.de Sun Oct 8 21:38:22 2006 From: bothie at gmx.de (Bodo Thiesen) Date: Sun, 8 Oct 2006 23:38:22 +0200 Subject: Retaining undelete data on ext3 In-Reply-To: <20061008194020.GA26726@thunk.org> References: <4516C67E.10609@bcgreen.com> <20060924195319.GC11083@thunk.org> <20061008185214.04ed1b8f@30_bodo.rupinet> <20061008170329.GA30816@thunk.org> <20061008194012.35e49669@30_bodo.rupinet> <20061008194020.GA26726@thunk.org> Message-ID: <20061008233822.481e8647@30_bodo.rupinet> Theodore Tso wrote: > "In the next transaction" --- that's exactly the problem, as I said, > in my earlier comment: > > You could if you forced transaction boundaries between every single > indirect block, but that would seriously degrade ext3's unlink > performance, and slow down any other filesystem activity that might be > happenning at the same time. > > The way ext3 works is that we batch multiple operations into a single > transaction. This is because commiting transactions is expensive, so > we amortize the cost over potentially a large number of filesystem > operations that might be happening very close together. What does the journaling code, if a block x which was already written to in a transaction get's written to again? What say we delete a small file from a directory and immediately recreate it, so the same directory data block needs to be updated again? Will this require a new transaction as well? If not, my aproach doesn't either. BTW: When I talked about a transaction I obviously meant something different than you, on the other hand that was my fault. What I meant with transaction is something like an atom. Moving a file from directory A to directory B needs (at least) four updates, the inodes of the directories and the directory data blocks. I would say, that this update is one transaction. But you would say, that is only a part of a transaction, as you would put deletion of another file, writing some data to an iso image and whatever else in the same transaction. So, just replace my "transactions" by "transaction atoms", and then read again, what I wrote, maybe that makes my idea more clearer. As soon as I1 is completely zeroed, it will be unlinked in D1, and thus I1 doesn't need to be written as having zeros. So if no update to I1 was already committed to disk, there is no need to do it at all (something like forget should be available in the journaling code as well). If it was already committed, it's original content needs to be committed in the next transaction, but there is no need to force a commit at this place at all. Regards, Bodo From bothie at gmx.de Sun Oct 8 21:59:03 2006 From: bothie at gmx.de (Bodo Thiesen) Date: Sun, 8 Oct 2006 23:59:03 +0200 Subject: Root filesystem on ext2 In-Reply-To: <47164.208.250.32.6.1158287646.squirrel@192.168.175.202> References: <47164.208.250.32.6.1158287646.squirrel@192.168.175.202> Message-ID: <20061008235903.19ecc8a2@30_bodo.rupinet> "Jayjitkumar Lobhe" wrote: > - My initrd image ... is irrelevant ... > - /etc/fstab ... which is irrelevant as well, as the *kernel* doesn't look there. > - I dont mount the real root during [...] linuxrc [...] the kernel will > mount it after linuxrc is finished. Right, using the general autoprobe order. You have to know: ext2 and ext3 (and ext4) are the same file systems. There are three differend drivers for the same file system, ext2 which supports the "normal" filesystem including many extenstions, but NOT supporting the extension "journalling", that extension is only supported by the ext3 *driver*. But again: It's the same file system (thus calling it ext3 and ext4 is a very very bad misnomer, but that is another story). The kernel must autoprobe for the driver to use when mounting it's root file system as it doesn't get any hints. So it tries to mount as iso9660 ... and fails, it tries to mount as vfat ... and fails and so on. The order in which the file system *drivers* are tried is controlled by the order they are kernel internally registered. Your ext3 driver will be registered later than the ext2 driver. So ext2 will be tried first, and ext2 recognizes the file system as ext2 as long as the file system was unmounted correctly. So the ext2 driver can successfully mount the root file system, and you are stuck without journalling forever [i.e. until you reboot]. Solutions: a) Make the ext3 file system driver a part of the static kernel (i.e. NOT a modul). In this case the kernel code makes sure, ext3 gets registeres before ext2. ext3 only works with file systems countaining a journal and thus leaving those file systems which don't have a journal alone. b) Don't unmount the root file system before rebooting *scnr* [Of course you would need it to be mounted as ext3 at any point for this to work (a Knoppix boot would suffice), but you wouldn't consider that anyways, I hope ;)] c) Change the file system of the initrd to ANYTHING else than ext2 AND make ext2 a module as like ext3. Them make sure to modprobe ext3 BEFORE ext2. > - The system boots up successfully, mount command shows / partition > mounted as ext3 but /proc/mount shows it as ext2. That's another thing. When you (the kernel) mount(s) your root file system, 1.) That file system isn't accessible nor is it writable yet (that it is only AFTER being mounted) 2.) There is no mount utility doing it. I guess, your /etc is on your root partition. In /etc you will find a file called mtab. That file contains the (wrong and very old) information, that your root was mounted using the ext3 file system driver. mount just cat's this file and thus shows the same wrong information. /proc/mount contains the current and correct information known by the kernel. Some peoble even deleted /etc/mtab already and replaced it by a symlink to /proc/mount. Other peoble told that to be a problem, I don't see the point, but just to warn you ;). For me, it worked fine. Regards, Bodo From bryan at kadzban.is-a-geek.net Mon Oct 9 02:13:14 2006 From: bryan at kadzban.is-a-geek.net (Bryan Kadzban) Date: Sun, 08 Oct 2006 22:13:14 -0400 Subject: Root filesystem on ext2 In-Reply-To: <20061008235903.19ecc8a2@30_bodo.rupinet> References: <47164.208.250.32.6.1158287646.squirrel@192.168.175.202> <20061008235903.19ecc8a2@30_bodo.rupinet> Message-ID: <4529B03A.5080508@kadzban.is-a-geek.net> Bodo Thiesen wrote: > Solutions: > > a) Make the ext3 file system driver a part of the static kernel > b) Don't unmount the root file system before rebooting > c) Change the file system of the initrd to ANYTHING else than ext2 > AND make ext2 a module as like ext3. Them make sure to modprobe > ext3 BEFORE ext2. d) Change your initramfs to manually mount the root filesystem. You will be able to completely specify what you want mount to do, including use a different FS than it normally would. (Er, wait, you still use an initrd? I suppose that doesn't really matter, but initramfs is newer.) > Some peoble even deleted /etc/mtab already and replaced it by a > symlink to /proc/mount. Other peoble told that to be a problem, I > don't see the point, but just to warn you ;). For me, it worked fine. You must not be using any mount options that require keeping state between mount and umount, then. That's not the case in general. One such option is "user" -- with "user", any user can mount the FS, but only that same user (or root) is allowed to umount it. To enforce this, mount has to keep track of which user did the mount -- it does so in /etc/mtab. The kernel doesn't care (this restriction is enforced by the setuid-root mount and umount programs, not the kernel), so that information does not appear in /proc/mounts at all. If you have a "user" FS in fstab, then I'd be willing to bet that any user can mount it, and any other user can umount it. If you want to try it in your symlink setup, mount one of them as root, then see if you can umount it as a user. I can't when mtab is not a symlink; this is correct behavior. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 258 bytes Desc: OpenPGP digital signature URL: From tytso at mit.edu Mon Oct 9 03:12:09 2006 From: tytso at mit.edu (Theodore Tso) Date: Sun, 8 Oct 2006 23:12:09 -0400 Subject: Retaining undelete data on ext3 In-Reply-To: <20061008233822.481e8647@30_bodo.rupinet> References: <4516C67E.10609@bcgreen.com> <20060924195319.GC11083@thunk.org> <20061008185214.04ed1b8f@30_bodo.rupinet> <20061008170329.GA30816@thunk.org> <20061008194012.35e49669@30_bodo.rupinet> <20061008194020.GA26726@thunk.org> <20061008233822.481e8647@30_bodo.rupinet> Message-ID: <20061009031209.GA24190@thunk.org> On Sun, Oct 08, 2006 at 11:38:22PM +0200, Bodo Thiesen wrote: > BTW: When I talked about a transaction I > obviously meant something different than you, on the other hand that was my > fault. What I meant with transaction is something like an atom. Moving a > file from directory A to directory B needs (at least) four updates, the > inodes of the directories and the directory data blocks. I would say, that > this update is one transaction. But you would say, that is only a part of a > transaction, as you would put deletion of another file, writing some data > to an iso image and whatever else in the same transaction. So, just replace > my "transactions" by "transaction atoms", and then read again, what I > wrote, maybe that makes my idea more clearer. Ah, but that brings up the other problem; which is for a really big file, your "transaction atom" might not fit in a single "transaction". Remember, it's not just about keeping the inode, indirect block, double indirect, and triple indirect blocks up to date; it's also about all of those block allocation bitmaps; and for a big file, the number of block bitmaps you might have to touch can grow very large indeed. If the number of blocks that have to be touched during the unlink is larger than the space left for the journal, then we have to write a consistent snapshot of the inode, indirect, double indirect, and triple indirect blocks, plus all of the block bitmaps. And if you try to "restore" the blocks afterwards, that's potentially an extra block that needs to be journaled in the new transaction, and getting that all right is more than a little bit tricky. Now, the good news is that we are using bforget in journal_forget now, and that at least some of the time, restoring the i_blocks[] pointers will allow the inode to be recovered --- although if the unlink operation takes multiple transactions, you won't get the entire inode recovered that way. The bottom line is the interaction of truncate and journalling gets tricky, if you want it to be 100% reliable. If you're willing to settle for "mostly working", it's probably not that hard. - Ted From adilger at clusterfs.com Wed Oct 11 21:01:20 2006 From: adilger at clusterfs.com (Andreas Dilger) Date: Wed, 11 Oct 2006 15:01:20 -0600 Subject: Retaining undelete data on ext3 In-Reply-To: <20061009031209.GA24190@thunk.org> References: <4516C67E.10609@bcgreen.com> <20060924195319.GC11083@thunk.org> <20061008185214.04ed1b8f@30_bodo.rupinet> <20061008170329.GA30816@thunk.org> <20061008194012.35e49669@30_bodo.rupinet> <20061008194020.GA26726@thunk.org> <20061008233822.481e8647@30_bodo.rupinet> <20061009031209.GA24190@thunk.org> Message-ID: <20061011210120.GS22010@schatzie.adilger.int> On Oct 08, 2006 23:12 -0400, Theodore Tso wrote: > The bottom line is the interaction of truncate and journalling gets > tricky, if you want it to be 100% reliable. If you're willing to > settle for "mostly working", it's probably not that hard. You can't be 100% with undelete anyways, because there is no guarantee that the blocks won't be reallocated right away. Having a 95% undelete solution in a few lines of code would be worthwhile, IMHO, since this topic comes up a lot and I've lamented on a few occasions the fact you can't ever salvage deleted files from ext3. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From jeffg at ahpcrc.org Mon Oct 16 17:04:56 2006 From: jeffg at ahpcrc.org (Jeff Garlough) Date: Mon, 16 Oct 2006 12:04:56 -0500 Subject: dual-ported raid Message-ID: <20061016170456.0418B4D291@que.ncs.ahpcrc.org> Hi, I have a dual-ported raid controller which allows two computers to connect to the same ext3 filesystem. I never mount both systems read-write at the same time. What I would like to do is use one normally, and mount the second system read-only to perform backups and to rsync the filesystem to another filesystem. When it's mounted read-write from another system, will mounting the same filesystem read-only cause the journal to be committed at the time it's mounted? If so, is that a bad thing, that is, will it corrupt the filesystem? Are journal events handled similar to databases, with regard to transaction processing of journal events, or could playing "partial" journal events (if there is such a thing) cause corruption? Is mounting the read-only instance as a ext2 filesystem the best solution, or does it matter if it's mounted ext2 or ext3 as long as it's read-only? -- Jeff Garlough From adilger at clusterfs.com Mon Oct 16 18:53:44 2006 From: adilger at clusterfs.com (Andreas Dilger) Date: Mon, 16 Oct 2006 12:53:44 -0600 Subject: dual-ported raid In-Reply-To: <20061016170456.0418B4D291@que.ncs.ahpcrc.org> References: <20061016170456.0418B4D291@que.ncs.ahpcrc.org> Message-ID: <20061016185344.GL6221@schatzie.adilger.int> On Oct 16, 2006 12:04 -0500, Jeff Garlough wrote: > What I would like to do is use one > normally, and mount the second system read-only to perform backups and > to rsync the filesystem to another filesystem. When it's mounted > read-write from another system, will mounting the same filesystem > read-only cause the journal to be committed at the time it's mounted? Yes, that is very bad. > If so, is that a bad thing, that is, will it corrupt the filesystem? Yes, it can corrupt the filesystem. > Are journal events handled similar to databases, with regard to transaction > processing of journal events, or could playing "partial" journal events > (if there is such a thing) cause corruption? Is mounting the read-only > instance as a ext2 filesystem the best solution, or does it matter if > it's mounted ext2 or ext3 as long as it's read-only? You can't mount it as ext2. I would instead use a block-device level backup, like "dump" if you really need to do it this way. You are probably better off just doing the backup from the primary node. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From worleys at gmail.com Mon Oct 16 19:43:26 2006 From: worleys at gmail.com (Chris Worley) Date: Mon, 16 Oct 2006 13:43:26 -0600 Subject: dual-ported raid In-Reply-To: <20061016185344.GL6221@schatzie.adilger.int> References: <20061016170456.0418B4D291@que.ncs.ahpcrc.org> <20061016185344.GL6221@schatzie.adilger.int> Message-ID: You can do it if the two systems use different luns for their ext/reiser/xfs file system, or if you use GFS as the file system (then, you can mount the same FS read/write). On 10/16/06, Andreas Dilger wrote: > On Oct 16, 2006 12:04 -0500, Jeff Garlough wrote: > > What I would like to do is use one > > normally, and mount the second system read-only to perform backups and > > to rsync the filesystem to another filesystem. When it's mounted > > read-write from another system, will mounting the same filesystem > > read-only cause the journal to be committed at the time it's mounted? > > Yes, that is very bad. > > > If so, is that a bad thing, that is, will it corrupt the filesystem? > > Yes, it can corrupt the filesystem. > > > Are journal events handled similar to databases, with regard to transaction > > processing of journal events, or could playing "partial" journal events > > (if there is such a thing) cause corruption? Is mounting the read-only > > instance as a ext2 filesystem the best solution, or does it matter if > > it's mounted ext2 or ext3 as long as it's read-only? > > You can't mount it as ext2. > > I would instead use a block-device level backup, like "dump" if you really > need to do it this way. You are probably better off just doing the backup > from the primary node. > > Cheers, Andreas > -- > Andreas Dilger > Principal Software Engineer > Cluster File Systems, Inc. > > _______________________________________________ > Ext3-users mailing list > Ext3-users at redhat.com > https://www.redhat.com/mailman/listinfo/ext3-users > From daniel at rimspace.net Mon Oct 16 23:20:22 2006 From: daniel at rimspace.net (Daniel Pittman) Date: Tue, 17 Oct 2006 09:20:22 +1000 Subject: dual-ported raid References: <20061016170456.0418B4D291@que.ncs.ahpcrc.org> Message-ID: <87vemjrgp5.fsf@rimspace.net> Jeff Garlough writes: > I have a dual-ported raid controller which allows two computers to > connect to the same ext3 filesystem. I never mount both systems > read-write at the same time. What I would like to do is use one > normally, and mount the second system read-only to perform backups and > to rsync the filesystem to another filesystem. That will not work, full stop, ever, with ext3. Find another solution. If you did do this, envision: On the master node, where read/write activities are going on, we have a bunch of on-disk data and a bunch of meta-data in memory. Things like inode allocation tables, etc. These get written out to disk every now and then, through the journal and for other reasons, on whatever schedule the master node feels is worthwhile. Meanwhile, over on the slave node you mount the file system. It reads some meta-data into memory and keeps it there, for convenience. You start working on data -- and, meanwhile, over on the master we update some of the meta-data that the slave has in memory. Now, the slave doesn't know that was updated, so it keeps using that in-memory data happily. Except, then it needs to load some fresh data from disk and, pow, huge inconsistency in the file system. ext3 alone cannot do what you want. You might get away with it if you can take a snapshot of the (consistent) state on the master, then mount that on the slave, but that probably isn't a great plan either. I strongly suggest you investigate some other solution like, say, simply running your backups on the master. You will have the same resource use in both cases, pretty much, unless your rsync process is very checksum intensive... Regards, Daniel -- Digital Infrastructure Solutions -- making IT simple, stable and secure Phone: 0401 155 707 email: contact at digital-infrastructure.com.au http://digital-infrastructure.com.au/ From neotericgnosis at yahoo.com Tue Oct 17 23:04:53 2006 From: neotericgnosis at yahoo.com (Jeff Garlough) Date: Tue, 17 Oct 2006 16:04:53 -0700 (PDT) Subject: Subject: Re: dual-ported raid Message-ID: <20061017230453.81243.qmail@web52202.mail.yahoo.com> >> What I would like to do is use one >> normally, and mount the second system read-only to perform backups and >> to rsync the filesystem to another filesystem. When it's mounted >> read-write from another system, will mounting the same filesystem >> read-only cause the journal to be committed at the time it's mounted? > >Yes, that is very bad. Can you elaborate why mounting a filesystem read-only is "dangerous"? > >> If so, is that a bad thing, that is, will it corrupt the filesystem? > >Yes, it can corrupt the filesystem. I assume, then, that mounting the filesystem read-only flushes the journal. Why does flushing it "early" corrupt the filesystem? > >> Are journal events handled similar to databases, with regard to transaction >> processing of journal events, or could playing "partial" journal events >> (if there is such a thing) cause corruption? Is mounting the read-only >> instance as a ext2 filesystem the best solution, or does it matter if >> it's mounted ext2 or ext3 as long as it's read-only? > >You can't mount it as ext2. > Why? It seemed to work, although I'm not sure, from the comments I've been getting, that it's safe. The ext3-faq says: How do I convert my ext3 partition back to ext2? Actually there is only little need to do so, because in most cases it is sufficient to mount the partition explicitely as ext2. >I would instead use a block-device level backup, like "dump" if you really >need to do it this way. You are probably better off just doing the backup >from the primary node. > >Cheers, Andreas >-- >Andreas Dilger >Principal Software Engineer >Cluster File Systems, Inc. > -- Jeff Garlough __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com From davids at webmaster.com Wed Oct 18 02:12:40 2006 From: davids at webmaster.com (David Schwartz) Date: Tue, 17 Oct 2006 19:12:40 -0700 Subject: Subject: Re: dual-ported raid In-Reply-To: <20061017230453.81243.qmail@web52202.mail.yahoo.com> Message-ID: > Can you elaborate why mounting a filesystem read-only > is "dangerous"? You will be interpreting the filesystem based on a mix of current and stale metadata. There is no way you can be sure what will happen in this case. Metadata may be read as data or vice versa, pieces of one file may be read as pieces of another one. DS From pengchengzou at gmail.com Fri Oct 20 23:02:53 2006 From: pengchengzou at gmail.com (Pengcheng Zou) Date: Sat, 21 Oct 2006 07:02:53 +0800 Subject: the worst scenario of ext3 after abnormal powerdown Message-ID: <24a313060610201602t6218a230h6a3059f8a2e50bf1@mail.gmail.com> Hi, I have seen and heard many cases of ext3 corrupted after abnormal powerdown (e.g. missing all the files in one directory). yes, UPS should help, but wonder what kind of worst scenario will ext3 present after powerdown. messed up meta data has been seen in many cases, for example, the in-direct block of one inode contains garbage, which causes the automatic fsck failed to work and user has to repair the file system manually (and always result in some missing files). should I blame ext3 for it? or should I just turn off the disk write cache? it seems Windows NTFS has less such problem than ext3, and no matter it's the problem of ext3 or mis-configured hardware, this behavior is really causes lots of people to doubt the stability of Linux file system. thanks, -- Pengcheng From mnalis-ml at voyager.hr Sat Oct 21 11:43:26 2006 From: mnalis-ml at voyager.hr (Matija Nalis) Date: Sat, 21 Oct 2006 13:43:26 +0200 Subject: the worst scenario of ext3 after abnormal powerdown In-Reply-To: <24a313060610201602t6218a230h6a3059f8a2e50bf1@mail.gmail.com> References: <24a313060610201602t6218a230h6a3059f8a2e50bf1@mail.gmail.com> Message-ID: <20061021114326.GA3149@eagle102.home.lan> On Sat, Oct 21, 2006 at 07:02:53AM +0800, Pengcheng Zou wrote: > messed up meta data has been seen in many cases, for example, the > in-direct block of one inode contains garbage, which causes the automatic > fsck failed to work and user has to repair the file system manually (and > always result in some missing files). should I blame ext3 for it? or > should I just turn off the disk write cache? In recent 2.6.x you can mount ext3 with "-o barrier=1", and you should be able to safely use disks with write cache on (if the disks support it, watch dmesg for "JBD: barrier-based sync failed" errors if not supported) read Documentation/block/barrier.txt for more info. > it seems Windows NTFS has less such problem than ext3, and no matter > it's the problem of ext3 or mis-configured hardware, this behavior is > really causes lots of people to doubt the stability of Linux file > system. It would be nice to know why "barrier=1" is not the default (to be safe by default, like with journal=ordered instead of journal=writeback) on ext3 ? (it is on by default on XFS for example) Also interesting question on http://lkml.org/lkml/2005/12/18/99 "... But if you want a different raid level you should ask the ext3 developers if there is a reason they don't call blkdev_issue_flush if barriers aren't supported." -- Opinions above are GNU-copylefted. From mnalis-ml at voyager.hr Mon Oct 23 17:15:31 2006 From: mnalis-ml at voyager.hr (Matija Nalis) Date: Mon, 23 Oct 2006 19:15:31 +0200 Subject: the worst scenario of ext3 after abnormal powerdown In-Reply-To: <24a313060610230727t5e2aa501wcb2258410fcdd1db@mail.gmail.com> References: <24a313060610201602t6218a230h6a3059f8a2e50bf1@mail.gmail.com> <20061021114326.GA3149@eagle102.home.lan> <24a313060610230727t5e2aa501wcb2258410fcdd1db@mail.gmail.com> Message-ID: <20061023171531.GA3240@eagle102.home.lan> On Mon, Oct 23, 2006 at 10:27:20PM +0800, Pengcheng Zou wrote: > thanks a lot for the explanation. so if i understand it clearly, to > get a reliable data storage, i need turn off write cache or enable > barrier. both methods depend on the hardware. so how to know whether a > disk or drive support write cache? how to turn off write cache (i know > hdparm -W0 for IDE, but how to turn off write cache of SCSI drive?)? maybe http://scsirastools.sourceforge.net/ ? also see: http://www-dt.e-technik.uni-dortmund.de/~ma/linux/kernel/safe-write-caches.html -- Opinions above are GNU-copylefted. From ramanara at cse.psu.edu Wed Oct 25 23:43:24 2006 From: ramanara at cse.psu.edu (Rajaraman Ramanarayanan) Date: Wed, 25 Oct 2006 19:43:24 -0400 (EDT) Subject: FS corruption? bogus i_mode Message-ID: Hello, I am doing some testing on a PXA270 based processor (on a single board computer) which makes the processor vulnerable to bit flips. One such bit flips seems to have corrupted the file system. The debug port on the board (it is a single board computer) had the following message when i think the FS corruption occured : <7>init_special_inode: bogus i_mode (33061) init_special_inode: bogus i_mode (30071) init_special_inode: bogus i_mode (34065) init_special_inode: bogus i_mode (30061) init_special_inode: bogus i_mode (33061) init_special_inode: bogus i_mode (30071) After this happened, directories like bin etc. were corrupted ( I am pasting the screen shot of ll commands that i did) which meant that i could not start the board again using the same FS (I had to re install the root file system on the hard drive). My question is what error could have caused a file system corruption like this. Is it possible to trace and analyze if i have the whole FS backed up? The OS was debian linux. I hope the question is clear and the given information is useful enough to make some comments. Here is the screen shot of the ll commands for 2 of the directories: (The total space in the partition was 4GB) ************************************************************************* segrith.cse.psu.edu 66% du -khs bin 426G bin segrith.cse.psu.edu 67% ll total 446404348 cr-Sr-S--- 8240 959265076 876099129 32, 50 Oct 2 1997 bin drwxr-xr-x 2 root root 4096 Sep 30 2005 boot drwxr-xr-x 6 root root 24576 Oct 10 15:29 dev drwxr-xr-x 61 root root 4096 Oct 10 15:32 etc drwxr-xr-x 2 root root 4096 Sep 30 2005 home drwxr-xr-x 2 root root 4096 Dec 31 1969 initrd drwxr-xr-x 9 root root 4096 Jan 12 2006 lib drwxr-xr-x 2 root root 16384 Dec 31 1969 lost+found drwxr-xr-x 4 root root 4096 Dec 19 2005 media drwxr-xr-x 8 root root 4096 Apr 26 15:39 mnt drwxr-xr-x 3 root root 4096 Dec 19 2005 opt dr-xr-xr-x 2 root root 4096 Dec 31 1969 proc drwxr-xr-x 4 root root 4096 Oct 9 23:10 root drwxr-xr-x 2 root root 4096 May 1 11:38 sbin drwxr-xr-x 2 root root 4096 Jan 12 2006 selinux drwxr-xr-x 2 root root 4096 Dec 31 1969 srv drwxr-xr-x 2 root root 4096 Dec 31 1969 sys drwxrwxrwt 4 root root 4096 Oct 10 15:29 tmp drwxr-xr-x 12 root root 4096 Dec 19 2005 usr drwxr-xr-x 13 root root 4096 Dec 19 2005 var segrith.cse.psu.edu 68% cd root/samplecodes/test7/ segrith.cse.psu.edu 69% du -khs * 426G a.out 0 err.out 434G matrix_a 458G matrix_b 394G matrix_c 426G matrix_d 434G matrix_e 394G matrix_f segrith.cse.psu.edu 70% ll total 3107434033 ?---rw---x 11552 892546336 959789109 943207220 Dec 28 1993 a.out -rw-r--r-- 1 root root 0 Oct 10 15:09 err.out ?---rwS--t 13869 909522483 540549173 926166304 Dec 28 1993 matrix_a ?--Srw-r-x 11552 943140128 757084720 808726580 Dec 28 1993 matrix_b ?---rwx--x 8246 842276912 540030005 859124013 Feb 11 1987 matrix_c ?---rw---x 11552 892546336 959789109 943207220 Dec 28 1993 matrix_d ?---rwS--t 13869 909522483 540549173 926166304 Dec 28 1993 matrix_e ?---rwx--x 8246 842276912 540030005 859124013 Feb 11 1987 matrix_f segrith.cse.psu.edu 71% ************************************************************************ Thank you! Sincerely, Rajaraman From lists at nerdbynature.de Thu Oct 26 14:57:17 2006 From: lists at nerdbynature.de (Christian Kujau) Date: Thu, 26 Oct 2006 15:57:17 +0100 (BST) Subject: FS corruption? bogus i_mode In-Reply-To: References: Message-ID: On Wed, 25 Oct 2006, Rajaraman Ramanarayanan wrote: > I am doing some testing on a PXA270 based processor (on a single board > computer) which makes the processor vulnerable to bit flips. One > such bit flips seems to have corrupted the file system. I don't know these PXA270 processors but your comment reads as if the processor is "prone to bit-flips by design", which I can't believe...so, I guess the cpu broke somehow, was overheated or sth.? If so, that's like having faulty memory or faulty data-paths in general (bus errors, bad cabling, too hot processors, etc...). And kinds of errors can be caused by this and the fs can't do much about it because the code in the fs-driver (any fs) isn't executed in the way it is meant to. > segrith.cse.psu.edu 66% du -khs bin > 426G bin > segrith.cse.psu.edu 67% ll > total 446404348 > cr-Sr-S--- 8240 959265076 876099129 32, 50 Oct 2 1997 bin so, the system thinks /bin is a 426 GB character device on a 4GB filesystem? you could run a recent version of e2fsck and see what can be repaired but I'd suggest to get a stable hardware platform and playback your backups :( Christian. -- BOFH excuse #54: Evil dogs hypnotised the night shift From ramanara at cse.psu.edu Thu Oct 26 15:46:11 2006 From: ramanara at cse.psu.edu (Rajaraman Ramanarayanan) Date: Thu, 26 Oct 2006 11:46:11 -0400 (EDT) Subject: FS corruption? bogus i_mode In-Reply-To: References: Message-ID: Thanks for the response. I am actually exposing the processor to neutron radiation which makes it vulnerable. Otherwise the processor and the system works fine once it is take out of the radiation. But this one time when the FS was corrupted i had to re-install the full root file system as it had corrupted the bin directory itself. But i have backed up the data (using dd command) to find out what exactly happened. And it looks like the FS is corrupted such that many of the fields are corrupted (including size, file type, author etc). Thanks again! Sincerely, Rajaraman On Thu, 26 Oct 2006, Christian Kujau wrote: > On Wed, 25 Oct 2006, Rajaraman Ramanarayanan wrote: >> I am doing some testing on a PXA270 based processor (on a single board >> computer) which makes the processor vulnerable to bit flips. One >> such bit flips seems to have corrupted the file system. > > I don't know these PXA270 processors but your comment reads as if the > processor is "prone to bit-flips by design", which I can't believe...so, I > guess the cpu broke somehow, was overheated or sth.? > > If so, that's like having faulty memory or faulty data-paths in general (bus > errors, bad cabling, too hot processors, etc...). And kinds of errors can be > caused by this and the fs can't do much about it because the code in the > fs-driver (any fs) isn't executed in the way it is meant to. > >> segrith.cse.psu.edu 66% du -khs bin >> 426G bin >> segrith.cse.psu.edu 67% ll >> total 446404348 >> cr-Sr-S--- 8240 959265076 876099129 32, 50 Oct 2 1997 bin > > so, the system thinks /bin is a 426 GB character device on a 4GB filesystem? > > you could run a recent version of e2fsck and see what can be repaired but I'd > suggest to get a stable hardware platform and playback your backups :( > > Christian. > -- > BOFH excuse #54: > > Evil dogs hypnotised the night shift > From lists at nerdbynature.de Thu Oct 26 16:28:54 2006 From: lists at nerdbynature.de (Christian Kujau) Date: Thu, 26 Oct 2006 17:28:54 +0100 (BST) Subject: FS corruption? bogus i_mode In-Reply-To: References: Message-ID: On Thu, 26 Oct 2006, Rajaraman Ramanarayanan wrote: > I am actually exposing the processor to neutron > radiation which makes it vulnerable. Otherwise the processor and the system > works fine once it is take out of the radiation. ROFL, this really is the best setup I've read about on ext3-users ;) > But this one time when the > FS was corrupted i had to re-install the full root file system as it had > corrupted the bin directory itself. But i have backed up the data (using dd > command) to find out what exactly happened. So, if this would be reproducible, one could activate the in-kernel debug flags or more specifically JBD_DEBUG or even try kdb[0] to see what's going on. Oh, and when we can see corruption patterns while the system is exposed to your special environment, I'd love to test the patch introducing CONFIG_EXT3_NEUTRON ;) Christian. [0] ftp://oss.sgi.com/www/projects/kdb/download/latest/ -- BOFH excuse #113: Root nameservers are out of sync From ramanara at cse.psu.edu Thu Oct 26 18:35:52 2006 From: ramanara at cse.psu.edu (Rajaraman Ramanarayanan) Date: Thu, 26 Oct 2006 14:35:52 -0400 (EDT) Subject: FS corruption? bogus i_mode In-Reply-To: References: Message-ID: On Thu, 26 Oct 2006, Christian Kujau wrote: > > ROFL, this really is the best setup I've read about on ext3-users ;) > Thanks! ;) Thats what my research is about - To test the effect of neutron induced errors on memories, processors etc. > So, if this would be reproducible, one could activate the in-kernel debug > flags or more specifically JBD_DEBUG or even try kdb[0] to see what's going > on. Oh, and when we can see corruption patterns while > the system is exposed to your special environment, I'd love to test the patch > introducing CONFIG_EXT3_NEUTRON ;) > I have seen this only once. So as of now it is not reporoducible, and i definitely cannot predict if and when it can occur. Also I am not familiar with activating debug flags, Is there any document that i can refer to for these.. or is it something i have to figure out myself? Thanks! Rajaraman From thomas_chris_666 at yahoo.co.in Fri Oct 27 05:09:50 2006 From: thomas_chris_666 at yahoo.co.in (Thomas chris) Date: Fri, 27 Oct 2006 06:09:50 +0100 (BST) Subject: Test Message-ID: <20061027050951.32461.qmail@web7704.mail.in.yahoo.com> This is a test -- Thomas Chris http://www.youbanking.com http://www.youbanking.com/email_page.html --------------------------------- Find out what India is talking about on - Yahoo! Answers India Send FREE SMS to your friend's mobile from Yahoo! Messenger Version 8. Get it NOW -------------- next part -------------- An HTML attachment was scrubbed... URL: From lists at nerdbynature.de Fri Oct 27 16:24:04 2006 From: lists at nerdbynature.de (Christian Kujau) Date: Fri, 27 Oct 2006 17:24:04 +0100 (BST) Subject: FS corruption? bogus i_mode In-Reply-To: References: Message-ID: On Thu, 26 Oct 2006, Rajaraman Ramanarayanan wrote: > I have seen this only once. So as of now it is not reporoducible, and i > definitely cannot predict if and when it can occur. Also I am not familiar > with activating debug flags, Is there any document that i can refer to for > these.. I'm not a filesystem wizard and use debug flags only when things go wrong and this question should be answered by the e2fs-crew, but for starters: when configuring your kernel (make menuconfig?), enabling "JBD (ext3) debugging support" (under "File systems") should make the ext3 fs-driver more verbose, especially when something goes wrong. Then there are the numerous "kernel debugging" options (under "Kernel hacking")..but I find it hard to propose a specific option here, because we don't know which part of the kernel would generate certain errors when exposed to the radiation. in general, these options make the various code-paths more verbose. but I doubt that apart from this (being more chatty whwn something goes wrong) will actually help to debug and even coding workarounds for hardware-going-crazy-under-certain-conditions. But then again, the satellites in space have lots of chips inside too and are exposed to radiation as well....hm, dunno how this is done. Christian. -- BOFH excuse #253: We've run out of licenses From magnusm at massive.se Fri Oct 13 12:14:09 2006 From: magnusm at massive.se (=?iso-8859-1?Q?Magnus_M=E5nsson?=) Date: Fri, 13 Oct 2006 12:14:09 -0000 Subject: e2defrag - Unable to allocate buffer for inode priorities Message-ID: Hi, first of all, apologies if this isn't the right mailing list but it was the best I could find. If you know a better mailing list, please tell me. Today I tried to defrag one of my filesystems. It's a 3.5T large filesystem that has 6 software-raids in the bottom and then merged together using lvm. I was running ext3 but removed the journal flag with thor:~# tune2fs -O ^has_journal /dev/vgraid/data After that I fsckd just to be sure I wouldnt meet any unexpected problems. So now it was time to defrag, I used this command: thor:~# e2defrag -r /dev/vgraid/data After about 15 seconds (after it ate all my 1.5G of RAM) I got this answer: e2defrag (/dev/vgraid/data): Unable to allocate buffer for inode priorities I am using Debian unstable and here is the version information from e2defrag: thor:~# e2defrag -V e2defrag 0.73pjm1 RCS version $Id: defrag.c,v 1.4 1997/08/17 14:23:57 linux Exp $ I also tried to use -p 256, -p 128, -p 64 to see if it used less memory then, it didn't seem like that to me, took the same time for the program to abort. Is there any way to get around this problem? The answer might be to get 10G of RAM, but that's not very realistic, 2G sure, but I think that's the limit on my motherboard. A huge amount of swapfiles may solve it, and that's probably doable, but it will be enormous slow I guess? Why do I want to defrag? Well, fsck gives this nice info to me: /dev/vgraid/data: 227652/475987968 files (41.2% non-contiguous), 847539147/951975936 blocks 41% sounds like a lot in my ears and I am having a constant read of files on the drives, it's to slow already. Very thankful for ideas or others experiences, maybe it's just not possible with such large partition with todays tools, hey ext[23] only supports 4T. Let's hope ext4 comes within a year in the mainstream kernels. PS! Please CC me since I am not on the list so I dont have to wait for marc's archive to get the mails. -- Magnus M?nsson Systems administrator Massive Entertainment AB Malm?, Sweden Office: +46-40-6001000 From magnusm at massive.se Fri Oct 13 14:44:04 2006 From: magnusm at massive.se (=?iso-8859-1?Q?Magnus_M=E5nsson?=) Date: Fri, 13 Oct 2006 14:44:04 -0000 Subject: FW: e2defrag - Unable to allocate buffer for inode priorities Message-ID: I have made some more research and found out the following .. thor:~# df -i Filesystem Inodes IUsed IFree IUse% Mounted on -[cut]- /dev/mapper/vgraid-data 475987968 227652 475760316 1% /data thor:~# strace e2defrag -r /dev/vgraid/data -[cut]- mmap2(NULL, 1903955968, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x46512000 (delay 15 seconds while allocating memory) mmap2(NULL, 475992064, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory) -[cut]- The first allocation seems to be 4 bytes per available inode on my filesystem. I wish now that I created the FS with less inodes, and there is another question. What's the gain of having less available inodes? If I recreated my filesystem, would it be an idea to make one inode per hundred block or something since that still is way more than I need? Would I gain speed from it? -----Original Message----- From: Magnus M?nsson Sent: den 13 oktober 2006 14:14 To: 'ext3-users at redhat.com' Subject: e2defrag - Unable to allocate buffer for inode priorities Hi, first of all, apologies if this isn't the right mailing list but it was the best I could find. If you know a better mailing list, please tell me. Today I tried to defrag one of my filesystems. It's a 3.5T large filesystem that has 6 software-raids in the bottom and then merged together using lvm. I was running ext3 but removed the journal flag with thor:~# tune2fs -O ^has_journal /dev/vgraid/data After that I fsckd just to be sure I wouldnt meet any unexpected problems. So now it was time to defrag, I used this command: thor:~# e2defrag -r /dev/vgraid/data After about 15 seconds (after it ate all my 1.5G of RAM) I got this answer: e2defrag (/dev/vgraid/data): Unable to allocate buffer for inode priorities I am using Debian unstable and here is the version information from e2defrag: thor:~# e2defrag -V e2defrag 0.73pjm1 RCS version $Id: defrag.c,v 1.4 1997/08/17 14:23:57 linux Exp $ I also tried to use -p 256, -p 128, -p 64 to see if it used less memory then, it didn't seem like that to me, took the same time for the program to abort. Is there any way to get around this problem? The answer might be to get 10G of RAM, but that's not very realistic, 2G sure, but I think that's the limit on my motherboard. A huge amount of swapfiles may solve it, and that's probably doable, but it will be enormous slow I guess? Why do I want to defrag? Well, fsck gives this nice info to me: /dev/vgraid/data: 227652/475987968 files (41.2% non-contiguous), 847539147/951975936 blocks 41% sounds like a lot in my ears and I am having a constant read of files on the drives, it's to slow already. Very thankful for ideas or others experiences, maybe it's just not possible with such large partition with todays tools, hey ext[23] only supports 4T. Let's hope ext4 comes within a year in the mainstream kernels. PS! Please CC me since I am not on the list so I dont have to wait for marc's archive to get the mails. -- Magnus M?nsson Systems administrator Massive Entertainment AB Malm?, Sweden Office: +46-40-6001000 From magnusm at massive.se Fri Oct 13 16:55:13 2006 From: magnusm at massive.se (=?Windows-1252?Q?Magnus_M=E5nsson?=) Date: Fri, 13 Oct 2006 16:55:13 -0000 Subject: FW: e2defrag - Unable to allocate buffer for inode priorities Message-ID: I have now upgraded my server from 1.5G of RAM to 4G of RAM. It get's a bit longer, it now looks like this with strace: mmap2(NULL, 1903955968, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x464a7000 (15 second delay) mmap2(NULL, 475992064, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x29eb6000 (this I didn't have memory enough to before) mmap2(NULL, 1903955968, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory) (here it wants another 2G RAM, sorry I dont have 2G-modules .. ) So if noone has any idea, I am stuck until I can find 4 pieces of 2G DDR400 modules. :( -- Magnus M?nsson -----Original Message----- From: Magnus M?nsson Sent: Fri 10/13/2006 4:32 PM To: ext3-users at redhat.com Cc: Magnus M?nsson Subject: FW: e2defrag - Unable to allocate buffer for inode priorities I have made some more research and found out the following .. thor:~# df -i Filesystem Inodes IUsed IFree IUse% Mounted on -[cut]- /dev/mapper/vgraid-data 475987968 227652 475760316 1% /data thor:~# strace e2defrag -r /dev/vgraid/data -[cut]- mmap2(NULL, 1903955968, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x46512000 (delay 15 seconds while allocating memory) mmap2(NULL, 475992064, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory) -[cut]- The first allocation seems to be 4 bytes per available inode on my filesystem. I wish now that I created the FS with less inodes, and there is another question. What's the gain of having less available inodes? If I recreated my filesystem, would it be an idea to make one inode per hundred block or something since that still is way more than I need? Would I gain speed from it? -----Original Message----- From: Magnus M?nsson Sent: den 13 oktober 2006 14:14 To: 'ext3-users at redhat.com' Subject: e2defrag - Unable to allocate buffer for inode priorities Hi, first of all, apologies if this isn't the right mailing list but it was the best I could find. If you know a better mailing list, please tell me. Today I tried to defrag one of my filesystems. It's a 3.5T large filesystem that has 6 software-raids in the bottom and then merged together using lvm. I was running ext3 but removed the journal flag with thor:~# tune2fs -O ^has_journal /dev/vgraid/data After that I fsckd just to be sure I wouldnt meet any unexpected problems. So now it was time to defrag, I used this command: thor:~# e2defrag -r /dev/vgraid/data After about 15 seconds (after it ate all my 1.5G of RAM) I got this answer: e2defrag (/dev/vgraid/data): Unable to allocate buffer for inode priorities I am using Debian unstable and here is the version information from e2defrag: thor:~# e2defrag -V e2defrag 0.73pjm1 RCS version $Id: defrag.c,v 1.4 1997/08/17 14:23:57 linux Exp $ I also tried to use -p 256, -p 128, -p 64 to see if it used less memory then, it didn't seem like that to me, took the same time for the program to abort. Is there any way to get around this problem? The answer might be to get 10G of RAM, but that's not very realistic, 2G sure, but I think that's the limit on my motherboard. A huge amount of swapfiles may solve it, and that's probably doable, but it will be enormous slow I guess? Why do I want to defrag? Well, fsck gives this nice info to me: /dev/vgraid/data: 227652/475987968 files (41.2% non-contiguous), 847539147/951975936 blocks 41% sounds like a lot in my ears and I am having a constant read of files on the drives, it's to slow already. Very thankful for ideas or others experiences, maybe it's just not possible with such large partition with todays tools, hey ext[23] only supports 4T. Let's hope ext4 comes within a year in the mainstream kernels. PS! Please CC me since I am not on the list so I dont have to wait for marc's archive to get the mails. -- Magnus M?nsson Systems administrator Massive Entertainment AB Malm?, Sweden Office: +46-40-6001000 -------------- next part -------------- An HTML attachment was scrubbed... URL: From adilger at clusterfs.com Tue Oct 31 17:10:50 2006 From: adilger at clusterfs.com (Andreas Dilger) Date: Wed, 1 Nov 2006 01:10:50 +0800 Subject: e2defrag - Unable to allocate buffer for inode priorities In-Reply-To: References: Message-ID: <20061031171050.GG5655@schatzie.adilger.int> On Oct 13, 2006 14:13 +0200, Magnus M?nsson wrote: > Today I tried to defrag one of my filesystems. It's a 3.5T large > filesystem that has 6 software-raids in the bottom and then merged > together using lvm. I was running ext3 but removed the journal flag with > Why do I want to defrag? Well, fsck gives this nice info to me: > /dev/vgraid/data: 227652/475987968 files (41.2% non-contiguous), 847539147/951975936 blocks > > 41% sounds like a lot in my ears and I am having a constant read of files > on the drives, it's to slow already. The 41% isn't necessarily bad if the files are very large. For large files it is inevitable that there will be fragmentation after 125MB or so. What is a bigger problem is if the filesystem is constantly very nearly full, or if your applications are appending a lot (e.g. mailspool). > So now it was time to defrag, I used this command: > thor:~# e2defrag -r /dev/vgraid/data This program is dangerous to use and any attempts to use it should be stopped. It hasn't been updated in such a long time that it doesn't even KNOW that it is dangerous (i.e. it doesn't check the filesystem version number or feature flags). What I would suggest in the meantime is to make as much free space in the filesystem as you can, find files that are very fragmented (via the filefrag program) and then copy these files to a new temp file, and rename it over the old file. It should help for files that are very fragmented. There is also a discussion about implementing online defragmentation, but that is still a ways away. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From tytso at mit.edu Tue Oct 31 19:29:48 2006 From: tytso at mit.edu (Theodore Tso) Date: Tue, 31 Oct 2006 14:29:48 -0500 Subject: e2defrag - Unable to allocate buffer for inode priorities In-Reply-To: <20061031171050.GG5655@schatzie.adilger.int> References: <20061031171050.GG5655@schatzie.adilger.int> Message-ID: <20061031192947.GA12277@thunk.org> Package: defrag Version: 0.73pjm1-8 Severity: grave On Wed, Nov 01, 2006 at 01:10:50AM +0800, Andreas Dilger wrote: > > So now it was time to defrag, I used this command: > > thor:~# e2defrag -r /dev/vgraid/data > > This program is dangerous to use and any attempts to use it should be > stopped. It hasn't been updated in such a long time that it doesn't > even KNOW that it is dangerous (i.e. it doesn't check the filesystem > version number or feature flags). In fact we need to create a Debian bug report indicating that this package should *NOT* be included when the Debian etch distribution releases. Goswin, I am setting the severity to grave (a release-critical severity) because defrag right now is almost guaranteed to corrupt the filesystem if used with modern ext3 filesystems leading to data loss, and this satisfies the definition of grave. I believe the correct answer is either to (a) make defrag refuse to run if any filesystem features are enabled (at the very least, resize_inode, but some of the other newer ext3 filesystem features make me nervous with respect to e2defrag, or (b) since (a) would make e2defrag mostly useless especially since filesystems with resize inodes are created by default in etch, and as far as I know upstream abandoned defrag a long time ago, that we should simply remove e2defrag from etch and probably from Debian altogether. If you are interested in doing a huge amount of auditing and testing of e2defrag with modern ext3 (and soon ext4) filesystems, that's great, but I suspect that will not at all be trivial, and even making sure e2defrag won't scramble users' data probably can't be achievable before etch releases. Regards, - Ted From brederlo at informatik.uni-tuebingen.de Tue Oct 31 21:44:03 2006 From: brederlo at informatik.uni-tuebingen.de (Goswin von Brederlow) Date: Tue, 31 Oct 2006 22:44:03 +0100 Subject: e2defrag - Unable to allocate buffer for inode priorities In-Reply-To: <20061031192947.GA12277@thunk.org> (Theodore Tso's message of "Tue, 31 Oct 2006 14:29:48 -0500") References: <20061031171050.GG5655@schatzie.adilger.int> <20061031192947.GA12277@thunk.org> Message-ID: <87iri0ma8s.fsf@informatik.uni-tuebingen.de> Theodore Tso writes: > Package: defrag > Version: 0.73pjm1-8 > Severity: grave > > On Wed, Nov 01, 2006 at 01:10:50AM +0800, Andreas Dilger wrote: >> > So now it was time to defrag, I used this command: >> > thor:~# e2defrag -r /dev/vgraid/data >> >> This program is dangerous to use and any attempts to use it should be >> stopped. It hasn't been updated in such a long time that it doesn't >> even KNOW that it is dangerous (i.e. it doesn't check the filesystem >> version number or feature flags). It should be doing that (checking for ext3 I can confirm) as of defrag (0.73pjm1-8) unstable; urgency=low * ext3-notwork.dpatch: reverse testcase (Closes: #310800) It doesn't handle ext3 right and does know so: # mke2fs -j /dev/ram0 # e2defrag -r /dev/ram0 e2defrag (/dev/ram0): ext3 filesystems not (yet) supported It hapily defrags a filesystem with resize_inode. Is it destroying resize capability or directly destroying data? > In fact we need to create a Debian bug report indicating that this > package should *NOT* be included when the Debian etch distribution > releases. Yes, please do so and preferably with a script to reproduce this without resorting to a big image file. Something in the form of mke2fs mount unpack kernel source umount defrag mount fails would be perfect. (Well not for defrag, but to debug it. :) > Goswin, I am setting the severity to grave (a release-critical You should have used debbugs-CC so I get to see the bug number directly and can reply to the bug. :) > severity) because defrag right now is almost guaranteed to corrupt the > filesystem if used with modern ext3 filesystems leading to data loss, > and this satisfies the definition of grave. I believe the correct > answer is either to (a) make defrag refuse to run if any filesystem > features are enabled (at the very least, resize_inode, but some of the > other newer ext3 filesystem features make me nervous with respect to > e2defrag, or (b) since (a) would make e2defrag mostly useless > especially since filesystems with resize inodes are created by default > in etch, and as far as I know upstream abandoned defrag a long time > ago, that we should simply remove e2defrag from etch and probably from > Debian altogether. > > If you are interested in doing a huge amount of auditing and testing > of e2defrag with modern ext3 (and soon ext4) filesystems, that's > great, but I suspect that will not at all be trivial, and even making > sure e2defrag won't scramble users' data probably can't be achievable > before etch releases. There is '#235498: defrag: ext3 support would be nice :-)' for this issue but I need some serious help there to add all the new features. Preferably a new active upstream. Maybe some people working on ext4 would be willing to help? But that won't happen before etch, I'm certain of that. I'm also confident that I can patch in any checks to make e2defrag run on a filesystem with incompatible features (like has_journal from ext3). Checking those is just an extension of the ext3 check. But people that still have ext2 or can disable the extra features (e.g. delete journal, e2defrag, create journal) can still use e2defrag. I would prefer keeping it in. > Regards, > > - Ted MfG Goswin