From magawake at gmail.com Mon Sep 1 17:18:31 2008 From: magawake at gmail.com (Mag Gam) Date: Mon, 1 Sep 2008 13:18:31 -0400 Subject: dynamic inode allocation Message-ID: <1cbd6f830809011018we95f119p55f697a2111c9e1a@mail.gmail.com> This maybe a newbie question but how come other file systems such as ReiserFS and Veritas' Vxfs dynamically allocate inodes and filesystems such as ext2/ext3 and JFS we need to allocate them when creating the filesystem? Is there a performance or maintenance gain when pre allocating? TIA From tytso at mit.edu Mon Sep 1 18:37:44 2008 From: tytso at mit.edu (Theodore Tso) Date: Mon, 1 Sep 2008 14:37:44 -0400 Subject: dynamic inode allocation In-Reply-To: <1cbd6f830809011018we95f119p55f697a2111c9e1a@mail.gmail.com> References: <1cbd6f830809011018we95f119p55f697a2111c9e1a@mail.gmail.com> Message-ID: <20080901183744.GD13069@mit.edu> On Mon, Sep 01, 2008 at 01:18:31PM -0400, Mag Gam wrote: > This maybe a newbie question but how come other file systems such as > ReiserFS and Veritas' Vxfs dynamically allocate inodes and filesystems > such as ext2/ext3 and JFS we need to allocate them when creating the > filesystem? Is there a performance or maintenance gain when pre > allocating? Having a static inode table is definitely much simpler than a dynamic inode table, and that's why ext2 originally used a static inode allocation system. Ext2 drew much of its initial design inspiration from the BSD Fast Filesystem, and it (along with most traditional Unix filesystems) used a static inode table. One of the advantages of having a static inode table is you can always reliably find it. With a dynamic inode table, it can often be much more difficult to find it in the face of filesystem corruption, caused by either hardware or software failure. For example, with Reiserfs, the inodes are stored in a B-Tree. If the root node, or a relatively high-level node of the B-tree is lost, the only way to recover all of the inodes is by looking at each block, and trying to determine if it "looks" like part of the filesystem B-tree or not. This is what the reiserfs's fsck program will do if the filesystem is sufficiently damaged. Unfortuntaely, this means that if you store reiserfs filesystem image (for example, for use by vmware, or qemu, or kvm, or xen) in a reiserfs filesystem, and the filesystem gets damaged, the recovery procedure will take every single block that looks like it could have been part Reiserfs B-tree, and stich them together into a new-btree. The result, if you have Reiserfs filesystem images is those blocks will get treated as if they were part of the containing filesystem, and the result is not pretty. These problems can be solved (although they were not for Reiserfs), but it means a lot more complexity. - Ted From magawake at gmail.com Mon Sep 1 20:29:06 2008 From: magawake at gmail.com (Mag Gam) Date: Mon, 1 Sep 2008 16:29:06 -0400 Subject: dynamic inode allocation In-Reply-To: <20080901183744.GD13069@mit.edu> References: <1cbd6f830809011018we95f119p55f697a2111c9e1a@mail.gmail.com> <20080901183744.GD13069@mit.edu> Message-ID: <1cbd6f830809011329mdc2a3e3v763a70a18d7dc383@mail.gmail.com> On Mon, Sep 1, 2008 at 2:37 PM, Theodore Tso wrote: > On Mon, Sep 01, 2008 at 01:18:31PM -0400, Mag Gam wrote: >> This maybe a newbie question but how come other file systems such as >> ReiserFS and Veritas' Vxfs dynamically allocate inodes and filesystems >> such as ext2/ext3 and JFS we need to allocate them when creating the >> filesystem? Is there a performance or maintenance gain when pre >> allocating? > > Having a static inode table is definitely much simpler than a dynamic > inode table, and that's why ext2 originally used a static inode > allocation system. Ext2 drew much of its initial design inspiration > from the BSD Fast Filesystem, and it (along with most traditional Unix > filesystems) used a static inode table. > > One of the advantages of having a static inode table is you can always > reliably find it. With a dynamic inode table, it can often be much > more difficult to find it in the face of filesystem corruption, caused > by either hardware or software failure. For example, with Reiserfs, > the inodes are stored in a B-Tree. If the root node, or a relatively > high-level node of the B-tree is lost, the only way to recover all of > the inodes is by looking at each block, and trying to determine if it > "looks" like part of the filesystem B-tree or not. This is what the > reiserfs's fsck program will do if the filesystem is sufficiently > damaged. Unfortuntaely, this means that if you store reiserfs > filesystem image (for example, for use by vmware, or qemu, or kvm, or > xen) in a reiserfs filesystem, and the filesystem gets damaged, the > recovery procedure will take every single block that looks like it > could have been part Reiserfs B-tree, and stich them together into a > new-btree. The result, if you have Reiserfs filesystem images is > those blocks will get treated as if they were part of the containing > filesystem, and the result is not pretty. > > These problems can be solved (although they were not for Reiserfs), > but it means a lot more complexity. > > - Ted > Ted, Thanks for the explanation and dumb-ing it down for me :-) So, if a reiserFs filesystem is damaged and it naturally do a fsck. The fsck basically recreated the b-tree by scanning from 1 to end of the filesystem? From tytso at mit.edu Mon Sep 1 20:39:13 2008 From: tytso at mit.edu (Theodore Tso) Date: Mon, 1 Sep 2008 16:39:13 -0400 Subject: dynamic inode allocation In-Reply-To: <1cbd6f830809011329mdc2a3e3v763a70a18d7dc383@mail.gmail.com> References: <1cbd6f830809011018we95f119p55f697a2111c9e1a@mail.gmail.com> <20080901183744.GD13069@mit.edu> <1cbd6f830809011329mdc2a3e3v763a70a18d7dc383@mail.gmail.com> Message-ID: <20080901203913.GF13069@mit.edu> On Mon, Sep 01, 2008 at 04:29:06PM -0400, Mag Gam wrote: > > So, if a reiserFs filesystem is damaged and it naturally do a fsck. > The fsck basically recreated the b-tree by scanning from 1 to end of > the filesystem? If the filesystem is sufficiently damaged such that portions of the b-tree can't be found, then yes. Otherwise, the data would be totally lost. As you can imagine, scaning every single block on the disk to see if it looks like filesystem metadata is quite slow, so naturally the reiserfs's fsck will avoid doing it if at all possible. But if the root or top-level nodes of the B-tree is damaged, it doesn't have much choice. - Ted From magawake at gmail.com Mon Sep 1 21:16:01 2008 From: magawake at gmail.com (Mag Gam) Date: Mon, 1 Sep 2008 17:16:01 -0400 Subject: dynamic inode allocation In-Reply-To: <20080901203913.GF13069@mit.edu> References: <1cbd6f830809011018we95f119p55f697a2111c9e1a@mail.gmail.com> <20080901183744.GD13069@mit.edu> <1cbd6f830809011329mdc2a3e3v763a70a18d7dc383@mail.gmail.com> <20080901203913.GF13069@mit.edu> Message-ID: <1cbd6f830809011416t5edffaa3p7e98b0324f3a13ac@mail.gmail.com> On Mon, Sep 1, 2008 at 4:39 PM, Theodore Tso wrote: > On Mon, Sep 01, 2008 at 04:29:06PM -0400, Mag Gam wrote: >> >> So, if a reiserFs filesystem is damaged and it naturally do a fsck. >> The fsck basically recreated the b-tree by scanning from 1 to end of >> the filesystem? > > If the filesystem is sufficiently damaged such that portions of the > b-tree can't be found, then yes. Otherwise, the data would be totally > lost. As you can imagine, scaning every single block on the disk to > see if it looks like filesystem metadata is quite slow, so naturally > the reiserfs's fsck will avoid doing it if at all possible. But if > the root or top-level nodes of the B-tree is damaged, it doesn't have > much choice. > > - Ted > > But, if thats the last and worst case scenario why don't they do the full scan? Sure its going to take a long time if its a big filesystem (there should be no changes since it would be unmounted), but its better than not having any data at all... From tytso at mit.edu Mon Sep 1 21:23:04 2008 From: tytso at mit.edu (Theodore Tso) Date: Mon, 1 Sep 2008 17:23:04 -0400 Subject: dynamic inode allocation In-Reply-To: <1cbd6f830809011416t5edffaa3p7e98b0324f3a13ac@mail.gmail.com> References: <1cbd6f830809011018we95f119p55f697a2111c9e1a@mail.gmail.com> <20080901183744.GD13069@mit.edu> <1cbd6f830809011329mdc2a3e3v763a70a18d7dc383@mail.gmail.com> <20080901203913.GF13069@mit.edu> <1cbd6f830809011416t5edffaa3p7e98b0324f3a13ac@mail.gmail.com> Message-ID: <20080901212304.GI13069@mit.edu> On Mon, Sep 01, 2008 at 05:16:01PM -0400, Mag Gam wrote: > > If the filesystem is sufficiently damaged such that portions of the > > b-tree can't be found, then yes. Otherwise, the data would be totally > > lost. As you can imagine, scaning every single block on the disk to > > see if it looks like filesystem metadata is quite slow, so naturally > > the reiserfs's fsck will avoid doing it if at all possible. But if > > the root or top-level nodes of the B-tree is damaged, it doesn't have > > much choice. > > > > But, if thats the last and worst case scenario why don't they do the > full scan? Sure its going to take a long time if its a big filesystem > (there should be no changes since it would be unmounted), but its > better than not having any data at all... As I said, in the worst case, it will do a full scan. But (a) it takes a long time, and (b) if the filesystem has any files that contain images of reiserfs filesystem, it will be totally scrambled. So it makes sense that the reiserfs fsck would try to avoid this if it can (i.e., if the b-tree is only mildly corrupted). With that said, this is really going out of scope of this mailing list. And I am not an expert on reiserfs's filesystem checker, although I have had people confirm to me that indeed, you can lose really big if your reiserfs filesystem contains files that have are images of other reiserfs filesystems for things like Virtualization. This problem is apparently solved in reiser4, it is NOT solved in reiserfs (i.e., version 3). As far as I am concerned, that's ample reason not to use reiserfs, but obviously I'm basied. :-) - Ted From magawake at gmail.com Mon Sep 1 21:47:26 2008 From: magawake at gmail.com (Mag Gam) Date: Mon, 1 Sep 2008 17:47:26 -0400 Subject: dynamic inode allocation In-Reply-To: <20080901212304.GI13069@mit.edu> References: <1cbd6f830809011018we95f119p55f697a2111c9e1a@mail.gmail.com> <20080901183744.GD13069@mit.edu> <1cbd6f830809011329mdc2a3e3v763a70a18d7dc383@mail.gmail.com> <20080901203913.GF13069@mit.edu> <1cbd6f830809011416t5edffaa3p7e98b0324f3a13ac@mail.gmail.com> <20080901212304.GI13069@mit.edu> Message-ID: <1cbd6f830809011447j7d48467cmb732ce4b5b1082b9@mail.gmail.com> Thanks! This has cured my curiosity (for now...) On Mon, Sep 1, 2008 at 5:23 PM, Theodore Tso wrote: > On Mon, Sep 01, 2008 at 05:16:01PM -0400, Mag Gam wrote: >> > If the filesystem is sufficiently damaged such that portions of the >> > b-tree can't be found, then yes. Otherwise, the data would be totally >> > lost. As you can imagine, scaning every single block on the disk to >> > see if it looks like filesystem metadata is quite slow, so naturally >> > the reiserfs's fsck will avoid doing it if at all possible. But if >> > the root or top-level nodes of the B-tree is damaged, it doesn't have >> > much choice. >> > >> >> But, if thats the last and worst case scenario why don't they do the >> full scan? Sure its going to take a long time if its a big filesystem >> (there should be no changes since it would be unmounted), but its >> better than not having any data at all... > > As I said, in the worst case, it will do a full scan. But (a) it > takes a long time, and (b) if the filesystem has any files that > contain images of reiserfs filesystem, it will be totally scrambled. > So it makes sense that the reiserfs fsck would try to avoid this if it > can (i.e., if the b-tree is only mildly corrupted). > > With that said, this is really going out of scope of this mailing > list. And I am not an expert on reiserfs's filesystem checker, > although I have had people confirm to me that indeed, you can lose > really big if your reiserfs filesystem contains files that have are > images of other reiserfs filesystems for things like Virtualization. > This problem is apparently solved in reiser4, it is NOT solved in > reiserfs (i.e., version 3). As far as I am concerned, that's ample > reason not to use reiserfs, but obviously I'm basied. :-) > > - Ted > > > From thorsten.henrici at gfd.de Tue Sep 2 20:03:36 2008 From: thorsten.henrici at gfd.de (thorsten.henrici at gfd.de) Date: Tue, 2 Sep 2008 22:03:36 +0200 Subject: =?iso-8859-1?q?Thorsten_Henrici_ist_au=DFer_Haus=2E?= Message-ID: Ich werde ab 27.08.2008 nicht im B?ro sein. Ich kehre zur?ck am 22.09.2008. Ich werde Ihre Nachricht nach meiner R?ckkehr beantworten. In dringenden F?llen wenden Sie sich bitte an Herrn St?ver. I'm out of office until the 22th of September. In urgent cases please contact Mr. Karl-Heinz St?ver. -- IMPORTANT NOTICE: This email is confidential, may be legally privileged, and is for the intended recipient only. Access, disclosure, copying, distribution, or reliance on any of it by anyone else is prohibited and may be a criminal offence. Please delete if obtained in error and email confirmation to the sender. From tytso at mit.edu Wed Sep 3 13:45:36 2008 From: tytso at mit.edu (Theodore Tso) Date: Wed, 3 Sep 2008 09:45:36 -0400 Subject: spd_readdir.c and readdir_r [real new version] In-Reply-To: <1213587981.8578.189.camel@corn.betterworld.us> References: <1212903039.7158.31.camel@corn.betterworld.us> <1212985588.32113.13.camel@corn.betterworld.us> <1213587981.8578.189.camel@corn.betterworld.us> Message-ID: <20080903134536.GD8360@mit.edu> Hey Ross, Sorry for not responding early; I was travelling a lot over the summer, and I never got around to responding to your e-mail. Many thanks for adding support for readdir_r and readdir64_r! As it turns out, I was doing some updates to spd_readdir.c to support fdopendir (which rm uses). Also, it looks like you based your changes off of an older version of spd_readdir.c that didn't support the dirfd() call. I probably will try to package this up into its own package, since I suspect it would be useful to a larger set of people. In any case here's the merged version I have. Please let me know if this works for you, and if you have any other suggested improvements! - Ted -------------- next part -------------- A non-text attachment was scrubbed... Name: spd_readdir.c Type: text/x-csrc Size: 10396 bytes Desc: not available URL: From tytso at mit.edu Wed Sep 3 16:09:52 2008 From: tytso at mit.edu (Theodore Tso) Date: Wed, 3 Sep 2008 12:09:52 -0400 Subject: Problem in HTREE directory node In-Reply-To: <1219689606.12088.50.camel@corn.betterworld.us> References: <1219689606.12088.50.camel@corn.betterworld.us> Message-ID: <20080903160952.GE8360@mit.edu> On Mon, Aug 25, 2008 at 11:40:06AM -0700, Ross Boylan wrote: > Short version: > > fsck said > "invalid HTREE directory inode 635113 > (mail/r/user/ross/comp/admin-wheat) clear HTREE index?" To which I > replied Yes. > > What exactly does this mean was corrupted? In particular, does it mean > the list of files in the directory .../comp/admin-wheat was damaged? Or > is the trouble in the comp directory? > > Is fsck likely to have fixed up things as good as new, or might > something be lost or corrupted? I don't know what clearing the HTREE > index does. That just means that the interior nodes in the HTREE were corrupt. If you give permission to clear the htree index, e2fsck put the inode on the list of directories that need to have their HTREE indexes rebuilt, and a "Pass 3A" will rebuild the directory's (or directories') HTREE indexes. This is similar to what "e2fsck -fD" does, except it only rebuilds directories whose HTREE indexes were corrupted, instead of rebuilding and optimize all of the directories in the system. So if that was the only message you received, and there were no other reports of damage to the directory, you wouldn't have lost any directory names. It's in all likelihood "good as new". Regards, - Ted From l.allegrucci at gmail.com Mon Sep 8 19:27:32 2008 From: l.allegrucci at gmail.com (Lorenzo Allegrucci) Date: Mon, 8 Sep 2008 21:27:32 +0200 Subject: tune2fs Message-ID: <4dcf7d360809081227y3a536642saca35f4ecad3f2b3@mail.gmail.com> Hi all, I was wondering if it's safe to run tune2fs with the -c or -i option on a rw mounted filesystem. Should I remount read only first? My man page doesn't mention it. Thanks -- Lorenzo From tytso at MIT.EDU Mon Sep 8 21:02:25 2008 From: tytso at MIT.EDU (Theodore Tso) Date: Mon, 8 Sep 2008 17:02:25 -0400 Subject: tune2fs In-Reply-To: <4dcf7d360809081227y3a536642saca35f4ecad3f2b3@mail.gmail.com> References: <4dcf7d360809081227y3a536642saca35f4ecad3f2b3@mail.gmail.com> Message-ID: <20080908210225.GM8161@mit.edu> On Mon, Sep 08, 2008 at 09:27:32PM +0200, Lorenzo Allegrucci wrote: > Hi all, I was wondering if it's safe to run tune2fs with the -c or -i option > on a rw mounted filesystem. > Should I remount read only first? My man page doesn't mention it. It is safe to use tune2fs on an rw-mounted filesystem; tune2fs is very careful about how it modifies the superblock in order to make it safe. - Ted From tobi at oetiker.ch Wed Sep 10 11:30:45 2008 From: tobi at oetiker.ch (Tobias Oetiker) Date: Wed, 10 Sep 2008 13:30:45 +0200 (CEST) Subject: journal on an ssd Message-ID: Experts, What happens if the disk hosting an external journal of a filesytem running with data=journal goes bust. The Backstory ... I have been batteling with filesystem performance for some time now. Our setup is a HW Raid(6) with LVM on top and ext3 filesytems. Recently we added an SSD to our setup and have moved all the journals to this ssd. This has dramatically improved performance and especially reduced the interdependence between performance of different partitions hosted on the same RAID. http://insights.oetiker.ch/linux/external-journal-on-ssd.html I realy like the performance of this new setup, but I am not all that sure about the data security aspects of it. Especially after reading http://www.cs.wisc.edu/adsl/Publications/sfa-dsn05.pdf which suggests that damaged journals are the worst that can happen to ext3. any insights on this? cheers tobi -- Tobi Oetiker, OETIKER+PARTNER AG, Aarweg 15 CH-4600 Olten, Switzerland http://it.oetiker.ch tobi at oetiker.ch ++41 62 775 9902 / sb: -9900 From holger at wizards.de Wed Sep 10 13:41:45 2008 From: holger at wizards.de (Holger Hoffstaette) Date: Wed, 10 Sep 2008 15:41:45 +0200 Subject: journal on an ssd References: Message-ID: On Wed, 10 Sep 2008 13:30:45 +0200, Tobias Oetiker wrote: > What happens if the disk hosting an external journal of a filesytem > running with data=journal goes bust. Probably the same as if the journal was on the same disk, going bust. :-) Or rather :-( as this can indeed get pretty ugly. With ext3 you can always fall back to mounting as ext2 and at least try to recover as much as possible. > Recently we added an SSD to our setup and have moved all the journals to > this ssd. This has dramatically improved performance and especially > reduced the interdependence between performance of different partitions > hosted on the same RAID. That is one of the great SSD uses, yes. > http://insights.oetiker.ch/linux/external-journal-on-ssd.html Very interesting, thanks! I was planning to do the same but waiting for the Intel SSDs to come to market or the large OZCs to come down in price, whatever happened first.. > I realy like the performance of this new setup, but I am not all that sure > about the data security aspects of it. Especially after reading > > http://www.cs.wisc.edu/adsl/Publications/sfa-dsn05.pdf > > which suggests that damaged journals are the worst that can happen to > ext3. True, a borked journal is bad but with the SSD you should actually have *less* chance of corruption (of the type mentioned in the paper), since the wear-leveling should keep the journal blocks alive without the file system/block layer noticing. At least in theory.. :-D You may also find this interesting: http://labs.google.com/papers/disk_failures.html Holger From sandeen at redhat.com Wed Sep 10 15:27:25 2008 From: sandeen at redhat.com (Eric Sandeen) Date: Wed, 10 Sep 2008 10:27:25 -0500 Subject: journal on an ssd In-Reply-To: References: Message-ID: <48C7E75D.8040909@redhat.com> Tobias Oetiker wrote: > Experts, > > What happens if the disk hosting an external journal of a filesytem > running with data=journal goes bust. > > The Backstory ... > > I have been batteling with filesystem performance for some time > now. Our setup is a HW Raid(6) with LVM on top and ext3 filesytems. > > Recently we added an SSD to our setup and have moved all the journals > to this ssd. This has dramatically improved performance and > especially reduced the interdependence between performance of > different partitions hosted on the same RAID. > > http://insights.oetiker.ch/linux/external-journal-on-ssd.html How does this compare to putting journals on a separate non-ssd device? -Eric From tobi at oetiker.ch Wed Sep 10 16:05:00 2008 From: tobi at oetiker.ch (Tobias Oetiker) Date: Wed, 10 Sep 2008 18:05:00 +0200 (CEST) Subject: journal on an ssd In-Reply-To: <48C7E75D.8040909@redhat.com> References: <48C7E75D.8040909@redhat.com> Message-ID: Hi Eric, I have not tested this, but since we are putting about 16 different journals on this one ssd, I would assume that the loss through seeking between the journals would be pretty bad, and again bring back that inter-filesystem-dependency we were trying to loose with this measure. cheers tobi Today Eric Sandeen wrote: > Tobias Oetiker wrote: > > Experts, > > > > What happens if the disk hosting an external journal of a filesytem > > running with data=journal goes bust. > > > > The Backstory ... > > > > I have been batteling with filesystem performance for some time > > now. Our setup is a HW Raid(6) with LVM on top and ext3 filesytems. > > > > Recently we added an SSD to our setup and have moved all the journals > > to this ssd. This has dramatically improved performance and > > especially reduced the interdependence between performance of > > different partitions hosted on the same RAID. > > > > http://insights.oetiker.ch/linux/external-journal-on-ssd.html > > How does this compare to putting journals on a separate non-ssd device? > > -Eric > > -- Tobi Oetiker, OETIKER+PARTNER AG, Aarweg 15 CH-4600 Olten, Switzerland http://it.oetiker.ch tobi at oetiker.ch ++41 62 775 9902 / sb: -9900 From holger at wizards.de Wed Sep 10 15:31:53 2008 From: holger at wizards.de (Holger Hoffstaette) Date: Wed, 10 Sep 2008 17:31:53 +0200 Subject: journal on an ssd References: Message-ID: Another followup.. On Wed, 10 Sep 2008 13:30:45 +0200, Tobias Oetiker wrote: > Recently we added an SSD to our setup and have moved all the journals to > this ssd. This has dramatically improved performance and especially > reduced the interdependence between performance of different partitions > hosted on the same RAID. > > http://insights.oetiker.ch/linux/external-journal-on-ssd.html You mention that you chose data=journal, i.e. full journaling. Have you tried ordered mode as well? This should still yield a significant performance win because of reduced head movement and faster metadata writes. It may or may not be faster depending on the size of the written data itself..I'm just curious if you tested this. thanks Holger From sandeen at redhat.com Wed Sep 10 16:21:32 2008 From: sandeen at redhat.com (Eric Sandeen) Date: Wed, 10 Sep 2008 11:21:32 -0500 Subject: journal on an ssd In-Reply-To: References: <48C7E75D.8040909@redhat.com> Message-ID: <48C7F40C.2040006@redhat.com> Tobias Oetiker wrote: > Hi Eric, > > I have not tested this, but since we are putting about 16 different > journals on this one ssd, I would assume that the loss through > seeking between the journals would be pretty bad, and again bring > back that inter-filesystem-dependency we were trying to loose with > this measure. Ah, ok - I missed that you had several journals on one device. Thanks, -Eric From worleys at gmail.com Wed Sep 10 17:23:31 2008 From: worleys at gmail.com (Chris Worley) Date: Wed, 10 Sep 2008 11:23:31 -0600 Subject: journal on an ssd In-Reply-To: References: <48C7E75D.8040909@redhat.com> Message-ID: Look at: http://www.fusionio.com/Products.aspx At 120K IOPS @1K blocks, it should make for a very good journaling device. It's not an SSD per se; it bypasses old disk controllers altogether (very innovative block device design). The block device layer and hardware are tailored for NAND failure idiosyncrasies... which results in their data loss is less than any available SSD or rotating disk. Put two together in a RAID1 configuration to compensate for device failures (assure you have 2 PCIe x8 slots available). Chris On Wed, Sep 10, 2008 at 10:05 AM, Tobias Oetiker wrote: > Hi Eric, > > I have not tested this, but since we are putting about 16 different > journals on this one ssd, I would assume that the loss through > seeking between the journals would be pretty bad, and again bring > back that inter-filesystem-dependency we were trying to loose with > this measure. > > cheers > tobi > > Today Eric Sandeen wrote: > > > Tobias Oetiker wrote: > > > Experts, > > > > > > What happens if the disk hosting an external journal of a filesytem > > > running with data=journal goes bust. > > > > > > The Backstory ... > > > > > > I have been batteling with filesystem performance for some time > > > now. Our setup is a HW Raid(6) with LVM on top and ext3 filesytems. > > > > > > Recently we added an SSD to our setup and have moved all the journals > > > to this ssd. This has dramatically improved performance and > > > especially reduced the interdependence between performance of > > > different partitions hosted on the same RAID. > > > > > > http://insights.oetiker.ch/linux/external-journal-on-ssd.html > > > > How does this compare to putting journals on a separate non-ssd device? > > > > -Eric > > > > > > -- > Tobi Oetiker, OETIKER+PARTNER AG, Aarweg 15 CH-4600 Olten, Switzerland > http://it.oetiker.ch tobi at oetiker.ch ++41 62 775 9902 / sb: -9900 > > _______________________________________________ > Ext3-users mailing list > Ext3-users at redhat.com > https://www.redhat.com/mailman/listinfo/ext3-users > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tobi at oetiker.ch Wed Sep 10 22:58:28 2008 From: tobi at oetiker.ch (Tobias Oetiker) Date: Thu, 11 Sep 2008 00:58:28 +0200 (CEST) Subject: journal on an ssd In-Reply-To: References: <48C7E75D.8040909@redhat.com> Message-ID: Hi Chris, Yesterday Chris Worley wrote: > Note that I do have one to experiment with. > What's a good way to measure journal performance, and/or in what cases do > you need a faster journal (i.e. an EXT3 atop an MD device with big block > stripes)? > > Chris Well, the 'problem' we had to solve was the following: setup: - large HW raid6 array - lvm on top - many ext3 partitions when there was a lot of write or meta data update activity on one partition, performance on all other partitions went to 0. (processes hanging for 10-20 seconds as soon as they accessed the filesystem). I am sure that there is a bad-bad bug in the linux kernel somewhere which is causing this, but all the upgrading and patching did not help, the condition remained. Until we moved the journals off to that external ssd. Now I can copy partition A over to partition B and the server remains nicely responsive. I am atributing that to the external journal. Obviously I would like to know how bad we are going to be had when the ssd dies. cheers tobi -- Tobi Oetiker, OETIKER+PARTNER AG, Aarweg 15 CH-4600 Olten, Switzerland http://it.oetiker.ch tobi at oetiker.ch ++41 62 775 9902 / sb: -9900 From adilger at sun.com Thu Sep 11 04:10:53 2008 From: adilger at sun.com (Andreas Dilger) Date: Wed, 10 Sep 2008 22:10:53 -0600 Subject: journal on an ssd In-Reply-To: References: <48C7E75D.8040909@redhat.com> Message-ID: <20080911041053.GT3086@webber.adilger.int> On Sep 10, 2008 18:05 +0200, Tobias Oetiker wrote: > I have not tested this, but since we are putting about 16 different > journals on this one ssd, I would assume that the loss through > seeking between the journals would be pretty bad, and again bring > back that inter-filesystem-dependency we were trying to loose with > this measure. The cost of putting the journals on 16 separate, relatively small disk devices would probably be comparable to the cost of the SSD and not have a single point of failure. The journal does mostly linear IO, so performance is probably equal or better. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. From tobi at oetiker.ch Thu Sep 11 05:43:18 2008 From: tobi at oetiker.ch (Tobias Oetiker) Date: Thu, 11 Sep 2008 07:43:18 +0200 (CEST) Subject: journal on an ssd In-Reply-To: <20080911041053.GT3086@webber.adilger.int> References: <48C7E75D.8040909@redhat.com> <20080911041053.GT3086@webber.adilger.int> Message-ID: Folks, Yesterday Andreas Dilger wrote: > On Sep 10, 2008 18:05 +0200, Tobias Oetiker wrote: > > I have not tested this, but since we are putting about 16 different > > journals on this one ssd, I would assume that the loss through > > seeking between the journals would be pretty bad, and again bring > > back that inter-filesystem-dependency we were trying to loose with > > this measure. > > The cost of putting the journals on 16 separate, relatively small > disk devices would probably be comparable to the cost of the SSD > and not have a single point of failure. The journal does mostly > linear IO, so performance is probably equal or better. You are telling me things that I am aware of. The reason I wrote to this group is to figure what would happen to an ext3 fs when the external journal was lost, especially what happens when it is lost on a filesystem where 'data=journal' is set. Because if it is catastrophic, then it basically means that the journal has to reside on a device that is as secure as to rest of the data, meaning that if the data is on RAID6 then the journal should be on RAID6 too. What I am hoping for, is that someone tells me, that in the case of 'data=journal' the loss would only be the material that is still in the journal (eg 30 seconds worth of data) and the rest of the fs would have a fair chance of being recoverd with fsck. cheers tobi -- Tobi Oetiker, OETIKER+PARTNER AG, Aarweg 15 CH-4600 Olten, Switzerland http://it.oetiker.ch tobi at oetiker.ch ++41 62 775 9902 / sb: -9900 From chris at harvington.org.uk Thu Sep 11 08:13:21 2008 From: chris at harvington.org.uk (Chris Haynes) Date: Thu, 11 Sep 2008 09:13:21 +0100 Subject: journal on an ssd In-Reply-To: References: <48C7E75D.8040909@redhat.com> <20080911041053.GT3086@webber.adilger.int> Message-ID: <106136964.20080911091321@harvington.org.uk> Just a random thought, and anticipating that the experts will say that if an entire journal is lost (not present) the main data is still accessible / recoverable (in its previous state). Is it perhaps the case that, to maximize the integrity of the main data, one would *want* the journal to have a different failure pattern? That, if there were any doubt about journal integrity, it would be better (for the integrity of the main file system) to discard the journal entirely? This would suggest the use of a robust hash / cryptographic digest of the journal contents, stored with it and checked each time the journal is about to be used. These are quite quick to compute nowadays. Any potential in this speculation? Chris Haynes On Thursday, September 11, 2008 at 6:43:18 AM, Tobias Oetiker wrote: > Folks, > Yesterday Andreas Dilger wrote: >> On Sep 10, 2008 18:05 +0200, Tobias Oetiker wrote: >> > I have not tested this, but since we are putting about 16 different >> > journals on this one ssd, I would assume that the loss through >> > seeking between the journals would be pretty bad, and again bring >> > back that inter-filesystem-dependency we were trying to loose with >> > this measure. >> The cost of putting the journals on 16 separate, relatively small >> disk devices would probably be comparable to the cost of the SSD >> and not have a single point of failure. The journal does mostly >> linear IO, so performance is probably equal or better. > You are telling me things that I am aware of. The reason I wrote to > this group is to figure what would happen to an ext3 fs when the > external journal was lost, especially what happens when it is lost > on a filesystem where 'data=journal' is set. > Because if it is catastrophic, then it basically means that the > journal has to reside on a device that is as secure as to rest of > the data, meaning that if the data is on RAID6 then the journal > should be on RAID6 too. > What I am hoping for, is that someone tells me, that in the case of > 'data=journal' the loss would only be the material that is still in > the journal (eg 30 seconds worth of data) and the rest of the fs > would have a fair chance of being recoverd with fsck. > cheers > tobi From rwheeler at redhat.com Thu Sep 11 11:06:07 2008 From: rwheeler at redhat.com (Ric Wheeler) Date: Thu, 11 Sep 2008 07:06:07 -0400 Subject: journal on an ssd In-Reply-To: References: <48C7E75D.8040909@redhat.com> Message-ID: <48C8FB9F.4030904@redhat.com> Tobias Oetiker wrote: > Hi Chris, > > Yesterday Chris Worley wrote: > > >> Note that I do have one to experiment with. >> What's a good way to measure journal performance, and/or in what cases do >> you need a faster journal (i.e. an EXT3 atop an MD device with big block >> stripes)? >> >> Chris >> > > Well, the 'problem' we had to solve was the following: > > setup: > > - large HW raid6 array > - lvm on top > - many ext3 partitions > > when there was a lot of write or meta data update activity on one > partition, performance on all other partitions went to 0. > (processes hanging for 10-20 seconds as soon as they accessed the > filesystem). I am sure that there is a bad-bad bug in the linux > kernel somewhere which is causing this, but all the upgrading and > patching did not help, the condition remained. > I assume that you have a hardware RAID card, not an external array? If you do have an array (IBM Shark, EMC box, etc) with battery backed internal cache, then you should get better than SSD speeds from one of its LUNs assuming your cache is large enough ;-) ric > Until we moved the journals off to that external ssd. > > Now I can copy partition A over to partition B and the server > remains nicely responsive. I am atributing that to the external > journal. > > Obviously I would like to know how bad we are going to be had when > the ssd dies. > > cheers > tobi > > From tobi at oetiker.ch Thu Sep 11 11:45:33 2008 From: tobi at oetiker.ch (Tobias Oetiker) Date: Thu, 11 Sep 2008 13:45:33 +0200 (CEST) Subject: journal on an ssd In-Reply-To: <48C8FB9F.4030904@redhat.com> References: <48C7E75D.8040909@redhat.com> <48C8FB9F.4030904@redhat.com> Message-ID: Hi Ric, Today Ric Wheeler wrote: [...] > I assume that you have a hardware RAID card, not an external array? If you do > have an array (IBM Shark, EMC box, etc) with battery backed internal cache, > then you should get better than SSD speeds from one of its LUNs assuming your > cache is large enough ;-) It is a hwraid card with batttery backed cache (areca). I think the problem with the built in cache is that the raid manages it without knowing about the structure of the filesystem. At the bottom of it it is strong argument for zfs :-) still wondering what happens when ext3 looses a journal. cheers tobi -- Tobi Oetiker, OETIKER+PARTNER AG, Aarweg 15 CH-4600 Olten, Switzerland http://it.oetiker.ch tobi at oetiker.ch ++41 62 775 9902 / sb: -9900 From tytso at MIT.EDU Thu Sep 11 13:07:15 2008 From: tytso at MIT.EDU (Theodore Tso) Date: Thu, 11 Sep 2008 09:07:15 -0400 Subject: journal on an ssd In-Reply-To: <106136964.20080911091321@harvington.org.uk> References: <48C7E75D.8040909@redhat.com> <20080911041053.GT3086@webber.adilger.int> <106136964.20080911091321@harvington.org.uk> <48C7E75D.8040909@redhat.com> <20080911041053.GT3086@webber.adilger.int> Message-ID: <20080911130715.GA4759@mit.edu> On Thu, Sep 11, 2008 at 07:43:18AM +0200, Tobias Oetiker wrote: > > What I am hoping for, is that someone tells me, that in the case of > 'data=journal' the loss would only be the material that is still in > the journal (eg 30 seconds worth of data) and the rest of the fs > would have a fair chance of being recoverd with fsck. > The paper you quoted essentially indicated that ext3's JBD layer checking for error cases sufficiently. It has improved since then, but there are a few places where when I did a quick audit of the code paths, I was able to find a few places where we aren't checking the error returns when calling sync_dirty_buffer(), for example. In general, though, if there is a failure to write to the SSD, it should get detected fairly quickly, at which point the journal will get aborted, which will suspend writes to the filesystem. It may not happen as quickly as we might like, and if you get really unlucky and a singleton write fails and it's one where the error return doesn't get written, you could end up writing garbage to the filesystem on a journal replay. In that worst case scenario, you might end up losing a full inode table block's worth of inodes, but in general, the loss should be the last few minutes worth of data. Fsck has a better than normal chance of recoverying from a busted journal. That being said, it would be wise to monitor the health of the SSD via S.M.A.R.T., since I would suspect that failures of the SSD should be easily predicted by the firmware. On Thu, Sep 11, 2008 at 09:13:21AM +0100, Chris Haynes wrote: > > Is it perhaps the case that, to maximize the integrity of the main > data, one would *want* the journal to have a different failure > pattern? > > That, if there were any doubt about journal integrity, it would be > better (for the integrity of the main file system) to discard the > journal entirely? > > This would suggest the use of a robust hash / cryptographic digest > of the journal contents, stored with it and checked each time the > journal is about to be used. These are quite quick to compute > nowadays. Indeed, this is what ext4 does; there is a checksum (you don't need a cryptographic digest since contrary to most sysadmin's fears, hard drives are *not* malicious sentient beings :-), in each commit record to detect these problems, and if a problem is found, we abort running the journal right then and there. It is possible this change can mean that you will lose more data, not less. If there is a singleton failure writing a single block, early in the journal, aborting the journal means that we don't replay any of the later journal commits, and it could very well be corrupted data block was later rewritten successfully to the journal in a later commit, and in fact, continuing the journal recovery is the right thing to do. On the other hand, if the corrupted datablock was a journal descriptor, aborting the journal commit is the best thing you could do. But this could mean that in theory you might end up losing more than just the last 30 seconds, but more like last couple of minutes worth of data. (Even data which was fsync'ed, since fsync only guarantees that the data was written to some stable storage; fsync makes no guarantees about what might happen if your stable storage, including the journal, fails to store data correctly.) We've talked about changing the journalling code to write a separate checksum for each block, which would allow us to more intelligently recover from a failed checksum in the journal block. It wouldn't be a trivial thing to add, so we haven't added that to date. And this is a relatively unlikely case, which involves an (undetected) single write failure, followed by a crash at just the wrong time, before the journal has a chance to wrap. Also, ext4 is even better than ext3 in terms of checking error returns (although to be honest when I did a quick audit just now I still did find a few places where we should add some error checks; I'll work on getting fixes submitted for both ext3 and ext4). - Ted From tobi at oetiker.ch Thu Sep 11 14:38:15 2008 From: tobi at oetiker.ch (Tobias Oetiker) Date: Thu, 11 Sep 2008 16:38:15 +0200 (CEST) Subject: journal on an ssd In-Reply-To: <20080911130715.GA4759@mit.edu> References: <48C7E75D.8040909@redhat.com> <20080911041053.GT3086@webber.adilger.int> <106136964.20080911091321@harvington.org.uk> <48C7E75D.8040909@redhat.com> <20080911041053.GT3086@webber.adilger.int> <20080911130715.GA4759@mit.edu> Message-ID: Teo, Today Theodore Tso wrote: [...] > In that worst case scenario, you might end up losing a full inode > table block's worth of inodes, but in general, the loss should be the > last few minutes worth of data. Fsck has a better than normal chance > of recoverying from a busted journal. That being said, it would be > wise to monitor the health of the SSD via S.M.A.R.T., since I would > suspect that failures of the SSD should be easily predicted by the > firmware. you are the man, thanks ... that was the kind of answer I was looking for :-) I have started to smart mon my journal disk ... it has interesting properties in smart, a whole lot of which my version of smartmontools not seems to know about ... do you have any insight in this ? is there a list of relevant smart properties ? I have also set errors=panic as a mount option, or is this unwise in this context ? -- Tobi Oetiker, OETIKER+PARTNER AG, Aarweg 15 CH-4600 Olten, Switzerland http://it.oetiker.ch tobi at oetiker.ch ++41 62 775 9902 / sb: -9900 From adilger at sun.com Thu Sep 11 21:07:01 2008 From: adilger at sun.com (Andreas Dilger) Date: Thu, 11 Sep 2008 15:07:01 -0600 Subject: journal on an ssd In-Reply-To: References: <48C7E75D.8040909@redhat.com> <20080911041053.GT3086@webber.adilger.int> Message-ID: <20080911210701.GD3086@webber.adilger.int> On Sep 11, 2008 07:43 +0200, Tobias Oetiker wrote: > You are telling me things that I am aware of. The reason I wrote to > this group is to figure what would happen to an ext3 fs when the > external journal was lost, especially what happens when it is lost > on a filesystem where 'data=journal' is set. Losing a journal will, in 99% of the cases, mean the loss of only a few seconds of data. In some rare cases it may be that an inconsistency from a partially-updated commit will cause e2fsck to become confused and possibly clean up a small number more files than it would have otherwise. > Because if it is catastrophic, then it basically means that the > journal has to reside on a device that is as secure as to rest of > the data, meaning that if the data is on RAID6 then the journal > should be on RAID6 too. No, because RAID6 is terribly sucky for performance. If you need this kind of reliability triple-mirrored RAID 1 would be better. Much less CPU overhead, and no extra IO. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. From tobi at oetiker.ch Thu Sep 11 21:10:36 2008 From: tobi at oetiker.ch (Tobias Oetiker) Date: Thu, 11 Sep 2008 23:10:36 +0200 (CEST) Subject: journal on an ssd In-Reply-To: <20080911210701.GD3086@webber.adilger.int> References: <48C7E75D.8040909@redhat.com> <20080911041053.GT3086@webber.adilger.int> <20080911210701.GD3086@webber.adilger.int> Message-ID: Hi Andreas, Today Andreas Dilger wrote: > On Sep 11, 2008 07:43 +0200, Tobias Oetiker wrote: > > You are telling me things that I am aware of. The reason I wrote to > > this group is to figure what would happen to an ext3 fs when the > > external journal was lost, especially what happens when it is lost > > on a filesystem where 'data=journal' is set. > > Losing a journal will, in 99% of the cases, mean the loss of only a > few seconds of data. In some rare cases it may be that an inconsistency > from a partially-updated commit will cause e2fsck to become confused > and possibly clean up a small number more files than it would have > otherwise. glad to hear > > Because if it is catastrophic, then it basically means that the > > journal has to reside on a device that is as secure as to rest of > > the data, meaning that if the data is on RAID6 then the journal > > should be on RAID6 too. > > No, because RAID6 is terribly sucky for performance. If you need this > kind of reliability triple-mirrored RAID 1 would be better. Much less > CPU overhead, and no extra IO. true ... do you happen to know how zfs handles it when the intent log is on an ssd ? cheers tobi -- Tobi Oetiker, OETIKER+PARTNER AG, Aarweg 15 CH-4600 Olten, Switzerland http://it.oetiker.ch tobi at oetiker.ch ++41 62 775 9902 / sb: -9900 From adilger at sun.com Thu Sep 11 21:17:41 2008 From: adilger at sun.com (Andreas Dilger) Date: Thu, 11 Sep 2008 15:17:41 -0600 Subject: journal on an ssd In-Reply-To: References: <48C7E75D.8040909@redhat.com> <20080911041053.GT3086@webber.adilger.int> <20080911210701.GD3086@webber.adilger.int> Message-ID: <20080911211741.GG3086@webber.adilger.int> On Sep 11, 2008 23:10 +0200, Tobias Oetiker wrote: > Today Andreas Dilger wrote: > > No, because RAID6 is terribly sucky for performance. If you need this > > kind of reliability triple-mirrored RAID 1 would be better. Much less > > CPU overhead, and no extra IO. > > do you happen to know how zfs handles it when the intent log is on > an ssd ? My (limited) understanding is that it will also mirror the intent log. I'm not really a ZFS guru, and Lustre's use of the DMU doesn't (yet) include use of the intent log. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. From carlo at alinoe.com Thu Sep 11 21:58:22 2008 From: carlo at alinoe.com (Carlo Wood) Date: Thu, 11 Sep 2008 23:58:22 +0200 Subject: pthread? Message-ID: <20080911215822.GA5731@alinoe.com> A user of ext3grep had a configuration problem that I tracked down to the fact that pkg-config --cflags ext2fs returns -pthread Why does it return -pthread ? That seems a bug to me. Please keep this user in the CC. Note on my (debian) system `pkg-config --cflags ext2fs` returns nothing. I don't know why his returns -pthread. Siegward, any ideas? -- Carlo Wood From tytso at MIT.EDU Thu Sep 11 21:57:23 2008 From: tytso at MIT.EDU (Theodore Tso) Date: Thu, 11 Sep 2008 17:57:23 -0400 Subject: journal on an ssd In-Reply-To: References: <20080911041053.GT3086@webber.adilger.int> <106136964.20080911091321@harvington.org.uk> <48C7E75D.8040909@redhat.com> <20080911041053.GT3086@webber.adilger.int> <20080911130715.GA4759@mit.edu> Message-ID: <20080911215723.GP5082@mit.edu> On Thu, Sep 11, 2008 at 04:38:15PM +0200, Tobias Oetiker wrote: > > you are the man, thanks ... that was the kind of answer I was > looking for :-) I have started to smart mon my journal disk > ... it has interesting properties in smart, a whole lot of which my > version of smartmontools not seems to know about ... do you have > any insight in this ? is there a list of relevant smart properties ? Sorry, I don't. You might try upgrading to newer version of smartmontools, since as people figure out what some of the properties mean (especially the ones with the high numbers that end to be hard drive spceific, and not standardized) they get added to the smartmontools program. Fortunately, it's not necessary to know what the properties mean in order for smartmontools to know if the hard drive is about to fail. > I have also set errors=panic as a mount option, or is this unwise > in this context ? It's a good thing. I would recommend using some kind of serial console logger though, so that if there are failures, you can see what the system emitted as its last gasp before panicking and rebooting (since if the filesystmem containing /var/log is set with errors=panic, you won't find that information in /var/log/messages). In general, for any production machine, I recommend serial console loggers, since if you have attackers who are have broken into your machine with a rootkit, and attempt to hide their tracks by editing the logs, presumably they won't have access to whatever machine you have dedicated to capturing and storing the logs from the serial console for all of your servers. - Ted From tytso at MIT.EDU Thu Sep 11 22:39:37 2008 From: tytso at MIT.EDU (Theodore Tso) Date: Thu, 11 Sep 2008 18:39:37 -0400 Subject: pthread? In-Reply-To: <20080911215822.GA5731@alinoe.com> References: <20080911215822.GA5731@alinoe.com> Message-ID: <20080911223937.GQ5082@mit.edu> On Thu, Sep 11, 2008 at 11:58:22PM +0200, Carlo Wood wrote: > A user of ext3grep had a configuration problem > that I tracked down to the fact that > > pkg-config --cflags ext2fs > > returns > > -pthread > > Why does it return -pthread ? What distribution and what version of e2fsprogs is this user using? I'm going to guess that he is using SuSE or some OpenSuSE derivitive, and it's because SuSE bludgeoned in a pthreads mutex into the internals of libcom_err. Since libext2fs can call libcom_err, it follows that a program that links with libext2fs needs to also be compiled and linked with -pthread. It's for this reason I've resisted including SuSE's, because the race they are concerned about is largely theoretical, and it causes problems for people who want to link against libcom_err. What I probably should do add in locking using sem_wait/sem_post, which doesn't require any Posix pthread nonsense. - Ted From keld at dkuug.dk Fri Sep 12 08:17:12 2008 From: keld at dkuug.dk (Keld =?utf-8?Q?J=F8rn?= Simonsen) Date: Fri, 12 Sep 2008 10:17:12 +0200 Subject: journal on an ssd In-Reply-To: <20080911210701.GD3086@webber.adilger.int> References: <48C7E75D.8040909@redhat.com> <20080911041053.GT3086@webber.adilger.int> <20080911210701.GD3086@webber.adilger.int> Message-ID: <20080912081712.GB21798@rap.rap.dk> On Thu, Sep 11, 2008 at 03:07:01PM -0600, Andreas Dilger wrote: > On Sep 11, 2008 07:43 +0200, Tobias Oetiker wrote: > > Because if it is catastrophic, then it basically means that the > > journal has to reside on a device that is as secure as to rest of > > the data, meaning that if the data is on RAID6 then the journal > > should be on RAID6 too. > > No, because RAID6 is terribly sucky for performance. If you need this > kind of reliability triple-mirrored RAID 1 would be better. Much less > CPU overhead, and no extra IO. RAID6 performs nicely for reads, but has quite bad performance for some writes (non-sequential). Raid6 is actually surprisingly fast for sequential reads. Best regards Keld From adilger at sun.com Fri Sep 12 09:12:33 2008 From: adilger at sun.com (Andreas Dilger) Date: Fri, 12 Sep 2008 03:12:33 -0600 Subject: journal on an ssd In-Reply-To: <20080912081712.GB21798@rap.rap.dk> References: <48C7E75D.8040909@redhat.com> <20080911041053.GT3086@webber.adilger.int> <20080911210701.GD3086@webber.adilger.int> <20080912081712.GB21798@rap.rap.dk> Message-ID: <20080912091233.GX3086@webber.adilger.int> On Sep 12, 2008 10:17 +0200, Keld J?rn Simonsen wrote: > On Thu, Sep 11, 2008 at 03:07:01PM -0600, Andreas Dilger wrote: > > On Sep 11, 2008 07:43 +0200, Tobias Oetiker wrote: > > > Because if it is catastrophic, then it basically means that the > > > journal has to reside on a device that is as secure as to rest of > > > the data, meaning that if the data is on RAID6 then the journal > > > should be on RAID6 too. > > > > No, because RAID6 is terribly sucky for performance. If you need this > > kind of reliability triple-mirrored RAID 1 would be better. Much less > > CPU overhead, and no extra IO. > > RAID6 performs nicely for reads, but has quite bad performance for some > writes (non-sequential). Raid6 is actually surprisingly fast for > sequential reads. The journal is NEVER read during normal operation, only once during journal recovery after a crash. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. From worleys at gmail.com Tue Sep 16 20:10:05 2008 From: worleys at gmail.com (Chris Worley) Date: Tue, 16 Sep 2008 14:10:05 -0600 Subject: When is a block free? Message-ID: Where in the ext2/3 code does it know that a block on the disk is now free to reuse? Thanks, Chris -------------- next part -------------- An HTML attachment was scrubbed... URL: From rwheeler at redhat.com Tue Sep 16 20:17:12 2008 From: rwheeler at redhat.com (Ric Wheeler) Date: Tue, 16 Sep 2008 16:17:12 -0400 Subject: When is a block free? In-Reply-To: References: Message-ID: <48D01448.4050107@redhat.com> Chris Worley wrote: > Where in the ext2/3 code does it know that a block on the disk is now > free to reuse? > > Thanks, > > Chris Hi Chris, File systems track which blocks are free from the file system creation time (mkfs), creation of new files and deletion. Ext2/3 is the gatekeeper for all deletions, so it knows when file system blocks transition from the used state to the free state. Ext file system use bitmaps to track the blocks that are allocated or not. Regards, Ric From articpenguin3800 at gmail.com Fri Sep 19 02:19:46 2008 From: articpenguin3800 at gmail.com (John Nelson) Date: Thu, 18 Sep 2008 22:19:46 -0400 Subject: directorys Message-ID: Does ext3 journal directory changes? -------------- next part -------------- An HTML attachment was scrubbed... URL: From tytso at mit.edu Fri Sep 19 15:01:48 2008 From: tytso at mit.edu (Theodore Tso) Date: Fri, 19 Sep 2008 11:01:48 -0400 Subject: directorys In-Reply-To: References: Message-ID: <20080919150148.GB13113@mit.edu> On Thu, Sep 18, 2008 at 10:19:46PM -0400, John Nelson wrote: > Does ext3 journal directory changes? Yes, it does; it has to, if you want the filesystem to be recoverable across an unclean shutdown. - Ted From rmichael-ext3 at edgeofthenet.org Mon Sep 22 00:44:57 2008 From: rmichael-ext3 at edgeofthenet.org (Richard Michael) Date: Sun, 21 Sep 2008 20:44:57 -0400 Subject: Rsync --link-dest and ext3: can I increase the number of inodes? Message-ID: <20080922004457.GC17339@nexus.edgeofthenet.org> Hello list, (I run rsync --link-dest backups onto ext3 and am anticipating running out of inodes.) Is there a tool I can use to increase the number of inodes on an ext3 filesystem? Also, are there any other implications I should be aware of when using rsync in this way on ext3? Specifically, what became of this discussion related to e2fsck and memory use? https://www.redhat.com/archives/ext3-users/2007-April/msg00017.html Thanks, Richard From tytso at mit.edu Mon Sep 22 02:27:24 2008 From: tytso at mit.edu (Theodore Tso) Date: Sun, 21 Sep 2008 22:27:24 -0400 Subject: Rsync --link-dest and ext3: can I increase the number of inodes? In-Reply-To: <20080922004457.GC17339@nexus.edgeofthenet.org> References: <20080922004457.GC17339@nexus.edgeofthenet.org> Message-ID: <20080922022724.GA9914@mit.edu> On Sun, Sep 21, 2008 at 08:44:57PM -0400, Richard Michael wrote: > (I run rsync --link-dest backups onto ext3 and am anticipating running > out of inodes.) > > Is there a tool I can use to increase the number of inodes on an ext3 > filesystem? Not without backing up your data to tape/DVD/whatever, reformatting the filesystem, and restoring from backups, sorry. > Also, are there any other implications I should be aware of when using > rsync in this way on ext3? Specifically, what became of this discussion > related to e2fsck and memory use? > > https://www.redhat.com/archives/ext3-users/2007-April/msg00017.html This is still a problem, and it's pretty fundamental to how e2fsck works. Calculating the number of hard links so we can make sure that i_links_count is correct requires a large amount of memory; there's no getting around that. E2fsck has a short-cut optimization that works for the common case where i_links_count=1, but that's not true if you are using backup strategies such as rsync --link-dest. The solution described above is present in mainline e2fsprogs, as an emergency method of allowing e2fsck to fix broken filesystems, but if you have to resort to it, it's *S*L*O*W*. I haven't gotten enough feedback to know whether it would be faster to use a 64-bit system and then enable swap; obviously the best way would be to use a 64-bit system and then have gobs and gobs of memory installed on your system. If you have a 32-bit system, and e2fsck needs more than 3-GB of user address space, you can try using a statically linked e2fsck to try to use the 3GB of address space most efficiently, but in the long run you will probably have to use the workaround described in the above link, and resign yourself to a very long fsck process. Alternatively, you could try using a backup program which uses a real database to keep track of reused files, instead of trying to use directory inodes and hard links as a bad substitute for the same. - Ted From cs at zip.com.au Mon Sep 22 04:12:57 2008 From: cs at zip.com.au (Cameron Simpson) Date: Mon, 22 Sep 2008 14:12:57 +1000 Subject: Rsync --link-dest and ext3: can I increase the number of inodes? In-Reply-To: <20080922022724.GA9914@mit.edu> Message-ID: <20080922041257.GA4867@cskk.homeip.net> On 21Sep2008 22:27, Theodore Tso wrote: | On Sun, Sep 21, 2008 at 08:44:57PM -0400, Richard Michael wrote: | > (I run rsync --link-dest backups onto ext3 and am anticipating running | > out of inodes.) [...] Hmm. While I take the point that each link tree consumes inodes for the directories, in a tree that changes little the use of new inodes for new/changed files should be quite slow. [...snip e2fsck memory requirements...] | Alternatively, you could try using a backup program which uses a real | database to keep track of reused files, instead of trying to use | directory inodes and hard links as a bad substitute for the same. But a database is... more complicated and then requires special db-aware tools for a real recover. The hard link thing is very simple and very direct. It has its drawbacks (chmod/chown history being the main one that comes to my mind) but for many scenarios it works quite well. For Richard's benefit, I can report that I've used the hard link backup tree approach extensively on ext3 filesystems made with default mke2fs options (i.e. no special inode count size) and have never run out of inodes. Have you actually done some figuring to decide that running out of inodes is probable? Cheers, -- Cameron Simpson DoD#743 http://www.cskk.ezoshosting.com/cs/ Peeve: Going to our favorite breakfast place, only to find that they were hit by a car...AND WE MISSED IT. - Don Baldwin, From jelledejong at powercraft.nl Mon Sep 22 08:01:37 2008 From: jelledejong at powercraft.nl (Jelle de Jong) Date: Mon, 22 Sep 2008 10:01:37 +0200 Subject: badblocks output format question Message-ID: <48D750E1.5090905@powercraft.nl> Hello List, I was testing a harddisk with badblocks, but i cant find what the exact output means. I have attached my logfile. Is the drive bad or ok :-p. Reading and comparing: 1556108 done, 339455:18:25 elapsed 3656908 done, 339455:20:00 elapsed 10566092done, 339455:25:11 elapsed Package: e2fsprogs Architecture: i386 Version: 1.41.1-3 Best regards, Jelle -------------- next part -------------- A non-text attachment was scrubbed... Name: badblocks.log Type: text/x-log Size: 4932 bytes Desc: not available URL: From tytso at mit.edu Mon Sep 22 13:51:56 2008 From: tytso at mit.edu (Theodore Tso) Date: Mon, 22 Sep 2008 09:51:56 -0400 Subject: Rsync --link-dest and ext3: can I increase the number of inodes? In-Reply-To: <20080922041257.GA4867@cskk.homeip.net> References: <20080922022724.GA9914@mit.edu> <20080922041257.GA4867@cskk.homeip.net> Message-ID: <20080922135156.GD9914@mit.edu> On Mon, Sep 22, 2008 at 02:12:57PM +1000, Cameron Simpson wrote: > On 21Sep2008 22:27, Theodore Tso wrote: > | On Sun, Sep 21, 2008 at 08:44:57PM -0400, Richard Michael wrote: > | > (I run rsync --link-dest backups onto ext3 and am anticipating running > | > out of inodes.) [...] > > Hmm. While I take the point that each link tree consumes inodes for the > directories, in a tree that changes little the use of new inodes for > new/changed files should be quite slow. There are two problems. The first is that the number of inodes you can consume with directories will go increase with each incremental backup. If you don't eventually delete some of your older backups, then you will eventually run out of inodes. There's no getting around that. The second problem is that each inode which has multiple inode takes up a small amount of memory per inode. If you are backing up a very large number of files, this number may consume more address space than you have on a 32-bit system. I have a workaround that uses tdb, but it is quite slow. (I have another idea that might be faster, but I'll have to try it too see how well or poorly it works.) > But a database is... more complicated and then requires special db-aware > tools for a real recover. The hard link thing is very simple and very > direct. It has its drawbacks (chmod/chown history being the main one > that comes to my mind) but for many scenarios it works quite well. Sure, but the solution may not scale so well for folks who are backing up 50+ machines and backing up all of /usr, including all of the distribution maintained files, or for folks who never delete any of their past incremental backups. > For Richard's benefit, I can report that I've used the hard link backup > tree approach extensively on ext3 filesystems made with default mke2fs > options (i.e. no special inode count size) and have never run out of > inodes. Have you actually done some figuring to decide that running out > of inodes is probable? Sure, but how many machines are you backing up this way, and how many days of backups are you keeping? And have you ever tried running "e2fsck -nftt /dev/hdXX" (you can do this on a live system if you want; the -n means you won't write anything to disk, and the goal is to see how much memory e2fsck needs) to make sure you can fix the filesystem if you need it? - Ted From jelledejong at powercraft.nl Mon Sep 22 20:55:36 2008 From: jelledejong at powercraft.nl (Jelle de Jong) Date: Mon, 22 Sep 2008 22:55:36 +0200 Subject: badblocks output format question In-Reply-To: <48D750E1.5090905@powercraft.nl> References: <48D750E1.5090905@powercraft.nl> Message-ID: <48D80648.40703@powercraft.nl> Jelle de Jong wrote: > Hello List, > > I was testing a harddisk with badblocks, but i cant find what the exact > output means. I have attached my logfile. Is the drive bad or ok :-p. > > Reading and comparing: 1556108 done, 339455:18:25 elapsed > 3656908 done, 339455:20:00 elapsed > 10566092done, 339455:25:11 elapsed > > Package: e2fsprogs > Architecture: i386 > Version: 1.41.1-3 > > Best regards, > > Jelle > Seems the log I sent is of a broken device, i ran badblocks on an other disk and there were no sector in the output. However the time indicator is a bid awkward (339455:20:00 elapsed) seems to my like a integer overflow or initialization bug. Best regards, Jelle From ulf at openlane.com Mon Sep 22 23:10:34 2008 From: ulf at openlane.com (Ulf Zimmermann) Date: Mon, 22 Sep 2008 16:10:34 -0700 Subject: ext3 zerofree option and RedHat back port? Message-ID: <5DE4B7D3E79067418154C49A739C125104C4A664@msmpk01.corp.autc.com> Can anyone tell me if the zerofree option for ext3 has been back ported to RedHat EL4 or EL5? Regards, Ulf. --------------------------------------------------------------------- OPENLANE Inc., T: 650-532-6382, F: 650-532-6441 4600 Bohannon Drive, Suite 100, Menlo Park, CA 94025 --------------------------------------------------------------------- From cs at zip.com.au Tue Sep 23 00:00:54 2008 From: cs at zip.com.au (Cameron Simpson) Date: Tue, 23 Sep 2008 10:00:54 +1000 Subject: Rsync --link-dest and ext3: can I increase the number of inodes? In-Reply-To: <20080922135156.GD9914@mit.edu> Message-ID: <20080923000054.GA8244@cskk.homeip.net> On 22Sep2008 09:51, Theodore Tso wrote: [...snip a lot of remarks I entirely agree with...] | > But a database is... more complicated [...] | | Sure, but the solution may not scale so well for folks who are backing | up 50+ machines and backing up all of /usr, including all of the | distribution maintained files, or for folks who never delete any of | their past incremental backups. Sure. There's plenty of stuff I wouldn't back up this way. | > For Richard's benefit, I can report that I've used the hard link backup | > tree approach extensively on ext3 filesystems made with default mke2fs | > options (i.e. no special inode count size) and have never run out of | > inodes. Have you actually done some figuring to decide that running out | > of inodes is probable? | | Sure, but how many machines are you backing up this way, and how many | days of backups are you keeping? My own current use case is pretty small, and they're not machines but data trees (eg static web site trees, configuration files etc - they have well defined and simple permissions and usually low change rates so I don't need "machine image" quality, just data integrity). Some 10s of GB and 4 months of dailies; I do prune old trees, but for overall disc space reasons, not lack of inodes. Only half of this is on ext3; the other is on xfs which I think has dynamic inode allocation. Probably we need to know more about Richard's plans. | And have you ever tried running | "e2fsck -nftt /dev/hdXX" (you can do this on a live system if you | want; the -n means you won't write anything to disk, and the goal is | to see how much memory e2fsck needs) to make sure you can fix the | filesystem if you need it? I'll queue this up as something to try, though the backups themselves are replicated to elsewhere anyway. Cheers, -- Cameron Simpson DoD#743 From tytso at mit.edu Tue Sep 23 05:07:30 2008 From: tytso at mit.edu (Theodore Tso) Date: Tue, 23 Sep 2008 01:07:30 -0400 Subject: badblocks output format question In-Reply-To: <48D750E1.5090905@powercraft.nl> References: <48D750E1.5090905@powercraft.nl> Message-ID: <20080923050730.GA8920@mit.edu> On Mon, Sep 22, 2008 at 10:01:37AM +0200, Jelle de Jong wrote: > Hello List, > > I was testing a harddisk with badblocks, but i cant find what the exact > output means. I have attached my logfile. Is the drive bad or ok :-p. You're drive is fine. This was a bug in the badblocks program which was introduced in e2fsprogs 1.41.1. This caused the percentage and elapsed time to be incorrectly displayed when the badblocks options -w and -s were given. Thanks for mentioning it. I'll fix it for the next release. - Ted From sandeen at redhat.com Tue Sep 23 15:13:20 2008 From: sandeen at redhat.com (Eric Sandeen) Date: Tue, 23 Sep 2008 10:13:20 -0500 Subject: ext3 zerofree option and RedHat back port? In-Reply-To: <5DE4B7D3E79067418154C49A739C125104C4A664@msmpk01.corp.autc.com> References: <5DE4B7D3E79067418154C49A739C125104C4A664@msmpk01.corp.autc.com> Message-ID: <48D90790.2020202@redhat.com> Ulf Zimmermann wrote: > Can anyone tell me if the zerofree option for ext3 has been back ported > to RedHat EL4 or EL5? there appears to be no backporting to do; it's a single .c file that makes simple use (I assume...) of libext2... But no, it's not in Fedora, EPEL, or RHEL. Builds fine on my rhel5 box. If you wanted to, you could be the maintainer for Fedora, and put it into EPEL, which would make it available for RHEL :) -Eric From tytso at mit.edu Tue Sep 23 16:49:26 2008 From: tytso at mit.edu (Theodore Tso) Date: Tue, 23 Sep 2008 12:49:26 -0400 Subject: ext3 zerofree option and RedHat back port? In-Reply-To: <48D90790.2020202@redhat.com> References: <5DE4B7D3E79067418154C49A739C125104C4A664@msmpk01.corp.autc.com> <48D90790.2020202@redhat.com> Message-ID: <20080923164926.GC12889@mit.edu> On Tue, Sep 23, 2008 at 10:13:20AM -0500, Eric Sandeen wrote: > Ulf Zimmermann wrote: > > Can anyone tell me if the zerofree option for ext3 has been back ported > > to RedHat EL4 or EL5? > > there appears to be no backporting to do; it's a single .c file that > makes simple use (I assume...) of libext2... > > But no, it's not in Fedora, EPEL, or RHEL. Builds fine on my rhel5 box. > > If you wanted to, you could be the maintainer for Fedora, and put it > into EPEL, which would make it available for RHEL :) Or it would be roughly a 5 line change to e2image (3 for option parsing, 1 for the usage line, and 1 to the if statement in write_raw_image_file() :-) to add an option to extend the "raw dump" functionality to also dump the data blocks of files, at which point it would create a sparse file containing only the used blocks in the filesystem for you, automatically. - Ted From sandeen at redhat.com Tue Sep 23 17:01:47 2008 From: sandeen at redhat.com (Eric Sandeen) Date: Tue, 23 Sep 2008 12:01:47 -0500 Subject: ext3 zerofree option and RedHat back port? In-Reply-To: <20080923164926.GC12889@mit.edu> References: <5DE4B7D3E79067418154C49A739C125104C4A664@msmpk01.corp.autc.com> <48D90790.2020202@redhat.com> <20080923164926.GC12889@mit.edu> Message-ID: <48D920FB.6030206@redhat.com> Theodore Tso wrote: > On Tue, Sep 23, 2008 at 10:13:20AM -0500, Eric Sandeen wrote: >> Ulf Zimmermann wrote: >>> Can anyone tell me if the zerofree option for ext3 has been back ported >>> to RedHat EL4 or EL5? >> there appears to be no backporting to do; it's a single .c file that >> makes simple use (I assume...) of libext2... >> >> But no, it's not in Fedora, EPEL, or RHEL. Builds fine on my rhel5 box. >> >> If you wanted to, you could be the maintainer for Fedora, and put it >> into EPEL, which would make it available for RHEL :) > > Or it would be roughly a 5 line change to e2image (3 for option > parsing, 1 for the usage line, and 1 to the if statement in > write_raw_image_file() :-) to add an option to extend the "raw dump" > functionality to also dump the data blocks of files, at which point it > would create a sparse file containing only the used blocks in the > filesystem for you, automatically. > > - Ted hey that sounds even better than a random collection of single-purpose utilities! ;) (But I suppose the original util had the other useful purpose of scrubbing free blocks even if you don't intend to compress the fs image...) -Eric From ulf at openlane.com Wed Sep 24 03:22:09 2008 From: ulf at openlane.com (Ulf Zimmermann) Date: Tue, 23 Sep 2008 20:22:09 -0700 Subject: ext3 zerofree option and RedHat back port? In-Reply-To: <48D920FB.6030206@redhat.com> References: <5DE4B7D3E79067418154C49A739C125104C4A664@msmpk01.corp.autc.com> <48D90790.2020202@redhat.com> <20080923164926.GC12889@mit.edu> <48D920FB.6030206@redhat.com> Message-ID: <5DE4B7D3E79067418154C49A739C125104C4A66C@msmpk01.corp.autc.com> > -----Original Message----- > From: Eric Sandeen [mailto:sandeen at redhat.com] > Sent: 09/23/2008 10:02 > To: Theodore Tso > Cc: Ulf Zimmermann; ext3-users at redhat.com > Subject: Re: ext3 zerofree option and RedHat back port? > > Theodore Tso wrote: > > On Tue, Sep 23, 2008 at 10:13:20AM -0500, Eric Sandeen wrote: > >> Ulf Zimmermann wrote: > >>> Can anyone tell me if the zerofree option for ext3 has been back > ported > >>> to RedHat EL4 or EL5? > >> there appears to be no backporting to do; it's a single .c file that > >> makes simple use (I assume...) of libext2... > >> > >> But no, it's not in Fedora, EPEL, or RHEL. Builds fine on my rhel5 > box. > >> > >> If you wanted to, you could be the maintainer for Fedora, and put it > >> into EPEL, which would make it available for RHEL :) > > > > Or it would be roughly a 5 line change to e2image (3 for option > > parsing, 1 for the usage line, and 1 to the if statement in > > write_raw_image_file() :-) to add an option to extend the "raw dump" > > functionality to also dump the data blocks of files, at which point > it > > would create a sparse file containing only the used blocks in the > > filesystem for you, automatically. > > > > - Ted > > hey that sounds even better than a random collection of single-purpose > utilities! ;) > > (But I suppose the original util had the other useful purpose of > scrubbing free blocks even if you don't intend to compress the fs > image...) > > -Eric Reason I asked is this. We use currently 3Par S400 and E200 as SAN arrays. The new T400 and T800 has a built in chip to do more intelligent thin provisioning but I believe even the S400 and E200 we have will free on the SAN level a block of a thin provisioned volume if it gets zero'ed out. Haven't gotten around yet to test it, but I am planning on. We are currently using 3 different file system types, one is a propriety from Onstor for their Bobcats (NFS/CIFS heads) where I believe I have observed just freeing of SAN level blocks. The two other are EXT3 and OCFS2. Ulf Zimmermann From sandeen at redhat.com Wed Sep 24 03:30:19 2008 From: sandeen at redhat.com (Eric Sandeen) Date: Tue, 23 Sep 2008 22:30:19 -0500 Subject: ext3 zerofree option and RedHat back port? In-Reply-To: <5DE4B7D3E79067418154C49A739C125104C4A66C@msmpk01.corp.autc.com> References: <5DE4B7D3E79067418154C49A739C125104C4A664@msmpk01.corp.autc.com> <48D90790.2020202@redhat.com> <20080923164926.GC12889@mit.edu> <48D920FB.6030206@redhat.com> <5DE4B7D3E79067418154C49A739C125104C4A66C@msmpk01.corp.autc.com> Message-ID: <48D9B44B.9000707@redhat.com> Ulf Zimmermann wrote: > Reason I asked is this. We use currently 3Par S400 and E200 as SAN > arrays. The new T400 and T800 has a built in chip to do more intelligent > thin provisioning but I believe even the S400 and E200 we have will free > on the SAN level a block of a thin provisioned volume if it gets zero'ed > out. Haven't gotten around yet to test it, but I am planning on. We are > currently using 3 different file system types, one is a propriety from > Onstor for their Bobcats (NFS/CIFS heads) where I believe I have > observed just freeing of SAN level blocks. The two other are EXT3 and > OCFS2. Ok, so you really want to zero the unused blocks in-place, and e2image writing out a new sparsified image isn't a ton of help. The tool does that, I guess - but only on an unmounted or RO-mounted filesystem, right? (plus I'd triple-check that it's doing things correctly, opening a block device and splatting zeros around, one hopes that it is!) But in any case the util itself is simple enough that building (or even packaging) for fedora/EPEL should be trivial. (FWIW, there is work upstream for filesystems to actually communicate freed blocks to the underlying storage, just for this purpose...) -Eric From ulf at openlane.com Wed Sep 24 04:17:26 2008 From: ulf at openlane.com (Ulf Zimmermann) Date: Tue, 23 Sep 2008 21:17:26 -0700 Subject: ext3 zerofree option and RedHat back port? In-Reply-To: <48D9B44B.9000707@redhat.com> References: <5DE4B7D3E79067418154C49A739C125104C4A664@msmpk01.corp.autc.com> <48D90790.2020202@redhat.com> <20080923164926.GC12889@mit.edu> <48D920FB.6030206@redhat.com> <5DE4B7D3E79067418154C49A739C125104C4A66C@msmpk01.corp.autc.com> <48D9B44B.9000707@redhat.com> Message-ID: <5DE4B7D3E79067418154C49A739C125104C4A66F@msmpk01.corp.autc.com> > -----Original Message----- > From: Eric Sandeen [mailto:sandeen at redhat.com] > Sent: 09/23/2008 20:30 > To: Ulf Zimmermann > Cc: Theodore Tso; ext3-users at redhat.com > Subject: Re: ext3 zerofree option and RedHat back port? > > Ulf Zimmermann wrote: > > > Reason I asked is this. We use currently 3Par S400 and E200 as SAN > > arrays. The new T400 and T800 has a built in chip to do more > intelligent > > thin provisioning but I believe even the S400 and E200 we have will > free > > on the SAN level a block of a thin provisioned volume if it gets > zero'ed > > out. Haven't gotten around yet to test it, but I am planning on. We > are > > currently using 3 different file system types, one is a propriety > from > > Onstor for their Bobcats (NFS/CIFS heads) where I believe I have > > observed just freeing of SAN level blocks. The two other are EXT3 and > > OCFS2. > > Ok, so you really want to zero the unused blocks in-place, and e2image > writing out a new sparsified image isn't a ton of help. > > The tool does that, I guess - but only on an unmounted or RO-mounted > filesystem, right? (plus I'd triple-check that it's doing things > correctly, opening a block device and splatting zeros around, one hopes > that it is!) > > But in any case the util itself is simple enough that building (or even > packaging) for fedora/EPEL should be trivial. > > (FWIW, there is work upstream for filesystems to actually communicate > freed blocks to the underlying storage, just for this purpose...) > > -Eric I am going to try it out by hand. Create a thin provisioned volume, write random crap to it, then zero the blocks. See if that shrinks the physical allocated space. Ulf. From adilger at sun.com Wed Sep 24 06:35:11 2008 From: adilger at sun.com (Andreas Dilger) Date: Wed, 24 Sep 2008 00:35:11 -0600 Subject: ext3 zerofree option and RedHat back port? In-Reply-To: <48D9B44B.9000707@redhat.com> References: <5DE4B7D3E79067418154C49A739C125104C4A664@msmpk01.corp.autc.com> <48D90790.2020202@redhat.com> <20080923164926.GC12889@mit.edu> <48D920FB.6030206@redhat.com> <5DE4B7D3E79067418154C49A739C125104C4A66C@msmpk01.corp.autc.com> <48D9B44B.9000707@redhat.com> Message-ID: <20080924063511.GX10950@webber.adilger.int> On Sep 23, 2008 22:30 -0500, Eric Sandeen wrote: > Ulf Zimmermann wrote: > Ok, so you really want to zero the unused blocks in-place, and e2image > writing out a new sparsified image isn't a ton of help. > > The tool does that, I guess - but only on an unmounted or RO-mounted > filesystem, right? (plus I'd triple-check that it's doing things > correctly, opening a block device and splatting zeros around, one hopes > that it is!) That is WAY to scary for me on a mounted filesystem. It is racy if the blocks become allocated. Instead, what I always do when creating a sparse image for e2fsck test cases is just "dd if=/dev/zero of=/mnt/fs/zeroes bs=64k; rm /mnt/fs/zeroes" until the filesystem is full, then the file is deleted. This will leave blocks "empty" for the free space in the filesystem without any special tools. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. From rmy at tigress.co.uk Wed Sep 24 08:12:37 2008 From: rmy at tigress.co.uk (Ron Yorston) Date: Wed, 24 Sep 2008 09:12:37 +0100 Subject: ext3 zerofree option and RedHat back port? In-Reply-To: <5DE4B7D3E79067418154C49A739C125104C4A664@msmpk01.corp.autc.com> References: <5DE4B7D3E79067418154C49A739C125104C4A664@msmpk01.corp.autc.com> Message-ID: <200809240812.m8O8CdIh010269@tiffany.internal.tigress.co.uk> "Ulf Zimmermann" wrote: >Can anyone tell me if the zerofree option for ext3 has been back ported >to RedHat EL4 or EL5? I used to maintain backports of zerofree (the kernel patch, not the utility) to EL4 and EL5, but since I wasn't actually using them I gave up. The last RPMs I have are from December of last year. Contact me directly if you want them. I don't recommend the ext3 patch as it hasn't seen much use. I regularly use the ext2 version (on Fedora 9), but be warned that Ted has expressed concerns about it. Ron From rmy at tigress.co.uk Wed Sep 24 08:19:53 2008 From: rmy at tigress.co.uk (Ron Yorston) Date: Wed, 24 Sep 2008 09:19:53 +0100 Subject: ext3 zerofree option and RedHat back port? In-Reply-To: <5DE4B7D3E79067418154C49A739C125104C4A66C@msmpk01.corp.autc.com> References: <5DE4B7D3E79067418154C49A739C125104C4A664@msmpk01.corp.autc.com> <48D90790.2020202@redhat.com> <20080923164926.GC12889@mit.edu> <48D920FB.6030206@redhat.com> <5DE4B7D3E79067418154C49A739C125104C4A66C@msmpk01.corp.autc.com> Message-ID: <200809240819.m8O8JrfC010279@tiffany.internal.tigress.co.uk> "Ulf Zimmermann" wrote: >Reason I asked is this. We use currently 3Par S400 and E200 as SAN >arrays. The new T400 and T800 has a built in chip to do more intelligent >thin provisioning but I believe even the S400 and E200 we have will free >on the SAN level a block of a thin provisioned volume if it gets zero'ed >out. Haven't gotten around yet to test it, but I am planning on. We are >currently using 3 different file system types, one is a propriety from >Onstor for their Bobcats (NFS/CIFS heads) where I believe I have >observed just freeing of SAN level blocks. The two other are EXT3 and >OCFS2. Interesting. A similar case I've seen recently is s3backer, a FUSE filesystem that keeps its blocks as objects in Amazon S3: http://code.google.com/p/s3backer/ Blocks of zeroes aren't actually stored, so they suggest using zerofree to get rid of non-zero deleted blocks and avoid being charged for them. Ron From rmy at tigress.co.uk Wed Sep 24 08:23:20 2008 From: rmy at tigress.co.uk (Ron Yorston) Date: Wed, 24 Sep 2008 09:23:20 +0100 Subject: ext3 zerofree option and RedHat back port? In-Reply-To: <20080924063511.GX10950@webber.adilger.int> References: <5DE4B7D3E79067418154C49A739C125104C4A664@msmpk01.corp.autc.com> <48D90790.2020202@redhat.com> <20080923164926.GC12889@mit.edu> <48D920FB.6030206@redhat.com> <5DE4B7D3E79067418154C49A739C125104C4A66C@msmpk01.corp.autc.com> <48D9B44B.9000707@redhat.com> <20080924063511.GX10950@webber.adilger.int> Message-ID: <200809240823.m8O8NK55010286@tiffany.internal.tigress.co.uk> Andreas Dilger wrote: >> Ulf Zimmermann wrote: >> Ok, so you really want to zero the unused blocks in-place, and e2image >> writing out a new sparsified image isn't a ton of help. >> >> The tool does that, I guess - but only on an unmounted or RO-mounted >> filesystem, right? (plus I'd triple-check that it's doing things >> correctly, opening a block device and splatting zeros around, one hopes >> that it is!) > >That is WAY to scary for me on a mounted filesystem. It is racy if the >blocks become allocated. The 1.0.0 version of the zerofree utility only worked on unmounted filesystems, but then someone suggested that it should be safe on a read-only mount. Is that not so? Ron From rwheeler at redhat.com Wed Sep 24 11:19:02 2008 From: rwheeler at redhat.com (Ric Wheeler) Date: Wed, 24 Sep 2008 07:19:02 -0400 Subject: ext3 zerofree option and RedHat back port? In-Reply-To: <5DE4B7D3E79067418154C49A739C125104C4A66F@msmpk01.corp.autc.com> References: <5DE4B7D3E79067418154C49A739C125104C4A664@msmpk01.corp.autc.com> <48D90790.2020202@redhat.com> <20080923164926.GC12889@mit.edu> <48D920FB.6030206@redhat.com> <5DE4B7D3E79067418154C49A739C125104C4A66C@msmpk01.corp.autc.com> <48D9B44B.9000707@redhat.com> <5DE4B7D3E79067418154C49A739C125104C4A66F@msmpk01.corp.autc.com> Message-ID: <48DA2226.70509@redhat.com> Ulf Zimmermann wrote: >> -----Original Message----- >> From: Eric Sandeen [mailto:sandeen at redhat.com] >> Sent: 09/23/2008 20:30 >> To: Ulf Zimmermann >> Cc: Theodore Tso; ext3-users at redhat.com >> Subject: Re: ext3 zerofree option and RedHat back port? >> >> Ulf Zimmermann wrote: >> >> >>> Reason I asked is this. We use currently 3Par S400 and E200 as SAN >>> arrays. The new T400 and T800 has a built in chip to do more >>> >> intelligent >> >>> thin provisioning but I believe even the S400 and E200 we have will >>> >> free >> >>> on the SAN level a block of a thin provisioned volume if it gets >>> >> zero'ed >> >>> out. Haven't gotten around yet to test it, but I am planning on. We >>> >> are >> >>> currently using 3 different file system types, one is a propriety >>> >> from >> >>> Onstor for their Bobcats (NFS/CIFS heads) where I believe I have >>> observed just freeing of SAN level blocks. The two other are EXT3 >>> > and > >>> OCFS2. >>> >> Ok, so you really want to zero the unused blocks in-place, and e2image >> writing out a new sparsified image isn't a ton of help. >> >> The tool does that, I guess - but only on an unmounted or RO-mounted >> filesystem, right? (plus I'd triple-check that it's doing things >> correctly, opening a block device and splatting zeros around, one >> > hopes > >> that it is!) >> >> But in any case the util itself is simple enough that building (or >> > even > >> packaging) for fedora/EPEL should be trivial. >> >> (FWIW, there is work upstream for filesystems to actually communicate >> freed blocks to the underlying storage, just for this purpose...) >> >> -Eric >> > > I am going to try it out by hand. Create a thin provisioned volume, > write random crap to it, then zero the blocks. See if that shrinks the > physical allocated space. > > Ulf. > > > Note that there is work on getting file systems to use the new TRIM (for S-ATA drives) and its equivalent proposed standard in T10 SCSI for arrays which will give you this automatically. David Woodhouse was pushing patches for TRIM, we are still thinking about the SCSI versions... ric From tytso at mit.edu Wed Sep 24 13:31:47 2008 From: tytso at mit.edu (Theodore Tso) Date: Wed, 24 Sep 2008 09:31:47 -0400 Subject: ext3 zerofree option and RedHat back port? In-Reply-To: <200809240812.m8O8CdIh010269@tiffany.internal.tigress.co.uk> References: <5DE4B7D3E79067418154C49A739C125104C4A664@msmpk01.corp.autc.com> <200809240812.m8O8CdIh010269@tiffany.internal.tigress.co.uk> Message-ID: <20080924133147.GD9929@mit.edu> On Wed, Sep 24, 2008 at 09:12:37AM +0100, Ron Yorston wrote: > "Ulf Zimmermann" wrote: > >Can anyone tell me if the zerofree option for ext3 has been back ported > >to RedHat EL4 or EL5? > > I used to maintain backports of zerofree (the kernel patch, not the > utility) to EL4 and EL5, but since I wasn't actually using them I gave > up. The last RPMs I have are from December of last year. Contact me > directly if you want them. > > I don't recommend the ext3 patch as it hasn't seen much use. I regularly > use the ext2 version (on Fedora 9), but be warned that Ted has expressed > concerns about it. I just searched my sent-mail archives for the last 5 years, and I can't find any references to "zerofree" previous to this mail thread. Maybe I commented about them under some other name. Having quickly looked at the ext3 patch here: http://osdir.com/ml/file-systems.ext3.user/2006-09/msg00026.html ...the big thing I will note is that if you crash after a file is deleted, but before the journal transaction is committed, the file may end up being cleared but not deleted. This may or may not be problematic for your appication; in particular, if the file deletion was implied with the intent of doing an atomic replacement of some critical file, i.e. such as a vipw script which does: cp /etc/passwd /etc/passwd.vipw vi /etc/passwd.vipw # atomically update /etc/passwd mv /etc/passwd.vipw /etc/passwd ... and you crash before the transaction is commited but after the "mv" command has run, you could end up with a partially or completely zero'ed /etc/passwd file. Some might call that unfortunate. :-) I will admit that the chances of this happening are somewhat remote, but in terms of potential issues that would have to be fixed before such a patch could be included in mainline, or before (I suspect) Red Hat would feel comfortable taking responsibility for their customers' data after such a patch were committed, that would probably be a real issue. The code for supporting the "trim" command could also be used to implement a proper zero-free command, but it gets tricky, since the blocks in question would have to be remembered until the commit block is written out, and then only zero'ed (or trimmed) right after the commit has happened, but before the pinned block bitmaps are released (which would allow the block allocator to allocate to the blocks that had just been released). - Ted From rmy at tigress.co.uk Wed Sep 24 14:35:32 2008 From: rmy at tigress.co.uk (Ron Yorston) Date: Wed, 24 Sep 2008 15:35:32 +0100 Subject: ext3 zerofree option and RedHat back port? In-Reply-To: <20080924133147.GD9929@mit.edu> References: <5DE4B7D3E79067418154C49A739C125104C4A664@msmpk01.corp.autc.com> <200809240812.m8O8CdIh010269@tiffany.internal.tigress.co.uk> <20080924133147.GD9929@mit.edu> Message-ID: <200809241435.m8OEZY7Y010555@tiffany.internal.tigress.co.uk> Theodore Tso wrote: >I just searched my sent-mail archives for the last 5 years, and I >can't find any references to "zerofree" previous to this mail thread. >Maybe I commented about them under some other name. > >Having quickly looked at the ext3 patch here: > > http://osdir.com/ml/file-systems.ext3.user/2006-09/msg00026.html Your response is in the same thread: http://osdir.com/ml/file-systems.ext3.user/2006-09/msg00031.html Unless that was some other Theodore Tso. >..the big thing I will note is that if you crash after a file is >deleted, but before the journal transaction is committed, the file may >end up being cleared but not deleted. Indeed, that was the concern last time. The ext3 patch hasn't changed significantly since then because, truth be told, I don't entirely understand journalling and was unable to fix it up. The ext2 patch now writes out the zeroed blocks immediately, which may or may not help. The latest versions of the patches are available on my website: http://intgat.tigress.co.uk/rmy/uml/sparsify.html Ron From sandeen at redhat.com Wed Sep 24 15:04:56 2008 From: sandeen at redhat.com (Eric Sandeen) Date: Wed, 24 Sep 2008 10:04:56 -0500 Subject: ext3 zerofree option and RedHat back port? In-Reply-To: <200809240812.m8O8CdIh010269@tiffany.internal.tigress.co.uk> References: <5DE4B7D3E79067418154C49A739C125104C4A664@msmpk01.corp.autc.com> <200809240812.m8O8CdIh010269@tiffany.internal.tigress.co.uk> Message-ID: <48DA5718.60403@redhat.com> Ron Yorston wrote: > "Ulf Zimmermann" wrote: >> Can anyone tell me if the zerofree option for ext3 has been back ported >> to RedHat EL4 or EL5? > > I used to maintain backports of zerofree (the kernel patch, not the > utility) to EL4 and EL5, but since I wasn't actually using them I gave > up. The last RPMs I have are from December of last year. Contact me > directly if you want them. > > I don't recommend the ext3 patch as it hasn't seen much use. I regularly > use the ext2 version (on Fedora 9), but be warned that Ted has expressed > concerns about it. oh, whoops - I guess my google-fu is weak, I searched for zerofree and assumed we were talking about the userspace utility I found ... /me runs off to look at that patch... -Eric From tytso at mit.edu Wed Sep 24 15:19:20 2008 From: tytso at mit.edu (Theodore Tso) Date: Wed, 24 Sep 2008 11:19:20 -0400 Subject: ext3 zerofree option and RedHat back port? In-Reply-To: <200809241435.m8OEZY7Y010555@tiffany.internal.tigress.co.uk> References: <5DE4B7D3E79067418154C49A739C125104C4A664@msmpk01.corp.autc.com> <200809240812.m8O8CdIh010269@tiffany.internal.tigress.co.uk> <20080924133147.GD9929@mit.edu> <200809241435.m8OEZY7Y010555@tiffany.internal.tigress.co.uk> Message-ID: <20080924151919.GG9929@mit.edu> On Wed, Sep 24, 2008 at 03:35:32PM +0100, Ron Yorston wrote: > Your response is in the same thread: > > http://osdir.com/ml/file-systems.ext3.user/2006-09/msg00031.html > > Unless that was some other Theodore Tso. Hmm, I must have sent that from a non-primary computer, so it wasn't in my sent-mail archive. My bad. :-) - Ted From ulf at openlane.com Wed Sep 24 16:23:56 2008 From: ulf at openlane.com (Ulf Zimmermann) Date: Wed, 24 Sep 2008 09:23:56 -0700 Subject: ext3 zerofree option and RedHat back port? In-Reply-To: <48DA5718.60403@redhat.com> References: <5DE4B7D3E79067418154C49A739C125104C4A664@msmpk01.corp.autc.com> <200809240812.m8O8CdIh010269@tiffany.internal.tigress.co.uk> <48DA5718.60403@redhat.com> Message-ID: <5DE4B7D3E79067418154C49A739C125104C4A671@msmpk01.corp.autc.com> > -----Original Message----- > From: Eric Sandeen [mailto:sandeen at redhat.com] > Sent: 09/24/2008 08:05 > To: Ron Yorston > Cc: Ulf Zimmermann; ext3-users at redhat.com > Subject: Re: ext3 zerofree option and RedHat back port? > > Ron Yorston wrote: > > "Ulf Zimmermann" wrote: > >> Can anyone tell me if the zerofree option for ext3 has been back > ported > >> to RedHat EL4 or EL5? > > > > I used to maintain backports of zerofree (the kernel patch, not the > > utility) to EL4 and EL5, but since I wasn't actually using them I > gave > > up. The last RPMs I have are from December of last year. Contact me > > directly if you want them. > > > > I don't recommend the ext3 patch as it hasn't seen much use. I > regularly > > use the ext2 version (on Fedora 9), but be warned that Ted has > expressed > > concerns about it. > > oh, whoops - I guess my google-fu is weak, I searched for zerofree and > assumed we were talking about the userspace utility I found ... > > /me runs off to look at that patch... > > -Eric Sorry, I meant the mount option for zero'ing blocks which are getting freed. Ulf. From lakshmipathi.g at gmail.com Mon Sep 29 07:43:44 2008 From: lakshmipathi.g at gmail.com (lakshmi pathi) Date: Mon, 29 Sep 2008 13:13:44 +0530 Subject: giis file undelete tool-new features Message-ID: Hi I have released giis4.4.It includes following features *Deleted files are recovered and restored into their original directories, if the path exists. *Dropped database tables are recovered. *Several Bug fixes. :) Limitation: If directory size greater than block_size,it giis won't work. -This will be fixed in next release. Homepage: www.giis.co.in Cheers, Lakshmipathi.G From worleys at gmail.com Mon Sep 29 15:24:33 2008 From: worleys at gmail.com (Chris Worley) Date: Mon, 29 Sep 2008 09:24:33 -0600 Subject: When is a block free? In-Reply-To: References: <48D01448.4050107@redhat.com> Message-ID: On Tue, Sep 16, 2008 at 3:32 PM, Chris Worley wrote: > For example, in balloc.c I'm seeing ext3_free_blocks_sb > calls ext3_clear_bit_atomic at the bottom... is that when the block is > freed? Are all blocks freed here? David Woodhouse, in an article at http://lwn.net/Articles/293658/, is implementing the T10/T13 committees "Trim" request in 2.6.28 kernels. Would it be appropriate to call "blkdev_issue_discard" at the bottom of ext3_free_blocks_sb where ext3_clear_bit_atomic is being called? Chris > > On Tue, Sep 16, 2008 at 3:03 PM, Chris Worley wrote: >> >> On Tue, Sep 16, 2008 at 2:17 PM, Ric Wheeler wrote: >>> >>> Chris Worley wrote: >>>> >>>> Where in the ext2/3 code does it know that a block on the disk is now >>>> free to reuse? >>>> >>>> Thanks, >>>> >>>> Chris >>> >>> Hi Chris, >>> >>> File systems track which blocks are free from the file system creation >>> time (mkfs), creation of new files and deletion. Ext2/3 is the gatekeeper >>> for all deletions, so it knows when file system blocks transition from the >>> used state to the free state. Ext file system use bitmaps to track the >>> blocks that are allocated or not. >> >> Where (in the code... what routine... or what's the name of the bitmap) is >> the "free" bit set? I've been looking through the code and don't see >> exactly where the block is marked as free. >> Thanks, >> Chris >>> >>> Regards, >>> >>> Ric >>> >> > > From tytso at mit.edu Mon Sep 29 16:39:17 2008 From: tytso at mit.edu (Theodore Tso) Date: Mon, 29 Sep 2008 12:39:17 -0400 Subject: When is a block free? In-Reply-To: References: <48D01448.4050107@redhat.com> Message-ID: <20080929163917.GB10831@mit.edu> On Mon, Sep 29, 2008 at 09:24:33AM -0600, Chris Worley wrote: > On Tue, Sep 16, 2008 at 3:32 PM, Chris Worley wrote: > > For example, in balloc.c I'm seeing ext3_free_blocks_sb > > calls ext3_clear_bit_atomic at the bottom... is that when the block is > > freed? Are all blocks freed here? > > David Woodhouse, in an article at http://lwn.net/Articles/293658/, is > implementing the T10/T13 committees "Trim" request in 2.6.28 kernels. > > Would it be appropriate to call "blkdev_issue_discard" at the bottom > of ext3_free_blocks_sb where ext3_clear_bit_atomic is being called? Unfortunately, it's not as simple as that. The problem is that as soon as you call trim, the drive is allowed to discard the contents of that block so that future attempts to read from that block returns all zeros. Therefore we can't call Trim until after the transaction has committed. That means we have to keep a linked list of block extents that are to be trimmed attached to the commit object, and only send the trim requests once the commit block has been written to disk. It's on the ext4 developer's TODO list to add Trim support to ext3 and ext4. - Ted From whats at wekk.net Wed Sep 24 21:10:17 2008 From: whats at wekk.net (Albert =?ISO-8859-1?Q?Sellar=E8s?=) Date: Wed, 24 Sep 2008 23:10:17 +0200 Subject: init_special_inode: bogus i_mode Message-ID: <1222290617.6307.33.camel@x61s> Hi everyone, I have a server running Redhat 5 that have attached a SAN of 5TB. The SAN filesystem is formated with ext3. One month ago, the kernel was started to send this error messages: init_special_inode: bogus i_mode (56333) init_special_inode: bogus i_mode (111367) init_special_inode: bogus i_mode (114022) init_special_inode: bogus i_mode (34016) init_special_inode: bogus i_mode (7170) init_special_inode: bogus i_mode (117576) init_special_inode: bogus i_mode (74600) init_special_inode: bogus i_mode (111237) init_special_inode: bogus i_mode (151624) init_special_inode: bogus i_mode (132565) init_special_inode: bogus i_mode (175003) init_special_inode: bogus i_mode (54343) init_special_inode: bogus i_mode (161626) init_special_inode: bogus i_mode (114644) init_special_inode: bogus i_mode (53215) init_special_inode: bogus i_mode (54563) init_special_inode: bogus i_mode (110115) init_special_inode: bogus i_mode (160572) init_special_inode: bogus i_mode (35607) init_special_inode: bogus i_mode (156516) init_special_inode: bogus i_mode (50005) init_special_inode: bogus i_mode (5362) init_special_inode: bogus i_mode (136237) init_special_inode: bogus i_mode (136237) init_special_inode: bogus i_mode (136237) init_special_inode: bogus i_mode (136237) init_special_inode: bogus i_mode (136237) init_special_inode: bogus i_mode (136237) Every day, kernel prints out one or two lines like these. I haven't found nothing in the list archives, and searching in google I found that it could be that the filesystem is corrupted. On my last step to know what has happening in the filesystem, I have look the kernel source code. Now I really think that it is an error, but I'm not sure what it means. Can anybody tell me what exactly means this message? Thanks you very much. -- Albert Sellar?s GPG id: 0x13053FFE http://www.wekk.net whats_up at jabber.org Linux User: 324456 Catalunya -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 197 bytes Desc: Aix? ?s una part d'un missatge, signada digitalment URL: From sandeen at redhat.com Tue Sep 30 14:53:58 2008 From: sandeen at redhat.com (Eric Sandeen) Date: Tue, 30 Sep 2008 09:53:58 -0500 Subject: init_special_inode: bogus i_mode In-Reply-To: <1222290617.6307.33.camel@x61s> References: <1222290617.6307.33.camel@x61s> Message-ID: <48E23D86.9070300@redhat.com> Albert Sellar?s wrote: > Hi everyone, > > I have a server running Redhat 5 that have attached a SAN of 5TB. The > SAN filesystem is formated with ext3. I suppose you mean RHEL5? > One month ago, the kernel was started to send this error messages: > > init_special_inode: bogus i_mode (56333) > init_special_inode: bogus i_mode (136237) > > Every day, kernel prints out one or two lines like these. > > I haven't found nothing in the list archives, and searching in google I > found that it could be that the filesystem is corrupted. > > On my last step to know what has happening in the filesystem, I have > look the kernel source code. Now I really think that it is an error, but > I'm not sure what it means. > > Can anybody tell me what exactly means this message? For an inode which is not recognized as a regular file, directory, or link when it is read, init_special_inode is called. At that point, if it's not a char, block, fifo, or socket, you get this error. Basically it doesn't know what this thing is. It'd be nice if it printed the inode number as well, to make it easier to find. I'd probably suggest fsck at this point, run it with -n first if you want to see what it *would* do, just to be safe. -Eric