From evil at g-house.de Sun Sep 3 13:20:03 2006 From: evil at g-house.de (Christian Kujau) Date: Sun, 3 Sep 2006 14:20:03 +0100 (BST) Subject: [OT] Re: Partitioning for ext3fs In-Reply-To: <52103.216.160.118.81.1157059001.squirrel@216.160.118.81> References: <4301.216.160.118.81.1156937128.squirrel@216.160.118.81> <52103.216.160.118.81.1157059001.squirrel@216.160.118.81> Message-ID: [please reply on-list, so that other ppl can help too] On Thu, 31 Aug 2006, david cooke wrote: > By hang, I mean the boot process will not go any further if I > turn on the USB during boot. Whatever boot happens to be Hm, too bad :( But I'd suggest to discuss this issue on some FC forum, usb-list or even linux-kernel. >> I don't know if I understand you correctly: you've upgraded to FC5 and >> the external (USB? SATA?) drive still "does not work"? > Typing fdisk /dev/sdc gives us > Unable to open /dev/sdc so, your OS (linux, FC) does not seem to be aware of your usb-disk (sdc) or the driver crashed. try to check dmesg/messages for related information and pass it on to one of the above mentioned lists. > There is a light on in the USB so I think it's on...Yes, it is on. That's good ;) > Typing fdisk /dev/sda results in > The number of cylinders for this disk is set to 19457. [...] OK, so sda (sata disk?) is doing well. this is good ;) > Now, I keep thinking this is my primary so don't to mess around here. > Is this the USB? typing "dmesg | grep sd" after booting should reveal which disk got initialized with which "sdX" name. > A different subject and we can cross that bridge later, but > See above where the cylinders are set to nineteen thousand plus and not > (I guess the usual) 1024? Can it be fixed? This is what the help-messages above is in place: "The number of cylinders for this disk is set to 19457. There is nothing wrong with that, but this is larger than 1024, and could in certain setups cause problems with: 1) software that runs at boot time (e.g., old versions of LILO) 2) booting and partitioning software from other OSs (e.g., DOS FDISK, OS/2 FDISK)" So, why would you change anything? do you have DOS, OS/2 fdisk? do you have old version of lilo? (FC uses GRUB anyway, IIRC). greetings, Christian. -- BOFH excuse #115: your keyboard's space bar is generating spurious keycodes. From evilninja at gmx.net Sun Sep 3 18:25:43 2006 From: evilninja at gmx.net (Christian) Date: Sun, 3 Sep 2006 19:25:43 +0100 (BST) Subject: Stress testing for ext3? In-Reply-To: References: Message-ID: On Thu, 31 Aug 2006, Kieft, Brian wrote: > Does anyone know of a good method for exercising > an ext3 file system? I'm not aware of such a "torture" tool, but any long run of your real-world-application of choice, some benchmarks or heavy operation on a big source tree or so should do no harm to any in-kernel rw-filesystem. > Perhaps something that involves power removal in between commits > or in the midst of a write, start any of the things mentioned above and pull the plug ;) maybe "reboot -f" could simulate this: -f Force halt or reboot, don't call shutdown(8). but I've never tried that and don't know if it will KILL running processes before rebooting. > and then checks for corrupt data. Do any utilities exist for this? fsck.ext[23] will do that for the fs structure. you could use diff(1) against a known-to-be-good filesystem to verify that all data is in place. Christian. -- BOFH excuse #325: Your processor does not develop enough heat. From evilninja at gmx.net Sun Sep 3 18:29:44 2006 From: evilninja at gmx.net (Christian) Date: Sun, 3 Sep 2006 19:29:44 +0100 (BST) Subject: Ext3 emergency recovery In-Reply-To: References: Message-ID: On Tue, 29 Aug 2006, Adam Atlas wrote: > I have a damaged Ext3 filesystem which fsck has not been able to recover. maybe the information *how* the fs went corrupt could help. posting a fsck log is also nice... > Up to group 95. Some say "SEVERE DATA LOSS POSSIBLE." are you using the latest e2fsprogs? latest kernel? i386 or something more exotic? > filesystem and tried answering yes to all of them; it ended up just erasing > the whole thing. is there nothing in lost+found? -- BOFH excuse #325: Your processor does not develop enough heat. From mr._x at shaw.ca Sun Sep 3 20:39:15 2006 From: mr._x at shaw.ca (..:::BeOS Mr. X:::..) Date: Sun, 03 Sep 2006 13:39:15 -0700 Subject: Stress testing for ext3? In-Reply-To: References: Message-ID: <44FB3D73.30408@shaw.ca> I know of a method to continously execute a command, maybe doing a full listing of the drive's contents will heat the drives up, but I am not sure about the error checking part. Here is what I would do: while [ 1 -eq 1 ]; do ls -shw9 -R; done Hope this helps! Mr. X Kieft, Brian wrote: > Does anyone know of a good method for exercising an ext3 file system? > Perhaps something that involves power removal in between commits or in > the midst of a write, and then checks for corrupt data. Do any utilities > exist for this? > > > > Thanks! > > > ------------------------------------------------------------------------ > > _______________________________________________ > Ext3-users mailing list > Ext3-users at redhat.com > https://www.redhat.com/mailman/listinfo/ext3-users From neilb at cse.unsw.edu.au Mon Sep 4 10:17:55 2006 From: neilb at cse.unsw.edu.au (Neil Brown) Date: Mon, 4 Sep 2006 20:17:55 +1000 Subject: debian unstable & ext3 In-Reply-To: message from Larry McVoy on Thursday August 31 References: <20060831175939.GF27660@bitmover.com> Message-ID: <17659.64851.51385.204395@cse.unsw.edu.au> (posting again from my subscribed address as it is a members' only list - grumble) On Thursday August 31, lm at bitmover.com wrote: > I'm running > > Linux travis 2.6.15-1-686 #2 Mon Mar 6 15:27:08 UTC 2006 i686 GNU/Linux > > on a laptop with ext3 on / > > Some time ago things started getting weird in the following way: I do a > fairly normal hack, ^Z, make, test loop when developing and it seems > that vim is calling fsync or sync and that is then flushing everything > to disk. My tests create maybe 10 dozen files in ~30MB and for some > reason this is taking 4 seconds to flush. > > I'm not sure if ext3, the kernel, or vim is the problem. I already > googled and set > > set swapsync=sync > set nofsync > > in my .exrc but that hasn't helped. > > Has anyone else seen this and do they have a work around? I'm about to > switch to reiserfs and that's a lot of fuss for what should be a simple > problem (I hope). I've noticed this sort of problem, but it hasn't yet been enough to make me explore very far.... One thing worth a try is to mount with data=writeback. Though I sustem that set swapsync= might be fastest, but might not be what you want. NeilBrown From neilb at suse.de Mon Sep 4 10:09:57 2006 From: neilb at suse.de (Neil Brown) Date: Mon, 4 Sep 2006 20:09:57 +1000 Subject: debian unstable & ext3 In-Reply-To: message from Larry McVoy on Thursday August 31 References: <20060831175939.GF27660@bitmover.com> Message-ID: <17659.64373.492887.356430@cse.unsw.edu.au> On Thursday August 31, lm at bitmover.com wrote: > I'm running > > Linux travis 2.6.15-1-686 #2 Mon Mar 6 15:27:08 UTC 2006 i686 GNU/Linux > > on a laptop with ext3 on / > > Some time ago things started getting weird in the following way: I do a > fairly normal hack, ^Z, make, test loop when developing and it seems > that vim is calling fsync or sync and that is then flushing everything > to disk. My tests create maybe 10 dozen files in ~30MB and for some > reason this is taking 4 seconds to flush. > > I'm not sure if ext3, the kernel, or vim is the problem. I already > googled and set > > set swapsync=sync > set nofsync > > in my .exrc but that hasn't helped. > > Has anyone else seen this and do they have a work around? I'm about to > switch to reiserfs and that's a lot of fuss for what should be a simple > problem (I hope). I've noticed this sort of problem, but it hasn't yet been enough to make me explore very far.... One thing worth a try is to mount with data=writeback. Though I sustem that set swapsync= might be fastest, but might not be what you want. NeilBrown From evilninja at gmx.net Tue Sep 5 09:48:31 2006 From: evilninja at gmx.net (Christian) Date: Tue, 5 Sep 2006 10:48:31 +0100 (BST) Subject: debian unstable & ext3 In-Reply-To: <20060831175939.GF27660@bitmover.com> References: <20060831175939.GF27660@bitmover.com> Message-ID: [resent to ext3-users at redhat.com] On Thu, 31 Aug 2006, Larry McVoy wrote: > Some time ago things started getting weird in the following way: I do a > fairly normal hack, ^Z, make, test loop when developing and it seems ----------------------^ this would STOP your editor (vi), but do you :w before you do this? > that vim is calling fsync or sync you can start vim via strace(1) to find out which one is called. > and that is then flushing everything to disk. My tests create maybe 10 > dozen files in ~30MB and for some reason this is taking 4 seconds to > flush. How full is the fs, maybe fragmentation is bad or the 4 sec are even I/O-bound? What mount-options are used? It'd be intresting to reproduce this behaviour on a fresh filesystem. > I'm about to switch to reiserfs and that's a lot of fuss for what should Let us know if this solved the problem ;) Christian. -- BOFH excuse #277: Your Flux Capacitor has gone bad. From evilninja at gmx.net Tue Sep 5 10:50:51 2006 From: evilninja at gmx.net (Christian) Date: Tue, 5 Sep 2006 11:50:51 +0100 (BST) Subject: Stress testing for ext3? In-Reply-To: <44FB3D73.30408@shaw.ca> References: <44FB3D73.30408@shaw.ca> Message-ID: On Sun, 3 Sep 2006, ..:::BeOS Mr. X:::.. wrote: > I know of a method to continously execute a command, maybe doing a full > listing of the drive's contents will heat the drives up, but I am not sure > about the error checking part. Here is what I would do: > while [ 1 -eq 1 ]; do ls -shw9 -R; done The directory liting will be cached, after the first run the disk should not be touched any more (try it out...). Also, when you're not redirecting the output to somewhere else (e.g. /dev/null), the terminal displaying the output will be the bottleneck and not the fs or the disk... -- BOFH excuse #34: (l)user error From herta.vandeneynde at cc.kuleuven.be Tue Sep 5 13:09:38 2006 From: herta.vandeneynde at cc.kuleuven.be (Herta Van den Eynde) Date: Tue, 05 Sep 2006 15:09:38 +0200 Subject: Stress testing for ext3? In-Reply-To: References: <44FB3D73.30408@shaw.ca> Message-ID: <44FD7712.7040509@cc.kuleuven.be> Christian wrote: > On Sun, 3 Sep 2006, ..:::BeOS Mr. X:::.. wrote: > >> I know of a method to continously execute a command, maybe doing a >> full listing of the drive's contents will heat the drives up, but I am >> not sure about the error checking part. Here is what I would do: >> while [ 1 -eq 1 ]; do ls -shw9 -R; done > > > The directory liting will be cached, after the first run the disk should > not be touched any more (try it out...). Also, when you're not > redirecting the output to somewhere else (e.g. /dev/null), the terminal > displaying the output will be the bottleneck and not the fs or the disk... > A colleague of mine reported he got ext3 to bail out while repeatedly recompiling the kernel. He enabled all kernel modules, and then ran: # while true; do make clean; make -j18; done The filesystem ended up being mounted ro. The fsck at reboot moved some files to lost+found, after which the filesystem could be used again. Kind regards, Herta Disclaimer: http://www.kuleuven.be/cwis/email_disclaimer.htm From tweeks at rackspace.com Tue Sep 5 20:53:52 2006 From: tweeks at rackspace.com (tweeks) Date: Tue, 5 Sep 2006 15:53:52 -0500 Subject: IO lockups and ext3 readonly filecorruption on RHEL4 (pre and post U4) Message-ID: <200609051553.52276.tweeks@rackspace.com> Has anyone been seeing IO lockup problems on EL4? I've tried multiple IO scheduler options (elevator=) in the boot... I'm seeing the same behavior regardless. Independent of hardware. Whitebox ATA, HA enclosure with dedicated SCSI, megaraid RAID hardware, Dell 2850s... same behavior: A semi-busy system will suddenly go into some kind of IO la-la land where nothing can be written to disk for >1hour. Of course when this happens, the ext3 kernel module freaks out and remounts all the filesystems as readonly. Then when the system is rebooted, if the system is allowed to fsck, the journal is hosed and the filesystem eats itself. Moving them off the RH kernel all together seems to fix the problem, but I have not found a way to reproduce the problem yet (burning and stress testing doesn't seem to make it appear), so real re-testing is difficult at best. It's become so big of a problem that we're moving some customers that require rock solid systems either over to RHEL3, or off RH and over to SLES or other distro with a non-RH kernel. Just the ext3 problem (minus the IO lockup part) can be seen in other BZ tickets: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=175877 (when the filesystem fills up) Has anyone seen these type of IO lockups + ext3 corruption on RHEL4? Can you reproduce it? Tweeks From richard.c.wolber at boeing.com Tue Sep 5 21:19:04 2006 From: richard.c.wolber at boeing.com (Wolber, Richard C) Date: Tue, 5 Sep 2006 14:19:04 -0700 Subject: IO lockups and ext3 readonly filecorruption on RHEL4 (pre and postU4) In-Reply-To: <200609051553.52276.tweeks@rackspace.com> Message-ID: <8C7C41A176AC0B468BEFB2EFD9BDAB9901324161@XCH-NW-5V2.nw.nos.boeing.com> We're using the same systems with the same OS (well okay, actually CentOS 4) and aren't seeing the same thing. 2.6.14.3 #1 SMP PREEMPT Thu Dec 8 10:34:08 PST 2005 i686 i686 i386 GNU/Linux ..Chuck.. > -----Original Message----- > From: tweeks [mailto:tweeks at rackspace.com] > Sent: Tuesday, September 05, 2006 1:54 PM > To: ext3-users at redhat.com > Subject: IO lockups and ext3 readonly filecorruption on RHEL4 > (pre and postU4) > > Has anyone been seeing IO lockup problems on EL4? > > I've tried multiple IO scheduler options (elevator=) in the > boot... I'm seeing the same behavior regardless. Independent > of hardware. Whitebox ATA, HA enclosure with dedicated SCSI, > megaraid RAID hardware, Dell 2850s... same > behavior: > > A semi-busy system will suddenly go into some kind of IO > la-la land where nothing can be written to disk for >1hour. > Of course when this happens, the > ext3 kernel module freaks out and remounts all the > filesystems as readonly. > Then when the system is rebooted, if the system is allowed to > fsck, the journal is hosed and the filesystem eats itself. > Moving them off the RH kernel all together seems to fix the > problem, but I have not found a way to reproduce the problem > yet (burning and stress testing doesn't seem to make it > appear), so real re-testing is difficult at best. > > It's become so big of a problem that we're moving some > customers that require rock solid systems either over to > RHEL3, or off RH and over to SLES or other distro with a > non-RH kernel. > > Just the ext3 problem (minus the IO lockup part) can be seen > in other BZ > tickets: > https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=175877 > (when the filesystem fills up) > > Has anyone seen these type of IO lockups + ext3 corruption on RHEL4? > Can you reproduce it? > > Tweeks > > _______________________________________________ > Ext3-users mailing list > Ext3-users at redhat.com > https://www.redhat.com/mailman/listinfo/ext3-users > From evilninja at gmx.net Wed Sep 6 00:34:34 2006 From: evilninja at gmx.net (Christian) Date: Wed, 6 Sep 2006 01:34:34 +0100 (BST) Subject: IO lockups and ext3 readonly filecorruption on RHEL4 (pre a U4) In-Reply-To: <200609051553.52276.tweeks@rackspace.com> References: <200609051553.52276.tweeks@rackspace.com> Message-ID: On Tue, 5 Sep 2006, tweeks wrote: > Has anyone been seeing IO lockup problems on EL4? not using RHEL here, but... > A semi-busy system will suddenly go into some kind of IO la-la land where > nothing can be written to disk for >1hour. ok, so ext3 will remount the fs to RO. this would happen if a panic() occurs? is there anything related in the logs? (if /var is RO too, try to setup a loghost). > Then when the system is rebooted, if the system is allowed to fsck, the > journal is hosed and the filesystem eats itself. coud you be more specific? what does fsck.ext3 say? is there something in lost+found? remember to use latest version of e2fsprogs. have you tried a vanilla kernel yet? -- BOFH excuse #289: Interference between the keyboard and the chair. From tweeks at rackspace.com Tue Sep 5 23:29:03 2006 From: tweeks at rackspace.com (tweeks) Date: Tue, 5 Sep 2006 18:29:03 -0500 Subject: IO lockups and ext3 readonly filecorruption on RHEL4 (pre and postU4) In-Reply-To: <8C7C41A176AC0B468BEFB2EFD9BDAB9901324161@XCH-NW-5V2.nw.nos.boeing.com> References: <8C7C41A176AC0B468BEFB2EFD9BDAB9901324161@XCH-NW-5V2.nw.nos.boeing.com> Message-ID: <200609051829.04031.tweeks@rackspace.com> On Tuesday 05 September 2006 04:19 pm, Wolber, Richard C wrote: > We're using the same systems with the same OS (well okay, actually > CentOS 4) and aren't seeing the same thing. > > 2.6.14.3 #1 SMP PREEMPT Thu Dec 8 10:34:08 PST 2005 i686 i686 i386 > GNU/Linux On how many servers tho. We have several thousand. Tweeks From tytso at mit.edu Wed Sep 6 05:45:43 2006 From: tytso at mit.edu (Theodore Tso) Date: Wed, 6 Sep 2006 01:45:43 -0400 Subject: debian unstable & ext3 In-Reply-To: <17659.64851.51385.204395@cse.unsw.edu.au> References: <20060831175939.GF27660@bitmover.com> <17659.64851.51385.204395@cse.unsw.edu.au> Message-ID: <20060906054543.GA20892@thunk.org> [Sorry, in Germany and so my e-mail latency is slow... ] On Mon, Sep 04, 2006 at 08:17:55PM +1000, Neil Brown wrote: > On Thursday August 31, lm at bitmover.com wrote: > > Some time ago things started getting weird in the following way: I do a > > fairly normal hack, ^Z, make, test loop when developing and it seems > > that vim is calling fsync or sync and that is then flushing everything > > to disk. My tests create maybe 10 dozen files in ~30MB and for some > > reason this is taking 4 seconds to flush. > > > > One thing worth a try is to mount with data=writeback. Or data=ordered. What does "cat /proc/mounts" say? The fsync() operation results in a journal commit operation, and if you're using "data=ordered" or "data=journaled", the data blocks will be flushed to either their final location on disk or to the journal before the journal is allowed to commit. - Ted From tweeks at rackspace.com Wed Sep 6 14:23:25 2006 From: tweeks at rackspace.com (tweeks) Date: Wed, 6 Sep 2006 09:23:25 -0500 Subject: IO lockups and ext3 readonly filecorruption on RHEL4 (pre a U4) In-Reply-To: References: <200609051553.52276.tweeks@rackspace.com> Message-ID: <200609060923.25597.tweeks@rackspace.com> On Tuesday 05 September 2006 07:34 pm, Christian wrote: > ok, so ext3 will remount the fs to RO. this would happen if a panic() > occurs? These boxes are not panicing. IO (or O actually) seems to come to a complete stop, the system can't sync.. the journal becomes out of sync.. ext3 freaks and re-mounts RO, and eventually the system becomes mostly unresponsive (as no new processes can be properly started. Graceful rebooting becomes a problem, and eventual reboots find the unsync'd disc very hard to fsck successfully. > is there anything related in the logs? No.. they're read only. > (if /var is RO too, try > to setup a loghost). We may try that as we already have a shared NetDump server set up. Can i do syslog to BOTH the local machine AND a network syslog server. If the local logs are locked, will my writing to a remote host still work? > coud you be more specific? what does fsck.ext3 say? It shows thousands of de-linked files being found. But I have not witnessed this first hand, as I am not in front of the console on these machines. But I'll ask. > is there something > in lost+found? I'm assuming yes. > remember to use latest version of e2fsprogs. have you > tried a vanilla kernel yet? Well, yes. But since it is thus far not able to be reliably reproduced, it's hard to tell what works and what doesn't. If anyone who understands the nature of this problem has any suggestions for reliably triggering it, then please speak up. Tim: You mentioned some type of forced buffer flush patch last month... any ETA on this? Tweeks -- Thomas Weeks, Lead Sys. Engineer The Managed Hosting Specialist(TM) Rackspace Managed Hosting http://www.rackspace.com/ Managed Service Innovation Team Email: "We Fanatically Support Fanatical Support!" (w)210.447.4451 (f)210.447.4041 From mind at bi.lt Thu Sep 7 07:25:40 2006 From: mind at bi.lt (Mindaugas) Date: Thu, 7 Sep 2006 10:25:40 +0300 Subject: wiping of unused space on ext3 Message-ID: <00dd01c6d24e$d29d15f0$f20214ac@bite.lt> Hello, I was asked if it is possible to zero unused space in ext3 partition? Users write to the server via Samba and are far from computer geeks so teaching them to use some safedelete utility is quite impossible. Is there some way or utility to wipe out all the data from unused space? Thanks, Mindaugas From bryan at kadzban.is-a-geek.net Thu Sep 7 10:58:25 2006 From: bryan at kadzban.is-a-geek.net (Bryan Kadzban) Date: Thu, 07 Sep 2006 06:58:25 -0400 Subject: wiping of unused space on ext3 In-Reply-To: <00dd01c6d24e$d29d15f0$f20214ac@bite.lt> References: <00dd01c6d24e$d29d15f0$f20214ac@bite.lt> Message-ID: <44FFFB51.7050809@kadzban.is-a-geek.net> Mindaugas wrote: > Hello, > > I was asked if it is possible to zero unused space in ext3 partition? Easiest way I can think of is: cat /dev/zero >/fsmountpoint/temp-file Then, after you get the inevitable error that the disk is full: rm /fsmountpoint/temp-file Of course this should probably be done while nobody else is trying to create or enlarge a file, otherwise they could get errors too... -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 258 bytes Desc: OpenPGP digital signature URL: From richard.c.wolber at boeing.com Thu Sep 7 14:53:07 2006 From: richard.c.wolber at boeing.com (Wolber, Richard C) Date: Thu, 7 Sep 2006 07:53:07 -0700 Subject: wiping of unused space on ext3 In-Reply-To: <00dd01c6d24e$d29d15f0$f20214ac@bite.lt> Message-ID: <8C7C41A176AC0B468BEFB2EFD9BDAB9901324173@XCH-NW-5V2.nw.nos.boeing.com> > From: Mindaugas [mailto:mind at bi.lt] > Sent: Thursday, September 07, 2006 12:26 AM > To: ext3-users at redhat.com > Cc: CentOS mailing list > Subject: wiping of unused space on ext3 > > > Hello, > > I was asked if it is possible to zero unused space in ext3 > partition? > > Users write to the server via Samba and are far from > computer geeks so teaching them to use some safedelete > utility is quite impossible. > > Is there some way or utility to wipe out all the data from > unused space? I believe you can use the "chattr -s" command to mark all of the files so that when they are deleted, their blocks are wiped with zeros. I believe that you'd need to set up some sort of cron job to make sure all of the files have this attribute set on a regular basis, unless this works as a directory level attribute. ..Chuck.. From adilger at clusterfs.com Thu Sep 7 21:15:06 2006 From: adilger at clusterfs.com (Andreas Dilger) Date: Thu, 7 Sep 2006 15:15:06 -0600 Subject: wiping of unused space on ext3 In-Reply-To: <8C7C41A176AC0B468BEFB2EFD9BDAB9901324173@XCH-NW-5V2.nw.nos.boeing.com> References: <00dd01c6d24e$d29d15f0$f20214ac@bite.lt> <8C7C41A176AC0B468BEFB2EFD9BDAB9901324173@XCH-NW-5V2.nw.nos.boeing.com> Message-ID: <20060907211506.GR6441@schatzie.adilger.int> On Sep 07, 2006 07:53 -0700, Wolber, Richard C wrote: > I believe you can use the "chattr -s" command to mark all of the files > so that when they are deleted, their blocks are wiped with zeros. In theory yes, but this has never been implemented. > I believe that you'd need to set up some sort of cron job to make sure > all of the files have this attribute set on a regular basis, unless > this works as a directory level attribute. It should be inherited from the parent. It is not currently functional and is unlikely to ever make it into the kernel. Instead, write a shared library that hooks "unlink" and have it wipe your files from userspace. I think there is already a "libtrashcan" or similar that will allow undelete in the same manner. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From richard.c.wolber at boeing.com Thu Sep 7 21:21:33 2006 From: richard.c.wolber at boeing.com (Wolber, Richard C) Date: Thu, 7 Sep 2006 14:21:33 -0700 Subject: wiping of unused space on ext3 In-Reply-To: <20060907211506.GR6441@schatzie.adilger.int> Message-ID: <8C7C41A176AC0B468BEFB2EFD9BDAB9901324180@XCH-NW-5V2.nw.nos.boeing.com> > On Sep 07, 2006 07:53 -0700, Wolber, Richard C wrote: > > I believe you can use the "chattr -s" command to mark all > > of the files so that when they are deleted, their blocks > > are wiped with zeros. > > In theory yes, but this has never been implemented. *BLINK* So let me get this straight. This feature is documented in the man page and works within the chattr command. It is also noted when you do a "chattr -v". And yet it still has no effect? I seriously wonder how many people are using this "feature" without realizing that it has absolutely no effect? Is it worth my time to patch the documentation? Or is this the forgotten stepchild of a development dispute that the parties would ignore any sane input on? ..Chuck.. From matts at ksu.edu Thu Sep 7 22:40:26 2006 From: matts at ksu.edu (Matt Stegman) Date: Thu, 7 Sep 2006 17:40:26 -0500 (CDT) Subject: wiping of unused space on ext3 In-Reply-To: <8C7C41A176AC0B468BEFB2EFD9BDAB9901324180@XCH-NW-5V2.nw.nos.boeing.com> Message-ID: Well, the manpage does say: BUGS AND LIMITATIONS The `c', 's', and `u' attributes are not honored by the ext2 and ext3 filesystems as implemented in the current mainline Linux kernels. These attributes may be implemented in future versions ext2 and ext3. If I remember right, it was dropped from the kernel because it was incomplete - inode data isn't wiped, the journal isn't wiped, and if a file is truncated, old data blocks weren't wiped. I think these were decided to be too difficult or too slow to implement, and so the feature was dropped "for the time being." It's been a while since I read the mail thread on this though, and my memory may be faulty. -- Matt Stegman On Thu, 7 Sep 2006, Wolber, Richard C wrote: > > On Sep 07, 2006 07:53 -0700, Wolber, Richard C wrote: > > > I believe you can use the "chattr -s" command to mark all > > > of the files so that when they are deleted, their blocks > > > are wiped with zeros. > > > > In theory yes, but this has never been implemented. > > *BLINK* > > So let me get this straight. This feature is documented > in the man page and works within the chattr command. It is > also noted when you do a "chattr -v". And yet it still has no > effect? I seriously wonder how many people are using this > "feature" without realizing that it has absolutely no > effect? > > Is it worth my time to patch the documentation? Or is this the > forgotten stepchild of a development dispute that the parties > would ignore any sane input on? > > ..Chuck.. > > _______________________________________________ > Ext3-users mailing list > Ext3-users at redhat.com > https://www.redhat.com/mailman/listinfo/ext3-users > > From adilger at clusterfs.com Thu Sep 7 22:53:50 2006 From: adilger at clusterfs.com (Andreas Dilger) Date: Thu, 7 Sep 2006 16:53:50 -0600 Subject: wiping of unused space on ext3 In-Reply-To: References: <8C7C41A176AC0B468BEFB2EFD9BDAB9901324180@XCH-NW-5V2.nw.nos.boeing.com> Message-ID: <20060907225350.GA6441@schatzie.adilger.int> On Sep 07, 2006 17:40 -0500, Matt Stegman wrote: > Well, the manpage does say: Also, right when 's' is defined, see Note: When a file with the ?s? attribute set is deleted, its blocks are zeroed and written back to the disk. Note: please make sure to read the bugs and limitations section at the end of this document. > BUGS AND LIMITATIONS > The `c', 's', and `u' attributes are not honored by the ext2 and > ext3 filesystems as implemented in the current mainline Linux > kernels. These attributes may be implemented in future versions > ext2 and ext3. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From evilninja at gmx.net Fri Sep 8 06:37:32 2006 From: evilninja at gmx.net (Christian) Date: Fri, 8 Sep 2006 07:37:32 +0100 (BST) Subject: IO lockups and ext3 readonly filecorruption on RHEL4 (pre a U4) In-Reply-To: <200609060923.25597.tweeks@rackspace.com> References: <200609051553.52276.tweeks@rackspace.com> <200609060923.25597.tweeks@rackspace.com> Message-ID: sorry for my late reply.... On Wed, 6 Sep 2006, tweeks wrote: > We may try that as we already have a shared NetDump server set up. > Can i do syslog to BOTH the local machine AND a network syslog server. If the > local logs are locked, will my writing to a remote host still work? yes, if syslogd/syslog-ng is still running, logging to the loghost should do. the network and the system has to be working of course. > hard to tell what works and what doesn't. If anyone who understands the > nature of this problem has any suggestions for reliably triggering it, then > please speak up. without more details, it's hard to conclude anything, one can only *guess* :( Christian. -- BOFH excuse #377: Someone hooked the twisted pair wires into the answering machine. From rmy at tigress.co.uk Fri Sep 8 09:01:27 2006 From: rmy at tigress.co.uk (Ron Yorston) Date: Fri, 08 Sep 2006 10:01:27 +0100 Subject: wiping of unused space on ext3 In-Reply-To: <00dd01c6d24e$d29d15f0$f20214ac@bite.lt> References: <00dd01c6d24e$d29d15f0$f20214ac@bite.lt> Message-ID: <200609080901.k8891RHs015100@tiffany.internal.tigress.co.uk> "Mindaugas" wrote: > I was asked if it is possible to zero unused space in ext3 partition? I have a couple of patches that add a zerofree mount option to ext2 and ext3 filesystems. The ext2 version is much better tested and more complete: it zeros all file data blocks, directory blocks and extended attributes (though not inode data). The ext3 patch only handles file data, not metadata. I've been meaning to submit these to LKML, but since you ask let's give them an airing here first. Since this is being copied to the CentOS mailing list I should point out that I also have versions of the patches that apply cleanly to the RHEL 4 kernel. I don't have them to hand at the moment but if there's any interest I can provide them later. Some background information and other tools are on my website: http://intgat.tigress.co.uk/rmy/uml/index.html Ron From rmy at tigress.co.uk Fri Sep 8 09:04:05 2006 From: rmy at tigress.co.uk (Ron Yorston) Date: Fri, 08 Sep 2006 10:04:05 +0100 Subject: [PATCH] ext2: zero freed blocks In-Reply-To: <200609080901.k8891RHs015100@tiffany.internal.tigress.co.uk> References: <00dd01c6d24e$d29d15f0$f20214ac@bite.lt> <200609080901.k8891RHs015100@tiffany.internal.tigress.co.uk> Message-ID: <200609080904.k88945DE015112@tiffany.internal.tigress.co.uk> Add a zerofree mount option to the ext2 filesystem. This causes freed blocks to be filled with zeros. ext2_zero_blocks has an additional argument to specify whether or not zeroing is required: there's no point in zeroing blocks that have just come from the free list. Some rerrangement of code in xattr.c is required to ensure that ext2_zero_blocks is never called with a locked buffer. Signed-off-by: Ron Yorston --- --- linux-2.6.17/Documentation/filesystems/ext2.txt.zerofree2 2006-06-18 02:49:35.000000000 +0100 +++ linux-2.6.17/Documentation/filesystems/ext2.txt 2006-08-25 20:08:06.000000000 +0100 @@ -58,6 +58,8 @@ nobh Do not attach buffer_heads to fi xip Use execute in place (no caching) if possible +zerofree Zero data blocks when they are freed. + grpquota,noquota,quota,usrquota Quota options are silently ignored by ext2. --- linux-2.6.17/fs/ext2/balloc.c.zerofree2 2006-06-18 02:49:35.000000000 +0100 +++ linux-2.6.17/fs/ext2/balloc.c 2006-08-25 20:08:26.000000000 +0100 @@ -174,9 +174,28 @@ static void group_release_blocks(struct } } +static void ext2_zero_blocks(struct super_block *sb, unsigned long block, + unsigned long count) +{ + unsigned long i; + struct buffer_head * bh; + + for (i = 0; i < count; i++) { + bh = sb_getblk(sb, block+i); + if (!bh) + continue; + + lock_buffer(bh); + memset(bh->b_data, 0, bh->b_size); + mark_buffer_dirty(bh); + unlock_buffer(bh); + brelse(bh); + } +} + /* Free given blocks, update quota and i_blocks field */ void ext2_free_blocks (struct inode * inode, unsigned long block, - unsigned long count) + unsigned long count, int zero) { struct buffer_head *bitmap_bh = NULL; struct buffer_head * bh2; @@ -201,6 +220,9 @@ void ext2_free_blocks (struct inode * in ext2_debug ("freeing block(s) %lu-%lu\n", block, block + count - 1); + if (test_opt(sb, ZEROFREE) && zero) + ext2_zero_blocks(sb, block, count); + do_more: overflow = 0; block_group = (block - le32_to_cpu(es->s_first_data_block)) / --- linux-2.6.17/fs/ext2/super.c.zerofree2 2006-06-18 02:49:35.000000000 +0100 +++ linux-2.6.17/fs/ext2/super.c 2006-08-25 20:08:06.000000000 +0100 @@ -287,7 +287,7 @@ enum { Opt_err_ro, Opt_nouid32, Opt_nocheck, Opt_debug, Opt_oldalloc, Opt_orlov, Opt_nobh, Opt_user_xattr, Opt_nouser_xattr, Opt_acl, Opt_noacl, Opt_xip, Opt_ignore, Opt_err, Opt_quota, - Opt_usrquota, Opt_grpquota + Opt_usrquota, Opt_grpquota, Opt_zerofree }; static match_table_t tokens = { @@ -310,6 +310,7 @@ static match_table_t tokens = { {Opt_oldalloc, "oldalloc"}, {Opt_orlov, "orlov"}, {Opt_nobh, "nobh"}, + {Opt_zerofree, "zerofree"}, {Opt_user_xattr, "user_xattr"}, {Opt_nouser_xattr, "nouser_xattr"}, {Opt_acl, "acl"}, @@ -393,6 +394,9 @@ static int parse_options (char * options case Opt_nobh: set_opt (sbi->s_mount_opt, NOBH); break; + case Opt_zerofree: + set_opt (sbi->s_mount_opt, ZEROFREE); + break; #ifdef CONFIG_EXT2_FS_XATTR case Opt_user_xattr: set_opt (sbi->s_mount_opt, XATTR_USER); --- linux-2.6.17/fs/ext2/xattr.c.zerofree2 2006-06-18 02:49:35.000000000 +0100 +++ linux-2.6.17/fs/ext2/xattr.c 2006-08-25 20:08:06.000000000 +0100 @@ -676,7 +676,7 @@ ext2_xattr_set2(struct inode *inode, str new_bh = sb_getblk(sb, block); if (!new_bh) { - ext2_free_blocks(inode, block, 1); + ext2_free_blocks(inode, block, 1, 0); error = -EIO; goto cleanup; } @@ -715,25 +715,26 @@ ext2_xattr_set2(struct inode *inode, str error = 0; if (old_bh && old_bh != new_bh) { + unsigned long block = old_bh->b_blocknr; struct mb_cache_entry *ce; /* * If there was an old block and we are no longer using it, * release the old block. */ - ce = mb_cache_entry_get(ext2_xattr_cache, old_bh->b_bdev, - old_bh->b_blocknr); + ce = mb_cache_entry_get(ext2_xattr_cache, old_bh->b_bdev, block); lock_buffer(old_bh); if (HDR(old_bh)->h_refcount == cpu_to_le32(1)) { /* Free the old block. */ if (ce) mb_cache_entry_free(ce); ea_bdebug(old_bh, "freeing"); - ext2_free_blocks(inode, old_bh->b_blocknr, 1); + unlock_buffer(old_bh); /* We let our caller release old_bh, so we * need to duplicate the buffer before. */ get_bh(old_bh); bforget(old_bh); + ext2_free_blocks(inode, block, 1, 1); } else { /* Decrement the refcount only. */ HDR(old_bh)->h_refcount = cpu_to_le32( @@ -744,8 +745,8 @@ ext2_xattr_set2(struct inode *inode, str mark_buffer_dirty(old_bh); ea_bdebug(old_bh, "refcount now=%d", le32_to_cpu(HDR(old_bh)->h_refcount)); + unlock_buffer(old_bh); } - unlock_buffer(old_bh); } cleanup: @@ -789,10 +790,10 @@ ext2_xattr_delete_inode(struct inode *in if (HDR(bh)->h_refcount == cpu_to_le32(1)) { if (ce) mb_cache_entry_free(ce); - ext2_free_blocks(inode, EXT2_I(inode)->i_file_acl, 1); + unlock_buffer(bh); get_bh(bh); bforget(bh); - unlock_buffer(bh); + ext2_free_blocks(inode, EXT2_I(inode)->i_file_acl, 1, 1); } else { HDR(bh)->h_refcount = cpu_to_le32( le32_to_cpu(HDR(bh)->h_refcount) - 1); --- linux-2.6.17/fs/ext2/inode.c.zerofree2 2006-06-18 02:49:35.000000000 +0100 +++ linux-2.6.17/fs/ext2/inode.c 2006-08-25 20:08:06.000000000 +0100 @@ -100,7 +100,7 @@ void ext2_discard_prealloc (struct inode ei->i_prealloc_count = 0; ei->i_prealloc_block = 0; write_unlock(&ei->i_meta_lock); - ext2_free_blocks (inode, block, total); + ext2_free_blocks (inode, block, total, 0); return; } else write_unlock(&ei->i_meta_lock); @@ -467,7 +467,7 @@ static int ext2_alloc_branch(struct inod for (i = 1; i < n; i++) bforget(branch[i].bh); for (i = 0; i < n; i++) - ext2_free_blocks(inode, le32_to_cpu(branch[i].key), 1); + ext2_free_blocks(inode, le32_to_cpu(branch[i].key), 1, 0); return err; } @@ -527,7 +527,7 @@ changed: for (i = 1; i < num; i++) bforget(where[i].bh); for (i = 0; i < num; i++) - ext2_free_blocks(inode, le32_to_cpu(where[i].key), 1); + ext2_free_blocks(inode, le32_to_cpu(where[i].key), 1, 1); return -EAGAIN; } @@ -837,7 +837,7 @@ static inline void ext2_free_data(struct count++; else { mark_inode_dirty(inode); - ext2_free_blocks (inode, block_to_free, count); + ext2_free_blocks (inode, block_to_free, count, 1); free_this: block_to_free = nr; count = 1; @@ -846,7 +846,7 @@ static inline void ext2_free_data(struct } if (count > 0) { mark_inode_dirty(inode); - ext2_free_blocks (inode, block_to_free, count); + ext2_free_blocks (inode, block_to_free, count, 1); } } @@ -889,7 +889,7 @@ static void ext2_free_branches(struct in (__le32*)bh->b_data + addr_per_block, depth); bforget(bh); - ext2_free_blocks(inode, nr, 1); + ext2_free_blocks(inode, nr, 1, 1); mark_inode_dirty(inode); } } else --- linux-2.6.17/fs/ext2/ext2.h.zerofree2 2006-06-18 02:49:35.000000000 +0100 +++ linux-2.6.17/fs/ext2/ext2.h 2006-08-25 20:08:06.000000000 +0100 @@ -94,7 +94,7 @@ extern unsigned long ext2_bg_num_gdb(str extern int ext2_new_block (struct inode *, unsigned long, __u32 *, __u32 *, int *); extern void ext2_free_blocks (struct inode *, unsigned long, - unsigned long); + unsigned long, int); extern unsigned long ext2_count_free_blocks (struct super_block *); extern unsigned long ext2_count_dirs (struct super_block *); extern void ext2_check_blocks_bitmap (struct super_block *); --- linux-2.6.17/include/linux/ext2_fs.h.zerofree2 2006-06-18 02:49:35.000000000 +0100 +++ linux-2.6.17/include/linux/ext2_fs.h 2006-08-25 20:08:06.000000000 +0100 @@ -310,6 +310,7 @@ struct ext2_inode { #define EXT2_MOUNT_MINIX_DF 0x000080 /* Mimics the Minix statfs */ #define EXT2_MOUNT_NOBH 0x000100 /* No buffer_heads */ #define EXT2_MOUNT_NO_UID32 0x000200 /* Disable 32-bit UIDs */ +#define EXT2_MOUNT_ZEROFREE 0x000400 /* Zero freed blocks */ #define EXT2_MOUNT_XATTR_USER 0x004000 /* Extended user attributes */ #define EXT2_MOUNT_POSIX_ACL 0x008000 /* POSIX Access Control Lists */ #define EXT2_MOUNT_XIP 0x010000 /* Execute in place */ From rmy at tigress.co.uk Fri Sep 8 09:04:53 2006 From: rmy at tigress.co.uk (Ron Yorston) Date: Fri, 08 Sep 2006 10:04:53 +0100 Subject: [PATCH] ext3: zero freed blocks In-Reply-To: <200609080901.k8891RHs015100@tiffany.internal.tigress.co.uk> References: <00dd01c6d24e$d29d15f0$f20214ac@bite.lt> <200609080901.k8891RHs015100@tiffany.internal.tigress.co.uk> Message-ID: <200609080904.k8894raH015117@tiffany.internal.tigress.co.uk> Add a zerofree mount option to the ext3 filesystem. This causes freed blocks to be filled with zeros. Zeroing is only applied to data blocks, not metadata. This means that directory blocks and extended attributes are not zeroed. Signed-off-by; Ron Yorston --- --- linux-2.6.17/Documentation/filesystems/ext3.txt.zerofree3 2006-06-18 02:49:35.000000000 +0100 +++ linux-2.6.17/Documentation/filesystems/ext3.txt 2006-08-26 19:07:34.000000000 +0100 @@ -113,6 +113,8 @@ noquota grpquota usrquota +zerofree Zero data blocks when they are freed. + Specification ============= --- linux-2.6.17/fs/ext3/super.c.zerofree3 2006-08-26 19:06:57.000000000 +0100 +++ linux-2.6.17/fs/ext3/super.c 2006-08-26 19:07:34.000000000 +0100 @@ -675,7 +675,7 @@ enum { Opt_usrjquota, Opt_grpjquota, Opt_offusrjquota, Opt_offgrpjquota, Opt_jqfmt_vfsold, Opt_jqfmt_vfsv0, Opt_quota, Opt_noquota, Opt_ignore, Opt_barrier, Opt_err, Opt_resize, Opt_usrquota, - Opt_grpquota + Opt_grpquota, Opt_zerofree }; static match_table_t tokens = { @@ -720,6 +720,7 @@ static match_table_t tokens = { {Opt_jqfmt_vfsold, "jqfmt=vfsold"}, {Opt_jqfmt_vfsv0, "jqfmt=vfsv0"}, {Opt_grpquota, "grpquota"}, + {Opt_zerofree, "zerofree"}, {Opt_noquota, "noquota"}, {Opt_quota, "quota"}, {Opt_usrquota, "usrquota"}, @@ -1052,6 +1053,9 @@ clear_qf_name: case Opt_nobh: set_opt(sbi->s_mount_opt, NOBH); break; + case Opt_zerofree: + set_opt(sbi->s_mount_opt, ZEROFREE); + break; default: printk (KERN_ERR "EXT3-fs: Unrecognized mount option \"%s\" " --- linux-2.6.17/fs/ext3/balloc.c.zerofree3 2006-06-18 02:49:35.000000000 +0100 +++ linux-2.6.17/fs/ext3/balloc.c 2006-08-26 19:08:14.000000000 +0100 @@ -491,9 +491,28 @@ error_return: return; } +static void ext3_zero_blocks(struct super_block *sb, unsigned long block, + unsigned long count) +{ + unsigned long i; + struct buffer_head *bh; + + for (i = 0; i < count; i++) { + bh = sb_getblk(sb, block+i); + if (!bh) + continue; + + lock_buffer(bh) ; + memset(bh->b_data, 0, bh->b_size); + mark_buffer_dirty(bh); + unlock_buffer(bh) ; + brelse(bh); + } +} + /* Free given blocks, update quota and i_blocks field */ void ext3_free_blocks(handle_t *handle, struct inode *inode, - unsigned long block, unsigned long count) + unsigned long block, unsigned long count, int zero) { struct super_block * sb; int dquot_freed_blocks; @@ -503,6 +522,8 @@ void ext3_free_blocks(handle_t *handle, printk ("ext3_free_blocks: nonexistent device"); return; } + if (test_opt(sb, ZEROFREE) && zero && !ext3_should_journal_data(inode)) + ext3_zero_blocks(sb, block, count); ext3_free_blocks_sb(handle, sb, block, count, &dquot_freed_blocks); if (dquot_freed_blocks) DQUOT_FREE_BLOCK(inode, dquot_freed_blocks); --- linux-2.6.17/fs/ext3/inode.c.zerofree3 2006-08-26 19:06:57.000000000 +0100 +++ linux-2.6.17/fs/ext3/inode.c 2006-08-26 19:07:34.000000000 +0100 @@ -562,7 +562,7 @@ static int ext3_alloc_blocks(handle_t *h return ret; failed_out: for (i = 0; i b_blocknr, 1); + ext3_free_blocks(handle, inode, bh->b_blocknr, 1, 0); get_bh(bh); ext3_forget(handle, 1, inode, bh, bh->b_blocknr); } else { @@ -804,7 +804,7 @@ inserted: new_bh = sb_getblk(sb, block); if (!new_bh) { getblk_failed: - ext3_free_blocks(handle, inode, block, 1); + ext3_free_blocks(handle, inode, block, 1, 0); error = -EIO; goto cleanup; } --- linux-2.6.17/include/linux/ext3_fs.h.zerofree3 2006-08-26 19:06:57.000000000 +0100 +++ linux-2.6.17/include/linux/ext3_fs.h 2006-08-26 19:07:34.000000000 +0100 @@ -376,6 +376,7 @@ struct ext3_inode { #define EXT3_MOUNT_QUOTA 0x80000 /* Some quota option set */ #define EXT3_MOUNT_USRQUOTA 0x100000 /* "old" user quota */ #define EXT3_MOUNT_GRPQUOTA 0x200000 /* "old" group quota */ +#define EXT3_MOUNT_ZEROFREE 0x400000 /* Zero freed blocks */ /* Compatibility, for having both ext2_fs.h and ext3_fs.h included at once */ #ifndef _LINUX_EXT2_FS_H @@ -745,7 +746,7 @@ extern int ext3_new_block (handle_t *, s extern int ext3_new_blocks (handle_t *, struct inode *, unsigned long, unsigned long *, int *); extern void ext3_free_blocks (handle_t *, struct inode *, unsigned long, - unsigned long); + unsigned long, int); extern void ext3_free_blocks_sb (handle_t *, struct super_block *, unsigned long, unsigned long, int *); extern unsigned long ext3_count_free_blocks (struct super_block *); From mind at bi.lt Fri Sep 8 10:51:08 2006 From: mind at bi.lt (Mindaugas) Date: Fri, 8 Sep 2006 13:51:08 +0300 Subject: wiping of unused space on ext3 References: <00dd01c6d24e$d29d15f0$f20214ac@bite.lt> <200609080901.k8891RHs015100@tiffany.internal.tigress.co.uk> Message-ID: <044401c6d334$b1108d90$f20214ac@bite.lt> > "Mindaugas" wrote: >> I was asked if it is possible to zero unused space in ext3 partition? > > I've been meaning to submit these to LKML, but since you ask let's give > them an airing here first. > > Since this is being copied to the CentOS mailing list I should point > out that I also have versions of the patches that apply cleanly to the > RHEL 4 kernel. I don't have them to hand at the moment but if there's > any interest I can provide them later. Thank you for the answer. Just this request is suspended now so I don't know if I will need those patches anymore. In case I will need them I will ask you for the RHEL4 version. :) Mindaugas From richard.c.wolber at boeing.com Fri Sep 8 13:53:26 2006 From: richard.c.wolber at boeing.com (Wolber, Richard C) Date: Fri, 8 Sep 2006 06:53:26 -0700 Subject: wiping of unused space on ext3 In-Reply-To: Message-ID: <8C7C41A176AC0B468BEFB2EFD9BDAB9901324182@XCH-NW-5V2.nw.nos.boeing.com> > -----Original Message----- > From: Matt Stegman [mailto:matts at ksu.edu] > Sent: Thursday, September 07, 2006 3:40 PM > To: Wolber, Richard C > Cc: Andreas Dilger; Mindaugas; ext3-users at redhat.com; CentOS > mailing list > Subject: RE: wiping of unused space on ext3 > > Well, the manpage does say: > > BUGS AND LIMITATIONS > The `c', 's', and `u' attributes are not honored by > the ext2 and ext3 filesystems as implemented in the current > mainline Linux kernels. These attributes may be implemented in > future versions ext2 and ext3. Doh! Thanks for the cluestick! ..Chuck.. From rmy at tigress.co.uk Fri Sep 8 18:13:33 2006 From: rmy at tigress.co.uk (Ron Yorston) Date: Fri, 08 Sep 2006 19:13:33 +0100 Subject: wiping of unused space on ext3 In-Reply-To: <044401c6d334$b1108d90$f20214ac@bite.lt> References: <00dd01c6d24e$d29d15f0$f20214ac@bite.lt> <200609080901.k8891RHs015100@tiffany.internal.tigress.co.uk> <044401c6d334$b1108d90$f20214ac@bite.lt> Message-ID: <200609081813.k88IDX3W015821@tiffany.internal.tigress.co.uk> Here we are: RHEL 4 versions of my zerofree patches. I add these to the kernel spec file at about Patch6000. Ron -------------- next part -------------- --- linux-2.6.9/include/linux/ext2_fs.h.zerofree2 2004-10-18 22:53:21.000000000 +0100 +++ linux-2.6.9/include/linux/ext2_fs.h 2006-08-29 19:49:10.000000000 +0100 @@ -310,6 +310,7 @@ struct ext2_inode { #define EXT2_MOUNT_MINIX_DF 0x0080 /* Mimics the Minix statfs */ #define EXT2_MOUNT_NOBH 0x0100 /* No buffer_heads */ #define EXT2_MOUNT_NO_UID32 0x0200 /* Disable 32-bit UIDs */ +#define EXT2_MOUNT_ZEROFREE 0x0400 /* Zero freed blocks */ #define EXT2_MOUNT_XATTR_USER 0x4000 /* Extended user attributes */ #define EXT2_MOUNT_POSIX_ACL 0x8000 /* POSIX Access Control Lists */ --- linux-2.6.9/fs/ext2/balloc.c.zerofree2 2004-10-18 22:53:51.000000000 +0100 +++ linux-2.6.9/fs/ext2/balloc.c 2006-08-29 19:46:35.000000000 +0100 @@ -173,9 +173,28 @@ static void group_release_blocks(struct } } +static void ext2_zero_blocks(struct super_block *sb, unsigned long block, + unsigned long count) +{ + unsigned long i; + struct buffer_head * bh; + + for (i = 0; i < count; i++) { + bh = sb_getblk(sb, block+i); + if (!bh) + continue; + + lock_buffer(bh); + memset(bh->b_data, 0, bh->b_size); + mark_buffer_dirty(bh); + unlock_buffer(bh); + brelse(bh); + } +} + /* Free given blocks, update quota and i_blocks field */ void ext2_free_blocks (struct inode * inode, unsigned long block, - unsigned long count) + unsigned long count, int zero) { struct buffer_head *bitmap_bh = NULL; struct buffer_head * bh2; @@ -200,6 +219,9 @@ void ext2_free_blocks (struct inode * in ext2_debug ("freeing block(s) %lu-%lu\n", block, block + count - 1); + if (test_opt(sb, ZEROFREE) && zero) + ext2_zero_blocks(sb, block, count); + do_more: overflow = 0; block_group = (block - le32_to_cpu(es->s_first_data_block)) / --- linux-2.6.9/fs/ext2/super.c.zerofree2 2006-08-29 19:44:53.000000000 +0100 +++ linux-2.6.9/fs/ext2/super.c 2006-08-29 19:54:05.000000000 +0100 @@ -293,7 +293,7 @@ enum { Opt_bsd_df, Opt_minix_df, Opt_grpid, Opt_nogrpid, Opt_resgid, Opt_resuid, Opt_sb, Opt_err_cont, Opt_err_panic, Opt_err_ro, Opt_nouid32, Opt_check, Opt_nocheck, Opt_debug, Opt_oldalloc, Opt_orlov, Opt_nobh, - Opt_user_xattr, Opt_nouser_xattr, Opt_acl, Opt_noacl, + Opt_zerofree, Opt_user_xattr, Opt_nouser_xattr, Opt_acl, Opt_noacl, Opt_ignore, Opt_err, }; @@ -318,6 +318,7 @@ static match_table_t tokens = { {Opt_oldalloc, "oldalloc"}, {Opt_orlov, "orlov"}, {Opt_nobh, "nobh"}, + {Opt_zerofree, "zerofree"}, {Opt_user_xattr, "user_xattr"}, {Opt_nouser_xattr, "nouser_xattr"}, {Opt_acl, "acl"}, @@ -407,6 +408,9 @@ static int parse_options (char * options case Opt_nobh: set_opt (sbi->s_mount_opt, NOBH); break; + case Opt_zerofree: + set_opt (sbi->s_mount_opt, ZEROFREE); + break; #ifdef CONFIG_EXT2_FS_XATTR case Opt_user_xattr: set_opt (sbi->s_mount_opt, XATTR_USER); --- linux-2.6.9/fs/ext2/xattr.c.zerofree2 2006-08-29 19:40:46.000000000 +0100 +++ linux-2.6.9/fs/ext2/xattr.c 2006-08-29 19:55:25.000000000 +0100 @@ -679,7 +679,7 @@ ext2_xattr_set2(struct inode *inode, str new_bh = sb_getblk(sb, block); if (!new_bh) { - ext2_free_blocks(inode, block, 1); + ext2_free_blocks(inode, block, 1, 0); error = -EIO; goto cleanup; } @@ -712,24 +712,25 @@ ext2_xattr_set2(struct inode *inode, str error = 0; if (old_bh && old_bh != new_bh) { + unsigned long block = old_bh->b_blocknr; struct mb_cache_entry *ce; /* * If there was an old block and we are no longer using it, * release the old block. */ - ce = mb_cache_entry_get(ext2_xattr_cache, old_bh->b_bdev, - old_bh->b_blocknr); + ce = mb_cache_entry_get(ext2_xattr_cache, old_bh->b_bdev, block); lock_buffer(old_bh); if (HDR(old_bh)->h_refcount == cpu_to_le32(1)) { /* Free the old block. */ if (ce) mb_cache_entry_free(ce); ea_bdebug(old_bh, "freeing"); - ext2_free_blocks(inode, old_bh->b_blocknr, 1); + unlock_buffer(old_bh); /* We let our caller release old_bh, so we * need to duplicate the buffer before. */ get_bh(old_bh); bforget(old_bh); + ext2_free_blocks(inode, block, 1, 1); } else { /* Decrement the refcount only. */ if (ce) @@ -740,8 +741,8 @@ ext2_xattr_set2(struct inode *inode, str mark_buffer_dirty(old_bh); ea_bdebug(old_bh, "refcount now=%d", le32_to_cpu(HDR(old_bh)->h_refcount)); + unlock_buffer(old_bh); } - unlock_buffer(old_bh); } cleanup: @@ -786,10 +787,10 @@ ext2_xattr_delete_inode(struct inode *in if (HDR(bh)->h_refcount == cpu_to_le32(1)) { if (ce) mb_cache_entry_free(ce); - ext2_free_blocks(inode, EXT2_I(inode)->i_file_acl, 1); + unlock_buffer(bh); get_bh(bh); bforget(bh); - unlock_buffer(bh); + ext2_free_blocks(inode, EXT2_I(inode)->i_file_acl, 1, 1); } else { if (ce) mb_cache_entry_release(ce); --- linux-2.6.9/fs/ext2/inode.c.zerofree2 2006-08-29 19:44:53.000000000 +0100 +++ linux-2.6.9/fs/ext2/inode.c 2006-08-29 19:46:35.000000000 +0100 @@ -99,7 +99,7 @@ void ext2_discard_prealloc (struct inode ei->i_prealloc_count = 0; ei->i_prealloc_block = 0; write_unlock(&ei->i_meta_lock); - ext2_free_blocks (inode, block, total); + ext2_free_blocks (inode, block, total, 0); return; } else write_unlock(&ei->i_meta_lock); @@ -462,7 +462,7 @@ static int ext2_alloc_branch(struct inod for (i = 1; i < n; i++) bforget(branch[i].bh); for (i = 0; i < n; i++) - ext2_free_blocks(inode, le32_to_cpu(branch[i].key), 1); + ext2_free_blocks(inode, le32_to_cpu(branch[i].key), 1, 0); return err; } @@ -522,7 +522,7 @@ changed: for (i = 1; i < num; i++) bforget(where[i].bh); for (i = 0; i < num; i++) - ext2_free_blocks(inode, le32_to_cpu(where[i].key), 1); + ext2_free_blocks(inode, le32_to_cpu(where[i].key), 1, 1); return -EAGAIN; } @@ -821,7 +821,7 @@ static inline void ext2_free_data(struct count++; else { mark_inode_dirty(inode); - ext2_free_blocks (inode, block_to_free, count); + ext2_free_blocks (inode, block_to_free, count, 1); free_this: block_to_free = nr; count = 1; @@ -830,7 +830,7 @@ static inline void ext2_free_data(struct } if (count > 0) { mark_inode_dirty(inode); - ext2_free_blocks (inode, block_to_free, count); + ext2_free_blocks (inode, block_to_free, count, 1); } } @@ -873,7 +873,7 @@ static void ext2_free_branches(struct in (__le32*)bh->b_data + addr_per_block, depth); bforget(bh); - ext2_free_blocks(inode, nr, 1); + ext2_free_blocks(inode, nr, 1, 1); mark_inode_dirty(inode); } } else --- linux-2.6.9/fs/ext2/ext2.h.zerofree2 2006-08-29 19:44:53.000000000 +0100 +++ linux-2.6.9/fs/ext2/ext2.h 2006-08-29 19:46:35.000000000 +0100 @@ -85,7 +85,7 @@ extern unsigned long ext2_bg_num_gdb(str extern int ext2_new_block (struct inode *, unsigned long, __u32 *, __u32 *, int *); extern void ext2_free_blocks (struct inode *, unsigned long, - unsigned long); + unsigned long, int); extern unsigned long ext2_count_free_blocks (struct super_block *); extern unsigned long ext2_count_dirs (struct super_block *); extern void ext2_check_blocks_bitmap (struct super_block *); --- linux-2.6.9/Documentation/filesystems/ext2.txt.zerofree2 2004-10-18 22:53:43.000000000 +0100 +++ linux-2.6.9/Documentation/filesystems/ext2.txt 2006-08-29 19:46:35.000000000 +0100 @@ -62,6 +62,8 @@ resgid=n The group ID which may use th sb=n Use alternate superblock at this location. +zerofree Zero data blocks when they are freed. + grpquota,noquota,quota,usrquota Quota options are silently ignored by ext2. -------------- next part -------------- --- linux-2.6.9/include/linux/ext3_fs.h.zerofree3 2006-08-30 20:44:40.000000000 +0100 +++ linux-2.6.9/include/linux/ext3_fs.h 2006-08-30 20:47:19.000000000 +0100 @@ -355,6 +355,7 @@ struct ext3_inode { #define EXT3_MOUNT_POSIX_ACL 0x08000 /* POSIX Access Control Lists */ #define EXT3_MOUNT_BARRIER 0x10000 /* Use block barriers */ #define EXT3_MOUNT_RESERVATION 0x20000 /* Preallocation */ +#define EXT3_MOUNT_ZEROFREE 0x40000 /* Zero freed blocks */ /* Compatibility, for having both ext2_fs.h and ext3_fs.h included at once */ #ifndef _LINUX_EXT2_FS_H @@ -713,7 +714,7 @@ extern int ext3_bg_has_super(struct supe extern unsigned long ext3_bg_num_gdb(struct super_block *sb, int group); extern int ext3_new_block (handle_t *, struct inode *, unsigned long, int *); extern void ext3_free_blocks (handle_t *, struct inode *, unsigned long, - unsigned long); + unsigned long, int); extern void ext3_free_blocks_sb (handle_t *, struct super_block *, unsigned long, unsigned long, int *); extern unsigned long ext3_count_free_blocks (struct super_block *); --- linux-2.6.9/fs/ext3/super.c.zerofree3 2006-08-30 20:45:30.000000000 +0100 +++ linux-2.6.9/fs/ext3/super.c 2006-08-30 20:47:19.000000000 +0100 @@ -631,7 +631,7 @@ enum { Opt_abort, Opt_data_journal, Opt_data_ordered, Opt_data_writeback, Opt_usrjquota, Opt_grpjquota, Opt_offusrjquota, Opt_offgrpjquota, Opt_jqfmt_vfsold, Opt_jqfmt_vfsv0, - Opt_ignore, Opt_barrier, Opt_err, Opt_resize, + Opt_zerofree, Opt_ignore, Opt_barrier, Opt_err, Opt_resize, }; static match_table_t tokens = { @@ -674,6 +674,7 @@ static match_table_t tokens = { {Opt_grpjquota, "grpjquota=%s"}, {Opt_jqfmt_vfsold, "jqfmt=vfsold"}, {Opt_jqfmt_vfsv0, "jqfmt=vfsv0"}, + {Opt_zerofree, "zerofree"}, {Opt_ignore, "grpquota"}, {Opt_ignore, "noquota"}, {Opt_ignore, "quota"}, @@ -970,6 +971,9 @@ clear_qf_name: match_int(&args[0], &option); *n_blocks_count = option; break; + case Opt_zerofree: + set_opt(sbi->s_mount_opt, ZEROFREE); + break; default: printk (KERN_ERR "EXT3-fs: Unrecognized mount option \"%s\" " --- linux-2.6.9/fs/ext3/balloc.c.zerofree3 2006-08-30 20:44:40.000000000 +0100 +++ linux-2.6.9/fs/ext3/balloc.c 2006-08-30 20:47:19.000000000 +0100 @@ -451,9 +451,28 @@ error_return: return; } +static void ext3_zero_blocks(struct super_block *sb, unsigned long block, + unsigned long count) +{ + unsigned long i; + struct buffer_head *bh; + + for (i = 0; i < count; i++) { + bh = sb_getblk(sb, block+i); + if (!bh) + continue; + + lock_buffer(bh) ; + memset(bh->b_data, 0, bh->b_size); + mark_buffer_dirty(bh); + unlock_buffer(bh) ; + brelse(bh); + } +} + /* Free given blocks, update quota and i_blocks field */ void ext3_free_blocks(handle_t *handle, struct inode *inode, - unsigned long block, unsigned long count) + unsigned long block, unsigned long count, int zero) { struct super_block * sb; int dquot_freed_blocks; @@ -463,6 +482,8 @@ void ext3_free_blocks(handle_t *handle, printk ("ext3_free_blocks: nonexistent device"); return; } + if (test_opt(sb, ZEROFREE) && zero && !ext3_should_journal_data(inode)) + ext3_zero_blocks(sb, block, count); ext3_free_blocks_sb(handle, sb, block, count, &dquot_freed_blocks); if (dquot_freed_blocks) DQUOT_FREE_BLOCK(inode, dquot_freed_blocks); --- linux-2.6.9/fs/ext3/inode.c.zerofree3 2006-08-30 20:44:40.000000000 +0100 +++ linux-2.6.9/fs/ext3/inode.c 2006-08-30 20:47:19.000000000 +0100 @@ -571,7 +571,7 @@ static int ext3_alloc_branch(handle_t *h ext3_journal_forget(handle, branch[i].bh); } for (i = 0; i < keys; i++) - ext3_free_blocks(handle, inode, le32_to_cpu(branch[i].key), 1); + ext3_free_blocks(handle, inode, le32_to_cpu(branch[i].key), 1, 0); return err; } @@ -672,7 +672,7 @@ err_out: if (err == -EAGAIN) for (i = 0; i < num; i++) ext3_free_blocks(handle, inode, - le32_to_cpu(where[i].key), 1); + le32_to_cpu(where[i].key), 1, 0); return err; } @@ -1819,7 +1819,7 @@ ext3_clear_blocks(handle_t *handle, stru } } - ext3_free_blocks(handle, inode, block_to_free, count); + ext3_free_blocks(handle, inode, block_to_free, count, 1); } /** @@ -1992,7 +1992,7 @@ static void ext3_free_branches(handle_t ext3_journal_test_restart(handle, inode); } - ext3_free_blocks(handle, inode, nr, 1); + ext3_free_blocks(handle, inode, nr, 1, 0); if (parent_bh) { /* --- linux-2.6.9/fs/ext3/xattr.c.zerofree3 2006-08-30 20:45:00.000000000 +0100 +++ linux-2.6.9/fs/ext3/xattr.c 2006-08-30 20:48:06.000000000 +0100 @@ -699,7 +699,7 @@ ext3_xattr_set_handle2(handle_t *handle, new_bh = sb_getblk(sb, block); if (!new_bh) { getblk_failed: - ext3_free_blocks(handle, inode, block, 1); + ext3_free_blocks(handle, inode, block, 1, 0); error = -EIO; goto cleanup; } @@ -746,7 +746,7 @@ getblk_failed: if (ce) mb_cache_entry_free(ce); ea_bdebug(old_bh, "freeing"); - ext3_free_blocks(handle, inode, old_bh->b_blocknr, 1); + ext3_free_blocks(handle, inode, old_bh->b_blocknr, 1, 0); /* ext3_forget() calls bforget() for us, but we let our caller release old_bh, so we need to @@ -845,7 +845,7 @@ ext3_xattr_delete_inode(handle_t *handle if (HDR(bh)->h_refcount == cpu_to_le32(1)) { if (ce) mb_cache_entry_free(ce); - ext3_free_blocks(handle, inode, EXT3_I(inode)->i_file_acl, 1); + ext3_free_blocks(handle, inode, EXT3_I(inode)->i_file_acl, 1, 0); get_bh(bh); ext3_forget(handle, 1, inode, bh, EXT3_I(inode)->i_file_acl); } else { --- linux-2.6.9/Documentation/filesystems/ext3.txt.zerofree3 2004-10-18 22:53:51.000000000 +0100 +++ linux-2.6.9/Documentation/filesystems/ext3.txt 2006-08-30 20:47:19.000000000 +0100 @@ -108,6 +108,8 @@ noquota (see fs/ext3/super.c, line 594 grpquota usrquota +zerofree Zero data blocks when they are freed. + Specification ============= From tytso at mit.edu Fri Sep 8 20:10:34 2006 From: tytso at mit.edu (Theodore Tso) Date: Fri, 8 Sep 2006 16:10:34 -0400 Subject: [PATCH] ext3: zero freed blocks In-Reply-To: <200609080904.k8894raH015117@tiffany.internal.tigress.co.uk> References: <00dd01c6d24e$d29d15f0$f20214ac@bite.lt> <200609080901.k8891RHs015100@tiffany.internal.tigress.co.uk> <200609080904.k8894raH015117@tiffany.internal.tigress.co.uk> Message-ID: <20060908201034.GA7542@thunk.org> On Fri, Sep 08, 2006 at 10:04:53AM +0100, Ron Yorston wrote: > Add a zerofree mount option to the ext3 filesystem. This causes freed > blocks to be filled with zeros. > > Zeroing is only applied to data blocks, not metadata. This means that > directory blocks and extended attributes are not zeroed. > > Signed-off-by; Ron Yorston ^ Should be a ':' character. :-) Ideally, this wouldn't be done as a mount-time option, but rather only if the secure_delete flag is set on the file. That way you don't do it for all files, but just those that need to be zeroed. The patch also has the potential danger that the data blocks are getting zeroed before the transaction which contains the unlink has committed. There is therefore the risk that the system might crash after the blocks have been zero'ed, but before transaction has committed. In that case, the file will still be there, but some or all of its contents will be zero'ed. The other thing which worries me about this patch is that if the blocks which you have zero'ed out get reallocated and used for some other file, and then data is written into the page cache and the page gets written to disk before the zero'ized buffers hit the disk, the new contents of the data blocks could get written. The reason for this is that there is no cache coherency enforced between the page cache and buffer cache, and so it is necessary to be very careful when a particular block transitions between from being modified via buffer cache versus the page cache. Anyway, there's a reason why secure delete is a more than a little bit tricky, and why it's never been implemented up until now. Not that it's impossible to do, just that it's a lot more subtle than it looks. :-) Regards, - Ted From adilger at clusterfs.com Sat Sep 9 00:26:28 2006 From: adilger at clusterfs.com (Andreas Dilger) Date: Fri, 8 Sep 2006 18:26:28 -0600 Subject: [PATCH] ext3: zero freed blocks In-Reply-To: <20060908201034.GA7542@thunk.org> References: <00dd01c6d24e$d29d15f0$f20214ac@bite.lt> <200609080901.k8891RHs015100@tiffany.internal.tigress.co.uk> <200609080904.k8894raH015117@tiffany.internal.tigress.co.uk> <20060908201034.GA7542@thunk.org> Message-ID: <20060909002628.GO6441@schatzie.adilger.int> On Sep 08, 2006 16:10 -0400, Theodore Tso wrote: > Ideally, this wouldn't be done as a mount-time option, but rather only > if the secure_delete flag is set on the file. That way you don't do > it for all files, but just those that need to be zeroed. Agreed. > The patch also has the potential danger that the data blocks are > getting zeroed before the transaction which contains the unlink has > committed. There is therefore the risk that the system might crash > after the blocks have been zero'ed, but before transaction has > committed. In that case, the file will still be there, but some or > all of its contents will be zero'ed. That might be considered a feature. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From tytso at mit.edu Sat Sep 9 05:09:20 2006 From: tytso at mit.edu (Theodore Tso) Date: Sat, 9 Sep 2006 01:09:20 -0400 Subject: [PATCH] ext3: zero freed blocks In-Reply-To: <20060909002628.GO6441@schatzie.adilger.int> References: <00dd01c6d24e$d29d15f0$f20214ac@bite.lt> <200609080901.k8891RHs015100@tiffany.internal.tigress.co.uk> <200609080904.k8894raH015117@tiffany.internal.tigress.co.uk> <20060908201034.GA7542@thunk.org> <20060909002628.GO6441@schatzie.adilger.int> Message-ID: <20060909050920.GA10849@thunk.org> On Fri, Sep 08, 2006 at 06:26:28PM -0600, Andreas Dilger wrote: > > The patch also has the potential danger that the data blocks are > > getting zeroed before the transaction which contains the unlink has > > committed. There is therefore the risk that the system might crash > > after the blocks have been zero'ed, but before transaction has > > committed. In that case, the file will still be there, but some or > > all of its contents will be zero'ed. > > That might be considered a feature. I don't think so. Deletes should be atomic. I could certainly see programs where a file should either be deleted, or not deleted. For a file to be partially corrupted but not deleted could ruin an application's consistency assumptions. - Ted From rmy at tigress.co.uk Sat Sep 9 10:36:27 2006 From: rmy at tigress.co.uk (Ron Yorston) Date: Sat, 09 Sep 2006 11:36:27 +0100 Subject: [PATCH] ext3: zero freed blocks In-Reply-To: <20060908201034.GA7542@thunk.org> References: <00dd01c6d24e$d29d15f0$f20214ac@bite.lt> <200609080901.k8891RHs015100@tiffany.internal.tigress.co.uk> <200609080904.k8894raH015117@tiffany.internal.tigress.co.uk> <20060908201034.GA7542@thunk.org> Message-ID: <200609091036.k89AaRXA016086@tiffany.internal.tigress.co.uk> I've removed Cc: centos at centos.org, as I think we've probably outstayed our welcome on that list. Theodore Tso wrote: >Ideally, this wouldn't be done as a mount-time option, but rather only >if the secure_delete flag is set on the file. That way you don't do >it for all files, but just those that need to be zeroed. We can use both a mount option and the secure delete flag. As Ted wrote on a previous occasion this came up on LKML: >The obvious thing to do would be to make it a mount option, so that >(a) recompilation is not necessary in order to use the feature, and >(b) the feature can be turned on or off on a per-filesystem feature. >In 2.6, it's possible to specify certain mount option to be specifed >by default on a per-filesystem basis (via a new field in the >superblock). > >So if you do things that way, then secure deletion would take place >either if the secure deletion flag is set (so it can be enabled on a >per-file basis), or if the filesystem is mounted with the >secure-deletion mount option. Personally I find a mount option much more useful than a per-file flag. >The patch also has the potential danger that the data blocks are >getting zeroed before the transaction which contains the unlink has >committed. There is therefore the risk that the system might crash >after the blocks have been zero'ed, but before transaction has >committed. In that case, the file will still be there, but some or >all of its contents will be zero'ed. Indeed, I was aware of this possibility, and breaking guarantees about the atomicity of delete is a bad thing. (Does ext2 provide any such guarantee?) The original patch (http://lwn.net/Articles/171924/) by Nikolai Joukov had code to call ext3_journal_dirty_data on the data blocks, which may have been intended to address this issue. But I ripped it out because it failed horribly when I tried to delete a file bigger than physical RAM. >The other thing which worries me about this patch is that if the >blocks which you have zero'ed out get reallocated and used for some >other file, and then data is written into the page cache and the page >gets written to disk before the zero'ized buffers hit the disk, the >new contents of the data blocks could get written. The reason for >this is that there is no cache coherency enforced between the page >cache and buffer cache, and so it is necessary to be very careful when >a particular block transitions between from being modified via buffer >cache versus the page cache. What are the consequences of this? Is there any danger of the other file being corrupted? If not, and if our purpose is just to ensure that the original contents of the freed blocks are destroyed, does it matter if they're overwritten with something other than the zeroes we intended? Ron From tytso at mit.edu Sat Sep 9 13:21:54 2006 From: tytso at mit.edu (Theodore Tso) Date: Sat, 9 Sep 2006 09:21:54 -0400 Subject: [PATCH] ext3: zero freed blocks In-Reply-To: <200609091036.k89AaRXA016086@tiffany.internal.tigress.co.uk> References: <00dd01c6d24e$d29d15f0$f20214ac@bite.lt> <200609080901.k8891RHs015100@tiffany.internal.tigress.co.uk> <200609080904.k8894raH015117@tiffany.internal.tigress.co.uk> <20060908201034.GA7542@thunk.org> <200609091036.k89AaRXA016086@tiffany.internal.tigress.co.uk> Message-ID: <20060909132154.GB24906@thunk.org> On Sat, Sep 09, 2006 at 11:36:27AM +0100, Ron Yorston wrote: > >The other thing which worries me about this patch is that if the > >blocks which you have zero'ed out get reallocated and used for some > >other file, and then data is written into the page cache and the page > >gets written to disk before the zero'ized buffers hit the disk, the > >new contents of the data blocks could get written. The reason for > >this is that there is no cache coherency enforced between the page > >cache and buffer cache, and so it is necessary to be very careful when > >a particular block transitions between from being modified via buffer > >cache versus the page cache. > > What are the consequences of this? Is there any danger of the other > file being corrupted? If not, and if our purpose is just to ensure that > the original contents of the freed blocks are destroyed, does it matter if > they're overwritten with something other than the zeroes we intended? > Yes, that's precisely what I'm worried about. Specifically, if you have this sequence of events: 1) File gets deleted; the file contents get zero'ed out via the the buffer cache. Since process of zeroing the files happen in the background, for a large file, this could continue for a long time... 2) In the meantime, one or more of the disk blocks that was used by the old file are reallocated for a new file. The application writes data to the new file, which is stored in the page cache. 3) The application calls fsync() and the contents of the new file are flushed from the page cache and written to disk. 4) The dirty buffers containing the zero'ed out contents of the block are written to disk, overwriting the contents of the new file. 5) Data is lost. One way of solving this problem is to zero the blocks in the foreground, and not allow the unlink to proceed until the data blocks are overwritten. Another way of solving the problem would be to not allow those data blocks to be allocated until the zeroization buffers have been written out. Yet another way would be try to determine if there is an outstanding buffer cache write from an attempt to zero the free blocks, and abort the buffer cache write before doing the page writeout. That last would not be trivial, and would require violating a number of abstraction boundaries... Another question is to ask is whether or not you care that the freed blocks might not be zero'ed if the system crashes before the buffer cache is written out. Currently, there is a chance that after a system crash some deleted file blocks won't be zero'ed. Depending on your requirements, that might or might not be fatal, though. - Ted From tytso at mit.edu Mon Sep 11 04:48:29 2006 From: tytso at mit.edu (Theodore Tso) Date: Mon, 11 Sep 2006 00:48:29 -0400 Subject: how does ext3 handle no communication to storage In-Reply-To: <44F5ACF8.2000705@bnl.gov> References: <44F33E3A.8020805@bnl.gov> <20060828205822.GB4944@thunk.org> <44F37285.8000104@bnl.gov> <20060829082003.GM20105@schatzie.adilger.int> <44F458AF.7040506@bnl.gov> <20060829170351.GA30599@thunk.org> <44F5ACF8.2000705@bnl.gov> Message-ID: <20060911044829.GC24653@thunk.org> [ Apologies for the delayed response, I've been travelling in Germany and Japan over the past week and a half... ] On Wed, Aug 30, 2006 at 11:21:28AM -0400, Sev Binello wrote: > What's the best way to keep informed as to when the patch > to the kernel is made and released ? Probably the best way is to subscribe to the linux-ext4 at vger.kernel.org mailing list... - Ted From mail-lists at karan.org Fri Sep 8 13:52:23 2006 From: mail-lists at karan.org (Karanbir Singh) Date: Fri, 08 Sep 2006 14:52:23 +0100 Subject: [CentOS] Re: wiping of unused space on ext3 In-Reply-To: <200609080901.k8891RHs015100@tiffany.internal.tigress.co.uk> References: <00dd01c6d24e$d29d15f0$f20214ac@bite.lt> <200609080901.k8891RHs015100@tiffany.internal.tigress.co.uk> Message-ID: <45017597.7060608@karan.org> Ron Yorston wrote: > "Mindaugas" wrote: >> I was asked if it is possible to zero unused space in ext3 partition? > > I have a couple of patches that add a zerofree mount option to ext2 and > ext3 filesystems. The ext2 version is much better tested and more > complete: it zeros all file data blocks, directory blocks and extended > attributes (though not inode data). The ext3 patch only handles file > data, not metadata. > > I've been meaning to submit these to LKML, but since you ask let's give > them an airing here first. Ron, thanks for these patch's - I dont think we can have them included in any official centos-repository hosted kernel, but its good to know that people, should they need this, can get to them here. - K -- Karanbir Singh : http://www.karan.org/ : 2522219 at icq From rmy at tigress.co.uk Tue Sep 12 20:10:56 2006 From: rmy at tigress.co.uk (Ron Yorston) Date: Tue, 12 Sep 2006 21:10:56 +0100 Subject: [PATCH] ext3: zero freed blocks In-Reply-To: <20060909132154.GB24906@thunk.org> References: <00dd01c6d24e$d29d15f0$f20214ac@bite.lt> <200609080901.k8891RHs015100@tiffany.internal.tigress.co.uk> <200609080904.k8894raH015117@tiffany.internal.tigress.co.uk> <20060908201034.GA7542@thunk.org> <200609091036.k89AaRXA016086@tiffany.internal.tigress.co.uk> <20060909132154.GB24906@thunk.org> Message-ID: <200609122010.k8CKAvdN018778@tiffany.internal.tigress.co.uk> Theodore Tso wrote: >1) File gets deleted; the file contents get zero'ed out via the the >buffer cache. Since process of zeroing the files happen in the >background, for a large file, this could continue for a long time... > >2) In the meantime, one or more of the disk blocks that was used by >the old file are reallocated for a new file. The application writes >data to the new file, which is stored in the page cache. > >3) The application calls fsync() and the contents of the new file are >flushed from the page cache and written to disk. > >4) The dirty buffers containing the zero'ed out contents of the block >are written to disk, overwriting the contents of the new file. > >5) Data is lost. I'm having no luck in generating any data loss with this sequence of events. Any suggestions as to how it might be possible to force it to happen? Ron From jayjitkumar.lobhe at patni.com Fri Sep 15 02:34:06 2006 From: jayjitkumar.lobhe at patni.com (Jayjitkumar Lobhe) Date: Thu, 14 Sep 2006 22:34:06 -0400 (EDT) Subject: Root filesystem on ext2 Message-ID: <47164.208.250.32.6.1158287646.squirrel@192.168.175.202> Dear All, I have a following query: - My initrd image is created using ext2 filesystem. - The filesystem type of / is specified as ext3 in /etc/fstab file. - I dont mount the real root during execution of linuxrc because I referred some documents saying that if you dont mount real root from linuxrc the kernel will mount it after linuxrc is finished. - The system boots up successfully, mount command shows / partition mounted as ext3 but /proc/mount shows it as ext2. Is this because in the kernel ext3 is built as module? Or Is this because my image is created using ext2 filesystem? When is the exact point when kernel mounts real root?(This seems not to be fitting in this mailing list.) Thanks in advance. It will be a great help if my queries get answered. Regards, Jayjit From samuel at bcgreen.com Sun Sep 24 17:55:10 2006 From: samuel at bcgreen.com (Stephen Samuel) Date: Sun, 24 Sep 2006 10:55:10 -0700 Subject: Retaining undelete data on ext3 In-Reply-To: References: Message-ID: <4516C67E.10609@bcgreen.com> Having just spent a day trying to recover a deleted ext3 file for a friend, I'm wondering about this way of maintining undelete information in ext3, like is done for ext2: The last step in the deletion process would be to put back the (previously zeroed) block pointers. Since it gets logged to the journal, I _think_ that this should be safe. The worst that would happen is that, if the plug gets pulled in the middle of a file delete, the old block pointers would be unavailable -- I don't see this as a killer issue, since editing the filesystem to do an undelete should be considered an emergency operation anyways. From keld at dkuug.dk Sun Sep 24 19:00:00 2006 From: keld at dkuug.dk (Keld =?iso-8859-1?Q?J=F8rn?= Simonsen) Date: Sun, 24 Sep 2006 21:00:00 +0200 Subject: Retaining undelete data on ext3 In-Reply-To: <4516C67E.10609@bcgreen.com> References: <4516C67E.10609@bcgreen.com> Message-ID: <20060924190000.GB4263@rap.rap.dk> On Sun, Sep 24, 2006 at 10:55:10AM -0700, Stephen Samuel wrote: > Having just spent a day trying to recover a deleted ext3 file > for a friend, I'm wondering about this way of maintining > undelete information in ext3, like is done for ext2: > > The last step in the deletion process would be to put back > the (previously zeroed) block pointers. Since it gets logged > to the journal, I _think_ that this should be safe. The worst > that would happen is that, if the plug gets pulled in the > middle of a file delete, the old block pointers would be > unavailable -- I don't see this as a killer issue, since > editing the filesystem to do an undelete should be considered an > emergency operation anyways. I have a design to improve ext3 so that one could salvage all files, even if you accidently reformated the partition, Available at http://std.dkuug.dk/keld/lazy3.txt This design has been reviewed by Ted. I also have some patches for debugfs to undelete files in ext3, available at http://std.dkuug.dk/keld/readme-salvage.html best regards keld From tytso at mit.edu Sun Sep 24 19:53:19 2006 From: tytso at mit.edu (Theodore Tso) Date: Sun, 24 Sep 2006 15:53:19 -0400 Subject: Retaining undelete data on ext3 In-Reply-To: <4516C67E.10609@bcgreen.com> References: <4516C67E.10609@bcgreen.com> Message-ID: <20060924195319.GC11083@thunk.org> On Sun, Sep 24, 2006 at 10:55:10AM -0700, Stephen Samuel wrote: > Having just spent a day trying to recover a deleted ext3 file > for a friend, I'm wondering about this way of maintining > undelete information in ext3, like is done for ext2: > > The last step in the deletion process would be to put back > the (previously zeroed) block pointers. Since it gets logged > to the journal, I _think_ that this should be safe. The worst > that would happen is that, if the plug gets pulled in the > middle of a file delete, the old block pointers would be > unavailable -- I don't see this as a killer issue, since > editing the filesystem to do an undelete should be considered an > emergency operation anyways. Yep, that's what would have to be done. The other caveat is that storing all of the previously zeroed block pointers temporarily in memory could take quite a bit of memory, especially if what is being deleted is really big. Consider that if a DVD iso image file is being deleted, betewen 4 and 5 megabytes of non-swappable (and on x86, it would have to be lowmem/ZONE_NORMAL) kernel memory would be required! Of course, storing the information as a series of extents would be an obvious optimization, which would work on all but a very badly fragmented file (for example, if said DVD .iso image was created when the filesystem was close to 100% full). The are some other ways it could be done that would be more optimized, but the bottom line is that main reason why it hasn't be done is because the people who could do it haven't had the time to implement it. We've been working on other features that are higher priority, either for ourselves or for our employers. Regards, - Ted From tytso at mit.edu Sun Sep 24 20:45:13 2006 From: tytso at mit.edu (Theodore Tso) Date: Sun, 24 Sep 2006 16:45:13 -0400 Subject: Retaining undelete data on ext3 In-Reply-To: <20060924190000.GB4263@rap.rap.dk> References: <4516C67E.10609@bcgreen.com> <20060924190000.GB4263@rap.rap.dk> Message-ID: <20060924204512.GA25658@thunk.org> On Sun, Sep 24, 2006 at 09:00:00PM +0200, Keld J?rn Simonsen wrote: > I have a design to improve ext3 so that one could salvage all files, > even if you accidently reformated the partition, Available at > http://std.dkuug.dk/keld/lazy3.txt > This design has been reviewed by Ted. To be fair, reviewed != to "approve of all aspects of the design". We exchanged e-mails for a while on the subject, yes. Note that the design has a number of holes in it --- for example, simply saying, "don't blank the inode when deleting it" is not so trivial if you also want to maintain ext3's consistency guarantees. So when the design says things like "My idea is to not clear the inodes, when they are marked as free", that's roughly equivalent to saying, "My idea is to purify Uranium by using some really big centrifuges". It is both simultaneously true and not useful. The hard part is all in the engineering. :-) > I also have some patches for debugfs to undelete files in ext3, > available at http://std.dkuug.dk/keld/readme-salvage.html This should probably be turned into its own standalone program, since it's far more than the scope of debugfs is intended to be. So I don't intend to merge them into debugfs. Regards, - Ted From adilger at clusterfs.com Mon Sep 25 15:48:18 2006 From: adilger at clusterfs.com (Andreas Dilger) Date: Mon, 25 Sep 2006 09:48:18 -0600 Subject: Retaining undelete data on ext3 In-Reply-To: <4516C67E.10609@bcgreen.com> References: <4516C67E.10609@bcgreen.com> Message-ID: <20060925154818.GC22010@schatzie.adilger.int> On Sep 24, 2006 10:55 -0700, Stephen Samuel wrote: > Having just spent a day trying to recover a deleted ext3 file > for a friend, I'm wondering about this way of maintining > undelete information in ext3, like is done for ext2: > > The last step in the deletion process would be to put back > the (previously zeroed) block pointers. Since it gets logged > to the journal, I _think_ that this should be safe. The worst > that would happen is that, if the plug gets pulled in the > middle of a file delete, the old block pointers would be > unavailable -- I don't see this as a killer issue, since > editing the filesystem to do an undelete should be considered an > emergency operation anyways. I've written a couple of times the best way to do this, while improving unlink/truncate performance at the same time (see last sentence): "It would be possible to walk the inode and precompute the number of bitmaps and group descriptors that would be modified by the operation and try to start a single transaction of that size. If this transaction can be started (true in most cases), then we are no longer required to zero out all of the [dt]indirect blocks (as we do not have to worry about restarting the operation) and we only have to update the block bitmaps and their group summaries, reducing the amount of IO considerably for block-mapped files. Also, the walking of the file metadata blocks can be done in forward order and also asynchronous readahead can be started for indirect blocks to make more efficient use of the disk. As an added benefit we would regain the ability to undelete files in ext3 because we no longer have to zero out all of the metadata blocks." The only issue is that nobody has worked on implementing this yet, and I don't have time. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From samuel at bcgreen.com Mon Sep 25 22:23:32 2006 From: samuel at bcgreen.com (Stephen Samuel) Date: Mon, 25 Sep 2006 15:23:32 -0700 Subject: Retaining undelete data on ext3 In-Reply-To: <20060924195319.GC11083@thunk.org> References: <4516C67E.10609@bcgreen.com> <20060924195319.GC11083@thunk.org> Message-ID: <451856E4.8060507@bcgreen.com> As far as I can tell, the only thing that gets zeroed is the block pointers in the inode (i.e. 12 direct pointers and one each of single, double and tripple indirects). so, I'm presuming that all that should need to be regenerated (and saved), above and beyond what is already done, is the pointers in the inode itself, which should take slightly less core than the whole inode entry. I just did a restore of a 1.5GB tar file from ext3, and the only information that I had to recover was the pointers that were in the inode. Identifying the triple indirect block (real easy) meant that I was only missing 1MB+ of the file, and finding the double indirect (only slightly harder) meant that I was only missing another 48K. Hunting that last 48K (12 blocks) out of the universe of unallocated blocks was the real bitch of the recovery process. If I had those 12 direct block pointers, I could have probably recoveed the entire tar file in under an hour. and with the extra two pointers (single and double and indirect) my time would have been down to 15 minutes (mostly loading software and reading directions). Theodore Tso wrote: > On Sun, Sep 24, 2006 at 10:55:10AM -0700, Stephen Samuel wrote: > >> ..... >> The last step in the deletion process would be to put back >> the (previously zeroed) block pointers. Since it gets logged >> to the journal, I _think_ that this should be safe. The worst >> > > Yep, that's what would have to be done. The other caveat is that > storing all of the previously zeroed block pointers temporarily in > memory could take quite a bit of memory, especially if what is being > deleted is really big. Consider that if a DVD iso image file is being > From guolin at alexa.com Tue Sep 26 01:27:08 2006 From: guolin at alexa.com (Guolin Cheng) Date: Tue, 26 Sep 2006 01:27:08 -0000 Subject: Strange Fedora Booting problem: can not mount "LABEL=*" partitions Message-ID: <41089CB27BD8D24E8385C8003EDAF7ABBA487B@karl.alexa.com> Hi, Sorry, NPTL instead of NTPL, typo. too embarrassed. :( --Guolin -----Original Message----- From: Guolin Cheng Sent: Thursday, April 01, 2004 10:37 PM To: Fedora (E-mail); Redhat Ext3 (E-mail); jgarzik at redhat.com Subject: Strange Fedora Booting problem: can not mount "LABEL=*" partitions Hi, Just got Fedora FC1 vanilla 2.4.25kernel+libata8patch booting problems, FC1 complains that it can not automatically find&found partitions specified with "LABEL=" in /etc/fstab, and then falls me into repair mode. In the repair mode I can mount it manually without any problems. More interesting are: 1) I have several partitions specified with "LABEL=*" in /etc/fstab, but FC1 always can not identify same partition even on different machines; 2) the default&upgraded ntpl kernel boots up without problems. My fstab is attached below: LABEL=/ / ext3 defaults 1 1 LABEL=/0 /0 ext3 defaults 1 2 /dev/hdc1 /1 ext3 defaults 1 2 LABEL=/alexa /alexa ext3 defaults 1 2 none /dev/pts devpts gid=5,mode=620 0 0 none /proc proc defaults 0 0 none /dev/shm tmpfs defaults 0 0 LABEL=/usr /usr ext3 defaults 1 2 LABEL=/var /var ext3 defaults 1 2 /dev/hda7 swap swap defaults 0 0 /dev/hda6 swap swap defaults 0 0 /dev/hda8 swap swap defaults 0 0 /dev/fd0 /mnt/floppy auto noauto,owner,kudzu 0 0 ops-test1.alexa.com guolin 134% FC1 stops on partitions "LABEL=/var" on two machines, stops on partition "LABEL=/" on the 3rd machine. While the default|upgraded NTPL kernel (with SMP problem) boots without a glitch, my vanilla 2.4.25 kernel plus libata patch 2.4.25-libata8 fails with the above symptoms described. The solution to fix it is: manually run "e2fsck -y -f /dev/hd?, tune2fs -j /dev/hd?; e2label /dev/hd?