From a_lindeman at hotmail.com Thu Mar 1 21:21:38 2007 From: a_lindeman at hotmail.com (Andy Lindeman) Date: Thu, 01 Mar 2007 16:21:38 -0500 Subject: whoops, corrupted my filesystem Message-ID: Hi all- I corrupted my filesystem by not doing a RTFM first... I got an automated email that the process monitoring the SMART data from my hard drive detected a bad sector. Not thinking (or RTFMing), I did a fsck on my partition- which is the main partition. Now it appears that I've ruined the superblock. I am running Fedora Core 6. I am booting off the Fedora Core 6 Rescue CD in order to try to fix things (my system isn't bootable.) Doing a e2fsck /dev/hda2 tells me that the superblock is corrupt. When I do a mke2fs -n /dev/hda2, it tells me that other backups are stored on 32768, 98304, 16840, 229376, 294912, 819200, 884736, 1605632, 265???? (cut off), 4096000, 7962624, 11239424, 20480000, 23887872. When I try doing a e2fsck -b xxx /dev/hda2, on any of the superblocks <= 4096000 I get the message that it's corrupted. When I do >= 7962625, I get "Invalid argument while trying to open /dev/hda2." By the way, there's some sort of weird Logical Volume thing going on with this partition. On an old (out of date unfortunately) backup, the mtab file has it listed as /dev/mapper/VolGroup00-LogVol00. Perhaps this partition can't be addressed as /dev/hda2 and it should be addressed differently?? Should I try a mke2fs -S on this drive or is there something else I should try first? Everything I've read says to back up before mke2fs -S ing. I have an external ext3 drive with enough space to hold this mangled partition on it, although it currently has a single ext3 partition. Is there a way to copy the contents of the mangaled partition to the external ext3 partition w/o deleting what's already on it or resizing it and creating a 2nd partition? If it is suggested that I try a mke2fs -S, how does that work? mke2fs -n tells me that: Block size=4096 (log=2) Fragment size=4096 (log=2) 30523392 inodes, 61022902 blocks 3051145 blocks First data block=0 Maximum filesystem blocsk=0 1863 block groups 32768 blocks per group, 32768 fragments per group 16384 inodes per group Thanks much for any help! I'd love to recover this instead of having to rebuild my linux PC! Andrew ps- This is a 250 GB Parallel ATA drive. _________________________________________________________________ Rates near 39yr lows! $430K Loan for $1,399/mo - Paying Too Much? Calculate new payment http://www.lowermybills.com/lre/index.jsp?sourceid=lmb-9632-18226&moid=7581 From lakshmipathi.g at gmail.com Fri Mar 2 07:53:14 2007 From: lakshmipathi.g at gmail.com (lakshmi pathi) Date: Fri, 2 Mar 2007 13:23:14 +0530 Subject: Hi all In-Reply-To: <20070226150818.26821.qmail@webmail89.rediffmail.com> References: <20070226150818.26821.qmail@webmail89.rediffmail.com> Message-ID: Hi, basically i would like to know,is possible to include this package in Redhat? is there any review panel to submit ur tools so that they can be released in distribution... what's the procedure to be followed? Thx in advance. On 26 Feb 2007 15:08:18 -0000, bimal pandit wrote: > > > Dear Laxmi, > > > On Mon, 26 Feb 2007 laksmi pathi wrote : > > >Hi Beos, > >It's true you can't recover files from ext3 since file address are > >zeroed out while deleting. > >This tool is crash proof recovery tool. > >You can the recover the files which are deleted only after it's > >installation.The concept is, once you install the tool,It make backup > >copy of your files addresses.When you delete a file , it's address in > >inode is deleted ...but we can access file from it's address which we > >copied earlier-provided the content is not overwirtte-So it's like a > >crash proof tool. > >Hi Bruno Wolff , > >Yes it's always better to take regular backup- > >and fellow developers in freshmeat tested and rated this tool, > >i assume they are quite satisfied with the tool. > >Please check out : > >http://freshmeat.net/projects/giis/ > > > >Warm Regards, > >Lakshmipathi.G > > > > > > > > > >On 2/25/07, Bruno Wolff III wrote: > >>On Sat, Feb 24, 2007 at 22:19:02 -0800, > >> "..:::BeOS Mr. X:::.." wrote: > >> > Yes, but I always here that recover from ext3 is not possible... > >> > possibly explain some of the technology ? I have interest in using the > >> > program if I can in fact figure out how to use it. I accidently > recently > >> > deleted a music folder with many mp3 files in it. > >> > >>You are probably better off regularly making backups rather than beta > testing > >>This software. > >> > > > > great job, will test it and would be keen to help and support you to the > extent and the way I could be ... > > regards, > > Bimal Pandit > > > From adilger at clusterfs.com Fri Mar 2 10:51:28 2007 From: adilger at clusterfs.com (Andreas Dilger) Date: Fri, 2 Mar 2007 18:51:28 +0800 Subject: whoops, corrupted my filesystem In-Reply-To: References: Message-ID: <20070302105128.GU6573@schatzie.adilger.int> On Mar 01, 2007 16:21 -0500, Andy Lindeman wrote: > Doing a e2fsck /dev/hda2 tells me that the superblock is corrupt. When I > do a mke2fs -n /dev/hda2, it tells me that other backups are stored on > 32768, 98304, 16840, 229376, 294912, 819200, 884736, 1605632, 265???? (cut > off), 4096000, 7962624, 11239424, 20480000, 23887872. > > When I try doing a e2fsck -b xxx /dev/hda2, on any of the superblocks <= > 4096000 I get the message that it's corrupted. When I do >= 7962625, I get > "Invalid argument while trying to open /dev/hda2." > > By the way, there's some sort of weird Logical Volume thing going on with > this partition. On an old (out of date unfortunately) backup, the mtab > file has it listed as /dev/mapper/VolGroup00-LogVol00. Perhaps this > partition can't be addressed as /dev/hda2 and it should be addressed > differently?? Correct. You should be running e2fsck /dev/mapper/VolGroup00-LogVol00 instead of /dev/hda2. That's likely why your filesystem is "corrupted"... Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From a_lindeman at hotmail.com Fri Mar 2 11:43:34 2007 From: a_lindeman at hotmail.com (Andy Lindeman) Date: Fri, 02 Mar 2007 06:43:34 -0500 Subject: whoops, corrupted my filesystem In-Reply-To: <20070302105128.GU6573@schatzie.adilger.int> Message-ID: Hi Andreas- Is it known what happens when e2fsck is run on /dev/hda2 instead of the volume device? I've run e2fsck on /dev/mapper/VolGroup00-LogVol00 and it gives me multiple "Block bitmap for group 0 is not in group. (block XXXXXX) Relocate?". I select y (actually, I ran with automatic mode.) This doesn't seem to help matters. When I rerun e2fsck, I get the same errors on the same blocks. Thanks for your help! Andy ----Original Message Follows---- From: Andreas Dilger To: Andy Lindeman Correct. You should be running e2fsck /dev/mapper/VolGroup00-LogVol00 instead of /dev/hda2. That's likely why your filesystem is "corrupted"... Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. _________________________________________________________________ Mortgage rates as low as 4.625% - Refinance $150,000 loan for $579 a month. Intro*Terms https://www2.nextag.com/goto.jsp?product=100000035&url=%2fst.jsp&tm=y&search=mortgage_text_links_88_h27f6&disc=y&vers=743&s=4056&p=5117 From Matt_Dodson at messageone.com Mon Mar 5 22:18:23 2007 From: Matt_Dodson at messageone.com (Matt Dodson) Date: Mon, 5 Mar 2007 16:18:23 -0600 Subject: Missing blocks Message-ID: <44B5599C8B5B1347AFF903FDCEC003070174EA85@auscorpex-1.austin.messageone.com> Hopefully this is a simple issue or just my ignorance on the results returned by "df -k" but can anyone explain why the available block is 0 if total 1k-blocks - Used is greater than 0? #df -k /ems/bigdisk/ Filesystem 1K-blocks Used Available Use% Mounted on /dev/mapper/vg0-bigdisk 397367512 383562960 0 100% / Filesystem volume name: Last mounted on: Filesystem UUID: de2b600f-120d-41d2-ba23-b48b50705432 Filesystem magic number: 0xEF53 Filesystem revision #: 1 (dynamic) Filesystem features: has_journal resize_inode dir_index filetype needs_recovery sparse_super large_file Default mount options: (none) Filesystem state: clean Errors behavior: Continue Filesystem OS type: Linux Inode count: 50462720 Block count: 100925440 Reserved block count: 5046160 Free blocks: 3174088 Free inodes: 45030587 First block: 0 Block size: 4096 Fragment size: 4096 Reserved GDT blocks: 1021 Blocks per group: 32768 Fragments per group: 32768 Inodes per group: 16384 Inode blocks per group: 512 Filesystem created: Fri May 12 08:43:41 2006 Last mount time: Sun Mar 4 23:26:08 2007 Last write time: Sun Mar 4 23:37:03 2007 Mount count: 5 Maximum mount count: 28 Last checked: Fri May 12 08:43:41 2006 Check interval: 15552000 (6 months) Next check after: Wed Nov 8 07:43:41 2006 Reserved blocks uid: 0 (user root) Reserved blocks gid: 0 (group root) First inode: 11 Inode size: 128 Journal inode: 8 Default directory hash: tea Directory Hash Seed: f127e09e-0c0b-4f18-9e81-d822f8eadf4a Journal backup: inode blocks Kernel 2.6.9-34.0.2.ELsmp e2fsprogs-1.35-12.3.EL4 -------------- next part -------------- An HTML attachment was scrubbed... URL: From jburgess777 at googlemail.com Mon Mar 5 22:39:34 2007 From: jburgess777 at googlemail.com (Jon Burgess) Date: Mon, 05 Mar 2007 22:39:34 +0000 Subject: Missing blocks In-Reply-To: <44B5599C8B5B1347AFF903FDCEC003070174EA85@auscorpex-1.austin.messageone.com> References: <44B5599C8B5B1347AFF903FDCEC003070174EA85@auscorpex-1.austin.messageone.com> Message-ID: <1173134374.29303.6.camel@localhost.localdomain> On Mon, 2007-03-05 at 16:18 -0600, Matt Dodson wrote: > Hopefully this is a simple issue or just my ignorance on the results > returned by ?df -k? but can anyone explain why the available block is > 0 if total 1k-blocks ? Used is greater than 0? > > You have 5% reserved for root use only, this is 20GB on your current filesystem. See the -m option in 'man mke2fs' for details. tune2fs can adjust this if the filesystem is unmounted. > #df -k /ems/bigdisk/ > > Filesystem 1K-blocks Used > Available Use% Mounted on > > /dev/mapper/vg0-bigdisk 397367512 383562960 > 0 100% / > > Block count: 100925440 > > Reserved block count: 5046160 > > Free blocks: 3174088 > Block size: 4096 Above we see 5046160 x 4096 bytes are reserved. Jon From Matt_Dodson at messageone.com Mon Mar 5 22:45:27 2007 From: Matt_Dodson at messageone.com (Matt Dodson) Date: Mon, 5 Mar 2007 16:45:27 -0600 Subject: Missing blocks In-Reply-To: <1173134374.29303.6.camel@localhost.localdomain> References: <44B5599C8B5B1347AFF903FDCEC003070174EA85@auscorpex-1.austin.messageone.com> <1173134374.29303.6.camel@localhost.localdomain> Message-ID: <44B5599C8B5B1347AFF903FDCEC003070174EAB7@auscorpex-1.austin.messageone.com> Thanks for explaining this to me. -------------------- Matt Dodson Infrastructure Engineer matt_dodson at messageone.com http://www.messageone.com MessageOne 11044 Research Blvd Building C, Fith Floor Austin, Tx 78759 (512) 652-4500 (office) -----Original Message----- From: Jon Burgess [mailto:jburgess777 at googlemail.com] Sent: Monday, March 05, 2007 4:40 PM To: Matt Dodson Cc: ext3-users at redhat.com Subject: Re: Missing blocks On Mon, 2007-03-05 at 16:18 -0600, Matt Dodson wrote: > Hopefully this is a simple issue or just my ignorance on the results > returned by "df -k" but can anyone explain why the available block is > 0 if total 1k-blocks - Used is greater than 0? > > You have 5% reserved for root use only, this is 20GB on your current filesystem. See the -m option in 'man mke2fs' for details. tune2fs can adjust this if the filesystem is unmounted. > #df -k /ems/bigdisk/ > > Filesystem 1K-blocks Used > Available Use% Mounted on > > /dev/mapper/vg0-bigdisk 397367512 383562960 > 0 100% / > > Block count: 100925440 > > Reserved block count: 5046160 > > Free blocks: 3174088 > Block size: 4096 Above we see 5046160 x 4096 bytes are reserved. Jon From aj at dungeon.inka.de Tue Mar 6 06:28:44 2007 From: aj at dungeon.inka.de (Andreas Jellinghaus) Date: Tue, 06 Mar 2007 07:28:44 +0100 Subject: resume from swap files Message-ID: <20070306062847.4F25E22A910@dungeon.inka.de> Hi, the latest kernel supports swap files, so I guess the resume code also works with those. So I wonder: is this still a good idea with ext3? As far as I know there is no such thing as a "mount read-only" with journalling filesystems - ext3 when mounted will always detect that it is not clean and replay the journal etc. So what do you think? Is it ok to use ext3 with swap files and suspend/resume? Or is that a combination asking for trouble? (The filesystem would only be used as in "mount /; resume", i.e. no other write operations, but it needs to be mounted for resume to work.) Thanks for your advise. Regards, Andreas From jlforrest at berkeley.edu Mon Mar 12 15:29:07 2007 From: jlforrest at berkeley.edu (Jon Forrest) Date: Mon, 12 Mar 2007 08:29:07 -0700 Subject: How To Recover From Creating >2TB ext3 Filesystem on MSDOS Partition Table? Message-ID: <45F571C3.9090303@berkeley.edu> (I've already sent this message to Ted Ts'o directly. I should have sent it to this list first but I didn't know about it until today. My apologies to Ted.) Last Friday a system that I just inherited refused to mount a file system that had been working fine for about 6 months. This is on a Scientific Linux 4.3 system using a 2.6.9 kernel. This is another Linux distribution based on RHEL 4. I don't think the actual hardware is relevant here so I won't mention it. If there's more information you'd like to see I'd be happy to provide it. It turns out that this 4.2TB file system was created in an msdos partition table, as shown below: ---- GNU Parted 1.6.19 Using /dev/sdb (parted) p Disk geometry for /dev/sdb: 0.000-4291443.000 megabytes Disk label type: msdos Minor Start End Type Filesystem Flags 1 0.031 97137.567 primary ext3 ---- Running fsck fails as shown below: ---- e2fsck 1.35 (28-Feb-2004) The filesystem size (according to the superblock) is 1098609033 blocks The physical size of the device is 24867209 blocks Either the superblock or the partition table is likely to be corrupt! Abort? yes Error reading block 24870914 (Invalid argument) while doing inode scan. ---- I have 2 questions: 1) How did this system run just file for ~6 months using this file system as a /home? I'm suspecting that the problem actually occurred long ago when the file system allocated meta or user data in blocks that are somehow unreachable by fsck but exactly how this could have happened isn't clear. Although it's too late now, I'd really like to know what happened. 2) Given that this happened, how can I recover as many files as possible from this file system? The professor who owns this system had put his faith in hardware RAID so he had never backed it up. He's very nervous right now. Any information or help you can provide would be very much appreciated. Cordially, Jon Forrest Unix Computing Support College of Chemistry Univ. of Cal. Berkeley 173 Tan Hall Berkeley, CA 94720-1460 510-643-1032 jlforrest at berkeley.edu From ling at fnal.gov Mon Mar 12 20:05:21 2007 From: ling at fnal.gov (Ling C. Ho) Date: Mon, 12 Mar 2007 15:05:21 -0500 Subject: How To Recover From Creating >2TB ext3 Filesystem on MSDOS Partition Table? In-Reply-To: <45F571C3.9090303@berkeley.edu> References: <45F571C3.9090303@berkeley.edu> Message-ID: <45F5B281.5060403@fnal.gov> Can u recreate your sdb1 using parted, but specifying a different end size, or just use "-1" ? And maybe try changing the label to "gpt" ? Then run e2fsck -n and see what it does. I wonder how you were able to create a 4TB ext3 filesystem with the msdos label under SL4.3. Never worked for me without the labelling it gpt. Jon Forrest wrote: > (I've already sent this message to Ted Ts'o directly. I should > have sent it to this list first but I didn't know about it > until today. My apologies to Ted.) > > Last Friday a system that I just inherited refused to mount > a file system that had been working fine for about 6 months. > This is on a Scientific Linux 4.3 system using a 2.6.9 > kernel. This is another Linux distribution based on RHEL 4. > I don't think the actual hardware is relevant > here so I won't mention it. If there's more information you'd > like to see I'd be happy to provide it. > > It turns out that this 4.2TB file system was created in an > msdos partition table, as shown below: > > ---- > GNU Parted 1.6.19 > Using /dev/sdb > (parted) p > Disk geometry for /dev/sdb: 0.000-4291443.000 megabytes > Disk label type: msdos > Minor Start End Type Filesystem Flags > 1 0.031 97137.567 primary ext3 > ---- > > Running fsck fails as shown below: > > ---- > e2fsck 1.35 (28-Feb-2004) > The filesystem size (according to the superblock) is 1098609033 blocks > The physical size of the device is 24867209 blocks > Either the superblock or the partition table is likely to be corrupt! > Abort? yes > > Error reading block 24870914 (Invalid argument) while doing inode scan. > ---- > > I have 2 questions: > > 1) How did this system run just file for ~6 months using this > file system as a /home? I'm suspecting that the problem > actually occurred long ago when the file system allocated > meta or user data in blocks that are somehow unreachable > by fsck but exactly how this could have happened isn't > clear. Although it's too late now, I'd really like > to know what happened. > > 2) Given that this happened, how can I recover as many > files as possible from this file system? The professor > who owns this system had put his faith in hardware > RAID so he had never backed it up. He's very nervous > right now. > > Any information or help you can provide would be > very much appreciated. > > Cordially, > Jon Forrest > Unix Computing Support > College of Chemistry > Univ. of Cal. Berkeley > 173 Tan Hall > Berkeley, CA > 94720-1460 > 510-643-1032 > jlforrest at berkeley.edu > > _______________________________________________ > Ext3-users mailing list > Ext3-users at redhat.com > https://www.redhat.com/mailman/listinfo/ext3-users From jlforrest at berkeley.edu Mon Mar 12 21:00:26 2007 From: jlforrest at berkeley.edu (Jon Forrest) Date: Mon, 12 Mar 2007 14:00:26 -0700 Subject: How To Recover From Creating >2TB ext3 Filesystem on MSDOS Partition Table? In-Reply-To: <45F5B281.5060403@fnal.gov> References: <45F571C3.9090303@berkeley.edu> <45F5B281.5060403@fnal.gov> Message-ID: <45F5BF6A.8000701@berkeley.edu> Ling C. Ho wrote: > Can u recreate your sdb1 using parted, but specifying a different end > size, or just use "-1" ? And maybe try changing the label to "gpt" ? > Then run e2fsck -n and see what it does. I'll add this to the small collection of suggestions. I clearly have to be very careful in what I do to restore this because I'll probably only have one chance. > I wonder how you were able to > create a 4TB ext3 filesystem with the msdos label under SL4.3. Never > worked for me without the labelling it gpt. There are two mysteries in my mind - 1) how the file system was allowed to be created, and 2) what was the exact scenario that caused the corruption, i.e. what is it about an msdos partition table that causes problems when a file system is >2TB. As for #1, I didn't create the file system. This is on a cluster that I recently took over managing. The file system was created before I started here. However, the person who did it is quite knowledgeable. Since it was done on a system running Scientific Linux 4.3, which is based on a fairly old kernel and tools, I'm wondering if the tools didn't recognize the dangerous configuration. Ted Ts'o was surprised to hear about this himself. Regarding #2, there are a number of places where very knowledgeable people describe the danger in creating >2TB file systems on msdos partition tables but I haven't seen an explanation of the fundemental problem. I would love to learn this (I'm not doubting that it's true). Cordially, -- Jon Forrest Unix Computing Support College of Chemistry 173 Tan Hall University of California Berkeley Berkeley, CA 94720-1460 510-643-1032 jlforrest at berkeley.edu From jlb17 at duke.edu Mon Mar 12 21:25:03 2007 From: jlb17 at duke.edu (Joshua Baker-LePain) Date: Mon, 12 Mar 2007 17:25:03 -0400 (EDT) Subject: How To Recover From Creating >2TB ext3 Filesystem on MSDOS Partition Table? In-Reply-To: <45F5BF6A.8000701@berkeley.edu> References: <45F571C3.9090303@berkeley.edu> <45F5B281.5060403@fnal.gov> <45F5BF6A.8000701@berkeley.edu> Message-ID: On Mon, 12 Mar 2007 at 2:00pm, Jon Forrest wrote > Regarding #2, there are a number of places where very knowledgeable > people describe the danger in creating >2TB file systems on msdos > partition tables but I haven't seen an explanation of the fundemental > problem. I would love to learn this (I'm not doubting that it's true). AIUI, msdos disk labels use a 32bit integer to describe the length of a partition. 2^32*512byte blocks=2TiB. -- Joshua Baker-LePain Department of Biomedical Engineering Duke University From bdavids1 at gmu.edu Mon Mar 12 21:40:48 2007 From: bdavids1 at gmu.edu (Brian Davidson) Date: Mon, 12 Mar 2007 17:40:48 -0400 Subject: e2fsck hanging Message-ID: I'm trying to run e2fsck on a ~6TB filesystem which is about 90% full. We're doing backup to disk to this filesystem, and have a number of hard links (link counts up to 90). strace shows: write(1, "Pass 2: Checking ", 17) = 17 write(1, "directory", 9) = 9 write(1, " structure\n", 11) = 11 mmap(NULL, 91574272, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b4299dbd000 mmap(NULL, 91574272, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b429f512000 mmap(NULL, 506724352, PROT_READ|PROT_WRITE, MAP_PRIVATE| MAP_ANONYMOUS, -1, 0) = 0x2b42a4c67000 mmap(NULL, 596029440, PROT_READ|PROT_WRITE, MAP_PRIVATE| MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory) brk(0x23e56000) = 0x5eb000 mmap(NULL, 596164608, PROT_READ|PROT_WRITE, MAP_PRIVATE| MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory) mmap(NULL, 2097152, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS| MAP_NORESERVE, -1, 0) = 0x2b430a09e000 munmap(0x2b430a09e000, 401408) = 0 munmap(0x2b430a200000, 647168) = 0 mprotect(0x2b430a100000, 135168, PROT_READ|PROT_WRITE) = 0 mmap(NULL, 596029440, PROT_READ|PROT_WRITE, MAP_PRIVATE| MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory) lseek(3, 6303744, SEEK_SET) = 6303744 read(3, "\2\0\0\0\f\0\1\2.\0\0\0\2\0\0\0\f\0\2\2..\0\0\v\0\0\0 \24"..., 4096) = 4096 lseek(3, 6307840, SEEK_SET) = 6307840 read(3, "\v\0\0\0\f\0\1\2.\0\0\0\2\0\0\0\364\17\2\2..\0\0\0\0\0"..., 4096) = 4096 lseek(3, 6311936, SEEK_SET) = 6311936 read(3, "\0\0\0\0\0\20\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096 lseek(3, 6316032, SEEK_SET) = 6316032 read(3, "\0\0\0\0\0\20\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096 lseek(3, 6320128, SEEK_SET) = 6320128 read(3, "\0\0\0\0\0\20\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096 lseek(3, 41709568, SEEK_SET) = 41709568 read(3, "\323\0\0\0\f\0\1\2.\0\0\0\226\2\252+\f\0\2\2..\0\0\324"..., 4096) = 4096 lseek(3, 41713664, SEEK_SET) = 41713664 read(3, "\324\0\0\0\f\0\1\2.\0\0\0\323\0\0\0\f\0\2\2..\0\0\214 \300"..., 4096) = 4096 lseek(3, 41717760, SEEK_SET) = 41717760 read(3, "\325\0\0\0\f\0\1\2.\0\0\0\226\2\252+\f\0\2\2..\0\0\326"..., 4096) = 4096 And, that's it. No more output. A backtrace from gdb shows: (gdb) bt #0 0x0000000000418aa5 in get_icount_el (icount=0x5cf170, ino=732562070, create=1) at icount.c:251 #1 0x0000000000418dd7 in ext2fs_icount_increment (icount=0x5cf170, ino=732562070, ret=0x7fffffa79a96) at icount.c:339 #2 0x000000000040a3cf in check_dir_block (fs=0x5af560, db=0x2b7070cc6064, priv_data=0x7fffffa79c90) at pass2.c:1021 #3 0x0000000000416c69 in ext2fs_dblist_iterate (dblist=0x5c3f20, func=0x409980 , priv_data=0x7fffffa79c90) at dblist.c:234 #4 0x0000000000408d9d in e2fsck_pass2 (ctx=0x5ae700) at pass2.c:149 #5 0x0000000000403102 in e2fsck_run (ctx=0x5ae700) at e2fsck.c:193 #6 0x0000000000401e50 in main (argc=Variable "argc" is not available. ) at unix.c:1075 It's stuck inside the while loop in get_icount_el() (line 251). I've added more memory to the server (up to 6 GB now), and am re- running e2fsck. Additionally, I upped /proc/sys/vm/max_map_count to 20,000,000 (just pulled that number out of the air). It takes 6 or 7 hours to get the part where it locks up, so I'm not sure if this is going to help or not. I figured while it's running I would post here to see if anyone has any additional insights. Thanks! Brian Davidson George Mason University From maxi.belino at gmail.com Mon Mar 12 22:44:56 2007 From: maxi.belino at gmail.com (Maxi Belino) Date: Mon, 12 Mar 2007 19:44:56 -0300 Subject: Error mounting Message-ID: <29180abb0703121544s627df8dev5495bf165a10cf90@mail.gmail.com> Hi all, i'm new in the list so i'm sorry if this i'm posting is off-topic or it was already answered before. I'm having this problem; i've got an ext3 8GB partition and it doesn't mount, the cause of this: a user (yes me!) running fsck.ext3 with the filesystem mounted, ups! (snif, forgive me!!, totally newbie and mad) errors while booting: EXT3_fs error (device hda4): ext3_check_descriptors: Blockbitmap for group 0 not in group (block 41471)! EXT3_fs: group descriptors corrupted mount: error 22 mounting ext3flags defaults well, retrying without the options flags and repeats this again twice; then: pivotroot:pivot_root (/sysroot, /sysroot/initrd) failed:2 umont /initrd/sys failed: 2 umont /initrd/proc failed: 2 Initrd finished Freeing unused kernel memory: 240 K freed Kernel panic - not syncing: No init found. Try passing init= option to kernel and it freezes Booting with Knoppix 3.2 it mounts all partitions but hda4, it gives this error: mount: wrong fs type , bad option, bad superblock on /dev/hda4, or too many mounted file systems i've already test running dd_rhelp and it grabs an 8GB file without problems but then, i can't mount it (using mount -o loop ...) If there's a solution or any chance i can get data from this partition i would love to hear how, if i'm really fried i'm already prepared. regards, Maxi -------------- next part -------------- An HTML attachment was scrubbed... URL: From bdavids1 at gmu.edu Tue Mar 13 04:04:47 2007 From: bdavids1 at gmu.edu (Brian Davidson) Date: Tue, 13 Mar 2007 00:04:47 -0400 Subject: e2fsck hanging In-Reply-To: References: Message-ID: <749E66B8-C720-4FEB-8C66-5A4938E80C8E@gmu.edu> Here's strace when running w/ 6GB of memory & with max_map_count set to 20000000. It looks like that got rid of the ENOMEM's from mmap, but it's still hanging in the same place... write(1, "Pass 2: Checking ", 17) = 17 write(1, "directory", 9) = 9 write(1, " structure\n", 11) = 11 mmap(NULL, 91574272, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b1078c55000 mmap(NULL, 91574272, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b107e3aa000 mmap(NULL, 501645312, PROT_READ|PROT_WRITE, MAP_PRIVATE| MAP_ANONYMOUS, -1, 0) = 0x2b1083aff000 mmap(NULL, 588230656, PROT_READ|PROT_WRITE, MAP_PRIVATE| MAP_ANONYMOUS, -1, 0) = 0x2b10a1967000 munmap(0x2b10a1967000, 588230656) = 0 lseek(5, 6303744, SEEK_SET) = 6303744 read(5, "\2\0\0\0\f\0\1\2.\0\0\0\2\0\0\0\f\0\2\2..\0\0\v\0\0\0 \24"..., 4096) = 4096 lseek(5, 6307840, SEEK_SET) = 6307840 read(5, "\v\0\0\0\f\0\1\2.\0\0\0\2\0\0\0\364\17\2\2..\0\0\0\0\0"..., 4096) = 4096 lseek(5, 6311936, SEEK_SET) = 6311936 read(5, "\0\0\0\0\0\20\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096 lseek(5, 6316032, SEEK_SET) = 6316032 read(5, "\0\0\0\0\0\20\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096 lseek(5, 6320128, SEEK_SET) = 6320128 read(5, "\0\0\0\0\0\20\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096 lseek(5, 41709568, SEEK_SET) = 41709568 read(5, "\323\0\0\0\f\0\1\2.\0\0\0\226\2\252+\f\0\2\2..\0\0\324"..., 4096) = 4096 lseek(5, 41713664, SEEK_SET) = 41713664 read(5, "\324\0\0\0\f\0\1\2.\0\0\0\323\0\0\0\f\0\2\2..\0\0\214 \300"..., 4096) = 4096 lseek(5, 41717760, SEEK_SET) = 41717760 read(5, "\325\0\0\0\f\0\1\2.\0\0\0\226\2\252+\f\0\2\2..\0\0\326"..., 4096) = 4096 The backtrace seems to be essentially the same: (gdb) bt #0 0x0000000000418aa5 in get_icount_el (icount=0x5cf170, ino=732562070, create=1) at icount.c:251 #1 0x0000000000418dd7 in ext2fs_icount_increment (icount=0x5cf170, ino=732562070, ret=0x7fffffad6e06) at icount.c:339 #2 0x000000000040a3cf in check_dir_block (fs=0x5af560, db=0x2b1011a88064, priv_data=0x7fffffad7000) at pass2.c:1021 #3 0x0000000000416c69 in ext2fs_dblist_iterate (dblist=0x5c3f20, func=0x409980 , priv_data=0x7fffffad7000) at dblist.c:234 #4 0x0000000000408d9d in e2fsck_pass2 (ctx=0x5ae700) at pass2.c:149 #5 0x0000000000403102 in e2fsck_run (ctx=0x5ae700) at e2fsck.c:193 #6 0x0000000000401e50 in main (argc=Variable "argc" is not available. ) at unix.c:1075 #7 0x0000000000421161 in __libc_start_main () #8 0x000000000040018a in _start () #9 0x00007fffffad7508 in ?? () #10 0x0000000000000000 in ?? () Additional info: $ cat /etc/redhat-release Red Hat Enterprise Linux AS release 4 (Nahant Update 4) $ uname -a Linux XXXXX.gmu.edu 2.6.16 #1 SMP Mon Mar 27 16:56:51 EST 2006 x86_64 x86_64 x86_64 GNU/Linux $ e2fsck -V e2fsck 1.35 (28-Feb-2004) Using EXT2FS Library version 1.35, 28-Feb-2004 $ rpm -q e2fsprogs e2fsprogs-1.35-12.4.EL4 Brian Davidson George Mason University From adilger at clusterfs.com Tue Mar 13 07:04:33 2007 From: adilger at clusterfs.com (Andreas Dilger) Date: Tue, 13 Mar 2007 03:04:33 -0400 Subject: How To Recover From Creating >2TB ext3 Filesystem on MSDOS Partition Table? In-Reply-To: <45F571C3.9090303@berkeley.edu> References: <45F571C3.9090303@berkeley.edu> Message-ID: <20070313070433.GL5266@schatzie.adilger.int> On Mar 12, 2007 08:29 -0700, Jon Forrest wrote: > Last Friday a system that I just inherited refused to mount > a file system that had been working fine for about 6 months. > This is on a Scientific Linux 4.3 system using a 2.6.9 > kernel. This is another Linux distribution based on RHEL 4. > I don't think the actual hardware is relevant > here so I won't mention it. If there's more information you'd > like to see I'd be happy to provide it. > > ---- > e2fsck 1.35 (28-Feb-2004) > The filesystem size (according to the superblock) is 1098609033 blocks > The physical size of the device is 24867209 blocks > Either the superblock or the partition table is likely to be corrupt! > Abort? yes > > Error reading block 24870914 (Invalid argument) while doing inode scan. Did you recently update your kernel? Is your kernel using CONFIG_LBD? If CONFIG_LBD is not set, then any use of > 2TB is completely unsafe. It will silently and fatally corrupt your filesystem. I'd pointed this out previously, but the patch I submitted wasn't accepted. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From adilger at clusterfs.com Tue Mar 13 07:27:32 2007 From: adilger at clusterfs.com (Andreas Dilger) Date: Tue, 13 Mar 2007 03:27:32 -0400 Subject: e2fsck hanging In-Reply-To: <749E66B8-C720-4FEB-8C66-5A4938E80C8E@gmu.edu> References: <749E66B8-C720-4FEB-8C66-5A4938E80C8E@gmu.edu> Message-ID: <20070313072732.GP5266@schatzie.adilger.int> On Mar 13, 2007 00:04 -0400, Brian Davidson wrote: > Here's strace when running w/ 6GB of memory & with max_map_count set > to 20000000. It looks like that got rid of the ENOMEM's from mmap, > but it's still hanging in the same place... > > The backtrace seems to be essentially the same: > > (gdb) bt > #0 0x0000000000418aa5 in get_icount_el (icount=0x5cf170, > ino=732562070, create=1) at icount.c:251 > #1 0x0000000000418dd7 in ext2fs_icount_increment (icount=0x5cf170, > ino=732562070, ret=0x7fffffad6e06) > at icount.c:339 > #2 0x000000000040a3cf in check_dir_block (fs=0x5af560, > db=0x2b1011a88064, priv_data=0x7fffffad7000) at pass2.c:1021 > #3 0x0000000000416c69 in ext2fs_dblist_iterate (dblist=0x5c3f20, > func=0x409980 , > priv_data=0x7fffffad7000) at dblist.c:234 > #4 0x0000000000408d9d in e2fsck_pass2 (ctx=0x5ae700) at pass2.c:149 > #5 0x0000000000403102 in e2fsck_run (ctx=0x5ae700) at e2fsck.c:193 > #6 0x0000000000401e50 in main (argc=Variable "argc" is not available. The icount implementation assumes that the number of hard-linked files is very low in comparison to the number of singly-linked files. It uses a linear list to look up the hard-linked inodes. I suspect it needs some algorithm lovin' to make it into a hash table (possibly multi-level) if the number of links becomes too large in a given bucket. We could consider the common case to be a single hash bucket if that makes the code simpler and more efficient. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From adilger at clusterfs.com Tue Mar 13 07:38:09 2007 From: adilger at clusterfs.com (Andreas Dilger) Date: Tue, 13 Mar 2007 03:38:09 -0400 Subject: Error mounting In-Reply-To: <29180abb0703121544s627df8dev5495bf165a10cf90@mail.gmail.com> References: <29180abb0703121544s627df8dev5495bf165a10cf90@mail.gmail.com> Message-ID: <20070313073809.GR5266@schatzie.adilger.int> On Mar 12, 2007 19:44 -0300, Maxi Belino wrote: > I'm having this problem; i've got an ext3 8GB partition and it doesn't > mount, the cause of this: a user (yes me!) running fsck.ext3 with the > filesystem mounted, ups! (snif, forgive me!!, totally newbie and mad) e2fsprogs should not allow you to run e2fsck while the filesystem is mounted. > If there's a solution or any chance i can get data from this partition i > would love to hear how, if i'm really fried i'm already prepared. Try e2fsck with a backup superblock (-b), not sure what else to try. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From tytso at mit.edu Tue Mar 13 13:53:27 2007 From: tytso at mit.edu (Theodore Tso) Date: Tue, 13 Mar 2007 09:53:27 -0400 Subject: e2fsck hanging In-Reply-To: <749E66B8-C720-4FEB-8C66-5A4938E80C8E@gmu.edu> References: <749E66B8-C720-4FEB-8C66-5A4938E80C8E@gmu.edu> Message-ID: <20070313135326.GA7362@thunk.org> At a first glance your report looks vaguely like this bugreport: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=411838 I've been crazy busy the last few weeks so I haven't had a chance to look at it yet. There is a suggested fix in the above bug report, but not a patch, and I haven't had time to validate it yet. Regards, - Ted From bdavids1 at gmu.edu Tue Mar 13 14:59:43 2007 From: bdavids1 at gmu.edu (Brian Davidson) Date: Tue, 13 Mar 2007 10:59:43 -0400 Subject: e2fsck hanging In-Reply-To: <20070313135326.GA7362@thunk.org> References: <749E66B8-C720-4FEB-8C66-5A4938E80C8E@gmu.edu> <20070313135326.GA7362@thunk.org> Message-ID: <070FB85A-AE98-4523-9F3F-28AFD13C3AC4@gmu.edu> On Mar 13, 2007, at 9:53 AM, Theodore Tso wrote: > At a first glance your report looks vaguely like this bugreport: > > http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=411838 > > I've been crazy busy the last few weeks so I haven't had a chance to > look at it yet. There is a suggested fix in the above bug report, but > not a patch, and I haven't had time to validate it yet. > > Regards, > > - Ted Yes, that's the same issue. We reduced the issue to a floating point precision issue too: main() { float range; double range2; unsigned int ino, lowval, highval; int high, low; ino=732562070; lowval= 2; highval = 732562081; high=57402135; low=0; range = ((float) (ino - lowval)) / (highval - lowval); printf("range=%f\n",range); } It outputs 1.0, rather than .99999... We're trying the suggested fix from the bug report. It'll take about 6 hours or so to get to that point. Here's specifically what we're doing: --- e2fsprogs-1.39/lib/ext2fs/icount.c 2005-09-06 05:40:14.000000000 -0400 +++ e2fsprogs-1.39-test/lib/ext2fs/icount.c 2007-03-13 10:56:19.000000000 -0400 @@ -251,6 +251,10 @@ range = ((float) (ino - lowval)) / (highval - lowval); mid = low + ((int) (range * (high-low))); + if (mid > high) + mid = high; + if (mid < low) + mid = low; } #endif if (ino == icount->list[mid].ino) { From jlforrest at berkeley.edu Tue Mar 13 15:43:43 2007 From: jlforrest at berkeley.edu (Jon Forrest) Date: Tue, 13 Mar 2007 08:43:43 -0700 Subject: How To Recover From Creating >2TB ext3 Filesystem on MSDOS Partition Table? In-Reply-To: <20070313070433.GL5266@schatzie.adilger.int> References: <45F571C3.9090303@berkeley.edu> <20070313070433.GL5266@schatzie.adilger.int> Message-ID: <45F6C6AF.6080709@berkeley.edu> Andreas Dilger wrote: > Did you recently update your kernel? No. The system had been running for months. > Is your kernel using CONFIG_LBD? Yes. Jon From bdavids1 at gmu.edu Wed Mar 14 00:32:44 2007 From: bdavids1 at gmu.edu (Brian Davidson) Date: Tue, 13 Mar 2007 20:32:44 -0400 Subject: e2fsck hanging In-Reply-To: <070FB85A-AE98-4523-9F3F-28AFD13C3AC4@gmu.edu> References: <749E66B8-C720-4FEB-8C66-5A4938E80C8E@gmu.edu> <20070313135326.GA7362@thunk.org> <070FB85A-AE98-4523-9F3F-28AFD13C3AC4@gmu.edu> Message-ID: <65B0B3F4-4231-473B-9594-6BF8BCEFB6DA@gmu.edu> This patch does the trick. > --- e2fsprogs-1.39/lib/ext2fs/icount.c 2005-09-06 > 05:40:14.000000000 -0400 > +++ e2fsprogs-1.39-test/lib/ext2fs/icount.c 2007-03-13 > 10:56:19.000000000 -0400 > @@ -251,6 +251,10 @@ > range = ((float) (ino - lowval)) / > (highval - lowval); > mid = low + ((int) (range * (high-low))); > + if (mid > high) > + mid = high; > + if (mid < low) > + mid = low; > } > #endif > if (ino == icount->list[mid].ino) { Our inode count is 732,577,792 on a 5.4 TB filesystem with 5.0 TB in use (94% use). It took about 9 hours to run, and used of 4GB of memory. From jss at ast.cam.ac.uk Wed Mar 14 09:17:16 2007 From: jss at ast.cam.ac.uk (Jeremy Sanders) Date: Wed, 14 Mar 2007 09:17:16 +0000 Subject: e2fsck hanging References: <749E66B8-C720-4FEB-8C66-5A4938E80C8E@gmu.edu> <20070313135326.GA7362@thunk.org> <070FB85A-AE98-4523-9F3F-28AFD13C3AC4@gmu.edu> Message-ID: Brian Davidson wrote: > --- e2fsprogs-1.39/lib/ext2fs/icount.c 2005-09-06 05:40:14.000000000 > -0400 > +++ e2fsprogs-1.39-test/lib/ext2fs/icount.c 2007-03-13 > 10:56:19.000000000 -0400 > @@ -251,6 +251,10 @@ > range = ((float) (ino - lowval)) / > (highval - lowval); > mid = low + ((int) (range * (high-low))); > + if (mid > high) > + mid = high; > + if (mid < low) > + mid = low; > } > #endif > if (ino == icount->list[mid].ino) { I'm happy to report this patch solved the fsck hanging problem I reported a few weeks ago. Jeremy -- Jeremy Sanders http://www-xray.ast.cam.ac.uk/~jss/ X-Ray Group, Institute of Astronomy, University of Cambridge, UK. Public Key Server PGP Key ID: E1AAE053 From jlforrest at berkeley.edu Wed Mar 14 21:07:55 2007 From: jlforrest at berkeley.edu (Jon Forrest) Date: Wed, 14 Mar 2007 14:07:55 -0700 Subject: Solution to Corrupt >2TB Filesystem in MSDOS Partition Table In-Reply-To: <20070313070433.GL5266@schatzie.adilger.int> References: <45F571C3.9090303@berkeley.edu> <20070313070433.GL5266@schatzie.adilger.int> Message-ID: <45F8642B.5080908@berkeley.edu> Thanks to Ted and several others, I was able to recover 100% of the corrupted file system that I posted about last week. (This was an >2TB ext3 file system that had been created in a MSDOS partition which had worked until the server was rebooted, at which time it wouldn't mount and fsck wouldn't fix the problem.) Based on the suggestions of various people here's what I did: 1) Upgraded to the latest version of GNU parted. The server is running Scientific Linux 4.3, a RHEL4 derived distribution with a 2.6.9 kernel. This distribution contained parted 1.6.19 whereas the latest release was 1.8.2. 2) Using parted 1.8.2, I removed the partition containing the corrupt file system. This was the only partition on the disk. 3) I then used the parted "rescue" command to recreate the partition. I gave it the original starting point at the start value and "-1s" as the ending value. After this, I was able to mount the file system as before, and all the files were there. The first thing I did was to copy the whole file system to another disk which completed without any errors. I have to admit that I don't fully understand why this worked. Clearly the combination of removing the partition and then rescuing it reset something that was fouling up the works before. Anyway, we're all very happy about this and we all appreciate the help we received from this list and elsewhere. I hope we'll be able to help you one day. Cordially, -- Jon Forrest Unix Computing Support College of Chemistry 173 Tan Hall University of California Berkeley Berkeley, CA 94720-1460 510-643-1032 jlforrest at berkeley.edu From adilger at clusterfs.com Wed Mar 14 14:57:31 2007 From: adilger at clusterfs.com (Andreas Dilger) Date: Wed, 14 Mar 2007 10:57:31 -0400 Subject: e2fsck hanging In-Reply-To: References: <749E66B8-C720-4FEB-8C66-5A4938E80C8E@gmu.edu> <20070313135326.GA7362@thunk.org> <070FB85A-AE98-4523-9F3F-28AFD13C3AC4@gmu.edu> Message-ID: <20070314145731.GB5513@schatzie.adilger.int> On Mar 14, 2007 09:17 +0000, Jeremy Sanders wrote: > > --- e2fsprogs-1.39/lib/ext2fs/icount.c 2005-09-06 05:40:14.000000000 > > -0400 > > +++ e2fsprogs-1.39-test/lib/ext2fs/icount.c 2007-03-13 > > 10:56:19.000000000 -0400 > > @@ -251,6 +251,10 @@ > > range = ((float) (ino - lowval)) / > > (highval - lowval); > > mid = low + ((int) (range * (high-low))); > > + if (mid > high) > > + mid = high; > > + if (mid < low) > > + mid = low; > > } > > #endif > > if (ino == icount->list[mid].ino) { > > I'm happy to report this patch solved the fsck hanging problem I reported a > few weeks ago. Any real reason we don't change this to a double instead of a float? Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From jss at ast.cam.ac.uk Thu Mar 15 09:36:36 2007 From: jss at ast.cam.ac.uk (Jeremy Sanders) Date: Thu, 15 Mar 2007 09:36:36 +0000 Subject: e2fsck hanging References: <749E66B8-C720-4FEB-8C66-5A4938E80C8E@gmu.edu> <20070313135326.GA7362@thunk.org> <070FB85A-AE98-4523-9F3F-28AFD13C3AC4@gmu.edu> <20070314145731.GB5513@schatzie.adilger.int> Message-ID: Andreas Dilger wrote: > Any real reason we don't change this to a double instead of a float? Presumably that would make it less likely to happen, not get rid of the problem completely, although on a real filesystem the issue may never happen with a double. It's probably a reasonable idea to change to a double, but also check for the bounding issues. Jeremy -- Jeremy Sanders http://www-xray.ast.cam.ac.uk/~jss/ X-Ray Group, Institute of Astronomy, University of Cambridge, UK. Public Key Server PGP Key ID: E1AAE053 From lakshmipathi.g at gmail.com Thu Mar 15 14:25:47 2007 From: lakshmipathi.g at gmail.com (lakshmi pathi) Date: Thu, 15 Mar 2007 19:55:47 +0530 Subject: How to name it? Message-ID: hi all, The reason why writting this mail--i don't know how to name a tool written by myself :-) Following is the functionality of a file system tool : When you install the tool it acts as a protection for your files.Tool copies the address of files. If you accidently deleted a file -if its contents are not modified-then the tool retrives the contents of file. How should i call this tool./ Saying file recovery is somewhat miss leading-(i got critisied for saying it's recovery tool)--because it doesnt recover files deleted before it's intallation of tool. It's can't be backup tool - since the tool backup only address of file and not the file itself. Is their any other similar tool is out there? Thanks. -------------- next part -------------- An HTML attachment was scrubbed... URL: From samuel at bcgreen.com Fri Mar 16 05:31:08 2007 From: samuel at bcgreen.com (Stephen Samuel) Date: Thu, 15 Mar 2007 22:31:08 -0700 Subject: How to name it? In-Reply-To: References: Message-ID: <6cd50f9f0703152231v4af0f0e8v84bd34b4eb9fef3c@mail.gmail.com> It's an undelete tool... Although it only allows you to undelete files since installation, It still allows undeletion of files deleted while it is working. On 3/15/07, lakshmi pathi wrote: > hi all, > The reason why writting this mail--i don't know how to name a tool written > by myself :-) From mats_a at MIT.EDU Sun Mar 18 01:42:17 2007 From: mats_a at MIT.EDU (Mats Ahlgren) Date: Sat, 17 Mar 2007 21:42:17 -0400 Subject: Frequent metadata corruption with ext3 + hard power-off Message-ID: <200703172142.17868.mats_a@mit.edu> Hello. I'm having serious issues with ext3; any insight would be greatly appreciated: _____ Overview: I believe ext3 is supposed to be recoverable in the case of a power failure by replaying the log. However, on two separate computers (running different operatings systems too), this has been everything but the case. _____ Specifics: Sometimes, my kernel will hard-freeze and I'll have to do a hard reboot. When this happens, sometimes fsck will insist on running and find some orphaned inodes, which it will proceed to put in the /lost+found directory. This is unacceptable: The last time this happened, random files in my operating system were plucked from the file system and stuffed in lost+found, corrupting the OS and forcing a reinstall. Another time, files I had recently moved (a final project) a minute before the crash were orphaned and put in the lost+found, effectively destroying it. Why should a lost+found folder even be necessary when the file hierarchy is guaranteed to be consistent? In response to these problems, I changed the ext3 journaling mode to "journal" rather than "ordered" (frankly it seems deeply disturbing that "ordered" is the default). Since then, I've once had to hard-reboot and yet again found files in the /lost+found folder. Might anyone know why ext3 is not fulfilling its promise of an always-consistent file system? _____ Other interacting issues: I'm running RAID1 (mirroring) on one computer, but I've had the same issues on another computer without RAID. (In response to "you shouldn't hard-reboot your computer": I realize that most computers are not meant to be hard-rebooted, but I don't have a sysrq key and xmodmapping it has been difficult. I also realize that kernels shouldn't crash, but what's a person to do if the computer doesn't respond to ctrl-alt-f1 and doesn't leave any messages in the logs...) (In response to "maybe your drive is defective": This is not a problem with a defective drive; I've tried multiple drives.) (In response to "you should backup your data": Periodic backups clearly help, but it's ridiculous to restore a system from backup every week because a hard-freeze corrupted your filesystem...) Any insight would be greatly appreciated. These problems have been making me look for other file systems (such as zfs, which unfortunately I can't use to boot; or reiser4, which also makes a filesystem-is-always-consistent guarantee); I would prefer to use ext3, but I've never had these sorts of problems with old Mac OS, OS X, or Windows. Thank you, Mats From tytso at mit.edu Sun Mar 18 13:33:59 2007 From: tytso at mit.edu (Theodore Tso) Date: Sun, 18 Mar 2007 09:33:59 -0400 Subject: Frequent metadata corruption with ext3 + hard power-off In-Reply-To: <200703172142.17868.mats_a@mit.edu> References: <200703172142.17868.mats_a@mit.edu> Message-ID: <20070318133359.GA31914@thunk.org> It sounds like you have a disk which is doing very aggressive write caching. If you are using a new enough kernel (2.6.9 or greater should have this), adding "barrier=1" to your mount options should help. We should probably make this the default at this point... - Ted From ahlist at gmail.com Mon Mar 19 21:15:59 2007 From: ahlist at gmail.com (ahlist) Date: Mon, 19 Mar 2007 17:15:59 -0400 Subject: rebooting more often to stop fsck problems and total disk loss Message-ID: Hi, I run several hundred servers that are used heavily (webhosting, etc.) all day long. Quite often we'll have a server that either needs a really long fsck (10 hours - 200 gig drive) or an fsck that evntually results in everything going to lost+found (pretty much a total loss). Would rebooting these servers monthly (or some other frequency) stop this? Is it correct to visualize this as small errors compounding over time thus more frequent reboots would allow quick fsck's to fix the errors before they become huge? (OS is redhat 7.3 and el3) Thanks for any input! From adilger at clusterfs.com Mon Mar 19 21:27:19 2007 From: adilger at clusterfs.com (Andreas Dilger) Date: Mon, 19 Mar 2007 15:27:19 -0600 Subject: rebooting more often to stop fsck problems and total disk loss In-Reply-To: References: Message-ID: <20070319212719.GF5967@schatzie.adilger.int> On Mar 19, 2007 17:15 -0400, ahlist wrote: > Quite often we'll have a server that either needs a really long fsck > (10 hours - 200 gig drive) or an fsck that evntually results in > everything going to lost+found (pretty much a total loss). Strange. We get 1TB/hr fscks these days unless the filesystem is completely corrupted and has a lot of duplicate blocks. > Would rebooting these servers monthly (or some other frequency) stop this? What else is important is that if you do an fsck you run with "-f" to actually check the filesystem instead of just the superblock. e2fsck will only do a full e2fsck if the kernel detected disk corruption, OR if the "last checked" time is > 6 months or {20 < X < 40} mounts have happened since the last check time. See tune2fs(8) for details. > Is it correct to visualize this as small errors compounding over time > thus more frequent reboots would allow quick fsck's to fix the errors > before they become huge? That is definitely true. If the bitmaps get corrupted, then this will spread corruption throughout the filesystem. > (OS is redhat 7.3 and el3) I would instead suggest updating to a newer kernel (e.g. RHEL4 2.6.9) as this has fixed a LOT of bugs in ext3. Also, make sure you are using the newest e2fsck available, as some bugs have been fixed there also. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From rjackson at mason.gmu.edu Tue Mar 20 13:44:07 2007 From: rjackson at mason.gmu.edu (Richard Jackson) Date: Tue, 20 Mar 2007 09:44:07 -0400 (EDT) Subject: e2fsck hanging Message-ID: <200703201344.l2KDi8u5017035@mason.gmu.edu> There are are few issues with the get_icount_el() code. First a simple binary search may be sufficient. Also, We now know the float type is not sufficient to handle the large or small values handled by this code. One problem with using float is it does not have the precision to divide two sufficently large numbers with a small enough difference. The other issue is with float value approximation that causes 'mid' to be larger than 'high'. The approximation is due to float single-precision 23 bit mantissa. Values up to integer 16,777,215 are handled as expected but starting at 16,777,216 the least significant bits are truncated producing an approximation. The approximation could be more or less than what is expected. This is a feature of using float. Double type for IEEE 754 double-precision 64 bit provides a 52 bit mantissa to play with. That is a large number. Since the e2fsck code must handle large numbers the use of float type should be used with caution. Reference http://steve.hollasch.net/cgindex/coding/ieeefloat.html http://en.wikipedia.org/wiki/IEEE_754 From tytso at mit.edu Tue Mar 20 22:59:20 2007 From: tytso at mit.edu (Theodore Tso) Date: Tue, 20 Mar 2007 18:59:20 -0400 Subject: e2fsck hanging In-Reply-To: <200703201344.l2KDi8u5017035@mason.gmu.edu> References: <200703201344.l2KDi8u5017035@mason.gmu.edu> Message-ID: <20070320225920.GA10134@thunk.org> On Tue, Mar 20, 2007 at 09:44:07AM -0400, Richard Jackson wrote: > There are are few issues with the get_icount_el() code. First a simple > binary search may be sufficient. Also, We now know the float type is > not sufficient to handle the large or small values handled by this > code. One problem with using float is it does not have the precision > to divide two sufficently large numbers with a small enough > difference. The other issue is with float value approximation that > causes 'mid' to be larger than 'high'. The approximation is due to > float single-precision 23 bit mantissa. Values up to integer > 16,777,215 are handled as expected but starting at 16,777,216 the least > significant bits are truncated producing an approximation. The > approximation could be more or less than what is expected. This is a > feature of using float. Double type for IEEE 754 double-precision 64 > bit provides a 52 bit mantissa to play with. That is a large number. Well, keep in mind that the float is just as an optimization to doing a simple binary search. So it doesn't have to be precise; an approximation is fine, except when mid ends up being larger than high. But it's simple enough to catch that particular case where the division going to 1 instead of 0.99999 as we might expect. Catching that should be enough, I expect. - Ted From bdavids1 at gmu.edu Tue Mar 20 23:53:24 2007 From: bdavids1 at gmu.edu (Brian Davidson) Date: Tue, 20 Mar 2007 19:53:24 -0400 Subject: e2fsck hanging In-Reply-To: <20070320225920.GA10134@thunk.org> References: <200703201344.l2KDi8u5017035@mason.gmu.edu> <20070320225920.GA10134@thunk.org> Message-ID: <9409CCD0-3AB9-48BF-A3D7-7CA353E70CA6@gmu.edu> On Mar 20, 2007, at 6:59 PM, Theodore Tso wrote: > Well, keep in mind that the float is just as an optimization to doing > a simple binary search. So it doesn't have to be precise; an > approximation is fine, except when mid ends up being larger than high. > But it's simple enough to catch that particular case where the > division going to 1 instead of 0.99999 as we might expect. Catching > that should be enough, I expect. > > - Ted With a float, you're still trying to cram 32 bits into a 24 bit mantissa (23 bits + implicit bit). If nothing else, the float should get changed to a double which has a 53 bit mantissa (52 + implicit bit). Just catching the case where division goes to one causes it to do a linear search. Given that this only occurs on really big filesystems, that's probably not what you want to do... Brian From armangau_philippe at emc.com Wed Mar 21 17:18:10 2007 From: armangau_philippe at emc.com (armangau_philippe at emc.com) Date: Wed, 21 Mar 2007 13:18:10 -0400 Subject: Ext3 behavior on power failure Message-ID: Hi all, We are building a new system which is going to use ext3 FS. We would like to know more about the behavior of ext3 in the case of failure. But before I procede, I would like to share more information about our future system. * Our application always does an fsync on files * When symbolic links (more specifically fast symlink) are created, the host directory is also fsync'ed. * Our application is also going to front an EMC disk array configured using RAID5 or RAID6. * We will be using multipathing so that we can assume that no disk errors will be reported. In this context , we would like to know the following for recovery after a power outage: 1. When will an fsck have to be run (not counting the scheduled fsck every N-mounts)? 2. In the case of a crash, are the fsync-ed file contents and symbolic links safe no matter what? Thanks, Philippe Armangau Centera Software Group Consultant Software Engineer EMC? Where Information Lives * Office: 508-249-5575 (toll free 877-362-2887 x45475) * Cell: 978-760-0485 * Fax: 508-249-5495 * E-mail: armangau_philippe at emc.com From skye0507 at yahoo.com Wed Mar 21 23:51:56 2007 From: skye0507 at yahoo.com (brian stone) Date: Wed, 21 Mar 2007 16:51:56 -0700 (PDT) Subject: EXT2 vs. EXT3: mount w/sync or fdatasync Message-ID: <221628.39405.qm@web59005.mail.re1.yahoo.com> My application always needs to sync file data after writing. I don't want anything handing around in the kernel buffers. I am wondering what is the best method to accomplish this. 1. Do I use EXT2 and use fdatasync() or fsync()? 2. Do I use EXT2 and mount with the "sync" option? 3. Do I use EXT2 and use the O_DIRECT flag on open()? 4. Do I use EXT3 in full journaled mode, where the data and metadata are journaled? In this case, is the journaled data sync'd or async'd? When the journal commits the data to the file system, is that sync'd or dumped into kernel buffers? 5. Since I will always be syncing the data, does it make any sense to use EXT3? It feels like the EXT3 journal would be unnecessary. Thanks in advance --------------------------------- The fish are biting. Get more visitors on your site using Yahoo! Search Marketing. -------------- next part -------------- An HTML attachment was scrubbed... URL: From adilger at clusterfs.com Thu Mar 22 04:14:24 2007 From: adilger at clusterfs.com (Andreas Dilger) Date: Wed, 21 Mar 2007 22:14:24 -0600 Subject: EXT2 vs. EXT3: mount w/sync or fdatasync In-Reply-To: <221628.39405.qm@web59005.mail.re1.yahoo.com> References: <221628.39405.qm@web59005.mail.re1.yahoo.com> Message-ID: <20070322041424.GM5967@schatzie.adilger.int> On Mar 21, 2007 16:51 -0700, brian stone wrote: > My application always needs to sync file data after writing. I don't want anything handing around in the kernel buffers. I am wondering what is the best method to accomplish this. > 4. Do I use EXT3 in full journaled mode, where the data and metadata are journaled? In this case, is the journaled data sync'd or async'd? When the journal commits the data to the file system, is that sync'd or dumped into kernel buffers? > > 5. Since I will always be syncing the data, does it make any sense to use EXT3? It feels like the EXT3 journal would be unnecessary. In theory, ext3 + data=journal will give you the best performance, because sync IO will always be linear IO to the journal. Unless your filesystem is constantly busy, then the writes to the filesystem can happen asynchronously after being committed to the journal without danger of being lost. That said, nothing better than benchmarking your app with different filesystem options to see which one is best. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From skye0507 at yahoo.com Thu Mar 22 11:51:00 2007 From: skye0507 at yahoo.com (brian stone) Date: Thu, 22 Mar 2007 04:51:00 -0700 (PDT) Subject: EXT2 vs. EXT3: mount w/sync or fdatasync In-Reply-To: <20070322041424.GM5967@schatzie.adilger.int> Message-ID: <823230.44351.qm@web59009.mail.re1.yahoo.com> >>nothing better than benchmarking your app with different IO performance is always a consideration, but for this application reliability is much more important. I am looking for the most reliable way of dumping files to disk. We I call close(), I need to know that the data is one disk. It doesn't need to be the highest performance method, just the most reliable. >>Unless your filesystem is constantly busy It is constantly busy. Each file system manages around 10 millions files across a TB. Each day, an average of 500,000 files totaling 100G are throw away while the same amount is generated. Its a constant cycle. The point is, these are very active file systems. I have already seen EXT3 corrupt its superblock(s) after a disk failure, using data=ordered. Trying different superblocks didn't work, maybe -O sparse_super isn't the best idea. No merit in EXT2 with fdatasync calls? thanks for the response. Andreas Dilger wrote: On Mar 21, 2007 16:51 -0700, brian stone wrote: > My application always needs to sync file data after writing. I don't want anything handing around in the kernel buffers. I am wondering what is the best method to accomplish this. > 4. Do I use EXT3 in full journaled mode, where the data and metadata are journaled? In this case, is the journaled data sync'd or async'd? When the journal commits the data to the file system, is that sync'd or dumped into kernel buffers? > > 5. Since I will always be syncing the data, does it make any sense to use EXT3? It feels like the EXT3 journal would be unnecessary. In theory, ext3 + data=journal will give you the best performance, because sync IO will always be linear IO to the journal. Unless your filesystem is constantly busy, then the writes to the filesystem can happen asynchronously after being committed to the journal without danger of being lost. That said, nothing better than benchmarking your app with different filesystem options to see which one is best. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. --------------------------------- Don't be flakey. Get Yahoo! Mail for Mobile and always stay connected to friends. -------------- next part -------------- An HTML attachment was scrubbed... URL: From skye0507 at yahoo.com Thu Mar 22 11:58:50 2007 From: skye0507 at yahoo.com (brian stone) Date: Thu, 22 Mar 2007 04:58:50 -0700 (PDT) Subject: EXT2 vs. EXT3: mount w/sync or fdatasync In-Reply-To: <823230.44351.qm@web59009.mail.re1.yahoo.com> Message-ID: <957664.74539.qm@web59015.mail.re1.yahoo.com> >>I have already seen EXT3 corrupt its superblock(s) after a disk failure, using data=ordered. Not sure why this post printed data="". I was using ordered mode, the default. thanks brian stone wrote: >>nothing better than benchmarking your app with different IO performance is always a consideration, but for this application reliability is much more important. I am looking for the most reliable way of dumping files to disk. We I call close(), I need to know that the data is one disk. It doesn't need to be the highest performance method, just the most reliable. >>Unless your filesystem is constantly busy It is constantly busy. Each file system manages around 10 millions files across a TB. Each day, an average of 500,000 files totaling 100G are throw away while the same amount is generated. Its a constant cycle. The point is, these are very active file systems. I have already seen EXT3 corrupt its superblock(s) after a disk failure, using data=ordered. Trying different superblocks didn't work, maybe -O sparse_super isn't the best idea. No merit in EXT2 with fdatasync calls? thanks for the response. Andreas Dilger wrote: On Mar 21, 2007 16:51 -0700, brian stone wrote: > My application always needs to sync file data after writing. I don't want anything handing around in the kernel buffers. I am wondering what is the best method to accomplish this. > 4. Do I use EXT3 in full journaled mode, where the data and metadata are journaled? In this case, is the journaled data sync'd or async'd? When the journal commits the data to the file system, is that sync'd or dumped into kernel buffers? > > 5. Since I will always be syncing the data, does it make any sense to use EXT3? It feels like the EXT3 journal would be unnecessary. In theory, ext3 + data=journal will give you the best performance, because sync IO will always be linear IO to the journal. Unless your filesystem is constantly busy, then the writes to the filesystem can happen asynchronously after being committed to the journal without danger of being lost. That said, nothing better than benchmarking your app with different filesystem options to see which one is best. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. --------------------------------- Don't be flakey. Get Yahoo! Mail for Mobile and always stay connected to friends._______________________________________________ Ext3-users mailing list Ext3-users at redhat.com https://www.redhat.com/mailman/listinfo/ext3-users --------------------------------- The fish are biting. Get more visitors on your site using Yahoo! Search Marketing. -------------- next part -------------- An HTML attachment was scrubbed... URL: From skye0507 at yahoo.com Fri Mar 23 03:44:40 2007 From: skye0507 at yahoo.com (brian stone) Date: Thu, 22 Mar 2007 20:44:40 -0700 (PDT) Subject: EXT2 vs. EXT3: mount w/sync or fdatasync In-Reply-To: <20070322041424.GM5967@schatzie.adilger.int> Message-ID: <810328.85867.qm@web59007.mail.re1.yahoo.com> Ran some performance tests as suggested. Machine A connects to machine B on a gigabit lan. Machine A sends 1024 1MB chucks of data; 1 GB in total. Machine B, the server, reads in the MB and writes it to a file. NOTE: server and client are little test programs written in C. Machine B (Server) hardware: - Single (no raid) Seagate Cheetah 70G Ultra320 15K - Quad Opteron 870 - 16G DDR400 - Backplane: LSI Logic / Symbios Logic 53c1030 PCI-X Fusion-MPT Dual Ultra320 SCSI (rev 8) Sync methods include: 1. mount with sync option - tried sync,dirsync which added no additional overhead 2. use O_SYNC open() flag 3. use fdatasync() just before closing the file - fsync() and fdatasync() produced the same results EXT2 tests ========================================== No sync 12.3 seconds (83 MB/Sec) mount=sync 44.3 seconds (23 MB/Sec) O_SYNC 31.7 seconds (32 MB/Sec) fdatasync() 31.3 seconds (32 MB/Sec) EXT3 tests =========================================== No sync data=writeback 14.5 seconds (70 MB/Sec) No sync data=ordered 17 seconds (60 MB/Sec) No sync data=journal 65 seconds (15 MB/Sec) data=ordered O_SYNC 49 seconds (20 MB/Sec) data=ordered,sync 52 seconds (19 MB/Sec) data=ordered fdatasync() 45.5 seconds (22 MB/Sec) data=journal O_SYNC 72.5 seconds (14 MB/Sec) data=journal,sync 81 seconds (12 MB/Sec) data=journal fdatasync() 60.5 seconds (17 MB/Sec) thanks Andreas Dilger wrote: On Mar 21, 2007 16:51 -0700, brian stone wrote: > My application always needs to sync file data after writing. I don't want anything handing around in the kernel buffers. I am wondering what is the best method to accomplish this. > 4. Do I use EXT3 in full journaled mode, where the data and metadata are journaled? In this case, is the journaled data sync'd or async'd? When the journal commits the data to the file system, is that sync'd or dumped into kernel buffers? > > 5. Since I will always be syncing the data, does it make any sense to use EXT3? It feels like the EXT3 journal would be unnecessary. In theory, ext3 + data=journal will give you the best performance, because sync IO will always be linear IO to the journal. Unless your filesystem is constantly busy, then the writes to the filesystem can happen asynchronously after being committed to the journal without danger of being lost. That said, nothing better than benchmarking your app with different filesystem options to see which one is best. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. --------------------------------- No need to miss a message. Get email on-the-go with Yahoo! Mail for Mobile. Get started. -------------- next part -------------- An HTML attachment was scrubbed... URL: From skye0507 at yahoo.com Fri Mar 23 03:50:38 2007 From: skye0507 at yahoo.com (brian stone) Date: Thu, 22 Mar 2007 20:50:38 -0700 (PDT) Subject: EXT2 vs. EXT3: mount w/sync or fdatasync In-Reply-To: <810328.85867.qm@web59007.mail.re1.yahoo.com> Message-ID: <546593.25102.qm@web59015.mail.re1.yahoo.com> Why does this forum convert the right side of an equal sign to ""??? Test results reformatted: EXT2 tests ========================================== No sync 12.3 seconds (83 MB/Sec) sync 44.3 seconds (23 MB/Sec) O_SYNC 31.7 seconds (32 MB/Sec) fdatasync() 31.3 seconds (32 MB/Sec) EXT3 tests =========================================== No sync writeback 14.5 seconds (70 MB/Sec) No sync ordered 17 seconds (60 MB/Sec) No sync journal 65 seconds (15 MB/Sec) ordered O_SYNC 49 seconds (20 MB/Sec) ordered,sync 52 seconds (19 MB/Sec) ordered fdatasync() 45.5 seconds (22 MB/Sec) journal O_SYNC 72.5 seconds (14 MB/Sec) journal,sync 81 seconds (12 MB/Sec) journal fdatasync() 60.5 seconds (17 MB/Sec) --------------------------------- No need to miss a message. Get email on-the-go with Yahoo! Mail for Mobile. Get started. -------------- next part -------------- An HTML attachment was scrubbed... URL: From adilger at clusterfs.com Fri Mar 23 06:18:40 2007 From: adilger at clusterfs.com (Andreas Dilger) Date: Fri, 23 Mar 2007 00:18:40 -0600 Subject: EXT2 vs. EXT3: mount w/sync or fdatasync In-Reply-To: <810328.85867.qm@web59007.mail.re1.yahoo.com> References: <20070322041424.GM5967@schatzie.adilger.int> <810328.85867.qm@web59007.mail.re1.yahoo.com> Message-ID: <20070323061840.GC5967@schatzie.adilger.int> On Mar 22, 2007 20:44 -0700, brian stone wrote: > Machine A connects to machine B on a gigabit lan. Machine A sends > 1024 1MB chucks of data; 1 GB in total. Machine B, the server, reads > in the MB and writes it to a file. > > NOTE: server and client are little test programs written in C. > > Machine B (Server) hardware: > - Single (no raid) Seagate Cheetah 70G Ultra320 15K > - Quad Opteron 870 > - 16G DDR400 > - Backplane: LSI Logic / Symbios Logic 53c1030 PCI-X Fusion-MPT Dual Ultra320 SCSI (rev 8) > > Sync methods include: > 1. mount with sync option > - tried sync,dirsync which added no additional overhead > 2. use O_SYNC open() flag > 3. use fdatasync() just before closing the file > - fsync() and fdatasync() produced the same results > > > EXT2 tests > ========================================== > No sync 12.3 seconds (83 MB/Sec) > mount=sync 44.3 seconds (23 MB/Sec) > O_SYNC 31.7 seconds (32 MB/Sec) > fdatasync() 31.3 seconds (32 MB/Sec) > > > EXT3 tests > =========================================== > No sync data=writeback 14.5 seconds (70 MB/Sec) > No sync data=ordered 17 seconds (60 MB/Sec) > No sync data=journal 65 seconds (15 MB/Sec) > data=ordered O_SYNC 49 seconds (20 MB/Sec) > data=ordered,sync 52 seconds (19 MB/Sec) > data=ordered fdatasync() 45.5 seconds (22 MB/Sec) > data=journal O_SYNC 72.5 seconds (14 MB/Sec) > data=journal,sync 81 seconds (12 MB/Sec) > data=journal fdatasync() 60.5 seconds (17 MB/Sec) If you are doing a large number of 1MB writes then I agree that data=journal is probably not the way to go because it means you can get at most 1/2 of the bandwidth of the disk (unless you create the journal on a separate disk). data=journal is good for small writes and lots of transactions, like mail servers that need lots of sync operations. For large writes, I'd suggest you put the journal on a separate device, and make it 1 or 2 GB (your server has plenty of RAM, so that isn't a problem). Are you using EAs, like selinux or similar? If yes, then you should also format your filesystem with large inodes (-I 256). You may also want to try out ext4dev with the mballoc and delalloc patches from Alex Tomas, as this code has been optimized for doing large power-of-two allocations in the filesystem. They've been posted to the ext4-devel lists a couple of times. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From ric at emc.com Fri Mar 23 10:47:26 2007 From: ric at emc.com (Ric Wheeler) Date: Fri, 23 Mar 2007 06:47:26 -0400 Subject: Ext3 behavior on power failure In-Reply-To: References: Message-ID: <4603B03E.7080302@emc.com> armangau_philippe at emc.com wrote: > Hi all, > > We are building a new system which is going to use ext3 FS. We would like to know more about the behavior of ext3 in the case of failure. But before I procede, I would like to share more information about our future system. > > * Our application always does an fsync on files > * When symbolic links (more specifically fast symlink) are created, the host directory is also fsync'ed. > * Our application is also going to front an EMC disk array configured using RAID5 or RAID6. > * We will be using multipathing so that we can assume that no disk errors will be reported. > > In this context , we would like to know the following for recovery after a power outage: > > 1. When will an fsck have to be run (not counting the scheduled fsck every N-mounts)? > 2. In the case of a crash, are the fsync-ed file contents and symbolic links safe no matter what? > > Thanks, This is an interesting twist on some of the discussion that we have had at the recent workshop and in other forums on hardening file system in order to prevent the need to fsck. The twist is that we have a disk that will not lose power without being able to write to platter all of the data that has been sent - this is the case for most mid-range or higher disk arrays. If the application can precisely use fsync() on files, directories and symlinks, it wants to know that all objects are safe on disk that have completed a successful fsync. It also wants to know that the file system will not need any recovery beyond replaying transactions after a power outage/reboot - simply mount, let the transactions get replayed and you should be good to go without the fsck. The hard part of the question is to understand when and how often we will fail to deliver this easy case. Also, does any of the hardening in ext4 help here. Maybe the Stanford explode work/analysis sheds some light on this behavior? ric From skye0507 at yahoo.com Fri Mar 23 13:17:06 2007 From: skye0507 at yahoo.com (brian stone) Date: Fri, 23 Mar 2007 06:17:06 -0700 (PDT) Subject: EXT2 vs. EXT3: mount w/sync or fdatasync In-Reply-To: <20070323061840.GC5967@schatzie.adilger.int> Message-ID: <663779.93645.qm@web59009.mail.re1.yahoo.com> I am currently leaning towards: mount in ordered mode with the dirsync option and use fsync(). That seemed to be the most consistent in performance tests. Some of the config tests would fart in the middle, hesitating for a second or two. The ordered mode with fsync() was rock solid. Also, I think journaling the data when you are syncing it is more than one needs. Without going to unneeded details, I will give you a glimpse of what this application is doing. Machine A, which I will call an app server, generates binary chucks/blocks of data ranging from 28 bytes to a maximum of 1MB. There are multiple app servers. The app servers need to quickly store these blocks on one of several Machine Bs, which I will call volume servers. When a block is transferred from an app server to a volume server, it must be done reliably ... thus the need to sync. If the volume server says, "I got that block", then it really must have it ... on disk. >>Are you using EAs, like selinux or similar File system permissions and security attributes are meaningless in this system. selinux is disabled. These blocks are not browsed by users. I actually mount using "noatime,nodiratime,noacl,nouser_xattr". Only the app servers have any idea what these blocks mean. The volume server is nothing more than a dumping ground out on the network. We even toyed with writing raw, opening a device directly with no fs and using O_DIRECT. Not a bad idea just a heck of a lot of work! Easier to fiddle with the correct config for ext3. So, maybe the volume servers need two fs configs: one for blocks less than 128KB and one for blocks over 128KB. I tested with 1MB blocks because that would be the worst case; I wanted to know how it would perform. The average block size is currently around 100KB. thanks soo much for your thoughts Andreas Dilger wrote: On Mar 22, 2007 20:44 -0700, brian stone wrote: > Machine A connects to machine B on a gigabit lan. Machine A sends > 1024 1MB chucks of data; 1 GB in total. Machine B, the server, reads > in the MB and writes it to a file. > > NOTE: server and client are little test programs written in C. > > Machine B (Server) hardware: > - Single (no raid) Seagate Cheetah 70G Ultra320 15K > - Quad Opteron 870 > - 16G DDR400 > - Backplane: LSI Logic / Symbios Logic 53c1030 PCI-X Fusion-MPT Dual Ultra320 SCSI (rev 8) > > Sync methods include: > 1. mount with sync option > - tried sync,dirsync which added no additional overhead > 2. use O_SYNC open() flag > 3. use fdatasync() just before closing the file > - fsync() and fdatasync() produced the same results > > > EXT2 tests > ========================================== > No sync 12.3 seconds (83 MB/Sec) > mount=sync 44.3 seconds (23 MB/Sec) > O_SYNC 31.7 seconds (32 MB/Sec) > fdatasync() 31.3 seconds (32 MB/Sec) > > > EXT3 tests > =========================================== > No sync data=writeback 14.5 seconds (70 MB/Sec) > No sync data=ordered 17 seconds (60 MB/Sec) > No sync data=journal 65 seconds (15 MB/Sec) > data=ordered O_SYNC 49 seconds (20 MB/Sec) > data=ordered,sync 52 seconds (19 MB/Sec) > data=ordered fdatasync() 45.5 seconds (22 MB/Sec) > data=journal O_SYNC 72.5 seconds (14 MB/Sec) > data=journal,sync 81 seconds (12 MB/Sec) > data=journal fdatasync() 60.5 seconds (17 MB/Sec) If you are doing a large number of 1MB writes then I agree that data=journal is probably not the way to go because it means you can get at most 1/2 of the bandwidth of the disk (unless you create the journal on a separate disk). data=journal is good for small writes and lots of transactions, like mail servers that need lots of sync operations. For large writes, I'd suggest you put the journal on a separate device, and make it 1 or 2 GB (your server has plenty of RAM, so that isn't a problem). Are you using EAs, like selinux or similar? If yes, then you should also format your filesystem with large inodes (-I 256). You may also want to try out ext4dev with the mballoc and delalloc patches from Alex Tomas, as this code has been optimized for doing large power-of-two allocations in the filesystem. They've been posted to the ext4-devel lists a couple of times. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. --------------------------------- Finding fabulous fares is fun. Let Yahoo! FareChase search your favorite travel sites to find flight and hotel bargains. -------------- next part -------------- An HTML attachment was scrubbed... URL: From skye0507 at yahoo.com Sat Mar 24 15:19:58 2007 From: skye0507 at yahoo.com (brian stone) Date: Sat, 24 Mar 2007 08:19:58 -0700 (PDT) Subject: EXT2 vs. EXT3: mount w/sync or fdatasync In-Reply-To: <20070323061840.GC5967@schatzie.adilger.int> Message-ID: <361965.45536.qm@web59008.mail.re1.yahoo.com> Final configuration and performance results. Changed machines (for a RAID test): - 3ware 9550SX with BBU - Pentium D 940 - 2G DDR2 667 - (4) 750G Seagate SATAII drives (AS series) RAID levels: - machine was configured for RAID5 but that was horribly slow, 12 MB/Sec - created a (2) drive RAID0, then sliced out a 100G partition - journal was on a separate JBOD disk - write caching was enabled for the RAID0 and journal disk - 64K stripes was used on RAID0 and JBOD journal File system configuration: - 100G ext3 file system - Used a 32M journal on a physically separate device - used "ordered" mode for the journal - mounted with "noatime,nodiratime,noauto,noacl,nouser_xattr,dirsync" - used the mkfs.ext3 -E option to set stripes to 16 - RAID0 was using 64K stripes. - fs was using 4K blocks - each file transaction did: open(),write(),fsync(),close() - slammed 1024 1MB chucks at it I got 36 MB/Sec consistently. A good sign because with the proper hardware, this would perform really well. In production, I would probably use a RAID10 with at least 12 15K SAS/FC drives with dual controllers in Active-Active mode: failover+load balancing. Either fiber or SAS connected. That should scream! Fortunately, this config needs very little space ... maybe 500G in total. So the hardware cost is not terrible. This config is for a queue directory that is crawled by a background process. That process moves the data from this queue to mass "slow" storage, fiber attached SATAII 7200RPM RAID5. The queue needs to be as fast as possible and must sync the data. Tricky problem :) thanks. Andreas Dilger wrote: On Mar 22, 2007 20:44 -0700, brian stone wrote: > Machine A connects to machine B on a gigabit lan. Machine A sends > 1024 1MB chucks of data; 1 GB in total. Machine B, the server, reads > in the MB and writes it to a file. > > NOTE: server and client are little test programs written in C. > > Machine B (Server) hardware: > - Single (no raid) Seagate Cheetah 70G Ultra320 15K > - Quad Opteron 870 > - 16G DDR400 > - Backplane: LSI Logic / Symbios Logic 53c1030 PCI-X Fusion-MPT Dual Ultra320 SCSI (rev 8) > > Sync methods include: > 1. mount with sync option > - tried sync,dirsync which added no additional overhead > 2. use O_SYNC open() flag > 3. use fdatasync() just before closing the file > - fsync() and fdatasync() produced the same results > > > EXT2 tests > ========================================== > No sync 12.3 seconds (83 MB/Sec) > mount=sync 44.3 seconds (23 MB/Sec) > O_SYNC 31.7 seconds (32 MB/Sec) > fdatasync() 31.3 seconds (32 MB/Sec) > > > EXT3 tests > =========================================== > No sync data=writeback 14.5 seconds (70 MB/Sec) > No sync data=ordered 17 seconds (60 MB/Sec) > No sync data=journal 65 seconds (15 MB/Sec) > data=ordered O_SYNC 49 seconds (20 MB/Sec) > data=ordered,sync 52 seconds (19 MB/Sec) > data=ordered fdatasync() 45.5 seconds (22 MB/Sec) > data=journal O_SYNC 72.5 seconds (14 MB/Sec) > data=journal,sync 81 seconds (12 MB/Sec) > data=journal fdatasync() 60.5 seconds (17 MB/Sec) If you are doing a large number of 1MB writes then I agree that data=journal is probably not the way to go because it means you can get at most 1/2 of the bandwidth of the disk (unless you create the journal on a separate disk). data=journal is good for small writes and lots of transactions, like mail servers that need lots of sync operations. For large writes, I'd suggest you put the journal on a separate device, and make it 1 or 2 GB (your server has plenty of RAM, so that isn't a problem). Are you using EAs, like selinux or similar? If yes, then you should also format your filesystem with large inodes (-I 256). You may also want to try out ext4dev with the mballoc and delalloc patches from Alex Tomas, as this code has been optimized for doing large power-of-two allocations in the filesystem. They've been posted to the ext4-devel lists a couple of times. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. --------------------------------- TV dinner still cooling? Check out "Tonight's Picks" on Yahoo! TV. -------------- next part -------------- An HTML attachment was scrubbed... URL: From adilger at clusterfs.com Sat Mar 24 21:25:02 2007 From: adilger at clusterfs.com (Andreas Dilger) Date: Sat, 24 Mar 2007 15:25:02 -0600 Subject: EXT2 vs. EXT3: mount w/sync or fdatasync In-Reply-To: <361965.45536.qm@web59008.mail.re1.yahoo.com> References: <20070323061840.GC5967@schatzie.adilger.int> <361965.45536.qm@web59008.mail.re1.yahoo.com> Message-ID: <20070324212502.GJ5967@schatzie.adilger.int> On Mar 24, 2007 08:19 -0700, brian stone wrote: > File system configuration: > - 100G ext3 file system > - Used a 32M journal on a physically separate device We normally run our servers with at least 256MB journals - under metadata intensive loads (including truncates) this can really help. > - used "ordered" mode for the journal > - mounted with "noatime,nodiratime,noauto,noacl,nouser_xattr,dirsync" > - used the mkfs.ext3 -E option to set stripes to 16 > - RAID0 was using 64K stripes. > - fs was using 4K blocks > - each file transaction did: open(),write(),fsync(),close() > - slammed 1024 1MB chucks at it > > I got 36 MB/Sec consistently. A good sign because with the proper hardware, this would perform really well. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From jack at suse.cz Wed Mar 28 12:40:16 2007 From: jack at suse.cz (Jan Kara) Date: Wed, 28 Mar 2007 14:40:16 +0200 Subject: Ext3 behavior on power failure In-Reply-To: <4603B03E.7080302@emc.com> References: <4603B03E.7080302@emc.com> Message-ID: <20070328124015.GG14935@atrey.karlin.mff.cuni.cz> > armangau_philippe at emc.com wrote: > >Hi all, > > > >We are building a new system which is going to use ext3 FS. We would like > >to know more about the behavior of ext3 in the case of failure. But > >before I procede, I would like to share more information about our future > >system. > >* Our application always does an fsync on files > >* When symbolic links (more specifically fast symlink) are created, > >the host directory is also fsync'ed. * Our application is also > >going to front an EMC disk array configured using RAID5 or RAID6. > >* We will be using multipathing so that we can assume that no disk > >errors will be reported. > >In this context , we would like to know the following for recovery after a > >power outage: > > > >1. When will an fsck have to be run (not counting the scheduled fsck > >every N-mounts)? > >2. In the case of a crash, are the fsync-ed file contents and symbolic > >links safe no matter what? > > > >Thanks, > > This is an interesting twist on some of the discussion that we have had > at the recent workshop and in other forums on hardening file system in > order to prevent the need to fsck. > > The twist is that we have a disk that will not lose power without being > able to write to platter all of the data that has been sent - this is > the case for most mid-range or higher disk arrays. > > If the application can precisely use fsync() on files, directories and > symlinks, it wants to know that all objects are safe on disk that have > completed a successful fsync. It also wants to know that the file system > will not need any recovery beyond replaying transactions after a power > outage/reboot - simply mount, let the transactions get replayed and you > should be good to go without the fsck. > > The hard part of the question is to understand when and how often we > will fail to deliver this easy case. Also, does any of the hardening in > ext4 help here. I'm probably misunderstanding something because the answer seems to be too obvious to me :) But anyway I'll write it so that you can correct me: Due to journalling guarantees you should get consistent FS whenever you replay the log (unless there are some software bugs or hardware problems which is why fsck is run once per several mounts anyway). If you fsync() your data, you are guaranteed that also your data are safely on disk when fsync returns. So what is the question here? Honza -- Jan Kara SuSE CR Labs From jack at suse.cz Wed Mar 28 13:29:04 2007 From: jack at suse.cz (Jan Kara) Date: Wed, 28 Mar 2007 15:29:04 +0200 Subject: Ext3 behavior on power failure In-Reply-To: References: <4603B03E.7080302@emc.com> <20070328124015.GG14935@atrey.karlin.mff.cuni.cz> Message-ID: <20070328132903.GI14935@atrey.karlin.mff.cuni.cz> > > If you fsync() your data, you are guaranteed that also your data are > >safely on disk when fsync returns. So what is the question here? > Pardon a newbie's intrusion, but I do know this isn't true. There is a > window of possible loss because of the multitude of layers of caching, > especially within the drive itself. Unless there is a super_duper_fsync() > that is able to actually poll the hardware and get a confirmation that the > internal buffers are purged? OK :), to correct myself: After fsync() returns, all the data is acked from the disk (or at least it should be like that unless there's a bug somewhere). So if there are some caches in the hardware which the hardware is not able to flush on power failure, that's a bad luck... That's why you should turn off write caching on cheaper disks if you really care about data integrity. Honza -- Jan Kara SuSE CR Labs From armangau_philippe at emc.com Wed Mar 28 14:17:33 2007 From: armangau_philippe at emc.com (armangau_philippe at emc.com) Date: Wed, 28 Mar 2007 10:17:33 -0400 Subject: Ext3 behavior on power failure In-Reply-To: References: <4603B03E.7080302@emc.com> <20070328124015.GG14935@atrey.karlin.mff.cuni.cz> Message-ID: In my case the disk cache is not a problem - We use an emc disk array the write cache is protected - Once the data has made over the disk array we can assume it is safe - Thx Philippe -----Original Message----- From: John Anthony Kazos Jr. [mailto:jakj at j-a-k-j.com] Sent: Wednesday, March 28, 2007 9:17 AM To: Jan Kara Cc: wheeler, richard; armangau, philippe; ext3-users at redhat.com; linux-ext4 at vger.kernel.org; csar at stanford.edu Subject: Re: Ext3 behavior on power failure > If you fsync() your data, you are guaranteed that also your data are > safely on disk when fsync returns. So what is the question here? Pardon a newbie's intrusion, but I do know this isn't true. There is a window of possible loss because of the multitude of layers of caching, especially within the drive itself. Unless there is a super_duper_fsync() that is able to actually poll the hardware and get a confirmation that the internal buffers are purged? From jack at suse.cz Wed Mar 28 15:00:03 2007 From: jack at suse.cz (Jan Kara) Date: Wed, 28 Mar 2007 17:00:03 +0200 Subject: Ext3 behavior on power failure In-Reply-To: References: <4603B03E.7080302@emc.com> <20070328124015.GG14935@atrey.karlin.mff.cuni.cz> Message-ID: <20070328150003.GE29587@duck.suse.cz> On Wed 28-03-07 10:17:33, armangau_philippe at emc.com wrote: > In my case the disk cache is not a problem - We use an emc disk array > the write cache is protected - > Once the data has made over the disk array we can assume it is safe - Then if you are able to reproduce the situation that not all data is written after fsync(); poweroff; that is a bug worth reporting.. Honza > > -----Original Message----- > From: John Anthony Kazos Jr. [mailto:jakj at j-a-k-j.com] > Sent: Wednesday, March 28, 2007 9:17 AM > To: Jan Kara > Cc: wheeler, richard; armangau, philippe; ext3-users at redhat.com; > linux-ext4 at vger.kernel.org; csar at stanford.edu > Subject: Re: Ext3 behavior on power failure > > > If you fsync() your data, you are guaranteed that also your data are > > safely on disk when fsync returns. So what is the question here? > > Pardon a newbie's intrusion, but I do know this isn't true. There is a > window of possible loss because of the multitude of layers of caching, > especially within the drive itself. Unless there is a > super_duper_fsync() > that is able to actually poll the hardware and get a confirmation that > the > internal buffers are purged? > -- Jan Kara SuSE CR Labs From tsh at mrc-lmb.cam.ac.uk Wed Mar 28 17:47:32 2007 From: tsh at mrc-lmb.cam.ac.uk (T. Horsnell) Date: Wed, 28 Mar 2007 18:47:32 +0100 Subject: ext3 usage guidance Message-ID: <20070328174732.GA31129@ls1.lmb.internal> Is there a document anywhere offering guidance on the optimum use of ext3 filesystems? Googling shows nothing useful and the Linux ext3 FAQ is not very forthcoming. I'm particularly interested in: 1. The effect on performance of large numbers of (generally) small files One of my ext3 filesystems has 750K files on a 36GB disk, and backup with tar takes forever. Even 'find /fs -type f -ls' to establish ownership of the various files takes some hours. Are there thresholds for #files-per-directory or #total-files-per-filesystem beyond which performance degrades rapidly? 2. I have a number of filesystems on SCSI disks which I would like to fsck on demand, rather than have an unscheduled fsck at reboot because some mount-count has expired. I use 'tune2fs -c 0 and -t 0' to do this, and would like to use 'shutdown -F -r 'at a chosen time to force fsck on reboot, and I'd then like fsck to do things in parallel. What are the resources (memory etc) required for parallel fsck'ing? Can I reasonably expect to be able to fsck say, 50 300GB filesystems in parallel, or should I group them into smaller groups? How small? Thanks, Terry. -- From ric at emc.com Wed Mar 28 23:00:54 2007 From: ric at emc.com (Ric Wheeler) Date: Wed, 28 Mar 2007 19:00:54 -0400 Subject: Ext3 behavior on power failure In-Reply-To: <20070328124015.GG14935@atrey.karlin.mff.cuni.cz> References: <4603B03E.7080302@emc.com> <20070328124015.GG14935@atrey.karlin.mff.cuni.cz> Message-ID: <460AF3A6.403@emc.com> Jan Kara wrote: >> armangau_philippe at emc.com wrote: >>> Hi all, >>> >>> We are building a new system which is going to use ext3 FS. We would like >>> to know more about the behavior of ext3 in the case of failure. But >>> before I procede, I would like to share more information about our future >>> system. >>> * Our application always does an fsync on files >>> * When symbolic links (more specifically fast symlink) are created, >>> the host directory is also fsync'ed. * Our application is also >>> going to front an EMC disk array configured using RAID5 or RAID6. >>> * We will be using multipathing so that we can assume that no disk >>> errors will be reported. >>> In this context , we would like to know the following for recovery after a >>> power outage: >>> >>> 1. When will an fsck have to be run (not counting the scheduled fsck >>> every N-mounts)? >>> 2. In the case of a crash, are the fsync-ed file contents and symbolic >>> links safe no matter what? >>> >>> Thanks, >> This is an interesting twist on some of the discussion that we have had >> at the recent workshop and in other forums on hardening file system in >> order to prevent the need to fsck. >> >> The twist is that we have a disk that will not lose power without being >> able to write to platter all of the data that has been sent - this is >> the case for most mid-range or higher disk arrays. >> >> If the application can precisely use fsync() on files, directories and >> symlinks, it wants to know that all objects are safe on disk that have >> completed a successful fsync. It also wants to know that the file system >> will not need any recovery beyond replaying transactions after a power >> outage/reboot - simply mount, let the transactions get replayed and you >> should be good to go without the fsck. >> >> The hard part of the question is to understand when and how often we >> will fail to deliver this easy case. Also, does any of the hardening in >> ext4 help here. > I'm probably misunderstanding something because the answer seems to be > too obvious to me :) But anyway I'll write it so that you can correct > me: > Due to journalling guarantees you should get consistent FS whenever > you replay the log (unless there are some software bugs or hardware > problems which is why fsck is run once per several mounts anyway). > If you fsync() your data, you are guaranteed that also your data are > safely on disk when fsync returns. So what is the question here? > > Honza I think that the real question here is in practice, how often does this really hold to be true? When it fails, how long does it take to recover the file system? There are a lot of odd errors that can happen when you monitor a large enough number of file systems. In my experience, I would guess that disk errors are clearly the leading cause of issues, followed by software bugs (file system, firmware, etc) and then a group of errors caused by various occasional things (bad DRAM in the server/HBA/disk, bad cables/etc). Note that using a high end array does not eliminate errors, it just reduces the rate (hopefully by a large amount). What is really hard to predict is the rate of the failures that require fsck with our current file system (say for a specific hardware setup) and how changes like the checksumming in ext4 can help us ride through these errors without needing a full fsck. This rate has a direct impact on how much pain an fsck will inflict and how important redundancy is to avoid having the file system be a single point of failure. ric From jack at suse.cz Thu Mar 29 08:00:59 2007 From: jack at suse.cz (Jan Kara) Date: Thu, 29 Mar 2007 10:00:59 +0200 Subject: Ext3 behavior on power failure In-Reply-To: <460AF3A6.403@emc.com> References: <4603B03E.7080302@emc.com> <20070328124015.GG14935@atrey.karlin.mff.cuni.cz> <460AF3A6.403@emc.com> Message-ID: <20070329080059.GA7698@duck.suse.cz> On Wed 28-03-07 19:00:54, Ric Wheeler wrote: > Jan Kara wrote: > >>armangau_philippe at emc.com wrote: > >>>Hi all, > >>> > >>>We are building a new system which is going to use ext3 FS. We would > >>>like to know more about the behavior of ext3 in the case of failure. > >>>But before I procede, I would like to share more information about our > >>>future system. > >>>* Our application always does an fsync on files > >>>* When symbolic links (more specifically fast symlink) are created, > >>>the host directory is also fsync'ed. * Our application is also > >>>going to front an EMC disk array configured using RAID5 or RAID6. > >>>* We will be using multipathing so that we can assume that no disk > >>>errors will be reported. > >>>In this context , we would like to know the following for recovery after > >>>a power outage: > >>> > >>>1. When will an fsck have to be run (not counting the scheduled fsck > >>>every N-mounts)? > >>>2. In the case of a crash, are the fsync-ed file contents and symbolic > >>>links safe no matter what? > >>> > >>>Thanks, > >>This is an interesting twist on some of the discussion that we have had > >>at the recent workshop and in other forums on hardening file system in > >>order to prevent the need to fsck. > >> > >>The twist is that we have a disk that will not lose power without being > >>able to write to platter all of the data that has been sent - this is > >>the case for most mid-range or higher disk arrays. > >> > >>If the application can precisely use fsync() on files, directories and > >>symlinks, it wants to know that all objects are safe on disk that have > >>completed a successful fsync. It also wants to know that the file system > >>will not need any recovery beyond replaying transactions after a power > >>outage/reboot - simply mount, let the transactions get replayed and you > >>should be good to go without the fsck. > >> > >>The hard part of the question is to understand when and how often we > >>will fail to deliver this easy case. Also, does any of the hardening in > >>ext4 help here. > > I'm probably misunderstanding something because the answer seems to be > >too obvious to me :) But anyway I'll write it so that you can correct > >me: > > Due to journalling guarantees you should get consistent FS whenever > >you replay the log (unless there are some software bugs or hardware > >problems which is why fsck is run once per several mounts anyway). > > If you fsync() your data, you are guaranteed that also your data are > >safely on disk when fsync returns. So what is the question here? > > > > Honza > > I think that the real question here is in practice, how often does this > really hold to be true? When it fails, how long does it take to recover the > file system? I see, thanks for explanation :). > There are a lot of odd errors that can happen when you monitor a large > enough number of file systems. In my experience, I would guess that disk > errors are clearly the leading cause of issues, followed by software bugs > (file system, firmware, etc) and then a group of errors caused by various > occasional things (bad DRAM in the server/HBA/disk, bad cables/etc). Note > that using a high end array does not eliminate errors, it just reduces the > rate (hopefully by a large amount). > > What is really hard to predict is the rate of the failures that require > fsck with our current file system (say for a specific hardware setup) and > how changes like the checksumming in ext4 can help us ride through these > errors without needing a full fsck. OK. All the features I've seen so far were more aiming to detecting that such an unexpected problem happened rather than trying to fix it or make fixing it faster. So currently it seems to me that any such unexpected failure requires fsck... Honza -- Jan Kara SuSE CR Labs From adilger at clusterfs.com Thu Mar 29 09:16:44 2007 From: adilger at clusterfs.com (Andreas Dilger) Date: Thu, 29 Mar 2007 03:16:44 -0600 Subject: ext3 usage guidance In-Reply-To: <20070328174732.GA31129@ls1.lmb.internal> References: <20070328174732.GA31129@ls1.lmb.internal> Message-ID: <20070329091644.GC5967@schatzie.adilger.int> On Mar 28, 2007 18:47 +0100, T. Horsnell wrote: > 1. The effect on performance of large numbers of (generally) small files > One of my ext3 filesystems has 750K files on a 36GB disk, and > backup with tar takes forever. Even 'find /fs -type f -ls' > to establish ownership of the various files takes some hours. > Are there thresholds for #files-per-directory or #total-files-per-filesystem > beyond which performance degrades rapidly? You should enable directory indexing if you have > 5000 file directories, then index the directories. "tune2fs -O dir_index /dev/XXX; e2fsck -fD /dev/XXX" > 2. I have a number of filesystems on SCSI disks which I would > like to fsck on demand, rather than have an unscheduled > fsck at reboot because some mount-count has expired. > I use 'tune2fs -c 0 and -t 0' to do this, and would like > to use 'shutdown -F -r 'at a chosen time to force fsck on > reboot, and I'd then like fsck to do things in parallel. > What are the resources (memory etc) required for parallel > fsck'ing? Can I reasonably expect to be able to fsck say, > 50 300GB filesystems in parallel, or should I group them into > smaller groups? How small? I think it was at least "(inodes_count * 7 + blocks_count * 3) / 8" per filesystem when I last checked, but I don't recall exactly anymore. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From tsh at mrc-lmb.cam.ac.uk Thu Mar 29 10:09:07 2007 From: tsh at mrc-lmb.cam.ac.uk (T. Horsnell) Date: Thu, 29 Mar 2007 11:09:07 +0100 Subject: ext3 usage guidance In-Reply-To: <20070329091644.GC5967@schatzie.adilger.int> References: <20070328174732.GA31129@ls1.lmb.internal> <20070329091644.GC5967@schatzie.adilger.int> Message-ID: <20070329100907.GA7238@ls1.lmb.internal> On Thu, Mar 29, 2007 at 03:16:44AM -0600, Andreas Dilger wrote: > On Mar 28, 2007 18:47 +0100, T. Horsnell wrote: > > 1. The effect on performance of large numbers of (generally) small files > > One of my ext3 filesystems has 750K files on a 36GB disk, and > > backup with tar takes forever. Even 'find /fs -type f -ls' > > to establish ownership of the various files takes some hours. > > Are there thresholds for #files-per-directory or #total-files-per-filesystem > > beyond which performance degrades rapidly? > > You should enable directory indexing if you have > 5000 file directories, > then index the directories. "tune2fs -O dir_index /dev/XXX; e2fsck -fD /dev/XXX" Thanks very much. Do you mean '> 5000 directories-per-filesystem' or '> 5000 files-per-directory' ? tune2fs refers to 'large directories' which implies to me that its files-per-directory Cheers, Terry. > > > 2. I have a number of filesystems on SCSI disks which I would > > like to fsck on demand, rather than have an unscheduled > > fsck at reboot because some mount-count has expired. > > I use 'tune2fs -c 0 and -t 0' to do this, and would like > > to use 'shutdown -F -r 'at a chosen time to force fsck on > > reboot, and I'd then like fsck to do things in parallel. > > What are the resources (memory etc) required for parallel > > fsck'ing? Can I reasonably expect to be able to fsck say, > > 50 300GB filesystems in parallel, or should I group them into > > smaller groups? How small? > > I think it was at least "(inodes_count * 7 + blocks_count * 3) / 8" per > filesystem when I last checked, but I don't recall exactly anymore. > > Cheers, Andreas > -- > Andreas Dilger > Principal Software Engineer > Cluster File Systems, Inc. > -- From vcaron at bearstech.com Thu Mar 29 12:17:56 2007 From: vcaron at bearstech.com (Vincent Caron) Date: Thu, 29 Mar 2007 14:17:56 +0200 Subject: tune2fs -l stale info Message-ID: <1175170676.5185.42.camel@localhost> Hello, I just noticed that 'tune2fs -l' did not returned a "lively" updated information regarding the free inodes count (looks like it's always correct after unmounting). It became suprising after an online resizing operation, where the total inode count was immediatly updated (grown in my case) but the free inode count was the same: one could deduce that suddenly a lot of inodes were used. Is it a normal/expected behaviour ? Stale info is okay (as long as advertised as such), but partially updated info makes it look incoherent to me. I'm using ext3 on a 2.6.18 (Debian's "vanilla") kernel, x86_64 platform and tune2fs 1.40-WIP (14-Nov-2006). From tytso at mit.edu Thu Mar 29 18:59:30 2007 From: tytso at mit.edu (Theodore Tso) Date: Thu, 29 Mar 2007 14:59:30 -0400 Subject: tune2fs -l stale info In-Reply-To: <1175170676.5185.42.camel@localhost> References: <1175170676.5185.42.camel@localhost> Message-ID: <20070329185930.GA30858@thunk.org> On Thu, Mar 29, 2007 at 02:17:56PM +0200, Vincent Caron wrote: > Hello, > > I just noticed that 'tune2fs -l' did not returned a "lively" updated > information regarding the free inodes count (looks like it's always > correct after unmounting). It became suprising after an online resizing > operation, where the total inode count was immediatly updated (grown in > my case) but the free inode count was the same: one could deduce that > suddenly a lot of inodes were used. Yes, this is expected. Don't use tune2fs -l for this. Use df -i instead. It is accurate while the filesystem is moutned, and it's even portable, which is important if you ever need to use other legacy Unix systems, such as Solaris. :-) You can use tune2fs -l or dumpe2fs to obtain the free block/inode quotes for unmounted filesystems, assuming they were uncleanly mounted. If the system had crashed and you haven't yet run the journal using e2fsck, the dumpe2fs/tune2fs -l may print stale information until you run the journal by running e2fsck, or by mounting and unmounting the ext3 filesystem. - Ted From adilger at clusterfs.com Thu Mar 29 19:59:39 2007 From: adilger at clusterfs.com (Andreas Dilger) Date: Thu, 29 Mar 2007 13:59:39 -0600 Subject: tune2fs -l stale info In-Reply-To: <1175170676.5185.42.camel@localhost> References: <1175170676.5185.42.camel@localhost> Message-ID: <20070329195939.GI5967@schatzie.adilger.int> On Mar 29, 2007 14:17 +0200, Vincent Caron wrote: > I just noticed that 'tune2fs -l' did not returned a "lively" updated > information regarding the free inodes count (looks like it's always > correct after unmounting). This is a bit of a defect in all 2.6 kernels. They never update the on disk superblock free blocks/inodes information to avoid lock contention, even if this info is available. Can you please give the following patch a try? It fixes this issue, and also makes statfs MUCH more efficient for large filesystems, because the filesystem overhead is constant unless the filesystem size changes and checking that for 16k groups is slow (hence hack to add cond_resched() instead of fixing problem correctly). It has not been tested much, but is very straight forward. Only the last part is strictly necessary to fix your particular problem (setting of es->s_free_inodes_count and es->s_free_blocks_count). This is lazy, in the sense that you need a "statfs" to update the count, and then a truncate or unlink or rmdir in order to dirty the superblock to flush it to disk. However, it will be correct in the buffer cache, and it is a lot better than what we have now. We don't want a non-lazy version anyways, because of performance. Signed-off-by: Andreas Dilger ======================= ext3-statfs-2.6.20.diff ========================== Index: linux-stage/fs/ext3/super.c =================================================================== --- linux-stage.orig/fs/ext3/super.c 2007-03-22 17:29:30.000000000 -0600 +++ linux-stage/fs/ext3/super.c 2007-03-23 01:48:41.000000000 -0600 @@ -2389,19 +2389,22 @@ restore_opts: struct super_block *sb = dentry->d_sb; struct ext3_sb_info *sbi = EXT3_SB(sb); struct ext3_super_block *es = sbi->s_es; - ext3_fsblk_t overhead; - int i; + static ext3_fsblk_t overhead_last; + static __le32 blocks_last; u64 fsid; - if (test_opt (sb, MINIX_DF)) - overhead = 0; - else { - unsigned long ngroups; - ngroups = EXT3_SB(sb)->s_groups_count; + if (test_opt (sb, MINIX_DF)) { + overhead_last = 0; + } else if (blocks_last != es->s_blocks_count) { + unsigned long ngroups = sbi->s_groups_count, group, metabg = ~0; + unsigned three = 1, five = 5, seven = 7; + ext3_fsblk_t overhead = 0; smp_rmb(); /* - * Compute the overhead (FS structures) + * Compute the overhead (FS structures). This is constant + * for a given filesystem unless the number of block groups + * changes so we cache the previous value until it does. */ /* @@ -2419,28 +2422,43 @@ static int ext3_statfs (struct super_blo * block group descriptors. If the sparse superblocks * feature is turned on, then not all groups have this. */ - for (i = 0; i < ngroups; i++) { - overhead += ext3_bg_has_super(sb, i) + - ext3_bg_num_gdb(sb, i); - cond_resched(); - } + overhead += 1 + sbi->s_gdb_count + + le16_to_cpu(es->s_reserved_gdt_blocks); /* group 0 */ + if (EXT3_HAS_INCOMPAT_FEATURE(sb, + EXT3_FEATURE_INCOMPAT_META_BG)) { + metabg = le32_to_cpu(es->s_first_meta_bg) * + sbi->s_desc_per_block; + group = ngroups - metabg; + overhead += (group + 1) / sbi->s_desc_per_block * 3 + + ((group%sbi->s_desc_per_block)>= 2?2:(group%2)); + } + + while ((group = ext3_list_backups(sb, &three, &five, &seven)) < + ngroups) /* sb + group descriptors backups */ + overhead += 1 +(group >= metabg ? 0 : sbi->s_gdb_count + + le16_to_cpu(es->s_reserved_gdt_blocks)); /* * Every block group has an inode bitmap, a block * bitmap, and an inode table. */ - overhead += (ngroups * (2 + EXT3_SB(sb)->s_itb_per_group)); + overhead += ngroups * (2 + sbi->s_itb_per_group); + overhead_last = overhead; + smp_wmb(); + blocks_last = es->s_blocks_count; } buf->f_type = EXT3_SUPER_MAGIC; buf->f_bsize = sb->s_blocksize; - buf->f_blocks = le32_to_cpu(es->s_blocks_count) - overhead; + buf->f_blocks = le32_to_cpu(es->s_blocks_count) - overhead_last; buf->f_bfree = percpu_counter_sum(&sbi->s_freeblocks_counter); + es->s_free_blocks_count = cpu_to_le32(buf->f_bfree); buf->f_bavail = buf->f_bfree - le32_to_cpu(es->s_r_blocks_count); if (buf->f_bfree < le32_to_cpu(es->s_r_blocks_count)) buf->f_bavail = 0; buf->f_files = le32_to_cpu(es->s_inodes_count); buf->f_ffree = percpu_counter_sum(&sbi->s_freeinodes_counter); + es->s_free_inodes_count = cpu_to_le32(buf->f_ffree); buf->f_namelen = EXT3_NAME_LEN; fsid = le64_to_cpup((void *)es->s_uuid) ^ le64_to_cpup((void *)es->s_uuid + sizeof(u64)); Index: linux-stage/fs/ext3/resize.c =================================================================== --- linux-stage.orig/fs/ext3/resize.c 2007-03-22 17:29:30.000000000 -0600 +++ linux-stage/fs/ext3/resize.c 2007-03-23 01:16:38.000000000 -0600 @@ -292,8 +292,8 @@ exit_journal: * sequence of powers of 3, 5, and 7: 1, 3, 5, 7, 9, 25, 27, 49, 81, ... * For a non-sparse filesystem it will be every group: 1, 2, 3, 4, ... */ -static unsigned ext3_list_backups(struct super_block *sb, unsigned *three, - unsigned *five, unsigned *seven) +unsigned ext3_list_backups(struct super_block *sb, unsigned *three, + unsigned *five, unsigned *seven) { unsigned *min = three; int mult = 3; Index: linux-stage/include/linux/ext3_fs.h =================================================================== --- linux-stage.orig/include/linux/ext3_fs.h 2007-03-22 17:29:30.000000000 -0600 +++ linux-stage/include/linux/ext3_fs.h 2007-03-23 00:41:22.000000000 -0600 @@ -846,6 +846,8 @@ extern int ext3_group_add(struct super_b extern int ext3_group_extend(struct super_block *sb, struct ext3_super_block *es, ext3_fsblk_t n_blocks_count); +extern unsigned ext3_list_backups(struct super_block *sb, unsigned *three, + unsigned *five, unsigned *seven); /* super.c */ extern void ext3_error (struct super_block *, const char *, const char *, ...) Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From vcaron at bearstech.com Thu Mar 29 20:12:13 2007 From: vcaron at bearstech.com (Vincent Caron) Date: Thu, 29 Mar 2007 22:12:13 +0200 Subject: tune2fs -l stale info In-Reply-To: <20070329185930.GA30858@thunk.org> References: <1175170676.5185.42.camel@localhost> <20070329185930.GA30858@thunk.org> Message-ID: <1175199133.5185.60.camel@localhost> On jeu, 2007-03-29 at 14:59 -0400, Theodore Tso wrote: > On Thu, Mar 29, 2007 at 02:17:56PM +0200, Vincent Caron wrote: > > Hello, > > > > I just noticed that 'tune2fs -l' did not returned a "lively" updated > > information regarding the free inodes count (looks like it's always > > correct after unmounting). It became suprising after an online resizing > > operation, where the total inode count was immediatly updated (grown in > > my case) but the free inode count was the same: one could deduce that > > suddenly a lot of inodes were used. > > Yes, this is expected. Don't use tune2fs -l for this. Use df -i > instead. It is accurate while the filesystem is moutned, and it's > even portable, which is important if you ever need to use other legacy > Unix systems, such as Solaris. :-) Thanks for the tip, figures looks much better now... From jakj at j-a-k-j.com Wed Mar 28 13:17:42 2007 From: jakj at j-a-k-j.com (John Anthony Kazos Jr.) Date: Wed, 28 Mar 2007 13:17:42 -0000 Subject: Ext3 behavior on power failure In-Reply-To: <20070328124015.GG14935@atrey.karlin.mff.cuni.cz> References: <4603B03E.7080302@emc.com> <20070328124015.GG14935@atrey.karlin.mff.cuni.cz> Message-ID: > If you fsync() your data, you are guaranteed that also your data are > safely on disk when fsync returns. So what is the question here? Pardon a newbie's intrusion, but I do know this isn't true. There is a window of possible loss because of the multitude of layers of caching, especially within the drive itself. Unless there is a super_duper_fsync() that is able to actually poll the hardware and get a confirmation that the internal buffers are purged?