From tytso at mit.edu Sat Apr 1 02:41:47 2006 From: tytso at mit.edu (Theodore Ts'o) Date: Fri, 31 Mar 2006 21:41:47 -0500 Subject: [RFC] mke2fs with DIR_INDEX, RESIZE_INODE by default In-Reply-To: <1142891937.21593.47.camel@orbit.scot.redhat.com> References: <20060317075312.GG30801@schatzie.adilger.int> <1142634418.3641.62.camel@orbit.scot.redhat.com> <20060317143630.300d82f8.akpm@osdl.org> <20060318084302.GX30801@schatzie.adilger.int> <1142878786.3414.27.camel@orbit.scot.redhat.com> <20060320211401.GG6199@schatzie.adilger.int> <1142891937.21593.47.camel@orbit.scot.redhat.com> Message-ID: <20060401024147.GA24163@thunk.org> On Mon, Mar 20, 2006 at 04:58:57PM -0500, Stephen C. Tweedie wrote: > > I think we're probably at the right point to do so. Most people who are > most likely to be affected have a reasonably recent e2fsprogs now. On > the Fedora side I'm seeing very few reports of people bitten by > e2fsprogs incompatibility, and more and more instances of people bitten > the other way by filesystems not performing as well as expected due to > missing dir_index flags. > In case some people haven't noticed, a few days ago I released a new e2fsprogs release for e2fsprogs 1.39. New in this release is a way for distributions and system administrators to control the default filesystem features via the /etc/mke2fs.conf file. In the pre-release version, mke2fs is still using the same behaviour as before, but my plan is to change mke2fs to create filesystems with the dir_index and resize_inode features by default. People who don't like this default can always edit mke2fs.conf and change things back. - Ted From HuntressGB at Npt.NUWC.Navy.Mil Sat Apr 1 14:53:11 2006 From: HuntressGB at Npt.NUWC.Navy.Mil (Huntress Gary B NPRI) Date: Sat, 01 Apr 2006 09:53:11 -0500 Subject: Tuning for large number of directory entries? Message-ID: <7F93C0D0C6D8454B9B05720F713A09F138D941@ldap.npt.nuwc.navy.mil> I have been running a public MySQL server for over 4 years. The system was a 1GHz box with 256MB of RAM. MySQL puts each database into a seperate directory in a single "data" directory. Once the old system reached about 10K databases the connection times increased 10X or more. I attributed the speed problems to a possible filesystem limitation with a large number of files. That system ran RH9 and either had ext2 or ext3 with no htree patch. Recently I bought a new 2.4GHz server with 1GB RAM and much faster drives. I installed FC4 and am running a 2.6.14 kernel. I thought that even if my problems were not completely solved, I would at least not see connection times increase until I had many more directory entries (wild guess - 20K). Since I am writing, you can guess that is not the case. At 13K directory entries, I am still seeing significantly slower connection times even with the faster hardware and newer software. I'm limited in what I can move because MySQL expects everyting in the "data" directory. My questions are: 1) I don't know much about htree, but I recall that it is supposed to help in this situation. How do I tell if it is in use on my system. Is it a kernel module or is it compiled into ext3? 2) Are there other options for tuneing ext3 performance? Thanks everyone, Gary Huntress From adilger at clusterfs.com Sat Apr 1 18:11:40 2006 From: adilger at clusterfs.com (Andreas Dilger) Date: Sat, 1 Apr 2006 11:11:40 -0700 Subject: Tuning for large number of directory entries? In-Reply-To: <7F93C0D0C6D8454B9B05720F713A09F138D941@ldap.npt.nuwc.navy.mil> References: <7F93C0D0C6D8454B9B05720F713A09F138D941@ldap.npt.nuwc.navy.mil> Message-ID: <20060401181140.GR17364@schatzie.adilger.int> On Apr 01, 2006 09:53 -0500, Huntress Gary B NPRI wrote: > Since I am writing, you can guess that is not the case. At 13K > directory entries, I am still seeing significantly slower connection > times even with the faster hardware and newer software. You need to run "e2fsck -f -D" when the filesystem is unmounted, to build htree indices for your large directories. > My questions are: 1) I don't know much about htree, but I recall > that it is supposed to help in this situation. How do I tell if it is > in use on my system. Is it a kernel module or is it compiled into ext3? "dumpe2fs -h | grep -i features", it's part of stock 2.6 ext3. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From hahaha_30k at yahoo.com Sun Apr 2 06:39:27 2006 From: hahaha_30k at yahoo.com (Robinson Tiemuqinke) Date: Sat, 1 Apr 2006 22:39:27 -0800 (PST) Subject: [RFC] mke2fs with DIR_INDEX, RESIZE_INODE by default In-Reply-To: <20060401024147.GA24163@thunk.org> Message-ID: <20060402063927.26510.qmail@web36709.mail.mud.yahoo.com> Hi, A stupid questions to ask: How to turn on "resize_inode" feature for ext3 file system created with old mke2fs? In Fedora Core 4 and Fedora Core 5 the new mke2fs program creates file systems with "resize_inode" feature on by default, but old file systems didn't have the "resize_inode" feature which were created with RH9 and Fedora core 1. I can not run a "tune2fs -O resize_inode" to make old file systems have the new feature after Linux OS upgraded from Fedora Core 1 to Fedora Core 5, neither can I re-create the file systems directly since I have important data on them. If there is a tool to upgrade old ext3 file systems so that they will also have "resize_inode" feature? Thanks. --- Theodore Ts'o wrote: > On Mon, Mar 20, 2006 at 04:58:57PM -0500, Stephen C. > Tweedie wrote: > > > > I think we're probably at the right point to do > so. Most people who are > > most likely to be affected have a reasonably > recent e2fsprogs now. On > > the Fedora side I'm seeing very few reports of > people bitten by > > e2fsprogs incompatibility, and more and more > instances of people bitten > > the other way by filesystems not performing as > well as expected due to > > missing dir_index flags. > > > > In case some people haven't noticed, a few days ago > I released a new > e2fsprogs release for e2fsprogs 1.39. New in this > release is a way > for distributions and system administrators to > control the default > filesystem features via the /etc/mke2fs.conf file. > > In the pre-release version, mke2fs is still using > the same behaviour > as before, but my plan is to change mke2fs to create > filesystems with > the dir_index and resize_inode features by default. > People who don't > like this default can always edit mke2fs.conf and > change things back. > > - Ted > > _______________________________________________ > Ext3-users mailing list > Ext3-users at redhat.com > https://www.redhat.com/mailman/listinfo/ext3-users > __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com From rmy at tigress.co.uk Sun Apr 2 17:07:21 2006 From: rmy at tigress.co.uk (Ron Yorston) Date: Sun, 2 Apr 2006 18:07:21 +0100 (BST) Subject: Zeroing freed blocks Message-ID: <200604021707.k32H7LpJ026632@tiffany.internal.tigress.co.uk> A couple of years ago there was a discussion on lkml under the thread 'PATCH - ext2fs privacy (i.e. secure deletion) patch' about zapping deleted data in the filesystem as a security mechanism. The discussion wandered off into how 'chattr +s' could be implemented and whether encrypting filesystems wouldn't be a better solution to the problem. I've been maintaining a simplified version of the patch for a different reason: to keep filesystems in files sparse. Filesystem images for use by things like user-mode Linux and Xen are often created as sparse files. After they've been in use for a while their sparseness is reduced even though they may have lots of free space. Having the guest kernel fill deleted blocks with zeros doesn't make the underlying file sparse, but it does help. I've got a page with more details: http://intgat.tigress.co.uk/rmy/uml/sparsify.html Anyway, a couple of things: 1. The patch (see below) is pretty simple. I've been using it for some time in UML build systems for old versions of software (rh62, anyone?), and today I even tried it for several seconds in a Xen domU kernel. It seems to do what I want, but is it any good? 2. The patch is now for ext2 only, the original ext3 version having succumbed to bitrot. What would it take to implement something similar for ext3 these days? Ron --- linux-2.6.16/Documentation/filesystems/ext2.txt.zerofree 2006-03-20 05:53:29.000000000 +0000 +++ linux-2.6.16/Documentation/filesystems/ext2.txt 2006-04-02 09:21:52.000000000 +0100 @@ -58,6 +58,8 @@ nobh Do not attach buffer_heads to fi xip Use execute in place (no caching) if possible +zerofree Zero data blocks when they are freed. + grpquota,noquota,quota,usrquota Quota options are silently ignored by ext2. --- linux-2.6.16/fs/ext2/balloc.c.zerofree 2006-03-20 05:53:29.000000000 +0000 +++ linux-2.6.16/fs/ext2/balloc.c 2006-04-02 09:21:52.000000000 +0100 @@ -174,6 +174,16 @@ static void group_release_blocks(struct } } +static inline void zero_block(struct super_block *sb, unsigned long block) +{ + struct buffer_head * bh; + + bh = sb_getblk(sb, block); + memset(bh->b_data, 0, bh->b_size); + mark_buffer_dirty(bh); + brelse(bh); +} + /* Free given blocks, update quota and i_blocks field */ void ext2_free_blocks (struct inode * inode, unsigned long block, unsigned long count) @@ -242,6 +252,9 @@ do_more: "bit already cleared for block %lu", block + i); } else { group_freed++; + if ( test_opt(sb, ZEROFREE) ) { + zero_block(sb, block+i); + } } } --- linux-2.6.16/fs/ext2/super.c.zerofree 2006-03-20 05:53:29.000000000 +0000 +++ linux-2.6.16/fs/ext2/super.c 2006-04-02 09:21:52.000000000 +0100 @@ -289,7 +289,7 @@ enum { Opt_err_ro, Opt_nouid32, Opt_nocheck, Opt_debug, Opt_oldalloc, Opt_orlov, Opt_nobh, Opt_user_xattr, Opt_nouser_xattr, Opt_acl, Opt_noacl, Opt_xip, Opt_ignore, Opt_err, Opt_quota, - Opt_usrquota, Opt_grpquota + Opt_usrquota, Opt_grpquota, Opt_zerofree }; static match_table_t tokens = { @@ -312,6 +312,7 @@ static match_table_t tokens = { {Opt_oldalloc, "oldalloc"}, {Opt_orlov, "orlov"}, {Opt_nobh, "nobh"}, + {Opt_zerofree, "zerofree"}, {Opt_user_xattr, "user_xattr"}, {Opt_nouser_xattr, "nouser_xattr"}, {Opt_acl, "acl"}, @@ -395,6 +396,9 @@ static int parse_options (char * options case Opt_nobh: set_opt (sbi->s_mount_opt, NOBH); break; + case Opt_zerofree: + set_opt (sbi->s_mount_opt, ZEROFREE); + break; #ifdef CONFIG_EXT2_FS_XATTR case Opt_user_xattr: set_opt (sbi->s_mount_opt, XATTR_USER); --- linux-2.6.16/include/linux/ext2_fs.h.zerofree 2006-03-20 05:53:29.000000000 +0000 +++ linux-2.6.16/include/linux/ext2_fs.h 2006-04-02 09:21:52.000000000 +0100 @@ -310,6 +310,7 @@ struct ext2_inode { #define EXT2_MOUNT_MINIX_DF 0x000080 /* Mimics the Minix statfs */ #define EXT2_MOUNT_NOBH 0x000100 /* No buffer_heads */ #define EXT2_MOUNT_NO_UID32 0x000200 /* Disable 32-bit UIDs */ +#define EXT2_MOUNT_ZEROFREE 0x000400 /* Zero freed blocks */ #define EXT2_MOUNT_XATTR_USER 0x004000 /* Extended user attributes */ #define EXT2_MOUNT_POSIX_ACL 0x008000 /* POSIX Access Control Lists */ #define EXT2_MOUNT_XIP 0x010000 /* Execute in place */ From keld at dkuug.dk Sun Apr 2 20:37:01 2006 From: keld at dkuug.dk (Keld =?iso-8859-1?Q?J=F8rn?= Simonsen) Date: Sun, 2 Apr 2006 22:37:01 +0200 Subject: Zeroing freed blocks In-Reply-To: <200604021707.k32H7LpJ026632@tiffany.internal.tigress.co.uk> References: <200604021707.k32H7LpJ026632@tiffany.internal.tigress.co.uk> Message-ID: <20060402203701.GB14104@rap.rap.dk> On Sun, Apr 02, 2006 at 06:07:21PM +0100, Ron Yorston wrote: > A couple of years ago there was a discussion on lkml under the thread > 'PATCH - ext2fs privacy (i.e. secure deletion) patch' about zapping > deleted data in the filesystem as a security mechanism. The discussion > wandered off into how 'chattr +s' could be implemented and whether > encrypting filesystems wouldn't be a better solution to the problem. > > I've been maintaining a simplified version of the patch for a different > reason: to keep filesystems in files sparse. Filesystem images for use > by things like user-mode Linux and Xen are often created as sparse files. > After they've been in use for a while their sparseness is reduced even > though they may have lots of free space. Having the guest kernel fill > deleted blocks with zeros doesn't make the underlying file sparse, > but it does help. I've got a page with more details: > > http://intgat.tigress.co.uk/rmy/uml/sparsify.html > > Anyway, a couple of things: > > 1. The patch (see below) is pretty simple. I've been using it for some > time in UML build systems for old versions of software (rh62, anyone?), > and today I even tried it for several seconds in a Xen domU kernel. > It seems to do what I want, but is it any good? > > 2. The patch is now for ext2 only, the original ext3 version having > succumbed to bitrot. What would it take to implement something > similar for ext3 these days? Well, I think this should be optional, if included. It does directly counteract the patch I recently sent to salvage files from their data blocks in ext2/ext3. Best regards keld From tytso at mit.edu Sun Apr 2 14:14:14 2006 From: tytso at mit.edu (Theodore Ts'o) Date: Sun, 2 Apr 2006 10:14:14 -0400 Subject: [RFC] mke2fs with DIR_INDEX, RESIZE_INODE by default In-Reply-To: <20060402063927.26510.qmail@web36709.mail.mud.yahoo.com> References: <20060401024147.GA24163@thunk.org> <20060402063927.26510.qmail@web36709.mail.mud.yahoo.com> Message-ID: <20060402141414.GA7745@thunk.org> On Sat, Apr 01, 2006 at 10:39:27PM -0800, Robinson Tiemuqinke wrote: > A stupid questions to ask: > > How to turn on "resize_inode" feature for ext3 file > system created with old mke2fs? > > Is there is a tool to upgrade old ext3 file systems > so that they will also have "resize_inode" feature? If you download the ext2resize program from SourceForge, it has a program called "ext2resize" which will do this. Neither Stephen when he was integrating on-line resizing for Fedora/Red Hat Enterprise Linux, nor I when considering how to integrate this functionality into e2fsprogs, were comfortable with the code base enough to accept responsibility for maintaining it in its current form, which is why that functionality has not yet appeared in either RHEL4 nor in e2fsprogs. That being said, I'm not aware of anyone who has lost data or any other serious bugs in the ext2prepare program in ext2resize, aside from the fact that it has portability problems on big-endian systems. (One of ext2resize's problems is that it doesn't use libext2fs, but rather rolled its own library functions, which clearly was never tested on big-endian systems but which also has application-level functionality folded into its library routines, making a port to libext2fs more difficult than it ought to have been.) In any case, since requesting on-line resizing is now integrated into resize2fs, the only missing functionality only found in ext2resze is the ext2prepare progam to reserve space on an already-created ext3 filesystem. Adding this support is on my todo list, but to be honest other development items for e2fsprogs are higher priority at the moment. If someone wants to try writing an ext2prepare-like program using libext2fs, let me know, and I can give you an outline of what needs to be done. - Ted From fk at linuxburg.de Mon Apr 3 09:40:36 2006 From: fk at linuxburg.de (Felix E. Klee) Date: Mon, 3 Apr 2006 11:40:36 +0200 Subject: Can copying a file damage the original? Message-ID: <200604031140.37018.fk@linuxburg.de> Consider the following scenario: * A database is accessing a large file $a on an Ext3FS, writing to it, reading from it. * While testing a backup script, the file $a is copied with rsync without prior shutdown of the database software. Here's what just happened under this scenario: $a got damaged. I'm certain that this is just a conincidence. However, my employer recalls hearing other people stating that copying around files while copying them may damage the original. I doubt that these other people have a clue, but perhaps it's me who doesn't have a clue: Are there any circumstances under which a source file in a copy operation can be damaged? -- Dipl.-Phys. Felix E. Klee Email: fk at linuxburg.de (work), felix.klee at inka.de (home) Tel: +49 721 8307937, Fax: +49 721 8307936 Linuxburg, Goethestr. 15a, 76135 Karlsruhe, Germany From sct at redhat.com Mon Apr 3 20:06:58 2006 From: sct at redhat.com (Stephen C. Tweedie) Date: Mon, 03 Apr 2006 16:06:58 -0400 Subject: FC5: "ext_attr" and "large_file" features for ext3 file systems ??? In-Reply-To: <20060328215257.28237.qmail@web36702.mail.mud.yahoo.com> References: <20060328215257.28237.qmail@web36702.mail.mud.yahoo.com> Message-ID: <1144094818.9387.7.camel@orbit.scot.redhat.com> Hi, On Tue, 2006-03-28 at 13:52 -0800, Robinson Tiemuqinke wrote: > First, what's the "large_file" feature REALLY means? > Then, what's the size of "large file" to light this > feature on? 2GB, or 2TB? 2GB. > Second, the "ext_attr" feature seems another > automatic one: it only appears after the first > "setfacl" command runs on the file system and then the > feature will keep on there forever even ACL is > removed. What's the indication of "ext_attr" feature > and what are the reasons behind to have this feature? They are there simply to indicate that a given feature is present on the filesystem. They prevent old versions of the kernel and/or e2fsck tools from mistakenly operating on a filesystem with newer features, potentially corrupting things on disk or returning incorrect file contents. All remotely recent kernels have large file support for ext3, and all 2.6 ones (and many vendor-supplied 2.4 ones) have ext_attr, so you have to be running something pretty old to run into compatibility problems with either of those features. --Stephen From maillists at hosttuls.com Tue Apr 4 00:16:28 2006 From: maillists at hosttuls.com (Brandon Evans) Date: Mon, 03 Apr 2006 17:16:28 -0700 Subject: Filesystem too large... Message-ID: <4431BADC.8000403@hosttuls.com> I need to setup a 3.27TB ext3 filesystem using -i 1024 and -b 1024. When I try to format this partition I get the "Filesystem too large." error. Are there any plans to update these limits? are there any patches already available that I can try out? Or am I just SOL here? (vzbu2 ~)# fdisk -l /dev/etherd/e1.1 Disk /dev/etherd/e1.1: 3600.7 GB, 3600795892224 bytes 255 heads, 63 sectors/track, 437771 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Disk /dev/etherd/e1.1 doesn't contain a valid partition table (vzbu2 ~)# lvdisplay --- Logical volume --- LV Name /dev/lvg01/vz VG Name lvg01 LV UUID CH5TEA-WC61-oSMX-olxz-sBTf-L1Ho-E1740u LV Write Access read/write LV Status available # open 0 LV Size 3.27 TB Current LE 858496 Segments 1 Allocation inherit Read ahead sectors 0 Block device 253:0 mkfs.ext3 -i 1024 -b 1024 /dev/lvg01/vz mke2fs 1.35 (28-Feb-2004) mkfs.ext3: Filesystem too large. No more than 2**31-1 blocks (8TB using a blocksize of 4k) are currently supported. -- Brandon Evans "I have a theory that the truth is never told during the nine-to-five hours." -Hunter S. Thompson From hahaha_30k at yahoo.com Tue Apr 4 02:03:21 2006 From: hahaha_30k at yahoo.com (Robinson Tiemuqinke) Date: Mon, 3 Apr 2006 19:03:21 -0700 (PDT) Subject: FC5: "ext_attr" and "large_file" features for ext3 file systems ??? In-Reply-To: <1144094818.9387.7.camel@orbit.scot.redhat.com> Message-ID: <20060404020321.31102.qmail@web36701.mail.mud.yahoo.com> Thanks a lot. Another question is: Do I have to run "e2fsck -y -D" on a file system to active "dir_index" feature? I have bunches of old ext3 file systems created with old versions of mkfs.ext3, then after upgraded to Fedora Core 5, I run "tune2fs -O dir_index" to have turned on the feature, but it is rumored that I have to run "e2fsck -y -D" after unmounting old ext3 file systems so that new file and directory creations will use hased B-tree. If that's corrct? If I don't run "e2fsck -y -D", then original linear directory structure will be still in effect even I turned on "dir_index" feature with tune2fs? For this case, what's the potential effects on the underlying old ext3 file systems? Thanks a lot. --- "Stephen C. Tweedie" wrote: > Hi, > > On Tue, 2006-03-28 at 13:52 -0800, Robinson > Tiemuqinke wrote: > > > First, what's the "large_file" feature REALLY > means? > > Then, what's the size of "large file" to light > this > > feature on? 2GB, or 2TB? > > 2GB. > > > Second, the "ext_attr" feature seems another > > automatic one: it only appears after the first > > "setfacl" command runs on the file system and then > the > > feature will keep on there forever even ACL is > > removed. What's the indication of "ext_attr" > feature > > and what are the reasons behind to have this > feature? > > They are there simply to indicate that a given > feature is present on the > filesystem. They prevent old versions of the kernel > and/or e2fsck tools > from mistakenly operating on a filesystem with newer > features, > potentially corrupting things on disk or returning > incorrect file > contents. > > All remotely recent kernels have large file support > for ext3, and all > 2.6 ones (and many vendor-supplied 2.4 ones) have > ext_attr, so you have > to be running something pretty old to run into > compatibility problems > with either of those features. > > --Stephen > > > __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com From adilger at clusterfs.com Tue Apr 4 07:01:21 2006 From: adilger at clusterfs.com (Andreas Dilger) Date: Tue, 4 Apr 2006 01:01:21 -0600 Subject: FC5: "ext_attr" and "large_file" features for ext3 file systems ??? In-Reply-To: <20060404020321.31102.qmail@web36701.mail.mud.yahoo.com> References: <1144094818.9387.7.camel@orbit.scot.redhat.com> <20060404020321.31102.qmail@web36701.mail.mud.yahoo.com> Message-ID: <20060404070121.GK17364@schatzie.adilger.int> On Apr 03, 2006 19:03 -0700, Robinson Tiemuqinke wrote: > Do I have to run "e2fsck -y -D" on a file system to > active "dir_index" feature? You do not HAVE to run this, as new directories and existing directories that grow larger than one block (normally 4kB) will start to use the directory indexing feature. However, to use dir indexing on existing large directories you do need to use "e2fsck -f -D". This will also "pack" large directories that have had most of the files deleted out, AFAIR. > I have bunches of old ext3 file systems created with > old versions of mkfs.ext3, then after upgraded to > Fedora Core 5, I run "tune2fs -O dir_index" to have > turned on the feature, but it is rumored that I have > to run "e2fsck -y -D" after unmounting old ext3 file > systems so that new file and directory creations will > use hased B-tree. > > If that's corrct? If I don't run "e2fsck -y -D", then > original linear directory structure will be still in > effect even I turned on "dir_index" feature with > tune2fs? For this case, what's the potential effects > on the underlying old ext3 file systems? > > Thanks a lot. > > > > --- "Stephen C. Tweedie" wrote: > > > Hi, > > > > On Tue, 2006-03-28 at 13:52 -0800, Robinson > > Tiemuqinke wrote: > > > > > First, what's the "large_file" feature REALLY > > means? > > > Then, what's the size of "large file" to light > > this > > > feature on? 2GB, or 2TB? > > > > 2GB. > > > > > Second, the "ext_attr" feature seems another > > > automatic one: it only appears after the first > > > "setfacl" command runs on the file system and then > > the > > > feature will keep on there forever even ACL is > > > removed. What's the indication of "ext_attr" > > feature > > > and what are the reasons behind to have this > > feature? > > > > They are there simply to indicate that a given > > feature is present on the > > filesystem. They prevent old versions of the kernel > > and/or e2fsck tools > > from mistakenly operating on a filesystem with newer > > features, > > potentially corrupting things on disk or returning > > incorrect file > > contents. > > > > All remotely recent kernels have large file support > > for ext3, and all > > 2.6 ones (and many vendor-supplied 2.4 ones) have > > ext_attr, so you have > > to be running something pretty old to run into > > compatibility problems > > with either of those features. > > > > --Stephen > > > > > > > > > __________________________________________________ > Do You Yahoo!? > Tired of spam? Yahoo! Mail has the best spam protection around > http://mail.yahoo.com > > _______________________________________________ > Ext3-users mailing list > Ext3-users at redhat.com > https://www.redhat.com/mailman/listinfo/ext3-users Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From adilger at clusterfs.com Tue Apr 4 06:56:21 2006 From: adilger at clusterfs.com (Andreas Dilger) Date: Tue, 4 Apr 2006 00:56:21 -0600 Subject: Filesystem too large... In-Reply-To: <4431BADC.8000403@hosttuls.com> References: <4431BADC.8000403@hosttuls.com> Message-ID: <20060404065621.GJ17364@schatzie.adilger.int> On Apr 03, 2006 17:16 -0700, Brandon Evans wrote: > I need to setup a 3.27TB ext3 filesystem using -i 1024 and -b 1024. > > When I try to format this partition I get the "Filesystem too large." > error. Are there any plans to update these limits? are there any > patches already available that I can try out? Or am I just SOL here? The same patches that have been posted here (or maybe ext2-devel?) to increase the fs size to 16TB are applicable in your case. They are experimental at this stage, however, but as always, testing is welcome. The other question is why you want to have a 3TB filesystem with 1kB blocks, unless you are consistently creating very small files... Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From sct at redhat.com Tue Apr 4 18:10:10 2006 From: sct at redhat.com (Stephen C. Tweedie) Date: Tue, 04 Apr 2006 14:10:10 -0400 Subject: Filesystem too large... In-Reply-To: <20060404065621.GJ17364@schatzie.adilger.int> References: <4431BADC.8000403@hosttuls.com> <20060404065621.GJ17364@schatzie.adilger.int> Message-ID: <1144174210.3411.24.camel@orbit.scot.redhat.com> Hi, On Tue, 2006-04-04 at 00:56 -0600, Andreas Dilger wrote: > On Apr 03, 2006 17:16 -0700, Brandon Evans wrote: > > I need to setup a 3.27TB ext3 filesystem using -i 1024 and -b 1024. > > > > When I try to format this partition I get the "Filesystem too large." > > error. Are there any plans to update these limits? are there any > > patches already available that I can try out? Or am I just SOL here? > > The same patches that have been posted here (or maybe ext2-devel?) > to increase the fs size to 16TB are applicable in your case. Yes; just note that with a 1k blocksize, 2^32 blocks will only get you as far as 4TB, not 16TB. But yes, it should work. However, 1k blocksize is usually a bad idea unless you really need the very very best space efficiency on the filesystem: it usually performs worse than 4k blocksize, and it imposes other limits such as a maximum file size of a bit over 16GB. With 4k blocksize, a 3.27TB filesystem should just work. --Stephen From maillists at hosttuls.com Tue Apr 4 21:22:20 2006 From: maillists at hosttuls.com (Brandon Evans) Date: Tue, 04 Apr 2006 14:22:20 -0700 Subject: Filesystem too large... In-Reply-To: <20060404065621.GJ17364@schatzie.adilger.int> References: <4431BADC.8000403@hosttuls.com> <20060404065621.GJ17364@schatzie.adilger.int> Message-ID: <4432E38C.1070506@hosttuls.com> Andreas Dilger wrote: > On Apr 03, 2006 17:16 -0700, Brandon Evans wrote: >> I need to setup a 3.27TB ext3 filesystem using -i 1024 and -b 1024. >> >> When I try to format this partition I get the "Filesystem too large." >> error. Are there any plans to update these limits? are there any >> patches already available that I can try out? Or am I just SOL here? > The other question is why you want to have a 3TB filesystem with 1kB > blocks, unless you are consistently creating very small files... The server I am preparing is a sw-soft virtuozzo backup server which requires the 1kB blocks. The small blocks are need for the magic links it uses in the virtual environment. -- Brandon Evans "I have a theory that the truth is never told during the nine-to-five hours." -Hunter S. Thompson From maillists at hosttuls.com Tue Apr 4 22:44:13 2006 From: maillists at hosttuls.com (Brandon Evans) Date: Tue, 04 Apr 2006 15:44:13 -0700 Subject: Filesystem too large... In-Reply-To: <1144174210.3411.24.camel@orbit.scot.redhat.com> References: <4431BADC.8000403@hosttuls.com> <20060404065621.GJ17364@schatzie.adilger.int> <1144174210.3411.24.camel@orbit.scot.redhat.com> Message-ID: <4432F6BD.7080201@hosttuls.com> Stephen C. Tweedie wrote: > Hi, > > On Tue, 2006-04-04 at 00:56 -0600, Andreas Dilger wrote: >> On Apr 03, 2006 17:16 -0700, Brandon Evans wrote: >>> I need to setup a 3.27TB ext3 filesystem using -i 1024 and -b 1024. >>> >>> When I try to format this partition I get the "Filesystem too large." >>> error. Are there any plans to update these limits? are there any >>> patches already available that I can try out? Or am I just SOL here? >> The same patches that have been posted here (or maybe ext2-devel?) >> to increase the fs size to 16TB are applicable in your case. > > Yes; just note that with a 1k blocksize, 2^32 blocks will only get you > as far as 4TB, not 16TB. But yes, it should work. > > However, 1k blocksize is usually a bad idea unless you really need the > very very best space efficiency on the filesystem: it usually performs > worse than 4k blocksize, and it imposes other limits such as a maximum > file size of a bit over 16GB. With 4k blocksize, a 3.27TB filesystem > should just work. I should mention I have tried this on 2.6.14 and 2.6.8. From what I have found, it seems thees kernels should already have the 16TB file system support. Perhaps I am looking in the wrong place. Any help finding this patch would be appreciated -- Brandon Evans "I have a theory that the truth is never told during the nine-to-five hours." -Hunter S. Thompson From talk2sumit at gmail.com Thu Apr 6 06:37:33 2006 From: talk2sumit at gmail.com (Sumit Narayan) Date: Thu, 6 Apr 2006 14:37:33 +0800 Subject: deleting partition does not effect superblock? Message-ID: <1458d9610604052337p2cafa6c8j78fc6da8c5f8be1a@mail.gmail.com> Hi, I am using kernel 2.6.15.4. On my system, I first created a partition with EXT3 and put some data on it. Later, I deleted the partition, and re-created another partition with the same starting block number and a higher ending block number. I intended to format it with another filesystem, but surprisingly (or maybe just to me), the superblock of the partition had not changed. I could still mount the new partition as the same old filesystem. I could see all the files which was present earlier. Doing 'df' showed me the older partition details (size, % used etc.). Shouldn't the superblock be changed/deleted once the partition is deleted? I tried a reboot, but the output remained the same. -- Sumit From menscher at uiuc.edu Thu Apr 6 07:31:31 2006 From: menscher at uiuc.edu (Damian Menscher) Date: Thu, 6 Apr 2006 02:31:31 -0500 (CDT) Subject: deleting partition does not effect superblock? In-Reply-To: <1458d9610604052337p2cafa6c8j78fc6da8c5f8be1a@mail.gmail.com> References: <1458d9610604052337p2cafa6c8j78fc6da8c5f8be1a@mail.gmail.com> Message-ID: On Thu, 6 Apr 2006, Sumit Narayan wrote: > On my system, I first created a partition with EXT3 and put some data > on it. Later, I deleted the partition, and re-created another > partition with the same starting block number and a higher ending > block number. I intended to format it with another filesystem, but > surprisingly (or maybe just to me), the superblock of the partition > had not changed. I could still mount the new partition as the same old > filesystem. I could see all the files which was present earlier. Doing > 'df' showed me the older partition details (size, % used etc.). > > Shouldn't the superblock be changed/deleted once the partition is > deleted? I tried a reboot, but the output remained the same. This is the expected behavior. A filesystem is created within the partition. If you grow the partition, the filesystem doesn't automatically grow (use resize2fs for that). In fact, you should probably read the resize2fs manpage, as it might give you some starting clue of what's going on. Damian Menscher -- -=#| www.uiuc.edu/~menscher/ Ofc:(650)253-2757 |#=- -=#| The above opinions are not necessarily those of my employers. |#=- From maillists at hosttuls.com Thu Apr 6 21:12:06 2006 From: maillists at hosttuls.com (Brandon Evans) Date: Thu, 06 Apr 2006 14:12:06 -0700 Subject: Filesystem too large... In-Reply-To: <4432E38C.1070506@hosttuls.com> References: <4431BADC.8000403@hosttuls.com> <20060404065621.GJ17364@schatzie.adilger.int> <4432E38C.1070506@hosttuls.com> Message-ID: <44358426.2030500@hosttuls.com> Brandon Evans wrote: > Andreas Dilger wrote: > The server I am preparing is a sw-soft virtuozzo backup server which > requires the 1kB blocks. The small blocks are need for the magic links > it uses in the virtual environment. > > It turns our the 4kB block size is not 100% necessary for virtuozzo, so I just formated with the 4Kb and moved on. -- Brandon Evans "I have a theory that the truth is never told during the nine-to-five hours." -Hunter S. Thompson From jbglaw at lug-owl.de Thu Apr 6 06:58:32 2006 From: jbglaw at lug-owl.de (Jan-Benedict Glaw) Date: Thu, 6 Apr 2006 08:58:32 +0200 Subject: deleting partition does not effect superblock? In-Reply-To: <1458d9610604052337p2cafa6c8j78fc6da8c5f8be1a@mail.gmail.com> References: <1458d9610604052337p2cafa6c8j78fc6da8c5f8be1a@mail.gmail.com> Message-ID: <20060406065832.GK13324@lug-owl.de> On Thu, 2006-04-06 14:37:33 +0800, Sumit Narayan wrote: > Shouldn't the superblock be changed/deleted once the partition is > deleted? I tried a reboot, but the output remained the same. No, everything you see is "works as expected." A partition is only a container (as well as "disks", "volume groups", "RAID arrays", "logical volumes", "image files" etc. are.) Whenever you destroy such a container, its contents isn't modified (or deleted) or otherwise modified. So it's perfectly okay to delete such a container (eg. remove start and end from the partition table) and recreate it at some time later (by adding those values back to the partition table.) As long as the new container starts at the same location, a filesystem driver will be able to find the old information. If you start a block later, it won't find it's superblocks. Finally, you have several choices how to defeat getting back old data. Most probably, you'd just zero it out before deleting the partition with something like: # cat /dev/zero > /dev/hda3 (of course with the correct device name!) MfG, JBG -- Jan-Benedict Glaw jbglaw at lug-owl.de . +49-172-7608481 _ O _ "Eine Freie Meinung in einem Freien Kopf | Gegen Zensur | Gegen Krieg _ _ O f?r einen Freien Staat voll Freier B?rger" | im Internet! | im Irak! O O O ret = do_actions((curr | FREE_SPEECH) & ~(NEW_COPYRIGHT_LAW | DRM | TCPA)); -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: Digital signature URL: From hahaha_30k at yahoo.com Fri Apr 7 07:20:48 2006 From: hahaha_30k at yahoo.com (Robinson Tiemuqinke) Date: Fri, 7 Apr 2006 00:20:48 -0700 (PDT) Subject: How to interpret the output of 'iostat -x /dev/sdb1 20 100' ?? Message-ID: <20060407072048.17474.qmail@web36709.mail.mud.yahoo.com> Hi, I'm a newbie to tool 'iostat' and I've read the manual for iostat several times. But it doesn't help. I still get confused with the output of 'iostat', the manual seems too abstract, or high-level, for me. Let's post the output first: avg-cpu: %user %nice %sys %idle 5.70 0.00 3.15 91.15 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util /dev/sdb1 0.60 4.70 12.60 1.50 105.60 49.60 52.80 24.80 11.01 1.54 10.92 8.65 12.20 I'll ask about the rrqm/s, r/s, rsec/s, avgrq-sz, avgqu-sz, await, svctm and %util in the above output. First question: How many physical disk I/O read requests are sent to hard drive by kernel driver? is it the subtract of (r/s - rrqm/s), or just r/s? if it is r/s, then it means user&sys applications send (r/s+rrqm/s) read requests to kernel per second? Second question: ( r/s * avgrq-sz ) is 30% bigger than rsec/s, why? they should be equal or little difference related to calculation omission. Third Question: What's the UNIT of avgqu-sz, is it NONE, or sector, or something else? If it is NONE, then does it mean that the unit is 'read request'? 4th question: (await + svctm) is the time span for a read request from being dispatched (by kernel driver) to being served? If so, could we use this number as a criteria for (disk + file_system) performance ? 5th question: %util is which percentage of CPU time? it looks too abstract in manual, does it means (disk I/O opertions time) divided by (%user + %nice + %sys)? Or it is (%user + %nice + %sys) divided by all the system time lots (%user + %nice + %sys +%idle)? I got lost completely here, Please help. Thanks a lot. __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com From hahaha_30k at yahoo.com Fri Apr 7 07:20:48 2006 From: hahaha_30k at yahoo.com (Robinson Tiemuqinke) Date: Fri, 7 Apr 2006 00:20:48 -0700 (PDT) Subject: How to interpret the output of 'iostat -x /dev/sdb1 20 100' ?? Message-ID: <20060407072048.17474.qmail@web36709.mail.mud.yahoo.com> Hi, I'm a newbie to tool 'iostat' and I've read the manual for iostat several times. But it doesn't help. I still get confused with the output of 'iostat', the manual seems too abstract, or high-level, for me. Let's post the output first: avg-cpu: %user %nice %sys %idle 5.70 0.00 3.15 91.15 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util /dev/sdb1 0.60 4.70 12.60 1.50 105.60 49.60 52.80 24.80 11.01 1.54 10.92 8.65 12.20 I'll ask about the rrqm/s, r/s, rsec/s, avgrq-sz, avgqu-sz, await, svctm and %util in the above output. First question: How many physical disk I/O read requests are sent to hard drive by kernel driver? is it the subtract of (r/s - rrqm/s), or just r/s? if it is r/s, then it means user&sys applications send (r/s+rrqm/s) read requests to kernel per second? Second question: ( r/s * avgrq-sz ) is 30% bigger than rsec/s, why? they should be equal or little difference related to calculation omission. Third Question: What's the UNIT of avgqu-sz, is it NONE, or sector, or something else? If it is NONE, then does it mean that the unit is 'read request'? 4th question: (await + svctm) is the time span for a read request from being dispatched (by kernel driver) to being served? If so, could we use this number as a criteria for (disk + file_system) performance ? 5th question: %util is which percentage of CPU time? it looks too abstract in manual, does it means (disk I/O opertions time) divided by (%user + %nice + %sys)? Or it is (%user + %nice + %sys) divided by all the system time lots (%user + %nice + %sys +%idle)? I got lost completely here, Please help. Thanks a lot. __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com -- fedora-list mailing list fedora-list at redhat.com To unsubscribe: https://www.redhat.com/mailman/listinfo/fedora-list -------------------------------------------------------- This e-mail and any attachments are confidential and may also be legally privileged and/or copyright material of Intec Telecom Systems PLC (or its affiliated companies). If you are not an intended or authorised recipient of this e-mail or have received it in error, please delete it immediately and notify the sender by e-mail. In such a case, reading, reproducing, printing or further dissemination of this e-mail or its contents is strictly prohibited and may be unlawful. Intec Telecom Systems PLC does not represent or warrant that an attachment hereto is free from computer viruses or other defects. The opinions expressed in this e-mail and any attachments may be those of the author and are not necessarily those of Intec Telecom Systems PLC. From jerume at assiniemafia.com Sun Apr 9 03:01:23 2006 From: jerume at assiniemafia.com (jerume) Date: Sun, 09 Apr 2006 05:01:23 +0200 Subject: Table creation failed Message-ID: <44387903.5090207@assiniemafia.com> Hello, I come to you beacause i have something that i dont understand : i m using udev on a debian sid with 2.6.15.1 kernel. I have created an deprecated raid at /dev/md0 when i tried doing mkfs.ext3 /dev/md0 i have got : mke2fs 1.39-WIP (29-Mar-2006) Filesystem label= OS type: Linux Block size=4096 (log=2) Fragment size=4096 (log=2) 4643968 inodes, 9277344 blocks 463867 blocks (5.00%) reserved for the super user First data block=0 284 block groups 32768 blocks per group, 32768 fragments per group 16352 inodes per group Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 4096000, 7962624 Writing inode tables: done Creating journal (32768 blocks): mkfs.ext3: Device or resource busy while trying to create journal zsh: exit 1 mkfs.ext3 /dev/md0 Could you help me please ? Thanks for open source. J?r?me. ;) From coywolf at sosdg.org Sun Apr 9 03:50:25 2006 From: coywolf at sosdg.org (Coywolf Qi Hunt) Date: Sat, 8 Apr 2006 23:50:25 -0400 Subject: Table creation failed In-Reply-To: <44387903.5090207@assiniemafia.com> References: <44387903.5090207@assiniemafia.com> Message-ID: <20060409035025.GA28159@everest.sosdg.org> On Sun, Apr 09, 2006 at 05:01:23AM +0200, jerume wrote: > Hello, > > I come to you beacause i have something that i dont understand : > > i m using udev on a debian sid with 2.6.15.1 kernel. > > I have created an deprecated raid at /dev/md0 > when i tried doing mkfs.ext3 /dev/md0 i have got : > > mke2fs 1.39-WIP (29-Mar-2006) > Filesystem label= > OS type: Linux > Block size=4096 (log=2) > Fragment size=4096 (log=2) > 4643968 inodes, 9277344 blocks > 463867 blocks (5.00%) reserved for the super user > First data block=0 > 284 block groups > 32768 blocks per group, 32768 fragments per group > 16352 inodes per group > Superblock backups stored on blocks: > 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, > 2654208, > 4096000, 7962624 > > Writing inode tables: done > Creating journal (32768 blocks): mkfs.ext3: Device or resource busy > while trying to create journal > > zsh: exit 1 mkfs.ext3 /dev/md0 > > Could you help me please ? Look at http://thunk.org/hg/e2fsprogs/?cs=1bfd437f2f61 Coywolf From jerume at assiniemafia.com Mon Apr 10 12:28:22 2006 From: jerume at assiniemafia.com (jerume) Date: Mon, 10 Apr 2006 14:28:22 +0200 Subject: Table creation failed In-Reply-To: <20060409035025.GA28159@everest.sosdg.org> References: <44387903.5090207@assiniemafia.com> <20060409035025.GA28159@everest.sosdg.org> Message-ID: <443A4F66.9030602@assiniemafia.com> Coywolf Qi Hunt wrote: > On Sun, Apr 09, 2006 at 05:01:23AM +0200, jerume wrote: > >> Hello, >> >> I come to you beacause i have something that i dont understand : >> >> i m using udev on a debian sid with 2.6.15.1 kernel. >> >> I have created an deprecated raid at /dev/md0 >> when i tried doing mkfs.ext3 /dev/md0 i have got : >> >> mke2fs 1.39-WIP (29-Mar-2006) >> Filesystem label= >> OS type: Linux >> Block size=4096 (log=2) >> Fragment size=4096 (log=2) >> 4643968 inodes, 9277344 blocks >> 463867 blocks (5.00%) reserved for the super user >> First data block=0 >> 284 block groups >> 32768 blocks per group, 32768 fragments per group >> 16352 inodes per group >> Superblock backups stored on blocks: >> 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, >> 2654208, >> 4096000, 7962624 >> >> Writing inode tables: done >> Creating journal (32768 blocks): mkfs.ext3: Device or resource busy >> while trying to create journal >> >> zsh: exit 1 mkfs.ext3 /dev/md0 >> >> Could you help me please ? >> > > Look at http://thunk.org/hg/e2fsprogs/?cs=1bfd437f2f61 > > Coywolf > Thanks you :) I would rather wait for the update package unless it'll take toolong. Let me know what you think of it ;) bye From jbglaw at lug-owl.de Mon Apr 10 16:00:52 2006 From: jbglaw at lug-owl.de (Jan-Benedict Glaw) Date: Mon, 10 Apr 2006 18:00:52 +0200 Subject: deleting partition does not effect superblock? In-Reply-To: References: <1458d9610604052337p2cafa6c8j78fc6da8c5f8be1a@mail.gmail.com> <20060406065832.GK13324@lug-owl.de> Message-ID: <20060410160052.GO13324@lug-owl.de> On Mon, 2006-04-10 17:28:18 +0200, Jan Engelhardt wrote: > >deleted) or otherwise modified. So it's perfectly okay to delete such > >a container (eg. remove start and end from the partition table) and > >recreate it at some time later (by adding those values back to the > >partition table.) As long as the new container starts at the same > >location, a filesystem driver will be able to find the old > >information. If you start a block later, it won't find it's > >superblocks. > > > If using a filesystem with replicated superblocks (ext*, xfs), then ...? > [Includes expecting weird breakage.] I'll possibly test if this works in another life... MfG, JBG -- Jan-Benedict Glaw jbglaw at lug-owl.de . +49-172-7608481 _ O _ "Eine Freie Meinung in einem Freien Kopf | Gegen Zensur | Gegen Krieg _ _ O f?r einen Freien Staat voll Freier B?rger" | im Internet! | im Irak! O O O ret = do_actions((curr | FREE_SPEECH) & ~(NEW_COPYRIGHT_LAW | DRM | TCPA)); -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: Digital signature URL: From tytso at mit.edu Mon Apr 10 16:08:37 2006 From: tytso at mit.edu (Theodore Ts'o) Date: Mon, 10 Apr 2006 12:08:37 -0400 Subject: Table creation failed In-Reply-To: <443A4F66.9030602@assiniemafia.com> References: <44387903.5090207@assiniemafia.com> <20060409035025.GA28159@everest.sosdg.org> <443A4F66.9030602@assiniemafia.com> Message-ID: <20060410160837.GB24654@thunk.org> On Mon, Apr 10, 2006 at 02:28:22PM +0200, jerume wrote: > >>Writing inode tables: done > >>Creating journal (32768 blocks): mkfs.ext3: Device or resource busy > >> while trying to create journal > >> > > > >Look at http://thunk.org/hg/e2fsprogs/?cs=1bfd437f2f61 > > > I would rather wait for the update package unless it'll take toolong. > Let me know what you think of it ;) I just put out a new WIP release (09-Apr-2006) last night/this morning which has this and the AMD64 build bug that were biting folks with the last WIP release. It can be found at: https://sourceforge.net/project/showfiles.php?group_id=2406 - Ted From sev at bnl.gov Tue Apr 11 16:34:12 2006 From: sev at bnl.gov (Sev Binello) Date: Tue, 11 Apr 2006 12:34:12 -0400 Subject: ext3 filesystem corruption Message-ID: <443BDA84.7010102@bnl.gov> Hi - We have had 3 rather major occurances of ext3 filesystem corruption lately, i.e. so bad we couldn't event mount, and fsck didn't help. I am looking for pointers, that could help us investigate the root cause. In general... We are running RedHat WS 3 Update 6, 2.4.21-40.2.ELsmp or 2.4.21-37.ELsmp We have a small SAN system that looks like this 3 NFS servers each containing 2 Qlocic hba's connected to 2 qlogic switches connected to an nstor (now xyratex) 6TB raid system containing 2 (active-active) controllers. On the first 2 occasions one of the controllers was failed over. On a 3rd occasion both SAN switches lost power, and the hosts and raid lost communication. On all occasions the qlocic failover driver tried to start up on the alternate HBA. On the first 2 instances we sort of tried to blame the controller. On the 3rd, that was harder to do since the raid system and the hosts stayed up but lost communication. I can provide more detail if anyone as any info on how to proceed. Thanks -Sev -- Sev Binello Brookhaven National Laboratory Upton, New York 631-344-5647 sev at bnl.gov From adilger at clusterfs.com Tue Apr 11 17:28:56 2006 From: adilger at clusterfs.com (Andreas Dilger) Date: Tue, 11 Apr 2006 11:28:56 -0600 Subject: ext3 filesystem corruption In-Reply-To: <443BDA84.7010102@bnl.gov> References: <443BDA84.7010102@bnl.gov> Message-ID: <20060411172856.GA17364@schatzie.adilger.int> On Apr 11, 2006 12:34 -0400, Sev Binello wrote: > We are running RedHat WS 3 Update 6, 2.4.21-40.2.ELsmp or > 2.4.21-37.ELsmp > > We have a small SAN system that looks like this > > 3 NFS servers each containing 2 Qlocic hba's connected to 2 > qlogic switches > connected to an nstor (now xyratex) 6TB raid system containing > 2 (active-active) controllers. Does this imply you have a 6TB ext3 filesystem? Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From sev at bnl.gov Wed Apr 12 23:28:40 2006 From: sev at bnl.gov (Sev Binello) Date: Wed, 12 Apr 2006 19:28:40 -0400 Subject: ext3 filesystem corruption - more info In-Reply-To: <443BDA84.7010102@bnl.gov> References: <443BDA84.7010102@bnl.gov> Message-ID: <443D8D28.3090202@bnl.gov> An HTML attachment was scrubbed... URL: From menscher at uiuc.edu Thu Apr 13 00:06:11 2006 From: menscher at uiuc.edu (Damian Menscher) Date: Wed, 12 Apr 2006 19:06:11 -0500 (CDT) Subject: ext3 filesystem corruption - more info In-Reply-To: <443D8D28.3090202@bnl.gov> References: <443BDA84.7010102@bnl.gov> <443D8D28.3090202@bnl.gov> Message-ID: I've seen similar errors when attempting to have a >2TB filesystem on a 32-bit RHEL3 machine. We have since implemented a 3.5TB filesystem on a 64-bit RHEL4 machine. It would help if you could answer the question Andreas Dilger posed: "Does this imply you have a 6TB ext3 filesystem?" Damian On Wed, 12 Apr 2006, Sev Binello wrote: > > Hi - > > In case this helps, > we got the following messages from EXT3 before the filesystem went > Does anyone recognize these..... > > //seems to mount okay > Mar 25 17:52:30 acnlin82 kernel: EXT3 FS 2.4-0.9.19, 19 August 2002 on sd(8,33), > internal journal > Mar 25 17:52:30 acnlin82 kernel: EXT3-fs: recovery complete. > Mar 26 00:04:01 acnlin82 kernel: EXT3-fs: mounted filesystem with ordered data > mode. > > //soon as nfs clients start get a TON of errors like this > Mar 26 00:07:19 acnlin82 kernel: EXT3-fs error (device sd(8,49)): ext3_free_blocks: > Freeing blocks not in datazone - block = 3443589120, count = 1 > Mar 26 00:07:19 acnlin82 kernel: EXT3-fs error (device sd(8,49)): ext3_free_blocks: > Freeing blocks not in datazone - block = 2113834232, count = 1 > Mar 26 00:07:22 acnlin82 kernel: EXT3-fs error (device sd(8,49)): ext3_free_blocks: > bit already cleared for block 49125 > > //interspersed with some of these > Mar 26 00:10:56 acnlin82 kernel: attempt to access beyond end of device > Mar 26 00:10:56 acnlin82 kernel: 08:31: rw=0, want=1891463980, limit=1722264358 > Mar 26 00:10:56 acnlin82 kernel: attempt to access beyond end of device > Mar 26 00:10:56 acnlin82 kernel: 08:31: rw=0, want=1824250576, limit=1722264358 > Mar 26 00:10:56 acnlin82 kernel: attempt to access beyond end of device > > Then we had to reboot and basically filesystem is shot > > Thanks > -Sev > > Sev Binello wrote: > Hi - > > We have had 3 rather major occurances of ext3 filesystem corruption > lately, > i.e. so bad we couldn't event mount, and fsck didn't help. > > I am looking for pointers, that could help us investigate the root > cause. > > In general... > We are running RedHat WS 3 Update 6, 2.4.21-40.2.ELsmp or > 2.4.21-37.ELsmp > > We have a small SAN system that looks like this > 3 NFS servers each containing 2 Qlocic hba's connected to 2 > qlogic switches > connected to an nstor (now xyratex) 6TB raid system containing 2 > (active-active) controllers. > > On the first 2 occasions one of the controllers was failed over. > On a 3rd occasion both SAN switches lost power, and the hosts and raid > lost communication. > > > On all occasions the qlocic failover driver tried to start up on the > alternate HBA. > > On the first 2 instances we sort of tried to blame the controller. > On the 3rd, that was harder to do since the raid system and the hosts > stayed up > but lost communication. > > I can provide more detail if anyone as any info on how to proceed. > > Thanks > -Sev > > > > -- > > Sev Binello > Brookhaven National Laboratory > Upton, New York > 631-344-5647 > sev at bnl.gov > > Damian Menscher -- -=#| www.uiuc.edu/~menscher/ Ofc:(650)253-2757 |#=- -=#| The above opinions are not necessarily those of my employers. |#=- From sev at bnl.gov Thu Apr 13 00:20:48 2006 From: sev at bnl.gov (Sev Binello) Date: Wed, 12 Apr 2006 20:20:48 -0400 Subject: ext3 filesystem corruption - more info In-Reply-To: References: <443BDA84.7010102@bnl.gov> <443D8D28.3090202@bnl.gov> Message-ID: <443D9960.20402@bnl.gov> An HTML attachment was scrubbed... URL: From adilger at clusterfs.com Thu Apr 13 05:40:56 2006 From: adilger at clusterfs.com (Andreas Dilger) Date: Wed, 12 Apr 2006 23:40:56 -0600 Subject: ext3 filesystem corruption - more info In-Reply-To: <443D8D28.3090202@bnl.gov> References: <443BDA84.7010102@bnl.gov> <443D8D28.3090202@bnl.gov> Message-ID: <20060413054056.GP17364@schatzie.adilger.int> On Apr 12, 2006 19:28 -0400, Sev Binello wrote: [HTML-only email] - it would be preferred if you used plain text, or at least multipart/mixed for your email to this list... > //soon as nfs clients start get a TON of errors like this > Mar 26 00:07:19 acnlin82 kernel: EXT3-fs error (device sd(8,49)): > ext3_free_blocks: Freeing blocks not in datazone - block = 3443589120, count = 1 > Mar 26 00:07:19 acnlin82 kernel: EXT3-fs error (device sd(8,49)): > ext3_free_blocks: Freeing blocks not in datazone - block = 2113834232, count = 1 > Mar 26 00:07:22 acnlin82 kernel: EXT3-fs error (device sd(8,49)): > ext3_free_blocks: bit already cleared for block 49125 > //interspersed with some of these > Mar 26 00:10:56 acnlin82 kernel: attempt to access beyond end of device > Mar 26 00:10:56 acnlin82 kernel: 08:31: rw=0, want=1891463980, limit=1722264358 > Mar 26 00:10:56 acnlin82 kernel: attempt to access beyond end of device > Mar 26 00:10:56 acnlin82 kernel: 08:31: rw=0, want=1824250576, limit=1722264358 > Mar 26 00:10:56 acnlin82 kernel: attempt to access beyond end of device These indicate that the kernel ext3 code detected serious corruption of the metadata on the filesystem. In cases like this, if the filesystem doesn't remount readonly (i.e. mounted with "-o errors=remount-ro") then it just makes the corruption progressively worse. It doesn't point to a root cause, however. > Would it be a problem if the two 1.8TB systems appeared on one host? No, some of our customers have hundreds of systems with two ext3 filesystems of about this size, running on 2.4.21-RHEL3 kernels. The LUNs exported from the RAID storage are all under 2TB. They have never reported similar problems over several years of usage. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From sev at bnl.gov Thu Apr 13 14:40:25 2006 From: sev at bnl.gov (Sev Binello) Date: Thu, 13 Apr 2006 10:40:25 -0400 Subject: ext3 filesystem corruption - more info In-Reply-To: <20060413054056.GP17364@schatzie.adilger.int> References: <443BDA84.7010102@bnl.gov> <443D8D28.3090202@bnl.gov> <20060413054056.GP17364@schatzie.adilger.int> Message-ID: <443E62D9.4060404@bnl.gov> An HTML attachment was scrubbed... URL: From sev at bnl.gov Thu Apr 13 19:54:50 2006 From: sev at bnl.gov (Sev Binello) Date: Thu, 13 Apr 2006 15:54:50 -0400 Subject: ext3 filesystem corruption - more info In-Reply-To: <20060413192909.GV17364@schatzie.adilger.int> References: <443BDA84.7010102@bnl.gov> <443D8D28.3090202@bnl.gov> <20060413054056.GP17364@schatzie.adilger.int> <443E62D9.4060404@bnl.gov> <20060413192909.GV17364@schatzie.adilger.int> Message-ID: <443EAC8A.9020209@bnl.gov> An HTML attachment was scrubbed... URL: From sev at bnl.gov Thu Apr 13 20:40:40 2006 From: sev at bnl.gov (Sev Binello) Date: Thu, 13 Apr 2006 16:40:40 -0400 Subject: ext3 filesystem corruption - more info (in text) In-Reply-To: <443EAC8A.9020209@bnl.gov> References: <443BDA84.7010102@bnl.gov> <443D8D28.3090202@bnl.gov> <20060413054056.GP17364@schatzie.adilger.int> <443E62D9.4060404@bnl.gov> <20060413192909.GV17364@schatzie.adilger.int> <443EAC8A.9020209@bnl.gov> Message-ID: <443EB748.20606@bnl.gov> Sorry about all the html Resending last message in text Sev Binello wrote: > Andreas Dilger wrote: > >>On Apr 13, 2006 10:40 -0400, Sev Binello wrote: >>[ still HTML-only email, extracting text from HTML is getting dull ] >> >> >>>Since it seemed to mount okay only 3mins earlier,
>>>can we assume that it was initially uncorrupted ?
>>>Or, is that not valid assumption ?
>>> >>> >> >>No, at mount time there is only very cursory checking done of the group >>descriptors and superblock. The corruption reported appears to be from >>bad indirect blocks. >> >> >> >>>Is there anything that we can check, test etc...
>>>any advice, action at this point is better than waiting for the next >>>fileystem disaster to ocurr.
>>> >>> >> >>Do you run with write cache enabled on your device? That can potentially >>cause filesystem corruption even in the face of ext3 journaling, because >>the journal atomicity guarantees are lost when the device reports a write >>is complete on disk when it really isn't. >> >> > The raid system does run with write back cache enabled. > I don't believe the actual drives have this enabled, but I'd have to check. > > But we didn't actually lose power on the raid or hosts > just the connecting switches, so we lost all communication. > Presumably, in this situation the controller cache should have been emptied > Is my reasoning correct here ? > > Either way, you are saying is best to avoid write cacheing in the future. > > Also, in looking and comparing error msgs in the log files > I noticed that on the host where the corruption occurred, > the call to abort the journal didn't seem to actually happen for an hour > Does that have any significance ? > > Mar 25 14:38:52 acnlin83 kernel: Error (-5) on journal on device 08:21 > Mar 25 14:38:52 acnlin83 kernel: Aborting journal on device sd(8,33). > > 1hr gap > Mar 25 15:39:19 acnlin83 kernel: ext3_abort called. > Mar 25 15:39:19 acnlin83 kernel: EXT3-fs abort (device sd(8,33)): > ext3_journal_start: Detected aborted journal > Mar 25 15:39:19 acnlin83 kernel: Remounting filesystem read-only > Mar 25 15:39:19 acnlin83 kernel: EXT3-fs error (device sd(8,33)) > in start_transaction: Journal has aborted > > Thanks again > -Sev > >>Cheers, Andreas >>-- >>Andreas Dilger >>Principal Software Engineer >>Cluster File Systems, Inc. >> >> >> > > > -- > > Sev Binello > Brookhaven National Laboratory > Upton, New York > 631-344-5647 > sev at bnl.gov > > > ------------------------------------------------------------------------ > > _______________________________________________ > Ext3-users mailing list > Ext3-users at redhat.com > https://www.redhat.com/mailman/listinfo/ext3-users -- Sev Binello Brookhaven National Laboratory Upton, New York 631-344-5647 sev at bnl.gov From adilger at clusterfs.com Thu Apr 13 22:12:10 2006 From: adilger at clusterfs.com (Andreas Dilger) Date: Thu, 13 Apr 2006 16:12:10 -0600 Subject: ext3 filesystem corruption - more info (in text) In-Reply-To: <443EB748.20606@bnl.gov> References: <443BDA84.7010102@bnl.gov> <443D8D28.3090202@bnl.gov> <20060413054056.GP17364@schatzie.adilger.int> <443E62D9.4060404@bnl.gov> <20060413192909.GV17364@schatzie.adilger.int> <443EAC8A.9020209@bnl.gov> <443EB748.20606@bnl.gov> Message-ID: <20060413221210.GA17364@schatzie.adilger.int> On Apr 13, 2006 16:40 -0400, Sev Binello wrote: >Andreas Dilger wrote: >>Do you run with write cache enabled on your device? That can potentially >>cause filesystem corruption even in the face of ext3 journaling, because >>the journal atomicity guarantees are lost when the device reports a write >>is complete on disk when it really isn't. >> >> >The raid system does run with write back cache enabled. >I don't believe the actual drives have this enabled, but I'd have to >check. > >But we didn't actually lose power on the raid or hosts >just the connecting switches, so we lost all communication. >Presumably, in this situation the controller cache should have been >emptied Is my reasoning correct here ? Correct. If your RAID has w/b cache enabled, but is battery backed, you should be OK. Beyond this, I'm not sure what else you can look at. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From martial at server101.com Thu Apr 13 22:30:19 2006 From: martial at server101.com (Martial Herbaut) Date: Fri, 14 Apr 2006 08:30:19 +1000 (EST) Subject: ext3 filesystem corruption - more info (in text) In-Reply-To: <20060413221210.GA17364@schatzie.adilger.int> Message-ID: > >But we didn't actually lose power on the raid or hosts > >just the connecting switches, so we lost all communication. > >Presumably, in this situation the controller cache should have been > >emptied Is my reasoning correct here ? > > Correct. If your RAID has w/b cache enabled, but is battery backed, you > should be OK. > > Beyond this, I'm not sure what else you can look at. > don't mean to barge in, however I have seen similar corruption happen in the past where the fabric went away momentarily, like unplugging and replugging a fibre cable on a non-dualpath/failover setup but the host was not killed/rebooted. From memory the corruption was not immediately apparent and became so later. I think the best thing to do in that case scenario is force a reboot of the host and then force fsck as opposed to continuing on and hope for the best. Martial Herbaut --------------- Server101.com From sev at bnl.gov Fri Apr 14 14:21:56 2006 From: sev at bnl.gov (Sev Binello) Date: Fri, 14 Apr 2006 10:21:56 -0400 Subject: ext3 filesystem corruption - more info (in text) In-Reply-To: References: Message-ID: <443FB004.2040809@bnl.gov> Thanks for the suggestion, seems reasonable unfortunately on a operational system it means a lot of down time, but we end up there anyway. Thanks -Sev Martial Herbaut wrote: > >>>But we didn't actually lose power on the raid or hosts >>>just the connecting switches, so we lost all communication. >>>Presumably, in this situation the controller cache should have been >>>emptied Is my reasoning correct here ? >> >>Correct. If your RAID has w/b cache enabled, but is battery backed, you >>should be OK. >> >>Beyond this, I'm not sure what else you can look at. >> > > > don't mean to barge in, however I have seen similar corruption happen in > the past where the fabric went away momentarily, like unplugging and > replugging a fibre cable on a non-dualpath/failover setup but the host > was not killed/rebooted. From memory the corruption was not immediately > apparent and became so later. > > I think the best thing to do in that case scenario is force a reboot of > the host and then force fsck as opposed to continuing on and hope for the > best. > > Martial Herbaut > --------------- > Server101.com > > _______________________________________________ > Ext3-users mailing list > Ext3-users at redhat.com > https://www.redhat.com/mailman/listinfo/ext3-users -- Sev Binello Brookhaven National Laboratory Upton, New York 631-344-5647 sev at bnl.gov From jlb17 at duke.edu Fri Apr 14 22:31:27 2006 From: jlb17 at duke.edu (Joshua Baker-LePain) Date: Fri, 14 Apr 2006 18:31:27 -0400 (EDT) Subject: Ext3 and 3ware RAID5 Message-ID: I run a decent amount of 3ware hardware, all under centos-4. There seems to be some sort of fundamental disagreement between ext3 and 3ware's hardware RAID5 mode that trashes write performance. As a representative example, one current setup is 2 9550SX-12 boards in hardware RAID5 mode (256KB stripe size) with a software RAID0 stripe on top (also 256KB chunks). bonnie++ results look like this: mount -t ext3 175 MB/s writes, 352 MB/s reads mount -t ext3 -o data=writeback 185 MB/s writes, 254 MB/s reads mount -t ext2 340 MB writes, 266 MB/s reads XFS on this hardware gets (untuned) about 300 MB/s writes and 400 MB/s reads. The hardware itself is capable of more, and those results are representative of several different configs of hardware and software RAID options. Any ideas as to what leads to ext3's performance hit? I've tested *lots* of configurations. It's not the md layer -- 1 card setups see the same performance hit. Thanks. -- Joshua Baker-LePain Department of Biomedical Engineering Duke University From alex at alex.org.uk Sat Apr 15 11:18:00 2006 From: alex at alex.org.uk (Alex Bligh) Date: Sat, 15 Apr 2006 12:18:00 +0100 Subject: Ext3 and 3ware RAID5 In-Reply-To: References: Message-ID: --On 14 April 2006 18:31 -0400 Joshua Baker-LePain wrote: > Any ideas as to what leads to ext3's performance hit? I've tested *lots* > of configurations. It's not the md layer -- 1 card setups see the same > performance hit. No idea, but I suffer the same problem with a 9550SX-4 with SATA3 drives. I don't think ext3 is the problem as dd gives much the same behaviour (you might want to try it). My reluctant conclusion is that the 9550SX is just dog slow, which is why the drives are currently sitting idle. I would love someone from 3ware to disprove this. Alex From jlb17 at duke.edu Sat Apr 15 11:33:48 2006 From: jlb17 at duke.edu (Joshua Baker-LePain) Date: Sat, 15 Apr 2006 07:33:48 -0400 (EDT) Subject: Ext3 and 3ware RAID5 In-Reply-To: References: Message-ID: On Sat, 15 Apr 2006 at 12:18pm, Alex Bligh wrote > > > --On 14 April 2006 18:31 -0400 Joshua Baker-LePain wrote: > >> Any ideas as to what leads to ext3's performance hit? I've tested *lots* >> of configurations. It's not the md layer -- 1 card setups see the same >> performance hit. > > No idea, but I suffer the same problem with a 9550SX-4 with SATA3 > drives. I don't think ext3 is the problem as dd gives much the same > behaviour (you might want to try it). My reluctant conclusion is > that the 9550SX is just dog slow, which is why the drives are currently > sitting idle. I would love someone from 3ware to disprove this. When ext2 is almost 2X faster than ext3 at writing, it points pretty firmly at something in the journaling code (IMO). And a journaling FS can do decently with this hardware, as XFS also gets ~300MB/s writing. -- Joshua Baker-LePain Department of Biomedical Engineering Duke University From alex at alex.org.uk Sat Apr 15 11:44:58 2006 From: alex at alex.org.uk (Alex Bligh) Date: Sat, 15 Apr 2006 12:44:58 +0100 Subject: Ext3 and 3ware RAID5 In-Reply-To: References: Message-ID: <9EDC1C2D4A1EBDB4CC54DF2B@[192.168.100.25]> --On 15 April 2006 07:33 -0400 Joshua Baker-LePain wrote: >> No idea, but I suffer the same problem with a 9550SX-4 with SATA3 >> drives. I don't think ext3 is the problem as dd gives much the same >> behaviour (you might want to try it). My reluctant conclusion is >> that the 9550SX is just dog slow, which is why the drives are currently >> sitting idle. I would love someone from 3ware to disprove this. > > When ext2 is almost 2X faster than ext3 at writing, it points pretty > firmly at something in the journaling code (IMO). And a journaling FS > can do decently with this hardware, as XFS also gets ~300MB/s writing. Sorry forgot to address that point. Writes seem to be especially slow (there is/was some stuff on the 3-ware site saying slow writes under Linux were a known problem). I am presuming that ext3 simply does more writing (even if the extra writes are small and discontiguous but numerous) than ext2 due to journalling, and this shows up the poor performance. You might run some benchmarks on the raw partition and see just how slow writes are. You might also take a look at the 3ware site as I think there might have been some tuning options they suggested which allegedly improved things (I gave up) - stripe size comes to mind. Anyway, do let me know if you find the answer... Alex From julius.junghans at gmx.de Sat Apr 15 11:54:05 2006 From: julius.junghans at gmx.de (julius Junghans) Date: Sat, 15 Apr 2006 13:54:05 +0200 Subject: Partition not recognized by mount Message-ID: <4440DEDD.1030706@gmx.de> Hi, somehow after a power failure i can't mount my ext3 partition :( mount /dev/hdd2 /mnt/gentoo/ mount: you must specify the filesystem type fdisk /dev/hdd The number of cylinders for this disk is set to 484521. There is nothing wrong with that, but this is larger than 1024, and could in certain setups cause problems with: 1) software that runs at boot time (e.g., old versions of LILO) 2) booting and partitioning software from other OSs (e.g., DOS FDISK, OS/2 FDISK) Command (m for help): p Disk /dev/hdd: 250.0 GB, 250059350016 bytes 16 heads, 63 sectors/track, 484521 cylinders Units = cylinders of 1008 * 512 = 516096 bytes Device Boot Start End Blocks Id System /dev/hdd1 1 970 488848+ 83 Linux /dev/hdd2 971 155114 77688576 83 Linux mount -t ext3 /dev/hdd2 /mnt/gentoo/ mount: wrong fs type, bad option, bad superblock on /dev/hdd2, missing codepage or other error In some cases useful info is found in syslog - try dmesg | tail or so dmesg: VFS: Can't find ext3 filesystem on dev hdd2. VFS: Can't find ext3 filesystem on dev hdd2. What can i do to get my data back? Julius From seanos at seanos.net Sat Apr 15 11:57:20 2006 From: seanos at seanos.net (Sean O Sullivan) Date: Sat, 15 Apr 2006 12:57:20 +0100 Subject: Ext3 and 3ware RAID5 In-Reply-To: <9EDC1C2D4A1EBDB4CC54DF2B@[192.168.100.25]> References: <9EDC1C2D4A1EBDB4CC54DF2B@[192.168.100.25]> Message-ID: <4440DFA0.5070703@seanos.net> Alex Bligh wrote: > > > --On 15 April 2006 07:33 -0400 Joshua Baker-LePain wrote: > >>> No idea, but I suffer the same problem with a 9550SX-4 with SATA3 >>> drives. I don't think ext3 is the problem as dd gives much the same >>> behaviour (you might want to try it). My reluctant conclusion is >>> that the 9550SX is just dog slow, which is why the drives are currently >>> sitting idle. I would love someone from 3ware to disprove this. >> >> When ext2 is almost 2X faster than ext3 at writing, it points pretty >> firmly at something in the journaling code (IMO). And a journaling FS >> can do decently with this hardware, as XFS also gets ~300MB/s writing. > I had similar problems with 9500S-8, and searched about, and eventually found some useful information. Try mounting the volume with the 'noreservation' option. Also, it is well worth your time setting 'blockdev' for example : blockdev --setra 20000 /dev/sde and this put this in /etc/rc.local Note blockdev is really something you have to just mess/experiment with. I went from 12000-26000 and some of the differences between between 1000, or even 500 at times was amazing. My volume is used for storage-only, write speed not too important, so really only did this out of curiosity. There is also a fair bit of interesting information in 3ware's knowledge base. Regards, Sean From jlb17 at duke.edu Sat Apr 15 12:23:38 2006 From: jlb17 at duke.edu (Joshua Baker-LePain) Date: Sat, 15 Apr 2006 08:23:38 -0400 (EDT) Subject: Ext3 and 3ware RAID5 In-Reply-To: <4440DFA0.5070703@seanos.net> References: <9EDC1C2D4A1EBDB4CC54DF2B@[192.168.100.25]> <4440DFA0.5070703@seanos.net> Message-ID: On Sat, 15 Apr 2006 at 12:57pm, Sean O Sullivan wrote > Alex Bligh wrote: >> >> --On 15 April 2006 07:33 -0400 Joshua Baker-LePain wrote: >> >>>> No idea, but I suffer the same problem with a 9550SX-4 with SATA3 >>>> drives. I don't think ext3 is the problem as dd gives much the same >>>> behaviour (you might want to try it). My reluctant conclusion is >>>> that the 9550SX is just dog slow, which is why the drives are currently >>>> sitting idle. I would love someone from 3ware to disprove this. >>> >>> When ext2 is almost 2X faster than ext3 at writing, it points pretty >>> firmly at something in the journaling code (IMO). And a journaling FS >>> can do decently with this hardware, as XFS also gets ~300MB/s writing. >> > I had similar problems with 9500S-8, and searched about, and eventually found > some useful information. > Try mounting the volume with the 'noreservation' option. AFAIK, that bug was fixed in the most recent centos/RHEL kernel. In any case, noreservation made no difference. > Also, it is well worth your time setting 'blockdev' > for example : blockdev --setra 20000 /dev/sde > and this put this in /etc/rc.local Yeah, I've already played around with blockdev a lot. It made some difference, but nothing extraordinary. -- Joshua Baker-LePain Department of Biomedical Engineering Duke University From keld at dkuug.dk Sun Apr 16 12:30:29 2006 From: keld at dkuug.dk (Keld =?iso-8859-1?Q?J=F8rn?= Simonsen) Date: Sun, 16 Apr 2006 14:30:29 +0200 Subject: e2fsck dies with signal 11 Message-ID: <20060416123029.GA11999@rap.rap.dk> Hi I got a strange error, happening on two of my ext3 partitions. What can be wrong? And why does e2fsck error out, instead of displaying an error message? Best regards keld fsck /dev/hda6 fsck 1.38 (30-Jun-2005) e2fsck 1.38 (30-Jun-2005) Warning... fsck.ext3 for device /dev/hda6 exited with signal 11. also From my dmesg: <1>general protection fault: e7a8 [#3] Modules linked in: i915 drm snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_intel8x0 snd_ac97_codec snd_pcm snd_timer snd_page_alloc snd soundcore lp parport_pc ppdev parport ipt_REJECT ipt_LOG ipt_state ipt_pkttype ipt_set ipt_CONNMARK ipt_MARK ipt_ROUTE ipt_connmark ipt_owner ipt_recent ipt_iprange ipt_physdev ipt_multiport ipt_conntrack iptable_mangle ip_set_portmap ip_set_macipmap ip_set_ipmap ip_set_iphash ip_set ip_nat_irc ip_nat_tftp ip_nat_ftp iptable_nat ip_conntrack_irc ip_conntrack_tftp ip_conntrack_ftp ip_conntrack iptable_filter ip_tables 8139too mii af_packet ide_cd loop ext3 jbd nls_iso8859_1 nls_cp850 vfat fat intel_agp nvram amd64_agp agpgart evdev bttv video_buf firmware_class i2c_algo_bit v4l2_common btcx_risc tveeprom i2c_core videodev dm_mod sata_vsc sata_via sata_svw sata_sil sata_promise sata_nv sx8 sata_uli sata_sx4 sata_sis sata_qstor pata_pdc2027x ahci BusLogic aic7xxx scsi_transport_spi sg sr_mod cdrom ata_piix libata reiserfs usb_storage sd_mod scsi_mod usbhid ohci_hcd ehci_hcd uhci_hcd usbcore CPU: 0 EIP: 00c0:[<000023c1>] Not tainted VLI EFLAGS: 00210046 (2.6.12-oci6.mdk-i586-up-1GB) EIP is at 0x23c1 eax: 00000292 ebx: 00000001 ecx: 00000000 edx: 00000000 esi: ffffffff edi: 00200014 ebp: bc569e5c esp: bc569e54 ds: 00c8 es: 0000 ss: 0068 Process fsck.ext3 (pid: 4220, threadinfo=bc568000 task=b2f13020) Stack: 462c44b1 00009e5c 000000c8 ffff0292 9e7000c0 00000001 530a0000 00200016 00b8467c 00000000 bc569ebc b0111311 00000060 bc569ebc 00200292 b11e007b 0020007b 00000000 b1292d98 00000000 00000000 a7df0000 bc560000 bc569f1a Call Trace: [] show_stack+0x9b/0xb0 [] show_registers+0x11b/0x190 [] die+0xb5/0x130 [] do_general_protection+0x13a/0x160 [] error_code+0x4f/0x60 [] 0xffff0292 Code: Bad EIP value. From coywolf at sosdg.org Sun Apr 16 13:55:54 2006 From: coywolf at sosdg.org (Coywolf Qi Hunt) Date: Sun, 16 Apr 2006 09:55:54 -0400 Subject: Partition not recognized by mount In-Reply-To: <4440DEDD.1030706@gmx.de> References: <4440DEDD.1030706@gmx.de> Message-ID: <20060416135554.GA30746@everest.sosdg.org> On Sat, Apr 15, 2006 at 01:54:05PM +0200, julius Junghans wrote: > Hi, > > somehow after a power failure i can't mount my ext3 partition :( > > mount /dev/hdd2 /mnt/gentoo/ > mount: you must specify the filesystem type > > fdisk /dev/hdd > > The number of cylinders for this disk is set to 484521. > There is nothing wrong with that, but this is larger than 1024, > and could in certain setups cause problems with: > 1) software that runs at boot time (e.g., old versions of LILO) > 2) booting and partitioning software from other OSs > (e.g., DOS FDISK, OS/2 FDISK) > > Command (m for help): p > > Disk /dev/hdd: 250.0 GB, 250059350016 bytes > 16 heads, 63 sectors/track, 484521 cylinders > Units = cylinders of 1008 * 512 = 516096 bytes > > Device Boot Start End Blocks Id System > /dev/hdd1 1 970 488848+ 83 Linux > /dev/hdd2 971 155114 77688576 83 Linux > > > mount -t ext3 /dev/hdd2 /mnt/gentoo/ > mount: wrong fs type, bad option, bad superblock on /dev/hdd2, > missing codepage or other error > In some cases useful info is found in syslog - try > dmesg | tail or so > > > dmesg: > VFS: Can't find ext3 filesystem on dev hdd2. > VFS: Can't find ext3 filesystem on dev hdd2. > > > What can i do to get my data back? > > Julius What were you doing before the power failure? I have had lost my super block before too. And I did get my filesystem back. To get your filesystem back, you need to locate your backup superblocks. I wrote a simple program to find my superblock last time. There is also one in the e2fsprogs source package. Then you could use dd(1) to copy your backup sb onto your primary sb. Or you could try mount with sb=n option. Good luck. Coywolf From coywolf at sosdg.org Sun Apr 16 14:18:38 2006 From: coywolf at sosdg.org (Coywolf Qi Hunt) Date: Sun, 16 Apr 2006 10:18:38 -0400 Subject: e2fsck dies with signal 11 In-Reply-To: <20060416123029.GA11999@rap.rap.dk> References: <20060416123029.GA11999@rap.rap.dk> Message-ID: <20060416141838.GB30746@everest.sosdg.org> On Sun, Apr 16, 2006 at 02:30:29PM +0200, Keld J?rn Simonsen wrote: > Hi > > I got a strange error, happening on two of my ext3 partitions. > What can be wrong? And why does e2fsck error out, instead of displaying > an error message? > > Best regards > keld > > fsck /dev/hda6 > fsck 1.38 (30-Jun-2005) > e2fsck 1.38 (30-Jun-2005) > Warning... fsck.ext3 for device /dev/hda6 exited with signal 11. Please try with gdb to trace the problem. Coywolf From tytso at mit.edu Mon Apr 17 08:41:25 2006 From: tytso at mit.edu (Theodore Ts'o) Date: Mon, 17 Apr 2006 04:41:25 -0400 Subject: e2fsck dies with signal 11 In-Reply-To: <20060416123029.GA11999@rap.rap.dk> References: <20060416123029.GA11999@rap.rap.dk> Message-ID: <20060417084125.GC13985@thunk.org> The dmesg indicates that the kernel trapped a general protection fault (GPF) in kernel space. So this looks like some kind of kernel bug which was triggered by e2fsck. Unfortunately the EIP is invalid, so it's hard to track down what might have caused it. If this is repeatable, I'd suggest using strace so we can see what e2fsck was requesting of the kernel right before it triggered the kernel GPF which killed the process. - Ted From keld at dkuug.dk Mon Apr 17 10:30:23 2006 From: keld at dkuug.dk (Keld =?iso-8859-1?Q?J=F8rn?= Simonsen) Date: Mon, 17 Apr 2006 12:30:23 +0200 Subject: e2fsck dies with signal 11 In-Reply-To: <20060417084125.GC13985@thunk.org> References: <20060416123029.GA11999@rap.rap.dk> <20060417084125.GC13985@thunk.org> Message-ID: <20060417103023.GA6782@rap.rap.dk> On Mon, Apr 17, 2006 at 04:41:25AM -0400, Theodore Ts'o wrote: > The dmesg indicates that the kernel trapped a general protection fault > (GPF) in kernel space. So this looks like some kind of kernel bug > which was triggered by e2fsck. Unfortunately the EIP is invalid, so > it's hard to track down what might have caused it. If this is > repeatable, I'd suggest using strace so we can see what e2fsck was > requesting of the kernel right before it triggered the kernel GPF > which killed the process. OK, here are the last words of an strace: open("/etc/mtab", O_RDONLY) = 3 stat64("/dev/hda6", {st_mode=S_IFBLK|0660, st_rdev=makedev(3, 6), ...}) = 0 fstat64(3, {st_mode=S_IFREG|0644, st_size=524, ...}) = 0 mmap2(NULL, 131072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xa7d5c000 read(3, "/dev/hda9 / reiserfs rw,noatime,"..., 131072) = 524 stat64("/dev/hda9", {st_mode=S_IFBLK|0660, st_rdev=makedev(3, 9), ...}) = 0 stat64("none", 0xafa32710) = -1 ENOENT (No such file or directory) stat64("none", 0xafa32710) = -1 ENOENT (No such file or directory) stat64("none", 0xafa32710) = -1 ENOENT (No such file or directory) stat64("/dev/hda1", {st_mode=S_IFBLK|0660, st_rdev=makedev(3, 1), ...}) = 0 stat64("/dev/hda10", {st_mode=S_IFBLK|0660, st_rdev=makedev(3, 10), ...}) = 0 stat64("/dev/hda11", {st_mode=S_IFBLK|0660, st_rdev=makedev(3, 11), ...}) = 0 stat64("/dev/hda2", {st_mode=S_IFBLK|0660, st_rdev=makedev(3, 2), ...}) = 0 stat64("/dev/hda3", {st_mode=S_IFBLK|0660, st_rdev=makedev(3, 3), ...}) = 0 stat64("/dev/hda5", {st_mode=S_IFBLK|0660, st_rdev=makedev(3, 5), ...}) = 0 read(3, "", 131072) = 0 stat64("/", {st_mode=S_IFDIR|0755, st_size=520, ...}) = 0 close(3) = 0 munmap(0xa7d5c000, 131072) = 0 stat64("/dev/hda6", {st_mode=S_IFBLK|0660, st_rdev=makedev(3, 6), ...}) = 0 open("/dev/hda6", O_RDONLY|O_EXCL) = 3 close(3) = 0 open("/dev/hda6", O_RDWR|O_LARGEFILE) = 3 uname({sys="Linux", node="localhost", ...}) = 0 lseek(3, 1024, SEEK_SET) = 1024 read(3, "\0\326\6\0\177\252\r\0\354\256\0\0002\v\1\0\350,\4\0\0"..., 1024) = 1024 lseek(3, 4096, SEEK_SET) = 4096 read(3, "\2\0\0\0\3\0\0\0\4\0\0\0\0\0|;=\1\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096 lseek(3, 16384, SEEK_SET) = 16384 read(3, "\0\0\0\0\0\0\0\0\0\17.C\0\17.C\0\17.C\0\0\0\0\0\0\0\0\0"..., 4096) = 4096 lseek(3, 2084864, SEEK_SET) = 2084864 read(3, "\300;9\230\0\0\0\4\0\0\0\0\0\0\20\0\0\0@\0\0\0\0\1\0\2"..., 4096) = 4096 open("/dev/hda6", O_RDONLY|O_LARGEFILE) = 4 uname({sys="Linux", node="localhost", ...}) = 0 ioctl(4, 0x80041272, 0xafa32698) = 0 close(4) = 0 open("/proc/apm", O_RDONLY) = 4 fstat64(4, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xa7d7b000 read(4, +++ killed by SIGSEGV +++ best regards keld From tytso at mit.edu Mon Apr 17 11:15:32 2006 From: tytso at mit.edu (Theodore Ts'o) Date: Mon, 17 Apr 2006 07:15:32 -0400 Subject: e2fsck dies with signal 11 In-Reply-To: <20060417103023.GA6782@rap.rap.dk> References: <20060416123029.GA11999@rap.rap.dk> <20060417084125.GC13985@thunk.org> <20060417103023.GA6782@rap.rap.dk> Message-ID: <20060417111532.GA23376@thunk.org> On Mon, Apr 17, 2006 at 12:30:23PM +0200, Keld J?rn Simonsen wrote: > open("/proc/apm", O_RDONLY) = 4 ... > read(4, This was caused by e2fsck trying to read from /proc/apm to see whether or not your system was running on batteries or not. /proc/apm exists (or the open would have returned an error), but reading from it apparently causes a kernel oops. This is definitely a kernel bug, and I suspect can be reproduced by the shell command "cat /proc/apm". Recompiling the kernel with CONFIG_APM disabled is probably the most expedient answer, since for most systems ACPI is more functional (and in some cases, required). Indeed, the APM code has been sufferring progressive bitrot, which probably explains the kernel oops. You could try sending a complaint to LKML if you really need APM functionality for your laptop, and for some reason ACPI is not sufficent for your needs. Regards, - Ted From sev at bnl.gov Mon Apr 17 19:22:25 2006 From: sev at bnl.gov (Sev Binello) Date: Mon, 17 Apr 2006 15:22:25 -0400 Subject: EXT3-fs unexpected failure msg ? Message-ID: <4443EAF1.8080807@bnl.gov> Hi - We have had a raid failure, we have some what recovered but we continue to see the following ext3 message... Apr 17 14:59:14 acnlin84 kernel: EXT3-fs unexpected failure: (((jh2bh(jh))->b_state & (1UL << BH_Uptodate)) != 0); Apr 17 14:59:14 acnlin84 kernel: Possible IO failure. Since we have experienced several instances of ext3 file system corruption when we lose total communication with our raid, we were wondering if there was any concrete advice out there on what to do in this situation. Other messages we got before the ones above... Apr 17 13:40:42 acnlin84 kernel: EXT3-fs error (device sd(8,33)): ext3_free_blocks: bit already cleared for block 14943160 Apr 17 13:40:42 acnlin84 kernel: EXT3-fs error (device sd(8,33)): ext3_free_blocks: bit already cleared for block 3703794 Apr 17 13:40:43 acnlin84 kernel: EXT3-fs error (device sd(8,65)): ext3_get_inode_loc: unable to read inode block - inode=50931914, block=101843272 -- Sev Binello Brookhaven National Laboratory Upton, New York 631-344-5647 sev at bnl.gov From adilger at clusterfs.com Mon Apr 17 23:51:56 2006 From: adilger at clusterfs.com (Andreas Dilger) Date: Mon, 17 Apr 2006 17:51:56 -0600 Subject: EXT3-fs unexpected failure msg ? In-Reply-To: <4443EAF1.8080807@bnl.gov> References: <4443EAF1.8080807@bnl.gov> Message-ID: <20060417235156.GO17364@schatzie.adilger.int> On Apr 17, 2006 15:22 -0400, Sev Binello wrote: > We have had a raid failure, we have some what recovered > but we continue to see the following ext3 message... > > Apr 17 14:59:14 acnlin84 kernel: EXT3-fs unexpected failure: > (((jh2bh(jh))->b_state & (1UL << BH_Uptodate)) != 0); > Apr 17 14:59:14 acnlin84 kernel: Possible IO failure. > > > Since we have experienced several instances of ext3 file system corruption > when we lose total communication with our raid, > we were wondering if there was any concrete advice out there > on what to do in this situation. You really, really, really need to mount your filesystem with "-o errors=remount-ro", at least to prevent filesystem corruption. I'm not sure if this is enough to prevent corruption in the case of your RAID disconnects (if it doesn't generate errors up to the filesystem, but still discards writes), but it is at least a minimum requirement. > Other messages we got before the ones above... > Apr 17 13:40:42 acnlin84 kernel: EXT3-fs error (device sd(8,33)): > ext3_free_blocks: bit already cleared for block 14943160 > Apr 17 13:40:42 acnlin84 kernel: EXT3-fs error (device sd(8,33)): > ext3_free_blocks: bit already cleared for block 3703794 > > Apr 17 13:40:43 acnlin84 kernel: EXT3-fs error (device sd(8,65)): > ext3_get_inode_loc: unable to read inode block - inode=50931914, > block=101843272 > -- Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From menscher at uiuc.edu Tue Apr 18 00:02:21 2006 From: menscher at uiuc.edu (Damian Menscher) Date: Mon, 17 Apr 2006 19:02:21 -0500 (CDT) Subject: EXT3-fs unexpected failure msg ? In-Reply-To: <20060417235156.GO17364@schatzie.adilger.int> References: <4443EAF1.8080807@bnl.gov> <20060417235156.GO17364@schatzie.adilger.int> Message-ID: On Mon, 17 Apr 2006, Andreas Dilger wrote: > > You really, really, really need to mount your filesystem with > "-o errors=remount-ro", at least to prevent filesystem corruption. > I'm not sure if this is enough to prevent corruption in the case > of your RAID disconnects (if it doesn't generate errors up to the > filesystem, but still discards writes), but it is at least a minimum > requirement. Since this was so strongly-worded, I just did a random spot-check of some of our filesystems (RHEL4) and discovered they all have: Errors behavior: Continue in the superblock (and mount apparently takes that option). This makes me curious: if it's so obvious that it should remount-ro on errors, why is the default (on RHEL4, at least) to continue? Damian Menscher -- -=#| www.uiuc.edu/~menscher/ Ofc:(650)253-2757 |#=- -=#| The above opinions are not necessarily those of my employers. |#=- From sev at bnl.gov Tue Apr 18 01:30:01 2006 From: sev at bnl.gov (Sev Binello) Date: Mon, 17 Apr 2006 21:30:01 -0400 Subject: EXT3-fs unexpected failure msg ? In-Reply-To: References: <4443EAF1.8080807@bnl.gov> <20060417235156.GO17364@schatzie.adilger.int> Message-ID: <44444119.6000502@bnl.gov> Damian Menscher wrote: > On Mon, 17 Apr 2006, Andreas Dilger wrote: > >> >> You really, really, really need to mount your filesystem with >> "-o errors=remount-ro", at least to prevent filesystem corruption. >> I'm not sure if this is enough to prevent corruption in the case >> of your RAID disconnects (if it doesn't generate errors up to the >> filesystem, but still discards writes), but it is at least a minimum >> requirement. > > > Since this was so strongly-worded, I just did a random spot-check of > some of our filesystems (RHEL4) and discovered they all have: > > Errors behavior: Continue > > in the superblock (and mount apparently takes that option). This makes > me curious: if it's so obvious that it should remount-ro on errors, why > is the default (on RHEL4, at least) to continue? > > Damian Menscher Aside from the fact that this is the current default setting for RHEL linux systems, though maybe not the best, my question/concern is that since there are sometimes trivial errors that we often have to live with until we can take our operational systems down long enough to fsck, will this option automatically put us in ro mode no matter how trivial the problem is ? Also, when we had the problem earlier today (i.e. the raid controller didn't failover for about 20 mins), we did stop and fsck. But even so when we checked after it was done, it still said state was "clean with errors" ? We tried fscking again with no better results, though when it started it said... "ext3 recovery flag clear but journal has data" any advice here ? Thanks -Sev -- Sev Binello Brookhaven National Laboratory Upton, New York 631-344-5647 sev at bnl.gov From adilger at clusterfs.com Tue Apr 18 08:31:11 2006 From: adilger at clusterfs.com (Andreas Dilger) Date: Tue, 18 Apr 2006 02:31:11 -0600 Subject: EXT3-fs unexpected failure msg ? In-Reply-To: <44444119.6000502@bnl.gov> References: <4443EAF1.8080807@bnl.gov> <20060417235156.GO17364@schatzie.adilger.int> <44444119.6000502@bnl.gov> Message-ID: <20060418083111.GP17364@schatzie.adilger.int> On Apr 17, 2006 21:30 -0400, Sev Binello wrote: > Damian Menscher wrote: > >On Mon, 17 Apr 2006, Andreas Dilger wrote: > >>You really, really, really need to mount your filesystem with > >>"-o errors=remount-ro", at least to prevent filesystem corruption. > >>I'm not sure if this is enough to prevent corruption in the case > >>of your RAID disconnects (if it doesn't generate errors up to the > >>filesystem, but still discards writes), but it is at least a minimum > >>requirement. > > > >Since this was so strongly-worded, I just did a random spot-check of > >some of our filesystems (RHEL4) and discovered they all have: > > > > Errors behavior: Continue > > > >in the superblock (and mount apparently takes that option). This makes > >me curious: if it's so obvious that it should remount-ro on errors, why > >is the default (on RHEL4, at least) to continue? It was only so strongly worded because Sev has had repeated failures of the RAID hardware resulting in filesystem corruption, and it seems prudent to stop the filesystem at the first inkling of corruption in this case. Not all environments see so many problems, and the choice to use remount-ro is up to the admin (though I believe Debian uses this as the default). > my question/concern is that since there are sometimes trivial errors that > we often have to live with until we can take our operational systems down > long enough to fsck, will this option automatically put us in ro mode no > matter how trivial the problem is ? This will only trigger on cases where there is a consistency error detected in the ext3 metadata. It doesn't affect regular IO errors for file data. However, that said, it surprises me that you are getting any kind of errors, even "trivial" ones, often. I wouldn't consider a RAID system where you often get errors to be very reliable. > Also, when we had the problem earlier today (i.e. the raid controller > didn't failover for about 20 mins), we did stop and fsck. > But even so when we checked after it was done, it still said state was > "clean with errors" ? When you run e2fsck, are you specifying the "-f" flag? For ext3 filesystems, an e2fsck (without -f) will normally not do a full filesystem check unless the superblock has been flagged with an error. This allows e2fsck to run against the filesystem always at boot, but normally only do journal replay (seconds at most) unless there was an error reported. > We tried fscking again with no better results, > though when it started it said... > "ext3 recovery flag clear but journal has data" > any advice here ? Run "e2fsck -f"? I haven't seen this unless the superblock was corrupted and had to be restored from backup or similar. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From sev at bnl.gov Tue Apr 18 13:57:46 2006 From: sev at bnl.gov (Sev Binello) Date: Tue, 18 Apr 2006 09:57:46 -0400 Subject: EXT3-fs unexpected failure msg ? In-Reply-To: <20060418083111.GP17364@schatzie.adilger.int> References: <4443EAF1.8080807@bnl.gov> <20060417235156.GO17364@schatzie.adilger.int> <44444119.6000502@bnl.gov> <20060418083111.GP17364@schatzie.adilger.int> Message-ID: <4444F05A.8010800@bnl.gov> Andreas Dilger wrote: > On Apr 17, 2006 21:30 -0400, Sev Binello wrote: > >>Damian Menscher wrote: >> >>>On Mon, 17 Apr 2006, Andreas Dilger wrote: >>> >>>>You really, really, really need to mount your filesystem with >>>>"-o errors=remount-ro", at least to prevent filesystem corruption. >>>>I'm not sure if this is enough to prevent corruption in the case >>>>of your RAID disconnects (if it doesn't generate errors up to the >>>>filesystem, but still discards writes), but it is at least a minimum >>>>requirement. >>> >>>Since this was so strongly-worded, I just did a random spot-check of >>>some of our filesystems (RHEL4) and discovered they all have: >>> >>> Errors behavior: Continue >>> >>>in the superblock (and mount apparently takes that option). This makes >>>me curious: if it's so obvious that it should remount-ro on errors, why >>>is the default (on RHEL4, at least) to continue? > > > It was only so strongly worded because Sev has had repeated failures of > the RAID hardware resulting in filesystem corruption, and it seems prudent > to stop the filesystem at the first inkling of corruption in this case. > Not all environments see so many problems, and the choice to use remount-ro > is up to the admin (though I believe Debian uses this as the default). > > >>my question/concern is that since there are sometimes trivial errors that >>we often have to live with until we can take our operational systems down >>long enough to fsck, will this option automatically put us in ro mode no >>matter how trivial the problem is ? > > > This will only trigger on cases where there is a consistency error detected > in the ext3 metadata. It doesn't affect regular IO errors for file data. > Ok, I'm assuming this would be any error reported in /var/log/messages that is preceeded by EXT3-fs > However, that said, it surprises me that you are getting any kind of errors, > even "trivial" ones, often. I wouldn't consider a RAID system where you > often get errors to be very reliable. > No arguement from us. > >>Also, when we had the problem earlier today (i.e. the raid controller >>didn't failover for about 20 mins), we did stop and fsck. >>But even so when we checked after it was done, it still said state was >>"clean with errors" ? > > > When you run e2fsck, are you specifying the "-f" flag? For ext3 filesystems, > an e2fsck (without -f) will normally not do a full filesystem check unless > the superblock has been flagged with an error. This allows e2fsck to run > against the filesystem always at boot, but normally only do journal replay > (seconds at most) unless there was an error reported. > > >>We tried fscking again with no better results, >>though when it started it said... >> "ext3 recovery flag clear but journal has data" >>any advice here ? > > > Run "e2fsck -f"? I haven't seen this unless the superblock was corrupted > and had to be restored from backup or similar. > Will try it Thanks > Cheers, Andreas > -- > Andreas Dilger > Principal Software Engineer > Cluster File Systems, Inc. > -- Sev Binello Brookhaven National Laboratory Upton, New York 631-344-5647 sev at bnl.gov From agupta at cs.ubc.ca Tue Apr 18 20:27:50 2006 From: agupta at cs.ubc.ca (Abhishek Gupta) Date: Tue, 18 Apr 2006 13:27:50 -0700 (PDT) Subject: Use of journal->j_blk_offset Message-ID: Hi everyone, So this question is more for people who are familiar with the internals of ext3. I notice that the function journal_init_dev() sets the value journal->j_blk_offset = start This means that start can be any arbitrary block number on the device. However, later in the function journal_bmap() it is never actually used. The value of *retp in journal_bmap() is set to *retp = blocknr; /* + journal->j_blk_offset */ A comment on the top of journal_bmap() says that the addition can be included in the above operation if so be the need. Is there any specific reason (related to performance etc) why it has not been done. Please let me know. Thanks Abhishek From jengelh at linux01.gwdg.de Mon Apr 10 15:28:18 2006 From: jengelh at linux01.gwdg.de (Jan Engelhardt) Date: Mon, 10 Apr 2006 17:28:18 +0200 (MEST) Subject: deleting partition does not effect superblock? In-Reply-To: <20060406065832.GK13324@lug-owl.de> References: <1458d9610604052337p2cafa6c8j78fc6da8c5f8be1a@mail.gmail.com> <20060406065832.GK13324@lug-owl.de> Message-ID: >deleted) or otherwise modified. So it's perfectly okay to delete such >a container (eg. remove start and end from the partition table) and >recreate it at some time later (by adding those values back to the >partition table.) As long as the new container starts at the same >location, a filesystem driver will be able to find the old >information. If you start a block later, it won't find it's >superblocks. > If using a filesystem with replicated superblocks (ext*, xfs), then ...? [Includes expecting weird breakage.] Jan Engelhardt -- From dlochart at gmail.com Wed Apr 19 15:34:17 2006 From: dlochart at gmail.com (Doug Lochart) Date: Wed, 19 Apr 2006 15:34:17 +0000 Subject: Max filesystem size for ext3 using Adaptec RAID 5 on 64 bit CentOS Message-ID: <1e71f8880604190834k1759512as301503b7b3586c9b@mail.gmail.com> We are strategizing a set of backup servers and I have been trying to deduce wha the maximum size of each RAID 5 array should be to match the OS we are using. We are currently running CentOS 4.3 64 bit. We have planned a 2 TB RAID 5 array for testing but we will need to set up several larger ones for production. I have poked around and I see people mention limits like 2TB max file size and 32TB max filesystem size. Many of these were mention 2.4/2,5 kernels and neither specified 32 bit vs 64 bit. Can someone please provide the following: max file size for 2.6.9 64 bit CentOS kernel max partition/fielsystem size for the same. Thanks Doug -- What profits a man if he gains the whole world yet loses his soul? From jlb17 at duke.edu Wed Apr 19 15:58:50 2006 From: jlb17 at duke.edu (Joshua Baker-LePain) Date: Wed, 19 Apr 2006 11:58:50 -0400 (EDT) Subject: Max filesystem size for ext3 using Adaptec RAID 5 on 64 bit CentOS In-Reply-To: <1e71f8880604190834k1759512as301503b7b3586c9b@mail.gmail.com> References: <1e71f8880604190834k1759512as301503b7b3586c9b@mail.gmail.com> Message-ID: On Wed, 19 Apr 2006 at 3:34pm, Doug Lochart wrote > We are strategizing a set of backup servers and I have been trying to > deduce wha the maximum size of each RAID 5 array should be to match > the OS we are using. We are currently running CentOS 4.3 64 bit. We > have planned a 2 TB RAID 5 array for testing but we will need to set > up several larger ones for production. I have poked around and I see > people mention limits like 2TB max file size and 32TB max filesystem > size. Many of these were mention 2.4/2,5 kernels and neither specified > 32 bit vs 64 bit. > > Can someone please provide the following: > > max file size for 2.6.9 64 bit CentOS kernel > max partition/fielsystem size for the same. http://www.redhat.com/rhel/details/limits/ -- Joshua Baker-LePain Department of Biomedical Engineering Duke University From keld at dkuug.dk Fri Apr 21 09:55:53 2006 From: keld at dkuug.dk (Keld =?iso-8859-1?Q?J=F8rn?= Simonsen) Date: Fri, 21 Apr 2006 11:55:53 +0200 Subject: e2fsck dies with signal 11 In-Reply-To: <20060417111532.GA23376@thunk.org> References: <20060416123029.GA11999@rap.rap.dk> <20060417084125.GC13985@thunk.org> <20060417103023.GA6782@rap.rap.dk> <20060417111532.GA23376@thunk.org> Message-ID: <20060421095553.GA28488@rap.rap.dk> On Mon, Apr 17, 2006 at 07:15:32AM -0400, Theodore Ts'o wrote: > On Mon, Apr 17, 2006 at 12:30:23PM +0200, Keld J?rn Simonsen wrote: > > open("/proc/apm", O_RDONLY) = 4 > ... > > read(4, > > This was caused by e2fsck trying to read from /proc/apm to see whether > or not your system was running on batteries or not. /proc/apm exists > (or the open would have returned an error), but reading from it > apparently causes a kernel oops. This is definitely a kernel bug, and > I suspect can be reproduced by the shell command "cat /proc/apm". > > Recompiling the kernel with CONFIG_APM disabled is probably the most > expedient answer, since for most systems ACPI is more functional (and > in some cases, required). Indeed, the APM code has been sufferring > progressive bitrot, which probably explains the kernel oops. You > could try sending a complaint to LKML if you really need APM > functionality for your laptop, and for some reason ACPI is not > sufficent for your needs. My problem here has vanished, I don't know why. But why was e2fsck checking APM? None of the other fs fsck's do, AFAIK. Best regards Keld From keld at dkuug.dk Fri Apr 21 10:00:00 2006 From: keld at dkuug.dk (Keld =?iso-8859-1?Q?J=F8rn?= Simonsen) Date: Fri, 21 Apr 2006 12:00:00 +0200 Subject: problem with e2fsck not knowing xfs Message-ID: <20060421100000.GB28488@rap.rap.dk> Hi! I had problem yesterday with e2fsck. It reported a bad superblock. I then tried to use one of the other superblocks. To no avail. Then later I remembered that I had switched the fs type to xfs. Maybe e2fsck could recognize other common fs types, and report this instead? best regards keld From keld at dkuug.dk Fri Apr 21 10:05:03 2006 From: keld at dkuug.dk (Keld =?iso-8859-1?Q?J=F8rn?= Simonsen) Date: Fri, 21 Apr 2006 12:05:03 +0200 Subject: EXT2-fs warning (device hda6): ext2_fill_super: mounting ext3 filesystem as ext2 Message-ID: <20060421100503.GA28673@rap.rap.dk> I often get the message: EXT2-fs warning (device hda6): ext2_fill_super: mounting ext3 filesystem as ext2 I have googled for a reason and a way to solve this - but not found something I could use. Maybe somebody here konws what to do? best regards keld From herta.vandeneynde at cc.kuleuven.be Fri Apr 21 15:10:38 2006 From: herta.vandeneynde at cc.kuleuven.be (Herta Van den Eynde) Date: Fri, 21 Apr 2006 17:10:38 +0200 Subject: ext3 data=ordered - good enough for oracle? Message-ID: <4448F5EE.7030106@cc.kuleuven.be> Given that the default journaling mode of ext3 (i.e. ordered), does not guarantee write ordering after a crash, is this journaling mode safe enough to use for a database such as Oracle? If so, how are out of sync writes delt with? Kind regards, Herta Disclaimer: http://www.kuleuven.be/cwis/email_disclaimer.htm From jburgess at uklinux.net Fri Apr 21 18:39:30 2006 From: jburgess at uklinux.net (Jon Burgess) Date: Fri, 21 Apr 2006 19:39:30 +0100 Subject: EXT2-fs warning (device hda6): ext2_fill_super: mounting ext3 filesystem as ext2 In-Reply-To: <20060421100503.GA28673@rap.rap.dk> References: <20060421100503.GA28673@rap.rap.dk> Message-ID: <1145644770.28767.7.camel@shark.home> On Fri, 2006-04-21 at 12:05 +0200, Keld J?rn Simonsen wrote: > I often get the message: > > EXT2-fs warning (device hda6): ext2_fill_super: mounting ext3 filesystem as ext2 > > I have googled for a reason and a way to solve this - > but not found something I could use. Maybe somebody here konws > what to do? This can happen for several reasons:- 1) Make sure you specify ext3 in /etc/fstab, e.g. ... /dev/hda6 /boot ext3 defaults 1 2 2) 'ext3' may not be compiled into your kernel (or the module may be missing). What kernel are you using and did you compile it yourself? 3) You may be hard coding ext2 in some mount command. Make sure that the filesystem is unspecified or set it to ext3, e.g. $ mount -t ext3 /dev/hda6 /mnt/tmp Jon From keld at dkuug.dk Fri Apr 21 20:05:58 2006 From: keld at dkuug.dk (Keld =?iso-8859-1?Q?J=F8rn?= Simonsen) Date: Fri, 21 Apr 2006 22:05:58 +0200 Subject: EXT2-fs warning (device hda6): ext2_fill_super: mounting ext3 filesystem as ext2 In-Reply-To: <1145644770.28767.7.camel@shark.home> References: <20060421100503.GA28673@rap.rap.dk> <1145644770.28767.7.camel@shark.home> Message-ID: <20060421200558.GA7256@rap.rap.dk> On Fri, Apr 21, 2006 at 07:39:30PM +0100, Jon Burgess wrote: > On Fri, 2006-04-21 at 12:05 +0200, Keld J?rn Simonsen wrote: > > I often get the message: > > > > EXT2-fs warning (device hda6): ext2_fill_super: mounting ext3 filesystem as ext2 > > > > I have googled for a reason and a way to solve this - > > but not found something I could use. Maybe somebody here konws > > what to do? > > This can happen for several reasons:- > > 1) Make sure you specify ext3 in /etc/fstab, e.g. > ... > /dev/hda6 /boot ext3 defaults 1 2 It was there as ext3. > 2) 'ext3' may not be compiled into your kernel (or the module may be > missing). What kernel are you using and did you compile it yourself? ext3 is in the kernel. I did compile it myself. it is 2.6.3 from The Suurce. > 3) You may be hard coding ext2 in some mount command. Make sure that the > filesystem is unspecified or set it to ext3, e.g. > $ mount -t ext3 /dev/hda6 /mnt/tmp I have not hard coded it. Anyway, mount report it as mounted as ext3. hda6 is my root fs for my default system on that machine. It may be only doring boot it mounts it as ext2. Still strange, and then what about the journal, if my system was stopped unreglementary? best regards keld From adilger at clusterfs.com Sat Apr 22 20:29:34 2006 From: adilger at clusterfs.com (Andreas Dilger) Date: Sat, 22 Apr 2006 14:29:34 -0600 Subject: problem with e2fsck not knowing xfs In-Reply-To: <20060421100000.GB28488@rap.rap.dk> References: <20060421100000.GB28488@rap.rap.dk> Message-ID: <20060422202934.GC6075@schatzie.adilger.int> On Apr 21, 2006 12:00 +0200, Keld J?rn Simonsen wrote: > I had problem yesterday with e2fsck. > It reported a bad superblock. > I then tried to use one of the other superblocks. > To no avail. > > Then later I remembered that I had switched the fs type to xfs. > Maybe e2fsck could recognize other common fs types, > and report this instead? Or, maybe you can change your /etc/fstab to report the filesystem type as xfs. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From adilger at clusterfs.com Sat Apr 22 20:30:08 2006 From: adilger at clusterfs.com (Andreas Dilger) Date: Sat, 22 Apr 2006 14:30:08 -0600 Subject: EXT2-fs warning (device hda6): ext2_fill_super: mounting ext3 filesystem as ext2 In-Reply-To: <20060421100503.GA28673@rap.rap.dk> References: <20060421100503.GA28673@rap.rap.dk> Message-ID: <20060422203008.GD6075@schatzie.adilger.int> On Apr 21, 2006 12:05 +0200, Keld J?rn Simonsen wrote: > I often get the message: > > EXT2-fs warning (device hda6): ext2_fill_super: mounting ext3 filesystem as ext2 > > I have googled for a reason and a way to solve this - > but not found something I could use. Maybe somebody here konws > what to do? It means your initrd (or /etc/fstab) is mounting an ext3 filesystem as ext2. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From keld at dkuug.dk Sat Apr 22 20:38:55 2006 From: keld at dkuug.dk (Keld =?iso-8859-1?Q?J=F8rn?= Simonsen) Date: Sat, 22 Apr 2006 22:38:55 +0200 Subject: problem with e2fsck not knowing xfs In-Reply-To: <20060422202934.GC6075@schatzie.adilger.int> References: <20060421100000.GB28488@rap.rap.dk> <20060422202934.GC6075@schatzie.adilger.int> Message-ID: <20060422203855.GA17657@rap.rap.dk> On Sat, Apr 22, 2006 at 02:29:34PM -0600, Andreas Dilger wrote: > On Apr 21, 2006 12:00 +0200, Keld J???rn Simonsen wrote: > > I had problem yesterday with e2fsck. > > It reported a bad superblock. > > I then tried to use one of the other superblocks. > > To no avail. > > > > Then later I remembered that I had switched the fs type to xfs. > > Maybe e2fsck could recognize other common fs types, > > and report this instead? > > Or, maybe you can change your /etc/fstab to report the filesystem type > as xfs. Of cause I did so, to make it work. And I promise to never ever again make errors. Well, I am just asking for a more intelligent error message than bad superblock. I think some of the other mkfs programs do so. best regards keld From johann.lombardi at bull.net Sun Apr 23 00:15:55 2006 From: johann.lombardi at bull.net (Johann Lombardi) Date: Sun, 23 Apr 2006 02:15:55 +0200 Subject: ext3 data=ordered - good enough for oracle? In-Reply-To: <4448F5EE.7030106@cc.kuleuven.be> References: <4448F5EE.7030106@cc.kuleuven.be> Message-ID: <20060423001555.GK11497@lombardij> > Given that the default journaling mode of ext3 (i.e. ordered), does not > guarantee write ordering after a crash, is this journaling mode safe > enough to use for a database such as Oracle? If so, how are out of sync > writes delt with? Oracle manages its own I/O cache in userspace and handles data coherency related to that. So data=journal is useless in this case. I guess databases such as Oracle uses O_SYNC to control the flushing of data or even O_DIRECT to bypass the kernel cache. Johann From tytso at mit.edu Sat Apr 22 08:37:57 2006 From: tytso at mit.edu (Theodore Ts'o) Date: Sat, 22 Apr 2006 04:37:57 -0400 Subject: e2fsck dies with signal 11 In-Reply-To: <20060421095553.GA28488@rap.rap.dk> References: <20060416123029.GA11999@rap.rap.dk> <20060417084125.GC13985@thunk.org> <20060417103023.GA6782@rap.rap.dk> <20060417111532.GA23376@thunk.org> <20060421095553.GA28488@rap.rap.dk> Message-ID: <20060422083756.GA8519@thunk.org> On Fri, Apr 21, 2006 at 11:55:53AM +0200, Keld J?rn Simonsen wrote: > But why was e2fsck checking APM? E2fsck will delay doing a full filesystem check based on number of mounts or time since last full filesystem check if APM or ACPI reports that the laptop is running on battery. Eventually, if the user is always booting without being connected to AC mains, e2fsck will force a check anyway, but for most usage patterns it means that the check is delayed for only a few boots until the user can boot while connected to AC power. - Ted From herta.vandeneynde at cc.kuleuven.be Sun Apr 23 21:46:30 2006 From: herta.vandeneynde at cc.kuleuven.be (Herta Van den Eynde) Date: Sun, 23 Apr 2006 23:46:30 +0200 Subject: ext3 data=ordered - good enough for oracle? In-Reply-To: <20060423001555.GK11497@lombardij> References: <4448F5EE.7030106@cc.kuleuven.be> <20060423001555.GK11497@lombardij> Message-ID: <444BF5B6.8080505@cc.kuleuven.be> Johann Lombardi wrote: >>Given that the default journaling mode of ext3 (i.e. ordered), does not >>guarantee write ordering after a crash, is this journaling mode safe >>enough to use for a database such as Oracle? If so, how are out of sync >>writes delt with? > > > Oracle manages its own I/O cache in userspace and handles data coherency related > to that. So data=journal is useless in this case. > I guess databases such as Oracle uses O_SYNC to control the flushing of data > or even O_DIRECT to bypass the kernel cache. > > Johann > Thanks for the reply, Johann, but given that Oracle is still using the filesystem (unless you use raw devices or ASM), what good does caching do in case of a hard crash? The O_SYNC and O_DIRECT would help. Is there any way to verify that this is what Oracle actually does? (Reason I'm asking is that I had a number of corruptions during the past year, and I have better things to do at nights than restoring databases.) Kind regards, Herta Disclaimer: http://www.kuleuven.be/cwis/email_disclaimer.htm From mkatiyar at gmail.com Tue Apr 25 09:45:09 2006 From: mkatiyar at gmail.com (Manish Katiyar) Date: Tue, 25 Apr 2006 15:15:09 +0530 Subject: Debugging file system using debugfs Message-ID: Hello friends, I am trying to learn recovering of file using debugfs. But even though i delete the file and run lsdel in debugfs it always gives me 0 deleted nodes found. Where am i making mistake?. [root at windce7 linux-2.4.32]# fdisk -l Disk /dev/hda: 40.0 GB, 40016019456 bytes 255 heads, 63 sectors/track, 4865 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Device Boot Start End Blocks Id System /dev/hda1 * 1 13 104391 83 Linux /dev/hda2 14 4735 37929465 83 Linux /dev/hda3 4736 4865 1044225 82 Linux swap [root at windce7 linux-2.4.32]# debugfs /dev/hda2 debugfs 1.32 (09-Nov-2002) debugfs: lsdel Inode Owner Mode Size Blocks Time deleted 0 deleted inodes found. debugfs: Please help me......I am new to this -- Thanks & Regards, ******************************************** Manish Katiyar Ozone 2, SP Infocity (Software Park), New Survey #208 Manjari Stud Farms Ltd., Phursungi Village, Haveli Taluka, Saswad Road, Hadapsar, Pune - 412308, India *********************************************** -------------- next part -------------- An HTML attachment was scrubbed... URL: From mkatiyar at gmail.com Tue Apr 25 09:45:09 2006 From: mkatiyar at gmail.com (Manish Katiyar) Date: Tue, 25 Apr 2006 15:15:09 +0530 Subject: Debugging file system using debugfs Message-ID: Hello friends, I am trying to learn recovering of file using debugfs. But even though i delete the file and run lsdel in debugfs it always gives me 0 deleted nodes found. Where am i making mistake?. [root at windce7 linux-2.4.32]# fdisk -l Disk /dev/hda: 40.0 GB, 40016019456 bytes 255 heads, 63 sectors/track, 4865 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Device Boot Start End Blocks Id System /dev/hda1 * 1 13 104391 83 Linux /dev/hda2 14 4735 37929465 83 Linux /dev/hda3 4736 4865 1044225 82 Linux swap [root at windce7 linux-2.4.32]# debugfs /dev/hda2 debugfs 1.32 (09-Nov-2002) debugfs: lsdel Inode Owner Mode Size Blocks Time deleted 0 deleted inodes found. debugfs: Please help me......I am new to this -- Thanks & Regards, ******************************************** Manish Katiyar Ozone 2, SP Infocity (Software Park), New Survey #208 Manjari Stud Farms Ltd., Phursungi Village, Haveli Taluka, Saswad Road, Hadapsar, Pune - 412308, India *********************************************** -------------- next part -------------- An HTML attachment was scrubbed... URL: From mkatiyar at gmail.com Tue Apr 25 09:45:09 2006 From: mkatiyar at gmail.com (Manish Katiyar) Date: Tue, 25 Apr 2006 15:15:09 +0530 Subject: Debugging file system using debugfs Message-ID: Hello friends, I am trying to learn recovering of file using debugfs. But even though i delete the file and run lsdel in debugfs it always gives me 0 deleted nodes found. Where am i making mistake?. [root at windce7 linux-2.4.32]# fdisk -l Disk /dev/hda: 40.0 GB, 40016019456 bytes 255 heads, 63 sectors/track, 4865 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Device Boot Start End Blocks Id System /dev/hda1 * 1 13 104391 83 Linux /dev/hda2 14 4735 37929465 83 Linux /dev/hda3 4736 4865 1044225 82 Linux swap [root at windce7 linux-2.4.32]# debugfs /dev/hda2 debugfs 1.32 (09-Nov-2002) debugfs: lsdel Inode Owner Mode Size Blocks Time deleted 0 deleted inodes found. debugfs: Please help me......I am new to this -- Thanks & Regards, ******************************************** Manish Katiyar Ozone 2, SP Infocity (Software Park), New Survey #208 Manjari Stud Farms Ltd., Phursungi Village, Haveli Taluka, Saswad Road, Hadapsar, Pune - 412308, India *********************************************** -------------- next part -------------- An HTML attachment was scrubbed... URL: From mkatiyar at gmail.com Tue Apr 25 09:45:09 2006 From: mkatiyar at gmail.com (Manish Katiyar) Date: Tue, 25 Apr 2006 15:15:09 +0530 Subject: Debugging file system using debugfs Message-ID: Hello friends, I am trying to learn recovering of file using debugfs. But even though i delete the file and run lsdel in debugfs it always gives me 0 deleted nodes found. Where am i making mistake?. [root at windce7 linux-2.4.32]# fdisk -l Disk /dev/hda: 40.0 GB, 40016019456 bytes 255 heads, 63 sectors/track, 4865 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Device Boot Start End Blocks Id System /dev/hda1 * 1 13 104391 83 Linux /dev/hda2 14 4735 37929465 83 Linux /dev/hda3 4736 4865 1044225 82 Linux swap [root at windce7 linux-2.4.32]# debugfs /dev/hda2 debugfs 1.32 (09-Nov-2002) debugfs: lsdel Inode Owner Mode Size Blocks Time deleted 0 deleted inodes found. debugfs: Please help me......I am new to this -- Thanks & Regards, ******************************************** Manish Katiyar Ozone 2, SP Infocity (Software Park), New Survey #208 Manjari Stud Farms Ltd., Phursungi Village, Haveli Taluka, Saswad Road, Hadapsar, Pune - 412308, India *********************************************** -------------- next part -------------- An HTML attachment was scrubbed... URL: From johann.lombardi at bull.net Tue Apr 25 14:49:50 2006 From: johann.lombardi at bull.net (Johann Lombardi) Date: Tue, 25 Apr 2006 16:49:50 +0200 Subject: ext3 data=ordered - good enough for oracle? In-Reply-To: <444BF5B6.8080505@cc.kuleuven.be> References: <4448F5EE.7030106@cc.kuleuven.be> <20060423001555.GK11497@lombardij> <444BF5B6.8080505@cc.kuleuven.be> Message-ID: <20060425144950.GB4037@chiva> Hi Herta, > Thanks for the reply, Johann, but given that Oracle is still using the > filesystem (unless you use raw devices or ASM), what good does caching > do in case of a hard crash? It's handled at the application level. > The O_SYNC and O_DIRECT would help. Is there any way to verify that > this is what Oracle actually does? It does: http://www.oracle.com/technology/tech/linux/htdocs/oracleonlinux_faq.html#8 http://asktom.oracle.com/pls/ask/f?p=4950:8:::::F4950_P8_DISPLAYID:618260965466 (thread entitled "Commited data not "guaranteed" ?") http://www.redhat.com/magazine/013nov05/features/oracle/ You can google "Oracle O_SYNC" for more pointers (or do it yourself with strace or gdb). Johann From adilger at clusterfs.com Tue Apr 25 18:18:59 2006 From: adilger at clusterfs.com (Andreas Dilger) Date: Tue, 25 Apr 2006 12:18:59 -0600 Subject: Debugging file system using debugfs In-Reply-To: References: Message-ID: <20060425181859.GD6075@schatzie.adilger.int> On Apr 25, 2006 15:15 +0530, Manish Katiyar wrote: > I am trying to learn recovering of file using debugfs. But even > though i delete the file and run lsdel in debugfs > it always gives me 0 deleted nodes found. Where am i making mistake?. The ext3 implementation makes is basically impossible to recover deleted files, unless you search the whole disk looking for the data that you want to recover. This is an implementation detail for truncate, and may concievably be fixed (I've discussed an improvement to do this several times), but nobody has ever worked on it. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From zach.brown at oracle.com Mon Apr 24 16:51:26 2006 From: zach.brown at oracle.com (Zach Brown) Date: Mon, 24 Apr 2006 09:51:26 -0700 Subject: ext3 data=ordered - good enough for oracle? In-Reply-To: <4448F5EE.7030106@cc.kuleuven.be> References: <4448F5EE.7030106@cc.kuleuven.be> Message-ID: <444D020E.6080907@oracle.com> Herta Van den Eynde wrote: > Given that the default journaling mode of ext3 (i.e. ordered), does not > guarantee write ordering after a crash, is this journaling mode safe > enough to use for a database such as Oracle? Yes, the database doesn't rely the kind of functionality that data=journaled provides that data=ordered doesn't. data=ordered is fine. > If so, how are out of sync writes delt with? The database, just like ext3/jbd, implements its own consistency mechanisms by careful write ordering. ext3 uses in-kernel device APIs to issue writes and find out when they're on disk, the database ideally uses O_DIRECT. I looked around otn.oracle.com to find a doc that talks about configuring and verifying AIO+O_DIRECT in the database but got tired of searching. You might be able to find something if you're more patient than I was. - z From danield at igb.uiuc.edu Wed Apr 26 18:33:16 2006 From: danield at igb.uiuc.edu (Daniel Davidson) Date: Wed, 26 Apr 2006 13:33:16 -0500 Subject: re-linking hard links Message-ID: <1146076397.3241.9.camel@arthur.igb.uiuc.edu> Hello, I have a situation where I have numerous files with numerous hard links to each of them on an ext3 RHEL4.2 system. Some of these files are duplicates of the others. I would like to re-link all of the duplicates to point to a single inode. For instance if file1 has hardlinks link1 and link2, and file2 has hardlinks link3 and link4, I need to change it so that link1, link2 (these two are already correct), file2, link3, and link4 are all hardinks to file1. The only information I have to start with are the inode numbers of file1 and file2 and the pathnames of file1 and file2. Any ideas beyond searching all of the filenames on the system and replacing them with the proper link? That takes a long time. thanks, Dan From herta.vandeneynde at cc.kuleuven.be Wed Apr 26 23:46:49 2006 From: herta.vandeneynde at cc.kuleuven.be (Herta Van den Eynde) Date: Thu, 27 Apr 2006 01:46:49 +0200 Subject: re-linking hard links In-Reply-To: <1146076397.3241.9.camel@arthur.igb.uiuc.edu> References: <1146076397.3241.9.camel@arthur.igb.uiuc.edu> Message-ID: <44500669.5080103@cc.kuleuven.be> Daniel Davidson wrote: > Hello, > > I have a situation where I have numerous files with numerous hard links > to each of them on an ext3 RHEL4.2 system. Some of these files are > duplicates of the others. I would like to re-link all of the > duplicates to point to a single inode. > For instance if file1 has > hardlinks link1 and link2, and file2 has hardlinks link3 and link4, I > need to change it so that link1, link2 (these two are already correct), > file2, link3, and link4 are all hardinks to file1. The only > information I have to start with are the inode numbers of file1 and > file2 and the pathnames of file1 and file2. Not sure I understand properly. It looks as though you want to compare every file on a given filesystem with every other file on that filesystem, and if they are duplicates, replace one of the actual files with a hard link to the other file. > Any ideas beyond searching all of the filenames on the system and > replacing them with the proper link? Remember that hardlinks cannot cross filesystem borders. > That takes a long time. I suppose you could write a script that cksums all files on the filesystem, sorts the output, and verifies that two files with the same cksum are actually the same. If they are, it could ask whether it's OK to overwrite one of the files with a hardlink to the other. And yes, depending on the size of your filesystem, that would take time. Kind regards, Herta Disclaimer: http://www.kuleuven.be/cwis/email_disclaimer.htm From herta.vandeneynde at cc.kuleuven.be Wed Apr 26 23:49:22 2006 From: herta.vandeneynde at cc.kuleuven.be (Herta Van den Eynde) Date: Thu, 27 Apr 2006 01:49:22 +0200 Subject: ext3 data=ordered - good enough for oracle? In-Reply-To: <444D020E.6080907@oracle.com> References: <4448F5EE.7030106@cc.kuleuven.be> <444D020E.6080907@oracle.com> Message-ID: <44500702.2040307@cc.kuleuven.be> Thanks for your replies and pointers, Johann and Zach. I hope to find time next week to study the extra information. Kind regards, Herta Zach Brown wrote: > Herta Van den Eynde wrote: > >>Given that the default journaling mode of ext3 (i.e. ordered), does not >>guarantee write ordering after a crash, is this journaling mode safe >>enough to use for a database such as Oracle? > > > Yes, the database doesn't rely the kind of functionality that > data=journaled provides that data=ordered doesn't. data=ordered is fine. > > >>If so, how are out of sync writes delt with? > > > The database, just like ext3/jbd, implements its own consistency > mechanisms by careful write ordering. ext3 uses in-kernel device APIs > to issue writes and find out when they're on disk, the database ideally > uses O_DIRECT. > > I looked around otn.oracle.com to find a doc that talks about > configuring and verifying AIO+O_DIRECT in the database but got tired of > searching. You might be able to find something if you're more patient > than I was. > > - z > Disclaimer: http://www.kuleuven.be/cwis/email_disclaimer.htm From AjitN at ami.com Wed Apr 26 18:20:20 2006 From: AjitN at ami.com (Ajit Narayanan) Date: Wed, 26 Apr 2006 11:20:20 -0700 Subject: Kernel panic from EXT3 filesystem Message-ID: <3225AF1B8CBF83459982D4987F1549CE01746C@fre-ops.us.megatrends.com> Hi All, I'm using FC3 with 2.6.9 SMP Kernel. The root file system is EXT3 and the volume is a XFS volume. While doing IO over a NFS v3 share, with 'watch df' running in parallel, kernel panic at SLAB memory is noticed. When searching Internet, I also noticed that similar KP are reported at free_block function; but could not find fix for this. For me, this issue appears intermittently. It happens when doing IO over 60 NFS shares. >From the KP message it appears to be an issue in Linux SLAB memory module. Can anyone suggest a solution for this issue? Is this issue already addressed in later kernel like 2.6.12? ------KP start------ kernel: Unable to handle kernel paging request at virtual address 49bec98e kernel: printing eip: kernel: 02140b7c kernel: *pde = 00000000 kernel: Oops: 0002 [#1] kernel: SMP kernel: Modules linked in: xfs i2c_i801 bccfg(U) dvm(U) sg st osst nfsd exportfs lockd md5 ipv6 autofs4 i2c_dev i2c_core sunrpc iptable_filter ip_tables dm_mod button battery ac sr_mod usb_storage uhci_hcd ehci_hcd e1000 floppy ext3 jbd bcraid aic79xx sd_mod scsi_mod kernel: CPU: 3 kernel: EIP: 0060:[<02140b7c>] Tainted: PF VLI kernel: EFLAGS: 00010087 (2.6.9-1.667smp) kernel: EIP is at free_block+0x62/0xd6 kernel: eax: 00000029 ebx: 41db3000 ecx: 03f1ccbb edx: 00000000 kernel: esi: 41f6f280 edi: 0000003c ebp: 00000011 esp: 3e384e10 kernel: ds: 007b es: 007b ss: 0068 kernel: Process atd (pid: 2831, threadinfo=3e384000 task=3e622930) kernel: Stack: 41e0d010 39f19000 09fa4540 41e0d010 0000003c 02140c5e 41e0d000 41f6f280 kernel: 41e0d000 09fa4540 41e0d010 00000202 0214102d 09fa4548 00000000 3f1bb808 kernel: 2674d1e0 42d02886 2674d1e0 41f47e00 3d5b2320 3d11ce8c 42d0291e 1a7b8880 kernel: Call Trace: kernel: [<02140c5e>] cache_flusharray+0x6e/0x9c kernel: [<0214102d>] kfree+0x43/0x51 kernel: [<42d02886>] free_rb_tree_fname+0x31/0x6c [ext3] kernel: [<42d0291e>] ext3_htree_free_dir_info+0x8/0x10 [ext3] kernel: [<42d02cd3>] ext3_release_dir+0xf/0x14 [ext3] kernel: [<021549ca>] __fput+0x55/0x100 kernel: [<021536f4>] filp_close+0x59/0x5f kernel: [<02121350>] put_files_struct+0x57/0xc0 kernel: [<02121f4e>] do_exit+0x227/0x3bd kernel: [<021221d2>] sys_exit_group+0x0/0xd kernel: [<021294f8>] get_signal_to_deliver+0x341/0x369 kernel: [<02105e6c>] do_signal+0x55/0xd5 kernel: [<0216396f>] filldir64+0x0/0x122 kernel: [<0214f08a>] rw_vm+0x27e/0x28c kernel: [<0214f3a5>] put_user_size+0x29/0x2d kernel: [<02163b30>] sys_getdents64+0x9f/0xa9 kernel: [<02105f14>] do_notify_resume+0x28/0x38 kernel: Code: 1c 8b 53 04 8b 03 89 50 04 89 02 31 d2 2b 4b 0c c7 03 00 01 10 00 c7 43 04 00 02 20 00 89 c8 f7 b6 b0 00 00 00 89 c1 0f b7 43 14 <66> 89 44 4b 18 8b 43 10 66 89 4b 14 48 85 c0 89 43 10 75 41 8b ------ KP End ------ Thanks in Advance Srikumar -------------- next part -------------- An HTML attachment was scrubbed... URL: From danield at igb.uiuc.edu Thu Apr 27 17:27:30 2006 From: danield at igb.uiuc.edu (Daniel Davidson) Date: Thu, 27 Apr 2006 12:27:30 -0500 Subject: re-linking hard links In-Reply-To: <44500669.5080103@cc.kuleuven.be> References: <1146076397.3241.9.camel@arthur.igb.uiuc.edu> <44500669.5080103@cc.kuleuven.be> Message-ID: <1146158850.3876.10.camel@arthur.igb.uiuc.edu> Nope, I am only using one drive (with a single ext3 filesystem on it). I know I can do a find -inum, but I was wondering if there was something more efficient. I am actually using an md5 checksum to find duplicate files, but then I need to hunt down all their hard links. Dan On Thu, 2006-04-27 at 01:46 +0200, Herta Van den Eynde wrote: > Daniel Davidson wrote: > > Hello, > > > > I have a situation where I have numerous files with numerous hard links > > to each of them on an ext3 RHEL4.2 system. Some of these files are > > duplicates of the others. I would like to re-link all of the > > duplicates to point to a single inode. > > For instance if file1 has > > hardlinks link1 and link2, and file2 has hardlinks link3 and link4, I > > need to change it so that link1, link2 (these two are already correct), > > file2, link3, and link4 are all hardinks to file1. The only > > information I have to start with are the inode numbers of file1 and > > file2 and the pathnames of file1 and file2. > > Not sure I understand properly. It looks as though you want to compare > every file on a given filesystem with every other file on that > filesystem, and if they are duplicates, replace one of the actual files > with a hard link to the other file. > > > Any ideas beyond searching all of the filenames on the system and > > replacing them with the proper link? > > Remember that hardlinks cannot cross filesystem borders. > > > That takes a long time. > > I suppose you could write a script that cksums all files on the > filesystem, sorts the output, and verifies that two files with the same > cksum are actually the same. If they are, it could ask whether it's OK > to overwrite one of the files with a hardlink to the other. And yes, > depending on the size of your filesystem, that would take time. > > Kind regards, > > Herta > > Disclaimer: http://www.kuleuven.be/cwis/email_disclaimer.htm From smb94532543 at w-lan.mine.nu Thu Apr 27 17:46:39 2006 From: smb94532543 at w-lan.mine.nu (Niki Hammler) Date: Thu, 27 Apr 2006 19:46:39 +0200 Subject: Whats this for a block? Message-ID: <4451037F.1020109@stiftingtal.net> Hi, I have got a question concerning directory entries. I have the following block containing exactly the filenames I had in one specified folder on the same file system: http://www.sbox.tugraz.at/home/n/nobaq/ext2.dat I really hoped that this is an directory block which could point me to the inode of the files. But when I try to extract the data, I only get garbage. I'm reading the block this way: First 4 bytes are pointer to inode, second 4 bytes are length of the name and the the rest is the name itself and so on. The first two entries should be '.' and '..', so the name lengths should be only 1 and 2, shouldn't they? Do you know what's this for a data block? I'm just reading the wrong way? Is there a chance to reconstruct useful information from that data block? Thank you very much in advance, Nikolaus Hammler From jburgess at uklinux.net Thu Apr 27 20:04:28 2006 From: jburgess at uklinux.net (Jon Burgess) Date: Thu, 27 Apr 2006 21:04:28 +0100 Subject: re-linking hard links In-Reply-To: <1146158850.3876.10.camel@arthur.igb.uiuc.edu> References: <1146076397.3241.9.camel@arthur.igb.uiuc.edu> <44500669.5080103@cc.kuleuven.be> <1146158850.3876.10.camel@arthur.igb.uiuc.edu> Message-ID: <1146168268.28767.51.camel@shark.home> On Thu, 2006-04-27 at 12:27 -0500, Daniel Davidson wrote: > Nope, I am only using one drive (with a single ext3 filesystem on it). > I know I can do a find -inum, but I was wondering if there was something > more efficient. > > I am actually using an md5 checksum to find duplicate files, but then I > need to hunt down all their hard links. > > Dan > There are existing tools which do both the md5sum and hardlinking of duplicates for you, e.g. http://www.sodarock.com/hardlink/ AFAIK ext3 doesn't have any idea of the md5's of any file, nor is there any reference from the inode back to the directory entries. If you were doing this regularly I guess you might be able to cache some of this info in extended attributes but you'd have to make sure you kept the info up to date. Jon From sct at redhat.com Thu Apr 27 20:52:51 2006 From: sct at redhat.com (Stephen C. Tweedie) Date: Thu, 27 Apr 2006 21:52:51 +0100 Subject: Whats this for a block? In-Reply-To: <4451037F.1020109@stiftingtal.net> References: <4451037F.1020109@stiftingtal.net> Message-ID: <1146171171.16140.43.camel@sisko.sctweedie.blueyonder.co.uk> Hi, On Thu, 2006-04-27 at 19:46 +0200, Niki Hammler wrote: > I have got a question concerning directory entries. I have the following > block containing exactly the filenames I had in one specified folder on > the same file system: > > http://www.sbox.tugraz.at/home/n/nobaq/ext2.dat > > I really hoped that this is an directory block which could point me to > the inode of the files. Yes, it is. > But when I try to extract the data, I only get garbage. I'm reading the > block this way: First 4 bytes are pointer to inode, second 4 bytes are > length of the name and the the rest is the name itself and so on. Not quite. It's an ext2_dir_entry_2 struct from linux/include/linux/ext2_fs.h : struct ext2_dir_entry_2 { __le32 inode; /* Inode number */ __le16 rec_len; /* Directory entry length */ __u8 name_len; /* Name length */ __u8 file_type; char name[EXT2_NAME_LEN]; /* File name */ }; so yes, the first 4 bytes are the inode number; but then you've got a 2- byte record length, which includes the 8 byte directory entry struct plus the name length rounded up to the next 4 bytes (to keep the entries 4-byte aligned on disk); then the name length itself, and the inode type, both of them just 1 byte long. > The first two entries should be '.' and '..', so the name lengths should > be only 1 and 2, shouldn't they? They are: looking at the "hexdump -C" of the data, I see 00000000 01 40 01 00 0c 00 01 02 2e 00 00 00 b6 c1 08 00 |. at ..............| 00000010 0c 00 02 02 2e 2e 00 00 02 40 01 00 14 00 09 01 |......... at ......| so you've got inode number 0x00014001, then 0x000c = 12 bytes record length, then a 1-byte name and file_type 2, EXT2_FT_DIR; then "." for the name. That completes the first record. Then you have inode 0x0008c1b6, record length 12, name length 2 and file_type 2, for the name "..". And so on. Cheers, Stephen From asi.linux at yahoo.com Sat Apr 29 01:01:07 2006 From: asi.linux at yahoo.com (Muhammad Asif) Date: Sat, 29 Apr 2006 02:01:07 +0100 (BST) Subject: Ext3 Variables Message-ID: <20060429010107.14172.qmail@web38304.mail.mud.yahoo.com> Hello , i want to create a script that should automatically free the proxy partition i.e /var. I am able to create that script. But the problem is that i don't know through which variables i can check my partition's space that show me full detail i.e remaining size, etc. One for that is df -h and other fdisk -l /dev/hdxx. But when i will use if state then which variable i will use for comparision. Please help me in this regards.Can u tell me any variables in ext3 that can be used to check partitions's size Thanks Muhammad Asif Send instant messages to your online friends http://uk.messenger.yahoo.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From asi.linux at yahoo.com Sat Apr 29 20:58:12 2006 From: asi.linux at yahoo.com (Muhammad Asif) Date: Sat, 29 Apr 2006 21:58:12 +0100 (BST) Subject: ext3 variables Message-ID: <20060429205812.27767.qmail@web38312.mail.mud.yahoo.com> Hello , i want to create a script that should automatically free the proxy partition i.e /var. I am able to create that script. But the problem is that i don't know through which variables i can check my partition's space that show me full detail i.e remaining size, etc. One for that is df -h and other fdisk -l /dev/hdxx. But when i will use if state then which variable i will use for comparision. Please help me in this regards.Can u tell me any variables in ext3 that can be used to check partitions's size Thanks Muhammad Asif Send instant messages to your online friends http://uk.messenger.yahoo.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From herta.vandeneynde at cc.kuleuven.be Sat Apr 29 22:41:08 2006 From: herta.vandeneynde at cc.kuleuven.be (Herta Van den Eynde) Date: Sun, 30 Apr 2006 00:41:08 +0200 Subject: ext3 variables In-Reply-To: <20060429205812.27767.qmail@web38312.mail.mud.yahoo.com> References: <20060429205812.27767.qmail@web38312.mail.mud.yahoo.com> Message-ID: <4453EB84.1040004@cc.kuleuven.be> Hi Muhammad, Not sure if I understand properly. It looks like you're confused about the difference in size between what fdisk and df report. If that's the case: fdisk shows you the size of the disk partitions. You create a filesystem on those partitions, and each filesystem has a specific overhead (superblock, inode tables,...). I.e. df shows you what is actually available to the user of the filesystem. Kind regards, Herta Muhammad Asif wrote: > Hello , > i want to create a script that should automatically free > the proxy partition i.e /var. I am able to create that script. > But the problem is that i don't know through which variables i can check > my partition's space that show me full detail i.e remaining size, etc. > One for that is df -h and other fdisk -l /dev/hdxx. > But when i will use if state then which variable i will use for > comparision. Please help me in this regards.Can u tell me any variables > in ext3 that can be used to check partitions's size > Thanks > > Muhammad Asif > > Send instant messages to your online friends http://uk.messenger.yahoo.com > > > ------------------------------------------------------------------------ > > _______________________________________________ > Ext3-users mailing list > Ext3-users at redhat.com > https://www.redhat.com/mailman/listinfo/ext3-users Disclaimer: http://www.kuleuven.be/cwis/email_disclaimer.htm