From swapana_ghosh at yahoo.com Mon Oct 1 13:18:15 2007 From: swapana_ghosh at yahoo.com (Swapana Ghosh) Date: Mon, 1 Oct 2007 06:18:15 -0700 (PDT) Subject: ext3 file system becoming read only In-Reply-To: <46FC9E49.6090900@cesca.es> Message-ID: <230929.8139.qm@web58302.mail.re3.yahoo.com> Thanks Jordi, Yes, we are checking everything, then only we will proceed for update the kernel. Thanks again --- Jordi Prats wrote: > Hi Swapana, > A update is always a good idea. On RHEL updates use to go smoothly, but > I have you checked your FC switch for errors on each port? You could > also check your SAN controllers, or run some diagnostics to be sure it's > not a problem on your SAN. If your active controller reboots suddenly it > can cause some IO errors causing your journal corruption. > > regards, > Jordi > > > > Swapana Ghosh wrote: > > Hi, > > > > As I explained in my first posting that the 'read-only' issue is not for > one > > server, it is happening for few servers which are generally 'oracle' > database > > oriented. Very recently it happned to an 'oracle' application server. For > > temporary basis , we are re-mounting the file system and also doing fsck. > > > While searching the redhat knowledge base, found the following url, the > problem > > they were explaining it is similar to our issues, > > > > https://bugzilla.redhat.com/show_bug.cgi?id=213921 > > > > It is telling that it is the bug of the kernel.. > > > > Not sure whether we will proceed for the higher version of kernel or not, > > please advice. > > > > Thanks > > > > > > --- tweeks wrote: > > > > > >> The EL4 kernel is wacky when it comes the the I/O scheduler locking up and > >> and > >> causing ext3 to remount RO. Various hardware hiccups can cause it to go > RO. > >> > >> And when it does.. you need to tread lightly or you could lose everything. > >> > >> If your ext3 filesystem had problems and remounted read-only, I would > >> strongly > >> advise /against/ simply fscking it. Often times when your filesystem has > >> gone RO, it may have been that way for 30 minutes or more. Just rebooting > ro > >> > >> fscking is a great way to lose everything (i.e. everything being dumped > >> into /lost+found/" > >> > >> Instead, I would recommend: > >> 1) rebooting into a rescue CD environment (not allowing the rescue > >> environment > >> to mount or fsck your filesystems). > >> 2) Nuke the ext3 journal: > >> tune2fs -O ^has_journal /dev/ > >> (possibly doing the same for other problem partitions) > >> 3) Do a fake fsck to see the extent of damage: > >> fsck -fn /dev/ > >> (after checking things out.. use "-fy" once you're sure that it's safe) > >> 4) Rebuild the journal w, "tune2fs -j /dev/ > >> (rerun at least once until "clean" result is repeatable) > >> 5) Mount and check things out, > >> "mkdir /mnt/tmp && mount -t ext3 /dev/ /mnt/tmp" > >> 6) Gracefully umount & reboot: > >> "umount /mnt/tmp && shutdown -rf now && exit" > >> > >> Tweeks > >> > >> On Tuesday 25 September 2007 11:47, Swapana Ghosh wrote: > >> > >>> Hi Jordi, > >>> > >>> Thanks for your reply. I will test the way you suggested. > >>> > >>> Thanks > >>> -swapna > >>> > >>> --- Jordi Prats wrote: > >>> > >>>> Hi, > >>>> It seems like what it happened to me. I did this to solve this issue: > >>>> > >>>> Mark the filesystem as it does not have a journal (take it to ext2) > >>>> > >>>> tune2fs -O ^has_journal /dev/cciss/c0d0p2 > >>>> > >>>> fsck it to delete the journal: > >>>> > >>>> e2fsck /dev/cciss/c0d0p2 > >>>> > >>>> Create the journal (take it back to ext3) > >>>> > >>>> tune2fs -j /dev/cciss/c0d0p2 > >>>> > >>>> and finaly, remount it. > >>>> > >>>> In my case it was with a local disk, but with your SAN disk should be > >>>> the same. > >>>> > >>>> Jordi > >>>> > >>>> Swapana Ghosh wrote: > >>>> > >>>>> Hi > >>>>> > >>>>> In our office environment few servers mostly database servers and > >>>>> > >>>> yesterday it > >>>> > >>>> > >>>>> happened > >>>>> for one application server(first time) the partion is getting "read > >>>>> only". > >>>>> > >>>>> I was checking the archives, found may be similar kind of issues in the > >>>>> 2007-July archives. > >>>>> But how it has been solved if someone describes me that will be really > >>>>> > >>>> helpful. > >>>> > >>>> > >>>>> In our case, just at the problem started found the line in log file as > >>>>> > >>>> follows: > >>>> > >>>>> EXT3-fs error (device dm-12): edxt3_find_entry: reading directory > >>>>> > >>>> #2015496 > >>>> > >>>> > >>>>> offset 2 > >>>>> > >>>>> Then one blank line > >>>>> Then the line is > >>>>> > >>>>> Aborting journal on device dm-12. > >>>>> ext3_abort called > >>>>> > >>>>> Ext3-fs error (device dm-12): ext3_journal_start_sb: Detected > >>>>> aborted journal > >>>>> Remounting filesysem read-only > >>>>> > >>>>> Then the continuous line as follows: > >>>>> > >>>>> > >>>>> EXT3-fs error (device dm-12) in start_transaction: Journal has > >>>>> aborted > >>>>> > >>>>> > >>>>> > >>>>> The above message is continuous until we remount the filesystem and > >>>>> > >>>> partion > >>>> > >>>> > >>>>> becomes > >>>>> 'read-write'. > >>>>> > >>>>> We could not figure it out what is the root cause of the system. > >>>>> > >>>>> We are using individual EMC luns and are configured with LVM volume > >>>>> groups > >>>>> > >>>> and > >>>> > >>>> > >>>>> then mounted on logical > >>>>> volumes. > >>>>> > >>>>> Here i am giving the server description: > >>>>> > >>>>> ____________________________________________________________ > >>>>> > >>>>> [root at server ~]# lsmod |grep -i qla > >>>>> qla2300 130304 0 > >>>>> qla2xxx_conf 305924 0 > >>>>> qla2xxx 307448 21 qla2300 > >>>>> scsi_mod 117709 5 sg,emcp,qla2xxx,cciss,sd_mod > >>>>> > >>>>> ____________________________________________________________ > >>>>> [root at server ~]# cat /etc/modprobe.conf > >>>>> alias eth0 tg3 > >>>>> alias eth1 tg3 > >>>>> alias eth2 e1000 > >>>>> alias eth3 e1000 > >>>>> alias eth4 e1000 > >>>>> alias eth5 e1000 > >>>>> alias bond0 bonding > >>>>> alias scsi_hostadapter cciss > >>>>> options bond0 max_bonds=2 miimon=100 mode=1 > >>>>> alias scsi_hostadapter1 qla2xxx > >>>>> alias scsi_hostadapter2 qla2xxx_conf > >>>>> #alias scsi_hostadapter3 qla6312 > >>>>> options qla2xxx ql2xmaxqdepth=16 qlport_down_retry=64 > >>>>> ql2xloginretrycount=30 ql2xfailover=0 ql2xlbType=0 > === message truncated === ____________________________________________________________________________________ Got a little couch potato? Check out fun summer activities for kids. http://search.yahoo.com/search?fr=oni_on_mail&p=summer+activities+for+kids&cs=bz From tango at tiac.net Tue Oct 2 19:38:47 2007 From: tango at tiac.net (Thomas Watt) Date: Tue, 2 Oct 2007 15:38:47 -0400 (GMT-04:00) Subject: How are alternate superblocks repaired? Message-ID: <24757522.1191353927349.JavaMail.root@mswamui-swiss.atl.sa.earthlink.net> Hi Ted, Ok, I think I understand now. I was assuming the backup superblocks played a role without the intervention of using e2fsck and were ready to be used in a standby mode when the primary superblock gets corrupted. But, of course, there is a very real reason to be cautious when the kernel may do things unknown to users. My point-of-view was more flavored by something like the Multics structure marking that kept backup data structures free from damage. It is clear there is another strategy at work here, but one that is workable and sufficient for the ext2/ext3 filesystem. In case you are interested, here is link to a web page on Structure Marking: http://www.multicians.org/thvv/marking.html I'm so happy you sent the tip on using the e2label to correct my problem. I've attached my script which I wrote more out of curiosity than anything else: ca18e1eb99c1279e0298db56f43b1ab1 genallsbs.sh Regards, -- Tom From: Theodore Tso [Add to Address Book] To: Thomas Watt Cc: Andreas Dilger , ext3-users at redhat.com Subject: Re: How are alternate superblocks repaired? Date: Sep 29, 2007 9:01 AM On Sat, Sep 29, 2007 at 03:29:13AM -0400, Thomas Watt wrote: > The only field not updated was the Filesystem state field. So, all > of the backup superblocks remain "not clean" and are now at least a > lot closer to being consistent with the primary superblock - just > not quite there yet as far as being usable in case the primary > superblock gets hosed. That's by design. The backup superblock always have the filesystem state set to "not clean". They are written out that way! Keep in mind that kernel does *not* update the backup superblocks under normal operations. So by definition, fields such as the free blocks, free inodes, last mount time, mount count, are always going to be out of date in the backup superblocks. AND THAT'S OK. The whole point of the backup superblocks is to have an extra copy of the fundamental filesystem parameters --- the blocksize, the number of inodes per block group, the block group size, the location of the inode table and the allocation bitmaps, and so on. That doesn't change under normal circumstances except when the filesystem is resized, so that's why it's OK for the kernel to not bother to update them. If the primary superblock is destroyed, e2fsck will use the backup superblocks to reconstruct the filesystem, and in the process of reconstructing the filesystem, it will update the free blocks, free inodes, and the other more transient portions of the filesystem. I'm not sure why you are so concerned about keeping every last field in the backup superblocks identical to that of the primary. There are lots of good reasons why they are not the same; the less they are modified, more likely they won't get corrupted or otherwise messed up. (For example, in addition to making the umount operation take a lot longer, the fact that the kernel never writes the backup superblocks means that we don't have to worry about what happens if the in-memory copy of the superblocks are corrupted --- say because the system administrator was too cheap to use ECC memory --- even if they are written to the primaries, the backups will still be OK for e2fsck to use for recovery purposes.) - Ted From: Thomas Watt [Add to Address Book] To: Theodore Tso Cc: Andreas Dilger , ext3-users at redhat.com Subject: Re: How are alternate superblocks repaired? Date: Sep 29, 2007 3:29 AM Hi Ted, I just wanted to give you some feedback on running the e2label command to fix the problem of backup superblock inconsistency with the primary superblock. Since Linux filesystem name labels are optional and my filesystem volume name was not set, I wondered if that would make a difference. It did not. I did not opt to set a label, but just followed your suggested command. The following fields were updated: Filesystem features Free blocks Free inodes Last mount time Last write time Mount count Last checked Next check after The only field not updated was the Filesystem state field. So, all of the backup superblocks remain "not clean" and are now at least a lot closer to being consistent with the primary superblock - just not quite there yet as far as being usable in case the primary superblock gets hosed. At this point I don't suppose there is anyway for e2fsck to make the backup superblocks "clean" (i.e. only when the primary is clean) until your enhancement gets released. It was fairly easy to make this assessment using the script I wrote to dump all of the superblocks and make the comparisons of before and after superblock states. Checking the result was the easy part. I want to make a few changes, test them out and donate the script to the e2fsprogs project. It should make it just a little bit easier for system administrators to keep an eye on the backup superblocks, and you also might find it useful in testing your enhancement to e2fsck. The only caveat is that the script has not been tested on ext2/ext3 filesystems with blocksizes of 1024 or 2048s. There are provisions for 1024 and 2048 blocksized sytsems - that's the speculative part of the script that needs testing - assumptions always need testing/challenging - right? :) I hope this feedback helps in your enhancement efforts to e2fsck. Regards, -- Tom m: Theodore Tso [Add to Address Book] To: Thomas Watt Cc: Andreas Dilger , ext3-users at redhat.com Subject: Re: How are alternate superblocks repaired? Date: Sep 28, 2007 2:55 PM On Fri, Sep 28, 2007 at 01:18:16AM -0400, Thomas Watt wrote: > The Maximum mount count is 30, and I have no reason to believe that > e2fsck has ever been run against this particular FC3 ext filesystem. > I have every reason to believe, however, that fsck has been run on > occasion when I either boot the FC3 system manually and the mount > count is over 30 or when I experience the situation where the > ext_attr goes missing and I then manually boot the system when it is > not clean in the primary superblock. The system was created at the > end of March, 2005 and as you can see from the differences the > backup superblock(s) have never even been touched after their > creation. > > What parameters do you suggest be used when e2fsck is run to repair > the backup superblocks? Hi Tom, There are a couple of things going on here. First of all, out of general paranoia, neither e2fsck nor the kernel touch backup superblocks out of general paranoia. Most of the changes that you pointed out between the primary and backup superblocks are no big deal, and can easily be regenerated by e2fsck. The one exception to is the feature bitmasks. Most of the time it's only tune2fs which makes changes to the feature compatibility bitmasks. Unfortunately, the kernel does make some changes "behind the user's back"; and one of them is the ext_attr feature flag. So thanks for pointing that out, and I'll have to make an enhacement to e2fsck to detect if the backup superblock's compatibility flags are different, and if so, to update the backup superblocks. For now, you can work around this and force an update to the backup superblocks by running the following command as root: e2label /dev/hdXXX "`e2label /dev/hdXXX`" This reads out the label from the filesystem, and thes sets the label to its current value. This will force a copy from the primary to the backup superblocks. Regards, - Ted -------------- next part -------------- A non-text attachment was scrubbed... Name: genallsbs.sh Type: application/x-shellscript Size: 14176 bytes Desc: not available URL: From nicdnicd at gmail.com Tue Oct 2 21:27:52 2007 From: nicdnicd at gmail.com (Nickel Cadmium) Date: Tue, 2 Oct 2007 23:27:52 +0200 Subject: Bad magic number in super-block Message-ID: <9ec348a90710021427r3c7b333el91685c17a277aacc@mail.gmail.com> Hi, After a power failure, I can't mount one of my partitions anymore. Here is what I get from fsck: -- fsck.ext3 /dev/sdb1 e2fsck 1.39 (29-May-2006) Couldn't find ext2 superblock, trying backup blocks... fsck.ext3: Bad magic number in super-block while trying to open /dev/sdb1 The superblock could not be read or does not describe a correct ext2 filesystem. If the device is valid and it really contains an ext2 filesystem (and not swap or ufs or something else), then the superblock is corrupt, and you might try running e2fsck with an alternate superblock: e2fsck -b 8193 -- I tried to give the suggested superblock as parameter but I get the same error message. And with dumpe2fs and tune2fs as well. Since I can't get the backup-superblock positions with dumpe2fs, I used a block size of 1K and tried all the supposed-to-be backup superblocks but it does not help. Is there anything I can try to mount the partition again? Cheers, NiCd -------------- next part -------------- An HTML attachment was scrubbed... URL: From tytso at mit.edu Tue Oct 2 21:59:11 2007 From: tytso at mit.edu (Theodore Tso) Date: Tue, 2 Oct 2007 17:59:11 -0400 Subject: How are alternate superblocks repaired? In-Reply-To: <24757522.1191353927349.JavaMail.root@mswamui-swiss.atl.sa.earthlink.net> References: <24757522.1191353927349.JavaMail.root@mswamui-swiss.atl.sa.earthlink.net> Message-ID: <20071002215911.GA6012@thunk.org> On Tue, Oct 02, 2007 at 03:38:47PM -0400, Thomas Watt wrote: > In case you are interested, here is link to a web page on Structure Marking: > http://www.multicians.org/thvv/marking.html I actually have used a Multics system way back when (I was actually logged into MIT Multics when it was finally shutdown[1]). The com_err library and the ss library in e2fsprogs was largely inspired from Multics, and I do use structure magic numbers in memory to protect against programming errors, which is basically a very simple structure marking technique. I'm a bit dubious about how useful simply structure matching would be for modern Linux systems, since a large number of errors really are silent bit flips in the data, that wouldn't be detected simply by checking the expected structure ID at the beginning of the on-disk object. We are planning on adding checksum to metadata for ext4, which will help a lot in terms of detected bad metadata. Regards, ("You are protected from preemption" :-) [1] http://stuff.mit.edu/afs/sipb/project/eichin/sipbscan/ - Ted From tango at tiac.net Wed Oct 3 03:30:46 2007 From: tango at tiac.net (Thomas Watt) Date: Tue, 2 Oct 2007 23:30:46 -0400 (GMT-04:00) Subject: Bad magic number in super-block Message-ID: <5486452.1191382247899.JavaMail.root@mswamui-blood.atl.sa.earthlink.net> Hi Nickel Cadmium, First, try running the command (as root): fdisk -l That should confirm whether /dev/sdb1 is a valid filesystem partition and not a swap partition. Look for an ID of 83 which identifies valid filesystem partitions. A partition with ID of 82 is usually swap and won't have a superblock. That said, if /dev/sdb1 is not a valid filesystem partition, then choose one that with an ID of 83 and looks like it has the majority of space. Then you should be able to use: dumpe2fs -h /dev/sdb2, for example, and see if you get any other errors or can then successfully mount the partition. Sometimes after a reboot, the fdisk -l command reports partitions not in partition table order and will assign different partition names than the ones you may normally see to the disk/partition of interest. -- Tom From nicdnicd at gmail.com Wed Oct 3 06:48:25 2007 From: nicdnicd at gmail.com (Nickel Cadmium) Date: Wed, 3 Oct 2007 08:48:25 +0200 Subject: Bad magic number in super-block In-Reply-To: <5486452.1191382247899.JavaMail.root@mswamui-blood.atl.sa.earthlink.net> References: <5486452.1191382247899.JavaMail.root@mswamui-blood.atl.sa.earthlink.net> Message-ID: <9ec348a90710022348i496a03a4ib7c0296bad67f365@mail.gmail.com> Hi! Tom, thanks a lot: you solved my problem! With fdisk -l I discovered that the partition I was trying to mount was a Windows partition. The weird thing is that /dev/sdb1 used to be a Linux partition. Thinking of it again, I had to pull apart my computer after the crash and I probably shuffled the disks around (or could the renumbering / device reassignement occur even without hardware change?). But in short, the partition I was looking for is now in /dev/sdc1 and updating the partition table solved it all. Thanks & cheers, NiCd On 10/3/07, Thomas Watt wrote: > > Hi Nickel Cadmium, > > First, try running the command (as root): fdisk -l > > That should confirm whether /dev/sdb1 is a valid filesystem partition and > not a > swap partition. Look for an ID of 83 which identifies valid filesystem > partitions. A partition with ID of 82 is usually swap and won't have a > superblock. > > That said, if /dev/sdb1 is not a valid filesystem partition, then choose > one > that with an ID of 83 and looks like it has the majority of space. Then > you > should be able to use: dumpe2fs -h /dev/sdb2, for example, and see if you > get > any other errors or can then successfully mount the partition. > > Sometimes after a reboot, the fdisk -l command reports partitions not in > partition table order and will assign different partition names than the > ones > you may normally see to the disk/partition of interest. > > -- Tom > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From bryan at kadzban.is-a-geek.net Wed Oct 3 11:01:08 2007 From: bryan at kadzban.is-a-geek.net (Bryan Kadzban) Date: Wed, 03 Oct 2007 07:01:08 -0400 Subject: Bad magic number in super-block In-Reply-To: <9ec348a90710022348i496a03a4ib7c0296bad67f365@mail.gmail.com> References: <5486452.1191382247899.JavaMail.root@mswamui-blood.atl.sa.earthlink.net> <9ec348a90710022348i496a03a4ib7c0296bad67f365@mail.gmail.com> Message-ID: <47037674.3050908@kadzban.is-a-geek.net> -----BEGIN PGP SIGNED MESSAGE----- Hash: RIPEMD160 Nickel Cadmium wrote: > (or could the renumbering / device reassignement occur even without > hardware change?) For SCSI, yes, it could have changed (depending on your hardware setup). SCSI disk scanning happens in parallel, and has ever since kernel 2.6.18 or .19 or somewhere around there. I believe it still depends on your low-level SCSI driver though. In any case, the sdX device names are no longer necessarily stable. That's why udev now creates the /dev/disk/by-* trees of symlinks, whose names are supposed to be stable. (I'd recommend by-id myself, but it depends on how your disks are set up.) -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHA3ZyS5vET1Wea5wRA+X+AKCbk7mtSA79wvZ0uQKHnTrgWTvTGQCdE7mh LML2ihueJgirORxFAvczVZA= =JZO3 -----END PGP SIGNATURE----- From tytso at mit.edu Wed Oct 3 14:52:18 2007 From: tytso at mit.edu (Theodore Tso) Date: Wed, 3 Oct 2007 10:52:18 -0400 Subject: Bad magic number in super-block In-Reply-To: <47037674.3050908@kadzban.is-a-geek.net> References: <5486452.1191382247899.JavaMail.root@mswamui-blood.atl.sa.earthlink.net> <9ec348a90710022348i496a03a4ib7c0296bad67f365@mail.gmail.com> <47037674.3050908@kadzban.is-a-geek.net> Message-ID: <20071003145218.GC23294@thunk.org> On Wed, Oct 03, 2007 at 07:01:08AM -0400, Bryan Kadzban wrote: > > In any case, the sdX device names are no longer necessarily stable. > That's why udev now creates the /dev/disk/by-* trees of symlinks, whose > names are supposed to be stable. (I'd recommend by-id myself, but it > depends on how your disks are set up.) The recommended way of dealing with this is to putting something like this in your /etc/fstab: UUID=57299143-64a5-45f3-8c3d-9b68e38247bd / ext3 defaults,errors=remount-ro 0 1 or LABEL=root / ext3 defaults,errors=remount-ro 0 1 Mount and fsck will automatically find the appropriate device, and this will work even if udev changes in the future. This approach also will work on much older systems, including ones that are pre-udev. (i.e, RHEL4, etc.) Note that you can get yourself in trouble with either approach if you have multiple filesystems with the same label or partition. With UUID's, that shouldn't ever happen unless you provision systems via partition images or use dd to copy filesystems around. If you do this, a *really* good idea is to use the command: tune2fs -U random /dev/sdXX ... after you copy a filesystem image, and then use dumpe2fs -h to determine the new UUID. That way, each filesystem will have its own unique filesystem. This is especially important if you have a large cluster of machines which access their root filesystem across a SAN network to some large enterprise storage array. It is a really, really good idea to keep each filesystem image separate with its own universally unique ID. - Ted From tango at tiac.net Wed Oct 3 17:16:18 2007 From: tango at tiac.net (Thomas Watt) Date: Wed, 3 Oct 2007 13:16:18 -0400 (GMT-04:00) Subject: How are alternate superblocks repaired? Message-ID: <2818383.1191431778901.JavaMail.root@mswamui-blood.atl.sa.earthlink.net> Hi Ted, That was pretty funny being "protected from preemption"! It turns out I did discover a bug in my script that I previously sent, and have fixed it. Only filesystem blocksize of 2048 needs testing/verification. Sorry for the resend - it appears my mailer decided I needed to loosen the priviledges to send the script. Here is the reworked script attached: 003a2b57b7d0c798b6d1044506634c3c genallsbs.sh Cheers, -- Tom -----Original Message----- >From: Theodore Tso >Sent: Oct 2, 2007 5:59 PM >To: Thomas Watt >Cc: Andreas Dilger , ext3-users at redhat.com >Subject: Re: How are alternate superblocks repaired? > >On Tue, Oct 02, 2007 at 03:38:47PM -0400, Thomas Watt wrote: >> In case you are interested, here is link to a web page on Structure Marking: >> http://www.multicians.org/thvv/marking.html > >I actually have used a Multics system way back when (I was actually >logged into MIT Multics when it was finally shutdown[1]). The com_err >library and the ss library in e2fsprogs was largely inspired from >Multics, and I do use structure magic numbers in memory to protect >against programming errors, which is basically a very simple structure >marking technique. > >I'm a bit dubious about how useful simply structure matching would be >for modern Linux systems, since a large number of errors really are >silent bit flips in the data, that wouldn't be detected simply by >checking the expected structure ID at the beginning of the on-disk >object. We are planning on adding checksum to metadata for ext4, >which will help a lot in terms of detected bad metadata. > >Regards, ("You are protected from preemption" :-) > >[1] http://stuff.mit.edu/afs/sipb/project/eichin/sipbscan/ > > - Ted -------------- next part -------------- A non-text attachment was scrubbed... Name: genallsbs.sh Type: application/x-shellscript Size: 13942 bytes Desc: not available URL: From tytso at mit.edu Wed Oct 3 18:44:36 2007 From: tytso at mit.edu (Theodore Tso) Date: Wed, 3 Oct 2007 14:44:36 -0400 Subject: How are alternate superblocks repaired? In-Reply-To: <20071002215911.GA6012@thunk.org> References: <24757522.1191353927349.JavaMail.root@mswamui-swiss.atl.sa.earthlink.net> <20071002215911.GA6012@thunk.org> Message-ID: <20071003184436.GD23294@thunk.org> On Tue, Oct 02, 2007 at 05:59:11PM -0400, Theodore Tso wrote: > I'm a bit dubious about how useful simply structure matching would be > for modern Linux systems, since a large number of errors really are sorry, I meant to say "filesystems", not "systems" above > silent bit flips in the data, that wouldn't be detected simply by > checking the expected structure ID at the beginning of the on-disk > object. We are planning on adding checksum to metadata for ext4, > which will help a lot in terms of detected bad metadata. - Ted From tango at tiac.net Thu Oct 4 04:15:26 2007 From: tango at tiac.net (Thomas Watt) Date: Thu, 4 Oct 2007 00:15:26 -0400 (GMT-04:00) Subject: How are alternate superblocks repaired? Message-ID: <11773462.1191471326851.JavaMail.root@mswamui-bichon.atl.sa.earthlink.net> Thanks. Turns out there was a way to fully test the script which is attached: eb89e01bde14d4ca25c778bbb13fb5fa genallsbs.sh.bz2 Looking forward to the new and improved filesystems from you and your filesystem colleagues. Regards, -- Tom -----Original Message----- >From: Theodore Tso >Sent: Oct 3, 2007 2:44 PM >To: Thomas Watt >Cc: Andreas Dilger , ext3-users at redhat.com >Subject: Re: How are alternate superblocks repaired? > >On Tue, Oct 02, 2007 at 05:59:11PM -0400, Theodore Tso wrote: >> I'm a bit dubious about how useful simply structure matching would be >> for modern Linux systems, since a large number of errors really are > sorry, I meant to say "filesystems", not "systems" above >> silent bit flips in the data, that wouldn't be detected simply by >> checking the expected structure ID at the beginning of the on-disk >> object. We are planning on adding checksum to metadata for ext4, >> which will help a lot in terms of detected bad metadata. > > - Ted -------------- next part -------------- A non-text attachment was scrubbed... Name: genallsbs.sh.bz2 Type: application/x-bzip Size: 3713 bytes Desc: not available URL: From ross at biostat.ucsf.edu Sat Oct 6 07:10:48 2007 From: ross at biostat.ucsf.edu (Ross Boylan) Date: Sat, 06 Oct 2007 00:10:48 -0700 Subject: Very slow directory traversal Message-ID: <1191654648.8679.109.camel@corn.betterworld.us> My last full backup of my Cyrus mail spool had 1,393,569 files and cconsumed about 4G after compression. It took over 13 hours. Some investigation led to the following test: time tar cf /dev/null /var/spool/cyrus/mail/r/user/ross/debian/user/ That took 15 minutes the first time it ran, and 32 seconds when run immediately thereafter. There were 355,746 files. This is typical of what I've been seeing: initial run is slow; later runs are much faster. df shows /dev/evms/CyrusSpool 19285771 17650480 606376 97% /var/spool/cyrus mount shows /dev/evms/CyrusSpool on /var/spool/cyrus type ext3 (rw,noatime) The spool was active when I did the tests just described, but inactive during backup. It's on top of LVM as managed by EVMS in a Linux 2.6.18 kernel, Pentium 4 processor. It might be significant the Linux treats this as an SMP machine with 2 processors, since the single processor has hyperthreading. I'm using a stock Debian kernel, -686 variant. # time dd if=/dev/evms/CyrusSpool bs=4096 skip=16k count=256k of=/dev/null 262144+0 records in 262144+0 records out 1073741824 bytes (1.1 GB) copied, 26.4824 seconds, 40.5 MB/s The spool was mostly populated all at once from another system, and the file names are mostly numbers. Perhaps that creates some hashing trouble? Can anyone explain this, or, even better, give me a hint how I could improve this situation? I found some earlier posts on similar issues, although they mostly concerned apparently empty directories that took a long time. Theodore Tso had a comment that seemed to indicate that hashing conflicts with Unix requirements. I think the implication was that you could end up with linearized, or partly linearized searches under some scenarios. Since this is a mail spool, I think it gets lots of sync()'s. I conducted pretty extensive tests before picking ext3 for this file system; it was fastest for my tests of writing messages into the spool. I think I tested the "nearly full disk" scenario, but I probably didn't test the scale of files I have now. Obviously my problem now is reading, not writing. # dumpe2fs -h /dev/evms/CyrusSpool dumpe2fs 1.40.2 (12-Jul-2007) Filesystem volume name: Last mounted on: Filesystem UUID: 44507cfa-39ce-46f1-9e3e-87091225395d Filesystem magic number: 0xEF53 Filesystem revision #: 1 (dynamic) Filesystem features: has_journal resize_inode dir_index filetype needs_recovery sparse_super Filesystem flags: signed directory hash Default mount options: (none) Filesystem state: clean Errors behavior: Continue Filesystem OS type: Linux Inode count: 10289152 # c 10x the number of files. Block count: 20578300 Reserved block count: 1028915 Free blocks: 1651151 Free inodes: 8860352 First block: 1 Block size: 1024 Fragment size: 1024 Reserved GDT blocks: 236 Blocks per group: 8192 Fragments per group: 8192 Inodes per group: 4096 Inode blocks per group: 512 Filesystem created: Mon Jan 1 11:32:49 2007 Last mount time: Thu Oct 4 09:42:00 2007 Last write time: Thu Oct 4 09:42:00 2007 Mount count: 2 Maximum mount count: 25 Last checked: Fri Sep 28 09:26:39 2007 Check interval: 15552000 (6 months) Next check after: Wed Mar 26 09:26:39 2008 Reserved blocks uid: 0 (user root) Reserved blocks gid: 0 (group root) First inode: 11 Inode size: 128 Journal inode: 8 Default directory hash: tea Directory Hash Seed: 9f50511e-2078-4476-96f4-c6f3415fda4f Journal backup: inode blocks Journal size: 32M I believe I created it this way; in particular, I'm pretty sure I've had dir_index from the start. From alex at alex.org.uk Sat Oct 6 11:06:28 2007 From: alex at alex.org.uk (Alex Bligh) Date: Sat, 06 Oct 2007 12:06:28 +0100 Subject: Very slow directory traversal In-Reply-To: <1191654648.8679.109.camel@corn.betterworld.us> References: <1191654648.8679.109.camel@corn.betterworld.us> Message-ID: <1C0A85F326C4B5EC68C47D44@[192.168.100.25]> --On 06 October 2007 00:10 -0700 Ross Boylan wrote: > I believe I created it this way; in particular, I'm pretty sure I've had > dir_index from the start. find /var/spool/cyrus -type d -exec lsattr -lad \{\} \; and check the large directories are actually indexed Alex From ross at biostat.ucsf.edu Sat Oct 6 16:30:40 2007 From: ross at biostat.ucsf.edu (Ross Boylan) Date: Sat, 06 Oct 2007 09:30:40 -0700 Subject: Very slow directory traversal In-Reply-To: <1C0A85F326C4B5EC68C47D44@[192.168.100.25]> References: <1191654648.8679.109.camel@corn.betterworld.us> <1C0A85F326C4B5EC68C47D44@[192.168.100.25]> Message-ID: <1191688240.8679.114.camel@corn.betterworld.us> On Sat, 2007-10-06 at 12:06 +0100, Alex Bligh wrote: > > --On 06 October 2007 00:10 -0700 Ross Boylan wrote: > > > I believe I created it this way; in particular, I'm pretty sure I've had > > dir_index from the start. > > find /var/spool/cyrus -type d -exec lsattr -lad \{\} \; > > and check the large directories are actually indexed > > Alex All the large directories are indexed, but some smaller or empty ones seem not to be. Here's a line from the directory I reported on, and then one that doesn't show as indexed. The find took about 3 minutes to run. /var/spool/cyrus/mail/r/user/ross/debian/user Indexed_direcctory /var/spool/cyrus/mail/r/user/ross/debian/devel --- During the find, as during my other operations that take a long time, vmstat shows around 40-45% of the CPU time in io wait. I'm not sure if the pseudo-dual CPU's are throwing that off, i.e., if that really means 80-90%. From alex at alex.org.uk Sun Oct 7 08:58:36 2007 From: alex at alex.org.uk (Alex Bligh) Date: Sun, 07 Oct 2007 09:58:36 +0100 Subject: Very slow directory traversal In-Reply-To: <1191688240.8679.114.camel@corn.betterworld.us> References: <1191654648.8679.109.camel@corn.betterworld.us> <1C0A85F326C4B5EC68C47D44@[192.168.100.25]> <1191688240.8679.114.camel@corn.betterworld.us> Message-ID: > All the large directories are indexed, but some smaller or empty ones > seem not to be. I think that's correct; it doesn't build the index tree until the directory reaches (from memory) a couple of blocks. I vaguely recall that one can still use readdir / telldir and end up with an O(n^2) result, but I forget how. You've reached the limit of my knowledge here. Alex From adilger at clusterfs.com Wed Oct 10 15:59:20 2007 From: adilger at clusterfs.com (Andreas Dilger) Date: Wed, 10 Oct 2007 09:59:20 -0600 Subject: Very slow directory traversal In-Reply-To: <1191654648.8679.109.camel@corn.betterworld.us> References: <1191654648.8679.109.camel@corn.betterworld.us> Message-ID: <20071010155920.GV8122@schatzie.adilger.int> On Oct 06, 2007 00:10 -0700, Ross Boylan wrote: > My last full backup of my Cyrus mail spool had 1,393,569 files and > cconsumed about 4G after compression. It took over 13 hours. Some > investigation led to the following test: > time tar cf /dev/null /var/spool/cyrus/mail/r/user/ross/debian/user/ FYI - "tar cf /dev/null" actually skips reading any file data. The code special cases /dev/null and skips the read entirely. > That took 15 minutes the first time it ran, and 32 seconds when run > immediately thereafter. There were 355,746 files. This is typical of > what I've been seeing: initial run is slow; later runs are much faster. I'd expect this is because on the initial run the on-disk inode ordering causes a lot of seeks, and later runs come straight from memory. Probably not a lot you can do directly, but e.g. pre-reading the inode table would be a good start. > I found some earlier posts on similar issues, although they mostly > concerned apparently empty directories that took a long time. Theodore > Tso had a comment that seemed to indicate that hashing conflicts with > Unix requirements. I think the implication was that you could end up > with linearized, or partly linearized searches under some scenarios. > Since this is a mail spool, I think it gets lots of sync()'s. There was an LD_PRELOAD library that Ted wrote that may also help: http://marc.info/?l=mutt-dev&m=107226330912347&w=2 Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From ross at biostat.ucsf.edu Thu Oct 11 06:37:19 2007 From: ross at biostat.ucsf.edu (Ross Boylan) Date: Wed, 10 Oct 2007 23:37:19 -0700 Subject: Very slow directory traversal In-Reply-To: <20071010155920.GV8122@schatzie.adilger.int> References: <1191654648.8679.109.camel@corn.betterworld.us> <20071010155920.GV8122@schatzie.adilger.int> Message-ID: <1192084639.2075.75.camel@corn.betterworld.us> On Wed, 2007-10-10 at 09:59 -0600, Andreas Dilger wrote: > On Oct 06, 2007 00:10 -0700, Ross Boylan wrote: > > My last full backup of my Cyrus mail spool had 1,393,569 files and > > cconsumed about 4G after compression. It took over 13 hours. Some > > investigation led to the following test: > > time tar cf /dev/null /var/spool/cyrus/mail/r/user/ross/debian/user/ > > FYI - "tar cf /dev/null" actually skips reading any file data. The > code special cases /dev/null and skips the read entirely. > > > That took 15 minutes the first time it ran, and 32 seconds when run > > immediately thereafter. There were 355,746 files. This is typical of > > what I've been seeing: initial run is slow; later runs are much faster. > > I'd expect this is because on the initial run the on-disk inode ordering > causes a lot of seeks, and later runs come straight from memory. Probably > not a lot you can do directly, but e.g. pre-reading the inode table would > be a good start. Judging from your comments and the thread you reference below, the problem is that the order returned from readdir is not inode order. But if tar, in this special case (/dev/null), doesn't actually read from the file, why should it be so slow. Does it do something (stat?) that makes it have to fetch the inode anyway? > > > > I found some earlier posts on similar issues, although they mostly > > concerned apparently empty directories that took a long time. Theodore > > Tso had a comment that seemed to indicate that hashing conflicts with > > Unix requirements. I think the implication was that you could end up > > with linearized, or partly linearized searches under some scenarios. > > Since this is a mail spool, I think it gets lots of sync()'s. > > There was an LD_PRELOAD library that Ted wrote that may also help: > http://marc.info/?l=mutt-dev&m=107226330912347&w=2 > I got the code, but am not having much luck making it work. I've tried various things. The most recent is cc -shared -fpic -o libsd_readdir.so spd_readdir.c # as me # rest as root # export LD_LIBRARY_PATH=./ # export LD_PRELOAD=libsd_readdir.so # ldconfig -v -n $(pwd) /usr/local/src/kernel/ext3-patch: libsd_readdir.so -> libsd_readdir.so corn:/usr/local/src/kernel/ext3-patch# date; time tar cf /dev/null /var/spool/cyrus/mail/r/user/ross/pol/asdnet/ Wed Oct 10 23:16:44 PDT 2007 tar: Removing leading `/' from member names Segmentation fault I don't know how to make something for preload; can anyone give any hints? Should the module I'm attempting to load have any effect on the 15 minute time noted above for tar to /dev/null, or is it only relevant if I am pulling data off the disk files? Would there be any value in having some other program traverse the directories before I do the backup, or would cache limits likely mean the stuff from the start would be gone from the cache by the time I got to the end, so that the backup would basically be starting fresh? Thanks. Ross From jae at platinumpsi.com Sat Oct 13 16:46:27 2007 From: jae at platinumpsi.com (J) Date: Sat, 13 Oct 2007 11:46:27 -0500 Subject: Commercial file recovery for ext3? Message-ID: <4710F663.8020806@platinumpsi.com> A user inflicted a massive change on an EXT-3 data partition.* I'm looking for an application that can recover deleted files. ( The majority of the files are Excel. ) I don't particularly care what it names the files, and I don't expect a 100% success rate, even though I told everyone to go home right after I found out it had been done. * Over a gig of files on a Samba server were moved into a another directory by mistake (by Windows XP Media center), and then subsequently moved back to their previous location... except when a dialog came up showing the files being processed one-by-one, it was canceled in a panic. The timing wasn't good: the backup scripts had been failing quietly. Looking for the latest options. Anyone have anything they've used? Thanks! --J From keld at dkuug.dk Sat Oct 13 17:59:56 2007 From: keld at dkuug.dk (Keld =?iso-8859-1?Q?J=F8rn?= Simonsen) Date: Sat, 13 Oct 2007 19:59:56 +0200 Subject: Commercial file recovery for ext3? In-Reply-To: <4710F663.8020806@platinumpsi.com> References: <4710F663.8020806@platinumpsi.com> Message-ID: <20071013175956.GA28717@rap.rap.dk> On Sat, Oct 13, 2007 at 11:46:27AM -0500, J wrote: > A user inflicted a massive change on an EXT-3 data partition.* I'm > looking for an application that can recover deleted files. ( The > majority of the files are Excel. ) I don't particularly care what it > names the files, and I don't expect a 100% success rate, even though I > told everyone to go home right after I found out it had been done. > > * Over a gig of files on a Samba server were moved into a another > directory by mistake (by Windows XP Media center), and then subsequently > moved back to their previous location... except when a dialog came up > showing the files being processed one-by-one, it was canceled in a panic. > > The timing wasn't good: the backup scripts had been failing quietly. > > Looking for the latest options. Anyone have anything they've used? I have made some software available at http://std.dkuug.dk/keld/readme-salvage.html It is not perfect, but try it out. best regards keld From ross at biostat.ucsf.edu Mon Oct 15 17:41:54 2007 From: ross at biostat.ucsf.edu (Ross Boylan) Date: Mon, 15 Oct 2007 10:41:54 -0700 Subject: Very slow directory traversal In-Reply-To: <1192084639.2075.75.camel@corn.betterworld.us> References: <1191654648.8679.109.camel@corn.betterworld.us> <20071010155920.GV8122@schatzie.adilger.int> <1192084639.2075.75.camel@corn.betterworld.us> Message-ID: <1192470114.8377.6.camel@corn.betterworld.us> On Wed, 2007-10-10 at 23:37 -0700, Ross Boylan wrote: > On Wed, 2007-10-10 at 09:59 -0600, Andreas Dilger wrote: > > On Oct 06, 2007 00:10 -0700, Ross Boylan wrote: > > > My last full backup of my Cyrus mail spool had 1,393,569 files and > > > cconsumed about 4G after compression. It took over 13 hours. Some > > > investigation led to the following test: > > > time tar cf /dev/null /var/spool/cyrus/mail/r/user/ross/debian/user/ > > > > FYI - "tar cf /dev/null" actually skips reading any file data. The > > code special cases /dev/null and skips the read entirely. > > > > > That took 15 minutes the first time it ran, and 32 seconds when run > > > immediately thereafter. There were 355,746 files. This is typical of > > > what I've been seeing: initial run is slow; later runs are much faster. > > > > I'd expect this is because on the initial run the on-disk inode ordering > > causes a lot of seeks, and later runs come straight from memory. Probably > > not a lot you can do directly, but e.g. pre-reading the inode table would > > be a good start. > Judging from your comments and the thread you reference below, the > problem is that the order returned from readdir is not inode order. But > if tar, in this special case (/dev/null), doesn't actually read from the > file, why should it be so slow. Does it do something (stat?) that makes > it have to fetch the inode anyway? > > > > > > > I found some earlier posts on similar issues, although they mostly > > > concerned apparently empty directories that took a long time. Theodore > > > Tso had a comment that seemed to indicate that hashing conflicts with > > > Unix requirements. I think the implication was that you could end up > > > with linearized, or partly linearized searches under some scenarios. > > > Since this is a mail spool, I think it gets lots of sync()'s. > > > > There was an LD_PRELOAD library that Ted wrote that may also help: > > http://marc.info/?l=mutt-dev&m=107226330912347&w=2 > > > I got the code, but am not having much luck making it work. I've tried > various things. The most recent is > cc -shared -fpic -o libsd_readdir.so spd_readdir.c # as me > # rest as root > # export LD_LIBRARY_PATH=./ > # export LD_PRELOAD=libsd_readdir.so > # ldconfig -v -n $(pwd) > /usr/local/src/kernel/ext3-patch: > libsd_readdir.so -> libsd_readdir.so > corn:/usr/local/src/kernel/ext3-patch# date; time tar > cf /dev/null /var/spool/cyrus/mail/r/user/ross/pol/asdnet/ > Wed Oct 10 23:16:44 PDT 2007 > tar: Removing leading `/' from member names > Segmentation fault Even stranger, when I try the same thing with a little test program that calls readdir, it works. I tried running tar as myself, but got the same segfault (the first test I reported I ran as root). tar doesn't look as if it's setuid # ls -l /bin/tar -rwxr-xr-x 1 root root 231188 2007-09-05 02:42 /bin/tar > > I don't know how to make something for preload; can anyone give any > hints? > > Should the module I'm attempting to load have any effect on the 15 > minute time noted above for tar to /dev/null, or is it only relevant if > I am pulling data off the disk files? > > Would there be any value in having some other program traverse the > directories before I do the backup, or would cache limits likely mean > the stuff from the start would be gone from the cache by the time I got > to the end, so that the backup would basically be starting fresh? > > > Thanks. > Ross From wesley at terpstra.ca Sun Oct 14 18:34:40 2007 From: wesley at terpstra.ca (Wesley W. Terpstra) Date: Sun, 14 Oct 2007 20:34:40 +0200 Subject: Big extended attributes Message-ID: Good evening! I've recently been running into a space limitation for extended attributes in ext3. I understand that earlier versions of ext3 stored these in the inode record. Is this still the case? Is there any way to allow for more space for extended attributes in an ext3 partition? I know that xfs has no limits on extended attributes, but I have several orthogonal reasons for sticking with ext3. PS. Please CC me as I am not a member of this list. From adilger at clusterfs.com Fri Oct 19 16:56:11 2007 From: adilger at clusterfs.com (Andreas Dilger) Date: Fri, 19 Oct 2007 10:56:11 -0600 Subject: Big extended attributes In-Reply-To: References: Message-ID: <20071019165611.GF8122@schatzie.adilger.int> On Oct 14, 2007 20:34 +0200, Wesley W. Terpstra wrote: > I've recently been running into a space limitation for extended > attributes in ext3. I understand that earlier versions of ext3 stored > these in the inode record. Is this still the case? Actually, it is the converse - only new (and specially formatted) fs with larger inodes will format with larger inodes and store the EA in the inode for improved performance. Otherwise there is a single fs block for all EAs on a file. If you need a small amount of extra EA space (e.g. 128 or 384 bytes) and you control the environment then formatting the filesystem with "mke2fs -j -I 512" can give you some more space, but not a huge amount. -I == total inode size; includes 128 bytes for inode; can be up to 4096 bytes > Is there any way > to allow for more space for extended attributes in an ext3 partition? Not currently. We did some work to allow large EAs to be stored in a separate inode, but that doesn't help if you have lots of small EAs. > I know that xfs has no limits on extended attributes, but I have > several orthogonal reasons for sticking with ext3. Hmm, I thought XFS had a 64kB EA limit? What is it you are trying to do? There are often better solutions than storing a lot of data in EAs. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From dpc22 at cam.ac.uk Mon Oct 22 15:03:35 2007 From: dpc22 at cam.ac.uk (David Carter) Date: Mon, 22 Oct 2007 16:03:35 +0100 (BST) Subject: EXT3-fs error in htree_dirblock_to_tree Message-ID: Hello all, Does anyone know if the following is likely to be a software problem or a hardware fault? Oct 22 14:01:43 cyrus-26 kernel: EXT3-fs error (device md0): htree_dirblock_to_tree: bad entry in directory #360809233: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0 A quick Google didn't tell me much, although a couple of people seem to have seen similar problems after hardware problems, and one person seemed to be able to trigger it using an "insane file system test": http://www.mail-archive.com/linux-ext4 at vger.kernel.org/msg02515.html The filesystem in question is a Cyrus mailstore: lots of write (and fsync) activity with small files. It was created with: mkfs.ext3 -T news -m 1 -O dir_index -j -J size=256 /dev/md0 and is currently mounted data=ordered. Platform is SLES10. We haven't seen one of these before, but we are in the process of moving from reiser (which never did anything like this) to ext3/htree, so it would be useful to known if it is a known problem. Thanks. -- David Carter Email: David.Carter at ucs.cam.ac.uk University Computing Service, Phone: (01223) 334502 New Museums Site, Pembroke Street, Fax: (01223) 334679 Cambridge UK. CB2 3QH. From ecashin at coraid.com Fri Oct 19 16:57:15 2007 From: ecashin at coraid.com (Ed L Cashin) Date: Fri, 19 Oct 2007 12:57:15 -0400 Subject: sync in-cache fs data after remount ro on error? Message-ID: <87d4vbt53o.fsf@coraid.com> Hi. If a block device stops working and then starts working later, does the sysadmin have a way to ask ext3 to sync the now read-only filesystem to disk? For example, I can temporarily shut down the network interfaces that make an AoE target accessible (simulating, e.g., somebody accidentally unplugging a network switch). When the I/O fails, the filesystem is automatically mounted read-only, which is great. But if valuable data has been committed to the in-cache filesystem but not the on-disk filesystem, it would ideally be possible to remount the filesystem read-write once the device is online again (from running aoe-revalidate), so that the new data could be sync'ed out to disk. The mount command won't remount the ext3 read-write. ellijay:~# mount -o remount,rw /mnt/e7.1 mount: block device /dev/etherd/e7.1 is write-protected, mounting read-only A kernel message says, "Abort forced by user", which looks like it is coming from fs/ext3/super.c, if (sbi->s_mount_opt & EXT3_MOUNT_ABORT) ext3_abort(sb, __FUNCTION__, "Abort forced by user"); Checking the e2fsprogs manpages, I don't see a way to ask ext3 to stop aborting a read-write mount. If all the uncommitted in-cache data is still marked as dirty, it seems like it might be possible to safely commit it now that the sysadmin knows the block device is OK. Is there a way to commit the dirty changes when the block device has stopped failing I/O? -- Ed L Cashin From rjcarr at gmail.com Tue Oct 23 22:30:12 2007 From: rjcarr at gmail.com (rjcarr) Date: Tue, 23 Oct 2007 15:30:12 -0700 (PDT) Subject: Solution to Corrupt >2TB Filesystem in MSDOS Partition Table In-Reply-To: <45F8642B.5080908@berkeley.edu> References: <45F571C3.9090303@berkeley.edu> <20070313070433.GL5266@schatzie.adilger.int> <45F8642B.5080908@berkeley.edu> Message-ID: <13375087.post@talk.nabble.com> Jon Forrest-2 wrote: > > Thanks to Ted and several others, I was > able to recover 100% of the corrupted > file system that I posted about last week. > (This was an >2TB ext3 file system that had been > created in a MSDOS partition which had worked > until the server was rebooted, at which time > it wouldn't mount and fsck wouldn't fix the > problem.) I just wanted to add that I had the same exact situation and this solution also worked for me. My only difference was my filesystem was xfs (not ext3), also, in this part: > 3) I then used the parted "rescue" command > to recreate the partition. I gave it the original > starting point at the start value and "-1s" as > the ending value. > I knew the exact end value from when I created the partition, so I used it instead of -1. Not sure if it would have worked it out had I used -1, but I thought my number safer. -- View this message in context: http://www.nabble.com/How-To-Recover-From-Creating-%3E2TB-ext3-Filesystem-on-MSDOS-Partition-Table--tf3390167.html#a13375087 Sent from the Ext3 - User mailing list archive at Nabble.com. From ameet.nanda at wipro.com Wed Oct 24 09:55:44 2007 From: ameet.nanda at wipro.com (Naxor) Date: Wed, 24 Oct 2007 02:55:44 -0700 (PDT) Subject: Problem with file system Message-ID: <13382672.post@talk.nabble.com> While I untar a large archive on xfs , ext3 (ver 1.3 and ver 1.4) file systems , on ppc processor and kernel ver 2.6.21 , I get an error. Also sometimes, on ext3 (1.3 and 1.4) the file system goes read-only while untarring. The same tar file when untarred on a i386 machine works properly. ERROR: -------------- tar: Skipping to next header gzip: stdin: invalid compressed data--crc error tar: Child returned status 1 tar: Error exit delayed from previous errors ------------------- Can any1 suggest some tools/method how to investigate the crash or proceed with the task ? -- View this message in context: http://www.nabble.com/Problem-with-file-system-tf4683372.html#a13382672 Sent from the Ext3 - User mailing list archive at Nabble.com. From lists at nerdbynature.de Thu Oct 25 08:07:36 2007 From: lists at nerdbynature.de (Christian Kujau) Date: Thu, 25 Oct 2007 10:07:36 +0200 (CEST) Subject: Problem with file system In-Reply-To: <13382672.post@talk.nabble.com> References: <13382672.post@talk.nabble.com> Message-ID: <43600.62.180.231.196.1193299656.squirrel@www.housecafe.de> On Wed, October 24, 2007 11:55, Naxor wrote: > While I untar a large archive on xfs , ext3 (ver 1.3 and ver 1.4) file > systems , on ppc processor and kernel ver 2.6.21 , I get an error. Also > sometimes, on ext3 (1.3 and 1.4) the file system goes read-only while > untarring. can you please post the errors from your syslog, when this happens? Also, did you fsck.ext3 your filesystem lately? Christian. -- BOFH excuse #442: Trojan horse ran out of hay From ameet.nanda at wipro.com Thu Oct 25 09:18:20 2007 From: ameet.nanda at wipro.com (Ameet Nanda) Date: Thu, 25 Oct 2007 14:48:20 +0530 Subject: Problem with file system In-Reply-To: <43600.62.180.231.196.1193299656.squirrel@www.housecafe.de> References: <13382672.post@talk.nabble.com> <43600.62.180.231.196.1193299656.squirrel@www.housecafe.de> Message-ID: <1193303900.6108.11.camel@ameet> Hi, I tried to untar using the command tar -xvzmf. The error I got after tar runs for sometime was : --------------------------------------- tar: Skipping to next header tar: Archive contains obsolescent base-64 headers gzip: stdin: invalid compressed data--crc error tar: Child returned status 1 tar: Error exit delayed from previous errors ---------------------------------------- on doing a fsck.ext3 i get the result as: -------------------------------------------------------------- /dev/sda2: ********** WARNING: Filesystem still has errors ********** 15162 inodes used (0.50%) 84 non-contiguous inodes (0.6%) # of inodes with ind/dind/tind blocks: 1370/52/0 605645 blocks used (10.05%) 0 bad blocks 2 large files 11832 regular files 886 directories 0 character device files 0 block device files 0 fifos 4294967294 links 2436 symbolic links (2414 fast symbolic links) 0 sockets -------- 15150 files --------------------------------------------------------------------------- - Ameet On Thu, 2007-10-25 at 10:07 +0200, Christian Kujau wrote: > On Wed, October 24, 2007 11:55, Naxor wrote: > > While I untar a large archive on xfs , ext3 (ver 1.3 and ver 1.4) file > > systems , on ppc processor and kernel ver 2.6.21 , I get an error. Also > > sometimes, on ext3 (1.3 and 1.4) the file system goes read-only while > > untarring. > > can you please post the errors from your syslog, when this happens? Also, > did you fsck.ext3 your filesystem lately? > > Christian. -------------- next part -------------- An HTML attachment was scrubbed... URL: From lists at nerdbynature.de Thu Oct 25 10:05:44 2007 From: lists at nerdbynature.de (Christian Kujau) Date: Thu, 25 Oct 2007 12:05:44 +0200 (CEST) Subject: Problem with file system In-Reply-To: <1193303900.6108.11.camel@ameet> References: <13382672.post@talk.nabble.com> <43600.62.180.231.196.1193299656.squirrel@www.housecafe.de> <1193303900.6108.11.camel@ameet> Message-ID: <42271.62.180.231.196.1193306744.squirrel@www.housecafe.de> Ameet, On Thu, October 25, 2007 11:18, Ameet Nanda wrote: > The error I got after tar runs for sometime was : Please post the errors from your system log (usually /var/log/messages, /var/log/kern.log or the like). > on doing a fsck.ext3 i get the result as: > -------------------------------------------------------------- > /dev/sda2: ********** WARNING: Filesystem still has errors ********** Did you unmount /dev/sda2 before running fsck.ext3? Please do, and then post the *whole* output of the "fsck.ext3 -v" run, not just the results. C. -- BOFH excuse #442: Trojan horse ran out of hay From ameet.nanda at wipro.com Thu Oct 25 11:35:28 2007 From: ameet.nanda at wipro.com (Ameet Nanda) Date: Thu, 25 Oct 2007 17:05:28 +0530 Subject: Problem with file system In-Reply-To: <42271.62.180.231.196.1193306744.squirrel@www.housecafe.de> References: <13382672.post@talk.nabble.com> <43600.62.180.231.196.1193299656.squirrel@www.housecafe.de> <1193303900.6108.11.camel@ameet> <42271.62.180.231.196.1193306744.squirrel@www.housecafe.de> Message-ID: <1193312128.6108.29.camel@ameet> Hi Chris, I unmounted /dev/sda2 and ran fsck.ext3. this was the complete o/p =========================== root at 172:/root> fsck.ext3 /dev/sda2 -v -n e2fsck 1.40.2 (12-Jul-2007) /dev/sda2 contains a file system with errors, check forced. Pass 1: Checking inodes, blocks, and sizes Inode 1505 has imagic flag set. Clear? no Inode 1505, i_blocks is 2561936855, should be 0. Fix? no Inode 15393 has compression flag set on filesystem without compression support. Clear? no Deleted inode 164029 has zero dtime. Fix? no Inode 1463073 is in use, but has dtime set. Fix? no Inode 1463073 has imagic flag set. Clear? no Inode 1463073 has compression flag set on filesystem without compression support. Clear? no Inode 1463073 has INDEX_FL flag set but is not a directory. Clear HTree index? no HTREE directory inode 1463073 has an invalid root node. Clear HTree index? no Error reading block 4294967295 (Invalid argument). Ignore error? no HTREE directory inode 1463073 has an invalid root node. Clear HTree index? no HTREE directory inode 1463073 has an invalid root node. Clear HTree index? no Inode 1463073, i_blocks is 4294967295, should be 0. Fix? no Deleted inode 1685409 has zero dtime. Fix? no Inode 1835553 is in use, but has dtime set. Fix? no Inode 1835553 has illegal block(s). Clear? no Illegal block #0 (310724603) in inode 1835553. IGNORED. Illegal block #1 (837540054) in inode 1835553. IGNORED. Illegal block #2 (3716133180) in inode 1835553. IGNORED. Illegal block #3 (2359092648) in inode 1835553. IGNORED. Illegal block #4 (155050197) in inode 1835553. IGNORED. Illegal block #5 (2295681145) in inode 1835553. IGNORED. HTREE directory inode 1835553 has an invalid root node. Clear HTree index? no Error reading block 310724603 (Invalid argument). Ignore error? no HTREE directory inode 1835553 has an invalid root node. Clear HTree index? no HTREE directory inode 1835553 has an invalid root node. Clear HTree index? no Inode 1835553 is a zero-length directory. Clear? no Inode 1835553, i_size is 1155516870, should be 0. Fix? no Inode 1835553, i_blocks is 2500161256, should be 0. Fix? no Pass 2: Checking directory structure Entry 'pdf_fontmgr_cidfonttypes.ps' in /SYSROM_SRC/mfp/PRF/rbdisk0/PostScript (1835102) has an incorrect filetype (was 1, should be 2). Fix? no Directory inode 1835553 has an unallocated block #6. Allocate? no Directory inode 1835553 has an unallocated block #7. Allocate? no Directory inode 1835553 has an unallocated block #8. Allocate? no Directory inode 1835553 has an unallocated block #9. Allocate? no Directory inode 1835553 has an unallocated block #10. Allocate? no Directory inode 1835553 has an unallocated block #11. Allocate? no Pass 3: Checking directory connectivity '..' in /SYSROM_SRC/mfp/PRF/rbdisk0/PostScript/pdf_fontmgr_cidfonttypes.ps (1835553) is (0), should be /SYSROM_SRC/mfp/PRF/rbdisk0/PostScript (1835102). Fix? no Pass 4: Checking reference counts Inode 1505 (...) is an illegal socket. Clear? no Unattached inode 1505 Connect to /lost+found? no Unattached zero-length inode 10209. Clear? no Unattached inode 10209 Connect to /lost+found? no Inode 15393 (...) has invalid mode (0177777). Clear? no Unattached inode 15393 Connect to /lost+found? no Inode 16673 (...) has invalid mode (0177777). Clear? no Unattached inode 16673 Connect to /lost+found? no Inode 32801 (...) has invalid mode (0177777). Clear? no Unattached inode 32801 Connect to /lost+found? no Inode 33313 (...) has invalid mode (0177777). Clear? no Unattached inode 33313 Connect to /lost+found? no Inode 49185 (...) has invalid mode (0177777). Clear? no Unattached inode 49185 Connect to /lost+found? no Inode 49697 (...) has invalid mode (0177777). Clear? no Unattached inode 49697 Connect to /lost+found? no Inode 65569 (...) has invalid mode (0177777). Clear? no Unattached inode 65569 Connect to /lost+found? no Inode 66081 (...) has invalid mode (0177777). Clear? no Unattached inode 66081 Connect to /lost+found? no Inode 1463073 (...) has invalid mode (00). Clear? no Unattached inode 1463073 Connect to /lost+found? no WARNING: PROGRAMMING BUG IN E2FSCK! OR SOME BONEHEAD (YOU) IS CHECKING A MOUNTED (LIVE) FILESYSTEM. inode_link_info[1835553] is 44779, inode.i_links_count is 1. They should be the same! Inode 1835553 ref count is 1, should be 1. Fix? no Pass 5: Checking group summary information Block bitmap differences: -(305650--305665) -359594 -(359611--360268) -(3701464--3701466) Fix? no Inode bitmap differences: +1505 +10209 +15393 +16673 +32801 +33313 +49185 +49697 +65569 +66081 -164029 +1463073 Fix? no Directories count wrong for group #112 (17, counted=18). Fix? no /dev/sda2: ********** WARNING: Filesystem still has errors ********** 15162 inodes used (0.50%) 81 non-contiguous inodes (0.5%) # of inodes with ind/dind/tind blocks: 1370/52/0 605645 blocks used (10.05%) 0 bad blocks 1 large file 11831 regular files 886 directories 0 character device files 0 block device files 0 fifos 4294967294 links 2436 symbolic links (2414 fast symbolic links) 0 sockets -------- 15150 files Here is the log from tail /var/log/kern.log ============================================= Oct 25 17:16:53 172 kernel: [ 267.117373] attempt to access beyond end of device Oct 25 17:16:53 172 kernel: [ 267.117396] sda2: rw=0, want=13777058744, limit=48195000 Oct 25 17:16:53 172 kernel: [ 267.117404] attempt to access beyond end of device Oct 25 17:16:53 172 kernel: [ 267.117411] sda2: rw=0, want=16416658088, limit=48195000 Oct 25 17:16:53 172 kernel: [ 267.117419] attempt to access beyond end of device Oct 25 17:16:53 172 kernel: [ 267.117425] sda2: rw=0, want=15853339616, limit=48195000 Oct 25 17:16:53 172 kernel: [ 267.117432] attempt to access beyond end of device Oct 25 17:16:53 172 kernel: [ 267.117439] sda2: rw=0, want=30048438328, limit=48195000 - Ameet On Thu, 2007-10-25 at 12:05 +0200, Christian Kujau wrote: > Ameet, > > On Thu, October 25, 2007 11:18, Ameet Nanda wrote: > > The error I got after tar runs for sometime was : > > Please post the errors from your system log (usually /var/log/messages, > /var/log/kern.log or the like). > > > on doing a fsck.ext3 i get the result as: > > -------------------------------------------------------------- > > /dev/sda2: ********** WARNING: Filesystem still has errors ********** > > Did you unmount /dev/sda2 before running fsck.ext3? Please do, and then > post the *whole* output of the "fsck.ext3 -v" run, not just the results. > > C. From lists at nerdbynature.de Thu Oct 25 12:25:11 2007 From: lists at nerdbynature.de (Christian Kujau) Date: Thu, 25 Oct 2007 14:25:11 +0200 (CEST) Subject: Problem with file system In-Reply-To: <1193312128.6108.29.camel@ameet> References: <13382672.post@talk.nabble.com> <43600.62.180.231.196.1193299656.squirrel@www.housecafe.de> <1193303900.6108.11.camel@ameet> <42271.62.180.231.196.1193306744.squirrel@www.housecafe.de> <1193312128.6108.29.camel@ameet> Message-ID: <43870.62.180.231.196.1193315111.squirrel@www.housecafe.de> On Thu, October 25, 2007 13:35, Ameet Nanda wrote: > I unmounted /dev/sda2 and ran fsck.ext3. this was the complete o/p thanks for the log. Now the real gurus have something to work with :-) > root at 172:/root> fsck.ext3 /dev/sda2 -v -n > e2fsck 1.40.2 (12-Jul-2007) /dev/sda2 contains a file system with errors, > check forced. If the filesystem is corrupted, all kinds of things might happen to your .tar file. A good start would be to find out what could've caused the filesystem corruptions in the first place. Did your box lose power and crashed? Has the hardware been altered, new memory, new cables? > Oct 25 17:16:53 172 kernel: [ 267.117373] attempt to access beyond end > of device > Oct 25 17:16:53 172 kernel: [ 267.117396] sda2: rw=0, want=13777058744, > limit=48195000 Did someone/something alter the partition table? Can you do the following without getting errors in kern.log? dd if=/dev/sda2 of=/dev/null bs=512 Christian. -- BOFH excuse #442: Trojan horse ran out of hay From h.m.holt at gmail.com Wed Oct 31 01:00:50 2007 From: h.m.holt at gmail.com (Hans Holt) Date: Wed, 31 Oct 2007 12:00:50 +1100 Subject: remounting ext3 file systems Message-ID: <27057c670710301800x46b402adoc6f9da18c4baf8b5@mail.gmail.com> Hi, I want to remount a mounted ext3 file system. Typically, the "mount -o remount " option is used when an already mounted read-only file system is remounted as read+write. Is it considered safe to remount a file system already mounted as read+write with open files that are in use ? I want to change some mount options without killing processes accessing the file systems and unmounting the file system or restarting the machine. Thanks Hans From darkonc at gmail.com Wed Oct 31 02:32:25 2007 From: darkonc at gmail.com (Stephen Samuel) Date: Tue, 30 Oct 2007 19:32:25 -0700 Subject: remounting ext3 file systems In-Reply-To: <27057c670710301800x46b402adoc6f9da18c4baf8b5@mail.gmail.com> References: <27057c670710301800x46b402adoc6f9da18c4baf8b5@mail.gmail.com> Message-ID: <6cd50f9f0710301932o7ea11815h7a46f48ee936d47@mail.gmail.com> As long as you don't don't set any other options which would disrupt with what the running processes are doing with the files on that filesystem, you should be fine. (for example: remounting the system readonly while files were open rw would be problematic for the processes involved, and I don't know what would happen if you remounted a filesystem nodev while people had devices open on it). On 10/30/07, Hans Holt wrote: > > Hi, > > I want to remount a mounted ext3 file system. Typically, the "mount -o > remount " option is used when an already mounted > read-only file system is remounted as read+write. Is it considered > safe to remount a file system already mounted as read+write with open > files that are in use ? I want to change some mount options without > killing processes accessing the file systems and unmounting the file > system or restarting the machine. > -- Stephen Samuel http://www.bcgreen.com 778-861-7641 -------------- next part -------------- An HTML attachment was scrubbed... URL: From sandeen at redhat.com Wed Oct 31 03:10:03 2007 From: sandeen at redhat.com (Eric Sandeen) Date: Tue, 30 Oct 2007 22:10:03 -0500 Subject: remounting ext3 file systems In-Reply-To: <6cd50f9f0710301932o7ea11815h7a46f48ee936d47@mail.gmail.com> References: <27057c670710301800x46b402adoc6f9da18c4baf8b5@mail.gmail.com> <6cd50f9f0710301932o7ea11815h7a46f48ee936d47@mail.gmail.com> Message-ID: <4727F20B.2070907@redhat.com> Stephen Samuel wrote: > As long as you don't don't set any other options which would disrupt > with what the running processes are doing with the files on that > filesystem, you should be fine. > > (for example: remounting the system readonly while files were open rw > would be problematic for the processes involved, In this case mount -o ro will fail with -EBUSY > and I don't know what > would happen if you remounted a filesystem nodev while people had > devices open on it). this should reject new device openers. -Eric From adilger at sun.com Thu Oct 25 20:31:20 2007 From: adilger at sun.com (Andreas Dilger) Date: Thu, 25 Oct 2007 20:31:20 -0000 Subject: sync in-cache fs data after remount ro on error? In-Reply-To: <87d4vbt53o.fsf@coraid.com> References: <87d4vbt53o.fsf@coraid.com> Message-ID: <20071025203104.GF3042@webber.adilger.int> On Oct 19, 2007 12:57 -0400, Ed L Cashin wrote: > For example, I can temporarily shut down the network interfaces that > make an AoE target accessible (simulating, e.g., somebody accidentally > unplugging a network switch). When the I/O fails, the filesystem is > automatically mounted read-only, which is great. > > But if valuable data has been committed to the in-cache filesystem but > not the on-disk filesystem, it would ideally be possible to remount > the filesystem read-write once the device is online again (from > running aoe-revalidate), so that the new data could be sync'ed out to > disk. No, there isn't any way to do this, because the filesystem has no way to know which previous writes have succeeded and which have failed, so any further writes from cache have a danger of corrupting the filesystem. Cheers, Andreas -- Andreas Dilger Sr. Software Engineer, Lustre Group Sun Microsystems of Canada, Inc.