From swapana_ghosh at yahoo.com  Mon Oct  1 13:18:15 2007
From: swapana_ghosh at yahoo.com (Swapana Ghosh)
Date: Mon, 1 Oct 2007 06:18:15 -0700 (PDT)
Subject: ext3 file system becoming read only
In-Reply-To: <46FC9E49.6090900@cesca.es>
Message-ID: <230929.8139.qm@web58302.mail.re3.yahoo.com>

Thanks Jordi,

Yes,  we are checking everything, then only we will proceed for update the
kernel.

Thanks again
--- Jordi Prats <jprats at cesca.es> wrote:

> Hi Swapana,
> A update is always a good idea. On RHEL updates use to go smoothly, but 
> I have you checked your FC switch for errors on each port? You could 
> also check your SAN controllers, or run some diagnostics to be sure it's 
> not a problem on your SAN. If your active controller reboots suddenly it 
> can cause some IO errors causing your journal corruption.
> 
> regards,
> Jordi
> 
> 
> 
> Swapana Ghosh wrote:
> > Hi,
> >
> > As I explained in my first posting that the 'read-only' issue is not for
> one
> > server, it is happening for few servers which are generally 'oracle'
> database
> > oriented. Very recently it happned to an 'oracle' application server. For
> > temporary basis , we are re-mounting the file system and also doing fsck.  
> 
> > While searching the redhat knowledge base, found the following url, the
> problem
> > they were explaining it is similar to our issues, 
> >
> > https://bugzilla.redhat.com/show_bug.cgi?id=213921
> >
> > It is telling that it is the bug of the kernel..
> >
> > Not sure whether we will proceed for the higher version of kernel or not,
> > please advice.
> >
> > Thanks
> >
> >
> > --- tweeks <tweeks at rackspace.com> wrote:
> >
> >   
> >> The EL4 kernel is wacky when it comes the the I/O scheduler locking up and
> >> and 
> >> causing ext3 to remount RO.  Various hardware hiccups can cause it to go
> RO. 
> >>
> >> And when it does.. you need to tread lightly or you could lose everything.
> >>
> >> If your ext3 filesystem had problems and remounted read-only, I would
> >> strongly 
> >> advise /against/ simply fscking it.  Often times when your filesystem has 
> >> gone RO, it may have been that way for 30 minutes or more.  Just rebooting
> ro
> >>
> >> fscking is a great way to lose everything (i.e. everything being dumped 
> >> into /lost+found/"
> >>
> >> Instead, I would recommend:
> >> 1) rebooting into a rescue CD environment (not allowing the rescue
> >> environment 
> >> to mount or fsck your filesystems).
> >> 2) Nuke the ext3 journal:
> >> 	tune2fs -O ^has_journal /dev/<rootfs>
> >>  (possibly doing the same for other problem partitions)
> >> 3) Do a fake fsck to see the extent of damage:
> >> 	fsck -fn /dev/<rootfs>
> >>   (after checking things out.. use "-fy" once you're sure that it's safe)
> >> 4) Rebuild the journal w, "tune2fs -j /dev/<rootfs>
> >>   (rerun at least once until "clean" result is repeatable)
> >> 5) Mount and check things out, 
> >> 	"mkdir /mnt/tmp && mount -t ext3 /dev/<rootfs> /mnt/tmp"
> >> 6) Gracefully umount & reboot:
> >> 	"umount /mnt/tmp  && shutdown -rf now && exit"
> >>
> >> Tweeks
> >>
> >> On Tuesday 25 September 2007 11:47, Swapana Ghosh wrote:
> >>     
> >>> Hi Jordi,
> >>>
> >>> Thanks for your reply.  I will test the way you suggested.
> >>>
> >>> Thanks
> >>> -swapna
> >>>
> >>> --- Jordi Prats <jprats at cesca.es> wrote:
> >>>       
> >>>> Hi,
> >>>> It seems like what it happened to me. I did this to solve this issue:
> >>>>
> >>>> Mark the filesystem as it does not have a journal (take it to ext2)
> >>>>
> >>>> tune2fs -O ^has_journal /dev/cciss/c0d0p2
> >>>>
> >>>> fsck it to delete the journal:
> >>>>
> >>>> e2fsck /dev/cciss/c0d0p2
> >>>>
> >>>> Create the journal (take it back to ext3)
> >>>>
> >>>> tune2fs -j /dev/cciss/c0d0p2
> >>>>
> >>>> and finaly, remount it.
> >>>>
> >>>> In my case it was with a local disk, but with your SAN disk should be
> >>>> the same.
> >>>>
> >>>> Jordi
> >>>>
> >>>> Swapana Ghosh wrote:
> >>>>         
> >>>>> Hi
> >>>>>
> >>>>> In our office environment few servers mostly  database servers and
> >>>>>           
> >>>> yesterday it
> >>>>
> >>>>         
> >>>>> happened
> >>>>> for one application server(first time) the partion is getting "read
> >>>>> only".
> >>>>>
> >>>>> I was checking the archives, found may be similar kind of issues in the
> >>>>> 2007-July archives.
> >>>>> But how it has been solved if someone describes me that will be really
> >>>>>           
> >>>> helpful.
> >>>>
> >>>>         
> >>>>> In our case, just at the problem started found the line in log file as
> >>>>>           
> >>>> follows:
> >>>>         
> >>>>>      EXT3-fs error (device dm-12): edxt3_find_entry: reading directory
> >>>>>           
> >>>> #2015496
> >>>>
> >>>>         
> >>>>> offset 2
> >>>>>
> >>>>> Then one blank line
> >>>>> Then the line is
> >>>>>
> >>>>>     Aborting journal on device dm-12.
> >>>>>     ext3_abort called
> >>>>>
> >>>>>     Ext3-fs error (device dm-12): ext3_journal_start_sb: Detected
> >>>>> aborted journal
> >>>>>     Remounting filesysem read-only
> >>>>>
> >>>>> Then the continuous line as follows:
> >>>>>
> >>>>>
> >>>>>     EXT3-fs error (device dm-12) in start_transaction: Journal has
> >>>>> aborted
> >>>>>
> >>>>>
> >>>>>
> >>>>> The above message is continuous  until we remount the filesystem and
> >>>>>           
> >>>> partion
> >>>>
> >>>>         
> >>>>> becomes
> >>>>> 'read-write'.
> >>>>>
> >>>>> We could not figure it out what is the root cause of the system.
> >>>>>
> >>>>> We are using individual EMC luns and are configured with LVM volume
> >>>>> groups
> >>>>>           
> >>>> and
> >>>>
> >>>>         
> >>>>> then mounted on logical
> >>>>> volumes.
> >>>>>
> >>>>> Here i am giving the server description:
> >>>>>
> >>>>> ____________________________________________________________
> >>>>>
> >>>>> [root at server ~]# lsmod |grep -i qla
> >>>>> qla2300               130304  0
> >>>>> qla2xxx_conf          305924  0
> >>>>> qla2xxx               307448  21 qla2300
> >>>>> scsi_mod              117709  5 sg,emcp,qla2xxx,cciss,sd_mod
> >>>>>
> >>>>> ____________________________________________________________
> >>>>> [root at server ~]# cat /etc/modprobe.conf
> >>>>> alias eth0 tg3
> >>>>> alias eth1 tg3
> >>>>> alias eth2 e1000
> >>>>> alias eth3 e1000
> >>>>> alias eth4 e1000
> >>>>> alias eth5 e1000
> >>>>> alias bond0 bonding
> >>>>> alias scsi_hostadapter cciss
> >>>>> options bond0 max_bonds=2 miimon=100 mode=1
> >>>>> alias scsi_hostadapter1 qla2xxx
> >>>>> alias scsi_hostadapter2 qla2xxx_conf
> >>>>> #alias scsi_hostadapter3 qla6312
> >>>>> options qla2xxx  ql2xmaxqdepth=16 qlport_down_retry=64
> >>>>> ql2xloginretrycount=30 ql2xfailover=0 ql2xlbType=0
> 
=== message truncated ===



       
____________________________________________________________________________________
Got a little couch potato? 
Check out fun summer activities for kids.
http://search.yahoo.com/search?fr=oni_on_mail&p=summer+activities+for+kids&cs=bz 



From tango at tiac.net  Tue Oct  2 19:38:47 2007
From: tango at tiac.net (Thomas Watt)
Date: Tue, 2 Oct 2007 15:38:47 -0400 (GMT-04:00)
Subject: How are alternate superblocks repaired?
Message-ID: <24757522.1191353927349.JavaMail.root@mswamui-swiss.atl.sa.earthlink.net>

Hi Ted,

Ok, I think I understand now.  I was assuming the backup superblocks played a role without the intervention of using e2fsck and were ready to be used in a standby mode when the primary superblock gets corrupted.  But, of course, there is a very real reason to be cautious when the kernel may do things unknown to users.

My point-of-view was more flavored by something like the Multics structure marking that kept backup data structures free from damage.  It is clear there is another strategy at work here, but one that is workable and sufficient for the ext2/ext3 filesystem.

In case you are interested, here is link to a web page on Structure Marking:
http://www.multicians.org/thvv/marking.html

I'm so happy you sent the tip on using the e2label to correct my problem.

I've attached my script which I wrote more out of curiosity than anything else:
ca18e1eb99c1279e0298db56f43b1ab1  genallsbs.sh

Regards,

-- Tom


From:  Theodore Tso <tytso at mit.edu>   [Add to Address Book]
To: Thomas Watt <tango at tiac.net>
Cc: Andreas Dilger <adilger at clusterfs.com>, ext3-users at redhat.com
Subject: Re: How are alternate superblocks repaired?
Date: Sep 29, 2007 9:01 AM

On Sat, Sep 29, 2007 at 03:29:13AM -0400, Thomas Watt wrote:
> The only field not updated was the Filesystem state field. So, all
> of the backup superblocks remain "not clean" and are now at least
a
> lot closer to being consistent with the primary superblock - just
> not quite there yet as far as being usable in case the primary
> superblock gets hosed.

That's by design.  The backup superblock always have the filesystem
state set to "not clean".  They are written out that way!  Keep in
mind that kernel does *not* update the backup superblocks under normal
operations.  So by definition, fields such as the free blocks, free
inodes, last mount time, mount count, are always going to be out of
date in the backup superblocks.  AND THAT'S OK.

The whole point of the backup superblocks is to have an extra copy of
the fundamental filesystem parameters --- the blocksize, the number of
inodes per block group, the block group size, the location of the
inode table and the allocation bitmaps, and so on.  That doesn't
change under normal circumstances except when the filesystem is
resized, so that's why it's OK for the kernel to not bother to update
them.  

If the primary superblock is destroyed, e2fsck will use the backup
superblocks to reconstruct the filesystem, and in the process of
reconstructing the filesystem, it will update the free blocks, free
inodes, and the other more transient portions of the filesystem.

I'm not sure why you are so concerned about keeping every last field
in the backup superblocks identical to that of the primary.  There are
lots of good reasons why they are not the same; the less they are
modified, more likely they won't get corrupted or otherwise messed up.
(For example, in addition to making the umount operation take a lot
longer, the fact that the kernel never writes the backup superblocks
means that we don't have to worry about what happens if the in-memory
copy of the superblocks are corrupted --- say because the system
administrator was too cheap to use ECC memory --- even if they are
written to the primaries, the backups will still be OK for e2fsck to
use for recovery purposes.)

						- Ted

 From:  Thomas Watt <tango at tiac.net>   [Add to Address Book]
To: Theodore Tso <tytso at mit.edu>
Cc: Andreas Dilger <adilger at clusterfs.com>, ext3-users at redhat.com
Subject: Re: How are alternate superblocks repaired?
Date: Sep 29, 2007 3:29 AM

Hi Ted,

I just wanted to give you some feedback on running the e2label command to fix the
problem of backup superblock inconsistency with the primary superblock.

Since Linux filesystem name labels are optional and my filesystem volume name was
not set, I wondered if that would make a difference.  It did not.  I did not opt
to set a label, but just followed your suggested command.

The following fields were updated:
Filesystem features
Free blocks
Free inodes
Last mount time
Last write time
Mount count
Last checked
Next check after

The only field not updated was the Filesystem state field. So, all of the backup
superblocks remain "not clean" and are now at least a lot closer to 
being consistent with the primary superblock - just not quite there yet as far as
being usable in case the primary superblock gets hosed.

At this point I don't suppose there is anyway for e2fsck to make the backup 
superblocks "clean" (i.e. only when the primary is clean) until your enhancement
gets released.

It was fairly easy to make this assessment using the script I wrote to dump all 
of the superblocks and make the comparisons of before and after superblock 
states.  Checking the result was the easy part.

I want to make a few changes, test them out and donate the script to the e2fsprogs
project.  It should make it just a little bit easier for system 
administrators to keep an eye on the backup superblocks, and you also might find
it useful in testing your enhancement to e2fsck.  The only caveat is that the script
has not been tested on ext2/ext3 filesystems with blocksizes of
1024 or 2048s.  There are provisions for 1024 and 2048 blocksized sytsems - that's
the speculative part of the script that needs testing - assumptions always need 
testing/challenging - right? :)

I hope this feedback helps in your enhancement efforts to e2fsck.

Regards,

-- Tom


m:  Theodore Tso <tytso at mit.edu>   [Add to Address Book]
To: Thomas Watt <tango at tiac.net>
Cc: Andreas Dilger <adilger at clusterfs.com>, ext3-users at redhat.com
Subject: Re: How are alternate superblocks repaired?
Date: Sep 28, 2007 2:55 PM

On Fri, Sep 28, 2007 at 01:18:16AM -0400, Thomas Watt wrote:
> The Maximum mount count is 30, and I have no reason to believe that
> e2fsck has ever been run against this particular FC3 ext filesystem.
> I have every reason to believe, however, that fsck has been run on
> occasion when I either boot the FC3 system manually and the mount
> count is over 30 or when I experience the situation where the
> ext_attr goes missing and I then manually boot the system when it is
> not clean in the primary superblock.  The system was created at the
> end of March, 2005 and as you can see from the differences the
> backup superblock(s) have never even been touched after their
> creation.
> 
> What parameters do you suggest be used when e2fsck is run to repair
> the backup superblocks?

Hi Tom, 

There are a couple of things going on here.  First of all, out of
general paranoia, neither e2fsck nor the kernel touch backup
superblocks out of general paranoia.  Most of the changes that you
pointed out between the primary and backup superblocks are no big
deal, and can easily be regenerated by e2fsck.  The one exception to
is the feature bitmasks.  Most of the time it's only tune2fs which
makes changes to the feature compatibility bitmasks.  

Unfortunately, the kernel does make some changes "behind the user's
back"; and one of them is the ext_attr feature flag.  So thanks for
pointing that out, and I'll have to make an enhacement to e2fsck to
detect if the backup superblock's compatibility flags are different,
and if so, to update the backup superblocks.

For now, you can work around this and force an update to the backup
superblocks by running the following command as root:

e2label /dev/hdXXX "`e2label /dev/hdXXX`"

This reads out the label from the filesystem, and thes sets the label
to its current value.  This will force a copy from the primary to the
backup superblocks.

Regards,

							- Ted

-------------- next part --------------
A non-text attachment was scrubbed...
Name: genallsbs.sh
Type: application/x-shellscript
Size: 14176 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20071002/7372263b/attachment.bin>

From nicdnicd at gmail.com  Tue Oct  2 21:27:52 2007
From: nicdnicd at gmail.com (Nickel Cadmium)
Date: Tue, 2 Oct 2007 23:27:52 +0200
Subject: Bad magic number in super-block
Message-ID: <9ec348a90710021427r3c7b333el91685c17a277aacc@mail.gmail.com>

Hi,

After a power failure, I can't mount one of my partitions anymore. Here is
what I get from fsck:

--
fsck.ext3 /dev/sdb1
e2fsck 1.39 (29-May-2006)
Couldn't find ext2 superblock, trying backup blocks...
fsck.ext3: Bad magic number in super-block while trying to open /dev/sdb1

The superblock could not be read or does not describe a correct ext2
filesystem. If the device is valid and it really contains an ext2
filesystem (and not swap or ufs or something else), then the superblock
is corrupt, and you might try running e2fsck with an alternate superblock:
e2fsck -b 8193 <device>
--

I tried to give the suggested superblock as parameter but I get the same
error message. And with dumpe2fs and tune2fs as well.
Since I can't get the backup-superblock positions with dumpe2fs, I used a
block size of 1K and tried all the supposed-to-be backup superblocks but it
does not help.

Is there anything I can try to mount the partition again?

Cheers,
NiCd
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20071002/e9c36639/attachment.htm>

From tytso at mit.edu  Tue Oct  2 21:59:11 2007
From: tytso at mit.edu (Theodore Tso)
Date: Tue, 2 Oct 2007 17:59:11 -0400
Subject: How are alternate superblocks repaired?
In-Reply-To: <24757522.1191353927349.JavaMail.root@mswamui-swiss.atl.sa.earthlink.net>
References: <24757522.1191353927349.JavaMail.root@mswamui-swiss.atl.sa.earthlink.net>
Message-ID: <20071002215911.GA6012@thunk.org>

On Tue, Oct 02, 2007 at 03:38:47PM -0400, Thomas Watt wrote:
> In case you are interested, here is link to a web page on Structure Marking:
> http://www.multicians.org/thvv/marking.html

I actually have used a Multics system way back when (I was actually
logged into MIT Multics when it was finally shutdown[1]).  The com_err
library and the ss library in e2fsprogs was largely inspired from
Multics, and I do use structure magic numbers in memory to protect
against programming errors, which is basically a very simple structure
marking technique.

I'm a bit dubious about how useful simply structure matching would be
for modern Linux systems, since a large number of errors really are
silent bit flips in the data, that wouldn't be detected simply by
checking the expected structure ID at the beginning of the on-disk
object.  We are planning on adding checksum to metadata for ext4,
which will help a lot in terms of detected bad metadata.

Regards,   ("You are protected from preemption"  :-)

[1]  http://stuff.mit.edu/afs/sipb/project/eichin/sipbscan/

					- Ted



From tango at tiac.net  Wed Oct  3 03:30:46 2007
From: tango at tiac.net (Thomas Watt)
Date: Tue, 2 Oct 2007 23:30:46 -0400 (GMT-04:00)
Subject: Bad magic number in super-block
Message-ID: <5486452.1191382247899.JavaMail.root@mswamui-blood.atl.sa.earthlink.net>

Hi Nickel Cadmium,

First, try running the command (as root): fdisk -l

That should confirm whether /dev/sdb1 is a valid filesystem partition and not a 
swap partition.  Look for an ID of 83 which identifies valid filesystem partitions.  A partition with ID of 82 is usually swap and won't have a superblock.

That said, if /dev/sdb1 is not a valid filesystem partition, then choose one
that with an ID of 83 and looks like it has the majority of space.  Then you
should be able to use: dumpe2fs -h /dev/sdb2, for example, and see if you get 
any other errors or can then successfully mount the partition.

Sometimes after a reboot, the fdisk -l command reports partitions not in
partition table order and will assign different partition names than the ones
you may normally see to the disk/partition of interest.

-- Tom



From nicdnicd at gmail.com  Wed Oct  3 06:48:25 2007
From: nicdnicd at gmail.com (Nickel Cadmium)
Date: Wed, 3 Oct 2007 08:48:25 +0200
Subject: Bad magic number in super-block
In-Reply-To: <5486452.1191382247899.JavaMail.root@mswamui-blood.atl.sa.earthlink.net>
References: <5486452.1191382247899.JavaMail.root@mswamui-blood.atl.sa.earthlink.net>
Message-ID: <9ec348a90710022348i496a03a4ib7c0296bad67f365@mail.gmail.com>

Hi!

Tom, thanks a lot: you solved my problem!

With fdisk -l I discovered that the partition I was trying to mount was a
Windows partition. The weird thing is that /dev/sdb1 used to be a Linux
partition. Thinking of it again, I had to pull apart my computer after the
crash and I probably shuffled the disks around (or could the renumbering /
device reassignement occur even without hardware change?).
But in short, the partition I was looking for is now in /dev/sdc1 and
updating the partition table solved it all.

Thanks & cheers,
NiCd

On 10/3/07, Thomas Watt <tango at tiac.net> wrote:
>
> Hi Nickel Cadmium,
>
> First, try running the command (as root): fdisk -l
>
> That should confirm whether /dev/sdb1 is a valid filesystem partition and
> not a
> swap partition.  Look for an ID of 83 which identifies valid filesystem
> partitions.  A partition with ID of 82 is usually swap and won't have a
> superblock.
>
> That said, if /dev/sdb1 is not a valid filesystem partition, then choose
> one
> that with an ID of 83 and looks like it has the majority of space.  Then
> you
> should be able to use: dumpe2fs -h /dev/sdb2, for example, and see if you
> get
> any other errors or can then successfully mount the partition.
>
> Sometimes after a reboot, the fdisk -l command reports partitions not in
> partition table order and will assign different partition names than the
> ones
> you may normally see to the disk/partition of interest.
>
> -- Tom
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20071003/7814045b/attachment.htm>

From bryan at kadzban.is-a-geek.net  Wed Oct  3 11:01:08 2007
From: bryan at kadzban.is-a-geek.net (Bryan Kadzban)
Date: Wed, 03 Oct 2007 07:01:08 -0400
Subject: Bad magic number in super-block
In-Reply-To: <9ec348a90710022348i496a03a4ib7c0296bad67f365@mail.gmail.com>
References: <5486452.1191382247899.JavaMail.root@mswamui-blood.atl.sa.earthlink.net>
	<9ec348a90710022348i496a03a4ib7c0296bad67f365@mail.gmail.com>
Message-ID: <47037674.3050908@kadzban.is-a-geek.net>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: RIPEMD160

Nickel Cadmium wrote:
> (or could the renumbering / device reassignement occur even without
> hardware change?)

For SCSI, yes, it could have changed (depending on your hardware setup).
SCSI disk scanning happens in parallel, and has ever since kernel 2.6.18
or .19 or somewhere around there.  I believe it still depends on your
low-level SCSI driver though.

In any case, the sdX device names are no longer necessarily stable.
That's why udev now creates the /dev/disk/by-* trees of symlinks, whose
names are supposed to be stable.  (I'd recommend by-id myself, but it
depends on how your disks are set up.)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHA3ZyS5vET1Wea5wRA+X+AKCbk7mtSA79wvZ0uQKHnTrgWTvTGQCdE7mh
LML2ihueJgirORxFAvczVZA=
=JZO3
-----END PGP SIGNATURE-----



From tytso at mit.edu  Wed Oct  3 14:52:18 2007
From: tytso at mit.edu (Theodore Tso)
Date: Wed, 3 Oct 2007 10:52:18 -0400
Subject: Bad magic number in super-block
In-Reply-To: <47037674.3050908@kadzban.is-a-geek.net>
References: <5486452.1191382247899.JavaMail.root@mswamui-blood.atl.sa.earthlink.net>
	<9ec348a90710022348i496a03a4ib7c0296bad67f365@mail.gmail.com>
	<47037674.3050908@kadzban.is-a-geek.net>
Message-ID: <20071003145218.GC23294@thunk.org>

On Wed, Oct 03, 2007 at 07:01:08AM -0400, Bryan Kadzban wrote:
> 
> In any case, the sdX device names are no longer necessarily stable.
> That's why udev now creates the /dev/disk/by-* trees of symlinks, whose
> names are supposed to be stable.  (I'd recommend by-id myself, but it
> depends on how your disks are set up.)

The recommended way of dealing with this is to putting something like
this in your /etc/fstab:

UUID=57299143-64a5-45f3-8c3d-9b68e38247bd / ext3 defaults,errors=remount-ro 0 1

or 

LABEL=root / ext3 defaults,errors=remount-ro 0 1

Mount and fsck will automatically find the appropriate device, and
this will work even if udev changes in the future.  This approach also
will work on much older systems, including ones that are pre-udev.
(i.e, RHEL4, etc.)

Note that you can get yourself in trouble with either approach if you
have multiple filesystems with the same label or partition.  With
UUID's, that shouldn't ever happen unless you provision systems via
partition images or use dd to copy filesystems around.  If you do
this, a *really* good idea is to use the command:

      tune2fs -U random /dev/sdXX

... after you copy a filesystem image, and then use dumpe2fs -h to
determine the new UUID.  That way, each filesystem will have its own
unique filesystem.  This is especially important if you have a large
cluster of machines which access their root filesystem across a SAN
network to some large enterprise storage array.  It is a really,
really good idea to keep each filesystem image separate with its own
universally unique ID.

						- Ted



From tango at tiac.net  Wed Oct  3 17:16:18 2007
From: tango at tiac.net (Thomas Watt)
Date: Wed, 3 Oct 2007 13:16:18 -0400 (GMT-04:00)
Subject: How are alternate superblocks repaired?
Message-ID: <2818383.1191431778901.JavaMail.root@mswamui-blood.atl.sa.earthlink.net>

Hi Ted,

That was pretty funny being "protected from preemption"!

It turns out I did discover a bug in my script that I previously sent, and have
fixed it.  Only filesystem blocksize of 2048 needs testing/verification.

Sorry for the resend - it appears my mailer decided I needed to loosen the
priviledges to send the script.

Here is the reworked script attached:
003a2b57b7d0c798b6d1044506634c3c  genallsbs.sh

Cheers,

-- Tom


-----Original Message-----
>From: Theodore Tso <tytso at mit.edu>
>Sent: Oct 2, 2007 5:59 PM
>To: Thomas Watt <tango at tiac.net>
>Cc: Andreas Dilger <adilger at clusterfs.com>, ext3-users at redhat.com
>Subject: Re: How are alternate superblocks repaired?
>
>On Tue, Oct 02, 2007 at 03:38:47PM -0400, Thomas Watt wrote:
>> In case you are interested, here is link to a web page on Structure Marking:
>> http://www.multicians.org/thvv/marking.html
>
>I actually have used a Multics system way back when (I was actually
>logged into MIT Multics when it was finally shutdown[1]).  The com_err
>library and the ss library in e2fsprogs was largely inspired from
>Multics, and I do use structure magic numbers in memory to protect
>against programming errors, which is basically a very simple structure
>marking technique.
>
>I'm a bit dubious about how useful simply structure matching would be
>for modern Linux systems, since a large number of errors really are
>silent bit flips in the data, that wouldn't be detected simply by
>checking the expected structure ID at the beginning of the on-disk
>object.  We are planning on adding checksum to metadata for ext4,
>which will help a lot in terms of detected bad metadata.
>
>Regards,   ("You are protected from preemption"  :-)
>
>[1]  http://stuff.mit.edu/afs/sipb/project/eichin/sipbscan/
>
>					- Ted
-------------- next part --------------
A non-text attachment was scrubbed...
Name: genallsbs.sh
Type: application/x-shellscript
Size: 13942 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20071003/1b7a2726/attachment.bin>

From tytso at mit.edu  Wed Oct  3 18:44:36 2007
From: tytso at mit.edu (Theodore Tso)
Date: Wed, 3 Oct 2007 14:44:36 -0400
Subject: How are alternate superblocks repaired?
In-Reply-To: <20071002215911.GA6012@thunk.org>
References: <24757522.1191353927349.JavaMail.root@mswamui-swiss.atl.sa.earthlink.net>
	<20071002215911.GA6012@thunk.org>
Message-ID: <20071003184436.GD23294@thunk.org>

On Tue, Oct 02, 2007 at 05:59:11PM -0400, Theodore Tso wrote:
> I'm a bit dubious about how useful simply structure matching would be
> for modern Linux systems, since a large number of errors really are
    sorry, I meant to say "filesystems", not "systems" above
> silent bit flips in the data, that wouldn't be detected simply by
> checking the expected structure ID at the beginning of the on-disk
> object.  We are planning on adding checksum to metadata for ext4,
> which will help a lot in terms of detected bad metadata.

  	     	    	   	 - Ted



From tango at tiac.net  Thu Oct  4 04:15:26 2007
From: tango at tiac.net (Thomas Watt)
Date: Thu, 4 Oct 2007 00:15:26 -0400 (GMT-04:00)
Subject: How are alternate superblocks repaired?
Message-ID: <11773462.1191471326851.JavaMail.root@mswamui-bichon.atl.sa.earthlink.net>

Thanks.

Turns out there was a way to fully test the script which is attached:
eb89e01bde14d4ca25c778bbb13fb5fa  genallsbs.sh.bz2

Looking forward to the new and improved filesystems from you and your filesystem colleagues.

Regards,

-- Tom

-----Original Message-----
>From: Theodore Tso <tytso at mit.edu>
>Sent: Oct 3, 2007 2:44 PM
>To: Thomas Watt <tango at tiac.net>
>Cc: Andreas Dilger <adilger at clusterfs.com>, ext3-users at redhat.com
>Subject: Re: How are alternate superblocks repaired?
>
>On Tue, Oct 02, 2007 at 05:59:11PM -0400, Theodore Tso wrote:
>> I'm a bit dubious about how useful simply structure matching would be
>> for modern Linux systems, since a large number of errors really are
>    sorry, I meant to say "filesystems", not "systems" above
>> silent bit flips in the data, that wouldn't be detected simply by
>> checking the expected structure ID at the beginning of the on-disk
>> object.  We are planning on adding checksum to metadata for ext4,
>> which will help a lot in terms of detected bad metadata.
>
>  	     	    	   	 - Ted
-------------- next part --------------
A non-text attachment was scrubbed...
Name: genallsbs.sh.bz2
Type: application/x-bzip
Size: 3713 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20071004/4f962131/attachment.bin>

From ross at biostat.ucsf.edu  Sat Oct  6 07:10:48 2007
From: ross at biostat.ucsf.edu (Ross Boylan)
Date: Sat, 06 Oct 2007 00:10:48 -0700
Subject: Very slow directory traversal
Message-ID: <1191654648.8679.109.camel@corn.betterworld.us>

My last full backup of my Cyrus mail spool had 1,393,569 files and
cconsumed about 4G after compression. It took over 13 hours.  Some
investigation led to the following test:
 time tar cf /dev/null /var/spool/cyrus/mail/r/user/ross/debian/user/
That took 15 minutes the first time it ran, and 32 seconds when run
immediately thereafter.  There were 355,746 files. This is typical of
what I've been seeing: initial run is slow; later runs are much faster.

df shows
/dev/evms/CyrusSpool  19285771  17650480    606376  97% /var/spool/cyrus

mount shows
/dev/evms/CyrusSpool on /var/spool/cyrus type ext3 (rw,noatime)

The spool was active when I did the tests just described, but inactive
during backup.  It's on top of LVM as managed by EVMS in a Linux 2.6.18
kernel, Pentium 4 processor.  It might be significant the Linux treats
this as an SMP machine with 2 processors, since the single processor has
hyperthreading.  I'm using a stock Debian kernel, -686 variant.

# time dd if=/dev/evms/CyrusSpool bs=4096 skip=16k count=256k
of=/dev/null
262144+0 records in
262144+0 records out
1073741824 bytes (1.1 GB) copied, 26.4824 seconds, 40.5 MB/s

The spool was mostly populated all at once from another system, and the
file names are mostly numbers.  Perhaps that creates some hashing
trouble?

Can anyone explain this, or, even better, give me a hint how I could
improve this situation?

I found some earlier posts on similar issues, although they mostly
concerned apparently empty directories that took a long time.  Theodore
Tso had a comment that seemed to indicate that hashing conflicts with
Unix requirements.  I think the implication was that you could end up
with linearized, or partly linearized searches under some scenarios.
Since this is a mail spool, I think it gets lots of sync()'s.

I conducted pretty extensive tests before picking ext3 for this file
system; it was fastest for my tests of writing messages into the spool.
I think I tested the "nearly full disk" scenario, but I probably didn't
test the scale of files I have now.  Obviously my problem now is
reading, not writing.

# dumpe2fs -h /dev/evms/CyrusSpool
dumpe2fs 1.40.2 (12-Jul-2007)
Filesystem volume name:   <none>
Last mounted on:          <not available>
Filesystem UUID:          44507cfa-39ce-46f1-9e3e-87091225395d
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal resize_inode dir_index filetype
needs_recovery sparse_super
Filesystem flags:         signed directory hash
Default mount options:    (none)
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              10289152  # c 10x the number of files.
Block count:              20578300
Reserved block count:     1028915
Free blocks:              1651151
Free inodes:              8860352
First block:              1
Block size:               1024
Fragment size:            1024
Reserved GDT blocks:      236
Blocks per group:         8192
Fragments per group:      8192
Inodes per group:         4096
Inode blocks per group:   512
Filesystem created:       Mon Jan  1 11:32:49 2007
Last mount time:          Thu Oct  4 09:42:00 2007
Last write time:          Thu Oct  4 09:42:00 2007
Mount count:              2
Maximum mount count:      25
Last checked:             Fri Sep 28 09:26:39 2007
Check interval:           15552000 (6 months)
Next check after:         Wed Mar 26 09:26:39 2008
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:               128
Journal inode:            8
Default directory hash:   tea
Directory Hash Seed:      9f50511e-2078-4476-96f4-c6f3415fda4f
Journal backup:           inode blocks
Journal size:             32M

I believe I created it this way; in particular, I'm pretty sure I've had
dir_index from the start.



From alex at alex.org.uk  Sat Oct  6 11:06:28 2007
From: alex at alex.org.uk (Alex Bligh)
Date: Sat, 06 Oct 2007 12:06:28 +0100
Subject: Very slow directory traversal
In-Reply-To: <1191654648.8679.109.camel@corn.betterworld.us>
References: <1191654648.8679.109.camel@corn.betterworld.us>
Message-ID: <1C0A85F326C4B5EC68C47D44@[192.168.100.25]>



--On 06 October 2007 00:10 -0700 Ross Boylan <ross at biostat.ucsf.edu> wrote:

> I believe I created it this way; in particular, I'm pretty sure I've had
> dir_index from the start.

  find /var/spool/cyrus -type d -exec lsattr -lad \{\} \;

and check the large directories are actually indexed

Alex



From ross at biostat.ucsf.edu  Sat Oct  6 16:30:40 2007
From: ross at biostat.ucsf.edu (Ross Boylan)
Date: Sat, 06 Oct 2007 09:30:40 -0700
Subject: Very slow directory traversal
In-Reply-To: <1C0A85F326C4B5EC68C47D44@[192.168.100.25]>
References: <1191654648.8679.109.camel@corn.betterworld.us>
	<1C0A85F326C4B5EC68C47D44@[192.168.100.25]>
Message-ID: <1191688240.8679.114.camel@corn.betterworld.us>

On Sat, 2007-10-06 at 12:06 +0100, Alex Bligh wrote:
> 
> --On 06 October 2007 00:10 -0700 Ross Boylan <ross at biostat.ucsf.edu> wrote:
> 
> > I believe I created it this way; in particular, I'm pretty sure I've had
> > dir_index from the start.
> 
>   find /var/spool/cyrus -type d -exec lsattr -lad \{\} \;
> 
> and check the large directories are actually indexed
> 
> Alex
All the large directories are indexed, but some smaller or empty ones
seem not to be.  Here's a line from the directory I reported on, and
then one that doesn't show as indexed.  The find took about 3 minutes to
run.
/var/spool/cyrus/mail/r/user/ross/debian/user Indexed_direcctory
/var/spool/cyrus/mail/r/user/ross/debian/devel ---

During the find, as during my other operations that take a long time,
vmstat shows around 40-45% of the CPU time in io wait.  I'm not sure if
the pseudo-dual CPU's are throwing that off, i.e., if that really means
80-90%.




From alex at alex.org.uk  Sun Oct  7 08:58:36 2007
From: alex at alex.org.uk (Alex Bligh)
Date: Sun, 07 Oct 2007 09:58:36 +0100
Subject: Very slow directory traversal
In-Reply-To: <1191688240.8679.114.camel@corn.betterworld.us>
References: <1191654648.8679.109.camel@corn.betterworld.us>
	<1C0A85F326C4B5EC68C47D44@[192.168.100.25]>
	<1191688240.8679.114.camel@corn.betterworld.us>
Message-ID: <D829C6CCF01075ADD2A71F08@[192.168.100.25]>


> All the large directories are indexed, but some smaller or empty ones
> seem not to be.

I think that's correct; it doesn't build the index tree until the directory
reaches (from memory) a couple of blocks.

I vaguely recall that one can still use readdir / telldir and end up
with an O(n^2) result, but I forget how. You've reached the limit of
my knowledge here.

Alex



From adilger at clusterfs.com  Wed Oct 10 15:59:20 2007
From: adilger at clusterfs.com (Andreas Dilger)
Date: Wed, 10 Oct 2007 09:59:20 -0600
Subject: Very slow directory traversal
In-Reply-To: <1191654648.8679.109.camel@corn.betterworld.us>
References: <1191654648.8679.109.camel@corn.betterworld.us>
Message-ID: <20071010155920.GV8122@schatzie.adilger.int>

On Oct 06, 2007  00:10 -0700, Ross Boylan wrote:
> My last full backup of my Cyrus mail spool had 1,393,569 files and
> cconsumed about 4G after compression. It took over 13 hours.  Some
> investigation led to the following test:
>  time tar cf /dev/null /var/spool/cyrus/mail/r/user/ross/debian/user/

FYI - "tar cf /dev/null" actually skips reading any file data.  The
code special cases /dev/null and skips the read entirely.

> That took 15 minutes the first time it ran, and 32 seconds when run
> immediately thereafter.  There were 355,746 files. This is typical of
> what I've been seeing: initial run is slow; later runs are much faster.

I'd expect this is because on the initial run the on-disk inode ordering 
causes a lot of seeks, and later runs come straight from memory.  Probably
not a lot you can do directly, but e.g. pre-reading the inode table would
be a good start.


> I found some earlier posts on similar issues, although they mostly
> concerned apparently empty directories that took a long time.  Theodore
> Tso had a comment that seemed to indicate that hashing conflicts with
> Unix requirements.  I think the implication was that you could end up
> with linearized, or partly linearized searches under some scenarios.
> Since this is a mail spool, I think it gets lots of sync()'s.

There was an LD_PRELOAD library that Ted wrote that may also help:
http://marc.info/?l=mutt-dev&m=107226330912347&w=2

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.



From ross at biostat.ucsf.edu  Thu Oct 11 06:37:19 2007
From: ross at biostat.ucsf.edu (Ross Boylan)
Date: Wed, 10 Oct 2007 23:37:19 -0700
Subject: Very slow directory traversal
In-Reply-To: <20071010155920.GV8122@schatzie.adilger.int>
References: <1191654648.8679.109.camel@corn.betterworld.us>
	<20071010155920.GV8122@schatzie.adilger.int>
Message-ID: <1192084639.2075.75.camel@corn.betterworld.us>

On Wed, 2007-10-10 at 09:59 -0600, Andreas Dilger wrote:
> On Oct 06, 2007  00:10 -0700, Ross Boylan wrote:
> > My last full backup of my Cyrus mail spool had 1,393,569 files and
> > cconsumed about 4G after compression. It took over 13 hours.  Some
> > investigation led to the following test:
> >  time tar cf /dev/null /var/spool/cyrus/mail/r/user/ross/debian/user/
> 
> FYI - "tar cf /dev/null" actually skips reading any file data.  The
> code special cases /dev/null and skips the read entirely.
> 
> > That took 15 minutes the first time it ran, and 32 seconds when run
> > immediately thereafter.  There were 355,746 files. This is typical of
> > what I've been seeing: initial run is slow; later runs are much faster.
> 
> I'd expect this is because on the initial run the on-disk inode ordering 
> causes a lot of seeks, and later runs come straight from memory.  Probably
> not a lot you can do directly, but e.g. pre-reading the inode table would
> be a good start.
Judging from your comments and the thread you reference below, the
problem is that the order returned from readdir is not inode order.  But
if tar, in this special case (/dev/null), doesn't actually read from the
file, why should it be so slow.  Does it do something (stat?) that makes
it have to fetch the inode anyway?
> 
> 
> > I found some earlier posts on similar issues, although they mostly
> > concerned apparently empty directories that took a long time.  Theodore
> > Tso had a comment that seemed to indicate that hashing conflicts with
> > Unix requirements.  I think the implication was that you could end up
> > with linearized, or partly linearized searches under some scenarios.
> > Since this is a mail spool, I think it gets lots of sync()'s.
> 
> There was an LD_PRELOAD library that Ted wrote that may also help:
> http://marc.info/?l=mutt-dev&m=107226330912347&w=2
> 
I got the code, but am not having much luck making it work.  I've tried
various things.  The most recent is
cc -shared -fpic -o libsd_readdir.so spd_readdir.c # as me
# rest as root
# export LD_LIBRARY_PATH=./
# export LD_PRELOAD=libsd_readdir.so
# ldconfig -v -n $(pwd)
/usr/local/src/kernel/ext3-patch:
	libsd_readdir.so -> libsd_readdir.so
corn:/usr/local/src/kernel/ext3-patch# date; time tar
cf /dev/null /var/spool/cyrus/mail/r/user/ross/pol/asdnet/
Wed Oct 10 23:16:44 PDT 2007
tar: Removing leading `/' from member names
Segmentation fault

I don't know how to make something for preload; can anyone give any
hints?

Should the module I'm attempting to load have any effect on the 15
minute time noted above for tar to /dev/null, or is it only relevant if
I am pulling data off the disk files?

Would there be any value in having some  other program traverse the
directories before I do the backup, or would cache limits likely mean
the stuff from the start would be gone from the cache by the time I got
to the end, so that the backup would basically be starting fresh?


Thanks.
Ross



From jae at platinumpsi.com  Sat Oct 13 16:46:27 2007
From: jae at platinumpsi.com (J)
Date: Sat, 13 Oct 2007 11:46:27 -0500
Subject: Commercial file recovery for ext3?
Message-ID: <4710F663.8020806@platinumpsi.com>

A user inflicted a massive change on an EXT-3 data partition.*  I'm 
looking for an application that can recover deleted files. ( The 
majority of the files are Excel. )  I don't particularly care what it 
names the files, and I don't expect a 100% success rate, even though I 
told everyone to go home right after I found out it had been done.

* Over a gig of files on a Samba server were moved into a another 
directory by mistake (by Windows XP Media center), and then subsequently 
moved back to their previous location... except when a dialog came up 
showing the files being processed one-by-one, it was canceled in a panic.

The timing wasn't good:  the backup scripts had been failing quietly.

Looking for the latest options.  Anyone have anything they've used?

Thanks!

--J



From keld at dkuug.dk  Sat Oct 13 17:59:56 2007
From: keld at dkuug.dk (Keld =?iso-8859-1?Q?J=F8rn?= Simonsen)
Date: Sat, 13 Oct 2007 19:59:56 +0200
Subject: Commercial file recovery for ext3?
In-Reply-To: <4710F663.8020806@platinumpsi.com>
References: <4710F663.8020806@platinumpsi.com>
Message-ID: <20071013175956.GA28717@rap.rap.dk>

On Sat, Oct 13, 2007 at 11:46:27AM -0500, J wrote:
> A user inflicted a massive change on an EXT-3 data partition.*  I'm 
> looking for an application that can recover deleted files. ( The 
> majority of the files are Excel. )  I don't particularly care what it 
> names the files, and I don't expect a 100% success rate, even though I 
> told everyone to go home right after I found out it had been done.
> 
> * Over a gig of files on a Samba server were moved into a another 
> directory by mistake (by Windows XP Media center), and then subsequently 
> moved back to their previous location... except when a dialog came up 
> showing the files being processed one-by-one, it was canceled in a panic.
> 
> The timing wasn't good:  the backup scripts had been failing quietly.
> 
> Looking for the latest options.  Anyone have anything they've used?

I have made some software available at
http://std.dkuug.dk/keld/readme-salvage.html

It is not perfect, but try it out.

best regards
keld



From ross at biostat.ucsf.edu  Mon Oct 15 17:41:54 2007
From: ross at biostat.ucsf.edu (Ross Boylan)
Date: Mon, 15 Oct 2007 10:41:54 -0700
Subject: Very slow directory traversal
In-Reply-To: <1192084639.2075.75.camel@corn.betterworld.us>
References: <1191654648.8679.109.camel@corn.betterworld.us>
	<20071010155920.GV8122@schatzie.adilger.int>
	<1192084639.2075.75.camel@corn.betterworld.us>
Message-ID: <1192470114.8377.6.camel@corn.betterworld.us>

On Wed, 2007-10-10 at 23:37 -0700, Ross Boylan wrote:
> On Wed, 2007-10-10 at 09:59 -0600, Andreas Dilger wrote:
> > On Oct 06, 2007  00:10 -0700, Ross Boylan wrote:
> > > My last full backup of my Cyrus mail spool had 1,393,569 files and
> > > cconsumed about 4G after compression. It took over 13 hours.  Some
> > > investigation led to the following test:
> > >  time tar cf /dev/null /var/spool/cyrus/mail/r/user/ross/debian/user/
> > 
> > FYI - "tar cf /dev/null" actually skips reading any file data.  The
> > code special cases /dev/null and skips the read entirely.
> > 
> > > That took 15 minutes the first time it ran, and 32 seconds when run
> > > immediately thereafter.  There were 355,746 files. This is typical of
> > > what I've been seeing: initial run is slow; later runs are much faster.
> > 
> > I'd expect this is because on the initial run the on-disk inode ordering 
> > causes a lot of seeks, and later runs come straight from memory.  Probably
> > not a lot you can do directly, but e.g. pre-reading the inode table would
> > be a good start.
> Judging from your comments and the thread you reference below, the
> problem is that the order returned from readdir is not inode order.  But
> if tar, in this special case (/dev/null), doesn't actually read from the
> file, why should it be so slow.  Does it do something (stat?) that makes
> it have to fetch the inode anyway?
> > 
> > 
> > > I found some earlier posts on similar issues, although they mostly
> > > concerned apparently empty directories that took a long time.  Theodore
> > > Tso had a comment that seemed to indicate that hashing conflicts with
> > > Unix requirements.  I think the implication was that you could end up
> > > with linearized, or partly linearized searches under some scenarios.
> > > Since this is a mail spool, I think it gets lots of sync()'s.
> > 
> > There was an LD_PRELOAD library that Ted wrote that may also help:
> > http://marc.info/?l=mutt-dev&m=107226330912347&w=2
> > 
> I got the code, but am not having much luck making it work.  I've tried
> various things.  The most recent is
> cc -shared -fpic -o libsd_readdir.so spd_readdir.c # as me
> # rest as root
> # export LD_LIBRARY_PATH=./
> # export LD_PRELOAD=libsd_readdir.so
> # ldconfig -v -n $(pwd)
> /usr/local/src/kernel/ext3-patch:
> 	libsd_readdir.so -> libsd_readdir.so
> corn:/usr/local/src/kernel/ext3-patch# date; time tar
> cf /dev/null /var/spool/cyrus/mail/r/user/ross/pol/asdnet/
> Wed Oct 10 23:16:44 PDT 2007
> tar: Removing leading `/' from member names
> Segmentation fault
Even stranger, when I try the same thing with a little test program that
calls readdir, it works.

I tried running tar as myself, but got the same segfault (the first test
I reported I ran as root).  tar doesn't look as if it's setuid
# ls -l /bin/tar
-rwxr-xr-x 1 root root 231188 2007-09-05 02:42 /bin/tar

> 
> I don't know how to make something for preload; can anyone give any
> hints?
> 
> Should the module I'm attempting to load have any effect on the 15
> minute time noted above for tar to /dev/null, or is it only relevant if
> I am pulling data off the disk files?
> 
> Would there be any value in having some  other program traverse the
> directories before I do the backup, or would cache limits likely mean
> the stuff from the start would be gone from the cache by the time I got
> to the end, so that the backup would basically be starting fresh?
> 
> 
> Thanks.
> Ross



From wesley at terpstra.ca  Sun Oct 14 18:34:40 2007
From: wesley at terpstra.ca (Wesley W. Terpstra)
Date: Sun, 14 Oct 2007 20:34:40 +0200
Subject: Big extended attributes
Message-ID: <D31EF936-2B20-4C4D-8A1D-111E160BE60B@terpstra.ca>

Good evening!

I've recently been running into a space limitation for extended  
attributes in ext3. I understand that earlier versions of ext3 stored  
these in the inode record. Is this still the case? Is there any way  
to allow for more space for extended attributes in an ext3 partition?  
I know that xfs has no limits on extended attributes, but I have  
several orthogonal reasons for sticking with ext3.

PS. Please CC me as I am not a member of this list.



From adilger at clusterfs.com  Fri Oct 19 16:56:11 2007
From: adilger at clusterfs.com (Andreas Dilger)
Date: Fri, 19 Oct 2007 10:56:11 -0600
Subject: Big extended attributes
In-Reply-To: <D31EF936-2B20-4C4D-8A1D-111E160BE60B@terpstra.ca>
References: <D31EF936-2B20-4C4D-8A1D-111E160BE60B@terpstra.ca>
Message-ID: <20071019165611.GF8122@schatzie.adilger.int>

On Oct 14, 2007  20:34 +0200, Wesley W. Terpstra wrote:
> I've recently been running into a space limitation for extended  
> attributes in ext3. I understand that earlier versions of ext3 stored  
> these in the inode record. Is this still the case?

Actually, it is the converse - only new (and specially formatted) fs
with larger inodes will format with larger inodes and store the EA in
the inode for improved performance.  Otherwise there is a single fs
block for all EAs on a file.

If you need a small amount of extra EA space (e.g. 128 or 384 bytes)
and you control the environment then formatting the filesystem with
"mke2fs -j -I 512" can give you some more space, but not a huge amount.

-I == total inode size; includes 128 bytes for inode; can be up to 4096 bytes

> Is there any way  
> to allow for more space for extended attributes in an ext3 partition?  

Not currently.  We did some work to allow large EAs to be stored in
a separate inode, but that doesn't help if you have lots of small EAs.

> I know that xfs has no limits on extended attributes, but I have  
> several orthogonal reasons for sticking with ext3.

Hmm, I thought XFS had a 64kB EA limit?

What is it you are trying to do?  There are often better solutions
than storing a lot of data in EAs.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.



From dpc22 at cam.ac.uk  Mon Oct 22 15:03:35 2007
From: dpc22 at cam.ac.uk (David Carter)
Date: Mon, 22 Oct 2007 16:03:35 +0100 (BST)
Subject: EXT3-fs error in htree_dirblock_to_tree
Message-ID: <Pine.LNX.4.64.0710221547350.18213@hermes-2.csi.cam.ac.uk>

Hello all,

Does anyone know if the following is likely to be a software problem or a 
hardware fault?

   Oct 22 14:01:43 cyrus-26 kernel:
    EXT3-fs error (device md0): htree_dirblock_to_tree:
    bad entry in directory #360809233: rec_len is
    smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0

A quick Google didn't tell me much, although a couple of people seem to 
have seen similar problems after hardware problems, and one person seemed 
to be able to trigger it using an "insane file system test":

   http://www.mail-archive.com/linux-ext4 at vger.kernel.org/msg02515.html

The filesystem in question is a Cyrus mailstore: lots of write (and fsync) 
activity with small files. It was created with:

   mkfs.ext3 -T news -m 1 -O dir_index -j -J size=256 /dev/md0

and is currently mounted data=ordered. Platform is SLES10.

We haven't seen one of these before, but we are in the process of moving 
from reiser (which never did anything like this) to ext3/htree, so it 
would be useful to known if it is a known problem. Thanks.

-- 
David Carter                             Email: David.Carter at ucs.cam.ac.uk
University Computing Service,            Phone: (01223) 334502
New Museums Site, Pembroke Street,       Fax:   (01223) 334679
Cambridge UK. CB2 3QH.



From ecashin at coraid.com  Fri Oct 19 16:57:15 2007
From: ecashin at coraid.com (Ed L Cashin)
Date: Fri, 19 Oct 2007 12:57:15 -0400
Subject: sync in-cache fs data after remount ro on error?
Message-ID: <87d4vbt53o.fsf@coraid.com>

Hi.  If a block device stops working and then starts working later,
does the sysadmin have a way to ask ext3 to sync the now read-only
filesystem to disk?

For example, I can temporarily shut down the network interfaces that
make an AoE target accessible (simulating, e.g., somebody accidentally
unplugging a network switch).  When the I/O fails, the filesystem is
automatically mounted read-only, which is great.

But if valuable data has been committed to the in-cache filesystem but
not the on-disk filesystem, it would ideally be possible to remount
the filesystem read-write once the device is online again (from
running aoe-revalidate), so that the new data could be sync'ed out to
disk.

The mount command won't remount the ext3 read-write.

  ellijay:~# mount -o remount,rw /mnt/e7.1
  mount: block device /dev/etherd/e7.1 is write-protected, mounting read-only

A kernel message says, "Abort forced by user", which looks like it is
coming from fs/ext3/super.c,

	if (sbi->s_mount_opt & EXT3_MOUNT_ABORT)
		ext3_abort(sb, __FUNCTION__, "Abort forced by user");

Checking the e2fsprogs manpages, I don't see a way to ask ext3 to stop
aborting a read-write mount.  If all the uncommitted in-cache data is
still marked as dirty, it seems like it might be possible to safely
commit it now that the sysadmin knows the block device is OK.

Is there a way to commit the dirty changes when the block device has
stopped failing I/O?

-- 
  Ed L Cashin <ecashin at coraid.com>



From rjcarr at gmail.com  Tue Oct 23 22:30:12 2007
From: rjcarr at gmail.com (rjcarr)
Date: Tue, 23 Oct 2007 15:30:12 -0700 (PDT)
Subject: Solution to Corrupt >2TB Filesystem in MSDOS Partition Table
In-Reply-To: <45F8642B.5080908@berkeley.edu>
References: <45F571C3.9090303@berkeley.edu>
	<20070313070433.GL5266@schatzie.adilger.int>
	<45F8642B.5080908@berkeley.edu>
Message-ID: <13375087.post@talk.nabble.com>



Jon Forrest-2 wrote:
> 
> Thanks to Ted and several others, I was
> able to recover 100% of the corrupted
> file system that I posted about last week.
> (This was an >2TB ext3 file system that had been
> created in a MSDOS partition which had worked
> until the server was rebooted, at which time
> it wouldn't mount and fsck wouldn't fix the
> problem.)

I just wanted to add that I had the same exact situation and this solution
also worked for me.  My only difference was my filesystem was xfs (not
ext3), also, in this part:



> 3) I then used the parted "rescue" command
> to recreate the partition. I gave it the original
> starting point at the start value and "-1s" as
> the ending value.
> 

I knew the exact end value from when I created the partition, so I used it
instead of -1.  Not sure if it would have worked it out had I used -1, but I
thought my number safer.
-- 
View this message in context: http://www.nabble.com/How-To-Recover-From-Creating-%3E2TB-ext3-Filesystem-on-MSDOS-Partition-Table--tf3390167.html#a13375087
Sent from the Ext3 - User mailing list archive at Nabble.com.



From ameet.nanda at wipro.com  Wed Oct 24 09:55:44 2007
From: ameet.nanda at wipro.com (Naxor)
Date: Wed, 24 Oct 2007 02:55:44 -0700 (PDT)
Subject: Problem with file system
Message-ID: <13382672.post@talk.nabble.com>


While I untar a large archive on  xfs , ext3 (ver 1.3 and ver 1.4) file
systems , on ppc processor and kernel ver 2.6.21 , I get an error. Also
sometimes, on ext3 (1.3 and 1.4) the file system goes read-only while
untarring.
The same tar file when untarred on a i386 machine works properly. 

ERROR:
--------------
tar: Skipping to next header

gzip: stdin: invalid compressed data--crc error
tar: Child returned status 1
tar: Error exit delayed from previous errors

-------------------

Can any1 suggest some tools/method how to investigate the crash or proceed
with the task ?

-- 
View this message in context: http://www.nabble.com/Problem-with-file-system-tf4683372.html#a13382672
Sent from the Ext3 - User mailing list archive at Nabble.com.



From lists at nerdbynature.de  Thu Oct 25 08:07:36 2007
From: lists at nerdbynature.de (Christian Kujau)
Date: Thu, 25 Oct 2007 10:07:36 +0200 (CEST)
Subject: Problem with file system
In-Reply-To: <13382672.post@talk.nabble.com>
References: <13382672.post@talk.nabble.com>
Message-ID: <43600.62.180.231.196.1193299656.squirrel@www.housecafe.de>

On Wed, October 24, 2007 11:55, Naxor wrote:
> While I untar a large archive on  xfs , ext3 (ver 1.3 and ver 1.4) file
> systems , on ppc processor and kernel ver 2.6.21 , I get an error. Also
> sometimes, on ext3 (1.3 and 1.4) the file system goes read-only while
> untarring.

can you please post the errors from your syslog, when this happens? Also,
did you fsck.ext3 your filesystem lately?

Christian.
-- 
BOFH excuse #442:

Trojan horse ran out of hay



From ameet.nanda at wipro.com  Thu Oct 25 09:18:20 2007
From: ameet.nanda at wipro.com (Ameet Nanda)
Date: Thu, 25 Oct 2007 14:48:20 +0530
Subject: Problem with file system
In-Reply-To: <43600.62.180.231.196.1193299656.squirrel@www.housecafe.de>
References: <13382672.post@talk.nabble.com>
	<43600.62.180.231.196.1193299656.squirrel@www.housecafe.de>
Message-ID: <1193303900.6108.11.camel@ameet>

Hi,
I tried to untar using the command tar -xvzmf.

The error I got after tar runs for sometime was :
---------------------------------------
tar: Skipping to next header
tar: Archive contains obsolescent base-64 headers

gzip: stdin: invalid compressed data--crc error
tar: Child returned status 1
tar: Error exit delayed from previous errors
----------------------------------------

on doing a fsck.ext3 i get the result as:
--------------------------------------------------------------

/dev/sda2: ********** WARNING: Filesystem still has errors **********


   15162 inodes used (0.50%)
      84 non-contiguous inodes (0.6%)
         # of inodes with ind/dind/tind blocks: 1370/52/0
  605645 blocks used (10.05%)
       0 bad blocks
       2 large files

   11832 regular files
     886 directories
       0 character device files
       0 block device files
       0 fifos
4294967294 links
    2436 symbolic links (2414 fast symbolic links)
       0 sockets
--------
   15150 files
---------------------------------------------------------------------------

-
Ameet

On Thu, 2007-10-25 at 10:07 +0200, Christian Kujau wrote:

> On Wed, October 24, 2007 11:55, Naxor wrote:
> > While I untar a large archive on  xfs , ext3 (ver 1.3 and ver 1.4) file
> > systems , on ppc processor and kernel ver 2.6.21 , I get an error. Also
> > sometimes, on ext3 (1.3 and 1.4) the file system goes read-only while
> > untarring.
> 
> can you please post the errors from your syslog, when this happens? Also,
> did you fsck.ext3 your filesystem lately?
> 
> Christian.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20071025/edf595d3/attachment.htm>

From lists at nerdbynature.de  Thu Oct 25 10:05:44 2007
From: lists at nerdbynature.de (Christian Kujau)
Date: Thu, 25 Oct 2007 12:05:44 +0200 (CEST)
Subject: Problem with file system
In-Reply-To: <1193303900.6108.11.camel@ameet>
References: <13382672.post@talk.nabble.com>
	<43600.62.180.231.196.1193299656.squirrel@www.housecafe.de>
	<1193303900.6108.11.camel@ameet>
Message-ID: <42271.62.180.231.196.1193306744.squirrel@www.housecafe.de>

Ameet,

On Thu, October 25, 2007 11:18, Ameet Nanda wrote:
> The error I got after tar runs for sometime was :

Please post the errors from your system log (usually /var/log/messages,
/var/log/kern.log or the like).

> on doing a fsck.ext3 i get the result as:
> --------------------------------------------------------------
> /dev/sda2: ********** WARNING: Filesystem still has errors **********

Did you unmount /dev/sda2 before running fsck.ext3? Please do, and then
post the *whole* output of the "fsck.ext3 -v" run, not just the results.

C.
-- 
BOFH excuse #442:

Trojan horse ran out of hay



From ameet.nanda at wipro.com  Thu Oct 25 11:35:28 2007
From: ameet.nanda at wipro.com (Ameet Nanda)
Date: Thu, 25 Oct 2007 17:05:28 +0530
Subject: Problem with file system
In-Reply-To: <42271.62.180.231.196.1193306744.squirrel@www.housecafe.de>
References: <13382672.post@talk.nabble.com>
	<43600.62.180.231.196.1193299656.squirrel@www.housecafe.de>
	<1193303900.6108.11.camel@ameet>
	<42271.62.180.231.196.1193306744.squirrel@www.housecafe.de>
Message-ID: <1193312128.6108.29.camel@ameet>

Hi Chris,

I unmounted /dev/sda2 and ran fsck.ext3. this was the complete o/p
===========================

root at 172:/root> fsck.ext3 /dev/sda2  -v -n
e2fsck 1.40.2 (12-Jul-2007)
/dev/sda2 contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Inode 1505 has imagic flag set.  Clear? no
Inode 1505, i_blocks is 2561936855, should be 0.  Fix? no
Inode 15393 has compression flag set on filesystem without compression
support.  Clear? no
Deleted inode 164029 has zero dtime.  Fix? no
Inode 1463073 is in use, but has dtime set.  Fix? no
Inode 1463073 has imagic flag set.  Clear? no
Inode 1463073 has compression flag set on filesystem without compression
support.  Clear? no
Inode 1463073 has INDEX_FL flag set but is not a directory.
Clear HTree index? no
HTREE directory inode 1463073 has an invalid root node.
Clear HTree index? no
Error reading block 4294967295 (Invalid argument).  Ignore error? no
HTREE directory inode 1463073 has an invalid root node.
Clear HTree index? no
HTREE directory inode 1463073 has an invalid root node.
Clear HTree index? no
Inode 1463073, i_blocks is 4294967295, should be 0.  Fix? no
Deleted inode 1685409 has zero dtime.  Fix? no
Inode 1835553 is in use, but has dtime set.  Fix? no
Inode 1835553 has illegal block(s).  Clear? no
Illegal block #0 (310724603) in inode 1835553.  IGNORED.
Illegal block #1 (837540054) in inode 1835553.  IGNORED.
Illegal block #2 (3716133180) in inode 1835553.  IGNORED.
Illegal block #3 (2359092648) in inode 1835553.  IGNORED.
Illegal block #4 (155050197) in inode 1835553.  IGNORED.
Illegal block #5 (2295681145) in inode 1835553.  IGNORED.
HTREE directory inode 1835553 has an invalid root node.
Clear HTree index? no
Error reading block 310724603 (Invalid argument).  Ignore error? no
HTREE directory inode 1835553 has an invalid root node.
Clear HTree index? no
HTREE directory inode 1835553 has an invalid root node.
Clear HTree index? no
Inode 1835553 is a zero-length directory.  Clear? no
Inode 1835553, i_size is 1155516870, should be 0.  Fix? no
Inode 1835553, i_blocks is 2500161256, should be 0.  Fix? no

Pass 2: Checking directory structure
Entry 'pdf_fontmgr_cidfonttypes.ps'
in /SYSROM_SRC/mfp/PRF/rbdisk0/PostScript (1835102) has an incorrect
filetype (was 1, should be 2).
Fix? no
Directory inode 1835553 has an unallocated block #6.  Allocate? no
Directory inode 1835553 has an unallocated block #7.  Allocate? no
Directory inode 1835553 has an unallocated block #8.  Allocate? no
Directory inode 1835553 has an unallocated block #9.  Allocate? no
Directory inode 1835553 has an unallocated block #10.  Allocate? no
Directory inode 1835553 has an unallocated block #11.  Allocate? no

Pass 3: Checking directory connectivity
'..'
in /SYSROM_SRC/mfp/PRF/rbdisk0/PostScript/pdf_fontmgr_cidfonttypes.ps
(1835553) is <The NULL inode> (0), should
be /SYSROM_SRC/mfp/PRF/rbdisk0/PostScript (1835102).
Fix? no

Pass 4: Checking reference counts
Inode 1505 (...) is an illegal socket.
Clear? no
Unattached inode 1505
Connect to /lost+found? no
Unattached zero-length inode 10209.  Clear? no
Unattached inode 10209
Connect to /lost+found? no
Inode 15393 (...) has invalid mode (0177777).
Clear? no
Unattached inode 15393
Connect to /lost+found? no
Inode 16673 (...) has invalid mode (0177777).
Clear? no
Unattached inode 16673
Connect to /lost+found? no
Inode 32801 (...) has invalid mode (0177777).
Clear? no
Unattached inode 32801
Connect to /lost+found? no
Inode 33313 (...) has invalid mode (0177777).
Clear? no
Unattached inode 33313
Connect to /lost+found? no
Inode 49185 (...) has invalid mode (0177777).
Clear? no
Unattached inode 49185
Connect to /lost+found? no
Inode 49697 (...) has invalid mode (0177777).
Clear? no
Unattached inode 49697
Connect to /lost+found? no
Inode 65569 (...) has invalid mode (0177777).
Clear? no
Unattached inode 65569
Connect to /lost+found? no
Inode 66081 (...) has invalid mode (0177777).
Clear? no
Unattached inode 66081
Connect to /lost+found? no
Inode 1463073 (...) has invalid mode (00).
Clear? no
Unattached inode 1463073
Connect to /lost+found? no
WARNING: PROGRAMMING BUG IN E2FSCK!
        OR SOME BONEHEAD (YOU) IS CHECKING A MOUNTED (LIVE) FILESYSTEM.
inode_link_info[1835553] is 44779, inode.i_links_count is 1.  They
should be the same!
Inode 1835553 ref count is 1, should be 1.  Fix? no

Pass 5: Checking group summary information
Block bitmap differences:  -(305650--305665) -359594 -(359611--360268)
-(3701464--3701466)
Fix? no
Inode bitmap differences:  +1505 +10209 +15393 +16673 +32801 +33313
+49185 +49697 +65569 +66081 -164029 +1463073
Fix? no
Directories count wrong for group #112 (17, counted=18).
Fix? no

/dev/sda2: ********** WARNING: Filesystem still has errors **********

   15162 inodes used (0.50%)
      81 non-contiguous inodes (0.5%)
         # of inodes with ind/dind/tind blocks: 1370/52/0
  605645 blocks used (10.05%)
       0 bad blocks
       1 large file

   11831 regular files
     886 directories
       0 character device files
       0 block device files
       0 fifos
4294967294 links
    2436 symbolic links (2414 fast symbolic links)
       0 sockets
--------
   15150 files

Here is the log from tail /var/log/kern.log
=============================================

Oct 25 17:16:53 172 kernel: [  267.117373] attempt to access beyond end
of device
Oct 25 17:16:53 172 kernel: [  267.117396] sda2: rw=0, want=13777058744,
limit=48195000
Oct 25 17:16:53 172 kernel: [  267.117404] attempt to access beyond end
of device
Oct 25 17:16:53 172 kernel: [  267.117411] sda2: rw=0, want=16416658088,
limit=48195000
Oct 25 17:16:53 172 kernel: [  267.117419] attempt to access beyond end
of device
Oct 25 17:16:53 172 kernel: [  267.117425] sda2: rw=0, want=15853339616,
limit=48195000
Oct 25 17:16:53 172 kernel: [  267.117432] attempt to access beyond end
of device
Oct 25 17:16:53 172 kernel: [  267.117439] sda2: rw=0, want=30048438328,
limit=48195000

-
Ameet


On Thu, 2007-10-25 at 12:05 +0200, Christian Kujau wrote:
> Ameet,
> 
> On Thu, October 25, 2007 11:18, Ameet Nanda wrote:
> > The error I got after tar runs for sometime was :
> 
> Please post the errors from your system log (usually /var/log/messages,
> /var/log/kern.log or the like).
> 
> > on doing a fsck.ext3 i get the result as:
> > --------------------------------------------------------------
> > /dev/sda2: ********** WARNING: Filesystem still has errors **********
> 
> Did you unmount /dev/sda2 before running fsck.ext3? Please do, and then
> post the *whole* output of the "fsck.ext3 -v" run, not just the results.
> 
> C.



From lists at nerdbynature.de  Thu Oct 25 12:25:11 2007
From: lists at nerdbynature.de (Christian Kujau)
Date: Thu, 25 Oct 2007 14:25:11 +0200 (CEST)
Subject: Problem with file system
In-Reply-To: <1193312128.6108.29.camel@ameet>
References: <13382672.post@talk.nabble.com>
	<43600.62.180.231.196.1193299656.squirrel@www.housecafe.de>
	<1193303900.6108.11.camel@ameet>
	<42271.62.180.231.196.1193306744.squirrel@www.housecafe.de>
	<1193312128.6108.29.camel@ameet>
Message-ID: <43870.62.180.231.196.1193315111.squirrel@www.housecafe.de>

On Thu, October 25, 2007 13:35, Ameet Nanda wrote:
> I unmounted /dev/sda2 and ran fsck.ext3. this was the complete o/p

thanks for the log. Now the real gurus have something to work with :-)

> root at 172:/root> fsck.ext3 /dev/sda2  -v -n
> e2fsck 1.40.2 (12-Jul-2007) /dev/sda2 contains a file system with errors,
> check forced.

If the filesystem is corrupted, all kinds of things might happen to your
.tar file. A good start would be to find out what could've caused the
filesystem corruptions in the first place. Did your box lose power and
crashed? Has the hardware been altered, new memory, new cables?


> Oct 25 17:16:53 172 kernel: [  267.117373] attempt to access beyond end
> of device
> Oct 25 17:16:53 172 kernel: [  267.117396] sda2: rw=0, want=13777058744,
> limit=48195000

Did someone/something alter the partition table? Can you do the following
without getting errors in kern.log?

  dd if=/dev/sda2 of=/dev/null bs=512


Christian.
-- 
BOFH excuse #442:

Trojan horse ran out of hay



From h.m.holt at gmail.com  Wed Oct 31 01:00:50 2007
From: h.m.holt at gmail.com (Hans Holt)
Date: Wed, 31 Oct 2007 12:00:50 +1100
Subject: remounting ext3 file systems
Message-ID: <27057c670710301800x46b402adoc6f9da18c4baf8b5@mail.gmail.com>

Hi,

I want to remount a mounted ext3 file system. Typically, the "mount -o
remount <mount-point>" option is used when an already mounted
read-only file system is remounted as read+write. Is it considered
safe to remount a file system already mounted as read+write with open
files that are in use ? I want to change some mount options without
killing processes accessing the file systems and unmounting the file
system or restarting the machine.

Thanks

Hans



From darkonc at gmail.com  Wed Oct 31 02:32:25 2007
From: darkonc at gmail.com (Stephen Samuel)
Date: Tue, 30 Oct 2007 19:32:25 -0700
Subject: remounting ext3 file systems
In-Reply-To: <27057c670710301800x46b402adoc6f9da18c4baf8b5@mail.gmail.com>
References: <27057c670710301800x46b402adoc6f9da18c4baf8b5@mail.gmail.com>
Message-ID: <6cd50f9f0710301932o7ea11815h7a46f48ee936d47@mail.gmail.com>

As long as you don't don't set any other options which would disrupt with
what the running processes are doing with the files on that filesystem, you
should be fine.

(for example: remounting the system readonly while files were open rw would
be problematic for the processes involved, and I don't know what would
happen if  you remounted a filesystem nodev while people had devices open on
it).

On 10/30/07, Hans Holt <h.m.holt at gmail.com> wrote:
>
> Hi,
>
> I want to remount a mounted ext3 file system. Typically, the "mount -o
> remount <mount-point>" option is used when an already mounted
> read-only file system is remounted as read+write. Is it considered
> safe to remount a file system already mounted as read+write with open
> files that are in use ? I want to change some mount options without
> killing processes accessing the file systems and unmounting the file
> system or restarting the machine.
>


-- 
Stephen Samuel http://www.bcgreen.com
778-861-7641
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20071030/e99d3836/attachment.htm>

From sandeen at redhat.com  Wed Oct 31 03:10:03 2007
From: sandeen at redhat.com (Eric Sandeen)
Date: Tue, 30 Oct 2007 22:10:03 -0500
Subject: remounting ext3 file systems
In-Reply-To: <6cd50f9f0710301932o7ea11815h7a46f48ee936d47@mail.gmail.com>
References: <27057c670710301800x46b402adoc6f9da18c4baf8b5@mail.gmail.com>
	<6cd50f9f0710301932o7ea11815h7a46f48ee936d47@mail.gmail.com>
Message-ID: <4727F20B.2070907@redhat.com>

Stephen Samuel wrote:
> As long as you don't don't set any other options which would disrupt
> with what the running processes are doing with the files on that
> filesystem, you should be fine.
> 
> (for example: remounting the system readonly while files were open rw
> would be problematic for the processes involved, 

In this case mount -o ro will fail with -EBUSY

> and I don't know what
> would happen if  you remounted a filesystem nodev while people had
> devices open on it).


this should reject new device openers.

-Eric



From adilger at sun.com  Thu Oct 25 20:31:20 2007
From: adilger at sun.com (Andreas Dilger)
Date: Thu, 25 Oct 2007 20:31:20 -0000
Subject: sync in-cache fs data after remount ro on error?
In-Reply-To: <87d4vbt53o.fsf@coraid.com>
References: <87d4vbt53o.fsf@coraid.com>
Message-ID: <20071025203104.GF3042@webber.adilger.int>

On Oct 19, 2007  12:57 -0400, Ed L Cashin wrote:
> For example, I can temporarily shut down the network interfaces that
> make an AoE target accessible (simulating, e.g., somebody accidentally
> unplugging a network switch).  When the I/O fails, the filesystem is
> automatically mounted read-only, which is great.
> 
> But if valuable data has been committed to the in-cache filesystem but
> not the on-disk filesystem, it would ideally be possible to remount
> the filesystem read-write once the device is online again (from
> running aoe-revalidate), so that the new data could be sync'ed out to
> disk.

No, there isn't any way to do this, because the filesystem has no way to
know which previous writes have succeeded and which have failed, so any
further writes from cache have a danger of corrupting the filesystem.

Cheers, Andreas
--
Andreas Dilger
Sr. Software Engineer, Lustre Group
Sun Microsystems of Canada, Inc.