From kevin at ucsd.edu Mon Jan 4 03:28:07 2010 From: kevin at ucsd.edu (Kevin Bowen) Date: Sun, 3 Jan 2010 19:28:07 -0800 Subject: ext3 resize failed, data loss Message-ID: <9c9db87d1001031928g470b9939v479d5af4c316e40@mail.gmail.com> I used parted to resize (shrink) an ext3 filesystem and associated partition, and it buggered my system. The operation completed apparently successfully, reporting no errors, but after reboot, the fs wouldn't mount, being marked as having errors, and and e2fsck said "The filesystem size (according to the superblock) is xxx blocks The physical size of the device is xxx blocks Either the superblock or the partition table is likely to be corrupt!". So the fs still thought it was its original size (larger than its partition). At this point, the fs would actually mount without errors if I mounted it manually (ro), and all my data seemed intact, it just thought it had way more free space than it should have, and it couldn't complete an fsck (and was obviously not safe to use mounted rw lest it try to write to space it didn't actually own). Google turned up some accounts of people with the identical issue, and suggestions to fix it by writing a new superblock with e2sck -S, then fscking - I did this, and it totally trashed my filesystem. The fs is now the right size and mounts fine, but everything just got dumped into lost+found. Is there any way I can fix this and get my data back? At least get it back to its previous state so I can mount it ro and copy my data off? Is my old superblock backed up somewhere, or does e2fsk update the backup superblock as well? Would my old superblock even help, or did the fsck trash my inode structure? Currently I think I have all my data, just dumped in lost+found without filenames - is there any way to salvage anything from that? And is this a known bug in ext2resize? In parted? -- Kevin Bowen kevin at ucsd.edu From pop3 at flachtaucher.de Wed Jan 6 11:00:58 2010 From: pop3 at flachtaucher.de (Martin Baum) Date: Wed, 06 Jan 2010 12:00:58 +0100 Subject: Optimizing dd images of ext3 partitions: Only copy blocks in use by fs Message-ID: <20100106120058.12202uj1udebzcmc@webmail.df.eu> Hello, for bare-metal recovery I need to create complete disk images of ext3 partitions of about 30 servers. I'm doing this by creating lvm2-snapshots and then dd'ing the snapshot-device to my backup media. (I am aware that backups created by this procedure are the equivalent of hitting the power switch at the time the snapshot was taken.) This works great and avoids a lot of seeks on highly utilized file systems. However it wastes a lot of space for disks with nearly empty filesystems. It would be a lot better if I could only read the blocks from raw disk that are really in use by ext3 (the rest could be sparse in the imagefile created). Is there a way to do this? I am aware that e2image -r dumps all metadata. Is there a tool that does not only dump metadata but also the data blocks? (maybe even in a way that avoids seeks by compiling a list of blocks first and then reading them in disk-order) If not: Is there a tool I can extend to do so / can you point me into the righ direction? (I tried dumpfs, however it dumps inodes on a per-directory base. Skimming through the source I did not see any optimization regarding seeks. So on highly populated filesystems dumpfs still is slower than full images with dd for me.) Thanks a lot, Martin From adilger at sun.com Wed Jan 6 21:09:10 2010 From: adilger at sun.com (Andreas Dilger) Date: Wed, 06 Jan 2010 14:09:10 -0700 Subject: Optimizing dd images of ext3 partitions: Only copy blocks in use by fs In-Reply-To: <20100106120058.12202uj1udebzcmc@webmail.df.eu> References: <20100106120058.12202uj1udebzcmc@webmail.df.eu> Message-ID: <1F6683F6-966A-4AD6-932F-DC80AB36DDBA@sun.com> On 2010-01-06, at 04:00, Martin Baum wrote: > for bare-metal recovery I need to create complete disk images of > ext3 partitions of about 30 servers. I'm doing this by creating lvm2- > snapshots and then dd'ing the snapshot-device to my backup media. (I > am aware that backups created by this procedure are the equivalent > of hitting the power switch at the time the snapshot was taken.) > > This works great and avoids a lot of seeks on highly utilized file > systems. However it wastes a lot of space for disks with nearly > empty filesystems. > > It would be a lot better if I could only read the blocks from raw > disk that are really in use by ext3 (the rest could be sparse in the > imagefile created). Is there a way to do this? You can use "dump" which will read only the in-use blocks, but it doesn't create a full disk image. The other trick that I've used for similar situations is to write a file of all zeroes to the filesystem until it is full (e.g. dd if=/dev/ zero of=/foo) and then the backup will be able to compress quite well. If the filesystem is in use, you should stop before the filesystem is completely full, and also unlink the file right after it is created, so in case of trouble the file will automatically be unlinked (even after a crash). > I am aware that e2image -r dumps all metadata. Is there a tool that > does not only dump metadata but also the data blocks? (maybe even in > a way that avoids seeks by compiling a list of blocks first and then > reading them in disk-order) If not: Is there a tool I can extend to > do so / can you point me into the righ direction? > > (I tried dumpfs, however it dumps inodes on a per-directory base. > Skimming through the source I did not see any optimization regarding > seeks. So on highly populated filesystems dumpfs still is slower > than full images with dd for me.) Optimizing dump to e.g. sort inodes might help the performance, if that isn't already done. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. From mrubin at google.com Mon Jan 4 18:57:49 2010 From: mrubin at google.com (Michael Rubin) Date: Mon, 4 Jan 2010 10:57:49 -0800 Subject: [Jfs-discussion] benchmark results In-Reply-To: <20100104162748.GA11932@think> References: <19251.26403.762180.228181@tree.ty.sabi.co.uk> <20091224212756.GM21594@thunk.org> <20091224234631.GA1028@ioremap.net> <20091225161146.GC32757@thunk.org> <20100104162748.GA11932@think> Message-ID: <532480951001041057w3ad8d1dfy361ced0346ebaaa4@mail.gmail.com> Google is currently in the middle of upgrading from ext2 to a more up to date file system. We ended up choosing ext4. This thread touches upon many of the issues we wrestled with, so I thought it would be interesting to share. We should be sending out more details soon. The driving performance reason to upgrade is that while ext2 had been "good enough" for a very long time the metadata arrangement on a stale file system was leading to what we call "read inflation". This is where we end up doing many seeks to read one block of data. In general latency from poor block allocation was causing performance hiccups. We spent a lot of time with unix standard benchmarks (dbench, compile bench, et al) on xfs, ext4, jfs to try to see which one was going to perform the best. In the end we mostly ended up using the benchmarks to validate our assumptions and do functional testing. Larry is completely right IMHO. These benchmarks were instrumental in helping us understand how the file systems worked in controlled situations and gain confidence from our customers. For our workloads we saw ext4 and xfs as "close enough" in performance in the areas we cared about. The fact that we had a much smoother upgrade path with ext4 clinched the deal. The only upgrade option we have is online. ext4 is already moving the bottleneck away from the storage stack for some of our most intensive applications. It was not until we moved from benchmarks to customer workload that we were able to make detailed performance comparisons and find bugs in our implementation. "Iterate often" seems to be the winning strategy for SW dev. But when it involves rebooting a cloud of systems and making a one way conversion of their data it can get messy. That said I see benchmarks as tools to build confidence before running traffic on redundant live systems. mrubin PS for some reason "dbench" holds mythical power over many folks I have met. They just believe it's the most trusted and standard benchmark for file systems. In my experience it often acts as a random number generator. It has found some bugs in our code as it exercises the VFS layer very well. From david at fromorbit.com Tue Jan 5 00:41:17 2010 From: david at fromorbit.com (Dave Chinner) Date: Tue, 5 Jan 2010 11:41:17 +1100 Subject: [Jfs-discussion] benchmark results In-Reply-To: <20100104162748.GA11932@think> References: <19251.26403.762180.228181@tree.ty.sabi.co.uk> <20091224212756.GM21594@thunk.org> <20091224234631.GA1028@ioremap.net> <20091225161146.GC32757@thunk.org> <20100104162748.GA11932@think> Message-ID: <20100105004117.GP13802@discord.disaster> On Mon, Jan 04, 2010 at 11:27:48AM -0500, Chris Mason wrote: > On Fri, Dec 25, 2009 at 11:11:46AM -0500, tytso at mit.edu wrote: > > On Fri, Dec 25, 2009 at 02:46:31AM +0300, Evgeniy Polyakov wrote: > > > > [1] http://samba.org/ftp/tridge/dbench/README > > > > > > Was not able to resist to write a small notice, what no matter what, but > > > whatever benchmark is running, it _does_ show system behaviour in one > > > or another condition. And when system behaves rather badly, it is quite > > > a common comment, that benchmark was useless. But it did show that > > > system has a problem, even if rarely triggered one :) > > > > If people are using benchmarks to improve file system, and a benchmark > > shows a problem, then trying to remedy the performance issue is a good > > thing to do, of course. Sometimes, though the case which is > > demonstrated by a poor benchmark is an extremely rare corner case that > > doesn't accurately reflect common real-life workloads --- and if > > addressing it results in a tradeoff which degrades much more common > > real-life situations, then that would be a bad thing. > > > > In situations where benchmarks are used competitively, it's rare that > > it's actually a *problem*. Instead it's much more common that a > > developer is trying to prove that their file system is *better* to > > gullible users who think that a single one-dimentional number is > > enough for them to chose file system X over file system Y. > > [ Look at all this email from my vacation...sorry for the delay ] > > It's important that people take benchmarks from filesystem developers > with a big grain of salt, which is one reason the boxacle.net results > are so nice. Steve more than willing to take patches and experiment to > improve a given FS results, but his business is a fair representation of > performance and it shows. Just looking at the results there, I notice that the RAID system XFS mailserver results dropped by an order of magnitude between 2.6.29-rc2 and 2.6.31. The single disk results are pretty much identical across the two kernels. IIRC, in 2.6.31 RAID0 started passing barriers through so I suspect this is the issue. However, seeing as dmesg is not collected by the scripts after the run and the output of the mounttab does not show default options, I cannot tell if this is the case. This might be worth checking by running XFS with the "nobarrier" mount option.... FWIW, is it possible to get these benchmarks run on each filesystem for each kernel release so ext/xfs/btrfs all get some regular basic performance regression test coverage? Cheers, Dave. -- Dave Chinner david at fromorbit.com From chris.mason at oracle.com Mon Jan 4 16:27:48 2010 From: chris.mason at oracle.com (Chris Mason) Date: Mon, 4 Jan 2010 11:27:48 -0500 Subject: [Jfs-discussion] benchmark results In-Reply-To: <20091225161146.GC32757@thunk.org> References: <19251.26403.762180.228181@tree.ty.sabi.co.uk> <20091224212756.GM21594@thunk.org> <20091224234631.GA1028@ioremap.net> <20091225161146.GC32757@thunk.org> Message-ID: <20100104162748.GA11932@think> On Fri, Dec 25, 2009 at 11:11:46AM -0500, tytso at mit.edu wrote: > On Fri, Dec 25, 2009 at 02:46:31AM +0300, Evgeniy Polyakov wrote: > > > [1] http://samba.org/ftp/tridge/dbench/README > > > > Was not able to resist to write a small notice, what no matter what, but > > whatever benchmark is running, it _does_ show system behaviour in one > > or another condition. And when system behaves rather badly, it is quite > > a common comment, that benchmark was useless. But it did show that > > system has a problem, even if rarely triggered one :) > > If people are using benchmarks to improve file system, and a benchmark > shows a problem, then trying to remedy the performance issue is a good > thing to do, of course. Sometimes, though the case which is > demonstrated by a poor benchmark is an extremely rare corner case that > doesn't accurately reflect common real-life workloads --- and if > addressing it results in a tradeoff which degrades much more common > real-life situations, then that would be a bad thing. > > In situations where benchmarks are used competitively, it's rare that > it's actually a *problem*. Instead it's much more common that a > developer is trying to prove that their file system is *better* to > gullible users who think that a single one-dimentional number is > enough for them to chose file system X over file system Y. [ Look at all this email from my vacation...sorry for the delay ] It's important that people take benchmarks from filesystem developers with a big grain of salt, which is one reason the boxacle.net results are so nice. Steve more than willing to take patches and experiment to improve a given FS results, but his business is a fair representation of performance and it shows. > > For example, if I wanted to play that game and tell people that ext4 > is better, I'd might pick this graph: > > http://btrfs.boxacle.net/repository/single-disk/2.6.29-rc2/2.6.29-rc2/2.6.29-rc2_Mail_server_simulation._num_threads=32.html > > On the other hand, this one shows ext4 as the worst compared to all > other file systems: > > http://btrfs.boxacle.net/repository/single-disk/2.6.29-rc2/2.6.29-rc2/2.6.29-rc2_Large_file_random_writes_odirect._num_threads=8.html > > Benchmarking, like statistics, can be extremely deceptive, and if > people do things like carefully order a tar file so the files are > optimal for a file system, it's fair to ask whether that's a common > thing for people to be doing (either unpacking tarballs or unpacking > tarballs whose files have been carefully ordered for a particular file > systems). I tend to use compilebench for testing the ability to create lots of small files, which puts the file names into FS native order (by unpacking and then readdiring the results) before it does any timings. I'd agree with Larry that benchmarking is most useful to test a theory. Here's a patch that is supposed to do xyz, is that actually true. With that said we should also be trying to write benchmarks that show the worst case...we know some of our design weakness and should be able to show numbers for how bad it really is (see the random write btrfs.boxacle.net tests for that one). -chris From slpratt at austin.ibm.com Tue Jan 5 15:31:00 2010 From: slpratt at austin.ibm.com (Steven Pratt) Date: Tue, 05 Jan 2010 09:31:00 -0600 Subject: [Jfs-discussion] benchmark results In-Reply-To: <20100105004117.GP13802@discord.disaster> References: <19251.26403.762180.228181@tree.ty.sabi.co.uk> <20091224212756.GM21594@thunk.org> <20091224234631.GA1028@ioremap.net> <20091225161146.GC32757@thunk.org> <20100104162748.GA11932@think> <20100105004117.GP13802@discord.disaster> Message-ID: <4B435B34.20003@austin.ibm.com> Dave Chinner wrote: > On Mon, Jan 04, 2010 at 11:27:48AM -0500, Chris Mason wrote: > >> On Fri, Dec 25, 2009 at 11:11:46AM -0500, tytso at mit.edu wrote: >> >>> On Fri, Dec 25, 2009 at 02:46:31AM +0300, Evgeniy Polyakov wrote: >>> >>>>> [1] http://samba.org/ftp/tridge/dbench/README >>>>> >>>> Was not able to resist to write a small notice, what no matter what, but >>>> whatever benchmark is running, it _does_ show system behaviour in one >>>> or another condition. And when system behaves rather badly, it is quite >>>> a common comment, that benchmark was useless. But it did show that >>>> system has a problem, even if rarely triggered one :) >>>> >>> If people are using benchmarks to improve file system, and a benchmark >>> shows a problem, then trying to remedy the performance issue is a good >>> thing to do, of course. Sometimes, though the case which is >>> demonstrated by a poor benchmark is an extremely rare corner case that >>> doesn't accurately reflect common real-life workloads --- and if >>> addressing it results in a tradeoff which degrades much more common >>> real-life situations, then that would be a bad thing. >>> >>> In situations where benchmarks are used competitively, it's rare that >>> it's actually a *problem*. Instead it's much more common that a >>> developer is trying to prove that their file system is *better* to >>> gullible users who think that a single one-dimentional number is >>> enough for them to chose file system X over file system Y. >>> >> [ Look at all this email from my vacation...sorry for the delay ] >> >> It's important that people take benchmarks from filesystem developers >> with a big grain of salt, which is one reason the boxacle.net results >> are so nice. Steve more than willing to take patches and experiment to >> improve a given FS results, but his business is a fair representation of >> performance and it shows. >> > > Just looking at the results there, I notice that the RAID system XFS > mailserver results dropped by an order of magnitude between > 2.6.29-rc2 and 2.6.31. The single disk results are pretty > much identical across the two kernels. > > IIRC, in 2.6.31 RAID0 started passing barriers through so I suspect > this is the issue. However, seeing as dmesg is not collected by > the scripts after the run and the output of the mounttab does > not show default options, I cannot tell if this is the case. Well the dmesg collection is done by the actual benchmark run which occurs after the mount command is issued, so if you are looking for dmesg related to mounting the xfs volume, it should be in the dmesg we did collect. If dmesg actually formatted timestamps, this would be easier to see. It seems that nothing from xfs is ending up in dmesg since we are running xfs with different threads counts in order without reboot, so the dmesg for 16 thread xfs is run right after 1 thread xfs, but the dmesg show ext3 as the last thing, so safe to say no output from xfs is ending up in dmesg at all. > This > might be worth checking by running XFS with the "nobarrier" mount > option.... > I could give that a try for you. > FWIW, is it possible to get these benchmarks run on each filesystem for > each kernel release so ext/xfs/btrfs all get some regular basic > performance regression test coverage? > Possible yes. Just need to find the time to do the runs, and more importantly postprocess the data in some meaningful way. I'll see what I can do. Steve > Cheers, > > Dave. > From lakshmipathi.g at gmail.com Wed Jan 13 09:05:14 2010 From: lakshmipathi.g at gmail.com (lakshmi pathi) Date: Wed, 13 Jan 2010 14:35:14 +0530 Subject: Fwd: ext4_inode: i_block[] doubt In-Reply-To: References: Message-ID: ~~~~~~~~~~~~~~~~ I checked for ext4-users mailing list - but unable to find it and posted this question to ext4-beta-list at redhat.com - It's seems like that mailing list is less active.So I'm posting it again to ext3 list . ~~~~~~~~~~~~~~~~ I was accessing ext4 file using ext2fs lib (from e2fsprogs-1.41.9-Fedora 12 ) ,while parsing inode contents I got these output Let me know whether my assumptions are correct? --------------- //code-part : print inode values ext2fs_read_inode(current_fs,d->d_ino,&inode); for ( i = 0; i < 15; i++) printf ("\ni_block[%d] :%u", i, inode.i_block[i]); --------------- In struct ext4_inode i_block[EXT4_N_BLOCKS], i_block[0] to i_block[2] denotes extent headers and tells whether this inode uses Htree for storing files data blocks. //output i_block[0] :324362 i_block[1] :4 i_block[2] :0 //remaining i_block[3] to i_block[14] holds four extents ?in following format //{extent index,number of blocks in extents,starting block number} i_block[3] :0 ? ? ? ? ? // --> first extent denoted as 0 i_block[4] :1 ? ? ? ? ? // --> has single block i_block[5] :36890 ? ? ? // --> block number is 36890 i_block[6] :1 ? ? ? ? ? // --> second extent denoted as 1 i_block[7] :2 ? ? ? ? ? // --> has two blocks i_block[8] :36892 ? ? ? // --> it uses 36892 and 36893 i_block[9] :3 ? ? ? ? ? // --??--> third extent ?-- why is it numbered as 3 instead of 2? i_block[10] :2 ? ? ? ? ?// --> has two blocks i_block[11] :36898 ? ? ?// --> uses 36898 and 36899 i_block[12] :5 ? ? ? ? ?// --??--> fourth and finally extent -- again why its comes as 5 instead of 3. i_block[13] :11 ? ? ? ? // --> it uses 11 blocks i_block[14] :38402 ? ? ?// --> starting block is 38402. If my assumption are correct,the question is ,why i_block[9] ?shows 3 instead of 2 and ? i_block[12] says 5 instead of 3? Thanks. -- ---- Cheers, Lakshmipathi.G www.giis.co.in From shadowbu at gmail.com Wed Jan 13 15:27:41 2010 From: shadowbu at gmail.com (George Butler) Date: Wed, 13 Jan 2010 09:27:41 -0600 Subject: ext3 partition size In-Reply-To: <8005FD36-E520-4E0E-B461-30A0C0F4DFCB@sun.com> References: <4B399C4A.2010109@gmail.com> <8005FD36-E520-4E0E-B461-30A0C0F4DFCB@sun.com> Message-ID: <4B4DE66D.1060707@gmail.com> Andreas.... thanks for the suggestion, I did a *resize2fs -d -p /dev/sdb8* on the partition and is now showing the size and disk usage correctly. Thanks for your advise. george On 12/31/2009 03:39 PM, Andreas Dilger wrote: > On 2009-12-28, at 23:06, George Butler wrote: >> I am running fedora 11 with kernel 2.6.30.9-102.fc11.x86_64 #1 >> SMP Fri Dec 4 00:18:53 EST 2009 x86_64 x86_64 x86_64 GNU/Linux. I am >> noticing a partition on my drive is reporting incorrect size with >> "df", the partition is ext3 size 204GB with about 79GB actual usage, >> the "df" result show the partition size to be 111GB, 93GB is missing. >> Please advice on what can be done to see why the system is reporting >> incorrect partition size. >> >> mount: /dev/sdb8 on /srv/multimedia type ext3 (rw,relatime) >> >> $ df -hT >> Filesystem Type Size Used Avail Use% Mounted on >> /dev/sdb2 ext3 30G 1.1G 28G 4% / >> /dev/sdb7 ext3 20G 1.3G 18G 7% /var >> /dev/sdb6 ext3 30G 12G 17G 43% /usr >> /dev/sdb5 ext3 40G 25G 13G 67% /home >> /dev/sdb1 ext3 107M 52M 50M 52% /boot >> /dev/sdb8 ext3 111G 79G 27G 76% /srv/multimedia >> tmpfs tmpfs 2.9G 35M 2.9G 2% /dev/shm >> >> Parted info: >> >> (parted) select /dev/sdb >> Using /dev/sdb >> (parted) print >> Model: ATA ST3500630AS (scsi) >> Disk /dev/sdb: 500GB >> Sector size (logical/physical): 512B/512B >> Partition Table: msdos >> >> Number Start End Size Type File system Flags >> 1 32.3kB 115MB 115MB primary ext3 boot >> 2 115MB 32.3GB 32.2GB primary ext3 >> 3 32.3GB 35.5GB 3224MB primary linux-swap >> 4 35.5GB 500GB 465GB extended >> 5 35.5GB 78.5GB 43.0GB logical ext3 >> 6 78.5GB 111GB 32.2GB logical ext3 >> 7 111GB 132GB 21.5GB logical ext3 >> 8 132GB 352GB 220GB logical ext3 >> 9 352GB 492GB 140GB logical ext3 > > It definitely looks strange. Did you resize this partition after it > was created? In any case, running "resize2fs /dev/sdb8" should > enlarge the > filesystem to fill the partition. > > Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at warner.fm Wed Jan 20 17:30:35 2010 From: doug at warner.fm (Doug Warner) Date: Wed, 20 Jan 2010 12:30:35 -0500 Subject: Slow fsck on adaptec SAS/SATA raid Message-ID: <4B573DBB.1070301@warner.fm> I'm trying to do an fsck on an ext3 partition but I'm seeing abysmally slow disk throughput; monitoring with "dstat" (like vmstat) shows ~1200-1500KB/s throughput to the disks. Even with 24hrs of fsck-ing I only get ~3% (still in pass1). The filesystem is ext3 running "e2fsck -C0 /dev/sda3" and about 3.7TB on an x86_64-based system with 4GB RAM. e2fsprogs is 1.41.9. During initial periods of the fsck I see throughput in the 30-70MB/s range (ie, <0.4% complete). Shortly after that throughput tanks and stays there. Just a rough extrapolation of the size of my filesystem (3.4TB used; ~95%) makes it look like this will take ~28 days to complete. I'm using approx 12M inodes out of my 243M available. -Doug -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 190 bytes Desc: OpenPGP digital signature URL: From pg_ext3 at ext3.for.sabi.co.UK Wed Jan 20 22:59:05 2010 From: pg_ext3 at ext3.for.sabi.co.UK (Peter Grandi) Date: Wed, 20 Jan 2010 22:59:05 +0000 Subject: Slow fsck on adaptec SAS/SATA raid In-Reply-To: <4B573DBB.1070301@warner.fm> References: <4B573DBB.1070301@warner.fm> Message-ID: <19287.35513.523189.476521@tree.ty.sabi.co.uk> >>> On Wed, 20 Jan 2010 12:30:35 -0500, Doug Warner >>> said: doug> I'm trying to do an fsck on an ext3 partition but I'm doug> seeing abysmally slow disk throughput; monitoring with doug> "dstat" (like vmstat) shows ~1200-1500KB/s throughput to doug> the disks. That seems pretty good to me. Perhaps the impression of slowness is motivated by insufficient understanding of how 'fsck' works and the IOP limitations of small rotating mass storage arrays. doug> Even with 24hrs of fsck-ing I only get ~3% (still in doug> pass1). That's pretty good too. doug> The filesystem is ext3 running "e2fsck -C0 /dev/sda3" and doug> about 3.7TB on an x86_64-based system with 4GB RAM. doug> e2fsprogs is 1.41.9. There have been reports of a 1.5TB 'ext3' filesystem taking over a month: http://www.sabi.co.uk/blog/anno05-4th.html#051009 even if in optimal cases it can be better: http://www.sabi.co.uk/blog/0802feb.html#080210 doug> Just a rough extrapolation of the size of my filesystem doug> (3.4TB used; ~95%) makes it look like this will take ~28 doug> days to complete. I'm using approx 12M inodes out of my doug> 243M available. That sounds about right for lots of small files (~280KB average) in a very large number of directories (or in directories that are really very long), and which is 95% used. You have designed that filesystem that way, and it is performing as expected or better. People who use filesystems as databases deserve whatn they get. From sandeen at redhat.com Fri Jan 22 20:01:48 2010 From: sandeen at redhat.com (Eric Sandeen) Date: Fri, 22 Jan 2010 14:01:48 -0600 Subject: Slow fsck on adaptec SAS/SATA raid In-Reply-To: <4B573DBB.1070301@warner.fm> References: <4B573DBB.1070301@warner.fm> Message-ID: <4B5A042C.4020608@redhat.com> Doug Warner wrote: > I'm trying to do an fsck on an ext3 partition but I'm seeing abysmally slow > disk throughput; monitoring with "dstat" (like vmstat) shows ~1200-1500KB/s > throughput to the disks. Even with 24hrs of fsck-ing I only get ~3% (still in > pass1). > > The filesystem is ext3 running "e2fsck -C0 /dev/sda3" and about 3.7TB on an > x86_64-based system with 4GB RAM. e2fsprogs is 1.41.9. > > During initial periods of the fsck I see throughput in the 30-70MB/s range > (ie, <0.4% complete). Shortly after that throughput tanks and stays there. > Just a rough extrapolation of the size of my filesystem (3.4TB used; ~95%) > makes it look like this will take ~28 days to complete. I'm using approx 12M > inodes out of my 243M available. It'd be interesting to use blktrace and/or seekwatcher to see if you are seeking madly all over the disk, that would certainly clobber perf. -Eric > -Doug From dshaw at jabberwocky.com Fri Jan 22 21:01:21 2010 From: dshaw at jabberwocky.com (David Shaw) Date: Fri, 22 Jan 2010 16:01:21 -0500 Subject: Extended attributes being cleared by e2fsck? Message-ID: <23361D9D-0DB5-41EF-9D7F-6267C6873EDB@jabberwocky.com> Twice now I have rebooted a box and seen a hundred or so unexpected messages from e2fsck about extended attributes being cleared: disk: Extended attribute in inode 1437565 has a value size (0) which is invalid CLEARED. (for many different inodes) This filesystem has a lot of files with single-byte xattrs (it is user.test='x'). After the fsck, I looked at a few of the files that correspond to the inodes mentioned by e2fsck, and that xattr was missing. However, some other files were not touched by e2fsck and still had the single-digit xattr. The only other clue I have at the moment is that in at least one of the examples, the filesystem had just been resized (online) with resize2fs. Both boxes are Fedora 11 with kernel-2.6.30.9-102.fc11.i586 and e2fsprogs-1.41.4-12.fc11.i586 Any suggestions on where to investigate next? David From rwheeler at redhat.com Fri Jan 22 21:11:34 2010 From: rwheeler at redhat.com (Ric Wheeler) Date: Fri, 22 Jan 2010 16:11:34 -0500 Subject: Extended attributes being cleared by e2fsck? In-Reply-To: <23361D9D-0DB5-41EF-9D7F-6267C6873EDB@jabberwocky.com> References: <23361D9D-0DB5-41EF-9D7F-6267C6873EDB@jabberwocky.com> Message-ID: <4B5A1486.2020801@redhat.com> On 01/22/2010 04:01 PM, David Shaw wrote: > Twice now I have rebooted a box and seen a hundred or so unexpected messages from e2fsck about extended attributes being cleared: > > disk: Extended attribute in inode 1437565 has a value size (0) which is invalid > CLEARED. > > (for many different inodes) > > This filesystem has a lot of files with single-byte xattrs (it is user.test='x'). After the fsck, I looked at a few of the files that correspond to the inodes mentioned by e2fsck, and that xattr was missing. However, some other files were not touched by e2fsck and still had the single-digit xattr. > > The only other clue I have at the moment is that in at least one of the examples, the filesystem had just been resized (online) with resize2fs. > > Both boxes are Fedora 11 with kernel-2.6.30.9-102.fc11.i586 and e2fsprogs-1.41.4-12.fc11.i586 > > Any suggestions on where to investigate next? > > David > > > Hi David, Could you open a fedora bugzilla ticket and fill in as much information as you have - the above would be a great start... Thanks! Ric From tytso at mit.edu Fri Jan 22 21:11:20 2010 From: tytso at mit.edu (tytso at mit.edu) Date: Fri, 22 Jan 2010 16:11:20 -0500 Subject: Slow fsck on adaptec SAS/SATA raid In-Reply-To: <4B573DBB.1070301@warner.fm> References: <4B573DBB.1070301@warner.fm> Message-ID: <20100122211120.GG21263@thunk.org> On Wed, Jan 20, 2010 at 12:30:35PM -0500, Doug Warner wrote: > I'm trying to do an fsck on an ext3 partition but I'm seeing abysmally slow > disk throughput; monitoring with "dstat" (like vmstat) shows ~1200-1500KB/s > throughput to the disks. Even with 24hrs of fsck-ing I only get ~3% (still in > pass1). > > The filesystem is ext3 running "e2fsck -C0 /dev/sda3" and about 3.7TB on an > x86_64-based system with 4GB RAM. e2fsprogs is 1.41.9. > > During initial periods of the fsck I see throughput in the 30-70MB/s range > (ie, <0.4% complete). Shortly after that throughput tanks and stays there. > Just a rough extrapolation of the size of my filesystem (3.4TB used; ~95%) > makes it look like this will take ~28 days to complete. I'm using approx 12M > inodes out of my 243M available. Are you using some kind of backup scheme that creates huge number of hard links to files? E2fsck could be thrashing due to lack of memory space. - Ted From dshaw at jabberwocky.com Fri Jan 22 21:40:01 2010 From: dshaw at jabberwocky.com (David Shaw) Date: Fri, 22 Jan 2010 16:40:01 -0500 Subject: Extended attributes being cleared by e2fsck? In-Reply-To: <4B5A1486.2020801@redhat.com> References: <23361D9D-0DB5-41EF-9D7F-6267C6873EDB@jabberwocky.com> <4B5A1486.2020801@redhat.com> Message-ID: On Jan 22, 2010, at 4:11 PM, Ric Wheeler wrote: > On 01/22/2010 04:01 PM, David Shaw wrote: >> Twice now I have rebooted a box and seen a hundred or so unexpected messages from e2fsck about extended attributes being cleared: >> >> disk: Extended attribute in inode 1437565 has a value size (0) which is invalid >> CLEARED. >> >> (for many different inodes) >> >> This filesystem has a lot of files with single-byte xattrs (it is user.test='x'). After the fsck, I looked at a few of the files that correspond to the inodes mentioned by e2fsck, and that xattr was missing. However, some other files were not touched by e2fsck and still had the single-digit xattr. >> >> The only other clue I have at the moment is that in at least one of the examples, the filesystem had just been resized (online) with resize2fs. >> >> Both boxes are Fedora 11 with kernel-2.6.30.9-102.fc11.i586 and e2fsprogs-1.41.4-12.fc11.i586 >> >> Any suggestions on where to investigate next? >> >> David >> >> >> > > Hi David, > > Could you open a fedora bugzilla ticket and fill in as much information as you have - the above would be a great start... Done: https://bugzilla.redhat.com/show_bug.cgi?id=557959 David From jarmstrong at postpath.com Fri Jan 22 21:49:32 2010 From: jarmstrong at postpath.com (Joe Armstrong) Date: Fri, 22 Jan 2010 21:49:32 +0000 (GMT) Subject: unsubscribe Message-ID: <6EB2467CC553DE11AD6B00221957FF8E5D12AC@ppst1mlb01.ppst3.intra> -------------- next part -------------- An HTML attachment was scrubbed... URL: From samix_119 at yahoo.com Tue Jan 26 13:15:22 2010 From: samix_119 at yahoo.com (Muhammed Sameer) Date: Tue, 26 Jan 2010 05:15:22 -0800 (PST) Subject: Going readonly too frequently Message-ID: <326877.26414.qm@web112608.mail.gq1.yahoo.com> Hey, * Our server is going readonly atleast 10 - 15 times a day with the below error Jan 26 12:57:36 mailbox kernel: EXT3-fs error (device sdb1): ext3_free_blocks_sb: bit already cleared for block 53771686 Jan 26 12:57:36 mailbox kernel: Aborting journal on device sdb1. Jan 26 12:57:36 mailbox kernel: EXT3-fs error (device sdb1) in ext3_reserve_inode_write: Journal has aborted Jan 26 12:57:36 mailbox kernel: EXT3-fs error (device sdb1) in ext3_truncate: Journal has aborted Jan 26 12:57:36 mailbox kernel: EXT3-fs error (device sdb1) in ext3_reserve_inode_write: Journal has aborted Jan 26 12:57:36 mailbox kernel: EXT3-fs error (device sdb1) in ext3_orphan_del: Journal has aborted Jan 26 12:57:36 mailbox kernel: EXT3-fs error (device sdb1) in ext3_reserve_inode_write: Journal has aborted Jan 26 12:57:36 mailbox kernel: EXT3-fs error (device sdb1) in ext3_delete_inode: Journal has aborted Jan 26 12:57:36 mailbox kernel: __journal_remove_journal_head: freeing b_committed_data Jan 26 12:57:36 mailbox last message repeated 3 times Jan 26 12:57:36 mailbox kernel: ext3_abort called. Jan 26 12:57:36 mailbox kernel: EXT3-fs error (device sdb1): ext3_journal_start_sb: Detected aborted journal Jan 26 12:57:36 mailbox kernel: Remounting filesystem read-only Jan 26 12:58:04 mailbox kernel: __journal_remove_journal_head: freeing b_committed_data Jan 26 12:58:05 mailbox kernel: __journal_remove_journal_head: freeing b_committed_data * Our kernel is 2.6.18-182.el5 * Our OS is Red Hat Enterprise Linux Server release 5.2 (Tikanga) * We even tried fsck but the problem persists Regards, Muhammed Sameer From sandeen at redhat.com Tue Jan 26 15:48:53 2010 From: sandeen at redhat.com (Eric Sandeen) Date: Tue, 26 Jan 2010 09:48:53 -0600 Subject: Going readonly too frequently In-Reply-To: <326877.26414.qm@web112608.mail.gq1.yahoo.com> References: <326877.26414.qm@web112608.mail.gq1.yahoo.com> Message-ID: <4B5F0EE5.9060403@redhat.com> Muhammed Sameer wrote: > Hey, > > * Our server is going readonly atleast 10 - 15 times a day with the below error ouch - and even just once is "too frequently" :) > > > Jan 26 12:57:36 mailbox kernel: EXT3-fs error (device sdb1): ext3_free_blocks_sb: bit already cleared for block 53771686 > Jan 26 12:57:36 mailbox kernel: Aborting journal on device sdb1. > Jan 26 12:57:36 mailbox kernel: EXT3-fs error (device sdb1) in ext3_reserve_inode_write: Journal has aborted > Jan 26 12:57:36 mailbox kernel: EXT3-fs error (device sdb1) in ext3_truncate: Journal has aborted > Jan 26 12:57:36 mailbox kernel: EXT3-fs error (device sdb1) in ext3_reserve_inode_write: Journal has aborted > Jan 26 12:57:36 mailbox kernel: EXT3-fs error (device sdb1) in ext3_orphan_del: Journal has aborted > Jan 26 12:57:36 mailbox kernel: EXT3-fs error (device sdb1) in ext3_reserve_inode_write: Journal has aborted > Jan 26 12:57:36 mailbox kernel: EXT3-fs error (device sdb1) in ext3_delete_inode: Journal has aborted > Jan 26 12:57:36 mailbox kernel: __journal_remove_journal_head: freeing b_committed_data > Jan 26 12:57:36 mailbox last message repeated 3 times > Jan 26 12:57:36 mailbox kernel: ext3_abort called. > Jan 26 12:57:36 mailbox kernel: EXT3-fs error (device sdb1): ext3_journal_start_sb: Detected aborted journal > Jan 26 12:57:36 mailbox kernel: Remounting filesystem read-only > Jan 26 12:58:04 mailbox kernel: __journal_remove_journal_head: freeing b_committed_data > Jan 26 12:58:05 mailbox kernel: __journal_remove_journal_head: freeing b_committed_data > > > * Our kernel is > 2.6.18-182.el5 > > * Our OS is > Red Hat Enterprise Linux Server release 5.2 (Tikanga) > > * We even tried fsck but the problem persists It would probably be best to open a support ticket with Red Hat for this one, since it's a Red Hat kernel. If you have any way to reproduce it that'd be very useful information - perhaps even a test using an e2image of the filesystem in question. Thanks, -Eric > Regards, > Muhammed Sameer > > > > > _______________________________________________ > Ext3-users mailing list > Ext3-users at redhat.com > https://www.redhat.com/mailman/listinfo/ext3-users From adilger at sun.com Tue Jan 26 23:09:32 2010 From: adilger at sun.com (Andreas Dilger) Date: Tue, 26 Jan 2010 16:09:32 -0700 Subject: Going readonly too frequently In-Reply-To: <4B5F0EE5.9060403@redhat.com> References: <326877.26414.qm@web112608.mail.gq1.yahoo.com> <4B5F0EE5.9060403@redhat.com> Message-ID: <20BC1566-10EB-4760-A50D-C13E1B9F3384@sun.com> On 2010-01-26, at 08:48, Eric Sandeen wrote: > Muhammed Sameer wrote: >> Hey, >> >> * Our server is going readonly atleast 10 - 15 times a day with the >> below error > > ouch - and even just once is "too frequently" :) >> >> Jan 26 12:57:36 mailbox kernel: EXT3-fs error (device sdb1): >> ext3_free_blocks_sb: bit already cleared for block 53771686 I haven't seen any similar bug reports, and given that the RHEL5.2 kernel has been out for a long time would steer me toward thinking this is a hardware error (bad RAM or cable, though only with PATA or SCSI drives). You could check for this by converting the block numbers to hex, and looking for consistent bits being cleared. > It would probably be best to open a support ticket with Red Hat for > this one, > since it's a Red Hat kernel. > > If you have any way to reproduce it that'd be very useful > information - > perhaps even a test using an e2image of the filesystem in question. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. From darkonc at gmail.com Wed Jan 27 07:25:23 2010 From: darkonc at gmail.com (Stephen Samuel (gmail)) Date: Tue, 26 Jan 2010 23:25:23 -0800 Subject: Going readonly too frequently In-Reply-To: <326877.26414.qm@web112608.mail.gq1.yahoo.com> References: <326877.26414.qm@web112608.mail.gq1.yahoo.com> Message-ID: <6cd50f9f1001262325weee5738v3fa3fb1f65b05988@mail.gmail.com> Clearly you have a corrupt filesystem. Of all these times that the filesystem has gone read-only, how many times have you done FSCKs? Have you done two FSCKs in a row? It's rare, but I've seen a second FSCK find stuff that was missed on the first run. A smart report will at least tell you if the drive has been suffering errors recently. Do you have a SMART report for the drive? On Tue, Jan 26, 2010 at 5:15 AM, Muhammed Sameer wrote: > Hey, > > * Our server is going readonly atleast 10 - 15 times a day with the below > error > > > Jan 26 12:57:36 mailbox kernel: EXT3-fs error (device sdb1): > ext3_free_blocks_sb: bit already cleared for block 53771686 > Jan 26 12:57:36 mailbox kernel: Aborting journal on device sdb1. > Jan 26 12:57:36 mailbox kernel: EXT3-fs error (device sdb1) in > ext3_reserve_inode_write: Journal has aborted > Jan 26 12:57:36 mailbox kernel: EXT3-fs error (device sdb1) in > ext3_truncate: Journal has aborted > Jan 26 12:57:36 mailbox kernel: EXT3-fs error (device sdb1) in > ext3_reserve_inode_write: Journal has aborted > Jan 26 12:57:36 mailbox kernel: EXT3-fs error (device sdb1) in > ext3_orphan_del: Journal has aborted > Jan 26 12:57:36 mailbox kernel: EXT3-fs error (device sdb1) in > ext3_reserve_inode_write: Journal has aborted > Jan 26 12:57:36 mailbox kernel: EXT3-fs error (device sdb1) in > ext3_delete_inode: Journal has aborted > Jan 26 12:57:36 mailbox kernel: __journal_remove_journal_head: freeing > b_committed_data > Jan 26 12:57:36 mailbox last message repeated 3 times > Jan 26 12:57:36 mailbox kernel: ext3_abort called. > Jan 26 12:57:36 mailbox kernel: EXT3-fs error (device sdb1): > ext3_journal_start_sb: Detected aborted journal > Jan 26 12:57:36 mailbox kernel: Remounting filesystem read-only > Jan 26 12:58:04 mailbox kernel: __journal_remove_journal_head: freeing > b_committed_data > Jan 26 12:58:05 mailbox kernel: __journal_remove_journal_head: freeing > b_committed_data > > > * Our kernel is > 2.6.18-182.el5 > > * Our OS is > Red Hat Enterprise Linux Server release 5.2 (Tikanga) > > * We even tried fsck but the problem persists > > > Regards, > Muhammed Sameer > > > > > _______________________________________________ > Ext3-users mailing list > Ext3-users at redhat.com > https://www.redhat.com/mailman/listinfo/ext3-users > -- Stephen Samuel http://www.bcgreen.com Software, like love, 778-861-7641 grows when you give it away -------------- next part -------------- An HTML attachment was scrubbed... URL: