From mike.miller at hp.com Fri Nov 2 21:54:17 2007 From: mike.miller at hp.com (Mike Miller) Date: Fri, 2 Nov 2007 16:54:17 -0500 Subject: journal has aborted Message-ID: <20071102215417.GA2231@roadking.cca.cpqcorp.net> All, We are encountering spurious errors with ext3. After some period of heavy IO we may see messages similiar to: EXT3-fs error (device cciss/c0d0p5) in start_transaction: Journal has aborted When this happens the filesystem is remounted read-only. If it's the root filesystem the system becomes unresponsive and must be rebooted. An fsck on the affected filesystem shows lots of corruption. Any ideas on what we can do to help isolate this problem? We have 64 nodes and the problem is random. Thanks, mikem From sandeen at redhat.com Sat Nov 3 02:00:13 2007 From: sandeen at redhat.com (Eric Sandeen) Date: Fri, 02 Nov 2007 21:00:13 -0500 Subject: journal has aborted In-Reply-To: <20071102215417.GA2231@roadking.cca.cpqcorp.net> References: <20071102215417.GA2231@roadking.cca.cpqcorp.net> Message-ID: <472BD62D.3070705@redhat.com> Mike Miller wrote: > All, > We are encountering spurious errors with ext3. After some period of heavy IO > we may see messages similiar to: > > EXT3-fs error (device cciss/c0d0p5) in start_transaction: Journal has > aborted You probably had relevant messages before that... what were they? > When this happens the filesystem is remounted read-only. If it's the root > filesystem the system becomes unresponsive and must be rebooted. An fsck on > the affected filesystem shows lots of corruption. > Any ideas on what we can do to help isolate this problem? We have 64 nodes > and the problem is random. Crazy question, but I have to ask - you don't have the same filesystem mounted on all those nodes, do you? What kernel is this? -Eric From jprats at cesca.es Mon Nov 5 08:46:39 2007 From: jprats at cesca.es (Jordi Prats) Date: Mon, 05 Nov 2007 09:46:39 +0100 Subject: journal has aborted In-Reply-To: <472BD62D.3070705@redhat.com> References: <20071102215417.GA2231@roadking.cca.cpqcorp.net> <472BD62D.3070705@redhat.com> Message-ID: <472ED86F.4030306@cesca.es> Hi, This happened to me also using an HP smartarray. Witch model do you have? What I did is this: Mark the filesystem as it does not have a journal (take it to ext2) tune2fs -O ^has_journal /dev/cciss/c0d0p2 fsck it to delete the journal: e2fsck /dev/cciss/c0d0p2 Create the journal (take it back to ext3) tune2fs -j /dev/cciss/c0d0p2 and finaly, remount it. On a live system, just reboot it. It did not happened again. regards, Jordi Eric Sandeen wrote: > Mike Miller wrote: > >> All, >> We are encountering spurious errors with ext3. After some period of heavy IO >> we may see messages similiar to: >> >> EXT3-fs error (device cciss/c0d0p5) in start_transaction: Journal has >> aborted >> > > You probably had relevant messages before that... what were they? > > >> When this happens the filesystem is remounted read-only. If it's the root >> filesystem the system becomes unresponsive and must be rebooted. An fsck on >> the affected filesystem shows lots of corruption. >> Any ideas on what we can do to help isolate this problem? We have 64 nodes >> and the problem is random. >> > > Crazy question, but I have to ask - you don't have the same filesystem > mounted on all those nodes, do you? > > What kernel is this? > > -Eric > > _______________________________________________ > Ext3-users mailing list > Ext3-users at redhat.com > https://www.redhat.com/mailman/listinfo/ext3-users > > > -- ...................................................................... __ / / Jordi Prats C E / S / C A Dept. de Sistemes /_/ Centre de Supercomputaci? de Catalunya Gran Capit?, 2-4 (Edifici Nexus) ? 08034 Barcelona T. 93 205 6464 ? F. 93 205 6979 ? jprats at cesca.es ...................................................................... From worleys at gmail.com Sat Nov 10 02:11:50 2007 From: worleys at gmail.com (Chris Worley) Date: Fri, 9 Nov 2007 19:11:50 -0700 Subject: Proper alignment between disk HW blocks, mdadm strides, and ext[23] blocks Message-ID: How do you measure/gauge/assure proper alignment? The physical disk has a block structure. What is it or how do you find it? I'm guessing it's best to not partition disks in order to assure that whatever it's block read/write is isn't bisected by the partition. Then, mdadm has some block structure. The "-c" ("chunk") is in "kibibytes" (feed the dog kibbles?), with a default of 64. Not a clue what they're trying to do. Finally, mkfs.ext[23] has a "stride", which is defined as a "stripe size" in the man page (and I thought all your stripes added together are a "stride"), as well as a block size. It's important to make sure these all align properly, but their definitions do. Could somebody please clarify... with an example? Thanks, Chris From worleys at gmail.com Tue Nov 13 17:20:54 2007 From: worleys at gmail.com (Chris Worley) Date: Tue, 13 Nov 2007 10:20:54 -0700 Subject: Proper alignment between disk HW blocks, mdadm strides, and ext[23] blocks In-Reply-To: <20071110061641.GK3966@webber.adilger.int> References: <20071110061641.GK3966@webber.adilger.int> Message-ID: On Nov 9, 2007 11:16 PM, Andreas Dilger wrote: > On Nov 09, 2007 19:11 -0700, Chris Worley wrote: > > How do you measure/gauge/assure proper alignment? > > > > The physical disk has a block structure. What is it or how do you > > find it? I'm guessing it's best to not partition disks in order to > > assure that whatever it's block read/write is isn't bisected by the > > partition. > > For Lustre we never partition the disks for exactly this reason, and if > you are using LVM/md on the whole device it doesn't make sense either. > > > Then, mdadm has some block structure. The "-c" ("chunk") is in > > "kibibytes" (feed the dog kibbles?), with a default of 64. Not a clue > > what they're trying to do. > > That just means for RAID 0/5/6 that the amount of data or parity in a > stripe is a multipe of the chunk size, i.e. for a 4+1 RAID5 you get: > > disk0 disk1 disk2 disk3 disk4 > [64kB][64kB][64kB][64kB][64kB] > [64kB][64kB]... > > > Finally, mkfs.ext[23] has a "stride", which is defined as a "stripe > > size" in the man page (and I thought all your stripes added together > > are a "stride"), as well as a block size. > > For ext2/3/4 the stride size (in kB) == the mdadm chunk size. Note that > the ext2/3/4 stride size is in units of filesystem blocks, so if you have > 4kB filesystem blocks (default for filesystems > 500MB) and a 64kB RAID5 > chunk size, this is 16: > > e2fsck -E stride=16 /dev/md0 So, if: B=Ext Block size S=Ext Stride size C=MD Chunk size Then: S=C/B Is that correct? Ignorantly/randomly shopping around for values (using 1MB block sizes and 16GB transfers in DD as the benchmark), I found performance increased as I increased the MD chunk (testing just the MD device), but, greater than 1024, the MD performance increased, but the EXT fs got slower. Strangely the EXT stride performed best set at 2048 (the above equation says 256 would have been correct): mdadm --create /dev/md0 --level=0 --chunk=1024 --raid-devices 12 /dev/sd[b-m] mkfs.ext2 -T largefile4 -b 4096 -E stride=2048 /dev/md0 So, it may be best put that "S", in the equation above, is some factor of the stride value used. Note that I am trying to optimize for big blocks and big files, with little regard for data reliability. I also found some strange performance differences using different manufacturer's disks. I have a bunch of Maxtor 15K and Seagate 10K SCSI disks. Streaming to a single drive serially, the Maxtor disks are faster, but, in parallel, the Seagate drives are faster. I measure this with something like: for i in /dev/sd[e-r] do /usr/bin/time -f "$i: %e" \ dd bs=1024k count=16000 of=/dev/null if=$i 2>&1 \ | grep -v records & done wait This test doesn't truly emulate an MD device, as each disk is treated independently; a given disk is allowed to get ahead of the rest... why the Seagates outperform the Maxtors is unknown. They are evenly distributed across the SCSI channels (as many Seagates on a channel as Maxtors). I'm guessing the Seagate disks have deeper buffers. I remember a few years ago increasing the number of outstanding scatter/gather requests helped increase the performance of Qlogic FC drivers... is there any such driver or kernel tweak these days? I'd still like to know what the disks use for a block size. Thanks, Chris P.S. Andreas: Hope your having fun at SC07... I don't get to go :( > > > It's important to make sure these all align properly, but their definitions > > do. > > ... do not? > > > Could somebody please clarify... with an example? > > Yes, I constantly wish the terminology were constant between different tools, > but sadly there isn't any "proper" terminology out there as far as I've been > able to see. > > Cheers, Andreas > -- > Andreas Dilger > Sr. Software Engineer, Lustre Group > Sun Microsystems of Canada, Inc. > > From anirban.adhikary at gmail.com Thu Nov 15 06:44:15 2007 From: anirban.adhikary at gmail.com (Anirban Adhikary) Date: Thu, 15 Nov 2007 12:14:15 +0530 Subject: Linux File systems Performance Tunning Message-ID: <71c73b070711142244y737a041cyb6b7a61551d0c4e5@mail.gmail.com> Dear List I want to do some performance tunning jobs on ext3 filesystem.So regarding this what are the parameters I need to check or what are the things i need to follow. Thanks & Regards Anirban Adhikary. From lists at nerdbynature.de Thu Nov 15 12:01:42 2007 From: lists at nerdbynature.de (Christian Kujau) Date: Thu, 15 Nov 2007 13:01:42 +0100 (CET) Subject: Linux File systems Performance Tunning In-Reply-To: <71c73b070711142244y737a041cyb6b7a61551d0c4e5@mail.gmail.com> References: <71c73b070711142244y737a041cyb6b7a61551d0c4e5@mail.gmail.com> Message-ID: <43158.62.180.231.196.1195128102.squirrel@housecafe.dyndns.org> On Thu, November 15, 2007 07:44, Anirban Adhikary wrote: > I want to do some performance tunning jobs on ext3 filesystem.So > regarding this what are the parameters I need to check or what are the > things i need to follow. Well, there's http://tinyurl.com/2nue5f And the manpage to 'mkfs.ext3' and 'mount' do also mention some tunables. But first you need to find out what you want your fs to tune for? Lots of small files in one directory? Lot's of directories? Lots of writes? Reads? And don't forget to measure performance with the application you intend to run. Benchmark programs like bonnie++ and stuff might help, but you're probably only interested how your application will perform. Christian. -- BOFH excuse #442: Trojan horse ran out of hay From adilger at sun.com Sat Nov 10 06:16:41 2007 From: adilger at sun.com (Andreas Dilger) Date: Fri, 9 Nov 2007 23:16:41 -0700 Subject: Proper alignment between disk HW blocks, mdadm strides, and ext[23] blocks In-Reply-To: References: Message-ID: <20071110061641.GK3966@webber.adilger.int> On Nov 09, 2007 19:11 -0700, Chris Worley wrote: > How do you measure/gauge/assure proper alignment? > > The physical disk has a block structure. What is it or how do you > find it? I'm guessing it's best to not partition disks in order to > assure that whatever it's block read/write is isn't bisected by the > partition. For Lustre we never partition the disks for exactly this reason, and if you are using LVM/md on the whole device it doesn't make sense either. > Then, mdadm has some block structure. The "-c" ("chunk") is in > "kibibytes" (feed the dog kibbles?), with a default of 64. Not a clue > what they're trying to do. That just means for RAID 0/5/6 that the amount of data or parity in a stripe is a multipe of the chunk size, i.e. for a 4+1 RAID5 you get: disk0 disk1 disk2 disk3 disk4 [64kB][64kB][64kB][64kB][64kB] [64kB][64kB]... > Finally, mkfs.ext[23] has a "stride", which is defined as a "stripe > size" in the man page (and I thought all your stripes added together > are a "stride"), as well as a block size. For ext2/3/4 the stride size (in kB) == the mdadm chunk size. Note that the ext2/3/4 stride size is in units of filesystem blocks, so if you have 4kB filesystem blocks (default for filesystems > 500MB) and a 64kB RAID5 chunk size, this is 16: e2fsck -E stride=16 /dev/md0 > It's important to make sure these all align properly, but their definitions > do. ... do not? > Could somebody please clarify... with an example? Yes, I constantly wish the terminology were constant between different tools, but sadly there isn't any "proper" terminology out there as far as I've been able to see. Cheers, Andreas -- Andreas Dilger Sr. Software Engineer, Lustre Group Sun Microsystems of Canada, Inc. From jpiszcz at lucidpixels.com Thu Nov 15 13:42:49 2007 From: jpiszcz at lucidpixels.com (Justin Piszcz) Date: Thu, 15 Nov 2007 08:42:49 -0500 (EST) Subject: Proper alignment between disk HW blocks, mdadm strides, and ext[23] blocks In-Reply-To: <20071110061641.GK3966@webber.adilger.int> References: <20071110061641.GK3966@webber.adilger.int> Message-ID: On Fri, 9 Nov 2007, Andreas Dilger wrote: > On Nov 09, 2007 19:11 -0700, Chris Worley wrote: >> How do you measure/gauge/assure proper alignment? >> >> The physical disk has a block structure. What is it or how do you >> find it? I'm guessing it's best to not partition disks in order to >> assure that whatever it's block read/write is isn't bisected by the >> partition. > > For Lustre we never partition the disks for exactly this reason, and if > you are using LVM/md on the whole device it doesn't make sense either. > >> Then, mdadm has some block structure. The "-c" ("chunk") is in >> "kibibytes" (feed the dog kibbles?), with a default of 64. Not a clue >> what they're trying to do. > > That just means for RAID 0/5/6 that the amount of data or parity in a > stripe is a multipe of the chunk size, i.e. for a 4+1 RAID5 you get: > > disk0 disk1 disk2 disk3 disk4 > [64kB][64kB][64kB][64kB][64kB] > [64kB][64kB]... > >> Finally, mkfs.ext[23] has a "stride", which is defined as a "stripe >> size" in the man page (and I thought all your stripes added together >> are a "stride"), as well as a block size. > > For ext2/3/4 the stride size (in kB) == the mdadm chunk size. Note that > the ext2/3/4 stride size is in units of filesystem blocks, so if you have > 4kB filesystem blocks (default for filesystems > 500MB) and a 64kB RAID5 > chunk size, this is 16: > > e2fsck -E stride=16 /dev/md0 > >> It's important to make sure these all align properly, but their definitions >> do. > > ... do not? > >> Could somebody please clarify... with an example? > > Yes, I constantly wish the terminology were constant between different tools, > but sadly there isn't any "proper" terminology out there as far as I've been > able to see. > > Cheers, Andreas > -- > Andreas Dilger > Sr. Software Engineer, Lustre Group > Sun Microsystems of Canada, Inc. > > _______________________________________________ > Ext3-users mailing list > Ext3-users at redhat.com > https://www.redhat.com/mailman/listinfo/ext3-users > Quick question Andreas, if you do not provide a -E stride=16 on a RAID5 filesystem, how much worse does the performance become on say a 2.0 or 5.0TB ext3 filesystem? Justin. From adilger at sun.com Thu Nov 15 18:06:24 2007 From: adilger at sun.com (Andreas Dilger) Date: Thu, 15 Nov 2007 11:06:24 -0700 Subject: Proper alignment between disk HW blocks, mdadm strides, and ext[23] blocks In-Reply-To: References: <20071110061641.GK3966@webber.adilger.int> Message-ID: <20071115180624.GN3966@webber.adilger.int> On Nov 15, 2007 08:42 -0500, Justin Piszcz wrote: > Quick question Andreas, if you do not provide a -E stride=16 on a RAID5 > filesystem, how much worse does the performance become on say a 2.0 or > 5.0TB ext3 filesystem? Sorry, I don't have any numbers on that. It really depends on the back-end RAID hardware and the IO load. If it has a write cache it might not be any significant overhead. Cheers, Andreas -- Andreas Dilger Sr. Software Engineer, Lustre Group Sun Microsystems of Canada, Inc. From sundevil007 at gmail.com Fri Nov 16 19:21:45 2007 From: sundevil007 at gmail.com (ViVu) Date: Fri, 16 Nov 2007 11:21:45 -0800 (PST) Subject: File System Traces Message-ID: <13799180.post@talk.nabble.com> Hello All, I'm trying to collect the following information about an application at the file system layer: Type of request - Read/Write Sector Number to which the request is directed - to which the request is directed Time of request Can anyone pls let me know what changes should I make in which modules to extract this information? Thanks a lot!! Rgds SunDevil -- View this message in context: http://www.nabble.com/File-System-Traces-tf4823170.html#a13799180 Sent from the Ext3 - User mailing list archive at Nabble.com. From skyfalcon866 at gmail.com Mon Nov 19 23:45:44 2007 From: skyfalcon866 at gmail.com (skyhawk) Date: Mon, 19 Nov 2007 15:45:44 -0800 (PST) Subject: fsck Message-ID: <13848110.post@talk.nabble.com> why does fsck take 10 minutes to finish on my 250GB hdd? JFS fsck takes 1 Minute -- View this message in context: http://www.nabble.com/fsck-tf4840279.html#a13848110 Sent from the Ext3 - User mailing list archive at Nabble.com. From sandeen at redhat.com Mon Nov 26 15:51:13 2007 From: sandeen at redhat.com (Eric Sandeen) Date: Mon, 26 Nov 2007 09:51:13 -0600 Subject: File System Traces In-Reply-To: <13799180.post@talk.nabble.com> References: <13799180.post@talk.nabble.com> Message-ID: <474AEB71.2040903@redhat.com> ViVu wrote: > Hello All, > > I'm trying to collect the following information about an application at the > file system layer: > > Type of request - Read/Write > Sector Number to which the request is directed - to which the request is > directed > Time of request > > Can anyone pls let me know what changes should I make in which modules to > extract this information? Thanks a lot!! I'd probably use Jens Axboe's blktrace, google can find it for you (or, fedora has rpms, other distros probably do to) the vm.block_dump sysctl might also help. -Eric