From mike.miller at hp.com  Fri Nov  2 21:54:17 2007
From: mike.miller at hp.com (Mike Miller)
Date: Fri, 2 Nov 2007 16:54:17 -0500
Subject: journal has aborted
Message-ID: <20071102215417.GA2231@roadking.cca.cpqcorp.net>

All,
We are encountering spurious errors with ext3. After some period of heavy IO
we may see messages similiar to:

EXT3-fs error (device cciss/c0d0p5) in start_transaction: Journal has
aborted

When this happens the filesystem is remounted read-only. If it's the root
filesystem the system becomes unresponsive and must be rebooted. An fsck on
the affected filesystem shows lots of corruption.
Any ideas on what we can do to help isolate this problem? We have 64 nodes
and the problem is random.

Thanks,
mikem


From sandeen at redhat.com  Sat Nov  3 02:00:13 2007
From: sandeen at redhat.com (Eric Sandeen)
Date: Fri, 02 Nov 2007 21:00:13 -0500
Subject: journal has aborted
In-Reply-To: <20071102215417.GA2231@roadking.cca.cpqcorp.net>
References: <20071102215417.GA2231@roadking.cca.cpqcorp.net>
Message-ID: <472BD62D.3070705@redhat.com>

Mike Miller wrote:
> All,
> We are encountering spurious errors with ext3. After some period of heavy IO
> we may see messages similiar to:
> 
> EXT3-fs error (device cciss/c0d0p5) in start_transaction: Journal has
> aborted

You probably had relevant messages before that... what were they?

> When this happens the filesystem is remounted read-only. If it's the root
> filesystem the system becomes unresponsive and must be rebooted. An fsck on
> the affected filesystem shows lots of corruption.
> Any ideas on what we can do to help isolate this problem? We have 64 nodes
> and the problem is random.

Crazy question, but I have to ask - you don't have the same filesystem
mounted on all those nodes, do you?

What kernel is this?

-Eric


From jprats at cesca.es  Mon Nov  5 08:46:39 2007
From: jprats at cesca.es (Jordi Prats)
Date: Mon, 05 Nov 2007 09:46:39 +0100
Subject: journal has aborted
In-Reply-To: <472BD62D.3070705@redhat.com>
References: <20071102215417.GA2231@roadking.cca.cpqcorp.net>
	<472BD62D.3070705@redhat.com>
Message-ID: <472ED86F.4030306@cesca.es>

Hi,
This happened to me also using an HP smartarray. Witch model do you have?

What I did is this:

Mark the filesystem as it does not have a journal (take it to ext2)

tune2fs -O ^has_journal /dev/cciss/c0d0p2

fsck it to delete the journal:

e2fsck /dev/cciss/c0d0p2

Create the journal (take it back to ext3)

tune2fs -j /dev/cciss/c0d0p2

and finaly, remount it. On a live system, just reboot it.

It did not happened again.

regards,
Jordi

Eric Sandeen wrote:
> Mike Miller wrote:
>   
>> All,
>> We are encountering spurious errors with ext3. After some period of heavy IO
>> we may see messages similiar to:
>>
>> EXT3-fs error (device cciss/c0d0p5) in start_transaction: Journal has
>> aborted
>>     
>
> You probably had relevant messages before that... what were they?
>
>   
>> When this happens the filesystem is remounted read-only. If it's the root
>> filesystem the system becomes unresponsive and must be rebooted. An fsck on
>> the affected filesystem shows lots of corruption.
>> Any ideas on what we can do to help isolate this problem? We have 64 nodes
>> and the problem is random.
>>     
>
> Crazy question, but I have to ask - you don't have the same filesystem
> mounted on all those nodes, do you?
>
> What kernel is this?
>
> -Eric
>
> _______________________________________________
> Ext3-users mailing list
> Ext3-users at redhat.com
> https://www.redhat.com/mailman/listinfo/ext3-users
>
>
>   


-- 
......................................................................
         __
        / /          Jordi Prats
  C E / S / C A      Dept. de Sistemes
      /_/            Centre de Supercomputaci? de Catalunya

  Gran Capit?, 2-4 (Edifici Nexus) ? 08034 Barcelona
  T. 93 205 6464 ? F.  93 205 6979 ? jprats at cesca.es
...................................................................... 


From worleys at gmail.com  Sat Nov 10 02:11:50 2007
From: worleys at gmail.com (Chris Worley)
Date: Fri, 9 Nov 2007 19:11:50 -0700
Subject: Proper alignment between disk HW blocks, mdadm strides,
	and ext[23] blocks
Message-ID: <f3177b9e0711091811v4134ddfby48d8baed81646d19@mail.gmail.com>

How do you measure/gauge/assure proper alignment?

The physical disk has a block structure.  What is it or how do you
find it?  I'm guessing it's best to not partition disks in order to
assure that whatever it's block read/write is isn't bisected by the
partition.

Then, mdadm has some block structure.  The "-c" ("chunk") is in
"kibibytes" (feed the dog kibbles?), with a default of 64.  Not a clue
what they're trying to do.

Finally, mkfs.ext[23] has a "stride", which is defined as a "stripe
size" in the man page (and I thought all your stripes added together
are a "stride"), as well as a block size.

It's important to make sure these all align properly, but their definitions do.

Could somebody please clarify... with an example?

Thanks,

Chris


From worleys at gmail.com  Tue Nov 13 17:20:54 2007
From: worleys at gmail.com (Chris Worley)
Date: Tue, 13 Nov 2007 10:20:54 -0700
Subject: Proper alignment between disk HW blocks, mdadm strides,
	and ext[23] blocks
In-Reply-To: <20071110061641.GK3966@webber.adilger.int>
References: <f3177b9e0711091811v4134ddfby48d8baed81646d19@mail.gmail.com>
	<20071110061641.GK3966@webber.adilger.int>
Message-ID: <f3177b9e0711130920i6082e6a6pf5f291ad297518fb@mail.gmail.com>

On Nov 9, 2007 11:16 PM, Andreas Dilger <adilger at sun.com> wrote:
> On Nov 09, 2007  19:11 -0700, Chris Worley wrote:
> > How do you measure/gauge/assure proper alignment?
> >
> > The physical disk has a block structure.  What is it or how do you
> > find it?  I'm guessing it's best to not partition disks in order to
> > assure that whatever it's block read/write is isn't bisected by the
> > partition.
>
> For Lustre we never partition the disks for exactly this reason, and if
> you are using LVM/md on the whole device it doesn't make sense either.
>
> > Then, mdadm has some block structure.  The "-c" ("chunk") is in
> > "kibibytes" (feed the dog kibbles?), with a default of 64.  Not a clue
> > what they're trying to do.
>
> That just means for RAID 0/5/6 that the amount of data or parity in a
> stripe is a multipe of the chunk size, i.e. for a 4+1 RAID5 you get:
>
>         disk0 disk1 disk2 disk3 disk4
>         [64kB][64kB][64kB][64kB][64kB]
>         [64kB][64kB]...
>
> > Finally, mkfs.ext[23] has a "stride", which is defined as a "stripe
> > size" in the man page (and I thought all your stripes added together
> > are a "stride"), as well as a block size.
>
> For ext2/3/4 the stride size (in kB) == the mdadm chunk size.  Note that
> the ext2/3/4 stride size is in units of filesystem blocks, so if you have
> 4kB filesystem blocks (default for filesystems > 500MB) and a 64kB RAID5
> chunk size, this is 16:
>
>         e2fsck -E stride=16 /dev/md0

So, if:

B=Ext Block size
S=Ext Stride size
C=MD Chunk size

Then:

S=C/B

Is that correct?

Ignorantly/randomly shopping around for values (using 1MB block sizes
and 16GB transfers in  DD as the benchmark), I found performance
increased as I increased the MD chunk (testing just the MD device),
but, greater than 1024, the MD performance increased, but the EXT fs
got slower.  Strangely the EXT stride performed best set at 2048 (the
above equation says 256 would have been correct):

mdadm --create /dev/md0 --level=0 --chunk=1024 --raid-devices 12  /dev/sd[b-m]
mkfs.ext2 -T largefile4 -b 4096 -E stride=2048 /dev/md0

So, it may be best put that "S", in the equation above, is some factor
of the stride value used.

Note that I am trying to optimize for big blocks and big files, with
little regard for data reliability.

I also found some strange performance differences using different
manufacturer's disks.  I have a bunch of Maxtor 15K and Seagate 10K
SCSI disks.  Streaming to a single drive serially, the Maxtor disks
are faster, but, in parallel, the Seagate drives are faster.  I
measure this with something like:

for i in /dev/sd[e-r]
do /usr/bin/time -f "$i: %e" \
       dd bs=1024k count=16000 of=/dev/null if=$i 2>&1 \
         | grep -v records &
done
wait

This test doesn't truly emulate an MD device, as each disk is treated
independently; a given disk is allowed to get ahead of the rest... why
the Seagates outperform the Maxtors is unknown.  They are evenly
distributed across the SCSI channels (as many Seagates on a channel as
Maxtors).

I'm guessing the Seagate disks have deeper buffers.

I remember a few years ago increasing the number of outstanding
scatter/gather requests helped increase the performance of Qlogic FC
drivers... is there any such driver or kernel tweak these days?

I'd still like to know what the disks use for a block size.

Thanks,

Chris
P.S. Andreas: Hope your having fun at SC07... I don't get to go  :(
>
> > It's important to make sure these all align properly, but their definitions
> > do.
>
> ... do not?
>
> > Could somebody please clarify... with an example?
>
> Yes, I constantly wish the terminology were constant between different tools,
> but sadly there isn't any "proper" terminology out there as far as I've been
> able to see.
>
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Software Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
>
>


From anirban.adhikary at gmail.com  Thu Nov 15 06:44:15 2007
From: anirban.adhikary at gmail.com (Anirban Adhikary)
Date: Thu, 15 Nov 2007 12:14:15 +0530
Subject: Linux File systems Performance Tunning
Message-ID: <71c73b070711142244y737a041cyb6b7a61551d0c4e5@mail.gmail.com>

Dear List
I want to do some performance tunning jobs on ext3 filesystem.So
regarding this what are the parameters I need to check or what are the
things i need to follow.

Thanks & Regards
Anirban Adhikary.


From lists at nerdbynature.de  Thu Nov 15 12:01:42 2007
From: lists at nerdbynature.de (Christian Kujau)
Date: Thu, 15 Nov 2007 13:01:42 +0100 (CET)
Subject: Linux File systems Performance Tunning
In-Reply-To: <71c73b070711142244y737a041cyb6b7a61551d0c4e5@mail.gmail.com>
References: <71c73b070711142244y737a041cyb6b7a61551d0c4e5@mail.gmail.com>
Message-ID: <43158.62.180.231.196.1195128102.squirrel@housecafe.dyndns.org>

On Thu, November 15, 2007 07:44, Anirban Adhikary wrote:
> I want to do some performance tunning jobs on ext3 filesystem.So
> regarding this what are the parameters I need to check or what are the
> things i need to follow.

Well, there's http://tinyurl.com/2nue5f
And the manpage to 'mkfs.ext3' and 'mount' do also mention some tunables.

But first you need to find out what you want your fs to tune for? Lots of
small files in one directory? Lot's of directories? Lots of writes? Reads?
And don't forget to measure performance with the application you intend to
run. Benchmark programs like bonnie++ and stuff might help, but you're
probably only interested how your application will perform.

Christian.
-- 
BOFH excuse #442:

Trojan horse ran out of hay


From adilger at sun.com  Sat Nov 10 06:16:41 2007
From: adilger at sun.com (Andreas Dilger)
Date: Fri, 9 Nov 2007 23:16:41 -0700
Subject: Proper alignment between disk HW blocks, mdadm strides, and
	ext[23] blocks
In-Reply-To: <f3177b9e0711091811v4134ddfby48d8baed81646d19@mail.gmail.com>
References: <f3177b9e0711091811v4134ddfby48d8baed81646d19@mail.gmail.com>
Message-ID: <20071110061641.GK3966@webber.adilger.int>

On Nov 09, 2007  19:11 -0700, Chris Worley wrote:
> How do you measure/gauge/assure proper alignment?
> 
> The physical disk has a block structure.  What is it or how do you
> find it?  I'm guessing it's best to not partition disks in order to
> assure that whatever it's block read/write is isn't bisected by the
> partition.

For Lustre we never partition the disks for exactly this reason, and if
you are using LVM/md on the whole device it doesn't make sense either.

> Then, mdadm has some block structure.  The "-c" ("chunk") is in
> "kibibytes" (feed the dog kibbles?), with a default of 64.  Not a clue
> what they're trying to do.

That just means for RAID 0/5/6 that the amount of data or parity in a
stripe is a multipe of the chunk size, i.e. for a 4+1 RAID5 you get:

	disk0 disk1 disk2 disk3 disk4
	[64kB][64kB][64kB][64kB][64kB]
	[64kB][64kB]...

> Finally, mkfs.ext[23] has a "stride", which is defined as a "stripe
> size" in the man page (and I thought all your stripes added together
> are a "stride"), as well as a block size.

For ext2/3/4 the stride size (in kB) == the mdadm chunk size.  Note that
the ext2/3/4 stride size is in units of filesystem blocks, so if you have
4kB filesystem blocks (default for filesystems > 500MB) and a 64kB RAID5
chunk size, this is 16:

	e2fsck -E stride=16 /dev/md0

> It's important to make sure these all align properly, but their definitions
> do.

... do not?

> Could somebody please clarify... with an example?

Yes, I constantly wish the terminology were constant between different tools,
but sadly there isn't any "proper" terminology out there as far as I've been
able to see.

Cheers, Andreas
--
Andreas Dilger
Sr. Software Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


From jpiszcz at lucidpixels.com  Thu Nov 15 13:42:49 2007
From: jpiszcz at lucidpixels.com (Justin Piszcz)
Date: Thu, 15 Nov 2007 08:42:49 -0500 (EST)
Subject: Proper alignment between disk HW blocks, mdadm strides, and
 ext[23] blocks
In-Reply-To: <20071110061641.GK3966@webber.adilger.int>
References: <f3177b9e0711091811v4134ddfby48d8baed81646d19@mail.gmail.com>
	<20071110061641.GK3966@webber.adilger.int>
Message-ID: <Pine.LNX.4.64.0711150842220.31238@p34.internal.lan>


On Fri, 9 Nov 2007, Andreas Dilger wrote:

> On Nov 09, 2007  19:11 -0700, Chris Worley wrote:
>> How do you measure/gauge/assure proper alignment?
>>
>> The physical disk has a block structure.  What is it or how do you
>> find it?  I'm guessing it's best to not partition disks in order to
>> assure that whatever it's block read/write is isn't bisected by the
>> partition.
>
> For Lustre we never partition the disks for exactly this reason, and if
> you are using LVM/md on the whole device it doesn't make sense either.
>
>> Then, mdadm has some block structure.  The "-c" ("chunk") is in
>> "kibibytes" (feed the dog kibbles?), with a default of 64.  Not a clue
>> what they're trying to do.
>
> That just means for RAID 0/5/6 that the amount of data or parity in a
> stripe is a multipe of the chunk size, i.e. for a 4+1 RAID5 you get:
>
> 	disk0 disk1 disk2 disk3 disk4
> 	[64kB][64kB][64kB][64kB][64kB]
> 	[64kB][64kB]...
>
>> Finally, mkfs.ext[23] has a "stride", which is defined as a "stripe
>> size" in the man page (and I thought all your stripes added together
>> are a "stride"), as well as a block size.
>
> For ext2/3/4 the stride size (in kB) == the mdadm chunk size.  Note that
> the ext2/3/4 stride size is in units of filesystem blocks, so if you have
> 4kB filesystem blocks (default for filesystems > 500MB) and a 64kB RAID5
> chunk size, this is 16:
>
> 	e2fsck -E stride=16 /dev/md0
>
>> It's important to make sure these all align properly, but their definitions
>> do.
>
> ... do not?
>
>> Could somebody please clarify... with an example?
>
> Yes, I constantly wish the terminology were constant between different tools,
> but sadly there isn't any "proper" terminology out there as far as I've been
> able to see.
>
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Software Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
>
> _______________________________________________
> Ext3-users mailing list
> Ext3-users at redhat.com
> https://www.redhat.com/mailman/listinfo/ext3-users
>

Quick question Andreas, if you do not provide a -E stride=16 on a RAID5 
filesystem, how much worse does the performance become on say a 2.0 or 
5.0TB ext3 filesystem?

Justin.


From adilger at sun.com  Thu Nov 15 18:06:24 2007
From: adilger at sun.com (Andreas Dilger)
Date: Thu, 15 Nov 2007 11:06:24 -0700
Subject: Proper alignment between disk HW blocks, mdadm strides, and
	ext[23] blocks
In-Reply-To: <Pine.LNX.4.64.0711150842220.31238@p34.internal.lan>
References: <f3177b9e0711091811v4134ddfby48d8baed81646d19@mail.gmail.com>
	<20071110061641.GK3966@webber.adilger.int>
	<Pine.LNX.4.64.0711150842220.31238@p34.internal.lan>
Message-ID: <20071115180624.GN3966@webber.adilger.int>

On Nov 15, 2007  08:42 -0500, Justin Piszcz wrote:
> Quick question Andreas, if you do not provide a -E stride=16 on a RAID5 
> filesystem, how much worse does the performance become on say a 2.0 or 
> 5.0TB ext3 filesystem?

Sorry, I don't have any numbers on that.  It really depends on the back-end
RAID hardware and the IO load.  If it has a write cache it might not be any
significant overhead.

Cheers, Andreas
--
Andreas Dilger
Sr. Software Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


From sundevil007 at gmail.com  Fri Nov 16 19:21:45 2007
From: sundevil007 at gmail.com (ViVu)
Date: Fri, 16 Nov 2007 11:21:45 -0800 (PST)
Subject: File System Traces
Message-ID: <13799180.post@talk.nabble.com>


Hello All,

I'm trying to collect the following information about an application at the
file system layer:

Type of request - Read/Write
Sector Number to which the request is directed - to which the request is
directed
Time of request

Can anyone pls let me know what changes should I make in which modules to
extract this information? Thanks a lot!!


Rgds
SunDevil
-- 
View this message in context: http://www.nabble.com/File-System-Traces-tf4823170.html#a13799180
Sent from the Ext3 - User mailing list archive at Nabble.com.


From skyfalcon866 at gmail.com  Mon Nov 19 23:45:44 2007
From: skyfalcon866 at gmail.com (skyhawk)
Date: Mon, 19 Nov 2007 15:45:44 -0800 (PST)
Subject: fsck
Message-ID: <13848110.post@talk.nabble.com>


why does fsck take 10 minutes to finish on my 250GB hdd? JFS fsck takes 1
Minute
-- 
View this message in context: http://www.nabble.com/fsck-tf4840279.html#a13848110
Sent from the Ext3 - User mailing list archive at Nabble.com.


From sandeen at redhat.com  Mon Nov 26 15:51:13 2007
From: sandeen at redhat.com (Eric Sandeen)
Date: Mon, 26 Nov 2007 09:51:13 -0600
Subject: File System Traces
In-Reply-To: <13799180.post@talk.nabble.com>
References: <13799180.post@talk.nabble.com>
Message-ID: <474AEB71.2040903@redhat.com>

ViVu wrote:
> Hello All,
> 
> I'm trying to collect the following information about an application at the
> file system layer:
> 
> Type of request - Read/Write
> Sector Number to which the request is directed - to which the request is
> directed
> Time of request
> 
> Can anyone pls let me know what changes should I make in which modules to
> extract this information? Thanks a lot!!

I'd probably use Jens Axboe's blktrace, google can find it for you (or,
fedora has rpms, other distros probably do to)

the vm.block_dump sysctl might also help.

-Eric