From adilger at dilger.ca  Mon Nov  1 06:13:33 2010
From: adilger at dilger.ca (Andreas Dilger)
Date: Mon, 1 Nov 2010 00:13:33 -0600
Subject: How to generate a large file allocating space
In-Reply-To: <2A382F5D94CB78493D1760C9@Ximines.local>
References: <9A62FED22DF5F54862C68579@nimrod.local>
	<20101031152351.GA20833@wolff.to>
	<2A382F5D94CB78493D1760C9@Ximines.local>
Message-ID: <250FAB8E-BB5F-4489-B377-D00C680671C7@dilger.ca>

On 2010-10-31, at 09:05, Alex Bligh wrote:
> I am trying to allocate huge files on ext4. I will then read the extents
> within the file and write to the disk at a block level rather than using
> ext4 (the FS will not be mounted at this point). This will allow me to
> have several iSCSI clients hitting the same LUN r/w safely. And at
> some point when I know the relevant iSCSI stuff has stopped and been
> flushed to disk, I may unlink the file.

Hmm, why not simply use a cluster filesystem to do this?

GFS and OCFS both handle shared writers for the same SAN disk (AFAIK), and Lustre uses ext4 as the underlying filesystem, and though it doesn't allow direct client writes to the same disk it will allow writing at 95% of the underlying raw disk performance from multiple clients.

Cheers, Andreas








From alex at alex.org.uk  Mon Nov  1 06:14:09 2010
From: alex at alex.org.uk (Alex Bligh)
Date: Mon, 01 Nov 2010 07:14:09 +0100
Subject: How to generate a large file allocating space
In-Reply-To: <250FAB8E-BB5F-4489-B377-D00C680671C7@dilger.ca>
References: <9A62FED22DF5F54862C68579@nimrod.local>
	<20101031152351.GA20833@wolff.to>	<2A382F5D94CB78493D1760C9@Ximines.local>
	<250FAB8E-BB5F-4489-B377-D00C680671C7@dilger.ca>
Message-ID: <8E4E90C9C1C482942DD479C6@nimrod.local>



--On 1 November 2010 00:13:33 -0600 Andreas Dilger <adilger at dilger.ca> 
wrote:

> Hmm, why not simply use a cluster filesystem to do this?
>
> GFS and OCFS both handle shared writers for the same SAN disk (AFAIK),
> and Lustre uses ext4 as the underlying filesystem, and though it doesn't
> allow direct client writes to the same disk it will allow writing at 95%
> of the underlying raw disk performance from multiple clients.

Essentially because none of them do exactly what I need them to do,
so I am reinventing a slightly different wheel...

-- 
Alex Bligh



From adilger.kernel at dilger.ca  Mon Nov  1 21:45:12 2010
From: adilger.kernel at dilger.ca (Andreas Dilger)
Date: Mon, 1 Nov 2010 15:45:12 -0600
Subject: How to generate a large file allocating space
In-Reply-To: <8E4E90C9C1C482942DD479C6@nimrod.local>
References: <9A62FED22DF5F54862C68579@nimrod.local>
	<20101031152351.GA20833@wolff.to>
	<2A382F5D94CB78493D1760C9@Ximines.local>
	<250FAB8E-BB5F-4489-B377-D00C680671C7@dilger.ca>
	<8E4E90C9C1C482942DD479C6@nimrod.local>
Message-ID: <4ADC8721-2D8C-4793-B583-F7F4AEA510BC@dilger.ca>

On 2010-11-01, at 00:14, Alex Bligh wrote:
> --On 1 November 2010 00:13:33 -0600 Andreas Dilger <adilger at dilger.ca> wrote:
>> Hmm, why not simply use a cluster filesystem to do this?
>> 
>> GFS and OCFS both handle shared writers for the same SAN disk (AFAIK),
>> and Lustre uses ext4 as the underlying filesystem, and though it doesn't
>> allow direct client writes to the same disk it will allow writing at 95%
>> of the underlying raw disk performance from multiple clients.
> 
> Essentially because none of them do exactly what I need them to do,
> so I am reinventing a slightly different wheel...

Personally, I hate re-inventing things vs. improving something to make it do what you want, since it means (probably) that your code will be used by you alone, while making an improvement to an existing cluster filesystem will both meet your needs and allow others to benefit as well.

What is it you really want to do in the end?  Shared concurrent writers to the same file?  High-bandwidth IO to the underlying disk?

Cheers, Andreas








From alex at alex.org.uk  Mon Nov  1 22:58:12 2010
From: alex at alex.org.uk (Alex Bligh)
Date: Mon, 01 Nov 2010 22:58:12 +0000
Subject: How to generate a large file allocating space
In-Reply-To: <4ADC8721-2D8C-4793-B583-F7F4AEA510BC@dilger.ca>
References: <9A62FED22DF5F54862C68579@nimrod.local>
	<20101031152351.GA20833@wolff.to>
	<2A382F5D94CB78493D1760C9@Ximines.local>
	<250FAB8E-BB5F-4489-B377-D00C680671C7@dilger.ca>
	<8E4E90C9C1C482942DD479C6@nimrod.local>
	<4ADC8721-2D8C-4793-B583-F7F4AEA510BC@dilger.ca>
Message-ID: <B06DC20F102CA75451E9914C@Ximines.local>



--On 1 November 2010 15:45:12 -0600 Andreas Dilger 
<adilger.kernel at dilger.ca> wrote:

> What is it you really want to do in the end?  Shared concurrent writers
> to the same file?  High-bandwidth IO to the underlying disk?

High bandwidth I/O to the underlying disk is part of it - only one
reader/writer per file. We're really using ext4 just for its extents
capability, i.e. allocating space, plus the convenience of directory
lookup to find the set of extents.

It's easier to do this than to write this bit from scratch, and the
files are pretty static in size (i.e. they only grow, and grow
infrequently by large amounts). The files on ext4 correspond to large
chunks of disks we are combining together using an device-mapper
type thing (but different), and on top of that lives arbitary real
filing systems. Because our device-mapper type thing already
understands what blocks have been written to, we already have a layer
that prevents the data on the disk before the file's creation being
exposed. That's why I don't need ext4 to zero them out. I suppose
in that sense it is like the swap file case.

Oh, and because these files are allocated infrequently, I am not
/that/ concerned about performance (famous last words). The performance
critical stuff is done via direct writes to the SAN and don't even
pass through ext4 (or indeed through any single host).

-- 
Alex Bligh



From tytso at mit.edu  Tue Nov  2 01:49:46 2010
From: tytso at mit.edu (Ted Ts'o)
Date: Mon, 1 Nov 2010 21:49:46 -0400
Subject: How to generate a large file allocating space
In-Reply-To: <B06DC20F102CA75451E9914C@Ximines.local>
References: <9A62FED22DF5F54862C68579@nimrod.local>
	<20101031152351.GA20833@wolff.to>
	<2A382F5D94CB78493D1760C9@Ximines.local>
	<250FAB8E-BB5F-4489-B377-D00C680671C7@dilger.ca>
	<8E4E90C9C1C482942DD479C6@nimrod.local>
	<4ADC8721-2D8C-4793-B583-F7F4AEA510BC@dilger.ca>
	<B06DC20F102CA75451E9914C@Ximines.local>
Message-ID: <20101102014946.GB24500@thunk.org>

On Mon, Nov 01, 2010 at 10:58:12PM +0000, Alex Bligh wrote:
> High bandwidth I/O to the underlying disk is part of it - only one
> reader/writer per file. We're really using ext4 just for its extents
> capability, i.e. allocating space, plus the convenience of directory
> lookup to find the set of extents.
> 
> It's easier to do this than to write this bit from scratch, and the
> files are pretty static in size (i.e. they only grow, and grow
> infrequently by large amounts). The files on ext4 correspond to large
> chunks of disks we are combining together using an device-mapper
> type thing (but different), and on top of that lives arbitary real
> filing systems. Because our device-mapper type thing already
> understands what blocks have been written to, we already have a layer
> that prevents the data on the disk before the file's creation being
> exposed. That's why I don't need ext4 to zero them out. I suppose
> in that sense it is like the swap file case.

But why not just use O_DIRECT?  Do you really need to access the
disk directly, as opposed to using O_DIRECT?

						- Ted



From adilger.kernel at dilger.ca  Tue Nov  2 03:21:10 2010
From: adilger.kernel at dilger.ca (Andreas Dilger)
Date: Mon, 1 Nov 2010 21:21:10 -0600
Subject: How to generate a large file allocating space
In-Reply-To: <B06DC20F102CA75451E9914C@Ximines.local>
References: <9A62FED22DF5F54862C68579@nimrod.local>
	<20101031152351.GA20833@wolff.to>
	<2A382F5D94CB78493D1760C9@Ximines.local>
	<250FAB8E-BB5F-4489-B377-D00C680671C7@dilger.ca>
	<8E4E90C9C1C482942DD479C6@nimrod.local>
	<4ADC8721-2D8C-4793-B583-F7F4AEA510BC@dilger.ca>
	<B06DC20F102CA75451E9914C@Ximines.local>
Message-ID: <A3371ED8-02E1-4D07-A3D6-58AC6D615F0B@dilger.ca>

On 2010-11-01, at 16:58, Alex Bligh wrote:
> --On 1 Nov 2010 15:45:12 Andreas Dilger <adilger.kernel at dilger.ca> wrote:
>> What is it you really want to do in the end?  Shared concurrent writers
>> to the same file?  High-bandwidth IO to the underlying disk?
> 
> High bandwidth I/O to the underlying disk is part of it - only one
> reader/writer per file. We're really using ext4 just for its extents
> capability, i.e. allocating space, plus the convenience of directory
> lookup to find the set of extents.
> 
> It's easier to do this than to write this bit from scratch, and the
> files are pretty static in size (i.e. they only grow, and grow
> infrequently by large amounts). The files on ext4 correspond to large
> chunks of disks we are combining together using an device-mapper
> type thing (but different), and on top of that lives arbitary real
> filing systems. Because our device-mapper type thing already
> understands what blocks have been written to, we already have a layer
> that prevents the data on the disk before the file's creation being
> exposed. That's why I don't need ext4 to zero them out. I suppose
> in that sense it is like the swap file case.
> 
> Oh, and because these files are allocated infrequently, I am not
> /that/ concerned about performance (famous last words). The performance
> critical stuff is done via direct writes to the SAN and don't even
> pass through ext4 (or indeed through any single host).

Actually, I think Ceph has a network block-device feature (recently submitted/committed to mainline), and Lustre has a prototype block-device feature as well.  

Cheers, Andreas








From alex at alex.org.uk  Tue Nov  2 07:58:02 2010
From: alex at alex.org.uk (Alex Bligh)
Date: Tue, 2 Nov 2010 07:58:02 +0000
Subject: How to generate a large file allocating space
In-Reply-To: <20101102014946.GB24500@thunk.org>
References: <9A62FED22DF5F54862C68579@nimrod.local>
	<20101031152351.GA20833@wolff.to>
	<2A382F5D94CB78493D1760C9@Ximines.local>
	<250FAB8E-BB5F-4489-B377-D00C680671C7@dilger.ca>
	<8E4E90C9C1C482942DD479C6@nimrod.local>
	<4ADC8721-2D8C-4793-B583-F7F4AEA510BC@dilger.ca>
	<B06DC20F102CA75451E9914C@Ximines.local>
	<20101102014946.GB24500@thunk.org>
Message-ID: <687E728B-761B-4917-8377-EE90820E9DF9@alex.org.uk>

Ted,

On 2 Nov 2010, at 01:49, "Ted Ts'o" <tytso at mit.edu> wrote:

> But why not just use O_DIRECT?  Do you really need to access the
> disk directly, as opposed to using O_DIRECT?
> 
Because more than one machine will be accessing the data on the ext4 volume (over iSCSI), though access to the large files is mediated by locks higher up. To use O_DIRECT each accessing machine would need to have the volume mounted, rather than merely receiving a list of extents.

-- 
Alex Bligh



From alex at alex.org.uk  Tue Nov  2 08:01:28 2010
From: alex at alex.org.uk (Alex Bligh)
Date: Tue, 2 Nov 2010 08:01:28 +0000
Subject: How to generate a large file allocating space
In-Reply-To: <A3371ED8-02E1-4D07-A3D6-58AC6D615F0B@dilger.ca>
References: <9A62FED22DF5F54862C68579@nimrod.local>
	<20101031152351.GA20833@wolff.to>
	<2A382F5D94CB78493D1760C9@Ximines.local>
	<250FAB8E-BB5F-4489-B377-D00C680671C7@dilger.ca>
	<8E4E90C9C1C482942DD479C6@nimrod.local>
	<4ADC8721-2D8C-4793-B583-F7F4AEA510BC@dilger.ca>
	<B06DC20F102CA75451E9914C@Ximines.local>
	<A3371ED8-02E1-4D07-A3D6-58AC6D615F0B@dilger.ca>
Message-ID: <252E7275-714E-4881-8B27-E7CB1D0C5424@alex.org.uk>

On 2 Nov 2010, at 03:21, Andreas Dilger <adilger.kernel at dilger.ca> wrote:

> 
> Actually, I think Ceph has a network block-device feature (recently submitted/committed to mainline), and Lustre has a prototype block-device feature as well.  

Unfortunately I need something not prototype. Fortunately I don't need many of lustre's or ceph's features.

-- 
Alex Bligh




From rwheeler at redhat.com  Tue Nov  2 11:20:48 2010
From: rwheeler at redhat.com (Ric Wheeler)
Date: Tue, 02 Nov 2010 07:20:48 -0400
Subject: How to generate a large file allocating space
In-Reply-To: <252E7275-714E-4881-8B27-E7CB1D0C5424@alex.org.uk>
References: <9A62FED22DF5F54862C68579@nimrod.local>	<20101031152351.GA20833@wolff.to>	<2A382F5D94CB78493D1760C9@Ximines.local>	<250FAB8E-BB5F-4489-B377-D00C680671C7@dilger.ca>	<8E4E90C9C1C482942DD479C6@nimrod.local>	<4ADC8721-2D8C-4793-B583-F7F4AEA510BC@dilger.ca>	<B06DC20F102CA75451E9914C@Ximines.local>	<A3371ED8-02E1-4D07-A3D6-58AC6D615F0B@dilger.ca>
	<252E7275-714E-4881-8B27-E7CB1D0C5424@alex.org.uk>
Message-ID: <4CCFF410.3040905@redhat.com>

  On 11/02/2010 04:01 AM, Alex Bligh wrote:
> On 2 Nov 2010, at 03:21, Andreas Dilger<adilger.kernel at dilger.ca>  wrote:
>
>> Actually, I think Ceph has a network block-device feature (recently submitted/committed to mainline), and Lustre has a prototype block-device feature as well.
> Unfortunately I need something not prototype. Fortunately I don't need many of lustre's or ceph's features.
>

Sounds like you will end up writing something brand new - much less stable than 
any of the options mentioned in the thread previously.

Ric



From alex at alex.org.uk  Tue Nov  2 17:37:29 2010
From: alex at alex.org.uk (Alex Bligh)
Date: Tue, 02 Nov 2010 17:37:29 +0000
Subject: How to generate a large file allocating space
In-Reply-To: <4CCFF410.3040905@redhat.com>
References: <9A62FED22DF5F54862C68579@nimrod.local>
	<20101031152351.GA20833@wolff.to>	<2A382F5D94CB78493D1760C9@Ximines.local>
	<250FAB8E-BB5F-4489-B377-D00C680671C7@dilger.ca>
	<8E4E90C9C1C482942DD479C6@nimrod.local>
	<4ADC8721-2D8C-4793-B583-F7F4AEA510BC@dilger.ca>
	<B06DC20F102CA75451E9914C@Ximines.local>
	<A3371ED8-02E1-4D07-A3D6-58AC6D615F0B@dilger.ca>
	<252E7275-714E-4881-8B27-E7CB1D0C5424@alex.org.uk>
	<4CCFF410.3040905@redhat.com>
Message-ID: <2DD84B976B2AFDE5E6F66A7E@nimrod.local>



--On 2 November 2010 07:20:48 -0400 Ric Wheeler <rwheeler at redhat.com> wrote:

> Sounds like you will end up writing something brand new - much less
> stable than any of the options mentioned in the thread previously.

Well, the new component will be something simple. All I really
need to know is how to mark the inodes as allocated and initialised,
rather than unwritten.

-- 
Alex Bligh



From bothie at gmx.de  Thu Nov  4 12:46:38 2010
From: bothie at gmx.de (Bodo Thiesen)
Date: Thu, 4 Nov 2010 13:46:38 +0100
Subject: How to generate a large file allocating space
In-Reply-To: <250FAB8E-BB5F-4489-B377-D00C680671C7@dilger.ca>
References: <9A62FED22DF5F54862C68579@nimrod.local>
	<20101031152351.GA20833@wolff.to>
	<2A382F5D94CB78493D1760C9@Ximines.local>
	<250FAB8E-BB5F-4489-B377-D00C680671C7@dilger.ca>
Message-ID: <20101104134638.31d98a5c@gmx.de>

Hello Alex, hello Andreas

* Andreas Dilger <adilger at dilger.ca> hat geschrieben:
>* On 2010-10-31, at 09:05, Alex Bligh wrote:
>> I am trying to allocate huge files on ext4. I will then read the extents
>> within the file and write to the disk at a block level rather than using
>> ext4 (the FS will not be mounted at this point). This will allow me to
>> have several iSCSI clients hitting the same LUN r/w safely. And at
>> some point when I know the relevant iSCSI stuff has stopped and been
>> flushed to disk, I may unlink the file.

Question: Did you consider using plain LVM for this purpose? By creating a
logical volume, no data is initialized, only the meta data is created
(what seems to be exactly what you need). Then, each client may access one
logical volume r/w. Retrieving the extents list is very easy as well. And
because there are no group management data (cluster bitmaps, inode bitmaps
and tables) of any kind, you will end up with only one single extent in
most cases regardless of the size of the volume you've created.

> Hmm, why not simply use a cluster filesystem to do this?
> 
> GFS and OCFS both handle shared writers for the same SAN disk (AFAIK),

They are SUPPOSED to do that - in theory. The last two weekends I tried to
set up a stable DRDB+GFS2 setup - I failed. Then I tried OCFS2 - again I
failed. The setup was simple: Two identical Systems with 10*500GB disks
and a hardware RAID6 yielding 4GB user disk space. That was used to create
a DRDB (no LVM or other stuff like crypto in betreen). Both were set to
primary and then I created GFS2 (later OCFS2) and started the additional
tools like clvm/o2bc. Then mounting the file systems on both machines -
everything worked up to here.

machine1: dd if=/dev/zero of=/mnt/4tb/file1
machine2: dd if=/dev/zero of=/mnt/4tb/file2

Worked well in both setups on both machines

machine1: let i=0; while let i=i+1; do echo "A$i" >> /mnt/4tb/file3; done
machine2: let i=0; while let i=i+1; do echo "B$i" >> /mnt/4tb/file3; done

GFS2: First machine works well, second machine starts returning EIO on
*ANY* request (even ls /mnt/4tb). Umount impossible. Had to reboot ->
#gfs2 #fail
OCFS2: passed this test as well as the next one

machine1: let i=0; while let i=i+1; do echo "A$i"; done >> /mnt/4tb/file4
machine2: let i=0; while let i=i+1; do echo "B$i"; done >> /mnt/4tb/file4

Then I rebooted one machine with "echo b > /proc/sysrq-trigger" while the
last test was still in progress. Guess what: The other machine stopped
working. No reads, no writes. It didn't evern go on when the first machine
came back. I had then to reboot the second one as well to continue using
the file system.

Maybe I did something wrong, maybe the file systems just aren't as stable
as we expected them to be, anyways, we decided now to use stable systems,
i.e. drbd in primary/secondary setup and ext3 with failover to the other
system if the primary goes down, and as the system already went
productive, we're not gonna change anything here in the near future. So
consider this report as strictly informative.

BTW: No, I do not longer have the config files, I didn't save them and the
systems have been completely reinstalled after testing the final setup
succeeded to wipe out everything left over from the previous attempts.

Regards, Bodo



From tytso at mit.edu  Thu Nov  4 16:16:13 2010
From: tytso at mit.edu (Ted Ts'o)
Date: Thu, 4 Nov 2010 12:16:13 -0400
Subject: How to generate a large file allocating space
In-Reply-To: <687E728B-761B-4917-8377-EE90820E9DF9@alex.org.uk>
References: <9A62FED22DF5F54862C68579@nimrod.local>
	<20101031152351.GA20833@wolff.to>
	<2A382F5D94CB78493D1760C9@Ximines.local>
	<250FAB8E-BB5F-4489-B377-D00C680671C7@dilger.ca>
	<8E4E90C9C1C482942DD479C6@nimrod.local>
	<4ADC8721-2D8C-4793-B583-F7F4AEA510BC@dilger.ca>
	<B06DC20F102CA75451E9914C@Ximines.local>
	<20101102014946.GB24500@thunk.org>
	<687E728B-761B-4917-8377-EE90820E9DF9@alex.org.uk>
Message-ID: <20101104161613.GC4631@thunk.org>

On Tue, Nov 02, 2010 at 07:58:02AM +0000, Alex Bligh wrote:
> On 2 Nov 2010, at 01:49, "Ted Ts'o" <tytso at mit.edu> wrote:
> > But why not just use O_DIRECT?  Do you really need to access the
> > disk directly, as opposed to using O_DIRECT?
> > 
> Because more than one machine will be accessing the data on the ext4
> volume (over iSCSI), though access to the large files is mediated by
> locks higher up. To use O_DIRECT each accessing machine would need
> to have the volume mounted, rather than merely receiving a list of
> extents.

Well, I would personally not be against an extension to fallocate()
where if the caller of the syscall specifies a new flag, that might be
named FALLOC_FL_EXPOSE_OLD_DATA, and if the caller either has root
privs or (if capabilities are enabled) CAP_DAC_OVERRIDE &&
CAP_MAC_OVERRIDE, it would be able to allocate blocks whose extents
would be marked as initialized without actually initializing the
blocks first.

I don't know whether it will get past the fs-devel bike shed painting
crew, but I do have some cluster file system users who would like
something similar.  In their case they will be writing the files using
Direct I/O, and the objects are all checksumed at the cluster file
system level, and if the object has the wrong checksum, then the
cluster file system will ask another server for the object.  Since the
cluster file system is considered trusted, and it verifies the
expected object checksum before releasing the data, there is no
security issue.

You do realize, though, that it sounds like with your design you are
replicating the servers, but not the disk devices --- so if your disk
device explodes, you're Sadly Out of Luck.  Sure you can use
super-expensive storage arrays, but if you're writing your own cluster
file system, why not create a design which uses commodity disks and
worry about replicating data across servers at the cluster file system
level?

						- Ted



From alex at alex.org.uk  Thu Nov  4 18:22:39 2010
From: alex at alex.org.uk (Alex Bligh)
Date: Thu, 04 Nov 2010 18:22:39 +0000
Subject: How to generate a large file allocating space
In-Reply-To: <20101104134638.31d98a5c@gmx.de>
References: <9A62FED22DF5F54862C68579@nimrod.local>
	<20101031152351.GA20833@wolff.to>	<2A382F5D94CB78493D1760C9@Ximines.local>
	<250FAB8E-BB5F-4489-B377-D00C680671C7@dilger.ca>
	<20101104134638.31d98a5c@gmx.de>
Message-ID: <23A690AD52CEF71A1C450278@nimrod.local>



--On 4 November 2010 13:46:38 +0100 Bodo Thiesen <bothie at gmx.de> wrote:

> <adilger at dilger.ca> hat geschrieben:
>> * On 2010-10-31, at 09:05, Alex Bligh wrote:
>>> I am trying to allocate huge files on ext4. I will then read the extents
>>> within the file and write to the disk at a block level rather than using
>>> ext4 (the FS will not be mounted at this point). This will allow me to
>>> have several iSCSI clients hitting the same LUN r/w safely. And at
>>> some point when I know the relevant iSCSI stuff has stopped and been
>>> flushed to disk, I may unlink the file.
>
> Question: Did you consider using plain LVM for this purpose?
> By creating a
> logical volume, no data is initialized, only the meta data is created
> (what seems to be exactly what you need). Then, each client may access one
> logical volume r/w. Retrieving the extents list is very easy as well. And
> because there are no group management data (cluster bitmaps, inode bitmaps
> and tables) of any kind, you will end up with only one single extent in
> most cases regardless of the size of the volume you've created.

Plain LVM or Clustered LVM? Clustered LVM has some severe limitations,
including needing to restart the entire cluster to add nodes, which
is not acceptable.

Plain LVM has two types of issue:

1. Without clustered LVM, as far as I can tell there is no locking
   of metadata. I have no guarantees that access to the disk does not
   go outside the LV's allocation. For instance, when a CoW snapshot is
   written to and expanded, the metadata must be written to, and there
   is no locking for that.

2. Snapshots suffer severe limitations. For instance,
   it is not possible to generate arbitrarily deep trees of snapshots
   (i.e. CoW on top of CoW) without an arbitrarily deep tree of loopback
   mounted lvm devices, which does not sound like a good idea.

I think you can only use lvm like this where you have simple volumes
mounted, and in essence take no snapshots.

To answer the implied question, yes we have a (partial) lvm replacement.

>> GFS and OCFS both handle shared writers for the same SAN disk (AFAIK),
>
> They are SUPPOSED to do that - in theory

We have had similar experiences and don't actually need all the features
(and thus complexity) that a true clustered filing system presents.

-- 
Alex Bligh



From alex at alex.org.uk  Thu Nov  4 18:29:47 2010
From: alex at alex.org.uk (Alex Bligh)
Date: Thu, 04 Nov 2010 18:29:47 +0000
Subject: How to generate a large file allocating space
In-Reply-To: <20101104161613.GC4631@thunk.org>
References: <9A62FED22DF5F54862C68579@nimrod.local>
	<20101031152351.GA20833@wolff.to>
	<2A382F5D94CB78493D1760C9@Ximines.local>
	<250FAB8E-BB5F-4489-B377-D00C680671C7@dilger.ca>
	<8E4E90C9C1C482942DD479C6@nimrod.local>
	<4ADC8721-2D8C-4793-B583-F7F4AEA510BC@dilger.ca>
	<B06DC20F102CA75451E9914C@Ximines.local>
	<20101102014946.GB24500@thunk.org>
	<687E728B-761B-4917-8377-EE90820E9DF9@alex.org.uk>
	<20101104161613.GC4631@thunk.org>
Message-ID: <90029CFD2E61A5BA29AD505F@nimrod.local>

Ted,

--On 4 November 2010 12:16:13 -0400 Ted Ts'o <tytso at mit.edu> wrote:

> Well, I would personally not be against an extension to fallocate()
> where if the caller of the syscall specifies a new flag, that might be
> named FALLOC_FL_EXPOSE_OLD_DATA, and if the caller either has root
> privs or (if capabilities are enabled) CAP_DAC_OVERRIDE &&
> CAP_MAC_OVERRIDE, it would be able to allocate blocks whose extents
> would be marked as initialized without actually initializing the
> blocks first.

That sounds a lot like "send patches" which I just might do, if only
to gain better understanding as to what is going on.

I seem to remember (from lwn's summary of lkml) that the proposed
options for fallocate() got a bit baroque to start with, and people
then simplified down to zero options. Perhaps that was a simplification
too far.

In the mean time, particularly as I'd ideally like to avoid a kernel
modification, is there a safe way I could use or modify the ext2
library to run through the extents of a fallocated() file and clear
the "unwritten" bit? If I clear that (which from memory is the top
bit of the extent length), is that alone safe? (on an unmounted
file system, obviously).

> You do realize, though, that it sounds like with your design you are
> replicating the servers, but not the disk devices --- so if your disk
> device explodes, you're Sadly Out of Luck.  Sure you can use
> super-expensive storage arrays, but if you're writing your own cluster
> file system, why not create a design which uses commodity disks and
> worry about replicating data across servers at the cluster file system
> level?

The particular use case here is for customers that have sunk huge
amounts of money into expensive storage arrays, or for whatever
reason have an aversion to storing anything on anything other than
expensive storage arrays.

I would tend to agree that replicating across commodity disks is
in almost all cases a better technological solution, but the
technology is still further away from readiness there. Sadly
technological arguments don't always win the day, and we need
something in the mean time...

-- 
Alex Bligh



From tytso at mit.edu  Thu Nov  4 19:17:34 2010
From: tytso at mit.edu (Ted Ts'o)
Date: Thu, 4 Nov 2010 15:17:34 -0400
Subject: How to generate a large file allocating space
In-Reply-To: <90029CFD2E61A5BA29AD505F@nimrod.local>
References: <20101031152351.GA20833@wolff.to>
	<2A382F5D94CB78493D1760C9@Ximines.local>
	<250FAB8E-BB5F-4489-B377-D00C680671C7@dilger.ca>
	<8E4E90C9C1C482942DD479C6@nimrod.local>
	<4ADC8721-2D8C-4793-B583-F7F4AEA510BC@dilger.ca>
	<B06DC20F102CA75451E9914C@Ximines.local>
	<20101102014946.GB24500@thunk.org>
	<687E728B-761B-4917-8377-EE90820E9DF9@alex.org.uk>
	<20101104161613.GC4631@thunk.org>
	<90029CFD2E61A5BA29AD505F@nimrod.local>
Message-ID: <20101104191734.GG7553@thunk.org>

On Thu, Nov 04, 2010 at 06:29:47PM +0000, Alex Bligh wrote:
> 
> >Well, I would personally not be against an extension to fallocate()
> >where if the caller of the syscall specifies a new flag, that might be
> >named FALLOC_FL_EXPOSE_OLD_DATA, and if the caller either has root
> >privs or (if capabilities are enabled) CAP_DAC_OVERRIDE &&
> >CAP_MAC_OVERRIDE, it would be able to allocate blocks whose extents
> >would be marked as initialized without actually initializing the
> >blocks first.
> 
> That sounds a lot like "send patches" which I just might do, if only
> to gain better understanding as to what is going on.

Patches to do this wouldn't be that hard.  The harder part would
probably be the politics on fs-devel regarding the semantics of
FALLOC_FL_EXPOSE_OLD_DATA.

> I seem to remember (from lwn's summary of lkml) that the proposed
> options for fallocate() got a bit baroque to start with, and people
> then simplified down to zero options. Perhaps that was a simplification
> too far.

It was simplified down to one flag.  But that means we have a flags
field we can use to extend fallocate.

> In the mean time, particularly as I'd ideally like to avoid a kernel
> modification, is there a safe way I could use or modify the ext2
> library to run through the extents of a fallocated() file and clear
> the "unwritten" bit? If I clear that (which from memory is the top
> bit of the extent length), is that alone safe? (on an unmounted
> file system, obviously).

Yes, there most certainly is.  The functions you'd probably want to
use are ext2fs_extent_open(), and then either use ext2fs_extent_goto()
to go to a specific extent, use ext2fs_extent_get() with the
EXT2_EXTENT_NEXT operation to iterate over the extents, and then use
ext2fs_extent_replace() to mutate the extent.  Oh, and then use
ext2fs_extent_close() when you're done looking at and/or changing the
extents of a file.

If you build tst_extents in lib/ext2fs, you can use commands like
"inode" (to open the extents for a particular inode), and "root",
"current", "next", "prev", "next_leaf", "prev_leaf", "next_sibling",
"prev_sibling", "delete_node", "insert_node", "replace_node",
"split_node", "print_all", "goto", etc.  Please don't use this in
production, but it's not a bad way to play with an extent tree, either
for learning purposes or to create test cases.  tst_extents.c is also
a good way of seeing how the various libext2fs extent API's work.

> I would tend to agree that replicating across commodity disks is
> in almost all cases a better technological solution, but the
> technology is still further away from readiness there. Sadly
> technological arguments don't always win the day, and we need
> something in the mean time...

Well, things like Hadoopfs exist today, and Ceph (if you need a
POSIX-level access) is admittedly less stable.  But if you're starting
from scratch, wouldn't that be pretty far away from readiness as well?

     	      	       	       	      	  - Ted



From bothie at gmx.de  Thu Nov  4 23:05:45 2010
From: bothie at gmx.de (Bodo Thiesen)
Date: Fri, 5 Nov 2010 00:05:45 +0100
Subject: How to generate a large file allocating space
In-Reply-To: <23A690AD52CEF71A1C450278@nimrod.local>
References: <9A62FED22DF5F54862C68579@nimrod.local>
	<20101031152351.GA20833@wolff.to>
	<2A382F5D94CB78493D1760C9@Ximines.local>
	<250FAB8E-BB5F-4489-B377-D00C680671C7@dilger.ca>
	<20101104134638.31d98a5c@gmx.de>
	<23A690AD52CEF71A1C450278@nimrod.local>
Message-ID: <20101105000545.70ee4750@gmx.de>

Hello Alex

* Alex Bligh <alex at alex.org.uk> hat geschrieben:
>* --On 4 November 2010 13:46:38 +0100 Bodo Thiesen <bothie at gmx.de> wrote:
>> Question: Did you consider using plain LVM for this purpose?
>> By creating a
>> logical volume, no data is initialized, only the meta data is created
>> (what seems to be exactly what you need). Then, each client may access one
>> logical volume r/w. Retrieving the extents list is very easy as well. And
>> because there are no group management data (cluster bitmaps, inode bitmaps
>> and tables) of any kind, you will end up with only one single extent in
>> most cases regardless of the size of the volume you've created.
> Plain LVM or Clustered LVM? Clustered LVM has some severe limitations,
> including needing to restart the entire cluster to add nodes, which
> is not acceptable.
> 
> Plain LVM has two types of issue:
> 
> 1. Without clustered LVM, as far as I can tell there is no locking
>    of metadata.

Possible (I don't know exactly)

>    I have no guarantees that access to the disk does not
>    go outside the LV's allocation.

In LVM you create one logical volume. In the process of creating that
volume, metadata get's updated. But just using the pre-existing logical
volumes doesn't change the metadata. So, if you do all creation and
removing of logical volumes on the same node, then you shouldn't get any
problems here. "lvchange -a[yn] $lv" doesn't even change the metadata,
it's a completely local operation (the local lvm cache get's updated, but
that's all). So, if you provide access via nbd or something like that to
the pv, all nodes could just use their portion of the lv without any
problems. Besides: You wanted to use ext4. I suggested to use lvm in the
same way you initially wanted to use ext4. So: On the main node you use
the command "lvdisplay -v $lv" (or thatever the exact command line is) and
you get a list of extents as result. Then you transfer that list to the
client and it can access the disk directly without issuing any lvm command
at all.

>    For instance, when a CoW snapshot is
>    written to and expanded, the metadata must be written to, and there
>    is no locking for that.

Right, but that was not part of your use-case. If you need such things,
you can't use ext4 as well.

> 2. Snapshots suffer severe limitations. For instance,
>    it is not possible to generate arbitrarily deep trees of snapshots
>    (i.e. CoW on top of CoW) without an arbitrarily deep tree of loopback
>    mounted lvm devices, which does not sound like a good idea.
> 
> I think you can only use lvm like this where you have simple volumes
> mounted, and in essence take no snapshots.

Yea, and I mentioned lvm, because that was exactly your use-case ;)

> To answer the implied question, yes we have a (partial) lvm replacement.

---> Did you consider using plain LVM for this purpose? <---

That was an explicit question ;)

>>> GFS and OCFS both handle shared writers for the same SAN disk (AFAIK),
>> They are SUPPOSED to do that - in theory
> We have had similar experiences and don't actually need all the features
> (and thus complexity) that a true clustered filing system presents.

Ok, so not my fault ;)

Regards, Bodo



From alex at alex.org.uk  Fri Nov  5 08:08:13 2010
From: alex at alex.org.uk (Alex Bligh)
Date: Fri, 05 Nov 2010 08:08:13 +0000
Subject: How to generate a large file allocating space
In-Reply-To: <20101105000545.70ee4750@gmx.de>
References: <9A62FED22DF5F54862C68579@nimrod.local>
	<20101031152351.GA20833@wolff.to>	<2A382F5D94CB78493D1760C9@Ximines.local>
	<250FAB8E-BB5F-4489-B377-D00C680671C7@dilger.ca>
	<20101104134638.31d98a5c@gmx.de>	<23A690AD52CEF71A1C450278@nimrod.local>
	<20101105000545.70ee4750@gmx.de>
Message-ID: <B9C57562859DC18CBC21818B@nimrod.local>



--On 5 November 2010 00:05:45 +0100 Bodo Thiesen <bothie at gmx.de> wrote:

>>    For instance, when a CoW snapshot is
>>    written to and expanded, the metadata must be written to, and there
>>    is no locking for that.
>
> Right, but that was not part of your use-case. If you need such things,
> you can't use ext4 as well.

I should have been clearer. We aren't using ext4 as anything other than
a block store. The CoW snapshots are done using our LVM replacement
type thing which stores metadata in such a way that it safe to access
it from multiple readers/writers. It would be lovely to use LVM for
this, but not (as far as I can tell) possible.

I might have another look at using lvm as a blockstore, then running our
stuff inside lvm. But I didn't think lvm was capable of running thousands
of LVs per volume group. ext4 is just fine for that. Perhaps I am
slating lvm unfairly.

-- 
Alex Bligh



From alex at alex.org.uk  Fri Nov  5 08:14:56 2010
From: alex at alex.org.uk (Alex Bligh)
Date: Fri, 05 Nov 2010 08:14:56 +0000
Subject: How to generate a large file allocating space
In-Reply-To: <20101104191734.GG7553@thunk.org>
References: <20101031152351.GA20833@wolff.to>
	<2A382F5D94CB78493D1760C9@Ximines.local>
	<250FAB8E-BB5F-4489-B377-D00C680671C7@dilger.ca>
	<8E4E90C9C1C482942DD479C6@nimrod.local>
	<4ADC8721-2D8C-4793-B583-F7F4AEA510BC@dilger.ca>
	<B06DC20F102CA75451E9914C@Ximines.local>
	<20101102014946.GB24500@thunk.org>
	<687E728B-761B-4917-8377-EE90820E9DF9@alex.org.uk>
	<20101104161613.GC4631@thunk.org>
	<90029CFD2E61A5BA29AD505F@nimrod.local>
	<20101104191734.GG7553@thunk.org>
Message-ID: <8B59E33D0CCCC9F5DF30CF2E@nimrod.local>

Ted,

--On 4 November 2010 15:17:34 -0400 Ted Ts'o <tytso at mit.edu> wrote:

> Patches to do this wouldn't be that hard.  The harder part would
> probably be the politics on fs-devel regarding the semantics of
> FALLOC_FL_EXPOSE_OLD_DATA.

Also presumably there would be some pressure to make it work for
every filesystem that supported fallocate().

>> In the mean time, particularly as I'd ideally like to avoid a kernel
>> modification, is there a safe way I could use or modify the ext2
>> library to run through the extents of a fallocated() file and clear
>> the "unwritten" bit? If I clear that (which from memory is the top
>> bit of the extent length), is that alone safe? (on an unmounted
>> file system, obviously).
>
> Yes, there most certainly is.  The functions you'd probably want to
> use are ext2fs_extent_open(), and then either use ext2fs_extent_goto()
> to go to a specific extent, use ext2fs_extent_get() with the
> EXT2_EXTENT_NEXT operation to iterate over the extents, and then use
> ext2fs_extent_replace() to mutate the extent.  Oh, and then use
> ext2fs_extent_close() when you're done looking at and/or changing the
> extents of a file.
>
> If you build tst_extents in lib/ext2fs, you can use commands like
> "inode" (to open the extents for a particular inode), and "root",
> "current", "next", "prev", "next_leaf", "prev_leaf", "next_sibling",
> "prev_sibling", "delete_node", "insert_node", "replace_node",
> "split_node", "print_all", "goto", etc.  Please don't use this in
> production, but it's not a bad way to play with an extent tree, either
> for learning purposes or to create test cases.  tst_extents.c is also
> a good way of seeing how the various libext2fs extent API's work.

Thaks, that's really helpful. Are the extents always the leaves? IE
will next_leaf take me through extent by extent?

Does your "please don't use this in production" warning apply to
tst_extents.c or to the whole of lib/ext2fs? The library calls
seem quite a good way to get the list of extents and are
presumably what fsck etc. use.

> Well, things like Hadoopfs exist today, and Ceph (if you need a
> POSIX-level access)

No, just block layer access fortunately

> is admittedly less stable.  But if you're starting
> from scratch, wouldn't that be pretty far away from readiness as well?

The idea was to base as much as possible on existing running code (e.g.
ext4) with as few variations as possible. I'd be very surprised if we
end up exceeding a few thousand lines of code. All the cluster, lock
management etc we are borrowing from elsewhere, for instance.

-- 
Alex Bligh



From bothie at gmx.de  Fri Nov  5 11:32:49 2010
From: bothie at gmx.de (Bodo Thiesen)
Date: Fri, 5 Nov 2010 12:32:49 +0100
Subject: How to generate a large file allocating space
In-Reply-To: <B9C57562859DC18CBC21818B@nimrod.local>
References: <9A62FED22DF5F54862C68579@nimrod.local>
	<20101031152351.GA20833@wolff.to>
	<2A382F5D94CB78493D1760C9@Ximines.local>
	<250FAB8E-BB5F-4489-B377-D00C680671C7@dilger.ca>
	<20101104134638.31d98a5c@gmx.de>
	<23A690AD52CEF71A1C450278@nimrod.local>
	<20101105000545.70ee4750@gmx.de>
	<B9C57562859DC18CBC21818B@nimrod.local>
Message-ID: <20101105123249.3427c1ab@gmx.de>

* Alex Bligh <alex at alex.org.uk> hat geschrieben:
> I might have another look at using lvm as a blockstore, then running our
> stuff inside lvm. But I didn't think lvm was capable of running thousands
> of LVs per volume group. ext4 is just fine for that. Perhaps I am
> slating lvm unfairly.

The number of logical volumes you can create should be mostly dependand on
the size of the metadata area. A short look on man pvcreate revealed the
command line argument --metadatasize size. Besides of this, lvm should be
able to handle any arbitrary number of logical volumes as long as the
metadata area is big enough to hold the new config. (The same applies to
ext2 and ext3 - if you don't have inodes left, you can't create new files
even with thousands of free terabytes - don't know, if this limitation
still exists in ext4, I'd guess "yes".)

So, my tip would be to just create a pv with a very bit metadata
size (i.e. 512 MB or even bigger) and write a script to create a few
thousand pv on that pv, something like this

pvcreate --metadatasize 512M /dev/foobar
lvcreate foobars /dev/foobar
for i in $(seq 1 1 5000)
do
	lvcreate --size 256M -n foobar$i foobars
done

Either it works - or not ...

Regards, Bodo



From tytso at mit.edu  Sat Nov  6 16:30:21 2010
From: tytso at mit.edu (Ted Ts'o)
Date: Sat, 6 Nov 2010 12:30:21 -0400
Subject: How to generate a large file allocating space
In-Reply-To: <8B59E33D0CCCC9F5DF30CF2E@nimrod.local>
References: <250FAB8E-BB5F-4489-B377-D00C680671C7@dilger.ca>
	<8E4E90C9C1C482942DD479C6@nimrod.local>
	<4ADC8721-2D8C-4793-B583-F7F4AEA510BC@dilger.ca>
	<B06DC20F102CA75451E9914C@Ximines.local>
	<20101102014946.GB24500@thunk.org>
	<687E728B-761B-4917-8377-EE90820E9DF9@alex.org.uk>
	<20101104161613.GC4631@thunk.org>
	<90029CFD2E61A5BA29AD505F@nimrod.local>
	<20101104191734.GG7553@thunk.org>
	<8B59E33D0CCCC9F5DF30CF2E@nimrod.local>
Message-ID: <20101106163021.GA2935@thunk.org>

On Fri, Nov 05, 2010 at 08:14:56AM +0000, Alex Bligh wrote:
> 
> >Patches to do this wouldn't be that hard.  The harder part would
> >probably be the politics on fs-devel regarding the semantics of
> >FALLOC_FL_EXPOSE_OLD_DATA.
> 
> Also presumably there would be some pressure to make it work for
> every filesystem that supported fallocate().

No, I don't think so.  There are plenty of file systems that don't
support fallocate(), and it's not a short step to consider adding new
flags which might not be supported by all.

> Thaks, that's really helpful. Are the extents always the leaves? IE
> will next_leaf take me through extent by extent?

Yes, to both questions.

> Does your "please don't use this in production" warning apply to
> tst_extents.c or to the whole of lib/ext2fs? The library calls
> seem quite a good way to get the list of extents and are
> presumably what fsck etc. use.

No, only to tst_extents.c.  It has a tst_ prefix precisely because
it's a little hacky, and it was something that I had never intended to
be installed by distributions.  (I got a little burned by "filefrag",
which was never intended to be installed at distribution, which is why
the code is so hackish, and why it's not internationalized, etc.)  I
just want to make sure tst_extents doesn't similarly escape.

The libext2fs is designed to be a production-quality codebase, with a
stable ABI.  So feel free to use it in good health.  :-)

     	     	       		   	   	     - Ted



From alex at alex.org.uk  Sat Nov  6 19:44:22 2010
From: alex at alex.org.uk (Alex Bligh)
Date: Sat, 06 Nov 2010 19:44:22 +0000
Subject: How to generate a large file allocating space
In-Reply-To: <20101106163021.GA2935@thunk.org>
References: <250FAB8E-BB5F-4489-B377-D00C680671C7@dilger.ca>
	<8E4E90C9C1C482942DD479C6@nimrod.local>
	<4ADC8721-2D8C-4793-B583-F7F4AEA510BC@dilger.ca>
	<B06DC20F102CA75451E9914C@Ximines.local>
	<20101102014946.GB24500@thunk.org>
	<687E728B-761B-4917-8377-EE90820E9DF9@alex.org.uk>
	<20101104161613.GC4631@thunk.org>
	<90029CFD2E61A5BA29AD505F@nimrod.local>
	<20101104191734.GG7553@thunk.org>
	<8B59E33D0CCCC9F5DF30CF2E@nimrod.local>
	<20101106163021.GA2935@thunk.org>
Message-ID: <0483B9053CD0D9ED0F75E70F@nimrod.local>



--On 6 November 2010 12:30:21 -0400 Ted Ts'o <tytso at mit.edu> wrote:

> On Fri, Nov 05, 2010 at 08:14:56AM +0000, Alex Bligh wrote:
>>
>> > Patches to do this wouldn't be that hard.  The harder part would
>> > probably be the politics on fs-devel regarding the semantics of
>> > FALLOC_FL_EXPOSE_OLD_DATA.
>>
>> Also presumably there would be some pressure to make it work for
>> every filesystem that supported fallocate().
>
> No, I don't think so.  There are plenty of file systems that don't
> support fallocate(), and it's not a short step to consider adding new
> flags which might not be supported by all.

Thanks. I might have a go. Patches to linux-ext4@ ?

>> Thaks, that's really helpful. Are the extents always the leaves? IE
>> will next_leaf take me through extent by extent?
>
> Yes, to both questions.
>
>> Does your "please don't use this in production" warning apply to
>> tst_extents.c or to the whole of lib/ext2fs? The library calls
>> seem quite a good way to get the list of extents and are
>> presumably what fsck etc. use.
>
> No, only to tst_extents.c.
...
> The libext2fs is designed to be a production-quality codebase, with a
> stable ABI.  So feel free to use it in good health.  :-)

Again, thanks for that.

-- 
Alex Bligh



From kernel at nedharvey.com  Wed Nov 10 23:38:37 2010
From: kernel at nedharvey.com (Edward Ned Harvey)
Date: Wed, 10 Nov 2010 18:38:37 -0500
Subject: Challenge: dump | restore
Message-ID: <000501cb8130$68f9b640$3aed22c0$@nedharvey.com>

This runs for a few minutes, and results in a broken pipe.  After which, at
least some fragments of the filesystem have been restored on the destination
filesystem.  At least some directories.

cd /mnt/newFS

dump -0af - /dev/someVG/sourceFS | restore -rf -

 

This works fine.

cd ~

dump -0af somefile /dev/someVG/sourceFS 

cd /mnt/newFS

restore -rf ~/newFS

 

Source and destination filesystems are ext3, 194G and 857G.  Destination
filesystem is created with simply default mkfs.ext3.  There are only approx.

200M used in the source filesystem, of which, there's no particularly huge
directory or number of inodes or anything unusual...  I forced the fsck, and
it came back clean.

My only guess is that there seems to be something wrong with the pipe.
Like, it's not streaming the bits properly or something.  Is it possible to
overflow a pipe or something?  I can't think of any good explanation for
this weird behavior.  What could cause a pipe to break, aside from the
receiving process terminating unexpectedly?

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20101110/0cd1ad95/attachment.htm>

From bothie at gmx.de  Sat Nov 13 15:40:19 2010
From: bothie at gmx.de (Bodo Thiesen)
Date: Sat, 13 Nov 2010 16:40:19 +0100
Subject: Challenge: dump | restore
In-Reply-To: <000501cb8130$68f9b640$3aed22c0$@nedharvey.com>
References: <000501cb8130$68f9b640$3aed22c0$@nedharvey.com>
Message-ID: <20101113164019.45ad3a0b@gmx.de>

* Edward Ned Harvey <kernel at nedharvey.com> hat geschrieben:
> dump -0af - /dev/someVG/sourceFS | restore -rf -
> My only guess is that there seems to be something wrong with the pipe.
> Like, it's not streaming the bits properly or something.  Is it possible to
> overflow a pipe or something?

The sending process should block until the receiving process reads the
data.

> I can't think of any good explanation for
> this weird behavior.  What could cause a pipe to break, aside from the
> receiving process terminating unexpectedly?

I recommend using strace to trace it down:

strace -f -o dump.strace dump -0af - /dev/someVG/sourceFS | \
strace -f -o restore.strace restore -rf -

Then take a closer look on the tails of the two files, maybe that reveals
the problem already.

Regards, Bodo



From samuel at bcgreen.com  Tue Nov 16 11:12:51 2010
From: samuel at bcgreen.com (Stephen Samuel)
Date: Tue, 16 Nov 2010 03:12:51 -0800
Subject: Challenge: dump | restore
In-Reply-To: <000501cb8130$68f9b640$3aed22c0$@nedharvey.com>
References: <000501cb8130$68f9b640$3aed22c0$@nedharvey.com>
Message-ID: <AANLkTik=2125foS2hSoWV0-3WqKTwsLwU7J7fTuxO6f6@mail.gmail.com>

Try :

cd ~

dump -0af -  /dev/someVG/sourceFS | ( cd /mnt/newFS; restore -rf - ~/newFS )


On Wed, Nov 10, 2010 at 3:38 PM, Edward Ned Harvey <kernel at nedharvey.com>wrote:

> This runs for a few minutes, and results in a broken pipe.  After which, at
> least some fragments of the filesystem have been restored on the destination
> filesystem.  At least some directories.
>
> cd /mnt/newFS
>
> dump -0af - /dev/someVG/sourceFS | restore -rf -
>
>
>
> This works fine.
>
> cd ~
>
> dump -0af somefile /dev/someVG/sourceFS
>
> cd /mnt/newFS
>
> restore -rf ~/newFS
>
>
>
> Source and destination filesystems are ext3, 194G and 857G.  Destination
> filesystem is created with simply default mkfs.ext3.  There are only approx.
>
> 200M used in the source filesystem, of which, there's no particularly huge
> directory or number of inodes or anything unusual...  I forced the fsck, and
> it came back clean.
>
> My only guess is that there seems to be something wrong with the pipe.
> Like, it's not streaming the bits properly or something.  Is it possible to
> overflow a pipe or something?  I can't think of any good explanation for
> this weird behavior.  What could cause a pipe to break, aside from the
> receiving process terminating unexpectedly?
>


-- 
Stephen Samuel http://www.bcgreen.com  Software, like love,
778-861-7641                              grows when you give it away
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20101116/23685017/attachment.htm>

From lists at alpha345.com  Tue Nov 16 17:33:07 2010
From: lists at alpha345.com (Matthew Dickinson)
Date: Tue, 16 Nov 2010 11:33:07 -0600 (CST)
Subject: corrupted quotas/running quotacheck on a mounted filesystem
Message-ID: <alpine.LRH.2.00.1011161128320.22001@tornado.pdcompsys.com>

Hi,

1TB ext3 volume mounted via iSCSI on a RHEL5.5 system

the quotas on one of my systems (with ~5k users) seems to have become out 
of sync with reality (500MB reported, but 100G+ in reality) - i'm seeing 
some odd behavior when running the quota 
tools also. For example, "quota -u" shows no quota for some users, but 
when running "edquota", they're visible in the list.

As such, I think i'm in need to running the quotacheck utility. From the 
man page, it would appear that this is to be run on an unmounted 
filesystem - is this accurate? can it be safely run on a mounted 
filesystem? I understand that the results might not be completely accurate 
if information changes during the quotacheck run, but it should be more 
accurate than it is now!

I'm not really able to take the system offline for an unmounted filesystem 
for another month or so, but would really like to get some more accurate 
figures in the quota.

Or is there another option i've missed?

Thanks,

Matthew



From jpiszcz at lucidpixels.com  Tue Nov  9 10:24:28 2010
From: jpiszcz at lucidpixels.com (Justin Piszcz)
Date: Tue, 9 Nov 2010 05:24:28 -0500 (EST)
Subject: Assertion failure in journal_commit_transaction() at
	fs/jbd/commit.c:496: "commit_transaction->t_nr_buffers <=
	commit_transaction->t_outstanding_credits"
Message-ID: <alpine.DEB.2.00.1011090519120.5074@p34.internal.lan>

Hi,

I have the same errors as these folks:
https://bugzilla.redhat.com/show_bug.cgi?id=563247

OS: RHEL 5 x86_64
Kernel: 2.6.18

I see this on a wide variety of hardware and according to the bug report,
it happens whether its hardware raid or dm.

Since there are no records of this bug/issue on LKML I thought I'd pose the
question.  I am just looking into what is the root cause here, is it an ext3
bug?

Assertion failure in journal_commit_transaction() at fs/jbd/commit.c:496: "commit_transaction->t_nr_buffers <= commit_transaction->t_outstanding_credits"
----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at fs/jbd/commit.c:496
invalid opcode: 0000 [1] SMP
last sysfs file: /class/scsi_host/host0/stats
CPU 3
Modules linked in: i2c_dev eeprom adm1026 hwmon_vid i2c_amd756 nfsd exportfs lockd nfs_acl auth_rpcgss sunrpc ipv6 xfrm_nalgo crypto_api dm_mirror dm_log dm_mod video backlight sbs i2c_ec button battery asus_acpi acpi_memhotplug ac lp floppy shpchp k8temp k8_edac hwmon parport_pc amd_rng edac_mc parport i2c_amd8111 tg3 serio_raw i2c_core pcspkr sg 3w_9xxx sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd
Pid: 1473, comm: kjournald Not tainted 2.6.18-164.el5az00 #1
RIP: 0010:[<ffffffff88033b05>]  [<ffffffff88033b05>] :jbd:journal_commit_transaction+0x6a3/0x106a
RSP: 0018:ffff81020ee63de0  EFLAGS: 00010286
RAX: 000000000000009d RBX: ffff810133f3c130 RCX: ffffffff80304ba8
RDX: ffffffff80304ba8 RSI: 0000000000000000 RDI: ffffffff80304ba0
RBP: ffff81010f6f4200 R08: ffffffff80304ba8 R09: 000000000000003d
R10: ffff81020ee63a80 R11: 0000000000000280 R12: ffff81011520b730
R13: ffff8101139b41c0 R14: 0000000000000001 R15: ffff81010e441000
FS:  00002b03c3c59d30(0000) GS:ffff8101139aa6c0(0000) knlGS:00000000f7f228d0
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00002aac9c7f20a0 CR3: 00000001fe72d000 CR4: 00000000000006e0
Process kjournald (pid: 1473, threadinfo ffff81020ee62000, task ffff81010e8aa820)
Stack:  00021f55e0d52ccc ffff81010e441000 ffff810100000000 000000b500000000
  0000000000000000 ffff81010e8aa820 ffffffff8009f468 ffff81020ee63e18
  ffff81020ee63e18 00000000ffffffff 0000000000000286 ffffffff8004b241
Call Trace:
  [<ffffffff8009f468>] autoremove_wake_function+0x0/0x2e
  [<ffffffff8004b241>] try_to_del_timer_sync+0x51/0x5a
  [<ffffffff8803758c>] :jbd:kjournald+0xc1/0x213
  [<ffffffff8009f468>] autoremove_wake_function+0x0/0x2e
  [<ffffffff880374cb>] :jbd:kjournald+0x0/0x213
  [<ffffffff8009f250>] keventd_create_kthread+0x0/0xc4
  [<ffffffff8003295e>] kthread+0xfe/0x132
  [<ffffffff8005dfb1>] child_rip+0xa/0x11
  [<ffffffff8009f250>] keventd_create_kthread+0x0/0xc4
  [<ffffffff80032860>] kthread+0x0/0x132
  [<ffffffff8005dfa7>] child_rip+0x0/0x11


Code: 0f 0b 68 74 8e 03 88 c2 f0 01 45 31 e4 45 31 ff 45 31 f6 c7
RIP  [<ffffffff88033b05>] :jbd:journal_commit_transaction+0x6a3/0x106a
  RSP <ffff81020ee63de0>

Message from<0>Kernel panic - not syncing: Fatal exception