From adilger at dilger.ca Mon Nov 1 06:13:33 2010 From: adilger at dilger.ca (Andreas Dilger) Date: Mon, 1 Nov 2010 00:13:33 -0600 Subject: How to generate a large file allocating space In-Reply-To: <2A382F5D94CB78493D1760C9@Ximines.local> References: <9A62FED22DF5F54862C68579@nimrod.local> <20101031152351.GA20833@wolff.to> <2A382F5D94CB78493D1760C9@Ximines.local> Message-ID: <250FAB8E-BB5F-4489-B377-D00C680671C7@dilger.ca> On 2010-10-31, at 09:05, Alex Bligh wrote: > I am trying to allocate huge files on ext4. I will then read the extents > within the file and write to the disk at a block level rather than using > ext4 (the FS will not be mounted at this point). This will allow me to > have several iSCSI clients hitting the same LUN r/w safely. And at > some point when I know the relevant iSCSI stuff has stopped and been > flushed to disk, I may unlink the file. Hmm, why not simply use a cluster filesystem to do this? GFS and OCFS both handle shared writers for the same SAN disk (AFAIK), and Lustre uses ext4 as the underlying filesystem, and though it doesn't allow direct client writes to the same disk it will allow writing at 95% of the underlying raw disk performance from multiple clients. Cheers, Andreas From alex at alex.org.uk Mon Nov 1 06:14:09 2010 From: alex at alex.org.uk (Alex Bligh) Date: Mon, 01 Nov 2010 07:14:09 +0100 Subject: How to generate a large file allocating space In-Reply-To: <250FAB8E-BB5F-4489-B377-D00C680671C7@dilger.ca> References: <9A62FED22DF5F54862C68579@nimrod.local> <20101031152351.GA20833@wolff.to> <2A382F5D94CB78493D1760C9@Ximines.local> <250FAB8E-BB5F-4489-B377-D00C680671C7@dilger.ca> Message-ID: <8E4E90C9C1C482942DD479C6@nimrod.local> --On 1 November 2010 00:13:33 -0600 Andreas Dilger wrote: > Hmm, why not simply use a cluster filesystem to do this? > > GFS and OCFS both handle shared writers for the same SAN disk (AFAIK), > and Lustre uses ext4 as the underlying filesystem, and though it doesn't > allow direct client writes to the same disk it will allow writing at 95% > of the underlying raw disk performance from multiple clients. Essentially because none of them do exactly what I need them to do, so I am reinventing a slightly different wheel... -- Alex Bligh From adilger.kernel at dilger.ca Mon Nov 1 21:45:12 2010 From: adilger.kernel at dilger.ca (Andreas Dilger) Date: Mon, 1 Nov 2010 15:45:12 -0600 Subject: How to generate a large file allocating space In-Reply-To: <8E4E90C9C1C482942DD479C6@nimrod.local> References: <9A62FED22DF5F54862C68579@nimrod.local> <20101031152351.GA20833@wolff.to> <2A382F5D94CB78493D1760C9@Ximines.local> <250FAB8E-BB5F-4489-B377-D00C680671C7@dilger.ca> <8E4E90C9C1C482942DD479C6@nimrod.local> Message-ID: <4ADC8721-2D8C-4793-B583-F7F4AEA510BC@dilger.ca> On 2010-11-01, at 00:14, Alex Bligh wrote: > --On 1 November 2010 00:13:33 -0600 Andreas Dilger wrote: >> Hmm, why not simply use a cluster filesystem to do this? >> >> GFS and OCFS both handle shared writers for the same SAN disk (AFAIK), >> and Lustre uses ext4 as the underlying filesystem, and though it doesn't >> allow direct client writes to the same disk it will allow writing at 95% >> of the underlying raw disk performance from multiple clients. > > Essentially because none of them do exactly what I need them to do, > so I am reinventing a slightly different wheel... Personally, I hate re-inventing things vs. improving something to make it do what you want, since it means (probably) that your code will be used by you alone, while making an improvement to an existing cluster filesystem will both meet your needs and allow others to benefit as well. What is it you really want to do in the end? Shared concurrent writers to the same file? High-bandwidth IO to the underlying disk? Cheers, Andreas From alex at alex.org.uk Mon Nov 1 22:58:12 2010 From: alex at alex.org.uk (Alex Bligh) Date: Mon, 01 Nov 2010 22:58:12 +0000 Subject: How to generate a large file allocating space In-Reply-To: <4ADC8721-2D8C-4793-B583-F7F4AEA510BC@dilger.ca> References: <9A62FED22DF5F54862C68579@nimrod.local> <20101031152351.GA20833@wolff.to> <2A382F5D94CB78493D1760C9@Ximines.local> <250FAB8E-BB5F-4489-B377-D00C680671C7@dilger.ca> <8E4E90C9C1C482942DD479C6@nimrod.local> <4ADC8721-2D8C-4793-B583-F7F4AEA510BC@dilger.ca> Message-ID: --On 1 November 2010 15:45:12 -0600 Andreas Dilger wrote: > What is it you really want to do in the end? Shared concurrent writers > to the same file? High-bandwidth IO to the underlying disk? High bandwidth I/O to the underlying disk is part of it - only one reader/writer per file. We're really using ext4 just for its extents capability, i.e. allocating space, plus the convenience of directory lookup to find the set of extents. It's easier to do this than to write this bit from scratch, and the files are pretty static in size (i.e. they only grow, and grow infrequently by large amounts). The files on ext4 correspond to large chunks of disks we are combining together using an device-mapper type thing (but different), and on top of that lives arbitary real filing systems. Because our device-mapper type thing already understands what blocks have been written to, we already have a layer that prevents the data on the disk before the file's creation being exposed. That's why I don't need ext4 to zero them out. I suppose in that sense it is like the swap file case. Oh, and because these files are allocated infrequently, I am not /that/ concerned about performance (famous last words). The performance critical stuff is done via direct writes to the SAN and don't even pass through ext4 (or indeed through any single host). -- Alex Bligh From tytso at mit.edu Tue Nov 2 01:49:46 2010 From: tytso at mit.edu (Ted Ts'o) Date: Mon, 1 Nov 2010 21:49:46 -0400 Subject: How to generate a large file allocating space In-Reply-To: References: <9A62FED22DF5F54862C68579@nimrod.local> <20101031152351.GA20833@wolff.to> <2A382F5D94CB78493D1760C9@Ximines.local> <250FAB8E-BB5F-4489-B377-D00C680671C7@dilger.ca> <8E4E90C9C1C482942DD479C6@nimrod.local> <4ADC8721-2D8C-4793-B583-F7F4AEA510BC@dilger.ca> Message-ID: <20101102014946.GB24500@thunk.org> On Mon, Nov 01, 2010 at 10:58:12PM +0000, Alex Bligh wrote: > High bandwidth I/O to the underlying disk is part of it - only one > reader/writer per file. We're really using ext4 just for its extents > capability, i.e. allocating space, plus the convenience of directory > lookup to find the set of extents. > > It's easier to do this than to write this bit from scratch, and the > files are pretty static in size (i.e. they only grow, and grow > infrequently by large amounts). The files on ext4 correspond to large > chunks of disks we are combining together using an device-mapper > type thing (but different), and on top of that lives arbitary real > filing systems. Because our device-mapper type thing already > understands what blocks have been written to, we already have a layer > that prevents the data on the disk before the file's creation being > exposed. That's why I don't need ext4 to zero them out. I suppose > in that sense it is like the swap file case. But why not just use O_DIRECT? Do you really need to access the disk directly, as opposed to using O_DIRECT? - Ted From adilger.kernel at dilger.ca Tue Nov 2 03:21:10 2010 From: adilger.kernel at dilger.ca (Andreas Dilger) Date: Mon, 1 Nov 2010 21:21:10 -0600 Subject: How to generate a large file allocating space In-Reply-To: References: <9A62FED22DF5F54862C68579@nimrod.local> <20101031152351.GA20833@wolff.to> <2A382F5D94CB78493D1760C9@Ximines.local> <250FAB8E-BB5F-4489-B377-D00C680671C7@dilger.ca> <8E4E90C9C1C482942DD479C6@nimrod.local> <4ADC8721-2D8C-4793-B583-F7F4AEA510BC@dilger.ca> Message-ID: On 2010-11-01, at 16:58, Alex Bligh wrote: > --On 1 Nov 2010 15:45:12 Andreas Dilger wrote: >> What is it you really want to do in the end? Shared concurrent writers >> to the same file? High-bandwidth IO to the underlying disk? > > High bandwidth I/O to the underlying disk is part of it - only one > reader/writer per file. We're really using ext4 just for its extents > capability, i.e. allocating space, plus the convenience of directory > lookup to find the set of extents. > > It's easier to do this than to write this bit from scratch, and the > files are pretty static in size (i.e. they only grow, and grow > infrequently by large amounts). The files on ext4 correspond to large > chunks of disks we are combining together using an device-mapper > type thing (but different), and on top of that lives arbitary real > filing systems. Because our device-mapper type thing already > understands what blocks have been written to, we already have a layer > that prevents the data on the disk before the file's creation being > exposed. That's why I don't need ext4 to zero them out. I suppose > in that sense it is like the swap file case. > > Oh, and because these files are allocated infrequently, I am not > /that/ concerned about performance (famous last words). The performance > critical stuff is done via direct writes to the SAN and don't even > pass through ext4 (or indeed through any single host). Actually, I think Ceph has a network block-device feature (recently submitted/committed to mainline), and Lustre has a prototype block-device feature as well. Cheers, Andreas From alex at alex.org.uk Tue Nov 2 07:58:02 2010 From: alex at alex.org.uk (Alex Bligh) Date: Tue, 2 Nov 2010 07:58:02 +0000 Subject: How to generate a large file allocating space In-Reply-To: <20101102014946.GB24500@thunk.org> References: <9A62FED22DF5F54862C68579@nimrod.local> <20101031152351.GA20833@wolff.to> <2A382F5D94CB78493D1760C9@Ximines.local> <250FAB8E-BB5F-4489-B377-D00C680671C7@dilger.ca> <8E4E90C9C1C482942DD479C6@nimrod.local> <4ADC8721-2D8C-4793-B583-F7F4AEA510BC@dilger.ca> <20101102014946.GB24500@thunk.org> Message-ID: <687E728B-761B-4917-8377-EE90820E9DF9@alex.org.uk> Ted, On 2 Nov 2010, at 01:49, "Ted Ts'o" wrote: > But why not just use O_DIRECT? Do you really need to access the > disk directly, as opposed to using O_DIRECT? > Because more than one machine will be accessing the data on the ext4 volume (over iSCSI), though access to the large files is mediated by locks higher up. To use O_DIRECT each accessing machine would need to have the volume mounted, rather than merely receiving a list of extents. -- Alex Bligh From alex at alex.org.uk Tue Nov 2 08:01:28 2010 From: alex at alex.org.uk (Alex Bligh) Date: Tue, 2 Nov 2010 08:01:28 +0000 Subject: How to generate a large file allocating space In-Reply-To: References: <9A62FED22DF5F54862C68579@nimrod.local> <20101031152351.GA20833@wolff.to> <2A382F5D94CB78493D1760C9@Ximines.local> <250FAB8E-BB5F-4489-B377-D00C680671C7@dilger.ca> <8E4E90C9C1C482942DD479C6@nimrod.local> <4ADC8721-2D8C-4793-B583-F7F4AEA510BC@dilger.ca> Message-ID: <252E7275-714E-4881-8B27-E7CB1D0C5424@alex.org.uk> On 2 Nov 2010, at 03:21, Andreas Dilger wrote: > > Actually, I think Ceph has a network block-device feature (recently submitted/committed to mainline), and Lustre has a prototype block-device feature as well. Unfortunately I need something not prototype. Fortunately I don't need many of lustre's or ceph's features. -- Alex Bligh From rwheeler at redhat.com Tue Nov 2 11:20:48 2010 From: rwheeler at redhat.com (Ric Wheeler) Date: Tue, 02 Nov 2010 07:20:48 -0400 Subject: How to generate a large file allocating space In-Reply-To: <252E7275-714E-4881-8B27-E7CB1D0C5424@alex.org.uk> References: <9A62FED22DF5F54862C68579@nimrod.local> <20101031152351.GA20833@wolff.to> <2A382F5D94CB78493D1760C9@Ximines.local> <250FAB8E-BB5F-4489-B377-D00C680671C7@dilger.ca> <8E4E90C9C1C482942DD479C6@nimrod.local> <4ADC8721-2D8C-4793-B583-F7F4AEA510BC@dilger.ca> <252E7275-714E-4881-8B27-E7CB1D0C5424@alex.org.uk> Message-ID: <4CCFF410.3040905@redhat.com> On 11/02/2010 04:01 AM, Alex Bligh wrote: > On 2 Nov 2010, at 03:21, Andreas Dilger wrote: > >> Actually, I think Ceph has a network block-device feature (recently submitted/committed to mainline), and Lustre has a prototype block-device feature as well. > Unfortunately I need something not prototype. Fortunately I don't need many of lustre's or ceph's features. > Sounds like you will end up writing something brand new - much less stable than any of the options mentioned in the thread previously. Ric From alex at alex.org.uk Tue Nov 2 17:37:29 2010 From: alex at alex.org.uk (Alex Bligh) Date: Tue, 02 Nov 2010 17:37:29 +0000 Subject: How to generate a large file allocating space In-Reply-To: <4CCFF410.3040905@redhat.com> References: <9A62FED22DF5F54862C68579@nimrod.local> <20101031152351.GA20833@wolff.to> <2A382F5D94CB78493D1760C9@Ximines.local> <250FAB8E-BB5F-4489-B377-D00C680671C7@dilger.ca> <8E4E90C9C1C482942DD479C6@nimrod.local> <4ADC8721-2D8C-4793-B583-F7F4AEA510BC@dilger.ca> <252E7275-714E-4881-8B27-E7CB1D0C5424@alex.org.uk> <4CCFF410.3040905@redhat.com> Message-ID: <2DD84B976B2AFDE5E6F66A7E@nimrod.local> --On 2 November 2010 07:20:48 -0400 Ric Wheeler wrote: > Sounds like you will end up writing something brand new - much less > stable than any of the options mentioned in the thread previously. Well, the new component will be something simple. All I really need to know is how to mark the inodes as allocated and initialised, rather than unwritten. -- Alex Bligh From bothie at gmx.de Thu Nov 4 12:46:38 2010 From: bothie at gmx.de (Bodo Thiesen) Date: Thu, 4 Nov 2010 13:46:38 +0100 Subject: How to generate a large file allocating space In-Reply-To: <250FAB8E-BB5F-4489-B377-D00C680671C7@dilger.ca> References: <9A62FED22DF5F54862C68579@nimrod.local> <20101031152351.GA20833@wolff.to> <2A382F5D94CB78493D1760C9@Ximines.local> <250FAB8E-BB5F-4489-B377-D00C680671C7@dilger.ca> Message-ID: <20101104134638.31d98a5c@gmx.de> Hello Alex, hello Andreas * Andreas Dilger hat geschrieben: >* On 2010-10-31, at 09:05, Alex Bligh wrote: >> I am trying to allocate huge files on ext4. I will then read the extents >> within the file and write to the disk at a block level rather than using >> ext4 (the FS will not be mounted at this point). This will allow me to >> have several iSCSI clients hitting the same LUN r/w safely. And at >> some point when I know the relevant iSCSI stuff has stopped and been >> flushed to disk, I may unlink the file. Question: Did you consider using plain LVM for this purpose? By creating a logical volume, no data is initialized, only the meta data is created (what seems to be exactly what you need). Then, each client may access one logical volume r/w. Retrieving the extents list is very easy as well. And because there are no group management data (cluster bitmaps, inode bitmaps and tables) of any kind, you will end up with only one single extent in most cases regardless of the size of the volume you've created. > Hmm, why not simply use a cluster filesystem to do this? > > GFS and OCFS both handle shared writers for the same SAN disk (AFAIK), They are SUPPOSED to do that - in theory. The last two weekends I tried to set up a stable DRDB+GFS2 setup - I failed. Then I tried OCFS2 - again I failed. The setup was simple: Two identical Systems with 10*500GB disks and a hardware RAID6 yielding 4GB user disk space. That was used to create a DRDB (no LVM or other stuff like crypto in betreen). Both were set to primary and then I created GFS2 (later OCFS2) and started the additional tools like clvm/o2bc. Then mounting the file systems on both machines - everything worked up to here. machine1: dd if=/dev/zero of=/mnt/4tb/file1 machine2: dd if=/dev/zero of=/mnt/4tb/file2 Worked well in both setups on both machines machine1: let i=0; while let i=i+1; do echo "A$i" >> /mnt/4tb/file3; done machine2: let i=0; while let i=i+1; do echo "B$i" >> /mnt/4tb/file3; done GFS2: First machine works well, second machine starts returning EIO on *ANY* request (even ls /mnt/4tb). Umount impossible. Had to reboot -> #gfs2 #fail OCFS2: passed this test as well as the next one machine1: let i=0; while let i=i+1; do echo "A$i"; done >> /mnt/4tb/file4 machine2: let i=0; while let i=i+1; do echo "B$i"; done >> /mnt/4tb/file4 Then I rebooted one machine with "echo b > /proc/sysrq-trigger" while the last test was still in progress. Guess what: The other machine stopped working. No reads, no writes. It didn't evern go on when the first machine came back. I had then to reboot the second one as well to continue using the file system. Maybe I did something wrong, maybe the file systems just aren't as stable as we expected them to be, anyways, we decided now to use stable systems, i.e. drbd in primary/secondary setup and ext3 with failover to the other system if the primary goes down, and as the system already went productive, we're not gonna change anything here in the near future. So consider this report as strictly informative. BTW: No, I do not longer have the config files, I didn't save them and the systems have been completely reinstalled after testing the final setup succeeded to wipe out everything left over from the previous attempts. Regards, Bodo From tytso at mit.edu Thu Nov 4 16:16:13 2010 From: tytso at mit.edu (Ted Ts'o) Date: Thu, 4 Nov 2010 12:16:13 -0400 Subject: How to generate a large file allocating space In-Reply-To: <687E728B-761B-4917-8377-EE90820E9DF9@alex.org.uk> References: <9A62FED22DF5F54862C68579@nimrod.local> <20101031152351.GA20833@wolff.to> <2A382F5D94CB78493D1760C9@Ximines.local> <250FAB8E-BB5F-4489-B377-D00C680671C7@dilger.ca> <8E4E90C9C1C482942DD479C6@nimrod.local> <4ADC8721-2D8C-4793-B583-F7F4AEA510BC@dilger.ca> <20101102014946.GB24500@thunk.org> <687E728B-761B-4917-8377-EE90820E9DF9@alex.org.uk> Message-ID: <20101104161613.GC4631@thunk.org> On Tue, Nov 02, 2010 at 07:58:02AM +0000, Alex Bligh wrote: > On 2 Nov 2010, at 01:49, "Ted Ts'o" wrote: > > But why not just use O_DIRECT? Do you really need to access the > > disk directly, as opposed to using O_DIRECT? > > > Because more than one machine will be accessing the data on the ext4 > volume (over iSCSI), though access to the large files is mediated by > locks higher up. To use O_DIRECT each accessing machine would need > to have the volume mounted, rather than merely receiving a list of > extents. Well, I would personally not be against an extension to fallocate() where if the caller of the syscall specifies a new flag, that might be named FALLOC_FL_EXPOSE_OLD_DATA, and if the caller either has root privs or (if capabilities are enabled) CAP_DAC_OVERRIDE && CAP_MAC_OVERRIDE, it would be able to allocate blocks whose extents would be marked as initialized without actually initializing the blocks first. I don't know whether it will get past the fs-devel bike shed painting crew, but I do have some cluster file system users who would like something similar. In their case they will be writing the files using Direct I/O, and the objects are all checksumed at the cluster file system level, and if the object has the wrong checksum, then the cluster file system will ask another server for the object. Since the cluster file system is considered trusted, and it verifies the expected object checksum before releasing the data, there is no security issue. You do realize, though, that it sounds like with your design you are replicating the servers, but not the disk devices --- so if your disk device explodes, you're Sadly Out of Luck. Sure you can use super-expensive storage arrays, but if you're writing your own cluster file system, why not create a design which uses commodity disks and worry about replicating data across servers at the cluster file system level? - Ted From alex at alex.org.uk Thu Nov 4 18:22:39 2010 From: alex at alex.org.uk (Alex Bligh) Date: Thu, 04 Nov 2010 18:22:39 +0000 Subject: How to generate a large file allocating space In-Reply-To: <20101104134638.31d98a5c@gmx.de> References: <9A62FED22DF5F54862C68579@nimrod.local> <20101031152351.GA20833@wolff.to> <2A382F5D94CB78493D1760C9@Ximines.local> <250FAB8E-BB5F-4489-B377-D00C680671C7@dilger.ca> <20101104134638.31d98a5c@gmx.de> Message-ID: <23A690AD52CEF71A1C450278@nimrod.local> --On 4 November 2010 13:46:38 +0100 Bodo Thiesen wrote: > hat geschrieben: >> * On 2010-10-31, at 09:05, Alex Bligh wrote: >>> I am trying to allocate huge files on ext4. I will then read the extents >>> within the file and write to the disk at a block level rather than using >>> ext4 (the FS will not be mounted at this point). This will allow me to >>> have several iSCSI clients hitting the same LUN r/w safely. And at >>> some point when I know the relevant iSCSI stuff has stopped and been >>> flushed to disk, I may unlink the file. > > Question: Did you consider using plain LVM for this purpose? > By creating a > logical volume, no data is initialized, only the meta data is created > (what seems to be exactly what you need). Then, each client may access one > logical volume r/w. Retrieving the extents list is very easy as well. And > because there are no group management data (cluster bitmaps, inode bitmaps > and tables) of any kind, you will end up with only one single extent in > most cases regardless of the size of the volume you've created. Plain LVM or Clustered LVM? Clustered LVM has some severe limitations, including needing to restart the entire cluster to add nodes, which is not acceptable. Plain LVM has two types of issue: 1. Without clustered LVM, as far as I can tell there is no locking of metadata. I have no guarantees that access to the disk does not go outside the LV's allocation. For instance, when a CoW snapshot is written to and expanded, the metadata must be written to, and there is no locking for that. 2. Snapshots suffer severe limitations. For instance, it is not possible to generate arbitrarily deep trees of snapshots (i.e. CoW on top of CoW) without an arbitrarily deep tree of loopback mounted lvm devices, which does not sound like a good idea. I think you can only use lvm like this where you have simple volumes mounted, and in essence take no snapshots. To answer the implied question, yes we have a (partial) lvm replacement. >> GFS and OCFS both handle shared writers for the same SAN disk (AFAIK), > > They are SUPPOSED to do that - in theory We have had similar experiences and don't actually need all the features (and thus complexity) that a true clustered filing system presents. -- Alex Bligh From alex at alex.org.uk Thu Nov 4 18:29:47 2010 From: alex at alex.org.uk (Alex Bligh) Date: Thu, 04 Nov 2010 18:29:47 +0000 Subject: How to generate a large file allocating space In-Reply-To: <20101104161613.GC4631@thunk.org> References: <9A62FED22DF5F54862C68579@nimrod.local> <20101031152351.GA20833@wolff.to> <2A382F5D94CB78493D1760C9@Ximines.local> <250FAB8E-BB5F-4489-B377-D00C680671C7@dilger.ca> <8E4E90C9C1C482942DD479C6@nimrod.local> <4ADC8721-2D8C-4793-B583-F7F4AEA510BC@dilger.ca> <20101102014946.GB24500@thunk.org> <687E728B-761B-4917-8377-EE90820E9DF9@alex.org.uk> <20101104161613.GC4631@thunk.org> Message-ID: <90029CFD2E61A5BA29AD505F@nimrod.local> Ted, --On 4 November 2010 12:16:13 -0400 Ted Ts'o wrote: > Well, I would personally not be against an extension to fallocate() > where if the caller of the syscall specifies a new flag, that might be > named FALLOC_FL_EXPOSE_OLD_DATA, and if the caller either has root > privs or (if capabilities are enabled) CAP_DAC_OVERRIDE && > CAP_MAC_OVERRIDE, it would be able to allocate blocks whose extents > would be marked as initialized without actually initializing the > blocks first. That sounds a lot like "send patches" which I just might do, if only to gain better understanding as to what is going on. I seem to remember (from lwn's summary of lkml) that the proposed options for fallocate() got a bit baroque to start with, and people then simplified down to zero options. Perhaps that was a simplification too far. In the mean time, particularly as I'd ideally like to avoid a kernel modification, is there a safe way I could use or modify the ext2 library to run through the extents of a fallocated() file and clear the "unwritten" bit? If I clear that (which from memory is the top bit of the extent length), is that alone safe? (on an unmounted file system, obviously). > You do realize, though, that it sounds like with your design you are > replicating the servers, but not the disk devices --- so if your disk > device explodes, you're Sadly Out of Luck. Sure you can use > super-expensive storage arrays, but if you're writing your own cluster > file system, why not create a design which uses commodity disks and > worry about replicating data across servers at the cluster file system > level? The particular use case here is for customers that have sunk huge amounts of money into expensive storage arrays, or for whatever reason have an aversion to storing anything on anything other than expensive storage arrays. I would tend to agree that replicating across commodity disks is in almost all cases a better technological solution, but the technology is still further away from readiness there. Sadly technological arguments don't always win the day, and we need something in the mean time... -- Alex Bligh From tytso at mit.edu Thu Nov 4 19:17:34 2010 From: tytso at mit.edu (Ted Ts'o) Date: Thu, 4 Nov 2010 15:17:34 -0400 Subject: How to generate a large file allocating space In-Reply-To: <90029CFD2E61A5BA29AD505F@nimrod.local> References: <20101031152351.GA20833@wolff.to> <2A382F5D94CB78493D1760C9@Ximines.local> <250FAB8E-BB5F-4489-B377-D00C680671C7@dilger.ca> <8E4E90C9C1C482942DD479C6@nimrod.local> <4ADC8721-2D8C-4793-B583-F7F4AEA510BC@dilger.ca> <20101102014946.GB24500@thunk.org> <687E728B-761B-4917-8377-EE90820E9DF9@alex.org.uk> <20101104161613.GC4631@thunk.org> <90029CFD2E61A5BA29AD505F@nimrod.local> Message-ID: <20101104191734.GG7553@thunk.org> On Thu, Nov 04, 2010 at 06:29:47PM +0000, Alex Bligh wrote: > > >Well, I would personally not be against an extension to fallocate() > >where if the caller of the syscall specifies a new flag, that might be > >named FALLOC_FL_EXPOSE_OLD_DATA, and if the caller either has root > >privs or (if capabilities are enabled) CAP_DAC_OVERRIDE && > >CAP_MAC_OVERRIDE, it would be able to allocate blocks whose extents > >would be marked as initialized without actually initializing the > >blocks first. > > That sounds a lot like "send patches" which I just might do, if only > to gain better understanding as to what is going on. Patches to do this wouldn't be that hard. The harder part would probably be the politics on fs-devel regarding the semantics of FALLOC_FL_EXPOSE_OLD_DATA. > I seem to remember (from lwn's summary of lkml) that the proposed > options for fallocate() got a bit baroque to start with, and people > then simplified down to zero options. Perhaps that was a simplification > too far. It was simplified down to one flag. But that means we have a flags field we can use to extend fallocate. > In the mean time, particularly as I'd ideally like to avoid a kernel > modification, is there a safe way I could use or modify the ext2 > library to run through the extents of a fallocated() file and clear > the "unwritten" bit? If I clear that (which from memory is the top > bit of the extent length), is that alone safe? (on an unmounted > file system, obviously). Yes, there most certainly is. The functions you'd probably want to use are ext2fs_extent_open(), and then either use ext2fs_extent_goto() to go to a specific extent, use ext2fs_extent_get() with the EXT2_EXTENT_NEXT operation to iterate over the extents, and then use ext2fs_extent_replace() to mutate the extent. Oh, and then use ext2fs_extent_close() when you're done looking at and/or changing the extents of a file. If you build tst_extents in lib/ext2fs, you can use commands like "inode" (to open the extents for a particular inode), and "root", "current", "next", "prev", "next_leaf", "prev_leaf", "next_sibling", "prev_sibling", "delete_node", "insert_node", "replace_node", "split_node", "print_all", "goto", etc. Please don't use this in production, but it's not a bad way to play with an extent tree, either for learning purposes or to create test cases. tst_extents.c is also a good way of seeing how the various libext2fs extent API's work. > I would tend to agree that replicating across commodity disks is > in almost all cases a better technological solution, but the > technology is still further away from readiness there. Sadly > technological arguments don't always win the day, and we need > something in the mean time... Well, things like Hadoopfs exist today, and Ceph (if you need a POSIX-level access) is admittedly less stable. But if you're starting from scratch, wouldn't that be pretty far away from readiness as well? - Ted From bothie at gmx.de Thu Nov 4 23:05:45 2010 From: bothie at gmx.de (Bodo Thiesen) Date: Fri, 5 Nov 2010 00:05:45 +0100 Subject: How to generate a large file allocating space In-Reply-To: <23A690AD52CEF71A1C450278@nimrod.local> References: <9A62FED22DF5F54862C68579@nimrod.local> <20101031152351.GA20833@wolff.to> <2A382F5D94CB78493D1760C9@Ximines.local> <250FAB8E-BB5F-4489-B377-D00C680671C7@dilger.ca> <20101104134638.31d98a5c@gmx.de> <23A690AD52CEF71A1C450278@nimrod.local> Message-ID: <20101105000545.70ee4750@gmx.de> Hello Alex * Alex Bligh hat geschrieben: >* --On 4 November 2010 13:46:38 +0100 Bodo Thiesen wrote: >> Question: Did you consider using plain LVM for this purpose? >> By creating a >> logical volume, no data is initialized, only the meta data is created >> (what seems to be exactly what you need). Then, each client may access one >> logical volume r/w. Retrieving the extents list is very easy as well. And >> because there are no group management data (cluster bitmaps, inode bitmaps >> and tables) of any kind, you will end up with only one single extent in >> most cases regardless of the size of the volume you've created. > Plain LVM or Clustered LVM? Clustered LVM has some severe limitations, > including needing to restart the entire cluster to add nodes, which > is not acceptable. > > Plain LVM has two types of issue: > > 1. Without clustered LVM, as far as I can tell there is no locking > of metadata. Possible (I don't know exactly) > I have no guarantees that access to the disk does not > go outside the LV's allocation. In LVM you create one logical volume. In the process of creating that volume, metadata get's updated. But just using the pre-existing logical volumes doesn't change the metadata. So, if you do all creation and removing of logical volumes on the same node, then you shouldn't get any problems here. "lvchange -a[yn] $lv" doesn't even change the metadata, it's a completely local operation (the local lvm cache get's updated, but that's all). So, if you provide access via nbd or something like that to the pv, all nodes could just use their portion of the lv without any problems. Besides: You wanted to use ext4. I suggested to use lvm in the same way you initially wanted to use ext4. So: On the main node you use the command "lvdisplay -v $lv" (or thatever the exact command line is) and you get a list of extents as result. Then you transfer that list to the client and it can access the disk directly without issuing any lvm command at all. > For instance, when a CoW snapshot is > written to and expanded, the metadata must be written to, and there > is no locking for that. Right, but that was not part of your use-case. If you need such things, you can't use ext4 as well. > 2. Snapshots suffer severe limitations. For instance, > it is not possible to generate arbitrarily deep trees of snapshots > (i.e. CoW on top of CoW) without an arbitrarily deep tree of loopback > mounted lvm devices, which does not sound like a good idea. > > I think you can only use lvm like this where you have simple volumes > mounted, and in essence take no snapshots. Yea, and I mentioned lvm, because that was exactly your use-case ;) > To answer the implied question, yes we have a (partial) lvm replacement. ---> Did you consider using plain LVM for this purpose? <--- That was an explicit question ;) >>> GFS and OCFS both handle shared writers for the same SAN disk (AFAIK), >> They are SUPPOSED to do that - in theory > We have had similar experiences and don't actually need all the features > (and thus complexity) that a true clustered filing system presents. Ok, so not my fault ;) Regards, Bodo From alex at alex.org.uk Fri Nov 5 08:08:13 2010 From: alex at alex.org.uk (Alex Bligh) Date: Fri, 05 Nov 2010 08:08:13 +0000 Subject: How to generate a large file allocating space In-Reply-To: <20101105000545.70ee4750@gmx.de> References: <9A62FED22DF5F54862C68579@nimrod.local> <20101031152351.GA20833@wolff.to> <2A382F5D94CB78493D1760C9@Ximines.local> <250FAB8E-BB5F-4489-B377-D00C680671C7@dilger.ca> <20101104134638.31d98a5c@gmx.de> <23A690AD52CEF71A1C450278@nimrod.local> <20101105000545.70ee4750@gmx.de> Message-ID: --On 5 November 2010 00:05:45 +0100 Bodo Thiesen wrote: >> For instance, when a CoW snapshot is >> written to and expanded, the metadata must be written to, and there >> is no locking for that. > > Right, but that was not part of your use-case. If you need such things, > you can't use ext4 as well. I should have been clearer. We aren't using ext4 as anything other than a block store. The CoW snapshots are done using our LVM replacement type thing which stores metadata in such a way that it safe to access it from multiple readers/writers. It would be lovely to use LVM for this, but not (as far as I can tell) possible. I might have another look at using lvm as a blockstore, then running our stuff inside lvm. But I didn't think lvm was capable of running thousands of LVs per volume group. ext4 is just fine for that. Perhaps I am slating lvm unfairly. -- Alex Bligh From alex at alex.org.uk Fri Nov 5 08:14:56 2010 From: alex at alex.org.uk (Alex Bligh) Date: Fri, 05 Nov 2010 08:14:56 +0000 Subject: How to generate a large file allocating space In-Reply-To: <20101104191734.GG7553@thunk.org> References: <20101031152351.GA20833@wolff.to> <2A382F5D94CB78493D1760C9@Ximines.local> <250FAB8E-BB5F-4489-B377-D00C680671C7@dilger.ca> <8E4E90C9C1C482942DD479C6@nimrod.local> <4ADC8721-2D8C-4793-B583-F7F4AEA510BC@dilger.ca> <20101102014946.GB24500@thunk.org> <687E728B-761B-4917-8377-EE90820E9DF9@alex.org.uk> <20101104161613.GC4631@thunk.org> <90029CFD2E61A5BA29AD505F@nimrod.local> <20101104191734.GG7553@thunk.org> Message-ID: <8B59E33D0CCCC9F5DF30CF2E@nimrod.local> Ted, --On 4 November 2010 15:17:34 -0400 Ted Ts'o wrote: > Patches to do this wouldn't be that hard. The harder part would > probably be the politics on fs-devel regarding the semantics of > FALLOC_FL_EXPOSE_OLD_DATA. Also presumably there would be some pressure to make it work for every filesystem that supported fallocate(). >> In the mean time, particularly as I'd ideally like to avoid a kernel >> modification, is there a safe way I could use or modify the ext2 >> library to run through the extents of a fallocated() file and clear >> the "unwritten" bit? If I clear that (which from memory is the top >> bit of the extent length), is that alone safe? (on an unmounted >> file system, obviously). > > Yes, there most certainly is. The functions you'd probably want to > use are ext2fs_extent_open(), and then either use ext2fs_extent_goto() > to go to a specific extent, use ext2fs_extent_get() with the > EXT2_EXTENT_NEXT operation to iterate over the extents, and then use > ext2fs_extent_replace() to mutate the extent. Oh, and then use > ext2fs_extent_close() when you're done looking at and/or changing the > extents of a file. > > If you build tst_extents in lib/ext2fs, you can use commands like > "inode" (to open the extents for a particular inode), and "root", > "current", "next", "prev", "next_leaf", "prev_leaf", "next_sibling", > "prev_sibling", "delete_node", "insert_node", "replace_node", > "split_node", "print_all", "goto", etc. Please don't use this in > production, but it's not a bad way to play with an extent tree, either > for learning purposes or to create test cases. tst_extents.c is also > a good way of seeing how the various libext2fs extent API's work. Thaks, that's really helpful. Are the extents always the leaves? IE will next_leaf take me through extent by extent? Does your "please don't use this in production" warning apply to tst_extents.c or to the whole of lib/ext2fs? The library calls seem quite a good way to get the list of extents and are presumably what fsck etc. use. > Well, things like Hadoopfs exist today, and Ceph (if you need a > POSIX-level access) No, just block layer access fortunately > is admittedly less stable. But if you're starting > from scratch, wouldn't that be pretty far away from readiness as well? The idea was to base as much as possible on existing running code (e.g. ext4) with as few variations as possible. I'd be very surprised if we end up exceeding a few thousand lines of code. All the cluster, lock management etc we are borrowing from elsewhere, for instance. -- Alex Bligh From bothie at gmx.de Fri Nov 5 11:32:49 2010 From: bothie at gmx.de (Bodo Thiesen) Date: Fri, 5 Nov 2010 12:32:49 +0100 Subject: How to generate a large file allocating space In-Reply-To: References: <9A62FED22DF5F54862C68579@nimrod.local> <20101031152351.GA20833@wolff.to> <2A382F5D94CB78493D1760C9@Ximines.local> <250FAB8E-BB5F-4489-B377-D00C680671C7@dilger.ca> <20101104134638.31d98a5c@gmx.de> <23A690AD52CEF71A1C450278@nimrod.local> <20101105000545.70ee4750@gmx.de> Message-ID: <20101105123249.3427c1ab@gmx.de> * Alex Bligh hat geschrieben: > I might have another look at using lvm as a blockstore, then running our > stuff inside lvm. But I didn't think lvm was capable of running thousands > of LVs per volume group. ext4 is just fine for that. Perhaps I am > slating lvm unfairly. The number of logical volumes you can create should be mostly dependand on the size of the metadata area. A short look on man pvcreate revealed the command line argument --metadatasize size. Besides of this, lvm should be able to handle any arbitrary number of logical volumes as long as the metadata area is big enough to hold the new config. (The same applies to ext2 and ext3 - if you don't have inodes left, you can't create new files even with thousands of free terabytes - don't know, if this limitation still exists in ext4, I'd guess "yes".) So, my tip would be to just create a pv with a very bit metadata size (i.e. 512 MB or even bigger) and write a script to create a few thousand pv on that pv, something like this pvcreate --metadatasize 512M /dev/foobar lvcreate foobars /dev/foobar for i in $(seq 1 1 5000) do lvcreate --size 256M -n foobar$i foobars done Either it works - or not ... Regards, Bodo From tytso at mit.edu Sat Nov 6 16:30:21 2010 From: tytso at mit.edu (Ted Ts'o) Date: Sat, 6 Nov 2010 12:30:21 -0400 Subject: How to generate a large file allocating space In-Reply-To: <8B59E33D0CCCC9F5DF30CF2E@nimrod.local> References: <250FAB8E-BB5F-4489-B377-D00C680671C7@dilger.ca> <8E4E90C9C1C482942DD479C6@nimrod.local> <4ADC8721-2D8C-4793-B583-F7F4AEA510BC@dilger.ca> <20101102014946.GB24500@thunk.org> <687E728B-761B-4917-8377-EE90820E9DF9@alex.org.uk> <20101104161613.GC4631@thunk.org> <90029CFD2E61A5BA29AD505F@nimrod.local> <20101104191734.GG7553@thunk.org> <8B59E33D0CCCC9F5DF30CF2E@nimrod.local> Message-ID: <20101106163021.GA2935@thunk.org> On Fri, Nov 05, 2010 at 08:14:56AM +0000, Alex Bligh wrote: > > >Patches to do this wouldn't be that hard. The harder part would > >probably be the politics on fs-devel regarding the semantics of > >FALLOC_FL_EXPOSE_OLD_DATA. > > Also presumably there would be some pressure to make it work for > every filesystem that supported fallocate(). No, I don't think so. There are plenty of file systems that don't support fallocate(), and it's not a short step to consider adding new flags which might not be supported by all. > Thaks, that's really helpful. Are the extents always the leaves? IE > will next_leaf take me through extent by extent? Yes, to both questions. > Does your "please don't use this in production" warning apply to > tst_extents.c or to the whole of lib/ext2fs? The library calls > seem quite a good way to get the list of extents and are > presumably what fsck etc. use. No, only to tst_extents.c. It has a tst_ prefix precisely because it's a little hacky, and it was something that I had never intended to be installed by distributions. (I got a little burned by "filefrag", which was never intended to be installed at distribution, which is why the code is so hackish, and why it's not internationalized, etc.) I just want to make sure tst_extents doesn't similarly escape. The libext2fs is designed to be a production-quality codebase, with a stable ABI. So feel free to use it in good health. :-) - Ted From alex at alex.org.uk Sat Nov 6 19:44:22 2010 From: alex at alex.org.uk (Alex Bligh) Date: Sat, 06 Nov 2010 19:44:22 +0000 Subject: How to generate a large file allocating space In-Reply-To: <20101106163021.GA2935@thunk.org> References: <250FAB8E-BB5F-4489-B377-D00C680671C7@dilger.ca> <8E4E90C9C1C482942DD479C6@nimrod.local> <4ADC8721-2D8C-4793-B583-F7F4AEA510BC@dilger.ca> <20101102014946.GB24500@thunk.org> <687E728B-761B-4917-8377-EE90820E9DF9@alex.org.uk> <20101104161613.GC4631@thunk.org> <90029CFD2E61A5BA29AD505F@nimrod.local> <20101104191734.GG7553@thunk.org> <8B59E33D0CCCC9F5DF30CF2E@nimrod.local> <20101106163021.GA2935@thunk.org> Message-ID: <0483B9053CD0D9ED0F75E70F@nimrod.local> --On 6 November 2010 12:30:21 -0400 Ted Ts'o wrote: > On Fri, Nov 05, 2010 at 08:14:56AM +0000, Alex Bligh wrote: >> >> > Patches to do this wouldn't be that hard. The harder part would >> > probably be the politics on fs-devel regarding the semantics of >> > FALLOC_FL_EXPOSE_OLD_DATA. >> >> Also presumably there would be some pressure to make it work for >> every filesystem that supported fallocate(). > > No, I don't think so. There are plenty of file systems that don't > support fallocate(), and it's not a short step to consider adding new > flags which might not be supported by all. Thanks. I might have a go. Patches to linux-ext4@ ? >> Thaks, that's really helpful. Are the extents always the leaves? IE >> will next_leaf take me through extent by extent? > > Yes, to both questions. > >> Does your "please don't use this in production" warning apply to >> tst_extents.c or to the whole of lib/ext2fs? The library calls >> seem quite a good way to get the list of extents and are >> presumably what fsck etc. use. > > No, only to tst_extents.c. ... > The libext2fs is designed to be a production-quality codebase, with a > stable ABI. So feel free to use it in good health. :-) Again, thanks for that. -- Alex Bligh From kernel at nedharvey.com Wed Nov 10 23:38:37 2010 From: kernel at nedharvey.com (Edward Ned Harvey) Date: Wed, 10 Nov 2010 18:38:37 -0500 Subject: Challenge: dump | restore Message-ID: <000501cb8130$68f9b640$3aed22c0$@nedharvey.com> This runs for a few minutes, and results in a broken pipe. After which, at least some fragments of the filesystem have been restored on the destination filesystem. At least some directories. cd /mnt/newFS dump -0af - /dev/someVG/sourceFS | restore -rf - This works fine. cd ~ dump -0af somefile /dev/someVG/sourceFS cd /mnt/newFS restore -rf ~/newFS Source and destination filesystems are ext3, 194G and 857G. Destination filesystem is created with simply default mkfs.ext3. There are only approx. 200M used in the source filesystem, of which, there's no particularly huge directory or number of inodes or anything unusual... I forced the fsck, and it came back clean. My only guess is that there seems to be something wrong with the pipe. Like, it's not streaming the bits properly or something. Is it possible to overflow a pipe or something? I can't think of any good explanation for this weird behavior. What could cause a pipe to break, aside from the receiving process terminating unexpectedly? -------------- next part -------------- An HTML attachment was scrubbed... URL: From bothie at gmx.de Sat Nov 13 15:40:19 2010 From: bothie at gmx.de (Bodo Thiesen) Date: Sat, 13 Nov 2010 16:40:19 +0100 Subject: Challenge: dump | restore In-Reply-To: <000501cb8130$68f9b640$3aed22c0$@nedharvey.com> References: <000501cb8130$68f9b640$3aed22c0$@nedharvey.com> Message-ID: <20101113164019.45ad3a0b@gmx.de> * Edward Ned Harvey hat geschrieben: > dump -0af - /dev/someVG/sourceFS | restore -rf - > My only guess is that there seems to be something wrong with the pipe. > Like, it's not streaming the bits properly or something. Is it possible to > overflow a pipe or something? The sending process should block until the receiving process reads the data. > I can't think of any good explanation for > this weird behavior. What could cause a pipe to break, aside from the > receiving process terminating unexpectedly? I recommend using strace to trace it down: strace -f -o dump.strace dump -0af - /dev/someVG/sourceFS | \ strace -f -o restore.strace restore -rf - Then take a closer look on the tails of the two files, maybe that reveals the problem already. Regards, Bodo From samuel at bcgreen.com Tue Nov 16 11:12:51 2010 From: samuel at bcgreen.com (Stephen Samuel) Date: Tue, 16 Nov 2010 03:12:51 -0800 Subject: Challenge: dump | restore In-Reply-To: <000501cb8130$68f9b640$3aed22c0$@nedharvey.com> References: <000501cb8130$68f9b640$3aed22c0$@nedharvey.com> Message-ID: Try : cd ~ dump -0af - /dev/someVG/sourceFS | ( cd /mnt/newFS; restore -rf - ~/newFS ) On Wed, Nov 10, 2010 at 3:38 PM, Edward Ned Harvey wrote: > This runs for a few minutes, and results in a broken pipe. After which, at > least some fragments of the filesystem have been restored on the destination > filesystem. At least some directories. > > cd /mnt/newFS > > dump -0af - /dev/someVG/sourceFS | restore -rf - > > > > This works fine. > > cd ~ > > dump -0af somefile /dev/someVG/sourceFS > > cd /mnt/newFS > > restore -rf ~/newFS > > > > Source and destination filesystems are ext3, 194G and 857G. Destination > filesystem is created with simply default mkfs.ext3. There are only approx. > > 200M used in the source filesystem, of which, there's no particularly huge > directory or number of inodes or anything unusual... I forced the fsck, and > it came back clean. > > My only guess is that there seems to be something wrong with the pipe. > Like, it's not streaming the bits properly or something. Is it possible to > overflow a pipe or something? I can't think of any good explanation for > this weird behavior. What could cause a pipe to break, aside from the > receiving process terminating unexpectedly? > -- Stephen Samuel http://www.bcgreen.com Software, like love, 778-861-7641 grows when you give it away -------------- next part -------------- An HTML attachment was scrubbed... URL: From lists at alpha345.com Tue Nov 16 17:33:07 2010 From: lists at alpha345.com (Matthew Dickinson) Date: Tue, 16 Nov 2010 11:33:07 -0600 (CST) Subject: corrupted quotas/running quotacheck on a mounted filesystem Message-ID: Hi, 1TB ext3 volume mounted via iSCSI on a RHEL5.5 system the quotas on one of my systems (with ~5k users) seems to have become out of sync with reality (500MB reported, but 100G+ in reality) - i'm seeing some odd behavior when running the quota tools also. For example, "quota -u" shows no quota for some users, but when running "edquota", they're visible in the list. As such, I think i'm in need to running the quotacheck utility. From the man page, it would appear that this is to be run on an unmounted filesystem - is this accurate? can it be safely run on a mounted filesystem? I understand that the results might not be completely accurate if information changes during the quotacheck run, but it should be more accurate than it is now! I'm not really able to take the system offline for an unmounted filesystem for another month or so, but would really like to get some more accurate figures in the quota. Or is there another option i've missed? Thanks, Matthew From jpiszcz at lucidpixels.com Tue Nov 9 10:24:28 2010 From: jpiszcz at lucidpixels.com (Justin Piszcz) Date: Tue, 9 Nov 2010 05:24:28 -0500 (EST) Subject: Assertion failure in journal_commit_transaction() at fs/jbd/commit.c:496: "commit_transaction->t_nr_buffers <= commit_transaction->t_outstanding_credits" Message-ID: Hi, I have the same errors as these folks: https://bugzilla.redhat.com/show_bug.cgi?id=563247 OS: RHEL 5 x86_64 Kernel: 2.6.18 I see this on a wide variety of hardware and according to the bug report, it happens whether its hardware raid or dm. Since there are no records of this bug/issue on LKML I thought I'd pose the question. I am just looking into what is the root cause here, is it an ext3 bug? Assertion failure in journal_commit_transaction() at fs/jbd/commit.c:496: "commit_transaction->t_nr_buffers <= commit_transaction->t_outstanding_credits" ----------- [cut here ] --------- [please bite here ] --------- Kernel BUG at fs/jbd/commit.c:496 invalid opcode: 0000 [1] SMP last sysfs file: /class/scsi_host/host0/stats CPU 3 Modules linked in: i2c_dev eeprom adm1026 hwmon_vid i2c_amd756 nfsd exportfs lockd nfs_acl auth_rpcgss sunrpc ipv6 xfrm_nalgo crypto_api dm_mirror dm_log dm_mod video backlight sbs i2c_ec button battery asus_acpi acpi_memhotplug ac lp floppy shpchp k8temp k8_edac hwmon parport_pc amd_rng edac_mc parport i2c_amd8111 tg3 serio_raw i2c_core pcspkr sg 3w_9xxx sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd Pid: 1473, comm: kjournald Not tainted 2.6.18-164.el5az00 #1 RIP: 0010:[] [] :jbd:journal_commit_transaction+0x6a3/0x106a RSP: 0018:ffff81020ee63de0 EFLAGS: 00010286 RAX: 000000000000009d RBX: ffff810133f3c130 RCX: ffffffff80304ba8 RDX: ffffffff80304ba8 RSI: 0000000000000000 RDI: ffffffff80304ba0 RBP: ffff81010f6f4200 R08: ffffffff80304ba8 R09: 000000000000003d R10: ffff81020ee63a80 R11: 0000000000000280 R12: ffff81011520b730 R13: ffff8101139b41c0 R14: 0000000000000001 R15: ffff81010e441000 FS: 00002b03c3c59d30(0000) GS:ffff8101139aa6c0(0000) knlGS:00000000f7f228d0 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 00002aac9c7f20a0 CR3: 00000001fe72d000 CR4: 00000000000006e0 Process kjournald (pid: 1473, threadinfo ffff81020ee62000, task ffff81010e8aa820) Stack: 00021f55e0d52ccc ffff81010e441000 ffff810100000000 000000b500000000 0000000000000000 ffff81010e8aa820 ffffffff8009f468 ffff81020ee63e18 ffff81020ee63e18 00000000ffffffff 0000000000000286 ffffffff8004b241 Call Trace: [] autoremove_wake_function+0x0/0x2e [] try_to_del_timer_sync+0x51/0x5a [] :jbd:kjournald+0xc1/0x213 [] autoremove_wake_function+0x0/0x2e [] :jbd:kjournald+0x0/0x213 [] keventd_create_kthread+0x0/0xc4 [] kthread+0xfe/0x132 [] child_rip+0xa/0x11 [] keventd_create_kthread+0x0/0xc4 [] kthread+0x0/0x132 [] child_rip+0x0/0x11 Code: 0f 0b 68 74 8e 03 88 c2 f0 01 45 31 e4 45 31 ff 45 31 f6 c7 RIP [] :jbd:journal_commit_transaction+0x6a3/0x106a RSP Message from<0>Kernel panic - not syncing: Fatal exception