From danielk1977 at gmail.com Tue Jul 13 13:47:08 2010 From: danielk1977 at gmail.com (Dan Kennedy) Date: Tue, 13 Jul 2010 20:47:08 +0700 Subject: Should SQLite users be setting barrier=1? Message-ID: <380A8F95-AD8B-442D-8B9D-48DDF07838EA@gmail.com> Hi, Should sqlite users who are paranoid about losing data when hard resets occur be setting the barrier=1 mount option with ext3? The situation is that we think SQLite has written data to a series of 4K blocks in a file and then called fsync() on the file descriptor. After this a hard reset occurs. Upon recovery it seems like one of the 4K blocks has been zeroed. The others are all fine. Happens every now and again under stress testing. System is using data=journaled, but not barrier=1. Should users also be setting barrier=1 for extra robustness in the face of hard resets? Thanks, Dan. From rwheeler at redhat.com Tue Jul 13 16:23:20 2010 From: rwheeler at redhat.com (Ric Wheeler) Date: Tue, 13 Jul 2010 12:23:20 -0400 Subject: Should SQLite users be setting barrier=1? In-Reply-To: <380A8F95-AD8B-442D-8B9D-48DDF07838EA@gmail.com> References: <380A8F95-AD8B-442D-8B9D-48DDF07838EA@gmail.com> Message-ID: <4C3C92F8.30801@redhat.com> On 07/13/2010 09:47 AM, Dan Kennedy wrote: > Hi, > > Should sqlite users who are paranoid about losing data > when hard resets occur be setting the barrier=1 mount > option with ext3? > > The situation is that we think SQLite has written data > to a series of 4K blocks in a file and then called > fsync() on the file descriptor. After this a hard reset > occurs. Upon recovery it seems like one of the 4K blocks > has been zeroed. The others are all fine. > > Happens every now and again under stress testing. > > System is using data=journaled, but not barrier=1. > > Should users also be setting barrier=1 for extra robustness > in the face of hard resets? > > Thanks, > Dan. > Hi Dan, If you do not use barriers, your storage device could very well lose data if it loses power. There is no easy answer, you need to understand the type and configuration of your storage. For a local SAS/S-ATA drive, you should have barriers enabled when the write cache is enabled (check that with hdparm for example on S-ATA). Note that you could also be safe by disabling the write cache and leaving barriers off as well. If you have a non-volatile write cache (for example on an external, enterprise class array), you can safely mount without barriers. Regards, Ric From sandeen at redhat.com Tue Jul 13 16:26:14 2010 From: sandeen at redhat.com (Eric Sandeen) Date: Tue, 13 Jul 2010 11:26:14 -0500 Subject: Should SQLite users be setting barrier=1? In-Reply-To: <380A8F95-AD8B-442D-8B9D-48DDF07838EA@gmail.com> References: <380A8F95-AD8B-442D-8B9D-48DDF07838EA@gmail.com> Message-ID: <4C3C93A6.8010400@redhat.com> On 07/13/2010 08:47 AM, Dan Kennedy wrote: > Hi, > > Should sqlite users who are paranoid about losing data > when hard resets occur be setting the barrier=1 mount > option with ext3? barriers should be enabled whenever you wish to ensure a consistent filesystem post-powerloss, and you have write caches on your drives which may reorder or lose data when power is lost. Whether your resets drop power to drive caches, I dunno. > The situation is that we think SQLite has written data > to a series of 4K blocks in a file and then called > fsync() on the file descriptor. After this a hard reset > occurs. Upon recovery it seems like one of the 4K blocks > has been zeroed. The others are all fine. See ext3_sync_file: /* * In case we didn't commit a transaction, we have to flush * disk caches manually so that data really is on persistent * storage */ if (needs_barrier) blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL, BLKDEV_IFL_WAIT); so w/o barriers you are not flushing the drive cache and that data will be lost. > Happens every now and again under stress testing. > > System is using data=journaled, but not barrier=1. > > Should users also be setting barrier=1 for extra robustness > in the face of hard resets? s/extra// - but yes. -Eric > Thanks, > Dan. > > _______________________________________________ > Ext3-users mailing list > Ext3-users at redhat.com > https://www.redhat.com/mailman/listinfo/ext3-users From danielk1977 at gmail.com Tue Jul 13 16:56:24 2010 From: danielk1977 at gmail.com (Dan Kennedy) Date: Tue, 13 Jul 2010 23:56:24 +0700 Subject: Should SQLite users be setting barrier=1? In-Reply-To: <4C3C92F8.30801@redhat.com> References: <380A8F95-AD8B-442D-8B9D-48DDF07838EA@gmail.com> <4C3C92F8.30801@redhat.com> Message-ID: <0A54407E-16A9-4DC0-A99C-C42172BB6802@gmail.com> > If you do not use barriers, your storage device could very well lose > data if it loses power. There is no easy answer, you need to > understand the type and configuration of your storage. > > For a local SAS/S-ATA drive, you should have barriers enabled when > the write cache is enabled (check that with hdparm for example on S- > ATA). Note that you could also be safe by disabling the write cache > and leaving barriers off as well. > > If you have a non-volatile write cache (for example on an external, > enterprise class array), you can safely mount without barriers. > > Regards, > > Ric Hi Ric, Thanks very much for the quick response (and Eric, thanks as well). Richard put a paragraph with a link to your answer in our documentation here: http://www.sqlite.org/draft/lockingv3.html#ext3-barrier-problem Please let us know if this misrepresents the situation. Or if there is something else we should add to clarify it. Thanks again, Dan. From rwheeler at redhat.com Tue Jul 13 17:10:43 2010 From: rwheeler at redhat.com (Ric Wheeler) Date: Tue, 13 Jul 2010 13:10:43 -0400 Subject: Should SQLite users be setting barrier=1? In-Reply-To: <0A54407E-16A9-4DC0-A99C-C42172BB6802@gmail.com> References: <380A8F95-AD8B-442D-8B9D-48DDF07838EA@gmail.com> <4C3C92F8.30801@redhat.com> <0A54407E-16A9-4DC0-A99C-C42172BB6802@gmail.com> Message-ID: <4C3C9E13.8010208@redhat.com> On 07/13/2010 12:56 PM, Dan Kennedy wrote: > >> If you do not use barriers, your storage device could very well lose >> data if it loses power. There is no easy answer, you need to >> understand the type and configuration of your storage. >> >> For a local SAS/S-ATA drive, you should have barriers enabled when the >> write cache is enabled (check that with hdparm for example on S-ATA). >> Note that you could also be safe by disabling the write cache and >> leaving barriers off as well. >> >> If you have a non-volatile write cache (for example on an external, >> enterprise class array), you can safely mount without barriers. >> >> Regards, >> >> Ric > > > > Hi Ric, > > Thanks very much for the quick response (and Eric, thanks > as well). > > Richard put a paragraph with a link to your answer in > our documentation here: > > http://www.sqlite.org/draft/lockingv3.html#ext3-barrier-problem > > Please let us know if this misrepresents the situation. > Or if there is something else we should add to clarify > it. > > Thanks again, > Dan. > > I would suggest that the drives are designed to cache data - this is not a "defective" controller per se. Think of it like lossy vs lossless audio encoding - by design, you get more (performance) but pay a price in data integrity. The traditional, easy answer was always "disable the write cache" on SAS or S-ATA drives, but the barriers do allow you to get back most of the performance. One other note is that you should be very careful not to use fsync() too much. Best to use it intentionally (for example, in your commit phase) than to sprinkle it in too often. I know that database people understand this, think of the fsync() as our file system level commit since it costs a lot to do, but carries data integrity promises with it :-) ric From tytso at mit.edu Wed Jul 14 08:35:54 2010 From: tytso at mit.edu (Ted Ts'o) Date: Wed, 14 Jul 2010 04:35:54 -0400 Subject: Should SQLite users be setting barrier=1? In-Reply-To: <0A54407E-16A9-4DC0-A99C-C42172BB6802@gmail.com> References: <380A8F95-AD8B-442D-8B9D-48DDF07838EA@gmail.com> <4C3C92F8.30801@redhat.com> <0A54407E-16A9-4DC0-A99C-C42172BB6802@gmail.com> Message-ID: <20100714083554.GB27114@thunk.org> On Tue, Jul 13, 2010 at 11:56:24PM +0700, Dan Kennedy wrote: > Richard put a paragraph with a link to your answer in > our documentation here: > > http://www.sqlite.org/draft/lockingv3.html#ext3-barrier-problem > > Please let us know if this misrepresents the situation. > Or if there is something else we should add to clarify > it. Something else that might be useful to put in the FAQ is some suggestions to application programmers about methods to get no more than the safety needed for their application. For example, firefox's "awesome bar" (myself, I don't think it's so awesome) at one point was doing an SQL COMMIT after every single time a user visited a page (either by clicking on a link or entering a URL in the "awesome bar"). Worse yet, they were doing this in the main UI loop of their application. Not only was this a performance disaster, if you had an SSD it was doing you no favors, the firefox/SQLite combination was also doing a third of a megabyte of disk writes for every single page that you visited. Did they really need that level of safety? Probably not. It probably would have been better if they had used a memory-only SQLite database for immediate history, and then every 50 pages or so, in a background thread, contents of the in-memory SQLite database could be flushed to the disk-resident SQLite database. After all, if someone exits a 3D game and their crappy proprietary ATI or Nvidia driver crashes their laptop, remember the last or 10 pages they web browser history might not be the most important thing in the world.... So putting in some text to help application programmers think about performance issues as well as robustness issues, and creative ways of trading off between them, I think would be a good idea. It's a rare application programmer these days who think at the XML and HTML and Java/Python level, and then understand what SQLite is doing, and then understand the implications at the level of the kernel, the hard disk, SSD, write wear issues, and barriers. Thanks, regards, - Ted From danielk1977 at gmail.com Wed Jul 14 11:20:29 2010 From: danielk1977 at gmail.com (Dan Kennedy) Date: Wed, 14 Jul 2010 18:20:29 +0700 Subject: Should SQLite users be setting barrier=1? In-Reply-To: <20100714083554.GB27114@thunk.org> References: <380A8F95-AD8B-442D-8B9D-48DDF07838EA@gmail.com> <4C3C92F8.30801@redhat.com> <0A54407E-16A9-4DC0-A99C-C42172BB6802@gmail.com> <20100714083554.GB27114@thunk.org> Message-ID: <160750B8-D889-4112-BCD2-0CBE12044E2C@gmail.com> On Jul 14, 2010, at 3:35 PM, Ted Ts'o wrote: > On Tue, Jul 13, 2010 at 11:56:24PM +0700, Dan Kennedy wrote: >> Richard put a paragraph with a link to your answer in >> our documentation here: >> >> http://www.sqlite.org/draft/lockingv3.html#ext3-barrier-problem >> >> Please let us know if this misrepresents the situation. >> Or if there is something else we should add to clarify >> it. > > Something else that might be useful to put in the FAQ is some > suggestions to application programmers about methods to get no more > than the safety needed for their application. For example, firefox's > "awesome bar" (myself, I don't think it's so awesome) at one point was > doing an SQL COMMIT after every single time a user visited a page > (either by clicking on a link or entering a URL in the "awesome bar"). > Worse yet, they were doing this in the main UI loop of their > application. Not only was this a performance disaster, if you had an > SSD it was doing you no favors, the firefox/SQLite combination was > also doing a third of a megabyte of disk writes for every single page > that you visited. > > Did they really need that level of safety? Probably not. It probably > would have been better if they had used a memory-only SQLite database > for immediate history, and then every 50 pages or so, in a background > thread, contents of the in-memory SQLite database could be flushed to > the disk-resident SQLite database. After all, if someone exits a 3D > game and their crappy proprietary ATI or Nvidia driver crashes their > laptop, remember the last or 10 pages they web browser history might > not be the most important thing in the world.... Hi Ted, Good points. Thanks. Recent versions of Firefox are doing pretty much as you suggest. One of the problems is that applications have to jump through some fairly involved hoops to get SQLite to do this. Usually, if an app needs to commit data to an SQLite database (so that other SQL applications can see it), you have two choices: (a) update the database safely, with all the fsync() calls that involves, or (b) omit the fsync() calls, and risk corrupting entire database tables if an inopportune power failure occurs. Folks who didn't care so much about the last 10 seconds of data were balking at the idea of corrupting an entire history log. The upcoming 3.7.0 has a new mode that allows applications to safely write data to databases without calling fsync() so that other applications can read it. If a power failure occurs after data is written in this mode, you only risk losing the new unsynced data. Hopefully people can start to use this to reduce the number of fsync() calls made to sync non-critical data. Off the top of your head is there anything else related to ext2 or ext4 mount options that we could mention on webpage? Or anything else that we should be emphasizing to do with ext3? Quite a few SQLite users like to run these power-failure/hard-reset tests to see if they can manage to corrupt an SQLite database file. The SQLite code assumes that: * once an fsync() call has returned, all previous writes to the file have made it all the way to the persistent media and are safe even if a power failure occurs, and that * if a power failure occurs before an fsync() call has returned successfully, any blocks written to since the previous fsync() may contain the new data, the old data or garbage data following system recovery. i.e. they cannot be trusted when reconstructing the database. in other words, we want an fsync() that behaves the way one who knows nothing about hardware might optimistically assume it behaves. :) Assuming they are using regular disks with volatile write-caches, how should we tell people to configure ext3 to get as close as possible to this ideal? Do they have to use any particular "data=" mode? With ext2, is the only option to disable the disks write-cache, or is there an equivalent to the barrier=1 parameter? Thanks very much, Dan. > So putting in some text to help application programmers think about > performance issues as well as robustness issues, and creative ways of > trading off between them, I think would be a good idea. It's a rare > application programmer these days who think at the XML and HTML and > Java/Python level, and then understand what SQLite is doing, and then > understand the implications at the level of the kernel, the hard disk, > SSD, write wear issues, and barriers. > > Thanks, regards, > > - Ted From danielk1977 at gmail.com Wed Jul 14 11:22:45 2010 From: danielk1977 at gmail.com (Dan Kennedy) Date: Wed, 14 Jul 2010 18:22:45 +0700 Subject: Should SQLite users be setting barrier=1? In-Reply-To: <4C3C9E13.8010208@redhat.com> References: <380A8F95-AD8B-442D-8B9D-48DDF07838EA@gmail.com> <4C3C92F8.30801@redhat.com> <0A54407E-16A9-4DC0-A99C-C42172BB6802@gmail.com> <4C3C9E13.8010208@redhat.com> Message-ID: <3B39ECF9-48BD-4A07-904A-7ED53FD7098D@gmail.com> > > I would suggest that the drives are designed to cache data - this is > not a "defective" controller per se. Think of it like lossy vs > lossless audio encoding - by design, you get more (performance) but > pay a price in data integrity. "defective" removed and replaced with a better explanation. Thanks! Dan. From ryan at bitlackeys.com Mon Jul 19 19:39:29 2010 From: ryan at bitlackeys.com (Ryan O'Neill) Date: Mon, 19 Jul 2010 19:39:29 -0000 Subject: problem with custom fs that utilizes ext3 Message-ID: I have developed a file system, it is a virtual memory file system with persistence -- the persistence essentially works by storing the fs data (from memory) into the slack space of ext3 files (We are working on CentOS 5.3 -- old I know). The following details should be sufficient. I keep the inode size the same so that utilities don't see the hidden data -- it appears do_sync_read which ext3 uses and the function that it uses (such as generic_file_read) do not read past the size of the inode. When storing this persistent data in the slack space of ext3, I create something like a journal that contains the names of the ext3 files and how much data we have in the slack space. So when I remount my file system, a read_Journal() happens, and here is the issue -- I temporarily extend the size of the inode of the ext3 file so I can get at the hidden data, then I put the inode size back. For along time this returned 0's, I then ( believe this made the change) marked the inode as dirty and flushed the page so it forced the change of the inode extension, because I knew the data was there. Now I get the actual data (using do_sync_read) -- but it only works for the first file in the journal (98% of the time). read_journal() works in a loop, it reads through each journal entry, and when it tries to perforrm the same operation on another ext3 file, after extending the inode it just gets zeroes back. But I know the data is there, because if I leave the inode extended, I use 'vi' to open the file and I can see the data. Is this some type of page cache issue? How can I get around this? Any input would be greatly appreciated. Thank you. I can't really give code slices, but the general idea is i_size = i_size_read(inode); i_size += extended_size; i_size_write(inode, isize); mark_inode_dirty(ext3_inode); wakeup_pdflush(0); <- I realize this is an overkill, but I was just trying to get it to work before I used aops->commit_write on the actual page of the inode. ext3_file->f_op->llseek(seek to start of hidden data); ext3_file->f_op->read(read in the data that is hidden at this location) The first time I go through this operation it works, I get the data back into memory and can reconstruct a file in virtual memory all subsequent attempts fail -- although I believe once or twice it did work. I simply don't understand the underlying page cache enough I'm guessing. Any help would be greatly appreciated on this, thank you all. Regards, Ryan.