From danielk1977 at gmail.com  Tue Jul 13 13:47:08 2010
From: danielk1977 at gmail.com (Dan Kennedy)
Date: Tue, 13 Jul 2010 20:47:08 +0700
Subject: Should SQLite users be setting barrier=1?
Message-ID: <380A8F95-AD8B-442D-8B9D-48DDF07838EA@gmail.com>

Hi,

Should sqlite users who are paranoid about losing data
when hard resets occur be setting the barrier=1 mount
option with ext3?

The situation is that we think SQLite has written data
to a series of 4K blocks in a file and then called
fsync() on the file descriptor. After this a hard reset
occurs. Upon recovery it seems like one of the 4K blocks
has been zeroed. The others are all fine.

Happens every now and again under stress testing.

System is using data=journaled, but not barrier=1.

Should users also be setting barrier=1 for extra robustness
in the face of hard resets?

Thanks,
Dan.


From rwheeler at redhat.com  Tue Jul 13 16:23:20 2010
From: rwheeler at redhat.com (Ric Wheeler)
Date: Tue, 13 Jul 2010 12:23:20 -0400
Subject: Should SQLite users be setting barrier=1?
In-Reply-To: <380A8F95-AD8B-442D-8B9D-48DDF07838EA@gmail.com>
References: <380A8F95-AD8B-442D-8B9D-48DDF07838EA@gmail.com>
Message-ID: <4C3C92F8.30801@redhat.com>

On 07/13/2010 09:47 AM, Dan Kennedy wrote:
> Hi,
>
> Should sqlite users who are paranoid about losing data
> when hard resets occur be setting the barrier=1 mount
> option with ext3?
>
> The situation is that we think SQLite has written data
> to a series of 4K blocks in a file and then called
> fsync() on the file descriptor. After this a hard reset
> occurs. Upon recovery it seems like one of the 4K blocks
> has been zeroed. The others are all fine.
>
> Happens every now and again under stress testing.
>
> System is using data=journaled, but not barrier=1.
>
> Should users also be setting barrier=1 for extra robustness
> in the face of hard resets?
>
> Thanks,
> Dan.
>


Hi Dan,

If you do not use barriers, your storage device could very well lose data if it 
loses power. There is no easy answer, you need to understand the type and 
configuration of your storage.

For a local SAS/S-ATA drive, you should have barriers enabled when the write 
cache is enabled (check that with hdparm for example on S-ATA). Note that you 
could also be safe by disabling the write cache and leaving barriers off as well.

If you have a non-volatile write cache (for example on an external, enterprise 
class array), you can safely mount without barriers.

Regards,

Ric


From sandeen at redhat.com  Tue Jul 13 16:26:14 2010
From: sandeen at redhat.com (Eric Sandeen)
Date: Tue, 13 Jul 2010 11:26:14 -0500
Subject: Should SQLite users be setting barrier=1?
In-Reply-To: <380A8F95-AD8B-442D-8B9D-48DDF07838EA@gmail.com>
References: <380A8F95-AD8B-442D-8B9D-48DDF07838EA@gmail.com>
Message-ID: <4C3C93A6.8010400@redhat.com>

On 07/13/2010 08:47 AM, Dan Kennedy wrote:
> Hi,
> 
> Should sqlite users who are paranoid about losing data
> when hard resets occur be setting the barrier=1 mount
> option with ext3?

barriers should be enabled whenever you wish to ensure a consistent
filesystem post-powerloss, and you have write caches on your drives
which may reorder or lose data when power is lost.

Whether your resets drop power to drive caches, I dunno.

> The situation is that we think SQLite has written data
> to a series of 4K blocks in a file and then called
> fsync() on the file descriptor. After this a hard reset
> occurs. Upon recovery it seems like one of the 4K blocks
> has been zeroed. The others are all fine.

See ext3_sync_file:

        /*
         * In case we didn't commit a transaction, we have to flush
         * disk caches manually so that data really is on persistent
         * storage
         */
        if (needs_barrier)
                blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL,
                                BLKDEV_IFL_WAIT);


so w/o barriers you are not flushing the drive cache and that data will
be lost.

> Happens every now and again under stress testing.
> 
> System is using data=journaled, but not barrier=1.
> 
> Should users also be setting barrier=1 for extra robustness
> in the face of hard resets?

s/extra// - but yes.

-Eric

> Thanks,
> Dan.
> 
> _______________________________________________
> Ext3-users mailing list
> Ext3-users at redhat.com
> https://www.redhat.com/mailman/listinfo/ext3-users


From danielk1977 at gmail.com  Tue Jul 13 16:56:24 2010
From: danielk1977 at gmail.com (Dan Kennedy)
Date: Tue, 13 Jul 2010 23:56:24 +0700
Subject: Should SQLite users be setting barrier=1?
In-Reply-To: <4C3C92F8.30801@redhat.com>
References: <380A8F95-AD8B-442D-8B9D-48DDF07838EA@gmail.com>
	<4C3C92F8.30801@redhat.com>
Message-ID: <0A54407E-16A9-4DC0-A99C-C42172BB6802@gmail.com>


> If you do not use barriers, your storage device could very well lose  
> data if it loses power. There is no easy answer, you need to  
> understand the type and configuration of your storage.
>
> For a local SAS/S-ATA drive, you should have barriers enabled when  
> the write cache is enabled (check that with hdparm for example on S- 
> ATA). Note that you could also be safe by disabling the write cache  
> and leaving barriers off as well.
>
> If you have a non-volatile write cache (for example on an external,  
> enterprise class array), you can safely mount without barriers.
>
> Regards,
>
> Ric


Hi Ric,

Thanks very much for the quick response (and Eric, thanks
as well).

Richard put a paragraph with a link to your answer in
our documentation here:

   http://www.sqlite.org/draft/lockingv3.html#ext3-barrier-problem

Please let us know if this misrepresents the situation.
Or if there is something else we should add to clarify
it.

Thanks again,
Dan.


From rwheeler at redhat.com  Tue Jul 13 17:10:43 2010
From: rwheeler at redhat.com (Ric Wheeler)
Date: Tue, 13 Jul 2010 13:10:43 -0400
Subject: Should SQLite users be setting barrier=1?
In-Reply-To: <0A54407E-16A9-4DC0-A99C-C42172BB6802@gmail.com>
References: <380A8F95-AD8B-442D-8B9D-48DDF07838EA@gmail.com>
	<4C3C92F8.30801@redhat.com>
	<0A54407E-16A9-4DC0-A99C-C42172BB6802@gmail.com>
Message-ID: <4C3C9E13.8010208@redhat.com>

On 07/13/2010 12:56 PM, Dan Kennedy wrote:
>
>> If you do not use barriers, your storage device could very well lose
>> data if it loses power. There is no easy answer, you need to
>> understand the type and configuration of your storage.
>>
>> For a local SAS/S-ATA drive, you should have barriers enabled when the
>> write cache is enabled (check that with hdparm for example on S-ATA).
>> Note that you could also be safe by disabling the write cache and
>> leaving barriers off as well.
>>
>> If you have a non-volatile write cache (for example on an external,
>> enterprise class array), you can safely mount without barriers.
>>
>> Regards,
>>
>> Ric
>
>
>
> Hi Ric,
>
> Thanks very much for the quick response (and Eric, thanks
> as well).
>
> Richard put a paragraph with a link to your answer in
> our documentation here:
>
> http://www.sqlite.org/draft/lockingv3.html#ext3-barrier-problem
>
> Please let us know if this misrepresents the situation.
> Or if there is something else we should add to clarify
> it.
>
> Thanks again,
> Dan.
>
>

I would suggest that the drives are designed to cache data - this is not a 
"defective" controller per se. Think of it like lossy vs lossless audio encoding 
- by design, you get more (performance) but pay a price in data integrity.

The traditional, easy answer was always "disable the write cache" on SAS or 
S-ATA drives, but the barriers do allow you to get back most of the performance.

One other note is that you should be very careful not to use fsync() too much. 
Best to use it intentionally (for example, in your commit phase) than to 
sprinkle it in too often. I know that database people understand this, think of 
the fsync() as our file system level commit since it costs a lot to do, but 
carries data integrity promises with it :-)


ric


From tytso at mit.edu  Wed Jul 14 08:35:54 2010
From: tytso at mit.edu (Ted Ts'o)
Date: Wed, 14 Jul 2010 04:35:54 -0400
Subject: Should SQLite users be setting barrier=1?
In-Reply-To: <0A54407E-16A9-4DC0-A99C-C42172BB6802@gmail.com>
References: <380A8F95-AD8B-442D-8B9D-48DDF07838EA@gmail.com>
	<4C3C92F8.30801@redhat.com>
	<0A54407E-16A9-4DC0-A99C-C42172BB6802@gmail.com>
Message-ID: <20100714083554.GB27114@thunk.org>

On Tue, Jul 13, 2010 at 11:56:24PM +0700, Dan Kennedy wrote:
> Richard put a paragraph with a link to your answer in
> our documentation here:
> 
>   http://www.sqlite.org/draft/lockingv3.html#ext3-barrier-problem
> 
> Please let us know if this misrepresents the situation.
> Or if there is something else we should add to clarify
> it.

Something else that might be useful to put in the FAQ is some
suggestions to application programmers about methods to get no more
than the safety needed for their application.  For example, firefox's
"awesome bar" (myself, I don't think it's so awesome) at one point was
doing an SQL COMMIT after every single time a user visited a page
(either by clicking on a link or entering a URL in the "awesome bar").
Worse yet, they were doing this in the main UI loop of their
application.  Not only was this a performance disaster, if you had an
SSD it was doing you no favors, the firefox/SQLite combination was
also doing a third of a megabyte of disk writes for every single page
that you visited.

Did they really need that level of safety?  Probably not.  It probably
would have been better if they had used a memory-only SQLite database
for immediate history, and then every 50 pages or so, in a background
thread, contents of the in-memory SQLite database could be flushed to
the disk-resident SQLite database.  After all, if someone exits a 3D
game and their crappy proprietary ATI or Nvidia driver crashes their
laptop, remember the last or 10 pages they web browser history might
not be the most important thing in the world....

So putting in some text to help application programmers think about
performance issues as well as robustness issues, and creative ways of
trading off between them, I think would be a good idea.  It's a rare
application programmer these days who think at the XML and HTML and
Java/Python level, and then understand what SQLite is doing, and then
understand the implications at the level of the kernel, the hard disk,
SSD, write wear issues, and barriers.

Thanks, regards,

						- Ted


From danielk1977 at gmail.com  Wed Jul 14 11:20:29 2010
From: danielk1977 at gmail.com (Dan Kennedy)
Date: Wed, 14 Jul 2010 18:20:29 +0700
Subject: Should SQLite users be setting barrier=1?
In-Reply-To: <20100714083554.GB27114@thunk.org>
References: <380A8F95-AD8B-442D-8B9D-48DDF07838EA@gmail.com>
	<4C3C92F8.30801@redhat.com>
	<0A54407E-16A9-4DC0-A99C-C42172BB6802@gmail.com>
	<20100714083554.GB27114@thunk.org>
Message-ID: <160750B8-D889-4112-BCD2-0CBE12044E2C@gmail.com>


On Jul 14, 2010, at 3:35 PM, Ted Ts'o wrote:

> On Tue, Jul 13, 2010 at 11:56:24PM +0700, Dan Kennedy wrote:
>> Richard put a paragraph with a link to your answer in
>> our documentation here:
>>
>>  http://www.sqlite.org/draft/lockingv3.html#ext3-barrier-problem
>>
>> Please let us know if this misrepresents the situation.
>> Or if there is something else we should add to clarify
>> it.
>
> Something else that might be useful to put in the FAQ is some
> suggestions to application programmers about methods to get no more
> than the safety needed for their application.  For example, firefox's
> "awesome bar" (myself, I don't think it's so awesome) at one point was
> doing an SQL COMMIT after every single time a user visited a page
> (either by clicking on a link or entering a URL in the "awesome bar").
> Worse yet, they were doing this in the main UI loop of their
> application.  Not only was this a performance disaster, if you had an
> SSD it was doing you no favors, the firefox/SQLite combination was
> also doing a third of a megabyte of disk writes for every single page
> that you visited.
>
> Did they really need that level of safety?  Probably not.  It probably
> would have been better if they had used a memory-only SQLite database
> for immediate history, and then every 50 pages or so, in a background
> thread, contents of the in-memory SQLite database could be flushed to
> the disk-resident SQLite database.  After all, if someone exits a 3D
> game and their crappy proprietary ATI or Nvidia driver crashes their
> laptop, remember the last or 10 pages they web browser history might
> not be the most important thing in the world....


Hi Ted,

Good points. Thanks.

Recent versions of Firefox are doing pretty much as you suggest. One
of the problems is that applications have to jump through some
fairly involved hoops to get SQLite to do this. Usually, if an app
needs to commit data to an SQLite database (so that other SQL
applications can see it), you have two choices: (a) update the database
safely, with all the fsync() calls that involves, or (b) omit the
fsync() calls, and risk corrupting entire database tables if an
inopportune power failure occurs.

Folks who didn't care so much about the last 10 seconds of data
were balking at the idea of corrupting an entire history log.

The upcoming 3.7.0 has a new mode that allows applications to safely
write data to databases without calling fsync() so that other
applications can read it. If a power failure occurs after data is
written in this mode, you only risk losing the new unsynced data.
Hopefully people can start to use this to reduce the number of
fsync() calls made to sync non-critical data.

Off the top of your head is there anything else related to ext2 or
ext4 mount options that we could mention on webpage? Or anything
else that we should be emphasizing to do with ext3?

Quite a few SQLite users like to run these power-failure/hard-reset
tests to see if they can manage to corrupt an SQLite database file.
The SQLite code assumes that:

   * once an fsync() call has returned, all previous writes to the
     file have made it all the way to the persistent media and are
     safe even if a power failure occurs, and that

   * if a power failure occurs before an fsync() call has returned
     successfully, any blocks written to since the previous fsync()
     may contain the new data, the old data or garbage data following
     system recovery. i.e. they cannot be trusted when reconstructing
     the database.

in other words, we want an fsync() that behaves the way one who
knows nothing about hardware might optimistically assume it behaves. :)

Assuming they are using regular disks with volatile write-caches,
how should we tell people to configure ext3 to get as close as
possible to this ideal? Do they have to use any particular "data="
mode?

With ext2, is the only option to disable the disks write-cache, or
is there an equivalent to the barrier=1 parameter?

Thanks very much,
Dan.


> So putting in some text to help application programmers think about
> performance issues as well as robustness issues, and creative ways of
> trading off between them, I think would be a good idea.  It's a rare
> application programmer these days who think at the XML and HTML and
> Java/Python level, and then understand what SQLite is doing, and then
> understand the implications at the level of the kernel, the hard disk,
> SSD, write wear issues, and barriers.
>
> Thanks, regards,
>
> 						- Ted


From danielk1977 at gmail.com  Wed Jul 14 11:22:45 2010
From: danielk1977 at gmail.com (Dan Kennedy)
Date: Wed, 14 Jul 2010 18:22:45 +0700
Subject: Should SQLite users be setting barrier=1?
In-Reply-To: <4C3C9E13.8010208@redhat.com>
References: <380A8F95-AD8B-442D-8B9D-48DDF07838EA@gmail.com>
	<4C3C92F8.30801@redhat.com>
	<0A54407E-16A9-4DC0-A99C-C42172BB6802@gmail.com>
	<4C3C9E13.8010208@redhat.com>
Message-ID: <3B39ECF9-48BD-4A07-904A-7ED53FD7098D@gmail.com>

>
> I would suggest that the drives are designed to cache data - this is  
> not a "defective" controller per se. Think of it like lossy vs  
> lossless audio encoding - by design, you get more (performance) but  
> pay a price in data integrity.


"defective" removed and replaced with a better explanation.
Thanks!

Dan.
  

From ryan at bitlackeys.com  Mon Jul 19 19:39:29 2010
From: ryan at bitlackeys.com (Ryan O'Neill)
Date: Mon, 19 Jul 2010 19:39:29 -0000
Subject: problem with custom fs that utilizes ext3
Message-ID: <AANLkTinjfPYvoHqVHAOMFfu96k7cxO-NZMH6nm10Kyzm@mail.gmail.com>

I have developed a file system, it is a virtual memory file system
with persistence -- the persistence essentially works by storing the
fs data (from memory) into the slack space
of ext3 files (We are working on CentOS 5.3 -- old I know). The
following details should be sufficient.

I keep the inode size the same so that utilities don't see the hidden
data -- it appears do_sync_read which ext3 uses and the function that
it uses (such as generic_file_read) do not read past the size of the
inode.

When storing this persistent data in the slack space of ext3, I create
something like a journal that contains the names of the ext3 files and
how much data we have in the slack space. So when I remount my file
system, a read_Journal()
happens, and here is the issue --

I temporarily extend the size of the inode of the ext3 file so I can
get at the hidden data, then I put the inode size back. For along time
this returned 0's, I then ( believe this made the change) marked the
inode as dirty
and flushed the page so it forced the change of the inode extension,
because I knew the data was there. Now I get the actual data (using
do_sync_read) -- but it
only works for the first file in the journal (98% of the time).
read_journal() works in
a loop, it reads through each journal entry, and when it tries to
perforrm
the same operation on another ext3 file, after extending the inode it
just gets zeroes back. But I know the data is there, because if I
leave the inode extended, I use 'vi' to open the file and I can see
the data.

Is this some type of page cache issue? How can I get around this? Any
input would be greatly appreciated. Thank you.

I can't really give code slices, but the general idea is

i_size = i_size_read(inode);
i_size += extended_size;
i_size_write(inode, isize);
mark_inode_dirty(ext3_inode);
wakeup_pdflush(0); <- I realize this is an overkill, but I was just
trying to get it to work before I used aops->commit_write on the
actual page of the inode.

ext3_file->f_op->llseek(seek to start of hidden data);
ext3_file->f_op->read(read in the data that is hidden at this location)

The first time I go through this operation it works, I get the data
back into memory and can reconstruct a file in virtual memory
all subsequent attempts fail -- although I believe once or twice it did work.

I simply don't understand the underlying page cache enough I'm
guessing. Any help would be greatly appreciated on this, thank you
all.

Regards, Ryan.