[linux-lvm] fsync() and LVM

Tue Mar 17 16:00:54 UTC 2009

Stuart D. Gathman wrote:
> That is clearly wrong - since fsync() isn't LVM's responsibility.
> I think they mean that fsync() can't garrantee that any writes are
> actually on the platter.

Even if the disk cache is in write-thru mode, that is.

>> that data doesn't even get to the controller, and it doesn't matter
>> if the disks have write caches enabled or not. Or if they have battery backed
>> caches. Please read the thread I linked. If what they say it's true,
> 
> That is clearly wrong.  If writes don't work, nothing works.

It's the flush (= write NOW) supposedly not working, not the write.
Writes happen, just later and potentially not in order. You seems to assume
that fsync() is the only way to have the data written. That's not clearly
the case, most userland processes just issue write(), never fsync(), and
data gets written anyway, sooner or later.

>> you can't use LVM for anything that needs fsync(), including mail queues
>> (sendmail), mail storage (imapd), as such. So I'd really like to know.
> 
> fsync() is a file system call that writes dirty buffers,

sure, but it's not the only way to have dirty pages flushed. There's
a kernel thread that flushes them every since and then, and there's
also memory pressure. So a broken fsync() can go unnoticed, you become
aware of it if and only if:

1) you run some application that needs it (most don't even use it);
2) the system crashes (power loss);
3) you are unlucky enough to hit the window of vulnerability.

If any of these conditions is not met, you won't be aware of a
mulfunctioning fsync().

But I think I understand what you mean: if the API to flush to physical
storage is the same (used by fsync(), by pdflush, by the VM system)
then you're right, everything is broken. But I've been using LVM
for years now, I'm assuming that's not the case. :)

> and then waits
> for the physical writes to complete.  It is only the waiting part that
> is broken.

Half-broken is broken. And the bigger issue here it's not even the delay.
The issue is ordering. For a database, loosing the last transactions is bad
enough, loosing transactions in the middle of the timeline is even worse.

For the mail subsystems, there's almost no ordering requirement, still
loosing messages is no good.

---------------

Ehm, I've decided to write a small test program. My system is a Fedora 7,
so nowhere recent. My setup:

/home is a LV, belonging to VG 'vg_data', whose only PV is /dev/md6.
/dev/md6 is a RAID1 md device, whose members are /dev/sda10 and /dev/sdb10.
/dev/sda and /dev/sdb are both Seagate ST3320620AS SATA disks.

The filesystem is EXT3, mounted with noatime,data=ordered.

The attached program writes the same block on a file N times (looping on
lseek/write. Depending on how it's compiled, it issues a fdatasync() after
each write.

Here are the results, for 32MB of data written:

$ time ./test_nosync

real    0m0.056s
user    0m0.004s
sys     0m0.052s

clearly, not disk activity here.

$ time ./test_sync

real    0m2.070s
user    0m0.002s
sys     0m0.152s

Now the same after hdparm -W0 /dev/sda; hdparm -W0 /dev/sdb:

$ time ./test_sync

real    1m16.431s
user    0m0.004s
sys     0m0.273s

These are 4096 "transactions" of size 8192, w/o the overhaed of
allocating new blocks (it writes to the same block over and over).
The first test is meaningless (they are never really committed).
The second test, it's about 2000 transactions per second. Too many.
In the third test, I got only about 50 transactions per second,
which makes a lot of sense.

It seems to me that in my setup, disabling the caches on the disks does
bring data to the platters, and that noone is "lying" about fsync.

Now I'm _really_ confused.

(the following isn't meaningful for the discussion)

For the curious of you (I was) I commented out the lseek(). For the _nosync
version it's the same (1/2 a second).

For the _sync version, with -W1 I get:

$ time ./test_sync

real    0m48.816s
user    0m0.002s
sys     0m0.483s

and with -W0:

$ time ./test_sync

real    3m6.674s
user    0m0.006s
sys     0m0.526s

Since all the test were done deleting the file each time, I think what
happens here is that the file is increasing in size, so fdatasync()
each time triggers a write of the inode. It's two writes per loop.
So I tried keeping the file around, having my test program write on
preallocated blocks.

With -W1:
$ time ./test_sync

real    0m11.253s
user    0m0.001s
sys     0m0.244s

with -W0:
$ time ./test_sync

real    0m46.353s
user    0m0.005s
sys     0m0.249s

.TM.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: test.c
Type: text/x-csrc
Size: 807 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-lvm/attachments/20090317/432352a0/attachment.bin>