EXT2 vs. EXT3: mount w/sync or fdatasync

brian stone skye0507 at yahoo.com
Fri Mar 23 13:17:06 UTC 2007


I am currently leaning towards: 
mount in ordered mode with the dirsync option and use fsync().

That seemed to be the most consistent in performance tests.  Some of the config tests would fart in the middle, hesitating for a second or two.  The ordered mode with fsync() was rock solid.  Also, I think journaling the data when you are syncing it is more than one needs.

Without going to unneeded details, I will give you a glimpse of what this application is doing.

Machine A, which I will call an app server, generates binary chucks/blocks of data ranging from 28 bytes to a maximum of 1MB.  There are multiple app servers.  The app servers need to quickly store these blocks on one of several Machine Bs, which I will call volume servers.  When a block is transferred from an app server to a volume server, it must be done reliably ... thus the need to sync.  If the volume server says, "I got that block", then it really must have it ... on disk.

>>Are you using EAs, like selinux or similar
File system permissions and security attributes are meaningless in this system.  selinux is disabled.  These blocks are not browsed by users.  I actually mount using "noatime,nodiratime,noacl,nouser_xattr".  Only the app servers have any idea what these blocks mean.  The volume server is nothing more than a dumping ground out on the network.  We even toyed with writing raw, opening a device directly with no fs and using O_DIRECT.  Not a bad idea just a heck of a lot of work!  Easier to fiddle with the correct config for ext3.

So, maybe the volume servers need two fs configs: one for blocks less than 128KB and one for blocks over 128KB.  

I tested with 1MB blocks because that would be the worst case; I wanted to know how it would perform.  The average block size is currently around 100KB.

thanks soo much for your thoughts


Andreas Dilger <adilger at clusterfs.com> wrote: On Mar 22, 2007  20:44 -0700, brian stone wrote:
> Machine A connects to machine B on a gigabit lan.  Machine A sends 
> 1024 1MB chucks of data; 1 GB in total. Machine B, the server, reads 
> in the MB and writes it to a file.
> 
> NOTE: server and client are little test programs written in C.  
> 
> Machine B (Server) hardware:
> - Single (no raid) Seagate Cheetah 70G Ultra320 15K
> - Quad Opteron 870
> - 16G DDR400
> - Backplane: LSI Logic / Symbios Logic 53c1030 PCI-X Fusion-MPT Dual Ultra320 SCSI (rev 8)
> 
> Sync methods include:
> 1. mount with sync option
>   - tried sync,dirsync which added no additional overhead
> 2. use O_SYNC open() flag
> 3. use fdatasync() just before closing the file
>   - fsync() and fdatasync() produced the same results
> 
> 
> EXT2 tests
> ==========================================
> No sync                     12.3 seconds  (83 MB/Sec)
> mount=sync                  44.3 seconds  (23 MB/Sec)
> O_SYNC                      31.7 seconds  (32 MB/Sec)
> fdatasync()                 31.3 seconds  (32 MB/Sec)
> 
> 
> EXT3 tests
> ===========================================
> No sync data=writeback      14.5 seconds  (70 MB/Sec)
> No sync data=ordered        17 seconds    (60 MB/Sec)
> No sync data=journal        65 seconds    (15 MB/Sec)
> data=ordered O_SYNC         49 seconds    (20 MB/Sec)
> data=ordered,sync           52 seconds    (19 MB/Sec)
> data=ordered fdatasync()    45.5 seconds  (22 MB/Sec)
> data=journal O_SYNC         72.5 seconds  (14 MB/Sec)
> data=journal,sync           81 seconds    (12 MB/Sec)
> data=journal fdatasync()    60.5 seconds  (17 MB/Sec)

If you are doing a large number of 1MB writes then I agree that
data=journal is probably not the way to go because it means you
can get at most 1/2 of the bandwidth of the disk (unless you
create the journal on a separate disk).  data=journal is good
for small writes and lots of transactions, like mail servers
that need lots of sync operations.

For large writes, I'd suggest you put the journal on a separate
device, and make it 1 or 2 GB (your server has plenty of RAM,
so that isn't a problem).  Are you using EAs, like selinux or
similar?  If yes, then you should also format your filesystem
with large inodes (-I 256).

You may also want to try out ext4dev with the mballoc and delalloc
patches from Alex Tomas, as this code has been optimized for
doing large power-of-two allocations in the filesystem.  They've
been posted to the ext4-devel lists a couple of times.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.



 
---------------------------------
Finding fabulous fares is fun.
Let Yahoo! FareChase search your favorite travel sites to find flight and hotel bargains.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20070323/16040812/attachment.htm>


More information about the Ext3-users mailing list