[Cluster-devel] Problem in ops_address.c :: gfs_writepage ?

Mathieu Avila mathieu.avila at seanodes.com
Tue Feb 20 10:59:04 UTC 2007


Le Mon, 19 Feb 2007 17:00:10 -0500,
Wendy Cheng <wcheng at redhat.com> a écrit :

> Mathieu Avila wrote:
> > Hello all,
> > But precisely, in that situation, there are multiple times when
> > gfs_writepage cannot perform its duty, because of the transaction
> > lock. 

> Yes, we did have this problem in the past with direct IO and SYNC
> flag.

I understand that in that case, data are written with get_transaction
lock taken, so that gfs_writepage never writes the pages, and you go
beyond dirty_ratio limit if you write too much pages. How did you do to
get rid of it ?

> > [snip]
> > we've experienced it using the test program "bonnie++" whose
> > purpose is to test a FS performance. Bonnie++ makes multiple files
> > of 1GB when it is asked to run long multi-Go writes. There is no
> > problem with 5 GB (5 files of 1 GB) but many machines in the
> > cluster are OOM killed with 10GB bonnies....
> >   

> I would like to know more about your experiments. So these
> bonnie++(s) are run on each cluster node with independent file sets ?
> 

Exactly. I run "bonnie++ -s 10G" on each node, in different directories
of the same GFS file system. To get the problem happen surely
and quicker, i tune pdflush by quite disabling it :
echo 300000 > /proc/sys/vm/dirty_expire_centisecs 
echo 50000 > /proc/sys/vm/dirty_writeback_centisecs 
..., so that only /proc/sys/vm/dirty_ratio plays in. Not doing this
makes the problem harder to reproduce. (but reproducible)

The problem happens only when it starts the writeback of the dirty
pages of the 2nd file, once it is done with the 1st file.  We still
have to determine why. So you can get the same problem with a program
that :
- opens 2 files
- writes 1Go in the 1st one,
- writes indefinitely in the second file.

We use a particular block device that doesn't read/write at the same
speed on all nodes. Don't know if this can help.

> > Keep a counter of pages in gfs_inode whose value represents those
> > not written in gfs_writepage, and at the end of do_do_write_buf,
> > call "balance_dirty_pages_ratelimited(file->f_mapping);" as many
> > times. The counter is possibly shared by multiple processes, but we
> > are assured that there is no transaction at that moment so pages
> > can be flushed, if "balance_dirty_pages_ratelimited" determines
> > that it must reclaim dirty pages. Otherwise performance is not
> > affected. 

> In general, this approach looks ok if we do have this flushing
> problem. However, GFS flush code has been embedded in glock code so I
> would think it would be better to do this within glock code. 

These are points we do not understand.

- I understand that the flushing code is done in glock, (when lock is
demoted or taken by another node, isn't it ?). Why isn't it possible to
let the kernel decide which pages to flush when it needs to ? For
example, in that particular case, it is not a good idea to flush the
page only when the lock is lost, the kernel needs to flush pages.

- Why does not gfs_writepage return an error when the page cannot be
flushed ?

- The balance_dirty_pages_ratelimited function is called inside
get_transaction/set_transaction, (i.e between gfs_trans_begin and
gfs_trans_end), therefore gfs_writepage should never work. Do you know
if there is some kind of asynchronous (kiocb ?) write so that
gfs_writepage is called later on the same process ? If not, what could
make gfs_writepage happen sometimes inside a transaction, and sometimes
not ?

- Ext3 should be affected as well, but it isn't. Is that because the
transaction lock is taken for a much shorter period of time, so that
dirty pages that are not flushed when the lock is taken will be
succesfully flushed later ?

- Some other file systems in the kernel : NTFS and ReiserFS, do explicit
calls to balance_dirty_pages_ratelimited
http://lxr.linux.no/source/fs/ntfs/file.c?v=2.6.18;a=x86_64#L284
http://lxr.linux.no/source/fs/ntfs/file.c?v=2.6.18;a=x86_64#L2101
http://lxr.linux.no/source/fs/reiserfs/file.c?v=2.6.11;a=x86_64#L1351
But they also redefines some generic functions from the kernel. Maybe
they have a strong reason to do so ?

> -- Wendy

Thank you for your answer,

--
Mathieu




More information about the Cluster-devel mailing list