[Cluster-devel] Problem in ops_address.c :: gfs_writepage ?

Mon Feb 19 22:00:10 UTC 2007

Mathieu Avila wrote:
> Hello all,
>
> I need advice about a bug in GFS that may also affect other filesystems
> (like ext3).
>
> The problem:
> It is possible that the function "ops_address.c :: gfs_writepage" does
> not write the page it's asked to, because the transaction lock is
> taken. This is valid, and in such case, it should return an error code,
> so that the kernel knows it was not possible to write the page. But
> this function does not return an error code; instead, it returns 0.
> I've looked at ext3, it does the same. This is valid and there's no
> corruption, as the page is "redirtied" so that it will be flushed later.
> Returning an error code is not solution, because it's possible that no
> page is flushable, and also 'sync' misinterprets the error code as an
> I/O error. There may be other implications, too.
>
> The problem comes when there is quite a stress on the filesystem.
> I've made a test program that opens 1 file, writes 1Go
> (at least more than the system's total memory), then opens a 2nd file,
> and writes as much data as it can.
> When the number of dirty pages go beyond /proc/sys/vm/dirty_ratio,
> some pages must be flushed synchronously, so that the writer is blocked
> in writing, and the system does not starve of free clean pages to use.
>
> But precisely, in that situation, there are multiple times when
> gfs_writepage cannot perform its duty, because of the transaction lock.
>   
Yes, we did have this problem in the past with direct IO and SYNC flag.
> [snip]
> we've experienced it using the test program "bonnie++" whose purpose is
> to test a FS performance. Bonnie++ makes multiple files of 1GB when it
> is asked to run long multi-Go writes. There is no problem with 5 GB (5
> files of 1 GB) but many machines in the cluster are OOM killed with 10GB
> bonnies....
>   
I would like to know more about your experiments. So these bonnie++(s) 
are run on each cluster node with independent file sets ?

> Setting more aggressive parameters for dirty_ratio and pdflush is not
> a complete solution (altough the problems happens much later or not at
> all), and kills performance.
>
> Proposed solution:
>
> Keep a counter of pages in gfs_inode whose value represents those not
> written in gfs_writepage, and at the end of do_do_write_buf, call
> "balance_dirty_pages_ratelimited(file->f_mapping);" as many times. The
> counter is possibly shared by multiple processes, but we are assured
> that there is no transaction at that moment so pages can be flushed, if
> "balance_dirty_pages_ratelimited" determines that it must reclaim dirty
> pages. Otherwise performance is not affected.
>   
In general, this approach looks ok if we do have this flushing problem. 
However, GFS flush code has been embedded in glock code so I would think 
it would be better to do this within glock code. My cluster nodes 
happened to be out at this moment. Will look into this when the cluster 
is re-assembled.

-- Wendy