[Cluster-devel] [PATCH][try6] VFS: new want_holesize and got_holesize buffer_head flags for fiemap

Bob Peterson rpeterso at redhat.com
Wed Oct 22 01:09:55 UTC 2014


Hi,

This is my sixth rework of this patch. The problem with the previous
version is that the underlying file system's block_map function
may set the buffer_head's b_state to 0, which may be misinterpreted
by fiemap as meaning it has returned the hole size in b_size.
This version implements a suggestion from Steve Whitehouse: it
improves on the design by (unfortunately) using two flags: a flag to
tell block_map to return hole size if possible, and another flag to
tell fiemap that the hole size has been returned. This way there is
no possibility of a misunderstanding.

The problem:
If you do a fiemap operation on a very large sparse file, it can take
an extremely long amount of time (we're talking days here) because
function __generic_block_fiemap does a block-for-block search when it
encounters a hole.

The solution:
Allow the underlying file system to return the hole size so that
function __generic_block_fiemap can quickly skip the hole. I have a
companion patch to GFS2 that takes advantage of the new flags to
speed up fiemap on sparse GFS2 files. Other file systems can do the
same as they see fit. For GFS2, the time it takes to skip a 1PB hole
in a sparse file goes from days to milliseconds.

Patch description:
This patch changes function __generic_block_fiemap so that it sets a new
buffer_want_holesize bit. The new bit signals to the underlying file system
to return a hole size from its block_map function (if possible) in the
event that a hole is encountered at the requested block. If the block_map
function encounters a hole, and sets buffer_got_holesize, fiemap takes the
returned b_size to be the size of the hole, in bytes. It then skips the
hole and moves to the next block. This may be repeated several times
in a row, especially for large holes, due to possible limitations of the
fs-specific block_map function. This is still much faster than trying
each block individually when large holes are encountered. If the
block_map function does not set buffer_got_holesize, the request for
holesize has been ignored, and it falls back to today's method of doing a
block-by-block search for the next valid block.

Regards,

Bob Peterson
Red Hat File Systems

Signed-off-by: Bob Peterson <rpeterso at redhat.com> 
---
 fs/ioctl.c                  | 7 ++++++-
 include/linux/buffer_head.h | 4 ++++
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/fs/ioctl.c b/fs/ioctl.c
index 8ac3fad..9c07a40 100644
--- a/fs/ioctl.c
+++ b/fs/ioctl.c
@@ -291,13 +291,18 @@ int __generic_block_fiemap(struct inode *inode,
 		memset(&map_bh, 0, sizeof(struct buffer_head));
 		map_bh.b_size = len;
 
+		set_buffer_want_holesize(&map_bh);
 		ret = get_block(inode, start_blk, &map_bh, 0);
 		if (ret)
 			break;
 
 		/* HOLE */
 		if (!buffer_mapped(&map_bh)) {
-			start_blk++;
+			if (buffer_got_holesize(&map_bh))
+				start_blk += logical_to_blk(inode,
+							    map_bh.b_size);
+			else
+				start_blk++;
 
 			/*
 			 * We want to handle the case where there is an
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index 324329c..fc8e47c 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -37,6 +37,8 @@ enum bh_state_bits {
 	BH_Meta,	/* Buffer contains metadata */
 	BH_Prio,	/* Buffer should be submitted with REQ_PRIO */
 	BH_Defer_Completion, /* Defer AIO completion to workqueue */
+	BH_Want_Holesize, /* Return hole size if possible */
+	BH_Got_Holesize, /* Hole size has been returned */
 
 	BH_PrivateStart,/* not a state bit, but the first bit available
 			 * for private allocation by other entities
@@ -128,6 +130,8 @@ BUFFER_FNS(Boundary, boundary)
 BUFFER_FNS(Write_EIO, write_io_error)
 BUFFER_FNS(Unwritten, unwritten)
 BUFFER_FNS(Meta, meta)
+BUFFER_FNS(Want_Holesize, want_holesize)
+BUFFER_FNS(Got_Holesize, got_holesize)
 BUFFER_FNS(Prio, prio)
 BUFFER_FNS(Defer_Completion, defer_completion)
 
-- 
1.9.3




More information about the Cluster-devel mailing list