[Cluster-devel] [GFS2 Patch] GFS2: Reduce file fragmentation

Bob Peterson rpeterso at redhat.com
Thu Jul 5 21:52:29 UTC 2012


----- Original Message -----
| Hi,
| 
| Some comments...
(snip)
| > +	/* Make sure the reservation rgrp is in the rlist too, because
| > +	   we need to free it. */
| > +	if (gfs2_rs_active(ip))
| > +		gfs2_rlist_add(ip, &rlist, ip->i_res->rs_start);
| > +
| I don't understand this... why do we need to add the reservation to
| the
| rlist here? It would be better to just drop it if we start to
| deallocate
| blocks since we'll probably want to start again in a different place
| if
| we then being allocating again.
You're absolutely right. I had added this on an earlier prototype that
had a slightly different take on claiming reserved blocks: Instead of
claiming them one by one, it claimed them ALL, and if some blocks weren't
used when the reservation was freed, the unused blocks were refunded.
This favors apps that use all their reserved blocks, but it requires
one more glock on the rgrp for the refund. I forgot to remove the glock,
in the current implementation. The replacement patch removes this.
 
| Why do we need to take an exclusive cluster lock here? Reservations
| are
| local things, so we should be able to drop them without taking
| cluster
| locks.

You're right. I've fixed this in the latest patch.
| I think it ought to protect more than just the reservation tree. We
| probably should be using it to protect the rest of the rgrp fields
| too,
| so lets just call it rd_mutex.

Renamed as per your suggestion.

| > +	struct gfs2_inode *rs_ip;     /* pointer to the inode */
| This doesn't hold a reference on the inode, so what prevents the
| inode
| from vanishing? It would be better to arrange to drop the
| reservations
| by scanning the inodes, rather than the rgrps, so that we don't need
| to
| store this pointer. Either that or find a way to allow the
| reservations
| to be dropped later when the inode is disposed of, and just mark them
| invalid when disposing of the rgrps.

Good idea. I replaced the inode pointer with the ip->i_no_addr.
Also, it makes more sense to pass the reservation pointer around
rather than the ip pointer in most cases anyway. I've adjusted the
latest patch so that the i_no_addr is only saved in the reservation
for reference: it helps a great deal in reading the new reservation
trace points. The structure is a bit big, however, so we could just
eliminate the field.
 
| I'm not sure I understand where this goal block has come from... the
| goal block should be ip->i_goal in the normal case.

What I'm doing here is keeping track of where start looking for the
next reservation, which is immediately after the most recent reservation.
My theory is that it's more efficient to start after the last reservation
than it is to start where the ip left off, because if there are lots
of files open, and writes being done to many of them, there's no
need to search the areas again that already been reserved by other writes.

| > +do_search:
| > +	/* We multiply the size hint by the number of active
| > reservations.
| > +	   That makes heavily contended resource groups grab bigger
| > +	   reservations so they don't bump into one another. */
| This doesn't tell you whether the rgrp is contended or not. It is
| possible to have multiple reservations in an rgrp which are all being
| used locally by inodes which are filling up very slowly and not to
| have
| any contention at all on the rgrp.

Yep. This comment was vestigial; left over from a different patch that
attempted to do ext3-like reservation size doubling. It turned out that
the scheme did more harm than good. We spent more time doing calculations
and such than just doing the straight searches. I pulled the patch (at
least temporarily) but I forgot to pull this comment. I removed it now.

| > +		/* We have to keep the reservations aligned on u64 boundaries
| > +		   otherwise we could get situations where a byte can't be
| > +		   used because it's after a reservation, but a free bit still
| > +		   is within the reservation's area. */
| I'm not sure I understand. If there are free blocks after an existing
| inode, then the reservation must start at that point and continue
| until
| it runs into something preventing it from extending further. Finding
| aligned blocks is ok, but only if we've already exhausted the
| possibility of adding on to an existing extent first.

The comment is a bit misleading. In fact, it does what you're saying.
The reservations are aligned to u64 boundaries and reservations are
entirely used up (their blocks are "claimed"). When there are no more
to claim, a new reservation is acquired, and it searches for an appropriate
place for it immediately after the last reservation for that rgrp.

| > +			nonzero = memchr_inv(ptr, 0, search_bytes);
| > +			/* If the lot is all zeroes, reserve the whole size. If
| > +			   there's enough zeroes to satisfy the request, use
| > +			   what we can. If there's not enough, keep looking. */
| Why not use gfs2_bitfit to search for the first free block and then
| see
| how long the extent is? That seems to make more sense than writing
| something new from scratch, bearing in mind that we've already spent
| a
| lot of time optimising it.

The reason is efficiency. Yes, gfs2_bitfit is well optimized,
but "seeing how long the extent is" and looping if it's too small,
is inefficient. memchr_inv is optimized and works pretty well afaict.
 
| > +	gfs2_print_dbg(seq, "All reservations:\n");
| This doesn't match the format for this file

I hadn't thought about making the output abbreviated, but it makes
good sense. Great idea. In my latest patch, I've reformatted the
output to be like the existing file format.

| > rr%lu",
| Missing : after rr

Good catch. Fixed in my latest. 

| > +	TP_printk("%u,%u rg:%llu rf:%lu rr:%lu bmap %llu rs:%llu blks:%lu
| > %s "
| > +		  "c:%llu f:%lu",
| This format is incorrect for a bmap tracepoint - it needs to match
| the
| common fields for the other bmap tracepoints.

Again, good suggestion; I've implemented your suggestion in the latest.

I'll post the replacement patch in an separate email.

Regards,

Bob Peterson
Red Hat File Systems




More information about the Cluster-devel mailing list