[Cluster-devel] [GFS2 PATCH 1/4] GFS2: Set of distributed preferences for rgrps

Tue Oct 21 12:30:22 UTC 2014

----- Original Message -----
> I assume that you are trying to get the number of nodes here? I'm not
> sure that this is a good way to do that. I would expect that with older
> userspace, num_slots might indeed be 0 so that is something that needs
> to be checked. Also, I suspect that you want to know how many nodes
> could be in the cluster, rather than how many there are now, otherwise
> there will be odd results when mounting the cluster.
> 
> Counting the number of journals would be simpler I think, and less
> likely to give odd results.
(snip)
My original version used the number of journals, which is fairly easy.
The problem is, customers often allocate extra journals to their file
system, anticipating that they will add more nodes in the future. Case
in point, our own performance group who has a four-node cluster, but
allocated 5 journals in mkfs. Doing so tends to leave large gaps that
will never be used until space gets low, and then it's a chaos of all
the nodes all trying to use those shunned rgrps at the same time.
I know people don't need to do that anymore, and it's a carry-over from
the GFS1 days, but people still do it.

I don't know of a better way to determine the number of nodes. The DLM
would know, but it doesn't share that information in any other way other
than the recovery code that I'm currently using with this patch.

I'm open to suggestions if there's a better way.

> This existing gfs2_rgrp_congested() function should be giving the answer
> as to which rgrp should be preferred, so the question is whether that is
> giving the wrong answer for some reason? I think that needs to be looked
> into and fixed if required,

The trouble is this:
The gfs2_rgrp_congested() function tells you if the rgrp is congested
at any given moment in time, and that's highly variable. What tends to
happen is that all the nodes create a bunch of files in a haphazard fashion,
as part of initialization. At the time, each node (accurately) sees that
there is _currently_ no congestion, so they all decide to use rgrp X.
They all make big multi-block reservations in rgrp X. Then they all
proceed to fight over who has the lock for rgrp X. Two reasons:
(a) when the initial files are set up, there are too few samples to get
any degree of accuracy with regard to congestion, and (b) there really
ISN'T any contention during setup because no one has begun to do any
serious writing: there's a trickle-in effect. The problem is that once
you've chosen a rgrp, you tend to stick with it, due to reservations and
due to the way "goal blocks" work, both of which preempt searching for
a different rgrp. 

Ordinarily, you would think the problem would get better (and therefore
faster) with time because there are more samples, and better information
regarding which rgrps really are congested, but in actual practice, it
doesn't work like that. All the nodes continue to fight over the same
rgrps. I suspect this is because in many use cases, workloads are evenly
distributed to the worker nodes, so they all go through phases of
(1) setup, (2) analysis of data, (3) writing, and they often hit the
same phases at roughly the same times (because of the even distribution
of the workload).

Experience has shown (both in GFS1 from prior years and GFS2) that
letting each node pick a unique subset of rgrps results in the least
amount of contention.

Regards,

Bob Peterson
Red Hat File Systems