[Cluster-devel] [GFS2 PATCH 2/2] GFS2: Split gfs2_rgrp_congested into inter-node and intra-node cases

Thu Jan 25 11:47:04 UTC 2018

Hi,

Some further thoughts...

Whenever we find a problem related to a lock, it is a good plan to 
understand where the problem actually lies. In other words whether the 
locking itself is slow, or whether it is some action that is being 
performed under the lock that is the issue. We have the ability to 
easily create histograms of DLM lock times, and almost as easily create 
histograms of the glock times (gfs2_glock_queue -> gfs2_promote). We can 
easily filter on glock type (rgrp) and the lock transistions that we 
care about (any -> EX) too. So it would be interesting to look at this 
in order to get more of an insight into what is really going on.

Taking the raw histogram and multiplying the count by the centre of each 
bucket gives us total time taken for each different lock latency. Then 
it is easy to see which latencies are the ones causing the most delay.

It would also be interesting to know how long it takes to allocate and 
deallocate a block. What are the operations that take the most time? 
Unfortunately our block allocation tracepoint doesn't give us that info, 
but it is probably not that tricky to alter it, so that it does.

That would give us a much more detailed picture of what is going on I think,

Steve.