[Cluster-devel] [PATCH 5/5] gfs2: dlm based recovery coordination

Wed Dec 21 15:40:48 UTC 2011

On Wed, Dec 21, 2011 at 10:45:21AM +0000, Steven Whitehouse wrote:
> I don't think I understand whats going on in that case. What I thought
> should be happening was this:
> 
>  - Try to get mounter lock in EX
>    - If successful, then we are the first mounter so recover all
>      journals
>    - Write info into LVB
>    - Drop mounter lock to PR so other nodes can mount
> 
>  - If failed to get mounter lock in EX, then wait for lock in PR state
>    - This will block until the EX lock is dropped to PR
>    - Read info from LVB
> 
> So a node with the mounter lock in EX knows that it is always the first
> mounter and will recover all journals before demoting the mounter lock
> to PR. A node with the mounter lock in PR may only recover its own
> journal (at mount time).

I previously used one lock similar to that, but had to change it a bit.
I had to split it across two separate locks, called control_lock and
mounted_lock.  There need to be two because of two conflicting requirements.

The control_lock lvb is used to communicate the generation number and jid
bits.  Writing the lvb requires an EX lock, and EX prevents others from
continually holding a PR lock.  Without mounted nodes continually holding
a PR lock we can't use EX to indicate first mounter.

So, the mounted_lock (no lvb) is used to indicate the first mounter.
Here all mounted nodes continually hold a PR lock, and a mounting node
attempts to get an EX lock, so any node to get an EX lock is the first
mounter.

(I previously used control_lock with "zero lvb" to indicate first mounter,
but there are some fairly common cases where the lvb may not be zero when
we need a first mounter.)

Now back to the reason why we need to retry lock requests and can't just
block.  It's not related to the first mounter case.  When a node mounts,
it needs to wait for other (previously mounted) nodes to update the
control_lock lvb with the latest generation number, and then it also needs
to wait for any bits set in the lvb to be cleared.  i.e. it needs to wait
for any unrecovered journals to be recovered before it finishes mounting.

To do this, it needs to wait in a loop reading the control_lock lvb.  The
question is whether we want to add some sort of delay to that loop or not,
and how.  msleep_interruptible(), schedule_timeout_interruptible(),
something else?

Dave