[Linux-cluster] DLM behavior after lockspace recovery

Fri Oct 8 12:03:32 UTC 2004

Friday, October 8, 2004, 3:14:25 AM, David Teigland wrote:

> On Thu, Oct 07, 2004 at 07:26:35AM -0400, Jeff wrote:

>> My preference would be that it has the most current copy from
>> the surviving members. If the nodes keep track of the change count,
>> this would be the copy with the highest value. An alternative,
>> although I suspect this is more difficult to implement, would be for
>> each surviving node to return the VALNOTVALID status until it writes
>> the lock value block. In this case after one node has written the value
>> block it would be important that the current, valid, value is used.
>> 
>> Here's the problem with simply resetting the value block to zero.
>> We're using the value block as a counter to track whether a block
>> on disk has changed or not. Each cluster member keeps a copy of the
>> value block counter in memory along with the associated disk block.
>> When a process converts a NL lock to a higher mode it reads the
>> current copy of the value block to decide whether it needs to re-read
>> the block from disk.
>> 
>> When the lock request completes with VALNOTVALID as a status the
>> process knows that it needs to re-read the block from disk. The big
>> question though is what does it write into the lock value block at
>> that point so the other systems will know this as well. If the lock
>> value block is guaranteed to have the most recent value seen by the
>> existing nodes then the process can simply increment the value and
>> it will know that the result will not match what any other system has
>> cached. If the lock value block is zeroed or set to an arbitrary
>> value from any one of the surviving nodes, then it might be a value
>> which is lower than exists on one or more of the nodes. There are ways
>> we can deal with this but it means more bookkeeping.

> That makes sense.  Here's an outline of LVB recovery.  While recovering
> resource R on node N:

> - If N was the master of R before recovery, we just leave R's LVB contents
>   as they are.  (We are certain this LVB was the most recent one written.)

> - If N is a new master of R (assigned during recovery) we rebuild 
>   R's locks from remaining nodes, then:

>   o If any of the locks have mode > CR, we take the LVB from it as
>     the copy for R.  (We are certain this is the most recent LVB that
>     was written.)

>   o If all locks on R have mode <= CR, we cannot know if any of the
>     LVB's on the remaining locks represent R's last LVB prior to
>     recovery.  We can, however, pick the most recent copy from the
>     remaining locks by using LVB sequence numbers.  (this is the part
>     we don't do now)

> Lock_dlm can use the VALNOTVALID flag to zero the LVB in this last case as
> GFS requires.

And in step #3 the resource is marked VALNOTVALID which is sent
across with subsequent grants until the lock value block is written.