[Cluster-devel] Question on LVB when the node that held EX lock crash

Wed Nov 30 09:07:22 UTC 2016

Hi David,

On 11/16/2016 11:08 PM, David Teigland wrote:
>>> convert(R1, EX)
>>> get LVB
>>> Qustion: what is the LVB then? x or y?
>>> ======
>>>
>>> Is this a valid question? or am I missing something?
> It's a good question, and it's been enough years that the details are now
> hazy.  I think the current behavior emulates the original VMS dlm model
> fairly well, so any documentation you can find on that may help.
>
> If you look at the recover_lvb() comment, you'll see a little information
> about this.  The LVB can be lost in a crash in some fairly common cases,
> and in those cases, the dlm should set the VALNOTVALID flag to tell the
> application that the LVB may be lost/stale.  So, an application cannot
> rely entirely on the LVB, and must be able to go without it, or
> reconstruct the value, i.e. the LVB data is usually used as part of an
> optimization (e.g. caching).
>
> The two cases mentioned in that comment are:
>
> 1. if N1 was R1 master when N2 crashed: N1 would purge the EX lock from
> N2, and set VALNOTVALID on R1, because the latest LVB from N2 was never
> seen by N1.
>
> 2. if N2 was R1 master when N2 crashed: N1 would become the new R1 master
> (if it kept a NL lock on it), and would set VALNOTVALID because it doesn't
> know if N2 had any EX locks from other nodes that might have also crashed,
> or an LVB that had been updated since N1 last saw it.
I have little doubts about the patch below and hope for your clarifications;-)

1) The commit:
"""
commit da8c66638ae684c99abcb30e89d2803402e7ca20
Author: David Teigland <teigland at redhat.com>
Date:   Thu Nov 15 15:01:51 2012 -0600

     dlm: fix lvb invalidation conditions

     When a node is removed that held a PW/EX lock, the
     existing master node should invalidate the lvb on the
     resource due to the purged lock.

     Previously, the existing master node was invalidating
     the lvb if it found only NL/CR locks on the resource
     during recovery for the removed node.  This could lead
     to cases where it invalidated the lvb and shouldn't
     have, or cases where it should have invalidated and
     didn't.

     When recovery selects a *new* master node for a
     resource, and that new master finds only NL/CR locks
     on the resource after lock recovery, it should
     invalidate the lvb.  This case was handled correctly
     (but was incorrectly applied to the existing master
     case also.)

     When a process exits while holding a PW/EX lock,
     the lvb on the resource should be invalidated.
     This was not happening.

     The lvb contents and VALNOTVALID flag should be
     recovered before granting locks in recovery so that
     the recovered lvb state is provided in the callback.
     The lvb was being recovered after the lock was granted.

     Signed-off-by: David Teigland <teigland at redhat.com>
"""

2) Snippet code that I cannot understand:
"""
@@ -852,12 +868,19 @@ void dlm_recover_rsbs(struct dlm_ls *ls)
                 if (is_master(r)) {
                         if (rsb_flag(r, RSB_RECOVER_CONVERT))
                                 recover_conversion(r);
+
+                       /* recover lvb before granting locks so the updated
+                          lvb/VALNOTVALID is presented in the completion */
+                       recover_lvb(r);
+
                         if (rsb_flag(r, RSB_NEW_MASTER2))
                                 recover_grant(r);
-                       recover_lvb(r);
                         count++;
+               } else {
+                       rsb_clear_flag(r, RSB_VALNOTVALID);
                 }
"""

3) Questions:

a. Should we put recover_lvb() even before recover_conversion()? if not, why?
b. Why should we clear fag RSB_VALNOTVALID in the else branch?

Best Regards,
Eric

>
> Dave
>