[Linux-cluster] Deadlock detection in libdlm

Tue Jan 25 20:01:00 UTC 2011

I've been trying to make use of deadlock detection in libdlm, but
without any luck so far. I'm hoping someone can tell me what I'm doing
wrong, or how to debug this further.

My test code looks like this:

#include <sys/types.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#define _REENTRANT
#include <libdlm.h>

void lock(struct dlm_lksb *l, const char *name, int mode) {
  printf("[%d] Attempting to lock %s, mode %d\n",getpid(),name,mode);
  int status = dlm_lock_wait(LKM_NLMODE, l, LKF_EXPEDITE, name, strlen(name),
                            0, NULL, NULL, NULL);
  if(status != 0) abort();

  status = dlm_lock_wait(mode, l, LKF_CONVERT | LKF_CONVDEADLK, name,
strlen(name),
                            0, NULL, NULL, NULL);
  if(status == 0) status = l->sb_status;

  printf("[%d] Status was %d\n",getpid(),status);
}

int main(void) {

  pid_t pid = fork();

  if(pid == 0) { // child process
    if(dlm_pthread_init() != 0) abort();

    struct dlm_lksb l1,l2;
    memset(&l1,0,sizeof(l1));
    memset(&l2,0,sizeof(l2));

    lock(&l1,"A",LKM_PRMODE);

    lock(&l2,"B",LKM_EXMODE);

    dlm_unlock_wait(l1.sb_lkid,0,&l1);
    dlm_unlock_wait(l2.sb_lkid,0,&l2);
    return EXIT_SUCCESS;
  } else { // parent process
    if(dlm_pthread_init() != 0) abort();

    struct dlm_lksb l1,l2;
    memset(&l1,0,sizeof(l1));
    memset(&l2,0,sizeof(l2));

    lock(&l1,"B",LKM_PRMODE);

    sleep(5); // wait to ensure child has grabbed A
    lock(&l2,"A",LKM_EXMODE);

    dlm_unlock_wait(l2.sb_lkid,0,&l2);
    dlm_unlock_wait(l1.sb_lkid,0,&l1);
  }

  return EXIT_SUCCESS;
}

This should cause a classic deadlock: process 1 is waiting on resource
A, which is locked by process 2. Process 2 is waiting on resource B,
which is locked by process 1.

>From the manpage, I would expect this to be detected and resolved by
one of the lock requests being refused:

"Return values
      *snip*
       EDEADLOCK       The lock operation is causing a deadlock and has been
                       cancelled. If this was a conversion then the lock is
                       reverted to its previously granted state. If it was a
                       new lock then it has not been granted. (NB Only
                       conversion deadlocks are currently detected)"

But instead, the process hangs indefinitely, until I kill it:

$ ./a.out
[27986] Attempting to lock A, mode 3
[27985] Attempting to lock B, mode 3
[27986] Status was 0
[27986] Attempting to lock B, mode 5
[27985] Status was 0
[27985] Attempting to lock A, mode 5
<hangs here>

Here's the output of lockdump:

$ /sbin/dlm_tool lockdump default
id 01aa0005 gr PR rq IV pid 27986 master 2 "A"
id 034f0004 gr NL rq EX pid 27985 master 2 "A"
id 03630001 gr PR rq IV pid 27985 master 4 "B"
id 02070004 gr NL rq EX pid 27986 master 4 "B"

and lockdebug:

$ /sbin/dlm_tool lockdebug default

Resource ffff810c1f02c080 Name (len=1) "A"
Local Copy, Master is node 2
Granted Queue
01aa0005 PR Master:     03b80003
Conversion Queue
034f0004 NL (EX) Master:     02310005
Waiting Queue

Resource ffff810c1f02cc80 Name (len=1) "B"
Local Copy, Master is node 4
Granted Queue
03630001 PR Master:     030c0001
Conversion Queue
02070004 NL (EX) Master:     03530003
Waiting Queue

The machine I'm using is running RHEL5.