[Linux-cluster] dlm spinlock BUG

Thu Apr 19 08:04:17 UTC 2007

Hi,

On Wed, Apr 18, 2007 at 04:04:13PM +0100, Patrick Caulfield wrote:
> Jens Beyer wrote:
> >
> > I am using a vanilla 2.6.20.6 (same with 2.6.20.x).
> > 
> 
> Hmm, I'm not sure how that got left unfixed upstream
> 
> Here's the patch:
> 

the Patch did fix one spinlock BUG; now I get an otherone:

[  315.040936] BUG: spinlock already unlocked on CPU#1, dlm_recvd/14593
[  315.040949]  lock: ee108f64, .magic: dead4ead, .owner: <none>/-1, .owner_cpu: -1
[  315.040964]  [<c01d62ac>] _raw_spin_unlock+0x70/0x72
[  315.040976]  [<f0b63f09>] dlm_lowcomms_commit_buffer+0x2f/0x9a [dlm]
[  315.040998]  [<f0b5fb67>] send_rcom+0xa/0x12 [dlm]
...
which seems to be fixed in 2.6.21-rc6 from where I got

--- fs/dlm/lowcomms-tcp.c.orig  2007-04-19 09:42:53.000000000 +0200
+++ fs/dlm/lowcomms-tcp.c       2007-04-19 09:43:23.000000000 +0200
@@ -748,6 +748,7 @@
        struct connection *con = e->con;
        int users;
 
+        spin_lock(&con->writequeue_lock);
        users = --e->users;
        if (users)
                goto out;


But now it hangs during mount:

boxfe01:/home/jbe # mount -t gfs2 -v /dev/sdd1 /export/vol1
/sbin/mount.gfs2: mount /dev/sdd1 /export/vol1
/sbin/mount.gfs2: parse_opts: opts = "rw"
/sbin/mount.gfs2:   clear flag 1 for "rw", flags = 0
/sbin/mount.gfs2: parse_opts: flags = 0
/sbin/mount.gfs2: parse_opts: extra = ""
/sbin/mount.gfs2: parse_opts: hostdata = ""
/sbin/mount.gfs2: parse_opts: lockproto = ""
/sbin/mount.gfs2: parse_opts: locktable = ""
/sbin/mount.gfs2: message to gfs_controld: asking to join mountgroup:
/sbin/mount.gfs2: write "join /export/vol1 gfs2 lock_dlm boxfe:clustervol1 rw /dev/sdd1"
/sbin/mount.gfs2: message from gfs_controld: response to join request:
/sbin/mount.gfs2: lock_dlm_join: read "0"
/sbin/mount.gfs2: message from gfs_controld: mount options:
/sbin/mount.gfs2: lock_dlm_join: read "hostdata=jid=1:id=262146:first=0"
/sbin/mount.gfs2: lock_dlm_join: hostdata: "hostdata=jid=1:id=262146:first=0"
/sbin/mount.gfs2: lock_dlm_join: extra_plus: "hostdata=jid=1:id=262146:first=0"

boxfe01:/home/jbe # dmesg | tail -15
[  137.276428] GFS2 (built Apr 19 2007 09:15:21) installed
[  137.285199] Lock_DLM (built Apr 19 2007 09:15:33) installed
[  149.628806] drbd1: role( Secondary -> Primary ) 
[  149.628827] drbd1: Writing meta data super block now.
[  156.324500] GFS2: fsid=: Trying to join cluster "lock_dlm", "boxfe:clustervol1"
[  156.397920] dlm: got connection from 2
[  156.399738] dlm: clustervol1: recover 1
[  156.399792] dlm: clustervol1: add member 2
[  156.399796] dlm: clustervol1: add member 1
[  156.400514] dlm: clustervol1: config mismatch: 32,0 nodeid 2: 11,0
[  156.400519] dlm: clustervol1: ping_members aborted -22 last nodeid 2
[  156.400523] dlm: clustervol1: total members 2 error -22
[  156.400526] dlm: clustervol1: recover_members failed -22
[  156.400529] dlm: clustervol1: recover 1 error -22
[  156.404760] GFS2: fsid=boxfe:clustervol1.1: Joined cluster. Now mounting FS...

Regards,
Jens