[Linux-cluster] gfs2, kvm setup

Sun Jul 6 21:51:05 UTC 2008

On Fri, Jun 27, 2008 at 01:41:17PM -0500, David Teigland wrote:
> On Fri, Jun 27, 2008 at 01:28:56PM -0400, david m. richter wrote:
> > 	i also have another setup in vmware; while i doubt it's 
> > substantively different than bruce's, i'm a ready and willing tester.  is 
> > there a different branch (or repo, or just a stack of patches somewhere) 
> > that i should/could be using?
> 
> If on 2.6.25, then use
> 
>   ftp://ftp%40openais%2Eorg:downloads@openais.org/downloads/openais-0.80.3/openais-0.80.3.tar.gz
>   ftp://sources.redhat.com/pub/cluster/releases/cluster-2.03.04.tar.gz
> 
> If on 2.6.26-rc, then you'll need to add the attached patch to cluster.

I tried that patch against STABLE2, and needed the following to get it
to compile.

diff --git a/group/gfs_controld/plock.c b/group/gfs_controld/plock.c
index 5e4f56b..f04a6b8 100644
--- a/group/gfs_controld/plock.c
+++ b/group/gfs_controld/plock.c
@@ -790,7 +790,7 @@ static void write_result(struct mountgroup *mg, struct dlm_plock_info *in,
 		in->fsid = mg->associated_ls_id;
 
 	in->rv = rv;
-	write(control_fd, in, sizeof(struct gdlm_plock_info));
+	write(control_fd, in, sizeof(struct dlm_plock_info));
 }
 
 static void do_waiters(struct mountgroup *mg, struct resource *r)

I built everything with debugging turned on.  The second mount again
hangs, with a lot of this in the logs:

Jul  1 14:06:42 piglet2 kernel: dlm: connecting to 1
Jul  1 14:06:42 piglet2 kernel: dlm: connect from non cluster node
Jul  1 14:06:42 piglet2 kernel: dlm: connect from non cluster node
Jul  1 14:08:35 piglet2 kernel: INFO: task mount.gfs2:6130 blocked for more than 120 seconds.
Jul  1 14:08:35 piglet2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul  1 14:08:35 piglet2 kernel: mount.gfs2    D c09f0244  1896  6130   6129
Jul  1 14:08:35 piglet2 kernel:        ce920bc4 00000046 ce9d28e0 c09f0244 6f5e11cb 00000621 ce9d2b40 ce9d2b40 
Jul  1 14:08:35 piglet2 kernel:        00000046 cf167db8 ce9d28e0 0077d2a4 00000000 6fd5e46f 00000621 ce9d28e0 
Jul  1 14:08:35 piglet2 kernel:        00000003 ce9e7874 00000002 7fffffff ce920bec c063cdc5 7fffffff ce920be0 
Jul  1 14:08:35 piglet2 kernel: Call Trace:
Jul  1 14:08:35 piglet2 kernel:  [<c063cdc5>] schedule_timeout+0x75/0xb0
Jul  1 14:08:35 piglet2 kernel:  [<c0138ccd>] ? trace_hardirqs_on+0x9d/0x110
Jul  1 14:08:35 piglet2 kernel:  [<c063c60e>] wait_for_common+0x9e/0x110
Jul  1 14:08:35 piglet2 kernel:  [<c0116340>] ? default_wake_function+0x0/0x10
Jul  1 14:08:35 piglet2 kernel:  [<c063c712>] wait_for_completion+0x12/0x20
Jul  1 14:08:35 piglet2 kernel:  [<c01bdf06>] dlm_new_lockspace+0x766/0x7f0
Jul  1 14:08:35 piglet2 kernel:  [<c03b9734>] gdlm_mount+0x304/0x430
Jul  1 14:08:35 piglet2 kernel:  [<c03a7bcf>] gfs2_mount_lockproto+0x13f/0x160
Jul  1 14:08:35 piglet2 kernel:  [<c03ad252>] fill_super+0x3d2/0x6e0
Jul  1 14:08:35 piglet2 kernel:  [<c03a0df0>] ? gfs2_glock_cb+0x0/0x150
Jul  1 14:08:35 piglet2 kernel:  [<c01ade75>] ? disk_name+0x25/0x90
Jul  1 14:08:35 piglet2 kernel:  [<c016db3f>] get_sb_bdev+0xef/0x120
Jul  1 14:08:35 piglet2 kernel:  [<c0182435>] ? alloc_vfsmnt+0xd5/0x110
Jul  1 14:08:35 piglet2 kernel:  [<c03abe25>] gfs2_get_sb+0x15/0x40
Jul  1 14:08:35 piglet2 kernel:  [<c03ace80>] ? fill_super+0x0/0x6e0
Jul  1 14:08:35 piglet2 kernel:  [<c016d613>] vfs_kern_mount+0x53/0x120
Jul  1 14:08:35 piglet2 kernel:  [<c016d731>] do_kern_mount+0x31/0xc0
Jul  1 14:08:35 piglet2 kernel:  [<c0183626>] do_new_mount+0x56/0x80
Jul  1 14:08:35 piglet2 kernel:  [<c0183816>] do_mount+0x1c6/0x1f0
Jul  1 14:08:35 piglet2 kernel:  [<c0166c91>] ? cache_alloc_debugcheck_after+0x71/0x1a0
Jul  1 14:08:35 piglet2 kernel:  [<c014f69b>] ? __get_free_pages+0x1b/0x30
Jul  1 14:08:35 piglet2 kernel:  [<c01814ea>] ? copy_mount_options+0x2a/0x130
Jul  1 14:08:35 piglet2 kernel:  [<c01838aa>] sys_mount+0x6a/0xb0
Jul  1 14:08:35 piglet2 kernel:  [<c0103182>] syscall_call+0x7/0xb
Jul  1 14:08:35 piglet2 kernel:  =======================
Jul  1 14:08:35 piglet2 kernel: 4 locks held by mount.gfs2/6130:
Jul  1 14:08:35 piglet2 kernel:  #0:  (&type->s_umount_key#20){--..}, at: [<c016ce66>] sget+0x176/0x360
Jul  1 14:08:35 piglet2 kernel:  #1:  (lmh_lock){--..}, at: [<c03a7ab0>] gfs2_mount_lockproto+0x20/0x160
Jul  1 14:08:35 piglet2 kernel:  #2:  (&ls_lock){--..}, at: [<c01bd7be>] dlm_new_lockspace+0x1e/0x7f0
Jul  1 14:08:35 piglet2 kernel:  #3:  (&ls->ls_in_recovery){--..}, at: [<c01bdd6f>] dlm_new_lockspace+0x5cf/0x7f0
Jul  1 14:10:44 piglet2 kernel: INFO: task mount.gfs2:6130 blocked for more than 120 seconds.
Jul  1 14:10:44 piglet2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul  1 14:10:44 piglet2 kernel: mount.gfs2    D c09f0244  1896  6130   6129

So I gave up on this and tried going back to v2.6.25, and the suggested
cluster-2.03.04, but the second mounts still hang, and a sysrq-T trace
shows the mount system call hanging in dlm_new_workspace().

Since this I guess is a known-working set of software versions, I'm
assuming there's something wrong with my setup....

It looks like dlm_new_workspace() is waiting on dlm_recoverd, which is
in "D" state in dlm_rcom_status(), so I guess the second node isn't
getting some dlm reply it expects?

--b.