[Linux-cluster] mount hang during test runs
Daniel McNeil
daniel at osdl.org
Tue Jan 11 00:50:20 UTC 2005
I started another test run on last week and let it run
over the week end. a 3 node test was running when it hung.
I set /proc/cluster/config/cman/max_retries to 9
and /proc/cluster/config/cman/hello_timer to 1
This time I hit a mount hang. The mount is hung on cl032:
mount D C170F414 0 18375 18369 (NOTLB)
e2dbbc20 00000082 e1dbda10 c170f414 0003e36e 00000000 00000008 c011bb10
d5ea8d58 57435700 0003e36e c18880ac e2dbbc00 e1dbda10 00000000 c170f8c0
c170ef60 00000000 000038d3 57435987 0003e36e e1dbcf50 e1dbd0b8 00000000
Call Trace:
[<c03dbac4>] wait_for_completion+0xa4/0xe0
[<f8a92ed2>] kcl_join_service+0x162/0x1a0 [cman]
[<f8966fbf>] init_mountgroup+0x6f/0xc0 [lock_dlm]
[<f8969411>] lm_dlm_mount+0xa1/0xf0 [lock_dlm]
[<f8812355>] lm_mount+0x155/0x250 [lock_harness]
[<f8affa0d>] gfs_lm_mount+0x1fd/0x390 [gfs]
[<f8b0ee53>] fill_super+0x513/0x1330 [gfs]
[<f8b0fe49>] gfs_get_sb+0x199/0x210 [gfs]
[<c0168e4c>] do_kern_mount+0x5c/0x110
[<c0180138>] do_new_mount+0x98/0xe0
[<c0180905>] do_mount+0x165/0x1b0
[<c0180dd5>] sys_mount+0xb5/0x140
[<c010537d>] sysenter_past_esp+0x52/0x71
Looks like a problem join the mount group.
/proc/cluster/services shows:
[root at cl030 cman]# cat /proc/cluster/services
Service Name GID LID State Code
Fence Domain: "default" 1 2 run -
[1 2 3]
DLM Lock Space: "stripefs" 324 693 run -
[1 2 3]
GFS Mount Group: "stripefs" 325 694 update U-4,1,3
[1 2 3]
[root at cl031 cluster]# cat /proc/cluster/services
Service Name GID LID State Code
Fence Domain: "default" 1 2 run -
[1 2 3]
DLM Lock Space: "stripefs" 324 457 run -
[1 2 3]
GFS Mount Group: "stripefs" 325 458 update U-4,1,3
[1 2 3]
[root at cl032 cluster]# cat /proc/cluster/services
Service Name GID LID State Code
Fence Domain: "default" 1 2 run -
[1 2 3]
DLM Lock Space: "stripefs" 324 225 run -
[1 2 3]
GFS Mount Group: "stripefs" 325 226 join S-6,20,3
[1 2 3]
I collected stack traces and a bunch of other info. It is
available here:
http://developer.osdl.org/daniel/GFS/mount.hang.05jan2005/
Any ideas on debugging this one?
Daniel
More information about the Linux-cluster
mailing list