[Linux-cluster] Re: rgmanger stuck, hung on futex
aberoham at gmail.com
aberoham at gmail.com
Mon Dec 11 18:22:19 UTC 2006
Another clue -- haldaemon crashed on this node, perhaps at the same time
clurgmgrd started to hang?
lastest dmesg entry --
hal[3509]: segfault at 0000000000000000 rip 0000000000400ec7 rsp
0000007fbfffd7e0 error 4
grep clurgmgrd /var/log/messages --
[snip]
Dec 11 06:39:43 bamf01 clurgmgrd: [7983]: <info> Executing
/etc/init.d/rsyncd-tiger status
Dec 11 06:39:44 bamf01 clurgmgrd: [7983]: <info> Executing
/etc/init.d/httpd.cluster status
Dec 11 06:39:44 bamf01 clurgmgrd: [7983]: <info> Executing
/etc/init.d/rsyncd-hartigan status
Dec 11 06:41:11 bamf01 clurgmgrd[7983]: <err> #48: Unable to obtain cluster
lock: Connection timed out
Dec 11 06:41:56 bamf01 clurgmgrd[7983]: <err> #50: Unable to obtain cluster
lock: Connection timed out
[snip]
On 12/11/06, aberoham at gmail.com <aberoham at gmail.com> wrote:
>
>
> I have a five node cluster, RHEL4.4 with latest errata. One node is
> telling me "Timed out waiting for a response from Resource Group Manager"
> when I run clustat, and if I strace one of its clurgmgrd PIDs it seems to be
> stuck at a futex locking call ---
>
> root at bamf01:~
> (1)>clustat
> Timed out waiting for a response from Resource Group Manager
> Member Status: Quorate
>
> Member Name Status
> ------ ---- ------
> bamf01 Online, Local, rgmanager
> bamf02 Online
> bamf03 Online, rgmanager
> bamf04 Online, rgmanager
> bamf05 Online, rgmanager
>
>
> Other nodes are fine:
>
> root at bamf03:/etc/init.d
> (0)>clustat
> Member Status: Quorate
>
> Member Name Status
> ------ ---- ------
> bamf01 Online, rgmanager
> bamf02 Online
> bamf03 Online, Local, rgmanager
> bamf04 Online, rgmanager
> bamf05 Online, rgmanager
>
> Service Name Owner (Last) State
> ------- ---- ----- ------ -----
> goat-design bamf05 started
> cougar-compout bamf05 started
> cheetah-renderout bamf01 started
> postgresql-blur bamf04 started
> tiger-jukebox bamf01 started
> hartigan-home bamf01 started
>
>
> cman_tool status on rgmanger-failed node (namf01) matches cman_tool status
> on other nodes besides "Active Subsytems" counts. Difference is that node
> with failed rgmanger is running service that uses GFS, so its has +4
> active subsystems, two DLM lock spaces for a gfs fs and two gfs mount groups
> --
>
> root at bamf01:~
> (0)>cman_tool status
> Protocol version: 5.0.1
> Config version: 34
> Cluster name: bamf
> Cluster ID: 1492
> Cluster Member: Yes
> Membership state: Cluster-Member
> Nodes: 5
> Expected_votes: 5
> Total_votes: 5
> Quorum: 3
> Active subsystems: 8
> Node name: bamf01
> Node ID: 2
> Node addresses: 10.0.19.21
>
> root at bamf05:~
> (0)>cman_tool status
> Protocol version: 5.0.1
> Config version: 34
> Cluster name: bamf
> Cluster ID: 1492
> Cluster Member: Yes
> Membership state: Cluster-Member
> Nodes: 5
> Expected_votes: 5
> Total_votes: 5
> Quorum: 3
> Active subsystems: 5
> Node name: bamf05
> Node ID: 4
> Node addresses: 10.0.19.25
>
>
> root at bamf01:~
> (0)>cman_tool services
> Service Name GID LID State Code
> Fence Domain: "default" 1 2 run -
> [2 1 4 5 3]
>
> DLM Lock Space: "clvmd" 2 3 update
> U-4,1,3
> [1 2 4 5 3]
>
> DLM Lock Space: "Magma" 4 5 run -
> [1 2 4 5]
>
> DLM Lock Space: "gfs1" 5 6 run -
> [2]
>
> GFS Mount Group: "gfs1" 6 7 run -
> [2]
>
> User: "usrm::manager" 3 4 run -
> [1 2 4 5]
>
>
> root at bamf04:~
> (0)>cman_tool services
> Service Name GID LID State Code
> Fence Domain: "default" 1 2 run -
> [1 2 4 5 3]
>
> DLM Lock Space: "clvmd" 2 3 update
> U-4,1,3
> [1 2 4 5 3]
>
> DLM Lock Space: "Magma" 4 5 run -
> [1 2 4 5]
>
> User: "usrm::manager" 3 4 run -
> [1 2 4 5]
>
>
> An strace of the two running clurgmgrd processes on an OK node shows this:
>
> root at bamf05:~
> (1)>ps auxw |grep clurg |grep -v grep
> root 7988 0.0 0.0 9568 376 ? S<s Dec08 0:00 clurgmgrd
> root 7989 0.0 0.0 58864 5012 ? S<l Dec08 0:35 clurgmgrd
>
> root at bamf05:~
> (0)>strace -p 7988
> Process 7988 attached - interrupt to quit
> wait4(7989,
> [nothing]
>
> root at bamf05:~
> (0)>strace -p 7989
> Process 7989 attached - interrupt to quit
> select(7, [4 5 6], NULL, NULL, {7, 760000}) = 0 (Timeout)
> socket(PF_FILE, SOCK_STREAM, 0) = 12
> connect(12, {sa_family=AF_FILE, path="/var/run/cluster/ccsd.sock"}, 110) =
> 0
> write(12, "\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 20) = 20
> read(12, "\1\0\0\0\0\0\0\0|o_\0\0\0\0\0\0\0\0\0", 20) = 20
> close(12) = 0
> [snip]
>
>
> strace of clurgmgrd PIDs on failed node shows:
>
> root at bamf01:~
> (0)>ps auxw |grep clurg |grep -v grep
> root 7982 0.0 0.0 9568 376 ? S<s Dec08 0:00 clurgmgrd
> root 7983 0.0 0.0 61592 7220 ? S<l Dec08 1:03 clurgmgrd
>
> root at bamf01:~
> (0)>strace -p 7982
> Process 7982 attached - interrupt to quit
> wait4(7983,
> [nothing]
>
> root at bamf01:~
> (0)>strace -p 7983
> Process 7983 attached - interrupt to quit
> futex(0x522e28, FUTEX_WAIT, 5, NULL
> [nothing]
>
> Abe
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20061211/e53849b5/attachment.htm>
More information about the Linux-cluster
mailing list