<br>I have a five node cluster, RHEL4.4 with latest errata. One node is telling me "Timed out waiting for a response from Resource Group Manager" when I run clustat, and if I strace one of its clurgmgrd PIDs it seems to be stuck at a futex locking call ---
<br><br>root@bamf01:~<br>(1)>clustat<br>Timed out waiting for a response from Resource Group Manager<br>Member Status: Quorate<br><br> Member Name Status<br> ------ ---- ------
<br> bamf01 Online, Local, rgmanager<br> bamf02 Online<br> bamf03 Online, rgmanager<br> bamf04 Online, rgmanager
<br> bamf05 Online, rgmanager<br><br><br>Other nodes are fine:<br><br>root@bamf03:/etc/init.d<br>(0)>clustat<br>Member Status: Quorate<br><br> Member Name Status
<br> ------ ---- ------<br> bamf01 Online, rgmanager<br> bamf02 Online<br> bamf03 Online, Local, rgmanager
<br> bamf04 Online, rgmanager<br> bamf05 Online, rgmanager<br><br> Service Name Owner (Last) State<br> ------- ---- ----- ------ -----
<br> goat-design bamf05 started<br> cougar-compout bamf05 started<br> cheetah-renderout bamf01 started<br> postgresql-blur bamf04 started
<br> tiger-jukebox bamf01 started<br> hartigan-home bamf01 started<br><br><br>cman_tool status on rgmanger-failed node (namf01) matches cman_tool status on other nodes besides "Active Subsytems" counts. Difference is that node with failed rgmanger is running service that uses GFS, so its has +4 active subsystems, two DLM lock spaces for a gfs fs and two gfs mount groups --
<br><br>root@bamf01:~<br>(0)>cman_tool status<br>Protocol version: 5.0.1<br>Config version: 34<br>Cluster name: bamf<br>Cluster ID: 1492<br>Cluster Member: Yes<br>Membership state: Cluster-Member<br>Nodes: 5<br>Expected_votes: 5
<br>Total_votes: 5<br>Quorum: 3<br>Active subsystems: 8<br>Node name: bamf01<br>Node ID: 2<br>Node addresses: <a href="http://10.0.19.21">10.0.19.21</a><br><br>root@bamf05:~<br>(0)>cman_tool status<br>Protocol version:
5.0.1<br>Config version: 34<br>Cluster name: bamf<br>Cluster ID: 1492<br>Cluster Member: Yes<br>Membership state: Cluster-Member<br>Nodes: 5<br>Expected_votes: 5<br>Total_votes: 5<br>Quorum: 3<br>Active subsystems: 5<br>Node name: bamf05
<br>Node ID: 4<br>Node addresses: <a href="http://10.0.19.25">10.0.19.25</a><br><br><br>root@bamf01:~<br>(0)>cman_tool services<br>Service Name GID LID State Code<br>Fence Domain: "default" 1 2 run -
<br>[2 1 4 5 3]<br><br>DLM Lock Space: "clvmd" 2 3 update U-4,1,3<br>[1 2 4 5 3]<br><br>DLM Lock Space: "Magma" 4 5 run -<br>[1 2 4 5]
<br><br>DLM Lock Space: "gfs1" 5 6 run -<br>[2]<br><br>GFS Mount Group: "gfs1" 6 7 run -<br>[2]<br><br>User: "usrm::manager" 3 4 run -
<br>[1 2 4 5]<br><br><br>root@bamf04:~<br>(0)>cman_tool services<br>Service Name GID LID State Code<br>Fence Domain: "default" 1 2 run -
<br>[1 2 4 5 3]<br><br>DLM Lock Space: "clvmd" 2 3 update U-4,1,3<br>[1 2 4 5 3]<br><br>DLM Lock Space: "Magma" 4 5 run -<br>[1 2 4 5]
<br><br>User: "usrm::manager" 3 4 run -<br>[1 2 4 5]<br><br><br>An strace of the two running clurgmgrd processes on an OK node shows this:<br><br>root@bamf05:~<br>(1)>ps auxw |grep clurg |grep -v grep
<br>root 7988 0.0 0.0 9568 376 ? S<s Dec08 0:00 clurgmgrd<br>root 7989 0.0 0.0 58864 5012 ? S<l Dec08 0:35 clurgmgrd<br><br>root@bamf05:~<br>(0)>strace -p 7988<br>Process 7988 attached - interrupt to quit
<br>wait4(7989,<br>[nothing]<br><br>root@bamf05:~<br>(0)>strace -p 7989<br>Process 7989 attached - interrupt to quit<br>select(7, [4 5 6], NULL, NULL, {7, 760000}) = 0 (Timeout)<br>socket(PF_FILE, SOCK_STREAM, 0) = 12
<br>connect(12, {sa_family=AF_FILE, path="/var/run/cluster/ccsd.sock"}, 110) = 0<br>write(12, "\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 20) = 20<br>read(12, "\1\0\0\0\0\0\0\0|o_\0\0\0\0\0\0\0\0\0", 20) = 20
<br>close(12) = 0<br>[snip]<br><br><br>strace of clurgmgrd PIDs on failed node shows:<br><br>root@bamf01:~<br>(0)>ps auxw |grep clurg |grep -v grep<br>root 7982 0.0 0.0 9568 376 ? S<s Dec08 0:00 clurgmgrd
<br>root 7983 0.0 0.0 61592 7220 ? S<l Dec08 1:03 clurgmgrd<br><br>root@bamf01:~<br>(0)>strace -p 7982<br>Process 7982 attached - interrupt to quit<br>wait4(7983,<br>[nothing]<br><br>root@bamf01:~<br>
(0)>strace -p 7983<br>Process 7983 attached - interrupt to quit<br>futex(0x522e28, FUTEX_WAIT, 5, NULL<br>[nothing]<br><br>Abe<br><br><br>