<br>I have a five node cluster, RHEL4.4 with latest errata. One node is telling me "Timed out waiting for a response from Resource Group Manager" when I run clustat, and if I strace one of its clurgmgrd PIDs it seems to be stuck at a futex locking call ---
<br><br>root@bamf01:~<br>(1)>clustat<br>Timed out waiting for a response from Resource Group Manager<br>Member Status: Quorate<br><br>  Member Name                              Status<br>  ------ ----                              ------
<br>  bamf01                                   Online, Local, rgmanager<br>  bamf02                                   Online<br>  bamf03                                   Online, rgmanager<br>  bamf04                                   Online, rgmanager
<br>  bamf05                                   Online, rgmanager<br><br><br>Other nodes are fine:<br><br>root@bamf03:/etc/init.d<br>(0)>clustat<br>Member Status: Quorate<br><br>  Member Name                              Status
<br>  ------ ----                              ------<br>  bamf01                                   Online, rgmanager<br>  bamf02                                   Online<br>  bamf03                                   Online, Local, rgmanager
<br>  bamf04                                   Online, rgmanager<br>  bamf05                                   Online, rgmanager<br><br>  Service Name         Owner (Last)                   State<br>  ------- ----         ----- ------                   -----
<br>  goat-design          bamf05                         started<br>  cougar-compout       bamf05                         started<br>  cheetah-renderout    bamf01                         started<br>  postgresql-blur      bamf04                         started
<br>  tiger-jukebox        bamf01                         started<br>  hartigan-home        bamf01                         started<br><br><br>cman_tool status on rgmanger-failed node (namf01) matches cman_tool status on other nodes besides "Active Subsytems" counts. Difference is that node with failed rgmanger is running service that uses GFS, so its has +4 active subsystems, two DLM lock spaces for a gfs fs and two gfs mount groups --
<br><br>root@bamf01:~<br>(0)>cman_tool status<br>Protocol version: 5.0.1<br>Config version: 34<br>Cluster name: bamf<br>Cluster ID: 1492<br>Cluster Member: Yes<br>Membership state: Cluster-Member<br>Nodes: 5<br>Expected_votes: 5
<br>Total_votes: 5<br>Quorum: 3<br>Active subsystems: 8<br>Node name: bamf01<br>Node ID: 2<br>Node addresses: <a href="http://10.0.19.21">10.0.19.21</a><br><br>root@bamf05:~<br>(0)>cman_tool status<br>Protocol version: 
5.0.1<br>Config version: 34<br>Cluster name: bamf<br>Cluster ID: 1492<br>Cluster Member: Yes<br>Membership state: Cluster-Member<br>Nodes: 5<br>Expected_votes: 5<br>Total_votes: 5<br>Quorum: 3<br>Active subsystems: 5<br>Node name: bamf05
<br>Node ID: 4<br>Node addresses: <a href="http://10.0.19.25">10.0.19.25</a><br><br><br>root@bamf01:~<br>(0)>cman_tool services<br>Service          Name                              GID LID State     Code<br>Fence Domain:    "default"                           1   2 run       -
<br>[2 1 4 5 3]<br><br>DLM Lock Space:  "clvmd"                             2   3 update    U-4,1,3<br>[1 2 4 5 3]<br><br>DLM Lock Space:  "Magma"                             4   5 run       -<br>[1 2 4 5]
<br><br>DLM Lock Space:  "gfs1"                              5   6 run       -<br>[2]<br><br>GFS Mount Group: "gfs1"                              6   7 run       -<br>[2]<br><br>User:            "usrm::manager"                     3   4 run       -
<br>[1 2 4 5]<br><br><br>root@bamf04:~<br>(0)>cman_tool services<br>Service          Name                              GID LID State     Code<br>Fence Domain:    "default"                           1   2 run       -
<br>[1 2 4 5 3]<br><br>DLM Lock Space:  "clvmd"                             2   3 update    U-4,1,3<br>[1 2 4 5 3]<br><br>DLM Lock Space:  "Magma"                             4   5 run       -<br>[1 2 4 5]
<br><br>User:            "usrm::manager"                     3   4 run       -<br>[1 2 4 5]<br><br><br>An strace of the two running clurgmgrd processes on an OK node shows this:<br><br>root@bamf05:~<br>(1)>ps auxw |grep clurg |grep -v grep
<br>root      7988  0.0  0.0  9568  376 ?        S<s  Dec08   0:00 clurgmgrd<br>root      7989  0.0  0.0 58864 5012 ?        S<l  Dec08   0:35 clurgmgrd<br><br>root@bamf05:~<br>(0)>strace -p 7988<br>Process 7988 attached - interrupt to quit
<br>wait4(7989,<br>[nothing]<br><br>root@bamf05:~<br>(0)>strace -p 7989<br>Process 7989 attached - interrupt to quit<br>select(7, [4 5 6], NULL, NULL, {7, 760000}) = 0 (Timeout)<br>socket(PF_FILE, SOCK_STREAM, 0)         = 12
<br>connect(12, {sa_family=AF_FILE, path="/var/run/cluster/ccsd.sock"}, 110) = 0<br>write(12, "\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 20) = 20<br>read(12, "\1\0\0\0\0\0\0\0|o_\0\0\0\0\0\0\0\0\0", 20) = 20
<br>close(12)                               = 0<br>[snip]<br><br><br>strace of clurgmgrd PIDs on failed node shows:<br><br>root@bamf01:~<br>(0)>ps auxw |grep clurg |grep -v grep<br>root      7982  0.0  0.0  9568  376 ?        S<s  Dec08   0:00 clurgmgrd
<br>root      7983  0.0  0.0 61592 7220 ?        S<l  Dec08   1:03 clurgmgrd<br><br>root@bamf01:~<br>(0)>strace -p 7982<br>Process 7982 attached - interrupt to quit<br>wait4(7983,<br>[nothing]<br><br>root@bamf01:~<br>
(0)>strace -p 7983<br>Process 7983 attached - interrupt to quit<br>futex(0x522e28, FUTEX_WAIT, 5, NULL<br>[nothing]<br><br>Abe<br><br><br>