[Linux-cluster] rgmanager blocked

Mon Nov 15 12:34:22 UTC 2010

Hello,

I'm trying to migrate an older Centos 5 / rhcs2 cluster to the newer
rhcs3. Being eager to play around, I decided to make my tests on Fedora
14, before Centos 6 is out.

Although everything seemed to work fine at the beginning, after a few
hours of cluster uptime I came across a strange situation of rgmanager
being apparently blocked. The process is still there, but:

1. It no longer produces any output - it's run in a "screen" session,
with params "-fd". Normally it's very verbose (I can see a lot of debug
messages, including output from agent scripts). It's been more than a
week since it blocked, and it hadn't output a sigle line of debug.

2. Resources from node 1 were (automatically) relocated to node 2 when
node 1 blocked, but node 2 blocked in a similar manner a few hours
later.

3. Now resources are still active on node 2, on both nodes a "clustat"
looks like this:

Service states unavailable: Temporary failure; try again
Cluster Status for ****** @ Mon Nov 15 14:14:22 2010
Member Status: Quorate

 Member Name                                                     ID   Status
 ------ ----                                                     ---- ------
 storage1.******                                                1 Online, Local
 storage2.******                                                2 Online

I've already tried several simple things like:
* looking at the process tree for some hung resource agents - no luck;
it's just clurgmgrd and its child threads;
* looking at the open files of clurgmgrd in /proc/NNN/fd - nothing
unusual
* tracing (with strace) the main clurgmgrd thread and the children.

At this point I'm totally clueless, so any suggestion would be welcome.
I can provide further info / logs about the running system / processes.

Thanks,

Radu Rendec