[Linux-cluster] rgmanager blocked
radu.rendec at mindbit.ro
Mon Nov 15 12:34:22 UTC 2010
I'm trying to migrate an older Centos 5 / rhcs2 cluster to the newer
rhcs3. Being eager to play around, I decided to make my tests on Fedora
14, before Centos 6 is out.
Although everything seemed to work fine at the beginning, after a few
hours of cluster uptime I came across a strange situation of rgmanager
being apparently blocked. The process is still there, but:
1. It no longer produces any output - it's run in a "screen" session,
with params "-fd". Normally it's very verbose (I can see a lot of debug
messages, including output from agent scripts). It's been more than a
week since it blocked, and it hadn't output a sigle line of debug.
2. Resources from node 1 were (automatically) relocated to node 2 when
node 1 blocked, but node 2 blocked in a similar manner a few hours
3. Now resources are still active on node 2, on both nodes a "clustat"
looks like this:
Service states unavailable: Temporary failure; try again
Cluster Status for ****** @ Mon Nov 15 14:14:22 2010
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
storage1.****** 1 Online, Local
storage2.****** 2 Online
I've already tried several simple things like:
* looking at the process tree for some hung resource agents - no luck;
it's just clurgmgrd and its child threads;
* looking at the open files of clurgmgrd in /proc/NNN/fd - nothing
* tracing (with strace) the main clurgmgrd thread and the children.
At this point I'm totally clueless, so any suggestion would be welcome.
I can provide further info / logs about the running system / processes.
More information about the Linux-cluster