[Linux-cluster] rgmanager or clustat problem

Mon Apr 9 17:22:26 UTC 2007

I am running a four node GFS cluster with about 20 services per node.  All
four nodes belong to the same failover domain, and they each have a priority
of 1.  My shared storage is an iSCSI SAN.

After rgmanager has been running for a couple of days, clustat produces the
following result on all four nodes:

Timed out waiting for a response from Resource Group Manager
Member Status: Quorate

  Member Name                              Status
  ------ ----                              ------
  node01           Online, rgmanager
  node02           Online, Local, rgmanager
  node03           Online, rgmanager
  node04           Online, rgmanager

I also get a time out when I try to determine the status of a particular
service with "clustat -s servicename".

All of the services seem to be up and running, but clustat does not work.
Is there something wrong?  Is there a way for me to increase the time out?

clurgmgrd and dlm_recvd seem to be using a lot of CPU cycles on Node02, 40
and 60 percent, respectively.

Thank you for your help.

cman_tool services:

NODE01:

Service          Name                              GID LID State     Code
Fence Domain:    "default"                           4   2 run       -
[1 3 2 4]

DLM Lock Space:  "clvmd"                             1   3 run       -
[1 3 2 4]

DLM Lock Space:  "Magma"                             3   5 run       -
[1 3 2 4]

DLM Lock Space:  "gfslv"                             5   6 run       -
[2 1 3 4]

GFS Mount Group: "gfslv"                             6   7 run       -
[2 1 3 4]

User:            "usrm::manager"                     2   4 run       -
[1 3 2 4]

NODE02:
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           4   5 run       -
[1 3 2 4]

DLM Lock Space:  "clvmd"                             1   1 run       -
[1 3 2 4]

DLM Lock Space:  "Magma"                             3   3 run       -
[1 3 2 4]

DLM Lock Space:  "gfslv"                             5   6 run       -
[1 4 2 3]

GFS Mount Group: "gfslv"                             6   7 run       -
[1 4 2 3]

User:            "usrm::manager"                     2   2 run       -
[1 3 2 4]

NODE03:
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           4   2 run       -
[1 2 3 4]

DLM Lock Space:  "clvmd"                             1   3 run       -
[1 2 3 4]

DLM Lock Space:  "Magma"                             3   5 run       -
[1 2 3 4]

DLM Lock Space:  "gfslv"                             5   6 run       -
[1 2 4 3]

GFS Mount Group: "gfslv"                             6   7 run       -
[1 2 4 3]

User:            "usrm::manager"                     2   4 run       -
[1 2 3 4]

NODE04:
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           4   2 run       -
[1 2 3 4]

DLM Lock Space:  "clvmd"                             1   3 run       -
[1 2 3 4]

DLM Lock Space:  "Magma"                             3   5 run       -
[1 2 3 4]

DLM Lock Space:  "gfslv"                             5   6 run       -
[1 4 2 3]

GFS Mount Group: "gfslv"                             6   7 run       -
[1 4 2 3]

User:            "usrm::manager"                     2   4 run       -
[1 2 3 4]
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070409/724f5946/attachment.htm>