[Linux-cluster] recovering from "resource groups locked" error?

aberoham at gmail.com aberoham at gmail.com
Fri Jun 16 19:36:40 UTC 2006


If clustat reports rgmanager as online, why would any clusvcadm operation
fail with "Try again (resource groups locked)" ?

Is there any way to recover from that rgmanger failure/error besides
resetting the entire cluster?

Details --

Yesterday evening a technician connected a Netgear GS748T switch to my
network. The new switch somehow caused a storm of traffic that in turn
caused a disruption of network connectivity across the entire LAN, including
to all of my CS/GFS cluster nodes, for a few minutes until the new switch
was removed from the network.

This morning when I finally had a chance to investigate I found that all of
the cluster members that are supposed to be online were online and that the
cluster was quorate. But rgmanager would not work and services running under
rgmanager were hung. (The cluster must have become inquorate and blocked
access to the shared GFS volume while the outage was in progress. But some
of the services and rgmanager never recovered?)

I first tried resetting the "lead" member. (This is a pool of mirrored
storage servers where the lead member creates a rsync batch off of a main
fileserver and all of the other members then replay the rsync batch that is
on a shared filesystem against their local filesystem mirror of the main
fileserver)

No matter what I did rgmanager would not start. cman_tool services would
report code "S-1,80,4" --

root at gfs05:~
(0)>cman_tool services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 run       -
[2 1 4 3]

DLM Lock Space:  "clvmd"                             2   3 run       -
[2 1 4 3]

User:            "usrm::manager"                     0   4 join
S-1,80,4
[]

Other cluster members would report rgmanager as online, yet when I tried to
operate on member services, the operation would fail with "Try again
(resource groups locked)".

root at gfs06:~
(1)>clustat
Member Status: Quorate

  Member Name                              Status
  ------ ----                              ------
  gfs04                                    Online, rgmanager
  gfs05                                    Online
  gfs06                                    Online, Local, rgmanager
  gfs07                                    Online, rgmanager
  gfs08                                    Offline

  Service Name         Owner (Last)                   State
  ------- ----         ----- ------                   -----
  mapsmirror1          gfs05                          started
  mapsmirror2          gfs06                          started
  mapsmirror3          gfs07                          started
  mapsmirror4          gfs04                          started
  mapsmirror5          (none)                         stopped
root at gfs06:~
(0)>clusvcadm -d mapsmirror1
Member gfs06 disabling mapsmirror1...failed: Try again (resource groups
locked)

Eventually I just gave up and power cycled all cluster members at ounce.
Everything, including rgmanger, then came back online OK.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060616/103249ed/attachment.htm>


More information about the Linux-cluster mailing list