[Linux-cluster] Re: recovering from "resource groups locked" error?
aberoham at gmail.com
aberoham at gmail.com
Fri Jun 16 19:39:59 UTC 2006
Btw, all members run on 2.6.9-34.ELsmp, cman-1.0.4-0 and
cman-kernel-smp-2.6.9-43.8 with rgmanager-1.9.46-0.
On 6/16/06, aberoham at gmail.com <aberoham at gmail.com> wrote:
>
>
> If clustat reports rgmanager as online, why would any clusvcadm operation
> fail with "Try again (resource groups locked)" ?
>
> Is there any way to recover from that rgmanger failure/error besides
> resetting the entire cluster?
>
> Details --
>
> Yesterday evening a technician connected a Netgear GS748T switch to my
> network. The new switch somehow caused a storm of traffic that in turn
> caused a disruption of network connectivity across the entire LAN, including
> to all of my CS/GFS cluster nodes, for a few minutes until the new switch
> was removed from the network.
>
> This morning when I finally had a chance to investigate I found that all
> of the cluster members that are supposed to be online were online and that
> the cluster was quorate. But rgmanager would not work and services running
> under rgmanager were hung. (The cluster must have become inquorate and
> blocked access to the shared GFS volume while the outage was in progress.
> But some of the services and rgmanager never recovered?)
>
> I first tried resetting the "lead" member. (This is a pool of mirrored
> storage servers where the lead member creates a rsync batch off of a main
> fileserver and all of the other members then replay the rsync batch that is
> on a shared filesystem against their local filesystem mirror of the main
> fileserver)
>
> No matter what I did rgmanager would not start. cman_tool services would
> report code "S-1,80,4" --
>
> root at gfs05:~
> (0)>cman_tool services
> Service Name GID LID State Code
> Fence Domain: "default" 1 2 run -
> [2 1 4 3]
>
> DLM Lock Space: "clvmd" 2 3 run -
> [2 1 4 3]
>
> User: "usrm::manager" 0 4 join
> S-1,80,4
> []
>
> Other cluster members would report rgmanager as online, yet when I tried
> to operate on member services, the operation would fail with "Try again
> (resource groups locked)".
>
> root at gfs06:~
> (1)>clustat
> Member Status: Quorate
>
> Member Name Status
> ------ ---- ------
> gfs04 Online, rgmanager
> gfs05 Online
> gfs06 Online, Local, rgmanager
> gfs07 Online, rgmanager
> gfs08 Offline
>
> Service Name Owner (Last) State
> ------- ---- ----- ------ -----
> mapsmirror1 gfs05 started
> mapsmirror2 gfs06 started
> mapsmirror3 gfs07 started
> mapsmirror4 gfs04 started
> mapsmirror5 (none) stopped
> root at gfs06:~
> (0)>clusvcadm -d mapsmirror1
> Member gfs06 disabling mapsmirror1...failed: Try again (resource groups
> locked)
>
> Eventually I just gave up and power cycled all cluster members at ounce.
> Everything, including rgmanger, then came back online OK.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060616/391a4eab/attachment.htm>
More information about the Linux-cluster
mailing list