[Linux-cluster] Rhel 5.7 Cluster - gfs2 volume in "LEAVE_START_WAIT" status
Cedric Kimaru
rhel_cluster at ckimaru.com
Tue Jun 5 14:14:57 UTC 2012
Hi Dan,
Thanks for the response and breadcrumb. The link to Davids document will
hopefully shed more light into this state.
I tried fencing the node with the pending sync restart, 12 in my case, but
that didn't seem to get the volume out of the weeds. Attempting to restart
from other nodes gfs2 also fails since it has to unmount, which it can't
... weeds, weeds, weeds.
Now, Could elaborate on which diags you are referring to, glock ?
thanks,
-Cedric
On Mon, Jun 4, 2012 at 10:52 AM, Dan Riley <dan131riley at gmail.com> wrote:
> Hi Cedric,
>
> About the only doc I've found that describes the barrier state transitions
> is in the cluster2 architecture doc
>
> http://people.redhat.com/teigland/cluster2-arch.txt
>
> When group membership changes, there's a barrier operation that stops the
> group, changes the membership, and restarts the group, so that all members
> agree on the membership change synchronization. LEAVE_START_WAIT means
> that a node (12) left the group, but restarting the group hasn't completed
> because not all the nodes have acknowledged agreement. You should do
> 'group_tool -v' on the different nodes of the cluster and look for a node
> where the final 'local_done' flag is 0, or where the group membership is
> inconsistent with the other nodes. Dumping the debug buffer for the group
> on the various nodes may also identify which node is being waited on. In
> the cases where we've found inconsistent group membership, fencing the node
> with the inconsistency let the group finish starting.
>
> [as an aside--is there a plan to reengineer the RH cluster group
> membership protocol stack to take advantage of the virtual synchrony
> capabilities of Corosync/TOTEM?]
>
> -dan
>
> On Jun 2, 2012, at 9:25 PM, Cedric Kimaru wrote:
>
> > Fellow Cluster Compatriots,
> > I'm looking for some guidance here. Whenever my rhel 5.7 cluster get's
> into "LEAVE_START_WAIT" on on a given iscsi volume, the following occurs:
> > • I can't r/w io to the volume.
> > • Can't unmount it, from any node.
> > • In flight/pending IO's are impossible to determine or kill since
> lsof on the mount fails. Basically all IO operations stall/fail.
> > So my questions are:
> >
> > • What does the output from group_tool -v really indicate,
> "00030005 LEAVE_START_WAIT 12 c000b0002 1" ? Man on group_tool doesn't list
> these fields.
> > • Does anyone have a list of what these fields represent ?
> > • Corrective actions. How do i get out of this state without
> rebooting the entire cluster ?
> > • Is it possible to determine the offending node ?
> > thanks,
> > -Cedric
> >
> >
> > //misc output
> >
> > root at bl13-node13:~# group_tool -v
> > type level name id state node id local_done
> > fence 0 default 0001000d none
> > [1 4 5 6 7 8 9 10 11 12 13 14 15]
> > dlm 1 clvmd 0001000c none
> > [1 4 5 6 7 8 9 10 11 12 13 14 15]
> > dlm 1 cluster3_disk1 00020005 none
> > [4 5 6 7 8 9 10 11 12 13 14 15]
> > dlm 1 cluster3_disk2 00040005 none
> > [4 5 6 7 8 9 10 11 13 14 15]
> > dlm 1 cluster3_disk7 00060005 none
> > [1 4 5 6 7 8 9 10 11 12 13 14 15]
> > dlm 1 cluster3_disk8 00080005 none
> > [1 4 5 6 7 8 9 10 11 12 13 14 15]
> > dlm 1 cluster3_disk9 000a0005 none
> > [1 4 5 6 7 8 9 10 11 12 13 14 15]
> > dlm 1 disk10 000c0005 none
> > [1 4 5 6 7 8 9 10 11 12 13 14 15]
> > dlm 1 rgmanager 0001000a none
> > [1 4 5 6 7 8 9 10 11 12 13 14 15]
> > dlm 1 cluster3_disk3 00020001 none
> > [1 5 6 7 8 9 10 11 12 13]
> > dlm 1 cluster3_disk6 00020008 none
> > [1 4 5 6 7 8 9 10 11 12 13 14 15]
> > gfs 2 cluster3_disk1 00010005 none
> > [4 5 6 7 8 9 10 11 12 13 14 15]
> > gfs 2 cluster3_disk2 00030005 LEAVE_START_WAIT 12
> c000b0002 1
> > [4 5 6 7 8 9 10 11 13 14 15]
> > gfs 2 cluster3_disk7 00050005 none
> > [1 4 5 6 7 8 9 10 11 12 13 14 15]
> > gfs 2 cluster3_disk8 00070005 none
> > [1 4 5 6 7 8 9 10 11 12 13 14 15]
> > gfs 2 cluster3_disk9 00090005 none
> > [1 4 5 6 7 8 9 10 11 12 13 14 15]
> > gfs 2 disk10 000b0005 none
> > [1 4 5 6 7 8 9 10 11 12 13 14 15]
> > gfs 2 cluster3_disk3 00010001 none
> > [1 5 6 7 8 9 10 11 12 13]
> > gfs 2 cluster3_disk6 00010008 none
> > [1 4 5 6 7 8 9 10 11 12 13 14 15]
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120605/4719b577/attachment.htm>
More information about the Linux-cluster
mailing list