[Linux-cluster] RHEL 5.7: cpg_leave error retrying
Robert Hayden
rhayden.public at gmail.com
Fri Sep 2 13:38:25 UTC 2011
Has anyone experienced the following error/hang/loop when attempting
to stop rgmanager or cman on the last node of a two node cluster?
groupd[4909]: cpg_leave error retrying
Basic scenario:
RHEL 5.7 with the latest errata for cman.
Create a two node cluster with qdisk and higher totem token=70000
start cman on both nodes, wait for qdisk to become online with master determined
stop cman on node1, wait for it to complete
stop cman on node2
error "cpg_leave" seen in logging output
Observations:
The "service cman stop" command hangs at "Stopping fencing" output
If I cycle openais service with "service openais restart", then the
"service cman stop" will complete (need to manually stop the openais
service afterwards).
When hung, the command "group_tool dump" hangs (any group_tool command hangs).
The hang is inconsistent which, in my mind, implies a timing issue.
Inconsistent meaning that every once in a while, then shutdown will
complete (maybe 20% of the time).
I have seen the issue with the stopping of rgmanager and cman. The
below example has been stripped down to show the hang with cman.
I have tested with varying the length of time to wait before stopping
the second node with no difference (hang still occurs periodically).
I have tested with commenting out the totem token and the
quorum_dev_poll and still experienced the hang. (we use the longer
timeouts to help survive network and san blips)/
I have dug through some of the source code. The message appears in
group's cpg.c as function do_cpg_leave( ). This calls the cpg_leave
function located in the openais package.
If I attach to the groupd process with gdb, I get the following stack.
Watching with strace, groupd is just in a looping state.
(gdb) where
#0 0x000000341409a510 in __nanosleep_nocancel () from /lib64/libc.so.6
#1 0x000000341409a364 in sleep () from /lib64/libc.so.6
#2 0x000000000040a410 in time ()
#3 0x000000000040bd09 in time ()
#4 0x000000000040e2cb in time ()
#5 0x000000000040ebe0 in time ()
#6 0x000000000040f394 in time ()
#7 0x000000341401d994 in __libc_start_main () from /lib64/libc.so.6
#8 0x00000000004018f9 in time ()
#9 0x00007fff04a671c8 in ?? ()
#10 0x0000000000000000 in ?? ()
If I attach to the aisexec process with gdb, I see the following:
(gdb) where
#0 0x00000034140cb696 in poll () from /lib64/libc.so.6
#1 0x0000000000405c50 in poll_run ()
#2 0x0000000000418aae in main ()
As you can see in the cluster.conf example below, I have attempted
many different ways to create more debug logging. I do see debug
messages from openais in the cpg.c component during startup, but
nothing is logged on the shutdown hang scenario.
I would appreciate any guidance on how to troubleshoot further,
especially with increasing the tracing of the openais calls in cpg.c.
Thanks
Robert
Example cluster.conf:
<?xml version="1.0"?>
<cluster config_version="33" name="cluster_app_1">
<logging to_syslog="yes" syslog_facility="local4"
timestamp="on" debug="on">
<logger ident="CPG" debug="on"/>
<logger ident="CMAN" debug="on"/>
</logging>
<cman expected_nodes="2" expected_votes="3" quorum_dev_poll="70000">
<multicast addr="239.192.1.192"/>
</cman>
<totem token="70000"/>
<fence_daemon clean_start="0" log_facility="local4"
post_fail_delay="10" post_join_delay="60"/>
<quorumd interval="1" label="rhcs_qdisk" log_facility="local4"
log_level="7" min_score="1" tko="60" votes="1">
<heuristic interval="2" program="/bin/ping -c1 -t2
-Ibond0 10.162.106.1" score="1" tko="3"/>
</quorumd>
<clusternodes>
<clusternode name="node1-priv" nodeid="1" votes="1">
<fence>
<method name="1">
<device name="iLO_node1"/>
</method>
</fence>
<multicast addr="239.192.1.192" interface="bond1"/>
</clusternode>
<clusternode name="node2-priv" nodeid="2" votes="1">
<fence>
<method name="1">
<device name="iLO_node2"/>
</method>
</fence>
<multicast addr="239.192.1.192" interface="bond1"/>
</clusternode>
</clusternodes>
<fencedevices>
<fencedevice action="off" agent="fence_ipmilan"
ipaddr="X.X.X.X" login="node1_fence" name="iLO_node1"
passwd="password" power_wait="10" lanplus="1"/>
<fencedevice action="off" agent="fence_ipmilan"
ipaddr="X.X.X.X" login="node2_fence" name="iLO_node2"
passwd="password" power_wait="10" lanplus="1"/>
</fencedevices>
<rm log_level="7"/>
</cluster>
More information about the Linux-cluster
mailing list