[Linux-cluster] Cluster managed KVM guest failure in rgmanager
Aaron Benner
tfrumbacher at gmail.com
Thu Dec 29 18:50:41 UTC 2011
All,
I'm not at all sure what is going on here. I have a large number of KVM guests being managed by a 5 node RHEL5.6 cluster and recently whenever I modify the cluster config, or reload / restart libvirtd (to add / remove guests) rgmanager goes berserk. When this happens rgmanager lists the guests as "failed" services and this is the result it the log:
Dec 29 10:44:17 plieadies1 clurgmgrd[6770]: <debug> 5 events processed
Dec 29 10:49:56 plieadies1 clurgmgrd: [6770]: <err> Could not determine Hypervisor
Dec 29 10:49:59 plieadies1 last message repeated 3 times
Dec 29 10:49:59 plieadies1 clurgmgrd[6770]: <notice> status on vm "Demeter" returned 2 (invalid argument(s))
Dec 29 10:50:00 plieadies1 clurgmgrd: [6770]: <err> Could not determine Hypervisor
Dec 29 10:50:00 plieadies1 clurgmgrd[6770]: <notice> status on vm "IoA" returned 2 (invalid argument(s))
Dec 29 10:50:00 plieadies1 clurgmgrd: [6770]: <err> Could not determine Hypervisor
Dec 29 10:50:00 plieadies1 clurgmgrd[6770]: <notice> status on vm "IoF" returned 2 (invalid argument(s))
Dec 29 10:50:00 plieadies1 clurgmgrd: [6770]: <err> Could not determine Hypervisor
Dec 29 10:50:00 plieadies1 clurgmgrd[6770]: <notice> status on vm "Pluto" returned 2 (invalid argument(s))
Dec 29 10:50:00 plieadies1 clurgmgrd: [6770]: <err> Could not determine Hypervisor
Dec 29 10:50:00 plieadies1 clurgmgrd[6770]: <notice> status on vm "Venus" returned 2 (invalid argument(s))
Dec 29 10:50:00 plieadies1 clurgmgrd[6770]: <debug> No other nodes have seen vm:Demeter
Dec 29 10:50:00 plieadies1 clurgmgrd[6770]: <notice> Stopping service vm:Demeter
Dec 29 10:50:00 plieadies1 clurgmgrd[6770]: <debug> No other nodes have seen vm:IoA
Dec 29 10:50:00 plieadies1 clurgmgrd[6770]: <notice> Stopping service vm:IoA
Dec 29 10:50:00 plieadies1 clurgmgrd: [6770]: <err> Could not determine Hypervisor
Dec 29 10:50:00 plieadies1 clurgmgrd[6770]: <notice> stop on vm "Demeter" returned 2 (invalid argument(s))
Dec 29 10:50:00 plieadies1 clurgmgrd[6770]: <debug> No other nodes have seen vm:IoF
Dec 29 10:50:00 plieadies1 clurgmgrd: [6770]: <err> Could not determine Hypervisor
Dec 29 10:50:00 plieadies1 clurgmgrd[6770]: <notice> stop on vm "IoA" returned 2 (invalid argument(s))
Dec 29 10:50:00 plieadies1 clurgmgrd[6770]: <crit> #12: RG vm:Demeter failed to stop; intervention required
Dec 29 10:50:00 plieadies1 clurgmgrd[6770]: <notice> Service vm:Demeter is failed
Dec 29 10:50:00 plieadies1 clurgmgrd[6770]: <notice> Stopping service vm:IoF
Dec 29 10:50:00 plieadies1 clurgmgrd[6770]: <debug> No other nodes have seen vm:Pluto
Dec 29 10:50:00 plieadies1 clurgmgrd[6770]: <crit> #12: RG vm:IoA failed to stop; intervention required
Dec 29 10:50:00 plieadies1 clurgmgrd[6770]: <notice> Service vm:IoA is failed
Dec 29 10:50:00 plieadies1 clurgmgrd[6770]: <debug> No other nodes have seen vm:Venus
Dec 29 10:50:00 plieadies1 clurgmgrd[6770]: <notice> Stopping service vm:Pluto
Dec 29 10:50:00 plieadies1 clurgmgrd: [6770]: <err> Could not determine Hypervisor
Dec 29 10:50:00 plieadies1 clurgmgrd[6770]: <notice> stop on vm "IoF" returned 2 (invalid argument(s))
Dec 29 10:50:00 plieadies1 clurgmgrd[6770]: <notice> Stopping service vm:Venus
Dec 29 10:50:00 plieadies1 clurgmgrd[6770]: <crit> #12: RG vm:IoF failed to stop; intervention required
Dec 29 10:50:00 plieadies1 clurgmgrd[6770]: <notice> Service vm:IoF is failed
Dec 29 10:50:00 plieadies1 clurgmgrd: [6770]: <err> Could not determine Hypervisor
Dec 29 10:50:00 plieadies1 clurgmgrd[6770]: <notice> stop on vm "Venus" returned 2 (invalid argument(s))
Dec 29 10:50:00 plieadies1 clurgmgrd[6770]: <crit> #12: RG vm:Venus failed to stop; intervention required
Dec 29 10:50:00 plieadies1 clurgmgrd[6770]: <notice> Service vm:Venus is failed
Dec 29 10:50:00 plieadies1 clurgmgrd: [6770]: <err> Could not determine Hypervisor
Dec 29 10:50:00 plieadies1 clurgmgrd[6770]: <notice> stop on vm "Pluto" returned 2 (invalid argument(s))
Dec 29 10:50:00 plieadies1 clurgmgrd[6770]: <crit> #12: RG vm:Pluto failed to stop; intervention required
Dec 29 10:50:00 plieadies1 clurgmgrd[6770]: <notice> Service vm:Pluto is failed
Dec 29 10:50:02 plieadies1 clurgmgrd: [6770]: <err> Could not determine Hypervisor
Dec 29 10:50:12 plieadies1 last message repeated 4 times
Dec 29 10:50:19 plieadies1 clurgmgrd[6770]: <debug> 13 events processed
Dec 29 10:50:20 plieadies1 clurgmgrd: [6770]: <err> Could not determine Hypervisor
Dec 29 10:50:20 plieadies1 clurgmgrd[6770]: <notice> status on vm "saturn" returned 2 (invalid argument(s))
Dec 29 10:50:20 plieadies1 clurgmgrd[6770]: <debug> No other nodes have seen vm:saturn
Dec 29 10:50:20 plieadies1 clurgmgrd[6770]: <notice> Stopping service vm:saturn
Dec 29 10:50:20 plieadies1 clurgmgrd: [6770]: <err> Could not determine Hypervisor
Dec 29 10:50:20 plieadies1 clurgmgrd[6770]: <notice> stop on vm "saturn" returned 2 (invalid argument(s))
Dec 29 10:50:20 plieadies1 clurgmgrd[6770]: <crit> #12: RG vm:saturn failed to stop; intervention required
Dec 29 10:50:20 plieadies1 clurgmgrd[6770]: <notice> Service vm:saturn is failed
Dec 29 10:50:31 plieadies1 clurgmgrd[6770]: <debug> 1 events processed
Dec 29 10:59:30 plieadies1 clurgmgrd[6770]: <debug> 1 events processed
The "Could not determine Hypervisor" message is coming from the following block of code in vm.sh:
# If someone selects a hypervisor, honor it.
# Otherwise, ask virsh what the hypervisor is.
#
if [ -z "$OCF_RESKEY_hypervisor" ] ||
[ "$OCF_RESKEY_hypervisor" = "auto" ]; then
export OCF_RESKEY_hypervisor="`virsh version | grep \"Running hypervisor:\" | awk '{print $3}' | tr A-Z a-z`"
if [ -z "$OCF_RESKEY_hypervisor" ]; then
ocf_log err "Could not determine Hypervisor"
return $OCF_ERR_ARGS
fi
echo Hypervisor: $OCF_RESKEY_hypervisor
fi
What's really twisting my shorts is that the command being run to determine the hypervisor works fine at the command prompt:
[root at plieadies1 ~]# virsh version | grep "Running hypervisor:" | awk '{print $3}' | tr A-Z a-z
qemu
I can migrate the still running guest to another node, use clusvcadm to disable it in rgmanager, and then use a wrapper on virsh which returns '0' when attempting to start an already running guest to return the still running vm to cluster control so I can work around this, however, I'm hugely concerned that I'm going to end up with a host failure and a heap of trouble at some point.
Anyone seen something similar or have thoughts on this? Guesses as to why rgmanager / vm.sh are failing to detect the running hypervisor?
--AB
More information about the Linux-cluster
mailing list