[Linux-cluster] Cluster managed KVM guest failure in rgmanager

Thu Dec 29 18:50:41 UTC 2011

All,

I'm not at all sure what is going on here.  I have a large number of KVM guests being managed by a 5 node RHEL5.6 cluster and recently whenever I modify the cluster config, or reload / restart libvirtd (to add / remove guests) rgmanager goes berserk.  When this happens rgmanager lists the guests as "failed" services and this is the result it the log:

Dec 29 10:44:17 plieadies1 clurgmgrd[6770]: <debug> 5 events processed 
Dec 29 10:49:56 plieadies1 clurgmgrd: [6770]: <err> Could not determine Hypervisor 
Dec 29 10:49:59 plieadies1 last message repeated 3 times
Dec 29 10:49:59 plieadies1 clurgmgrd[6770]: <notice> status on vm "Demeter" returned 2 (invalid argument(s)) 
Dec 29 10:50:00 plieadies1 clurgmgrd: [6770]: <err> Could not determine Hypervisor 
Dec 29 10:50:00 plieadies1 clurgmgrd[6770]: <notice> status on vm "IoA" returned 2 (invalid argument(s)) 
Dec 29 10:50:00 plieadies1 clurgmgrd: [6770]: <err> Could not determine Hypervisor 
Dec 29 10:50:00 plieadies1 clurgmgrd[6770]: <notice> status on vm "IoF" returned 2 (invalid argument(s)) 
Dec 29 10:50:00 plieadies1 clurgmgrd: [6770]: <err> Could not determine Hypervisor 
Dec 29 10:50:00 plieadies1 clurgmgrd[6770]: <notice> status on vm "Pluto" returned 2 (invalid argument(s)) 
Dec 29 10:50:00 plieadies1 clurgmgrd: [6770]: <err> Could not determine Hypervisor 
Dec 29 10:50:00 plieadies1 clurgmgrd[6770]: <notice> status on vm "Venus" returned 2 (invalid argument(s)) 
Dec 29 10:50:00 plieadies1 clurgmgrd[6770]: <debug> No other nodes have seen vm:Demeter 
Dec 29 10:50:00 plieadies1 clurgmgrd[6770]: <notice> Stopping service vm:Demeter 
Dec 29 10:50:00 plieadies1 clurgmgrd[6770]: <debug> No other nodes have seen vm:IoA 
Dec 29 10:50:00 plieadies1 clurgmgrd[6770]: <notice> Stopping service vm:IoA 
Dec 29 10:50:00 plieadies1 clurgmgrd: [6770]: <err> Could not determine Hypervisor 
Dec 29 10:50:00 plieadies1 clurgmgrd[6770]: <notice> stop on vm "Demeter" returned 2 (invalid argument(s)) 
Dec 29 10:50:00 plieadies1 clurgmgrd[6770]: <debug> No other nodes have seen vm:IoF 
Dec 29 10:50:00 plieadies1 clurgmgrd: [6770]: <err> Could not determine Hypervisor 
Dec 29 10:50:00 plieadies1 clurgmgrd[6770]: <notice> stop on vm "IoA" returned 2 (invalid argument(s)) 
Dec 29 10:50:00 plieadies1 clurgmgrd[6770]: <crit> #12: RG vm:Demeter failed to stop; intervention required 
Dec 29 10:50:00 plieadies1 clurgmgrd[6770]: <notice> Service vm:Demeter is failed 
Dec 29 10:50:00 plieadies1 clurgmgrd[6770]: <notice> Stopping service vm:IoF 
Dec 29 10:50:00 plieadies1 clurgmgrd[6770]: <debug> No other nodes have seen vm:Pluto 
Dec 29 10:50:00 plieadies1 clurgmgrd[6770]: <crit> #12: RG vm:IoA failed to stop; intervention required 
Dec 29 10:50:00 plieadies1 clurgmgrd[6770]: <notice> Service vm:IoA is failed 
Dec 29 10:50:00 plieadies1 clurgmgrd[6770]: <debug> No other nodes have seen vm:Venus 
Dec 29 10:50:00 plieadies1 clurgmgrd[6770]: <notice> Stopping service vm:Pluto 
Dec 29 10:50:00 plieadies1 clurgmgrd: [6770]: <err> Could not determine Hypervisor 
Dec 29 10:50:00 plieadies1 clurgmgrd[6770]: <notice> stop on vm "IoF" returned 2 (invalid argument(s)) 
Dec 29 10:50:00 plieadies1 clurgmgrd[6770]: <notice> Stopping service vm:Venus 
Dec 29 10:50:00 plieadies1 clurgmgrd[6770]: <crit> #12: RG vm:IoF failed to stop; intervention required 
Dec 29 10:50:00 plieadies1 clurgmgrd[6770]: <notice> Service vm:IoF is failed 
Dec 29 10:50:00 plieadies1 clurgmgrd: [6770]: <err> Could not determine Hypervisor 
Dec 29 10:50:00 plieadies1 clurgmgrd[6770]: <notice> stop on vm "Venus" returned 2 (invalid argument(s)) 
Dec 29 10:50:00 plieadies1 clurgmgrd[6770]: <crit> #12: RG vm:Venus failed to stop; intervention required 
Dec 29 10:50:00 plieadies1 clurgmgrd[6770]: <notice> Service vm:Venus is failed 
Dec 29 10:50:00 plieadies1 clurgmgrd: [6770]: <err> Could not determine Hypervisor 
Dec 29 10:50:00 plieadies1 clurgmgrd[6770]: <notice> stop on vm "Pluto" returned 2 (invalid argument(s)) 
Dec 29 10:50:00 plieadies1 clurgmgrd[6770]: <crit> #12: RG vm:Pluto failed to stop; intervention required 
Dec 29 10:50:00 plieadies1 clurgmgrd[6770]: <notice> Service vm:Pluto is failed 
Dec 29 10:50:02 plieadies1 clurgmgrd: [6770]: <err> Could not determine Hypervisor 
Dec 29 10:50:12 plieadies1 last message repeated 4 times
Dec 29 10:50:19 plieadies1 clurgmgrd[6770]: <debug> 13 events processed 
Dec 29 10:50:20 plieadies1 clurgmgrd: [6770]: <err> Could not determine Hypervisor 
Dec 29 10:50:20 plieadies1 clurgmgrd[6770]: <notice> status on vm "saturn" returned 2 (invalid argument(s)) 
Dec 29 10:50:20 plieadies1 clurgmgrd[6770]: <debug> No other nodes have seen vm:saturn 
Dec 29 10:50:20 plieadies1 clurgmgrd[6770]: <notice> Stopping service vm:saturn 
Dec 29 10:50:20 plieadies1 clurgmgrd: [6770]: <err> Could not determine Hypervisor 
Dec 29 10:50:20 plieadies1 clurgmgrd[6770]: <notice> stop on vm "saturn" returned 2 (invalid argument(s)) 
Dec 29 10:50:20 plieadies1 clurgmgrd[6770]: <crit> #12: RG vm:saturn failed to stop; intervention required 
Dec 29 10:50:20 plieadies1 clurgmgrd[6770]: <notice> Service vm:saturn is failed 
Dec 29 10:50:31 plieadies1 clurgmgrd[6770]: <debug> 1 events processed 
Dec 29 10:59:30 plieadies1 clurgmgrd[6770]: <debug> 1 events processed 

The "Could not determine Hypervisor" message is coming from the following block of code in vm.sh:

	# If someone selects a hypervisor, honor it.
	# Otherwise, ask virsh what the hypervisor is.
	#
	if [ -z "$OCF_RESKEY_hypervisor" ] ||
	   [ "$OCF_RESKEY_hypervisor" = "auto" ]; then
		export OCF_RESKEY_hypervisor="`virsh version | grep \"Running hypervisor:\" | awk '{print $3}' | tr A-Z a-z`"
		if [ -z "$OCF_RESKEY_hypervisor" ]; then
			ocf_log err "Could not determine Hypervisor"
			return $OCF_ERR_ARGS
		fi
		echo Hypervisor: $OCF_RESKEY_hypervisor 
	fi

What's really twisting my shorts is that the command being run to determine the hypervisor works fine at the command prompt:

[root at plieadies1 ~]# virsh version | grep "Running hypervisor:" | awk '{print $3}' | tr A-Z a-z
qemu

I can migrate the still running guest to another node, use clusvcadm to disable it in rgmanager, and then use a wrapper on virsh which returns '0' when attempting to start an already running guest to return the still running vm to cluster control so I can work around this, however, I'm hugely concerned that I'm going to end up with a host failure and a heap of trouble at some point.

Anyone seen something similar or have thoughts on this?  Guesses as to why rgmanager / vm.sh are failing to detect the running hypervisor?

--AB