[Linux-cluster] Services not relocated after successful fencing

Mon Jul 13 23:36:50 UTC 2009

Hi all, first mail to this mailing list.

I'm experimenting with the STABLE2 branch (using cluster-2.03.11
release) on a couple of gentoo servers (2 node cluster) using DRBD in
primary/primary.
I use rhcs for clvm, fencing, and failover of services (kvm with libvirt
and a primary/secondary drbd device used for backups). Every node has 3
gbit ethernet interfaces, two of them trunked in a bond device and used
for drbd replication and cluster communication, while the other as the
public interface.
cluster.conf is attached.

I've gone through all the step, configured cman, fenced using ipmi lan,
rgmanager (with vm.sh taken from git to use libvirt) and everything is
working as expected. At least issuing 

clusvcadm -M vm:vm01 -m node2 

makes the machine migrate to the other node. Similary
enabling/disabling/relocating a vm works too.

Obviously there's a problem :) While testing the failover I noticed a
behaviour similar to what reported on the ML in april
http://www.mail-archive.com/linux-cluster@redhat.com/msg05919.html

issuing a power off using ipmi on a node to simulate a failure I saw in
the log files:

fenced[9592]: node2 not a cluster member after 0 sec post_fail_delay
fenced[9592]: fencing node "node2"
fenced[9592]: can't get node number for node <garbage_here>
fenced[9592]: fence "node2" success

clustat then showed node2 as offline but its services were still marked
as "started" on the fenced node2. When node2 came back services did not
relocate back. 
I tried to trace the problem in the code, and found that in 
cluster-2.03.11/fence/fenced/agent.c

313         if (ccs_lookup_nodename(cd, victim, &victim_nodename) == 0)
314                 victim = victim_nodename;

then on line 358 victim_nodename is freed 

357                 if (victim_nodename)
358                         free(victim_nodename);

and than update_cman is called with "victim" as node name, failing as
the nodeid could not be retrieved (and garbage printed to syslog)

361                 if (!error) {
362                         update_cman(victim, good_device);
363                         break;

I admit that I miss why ccs_lookup_nodename returns 0, but delaying the
free call after the update_cman call makes everything works, services
relocate to the other node and when node2 comes back and rejoins the
cluster they migrate back to the original node, as expected.

Complete patch:
diff -Nuar a/fence/fenced/agent.c b/fence/fenced/agent.c

--- a/fence/fenced/agent.c        2009-01-22 13:33:51.000000000 +0100
+++ b/fence/fenced/agent.c        2009-07-14 01:19:26.385518781 +0200
@@ -354,14 +354,14 @@

                if (device)
                        free(device);
-               if (victim_nodename)
-                       free(victim_nodename);
                free(method);

                if (!error) {
                        update_cman(victim, good_device);
                        break;
                }
+               if (victim_nodename) 
+                       free(victim_nodename);
        }

        ccs_disconnect(cd);

The question is: should I open a bug on bugzilla? Or is my setup
(gentoo, vm.sh backported, etc) too unusual for this to being useful?
Or is it just a problem in the configuration?

Sorry for my English but I'm not a native speaker.

Regards,
	Giacomo
-------------- next part --------------
A non-text attachment was scrubbed...
Name: cluster.conf
Type: application/xml
Size: 2707 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090714/f34e5748/attachment.wsdl>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090714/f34e5748/attachment.sig>