[Linux-cluster] share experience migrating cluster suite from centos 5.3 to centos 5.4

Gianluca Cecchi gianluca.cecchi at gmail.com
Thu Nov 5 09:38:34 UTC 2009

On Wed, Nov 4, 2009 at 12:57 PM, Gianluca Cecchi
<gianluca.cecchi at gmail.com>wrote:

> On Wed, 4 Nov 2009 15:33:19 +1000 Peter Tiggerdine wrote:
> > 7. Your going to need to copy this over manually otherwise it
> > will fail, I've fallen victim of this before. All cluster nodes need to start on
> > the current revision of the file before you update it. I think this is a chicken
> > and egg problem.
> In the past I already encountered this situation. And in all cases, the starting node detects its version as not up2date and gets its new config from the other node.
> My scenario was:
> node 1 and node 2 up
> node 2 shutdown
> change node1 config (I mean here in term of services, probably not valid if inserting a qdiskd section when not available before, or possibly in other cases)
> power on node2
> node 2 gets the new config and apply it (based on availability and correctness of definitions)
> So I don't think this is correct.....
> Any one commenting on this?
> Do you have the messages of the errors when you get this problem?
> On Wed, 4 Nov 2009 12:30:57 +0100 Jakov Sosic wrote:
> > Well I usually do rolling updates, (i relocate the services to other
> > nodes, and update one node, then restart it and see if it joins
> > cluster).
> OK. In fact I'm now working on a test cluster, just to get the correct
> workflow.
> But you are saying you did this also for 5.3 -> 5.4, while I experienced
> the oom problem that David documented too, with the entry in bugzilla......
> So you joined a just updated 5.4 node to its previous cluster (composed by
> all 5.3 nodes) and you didn't get any problem at all?
> Gianluca

OK. All went well in my virtual environment.
More, in step 7, I created a new ip service and updated my config into the
first updated node, enabling it while the second node, still in 5.3, was
This below the diff with the pre-5.4

< <cluster alias="clumm" config_version="7" name="clumm">
> <cluster alias="clumm" config_version="5" name="clumm">
<             <failoverdomain name="MM3" restricted="1" ordered="1"
<                 <failoverdomainnode name="node1" priority="2"/>
<                 <failoverdomainnode name="node2" priority="1"/>
<             </failoverdomain>
<             <ip address="" monitor_link="0"/>
<         <service domain="MM3" autostart="1" name="MM3SRV">
<             <ip ref=""/>
<         </service>

When the second node in step 11) joins the cluster, it indeed gets the
updated config and all goes well.
I also successfully relocated this new server from former node to the other
No oom with this approach as written by David.

two other things:
1) I see these messages about quorum inside the first node, that didn't came
during the previous days in 5.3 env
Nov  5 08:00:14 mork clurgmgrd: [2692]: <notice> Getting status
Nov  5 08:27:08 mork qdiskd[2206]: <warning> qdiskd: read (system call) has
hung for 40 seconds
Nov  5 08:27:08 mork qdiskd[2206]: <warning> In 40 more seconds, we will be
Nov  5 09:00:15 mork clurgmgrd: [2692]: <notice> Getting status
Nov  5 09:00:15 mork clurgmgrd: [2692]: <notice> Getting status
Nov  5 09:48:23 mork qdiskd[2206]: <warning> qdiskd: read (system call) has
hung for 40 seconds
Nov  5 09:48:23 mork qdiskd[2206]: <warning> In 40 more seconds, we will be
Nov  5 10:00:15 mork clurgmgrd: [2692]: <notice> Getting status
Nov  5 10:00:15 mork clurgmgrd: [2692]: <notice> Getting status

Any timings changed between releases?
My relevant lines about timings in cluster.conf were in 5.3 and remained so
in 5.4:

<cluster alias="clumm" config_version="7" name="clumm">
        <totem token="162000"/>
        <cman quorum_dev_poll="80000" expected_votes="3" two_node="0"/>
        <fence_daemon clean_start="1" post_fail_delay="0"

        <quorumd device="/dev/sda" interval="5" label="clummquorum"
log_facility="local4" log_level="7" tko="16" votes="1">
                <heuristic interval="2" program="ping -c1 -w1"
score="1" tko="3000"/>

(tko very big in heuristic because I was testing best and safer way to do
on-the-fly changes to heuristic, due to network maintenance activity causing
gw disappear for some time, not predictable by the net-guys...)

I don't know if this message is deriving from a problem with latencies in my
virtual env or not....
On the host side I don't see any message with dmesg command or in

2) saw that a new kernel just released...... ;-(
Hints about possible interferences with cluster infra?

