[Linux-cluster] having problems trying to setup a two node cluster

Wed Dec 1 18:39:49 UTC 2004

vahram wrote:
> Rick Stevens wrote:
> 
>>
>> I had a similar issue.  The problem was with the multicast routing.
>> I was using two NICs on each node...one public (eth0) and one private
>> (eth1), with the default gateway going out eth0.
>>
>> The route for the multicast (224.x.x.x) was going out the default
>> gateway and not reaching the other machine.  By putting in a fixed route
>> in for multicast:
>>
>>     route add -net 224.0.0.0/8 dev eth1
>>
>> it all started working.  This was my fix, it may not work for you.
>> Also, I use the CVS code from http://sources.redhat.com/cluster and
>> not the source RPMs from where you specified.
>> ----------------------------------------------------------------------
>> - Rick Stevens, Senior Systems Engineer     rstevens at vitalstream.com -
>> - VitalStream, Inc.                       http://www.vitalstream.com -
>> -                                                                    -
>> -     Veni, Vidi, VISA:  I came, I saw, I did a little shopping.     -
>> ----------------------------------------------------------------------
>>
>> -- 
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> http://www.redhat.com/mailman/listinfo/linux-cluster
> 
> 
> Yeap, both boxes have two NICs.  eth0 is public, and eth1 is private 
> (192.168.2.x).  I tried adding the route, and that didn't fix it.  I've 
> also tried disabling the private NIC before and running with one public 
> NIC, and that didn't fix it either.  One other interesting thing I 
> noticed...when I run cman_tool join on nodeA, netstat shows ccsd trying 
> to do this:
> 
> tcp        0      0 127.0.0.1:50006             127.0.0.1:739     
> TIME_WAIT   -
> tcp        0      0 127.0.0.1:50006             127.0.0.1:738     
> TIME_WAIT   -
> tcp        0      0 127.0.0.1:50006             127.0.0.1:737     
> TIME_WAIT   -
> tcp        0      0 127.0.0.1:50006             127.0.0.1:736     
> TIME_WAIT   -
> tcp        0      0 127.0.0.1:50006             127.0.0.1:743     
> TIME_WAIT   -
> tcp        0      0 127.0.0.1:50006             127.0.0.1:742     
> TIME_WAIT   -
> tcp        0      0 127.0.0.1:50006             127.0.0.1:741     
> TIME_WAIT   -
> tcp        0      0 127.0.0.1:50006             127.0.0.1:740     
> TIME_WAIT   -
> tcp        0      0 127.0.0.1:50006             127.0.0.1:727     
> TIME_WAIT   -
> tcp        0      0 127.0.0.1:50006             127.0.0.1:731     
> TIME_WAIT   -
> tcp        0      0 127.0.0.1:50006             127.0.0.1:730     
> TIME_WAIT   -
> tcp        0      0 127.0.0.1:50006             127.0.0.1:729     
> TIME_WAIT   -
> tcp        0      0 127.0.0.1:50006             127.0.0.1:728     
> TIME_WAIT   -
> tcp        0      0 127.0.0.1:50006             127.0.0.1:735     
> TIME_WAIT   -
> tcp        0      0 127.0.0.1:50006             127.0.0.1:734     
> TIME_WAIT   -
> tcp        0      0 127.0.0.1:50006             127.0.0.1:733     
> TIME_WAIT   -
> tcp        0      0 127.0.0.1:50006             127.0.0.1:732     
> TIME_WAIT   -
> 

Looking back at your cluster.conf, I see you're using broadcast.  I used
multicast because, in the first CVS checkout I did, broadcast didn't
work properly.  It's possible your SRPMs also have that flaw.  Why not
try multicast and see if that works.  Add that route I mentioned and
here's my cluster.conf which you can crib:

<?xml version="1.0"?>
<cluster name="test" config_version="1">

     <cman two-node="1" expected_votes="1">
         <multicast addr="224.0.0.1"/>
     </cman>

     <nodes>
         <node name="gfs-01-001" votes="1">
             <multicast addr="224.0.0.1" interface="eth1"/>
             <fence>
                 <method name="single">
                     <device name="human" ipaddr="gfs-01-001"/>
                 </method>
             </fence>
         </node>

         <node name="gfs-01-002" votes="1">
             <multicast addr="224.0.0.1" interface="eth1"/>
             <fence>
                 <method name="single">
                     <device name="human" ipaddr="gfs-01-002"/>
                 </method>
             </fence>
         </node>
     </nodes>

     <fence_devices>
         <device name="human" agent="fence_manual"/>
     </fence_devices>
</cluster>

----------------------------------------------------------------------
- Rick Stevens, Senior Systems Engineer     rstevens at vitalstream.com -
- VitalStream, Inc.                       http://www.vitalstream.com -
-                                                                    -
-  What's small, yellow and very, VERY dangerous?  The root canary!  -
----------------------------------------------------------------------