[Linux-cluster] cman startup issue

Wed Nov 7 14:17:26 UTC 2007

Patrick Caulfield wrote:
> gordan at bobich.net wrote:
>> On Wed, 7 Nov 2007, Patrick Caulfield wrote:
>>
>>>>>>>> I'm having a weird problem. I am using a shared GFS root file
>>>>>>>> system,
>>>>>>>> and the same initrd image on all the machines. The cluster has 3
>>>>>>>> machines on it at the moment, and 1 refuses to join the cluster,
>>>>>>>> regardless of which order I bring them up in.
>>>>>>>>
>>>>>>>> When cman service is being started, it fails when starting cman:
>>>>>>>>
>>>>>>>> cman not started: Can't find local node name in cluster.conf
>>>>>>>> /usr/local/sbin/cman_tool: aisexec daemon didn't start
>>>>>>>>
>>>>>>>> If I try to run aisexec, I get:
>>>>>>>> aisexec: totemsrp.c:2867: memb_ring_id_store: Assertion `0' failed.
>>>>>>>>
>>>>>>>> Where should I be looking for causes of this? I double checked my
>>>>>>>> cluster.conf and the MAC addresses, IP addresses and interface
>>>>>>>> names are
>>>>>>>> correct in each node's config.
>>>>>>> Check that the new node can write into /tmp - where it is trying to
>>>>>>> store the
>>>>>>> current ring-id.  It could be SElinux or perhaps the permissions on
>>>>>>> the file it
>>>>>>> is trying to create.
>>>>>> That fixed the aisexec problem, but the "Can't find local node name in
>>>>>> cluster.conf" problem remains, and cman still won't start. :-(
>>>>> Well, it won't start if it can' find the local node name in
>>>>> cluster.conf ...
>>>>> Have you double-checked that the name(s) in cluster.conf match those
>>>>> on the
>>>>> ethernet interfaces ?
>>>> You mean as in:
>>>> <eth name="eth1" mac="my:ma:ca:dd:re:ss" ip="10.1.2.3"
>>>> mask="255.255.255.0"/>
>>>> ?
>>>>
>>>> If so, then yes, I checked it about 10 times. That was the first thing I
>>>> thought was wrong. :-(
>>> As I don't have your cluster.conf or access to your DNS server it's
>>> hard to say
>>> from here, but that message does mean what it says. If you have older
>>> software
>>> it might not detect anything other than the node's main hostname, but
>>> later
>>> versions will check all the interfaces on the system for something
>>> that matches
>>> anything in cluster.conf.
>> Well, the thing that really puzzles me is that the same cluster used to
>> work before. All I effectively did was move it to a different IP range
>> and changed cluster.conf. I can't figure out what could have changed in
>> the meantime to break it, other than cluster.conf. The only other thing
>> that's different is that some of the machines have eth1 and eth0
>> reversed. Before they all used eth1 for cluster configuration, and now
>> one of them uses eth0 (slightly different model, and the manufacturer
>> mislaeled the ports on them). But I have two identical machines, and one
>> connects, the other doesn't. It really has me stumped.
>>
>>> I see you're using eth1 so make sure you do have an up-to-date cman.
>> I'm running the latest that is available for RHEL5.
> 
> If that's what came with 5.0 then there's a bug in the name matching. I can't
> figure out from the CVS tags in which package this was fixed unfortunately.
> 
> "revision 1.26
>  date: 2007/03/15 11:12:33;  author: pcaulfield;  state: Exp;  lines: +16 -13
>  If the machine is multi-homed, then using a truncated name in uname but not in
>  cluster.conf would fail to match them up."

Well, I can tell you that the fix is NOT in cman-2.0.61, and it IS in
cman-2.0.73. Sorry I can't be more specific!

-- 
Patrick