[Linux-cluster] why all services stops when a node reboots?

Fri Feb 13 08:23:36 UTC 2009

Hello,

The services run ok on node1. If I halt node2 and try to run the services
the run ok on node1.
If I run the services without cluster they also run ok.

I have eliminated the HTTP services and I have left the service BBDD to
debug the problem. Here is the log when the service is running on node2 and
node1 comes up:

Feb 13 09:16:00 NODE2 openais[3326]: [TOTEM] entering GATHER state from 11.
Feb 13 09:16:00 NODE2 openais[3326]: [TOTEM] Creating commit token because I
am
the rep.
Feb 13 09:16:00 NODE2 openais[3326]: [TOTEM] Saving state aru 1a high seq
receiv
ed 1a
Feb 13 09:16:00 NODE2 openais[3326]: [TOTEM] Storing new sequence id for
ring
17
f4
Feb 13 09:16:00 NODE2 openais[3326]: [TOTEM] entering COMMIT state.
Feb 13 09:16:00 NODE2 openais[3326]: [TOTEM] entering RECOVERY state.
Feb 13 09:16:00 NODE2 openais[3326]: [TOTEM] position [0] member
192.168.1.185:
Feb 13 09:16:00 NODE2 openais[3326]: [TOTEM] previous ring seq 6128 rep
192.168.
1.185
Feb 13 09:16:00 NODE2 openais[3326]: [TOTEM] aru 1a high delivered 1a
received
f
lag 1
Feb 13 09:16:00 NODE2 openais[3326]: [TOTEM] position [1] member
192.168.1.188:
Feb 13 09:16:00 NODE2 openais[3326]: [TOTEM] previous ring seq 6128 rep
192.168.
1.188
Feb 13 09:16:00 NODE2 openais[3326]: [TOTEM] aru 9 high delivered 9 received
fla
g 1
Feb 13 09:16:00 NODE2 openais[3326]: [TOTEM] Did not need to originate any
messa
ges in recovery.
Feb 13 09:16:00 NODE2 openais[3326]: [TOTEM] Sending initial ORF token
Feb 13 09:16:00 NODE2 openais[3326]: [CLM  ] CLM CONFIGURATION CHANGE
Feb 13 09:16:00 NODE2 openais[3326]: [CLM  ] New Configuration:
Feb 13 09:16:00 NODE2 openais[3326]: [CLM  ]    r(0) ip(192.168.1.185)
Feb 13 09:16:00 NODE2 openais[3326]: [CLM  ] Members Left:
Feb 13 09:16:00 NODE2 openais[3326]: [CLM  ] Members Joined:
Feb 13 09:16:00 NODE2 openais[3326]: [CLM  ] CLM CONFIGURATION CHANGE
Feb 13 09:16:00 NODE2 openais[3326]: [CLM  ] New Configuration:
Feb 13 09:16:00 NODE2 openais[3326]: [CLM  ]    r(0) ip(192.168.1.185)
Feb 13 09:16:00 NODE2 openais[3326]: [CLM  ]    r(0) ip(192.168.1.188)
Feb 13 09:16:00 NODE2 openais[3326]: [CLM  ] Members Left:
Feb 13 09:16:00 NODE2 openais[3326]: [CLM  ] Members Joined:
Feb 13 09:16:00 NODE2 openais[3326]: [CLM  ]    r(0) ip(192.168.1.188)
Feb 13 09:16:00 NODE2 openais[3326]: [SYNC ] This node is within the primary
component and will provide service.
Feb 13 09:16:00 NODE2 openais[3326]: [TOTEM] entering OPERATIONAL state.
Feb 13 09:16:00 NODE2 openais[3326]: [CLM  ] got nodejoin message
192.168.1.185
Feb 13 09:16:00 NODE2 openais[3326]: [CLM  ] got nodejoin message
192.168.1.188
Feb 13 09:16:00 NODE2 openais[3326]: [CPG  ] got joinlist message from node
2
Feb 13 09:16:03 NODE2 kernel: dlm: connecting to 1
Feb 13 09:16:24 NODE2 clurgmgrd[4001]: <notice> Relocating service:BBDD to
better node node1
Feb 13 09:16:24 NODE2 clurgmgrd[4001]: <notice> Stopping service
service:BBDD
Feb 13 09:16:25 NODE2 clurgmgrd: [4001]: <err> Stopping Service mysql:mydb >
Failed - Application Is Still Running
Feb 13 09:16:25 NODE2 clurgmgrd: [4001]: <err> Stopping Service mysql:mydb >
Failed
Feb 13 09:16:25 NODE2 clurgmgrd[4001]: <notice> stop on mysql "mydb"
returned 1 (generic error)
Feb 13 09:16:25 NODE2 avahi-daemon[3872]: Withdrawing address record for
192.168.1.183 on eth0.
Feb 13 09:16:35 NODE2 clurgmgrd[4001]: <crit> #12: RG service:BBDD failed to
stop; intervention required
Feb 13 09:16:35 NODE2 clurgmgrd[4001]: <notice> Service service:BBDD is
failed
Feb 13 09:16:36 NODE2 clurgmgrd[4001]: <warning> #70: Failed to relocate
service:BBDD; restarting locally
Feb 13 09:16:36 NODE2 clurgmgrd[4001]: <err> #43: Service service:BBDD has
failed; can not start.
Feb 13 09:16:36 NODE2 clurgmgrd[4001]: <alert> #2: Service service:BBDD
returned failure code.  Last Owner: node2
Feb 13 09:16:36 NODE2 clurgmgrd[4001]: <alert> #4: Administrator
intervention required.

As you can see in the message "Relocating service:BBDD to better node node1"

But it fails

Another error that appears frecuently in my logs is the next:

<err> Checking Existence Of File /var/run/cluster/mysql/mysql:mydb.pid
[mysql:mydb] > Failed - File Doesn't Exist

I dont know if this is important. but I think this makes the message err>
Stopping Service mysql:mydb > Failed - Application Is Still Running and this
makes the service fails (I´m just guessing...)

Any idea?

ESG

2009/2/12 rajveer singh <torajveersingh at gmail.com>

> Hi,
>
> Ok, perhaps there is some problem with the services on node1 , so, are you
> able to run these services on node1 without cluster. You first stop the
> cluster, and try to run these services on node1.
>
> It should run.
>
> Re,
> Rajveer Singh
>
> 2009/2/13 ESGLinux <esggrupos at gmail.com>
>
> Hello,
>>
>> Thats what I want, when node1 comes up I want to relocate to node1 but
>> what I get is all my services stoped and in failed state.
>>
>> With my configuration I expect to have the services running on node1.
>>
>> Any idea about this behaviour?
>>
>> Thanks
>>
>> ESG
>>
>>
>> 2009/2/12 rajveer singh <torajveersingh at gmail.com>
>>
>>
>>>
>>> 2009/2/12 ESGLinux <esggrupos at gmail.com>
>>>
>>>>  Hello all,
>>>>
>>>> I´m testing a cluster using luci as admin tool. I have configured 2
>>>> nodes with 2 services http + mysql. This configuration works almost fine. I
>>>> have the services running on the node1
>>>>  and y reboot this node1. Then the services relocates to node2 and all
>>>> contnues working but, when the node1 goes up all the services stops.
>>>>
>>>> I think that the node1, when comes alive, tries to run the services and
>>>> that makes the services stops, can it be true? I think node1 should not
>>>> start anything because the services are running in node2.
>>>>
>>>> Perphaps is a problem with the configuration, perhaps with fencing (i
>>>> have not configured fencing at all)
>>>>
>>>> here is my cluster.conf. Any idea?
>>>>
>>>> Thanks in advace
>>>>
>>>> ESG
>>>>
>>>>
>>>> <?xml version="1.0"?>
>>>> <cluster alias="MICLUSTER" config_version="29" name="MICLUSTER">
>>>>         <fence_daemon clean_start="0" post_fail_delay="0"
>>>> post_join_delay="3"/>
>>>>         <clusternodes>
>>>>                 <clusternode name="node1" nodeid="1" votes="1">
>>>>                         <fence/>
>>>>                 </clusternode>
>>>>                 <clusternode name="node2" nodeid="2" votes="1">
>>>>                         <fence/>
>>>>                 </clusternode>
>>>>         </clusternodes>
>>>>         <cman expected_votes="1" two_node="1"/>
>>>>         <fencedevices/>
>>>>         <rm>
>>>>                 <failoverdomains>
>>>>                         <failoverdomain name="DOMINIOFAIL"
>>>> nofailback="0" ordere
>>>> d="1" restricted="1">
>>>>                              *   <failoverdomainnode name="node1"
>>>> priority="1"/>
>>>> *                               * <failoverdomainnode name="node2"
>>>> priority="2"/>
>>>> *                        </failoverdomain>
>>>>                 </failoverdomains>
>>>>                 <resources>
>>>>                         <ip address="192.168.1.183" monitor_link="1"/>
>>>>                 </resources>
>>>>                 <service autostart="1" domain="DOMINIOFAIL"
>>>> exclusive="0" name="
>>>> HTTP" recovery="relocate">
>>>>                         <apache config_file="conf/httpd.conf"
>>>> name="http" server
>>>> _root="/etc/httpd" shutdown_wait="0"/>
>>>>                         <ip ref="192.168.1.183"/>
>>>>                 </service>
>>>>                 <service autostart="1" domain="DOMINIOFAIL"
>>>> exclusive="0" name="
>>>> BBDD" recovery="relocate">
>>>>                         <mysql config_file="/etc/my.cnf"
>>>> listen_address="192.168
>>>> .1.183" name="mydb" shutdown_wait="0"/>
>>>>                         <ip ref="192.168.1.183"/>
>>>>                 </service>
>>>>         </rm>
>>>> </cluster>
>>>>
>>>>
>>>> --
>>>> Linux-cluster mailing list
>>>> Linux-cluster at redhat.com
>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>
>>>
>>> Hi ESG,
>>>
>>> Offcoures, as you have defined the priority of node1 as 1 and node2 as 2,
>>> so node1 is having more priority, so whenever it will be up, it will try to
>>> run the service on itself and so it will relocate the service from node2 to
>>> node1.
>>>
>>>
>>> Re,
>>> Rajveer Singh
>>>
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>
>>
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090213/e2ad4dd1/attachment.htm>