[Linux-cluster] why all services stops when a node reboots?

Fri Feb 13 08:39:45 UTC 2009

More clues,

using system-config-cluster

When I try to run a service in state failed I always get an error.
I have tu disable the service, to get disabled state. With this state I can
restart the services.

I think I have a problem with the relocate because I cant do it nor with
luci nor with system-config-cluster nor with clusvadm

I always get error when i try this

greetings

ESG

2009/2/13 ESGLinux <esggrupos at gmail.com>

> Hello,
>
> The services run ok on node1. If I halt node2 and try to run the services
> the run ok on node1.
> If I run the services without cluster they also run ok.
>
> I have eliminated the HTTP services and I have left the service BBDD to
> debug the problem. Here is the log when the service is running on node2 and
> node1 comes up:
>
> Feb 13 09:16:00 NODE2 openais[3326]: [TOTEM] entering GATHER state from 11.
> Feb 13 09:16:00 NODE2 openais[3326]: [TOTEM] Creating commit token because
> I
> am
> the rep.
> Feb 13 09:16:00 NODE2 openais[3326]: [TOTEM] Saving state aru 1a high seq
> receiv
> ed 1a
> Feb 13 09:16:00 NODE2 openais[3326]: [TOTEM] Storing new sequence id for
> ring
> 17
> f4
> Feb 13 09:16:00 NODE2 openais[3326]: [TOTEM] entering COMMIT state.
> Feb 13 09:16:00 NODE2 openais[3326]: [TOTEM] entering RECOVERY state.
> Feb 13 09:16:00 NODE2 openais[3326]: [TOTEM] position [0] member
> 192.168.1.185:
> Feb 13 09:16:00 NODE2 openais[3326]: [TOTEM] previous ring seq 6128 rep
> 192.168.
> 1.185
> Feb 13 09:16:00 NODE2 openais[3326]: [TOTEM] aru 1a high delivered 1a
> received
> f
> lag 1
> Feb 13 09:16:00 NODE2 openais[3326]: [TOTEM] position [1] member
> 192.168.1.188:
> Feb 13 09:16:00 NODE2 openais[3326]: [TOTEM] previous ring seq 6128 rep
> 192.168.
> 1.188
> Feb 13 09:16:00 NODE2 openais[3326]: [TOTEM] aru 9 high delivered 9
> received
> fla
> g 1
> Feb 13 09:16:00 NODE2 openais[3326]: [TOTEM] Did not need to originate any
> messa
> ges in recovery.
> Feb 13 09:16:00 NODE2 openais[3326]: [TOTEM] Sending initial ORF token
> Feb 13 09:16:00 NODE2 openais[3326]: [CLM  ] CLM CONFIGURATION CHANGE
> Feb 13 09:16:00 NODE2 openais[3326]: [CLM  ] New Configuration:
> Feb 13 09:16:00 NODE2 openais[3326]: [CLM  ]    r(0) ip(192.168.1.185)
> Feb 13 09:16:00 NODE2 openais[3326]: [CLM  ] Members Left:
> Feb 13 09:16:00 NODE2 openais[3326]: [CLM  ] Members Joined:
> Feb 13 09:16:00 NODE2 openais[3326]: [CLM  ] CLM CONFIGURATION CHANGE
> Feb 13 09:16:00 NODE2 openais[3326]: [CLM  ] New Configuration:
> Feb 13 09:16:00 NODE2 openais[3326]: [CLM  ]    r(0) ip(192.168.1.185)
> Feb 13 09:16:00 NODE2 openais[3326]: [CLM  ]    r(0) ip(192.168.1.188)
> Feb 13 09:16:00 NODE2 openais[3326]: [CLM  ] Members Left:
> Feb 13 09:16:00 NODE2 openais[3326]: [CLM  ] Members Joined:
> Feb 13 09:16:00 NODE2 openais[3326]: [CLM  ]    r(0) ip(192.168.1.188)
> Feb 13 09:16:00 NODE2 openais[3326]: [SYNC ] This node is within the
> primary component and will provide service.
> Feb 13 09:16:00 NODE2 openais[3326]: [TOTEM] entering OPERATIONAL state.
> Feb 13 09:16:00 NODE2 openais[3326]: [CLM  ] got nodejoin message
> 192.168.1.185
> Feb 13 09:16:00 NODE2 openais[3326]: [CLM  ] got nodejoin message
> 192.168.1.188
> Feb 13 09:16:00 NODE2 openais[3326]: [CPG  ] got joinlist message from node
> 2
> Feb 13 09:16:03 NODE2 kernel: dlm: connecting to 1
> Feb 13 09:16:24 NODE2 clurgmgrd[4001]: <notice> Relocating service:BBDD to
> better node node1
> Feb 13 09:16:24 NODE2 clurgmgrd[4001]: <notice> Stopping service
> service:BBDD
> Feb 13 09:16:25 NODE2 clurgmgrd: [4001]: <err> Stopping Service mysql:mydb
> > Failed - Application Is Still Running
> Feb 13 09:16:25 NODE2 clurgmgrd: [4001]: <err> Stopping Service mysql:mydb
> > Failed
> Feb 13 09:16:25 NODE2 clurgmgrd[4001]: <notice> stop on mysql "mydb"
> returned 1 (generic error)
> Feb 13 09:16:25 NODE2 avahi-daemon[3872]: Withdrawing address record for
> 192.168.1.183 on eth0.
> Feb 13 09:16:35 NODE2 clurgmgrd[4001]: <crit> #12: RG service:BBDD failed
> to stop; intervention required
> Feb 13 09:16:35 NODE2 clurgmgrd[4001]: <notice> Service service:BBDD is
> failed
> Feb 13 09:16:36 NODE2 clurgmgrd[4001]: <warning> #70: Failed to relocate
> service:BBDD; restarting locally
> Feb 13 09:16:36 NODE2 clurgmgrd[4001]: <err> #43: Service service:BBDD has
> failed; can not start.
> Feb 13 09:16:36 NODE2 clurgmgrd[4001]: <alert> #2: Service service:BBDD
> returned failure code.  Last Owner: node2
> Feb 13 09:16:36 NODE2 clurgmgrd[4001]: <alert> #4: Administrator
> intervention required.
>
>
> As you can see in the message "Relocating service:BBDD to better node
> node1"
>
> But it fails
>
> Another error that appears frecuently in my logs is the next:
>
> <err> Checking Existence Of File /var/run/cluster/mysql/mysql:mydb.pid
> [mysql:mydb] > Failed - File Doesn't Exist
>
> I dont know if this is important. but I think this makes the message err>
> Stopping Service mysql:mydb > Failed - Application Is Still Running and this
> makes the service fails (I´m just guessing...)
>
> Any idea?
>
>
> ESG
>
>
> 2009/2/12 rajveer singh <torajveersingh at gmail.com>
>
>> Hi,
>>
>> Ok, perhaps there is some problem with the services on node1 , so, are you
>> able to run these services on node1 without cluster. You first stop the
>> cluster, and try to run these services on node1.
>>
>> It should run.
>>
>> Re,
>> Rajveer Singh
>>
>> 2009/2/13 ESGLinux <esggrupos at gmail.com>
>>
>> Hello,
>>>
>>> Thats what I want, when node1 comes up I want to relocate to node1 but
>>> what I get is all my services stoped and in failed state.
>>>
>>> With my configuration I expect to have the services running on node1.
>>>
>>> Any idea about this behaviour?
>>>
>>> Thanks
>>>
>>> ESG
>>>
>>>
>>> 2009/2/12 rajveer singh <torajveersingh at gmail.com>
>>>
>>>
>>>>
>>>> 2009/2/12 ESGLinux <esggrupos at gmail.com>
>>>>
>>>>>  Hello all,
>>>>>
>>>>> I´m testing a cluster using luci as admin tool. I have configured 2
>>>>> nodes with 2 services http + mysql. This configuration works almost fine. I
>>>>> have the services running on the node1
>>>>>  and y reboot this node1. Then the services relocates to node2 and all
>>>>> contnues working but, when the node1 goes up all the services stops.
>>>>>
>>>>> I think that the node1, when comes alive, tries to run the services and
>>>>> that makes the services stops, can it be true? I think node1 should not
>>>>> start anything because the services are running in node2.
>>>>>
>>>>> Perphaps is a problem with the configuration, perhaps with fencing (i
>>>>> have not configured fencing at all)
>>>>>
>>>>> here is my cluster.conf. Any idea?
>>>>>
>>>>> Thanks in advace
>>>>>
>>>>> ESG
>>>>>
>>>>>
>>>>> <?xml version="1.0"?>
>>>>> <cluster alias="MICLUSTER" config_version="29" name="MICLUSTER">
>>>>>         <fence_daemon clean_start="0" post_fail_delay="0"
>>>>> post_join_delay="3"/>
>>>>>         <clusternodes>
>>>>>                 <clusternode name="node1" nodeid="1" votes="1">
>>>>>                         <fence/>
>>>>>                 </clusternode>
>>>>>                 <clusternode name="node2" nodeid="2" votes="1">
>>>>>                         <fence/>
>>>>>                 </clusternode>
>>>>>         </clusternodes>
>>>>>         <cman expected_votes="1" two_node="1"/>
>>>>>         <fencedevices/>
>>>>>         <rm>
>>>>>                 <failoverdomains>
>>>>>                         <failoverdomain name="DOMINIOFAIL"
>>>>> nofailback="0" ordere
>>>>> d="1" restricted="1">
>>>>>                              *   <failoverdomainnode name="node1"
>>>>> priority="1"/>
>>>>> *                               * <failoverdomainnode name="node2"
>>>>> priority="2"/>
>>>>> *                        </failoverdomain>
>>>>>                 </failoverdomains>
>>>>>                 <resources>
>>>>>                         <ip address="192.168.1.183" monitor_link="1"/>
>>>>>                 </resources>
>>>>>                 <service autostart="1" domain="DOMINIOFAIL"
>>>>> exclusive="0" name="
>>>>> HTTP" recovery="relocate">
>>>>>                         <apache config_file="conf/httpd.conf"
>>>>> name="http" server
>>>>> _root="/etc/httpd" shutdown_wait="0"/>
>>>>>                         <ip ref="192.168.1.183"/>
>>>>>                 </service>
>>>>>                 <service autostart="1" domain="DOMINIOFAIL"
>>>>> exclusive="0" name="
>>>>> BBDD" recovery="relocate">
>>>>>                         <mysql config_file="/etc/my.cnf"
>>>>> listen_address="192.168
>>>>> .1.183" name="mydb" shutdown_wait="0"/>
>>>>>                         <ip ref="192.168.1.183"/>
>>>>>                 </service>
>>>>>         </rm>
>>>>> </cluster>
>>>>>
>>>>>
>>>>> --
>>>>> Linux-cluster mailing list
>>>>> Linux-cluster at redhat.com
>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>>
>>>>
>>>> Hi ESG,
>>>>
>>>> Offcoures, as you have defined the priority of node1 as 1 and node2 as
>>>> 2, so node1 is having more priority, so whenever it will be up, it will try
>>>> to  run the service on itself and so it will relocate the service from node2
>>>> to node1.
>>>>
>>>>
>>>> Re,
>>>> Rajveer Singh
>>>>
>>>> --
>>>> Linux-cluster mailing list
>>>> Linux-cluster at redhat.com
>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>
>>>
>>>
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>
>>
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090213/050765aa/attachment.htm>