[Linux-cluster] service stuck in "starting" state

jason at monsterjam.org jason at monsterjam.org
Mon Jul 13 23:24:27 UTC 2009


On Mon, Jul 13, 2009 at 12:16:00PM -0700, Rick Stevens wrote:
> jason at monsterjam.org wrote:
>> On Fri, Jul 10, 2009 at 04:50:12PM -0700, Rick Stevens wrote:
>>> jason at monsterjam.org wrote:
>>>> hey cluster gurus..
>>>> I have a 2 node cluster thats been running without issue for quite a 
>>>> while.. all of a sudden one of the nodes will not completely start the 
>>>> apache webserver service.. it looks like this [root at tf1 ~]# clustat
>>>> Member Status: Quorate
>>>>   Member Name                              Status
>>>>   ------ ----                              ------
>>>>   tf1                                      Online, Local, rgmanager
>>>>   tf2                                      Online, rgmanager
>>>>   Service Name         Owner (Last)                   State           
>>>> ------- ----         ----- ------                   -----           
>>>> Apache Service       tf1                            starting          
>>>> postfix service      tf1                            started         
>>>> [root at tf1 ~]# and I see that the httpd is NOT started. although, if I do 
>>>> /etc/init.d/httpd start
>>>> the service starts without issue.
>>>> grepping for apache and http in the logs, I see this..
>>>> Jul 10 14:32:13 tf1 httpd: httpd shutdown failed
>>>> Jul 10 14:32:52 tf1 httpd: httpd shutdown failed
>>>> Jul 10 14:33:11 tf1 httpd: httpd shutdown failed
>>>> Jul 10 14:33:57 tf1 httpd: Syntax error on line 117 of 
>>>> /etc/httpd/conf.d/ssl.conf:
>>>> Jul 10 14:33:57 tf1 httpd: SSLCertificateFile: file 
>>>> '/etc/httpd/conf/ssl.crt/server.crt' does not exist or is empty
>>>> Jul 10 14:33:57 tf1 httpd: httpd startup failed
>>>> Jul 10 14:34:06 tf1 httpd: Syntax error on line 117 of 
>>>> /etc/httpd/conf.d/ssl.conf:
>>>> Jul 10 14:34:06 tf1 httpd: SSLCertificateFile: file 
>>>> '/etc/httpd/conf/ssl.crt/server.crt' does not exist or is empty
>>>> Jul 10 14:34:06 tf1 httpd: httpd startup failed
>>>> Jul 10 14:34:08 tf1 httpd: httpd shutdown failed
>>>> Jul 10 16:23:33 tf1 clurgmgrd: [6168]: <info> Executing 
>>>> /etc/init.d/httpd stop Jul 10 16:23:34 tf1 httpd: httpd shutdown failed
>>>> Jul 10 16:24:31 tf1 httpd: httpd shutdown failed
>>>> Jul 10 16:24:36 tf1 httpd: httpd shutdown failed
>>>> Jul 10 16:24:41 tf1 httpd: httpd startup succeeded
>>>> Jul 10 18:10:13 tf1 clurgmgrd: [6231]: <info> Executing 
>>>> /etc/init.d/httpd stop Jul 10 18:10:13 tf1 httpd: httpd shutdown failed
>>>> Jul 10 18:22:00 tf1 httpd: httpd startup succeeded
>>>> [root at tf1 log]# grep apache  messages
>>>> Jul 10 04:40:00 tf1 clurgmgrd[6267]: <notice> stop on script 
>>>> "cluster_apache" returned 1 (generic error) Jul 10 10:04:33 tf1 
>>>> clurgmgrd[6149]: <notice> stop on script "cluster_apache" returned 1 
>>>> (generic error) Jul 10 14:29:54 tf1 clurgmgrd[6281]: <notice> stop on 
>>>> script "cluster_apache" returned 1 (generic error) Jul 10 16:23:34 tf1 
>>>> clurgmgrd[6168]: <notice> stop on script "cluster_apache" returned 1 
>>>> (generic error) Jul 10 18:10:13 tf1 clurgmgrd[6231]: <notice> stop on 
>>>> script "cluster_apache" returned 1 (generic error) [root at tf1 log]# Im 
>>>> guessing its the  stop on script "cluster_apache" returned 1 (generic 
>>>> error)
>>>> but I looked at the /etc/init.d/httpd on tf1 and tf2 and they are both 
>>>> the same size
>>>> [root at tf2 ~]# ls -al /etc/init.d/httpd
>>>> -rwxr-xr-x  1 root root 3201 Jan 30  2007 /etc/init.d/httpd
>>>> [root at tf1 log]# ls -al /etc/init.d/httpd
>>>> -rwxr-xr-x  1 root root 3201 Jan 30  2007 /etc/init.d/httpd
>>>> and the apache service starts/stops just fine on tf2 when the services 
>>>> get failed over to that machine.
>>>> any ideas on what can be wrong?
>>> tf1 is complaining about a bad SSL cert.  The fact that it's complaining
>>> when being started by clurgmgrd but not when started manually indicates
>>> that clurgmgrd is starting it differently (specifying a different
>>> httpd.conf file perhaps?).
>> well, heres the relevant part of my config file
>>         <rm>
>>                 <failoverdomains>
>>                         <failoverdomain name="httpd" ordered="1" 
>> restricted="1">
>>                                 <failoverdomainnode name="tf1" 
>> priority="1"/>
>>                                 <failoverdomainnode name="tf2" 
>> priority="2"/>
>>                         </failoverdomain>
>>                 </failoverdomains>
>>                 <resources>
>>                         <script file="/etc/init.d/httpd" 
>> name="cluster_apache"/>
>>                         <ip address="192.168.1.7" monitor_link="1"/>
>>                         <script file="/etc/init.d/postfix" 
>> name="cluster_posstfix"/>
>>                 </resources>
>>                 <service autostart="1" domain="httpd" name="Apache 
>> Service">
>>                         <ip ref="192.168.1.7"/>
>>                         <script ref="cluster_apache"/>
>>                 </service>
>>                 <service autostart="1" domain="httpd" name="postfix 
>> service">
>>                         <ip ref="192.168.1.7"/>
>>                         <script ref="cluster_posstfix"/>
>>                 </service>
>>         </rm>
>> ive never seen that ssl error when starting the service manually.
>> the other thing that I noticed.. is that when I try to do [root at tf1 
>> cluster]# clusvcadm -d "Apache Service"
>> Member tf1 disabling Apache Service...
>> it just hangs there and never returns.
>
> Sorry about the delay in responding.  Was out of town for the weekend.
>
> Does clusvcadm or clurgmgrd run as a different user...one that either
> can't read the SSL certs or the directory containing them?  Normally
> the stuff in /etc/init.d runs as root.  Running one of those scripts as
> a different user can lead to lots of permissions issues.  It's bitten
> me before.

ok, so I think ive found out whats going on.. 
There is another custom program from some other 3rd party vendor that theyre trying to get going on 
this cluster.. It is somehow interfering with the apache service coming up.
If this 3rd party application is NOT started (from /etc/init.d/rc.local) when the server boots up, 
then the apache service comes up fine.. If the 3rd party application IS allowed to start when the 
server boots up, it somehow causes the apache service to not come up correctly, and I see the 
 Service Name         Owner (Last)                   State
  ------- ----         ----- ------                   -----
  Apache Service       tf1                            starting
  postfix service      tf1                            started

but like I said earlier. 
/etc/init.d/httpd start 
still works fine either way..

If the 3rd party program is NOT running the cluster services come up fine on their own.

funky. 

thanks for the help,
Jason




More information about the Linux-cluster mailing list