[Linux-cluster] service stuck in "starting" state

Rick Stevens ricks at nerd.com
Mon Jul 13 19:16:00 UTC 2009


jason at monsterjam.org wrote:
> On Fri, Jul 10, 2009 at 04:50:12PM -0700, Rick Stevens wrote:
>> jason at monsterjam.org wrote:
>>> hey cluster gurus..
>>> I have a 2 node cluster thats been running without issue for quite a 
>>> while.. all of a sudden one of the nodes will not completely start the 
>>> apache webserver service.. it looks like this [root at tf1 ~]# clustat
>>> Member Status: Quorate
>>>   Member Name                              Status
>>>   ------ ----                              ------
>>>   tf1                                      Online, Local, rgmanager
>>>   tf2                                      Online, rgmanager
>>>   Service Name         Owner (Last)                   State           
>>> ------- ----         ----- ------                   -----           Apache 
>>> Service       tf1                            starting          postfix 
>>> service      tf1                            started         [root at tf1 ~]# 
>>> and I see that the httpd is NOT started. although, if I do 
>>> /etc/init.d/httpd start
>>> the service starts without issue.
>>> grepping for apache and http in the logs, I see this..
>>> Jul 10 14:32:13 tf1 httpd: httpd shutdown failed
>>> Jul 10 14:32:52 tf1 httpd: httpd shutdown failed
>>> Jul 10 14:33:11 tf1 httpd: httpd shutdown failed
>>> Jul 10 14:33:57 tf1 httpd: Syntax error on line 117 of 
>>> /etc/httpd/conf.d/ssl.conf:
>>> Jul 10 14:33:57 tf1 httpd: SSLCertificateFile: file 
>>> '/etc/httpd/conf/ssl.crt/server.crt' does not exist or is empty
>>> Jul 10 14:33:57 tf1 httpd: httpd startup failed
>>> Jul 10 14:34:06 tf1 httpd: Syntax error on line 117 of 
>>> /etc/httpd/conf.d/ssl.conf:
>>> Jul 10 14:34:06 tf1 httpd: SSLCertificateFile: file 
>>> '/etc/httpd/conf/ssl.crt/server.crt' does not exist or is empty
>>> Jul 10 14:34:06 tf1 httpd: httpd startup failed
>>> Jul 10 14:34:08 tf1 httpd: httpd shutdown failed
>>> Jul 10 16:23:33 tf1 clurgmgrd: [6168]: <info> Executing /etc/init.d/httpd 
>>> stop Jul 10 16:23:34 tf1 httpd: httpd shutdown failed
>>> Jul 10 16:24:31 tf1 httpd: httpd shutdown failed
>>> Jul 10 16:24:36 tf1 httpd: httpd shutdown failed
>>> Jul 10 16:24:41 tf1 httpd: httpd startup succeeded
>>> Jul 10 18:10:13 tf1 clurgmgrd: [6231]: <info> Executing /etc/init.d/httpd 
>>> stop Jul 10 18:10:13 tf1 httpd: httpd shutdown failed
>>> Jul 10 18:22:00 tf1 httpd: httpd startup succeeded
>>> [root at tf1 log]# grep apache  messages
>>> Jul 10 04:40:00 tf1 clurgmgrd[6267]: <notice> stop on script 
>>> "cluster_apache" returned 1 (generic error) Jul 10 10:04:33 tf1 
>>> clurgmgrd[6149]: <notice> stop on script "cluster_apache" returned 1 
>>> (generic error) Jul 10 14:29:54 tf1 clurgmgrd[6281]: <notice> stop on 
>>> script "cluster_apache" returned 1 (generic error) Jul 10 16:23:34 tf1 
>>> clurgmgrd[6168]: <notice> stop on script "cluster_apache" returned 1 
>>> (generic error) Jul 10 18:10:13 tf1 clurgmgrd[6231]: <notice> stop on 
>>> script "cluster_apache" returned 1 (generic error) [root at tf1 log]# Im 
>>> guessing its the  stop on script "cluster_apache" returned 1 (generic 
>>> error)
>>> but I looked at the /etc/init.d/httpd on tf1 and tf2 and they are both the 
>>> same size
>>> [root at tf2 ~]# ls -al /etc/init.d/httpd
>>> -rwxr-xr-x  1 root root 3201 Jan 30  2007 /etc/init.d/httpd
>>> [root at tf1 log]# ls -al /etc/init.d/httpd
>>> -rwxr-xr-x  1 root root 3201 Jan 30  2007 /etc/init.d/httpd
>>> and the apache service starts/stops just fine on tf2 when the services get 
>>> failed over to that machine.
>>> any ideas on what can be wrong?
>> tf1 is complaining about a bad SSL cert.  The fact that it's complaining
>> when being started by clurgmgrd but not when started manually indicates
>> that clurgmgrd is starting it differently (specifying a different
>> httpd.conf file perhaps?).
> 
> well, heres the relevant part of my config file
>         <rm>
>                 <failoverdomains>
>                         <failoverdomain name="httpd" ordered="1" restricted="1">
>                                 <failoverdomainnode name="tf1" priority="1"/>
>                                 <failoverdomainnode name="tf2" priority="2"/>
>                         </failoverdomain>
>                 </failoverdomains>
>                 <resources>
>                         <script file="/etc/init.d/httpd" name="cluster_apache"/>
>                         <ip address="192.168.1.7" monitor_link="1"/>
>                         <script file="/etc/init.d/postfix" name="cluster_posstfix"/>
>                 </resources>
>                 <service autostart="1" domain="httpd" name="Apache Service">
>                         <ip ref="192.168.1.7"/>
>                         <script ref="cluster_apache"/>
>                 </service>
>                 <service autostart="1" domain="httpd" name="postfix service">
>                         <ip ref="192.168.1.7"/>
>                         <script ref="cluster_posstfix"/>
>                 </service>
>         </rm>
> 
> ive never seen that ssl error when starting the service manually.
> 
> 
> the other thing that I noticed.. is that when I try to do 
> 
> [root at tf1 cluster]# clusvcadm -d "Apache Service"
> Member tf1 disabling Apache Service...
> 
> it just hangs there and never returns.

Sorry about the delay in responding.  Was out of town for the weekend.

Does clusvcadm or clurgmgrd run as a different user...one that either
can't read the SSL certs or the directory containing them?  Normally
the stuff in /etc/init.d runs as root.  Running one of those scripts as
a different user can lead to lots of permissions issues.  It's bitten
me before.
----------------------------------------------------------------------
- Rick Stevens, Systems Engineer                      ricks at nerd.com -
- AIM/Skype: therps2        ICQ: 22643734            Yahoo: origrps2 -
-                                                                    -
- Millihelen, adj: The amount of beauty required to launch one ship. -
----------------------------------------------------------------------




More information about the Linux-cluster mailing list