[Linux-cluster] service stuck in "starting" state
Rick Stevens
ricks at nerd.com
Mon Jul 13 19:16:00 UTC 2009
jason at monsterjam.org wrote:
> On Fri, Jul 10, 2009 at 04:50:12PM -0700, Rick Stevens wrote:
>> jason at monsterjam.org wrote:
>>> hey cluster gurus..
>>> I have a 2 node cluster thats been running without issue for quite a
>>> while.. all of a sudden one of the nodes will not completely start the
>>> apache webserver service.. it looks like this [root at tf1 ~]# clustat
>>> Member Status: Quorate
>>> Member Name Status
>>> ------ ---- ------
>>> tf1 Online, Local, rgmanager
>>> tf2 Online, rgmanager
>>> Service Name Owner (Last) State
>>> ------- ---- ----- ------ ----- Apache
>>> Service tf1 starting postfix
>>> service tf1 started [root at tf1 ~]#
>>> and I see that the httpd is NOT started. although, if I do
>>> /etc/init.d/httpd start
>>> the service starts without issue.
>>> grepping for apache and http in the logs, I see this..
>>> Jul 10 14:32:13 tf1 httpd: httpd shutdown failed
>>> Jul 10 14:32:52 tf1 httpd: httpd shutdown failed
>>> Jul 10 14:33:11 tf1 httpd: httpd shutdown failed
>>> Jul 10 14:33:57 tf1 httpd: Syntax error on line 117 of
>>> /etc/httpd/conf.d/ssl.conf:
>>> Jul 10 14:33:57 tf1 httpd: SSLCertificateFile: file
>>> '/etc/httpd/conf/ssl.crt/server.crt' does not exist or is empty
>>> Jul 10 14:33:57 tf1 httpd: httpd startup failed
>>> Jul 10 14:34:06 tf1 httpd: Syntax error on line 117 of
>>> /etc/httpd/conf.d/ssl.conf:
>>> Jul 10 14:34:06 tf1 httpd: SSLCertificateFile: file
>>> '/etc/httpd/conf/ssl.crt/server.crt' does not exist or is empty
>>> Jul 10 14:34:06 tf1 httpd: httpd startup failed
>>> Jul 10 14:34:08 tf1 httpd: httpd shutdown failed
>>> Jul 10 16:23:33 tf1 clurgmgrd: [6168]: <info> Executing /etc/init.d/httpd
>>> stop Jul 10 16:23:34 tf1 httpd: httpd shutdown failed
>>> Jul 10 16:24:31 tf1 httpd: httpd shutdown failed
>>> Jul 10 16:24:36 tf1 httpd: httpd shutdown failed
>>> Jul 10 16:24:41 tf1 httpd: httpd startup succeeded
>>> Jul 10 18:10:13 tf1 clurgmgrd: [6231]: <info> Executing /etc/init.d/httpd
>>> stop Jul 10 18:10:13 tf1 httpd: httpd shutdown failed
>>> Jul 10 18:22:00 tf1 httpd: httpd startup succeeded
>>> [root at tf1 log]# grep apache messages
>>> Jul 10 04:40:00 tf1 clurgmgrd[6267]: <notice> stop on script
>>> "cluster_apache" returned 1 (generic error) Jul 10 10:04:33 tf1
>>> clurgmgrd[6149]: <notice> stop on script "cluster_apache" returned 1
>>> (generic error) Jul 10 14:29:54 tf1 clurgmgrd[6281]: <notice> stop on
>>> script "cluster_apache" returned 1 (generic error) Jul 10 16:23:34 tf1
>>> clurgmgrd[6168]: <notice> stop on script "cluster_apache" returned 1
>>> (generic error) Jul 10 18:10:13 tf1 clurgmgrd[6231]: <notice> stop on
>>> script "cluster_apache" returned 1 (generic error) [root at tf1 log]# Im
>>> guessing its the stop on script "cluster_apache" returned 1 (generic
>>> error)
>>> but I looked at the /etc/init.d/httpd on tf1 and tf2 and they are both the
>>> same size
>>> [root at tf2 ~]# ls -al /etc/init.d/httpd
>>> -rwxr-xr-x 1 root root 3201 Jan 30 2007 /etc/init.d/httpd
>>> [root at tf1 log]# ls -al /etc/init.d/httpd
>>> -rwxr-xr-x 1 root root 3201 Jan 30 2007 /etc/init.d/httpd
>>> and the apache service starts/stops just fine on tf2 when the services get
>>> failed over to that machine.
>>> any ideas on what can be wrong?
>> tf1 is complaining about a bad SSL cert. The fact that it's complaining
>> when being started by clurgmgrd but not when started manually indicates
>> that clurgmgrd is starting it differently (specifying a different
>> httpd.conf file perhaps?).
>
> well, heres the relevant part of my config file
> <rm>
> <failoverdomains>
> <failoverdomain name="httpd" ordered="1" restricted="1">
> <failoverdomainnode name="tf1" priority="1"/>
> <failoverdomainnode name="tf2" priority="2"/>
> </failoverdomain>
> </failoverdomains>
> <resources>
> <script file="/etc/init.d/httpd" name="cluster_apache"/>
> <ip address="192.168.1.7" monitor_link="1"/>
> <script file="/etc/init.d/postfix" name="cluster_posstfix"/>
> </resources>
> <service autostart="1" domain="httpd" name="Apache Service">
> <ip ref="192.168.1.7"/>
> <script ref="cluster_apache"/>
> </service>
> <service autostart="1" domain="httpd" name="postfix service">
> <ip ref="192.168.1.7"/>
> <script ref="cluster_posstfix"/>
> </service>
> </rm>
>
> ive never seen that ssl error when starting the service manually.
>
>
> the other thing that I noticed.. is that when I try to do
>
> [root at tf1 cluster]# clusvcadm -d "Apache Service"
> Member tf1 disabling Apache Service...
>
> it just hangs there and never returns.
Sorry about the delay in responding. Was out of town for the weekend.
Does clusvcadm or clurgmgrd run as a different user...one that either
can't read the SSL certs or the directory containing them? Normally
the stuff in /etc/init.d runs as root. Running one of those scripts as
a different user can lead to lots of permissions issues. It's bitten
me before.
----------------------------------------------------------------------
- Rick Stevens, Systems Engineer ricks at nerd.com -
- AIM/Skype: therps2 ICQ: 22643734 Yahoo: origrps2 -
- -
- Millihelen, adj: The amount of beauty required to launch one ship. -
----------------------------------------------------------------------
More information about the Linux-cluster
mailing list