[Linux-cluster] failover questions after upgrade

Wed Nov 15 17:02:23 UTC 2006

On Tue, 2006-11-14 at 20:06 -0500, jason at monsterjam.org wrote:

> 
> and when I reboot both servers of 2 node cluster, they come up fine..
> [jason at tf2 ~]$ clustat
> Member Status: Quorate, Group Member
> 
>   Member Name                              State      ID
>   ------ ----                              -----      --
>   tf1                                      Online     0x0000000000000001
>   tf2                                      Online     0x0000000000000002
> 
>   Service Name         Owner (Last)                   State         
>   ------- ----         ----- ------                   -----         
>   Apache Service       tf1                            started   
> [jason at tf2 ~]$
> 
> when I reboot (shutdown -r now) tf1,
> tf2 never takes over 
> 
> [jason at tf2 ~]$ clustat
> Member Status: Quorate, Group Member
> 
>   Member Name                              State      ID
>   ------ ----                              -----      --
>   tf2                                      Online     0x0000000000000002
> 
>   Service Name         Owner (Last)                   State         
>   ------- ----         ----- ------                   -----         
>   Apache Service       ((null)                      ) failed    
> [jason at tf2 ~]$
> 
> heres the logs from tf2:
> 
> Nov 14 19:48:21 tf2 clurgmgrd[5345]: <info> Logged in SG "usrm::manager" 
> Nov 14 19:48:21 tf2 clurgmgrd[5345]: <info> Magma Event: Membership Change 
> Nov 14 19:48:21 tf2 clurgmgrd[5345]: <info> State change: Local UP 
> Nov 14 19:48:22 tf2 clurgmgrd[5345]: <info> State change: tf1 UP 
> Nov 14 19:48:25 tf2 snmpd[5195]: Got trap from peer on fd 13 
> Nov 14 19:48:44 tf2 kernel: process `omaws32' is using obsolete setsockopt SO_BSDCOMPAT
> Nov 14 19:48:58 tf2 Server Administrator: Storage Service EventID: 2164  See readme.txt for a list 
> of validated controller driver versions.
> Nov 14 19:49:00 tf2 snmpd[5195]: Got trap from peer on fd 13 
> Nov 14 19:50:31 tf2 sshd(pam_unix)[6920]: session opened for user jason by (uid=0)
> Nov 14 19:51:03 tf2 sshd(pam_unix)[6951]: session opened for user jason by (uid=0)
> 
> Nov 14 19:51:39 tf2 clurgmgrd[5345]: <info> Magma Event: Membership Change 
> Nov 14 19:51:39 tf2 clurgmgrd[5345]: <info> State change: tf1 DOWN 
> Nov 14 19:52:19 tf2 ntpd[4896]: synchronized to 193.162.159.97, stratum 2
> Nov 14 19:52:19 tf2 ntpd[4896]: kernel time sync disabled 0041
> Nov 14 19:52:28 tf2 kernel: e100: eth2: e100_watchdog: link down
> Nov 14 19:52:34 tf2 kernel: CMAN: removing node tf1 from the cluster : Missed too many heartbeats
> Nov 14 19:52:58 tf2 kernel: e100: eth2: e100_watchdog: link up, 100Mbps, full-duplex
> Nov 14 19:55:14 tf2 kernel: CMAN: node tf1 rejoining
> Nov 14 19:55:45 tf2 clurgmgrd[5345]: <info> Magma Event: Membership Change 
> Nov 14 19:55:45 tf2 clurgmgrd[5345]: <info> State change: tf1 UP 
> 
> 
> then when tf1 comes back up, my apache service doesnt come up correctly..
> 
> [jason at tf2 ~]$ clustat
> Member Status: Quorate, Group Member
> 
>   Member Name                              State      ID
>   ------ ----                              -----      --
>   tf1                                      Online     0x0000000000000001
>   tf2                                      Online     0x0000000000000002
> 
>   Service Name         Owner (Last)                   State         
>   ------- ----         ----- ------                   -----         
>   Apache Service       (tf1                         ) failed    
> [jason at tf2 ~]$ 
> 
> 
> and I see this in the logs on tf1 as hes booting up.
> Nov 14 19:55:44 tf1 rhnsd[5445]: Red Hat Network Services Daemon starting up.
> Nov 14 19:55:44 tf1 rhnsd: rhnsd startup succeeded
> Nov 14 19:55:44 tf1 cups-config-daemon: cups-config-daemon startup succeeded
> Nov 14 19:55:44 tf1 haldaemon: haldaemon startup succeeded
> Nov 14 19:55:44 tf1 clurgmgrd[5488]: <info> Loading Service Data 
> Nov 14 19:55:44 tf1 rgmanager: clurgmgrd startup succeeded
> Nov 14 19:55:44 tf1 fstab-sync[5764]: removed all generated mount points
> Nov 14 19:55:45 tf1 clurgmgrd[5488]: <info> Initializing Services 
> Nov 14 19:55:45 tf1 fstab-sync[6152]: added mount point /media/cdrom for /dev/hda
> Nov 14 19:55:45 tf1 httpd: httpd shutdown failed
> Nov 14 19:55:45 tf1 clurgmgrd[5488]: <notice> stop on script "cluster_apache" returned 1 (generic 
> error) 
> Nov 14 19:55:45 tf1 clurgmgrd[5488]: <info> Services Initialized 
> Nov 14 19:55:45 tf1 clurgmgrd[5488]: <info> Logged in SG "usrm::manager" 
> Nov 14 19:55:45 tf1 clurgmgrd[5488]: <info> Magma Event: Membership Change 
> Nov 14 19:55:45 tf1 clurgmgrd[5488]: <info> State change: Local UP 
> Nov 14 19:55:46 tf1 fstab-sync[6465]: added mount point /media/floppy for /dev/fd0
> Nov 14 19:55:46 tf1 clurgmgrd[5488]: <info> State change: tf2 UP 
> 
> any suggestions?
> 

http://sources.redhat.com/cluster/faq.html#rgm_wontrestart

The init script probably is returning 1 for stop-after-stop (or
stop-when-stopped), when it should be returning 0.  This is a bug in the
initscripts package, and here's a patch to /etc/init.d/functions to make
httpd work normally:

https://bugzilla.redhat.com/bugzilla/attachment.cgi?id=111998

-- Lon