[Linux-cluster] Node has joined cluster but services cannot be started on it, why?

Ralph.Grothe at itdz-berlin.de Ralph.Grothe at itdz-berlin.de
Wed Aug 31 08:25:00 UTC 2011


Hello everyone,

I experience a strange phenomenon on one of our RHCS clusters.

During a scheduled downtime I needed to run a few cluster tests
where I also fenced the node (by issuing a "fence_node barosic"
from the other node of this two-node cluster) which now is
causing me some pain because it is unwilling to start any service
even when explicitly told so by the "-m" option of e.g. clusvcadm
command.

It appears to me as if the communication to the clurgmgrd on this
node is disrupted although the daemon is running.

This can also be seen from the incomplete output of clustat when
compared to that of the fully integrated cluster node (i.e.
arubaic in this case).

At the moment I'm not allowed to issue a service relocation to
show the resulting output because I require a scheduled downtime
for this.
All I can issue now are commands that don't affect the running
services.

Here's clustat's output on the "working node":
(in accordance with the customer I froze all services to counter
any unwanted mangling by clurgmgrd because we aren't HA in the
current situation anyway)


[root at aruba:~]
# clustat
Cluster Status for rhcs-voebb @ Wed Aug 31 09:43:10 2011
Member Status: Quorate

 Member Name                                                ID
Status
 ------ ----                                                ----
------
 arubaic                                                        1
Online, Local, RG-Master
 barosic                                                        2
Online

 Service Name                                      Owner (Last)
State         
 ------- ----                                      ----- ------
-----         
 service:alma                                      arubaic
started    [Z]
 service:lola                                      arubaic
started    [Z]
 service:vb_bz_zlb                                 arubaic
started    [Z]



Whereas the same command issued on the reluctant node I get this:


[root at baros:~]
# clustat
Cluster Status for rhcs-voebb @ Wed Aug 31 09:44:46 2011
Member Status: Quorate

 Member Name                                     ID   Status
 ------ ----                                     ---- ------
 arubaic                                             1 Online
 barosic                                             2 Online,
Local


I monitor our RHCS clusters through Nagios and defined a
check_multi command to this end that checks what I deemed the
vital functions of the RHCS cluster stack.
Its OK output also shows me that all the required daemons are all
running on barosic.
Here's the output of this check run on barosic:


[nagios at baros:~]
$
/usr/lib64/nagios/plugins/contrib/check_multi/libexec/check_multi
-l /usr/lib64/nagios/plugins -f
/etc/nagios/check_multi/rhcs_status.cmd 
OK - 20 plugins checked, 20 ok
[ 1] proc_ccsd PROCS OK: 1 process with command name 'ccsd'
[ 2] proc_clurgmgrd PROCS OK: 2 processes with command name
'clurgmgrd'
[ 3] proc_fenced PROCS OK: 1 process with command name 'fenced'
[ 4] proc_groupd PROCS OK: 1 process with command name 'groupd'
[ 5] proc_clvmd PROCS OK: 1 process with command name 'clvmd'
[ 6] proc_gfs_controld PROCS OK: 1 process with command name
'clvmd'
[ 7] proc_dlm_controld PROCS OK: 1 process with command name
'clvmd'
[ 8] ic_node_ip 192.168.5.58 
[ 9] ic_bond_dev bond1
[10] ic_mii_status up
[11] ic_slave1 eth1
[12] ic_slave2 eth4
[13] slave1_props  8000Mb/s
  Full
  yes
[14] slave2_props  8000Mb/s
  Full
  yes
[15] slave1_link  yes
[16] slave2_link  yes
[17] slave1_speed 8000
[18] slave2_speed 8000
[19] slave1_mode  full
[20] slave2_mode  full|check_multi::check_multi::plugins=20
time=0.257608 



Also cman_tool reports all being OK with barosic (if I
interpreted its output correctly).
Yet, I'm not able to relocate any of the three services on
barosic.

What could be going wrong/missing, where else to look?


Regards
Ralph



[root at baros:~]
# cman_tool status
Version: 6.2.0
Config Version: 64
Cluster Name: rhcs-voebb
Cluster Id: 44402
Cluster Member: Yes
Cluster Generation: 516
Membership state: Cluster-Member
Nodes: 2
Expected votes: 1
Total votes: 2
Quorum: 1  
Active subsystems: 9
Flags: 2node Dirty 
Ports Bound: 0 11  
Node name: barosic
Node ID: 2
Multicast addresses: 239.192.173.32 
Node addresses: 192.168.5.58 
[root at baros:~]
# cman_tool nodes
Node  Sts   Inc   Joined               Name
   1   M    516   2011-08-28 19:27:38  arubaic
   2   M    512   2011-08-28 19:27:38  barosic
[root at baros:~]
# cman_tool services
type             level name       id       state       
fence            0     default    00010001 none        
[1 2]
dlm              1     clvmd      00020001 none        
[1 2]
dlm              1     rgmanager  00010002 none        
[1 2]




More information about the Linux-cluster mailing list