[Linux-cluster] Split Brain

Thu Jan 31 18:13:17 UTC 2008

Hi

I have a problem with mi current cluster.
we have a 2 node cluster ( DL385 G2 we not have external storage) with 
RedHat 4 u5 and cluster suite 4 u5

When the nodes loose comunication, we get 2 cluster instances with de 
service up in both :( .. too bad.
I don't undertand  why one node not try to fence the other node, before 
form the cluster.

this are the log:
==================================================
Jan 20 22:17:42 node1 kernel: bonding: bond0: link status definitely up 
for interface eth0.
Jan 20 22:17:48 node1 clurgmgrd: [4081]: <info> Executing 
/home/app/myservice.sh status
Jan 20 22:17:48 node1 su(pam_unix)[11307]: session opened for user 
app_usr by (uid=0)
Jan 20 22:17:48 node1 su(pam_unix)[11307]: session closed for user app_usr
Jan 20 22:18:18 node1 clurgmgrd: [4081]: <info> Executing 
/home/app/myservice.sh status
Jan 20 22:18:18 node1 su(pam_unix)[11533]: session opened for user 
app_usr by (uid=0)
Jan 20 22:18:18 node1 su(pam_unix)[11533]: session closed for user app_usr
Jan 20 22:18:33 node1 kernel: e1000: eth2: e1000_watchdog_task: NIC Link 
is Down
Jan 20 22:18:33 node1 kernel: bonding: bond0: link status definitely 
down for interface eth2, disabling it
Jan 20 22:18:33 node1 kernel: bonding: bond0: making interface eth0 the 
new active one.
Jan 20 22:18:37 node1 kernel: e1000: eth2: e1000_watchdog_task: NIC Link 
is Up 1000 Mbps Full Duplex
Jan 20 22:18:37 node1 kernel: bonding: bond0: link status definitely up 
for interface eth2.
Jan 20 22:18:43 node1 kernel: bnx2: eth0 NIC Link is Down
Jan 20 22:18:43 node1 kernel: bonding: bond0: link status definitely 
down for interface eth0, disabling it
Jan 20 22:18:43 node1 kernel: bonding: bond0: making interface eth2 the 
new active one.
Jan 20 22:18:46 node1 kernel: bnx2: eth0 NIC Link is Up, 1000 Mbps full 
duplex
Jan 20 22:18:46 node1 kernel: bonding: bond0: link status definitely up 
for interface eth0.
Jan 20 22:19:03 node1 kernel: CMAN: removing node node2 from the cluster 
: Missed too many heartbeats
Jan 20 22:19:05 node1 clurgmgrd[4081]: <info> Magma Event: Membership Change
Jan 20 22:19:05 node1 clurgmgrd[4081]: <info> State change: node2 DOWN
Jan 20 22:19:06 node1 clurgmgrd: [4081]: <info> Executing 
/home/app/myservice.sh status
Jan 20 22:19:06 node1 su(pam_unix)[11780]: session opened for user 
app_usr by (uid=0)
Jan 20 22:19:06 node1 su(pam_unix)[11780]: session closed for user app_usr
Jan 20 22:19:22 node1 kernel: bnx2: eth0 NIC Link is Down
Jan 20 22:19:22 node1 kernel: bonding: bond0: link status definitely 
down for interface eth0, disabling it
Jan 20 22:19:25 node1 kernel: bnx2: eth0 NIC Link is Up, 1000 Mbps full 
duplex
Jan 20 22:19:25 node1 kernel: bonding: bond0: link status definitely up 
for interface eth0.
Jan 20 22:19:40 node1 clurgmgrd: [4081]: <info> Executing 
/home/app/myservice.sh status
Jan 20 22:19:40 node1 su(pam_unix)[12037]: session opened for user 
app_usr by (uid=0)
Jan 20 22:19:40 node1 su(pam_unix)[12037]: session closed for user app_usr
Jan 20 22:20:10 node1 clurgmgrd: [4081]: <info> Executing 
/home/app/myservice.sh status
Jan 20 22:20:10 node1 su(pam_unix)[12236]: session opened for user 
app_usr by (uid=0)
Jan 20 22:20:10 node1 su(pam_unix)[12236]: session closed for user app_usr
Jan 20 22:20:40 node1 clurgmgrd: [4081]: <info> Executing 
/home/app/myservice.sh status
Jan 20 22:20:40 node1 su(pam_unix)[12461]: session opened for user 
app_usr by (uid=0)
Jan 20 22:20:40 node1 su(pam_unix)[12461]: session closed for user app_usr
=====================================================================

Node 2
=====================================================================
Jan 20 22:10:22 node2 sshd(pam_unix)[22703]: session opened for user 
app_usr by (uid=0)
Jan 20 22:10:22 node2 sshd(pam_unix)[22703]: session closed for user app_usr
Jan 20 22:10:24 node2 sshd(pam_unix)[22741]: session opened for user 
app_usr by (uid=0)
Jan 20 22:10:24 node2 sshd(pam_unix)[22741]: session closed for user app_usr
Jan 20 22:20:07 node2 sshd(pam_unix)[23541]: session opened for user 
app_usr by (uid=0)
Jan 20 22:20:07 node2 sshd(pam_unix)[23541]: session closed for user app_usr
Jan 20 22:20:09 node2 sshd(pam_unix)[23578]: session opened for user 
app_usr by (uid=0)
Jan 20 22:20:09 node2 sshd(pam_unix)[23578]: session closed for user app_usr
Jan 20 22:21:38 node2 kernel: CMAN: removing node node1 from the cluster 
: Missed too many heartbeats
Jan 20 22:21:40 node2 clurgmgrd[4177]: <info> Magma Event: Membership Change
Jan 20 22:21:40 node2 clurgmgrd[4177]: <info> State change: node1 DOWN
Jan 20 22:21:41 node2 clurgmgrd[4177]: <notice> Taking over service 
myservice from down member (null)
Jan 20 22:21:41 node2 clurgmgrd: [4177]: <info> Adding IPv4 address 
10.10.65.1 to bond0
Jan 20 22:21:42 node2 clurgmgrd: [4177]: <info> Adding IPv4 address 
10.10.65.10 to bond0
Jan 20 22:21:43 node2 clurgmgrd: [4177]: <info> Executing 
/home/app/myservice.sh start
Jan 20 22:21:43 node2 su(pam_unix)[23855]: session opened for user 
app_usr by (uid=0)
Jan 20 22:21:43 node2 su(pam_unix)[23855]: session closed for user app_usr
Jan 20 22:21:43 node2 clurgmgrd: [4177]: <info> Adding IPv4 address 
10.10.70.20 to bond1
Jan 20 22:21:44 node2 clurgmgrd[4177]: <notice> Service myservice started
Jan 20 22:21:50 node2 clurgmgrd: [4177]: <info> Executing 
/home/app/myservice.sh status
Jan 20 22:21:50 node2 su(pam_unix)[24022]: session opened for user 
app_usr by (uid=0)
Jan 20 22:21:50 node2 su(pam_unix)[24022]: session closed for user app_usr
Jan 20 22:22:20 node2 clurgmgrd: [4177]: <info> Executing 
/home/app/myservice.sh status
Jan 20 22:22:20 node2 su(pam_unix)[24244]: session opened for user 
app_usr by (uid=0)
Jan 20 22:22:20 node2 su(pam_unix)[24244]: session closed for user app_usr
Jan 20 22:22:50 node2 clurgmgrd: [4177]: <info> Executing 
/home/app/myservice.sh status
Jan 20 22:22:50 node2 su(pam_unix)[24469]: session opened for user 
app_usr by (uid=0)
=================================================================

I have configured the fences device and the power off work fine  .... 
when I power up the machine the first en startup "fenced" the other and 
startup continue ok

Any help will by apreciated ..
Sorry for my bad inglish
Luis G.