[Linux-cluster] Split Brain

Luis Godoy Gonzalez lgodoy at atichile.com
Thu Jan 31 18:13:17 UTC 2008


Hi

I have a problem with mi current cluster.
we have a 2 node cluster ( DL385 G2 we not have external storage) with 
RedHat 4 u5 and cluster suite 4 u5

When the nodes loose comunication, we get 2 cluster instances with de 
service up in both :( .. too bad.
I don't undertand  why one node not try to fence the other node, before 
form the cluster.

this are the log:
==================================================
Jan 20 22:17:42 node1 kernel: bonding: bond0: link status definitely up 
for interface eth0.
Jan 20 22:17:48 node1 clurgmgrd: [4081]: <info> Executing 
/home/app/myservice.sh status
Jan 20 22:17:48 node1 su(pam_unix)[11307]: session opened for user 
app_usr by (uid=0)
Jan 20 22:17:48 node1 su(pam_unix)[11307]: session closed for user app_usr
Jan 20 22:18:18 node1 clurgmgrd: [4081]: <info> Executing 
/home/app/myservice.sh status
Jan 20 22:18:18 node1 su(pam_unix)[11533]: session opened for user 
app_usr by (uid=0)
Jan 20 22:18:18 node1 su(pam_unix)[11533]: session closed for user app_usr
Jan 20 22:18:33 node1 kernel: e1000: eth2: e1000_watchdog_task: NIC Link 
is Down
Jan 20 22:18:33 node1 kernel: bonding: bond0: link status definitely 
down for interface eth2, disabling it
Jan 20 22:18:33 node1 kernel: bonding: bond0: making interface eth0 the 
new active one.
Jan 20 22:18:37 node1 kernel: e1000: eth2: e1000_watchdog_task: NIC Link 
is Up 1000 Mbps Full Duplex
Jan 20 22:18:37 node1 kernel: bonding: bond0: link status definitely up 
for interface eth2.
Jan 20 22:18:43 node1 kernel: bnx2: eth0 NIC Link is Down
Jan 20 22:18:43 node1 kernel: bonding: bond0: link status definitely 
down for interface eth0, disabling it
Jan 20 22:18:43 node1 kernel: bonding: bond0: making interface eth2 the 
new active one.
Jan 20 22:18:46 node1 kernel: bnx2: eth0 NIC Link is Up, 1000 Mbps full 
duplex
Jan 20 22:18:46 node1 kernel: bonding: bond0: link status definitely up 
for interface eth0.
Jan 20 22:19:03 node1 kernel: CMAN: removing node node2 from the cluster 
: Missed too many heartbeats
Jan 20 22:19:05 node1 clurgmgrd[4081]: <info> Magma Event: Membership Change
Jan 20 22:19:05 node1 clurgmgrd[4081]: <info> State change: node2 DOWN
Jan 20 22:19:06 node1 clurgmgrd: [4081]: <info> Executing 
/home/app/myservice.sh status
Jan 20 22:19:06 node1 su(pam_unix)[11780]: session opened for user 
app_usr by (uid=0)
Jan 20 22:19:06 node1 su(pam_unix)[11780]: session closed for user app_usr
Jan 20 22:19:22 node1 kernel: bnx2: eth0 NIC Link is Down
Jan 20 22:19:22 node1 kernel: bonding: bond0: link status definitely 
down for interface eth0, disabling it
Jan 20 22:19:25 node1 kernel: bnx2: eth0 NIC Link is Up, 1000 Mbps full 
duplex
Jan 20 22:19:25 node1 kernel: bonding: bond0: link status definitely up 
for interface eth0.
Jan 20 22:19:40 node1 clurgmgrd: [4081]: <info> Executing 
/home/app/myservice.sh status
Jan 20 22:19:40 node1 su(pam_unix)[12037]: session opened for user 
app_usr by (uid=0)
Jan 20 22:19:40 node1 su(pam_unix)[12037]: session closed for user app_usr
Jan 20 22:20:10 node1 clurgmgrd: [4081]: <info> Executing 
/home/app/myservice.sh status
Jan 20 22:20:10 node1 su(pam_unix)[12236]: session opened for user 
app_usr by (uid=0)
Jan 20 22:20:10 node1 su(pam_unix)[12236]: session closed for user app_usr
Jan 20 22:20:40 node1 clurgmgrd: [4081]: <info> Executing 
/home/app/myservice.sh status
Jan 20 22:20:40 node1 su(pam_unix)[12461]: session opened for user 
app_usr by (uid=0)
Jan 20 22:20:40 node1 su(pam_unix)[12461]: session closed for user app_usr
=====================================================================

Node 2
=====================================================================
Jan 20 22:10:22 node2 sshd(pam_unix)[22703]: session opened for user 
app_usr by (uid=0)
Jan 20 22:10:22 node2 sshd(pam_unix)[22703]: session closed for user app_usr
Jan 20 22:10:24 node2 sshd(pam_unix)[22741]: session opened for user 
app_usr by (uid=0)
Jan 20 22:10:24 node2 sshd(pam_unix)[22741]: session closed for user app_usr
Jan 20 22:20:07 node2 sshd(pam_unix)[23541]: session opened for user 
app_usr by (uid=0)
Jan 20 22:20:07 node2 sshd(pam_unix)[23541]: session closed for user app_usr
Jan 20 22:20:09 node2 sshd(pam_unix)[23578]: session opened for user 
app_usr by (uid=0)
Jan 20 22:20:09 node2 sshd(pam_unix)[23578]: session closed for user app_usr
Jan 20 22:21:38 node2 kernel: CMAN: removing node node1 from the cluster 
: Missed too many heartbeats
Jan 20 22:21:40 node2 clurgmgrd[4177]: <info> Magma Event: Membership Change
Jan 20 22:21:40 node2 clurgmgrd[4177]: <info> State change: node1 DOWN
Jan 20 22:21:41 node2 clurgmgrd[4177]: <notice> Taking over service 
myservice from down member (null)
Jan 20 22:21:41 node2 clurgmgrd: [4177]: <info> Adding IPv4 address 
10.10.65.1 to bond0
Jan 20 22:21:42 node2 clurgmgrd: [4177]: <info> Adding IPv4 address 
10.10.65.10 to bond0
Jan 20 22:21:43 node2 clurgmgrd: [4177]: <info> Executing 
/home/app/myservice.sh start
Jan 20 22:21:43 node2 su(pam_unix)[23855]: session opened for user 
app_usr by (uid=0)
Jan 20 22:21:43 node2 su(pam_unix)[23855]: session closed for user app_usr
Jan 20 22:21:43 node2 clurgmgrd: [4177]: <info> Adding IPv4 address 
10.10.70.20 to bond1
Jan 20 22:21:44 node2 clurgmgrd[4177]: <notice> Service myservice started
Jan 20 22:21:50 node2 clurgmgrd: [4177]: <info> Executing 
/home/app/myservice.sh status
Jan 20 22:21:50 node2 su(pam_unix)[24022]: session opened for user 
app_usr by (uid=0)
Jan 20 22:21:50 node2 su(pam_unix)[24022]: session closed for user app_usr
Jan 20 22:22:20 node2 clurgmgrd: [4177]: <info> Executing 
/home/app/myservice.sh status
Jan 20 22:22:20 node2 su(pam_unix)[24244]: session opened for user 
app_usr by (uid=0)
Jan 20 22:22:20 node2 su(pam_unix)[24244]: session closed for user app_usr
Jan 20 22:22:50 node2 clurgmgrd: [4177]: <info> Executing 
/home/app/myservice.sh status
Jan 20 22:22:50 node2 su(pam_unix)[24469]: session opened for user 
app_usr by (uid=0)
=================================================================

I have configured the fences device and the power off work fine  .... 
when I power up the machine the first en startup "fenced" the other and 
startup continue ok


Any help will by apreciated ..
Sorry for my bad inglish
Luis G.





More information about the Linux-cluster mailing list