[Linux-cluster] GFS-6.0.2.20-2 doesn't accept rebooted nodes

Mon Jun 26 13:29:38 UTC 2006

Hi gurus.

We have a three nodes Itanium 64 with GFS in conjuction with OCFS for a
Oracle RAC

We have find many phisical problems in our switch , and made a
sobstitution of the switches.
Here is the problem :
the first node of the cluster doesn't re-login to the gfs:
here is the situation:made from the master :

[root at sapcl02 spool]# gulm_tool nodelist sapcl02:core
 Name: sapcl03.aem.torino.it
  ip    = 100.2.254.210
  state = Logged in
  mode = Slave
  missed beats = 0
  last beat = 1151328027843676
  delay avg = 10000443
  max delay = 13047588

 Name: sapcl01.aem.torino.it
  ip    = 100.2.254.208
  state = Expired
  mode = Slave
  missed beats = 0
  last beat = 0
  delay avg = 0
  max delay = 0

 Name: sapcl02.aem.torino.it
  ip    = 100.2.254.209
  state = Logged in
  mode = Master
  missed beats = 0
  last beat = 1151328021593557
  delay avg = 10000849
  max delay = 113821588141

as you can see ...sapcl01 is in state expired.
In sapcl01 the startint of lock_gulmd hung ....
>From the /var/log/message of the master i see ....infinitely repetuted
....

Jun 26 15:23:32 sapcl02 lock_gulmd_core[22601]: Gonna exec fence_node
sapcl01.aem.torino.it 
Jun 26 15:23:32 sapcl02 fence_node[22601]: Cannot locate the cluster
node, sapcl01.aem.torino.it 
Jun 26 15:23:32 sapcl02 fence_node[22601]: All fencing methods FAILED!

Jun 26 15:23:32 sapcl02 fence_node[22601]: Fence of
"sapcl01.aem.torino.it" was unsuccessful. 
Jun 26 15:23:32 sapcl02 lock_gulmd_core[7499]: Fence failed. [22601]
Exit code:1 Running it again. 
Jun 26 15:23:32 sapcl02 lock_gulmd_core[7499]: Forked [22604]
fence_node sapcl01.aem.torino.it with a 5 pause. 

also if i power down the sapcl01 node , the master try and try to fence
the slave node.
Also , in the master and the slave , i try to manually fence for
eliminate the expiration . But no results.
It seems that the only way to reallinate the cluster is to GLOBALLY
power down the entire nodes , and restart.

here is the configuration files:
########### fence.ccs ########################################
fence_devices {
nps {
agent = "fence_wti"
ipaddr = "100.2.254.254"
login = "nps"
passwd = "password"
}
}
[root at sapcl01 gfs]# more nodes.ccs 
#### nodes.ccs #######################################
nodes {
sapcl01 {
ip_interfaces {
eth1 = "192.168.2.208"
}
fence {
power {
nps {
port = 1
}
}
}
}
sapcl02 {
ip_interfaces {
eth1 = "192.168.2.209"
}
fence {
power {
nps {
port = 2
}
}
}
}
sapcl03 {
ip_interfaces {
eth1 = "192.168.2.210"
}
fence {
power {
nps {
port = 3
}
}
[root at sapcl01 gfs]# more cluster.ccs 
#### cluster.ccs #####################################
cluster {
name = "gfsrac"
lock_gulm {
servers = [ "sapcl01","sapcl02","sapcl03" ]
}
}

PS the cluster is was fully operational from 7 months ago. the change
of the switch is the problem

Best regards
Stefano