[Linux-cluster] 10 Node Installation - Loosing Heartbeat

Richard Mayhew rmayhew at mweb.com
Thu Feb 3 12:53:08 UTC 2005


Hi All,
 
I have been able to successfully setup a 10 node GFS cluster, 4 Locking
Servers and 6 Clients. Each server is a Dell PowerEdge 1750 with 1GB
RAM, Dual P4 2.8 HT running RedHat Enterprise Server V3 Update 4. The
first ethernet interface is used for normal network traffic and the
second ethernet interface is used for GFS heatbeats only. Both
interfaces run at 1GB-FD on 2 separate switches. I have installed the
GFS-6.0.2-25 from http://bender.it.swin.edu.au/centos-3/RHGFS/i686/ on
the latest RedHat ES Kernel.
 
The storage is made available using a EMC CX600 SAN. (4 x 50GB) running
over a McData Fiber Switch. The pool service uses the storage from the
san over the EMC Powepath software using the emcpower pseudo devices.
Pool is able to assemble the pools with no problem. The ccsd service is
able to retrieve the gfs archive with no problems. The lock gulm server
loads with no problem and communicates with the master lock server with
out any errors or missed beats. GFS mounts with out any errors etc. I am
able to access the storage on each server with no problem on throughput.

 
The problem I am experiencing is as follows.
 
Once the GFS system has been running for a few hours with some usage on
each of the servers, some of the servers start missing beats. I
increased the heartbeat rate to test every 60 seconds and to fail after
10 tries. This just prolonged the servers being fenced. The only thing I
can come up with is that the locking server is buggy and stops
responding to heartbeats. On the master server when it detects that the
server has skipped the required number of beats, it tries to fence it
and fails. I have setup the fencing to use the mcdata module and I have
specified the correct login details. When the server that was fenced has
had its lock server restarted it tries to relog in to the master lock
server. This fails for obvious reasons as the master will refuse to
allow it to reconnect due to the previous fencing failures. Manual
fencing works without any problems but I have only tried this on the cmd
line.
 
Does anyone have an idea as to why the locking servers are hanging up
when it comes to sending heartbeat beats and possibly why the fencing
isnt working?
 
Here are my configs with some of the privileged information changed.
 
fence.css
fence_devices {

	EMC_Switch_01 {
	agent = "fence_mcdata"
	ipaddr = "xxx.xxx.xxx.xxx"
	login = "XXXXXXXX"
	password = "xxxxxx"
	}

I	EMC_Switch_02 {
	agent = "fence_mcdata"
	ipaddr = "xxx.xxx.xxx.xxx"
	login = "XXXXXXXXXX"
	password = "xxxxxx"
	}
}


Cluster.ccs
cluster {
        name = "mail"
        lock_gulm 
        servers =
["store-01.mc.mweb.net","store-02.mc.mweb.net","store-03.mc.mweb.net","s
tore-04.mc.mweb.net"]
        heartbeat_rate = 60
        allowed_misses = 10
        }
}

Nodes.ccs
nodes   {
   store-01.mc.mweb.net    {
       ip_interfaces   {
             eth1 = "xxx.xxx.xxx.xxx"
       }
       fence {
         san {
          EMC_Switch_01   {
              port = 3
          }
         }
       }
    }

   store-02.mc.mweb.net  {
        ip_interfaces {
              eth1 = "xxx.xxx.xxx.xxx"
        }
        fence {
          san {
            EMC_Switch_01 {
                port = 27
            }
          }
        }
    }

   store-03.mc.mweb.net  {
        ip_interfaces {
              eth1 = "xxx.xxx.xxx.xxx"
        }
        fence {
          san {
            EMC_Switch_01 {
                port = 9
            }
          }
        }
    }
   store-04.mc.mweb.net  {
        ip_interfaces {
              eth1 = "xxx.xxx.xxx.xxx"
        }
        fence {
          san {
            EMC_Switch_01 {
                port = 31
            }
          }
        }
    }

        serv-01.mc.mweb.net {
                                ip_interfaces {
                                                        eth1 =
"xxx.xxx.xxx.xxx"
                                }
                                fence {
                                        san {
                                                        EMC_Switch_01 {
                                                                port =
19
                                                        }
                                        }
                                }
                        }
        serv-02.mc.mweb.net {
                                ip_interfaces {
                                                eth1 = "xxx.xxx.xxx.xxx"
                                }
                                fence {
                                        san {
                                                        EMC_Switch_01 {
                                                                port =
27
                                        }
                               }
                        }
                }
        serv-03.mc.mweb.net {
                        ip_interfaces {
                                                eth1 = "xxx.xxx.xxx.xxx"
                        }
                        fence {
                                san {
                                                EMC_Switch_01 {
                                                        port = 31
                                                }
                                        }
                                }
                        }
        serv-04.mc.mweb.net {
                        ip_interfaces {
                                                eth1 = "xxx.xxx.xxx.xxx"
                        }
                        fence {
                                san {
                                                EMC_Switch_02 {
                                                        port = 3
                                                }
                                        }
                                }
                        }
        serv-05.mc.mweb.net {
                        ip_interfaces {
                                                eth1 = "xxx.xxx.xxx.xxx"
                        }
                        fence {
                                san {
                                                EMC_Switch_02 {
                                                        port = 9
                                                }
                       ip_interfaces {
                                                eth1 = "xxx.xxx.xxx.xxx"
                        }
                        fence {
                                san {
                                                EMC_Switch_02 {
                                                        port = 3
                                                }
                                        }
                                }
                        }
        serv-05.mc.mweb.net {
                        ip_interfaces {
                                                eth1 = "xxx.xxx.xxx.xxx"
                        }
                        fence {
                                san {
                                                EMC_Switch_02 {
                                                        port = 9
                                                }
                                        }
                                }
                        }
        serv-06.mc.mweb.net {
                        ip_interfaces {
                                        eth1 = "xxx.xxx.xxx.xxx"
                        }
                        fence {
                                san {
                                                EMC_Switch_02 {
                                                                port =
19
                                                }
                                        }
                                }
                        }
}
--

Regards

Richard Mayhew
Unix Specialist

MWEB Business
Tel:  + 27 11 340 7200
Fax:  + 27 11 340 7288
Website: www.mwebbusiness.co.za







More information about the Linux-cluster mailing list