[Linux-cluster] Non-Deterministic Cluster Failure (can't communicate with fenced -1)

Fri Mar 14 10:58:32 UTC 2008

Hi,

We have two RHEL5.1 boxes installed on IBM X3850 machines sharing a
single DS4700 SAN with IBM 2005-B16 fence devices. System is configured
as a high-availability system for database systems. We are facing
serious non-deterministic (can happen in anywhere, at anytime without a
single clue) problems.

One of the most repeating problems are fence_tool related.

  # service cman start
  Starting cluster:
     Loading modules... done
     Mounting configfs... done
     Starting ccsd... done
     Starting cman... done
     Starting daemons... done
     Starting fencing... fence_tool: can't communicate with fenced -1

  # fenced -D
  1204556546 cman_init error 0 111

  # clustat
  CMAN is not running.

  # cman_tool join

  # clustat
  msg_open: Connection refused
  Member Status: Quorate
    Member Name                        ID   Status
    ------ ----                        ---- ------
    mobilizc1                             1 Online, Local
    mobilizc2                             2 Offline

  # groupd -D
  1204556993 cman: our nodeid 1 name mobilizc1 quorum 1
  1204556993 found uncontrolled kernel object rgmanager in /sys/kernel/dlm
  1204556993 found uncontrolled kernel object clvmd in /sys/kernel/dlm
  1204556993 local node must be reset to clear 2 uncontrolled instances of gfs and/or dlm

Sometimes this problem gets solved if the two machines are rebooted at
the same time. But in the current HA configuration, I cannot guarantee
two systems will be rebooted at the same time for every problem we
face. At least one of them should start without a problem.

Moreover, we were facing problems with the rgmanager. Below are the
related /var/log/messages lines:

  kernel: clurgmgrd[4801]: segfault at 0000000000000000 rip 0000000000408905 rsp 00007fff9075f0b0 error 4
  clurgmgrd[4800]: <crit> Watchdog: Daemon died, rebooting...

We contacted with our RH support and they asked for a clurgmgrd
backtrace from use. But unfortunately, we couldn't manage to start cman
service to be able to start clurgmgrd. (You are asking why we couldn't
cman? Really dunno. Same "fence_tool: can't communicate with fenced -1"
problem. As I said previously, it sometimes works, sometimes doesn't
work.) Later, they sent new not-released-yet
rgmanager-2.0.36-1.el5.x86_64.rpm to us to try. Somehow, we managed to
stnart cman on both machines and then started rgmanager service with this
new rgmanager RPM. (Couldn't reproduce clurgmgrd SegFault.) And this
solved clurgmgrd SegFault problem. But we are still having "can't
communicate with fenced -1" errors occasionally.

Sorry for the long post, but I try to help to people who will try to
help to figure out the problem. I also attach my cluster.conf file with
the post. Any kind of help will be really, really appreciated! Thanks so
much for your kindly interest by reading this far.

Regards.