[Linux-cluster] Re: Non-Deterministic Cluster Failure (can't communicate with fenced -1)
Volkan YAZICI
yazicivo at ttmail.com
Fri Mar 14 11:16:34 UTC 2008
Oops! Here is the cluster.conf file.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: cluster.conf
Type: application/octet-stream
Size: 1494 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080314/bb982ffb/attachment.obj>
-------------- next part --------------
On Fri, 14 Mar 2008, Volkan YAZICI <yazicivo at ttmail.com> writes:
> We have two RHEL5.1 boxes installed on IBM X3850 machines sharing a
> single DS4700 SAN with IBM 2005-B16 fence devices. System is configured
> as a high-availability system for database systems. We are facing
> serious non-deterministic (can happen in anywhere, at anytime without a
> single clue) problems.
>
> One of the most repeating problems are fence_tool related.
>
> # service cman start
> Starting cluster:
> Loading modules... done
> Mounting configfs... done
> Starting ccsd... done
> Starting cman... done
> Starting daemons... done
> Starting fencing... fence_tool: can't communicate with fenced -1
>
> # fenced -D
> 1204556546 cman_init error 0 111
>
> # clustat
> CMAN is not running.
>
> # cman_tool join
>
> # clustat
> msg_open: Connection refused
> Member Status: Quorate
> Member Name ID Status
> ------ ---- ---- ------
> mobilizc1 1 Online, Local
> mobilizc2 2 Offline
>
>
> # groupd -D
> 1204556993 cman: our nodeid 1 name mobilizc1 quorum 1
> 1204556993 found uncontrolled kernel object rgmanager in /sys/kernel/dlm
> 1204556993 found uncontrolled kernel object clvmd in /sys/kernel/dlm
> 1204556993 local node must be reset to clear 2 uncontrolled instances of gfs and/or dlm
>
> Sometimes this problem gets solved if the two machines are rebooted at
> the same time. But in the current HA configuration, I cannot guarantee
> two systems will be rebooted at the same time for every problem we
> face. At least one of them should start without a problem.
>
> Moreover, we were facing problems with the rgmanager. Below are the
> related /var/log/messages lines:
>
> kernel: clurgmgrd[4801]: segfault at 0000000000000000 rip 0000000000408905 rsp 00007fff9075f0b0 error 4
> clurgmgrd[4800]: <crit> Watchdog: Daemon died, rebooting...
>
> We contacted with our RH support and they asked for a clurgmgrd
> backtrace from use. But unfortunately, we couldn't manage to start cman
> service to be able to start clurgmgrd. (You are asking why we couldn't
> cman? Really dunno. Same "fence_tool: can't communicate with fenced -1"
> problem. As I said previously, it sometimes works, sometimes doesn't
> work.) Later, they sent new not-released-yet
> rgmanager-2.0.36-1.el5.x86_64.rpm to us to try. Somehow, we managed to
> stnart cman on both machines and then started rgmanager service with this
> new rgmanager RPM. (Couldn't reproduce clurgmgrd SegFault.) And this
> solved clurgmgrd SegFault problem. But we are still having "can't
> communicate with fenced -1" errors occasionally.
>
> Sorry for the long post, but I try to help to people who will try to
> help to figure out the problem. I also attach my cluster.conf file with
> the post. Any kind of help will be really, really appreciated! Thanks so
> much for your kindly interest by reading this far.
>
>
> Regards.
More information about the Linux-cluster
mailing list