[Linux-cluster] Re: Non-Deterministic Cluster Failure (can't communicate with fenced -1)
Steven Dake
sdake at redhat.com
Fri Mar 14 19:14:46 UTC 2008
ls -lr /var/lib/openais
If there are core files openais has crashed for some reason.
If this is the issue contact me off list.
Regards
-steve
On Fri, 2008-03-14 at 13:16 +0200, Volkan YAZICI wrote:
> Oops! Here is the cluster.conf file.
> On Fri, 14 Mar 2008, Volkan YAZICI <yazicivo at ttmail.com> writes:
> > We have two RHEL5.1 boxes installed on IBM X3850 machines sharing a
> > single DS4700 SAN with IBM 2005-B16 fence devices. System is configured
> > as a high-availability system for database systems. We are facing
> > serious non-deterministic (can happen in anywhere, at anytime without a
> > single clue) problems.
> >
> > One of the most repeating problems are fence_tool related.
> >
> > # service cman start
> > Starting cluster:
> > Loading modules... done
> > Mounting configfs... done
> > Starting ccsd... done
> > Starting cman... done
> > Starting daemons... done
> > Starting fencing... fence_tool: can't communicate with fenced -1
> >
> > # fenced -D
> > 1204556546 cman_init error 0 111
> >
> > # clustat
> > CMAN is not running.
> >
> > # cman_tool join
> >
> > # clustat
> > msg_open: Connection refused
> > Member Status: Quorate
> > Member Name ID Status
> > ------ ---- ---- ------
> > mobilizc1 1 Online, Local
> > mobilizc2 2 Offline
> >
> >
> > # groupd -D
> > 1204556993 cman: our nodeid 1 name mobilizc1 quorum 1
> > 1204556993 found uncontrolled kernel object rgmanager in /sys/kernel/dlm
> > 1204556993 found uncontrolled kernel object clvmd in /sys/kernel/dlm
> > 1204556993 local node must be reset to clear 2 uncontrolled instances of gfs and/or dlm
> >
> > Sometimes this problem gets solved if the two machines are rebooted at
> > the same time. But in the current HA configuration, I cannot guarantee
> > two systems will be rebooted at the same time for every problem we
> > face. At least one of them should start without a problem.
> >
> > Moreover, we were facing problems with the rgmanager. Below are the
> > related /var/log/messages lines:
> >
> > kernel: clurgmgrd[4801]: segfault at 0000000000000000 rip 0000000000408905 rsp 00007fff9075f0b0 error 4
> > clurgmgrd[4800]: <crit> Watchdog: Daemon died, rebooting...
> >
> > We contacted with our RH support and they asked for a clurgmgrd
> > backtrace from use. But unfortunately, we couldn't manage to start cman
> > service to be able to start clurgmgrd. (You are asking why we couldn't
> > cman? Really dunno. Same "fence_tool: can't communicate with fenced -1"
> > problem. As I said previously, it sometimes works, sometimes doesn't
> > work.) Later, they sent new not-released-yet
> > rgmanager-2.0.36-1.el5.x86_64.rpm to us to try. Somehow, we managed to
> > stnart cman on both machines and then started rgmanager service with this
> > new rgmanager RPM. (Couldn't reproduce clurgmgrd SegFault.) And this
> > solved clurgmgrd SegFault problem. But we are still having "can't
> > communicate with fenced -1" errors occasionally.
> >
> > Sorry for the long post, but I try to help to people who will try to
> > help to figure out the problem. I also attach my cluster.conf file with
> > the post. Any kind of help will be really, really appreciated! Thanks so
> > much for your kindly interest by reading this far.
> >
> >
> > Regards.
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
More information about the Linux-cluster
mailing list