[Linux-cluster] Re: Non-Deterministic Cluster Failure (can't communicate with fenced -1)

Fri Mar 14 19:14:46 UTC 2008

ls -lr /var/lib/openais

If there are core files openais has crashed for some reason.

If this is the issue contact me off list.

Regards
-steve

On Fri, 2008-03-14 at 13:16 +0200, Volkan YAZICI wrote:
> Oops! Here is the cluster.conf file.
> On Fri, 14 Mar 2008, Volkan YAZICI <yazicivo at ttmail.com> writes:
> > We have two RHEL5.1 boxes installed on IBM X3850 machines sharing a
> > single DS4700 SAN with IBM 2005-B16 fence devices. System is configured
> > as a high-availability system for database systems. We are facing
> > serious non-deterministic (can happen in anywhere, at anytime without a
> > single clue) problems.
> >
> > One of the most repeating problems are fence_tool related.
> >
> >   # service cman start
> >   Starting cluster:
> >      Loading modules... done
> >      Mounting configfs... done
> >      Starting ccsd... done
> >      Starting cman... done
> >      Starting daemons... done
> >      Starting fencing... fence_tool: can't communicate with fenced -1
> >
> >   # fenced -D
> >   1204556546 cman_init error 0 111
> >
> >   # clustat
> >   CMAN is not running.
> >   
> >   # cman_tool join
> >   
> >   # clustat
> >   msg_open: Connection refused
> >   Member Status: Quorate
> >     Member Name                        ID   Status
> >     ------ ----                        ---- ------
> >     mobilizc1                             1 Online, Local
> >     mobilizc2                             2 Offline
> >   
> >   
> >   # groupd -D
> >   1204556993 cman: our nodeid 1 name mobilizc1 quorum 1
> >   1204556993 found uncontrolled kernel object rgmanager in /sys/kernel/dlm
> >   1204556993 found uncontrolled kernel object clvmd in /sys/kernel/dlm
> >   1204556993 local node must be reset to clear 2 uncontrolled instances of gfs and/or dlm
> >
> > Sometimes this problem gets solved if the two machines are rebooted at
> > the same time. But in the current HA configuration, I cannot guarantee
> > two systems will be rebooted at the same time for every problem we
> > face. At least one of them should start without a problem.
> >
> > Moreover, we were facing problems with the rgmanager. Below are the
> > related /var/log/messages lines:
> >
> >   kernel: clurgmgrd[4801]: segfault at 0000000000000000 rip 0000000000408905 rsp 00007fff9075f0b0 error 4
> >   clurgmgrd[4800]: <crit> Watchdog: Daemon died, rebooting...
> >
> > We contacted with our RH support and they asked for a clurgmgrd
> > backtrace from use. But unfortunately, we couldn't manage to start cman
> > service to be able to start clurgmgrd. (You are asking why we couldn't
> > cman? Really dunno. Same "fence_tool: can't communicate with fenced -1"
> > problem. As I said previously, it sometimes works, sometimes doesn't
> > work.) Later, they sent new not-released-yet
> > rgmanager-2.0.36-1.el5.x86_64.rpm to us to try. Somehow, we managed to
> > stnart cman on both machines and then started rgmanager service with this
> > new rgmanager RPM. (Couldn't reproduce clurgmgrd SegFault.) And this
> > solved clurgmgrd SegFault problem. But we are still having "can't
> > communicate with fenced -1" errors occasionally.
> >
> > Sorry for the long post, but I try to help to people who will try to
> > help to figure out the problem. I also attach my cluster.conf file with
> > the post. Any kind of help will be really, really appreciated! Thanks so
> > much for your kindly interest by reading this far.
> >
> >
> > Regards.
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster