[Linux-cluster] Re: Non-Deterministic Cluster Failure (can't communicate with fenced -1)

Fri Mar 14 11:16:34 UTC 2008

Oops! Here is the cluster.conf file.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: cluster.conf
Type: application/octet-stream
Size: 1494 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080314/bb982ffb/attachment.obj>
-------------- next part --------------

On Fri, 14 Mar 2008, Volkan YAZICI <yazicivo at ttmail.com> writes:
> We have two RHEL5.1 boxes installed on IBM X3850 machines sharing a
> single DS4700 SAN with IBM 2005-B16 fence devices. System is configured
> as a high-availability system for database systems. We are facing
> serious non-deterministic (can happen in anywhere, at anytime without a
> single clue) problems.
>
> One of the most repeating problems are fence_tool related.
>
>   # service cman start
>   Starting cluster:
>      Loading modules... done
>      Mounting configfs... done
>      Starting ccsd... done
>      Starting cman... done
>      Starting daemons... done
>      Starting fencing... fence_tool: can't communicate with fenced -1
>
>   # fenced -D
>   1204556546 cman_init error 0 111
>
>   # clustat
>   CMAN is not running.
>   
>   # cman_tool join
>   
>   # clustat
>   msg_open: Connection refused
>   Member Status: Quorate
>     Member Name                        ID   Status
>     ------ ----                        ---- ------
>     mobilizc1                             1 Online, Local
>     mobilizc2                             2 Offline
>   
>   
>   # groupd -D
>   1204556993 cman: our nodeid 1 name mobilizc1 quorum 1
>   1204556993 found uncontrolled kernel object rgmanager in /sys/kernel/dlm
>   1204556993 found uncontrolled kernel object clvmd in /sys/kernel/dlm
>   1204556993 local node must be reset to clear 2 uncontrolled instances of gfs and/or dlm
>
> Sometimes this problem gets solved if the two machines are rebooted at
> the same time. But in the current HA configuration, I cannot guarantee
> two systems will be rebooted at the same time for every problem we
> face. At least one of them should start without a problem.
>
> Moreover, we were facing problems with the rgmanager. Below are the
> related /var/log/messages lines:
>
>   kernel: clurgmgrd[4801]: segfault at 0000000000000000 rip 0000000000408905 rsp 00007fff9075f0b0 error 4
>   clurgmgrd[4800]: <crit> Watchdog: Daemon died, rebooting...
>
> We contacted with our RH support and they asked for a clurgmgrd
> backtrace from use. But unfortunately, we couldn't manage to start cman
> service to be able to start clurgmgrd. (You are asking why we couldn't
> cman? Really dunno. Same "fence_tool: can't communicate with fenced -1"
> problem. As I said previously, it sometimes works, sometimes doesn't
> work.) Later, they sent new not-released-yet
> rgmanager-2.0.36-1.el5.x86_64.rpm to us to try. Somehow, we managed to
> stnart cman on both machines and then started rgmanager service with this
> new rgmanager RPM. (Couldn't reproduce clurgmgrd SegFault.) And this
> solved clurgmgrd SegFault problem. But we are still having "can't
> communicate with fenced -1" errors occasionally.
>
> Sorry for the long post, but I try to help to people who will try to
> help to figure out the problem. I also attach my cluster.conf file with
> the post. Any kind of help will be really, really appreciated! Thanks so
> much for your kindly interest by reading this far.
>
>
> Regards.