[Linux-cluster] "dlm_controld[nnnn]: cluster is down, exiting" on node1 when starting node2

Fri Jun 5 17:20:11 UTC 2009

On Fri, 5 Jun 2009, Steven Dake wrote:

> On Fri, 2009-06-05 at 11:49 -0500, David Teigland wrote:
>> On Fri, Jun 05, 2009 at 12:50:57PM -0400, Charlie Brady wrote:
>>>
>>> On Fri, 5 Jun 2009, David Teigland wrote:
>>>
>>>> On Fri, Jun 05, 2009 at 11:42:59AM -0400, Charlie Brady wrote:
>>>>>
>>>>> On Fri, 5 Jun 2009, David Teigland wrote:
>>>>>
>>>>>> They are all complaining that the the cluster is down, which is a polite
>>>>>> way
>>>>>> of saying that aisexec has died/crashed/failed/killed/gone-away.
>>>>>
>>>>> Thanks. Why might that have occurred? Where would I look for clues? How
>>>>> can I increase logging output from aisexec?
>>>>
>>>> If you're lucky it'll leave a core file, otherwise aisexec is notorious for
>>>> disappearing without leaving any clues about why.
>>>
>>> That's very disconcerting to hear. Doesn't sound like HA. :-(
>>
>> To clarify, aisexec does not often disappear, it's very reliable.  The point
>> was that in the rare case when it does, it's notorious for not leaving any
>> reasons behind.
>>
>> Dave
>>
>
> 99.9% of the time there would be a core file in /var/lib/openais/core*
> if aisexec faults.

Only file I have there is named.

ringid_10.39.171.212

>  We have not seen faults during normal operations for
> years in a released version under typical gfs2 usage scenarios.  If
> there is no core, it means some other component failed, exited, and
> caused that node to be fenced, or the core file could not be written by
> the OS because of some other OS specific failure.  Another option is
> that the OOM killer killed aisexec.

No sign of the oom killer in the log I quoted yesterday.

>  I would have a hard time believing
> aisexec would crash without a core file while the operating system was
> still functional.
>
> In the trunk we are enhancing our failure analysis to do fulltime event
> tracing so failures can be debugged more rapidly then looking at a core
> file.  I hope that helps.

Thanks.

I'll try to reproduce the scenario. Meanwhile I'm still looking for hints 
as to how to get more visibility of what is happening.