[Linux-cluster] cluster failed after 53 hours

Daniel McNeil daniel at osdl.org
Tue Jan 18 23:10:20 UTC 2005


On Tue, 2005-01-18 at 00:48, Patrick Caulfield wrote:
> On Mon, Jan 17, 2005 at 05:31:33PM -0800, Daniel McNeil wrote:
> > My 3 node cluster ran tests for 53 hours before hitting a problem.
> > 
> > 
> > Node cl031 hit the 1st problem CMAN: killed by STARTTRANS or
> > NOMINATE.  There is a DLM assert on cl031 also, but that is
> > after a whole bunch of debug output.  The full logs are
> > here (http://developer.osdl.org/daniel/GFS/test.12jan2005/)
> > 
> > Any ideas on what is going on?
> > 
> > Here is simplified output (in the README file):
> > test started Jan Wed 12 17:18
> > hung after Fri Jan 14 22:00
> > 
> > cl031 got an error in just under 53 hours.
> > ==========================================
> > Jan 14 22:00:38 cl031 kernel: CMAN: node cl031a has been removed from the cluster : No response to messages
> 
> It's the usual thing. missing messages.
> 
> patrick

There is an DLM ASSERT farther down in log that show error = -105
which is ENOBUFS.  Is this happening after the node has decided
to leave the cluster?  I just want to make sure a out of memory
problem isn't causing the problem.

Daniel





More information about the Linux-cluster mailing list