[Linux-cluster] Re: How to configure a cluster to remain up in the event of node failure

Wed Aug 13 11:18:03 UTC 2008

Hi Christine, thanks for the feedback (and while im thanking you, also
for the programming locking applications book :)

On Wed, Aug 13, 2008 at 12:09 PM, Christine Caulfield
<ccaulfie at redhat.com> wrote:
>> I think I found a problem with the way it starts up...  See just below
>> the startup output for more info...
>>> Mounting GFS filesystems: GFS 0.1.1-7.el5 installed
>>> Trying to join cluster "lock_dlm","jemdevcluster:cache1"
>>> dlm: Using TCP for communications
>>> dlm: connecting to 2
>>> dlm: got connection to 2
>>> dlm: connecting to 2
>>> dlm: got connection from 4
>>
>> Could this be the problem?
>
> Yes, that's bad! You should only get one "connecting to" message per node.
> If you're getting two it looks like the connection is being closed by the
> remote node for some reason. Are there any messages on node 2 that might
> give a clue as to what's happening ?

That was it. qdiskd service was not running on all nodes, and I had
restarted it a few times. In addition to that, I had run a few config
updates with ccs_tool and also cman_tool expected 4 to lower my
quorum, as cluster was locking up due to loosing it. Obviously, it was
loosing quorum because the qdiskd service was not running and the
cluster was 2 votes short. "Cluster is not quorate, refusing
connection" in node2 logs.  Eventually had to restart the entire
cluster to get things running, seems that gfs does not recover that
well once it looses quorum.

* 1st rule of troubleshooting - check the logs.