[Linux-cluster] RHEL 6 two-node cluster - nodes killing each other's cman

Thu Jul 26 16:46:59 UTC 2012

For automatic recovery, you have to use power fencing. Fabric fencing 
(like fencing at a SAN switch) is perfectly safe, but it requires human 
intervention.

The problem is that the messages passed around the cluster in the closed 
process group (CPG) are sequenced. Once a node falls out of sequence, it 
needs to be restarted. To automate this, power fence the node. When it 
boots back up, it should automatically rejoin the cluster with a clean 
state.

May I ask why you're so careful to avoid a restart? The whole idea of 
clustering is to have no/minimal interruption of service during a node 
failure.

Digimer

On 07/26/2012 12:04 PM, DIMITROV, TANIO wrote:
> Thanks Digimer,
>
> Yes, this works but it cannot be done automatically - and that's my problem.
> I'm trying to figure out what is the reason for killing CMAN - what if I use SAN switch as a fencing device to block access to the SAN - my node won't be rebooted and I will run into the same situation?
> Is it at all possible for the node to rejoin the cluster without rebooting /CMAN restarting?
> And if it is not, what about the SAN switch fencing scenario?
>
>
>
> -----Original Message-----
> From: Digimer [mailto:lists at alteeve.ca]
> Sent: Thursday, July 26, 2012 11:48 AM
> To: linux clustering
> Cc: DIMITROV, TANIO
> Subject: Re: [Linux-cluster] RHEL 6 two-node cluster - nodes killing each other's cman
>
> On 07/26/2012 11:44 AM, DIMITROV, TANIO wrote:
>> Hello,
>> I'm testing RHEL 6.2 cluster using CMAN.
>> It is a two-node cluster, no shared data. The problem is that if there is a connectivity problem between the nodes, each of them continues working as stand-alone - which is OK (no shared data, manual fencing). But when the connection comes back up the nodes kill each other's cman instances :
>>
>> Jul 26 13:58:05.000 node1 corosync[15771]: cman killed by node 2 because we were killed by cman_tool or other application
>> Jul 26 13:58:05.000 node1 gfs_controld[15900]: cluster is down, exiting
>> Jul 26 13:58:05.000 node1 gfs_controld[15900]: daemon cpg_dispatch error 2
>> Jul 26 13:58:05.000 node1 dlm_controld[15848]: cluster is down, exiting
>>
>> Can this be avoided somehow?
>>
>> Thanks in advance!
>
> Use real fencing.
>
> The problem is, I believe, that the CPG messages fall out of sync. You
> could try stopping cman on one node, reconnecting the network and
> restarting cman on the one node again.
>

-- 
Digimer
Papers and Projects: https://alteeve.com