[Linux-cluster] RHEL 6 two-node cluster - nodes killing each other's cman

Thu Jul 26 17:20:11 UTC 2012

Sorry, sent the message to the wrong address

The reason I don't want to reboot/fence the node is that my nodes are actually semi-independent - each one writes to its local file system which is then backed up on the other node when it becomes available. 

So, the only way to rejoin the cluster is to start CPG sequence from 0 (clean state) by either rebooting the node or restarting CMAN?

-----Original Message-----
From: Digimer [mailto:lists at alteeve.ca] 
Sent: Thursday, July 26, 2012 12:47 PM
To: DIMITROV, TANIO
Cc: linux clustering
Subject: Re: [Linux-cluster] RHEL 6 two-node cluster - nodes killing each other's cman

For automatic recovery, you have to use power fencing. Fabric fencing 
(like fencing at a SAN switch) is perfectly safe, but it requires human 
intervention.

The problem is that the messages passed around the cluster in the closed 
process group (CPG) are sequenced. Once a node falls out of sequence, it 
needs to be restarted. To automate this, power fence the node. When it 
boots back up, it should automatically rejoin the cluster with a clean 
state.

May I ask why you're so careful to avoid a restart? The whole idea of 
clustering is to have no/minimal interruption of service during a node 
failure.

Digimer

On 07/26/2012 12:04 PM, DIMITROV, TANIO wrote:
> Thanks Digimer,
>
> Yes, this works but it cannot be done automatically - and that's my problem.
> I'm trying to figure out what is the reason for killing CMAN - what if I use SAN switch as a fencing device to block access to the SAN - my node won't be rebooted and I will run into the same situation?
> Is it at all possible for the node to rejoin the cluster without rebooting /CMAN restarting?
> And if it is not, what about the SAN switch fencing scenario?
>
>
>
> -----Original Message-----
> From: Digimer [mailto:lists at alteeve.ca]
> Sent: Thursday, July 26, 2012 11:48 AM
> To: linux clustering
> Cc: DIMITROV, TANIO
> Subject: Re: [Linux-cluster] RHEL 6 two-node cluster - nodes killing each other's cman
>
> On 07/26/2012 11:44 AM, DIMITROV, TANIO wrote:
>> Hello,
>> I'm testing RHEL 6.2 cluster using CMAN.
>> It is a two-node cluster, no shared data. The problem is that if there is a connectivity problem between the nodes, each of them continues working as stand-alone - which is OK (no shared data, manual fencing). But when the connection comes back up the nodes kill each other's cman instances :
>>
>> Jul 26 13:58:05.000 node1 corosync[15771]: cman killed by node 2 because we were killed by cman_tool or other application
>> Jul 26 13:58:05.000 node1 gfs_controld[15900]: cluster is down, exiting
>> Jul 26 13:58:05.000 node1 gfs_controld[15900]: daemon cpg_dispatch error 2
>> Jul 26 13:58:05.000 node1 dlm_controld[15848]: cluster is down, exiting
>>
>> Can this be avoided somehow?
>>
>> Thanks in advance!
>
> Use real fencing.
>
> The problem is, I believe, that the CPG messages fall out of sync. You
> could try stopping cman on one node, reconnecting the network and
> restarting cman on the one node again.
>

-- 
Digimer
Papers and Projects: https://alteeve.com