[Linux-cluster] Unexpected problems with clvmd

Wed Dec 3 20:10:56 UTC 2008

Hi,

I found a work around for this by removing cman, clvmd and gfs from the
startup directories and rebooting all the nodes then starting cman
simultaneously on all nodes by hand, to avoid fencing.

I then started clvmd the same way.

I think this problem may have been caused by our Cisco 2960G gigabit
switches which might be blocking multicasts.

How can I check that multicasts from clvmd are being received by all
nodes?

We have two Cisco 2960G switches which we bond eth0 and eth1 to so that
we have a redundant data path for all the CMAN traffic.

The nodes are connected to a SAN via fiber HBAs.

Are these switches suitable for this application?

TIA for any help...

And many thanks for the help already given....

Shaun

-----Original Message-----
From: Shaun Mccullagh 
Sent: Wednesday, December 03, 2008 2:57 PM
To: linux clustering
Subject: RE: [Linux-cluster] Unexpected problems with clvmd

Hi Chrissie

Fence status is    ''none''    on all nodes

Shaun

for i in 10.0.154.10 pan5.tmf pan6.tmf pan4.tmf ; do ssh $i
/sbin/group_tool | grep fence; done
fence            0     default     00000000 none        
fence            0     default     00000000 none        
fence            0     default     00000000 none        
fence            0     default  00000000 none    

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Christine
Caulfield
Sent: woensdag 3 december 2008 14:53
To: linux clustering
Subject: Re: [Linux-cluster] Unexpected problems with clvmd

Shaun Mccullagh wrote:
> Hi,
> 
> I tried to add another node to our 3 node cluster this morning.
> 
> Initially things went well; but I wanted to check the new node booted 
> correctly.
> 
> After the second reboot clvmd failed to start up on the new node 
> (called
> pan4):
> 
> [root at pan4 ~]# clvmd -d1 -T20
> CLVMD[8e1e8300]: Dec  3 14:24:09 CLVMD started
> CLVMD[8e1e8300]: Dec  3 14:24:09 Connected to CMAN
> CLVMD[8e1e8300]: Dec  3 14:24:12 CMAN initialisation complete
> 
> Group_tool reports this output for clvmd on all four nodes in the 
> cluster
> 
> dlm              1     clvmd       00010005 FAIL_START_WAIT
> dlm              1     clvmd       00010005 FAIL_ALL_STOPPED
> dlm              1     clvmd       00010005 FAIL_ALL_STOPPED
> dlm              1     clvmd       00000000 JOIN_STOP_WAIT
> 
> Otherwise the cluster is OK:
> 
> [root at brik3 ~]# clustat
> Cluster Status for mtv_gfs @ Wed Dec  3 14:38:26 2008 Member Status: 
> Quorate
> 
>  Member Name                                               ID   Status
>  ------ ----                                               ---- ------
>  pan4                                                          4
Online
>  pan5                                                          5
Online
>  nfs-pan                                                       6
Online
>  brik3-gfs                                                     7
Online,
> Local
> 
> [root at brik3 ~]# cman_tool status
> Version: 6.1.0
> Config Version: 4
> Cluster Name: mtv_gfs
> Cluster Id: 14067
> Cluster Member: Yes
> Cluster Generation: 172
> Membership state: Cluster-Member
> Nodes: 4
> Expected votes: 4
> Total votes: 4
> Quorum: 3
> Active subsystems: 8
> Flags: Dirty
> Ports Bound: 0 11
> Node name: brik3-gfs
> Node ID: 7
> Multicast addresses: 239.192.54.42
> Node addresses: 172.16.1.60
> 
> It seems I have created a deadlock, what is the best way to fix this?
> 
> TIA
> 
>
The first thing is to check the fencing status, via group_tool and
syslog. If fencing hasn't completed then the DLM can't recover.
-- 

Chrissie

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

Op dit e-mailbericht is een disclaimer van toepassing, welke te vinden is op http://www.espritxb.nl/disclaimer