[Linux-cluster] Some nodes won't join after being fenced

ted tedley at gmail.com
Fri Aug 1 01:01:57 UTC 2008


We seem to have found part of the culprit.

We're using an Extreme switch that handles all of our traffic in seperate
VLAN's, and the IGMP bits of the ExtremeOS seem to be interfering with the
clusters ability to recover itself from such an episode.

At the moment we're leaning towards the Juniper switch as we moved to an
identically configured (as far as ports and VLANs go) Juniper EX-4200 and
the cluster was able to recover itself with a single node (of nine) being
fenced.

While on the Extreme, each node needed to be fenced in turn for the cluster
to be able to recover fully.  This means each node being able to mount the
GFS mount r/w and actually be able to write and delete test files on the
mount point.

Our testing continues and we're trying to come up with "real" evidence such
as proof that some parts of the multicast traffic are or aren't being dealt
with properly.  So far the empirical evidence suggests the above
conclusions.

-ted


On 7/31/08, Brandon Young <bkyoung at gmail.com> wrote:
>
> I have occasionally run into this problem, too.  I have found that
> sometimes I can work around the problem by chkconfig'ing clvmd,cman,and
> rgmanager off, rebooting, then manually starting cman, rgmanager, clvmd (in
> that order).  Usually, after that, I am able to fence the node(s) and they
> will rejoin automatically (after re-enabling automatic startup with
> chkconfig, of course).  I know this workaround doesn't explain *why* it
> happens, but it has more than once helped me get my cluster nodes back
> online without having to reboot all the nodes.
>
> On Thu, Jul 31, 2008 at 1:42 PM, Mailing List <ml at adamdein.com> wrote:
>
>> Hello,
>>
>> I currently have a 9 node centos 5.1 cman/gfs cluster which I've managed
>> to break.
>>
>> It is broken in almost exactly the same way as stated in these two
>> previous threads:
>>
>> http://www.spinics.net/lists/cluster/msg10304.html
>> http://www.redhat.com/archives/linux-cluster/2008-May/msg00060.html
>>
>> However, I can find no resolution in the archives. My only guaranteed
>> resolution at this point is a cold restart of all nodes which to me seems
>> ridiculous (ie: I'm missing something).
>>
>> To add a little details, I have nodes cluster1...9. Nodes 7 & 8 are
>> broken. When I fence/reboot them, cman starts but times out on starting
>> fencing. cman_tools nodes shows them as joined but the fence domain looks
>> broke.
>>
>> Any ideas?
>>
>> I have included some information for a good node, bad node, and
>> /var/log/messages from a good node that did the fencing.
>>
>> Good Node:
>>
>> [root at cluster1 ~]# cman_tool nodes
>> Node  Sts   Inc   Joined               Name
>>   1   M    768   2008-07-31 12:47:19  cluster1-rhc
>>   2   M    776   2008-07-31 12:47:37  cluster2-rhc
>>   3   M    772   2008-07-31 12:47:19  cluster3-rhc
>>   4   M    788   2008-07-31 12:56:20  cluster4-rhc
>>   5   M    772   2008-07-31 12:47:19  cluster5-rhc
>>   6   M    784   2008-07-31 12:52:50  cluster6-rhc
>>   7   M    808   2008-07-31 13:24:24  cluster7-rhc
>>   8   X    800                        cluster8-rhc
>>   9   M    772   2008-07-31 12:47:19  cluster9-rhc
>> [root at cluster1 ~]# cman_tool services
>> type             level name      id       state
>> fence            0     default   00010003 FAIL_START_WAIT
>> [1 2 3 4 5 6 9]
>> dlm              1     testgfs1  00020005 none
>> [1 2 3 4 5 6]
>> gfs              2     testgfs1  00010005 none
>> [1 2 3 4 5 6]
>> [root at cluster1 ~]# cman_tool status
>> Version: 6.1.0
>> Config Version: 13
>> Cluster Name: test
>> Cluster Id: 1678
>> Cluster Member: Yes
>> Cluster Generation: 808
>> Membership state: Cluster-Member
>> Nodes: 8
>> Expected votes: 9
>> Total votes: 8
>> Quorum: 5
>> Active subsystems: 7
>> Flags: Dirty
>> Ports Bound: 0
>> Node name: cluster1-rhc
>> Node ID: 1
>> Multicast addresses: 239.192.6.148
>> Node addresses: 10.128.161.81
>> [root at cluster1 ~]# group_tool
>> type             level name      id       state
>> fence            0     default   00010003 FAIL_START_WAIT
>> [1 2 3 4 5 6 9]
>> dlm              1     testgfs1  00020005 none
>> [1 2 3 4 5 6]
>> gfs              2     testgfs1  00010005 none
>> [1 2 3 4 5 6]
>> [root at cluster1 ~]#
>>
>>
>> Bad/broken Node:
>>
>> [root at cluster7 ~]# cman_tool nodes
>> Node  Sts   Inc   Joined               Name
>>   1   M    808   2008-07-31 13:24:24  cluster1-rhc
>>   2   M    808   2008-07-31 13:24:24  cluster2-rhc
>>   3   M    808   2008-07-31 13:24:24  cluster3-rhc
>>   4   M    808   2008-07-31 13:24:24  cluster4-rhc
>>   5   M    808   2008-07-31 13:24:24  cluster5-rhc
>>   6   M    808   2008-07-31 13:24:24  cluster6-rhc
>>   7   M    804   2008-07-31 13:24:24  cluster7-rhc
>>   8   X      0                        cluster8-rhc
>>   9   M    808   2008-07-31 13:24:24  cluster9-rhc
>> [root at cluster7 ~]# cman_tool services
>> type             level name     id       state
>> fence            0     default  00000000 JOIN_STOP_WAIT
>> [1 2 3 4 5 6 7 9]
>> [root at cluster7 ~]# cman_tool status
>> Version: 6.1.0
>> Config Version: 13
>> Cluster Name: test
>> Cluster Id: 1678
>> Cluster Member: Yes
>> Cluster Generation: 808
>> Membership state: Cluster-Member
>> Nodes: 8
>> Expected votes: 9
>> Total votes: 8
>> Quorum: 5
>> Active subsystems: 7
>> Flags: Dirty
>> Ports Bound: 0
>> Node name: cluster7-rhc
>> Node ID: 7
>> Multicast addresses: 239.192.6.148
>> Node addresses: 10.128.161.87
>> [root at cluster7 ~]# group_tool
>> type             level name     id       state
>> fence            0     default  00000000 JOIN_STOP_WAIT
>> [1 2 3 4 5 6 7 9]
>> [root at cluster7 ~]#
>>
>>
>> /var/log/messages:
>>
>> Jul 31 13:20:54 cluster3 fence_node[3813]: Fence of "cluster7-rhc" was
>> successful
>> Jul 31 13:21:03 cluster3 fence_node[3815]: Fence of "cluster8-rhc" was
>> successful
>> Jul 31 13:21:11 cluster3 openais[3084]: [TOTEM] entering GATHER state from
>> 12.
>> Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] entering GATHER state from
>> 11.
>> Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] Saving state aru 89 high
>> seq received 89
>> Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] Storing new sequence id
>> for ring 324
>> Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] entering COMMIT state.
>> Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] entering RECOVERY state.
>> Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] position [0] member
>> 10.128.161.81:
>> Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] previous ring seq 800 rep
>> 10.128.161.81
>> Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] aru 89 high delivered 89
>> received flag 1
>> Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] position [1] member
>> 10.128.161.82:
>> Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] previous ring seq 800 rep
>> 10.128.161.81
>> Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] aru 89 high delivered 89
>> received flag 1
>> Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] position [2] member
>> 10.128.161.83:
>> Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] previous ring seq 800 rep
>> 10.128.161.81
>> Jul 31 13:21:16 cluster3 kernel: dlm: closing connection to node 7
>> Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] aru 89 high delivered 89
>> received flag 1
>> Jul 31 13:21:16 cluster3 kernel: dlm: closing connection to node 8
>> Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] position [3] member
>> 10.128.161.84:
>> Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] previous ring seq 800 rep
>> 10.128.161.81
>> Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] aru 89 high delivered 89
>> received flag 1
>> Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] position [4] member
>> 10.128.161.85:
>> Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] previous ring seq 800 rep
>> 10.128.161.81
>> Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] aru 89 high delivered 89
>> received flag 1
>> Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] position [5] member
>> 10.128.161.86:
>> Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] previous ring seq 800 rep
>> 10.128.161.81
>> Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] aru 89 high delivered 89
>> received flag 1
>> Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] position [6] member
>> 10.128.161.89:
>> Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] previous ring seq 800 rep
>> 10.128.161.81
>> Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] aru 89 high delivered 89
>> received flag 1
>> Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] Did not need to originate
>> any messages in recovery.
>> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] CLM CONFIGURATION CHANGE
>> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] New Configuration:
>> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) ip(
>> 10.128.161.81)
>> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) ip(
>> 10.128.161.82)
>> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) ip(
>> 10.128.161.83)
>> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) ip(
>> 10.128.161.84)
>> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) ip(
>> 10.128.161.85)
>> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) ip(
>> 10.128.161.86)
>> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) ip(
>> 10.128.161.89)
>> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] Members Left:
>> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) ip(
>> 10.128.161.87)
>> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) ip(
>> 10.128.161.88)
>> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] Members Joined:
>> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] CLM CONFIGURATION CHANGE
>> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] New Configuration:
>> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) ip(
>> 10.128.161.81)
>> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) ip(
>> 10.128.161.82)
>> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) ip(
>> 10.128.161.83)
>> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) ip(
>> 10.128.161.84)
>> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) ip(
>> 10.128.161.85)
>> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) ip(
>> 10.128.161.86)
>> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) ip(
>> 10.128.161.89)
>> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] Members Left:
>> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] Members Joined:
>> Jul 31 13:21:16 cluster3 openais[3084]: [SYNC ] This node is within the
>> primary component and will provide service.
>> Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] entering OPERATIONAL
>> state.
>> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] got nodejoin message
>> 10.128.161.81
>> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] got nodejoin message
>> 10.128.161.82
>> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] got nodejoin message
>> 10.128.161.83
>> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] got nodejoin message
>> 10.128.161.84
>> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] got nodejoin message
>> 10.128.161.85
>> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] got nodejoin message
>> 10.128.161.86
>> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] got nodejoin message
>> 10.128.161.89
>> Jul 31 13:21:16 cluster3 openais[3084]: [CPG  ] got joinlist message from
>> node 2
>> Jul 31 13:21:16 cluster3 openais[3084]: [CPG  ] got joinlist message from
>> node 3
>> Jul 31 13:21:16 cluster3 openais[3084]: [CPG  ] got joinlist message from
>> node 4
>> Jul 31 13:21:16 cluster3 openais[3084]: [CPG  ] got joinlist message from
>> node 5
>> Jul 31 13:21:16 cluster3 openais[3084]: [CPG  ] got joinlist message from
>> node 6
>> Jul 31 13:21:16 cluster3 openais[3084]: [CPG  ] got joinlist message from
>> node 9
>> Jul 31 13:21:16 cluster3 openais[3084]: [CPG  ] got joinlist message from
>> node 1
>> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] entering GATHER state from
>> 11.
>> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] Saving state aru 68 high
>> seq received 68
>> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] Storing new sequence id
>> for ring 328
>> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] entering COMMIT state.
>> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] entering RECOVERY state.
>> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] position [0] member
>> 10.128.161.81:
>> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] previous ring seq 804 rep
>> 10.128.161.81
>> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] aru 68 high delivered 68
>> received flag 1
>> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] position [1] member
>> 10.128.161.82:
>> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] previous ring seq 804 rep
>> 10.128.161.81
>> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] aru 68 high delivered 68
>> received flag 1
>> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] position [2] member
>> 10.128.161.83:
>> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] previous ring seq 804 rep
>> 10.128.161.81
>> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] aru 68 high delivered 68
>> received flag 1
>> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] position [3] member
>> 10.128.161.84:
>> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] previous ring seq 804 rep
>> 10.128.161.81
>> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] aru 68 high delivered 68
>> received flag 1
>> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] position [4] member
>> 10.128.161.85:
>> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] previous ring seq 804 rep
>> 10.128.161.81
>> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] aru 68 high delivered 68
>> received flag 1
>> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] position [5] member
>> 10.128.161.86:
>> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] previous ring seq 804 rep
>> 10.128.161.81
>> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] aru 68 high delivered 68
>> received flag 1
>> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] position [6] member
>> 10.128.161.87:
>> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] previous ring seq 804 rep
>> 10.128.161.87
>> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] aru 9 high delivered 9
>> received flag 1
>> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] position [7] member
>> 10.128.161.89:
>> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] previous ring seq 804 rep
>> 10.128.161.81
>> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] aru 68 high delivered 68
>> received flag 1
>> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] Did not need to originate
>> any messages in recovery.
>> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] CLM CONFIGURATION CHANGE
>> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] New Configuration:
>> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) ip(
>> 10.128.161.81)
>> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) ip(
>> 10.128.161.82)
>> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) ip(
>> 10.128.161.83)
>> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) ip(
>> 10.128.161.84)
>> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) ip(
>> 10.128.161.85)
>> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) ip(
>> 10.128.161.86)
>> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) ip(
>> 10.128.161.89)
>> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] Members Left:
>> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] Members Joined:
>> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] CLM CONFIGURATION CHANGE
>> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] New Configuration:
>> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) ip(
>> 10.128.161.81)
>> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) ip(
>> 10.128.161.82)
>> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) ip(
>> 10.128.161.83)
>> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) ip(
>> 10.128.161.84)
>> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) ip(
>> 10.128.161.85)
>> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) ip(
>> 10.128.161.86)
>> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) ip(
>> 10.128.161.87)
>> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) ip(
>> 10.128.161.89)
>> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] Members Left:
>> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] Members Joined:
>> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) ip(
>> 10.128.161.87)
>> Jul 31 13:24:24 cluster3 openais[3084]: [SYNC ] This node is within the
>> primary component and will provide service.
>> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] entering OPERATIONAL
>> state.
>> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] got nodejoin message
>> 10.128.161.81
>> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] got nodejoin message
>> 10.128.161.82
>> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] got nodejoin message
>> 10.128.161.83
>> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] got nodejoin message
>> 10.128.161.84
>> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] got nodejoin message
>> 10.128.161.85
>> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] got nodejoin message
>> 10.128.161.86
>> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] got nodejoin message
>> 10.128.161.87
>> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] got nodejoin message
>> 10.128.161.89
>> Jul 31 13:24:24 cluster3 openais[3084]: [CPG  ] got joinlist message from
>> node 6
>> Jul 31 13:24:24 cluster3 openais[3084]: [CPG  ] got joinlist message from
>> node 9
>> Jul 31 13:24:24 cluster3 openais[3084]: [CPG  ] got joinlist message from
>> node 1
>> Jul 31 13:24:24 cluster3 openais[3084]: [CPG  ] got joinlist message from
>> node 2
>> Jul 31 13:24:24 cluster3 openais[3084]: [CPG  ] got joinlist message from
>> node 3
>> Jul 31 13:24:24 cluster3 openais[3084]: [CPG  ] got joinlist message from
>> node 4
>> Jul 31 13:24:24 cluster3 openais[3084]: [CPG  ] got joinlist message from
>> node 5
>>
>> Thanks!
>>
>> Adam
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080731/de8758e9/attachment.htm>


More information about the Linux-cluster mailing list