[Linux-cluster] Some nodes won't join after being fenced

Brandon Young bkyoung at gmail.com
Thu Jul 31 20:25:26 UTC 2008


I have occasionally run into this problem, too.  I have found that sometimes
I can work around the problem by chkconfig'ing clvmd,cman,and rgmanager off,
rebooting, then manually starting cman, rgmanager, clvmd (in that order).
Usually, after that, I am able to fence the node(s) and they will rejoin
automatically (after re-enabling automatic startup with chkconfig, of
course).  I know this workaround doesn't explain *why* it happens, but it
has more than once helped me get my cluster nodes back online without having
to reboot all the nodes.

On Thu, Jul 31, 2008 at 1:42 PM, Mailing List <ml at adamdein.com> wrote:

> Hello,
>
> I currently have a 9 node centos 5.1 cman/gfs cluster which I've managed to
> break.
>
> It is broken in almost exactly the same way as stated in these two previous
> threads:
>
> http://www.spinics.net/lists/cluster/msg10304.html
> http://www.redhat.com/archives/linux-cluster/2008-May/msg00060.html
>
> However, I can find no resolution in the archives. My only guaranteed
> resolution at this point is a cold restart of all nodes which to me seems
> ridiculous (ie: I'm missing something).
>
> To add a little details, I have nodes cluster1...9. Nodes 7 & 8 are broken.
> When I fence/reboot them, cman starts but times out on starting fencing.
> cman_tools nodes shows them as joined but the fence domain looks broke.
>
> Any ideas?
>
> I have included some information for a good node, bad node, and
> /var/log/messages from a good node that did the fencing.
>
> Good Node:
>
> [root at cluster1 ~]# cman_tool nodes
> Node  Sts   Inc   Joined               Name
>   1   M    768   2008-07-31 12:47:19  cluster1-rhc
>   2   M    776   2008-07-31 12:47:37  cluster2-rhc
>   3   M    772   2008-07-31 12:47:19  cluster3-rhc
>   4   M    788   2008-07-31 12:56:20  cluster4-rhc
>   5   M    772   2008-07-31 12:47:19  cluster5-rhc
>   6   M    784   2008-07-31 12:52:50  cluster6-rhc
>   7   M    808   2008-07-31 13:24:24  cluster7-rhc
>   8   X    800                        cluster8-rhc
>   9   M    772   2008-07-31 12:47:19  cluster9-rhc
> [root at cluster1 ~]# cman_tool services
> type             level name      id       state
> fence            0     default   00010003 FAIL_START_WAIT
> [1 2 3 4 5 6 9]
> dlm              1     testgfs1  00020005 none
> [1 2 3 4 5 6]
> gfs              2     testgfs1  00010005 none
> [1 2 3 4 5 6]
> [root at cluster1 ~]# cman_tool status
> Version: 6.1.0
> Config Version: 13
> Cluster Name: test
> Cluster Id: 1678
> Cluster Member: Yes
> Cluster Generation: 808
> Membership state: Cluster-Member
> Nodes: 8
> Expected votes: 9
> Total votes: 8
> Quorum: 5
> Active subsystems: 7
> Flags: Dirty
> Ports Bound: 0
> Node name: cluster1-rhc
> Node ID: 1
> Multicast addresses: 239.192.6.148
> Node addresses: 10.128.161.81
> [root at cluster1 ~]# group_tool
> type             level name      id       state
> fence            0     default   00010003 FAIL_START_WAIT
> [1 2 3 4 5 6 9]
> dlm              1     testgfs1  00020005 none
> [1 2 3 4 5 6]
> gfs              2     testgfs1  00010005 none
> [1 2 3 4 5 6]
> [root at cluster1 ~]#
>
>
> Bad/broken Node:
>
> [root at cluster7 ~]# cman_tool nodes
> Node  Sts   Inc   Joined               Name
>   1   M    808   2008-07-31 13:24:24  cluster1-rhc
>   2   M    808   2008-07-31 13:24:24  cluster2-rhc
>   3   M    808   2008-07-31 13:24:24  cluster3-rhc
>   4   M    808   2008-07-31 13:24:24  cluster4-rhc
>   5   M    808   2008-07-31 13:24:24  cluster5-rhc
>   6   M    808   2008-07-31 13:24:24  cluster6-rhc
>   7   M    804   2008-07-31 13:24:24  cluster7-rhc
>   8   X      0                        cluster8-rhc
>   9   M    808   2008-07-31 13:24:24  cluster9-rhc
> [root at cluster7 ~]# cman_tool services
> type             level name     id       state
> fence            0     default  00000000 JOIN_STOP_WAIT
> [1 2 3 4 5 6 7 9]
> [root at cluster7 ~]# cman_tool status
> Version: 6.1.0
> Config Version: 13
> Cluster Name: test
> Cluster Id: 1678
> Cluster Member: Yes
> Cluster Generation: 808
> Membership state: Cluster-Member
> Nodes: 8
> Expected votes: 9
> Total votes: 8
> Quorum: 5
> Active subsystems: 7
> Flags: Dirty
> Ports Bound: 0
> Node name: cluster7-rhc
> Node ID: 7
> Multicast addresses: 239.192.6.148
> Node addresses: 10.128.161.87
> [root at cluster7 ~]# group_tool
> type             level name     id       state
> fence            0     default  00000000 JOIN_STOP_WAIT
> [1 2 3 4 5 6 7 9]
> [root at cluster7 ~]#
>
>
> /var/log/messages:
>
> Jul 31 13:20:54 cluster3 fence_node[3813]: Fence of "cluster7-rhc" was
> successful
> Jul 31 13:21:03 cluster3 fence_node[3815]: Fence of "cluster8-rhc" was
> successful
> Jul 31 13:21:11 cluster3 openais[3084]: [TOTEM] entering GATHER state from
> 12.
> Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] entering GATHER state from
> 11.
> Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] Saving state aru 89 high
> seq received 89
> Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] Storing new sequence id for
> ring 324
> Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] entering COMMIT state.
> Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] entering RECOVERY state.
> Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] position [0] member
> 10.128.161.81:
> Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] previous ring seq 800 rep
> 10.128.161.81
> Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] aru 89 high delivered 89
> received flag 1
> Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] position [1] member
> 10.128.161.82:
> Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] previous ring seq 800 rep
> 10.128.161.81
> Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] aru 89 high delivered 89
> received flag 1
> Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] position [2] member
> 10.128.161.83:
> Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] previous ring seq 800 rep
> 10.128.161.81
> Jul 31 13:21:16 cluster3 kernel: dlm: closing connection to node 7
> Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] aru 89 high delivered 89
> received flag 1
> Jul 31 13:21:16 cluster3 kernel: dlm: closing connection to node 8
> Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] position [3] member
> 10.128.161.84:
> Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] previous ring seq 800 rep
> 10.128.161.81
> Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] aru 89 high delivered 89
> received flag 1
> Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] position [4] member
> 10.128.161.85:
> Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] previous ring seq 800 rep
> 10.128.161.81
> Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] aru 89 high delivered 89
> received flag 1
> Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] position [5] member
> 10.128.161.86:
> Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] previous ring seq 800 rep
> 10.128.161.81
> Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] aru 89 high delivered 89
> received flag 1
> Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] position [6] member
> 10.128.161.89:
> Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] previous ring seq 800 rep
> 10.128.161.81
> Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] aru 89 high delivered 89
> received flag 1
> Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] Did not need to originate
> any messages in recovery.
> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] CLM CONFIGURATION CHANGE
> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] New Configuration:
> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) ip(
> 10.128.161.81)
> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) ip(
> 10.128.161.82)
> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) ip(
> 10.128.161.83)
> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) ip(
> 10.128.161.84)
> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) ip(
> 10.128.161.85)
> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) ip(
> 10.128.161.86)
> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) ip(
> 10.128.161.89)
> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] Members Left:
> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) ip(
> 10.128.161.87)
> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) ip(
> 10.128.161.88)
> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] Members Joined:
> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] CLM CONFIGURATION CHANGE
> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] New Configuration:
> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) ip(
> 10.128.161.81)
> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) ip(
> 10.128.161.82)
> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) ip(
> 10.128.161.83)
> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) ip(
> 10.128.161.84)
> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) ip(
> 10.128.161.85)
> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) ip(
> 10.128.161.86)
> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) ip(
> 10.128.161.89)
> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] Members Left:
> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] Members Joined:
> Jul 31 13:21:16 cluster3 openais[3084]: [SYNC ] This node is within the
> primary component and will provide service.
> Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] entering OPERATIONAL state.
> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] got nodejoin message
> 10.128.161.81
> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] got nodejoin message
> 10.128.161.82
> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] got nodejoin message
> 10.128.161.83
> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] got nodejoin message
> 10.128.161.84
> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] got nodejoin message
> 10.128.161.85
> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] got nodejoin message
> 10.128.161.86
> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] got nodejoin message
> 10.128.161.89
> Jul 31 13:21:16 cluster3 openais[3084]: [CPG  ] got joinlist message from
> node 2
> Jul 31 13:21:16 cluster3 openais[3084]: [CPG  ] got joinlist message from
> node 3
> Jul 31 13:21:16 cluster3 openais[3084]: [CPG  ] got joinlist message from
> node 4
> Jul 31 13:21:16 cluster3 openais[3084]: [CPG  ] got joinlist message from
> node 5
> Jul 31 13:21:16 cluster3 openais[3084]: [CPG  ] got joinlist message from
> node 6
> Jul 31 13:21:16 cluster3 openais[3084]: [CPG  ] got joinlist message from
> node 9
> Jul 31 13:21:16 cluster3 openais[3084]: [CPG  ] got joinlist message from
> node 1
> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] entering GATHER state from
> 11.
> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] Saving state aru 68 high
> seq received 68
> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] Storing new sequence id for
> ring 328
> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] entering COMMIT state.
> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] entering RECOVERY state.
> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] position [0] member
> 10.128.161.81:
> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] previous ring seq 804 rep
> 10.128.161.81
> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] aru 68 high delivered 68
> received flag 1
> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] position [1] member
> 10.128.161.82:
> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] previous ring seq 804 rep
> 10.128.161.81
> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] aru 68 high delivered 68
> received flag 1
> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] position [2] member
> 10.128.161.83:
> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] previous ring seq 804 rep
> 10.128.161.81
> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] aru 68 high delivered 68
> received flag 1
> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] position [3] member
> 10.128.161.84:
> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] previous ring seq 804 rep
> 10.128.161.81
> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] aru 68 high delivered 68
> received flag 1
> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] position [4] member
> 10.128.161.85:
> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] previous ring seq 804 rep
> 10.128.161.81
> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] aru 68 high delivered 68
> received flag 1
> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] position [5] member
> 10.128.161.86:
> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] previous ring seq 804 rep
> 10.128.161.81
> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] aru 68 high delivered 68
> received flag 1
> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] position [6] member
> 10.128.161.87:
> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] previous ring seq 804 rep
> 10.128.161.87
> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] aru 9 high delivered 9
> received flag 1
> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] position [7] member
> 10.128.161.89:
> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] previous ring seq 804 rep
> 10.128.161.81
> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] aru 68 high delivered 68
> received flag 1
> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] Did not need to originate
> any messages in recovery.
> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] CLM CONFIGURATION CHANGE
> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] New Configuration:
> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) ip(
> 10.128.161.81)
> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) ip(
> 10.128.161.82)
> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) ip(
> 10.128.161.83)
> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) ip(
> 10.128.161.84)
> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) ip(
> 10.128.161.85)
> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) ip(
> 10.128.161.86)
> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) ip(
> 10.128.161.89)
> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] Members Left:
> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] Members Joined:
> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] CLM CONFIGURATION CHANGE
> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] New Configuration:
> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) ip(
> 10.128.161.81)
> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) ip(
> 10.128.161.82)
> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) ip(
> 10.128.161.83)
> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) ip(
> 10.128.161.84)
> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) ip(
> 10.128.161.85)
> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) ip(
> 10.128.161.86)
> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) ip(
> 10.128.161.87)
> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) ip(
> 10.128.161.89)
> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] Members Left:
> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] Members Joined:
> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) ip(
> 10.128.161.87)
> Jul 31 13:24:24 cluster3 openais[3084]: [SYNC ] This node is within the
> primary component and will provide service.
> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] entering OPERATIONAL state.
> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] got nodejoin message
> 10.128.161.81
> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] got nodejoin message
> 10.128.161.82
> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] got nodejoin message
> 10.128.161.83
> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] got nodejoin message
> 10.128.161.84
> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] got nodejoin message
> 10.128.161.85
> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] got nodejoin message
> 10.128.161.86
> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] got nodejoin message
> 10.128.161.87
> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] got nodejoin message
> 10.128.161.89
> Jul 31 13:24:24 cluster3 openais[3084]: [CPG  ] got joinlist message from
> node 6
> Jul 31 13:24:24 cluster3 openais[3084]: [CPG  ] got joinlist message from
> node 9
> Jul 31 13:24:24 cluster3 openais[3084]: [CPG  ] got joinlist message from
> node 1
> Jul 31 13:24:24 cluster3 openais[3084]: [CPG  ] got joinlist message from
> node 2
> Jul 31 13:24:24 cluster3 openais[3084]: [CPG  ] got joinlist message from
> node 3
> Jul 31 13:24:24 cluster3 openais[3084]: [CPG  ] got joinlist message from
> node 4
> Jul 31 13:24:24 cluster3 openais[3084]: [CPG  ] got joinlist message from
> node 5
>
> Thanks!
>
> Adam
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080731/8b08079a/attachment.htm>


More information about the Linux-cluster mailing list