From tedley at gmail.com  Fri Aug  1 01:01:57 2008
From: tedley at gmail.com (ted)
Date: Thu, 31 Jul 2008 21:01:57 -0400
Subject: [Linux-cluster] Some nodes won't join after being fenced
In-Reply-To: <824ffea00807311325u186e8129kf5218e6dbc2a4d06@mail.gmail.com>
References: <48920785.4060300@adamdein.com>
	<824ffea00807311325u186e8129kf5218e6dbc2a4d06@mail.gmail.com>
Message-ID: <a5f85ede0807311801x2d1ddc55vd9c83de7cf8be3ef@mail.gmail.com>

We seem to have found part of the culprit.

We're using an Extreme switch that handles all of our traffic in seperate
VLAN's, and the IGMP bits of the ExtremeOS seem to be interfering with the
clusters ability to recover itself from such an episode.

At the moment we're leaning towards the Juniper switch as we moved to an
identically configured (as far as ports and VLANs go) Juniper EX-4200 and
the cluster was able to recover itself with a single node (of nine) being
fenced.

While on the Extreme, each node needed to be fenced in turn for the cluster
to be able to recover fully.  This means each node being able to mount the
GFS mount r/w and actually be able to write and delete test files on the
mount point.

Our testing continues and we're trying to come up with "real" evidence such
as proof that some parts of the multicast traffic are or aren't being dealt
with properly.  So far the empirical evidence suggests the above
conclusions.

-ted


On 7/31/08, Brandon Young <bkyoung at gmail.com> wrote:
>
> I have occasionally run into this problem, too.  I have found that
> sometimes I can work around the problem by chkconfig'ing clvmd,cman,and
> rgmanager off, rebooting, then manually starting cman, rgmanager, clvmd (in
> that order).  Usually, after that, I am able to fence the node(s) and they
> will rejoin automatically (after re-enabling automatic startup with
> chkconfig, of course).  I know this workaround doesn't explain *why* it
> happens, but it has more than once helped me get my cluster nodes back
> online without having to reboot all the nodes.
>
> On Thu, Jul 31, 2008 at 1:42 PM, Mailing List <ml at adamdein.com> wrote:
>
>> Hello,
>>
>> I currently have a 9 node centos 5.1 cman/gfs cluster which I've managed
>> to break.
>>
>> It is broken in almost exactly the same way as stated in these two
>> previous threads:
>>
>> http://www.spinics.net/lists/cluster/msg10304.html
>> http://www.redhat.com/archives/linux-cluster/2008-May/msg00060.html
>>
>> However, I can find no resolution in the archives. My only guaranteed
>> resolution at this point is a cold restart of all nodes which to me seems
>> ridiculous (ie: I'm missing something).
>>
>> To add a little details, I have nodes cluster1...9. Nodes 7 & 8 are
>> broken. When I fence/reboot them, cman starts but times out on starting
>> fencing. cman_tools nodes shows them as joined but the fence domain looks
>> broke.
>>
>> Any ideas?
>>
>> I have included some information for a good node, bad node, and
>> /var/log/messages from a good node that did the fencing.
>>
>> Good Node:
>>
>> [root at cluster1 ~]# cman_tool nodes
>> Node  Sts   Inc   Joined               Name
>>   1   M    768   2008-07-31 12:47:19  cluster1-rhc
>>   2   M    776   2008-07-31 12:47:37  cluster2-rhc
>>   3   M    772   2008-07-31 12:47:19  cluster3-rhc
>>   4   M    788   2008-07-31 12:56:20  cluster4-rhc
>>   5   M    772   2008-07-31 12:47:19  cluster5-rhc
>>   6   M    784   2008-07-31 12:52:50  cluster6-rhc
>>   7   M    808   2008-07-31 13:24:24  cluster7-rhc
>>   8   X    800                        cluster8-rhc
>>   9   M    772   2008-07-31 12:47:19  cluster9-rhc
>> [root at cluster1 ~]# cman_tool services
>> type             level name      id       state
>> fence            0     default   00010003 FAIL_START_WAIT
>> [1 2 3 4 5 6 9]
>> dlm              1     testgfs1  00020005 none
>> [1 2 3 4 5 6]
>> gfs              2     testgfs1  00010005 none
>> [1 2 3 4 5 6]
>> [root at cluster1 ~]# cman_tool status
>> Version: 6.1.0
>> Config Version: 13
>> Cluster Name: test
>> Cluster Id: 1678
>> Cluster Member: Yes
>> Cluster Generation: 808
>> Membership state: Cluster-Member
>> Nodes: 8
>> Expected votes: 9
>> Total votes: 8
>> Quorum: 5
>> Active subsystems: 7
>> Flags: Dirty
>> Ports Bound: 0
>> Node name: cluster1-rhc
>> Node ID: 1
>> Multicast addresses: 239.192.6.148
>> Node addresses: 10.128.161.81
>> [root at cluster1 ~]# group_tool
>> type             level name      id       state
>> fence            0     default   00010003 FAIL_START_WAIT
>> [1 2 3 4 5 6 9]
>> dlm              1     testgfs1  00020005 none
>> [1 2 3 4 5 6]
>> gfs              2     testgfs1  00010005 none
>> [1 2 3 4 5 6]
>> [root at cluster1 ~]#
>>
>>
>> Bad/broken Node:
>>
>> [root at cluster7 ~]# cman_tool nodes
>> Node  Sts   Inc   Joined               Name
>>   1   M    808   2008-07-31 13:24:24  cluster1-rhc
>>   2   M    808   2008-07-31 13:24:24  cluster2-rhc
>>   3   M    808   2008-07-31 13:24:24  cluster3-rhc
>>   4   M    808   2008-07-31 13:24:24  cluster4-rhc
>>   5   M    808   2008-07-31 13:24:24  cluster5-rhc
>>   6   M    808   2008-07-31 13:24:24  cluster6-rhc
>>   7   M    804   2008-07-31 13:24:24  cluster7-rhc
>>   8   X      0                        cluster8-rhc
>>   9   M    808   2008-07-31 13:24:24  cluster9-rhc
>> [root at cluster7 ~]# cman_tool services
>> type             level name     id       state
>> fence            0     default  00000000 JOIN_STOP_WAIT
>> [1 2 3 4 5 6 7 9]
>> [root at cluster7 ~]# cman_tool status
>> Version: 6.1.0
>> Config Version: 13
>> Cluster Name: test
>> Cluster Id: 1678
>> Cluster Member: Yes
>> Cluster Generation: 808
>> Membership state: Cluster-Member
>> Nodes: 8
>> Expected votes: 9
>> Total votes: 8
>> Quorum: 5
>> Active subsystems: 7
>> Flags: Dirty
>> Ports Bound: 0
>> Node name: cluster7-rhc
>> Node ID: 7
>> Multicast addresses: 239.192.6.148
>> Node addresses: 10.128.161.87
>> [root at cluster7 ~]# group_tool
>> type             level name     id       state
>> fence            0     default  00000000 JOIN_STOP_WAIT
>> [1 2 3 4 5 6 7 9]
>> [root at cluster7 ~]#
>>
>>
>> /var/log/messages:
>>
>> Jul 31 13:20:54 cluster3 fence_node[3813]: Fence of "cluster7-rhc" was
>> successful
>> Jul 31 13:21:03 cluster3 fence_node[3815]: Fence of "cluster8-rhc" was
>> successful
>> Jul 31 13:21:11 cluster3 openais[3084]: [TOTEM] entering GATHER state from
>> 12.
>> Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] entering GATHER state from
>> 11.
>> Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] Saving state aru 89 high
>> seq received 89
>> Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] Storing new sequence id
>> for ring 324
>> Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] entering COMMIT state.
>> Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] entering RECOVERY state.
>> Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] position [0] member
>> 10.128.161.81:
>> Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] previous ring seq 800 rep
>> 10.128.161.81
>> Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] aru 89 high delivered 89
>> received flag 1
>> Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] position [1] member
>> 10.128.161.82:
>> Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] previous ring seq 800 rep
>> 10.128.161.81
>> Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] aru 89 high delivered 89
>> received flag 1
>> Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] position [2] member
>> 10.128.161.83:
>> Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] previous ring seq 800 rep
>> 10.128.161.81
>> Jul 31 13:21:16 cluster3 kernel: dlm: closing connection to node 7
>> Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] aru 89 high delivered 89
>> received flag 1
>> Jul 31 13:21:16 cluster3 kernel: dlm: closing connection to node 8
>> Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] position [3] member
>> 10.128.161.84:
>> Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] previous ring seq 800 rep
>> 10.128.161.81
>> Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] aru 89 high delivered 89
>> received flag 1
>> Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] position [4] member
>> 10.128.161.85:
>> Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] previous ring seq 800 rep
>> 10.128.161.81
>> Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] aru 89 high delivered 89
>> received flag 1
>> Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] position [5] member
>> 10.128.161.86:
>> Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] previous ring seq 800 rep
>> 10.128.161.81
>> Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] aru 89 high delivered 89
>> received flag 1
>> Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] position [6] member
>> 10.128.161.89:
>> Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] previous ring seq 800 rep
>> 10.128.161.81
>> Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] aru 89 high delivered 89
>> received flag 1
>> Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] Did not need to originate
>> any messages in recovery.
>> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] CLM CONFIGURATION CHANGE
>> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] New Configuration:
>> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) ip(
>> 10.128.161.81)
>> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) ip(
>> 10.128.161.82)
>> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) ip(
>> 10.128.161.83)
>> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) ip(
>> 10.128.161.84)
>> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) ip(
>> 10.128.161.85)
>> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) ip(
>> 10.128.161.86)
>> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) ip(
>> 10.128.161.89)
>> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] Members Left:
>> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) ip(
>> 10.128.161.87)
>> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) ip(
>> 10.128.161.88)
>> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] Members Joined:
>> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] CLM CONFIGURATION CHANGE
>> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] New Configuration:
>> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) ip(
>> 10.128.161.81)
>> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) ip(
>> 10.128.161.82)
>> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) ip(
>> 10.128.161.83)
>> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) ip(
>> 10.128.161.84)
>> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) ip(
>> 10.128.161.85)
>> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) ip(
>> 10.128.161.86)
>> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) ip(
>> 10.128.161.89)
>> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] Members Left:
>> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] Members Joined:
>> Jul 31 13:21:16 cluster3 openais[3084]: [SYNC ] This node is within the
>> primary component and will provide service.
>> Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] entering OPERATIONAL
>> state.
>> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] got nodejoin message
>> 10.128.161.81
>> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] got nodejoin message
>> 10.128.161.82
>> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] got nodejoin message
>> 10.128.161.83
>> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] got nodejoin message
>> 10.128.161.84
>> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] got nodejoin message
>> 10.128.161.85
>> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] got nodejoin message
>> 10.128.161.86
>> Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] got nodejoin message
>> 10.128.161.89
>> Jul 31 13:21:16 cluster3 openais[3084]: [CPG  ] got joinlist message from
>> node 2
>> Jul 31 13:21:16 cluster3 openais[3084]: [CPG  ] got joinlist message from
>> node 3
>> Jul 31 13:21:16 cluster3 openais[3084]: [CPG  ] got joinlist message from
>> node 4
>> Jul 31 13:21:16 cluster3 openais[3084]: [CPG  ] got joinlist message from
>> node 5
>> Jul 31 13:21:16 cluster3 openais[3084]: [CPG  ] got joinlist message from
>> node 6
>> Jul 31 13:21:16 cluster3 openais[3084]: [CPG  ] got joinlist message from
>> node 9
>> Jul 31 13:21:16 cluster3 openais[3084]: [CPG  ] got joinlist message from
>> node 1
>> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] entering GATHER state from
>> 11.
>> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] Saving state aru 68 high
>> seq received 68
>> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] Storing new sequence id
>> for ring 328
>> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] entering COMMIT state.
>> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] entering RECOVERY state.
>> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] position [0] member
>> 10.128.161.81:
>> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] previous ring seq 804 rep
>> 10.128.161.81
>> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] aru 68 high delivered 68
>> received flag 1
>> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] position [1] member
>> 10.128.161.82:
>> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] previous ring seq 804 rep
>> 10.128.161.81
>> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] aru 68 high delivered 68
>> received flag 1
>> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] position [2] member
>> 10.128.161.83:
>> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] previous ring seq 804 rep
>> 10.128.161.81
>> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] aru 68 high delivered 68
>> received flag 1
>> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] position [3] member
>> 10.128.161.84:
>> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] previous ring seq 804 rep
>> 10.128.161.81
>> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] aru 68 high delivered 68
>> received flag 1
>> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] position [4] member
>> 10.128.161.85:
>> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] previous ring seq 804 rep
>> 10.128.161.81
>> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] aru 68 high delivered 68
>> received flag 1
>> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] position [5] member
>> 10.128.161.86:
>> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] previous ring seq 804 rep
>> 10.128.161.81
>> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] aru 68 high delivered 68
>> received flag 1
>> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] position [6] member
>> 10.128.161.87:
>> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] previous ring seq 804 rep
>> 10.128.161.87
>> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] aru 9 high delivered 9
>> received flag 1
>> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] position [7] member
>> 10.128.161.89:
>> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] previous ring seq 804 rep
>> 10.128.161.81
>> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] aru 68 high delivered 68
>> received flag 1
>> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] Did not need to originate
>> any messages in recovery.
>> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] CLM CONFIGURATION CHANGE
>> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] New Configuration:
>> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) ip(
>> 10.128.161.81)
>> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) ip(
>> 10.128.161.82)
>> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) ip(
>> 10.128.161.83)
>> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) ip(
>> 10.128.161.84)
>> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) ip(
>> 10.128.161.85)
>> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) ip(
>> 10.128.161.86)
>> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) ip(
>> 10.128.161.89)
>> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] Members Left:
>> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] Members Joined:
>> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] CLM CONFIGURATION CHANGE
>> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] New Configuration:
>> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) ip(
>> 10.128.161.81)
>> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) ip(
>> 10.128.161.82)
>> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) ip(
>> 10.128.161.83)
>> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) ip(
>> 10.128.161.84)
>> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) ip(
>> 10.128.161.85)
>> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) ip(
>> 10.128.161.86)
>> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) ip(
>> 10.128.161.87)
>> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) ip(
>> 10.128.161.89)
>> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] Members Left:
>> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] Members Joined:
>> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) ip(
>> 10.128.161.87)
>> Jul 31 13:24:24 cluster3 openais[3084]: [SYNC ] This node is within the
>> primary component and will provide service.
>> Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] entering OPERATIONAL
>> state.
>> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] got nodejoin message
>> 10.128.161.81
>> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] got nodejoin message
>> 10.128.161.82
>> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] got nodejoin message
>> 10.128.161.83
>> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] got nodejoin message
>> 10.128.161.84
>> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] got nodejoin message
>> 10.128.161.85
>> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] got nodejoin message
>> 10.128.161.86
>> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] got nodejoin message
>> 10.128.161.87
>> Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] got nodejoin message
>> 10.128.161.89
>> Jul 31 13:24:24 cluster3 openais[3084]: [CPG  ] got joinlist message from
>> node 6
>> Jul 31 13:24:24 cluster3 openais[3084]: [CPG  ] got joinlist message from
>> node 9
>> Jul 31 13:24:24 cluster3 openais[3084]: [CPG  ] got joinlist message from
>> node 1
>> Jul 31 13:24:24 cluster3 openais[3084]: [CPG  ] got joinlist message from
>> node 2
>> Jul 31 13:24:24 cluster3 openais[3084]: [CPG  ] got joinlist message from
>> node 3
>> Jul 31 13:24:24 cluster3 openais[3084]: [CPG  ] got joinlist message from
>> node 4
>> Jul 31 13:24:24 cluster3 openais[3084]: [CPG  ] got joinlist message from
>> node 5
>>
>> Thanks!
>>
>> Adam
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080731/de8758e9/attachment.htm>

From fdinitto at redhat.com  Fri Aug  1 09:17:50 2008
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Fri, 1 Aug 2008 11:17:50 +0200 (CEST)
Subject: [Linux-cluster] Cluster 2.99.07 (development snapshot) released
Message-ID: <Pine.LNX.4.64.0808011102440.14806@trider-g7>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


The cluster team and its community are proud to announce the 2.99.07 
release from the master branch.

The development cycle for 3.0 is proceeding at a very good speed and 
mostlikely one of the next releases will be 3.0alpha1. All features 
designed for 3.0 are being completed and taking a proper shape, the 
library API has been stable for sometime (and will soon be marked as 3.0 
soname). Stay tuned for upcoming updates!

The 2.99.XX releases are _NOT_ meant to be used for production
environments.. yet.

The master branch is the main development tree that receives all new
features, code, clean up and a whole brand new set of bugs,

At some point in time this code will become the 3.0 stable release.

Everybody with test equipment and time to spare, is highly encouraged to
download, install and test the 2.99 releases and more important report
problems.

In order to build the 2.99.07 release you will need:

- - openais svn r1579. Porting to corosync is a work in progress.
- - linux kernel (2.6.26) from
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git
(but userland can run on 2.6.25 in compatibility mode)

NOTE to packagers: the library API/ABI's are _NOT_ stable (hence 2.9). We
are still shipping shared libraries but remember that they can change
anytime without warning. A bunch of new shared libraries have been added.

The new source tarball can be downloaded here:

   ftp://sources.redhat.com/pub/cluster/releases/cluster-2.99.07.tar.gz
   https://fedorahosted.org/releases/c/l/cluster/cluster-2.99.07.tar.gz

In order to use GFS1, the Linux kernel requires a minimal patch:

   ftp://sources.redhat.com/pub/cluster/releases/lockproto-exports.patch
   https://fedorahosted.org/releases/c/l/cluster/lockproto-exports.patch

To report bugs or issues:

   https://bugzilla.redhat.com/

Would you like to meet the cluster team or members of its community?

   Join us on IRC (irc.freenode.net #linux-cluster) and share your
   experience  with other sysadministrators or power users.

Happy clustering,
Fabio

Under the hood (from 2.99.06):

Andrew Price (1):
       [GFS2] libgfs2: Build with -fPIC

Bob Peterson (14):
       Print log header flags for gfs journals.
       Speed up userspace bitmap manipulation code.
       gfs_fsck crosswrite for block number sanity checking
       Fix some bad references to gfs_tool and gfs_fsck
       Deleted unused function print_map
       Shrink memory 1: eliminate b_size from pseudo-buffer-heads
       Shrink memory 2: get rid of 3 huge in-core bitmaps
       Shrink memory 3: smaller link counts in inode_info
       Better error reporting in gfs2_fsck
       RGRepair: Account for RG blocks inside journals
       gfs2_fsck dupl. blocks between EA and data
       gfs2_edit: Ability to enter "journalX" in block number.
       gfs2_edit: was parsing out gfs1 log descriptors improperly
       gfs2_edit: Improved gfs journal dumps

Christine Caulfield (13):
       [CCS] Set errno when an error occurs.
       [CMAN] Don't use logsys in config modules.
       Revert "[CMAN] Don't use logsys in config modules."
       [CMAN] Don't use logsys in config modules.
       [CCS] Fold ccs_test into ccs_tool and tidy
       [CCS] add -c flag to ccs_tool query
       [CONFIG] Add some more errnos to libccsconfdb
       [CCS] Set return status on failure
       [CCS] Make ccs_tool/ccs_test more consistent
       [CMAN] Fix overridden node names
       [CMAN] pass COROSYNC_ env variables to the daemon
       [CMAN] Display the node's votes in cman_tool status
       qdisk: fix compile error when building without debug.

David Teigland (19):
       gfs_controld: change start message from new members
       gfs_controld: add missing endian conversion
       gfs_controld: byte swap ids earlier
       gfs_controld: close dlm_controld connection
       fenced: improved start messages
       fenced: munge config option code
       fenced: debug logsys options
       dlm_controld: improved start messages
       fenced: complete messages copy start messages
       fenced: munge logging
       dlm_controld: use logsys
       gfs_controld: use logsys
       dlm_controld/gfs_controld: add logging.c file
       groupd: use logsys
       groupd: detect group_mode
       fenced: use group_mode detection
       dlm_controld: use group_mode detection
       gfs_controld: use group_mode detection
       fence_tool: add domain member checks

Fabio M. Di Nitto (42):
       [CCS] Fix LEGACY_CODE ifdef
       [BUILD] Implement --enable_legacy_code in the build system
       [BUILD] Add ccs_test replacement when building legacy_code
       [BUILD] Fix ccs.h include path
       [BUILD] Fix doc install target when building objects outside source tree
       [CCS] Kill obsolted ccs_test
       [RGMANAGER] Port all resource agents to new ccs interface
       [RGMANAGER] Port smb resource agent to ccs_tool
       [BUILD] Fix race condition in oldconfig update/execution
       [RGMANAGER] Use proper ccs_tool query output
       [BUILD] Fix ccs_tool/ccs_test build with new compat code
       [CCS] Inflict hopefully last compat issues love to ccs_t*
       Revert "[RGMANAGER] Use proper ccs_tool query output"
       [RGMANAGER] Port ccs_get to proper ccs_tool output
       [RGMANGER] Fix call to ccs_tool
       [BUILD] Fix ccs_tool linking dir order
       [BUILD] Fix logrotate snippet filename
       [FENCE] Sync fence_apc_snmp from RHEL47 branch
       [BUILD] Fix LOGDIR usage
       [FENCE] Fix fence_apc_snmp logging
       [BUILD] Cleanup linking order for logsys
       [BUILD] Cleanup groupd makefile
       build: update .gitignore
       Revert "fence: port scsi agent to use ccs_tool query and drop XML::LibXML requirement"
       Revert "fence: simplify init script"
       Revert "rgmanger: remove check on cluster.conf from rgmanager init script"
       rgmanger: remove check on cluster.conf from rgmanager init script
       fence: simplify init script
       fence: port scsi agent to use ccs_tool query and drop XML::LibXML requirement
       rgmanager: fix clean target
       cman: init script should not user cluster.conf directly
       rgmanager: init script does not need network config
       config: allow users to override default config file in xmlconfig
       test commit
       Revert "test commit"
       bindings: add first cut of perl Cluster:CCS
       bindings: improve Cluster::CCS description
       build: clean up perl bindings build system
       misc: clean up "char const *" vs "const char *"
       init: standardize init scripts to /etc/sysconfig/cluster
       build: fix bindings build when using external object tree
       bindings: fix CCS.pm doc

Lon Hohberger (2):
       [rgmanager] Add optional save/restore to vm resource
       [qdisk] Make stop_cman="1" work if heuristics fail during initialization

Ryan McCabe (1):
       fence: update apc snmp agent

Ryan O'Hara (3):
       gfs_mkfs: change the way we check to see if a device is mounted
       cman: add option to init script to prevent joining the fence domain
       cman: fix typo (#!/bin/bash) from previous commit

  .gitignore                                       |    7 +
  bindings/perl/Makefile                           |    4 +-
  bindings/perl/ccs/CCS.pm.in                      |  145 +++++
  bindings/perl/ccs/CCS.xs                         |   82 +++
  bindings/perl/ccs/MANIFEST                       |    7 +
  bindings/perl/ccs/META.yml.in                    |   13 +
  bindings/perl/ccs/Makefile.PL                    |   28 +
  bindings/perl/ccs/Makefile.bindings              |   11 +
  bindings/perl/ccs/test.pl                        |   20 +
  bindings/perl/ccs/typemap                        |    1 +
  ccs/ccs_tool/Makefile                            |   35 +-
  ccs/ccs_tool/ccs_tool.c                          |  261 ++++++++-
  ccs/ccs_tool/old_parser.c                        |  688 ----------------------
  ccs/ccs_tool/old_parser.h                        |   64 --
  ccs/ccs_tool/upgrade.c                           |  259 --------
  ccs/ccs_tool/upgrade.h                           |    6 -
  ccs/libccscompat/libccscompat.h                  |    2 +-
  ccs/man/Makefile                                 |    5 +
  ccs/man/ccs_test.8                               |  132 +++++
  cman/cman_tool/cman_tool.h                       |    2 +-
  cman/cman_tool/join.c                            |   19 +-
  cman/cman_tool/main.c                            |    7 +-
  cman/daemon/cman-preconfig.c                     |   35 +-
  cman/init.d/Makefile                             |   16 +-
  cman/init.d/cman                                 |  648 ++++++++++++++++++++
  cman/init.d/cman.in                              |  592 -------------------
  cman/qdisk/main.c                                |    4 +-
  config/libs/libccsconfdb/ccs.h                   |    2 +-
  config/libs/libccsconfdb/libccs.c                |   69 ++-
  config/plugins/ldap/configldap.c                 |   10 +-
  config/plugins/xml/config.c                      |   20 +-
  config/tools/Makefile                            |    2 +-
  config/tools/ccs_test/Makefile                   |   32 -
  config/tools/ccs_test/ccs_test.c                 |  147 -----
  config/tools/man/Makefile                        |    2 +-
  config/tools/man/ccs_test.8                      |  132 -----
  configure                                        |   23 +-
  doc/Makefile                                     |    6 +-
  fence/agents/apc_snmp/fence_apc_snmp.py          |  581 +++++++++++--------
  fence/agents/scsi/fence_scsi.pl                  |   22 +-
  fence/agents/scsi/fence_scsi_test.pl             |   26 +-
  fence/agents/scsi/scsi_reserve                   |   24 +-
  fence/fence_tool/fence_tool.c                    |  260 ++++-----
  fence/fenced/Makefile                            |    6 +-
  fence/fenced/config.c                            |   68 ++-
  fence/fenced/config.h                            |   29 +
  fence/fenced/cpg.c                               |  565 +++++++++++-------
  fence/fenced/fd.h                                |   40 +-
  fence/fenced/group.c                             |   29 +
  fence/fenced/logging.c                           |   42 +-
  fence/fenced/main.c                              |   90 ++--
  fence/fenced/member_cman.c                       |    3 +-
  fence/fenced/recover.c                           |   21 +-
  fence/libfenced/libfenced.h                      |    3 +
  gfs/gfs_mkfs/main.c                              |   29 +-
  gfs2/edit/hexedit.c                              |  290 +++++++---
  gfs2/edit/savemeta.c                             |    9 +-
  gfs2/fsck/eattr.c                                |   21 +-
  gfs2/fsck/eattr.h                                |   20 +-
  gfs2/fsck/fs_recovery.c                          |    4 +-
  gfs2/fsck/fsck.h                                 |    5 +-
  gfs2/fsck/initialize.c                           |   10 +-
  gfs2/fsck/lost_n_found.c                         |    7 +-
  gfs2/fsck/main.c                                 |   35 +-
  gfs2/fsck/metawalk.c                             |  177 ++++--
  gfs2/fsck/metawalk.h                             |   16 +-
  gfs2/fsck/pass1.c                                |  405 +++++++++-----
  gfs2/fsck/pass1b.c                               |   95 ++--
  gfs2/fsck/pass1c.c                               |   69 ++-
  gfs2/fsck/pass2.c                                |   61 ++-
  gfs2/fsck/pass3.c                                |   20 +-
  gfs2/fsck/pass4.c                                |   11 +-
  gfs2/fsck/pass5.c                                |    2 +-
  gfs2/fsck/rgrepair.c                             |   58 ++-
  gfs2/libgfs2/Makefile                            |    1 +
  gfs2/libgfs2/bitmap.c                            |   79 ++-
  gfs2/libgfs2/block_list.c                        |  232 ++++----
  gfs2/libgfs2/buf.c                               |    1 -
  gfs2/libgfs2/fs_bits.c                           |    2 +-
  gfs2/libgfs2/fs_ops.c                            |   38 +-
  gfs2/libgfs2/libgfs2.h                           |   93 ++-
  gfs2/libgfs2/recovery.c                          |    2 +-
  gfs2/libgfs2/rgrp.c                              |    8 +
  group/daemon/Makefile                            |   10 +-
  group/daemon/app.c                               |    3 +
  group/daemon/cpg.c                               |  369 ++++++++++++
  group/daemon/gd_internal.h                       |   51 ++-
  group/daemon/logging.c                           |  170 ++++++
  group/daemon/main.c                              |  177 ++++++-
  group/dlm_controld/Makefile                      |    8 +-
  group/dlm_controld/config.c                      |   39 ++-
  group/dlm_controld/config.h                      |    5 +-
  group/dlm_controld/cpg.c                         |  350 ++++++------
  group/dlm_controld/dlm_daemon.h                  |   34 +-
  group/dlm_controld/group.c                       |   29 +
  group/dlm_controld/logging.c                     |  171 ++++++
  group/dlm_controld/main.c                        |   63 +--
  group/dlm_controld/member_cman.c                 |    3 +-
  group/gfs_controld/Makefile                      |    6 +-
  group/gfs_controld/config.c                      |   59 ++-
  group/gfs_controld/config.h                      |    5 +-
  group/gfs_controld/cpg-new.c                     |  188 ++++---
  group/gfs_controld/gfs_daemon.h                  |   44 ++-
  group/gfs_controld/group.c                       |   29 +
  group/gfs_controld/logging.c                     |  171 ++++++
  group/gfs_controld/main.c                        |   52 ++-
  group/gfs_controld/member_cman.c                 |    1 +
  group/gfs_controld/util.c                        |    1 +
  group/lib/libgroup.c                             |   25 +
  group/lib/libgroup.h                             |    2 +
  make/binding-passthrough.mk                      |    7 +
  make/defines.mk.input                            |    3 +-
  make/fencebuild.mk                               |    1 +
  make/install.mk                                  |    4 +-
  make/perl-binding-common.mk                      |   30 +
  rgmanager/init.d/Makefile                        |   12 +-
  rgmanager/init.d/rgmanager                       |  141 +++++
  rgmanager/init.d/rgmanager.in                    |  154 -----
  rgmanager/src/resources/apache.sh                |   11 +-
  rgmanager/src/resources/mysql.sh                 |   12 +-
  rgmanager/src/resources/named.sh                 |   11 +-
  rgmanager/src/resources/openldap.sh              |   12 +-
  rgmanager/src/resources/postgres-8.sh            |   12 +-
  rgmanager/src/resources/samba.sh                 |   12 +-
  rgmanager/src/resources/smb.sh                   |  104 +---
  rgmanager/src/resources/tomcat-5.sh              |   12 +-
  rgmanager/src/resources/utils/config-utils.sh.in |   66 +--
  rgmanager/src/resources/utils/messages.sh        |    4 -
  rgmanager/src/resources/vm.sh                    |   30 +
  129 files changed, 5659 insertions(+), 4191 deletions(-)

- --
I'm going to make him an offer he can't refuse.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)

iQIVAwUBSJLUxAgUGcMLQ3qJAQKMCw//Ud5jm6xhZlrUJvAhB3JsnromDFEgJiwt
KYFJ+pzmvfTvkw3q+SyJu8vBSvJ3tFVeu1/fIiFGtVJSiucROKl3ToDhjDUz1Y+4
OYvyMdPMHlw1GK92XnCA8cnKFlejnSMTvgSpfJkWWsOfp/MKB5zwrUBaSKAdutPV
d7Y4nD8zEKhLWgZ76flrq5uPOvGTazU6Q3aNMJJIhyDkrLNSBOTEjIWBRtwtAAMq
RX4mv0aQCgcRPat602BiAVb8+DVHmmxFkjmWjnARi8LypMOxxAEZX5g8dFFWPMC7
C5Quul6AhjAfbzWkOxINjk8aa/i7USqSkwmVkNnkifrcGFdH+Su3pDMzGAOpWSqO
4UPZF00rKqr8hH51BDufCtebieZ5qIyE2yBLpuQSqs5ZGk7oSaa0cog3QqUqhvDf
d32QIbRZ/bR6ChJnQu2IHH8FNZGMscsnkPcNt2BzXVYsgQMJUJtWf44r3H2jCWoO
bsjT1EDJIAgM3urYm09o/jURW8eckYlA5oH5xuQuydOYRr5EKW31W0LNP4PMfWSR
WNBAs0U3vB0RI41v40IqyRWmNqoOIdkBJe59Kb9r5z0Z/AvbASVUES3FCjLv12tY
Gn4CEqiL1ti7kGZpX73W+1ydvYO+ZQUvqP4bfqYNLwB1OPrsUXT6rG5wx2lWs+rn
XAqCkmBqcKo=
=IH1P
-----END PGP SIGNATURE-----



From balajisundar at midascomm.com  Fri Aug  1 10:06:49 2008
From: balajisundar at midascomm.com (Balaji)
Date: Fri, 01 Aug 2008 15:36:49 +0530
Subject: [Linux-cluster] HP ILO Fence Configuration
Message-ID: <4892E039.3050701@midascomm.com>

Dear All,

  Currently i am using HP x6600 Server and I have installed RHEL4 Update 
4 AS Linux and
  RHEL4 Update 4 Support Cluster Suite in my server
  I am new in fence and can any one help me how to configure HP ILO 
fence in my server
  and HP ILO Fence Functionality

Regards
-S.Balaji



From ajeet.singh.raina at logica.com  Fri Aug  1 10:16:05 2008
From: ajeet.singh.raina at logica.com (Singh Raina, Ajeet)
Date: Fri, 1 Aug 2008 15:46:05 +0530
Subject: [Linux-cluster] Directories gets Deleted during Failover
Message-ID: <0139539A634FD04A99C9B8880AB70CB209B179E6@in-ex004.groupinfra.com>

Hi,

I have been busy setting up Two Node cluster Setup and find that during
the failover the directories created under mount point gets deleted.
Please do let me know why it is behaving so?


ajeet


This e-mail and any attachment is for authorised use by the intended recipient(s) only. It may contain proprietary material, confidential information and/or be subject to legal privilege. It should not be copied, disclosed to, retained or used by, any other party. If you are not an intended recipient then please promptly delete this e-mail and any attachment and all copies and inform the sender. Thank you.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080801/568db7ac/attachment.htm>

From fdinitto at redhat.com  Fri Aug  1 11:08:46 2008
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Fri, 1 Aug 2008 13:08:46 +0200 (CEST)
Subject: [Linux-cluster] Cluster 2.03.06 released
Message-ID: <Pine.LNX.4.64.0808011302120.14806@trider-g7>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


The cluster team and its vibrant community are proud to announce the 7th
release from the STABLE2 branch: 2.03.06.

The STABLE2 branch collects, on a daily base, all bug fixes and the bare 
minimal changes required to run the cluster on top of the most recent 
Linux kernel (2.6.26) and rock solid openais (0.80.3).

The 2.03.06 release features porting to the 2.6.26 kernel for the kernel 
modules and userland. Userland can also run in compatibility mode with 
2.6.25 kernel.

NOTE The stable2 branch will not build on top of corosync/openais new tree 
for this release. The very latest code from openais that can be used is 
svn r1579. Porting to corosync will happen in future.

The new source tarball can be downloaded here:

   ftp://sources.redhat.com/pub/cluster/releases/cluster-2.03.06.tar.gz
   https://fedorahosted.org/releases/c/l/cluster/cluster-2.03.06.tar.gz

In order to use GFS1, the Linux kernel requires a minimal patch:

   ftp://sources.redhat.com/pub/cluster/releases/lockproto-exports.patch
   https://fedorahosted.org/releases/c/l/cluster/lockproto-exports.patch

To report bugs or issues:

   https://bugzilla.redhat.com/

Would you like to meet the cluster team or members of its community?

   Join us on IRC (irc.freenode.net #linux-cluster) and share your
   experience  with other sysadministrators or power users.

Happy clustering,
Fabio

Under the hood (from 2.03.05):

Bob Peterson (15):
       Replace put_inode with drop_inode
       Print log header flags for gfs journals.
       Speed up userspace bitmap manipulation code.
       gfs_fsck crosswrite for block number sanity checking
       Fix some bad references to gfs_tool and gfs_fsck
       Deleted unused function print_map
       Shrink memory 1: eliminate b_size from pseudo-buffer-heads
       Shrink memory 2: get rid of 3 huge in-core bitmaps
       Shrink memory 3: smaller link counts in inode_info
       Better error reporting in gfs2_fsck
       RGRepair: Account for RG blocks inside journals
       gfs2_fsck dupl. blocks between EA and data
       gfs2_edit: Ability to enter "journalX" in block number.
       gfs2_edit: was parsing out gfs1 log descriptors improperly
       gfs2_edit: Improved gfs journal dumps

Christine Caulfield (2):
       [CMAN] Add node votes to 'cman_tool status' output
       cman: revert dirty patch

David Teigland (3):
       gfs_controld: read plocks from dlm or lock_dlm
       fenced: update cman only after complete success
       groupd: ignore nolock gfs

Fabio M. Di Nitto (5):
       [GNBD] Update gnbd to work with 2.6.26
       [GFS] Make gfs build with 2.6.26 (DO NOT USE!)
       [GFS] Fix comment
       [BUILD] Add install/uninstall snippets for documents
       [FENCE] Sync fence_apc_snmp from RHEL47 branch

Lon Hohberger (1):
       [qdisk] Make stop_cman="1" work if heuristics fail during initialization

Ryan McCabe (1):
       fence: update apc snmp agent

Ryan O'Hara (2):
       gfs_mkfs: change the way we check to see if a device is mounted
       cman: add option to init script to prevent joining the fence domain

  cman/cman_tool/main.c                   |    1 +
  cman/daemon/commands.c                  |    3 +-
  cman/init.d/cman.in                     |   93 ++++--
  cman/qdisk/main.c                       |    2 +
  fence/agents/apc_snmp/fence_apc_snmp.py |  581 ++++++++++++++++++-------------
  fence/fenced/agent.c                    |   16 +-
  gfs-kernel/src/gfs/ops_address.c        |    2 +-
  gfs-kernel/src/gfs/ops_super.c          |    7 +-
  gfs-kernel/src/gfs/quota.c              |    4 +-
  gfs/gfs_mkfs/main.c                     |   29 +-
  gfs2/edit/hexedit.c                     |  290 ++++++++++++----
  gfs2/edit/savemeta.c                    |    9 +-
  gfs2/fsck/eattr.c                       |   21 +-
  gfs2/fsck/eattr.h                       |   20 +-
  gfs2/fsck/fs_recovery.c                 |    4 +-
  gfs2/fsck/fsck.h                        |    5 +-
  gfs2/fsck/initialize.c                  |   10 +-
  gfs2/fsck/lost_n_found.c                |    7 +-
  gfs2/fsck/main.c                        |   35 +--
  gfs2/fsck/metawalk.c                    |  177 +++++++----
  gfs2/fsck/metawalk.h                    |   16 +-
  gfs2/fsck/pass1.c                       |  405 ++++++++++++++--------
  gfs2/fsck/pass1b.c                      |   95 +++---
  gfs2/fsck/pass1c.c                      |   69 +++--
  gfs2/fsck/pass2.c                       |   61 ++--
  gfs2/fsck/pass3.c                       |   20 +-
  gfs2/fsck/pass4.c                       |   11 +-
  gfs2/fsck/pass5.c                       |    2 +-
  gfs2/fsck/rgrepair.c                    |   58 +++-
  gfs2/libgfs2/bitmap.c                   |   79 ++++-
  gfs2/libgfs2/block_list.c               |  232 ++++++-------
  gfs2/libgfs2/buf.c                      |    1 -
  gfs2/libgfs2/fs_bits.c                  |    2 +-
  gfs2/libgfs2/fs_ops.c                   |   38 +-
  gfs2/libgfs2/libgfs2.h                  |   93 ++++--
  gfs2/libgfs2/recovery.c                 |    2 +-
  gfs2/libgfs2/rgrp.c                     |    8 +
  gnbd-kernel/src/gnbd.c                  |   91 +++---
  gnbd-kernel/src/gnbd.h                  |    4 +-
  group/daemon/main.c                     |   28 ++-
  group/gfs_controld/lock_dlm.h           |    1 +
  group/gfs_controld/plock.c              |  254 +++++++++++---
  make/install.mk                         |    4 +
  make/uninstall.mk                       |    3 +
  44 files changed, 1841 insertions(+), 1052 deletions(-)

- --
I'm going to make him an offer he can't refuse.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)

iQIVAwUBSJLuxAgUGcMLQ3qJAQJu0xAApnUtjXaP72FlznIFBXHIyIvDWxozRf9u
HNSAM7dO94Iu2nCUyuehFKNNzyL80s9U/LhrTZfokxwTqHLp3YGYAzcMJ2WmqDDp
DiskzoofGbYp2BT3LBeZKuNeGi+eWoK4C6kfgKMTGpex/1CrSdT4lm/x9kya+zwR
h24fl1kp74z+90gcU5aqkwb6GbDdmu9CLmUrufciHsaLAx6Cw96SU794BRpOBNiH
zw1deZMHvnNQYlJmBF0icpHS3GbdKF/wNt2m3ux1fPcAsaDRbSLfkyqgxd3qaC8p
fOGh1seQIW8iefh/2kJlSmcZ8D2SOnycdyXK7wLKUMOuXNjbxgLHjguXjqaKsg6V
oxQaY6IWuczW47KOdti6A3SNU86obz74zc8D+7LXPbf3HC7TIvqvgCwl6RJ7ODSs
0sbgZ6QYZvNlN3hwGnuaE2dh5UgsL5foUgogJSgJ4alTp6RCXPwv8Lm9uGAtcT6l
BMull8I/R+/SmLHi8bnXm/w/7HSCziT8CZhXIwXkBTkTkt7V4s30o8QJOAABDxp0
ehavfsjqX/ualz4CKFykEKi3CIbXvXqrxcYrncNd8UWcHrLNQHNbEQ0xsnmrvhgj
zVjNWbPnfa/FEOjMjLZ1xqnSXXGpIzR7bjoOy2PUZ3THmhwq85nf9Eyo+56Dzgdi
IkL0+pbpH4Q=
=brA9
-----END PGP SIGNATURE-----



From ozgurakan at gmail.com  Fri Aug  1 13:33:59 2008
From: ozgurakan at gmail.com (Ozgur Akan)
Date: Fri, 1 Aug 2008 09:33:59 -0400
Subject: [Linux-cluster] network for cluster communication
Message-ID: <68f132770808010633t1d6421f2va9adaf388ac7480e@mail.gmail.com>

Hi,

I have two important questions regardin cluster performance.

I attached two ethernet cards as second interfaces on two nodes that I have.


- How can I configure cluster to use this new interface (network) to
communicate between eachother.?

- Is speed of this local network between two nodes an important criteria for
file locks on GFS ?

thanks,
Ozgur Akan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080801/6be0eda5/attachment.htm>

From rpeterso at redhat.com  Fri Aug  1 13:34:40 2008
From: rpeterso at redhat.com (Bob Peterson)
Date: Fri, 01 Aug 2008 08:34:40 -0500
Subject: [Linux-cluster] Directories gets Deleted during Failover
In-Reply-To: <0139539A634FD04A99C9B8880AB70CB209B179E6@in-ex004.groupinfra.com>
References: <0139539A634FD04A99C9B8880AB70CB209B179E6@in-ex004.groupinfra.com>
Message-ID: <1217597680.9521.31.camel@technetium.msp.redhat.com>

Hi Ajeet,

On Fri, 2008-08-01 at 15:46 +0530, Singh Raina, Ajeet wrote:
> Hi,
> 
> I have been busy setting up Two Node cluster Setup and find that
> during the failover the directories created under mount point gets
> deleted.
> 
> Please do let me know why it is behaving so?

You haven't given us enough information.
You haven't even said whether the file system is GFS, GFS2, EXT3,
XFS, etc., or NFS over one of the above.  In general, directories
should not just disappear.  Perhaps one of your nodes has the
file system mounted and the other does not, so when failover
occurs, it just looks like the directories are gone?

Regards,

Bob Peterson
Red Hat Clustering & GFS




From ccaulfie at redhat.com  Fri Aug  1 13:38:38 2008
From: ccaulfie at redhat.com (Christine Caulfield)
Date: Fri, 01 Aug 2008 14:38:38 +0100
Subject: [Linux-cluster] network for cluster communication
In-Reply-To: <68f132770808010633t1d6421f2va9adaf388ac7480e@mail.gmail.com>
References: <68f132770808010633t1d6421f2va9adaf388ac7480e@mail.gmail.com>
Message-ID: <489311DE.6050701@redhat.com>

Ozgur Akan wrote:
> Hi,
> 
> I have two important questions regardin cluster performance.
> 
> I attached two ethernet cards as second interfaces on two nodes that I 
> have.
> 
> - How can I configure cluster to use this new interface (network) to 
> communicate between eachother.?

Put the host name or IP address of the new interface in cluster.conf, in 
place of the existing host names.


> - Is speed of this local network between two nodes an important criteria 
> for file locks on GFS ?
> 

Yes, very :)


Chrissie



From lhh at redhat.com  Fri Aug  1 19:34:02 2008
From: lhh at redhat.com (Lon Hohberger)
Date: Fri, 01 Aug 2008 15:34:02 -0400
Subject: [Linux-cluster] "Inc" column description/semnification
In-Reply-To: <200807311004.23788.linux@vfemail.net>
References: <200807301452.41459.linux@vfemail.net>
	<1217439408.30587.195.camel@ayanami>
	<200807311004.23788.linux@vfemail.net>
Message-ID: <1217619242.11524.214.camel@ayanami>

On Thu, 2008-07-31 at 10:04 +0300, Alex wrote:
> On Wednesday 30 July 2008 20:36, Lon Hohberger wrote:
> > On Wed, 2008-07-30 at 14:52 +0300, Alex wrote:
> > > Hello,
> > >
> > > What does it mean "Inc" column in the output of the cman_tool nodes
> > > command?
> > >
> > > [root at rs2 ~]# cman_tool nodes
> > > Node  Sts   Inc   Joined               Name
> > >    1   M      8   2008-07-30 11:03:12  192.168.113.5
> > >    2   M      4   2008-07-30 10:59:34  192.168.113.4
> > > [root at rs2 ~]#
> > >
> > > Can anybody tell me what represent 4 and 8 in Inc coulmn?
> >
> > Local incarnation # for the node, if I recall correctly.  They usually
> > do not match cluster-wide.
> 
> Because we know what is its name, let me ask you about Inc signification, how 
> can be interpreted and what represent 8 and 4 in above column... 8m, 8pps, 
> 8kbps, 8kv, womans, mans, aliens? In manual and documentation is absolutely 
> missing any info about Inc column!

I'm pretty sure it's the Totem protocol sequence # the local node
recorded for when it first "saw" the node.  The "Joined" time is the
same thing, except it's according to the local node's clock instead of
the Totem token sequence #.

That's all they are.  They don't indicate anything useful for
monitoring.


> And another question: why numbers in Inc column is changing everytime a node 
> is rebooted and remain constant till next reboot?

The sequence # is different the next time the node is "seen".  You'll
also notice the "Joined" value is different.

The "Inc" column and "Joined" column are set at the same time but are
not related to each other value-wise.

-- Lon



From lhh at redhat.com  Fri Aug  1 19:38:18 2008
From: lhh at redhat.com (Lon Hohberger)
Date: Fri, 01 Aug 2008 15:38:18 -0400
Subject: [Linux-cluster] how to mount a gfs2 volume on all our
	real	webservers in /var/www/html
In-Reply-To: <200807311322.46231.linux@vfemail.net>
References: <200807311143.22407.linux@vfemail.net>
	<48917D66.7050801@aokaifh.cn>  <200807311322.46231.linux@vfemail.net>
Message-ID: <1217619498.11524.220.camel@ayanami>

On Thu, 2008-07-31 at 13:22 +0300, Alex wrote:
> On Thursday 31 July 2008 11:52, ??? wrote:
> > This is a typical LVS model.
> 
> Indeed is a LVS. I have an router in front of rs1, rs2, rs3 webservers which 
> is configured as LVS with load balancing.
> 
> > Do not add your httpd script and mount script into source in your
> > cluster.conf
> 
> In redhat howto "Example of Setting Up Apache HTTP Server" they are saying to 
> not start httpd server at boot time and leave the cluster to do that! Thats 
> why i added http_service in my cluster.conf.

It's a different use case than what you want.  The one in the
documentation you were reading is referring to failover of a single
instance of httpd, not running httpd on 3 nodes at the same time.

* put your gfs2 volumes in /etc/fstab
* turn on httpd

-- Lon




From lhh at redhat.com  Fri Aug  1 19:40:28 2008
From: lhh at redhat.com (Lon Hohberger)
Date: Fri, 01 Aug 2008 15:40:28 -0400
Subject: [Linux-cluster] 2 questions regarding gfs and gfs2
In-Reply-To: <200807311158.29161.linux@vfemail.net>
References: <200807311158.29161.linux@vfemail.net>
Message-ID: <1217619628.11524.224.camel@ayanami>

On Thu, 2008-07-31 at 11:58 +0300, Alex wrote:
> Hello,
> 
> Using conga, to generate cluster.conf file i saw by default, when is choosen 
> GFS File system, in cluster.conf file is generated fsid="35790" and 
> fstype="gfs" vor a gfs volume.
> 
> [snip from my cluster.conf]
> clusterfs device="/dev/myvg1/mylv1" force_unmount="0" 
> fsid="35790" fstype="gfs" mountpoint="/var/www/html
> 
> With this config, mylv1 has failed to mount because /dev/myvg1/mylv1 is gfs2 
> formatted. In this case, I changed manually in cluster.conf 
> fstype="gfs2" (leaving unchanged fsid="35790"), and now mylv1 is mounted 
> without problem.
> 
> Questions:
> - GFS2 has the same fsid as GFS? If not, which value is correct?

fsid is not related to file system types, it's for preserving NFS client
file handles in the event of a server-side failover when devices do not
match up.

> - On centos-5.2, i saw that by default is used GFS2, which many peoples says 
> that is not good for production use. Is this true or in centos/rhel-5.2 this 
> has been changed and GFS2 is enough mature to be considered "production 
> quality"?

No, it's not yet production quality.

-- Lon

> 
> Regards,
> Alx
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From lhh at redhat.com  Fri Aug  1 19:48:15 2008
From: lhh at redhat.com (Lon Hohberger)
Date: Fri, 01 Aug 2008 15:48:15 -0400
Subject: [Linux-cluster] Directories gets Deleted during Failover
In-Reply-To: <0139539A634FD04A99C9B8880AB70CB209B179E6@in-ex004.groupinfra.com>
References: <0139539A634FD04A99C9B8880AB70CB209B179E6@in-ex004.groupinfra.com>
Message-ID: <1217620095.11524.228.camel@ayanami>

On Fri, 2008-08-01 at 15:46 +0530, Singh Raina, Ajeet wrote:
> Hi,
> 
> I have been busy setting up Two Node cluster Setup and find that
> during the failover the directories created under mount point gets
> deleted.
> 
> Please do let me know why it is behaving so?

Cluster.conf! Cluster.conf!

-- Lon




From j.buzzard at dundee.ac.uk  Sat Aug  2 22:52:14 2008
From: j.buzzard at dundee.ac.uk (Jonathan Buzzard)
Date: Sat, 02 Aug 2008 23:52:14 +0100
Subject: [Linux-cluster] Fencing using iDRAC/ Dell M600
In-Reply-To: <c1ace6860807300312n1ba2085gf754e872314ab50e@mail.gmail.com>
References: <c1ace6860807290842l5c17ad3bt65319843994c8b65@mail.gmail.com>	<824ffea00807291305w4c542f2fr764ae54a29585897@mail.gmail.com>
	<c1ace6860807300312n1ba2085gf754e872314ab50e@mail.gmail.com>
Message-ID: <4894E51E.2050205@dundee.ac.uk>

David J Craigon wrote:
> Are you sure you are using an actual M600 blade chassis? On the ones
> I've got, they speak a different language after the telnet from other
> DRAC cards, hence the problem.
> 

Indeed, they are SMASH-CLP

http://publib.boulder.ibm.com/infocenter/toolsctr/v1r0/index.jsp?topic=/com.ibm.smash1_3.doc/smash_t_usingclp.html

As far as I can make out it is designed to be a vendor neutral out of 
band management processor interface. So a DRAC, ILO, LOM, etc. all look 
the same. I guess in about 10 years when everything in the data centre 
has such an interface it will make life simpler in multi vendor 
environments.

It is full of XML goodness if that sort of stuff is your cup of tea, and 
is supposed to be easier to script up.

You can get it on a standard DRAC5 by issuing a smclp command after login.

All that said it is the most tortuous pile of dino droppings I have had 
the misfortune to use. Not helped by a lack of documentation. Looks like 
it came right out of the same committee that dreamt up ACPI.


JAB.

-- 
Jonathan A. Buzzard                      Tel: +441382-386998
Storage Administrator, College of Life Sciences
University of Dundee, DD1 5EH



From brettcave at gmail.com  Mon Aug  4 08:11:26 2008
From: brettcave at gmail.com (Brett Cave)
Date: Mon, 4 Aug 2008 10:11:26 +0200
Subject: [Linux-cluster] HP ILO Fence Configuration
In-Reply-To: <4892E039.3050701@midascomm.com>
References: <4892E039.3050701@midascomm.com>
Message-ID: <c0773fd30808040111w1fb8628h5ab1a99730204cc@mail.gmail.com>

On Fri, Aug 1, 2008 at 12:06 PM, Balaji <balajisundar at midascomm.com> wrote:
> Dear All,
>
>  Currently i am using HP x6600 Server and I have installed RHEL4 Update 4 AS
> Linux and
>  RHEL4 Update 4 Support Cluster Suite in my server
>  I am new in fence and can any one help me how to configure HP ILO fence in
> my server
>  and HP ILO Fence Functionality

I have just set it up, have not tested 100%, but what I have so far is:
1) create fence usernames and passwords ILO on each of your devices.
2) Update cluster.conf as follows:
<clusternodes>
        <clusternode name="worker1" nodeid="1">
                <fence>
                        <method name="fabric">
                                <device name="ilo-worker1"/>
                        </method>
                </fence>
        </clusternode>
       <!-- the rest of your nodes here -->
<fencedevices>
        <fencedevice name="ilo-worker1" agent="fence_ilo"
hostname="192.168.0.101" login="fence" passwd="fencepassword"/>
        <!-- 1 fence device for each clusternode, with corrent hostname info -->
</fencedevices>

According to the docs, that SHOULD work, I am still having hanging
issues on access to certain files / directories on GFS, but still
pretty new to it, so not 100% sure whether its related to fencing or
not.

> Regards
> -S.Balaji
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



From brettcave at gmail.com  Mon Aug  4 09:02:51 2008
From: brettcave at gmail.com (Brett Cave)
Date: Mon, 4 Aug 2008 11:02:51 +0200
Subject: [Linux-cluster] How to determine what is causing GFS to hang?
Message-ID: <c0773fd30808040202w2f82e7bbp35a347065d8de39a@mail.gmail.com>

Hi,

I have a GFS cluster set up on a fibre SAN.
<?xml version="1.0"?>
<cluster name="mydisk1" config_version="8">

<quorumd interval="3" tko="10" label="myqdisk" votes="5"/>
<cman expected_votes="11" port="6809">
</cman>

<fence_daemon post_join_delay="60" post_fail_delay="30">
</fence_daemon>

<clusternodes>
         <clusternode name="worker1" nodeid="1">
                <fence>
                        <method name="fabric">
                                <device name="ilo-worker1"/>
                        </method>
                </fence>
        </clusternode>
        <!-- repeated for nodes through to worker6 -->
</clusternodes>
<fencedevices>
        <fencedevice name="ilo-worker1" agent="fence_ilo"
hostname="192.168.0.101" login="fence" passwd="fencerPass"/>
         <!-- repeated through to ilo-worker6 -->
</fencedevices>


Selected output from cman_tool status:
Membership state: Cluster-Member
Nodes: 6
Expected votes: 11
Total votes: 11
Quorum: 6
Active subsystems: 7
Flags:

cman_tool nodes (0 = qdisk):
Node  Sts   Inc   Joined               Name
   0   M      0   2008-07-25 03:00:29  /dev/sda1
   1   M   1156   2008-07-25 02:59:16  worker1
   2   M   1160   2008-07-25 02:59:20  worker2
# and so on, all sts columns = M, all have valid Joined time, all have
different Inc column.

cman_tool services - think there might be something here, not sure
what to make of this - is this fencing trying to take place??
[root at hecate ~]# cman_tool services
type             level name     id       state
fence            0     default  00010001 none
[1 2 3 4 5 6]
dlm              1     storage  00030001 none
[1 2 3 4 5 6]
dlm              1     cache1   00050001 none
[1 2 3 4 5 6]
gfs              2     storage  00020001 none
[1 2 3 4 5 6]
gfs              2     cache1   00040001 none
[1 2 3 4 5 6]

cache1 and storage are the 2 GFS volumes in the cluster.

when I run an "ls" on a directory in storage, it just hangs. How would
I get GFS to recover from this?

Regards.
Brett



From ben.yarwood at juno.co.uk  Mon Aug  4 11:43:03 2008
From: ben.yarwood at juno.co.uk (Ben Yarwood)
Date: Mon, 4 Aug 2008 12:43:03 +0100
Subject: [Linux-cluster] GFS Mounting Issues
In-Reply-To: <474534909BE4064E853161350C47578E0BABF8EE@ncrmail1.corp.navcan.ca>
References: <474534909BE4064E853161350C47578E0BABF8EE@ncrmail1.corp.navcan.ca>
Message-ID: <047101c8f627$410cbb50$c32631f0$@yarwood@juno.co.uk>

I pretty sure you need to be running fenced and clvmd as well to get this to work, there was a message relating to this in your
original post.

/sbin/mount.gfs: node not a member of the default fence domain
/sbin/mount.gfs: error mounting lockproto lock_dlm

You should see something like this in the output from cman_tool services.

type             level name       id       state       
fence            0     default    00010001 none        
[1 2]
dlm              1     clvmd      00020001 none        
[1 2]
dlm              1     rgmanager  00030001 none        
[1 2]

The fence domain will need to be configured correctly in your cluster.conf file and I believe will start automatically when you
start cman.  There will probably be some errors in your log stating the fence domain couldn't start up when you started cman.

Ben


> -----Original Message-----
> From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Caron,
> Chris
> Sent: 31 July 2008 18:30
> To: linux clustering
> Subject: RE: [Linux-cluster] GFS Mounting Issues
> 
> Bob,
> 
> Thank you for replying; I should have included more information.  I was
> going by the bases people assumed a valid cluster was running (but we
> should never assume that right? :) ).  After your email I ran a few
> status tools to report more information in hopes may have helped guide
> anyone to an answer.  Had you not sent your email, I wouldn't have
> uncovered the very odd one at the bottom of this email.
> 
> [root at node01 ~]# service cman status
> cman is running.
> 
> [root at node01 ~]# clustat
> Cluster Status for rhc1 @ Thu Jul 31 13:21:35 2008
> Member Status: Quorate
> 
>  Member Name                                   ID   Status
>  ------ ----                                   ---- ------
>  node01.rhc1                                      1 Online, Local
>  node02.rhc1                                      2 Online
>  node03.rhc1                                      3 Online
>  node04.rhc1                                      4 Online
>  node05.rhc1                                      5 Offline
> 
> (Note: I tailored the above output so it wouldn't wrap)
> 
> [root at node01 ~]# service rgmanager status
> clurgmgrd (pid 13235) is running...
> 
> [root at node01 ~]# cman_tool status
> Version: 6.1.0
> Config Version: 8
> Cluster Name: rhc1
> Cluster Id: 1575
> Cluster Member: Yes
> Cluster Generation: 36
> Membership state: Cluster-Member
> Nodes: 4
> Expected votes: 5
> Total votes: 4
> Quorum: 3
> Active subsystems: 8
> Flags: Dirty
> Ports Bound: 0 177
> Node name: node01.rhc1
> Node ID: 1
> Multicast addresses: <not important>
> Node addresses: <not important>
> 
> This one concerns me :
> [root at node01 ~]# cman_tool services
> type             level name       id       state
> dlm              1     rgmanager  00010002 FAIL_ALL_STOPPED
> [1 2 3]
> 
> Chris Caron
> 
> 
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster






From pbruna at it-linux.cl  Mon Aug  4 17:30:04 2008
From: pbruna at it-linux.cl (Patricio A. Bruna)
Date: Mon, 4 Aug 2008 13:30:04 -0400 (CLT)
Subject: [Linux-cluster] GFS and Directory with lots of small files
Message-ID: <23892998.97721217871004822.JavaMail.root@lisa.itlinux.cl>

Hi, 
I had found on the list that i can improve the performance of GFS with small files if i adapt the size of the rsbtbl_size/lkbtbl_size values. 
But it also found that this has to be done after loading the dlm module, but before the lockspace is created. What means "before the lockspace is created", before the GFS partitions are mounted? 

How do i do this? 

PD: I send this same email to antoher list by mistake. 

------------------------------------ 
Patricio Bruna V. 
IT Linux Ltda. 
http://www.it-linux.cl 
Fono : (+56-2) 333 0578 - Chile 
Fono: (+54-11) 6632 2760 - Argentina 
M?vil : (+56-09) 8827 0342 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080804/666232f5/attachment.htm>

From grimme at atix.de  Mon Aug  4 20:57:07 2008
From: grimme at atix.de (Marc Grimme)
Date: Mon, 4 Aug 2008 22:57:07 +0200
Subject: [Linux-cluster] GFS and Directory with lots of small files
In-Reply-To: <23892998.97721217871004822.JavaMail.root@lisa.itlinux.cl>
References: <23892998.97721217871004822.JavaMail.root@lisa.itlinux.cl>
Message-ID: <200808042257.07193.grimme@atix.de>

On Monday 04 August 2008 19:30:04 Patricio A. Bruna wrote:
> Hi,
> I had found on the list that i can improve the performance of GFS with
> small files if i adapt the size of the rsbtbl_size/lkbtbl_size values. But
> it also found that this has to be done after loading the dlm module, but
> before the lockspace is created. What means "before the lockspace is
> created", before the GFS partitions are mounted?
>
> How do i do this?
Umount gfs fs. Add changes to the proc-fs in a resource skript that is startet 
before gfs is mounted  and apply it. Then remount the gfs and there you go.

The lockspaces will get created for every filesystem when the filesystem is 
mounted.

-marc.
>
> PD: I send this same email to antoher list by mistake.
>
> ------------------------------------
> Patricio Bruna V.
> IT Linux Ltda.
> http://www.it-linux.cl
> Fono : (+56-2) 333 0578 - Chile
> Fono: (+54-11) 6632 2760 - Argentina
> M?vil : (+56-09) 8827 0342



-- 
Gruss / Regards,

Marc Grimme
Phone: +49-89 452 3538-14
http://www.atix.de/               http://www.open-sharedroot.org/

**
ATIX Informationstechnologie und Consulting AG
Einsteinstr. 10 
85716 Unterschleissheim
Deutschland/Germany

Phone: +49-89 452 3538-0
Fax:   +49-89 990 1766-0

Registergericht: Amtsgericht Muenchen
Registernummer: HRB 168930
USt.-Id.: DE209485962

Vorstand: 
Marc Grimme, Mark Hlawatschek, Thomas Merz (Vors.)

Vorsitzender des Aufsichtsrats:
Dr. Martin Buss




From schlegel at riege.com  Mon Aug  4 22:16:56 2008
From: schlegel at riege.com (Gunther Schlegel)
Date: Tue, 05 Aug 2008 00:16:56 +0200
Subject: [Linux-cluster] How to fence a virtual machine in a virtual cluster?
In-Reply-To: <200808042257.07193.grimme@atix.de>
References: <23892998.97721217871004822.JavaMail.root@lisa.itlinux.cl>
	<200808042257.07193.grimme@atix.de>
Message-ID: <48977FD8.7000508@riege.com>

Hi,

I am running RHEL Virtual Machines as cluster services on RedHat 5.2 
Dom0 nodes. The virtual machines use clustered logical volumes for 
storage, /etc/xen is located on a gfs filesystem. Cluster management 
using luci from a dedicated admin server. (The entire system works quite 
well. Some load balancing mechanism on the Dom0 nodes would be fine, but 
that is another issue...)


Now I need a second cluster, in fact I need a gfs filesystem shared 
among some of the virtual machines. In general this should not be an 
issue, but how can I fence a virtual machine inside of a virtual 
cluster? Technically 'virsh destroy' on the Dom0 host will do the job.

Though:
a) I cannot define a script for fencing (at least using luci).
b) There is a fencing method for virtual machines in the RHEL 5.2 
cluster, but it is only meant to fence virtual nodes that are part of a 
mixed cluster of physical and virtual nodes.
c) Inside a virtual cluster the "Virtual Machine Fencing" is of no use, 
because the virtual machine itself is a service in *another* cluster. 
One would need an option to define the Dom0 host or cluster.


I somehow object to mix physical with virtual machines inside of a 
cluster (and I do not want to take virtual machines part in the quorum 
of the physical machines. Hypothetically the virtual machines may fence 
the physical nodes, thereby shutting down the entire cluster...) . The 
Dom0-cluster is intended to run VMs only and no other services. The VMs 
are to provide services, and if they need cluster services, I prefer to 
define aditional clusters.

Am I missing something? In fact the ability to define a script for 
fencing would be sufficient from my point of view.

Or is the only real solution to join the VMs in the Dom0 cluster and 
assign a dedicated failover group to them?

any hint is highly appreciated.

best regards, Gunther





-- 
.............................................................
Riege Software International GmbH  Fon: +49 (2159) 9148 0
Mollsfeld 10                       Fax: +49 (2159) 9148 11
40670 Meerbusch                    Web: www.riege.com
Germany                            E-Mail: schlegel at riege.com
---                                ---
Handelsregister:                   Managing Directors:
Amtsgericht Neuss HRB-NR 4207      Christian Riege
USt-ID-Nr.: DE120585842            Gabriele  Riege
                                   Johannes  Riege
.............................................................
           YOU CARE FOR FREIGHT, WE CARE FOR YOU          



-------------- next part --------------
A non-text attachment was scrubbed...
Name: schlegel.vcf
Type: text/x-vcard
Size: 344 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080805/9ffaba73/attachment.vcf>

From Bevan.Broun at ardec.com.au  Mon Aug  4 23:11:49 2008
From: Bevan.Broun at ardec.com.au (Bevan Broun)
Date: Tue, 5 Aug 2008 09:11:49 +1000
Subject: [Linux-cluster] luci cant perform cluster actions. How to debug?
In-Reply-To: <48977FD8.7000508@riege.com>
Message-ID: <6008E5CED89FD44A86D3C376519E1DB2010ACB4260@megatron.ms.a2end.com>


Hi All

I have cluster where luci/ricci seems partially broken. Luci reports the cluster members and cluster state but it will not have a node leave or shutdown the cluster.

Performing shutdown/startup of cluster via the init scripts all works.

We had changed the node names manually in cluster.conf from fully qualified host names to IP addresses (with change added to cman init script to make this work). That's about the only thing I can think that may have caused the issue.

Would a non functioning DNS system cause issues?

How can I go about debugging where the issue is?

Thanks


Bevan Broun
Solutions Architect
Ardec International

http://www.ardec.com.au
http://www.lisasoft.com
http://www.terrapages.com
Sydney
-----------------------
Suite 112,The Lower Deck
19-21 Jones Bay Wharf
Pirrama Road, Pyrmont 2009
Ph:  +61 2 8570 5000
Fax: +61 2 8570 5099



From satoru.satoh at gmail.com  Tue Aug  5 04:02:03 2008
From: satoru.satoh at gmail.com (Satoru SATOH)
Date: Tue, 5 Aug 2008 13:02:03 +0900
Subject: [Linux-cluster] [PATCH] Add network interface select option for
	fence_xvmd
Message-ID: <20080805040202.GA14134@gescom.nrt.redhat.com>

Hello,


# I sent this before but it looks disappered somewhere so that resend it
# again. Excuse me if you received the same mail twice.

It should be useful that fence_xvmd listen on a certain network
interface which manually specified under some conditions such as a
system has multiple network interfaces and the one to default route is
not prefered choice, I think.

The following patch adds the option "-I <interface_name>" to select
network interface fence_xvmd to listen on.

- satoru


 fence/agents/xvm/fence_xvmd.c |    8 ++++----
 fence/agents/xvm/mcast.c      |   21 ++++++++++++++++++---
 fence/agents/xvm/mcast.h      |    4 ++--
 fence/agents/xvm/options.c    |   13 +++++++++++++
 fence/agents/xvm/options.h    |    1 +
 fence/man/fence_xvmd.8        |    3 +++
 6 files changed, 41 insertions(+), 9 deletions(-)
diff --git a/fence/agents/xvm/fence_xvmd.c b/fence/agents/xvm/fence_xvmd.c
index 888f24b..1dc5eba 100644
--- a/fence/agents/xvm/fence_xvmd.c
+++ b/fence/agents/xvm/fence_xvmd.c
@@ -921,7 +921,7 @@ main(int argc, char **argv)
 	unsigned int logmode = 0;
 	char key[MAX_KEY_LEN];
 	int key_len = 0, x;
-	char *my_options = "dfi:a:p:C:U:c:k:u?hLXV";
+	char *my_options = "dfi:a:I:p:C:U:c:k:u?hLXV";
 	cman_handle_t ch = NULL;
 	void *h = NULL;
 
@@ -1031,9 +1031,9 @@ main(int argc, char **argv)
 	}
 
 	if (args.family == PF_INET)
-		mc_sock = ipv4_recv_sk(args.addr, args.port);
+		mc_sock = ipv4_recv_sk(args.addr, args.port, args.ifindex);
 	else
-		mc_sock = ipv6_recv_sk(args.addr, args.port);
+		mc_sock = ipv6_recv_sk(args.addr, args.port, args.ifindex);
 	if (mc_sock < 0) {
 		log_printf(LOG_ERR,
 			   "Could not set up multicast listen socket\n");
@@ -1049,5 +1049,5 @@ main(int argc, char **argv)
 
 	//malloc_dump_table();
 
-	return 0;
+	exit(errno);
 }
diff --git a/fence/agents/xvm/mcast.c b/fence/agents/xvm/mcast.c
index db46328..001e3ac 100644
--- a/fence/agents/xvm/mcast.c
+++ b/fence/agents/xvm/mcast.c
@@ -31,11 +31,12 @@ LOGSYS_DECLARE_SUBSYS ("XVM", SYSLOGLEVEL);
   Sets up a multicast receive socket
  */
 int
-ipv4_recv_sk(char *addr, int port)
+ipv4_recv_sk(char *addr, int port, unsigned int ifindex)
 {
 	int sock;
 	struct ip_mreq mreq;
 	struct sockaddr_in sin;
+	struct ifreq ifreq;
 
 	/* Store multicast address */
 	if (inet_pton(PF_INET, addr,
@@ -74,7 +75,20 @@ ipv4_recv_sk(char *addr, int port)
 	 * Join multicast group
 	 */
 	/* mreq.imr_multiaddr.s_addr is set above */
-	mreq.imr_interface.s_addr = htonl(INADDR_ANY);
+	if (ifindex > 0 && if_indextoname(ifindex, ifreq.ifr_name) != NULL) {
+		ifreq.ifr_addr.sa_family = AF_INET;
+		if (ioctl(sock, SIOCGIFADDR, &ifreq) < 0) {
+			printf("Failed to get address of the interface %d\n",
+				ifindex);
+			mreq.imr_interface.s_addr = htonl(INADDR_ANY);
+		} else {
+			memcpy(&mreq.imr_interface,
+				&((struct sockaddr_in *) &ifreq.ifr_addr)->sin_addr,
+				sizeof(struct in_addr));
+		}
+	} else {
+		mreq.imr_interface.s_addr = htonl(INADDR_ANY);
+	}
 	dbg_printf(4, "Joining multicast group\n");
 	if (setsockopt(sock, IPPROTO_IP, IP_ADD_MEMBERSHIP,
 		       &mreq, sizeof(mreq)) == -1) {
@@ -184,7 +198,7 @@ ipv4_send_sk(char *send_addr, char *addr, int port, struct sockaddr *tgt,
   Sets up a multicast receive (ipv6) socket
  */
 int
-ipv6_recv_sk(char *addr, int port)
+ipv6_recv_sk(char *addr, int port, unsigned int ifindex)
 {
 	int sock, val;
 	struct ipv6_mreq mreq;
@@ -203,6 +217,7 @@ ipv6_recv_sk(char *addr, int port)
 	memcpy(&mreq.ipv6mr_multiaddr, &sin.sin6_addr,
 	       sizeof(struct in6_addr));
 
+	mreq.ipv6mr_interface = (ifindex > 0) ? ifindex : 0;
 
 	/********************************
 	 * SET UP MULTICAST RECV SOCKET *
diff --git a/fence/agents/xvm/mcast.h b/fence/agents/xvm/mcast.h
index 5113f04..08fd6de 100644
--- a/fence/agents/xvm/mcast.h
+++ b/fence/agents/xvm/mcast.h
@@ -4,11 +4,11 @@
 #define IPV4_MCAST_DEFAULT "225.0.0.12"
 #define IPV6_MCAST_DEFAULT "ff05::3:1"
 
-int ipv4_recv_sk(char *addr, int port);
+int ipv4_recv_sk(char *addr, int port, unsigned int ifindex);
 int ipv4_send_sk(char *src_addr, char *addr, int port,
 		 struct sockaddr *src, socklen_t slen,
 		 int ttl);
-int ipv6_recv_sk(char *addr, int port);
+int ipv6_recv_sk(char *addr, int port, unsigned int ifindex);
 int ipv6_send_sk(char *src_addr, char *addr, int port,
 		 struct sockaddr *src, socklen_t slen,
 		 int ttl);
diff --git a/fence/agents/xvm/options.c b/fence/agents/xvm/options.c
index 969ca8d..519f57e 100644
--- a/fence/agents/xvm/options.c
+++ b/fence/agents/xvm/options.c
@@ -82,6 +82,13 @@ assign_address(fence_xvm_args_t *args, struct arg_info *arg, char *value)
 
 
 static inline void
+assign_interface(fence_xvm_args_t *args, struct arg_info *arg, char *value)
+{
+	args->ifindex = if_nametoindex(value);
+}
+
+
+static inline void
 assign_ttl(fence_xvm_args_t *args, struct arg_info *arg, char *value)
 {
 	int ttl;
@@ -299,6 +306,10 @@ static struct arg_info _arg_info[] = {
 	  "Multicast address (default=225.0.0.12 / ff02::3:1)",
 	  assign_address },
 
+	{ 'I', "-I <interface>", NULL,
+	  "Network interface to listen on (default=auto; kernel selects)",
+	  assign_interface },
+
 	{ 'T', "-T <ttl>", "multicast_ttl",
 	  "Multicast time-to-live (in hops; default=2)",
 	  assign_ttl },
@@ -422,6 +433,7 @@ args_init(fence_xvm_args_t *args)
 	args->flags = 0;
 	args->debug = 0;
 	args->ttl = DEFAULT_TTL;
+	args->ifindex = 0;
 }
 
 
@@ -439,6 +451,7 @@ args_print(fence_xvm_args_t *args)
 {
 	dbg_printf(1, "-- args @ %p --\n", args);
 	_pr_str(args->addr);
+	_pr_int(args->ifindex);
 	_pr_str(args->domain);
 	_pr_str(args->key_file);
 	_pr_int(args->op);
diff --git a/fence/agents/xvm/options.h b/fence/agents/xvm/options.h
index 7a2dcca..8720366 100644
--- a/fence/agents/xvm/options.h
+++ b/fence/agents/xvm/options.h
@@ -29,6 +29,7 @@ typedef struct {
 	arg_flags_t flags;
 	int debug;
 	int ttl;
+	unsigned int ifindex;
 } fence_xvm_args_t;
 
 /* Private structure for commandline / stdin fencing args */
diff --git a/fence/man/fence_xvmd.8 b/fence/man/fence_xvmd.8
index 5a47211..05d4720 100644
--- a/fence/man/fence_xvmd.8
+++ b/fence/man/fence_xvmd.8
@@ -36,6 +36,9 @@ IP family to use (auto, ipv4, or ipv6; default = auto)
 Multicast address to listen on (default=225.0.0.12 for ipv4, ff02::3:1
 for ipv6)
 .TP
+\fB-I\fP \fIinterface\fP
+Network interface to use; e.g. eth0 (default: one[s] kernel choosed)
+.TP
 \fB-p\fP \fIport\fP
 Port to use (default=1229)
 .TP



From satoru.satoh at gmail.com  Tue Aug  5 05:06:30 2008
From: satoru.satoh at gmail.com (Satoru SATOH)
Date: Tue, 5 Aug 2008 14:06:30 +0900
Subject: [Linux-cluster] [PATCH] trivial debug print fix for
	fence/agents/xvm/
Message-ID: <20080805050628.GC14134@gescom.nrt.redhat.com>

Hello,

Here is a trivial patch to add missing line breaks in some debug print
lines for fence_xvm*.

- satoru


 fence/agents/xvm/simple_auth.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/fence/agents/xvm/simple_auth.c b/fence/agents/xvm/simple_auth.c
index 07bc261..a1f04f4 100644
--- a/fence/agents/xvm/simple_auth.c
+++ b/fence/agents/xvm/simple_auth.c
@@ -381,7 +381,7 @@ read_key_file(char *file, char *key, size_t max_len)
 		}
 
 		if (nread == 0) {
-			dbg_printf(3, "Stopped reading @ %d bytes",
+			dbg_printf(3, "Stopped reading @ %d bytes\n",
 				(int)max_len-remain);
 			break;
 		}
@@ -391,7 +391,7 @@ read_key_file(char *file, char *key, size_t max_len)
 	}
 
 	close(fd);	
-	dbg_printf(3, "Actual key length = %d bytes", (int)max_len-remain);
+	dbg_printf(3, "Actual key length = %d bytes\n", (int)max_len-remain);
 	
 	return (int)(max_len - remain);
 }



From pedroche5 at gmail.com  Tue Aug  5 09:44:47 2008
From: pedroche5 at gmail.com (Pedro Gonzalez Zamora)
Date: Tue, 5 Aug 2008 11:44:47 +0200
Subject: [Linux-cluster] How can I re-assign cluster id
Message-ID: <47311dd20808050244w1de4d3c4i4e4cb14f6ba2bde5@mail.gmail.com>

Dear all


I have two clusters each cluster has two nodes, the first cluster1 starts ok
but de second cluster2 can't start because it gets the same cluster ID that
cluster1 and I don't know why??
I have set diferent cluster name in cluster.conf.

Best Regards
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080805/22b87ec8/attachment.htm>

From ccaulfie at redhat.com  Tue Aug  5 10:02:43 2008
From: ccaulfie at redhat.com (Christine Caulfield)
Date: Tue, 05 Aug 2008 11:02:43 +0100
Subject: [Linux-cluster] How can I re-assign cluster id
In-Reply-To: <47311dd20808050244w1de4d3c4i4e4cb14f6ba2bde5@mail.gmail.com>
References: <47311dd20808050244w1de4d3c4i4e4cb14f6ba2bde5@mail.gmail.com>
Message-ID: <48982543.6030607@redhat.com>

Pedro Gonzalez Zamora wrote:
> Dear all
> 
> 
> I have two clusters each cluster has two nodes, the first cluster1 
> starts ok but de second cluster2 can't start because it gets the same 
> cluster ID that cluster1 and I don't know why??
> I have set diferent cluster name in cluster.conf.
> 

It's probably that you've hit a weakness with the cluster name hash, 
it's not perfect by any means. Your options are to change one of the 
cluster names so that they hash to different values or (easier) add

<cman cluster_id="XXX/>

to one or both of the cluster.conf files to force a different cluster ID 
number.


Chrissie



From david at craigon.co.uk  Tue Aug  5 10:55:11 2008
From: david at craigon.co.uk (David J Craigon)
Date: Tue, 5 Aug 2008 11:55:11 +0100
Subject: [Linux-cluster] Fencing using iDRAC/ Dell M600
In-Reply-To: <1217451390.3371.3.camel@localhost.localdomain>
References: <c1ace6860807290842l5c17ad3bt65319843994c8b65@mail.gmail.com>
	<4890631E.2060305@redhat.com>
	<c1ace6860807300837q460a988bmeb99c40897718400@mail.gmail.com>
	<1217451390.3371.3.camel@localhost.localdomain>
Message-ID: <c1ace6860808050355v5f8bbedh7d35d485265e3067@mail.gmail.com>

Hello,

Yes, it works on a DRAC 5 in a 1950 too. Yay!

I'm going to resend my patch in a mo- I've added a bit to the documentation.

David

2008/7/30 jim parsons <jparsons at redhat.com>:
> On Wed, 2008-07-30 at 16:37 +0100, David J Craigon wrote:
>> It turns out that the right way to do this is use what Dell call
>> "CMC"- a device that manages all the blades, not just one (just like
>> the DRAC/MC). This is like a mix of the Dell DRAC/MC and DRAC 5 in
>> fence_drac.
>>
>> I've written a patch that adds support for the CMC to fence_drac. This
>> is my first patch ever using git, so hopefully it's good for you.
>>
>> This has been tested on a CMC, but it also changes the code for a Dell
>> 1950. I'm going to get a 1950 and test it tomorrow.
>>
>> Feedback welcomed!
> THANK YOU. SINCERELY. Please update us with test results. If no
> regressions pop up, this is going into the agent ASAP.
>
> THANK YOU.
>
> :)
>
> -Jim, who often feels fenced in
>
>>
>> David
>>
>> ---
>>  fence/agents/drac/fence_drac.pl |   36 +++++++++++++++++++++++++++++-------
>>  1 files changed, 29 insertions(+), 7 deletions(-)
>>
>> diff --git a/fence/agents/drac/fence_drac.pl b/fence/agents/drac/fence_drac.pl
>> index f199814..f96ef22 100644
>> --- a/fence/agents/drac/fence_drac.pl
>> +++ b/fence/agents/drac/fence_drac.pl
>> @@ -38,6 +38,7 @@ my $DRAC_VERSION_MC                 = 'DRAC/MC';
>>  my $DRAC_VERSION_4I                  = 'DRAC 4/I';
>>  my $DRAC_VERSION_4P                  = 'DRAC 4/P';
>>  my $DRAC_VERSION_5                   = 'DRAC 5';
>> +my $DRAC_VERSION_CMC                         = 'CMC';
>>
>>  my $PWR_CMD_SUCCESS                  = "/^OK/";
>>  my $PWR_CMD_SUCCESS_DRAC5    = "/^Server power operation successful$/";
>> @@ -192,10 +193,15 @@ sub login
>>       # DRAC5 prints version controller version info
>>       # only after you've logged in.
>>       if ($drac_version eq $DRAC_VERSION_UNKNOWN) {
>> -             if ($t->waitfor(Match => "/.*\($DRAC_VERSION_5\)/m")) {
>> +
>> +             if (my ($prematch,$match)=$t->waitfor(Match =>
>> "/.*(\($DRAC_VERSION_5\)|$DRAC_VERSION_CMC)/m")) {
>> +                 if ($match=~/$DRAC_VERSION_CMC/) {
>> +                     $drac_version =  $DRAC_VERSION_CMC;
>> +                 } else {
>>                       $drac_version = $DRAC_VERSION_5;
>> +                 }
>>                       $cmd_prompt = "/\\\$ /";
>> -                     $PWR_CMD_SUCCESS = $PWR_CMD_SUCCESS_DRAC5;
>> +                 $PWR_CMD_SUCCESS = $PWR_CMD_SUCCESS_DRAC5;
>>               } else {
>>                       print "WARNING: unable to detect DRAC version '$_'\n";
>>               }
>> @@ -228,8 +234,10 @@ sub set_power_status
>>       }
>>       elsif ($drac_version eq $DRAC_VERSION_5) {
>>               $cmd = "racadm serveraction $svr_action";
>> -     } else
>> -     {
>> +     }
>> +     elsif ($drac_version eq $DRAC_VERSION_CMC) {
>> +             $cmd = "racadm serveraction -m $modulename $svr_action";
>> +     } else {
>>               $cmd = "serveraction -d 0 $svr_action";
>>       }
>>
>> @@ -271,6 +279,11 @@ sub set_power_status
>>               }
>>       }
>>       fail "failed: unexpected response: '$err'" if defined $err;
>> +
>> +     # on M600 blade systems, after power on or power off, status takes a
>> couple of seconds to report correctly. Wait here before checking
>> status again
>> +     sleep 5;
>> +
>> +
>>  }
>>
>>
>> @@ -285,6 +298,8 @@ sub get_power_status
>>
>>       if ($drac_version eq $DRAC_VERSION_5) {
>>               $cmd = "racadm serveraction powerstatus";
>> +     } elsif ($drac_version eq $DRAC_VERSION_CMC) {
>> +         $cmd = "racadm serveraction powerstatus -m $modulename";
>>       } else {
>>               $cmd = "getmodinfo";
>>       }
>> @@ -306,7 +321,7 @@ sub get_power_status
>>
>>       fail "failed: unkown dialog exception: '$_'" unless (/^$cmd$/);
>>
>> -     if ($drac_version ne $DRAC_VERSION_5) {
>> +     if ($drac_version ne $DRAC_VERSION_5 && $drac_version ne $DRAC_VERSION_CMC) {
>>               #Expect:
>>               #  #<group>     <module>    <presence>  <pwrState>  <health>  <svcTag>
>>               #   1  ---->     chassis    Present         ON      Normal    CQXYV61
>> @@ -335,6 +350,11 @@ sub get_power_status
>>                       if(m/^Server power status: (\w+)/) {
>>                               $status = lc($1);
>>                       }
>> +             }
>> +             elsif ($drac_version eq $DRAC_VERSION_CMC) {
>> +                     if(m/^(\w+)/) {
>> +                         $status = lc($1);
>> +                     }
>>               } else {
>>                       my ($group,$arrow,$module,$presence,$pwrstate,$health,
>>                               $svctag,$junk) = split /\s+/;
>> @@ -364,7 +384,8 @@ sub get_power_status
>>       }
>>
>>       $_=$status;
>> -     if(/^(on|off)$/i)
>> +
>> +     if (/^(on|off)$/i)
>>       {
>>               # valid power states
>>       }
>> @@ -440,6 +461,7 @@ sub do_action
>>               }
>>
>>               set_power_status on;
>> +
>>               fail "failed: $_" unless wait_power_status on;
>>
>>               msg "success: powered on";
>> @@ -641,7 +663,7 @@ if ($drac_version eq $DRAC_VERSION_III_XT)
>>       fail "failed: option 'modulename' not compatilble with DRAC version
>> '$drac_version'"
>>               if defined $modulename;
>>  }
>> -elsif ($drac_version eq $DRAC_VERSION_MC)
>> +elsif ($drac_version eq $DRAC_VERSION_MC || $drac_version eq $DRAC_VERSION_CMC)
>>  {
>>       fail "failed: option 'modulename' required for DRAC version '$drac_version'"
>>               unless  defined $modulename;
>> --
>> 1.5.5.1
>>
>>
>> >From 2899ae4468a69b89346cafba13022a9b214404f2 Mon Sep 17 00:00:00 2001
>> From: David J Craigon <david at craigon.co.uk>
>> Date: Wed, 30 Jul 2008 16:34:24 +0100
>> Subject: Add a comment to state the CMC version this script works on
>>
>> ---
>>  fence/agents/drac/fence_drac.pl |    1 +
>>  1 files changed, 1 insertions(+), 0 deletions(-)
>>
>> diff --git a/fence/agents/drac/fence_drac.pl b/fence/agents/drac/fence_drac.pl
>> index f96ef22..11cc771 100644
>> --- a/fence/agents/drac/fence_drac.pl
>> +++ b/fence/agents/drac/fence_drac.pl
>> @@ -13,6 +13,7 @@
>>  #  PowerEdge 1850    DRAC 4/I        1.35 (Build 09.27)
>>  #  PowerEdge 1850    DRAC 4/I        1.40 (Build 08.24)
>>  #  PowerEdge 1950    DRAC 5          1.0  (Build 06.05.12)
>> +#  PowerEdge M600    CMC             1.01.A05.200803072107
>>  #
>>
>>  use Getopt::Std;
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



From pedroche5 at gmail.com  Tue Aug  5 11:03:31 2008
From: pedroche5 at gmail.com (Pedro Gonzalez Zamora)
Date: Tue, 5 Aug 2008 13:03:31 +0200
Subject: [Linux-cluster] How can I re-assign cluster id
In-Reply-To: <48982543.6030607@redhat.com>
References: <47311dd20808050244w1de4d3c4i4e4cb14f6ba2bde5@mail.gmail.com>
	<48982543.6030607@redhat.com>
Message-ID: <47311dd20808050403h7c9ef563re09178bbc47a6eb5@mail.gmail.com>

Dear Christine

I have set <cman cluster_id="1470"/> and I trying again but I get this
error:

cman: unable to set cluster_id

Could you tell me please more about cluster name hash, how it works and how
can I change the values?

Best Regards


2008/8/5 Christine Caulfield <ccaulfie at redhat.com>

> Pedro Gonzalez Zamora wrote:
>
>> Dear all
>>
>>
>> I have two clusters each cluster has two nodes, the first cluster1 starts
>> ok but de second cluster2 can't start because it gets the same cluster ID
>> that cluster1 and I don't know why??
>> I have set diferent cluster name in cluster.conf.
>>
>>
> It's probably that you've hit a weakness with the cluster name hash, it's
> not perfect by any means. Your options are to change one of the cluster
> names so that they hash to different values or (easier) add
>
> <cman cluster_id="XXX/>
>
> to one or both of the cluster.conf files to force a different cluster ID
> number.
>
>
> Chrissie
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080805/fc8aa859/attachment.htm>

From david at craigon.co.uk  Tue Aug  5 11:04:43 2008
From: david at craigon.co.uk (David J Craigon)
Date: Tue,  5 Aug 2008 12:04:43 +0100
Subject: [Linux-cluster] [iDRAC/ Dell M600 3/3] Documentation fix
In-Reply-To: <1217934283-10326-2-git-send-email-david@craigon.co.uk>
References: <1217934283-10326-1-git-send-email-david@craigon.co.uk>
	<1217934283-10326-2-git-send-email-david@craigon.co.uk>
Message-ID: <1217934283-10326-3-git-send-email-david@craigon.co.uk>

---
 fence/agents/drac/fence_drac.pl |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/fence/agents/drac/fence_drac.pl b/fence/agents/drac/fence_drac.pl
index 11cc771..2af0b22 100644
--- a/fence/agents/drac/fence_drac.pl
+++ b/fence/agents/drac/fence_drac.pl
@@ -66,8 +66,8 @@ sub usage
 	print "  -D <debugfile>   debugging output file\n";
 	print "  -h               usage\n";
 	print "  -l <name>        Login name\n";
-	print "  -m <modulename>  DRAC/MC module name\n";
-	print "  -o <string>      Action: reboot (default), off or on\n";
+	print "  -m <modulename>  DRAC/MC or CMC module name\n";
+	print "  -o <string>      Action: reboot (default), off, on or status\n";
 	print "  -p <string>      Login password\n";
 	print "  -S <path>        Script to run to retrieve password\n";
 	print "  -q               quiet mode\n";
-- 
1.5.5.1



From david at craigon.co.uk  Tue Aug  5 11:04:42 2008
From: david at craigon.co.uk (David J Craigon)
Date: Tue,  5 Aug 2008 12:04:42 +0100
Subject: [Linux-cluster] [iDRAC/ Dell M600 2/3] Add a comment to state the
	CMC version this script works on
In-Reply-To: <1217934283-10326-1-git-send-email-david@craigon.co.uk>
References: <1217934283-10326-1-git-send-email-david@craigon.co.uk>
Message-ID: <1217934283-10326-2-git-send-email-david@craigon.co.uk>

---
 fence/agents/drac/fence_drac.pl |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/fence/agents/drac/fence_drac.pl b/fence/agents/drac/fence_drac.pl
index f96ef22..11cc771 100644
--- a/fence/agents/drac/fence_drac.pl
+++ b/fence/agents/drac/fence_drac.pl
@@ -13,6 +13,7 @@
 #  PowerEdge 1850	DRAC 4/I	1.35 (Build 09.27)
 #  PowerEdge 1850	DRAC 4/I	1.40 (Build 08.24)
 #  PowerEdge 1950	DRAC 5		1.0  (Build 06.05.12)
+#  PowerEdge M600	CMC		1.01.A05.200803072107
 #
 
 use Getopt::Std;
-- 
1.5.5.1



From david at craigon.co.uk  Tue Aug  5 11:04:41 2008
From: david at craigon.co.uk (David J Craigon)
Date: Tue,  5 Aug 2008 12:04:41 +0100
Subject: [Linux-cluster] [iDRAC/ Dell M600 1/3] Fencing support for Dell
	M600 CMC (a DRAC in diguise)
In-Reply-To: <1217451390.3371.3.camel@localhost.localdomain>
References: <1217451390.3371.3.camel@localhost.localdomain>
Message-ID: <1217934283-10326-1-git-send-email-david@craigon.co.uk>

---
 fence/agents/drac/fence_drac.pl |   36 +++++++++++++++++++++++++++++-------
 1 files changed, 29 insertions(+), 7 deletions(-)

diff --git a/fence/agents/drac/fence_drac.pl b/fence/agents/drac/fence_drac.pl
index f199814..f96ef22 100644
--- a/fence/agents/drac/fence_drac.pl
+++ b/fence/agents/drac/fence_drac.pl
@@ -38,6 +38,7 @@ my $DRAC_VERSION_MC			= 'DRAC/MC';
 my $DRAC_VERSION_4I			= 'DRAC 4/I';
 my $DRAC_VERSION_4P			= 'DRAC 4/P';
 my $DRAC_VERSION_5			= 'DRAC 5';
+my $DRAC_VERSION_CMC   			= 'CMC';
 
 my $PWR_CMD_SUCCESS			= "/^OK/";
 my $PWR_CMD_SUCCESS_DRAC5	= "/^Server power operation successful$/";
@@ -192,10 +193,15 @@ sub login
 	# DRAC5 prints version controller version info
 	# only after you've logged in.
 	if ($drac_version eq $DRAC_VERSION_UNKNOWN) {
-		if ($t->waitfor(Match => "/.*\($DRAC_VERSION_5\)/m")) {
+	    
+		if (my ($prematch,$match)=$t->waitfor(Match => "/.*(\($DRAC_VERSION_5\)|$DRAC_VERSION_CMC)/m")) {
+		    if ($match=~/$DRAC_VERSION_CMC/) {
+			$drac_version =  $DRAC_VERSION_CMC;
+		    } else {
 			$drac_version = $DRAC_VERSION_5;
+		    }
 			$cmd_prompt = "/\\\$ /";
-			$PWR_CMD_SUCCESS = $PWR_CMD_SUCCESS_DRAC5;
+		    $PWR_CMD_SUCCESS = $PWR_CMD_SUCCESS_DRAC5;
 		} else {
 			print "WARNING: unable to detect DRAC version '$_'\n";
 		}
@@ -228,8 +234,10 @@ sub set_power_status
 	}
 	elsif ($drac_version eq $DRAC_VERSION_5) {
 		$cmd = "racadm serveraction $svr_action";
-	} else
-	{
+	} 
+	elsif ($drac_version eq $DRAC_VERSION_CMC) {
+		$cmd = "racadm serveraction -m $modulename $svr_action";
+	} else {
 		$cmd = "serveraction -d 0 $svr_action";
 	}
 
@@ -271,6 +279,11 @@ sub set_power_status
 		}
 	}
 	fail "failed: unexpected response: '$err'" if defined $err;
+
+	# on M600 blade systems, after power on or power off, status takes a couple of seconds to report correctly. Wait here before checking status again
+	sleep 5;
+
+
 }
 
 
@@ -285,6 +298,8 @@ sub get_power_status
 
 	if ($drac_version eq $DRAC_VERSION_5) {
 		$cmd = "racadm serveraction powerstatus";
+	} elsif ($drac_version eq $DRAC_VERSION_CMC) {
+	    $cmd = "racadm serveraction powerstatus -m $modulename";
 	} else {
 		$cmd = "getmodinfo";
 	}
@@ -306,7 +321,7 @@ sub get_power_status
 
 	fail "failed: unkown dialog exception: '$_'" unless (/^$cmd$/);
 
-	if ($drac_version ne $DRAC_VERSION_5) {
+	if ($drac_version ne $DRAC_VERSION_5 && $drac_version ne $DRAC_VERSION_CMC) {
 		#Expect:
 		#  #<group>     <module>    <presence>  <pwrState>  <health>  <svcTag>
 		#   1  ---->     chassis    Present         ON      Normal    CQXYV61
@@ -335,6 +350,11 @@ sub get_power_status
 			if(m/^Server power status: (\w+)/) {
 				$status = lc($1);
 			}
+		}
+		elsif ($drac_version eq $DRAC_VERSION_CMC) {
+			if(m/^(\w+)/) {
+			    $status = lc($1);
+			}
 		} else {
 			my ($group,$arrow,$module,$presence,$pwrstate,$health,
 				$svctag,$junk) = split /\s+/;
@@ -364,7 +384,8 @@ sub get_power_status
 	}
 
 	$_=$status;
-	if(/^(on|off)$/i)
+	
+	if (/^(on|off)$/i)
 	{
 		# valid power states 
 	}
@@ -440,6 +461,7 @@ sub do_action
 		}
 			
 		set_power_status on;
+		
 		fail "failed: $_" unless wait_power_status on;
 
 		msg "success: powered on";
@@ -641,7 +663,7 @@ if ($drac_version eq $DRAC_VERSION_III_XT)
 	fail "failed: option 'modulename' not compatilble with DRAC version '$drac_version'" 
 		if defined $modulename;
 }
-elsif ($drac_version eq $DRAC_VERSION_MC)
+elsif ($drac_version eq $DRAC_VERSION_MC || $drac_version eq $DRAC_VERSION_CMC)
 {
 	fail "failed: option 'modulename' required for DRAC version '$drac_version'"
 		unless  defined $modulename;
-- 
1.5.5.1



From ccaulfie at redhat.com  Tue Aug  5 11:44:16 2008
From: ccaulfie at redhat.com (Christine Caulfield)
Date: Tue, 05 Aug 2008 12:44:16 +0100
Subject: [Linux-cluster] How can I re-assign cluster id
In-Reply-To: <47311dd20808050403h7c9ef563re09178bbc47a6eb5@mail.gmail.com>
References: <47311dd20808050244w1de4d3c4i4e4cb14f6ba2bde5@mail.gmail.com>	<48982543.6030607@redhat.com>
	<47311dd20808050403h7c9ef563re09178bbc47a6eb5@mail.gmail.com>
Message-ID: <48983D10.5000308@redhat.com>

Pedro Gonzalez Zamora wrote:
> Dear Christine
> 
> I have set <cman cluster_id="1470"/> and I trying again but I get this 
> error:
> 
> cman: unable to set cluster_id
> 
> Could you tell me please more about cluster name hash, how it works and 
> how can I change the values?
> 

It sounds like you must have a rather old RHEL4 installation - the 
cluster_id changing code has been in there for a very long time now.

The 'trick' to making the cluster names hash to unique values is simply 
to make the names very different really. Avoid long, similar, names that 
end in numbers for example.

> 
> 2008/8/5 Christine Caulfield <ccaulfie at redhat.com 
> <mailto:ccaulfie at redhat.com>>
> 
>     Pedro Gonzalez Zamora wrote:
> 
>         Dear all
> 
> 
>         I have two clusters each cluster has two nodes, the first
>         cluster1 starts ok but de second cluster2 can't start because it
>         gets the same cluster ID that cluster1 and I don't know why??
>         I have set diferent cluster name in cluster.conf.
> 
> 
>     It's probably that you've hit a weakness with the cluster name hash,
>     it's not perfect by any means. Your options are to change one of the
>     cluster names so that they hash to different values or (easier) add
> 
>     <cman cluster_id="XXX/>
> 
>     to one or both of the cluster.conf files to force a different
>     cluster ID number.
> 
> 
>     Chrissie
> 
>     --
>     Linux-cluster mailing list
>     Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
>     https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> 
> 
> ------------------------------------------------------------------------
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


-- 

Chrissie



From rhurst at bidmc.harvard.edu  Tue Aug  5 12:31:34 2008
From: rhurst at bidmc.harvard.edu (rhurst at bidmc.harvard.edu)
Date: Tue, 5 Aug 2008 08:31:34 -0400
Subject: [Linux-cluster] HP ILO Fence Configuration
In-Reply-To: <c0773fd30808040111w1fb8628h5ab1a99730204cc@mail.gmail.com>
References: <4892E039.3050701@midascomm.com>
	<c0773fd30808040111w1fb8628h5ab1a99730204cc@mail.gmail.com>
Message-ID: <1217939494.2704.20.camel@WSBID06223.bidmc.harvard.edu>

Does not sound like you are having a fencing issue, but I can share our
configuration / implementation and experiences with it.

We have been using fencing configured for HP iLO and iLO2 for the better
of 2-years, with almost a full year in production now.  It is slow (42+
seconds per fencing attempt) and always problematic.  We are piloting
IBM BladeCenter as a replacement form factor over the incumbent HP DL385
servers.  Today, we use Nagios to monitor iLO's ssh / https services,
because they can go down without warning -- requiring you to reset
services via their firmware interface at BIOS boot-up time -- ugh!

Also, make certain you are running v1.89 on iLO cards; I forget what we
are using for iLO2 (v1.43 jumps into mind).

Simple to configure (password is in plaintext, but my iLO cards are
wired to an internal switch in the rack, hidden behind Cluster Suite's
network director):

<clusternodes>
		<clusternode name="net1" votes="5">
			<fence>
				<method name="1">
					<device name="net1ilo"/>
				</method>
			</fence>
		</clusternode>

   ... 10 more entries ...

</clusternodes>

<fencedevices>
	<fencedevice agent="fence_ilo" hostname="net1ilo.cad.rack"
login="Administrator" name="net1ilo" passwd="cadnet1tendac"/>

   ... 10 more entries ...

</fencedevices>

I also wrote a wrapper script named "power" around fence_ilo for
testing, and for other maintenance scripts, i.e.,
power reboot net1

#!/bin/bash
# 
# rev 07-May-2007 RHurst
# 

CHOICES=( "off on reboot status" )

COMMAND=$1
while [ -z "${COMMAND}" ]; do
	echo -n "Command (${CHOICES[@]})? "
	read COMMAND
	[ -z "${COMMAND}" ] && exit
done

HOST=${2}
HOSTIP="`dig +short ${HOST}ilo.cad.rack`"
while [ -z "${HOSTIP}" ]; do
	echo -n "host? "
	read HOST
	[ -z "${HOST}" ] && exit
	HOSTIP="`dig +short ${HOST}ilo.cad.rack`"
done

PASSWD=
[ "${HOST:0:3}" = "net" ] && PASSWD="cad${HOST}tendac"
[ "${HOST:0:3}" = "app" ] && PASSWD="cad${HOST}ppadac"
[ "${HOST:0:2}" = "db" ] && PASSWD="cad${HOST}bddac"
[ -z "${PASSWD}" ] && exit

[ $# -lt 2 ] && echo -n "Sending '${COMMAND}' to ${HOSTIP} iLO  :  "
fence_ilo -a ${HOSTIP} -l Administrator -p ${PASSWD} -o ${COMMAND}



________________________________________________________________________
??Robert Hurst, Sr. Cach? Administrator
Beth Israel Deaconess Medical Center
1135 Tremont Street, REN-7
Boston, Massachusetts   02120-2140
617-754-8754 ? Fax: 617-754-8730 ? Cell: 401-787-3154
Any technology distinguishable from magic is insufficiently advanced.


On Mon, 2008-08-04 at 10:11 +0200, Brett Cave wrote:

> On Fri, Aug 1, 2008 at 12:06 PM, Balaji <balajisundar at midascomm.com> wrote:
> > Dear All,
> >
> >  Currently i am using HP x6600 Server and I have installed RHEL4 Update 4 AS
> > Linux and
> >  RHEL4 Update 4 Support Cluster Suite in my server
> >  I am new in fence and can any one help me how to configure HP ILO fence in
> > my server
> >  and HP ILO Fence Functionality
> 
> I have just set it up, have not tested 100%, but what I have so far is:
> 1) create fence usernames and passwords ILO on each of your devices.
> 2) Update cluster.conf as follows:
> <clusternodes>
>         <clusternode name="worker1" nodeid="1">
>                 <fence>
>                         <method name="fabric">
>                                 <device name="ilo-worker1"/>
>                         </method>
>                 </fence>
>         </clusternode>
>        <!-- the rest of your nodes here -->
> <fencedevices>
>         <fencedevice name="ilo-worker1" agent="fence_ilo"
> hostname="192.168.0.101" login="fence" passwd="fencepassword"/>
>         <!-- 1 fence device for each clusternode, with corrent hostname info -->
> </fencedevices>
> 
> According to the docs, that SHOULD work, I am still having hanging
> issues on access to certain files / directories on GFS, but still
> pretty new to it, so not 100% sure whether its related to fencing or
> not.
> 
> > Regards
> > -S.Balaji
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> >
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080805/307912e7/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2178 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080805/307912e7/attachment.p7s>

From bobby.m.dalton at nasa.gov  Tue Aug  5 21:56:04 2008
From: bobby.m.dalton at nasa.gov (Dalton, Maurice)
Date: Tue, 5 Aug 2008 16:56:04 -0500
Subject: [Linux-cluster] 3 node cluster crashes
Message-ID: <EB190CD1E73E1146ACB7694746E205A805B1AE55@hx1.ums.msfc.nasa.gov>

 

I have a 3 node cluster running cman-2.0.84-2.el5.  At times we have
spanning tree events that cause network storms up to 9 seconds. 

When these events  occur (today we caused them twice to verify this
issue). All three nodes go down within seconds of this event.

 

The second time we tried it I added the totem token statement shown
below. Same problem.

 

 

 

 

 

<cman>

                <multicast addr="225.0.0.11"/>

                <totem token="21000"/>

        </cman>

 

 

 

Aug  5 16:41:18 csarcsys2-eth0 ntpd[3484]: kernel time sync enabled 0001

Aug  5 16:41:19 csarcsys2-eth0 openais[3096]: [TOTEM] The token was lost
in the OPERATIONAL state.

Aug  5 16:41:19 csarcsys2-eth0 openais[3096]: [TOTEM] Receive multicast
socket recv buffer size (288000 bytes).

Aug  5 16:41:19 csarcsys2-eth0 openais[3096]: [TOTEM] Transmit multicast
socket send buffer size (262142 bytes).

Aug  5 16:41:19 csarcsys2-eth0 openais[3096]: [TOTEM] entering GATHER
state from 2.

Aug  5 16:41:24 csarcsys2-eth0 openais[3096]: [TOTEM] entering GATHER
state from 0.

Aug  5 16:41:24 csarcsys2-eth0 openais[3096]: [TOTEM] Creating commit
token because I am the rep.

Aug  5 16:41:24 csarcsys2-eth0 openais[3096]: [TOTEM] Saving state aru
46 high seq received 46

Aug  5 16:41:24 csarcsys2-eth0 openais[3096]: [TOTEM] Storing new
sequence id for ring b50

Aug  5 16:41:24 csarcsys2-eth0 openais[3096]: [TOTEM] entering COMMIT
state.

Aug  5 16:41:24 csarcsys2-eth0 openais[3096]: [TOTEM] entering RECOVERY
state.

Aug  5 16:41:24 csarcsys2-eth0 openais[3096]: [TOTEM] position [0]
member 172.xx.xx.xxx:

Aug  5 16:41:24 csarcsys2-eth0 openais[3096]: [TOTEM] previous ring seq
2892 rep 172.xx.xxx.xx

Aug  5 16:41:24 csarcsys2-eth0 openais[3096]: [TOTEM] aru 46 high
delivered 46 received flag 1

Aug  5 16:41:24 csarcsys2-eth0 openais[3096]: [TOTEM] Did not need to
originate any messages in recovery.

Aug  5 16:41:24 csarcsys2-eth0 openais[3096]: [TOTEM] Sending initial
ORF token

Aug  5 16:41:24 csarcsys2-eth0 openais[3096]: [CLM  ] CLM CONFIGURATION
CHANGE

Aug  5 16:41:24 csarcsys2-eth0 openais[3096]: [CLM  ] New Configuration:

Aug  5 16:41:24 csarcsys2-eth0 kernel: dlm: closing connection to node 1

Aug  5 16:41:24 csarcsys2-eth0 clurgmgrd[3750]: <emerg> #1: Quorum
Dissolved

Aug  5 16:41:24 csarcsys2-eth0 openais[3096]: [CLM  ]   r(0) ip(172.
xx.xxx.xx)

Aug  5 16:41:24 csarcsys2-eth0 kernel: dlm: closing connection to node 3

Aug  5 16:41:24 csarcsys2-eth0 openais[3096]: [CLM  ] Members Left:

Aug  5 16:41:24 csarcsys2-eth0 openais[3096]: [CLM  ]   r(0) ip(172.
xx.xxx.xx)

Aug  5 16:41:24 csarcsys2-eth0 openais[3096]: [CLM  ]   r(0) ip(172.
xx.xxx.xx)

Aug  5 16:41:24 csarcsys2-eth0 openais[3096]: [CLM  ] Members Joined:

Aug  5 16:41:24 csarcsys2-eth0 openais[3096]: [CMAN ] quorum lost,
blocking activity

Aug  5 16:41:24 csarcsys2-eth0 openais[3096]: [CLM  ] CLM CONFIGURATION
CHANGE

Aug  5 16:41:24 csarcsys2-eth0 openais[3096]: [CLM  ] New Configuration:

Aug  5 16:41:24 csarcsys2-eth0 openais[3096]: [CLM  ]   r(0) ip(172.
xx.xxx.xx)

Aug  5 16:41:24 csarcsys2-eth0 ccsd[3031]: Cluster is not quorate.
Refusing connection.

Aug  5 16:41:24 csarcsys2-eth0 openais[3096]: [CLM  ] Members Left:

Aug  5 16:41:24 csarcsys2-eth0 ccsd[3031]: Error while processing
connect: Connection refused

Aug  5 16:41:24 csarcsys2-eth0 openais[3096]: [CLM  ] Members Joined:

Aug  5 16:41:24 csarcsys2-eth0 ccsd[3031]: Invalid descriptor specified
(-111).

Aug  5 16:41:24 csarcsys2-eth0 openais[3096]: [SYNC ] This node is
within the primary component and will provide service.

Aug  5 16:41:24 csarcsys2-eth0 ccsd[3031]: Someone may be attempting
something evil.

Aug  5 16:41:24 csarcsys2-eth0 openais[3096]: [TOTEM] entering
OPERATIONAL state.

Aug  5 16:41:24 csarcsys2-eth0 ccsd[3031]: Error while processing get:
Invalid request descriptor

Aug  5 16:41:24 csarcsys2-eth0 openais[3096]: [CLM  ] got nodejoin
message 172.24.86.143

Aug  5 16:41:24 csarcsys2-eth0 ccsd[3031]: Cluster is not quorate.
Refusing connection.

Aug  5 16:41:24 csarcsys2-eth0 openais[3096]: [CPG  ] got joinlist
message from node 2

Aug  5 16:41:24 csarcsys2-eth0 ccsd[3031]: Error while processing
connect: Connection refused

Aug  5 16:41:24 csarcsys2-eth0 ccsd[3031]: Invalid descriptor specified
(-111).

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080805/6e50c080/attachment.htm>

From Norbert.Nemeth at mscibarra.com  Wed Aug  6 09:30:27 2008
From: Norbert.Nemeth at mscibarra.com (Nemeth, Norbert)
Date: Wed, 6 Aug 2008 11:30:27 +0200
Subject: [Linux-cluster] RE: 3 node cluster crashes
In-Reply-To: <EB190CD1E73E1146ACB7694746E205A805B1AE55@hx1.ums.msfc.nasa.gov>
References: <EB190CD1E73E1146ACB7694746E205A805B1AE55@hx1.ums.msfc.nasa.gov>
Message-ID: <CD5897CBBD759345BB46FDE6A5616C2C10D299A8D1@EU-MAILBOX001.int.msci.com>

Hi,

I have a problem with rgmanager's script resource.
My script uses $OCF_RESKEY_service_name in a following way:

<script file="/usr/local/sbin/cl2r.sh" name="script-VG" service_name="VOLUME GROUP">

It works on volume group: VOLUME GROUP defined in service_name.

If I have multiple services defined using the same script, I got:
clurgmgrd[10143]: <err> Unique attribute collision. type=script attr=file value=/usr/local/sbin/cl2r.sh

Checking /usr/share/cluster/script.sh I found:

        <parameter name="file" unique="1" required="1">
            <longdesc lang="en">
                Path to script
            </longdesc>
            <shortdesc lang="en">
                Path to script
            </shortdesc>
            <content type="string"/>
        </parameter>

Checking latest: (line 40)

http://git.fedorahosted.org/git/cluster.git?p=cluster.git;a=blob;f=rgmanager/src/resources/script.sh;h=41298115ccd39863f9f45d5f889e3b6299b3659d;hb=refs/heads/STABLE2#l40

Do you know why this file parameter for script resource has been set to unique?
May I ask to change it to unique="0"?

Best regards,
Norbert N?meth

From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Dalton, Maurice
Sent: Tuesday, August 05, 2008 11:56 PM
To: linux-cluster at redhat.com
Subject: [Linux-cluster] 3 node cluster crashes


I have a 3 node cluster running cman-2.0.84-2.el5.  At times we have spanning tree events that cause network storms up to 9 seconds.
When these events  occur (today we caused them twice to verify this issue). All three nodes go down within seconds of this event.

The second time we tried it I added the totem token statement shown below. Same problem.





<cman>
                <multicast addr="225.0.0.11"/>
                <totem token="21000"/>
        </cman>



Aug  5 16:41:18 csarcsys2-eth0 ntpd[3484]: kernel time sync enabled 0001
Aug  5 16:41:19 csarcsys2-eth0 openais[3096]: [TOTEM] The token was lost in the OPERATIONAL state.
Aug  5 16:41:19 csarcsys2-eth0 openais[3096]: [TOTEM] Receive multicast socket recv buffer size (288000 bytes).
Aug  5 16:41:19 csarcsys2-eth0 openais[3096]: [TOTEM] Transmit multicast socket send buffer size (262142 bytes).
Aug  5 16:41:19 csarcsys2-eth0 openais[3096]: [TOTEM] entering GATHER state from 2.
Aug  5 16:41:24 csarcsys2-eth0 openais[3096]: [TOTEM] entering GATHER state from 0.
Aug  5 16:41:24 csarcsys2-eth0 openais[3096]: [TOTEM] Creating commit token because I am the rep.
Aug  5 16:41:24 csarcsys2-eth0 openais[3096]: [TOTEM] Saving state aru 46 high seq received 46
Aug  5 16:41:24 csarcsys2-eth0 openais[3096]: [TOTEM] Storing new sequence id for ring b50
Aug  5 16:41:24 csarcsys2-eth0 openais[3096]: [TOTEM] entering COMMIT state.
Aug  5 16:41:24 csarcsys2-eth0 openais[3096]: [TOTEM] entering RECOVERY state.
Aug  5 16:41:24 csarcsys2-eth0 openais[3096]: [TOTEM] position [0] member 172.xx.xx.xxx:
Aug  5 16:41:24 csarcsys2-eth0 openais[3096]: [TOTEM] previous ring seq 2892 rep 172.xx.xxx.xx
Aug  5 16:41:24 csarcsys2-eth0 openais[3096]: [TOTEM] aru 46 high delivered 46 received flag 1
Aug  5 16:41:24 csarcsys2-eth0 openais[3096]: [TOTEM] Did not need to originate any messages in recovery.
Aug  5 16:41:24 csarcsys2-eth0 openais[3096]: [TOTEM] Sending initial ORF token
Aug  5 16:41:24 csarcsys2-eth0 openais[3096]: [CLM  ] CLM CONFIGURATION CHANGE
Aug  5 16:41:24 csarcsys2-eth0 openais[3096]: [CLM  ] New Configuration:
Aug  5 16:41:24 csarcsys2-eth0 kernel: dlm: closing connection to node 1
Aug  5 16:41:24 csarcsys2-eth0 clurgmgrd[3750]: <emerg> #1: Quorum Dissolved
Aug  5 16:41:24 csarcsys2-eth0 openais[3096]: [CLM  ]   r(0) ip(172. xx.xxx.xx)
Aug  5 16:41:24 csarcsys2-eth0 kernel: dlm: closing connection to node 3
Aug  5 16:41:24 csarcsys2-eth0 openais[3096]: [CLM  ] Members Left:
Aug  5 16:41:24 csarcsys2-eth0 openais[3096]: [CLM  ]   r(0) ip(172. xx.xxx.xx)
Aug  5 16:41:24 csarcsys2-eth0 openais[3096]: [CLM  ]   r(0) ip(172. xx.xxx.xx)
Aug  5 16:41:24 csarcsys2-eth0 openais[3096]: [CLM  ] Members Joined:
Aug  5 16:41:24 csarcsys2-eth0 openais[3096]: [CMAN ] quorum lost, blocking activity
Aug  5 16:41:24 csarcsys2-eth0 openais[3096]: [CLM  ] CLM CONFIGURATION CHANGE
Aug  5 16:41:24 csarcsys2-eth0 openais[3096]: [CLM  ] New Configuration:
Aug  5 16:41:24 csarcsys2-eth0 openais[3096]: [CLM  ]   r(0) ip(172. xx.xxx.xx)
Aug  5 16:41:24 csarcsys2-eth0 ccsd[3031]: Cluster is not quorate.  Refusing connection.
Aug  5 16:41:24 csarcsys2-eth0 openais[3096]: [CLM  ] Members Left:
Aug  5 16:41:24 csarcsys2-eth0 ccsd[3031]: Error while processing connect: Connection refused
Aug  5 16:41:24 csarcsys2-eth0 openais[3096]: [CLM  ] Members Joined:
Aug  5 16:41:24 csarcsys2-eth0 ccsd[3031]: Invalid descriptor specified (-111).
Aug  5 16:41:24 csarcsys2-eth0 openais[3096]: [SYNC ] This node is within the primary component and will provide service.
Aug  5 16:41:24 csarcsys2-eth0 ccsd[3031]: Someone may be attempting something evil.
Aug  5 16:41:24 csarcsys2-eth0 openais[3096]: [TOTEM] entering OPERATIONAL state.
Aug  5 16:41:24 csarcsys2-eth0 ccsd[3031]: Error while processing get: Invalid request descriptor
Aug  5 16:41:24 csarcsys2-eth0 openais[3096]: [CLM  ] got nodejoin message 172.24.86.143
Aug  5 16:41:24 csarcsys2-eth0 ccsd[3031]: Cluster is not quorate.  Refusing connection.
Aug  5 16:41:24 csarcsys2-eth0 openais[3096]: [CPG  ] got joinlist message from node 2
Aug  5 16:41:24 csarcsys2-eth0 ccsd[3031]: Error while processing connect: Connection refused
Aug  5 16:41:24 csarcsys2-eth0 ccsd[3031]: Invalid descriptor specified (-111).

________________________________
NOTICE: If received in error, please destroy and notify sender. Sender does not intend to waive confidentiality or privilege. Use of this email is prohibited when received in error.

Local registered entity: MSCI KFT
Metropolitan Court acting as the Court of Registry
Registered office: 1138 Budapest, N?pf?rdo utca 22, Hungary
Registration No. 01-09-885383
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080806/cd6c9d58/attachment.htm>

From jakub.suchy at enlogit.cz  Thu Aug  7 07:33:46 2008
From: jakub.suchy at enlogit.cz (Jakub Suchy)
Date: Thu, 7 Aug 2008 09:33:46 +0200 (CEST)
Subject: [Linux-cluster] HP ILO Fence Configuration
In-Reply-To: <1217939494.2704.20.camel@WSBID06223.bidmc.harvard.edu>
References: <4892E039.3050701@midascomm.com>
	<c0773fd30808040111w1fb8628h5ab1a99730204cc@mail.gmail.com>
	<1217939494.2704.20.camel@WSBID06223.bidmc.harvard.edu>
Message-ID: <11505.217.77.161.17.1218094426.squirrel@www.hal9000.cz>

> Does not sound like you are having a fencing issue, but I can share our
> configuration / implementation and experiences with it.
>
> We have been using fencing configured for HP iLO and iLO2 for the better
> of 2-years, with almost a full year in production now.  It is slow (42+
> seconds per fencing attempt) and always problematic.  We are piloting

Hi,
I am currently implementing cluster using HP iLO and i am experiencing
this slowness too. As far as I have dug into the fence_ilo perl script, it
seems that longest time it takes is for opening a SSL socket to the card.
Also, the script for reboot works like this (pseudocode):

if ($action == reboot) {
  check_status();
  if (status == on) {
    power_off();
    check_status(); // if error...
  }
}

this means 3 operations = 3 sockets = lots of time.
If the script could be rewritten to reuse existing socket, it will be a
lot faster. I just don't know how to determine if the socket is still
alive (then we need to reconnect). Anyone?

(Also, fence_ilo depends on perl-Crypt-SSLeay, but this is not marked as a
dependency in relevant channel, so you have to install it manually - i
should post a bug report to bugzilla).

Jakub



From grimme at atix.de  Thu Aug  7 07:57:38 2008
From: grimme at atix.de (Marc Grimme)
Date: Thu, 7 Aug 2008 09:57:38 +0200
Subject: [Linux-cluster] HP ILO Fence Configuration
In-Reply-To: <11505.217.77.161.17.1218094426.squirrel@www.hal9000.cz>
References: <4892E039.3050701@midascomm.com>
	<1217939494.2704.20.camel@WSBID06223.bidmc.harvard.edu>
	<11505.217.77.161.17.1218094426.squirrel@www.hal9000.cz>
Message-ID: <200808070957.38394.grimme@atix.de>

You might want to take a look at this ILO Fence Agent.
http://download.atix.de/yum/comoonics/redhat-el5/productive/noarch/RPMS/comoonics-bootimage-fenceclient-ilo-0.1-18.noarch.rpm
I think when I wrote it I detected the same problem and fixed it there.
Marc.
On Thursday 07 August 2008 09:33:46 Jakub Suchy wrote:
> > Does not sound like you are having a fencing issue, but I can share our
> > configuration / implementation and experiences with it.
> >
> > We have been using fencing configured for HP iLO and iLO2 for the better
> > of 2-years, with almost a full year in production now.  It is slow (42+
> > seconds per fencing attempt) and always problematic.  We are piloting
>
> Hi,
> I am currently implementing cluster using HP iLO and i am experiencing
> this slowness too. As far as I have dug into the fence_ilo perl script, it
> seems that longest time it takes is for opening a SSL socket to the card.
> Also, the script for reboot works like this (pseudocode):
>
> if ($action == reboot) {
>   check_status();
>   if (status == on) {
>     power_off();
>     check_status(); // if error...
>   }
> }
>
> this means 3 operations = 3 sockets = lots of time.
> If the script could be rewritten to reuse existing socket, it will be a
> lot faster. I just don't know how to determine if the socket is still
> alive (then we need to reconnect). Anyone?
>
> (Also, fence_ilo depends on perl-Crypt-SSLeay, but this is not marked as a
> dependency in relevant channel, so you have to install it manually - i
> should post a bug report to bugzilla).
>
> Jakub
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



-- 
Gruss / Regards,

Marc Grimme
Phone: +49-89 452 3538-14
http://www.atix.de/               http://www.open-sharedroot.org/

**
ATIX Informationstechnologie und Consulting AG
Einsteinstr. 10 
85716 Unterschleissheim
Deutschland/Germany

Phone: +49-89 452 3538-0
Fax:   +49-89 990 1766-0

Registergericht: Amtsgericht Muenchen
Registernummer: HRB 168930
USt.-Id.: DE209485962

Vorstand: 
Marc Grimme, Mark Hlawatschek, Thomas Merz (Vors.)

Vorsitzender des Aufsichtsrats:
Dr. Martin Buss



From sunhux at gmail.com  Thu Aug  7 08:12:02 2008
From: sunhux at gmail.com (sunhux G)
Date: Thu, 7 Aug 2008 16:12:02 +0800
Subject: [Linux-cluster] "last" in Linux do not give info like in Solaris'
	"last" (& how to capture 35 days worth of wtmp)
Message-ID: <60f08e700808070112x81f5abfo38663991223a56d8@mail.gmail.com>

 Hi,

In Solaris, "last" command can give me a lot of info including
the type of login (the source IP addr I login from, whether its
telnet/ssh/ftp).

I can't seem to find anything on this in my Linux
(uname -a    gives 2.4.20-8smp)

Also, how can I tuned wtmp to capture up to 35 days' worth
of data?


TIA
U
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080807/d90d31d1/attachment.htm>

From ajeet.singh.raina at logica.com  Thu Aug  7 09:07:33 2008
From: ajeet.singh.raina at logica.com (Singh Raina, Ajeet)
Date: Thu, 7 Aug 2008 14:37:33 +0530
Subject: [Linux-cluster] NFS + Cluster
Message-ID: <0139539A634FD04A99C9B8880AB70CB209B17A09@in-ex004.groupinfra.com>

Hello Guys,

I need your help regarding the Cluster configuration.All I have setup is
Two Node Cluster Configuration with one node as Slave which is outside
the cluster.All I want is mount point be active all the time on the
three nodes for High Availability.
I heard about NFS can do the trick.That is, One need to setup NFS mount
point on all the three nodes.
On Failover cluster mount point may shift but NFS mount point will still
prevails.

Is it true?

Thanks in advance,


Ajeet



This e-mail and any attachment is for authorised use by the intended recipient(s) only. It may contain proprietary material, confidential information and/or be subject to legal privilege. It should not be copied, disclosed to, retained or used by, any other party. If you are not an intended recipient then please promptly delete this e-mail and any attachment and all copies and inform the sender. Thank you.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080807/bb40733b/attachment.htm>

From linux at vfemail.net  Thu Aug  7 12:28:42 2008
From: linux at vfemail.net (Alex)
Date: Thu, 7 Aug 2008 15:28:42 +0300
Subject: [Linux-cluster] SAN, software raid, iscsi or GNBD, GFS
Message-ID: <200808071528.43149.linux@vfemail.net>

Hello all,

I would like to build a SAN using cheap hardware.

Let say that we have N computers (N>8) exporting their volumes (volX, where 
X=N) using ataoe or iscsi or gnbd protocol. Each volX is arround 120GB.

Now, I want:
- to build a GFS cluster filesystem using imported volumes (vol1, 
vol2, ... volN) with high data availability without a single point of failure.
- resulted volume to be used on SERVER1, SERVER2 and SERVER3.

First scenario: 
- ATAoE or ISCSI on computer1 up to computerN to export vol1 up to volN to 
SERVER1, SERVER2 and SERVER3.
- SERVER1 up to SERVER3 are forming my cluster (3 nodes)

1. create mirrors for each imported volumes using mdadm (here we consider each 
volN=120GB)

mdadm -C /dev/md1 -l 1 -n 2 /dev/vol1 /dev/vol2
mdadm -C /dev/md2 -l 1 -n 2 /dev/vol3 /dev/vol4
mdadm -C /dev/md3 -l 1 -n 2 /dev/vol5 /dev/vol6
mdadm -C /dev/md4 -l 1 -n 2 /dev/vol7 /dev/vol8

2. join resulted volumes together using lvm and create a logical volume 
(480GB)
pvcreate /dev/md1 /dev/md2 /dev/md3 /dev/md4
vgcreate myvg ...
lvcreate mylvm ...
mylvm = md1+md2+md3+md4

3. format /dev/myvg/mylv using GFS:
mkfs.gfs -p lock_dlm -t cluster:data -j 3 /dev/myvg/mylv

4. mount /dev/myvg/mylv on all our servers.

I read about software raid problems when used in conjunction with GFS. Is that 
correct?

Also, using ATAoE and ISCSI, i have no fencing mechanism for volumes exported 
by each computerN. How can be implemented fencing in this case?

Second scenario:
- GNBD server installed on each of our computerN to export vol1 up to volN to 
SERVER1, SERVER2 and SERVER3.
- GNBD client installed on SERVER1, SERVER2 and SERVER3
- SERVER1 up to SERVER3 toghether with computerN are forming now my cluster 
(11 nodes)

Pros:
- I read that, real advantage of using GNBD is that it has built in fencing 
so, second scenario seems to be better

Cons:
- Using iSCSI also allows a much more seemless transition to a hardware shared 
storage solution later on
- GNBD seems to be slower then ISCSI and lot of work that needs to be done for 
GNBD to reach its full speed potential.

On this scenario, we will have on each of our SERVERX:
 /dev/gnbd/vol1
 /dev/gnbd/vol2
...
 /dev/gnbd/vol8.

Now, on SERVER1, can i use mdadm to group volumes as above? Is safest then in 
my first scenario?

mdadm -C /dev/md1 -l 1 -n 2 /dev/gnbd/vol1 /dev/gnbd/vol2
mdadm -C /dev/md2 -l 1 -n 2 /dev/gnbd/vol3 /dev/gnbd/vol4
mdadm -C /dev/md3 -l 1 -n 2 /dev/gnbd/vol5 /dev/gnbd/vol6
mdadm -C /dev/md4 -l 1 -n 2 /dev/vgnbd/ol7 /dev/gnbd/vol8

Will be ok too create using lvm, mylvm=md1+md2+md3+md4 && mkfs.gfs && mount it 
on our servers?

So, what to do...?

Regards,
Alx



From jparsons at redhat.com  Thu Aug  7 14:20:34 2008
From: jparsons at redhat.com (jim parsons)
Date: Thu, 07 Aug 2008 10:20:34 -0400
Subject: [Linux-cluster] HP ILO Fence Configuration
In-Reply-To: <200808070957.38394.grimme@atix.de>
References: <4892E039.3050701@midascomm.com>
	<1217939494.2704.20.camel@WSBID06223.bidmc.harvard.edu>
	<11505.217.77.161.17.1218094426.squirrel@www.hal9000.cz>
	<200808070957.38394.grimme@atix.de>
Message-ID: <1218118834.3395.5.camel@localhost.localdomain>

The problem here is that the RIBCL interface insists on a new connection
each time you contact it - so there is alot of buildup/teardown time. We
tried to help this agent's speed by adding a 'force="1"'attribute in the
cluster.conf fencedevice section for it. This attribute just kills it as
fast as possible - and does not do the initial status check. Fencing was
done in less than 7 seconds using this attribute. I'm pretty sure this
is doc'd on the schema page and should be in the man page for fence_ilo
as well.

Hope this helps...

-J

On Thu, 2008-08-07 at 09:57 +0200, Marc Grimme wrote:
> You might want to take a look at this ILO Fence Agent.
> http://download.atix.de/yum/comoonics/redhat-el5/productive/noarch/RPMS/comoonics-bootimage-fenceclient-ilo-0.1-18.noarch.rpm
> I think when I wrote it I detected the same problem and fixed it there.
> Marc.
> On Thursday 07 August 2008 09:33:46 Jakub Suchy wrote:
> > > Does not sound like you are having a fencing issue, but I can share our
> > > configuration / implementation and experiences with it.
> > >
> > > We have been using fencing configured for HP iLO and iLO2 for the better
> > > of 2-years, with almost a full year in production now.  It is slow (42+
> > > seconds per fencing attempt) and always problematic.  We are piloting
> >
> > Hi,
> > I am currently implementing cluster using HP iLO and i am experiencing
> > this slowness too. As far as I have dug into the fence_ilo perl script, it
> > seems that longest time it takes is for opening a SSL socket to the card.
> > Also, the script for reboot works like this (pseudocode):
> >
> > if ($action == reboot) {
> >   check_status();
> >   if (status == on) {
> >     power_off();
> >     check_status(); // if error...
> >   }
> > }
> >
> > this means 3 operations = 3 sockets = lots of time.
> > If the script could be rewritten to reuse existing socket, it will be a
> > lot faster. I just don't know how to determine if the socket is still
> > alive (then we need to reconnect). Anyone?
> >
> > (Also, fence_ilo depends on perl-Crypt-SSLeay, but this is not marked as a
> > dependency in relevant channel, so you have to install it manually - i
> > should post a bug report to bugzilla).
> >
> > Jakub
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> 
> 



From miolinux at libero.it  Thu Aug  7 15:55:43 2008
From: miolinux at libero.it (Miolinux)
Date: Thu, 07 Aug 2008 17:55:43 +0200
Subject: [Linux-cluster] HP ILO Fence Configuration
In-Reply-To: <200808070957.38394.grimme@atix.de>
References: <4892E039.3050701@midascomm.com>
	<1217939494.2704.20.camel@WSBID06223.bidmc.harvard.edu>
	<11505.217.77.161.17.1218094426.squirrel@www.hal9000.cz>
	<200808070957.38394.grimme@atix.de>
Message-ID: <1218124543.10161.2.camel@GD-P2-093>

On Thu, 2008-08-07 at 09:57 +0200, Marc Grimme wrote:
> You might want to take a look at this ILO Fence Agent.
> http://download.atix.de/yum/comoonics/redhat-el5/productive/noarch/RPMS/comoonics-bootimage-fenceclient-ilo-0.1-18.noarch.rpm
> I think when I wrote it I detected the same problem and fixed it there.
> Marc.

Wow, pretty impressive,
thank you very much, your script saves me a lot of headache.

# time ./fence_ilo -a xx.xx.xx.xx -l admin -p xxxxxxxx
success

real    0m49.406s
user    0m0.138s
sys     0m0.015s

# time ./fence_fast_ilo -a xx.xx.xx.xx -l admin -p xxxxxxxx

real    0m8.713s
user    0m0.041s
sys     0m0.011s

Bye
Miolinux



From grimme at atix.de  Thu Aug  7 16:20:32 2008
From: grimme at atix.de (Marc Grimme)
Date: Thu, 7 Aug 2008 18:20:32 +0200
Subject: [Linux-cluster] HP ILO Fence Configuration
In-Reply-To: <1218124543.10161.2.camel@GD-P2-093>
References: <4892E039.3050701@midascomm.com>
	<200808070957.38394.grimme@atix.de>
	<1218124543.10161.2.camel@GD-P2-093>
Message-ID: <200808071820.32538.grimme@atix.de>

On Thursday 07 August 2008 17:55:43 Miolinux wrote:
> On Thu, 2008-08-07 at 09:57 +0200, Marc Grimme wrote:
> > You might want to take a look at this ILO Fence Agent.
> > http://download.atix.de/yum/comoonics/redhat-el5/productive/noarch/RPMS/c
> >omoonics-bootimage-fenceclient-ilo-0.1-18.noarch.rpm I think when I wrote
> > it I detected the same problem and fixed it there. Marc.
>
> Wow, pretty impressive,
> thank you very much, your script saves me a lot of headache.
>
> # time ./fence_ilo -a xx.xx.xx.xx -l admin -p xxxxxxxx
> success
>
> real    0m49.406s
> user    0m0.138s
> sys     0m0.015s
>
> # time ./fence_fast_ilo -a xx.xx.xx.xx -l admin -p xxxxxxxx
>
> real    0m8.713s
> user    0m0.041s
> sys     0m0.011s
>
> Bye
> Miolinux
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

You can also use this agent for mapping iso images as cds and the like. Look 
at the -x/--xmlfile option. We use it very often for such things like 
creating users/mapping isos ...

<RIB_INFO MODE = "write">
	<INSERT_VIRTUAL_MEDIA DEVICE="CDROM"
	IMAGE_URL="http://servername/atix/livecd.img"/>
	<SET_VM_STATUS DEVICE = "CDROM">
   		<VM_BOOT_OPTION value = "BOOT_ONCE"/>
   		<VM_WRITE_PROTECT value = "Y"/>
	</SET_VM_STATUS>
</RIB_INFO>

or for resetting the ilo which is not so bad from time to time:

  <RIB_INFO MODE="write">
    <RESET_RIB/>
  </RIB_INFO>

-- 
Gruss / Regards,

Marc Grimme
http://www.atix.de/               http://www.open-sharedroot.org/



From grigorygor at gmail.com  Thu Aug  7 16:38:58 2008
From: grigorygor at gmail.com (Grisha G.)
Date: Thu, 7 Aug 2008 18:38:58 +0200
Subject: [Linux-cluster] need help with cluster
Message-ID: <c3c0440e0808070938x59ae7313p87efd85852a080d4@mail.gmail.com>

Hi
I don't know if its the right address to send this mail to but I really need
help so I'm gonna try my luck anyway

I'm trying to configure a RHEL HA cluster with MySQl and GFS,

I have 2 virtual machines running centos5.2 virt1.xen and virt2.xen
on them I have sheared iscsi device with GFS2 that i want MySQL database to
be on.

I installed MySQL servers on both virtual machines, but I can't configure
the cluster service properly so it will be a HA MyQSL cluster

Please help I can't find any proper howtos about that.

I get this errors:
clurgmgrd[4070]: Stopping service service:MySQL
clurgmgrd[4070]: stop on mysql "mysql" returned 1 (generic error)
clurgmgrd: [4070]: Checking Existence Of File
/var/run/cluster/mysql/mysql:mysql.pid [mysql:mysql] > Failed - File Doesn't
Exist
clurgmgrd: [4070]: Stopping Service mysql:mysql > Failed
clurgmgrd[4070]: Marking service:MySQL as 'disabled', but some resources may
still be allocated!
clurgmgrd[4070]: Service service:MySQL is disabled
clurgmgrd[4070]: stop on clusterfs "GFS" returned 2 (invalid argument(s))
clurgmgrd: [4070]: stop: Could not match /dev/vgoo/iscsi with a real device
clurgmgrd[4070]: Marking service:MySQL as 'disabled', but some resources may
still be allocated!
clurgmgrd[4070]: Service service:MySQL is disabled
clurgmgrd[4070]: Stopping service service:MySQL
clurgmgrd[4070]: stop on clusterfs "GFS" returned 2 (invalid argument(s))

This is my conf files:
# cat /etc/cluster/cluster.conf
<?xml version="1.0"?>
<cluster alias="virt-cluster" config_version="15" name="virt-cluster">
<fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/>
<clusternodes>
<clusternode name="virt2.xen" nodeid="1" votes="1">
<fence/>
<multicast addr="192.168.122.188"/>
</clusternode>
<clusternode name="virt1.xen" nodeid="2" votes="1">
<fence/>
<multicast addr="192.168.122.188"/>
</clusternode>
</clusternodes>
<cman expected_votes="3" two_node="0">
<multicast addr="192.168.122.188"/>
</cman>
<fencedevices/>
<rm>
<failoverdomains>
<failoverdomain name="virt12" nofailback="0" ordered="0" restricted="0"/>
<failoverdomain name="FD" nofailback="0" ordered="1" restricted="0">
<failoverdomainnode name="virt2.xen" priority="2"/>
<failoverdomainnode name="virt1.xen" priority="1"/>
</failoverdomain>
</failoverdomains>
<resources>
<ip address="192.168.122.188" monitor_link="0"/>
</resources>
<service autostart="1" domain="FD" exclusive="0" name="MySQL"
recovery="relocate">
<mysql config_file="/etc/my.cnf" listen_address="192.168.9.36" name="mysql"
shutdown_wait="0"/>
<ip ref="192.168.122.188"/>
<clusterfs device="/dev/sda1" force_unmount="0" fsid="59408" fstype="gfs"
mountpoint="/data" name="gfs" self_fence="0"/>
</service>
</rm>
<quorumd interval="3" label="rac_qdisk" min_score="1" tko="20" votes="1"/>
</cluster>



# cat /etc/my.cnf
[mysqld]
datadir=/data
socket=/var/lib/mysql/mysql.sock
user=mysql
# Default to using old password format for compatibility with mysql 3.x
# clients (those using the mysqlclient10 compatibility package).
old_passwords=1

[mysqld_safe]
log-error=/var/log/mysqld.log
pid-file=/var/run/mysqld/mysqld.pid



# cat /etc/hosts
# Do not remove the following line, or various programs
# that require network functionality will fail.
127.0.0.1 localhost.localdomain localhost
::1 localhost6.localdomain6 localhost6

hostname = virt1.xen

192.168.122.24 virt2.xen

all the conf file are the same on both virt machines

with this configurations the virtual machine didn't load after reboot it got
stuck on loading cman

I can't find any proper howtos on configuring the services and resources in
the cluster, and I realy need to do this

Thank for your help

Grigory
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080807/3d79da4c/attachment.htm>

From grigorygor at gmail.com  Thu Aug  7 16:47:54 2008
From: grigorygor at gmail.com (Grisha G.)
Date: Thu, 7 Aug 2008 18:47:54 +0200
Subject: [Linux-cluster] need help with redhat cluster + mysql
Message-ID: <c3c0440e0808070947n67a2c239o713ce8b56466fd94@mail.gmail.com>

http://www.linuxquestions.org/questions/linux-enterprise-47/rhel-cluster-with-mysql-660927/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080807/05418b2e/attachment.htm>

From dake at novatec.de  Thu Aug  7 17:44:10 2008
From: dake at novatec.de (dake at novatec.de)
Date: Thu, 07 Aug 2008 19:44:10 +0200
Subject: [Linux-cluster] Problems with logging and cluster instability
Message-ID: <20080807194410.w91snaqpsom8oggs@novatec.dnsalias.net>

Hello folks,

we've been having two nasty problems with a GFS cluster, currently  
running version 2.03.03 of cluster suite and 0.80.3 of OpenAIS.

The first is that for some time now, logging has been broken. We're  
getting kernel log messages from the DLM and GFS modules, but the  
userlnd utilities (i.e. OpenAIS) refuses to log at all when used with  
the cluster suite. Logging is fine when started without it (i.e.  
Default OpenAIS config file), so I'm pretty sure it's not the logging  
setup. Somehow, it seems that OpenAIS is not being given correct  
logging parameters by CMAN, and I really don't know why. I've tried  
including extra logging directives in cluster.conf, in various  
different forms, but to no avail. The cluster.conf we're using now is  
as follows:

<?xml version="1.0"?>
<cluster name="gfscluster" config_version="6">

   <clusternodes>
     <clusternode name="smb1-cluster" nodeid="1">
       <fence>
         <method name="powerswitch">
           <device name="powerswitch" port="1"/>
         </method>
         <method name="last_resort">
           <device name="manual" nodename="smb1"/>
         </method>
       </fence>
     </clusternode>
     <clusternode name="smb2-cluster" nodeid="2">
       <fence>
         <method name="powerswitch">
           <device name="powerswitch" port="2"/>
         </method>
         <method name="last_resort">
           <device name="manual" nodename="smb2"/>
         </method>
       </fence>
     </clusternode>
     <clusternode name="mail-cluster" nodeid="3">
       <fence>
         <method name="powerswitch">
           <device name="powerswitch" port="3"/>
         </method>
         <method name="last_resort">
           <device name="manual" nodename="mail"/>
         </method>
       </fence>
     </clusternode>
     <clusternode name="backup-cluster" nodeid="4">
       <fence>
         <method name="powerswitch">
           <device name="powerswitch" port="4"/>
         </method>
         <method name="last_resort">
           <device name="manual" nodename="backup"/>
         </method>
       </fence>
     </clusternode>
   </clusternodes>

   <fencedevices>
     <fencedevice name="powerswitch" agent="fence_epc"  
host="192.168.10.xx" passwd="xxx" action="4"/>
     <fencedevice name="manual" agent="fence_manual"/>
   </fencedevices>

   <fence_daemon post_join_delay="30">
   </fence_daemon>

   <logging to_syslog="yes" syslog_facility="local3">
     <logger ident="CPG" to_syslog="yes">
     </logger>
     <logger ident="CMAN" to_syslog="yes">
     </logger>
     <logger ident="CLM" to_syslog="yes">
     </logger>
   </logging>

</cluster>

Any idea why this might not be working?

The second problem is that once quorum is reached, any additional  
nodes joining will make the existing quorate cluster break apart. This  
behaviour has been seen in a three-node config with the third node  
joining, and in a four-node config with the fourth node joining. WHICH  
node is the last to join doesn't seem to make a difference. The  
"breaking apart" means that the newly joined node dies ("joining  
cluster with disallowed nodes, must die"), one of the existing nodes  
dies, and two of the other existing nodes keep running, but desynced -  
both show differing cluster membership and differing disallowed nodes.  
This is after a fresh reboot, so there is NO state in any node before  
joining. The crash occurs at the cman_tool join stage.

I have a gut feeling it might have something to do with our network  
config, which has a total of four ethernet interfaces in three of the  
nodes, and two in the fourth. The first three have two iSCSI  
interfaces, one for cluster use and one for regular LAN access. The  
last has only one iSCSI interface and no LAN access for now. Routing  
tables etc. should be setup properly; as you can see above,  
cluster.conf uses special hostnames for the cluster interfaces, which  
are resolved to IPs using hosts files which are identical on all four  
machines. I have yet to do any packet sniffing, and I have very little  
information log-wise due to the first problem, so I'm sure this is not  
a lot of info; but I thought I might include it anyway, in case  
someone can immediately point out the problem.

Thanks in advance,
Daniel



From shawnlhood at gmail.com  Thu Aug  7 20:01:20 2008
From: shawnlhood at gmail.com (Shawn Hood)
Date: Thu, 7 Aug 2008 16:01:20 -0400
Subject: [Linux-cluster] Problems after gfs_resize
Message-ID: <cfe2fc960808071301o2aab4ee2jc29484b4a3684d13@mail.gmail.com>

Yesterday, I resized (+300GB) a clustered logical volume using lvresize.  I
then executed gfs_grow.  All went well.  Today, however, I'm getting the
error message 'attempt to access beyond end of device.'  See below.  I
wanted to give some cursory information in case this is a known issue with a
quick fix.  I'll follow up with additional information.  Any ideas??

 --- Logical volume ---
  LV Name                /dev/hq-san/svn_users
  VG Name                hq-san
  LV UUID                jKaFXu-yixW-5fOJ-uzeg-x1SP-IL0k-LFqBym
  LV Write Access        read/write
  LV Status              available
  # open                 1
  LV Size                1.89 TB
  Current LE             495616
  Segments               2
  Allocation             inherit
  Read ahead sectors     0
  Block device           253:9


attempt to access beyond end of device
dm-9: rw=0, want=3221225552, limit=3221225472
attempt to access beyond end of device
dm-9: rw=0, want=3221225560, limit=3221225472
attempt to access beyond end of device
dm-9: rw=0, want=3221225568, limit=3221225472
GFS: fsid=qualia:svn_users.3: fatal: I/O error
GFS: fsid=qualia:svn_users.3:   block = 402653195
GFS: fsid=qualia:svn_users.3:   function = gfs_dreread
GFS: fsid=qualia:svn_users.3:   file =
/builddir/build/BUILD/gfs-kernel-2.6.9-75/smp/src/gfs/dio.c, line = 576
GFS: fsid=qualia:svn_users.3:   time = 1218137345
GFS: fsid=qualia:svn_users.3: about to withdraw from the cluster
GFS: fsid=qualia:svn_users.3: waiting for outstanding I/O
GFS: fsid=qualia:svn_users.3: telling LM to withdraw
lock_dlm: withdraw abandoned memory


Shawn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080807/67cda358/attachment.htm>

From rpeterso at redhat.com  Thu Aug  7 20:52:50 2008
From: rpeterso at redhat.com (Bob Peterson)
Date: Thu, 07 Aug 2008 15:52:50 -0500
Subject: [Linux-cluster] Problems after gfs_resize
In-Reply-To: <cfe2fc960808071301o2aab4ee2jc29484b4a3684d13@mail.gmail.com>
References: <cfe2fc960808071301o2aab4ee2jc29484b4a3684d13@mail.gmail.com>
Message-ID: <1218142370.9521.87.camel@technetium.msp.redhat.com>

On Thu, 2008-08-07 at 16:01 -0400, Shawn Hood wrote:
> Yesterday, I resized (+300GB) a clustered logical volume using
> lvresize.  I then executed gfs_grow.  All went well.  Today, however,
> I'm getting the error message 'attempt to access beyond end of
> device.'  See below.  I wanted to give some cursory information in
> case this is a known issue with a quick fix.  I'll follow up with
> additional information.  Any ideas??

Hi Shawn,

This is most likely bugzilla bug #436383.  Assuming it is,
there is a fix to gfs-kernel that resolves the problem so it can't
happen again.  Unfortunately, the bug causes damage to the gfs
file system.  The latest version of gfs_fsck knows how to repair
the damage.  The fix to gfs_fsck was done for bugzilla bug 440896.
My apologies if you can't view either of these bug records, but
they may be marked private to protect the customer who reported the
problem.

Of course, without actually seeing your file system, I can't be
sure that it's the same problem.  It's likely it though.

The fix to both gfs and gfs_fsck were posted to RHN.  Here is a
link to the gfs_fsck information:
http://rhn.redhat.com/errata/RHBA-2008-0804.html

(There are RHEL4 and RHEL5 fixes for both of these, but I'm posting
the RHEL4 version because your kernel indicated you were using RHEL4.)

Regards,

Bob Peterson
Red Hat Clustering & GFS




From rauch at atix.de  Fri Aug  8 06:37:20 2008
From: rauch at atix.de (Michael Rauch (ATIX AG))
Date: Fri, 8 Aug 2008 08:37:20 +0200
Subject: [Linux-cluster] Problems with logging and cluster instability
In-Reply-To: <20080807194410.w91snaqpsom8oggs@novatec.dnsalias.net>
References: <20080807194410.w91snaqpsom8oggs@novatec.dnsalias.net>
Message-ID: <200808080837.20494.rauch@atix.de>

Hello Daniel,

the first issue sounds like a network problem.
For RHEL5 you have to enable IGMP and multicast traffic forwarding
on some network-switches (on the logging network).

Regards, Michael



On Thursday 07 August 2008, dake at novatec.de wrote:
> Hello folks,
>
> we've been having two nasty problems with a GFS cluster, currently
> running version 2.03.03 of cluster suite and 0.80.3 of OpenAIS.
>
> The first is that for some time now, logging has been broken. We're
> getting kernel log messages from the DLM and GFS modules, but the
> userlnd utilities (i.e. OpenAIS) refuses to log at all when used with
> the cluster suite. Logging is fine when started without it (i.e.
> Default OpenAIS config file), so I'm pretty sure it's not the logging
> setup. Somehow, it seems that OpenAIS is not being given correct
> logging parameters by CMAN, and I really don't know why. I've tried
> including extra logging directives in cluster.conf, in various
> different forms, but to no avail. The cluster.conf we're using now is
> as follows:
>
> <?xml version="1.0"?>
> <cluster name="gfscluster" config_version="6">
>
>    <clusternodes>
>      <clusternode name="smb1-cluster" nodeid="1">
>        <fence>
>          <method name="powerswitch">
>            <device name="powerswitch" port="1"/>
>          </method>
>          <method name="last_resort">
>            <device name="manual" nodename="smb1"/>
>          </method>
>        </fence>
>      </clusternode>
>      <clusternode name="smb2-cluster" nodeid="2">
>        <fence>
>          <method name="powerswitch">
>            <device name="powerswitch" port="2"/>
>          </method>
>          <method name="last_resort">
>            <device name="manual" nodename="smb2"/>
>          </method>
>        </fence>
>      </clusternode>
>      <clusternode name="mail-cluster" nodeid="3">
>        <fence>
>          <method name="powerswitch">
>            <device name="powerswitch" port="3"/>
>          </method>
>          <method name="last_resort">
>            <device name="manual" nodename="mail"/>
>          </method>
>        </fence>
>      </clusternode>
>      <clusternode name="backup-cluster" nodeid="4">
>        <fence>
>          <method name="powerswitch">
>            <device name="powerswitch" port="4"/>
>          </method>
>          <method name="last_resort">
>            <device name="manual" nodename="backup"/>
>          </method>
>        </fence>
>      </clusternode>
>    </clusternodes>
>
>    <fencedevices>
>      <fencedevice name="powerswitch" agent="fence_epc"
> host="192.168.10.xx" passwd="xxx" action="4"/>
>      <fencedevice name="manual" agent="fence_manual"/>
>    </fencedevices>
>
>    <fence_daemon post_join_delay="30">
>    </fence_daemon>
>
>    <logging to_syslog="yes" syslog_facility="local3">
>      <logger ident="CPG" to_syslog="yes">
>      </logger>
>      <logger ident="CMAN" to_syslog="yes">
>      </logger>
>      <logger ident="CLM" to_syslog="yes">
>      </logger>
>    </logging>
>
> </cluster>
>
> Any idea why this might not be working?
>
> The second problem is that once quorum is reached, any additional
> nodes joining will make the existing quorate cluster break apart. This
> behaviour has been seen in a three-node config with the third node
> joining, and in a four-node config with the fourth node joining. WHICH
> node is the last to join doesn't seem to make a difference. The
> "breaking apart" means that the newly joined node dies ("joining
> cluster with disallowed nodes, must die"), one of the existing nodes
> dies, and two of the other existing nodes keep running, but desynced -
> both show differing cluster membership and differing disallowed nodes.
> This is after a fresh reboot, so there is NO state in any node before
> joining. The crash occurs at the cman_tool join stage.
>
> I have a gut feeling it might have something to do with our network
> config, which has a total of four ethernet interfaces in three of the
> nodes, and two in the fourth. The first three have two iSCSI
> interfaces, one for cluster use and one for regular LAN access. The
> last has only one iSCSI interface and no LAN access for now. Routing
> tables etc. should be setup properly; as you can see above,
> cluster.conf uses special hostnames for the cluster interfaces, which
> are resolved to IPs using hosts files which are identical on all four
> machines. I have yet to do any packet sniffing, and I have very little
> information log-wise due to the first problem, so I'm sure this is not
> a lot of info; but I thought I might include it anyway, in case
> someone can immediately point out the problem.
>
> Thanks in advance,
> Daniel
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From rhurst at bidmc.harvard.edu  Fri Aug  8 11:58:37 2008
From: rhurst at bidmc.harvard.edu (rhurst at bidmc.harvard.edu)
Date: Fri, 8 Aug 2008 07:58:37 -0400
Subject: [Linux-cluster] HP ILO Fence Configuration
In-Reply-To: <200808071820.32538.grimme@atix.de>
References: <4892E039.3050701@midascomm.com>
	<200808070957.38394.grimme@atix.de>
	<1218124543.10161.2.camel@GD-P2-093>
	<200808071820.32538.grimme@atix.de>
Message-ID: <1218196717.2579.1.camel@WSBID06223.bidmc.harvard.edu>

Excellent work and explanation!!  It would be wonderful to see these
useful techniques get integrated back into Red Hat's Cluster Suite, with
appropriate credit acknowledging the author's work.


________________________________________________________________________
??Robert Hurst, Sr. Cach? Administrator
Beth Israel Deaconess Medical Center
1135 Tremont Street, REN-7
Boston, Massachusetts   02120-2140
617-754-8754 ? Fax: 617-754-8730 ? Cell: 401-787-3154
Any technology distinguishable from magic is insufficiently advanced.


On Thu, 2008-08-07 at 18:20 +0200, Marc Grimme wrote:

> On Thursday 07 August 2008 17:55:43 Miolinux wrote:
> > On Thu, 2008-08-07 at 09:57 +0200, Marc Grimme wrote:
> > > You might want to take a look at this ILO Fence Agent.
> > > http://download.atix.de/yum/comoonics/redhat-el5/productive/noarch/RPMS/c
> > >omoonics-bootimage-fenceclient-ilo-0.1-18.noarch.rpm I think when I wrote
> > > it I detected the same problem and fixed it there. Marc.
> >
> > Wow, pretty impressive,
> > thank you very much, your script saves me a lot of headache.
> >
> > # time ./fence_ilo -a xx.xx.xx.xx -l admin -p xxxxxxxx
> > success
> >
> > real    0m49.406s
> > user    0m0.138s
> > sys     0m0.015s
> >
> > # time ./fence_fast_ilo -a xx.xx.xx.xx -l admin -p xxxxxxxx
> >
> > real    0m8.713s
> > user    0m0.041s
> > sys     0m0.011s
> >
> > Bye
> > Miolinux
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> You can also use this agent for mapping iso images as cds and the like. Look 
> at the -x/--xmlfile option. We use it very often for such things like 
> creating users/mapping isos ...
> 
> <RIB_INFO MODE = "write">
> 	<INSERT_VIRTUAL_MEDIA DEVICE="CDROM"
> 	IMAGE_URL="http://servername/atix/livecd.img"/>
> 	<SET_VM_STATUS DEVICE = "CDROM">
>    		<VM_BOOT_OPTION value = "BOOT_ONCE"/>
>    		<VM_WRITE_PROTECT value = "Y"/>
> 	</SET_VM_STATUS>
> </RIB_INFO>
> 
> or for resetting the ilo which is not so bad from time to time:
> 
>   <RIB_INFO MODE="write">
>     <RESET_RIB/>
>   </RIB_INFO>
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080808/8ed10fac/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2178 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080808/8ed10fac/attachment.p7s>

From raju.rajsand at gmail.com  Sat Aug  9 07:38:09 2008
From: raju.rajsand at gmail.com (Rajagopal Swaminathan)
Date: Sat, 9 Aug 2008 07:38:09 +0000
Subject: [Linux-cluster] NFS + Cluster
In-Reply-To: <0139539A634FD04A99C9B8880AB70CB209B17A09@in-ex004.groupinfra.com>
References: <Acj4bQcCKqtlMTw3SICJyufYAXItbg==>
	<0139539A634FD04A99C9B8880AB70CB209B17A09@in-ex004.groupinfra.com>
Message-ID: <8786b91c0808090038n53c5d701s5eb7a9adbe49381b@mail.gmail.com>

Greetings,

2008/8/7 Singh Raina, Ajeet <ajeet.singh.raina at logica.com>

>  Hello Guys,
>
Have you checked out the NFS+GFS in the linux cluster wiki?

Regards

Rajagopal
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080809/4eacec5f/attachment.htm>

From raju.rajsand at gmail.com  Sat Aug  9 08:09:41 2008
From: raju.rajsand at gmail.com (Rajagopal Swaminathan)
Date: Sat, 9 Aug 2008 08:09:41 +0000
Subject: [Linux-cluster] need help with cluster
In-Reply-To: <c3c0440e0808070938x59ae7313p87efd85852a080d4@mail.gmail.com>
References: <c3c0440e0808070938x59ae7313p87efd85852a080d4@mail.gmail.com>
Message-ID: <8786b91c0808090109j35f39e5fpe679006d8e6d0a30@mail.gmail.com>

Greetings,

2008/8/7 Grisha G. <grigorygor at gmail.com>

>
> I'm trying to configure a RHEL HA cluster with MySQl and GFS,
>
> I have 2 virtual machines running centos5.2 virt1.xen and virt2.xen
> on them I have sheared iscsi device with GFS2 that i want MySQL database to
> be on.
>
> Have u tried the following?

http://www.howtoforge.com/loadbalanced_mysql_cluster_debian

It can give you certain pointers

HTH

Regards

Rajagopal
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080809/8f9aa865/attachment.htm>

From peter.gustafsson1 at se.ibm.com  Sat Aug  9 16:20:29 2008
From: peter.gustafsson1 at se.ibm.com (Peter Gustafsson1)
Date: Sat, 9 Aug 2008 18:20:29 +0200
Subject: [Linux-cluster] Multiple LV's in a VG when running HA LVM.
Message-ID: <OF3818E408.0C10041E-ONC12574A0.0057031C-C12574A0.0059B4A3@se.ibm.com>

Hi,

Is there any one who knowes how to setup HA LVM with multiple lv's in a 
VG.

I'm running RHEL 4 AS update 7.

H?lsningar/Regards Peter
___________________________________________________________________________________ 

Peter Gustafsson

UNIX/LINUX systems programmer

SDO/SSO/Server Management/Operating Systems/Unix

.   Nordic Processor, Oddegatan 5, SE-164 92 Stockholm
%   Phone:+46 (0)8 793 39 23 Mobile:+46 (0)70 793 39 23
)   mailto:peter.gustafsson1 at se.ibm.com 
Red Hat Enterprise Linux Certified Engineer

S?vida annat inte anges ovan: / Unless stated otherwise above:
IBM Svenska AB
Organisationsnummer: 556026-6883
Adress: 164 92 Stockholm
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080809/ac76aa60/attachment.htm>

From vikash at netvigator.com  Mon Aug 11 15:34:00 2008
From: vikash at netvigator.com (Vikash Khatuwala)
Date: Mon, 11 Aug 2008 23:34:00 +0800
Subject: [Linux-cluster] GFS leak lock
Message-ID: <20080811153402.DHSG2335.imsm058dat.netvigator.com@cch-vikash.netvigator.com>

Hello,

I am working on setting up a redhat 5.2 cluster with GFS filesystem 
on an iSCSI SAN. I have configured everything according to the 
manuals but now I am constantly getting a message on the console and 
in the output of "dmesg" as below:

GFS: would leak glock ffff8101f09ef408
GFS: would leak glock ffff810185d22258
GFS: would leak glock ffff81015dd930a0
GFS: would leak glock ffff8101e4e184c8
GFS: would leak glock ffff81008785d580

This message appears every few minutes and I am not sure what it 
means and how to resolve this if its a problem. The FS is gfs1 with 4 
journals but there is only 1 node at the moment.

Also when I issue a "shutdown -h now". The node will not halt and 
will always get stuck at the point when shutting down clvmd. I always 
have to manually power down and power up again.

Please let me know what other info would be useful to help diagnose. 
Waiting for any advice.

Thanks,
Vikash.



From tedley at gmail.com  Tue Aug 12 00:25:09 2008
From: tedley at gmail.com (ted)
Date: Mon, 11 Aug 2008 20:25:09 -0400
Subject: [Linux-cluster] Some nodes won't join after being fenced
In-Reply-To: <a5f85ede0807311801x2d1ddc55vd9c83de7cf8be3ef@mail.gmail.com>
References: <48920785.4060300@adamdein.com>
	<824ffea00807311325u186e8129kf5218e6dbc2a4d06@mail.gmail.com>
	<a5f85ede0807311801x2d1ddc55vd9c83de7cf8be3ef@mail.gmail.com>
Message-ID: <a5f85ede0808111725t74423cbdt1e49eae8c7121d8c@mail.gmail.com>

We ended up fixing this issue by fixing our network.  Our switch was
attempting to do something interesting with multicast traffic even though we
had everything turned off.

We moved the cluster to a switch that doesn't try to be as cool about
multicast traffic and it's rock solid now.

-ted
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080811/39ebcfe6/attachment.htm>

From fdinitto at redhat.com  Tue Aug 12 09:07:17 2008
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Tue, 12 Aug 2008 11:07:17 +0200 (CEST)
Subject: [Linux-cluster] Cluster 2.99.08 (development snapshot) released
Message-ID: <Pine.LNX.4.64.0808121105180.14806@trider-g7>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


The cluster team and its community are proud to announce the 2.99.08
release from the master branch.

The development cycle for 3.0 is proceeding at a very good speed and
mostlikely one of the next releases will be 3.0alpha1. All features
designed for 3.0 are being completed and taking a proper shape, the
library API has been stable for sometime (and will soon be marked as 3.0
soname). Stay tuned for upcoming updates!

The 2.99.XX releases are _NOT_ meant to be used for production
environments.. yet.

The master branch is the main development tree that receives all new
features, code, clean up and a whole brand new set of bugs,

At some point in time this code will become the 3.0 stable release.

Everybody with test equipment and time to spare, is highly encouraged to
download, install and test the 2.99 releases and more important report
problems.

In order to build the 2.99.08 release you will need:

- - openais svn r1579. Porting to corosync is a work in progress.
- - linux kernel (2.6.26) from
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git
(but userland can run on 2.6.25 in compatibility mode)

NOTE to packagers: the library API/ABI's are _NOT_ stable (hence 2.9). We
are still shipping shared libraries but remember that they can change
anytime without warning. A bunch of new shared libraries have been added.

The new source tarball can be downloaded here:

   ftp://sources.redhat.com/pub/cluster/releases/cluster-2.99.08.tar.gz
   https://fedorahosted.org/releases/c/l/cluster/cluster-2.99.08.tar.gz

In order to use GFS1, the Linux kernel requires a minimal patch:

   ftp://sources.redhat.com/pub/cluster/releases/lockproto-exports.patch
   https://fedorahosted.org/releases/c/l/cluster/lockproto-exports.patch

To report bugs or issues:

   https://bugzilla.redhat.com/

Would you like to meet the cluster team or members of its community?

   Join us on IRC (irc.freenode.net #linux-cluster) and share your
   experience  with other sysadministrators or power users.

Happy clustering,
Fabio

Under the hood (from 2.99.07):

Bob Peterson (1):
       mkfs.gfs2: should have an optional fs size parm

Christine Caulfield (3):
       cman: Revert dirty patch
       cman: exit if configuration check fails.
       cman: tidy objdb_get_int

David Teigland (7):
       dlm_controld: allow early fs_register
       gfs_controld: register with dlm_controld earlier
       libdlm: remove device node creation/removal
       dlm_tool: handle all join flags
       group_tool: use mode from groupd
       fenced: finishing off query stuff
       dlm_controld: queries in libgroup mode

Fabio M. Di Nitto (12):
       fence: remove unrequired headers from rackswitch
       build: fix several issues related to install and build targets
       build: drop "all" dependency from install: targets
       ccs: move to the new logsys init API
       qdisk: port to new logsys api
       build: properly respect non standard libdir and incdir
       ccs: turn more ccs_tool code into legacy code
       build: fix ccs_test symlink install target
       cman: make sure not to umount configfs when there are other users
       config: fix objdb2xml filtering
       rgmanager: unbreak locking in clulib
       libccs: add support for /child::*[%d]/ for xpathlite

Lon Hohberger (2):
       [rgmanager] Fix resource agent metadata and un-break 'make check' target
       [rgmanager] Re-fix permissions bits broken in last commit

Ryan O'Hara (1):
       ip.sh: add sleeptime parameter

  Makefile                                      |    2 +-
  bindings/perl/ccs/Makefile.bindings           |    6 +-
  ccs/ccs_tool/Makefile                         |   18 ++++--
  ccs/ccs_tool/ccs_tool.c                       |    2 +-
  ccs/ccs_tool/editconf.c                       |    4 +
  ccs/ccsais/Makefile                           |    5 +-
  ccs/daemon/Makefile                           |    6 +-
  ccs/daemon/ccsd.c                             |   10 +--
  ccs/daemon/cluster_mgr.c                      |    2 -
  ccs/daemon/cnx_mgr.c                          |    2 -
  ccs/daemon/misc.c                             |    2 -
  cman/cman_tool/Makefile                       |    2 +-
  cman/daemon/Makefile                          |    2 +
  cman/daemon/ais.c                             |   13 +++--
  cman/daemon/cman-preconfig.c                  |   16 +----
  cman/daemon/cman.h                            |    9 +++
  cman/daemon/cmanconfig.c                      |   27 ++++-----
  cman/daemon/cmanconfig.h                      |    4 +-
  cman/daemon/commands.c                        |    9 +--
  cman/daemon/daemon.c                          |    2 +-
  cman/init.d/cman                              |    2 +-
  cman/qdisk/Makefile                           |    4 +-
  cman/qdisk/daemon_init.c                      |    2 -
  cman/qdisk/disk.c                             |    2 -
  cman/qdisk/disk_util.c                        |    2 -
  cman/qdisk/main.c                             |   13 +---
  cman/qdisk/mkqdisk.c                          |   14 +++--
  cman/qdisk/proc.c                             |    2 -
  cman/qdisk/score.c                            |    2 -
  cman/tests/Makefile                           |    4 +-
  config/libs/libccsconfdb/Makefile             |    1 +
  config/libs/libccsconfdb/libccs.c             |   33 +++++++++--
  config/plugins/ldap/Makefile                  |    1 +
  config/plugins/xml/Makefile                   |    1 +
  config/tools/ldap/Makefile                    |    4 +-
  dlm/libdlm/Makefile                           |    2 +
  dlm/libdlm/libdlm.c                           |   65 ++------------------
  dlm/libdlm/libdlm.h                           |    2 +
  dlm/libdlmcontrol/libdlmcontrol.h             |    1 +
  dlm/tests/usertest/Makefile                   |    3 +-
  dlm/tool/Makefile                             |    2 +-
  dlm/tool/main.c                               |   80 ++++++++++++++++++++-----
  fence/agents/gnbd/Makefile                    |    2 +
  fence/agents/ipmilan/Makefile                 |    1 +
  fence/agents/rackswitch/Makefile              |    1 +
  fence/agents/rackswitch/do_rack.h             |    4 -
  fence/agents/rps10/Makefile                   |    1 +
  fence/agents/xvm/Makefile                     |    8 +-
  fence/fence_node/Makefile                     |    5 +-
  fence/fence_tool/Makefile                     |    5 +-
  fence/fenced/Makefile                         |    5 +-
  fence/fenced/cpg.c                            |    3 +
  fence/fenced/fd.h                             |    1 +
  fence/fenced/group.c                          |    7 ++-
  fence/fenced/main.c                           |    3 +
  fence/fenced/recover.c                        |    5 ++
  fence/libfence/Makefile                       |    1 +
  fence/libfenced/libfenced.h                   |    1 +
  gfs/gfs_debug/Makefile                        |    1 +
  gfs/gfs_fsck/Makefile                         |    1 +
  gfs/gfs_grow/Makefile                         |    6 +-
  gfs/gfs_jadd/Makefile                         |    6 +-
  gfs/gfs_mkfs/Makefile                         |    6 +-
  gfs/gfs_quota/Makefile                        |    1 +
  gfs/gfs_tool/Makefile                         |    1 +
  gfs/tests/filecon2/Makefile                   |    2 +
  gfs/tests/mmdd/Makefile                       |    3 +
  gfs2/convert/Makefile                         |    6 +-
  gfs2/edit/Makefile                            |    6 +-
  gfs2/fsck/Makefile                            |    8 +-
  gfs2/mkfs/Makefile                            |    6 +-
  gfs2/mkfs/main_mkfs.c                         |   19 ++++--
  gfs2/mount/Makefile                           |    7 ++-
  gfs2/quota/Makefile                           |    6 +-
  gfs2/tool/Makefile                            |    6 +-
  gnbd/client/Makefile                          |    3 +-
  gnbd/server/Makefile                          |    3 +-
  gnbd/tools/gnbd_export/Makefile               |    3 +-
  gnbd/tools/gnbd_import/Makefile               |    3 +-
  group/dlm_controld/Makefile                   |    8 ++-
  group/dlm_controld/dlm_daemon.h               |    5 ++
  group/dlm_controld/group.c                    |   64 +++++++++++++++++++-
  group/dlm_controld/main.c                     |   58 +++++++++++++++++-
  group/gfs_control/Makefile                    |    5 +-
  group/gfs_controld/Makefile                   |    8 ++-
  group/gfs_controld/cpg-new.c                  |   23 ++++----
  group/test/Makefile                           |    3 +
  group/tool/Makefile                           |   11 ++-
  group/tool/main.c                             |   70 ++++++++++++++--------
  make/install.mk                               |    2 +-
  make/perl-binding-common.mk                   |    2 +-
  rgmanager/src/clulib/Makefile                 |    2 +-
  rgmanager/src/clulib/cman.c                   |    8 +-
  rgmanager/src/daemons/Makefile                |   28 +++++----
  rgmanager/src/resources/Makefile              |    6 +-
  rgmanager/src/resources/SAPDatabase           |    4 +-
  rgmanager/src/resources/SAPInstance           |    4 +-
  rgmanager/src/resources/apache.metadata       |    2 +-
  rgmanager/src/resources/ip.sh                 |   19 ++++++-
  rgmanager/src/resources/lvm.metadata          |    2 +-
  rgmanager/src/resources/lvm.sh                |    2 +-
  rgmanager/src/resources/mysql.metadata        |    2 +-
  rgmanager/src/resources/named.metadata        |    2 +-
  rgmanager/src/resources/openldap.metadata     |    2 +-
  rgmanager/src/resources/oracledb.sh           |   67 ++++++++++++++------
  rgmanager/src/resources/postgres-8.metadata   |    2 +-
  rgmanager/src/resources/ra-api-1-modified.dtd |    7 +-
  rgmanager/src/resources/samba.metadata        |    2 +-
  rgmanager/src/resources/service.sh            |    2 +-
  rgmanager/src/resources/tomcat-5.metadata     |    2 +-
  rgmanager/src/utils/Makefile                  |   24 ++++----
  111 files changed, 626 insertions(+), 363 deletions(-)

- --
I'm going to make him an offer he can't refuse.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)

iQIVAwUBSKFSzAgUGcMLQ3qJAQJQfw/+IoOL12/bSniC/M4mpPVa0+6O3Y4/Vh03
rI7RdoHMI/UJy+uLvCwXdOxa7M2jPck8M+qzS/joO6vw2sphP1pLVkOkmrpBysv9
iOvcy93pguu/0XOtONrvd4EU7bgSdeJoXm5XKLwyMHQYKF1P+t4OAa6M99XkwIBZ
Yntpakj5wAg34Rtie+zjFijQeeXma/V+yENSXL+6xEt+DltyuFsA4SdyoLXFa92u
UkZf7qEliGY0IxeaS+aV9eVohWd61qHkms6S8bDT5ZR4oiU0pJAiTl2hIg69rK2+
NFT/PBXj8LacZcA04S3lt6boP/a7wYHKU8ltRk30vqaVtZmrjTPHuTBn0QV+FqE5
56VFPvh5D5vVePRn+jNoZucvy2vXCOK8tJ7YiiOo/Ndcy+hCzC5FJjSHggVjjD/x
Kn7NE9EVUoeZohPQWcnCajnrLP9jZz4koBhNzqUh1nPvRsYYHPN5f6TmRFjDqjvM
gU7U5Jwp+ob1A/jwEFbr8uMN7s8RCI1aBAy5MMJhohJVNMi8j2vQ9rZz4PCMhDCw
7JLiVdesxjLdBei4s7JtJpU5iOawAOCbt62+on8iP2Rj+rRkoWXPOL6C3Q8jCnVR
Y9VEC/rJ5LpponExUEsTnRPoNlmN5a1EjjKpVBTm/0+f0IAY25XUd+GGwaAgolzE
PGHv8bVILgA=
=PIjI
-----END PGP SIGNATURE-----



From linux at vfemail.net  Tue Aug 12 10:26:44 2008
From: linux at vfemail.net (Alex)
Date: Tue, 12 Aug 2008 13:26:44 +0300
Subject: [Linux-cluster] gfs over raid/lvm or any other option?
Message-ID: <200808121326.44254.linux@vfemail.net>

Hello experts,

If on one machine i have 8 block devices (/deb/sda, /dev/sdb, ... /dev/sdg, 
dev/sdh) imported via iscsi from 8 different computers (computerX) can i 
group volumes two by two using raid1 and after that to join resulted /dev/md0 
up to md3 in one logical volume and run GFS on top?

If this thing is not possible, how can i join together all 8 sd[a-h] in order 
to have fault-tolerance shared storage (failure of a single drive sd[a-h] or 
server (computerX) should not bring down the GFS)? I read about gnbd but is 
working only on 2 nodes. Any idea about how to construct a shared storage 
volume using all 8 devices?

Lustre can be a solution to my problem? I have one doubt regarding lustre: i 
saw that they are using EXT3 on top, which I know that is a LOCAL FILE SYSTEM 
not suitable for SHARED STORAGE (different computers accesing the same volume 
and write at the same time on it).

So, using their patched kernels, ext3 become suitable for SHARED STORAGE?

Regards,
Alx



From ccaulfie at redhat.com  Tue Aug 12 10:50:33 2008
From: ccaulfie at redhat.com (Christine Caulfield)
Date: Tue, 12 Aug 2008 11:50:33 +0100
Subject: [Linux-cluster] Cluster Shutdown - ideas?
Message-ID: <48A16AF9.8030009@redhat.com>

One thing that cman does rather badly is a full cluster shutdown. With 
the RHEL4 code you would shut each node down in turn using the init 
scripts and found that everything hung as it lost quorum when the N/2th 
node went down.

With RHEL5 the init script was changed to do a "cman_tool leave remove" 
which tells the remaining nodes to reduce quorum to allow for the 
missing node(s).

I don't really like either of these solutions. The RHEL4 way is 
obviously a nuisance, but even the RHEL5 system is wrong IMHO. A normal 
node shutdown should not reduce quorum. If other nodes fail while that 
node is down the cluster runs the risk of a split brain due to reduced 
quorum.

Those of you who have worked with VMS systems know that that OS has a 
CLUSTER_SHUTDOWN option which causes the cluster software to wait until 
all nodes have reached a shutdown barrier and then allows all of them to 
go down at the same time. We could do this with Linux, but I'm not 
really sure how much use it would be, mainly because the cluster 
software sits at a higher level in the OS than with VMS and there is a 
lot more for the computer to do after the cluster software has shut 
down. It is an option though.

The other option is simply to set a flag (either in CMAN or locally) to 
tell the node or the whole cluster that everyone is being shut down. 
There are a few ways of doing this, the simplest is to add a flag to the 
cman init script (basically the opposite of what happens now in RHEL5) 
that causes "cman_tool leave remove". But that requires the cluster 
software to be shut down independently of the rest of the software thus 
destroying the point of ordered init scripts.

So, the flag could be an environment variable that is checked by the 
init script perhaps (do those get passed through?), or perhaps a flag 
inside cman itself that changes the "leave" behaviour to either do a 
"leave remove" or the synchronised cluster shutdown I mentioned earlier.

Does anyone have any preferences, ideas or other options we might consider?

Chrissie



From shawnlhood at gmail.com  Tue Aug 12 11:01:25 2008
From: shawnlhood at gmail.com (Shawn Hood)
Date: Tue, 12 Aug 2008 07:01:25 -0400
Subject: [Linux-cluster] Cluster Shutdown - ideas?
In-Reply-To: <48A16AF9.8030009@redhat.com>
References: <48A16AF9.8030009@redhat.com>
Message-ID: <cfe2fc960808120401x26709dddp85e0d17fe6c934d1@mail.gmail.com>

Whatever you do, please make sure it backports to the RHEL4 packages.  This
causes much woe!  I'm interested in hearing what others say about this
problem.

Shawn


On Tue, Aug 12, 2008 at 6:50 AM, Christine Caulfield <ccaulfie at redhat.com>wrote:

> One thing that cman does rather badly is a full cluster shutdown. With the
> RHEL4 code you would shut each node down in turn using the init scripts and
> found that everything hung as it lost quorum when the N/2th node went down.
>
> With RHEL5 the init script was changed to do a "cman_tool leave remove"
> which tells the remaining nodes to reduce quorum to allow for the missing
> node(s).
>
> I don't really like either of these solutions. The RHEL4 way is obviously a
> nuisance, but even the RHEL5 system is wrong IMHO. A normal node shutdown
> should not reduce quorum. If other nodes fail while that node is down the
> cluster runs the risk of a split brain due to reduced quorum.
>
> Those of you who have worked with VMS systems know that that OS has a
> CLUSTER_SHUTDOWN option which causes the cluster software to wait until all
> nodes have reached a shutdown barrier and then allows all of them to go down
> at the same time. We could do this with Linux, but I'm not really sure how
> much use it would be, mainly because the cluster software sits at a higher
> level in the OS than with VMS and there is a lot more for the computer to do
> after the cluster software has shut down. It is an option though.
>
> The other option is simply to set a flag (either in CMAN or locally) to
> tell the node or the whole cluster that everyone is being shut down. There
> are a few ways of doing this, the simplest is to add a flag to the cman init
> script (basically the opposite of what happens now in RHEL5) that causes
> "cman_tool leave remove". But that requires the cluster software to be shut
> down independently of the rest of the software thus destroying the point of
> ordered init scripts.
>
> So, the flag could be an environment variable that is checked by the init
> script perhaps (do those get passed through?), or perhaps a flag inside cman
> itself that changes the "leave" behaviour to either do a "leave remove" or
> the synchronised cluster shutdown I mentioned earlier.
>
> Does anyone have any preferences, ideas or other options we might consider?
>
> Chrissie
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>


--
Shawn Hood
910.670.1819 m
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080812/7c7779cd/attachment.htm>

From fdinitto at redhat.com  Tue Aug 12 11:44:44 2008
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Tue, 12 Aug 2008 13:44:44 +0200 (CEST)
Subject: [Linux-cluster] Info about building master branch
Message-ID: <Pine.LNX.4.64.0808121340470.16551@trider-g7>


hi guys,

Chrissie landed today the awesome port to corosync for cman and related 
bits. Shortly after also some of the usual build bits have landed.

There is a bug in corosync headers that break cman build.

In order to workaround the problem you can just use:

CFLAGS="-I/path/to/openais/include/files" make

path being the one the contains saAis.h

This problem will be fixed shortly in corosync (today or tomorrow).

Fabio

--
I'm going to make him an offer he can't refuse.



From swhiteho at redhat.com  Tue Aug 12 12:05:11 2008
From: swhiteho at redhat.com (Steven Whitehouse)
Date: Tue, 12 Aug 2008 13:05:11 +0100
Subject: [Linux-cluster] Cluster Shutdown - ideas?
In-Reply-To: <48A16AF9.8030009@redhat.com>
References: <48A16AF9.8030009@redhat.com>
Message-ID: <1218542711.20622.25.camel@quoit>

Hi,

On Tue, 2008-08-12 at 11:50 +0100, Christine Caulfield wrote:
> One thing that cman does rather badly is a full cluster shutdown. With 
> the RHEL4 code you would shut each node down in turn using the init 
> scripts and found that everything hung as it lost quorum when the N/2th 
> node went down.
> 
> With RHEL5 the init script was changed to do a "cman_tool leave remove" 
> which tells the remaining nodes to reduce quorum to allow for the 
> missing node(s).
> 
> I don't really like either of these solutions. The RHEL4 way is 
> obviously a nuisance, but even the RHEL5 system is wrong IMHO. A normal 
> node shutdown should not reduce quorum. If other nodes fail while that 
> node is down the cluster runs the risk of a split brain due to reduced 
> quorum.
> 
> Those of you who have worked with VMS systems know that that OS has a 
> CLUSTER_SHUTDOWN option which causes the cluster software to wait until 
> all nodes have reached a shutdown barrier and then allows all of them to 
> go down at the same time. We could do this with Linux, but I'm not 
> really sure how much use it would be, mainly because the cluster 
> software sits at a higher level in the OS than with VMS and there is a 
> lot more for the computer to do after the cluster software has shut 
> down. It is an option though.
> 
> The other option is simply to set a flag (either in CMAN or locally) to 
> tell the node or the whole cluster that everyone is being shut down. 
> There are a few ways of doing this, the simplest is to add a flag to the 
> cman init script (basically the opposite of what happens now in RHEL5) 
> that causes "cman_tool leave remove". But that requires the cluster 
> software to be shut down independently of the rest of the software thus 
> destroying the point of ordered init scripts.
> 
> So, the flag could be an environment variable that is checked by the 
> init script perhaps (do those get passed through?), or perhaps a flag 
> inside cman itself that changes the "leave" behaviour to either do a 
> "leave remove" or the synchronised cluster shutdown I mentioned earlier.
> 
> Does anyone have any preferences, ideas or other options we might consider?
> 
> Chrissie
> 
I think this is part of a larger problem that we have. Currently GFS2
has a shutdown issue where filesystems which were mounted by means other
than the init scripts cause hangs at node shutdown time. This is due to
the ordering of the kernel's shutdown scripts (kill off all userland
processes, then umount filesystems). There are pending bzs relating to
this issue, #435906 and #207697

I have to say that I still like the idea of a cluster run-level. Upon
reaching the "normal" multiuser run level, a process would run to join
the cluster and when quorum was reached, we'd go into a special cluster
run level, and drop out of it again when quorum is lost. I think that
would be similar to what VMS used to do from your description. When the
cluster lost quorum, then the remaining nodes would drop back into the
"normal" run level.

Steve.

> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From brettcave at gmail.com  Tue Aug 12 12:46:48 2008
From: brettcave at gmail.com (Brett Cave)
Date: Tue, 12 Aug 2008 14:46:48 +0200
Subject: [Linux-cluster] creating file locks on a gfs volume
Message-ID: <c0773fd30808120546sd284d30ucd871f41af08dc45@mail.gmail.com>

Hi,

Not sure if this is the correct place for discussing this, but here it is.
On a gfs cluster running dlm locking, is it possible to create a file
lock on a file on the volume? (e.g. in C).
Have seen that there is a libdlm library, and have not tested yet. By
calling dlm_lock, would gfs be aware of the file lock?

This is to allow an application we are developing to create file locks
with different lock modes, based on what the app is doing (e.g. CR /
CW / PW / EX locks).



From shawnlhood at gmail.com  Tue Aug 12 13:12:42 2008
From: shawnlhood at gmail.com (Shawn Hood)
Date: Tue, 12 Aug 2008 09:12:42 -0400
Subject: [Linux-cluster] creating file locks on a gfs volume
In-Reply-To: <c0773fd30808120546sd284d30ucd871f41af08dc45@mail.gmail.com>
References: <c0773fd30808120546sd284d30ucd871f41af08dc45@mail.gmail.com>
Message-ID: <cfe2fc960808120612j3d957459j7a6ec7b6e829489c@mail.gmail.com>

Christine Caulfield has blessed us with an excellent reference that answers
these questions:

http://people.redhat.com/ccaulfie/docs/rhdlmbook.pdf



On Tue, Aug 12, 2008 at 8:46 AM, Brett Cave <brettcave at gmail.com> wrote:

> Hi,
>
> Not sure if this is the correct place for discussing this, but here it is.
> On a gfs cluster running dlm locking, is it possible to create a file
> lock on a file on the volume? (e.g. in C).
> Have seen that there is a libdlm library, and have not tested yet. By
> calling dlm_lock, would gfs be aware of the file lock?
>
> This is to allow an application we are developing to create file locks
> with different lock modes, based on what the app is doing (e.g. CR /
> CW / PW / EX locks).
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



--
Shawn Hood
910.670.1819 m
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080812/6730ea8f/attachment.htm>

From brettcave at gmail.com  Tue Aug 12 13:55:28 2008
From: brettcave at gmail.com (Brett Cave)
Date: Tue, 12 Aug 2008 15:55:28 +0200
Subject: [Linux-cluster] creating file locks on a gfs volume
In-Reply-To: <cfe2fc960808120612j3d957459j7a6ec7b6e829489c@mail.gmail.com>
References: <c0773fd30808120546sd284d30ucd871f41af08dc45@mail.gmail.com>
	<cfe2fc960808120612j3d957459j7a6ec7b6e829489c@mail.gmail.com>
Message-ID: <c0773fd30808120655r638d8d4dn33ef23d74240a1bf@mail.gmail.com>

2008/8/12 Shawn Hood <shawnlhood at gmail.com>:
> Christine Caulfield has blessed us with an excellent reference that answers
> these questions:
>
> http://people.redhat.com/ccaulfie/docs/rhdlmbook.pdf

Thanks, going through it now.



From hdghoghari at yahoo.com  Tue Aug 12 14:27:54 2008
From: hdghoghari at yahoo.com (haresh ghoghari)
Date: Tue, 12 Aug 2008 07:27:54 -0700 (PDT)
Subject: [Linux-cluster] Can I create Redhat Custer without Fence and SAN
In-Reply-To: <c0773fd30808120546sd284d30ucd871f41af08dc45@mail.gmail.com>
Message-ID: <601528.6603.qm@web36102.mail.mud.yahoo.com>


Dear Friends,

I have Redhat 5 , running on 3 Servers.

I am creating Redhat Custer for NFS.

Thanks ?????? 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080812/a0a281d9/attachment.htm>

From shawnlhood at gmail.com  Tue Aug 12 14:46:22 2008
From: shawnlhood at gmail.com (Shawn Hood)
Date: Tue, 12 Aug 2008 10:46:22 -0400
Subject: [Linux-cluster] Can I create Redhat Custer without Fence and SAN
In-Reply-To: <601528.6603.qm@web36102.mail.mud.yahoo.com>
References: <c0773fd30808120546sd284d30ucd871f41af08dc45@mail.gmail.com>
	<601528.6603.qm@web36102.mail.mud.yahoo.com>
Message-ID: <cfe2fc960808120746s6a276b25g73750e4fe103f124@mail.gmail.com>

See http://sources.redhat.com/cluster/faq.html


2008/8/12 haresh ghoghari <hdghoghari at yahoo.com>

>
> Dear Friends,
>
> I have Redhat 5 , running on 3 Servers.
>
> I am creating Redhat Custer for NFS.
>
> Thanks
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080812/6eb7575e/attachment.htm>

From brettcave at gmail.com  Tue Aug 12 14:59:03 2008
From: brettcave at gmail.com (Brett Cave)
Date: Tue, 12 Aug 2008 16:59:03 +0200
Subject: [Linux-cluster] How to configure a cluster to remain up in the
	event of node failure
Message-ID: <c0773fd30808120759g258168d1u99d8bed8b22da32d@mail.gmail.com>

With a 3node gfs1 cluster, and if i hard reset 1 node, it hangs on
startup, although the cluster seems to return to normal.
Nodes: node2, node3, node4
each node has 1 vote, and a qdisk has 2 votes.

If I reset node3, gfs on node2 and node4 is blocked while node3
restarts. First question: is there a config that will allow the
cluster to continue operating while 1 node is down? My quorum is 3 and
total votes is 4 while node3 is restarting, but my gfs mountpoints are
inaccessible until my cman services start up on node3.

Secondly, when node3 restarts, it hangs when trying to remount gfs file systems.
Starting cman
Mounting configfs...done
Starting ccsd...done
Starting cman...done
Starting daemons...done
Starting fencing...done
                   OK
qdiskd        OK

"Mounting other file systems..." OK

Mounting GFS filesystems: GFS 0.1.1-7.el5 installed
Trying to join cluster "lock_dlm","jemdevcluster:cache1"
dlm: Using TCP for communications
dlm: connecting to 2
dlm: got connection to 2
dlm: connecting to 2
dlm: got connection from 4

After that, system just hangs.

>From nodes 2 & 4, i can run cman_tool, and everything shows that the
cluster is up, except for some services:
[root at node2 cache1]# cman_tool services
type             level name     id       state
fence            0     default  00010004 none
[2 3 4]
dlm              1     cache1   00010003 none
[2 3 4]
dlm              1     storage  00030003 none
[2 4]
gfs              2     cache1   00000000 none
[2 3 4]
gfs              2     storage  00020003 none
[2 4]

[root at node2 cache1]# cman_tool nodes
Node  Sts   Inc   Joined               Name
   0   M      0   2008-08-12 16:11:46  /dev/sda5
   2   M    336   2008-08-12 16:11:12  node2
   3   M    352   2008-08-12 16:44:31  node3
   4   M    344   2008-08-12 16:11:12  node4

I have 2 gfs partitions
[root at node4 CentOS]# grep gfs /etc/fstab
/dev/sda1       /gfs/cache1                     gfs     defaults
                         0 0
/dev/sda2       /gfs/storage                    gfs     defaults
                         0 0


At this point, I am unable to unmount /gfs/cache1 from any of my nodes
(node2 or node4) - it just hangs. I can unmount storage with no
problem.

Is there something I am overlooking? Any and all advice welcome :)

Regards,
Brett



From Markus at hochholdinger.net  Tue Aug 12 18:04:46 2008
From: Markus at hochholdinger.net (Markus Hochholdinger)
Date: Tue, 12 Aug 2008 20:04:46 +0200
Subject: [Linux-cluster] SAN, software raid, iscsi or GNBD, GFS
In-Reply-To: <200808071528.43149.linux@vfemail.net>
References: <200808071528.43149.linux@vfemail.net>
Message-ID: <200808122004.50644.Markus@hochholdinger.net>

Hi,

Am Donnerstag, 7. August 2008 14:28 schrieb Alex:
> I would like to build a SAN using cheap hardware.
> Let say that we have N computers (N>8) exporting their volumes (volX, where
> X=N) using ataoe or iscsi or gnbd protocol. Each volX is arround 120GB.

i'm in a similar situation. I've a setup with gnbd (with "No Cluster" 
configuration) and i'm trying to switch to aoe, because i read that aoe 
should perform better.

Well, my tests indicate a very poor performance for aoe! I setup gnbd without 
great hassle and tweaking. Performance is as fast as the network and the 
disks can handle.

First, with aoe i had very poor performance. I had to use latest aoe module, 
latest vblade and a lot of network tweaking (jumbo frames, ..) to have nearly 
the performance from gnbd. OK, with reads i'm a litter faster with aoe.
But the drawback is that vblade needs a lot more cpu than gnbd_serv! So if you 
export more than one block device with aoe your performance will drop.


[..]
> Cons:
> - Using iSCSI also allows a much more seemless transition to a hardware
> shared storage solution later on
> - GNBD seems to be slower then ISCSI and lot of work that needs to be done
> for GNBD to reach its full speed potential.

Have you tested this? As far as i can tell (my tests are over two years old) 
iscsi needed much more cpu and had less performance than gnbd.


-- 
greetings

eMHa
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080812/5e5d6a06/attachment.sig>

From ajeet.singh.raina at logica.com  Wed Aug 13 05:48:53 2008
From: ajeet.singh.raina at logica.com (Singh Raina, Ajeet)
Date: Wed, 13 Aug 2008 11:18:53 +0530
Subject: [Linux-cluster] Cluster Setup for Video Application?
Message-ID: <0139539A634FD04A99C9B8880AB70CB209B17A4B@in-ex004.groupinfra.com>

Hello Guys,

I need help of you guys in setting up of Video Application Cluster.I
need to setup OSCAR Cluster in which two Nodes has to be in cluster and
video application should be running actively on one server.When one
server goes down other should be failedover with out any interruption to
the customer.

Any hint how can I setup the same?




This e-mail and any attachment is for authorised use by the intended recipient(s) only. It may contain proprietary material, confidential information and/or be subject to legal privilege. It should not be copied, disclosed to, retained or used by, any other party. If you are not an intended recipient then please promptly delete this e-mail and any attachment and all copies and inform the sender. Thank you.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080813/566ba711/attachment.htm>

From fdinitto at redhat.com  Wed Aug 13 06:55:00 2008
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Wed, 13 Aug 2008 08:55:00 +0200 (CEST)
Subject: [Linux-cluster] Re: [Cluster-devel] HA Cluster Developer Summit
 2008: call for participants
In-Reply-To: <1216700198.26234.36.camel@daitarn-fedora.int.fabbione.net>
References: <1216700198.26234.36.camel@daitarn-fedora.int.fabbione.net>
Message-ID: <Pine.LNX.4.64.0808130848500.16551@trider-g7>


Hi everybody,

I am pleased to announce that the we have achieved quorum and that the 
Summit is confirmed to take place as previously announced.

I urge all the participant to update the wiki page (1), specially if you 
need help with the hotel room.

Please make sure to do it as soon as possible because another event will 
take place in Prague at the same time and almost all hotels are alredy 
full.

Details on transport, food and all the arrangements will be posted on the 
wiki _only_. So keep an eye for them. For any question please mail me any 
time.

Regards
Fabio

(1) http://sources.redhat.com/cluster/wiki/ClusterSummit2008

--
I'm going to make him an offer he can't refuse.



From brettcave at gmail.com  Wed Aug 13 09:32:34 2008
From: brettcave at gmail.com (Brett Cave)
Date: Wed, 13 Aug 2008 11:32:34 +0200
Subject: [Linux-cluster] Re: How to configure a cluster to remain up in the
	event of node failure
In-Reply-To: <c0773fd30808120759g258168d1u99d8bed8b22da32d@mail.gmail.com>
References: <c0773fd30808120759g258168d1u99d8bed8b22da32d@mail.gmail.com>
Message-ID: <c0773fd30808130232k390dddbbgb582464fa8918e84@mail.gmail.com>

I think I found a problem with the way it starts up...  See just below
the startup output for more info...

On Tue, Aug 12, 2008 at 4:59 PM, Brett Cave <brettcave at gmail.com> wrote:
> With a 3node gfs1 cluster, and if i hard reset 1 node, it hangs on
> startup, although the cluster seems to return to normal.
> Nodes: node2, node3, node4
> each node has 1 vote, and a qdisk has 2 votes.
>
> If I reset node3, gfs on node2 and node4 is blocked while node3
> restarts. First question: is there a config that will allow the
> cluster to continue operating while 1 node is down? My quorum is 3 and
> total votes is 4 while node3 is restarting, but my gfs mountpoints are
> inaccessible until my cman services start up on node3.
>
> Secondly, when node3 restarts, it hangs when trying to remount gfs file systems.
> Starting cman
> Mounting configfs...done
> Starting ccsd...done
> Starting cman...done
> Starting daemons...done
> Starting fencing...done
>                   OK
> qdiskd        OK
>
> "Mounting other file systems..." OK
>
> Mounting GFS filesystems: GFS 0.1.1-7.el5 installed
> Trying to join cluster "lock_dlm","jemdevcluster:cache1"
> dlm: Using TCP for communications
> dlm: connecting to 2
> dlm: got connection to 2
> dlm: connecting to 2
> dlm: got connection from 4

Could this be the problem?

When GFS is set to auto start via chkconfig, it first tries to
connnect to 2, gets connection and then tries to connect to 2 again.
It gets a connection from 4, and hangs.

However, if I chkconfig --levels 3 gfs off and then run service gfs
start once system has booted, i get:
dlm: connecting to 2
dlm: got connection from 2
dlm: connecting to 4
dlm: got connection from 4
mounting gfs mountpoints.

This works exactly as expected - gfs mounts, and cluster is back to
normal. This means that for some reason, when gfs is starting as an
automatic boot service, it doesnt connect to nodes properly - trying
to connect to node2 twice, rather than node2 and then node4 as it
should.

Why would it be doing this? where would i start for troubleshooting
something like.

>
> After that, system just hangs.
>
> From nodes 2 & 4, i can run cman_tool, and everything shows that the
> cluster is up, except for some services:
> [root at node2 cache1]# cman_tool services
> type             level name     id       state
> fence            0     default  00010004 none
> [2 3 4]
> dlm              1     cache1   00010003 none
> [2 3 4]
> dlm              1     storage  00030003 none
> [2 4]
> gfs              2     cache1   00000000 none
> [2 3 4]
> gfs              2     storage  00020003 none
> [2 4]
>
> [root at node2 cache1]# cman_tool nodes
> Node  Sts   Inc   Joined               Name
>   0   M      0   2008-08-12 16:11:46  /dev/sda5
>   2   M    336   2008-08-12 16:11:12  node2
>   3   M    352   2008-08-12 16:44:31  node3
>   4   M    344   2008-08-12 16:11:12  node4
>
> I have 2 gfs partitions
> [root at node4 CentOS]# grep gfs /etc/fstab
> /dev/sda1       /gfs/cache1                     gfs     defaults
>                         0 0
> /dev/sda2       /gfs/storage                    gfs     defaults
>                         0 0
>
>
> At this point, I am unable to unmount /gfs/cache1 from any of my nodes
> (node2 or node4) - it just hangs. I can unmount storage with no
> problem.
>
> Is there something I am overlooking? Any and all advice welcome :)
>
> Regards,
> Brett
>



From ccaulfie at redhat.com  Wed Aug 13 10:09:13 2008
From: ccaulfie at redhat.com (Christine Caulfield)
Date: Wed, 13 Aug 2008 11:09:13 +0100
Subject: [Linux-cluster] Re: How to configure a cluster to remain up in
	the	event of node failure
In-Reply-To: <c0773fd30808130232k390dddbbgb582464fa8918e84@mail.gmail.com>
References: <c0773fd30808120759g258168d1u99d8bed8b22da32d@mail.gmail.com>
	<c0773fd30808130232k390dddbbgb582464fa8918e84@mail.gmail.com>
Message-ID: <48A2B2C9.2040600@redhat.com>

Brett Cave wrote:
> I think I found a problem with the way it starts up...  See just below
> the startup output for more info...
> 
> On Tue, Aug 12, 2008 at 4:59 PM, Brett Cave <brettcave at gmail.com> wrote:
>> With a 3node gfs1 cluster, and if i hard reset 1 node, it hangs on
>> startup, although the cluster seems to return to normal.
>> Nodes: node2, node3, node4
>> each node has 1 vote, and a qdisk has 2 votes.
>>
>> If I reset node3, gfs on node2 and node4 is blocked while node3
>> restarts. First question: is there a config that will allow the
>> cluster to continue operating while 1 node is down? My quorum is 3 and
>> total votes is 4 while node3 is restarting, but my gfs mountpoints are
>> inaccessible until my cman services start up on node3.
>>
>> Secondly, when node3 restarts, it hangs when trying to remount gfs file systems.
>> Starting cman
>> Mounting configfs...done
>> Starting ccsd...done
>> Starting cman...done
>> Starting daemons...done
>> Starting fencing...done
>>                   OK
>> qdiskd        OK
>>
>> "Mounting other file systems..." OK
>>
>> Mounting GFS filesystems: GFS 0.1.1-7.el5 installed
>> Trying to join cluster "lock_dlm","jemdevcluster:cache1"
>> dlm: Using TCP for communications
>> dlm: connecting to 2
>> dlm: got connection to 2
>> dlm: connecting to 2
>> dlm: got connection from 4
> 
> Could this be the problem?

Yes, that's bad! You should only get one "connecting to" message per 
node. If you're getting two it looks like the connection is being closed 
by the remote node for some reason. Are there any messages on node 2 
that might give a clue as to what's happening ?


Chrissie



From brettcave at gmail.com  Wed Aug 13 11:18:03 2008
From: brettcave at gmail.com (Brett Cave)
Date: Wed, 13 Aug 2008 13:18:03 +0200
Subject: [Linux-cluster] Re: How to configure a cluster to remain up in
	the event of node failure
In-Reply-To: <48A2B2C9.2040600@redhat.com>
References: <c0773fd30808120759g258168d1u99d8bed8b22da32d@mail.gmail.com>
	<c0773fd30808130232k390dddbbgb582464fa8918e84@mail.gmail.com>
	<48A2B2C9.2040600@redhat.com>
Message-ID: <c0773fd30808130418l7891b731p6fa04708c74f0ba4@mail.gmail.com>

Hi Christine, thanks for the feedback (and while im thanking you, also
for the programming locking applications book :)

On Wed, Aug 13, 2008 at 12:09 PM, Christine Caulfield
<ccaulfie at redhat.com> wrote:
>> I think I found a problem with the way it starts up...  See just below
>> the startup output for more info...
>>> Mounting GFS filesystems: GFS 0.1.1-7.el5 installed
>>> Trying to join cluster "lock_dlm","jemdevcluster:cache1"
>>> dlm: Using TCP for communications
>>> dlm: connecting to 2
>>> dlm: got connection to 2
>>> dlm: connecting to 2
>>> dlm: got connection from 4
>>
>> Could this be the problem?
>
> Yes, that's bad! You should only get one "connecting to" message per node.
> If you're getting two it looks like the connection is being closed by the
> remote node for some reason. Are there any messages on node 2 that might
> give a clue as to what's happening ?

That was it. qdiskd service was not running on all nodes, and I had
restarted it a few times. In addition to that, I had run a few config
updates with ccs_tool and also cman_tool expected 4 to lower my
quorum, as cluster was locking up due to loosing it. Obviously, it was
loosing quorum because the qdiskd service was not running and the
cluster was 2 votes short. "Cluster is not quorate, refusing
connection" in node2 logs.  Eventually had to restart the entire
cluster to get things running, seems that gfs does not recover that
well once it looses quorum.

* 1st rule of troubleshooting - check the logs.



From theophanis_kontogiannis at yahoo.gr  Wed Aug 13 13:03:27 2008
From: theophanis_kontogiannis at yahoo.gr (Theophanis Kontogiannis)
Date: Wed, 13 Aug 2008 16:03:27 +0300
Subject: [Linux-cluster] relation between cman and ais?
Message-ID: <009b01c8fd44$fbd28360$f3778a20$@gr>

Hello All,

 

Sorry for this but my mind has stopped completely and googling gives results
I cannot process right now!!!

 

AIS should start prior to cman or vice versa?

 

Thank you All and I apologize for the trivial question.

 

Theophanis Kontogiannis

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080813/ee4713d7/attachment.htm>

From ajeet.singh.raina at logica.com  Wed Aug 13 13:18:21 2008
From: ajeet.singh.raina at logica.com (Singh Raina, Ajeet)
Date: Wed, 13 Aug 2008 18:48:21 +0530
Subject: [Linux-cluster] Managing Cluster without GUI
Message-ID: <0139539A634FD04A99C9B8880AB70CB209B17A58@in-ex004.groupinfra.com>

Hi,

X Window system is not installed on my RHEL machine(as it is clustomized
machine).Can I see configure and add service through command line.

I am facing problem installing X Window in RHEL.
Pls Advise


This e-mail and any attachment is for authorised use by the intended recipient(s) only. It may contain proprietary material, confidential information and/or be subject to legal privilege. It should not be copied, disclosed to, retained or used by, any other party. If you are not an intended recipient then please promptly delete this e-mail and any attachment and all copies and inform the sender. Thank you.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080813/49115a3c/attachment.htm>

From ajeet.singh.raina at logica.com  Wed Aug 13 13:18:21 2008
From: ajeet.singh.raina at logica.com (Singh Raina, Ajeet)
Date: Wed, 13 Aug 2008 18:48:21 +0530
Subject: [Linux-cluster] Managing Cluster without GUI
Message-ID: <0139539A634FD04A99C9B8880AB70CB209B17A58@in-ex004.groupinfra.com>

Hi,

X Window system is not installed on my RHEL machine(as it is clustomized
machine).Can I see configure and add service through command line.

I am facing problem installing X Window in RHEL.
Pls Advise


This e-mail and any attachment is for authorised use by the intended recipient(s) only. It may contain proprietary material, confidential information and/or be subject to legal privilege. It should not be copied, disclosed to, retained or used by, any other party. If you are not an intended recipient then please promptly delete this e-mail and any attachment and all copies and inform the sender. Thank you.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080813/49115a3c/attachment-0001.htm>

From ccaulfie at redhat.com  Wed Aug 13 13:23:54 2008
From: ccaulfie at redhat.com (Christine Caulfield)
Date: Wed, 13 Aug 2008 14:23:54 +0100
Subject: [Linux-cluster] relation between cman and ais?
In-Reply-To: <009b01c8fd44$fbd28360$f3778a20$@gr>
References: <009b01c8fd44$fbd28360$f3778a20$@gr>
Message-ID: <48A2E06A.4040204@redhat.com>

Theophanis Kontogiannis wrote:
> Hello All,
> 
>  
> 
> Sorry for this but my mind has stopped completely and googling gives 
> results I cannot process right now!!!
> 
>  
> 
> AIS should start prior to cman or vice versa?
> 
>
cman starts openais. You should not try to start both.

If you want the details try this:

http://people.redhat.com/ccaulfie/docs/aiscman.pdf
-- 

Chrissie



From swoehrl at mpe.mpg.de  Wed Aug 13 13:49:51 2008
From: swoehrl at mpe.mpg.de (Sebastian Woehrl)
Date: Wed, 13 Aug 2008 15:49:51 +0200
Subject: [Linux-cluster] CentOS + Conga: luci shows incorrect service status
	+ xen vm service fails
Message-ID: <200808131549.51284.swoehrl@mpe.mpg.de>

Hello,

I am currently experimenting with Conga on CentOS 5.2 with two cluster nodes 
(centos1, centos2) and a management machine (centos0).

I created a service(service1) with a script resource. But when I try to start 
the service through the luci webinterface on centos0 the service is always 
shown as stopped afterwards but the service is running and I can control it 
through clusvcadm on the nodes. Also running clustat on the nodes confirms 
the service is running.

I also tried to create a virtual service (mpevm1). For this I created a xen vm 
with a config file and disk file on a nfs mount accessible by both nodes and 
added it as a virtual service to the cluster. But when I try to start the 
service it fails. 
/var/log/messages shows:
Aug 13 15:32:50 centos2 clurgmgrd[17230]: <notice> Starting stopped service 
vm:mpevm1
Aug 13 15:32:50 centos2 clurgmgrd[17230]: <notice> start on vm "mpevm1" 
returned 1 (generic error)
Aug 13 15:32:50 centos2 clurgmgrd[17230]: <warning> #68: Failed to start 
vm:mpevm1; return value: 1
Aug 13 15:32:50 centos2 clurgmgrd[17230]: <notice> Stopping service vm:mpevm1
Aug 13 15:32:56 centos2 clurgmgrd[17230]: <notice> Service vm:mpevm1 is 
recovering

The same if I try to start it using clusvcadm. Xend is configured for 
relocation and manually starting the xen vm with "xm create" works. Also 
migrating with "xm migrate" is successful. It just doesn't work using the 
conga tools.

Does anyone have any ideas on these two problems?

Greetings
Sebastian Woehrl


PS: My cluster.conf:

<?xml version="1.0"?>
<cluster alias="cluster1" config_version="21" name="cluster1">
<fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/>
<clusternodes>
    <clusternode name="centos1.mpe.mpg.de" nodeid="1" votes="1">
        <fence/>
    </clusternode>
    <clusternode name="centos2.mpe.mpg.de" nodeid="2" votes="1">
        <fence/>
    </clusternode>
</clusternodes>
<cman expected_votes="1" two_node="1"/>
<fencedevices/>
<rm>
    <failoverdomains>
        <failoverdomain name="failover1" nofailback="0" ordered="1" 
restricted="1">
           <failoverdomainnode name="centos1.mpe.mpg.de" priority="1"/>
           <failoverdomainnode name="centos2.mpe.mpg.de" priority="2"/>
        </failoverdomain>
    </failoverdomains>
    <resources/>
    <service autostart="0" domain="failover1" exclusive="1" name="service1" 
recovery="relocate">
        <script file="/shell0" name="shell0"/>
    </service>
    <vm autostart="0" domain="failover1" exclusive="1" migrate="live" 
name="mpevm1" path="/var/xen/mpevm1" recovery="relocate"/>
</rm>
</cluster>





From ajeet.singh.raina at logica.com  Wed Aug 13 14:22:45 2008
From: ajeet.singh.raina at logica.com (Singh Raina, Ajeet)
Date: Wed, 13 Aug 2008 19:52:45 +0530
Subject: [Linux-cluster] Managing Cluster without GUI
Message-ID: <0139539A634FD04A99C9B8880AB70CB209B17A59@in-ex004.groupinfra.com>

I want to add a new service called nfsmount on the redhat cluster and
dont see any option.I should be able to perform all those activities
which could be done through system-config-cluster.


This e-mail and any attachment is for authorised use by the intended recipient(s) only. It may contain proprietary material, confidential information and/or be subject to legal privilege. It should not be copied, disclosed to, retained or used by, any other party. If you are not an intended recipient then please promptly delete this e-mail and any attachment and all copies and inform the sender. Thank you.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080813/219d95e6/attachment.htm>

From cedwards at smartechcorp.net  Wed Aug 13 15:44:33 2008
From: cedwards at smartechcorp.net (Chris Edwards)
Date: Wed, 13 Aug 2008 11:44:33 -0400
Subject: [Linux-cluster] Fencing
Message-ID: <334701c8fd5b$7b7beeb0$7273cc10$@net>

Does anyone know where would I find a good technical explanation on how
fencing works and how to set it up?

 

Thanks!

 

 

---

 

Chris Edwards
Smartech Corp.
Div. of AirNet Group

 <http://www.airnetgroup.com/> http://www.airnetgroup.com

 <http://www.smartechcorp.net/> http://www.smartechcorp.net

 <mailto:agarrison at smartechcorp.net> cedwards at smartechcorp.net
P:  423-664-7678 x114

C:  423-593-6964

F:  423-664-7680

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080813/ea1aa555/attachment.htm>

From rpeterso at redhat.com  Wed Aug 13 15:45:55 2008
From: rpeterso at redhat.com (Bob Peterson)
Date: Wed, 13 Aug 2008 10:45:55 -0500
Subject: [Linux-cluster] Fencing
In-Reply-To: <334701c8fd5b$7b7beeb0$7273cc10$@net>
References: <334701c8fd5b$7b7beeb0$7273cc10$@net>
Message-ID: <1218642355.9521.132.camel@technetium.msp.redhat.com>

On Wed, 2008-08-13 at 11:44 -0400, Chris Edwards wrote:
> Does anyone know where would I find a good technical explanation on
> how fencing works and how to set it up?
> 
>  
> 
> Thanks!

Hi Chris,

Here's a place to start:

http://sources.redhat.com/cluster/wiki/FAQ/Fencing

Regards,

Bob Peterson
Red Hat Clustering & GFS




From cedwards at smartechcorp.net  Wed Aug 13 17:41:44 2008
From: cedwards at smartechcorp.net (Chris Edwards)
Date: Wed, 13 Aug 2008 13:41:44 -0400
Subject: [Linux-cluster] Fencing
In-Reply-To: <1218642355.9521.132.camel@technetium.msp.redhat.com>
References: <334701c8fd5b$7b7beeb0$7273cc10$@net>
	<1218642355.9521.132.camel@technetium.msp.redhat.com>
Message-ID: <001801c8fd6b$da7dbaa0$8f792fe0$@net>

Awesome!  That was just what I needed to read.  When a server in the cluster
gets fenced how do you debug it?  How do you find out the nodes status?  And
how do you get a fenced node to rejoin the cluster?

Thanks for your help!

---

Chris Edwards
Smartech Corp.
Div. of AirNet Group
http://www.airnetgroup.com
http://www.smartechcorp.net
cedwards at smartechcorp.net
P:  423-664-7678 x114
C:  423-593-6964
F:  423-664-7680

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Bob Peterson
Sent: Wednesday, August 13, 2008 11:46 AM
To: linux clustering
Subject: Re: [Linux-cluster] Fencing

On Wed, 2008-08-13 at 11:44 -0400, Chris Edwards wrote:
> Does anyone know where would I find a good technical explanation on
> how fencing works and how to set it up?
> 
>  
> 
> Thanks!

Hi Chris,

Here's a place to start:

http://sources.redhat.com/cluster/wiki/FAQ/Fencing

Regards,

Bob Peterson
Red Hat Clustering & GFS


--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster




From brettcave at gmail.com  Wed Aug 13 18:17:58 2008
From: brettcave at gmail.com (Brett Cave)
Date: Wed, 13 Aug 2008 20:17:58 +0200
Subject: [Linux-cluster] node joining still not working 100% in 3node cluster
Message-ID: <c0773fd30808131117m438ea2a8j60319b14428d86f0@mail.gmail.com>

Thought I had it worked out, but things still not working 100%.

Setup: 3node gfs cluster, each node has 1 vote, and quorum disk has 2
votes. Cluster is up and running with no problem. I then reboot 1
node. For troubleshooting purposes, I turned gfs off from default
startup so i can start it manually (cman and qdiskd is still
automatically started).
nodes are 2,3 and 4. Node2 is being restarted.

The logs all show node leaving successfully. Quorum is 3, expected
votes 5, total votes 4 (once node has shut down).
Node2 restarted, cman and qdiskd starts up. at this point, cluster
services show everything back to normal. Output from cman_tool status
on all 3 nodes is the same, with no errors (output abbreviated here).

# cman_tool status
Config Version: 5
Nodes: 3
Expected votes: 5
Total votes: 5
Quorum: 3
Active subsystems: 7

However, when I run service gfs start (or try and mount my first gfs
volume), it just hangs. My logs on node 2 show the following:
Aug 13 19:51:42 blade2 gfs_controld[2825]: retrieve_plocks: ckpt open
error 12 cache1
Aug 13 19:51:42 blade2 kernel: GFS 0.1.19-7.el5 (built Nov 12 2007
14:43:37) installed
Aug 13 19:51:42 blade2 kernel: Trying to join cluster "lock_dlm",
"jemdevcluster:cache1"
Aug 13 19:51:42 blade2 kernel: dlm: Using TCP for communications
Aug 13 19:51:42 blade2 kernel: dlm: got connection from 3
Aug 13 19:51:42 blade2 kernel: dlm: connecting to 4
Aug 13 19:51:42 blade2 kernel: dlm: got connection from 4
Aug 13 19:51:42 blade2 kernel: dlm: connecting to 4

At this point, mount.gfs just hangs. Restarting node2 causes the same
thing to happen over and over, and am not able to get the 2 gfs
volumes mounted. Nodes3 & 4 can still access the filesystem however.

After a 2nd reboot, my logs show...
Aug 13 20:13:08 blade2 qdiskd[2873]: <info> Node 3 is the master
Aug 13 20:13:09 blade2 gfs_controld[2834]: retrieve_plocks: ckpt open
error 12 cache1
Aug 13 20:13:09 blade2 kernel: GFS 0.1.19-7.el5 (built Nov 12 2007
14:43:37) installed
Aug 13 20:13:09 blade2 kernel: Trying to join cluster "lock_dlm",
"jemdevcluster:cache1"
Aug 13 20:13:09 blade2 kernel: dlm: Using TCP for communications
Aug 13 20:13:09 blade2 kernel: dlm: connecting to 3
Aug 13 20:13:09 blade2 kernel: dlm: got connection from 3
Aug 13 20:13:09 blade2 kernel: dlm: connecting to 3
Aug 13 20:13:09 blade2 kernel: dlm: got connection from 4

Both 3 & 4 show the following in the logs:
Aug 13 20:14:17 blade4 openais[2554]: [CLM  ] Members Joined:
Aug 13 20:14:17 blade4 openais[2554]: [CLM  ]   r(0) ip(192.168.70.102)
Aug 13 20:14:17 blade4 openais[2554]: [SYNC ] This node is within the
primary component and will provide service.
Aug 13 20:14:17 blade4 openais[2554]: [TOTEM] entering OPERATIONAL state.
Aug 13 20:14:17 blade4 openais[2554]: [CLM  ] got nodejoin message
192.168.70.102
Aug 13 20:14:17 blade4 openais[2554]: [CLM  ] got nodejoin message
192.168.70.103
Aug 13 20:14:17 blade4 openais[2554]: [CLM  ] got nodejoin message
192.168.70.104
Aug 13 20:14:17 blade4 openais[2554]: [CPG  ] got joinlist message from node 4
Aug 13 20:14:17 blade4 openais[2554]: [CPG  ] got joinlist message from node 3
Aug 13 20:14:33 blade4 kernel: dlm: connecting to 2

Surely node2 should connect to 3, get connection from 3 and then
connect to 4 and get connection from 4?
Could this possibly be a gfs bug?

Brett



From jparsons at redhat.com  Wed Aug 13 19:27:12 2008
From: jparsons at redhat.com (jim parsons)
Date: Wed, 13 Aug 2008 15:27:12 -0400
Subject: [Linux-cluster] Managing Cluster without GUI
In-Reply-To: <0139539A634FD04A99C9B8880AB70CB209B17A58@in-ex004.groupinfra.com>
References: <0139539A634FD04A99C9B8880AB70CB209B17A58@in-ex004.groupinfra.com>
Message-ID: <1218655632.3426.3.camel@localhost.localdomain>

On Wed, 2008-08-13 at 18:48 +0530, Singh Raina, Ajeet wrote:
> Hi,
> 
> X Window system is not installed on my RHEL machine(as it is
> clustomized machine).Can I see configure and add service through
> command line.
> 
> I am facing problem installing X Window in RHEL.
> 
> Pls Advise
Look into Conga for administering your cluster. Then you can use a
browser to manage it.

-j
> 




From jparsons at redhat.com  Wed Aug 13 19:55:35 2008
From: jparsons at redhat.com (jim parsons)
Date: Wed, 13 Aug 2008 15:55:35 -0400
Subject: [Linux-cluster] Managing Cluster without GUI
In-Reply-To: <1218655632.3426.3.camel@localhost.localdomain>
References: <0139539A634FD04A99C9B8880AB70CB209B17A58@in-ex004.groupinfra.com>
	<1218655632.3426.3.camel@localhost.localdomain>
Message-ID: <1218657335.3426.5.camel@localhost.localdomain>

On Wed, 2008-08-13 at 15:27 -0400, jim parsons wrote:
> On Wed, 2008-08-13 at 18:48 +0530, Singh Raina, Ajeet wrote:
> > Hi,
> > 
> > X Window system is not installed on my RHEL machine(as it is
> > clustomized machine).Can I see configure and add service through
> > command line.
> > 
> > I am facing problem installing X Window in RHEL.
> > 
> > Pls Advise
> Look into Conga for administering your cluster. Then you can use a
> browser to manage it.

That is, a browser from any machine that can see your cluster network :)
> 
> -j
> > 
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From gregory at steulet.org  Wed Aug 13 21:12:31 2008
From: gregory at steulet.org (gregory steulet)
Date: Wed, 13 Aug 2008 23:12:31 +0200
Subject: [Linux-cluster] Compiling cluster-2.03.06 on a 64 bits arch
Message-ID: <1218661951-2d2653a9313032c43ec35049ca6c9872@steulet.org>

Hi folks,

My configuration :
-------------------

- Linux emperor02  2.6.26 #1 SMP Tue Aug 12 12:50:06 CEST 2008 x86_64 x86_64 x86_64 GNU/Linux

- openais-0.80.3-15.el5

What I try to do :
-------------------

I'm currently trying to install cluster-2.03.06 because I'm facing the problem with lock_dlm of gfs_controld.

My problem :
---------------

My problem is the follwing after making :

cd /home/src/cluster-2.03.06 
./configure --kernel_src=/home/src/linux-2.6.26 --libdir=/usr/lib64
make install

I got lot of this kind of errors :

/home/src/cluster-2.03.06/gfs-kernel/src/gfs/dio.c:72: warning: format expects type long unsigned int, but argument 3 has type long long unsigned int
/home/src/cluster-2.03.06/gfs-kernel/src/gfs/dio.c:83: warning: format expects type long unsigned int, but argument 4 has type u64
...
/home/src/cluster-2.03.06/gfs-kernel/src/gfs/dio.c:109: warning: format expects type long unsigned int, but argument 4 has type int64

and finish with :

/home/src/cluster-2.03.06/cman/daemon/daemon.c:514: error: too many arguments to function ?open_local_sock?
make[2]: *** [daemon.o] Error 1
make[2]: Leaving directory `/home/src/cluster-2.03.06/cman/daemon'
make[1]: *** [all] Error 2
make[1]: Leaving directory `/home/src/cluster-2.03.06/cman'
make: *** [cman] Error 2

Any help would be really appreciated ! Best regards

Greg



 

 


 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080813/aa86bb02/attachment.htm>

From chris-m-lists at joelly.net  Wed Aug 13 21:13:32 2008
From: chris-m-lists at joelly.net (Chris Joelly)
Date: Wed, 13 Aug 2008 23:13:32 +0200
Subject: [Linux-cluster] GFS vs. GFS2: system-config-cluster, locking ...
Message-ID: <20080813211332.GA10496@joysn.joelly.net>


Hello,

i'm playing with RHCS which is delivered with ubuntu 8.04 and i tried to
use GFS2. But i saw that using system-config-cluster it is not possible
to configure a GFS2 resource for use with a service. Sure i can change
the fstype manually in cluster.conf, but i think as GFS2 is still
considered experimental it should not be considered for a production
machine.

When is GFS2 planned to be used in production scenarios?

As i read long ago and if i remember correct for GFS there was a lock
server required (gulm, or similar...)

Is this still true? How do i correctly install GFS? Is GFS able to use
lock_dlm?

And whats with GFS kernel module depends on GFS2 module as seen in:

root at srv02:/storage# modinfo gfs
filename:       /lib/modules/2.6.24-19-server/ubuntu/fs/gfs/gfs.ko
license:        GPL
author:         Red Hat, Inc.
description:    Global File System <CVS>
srcversion:     4695C79080CC395FA4E74E5
depends:        gfs2
vermagic:       2.6.24-19-server SMP mod_unload 686 
root at srv02:/storage#

Does this mean that GFS is using the locking infrastructure of GFS2
already?

Or is Ubuntu 8.04 somehow inconsistent regarding RHCS?

Chris



From ajeet.singh.raina at logica.com  Thu Aug 14 05:27:05 2008
From: ajeet.singh.raina at logica.com (Singh Raina, Ajeet)
Date: Thu, 14 Aug 2008 10:57:05 +0530
Subject: [Linux-cluster] Adding Service to Cluster through Command Line..
Message-ID: <0139539A634FD04A99C9B8880AB70CB209B17A5A@in-ex004.groupinfra.com>

Hello Guys,

I am in verse to add service to cluster through only command line.
My Clustat output is:
[code]
edwin2# clustat
Member Status: Quorate

  Member Name                              Status
  ------ ----                              ------
  edwin1-cluster                           Online, rgmanager
  edwin2-cluster                           Online, Local, rgmanager

  Service Name         Owner (Last)                   State
  ------- ----         ----- ------                   -----
  bang                 edwin2-cluster                 started

[/code]

I have been provided with the script called cluster_python.sh which has
the following options:
[code]
edwin2# python cluster.py
Syntax: cluster.py <command> <args>
        getversion
        incrementversion
        setversion <version>
        listfodomains
        listservices
        listfodnodes <domain_name>
        addservice <service_name> <autostart> <domain>
        removeservice <service_name>
        addserviceip <service_name> <ip_address> <monitor_link>
        addservicescript <service_name> <script_name> <script_file>
        addservicefs <service_name> <device> <mountpoint>
<force_unmount=1/0> <fs_name> <fs_type>
        addfodomain <domain_name> <restricted/unrestricted>
<ordered/unordered>
        addfodnode <domain_name> <node_name> <node_priority>
        addquorumd <interval> <tko> <votes> <label> <log_level>
<log_facility> <device>
        changelogging <log_level> <log_facility>
        setfencedcleanstart
        addhardwarefencing
edwin2#
[/code]

Now All I want is to include NFS Service be created for my project
work.I need 3 conditions- Status, Start and Stop process so I included
it in one script and ran the command:
[code]




This e-mail and any attachment is for authorised use by the intended recipient(s) only. It may contain proprietary material, confidential information and/or be subject to legal privilege. It should not be copied, disclosed to, retained or used by, any other party. If you are not an intended recipient then please promptly delete this e-mail and any attachment and all copies and inform the sender. Thank you.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080814/e1f08357/attachment.htm>

From ajeet.singh.raina at logica.com  Thu Aug 14 05:33:34 2008
From: ajeet.singh.raina at logica.com (Singh Raina, Ajeet)
Date: Thu, 14 Aug 2008 11:03:34 +0530
Subject: [Linux-cluster] Adding Service to Cluster..
Message-ID: <0139539A634FD04A99C9B8880AB70CB209B17A5B@in-ex004.groupinfra.com>

Hi Guys,

I want to add a service on already setup red hat cluster. Due to no X
window installed on my machine,I am working completely on text mode.
All My Clustat says:

[terminal]
edwin2# clustat
Member Status: Quorate

  Member Name                              Status
  ------ ----                              ------
  edwin1-cluster                           Online, rgmanager
  edwin2-cluster                           Online, Local, rgmanager

  Service Name         Owner (Last)                   State
  ------- ----         ----- ------                   -----
  bang                 edwin2-cluster                 started
[terminal]

I have been provided a script by cluster experts which will help me in
including service and script to the cluster.iT has the following
options:
{Terminal}

Syntax:  ../cluster/cluster.py <command> <args>
        getversion
        incrementversion
        setversion <version>
        listfodomains
        listservices
        listfodnodes <domain_name>
        addservice <service_name> <autostart> <domain>
        removeservice <service_name>
        addserviceip <service_name> <ip_address> <monitor_link>
        addservicescript <service_name> <script_name> <script_file>
        addservicefs <service_name> <device> <mountpoint>
<force_unmount=1/0> <fs_name> <fs_type>
        addfodomain <domain_name> <restricted/unrestricted>
<ordered/unordered>
        addfodnode <domain_name> <node_name> <node_priority>
        addquorumd <interval> <tko> <votes> <label> <log_level>
<log_facility> <device>
        changelogging <log_level> <log_facility>
        setfencedcleanstart
        addhardwarefencing
edwin1# ../cluster/cluster.py listservices
bang
nfs
edwin1# vi nfsdeamon.sh
edwin1# ../cluster/cluster.py addservicescript nfs nfscontrol
nfsdeamon.sh
Script has been added successfully.
edwin1# clustat


I added nfs script written ownself.
I too added the service before that.
Now When I run 
#../cluster/cluster.py listfoservices
Bang
Nfs

It means that service is successfully added with the script.

Now when I see clustat I acnt see NFS Service being displayed.
That's Why its difficult for me to run the following command instead:

[code]
edwin2# clusvcadm -r nfs -m edwin2-cluster
Trying to relocate nfs to edwin2-cluster...failed: Service does not
exist
[/code]


What should I do to include nfs in clustat.


Pls help



This e-mail and any attachment is for authorised use by the intended recipient(s) only. It may contain proprietary material, confidential information and/or be subject to legal privilege. It should not be copied, disclosed to, retained or used by, any other party. If you are not an intended recipient then please promptly delete this e-mail and any attachment and all copies and inform the sender. Thank you.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080814/1a040901/attachment.htm>

From brettcave at gmail.com  Thu Aug 14 05:48:23 2008
From: brettcave at gmail.com (Brett Cave)
Date: Thu, 14 Aug 2008 07:48:23 +0200
Subject: [Linux-cluster] GFS vs. GFS2: system-config-cluster, locking ...
In-Reply-To: <20080813211332.GA10496@joysn.joelly.net>
References: <20080813211332.GA10496@joysn.joelly.net>
Message-ID: <c0773fd30808132248l66bdf043x7b03236f2144597f@mail.gmail.com>

On Wed, Aug 13, 2008 at 11:13 PM, Chris Joelly <chris-m-lists at joelly.net> wrote:
>
> Hello,
Hi

>
> As i read long ago and if i remember correct for GFS there was a lock
> server required (gulm, or similar...)

GULM is 1 option, DLM is another. I have been using lock_dlm
(distributed lock manager), along with cman (Cluster manager). It
ships standard with Centos, and also have it running on Gentoo.

>
> Is this still true? How do i correctly install GFS? Is GFS able to use
> lock_dlm?

Yes. Dont need a GULM server if you are using dlm though.

> Does this mean that GFS is using the locking infrastructure of GFS2
> already?
>
> Or is Ubuntu 8.04 somehow inconsistent regarding RHCS?

Centos modinfo on gfs also shows gfs2 as dependancy. locking is a
seperate service from gfs, and from what I understand, both gfs and
gfs2 use locking (lock_dlm). Not sure why there is this module
dependancy. Have no idea why there is this depedancy though.



From brettcave at gmail.com  Thu Aug 14 05:50:37 2008
From: brettcave at gmail.com (Brett Cave)
Date: Thu, 14 Aug 2008 07:50:37 +0200
Subject: [Linux-cluster] relation between cman and ais?
In-Reply-To: <009b01c8fd44$fbd28360$f3778a20$@gr>
References: <009b01c8fd44$fbd28360$f3778a20$@gr>
Message-ID: <c0773fd30808132250s7b20efar75d8202da89f8403@mail.gmail.com>

2008/8/13 Theophanis Kontogiannis <theophanis_kontogiannis at yahoo.gr>:
> Hello All,

> Sorry for this but my mind has stopped completely and googling gives results
> I cannot process right now!!!
>
>
>
> AIS should start prior to cman or vice versa?

If you have openais service installed, and start it prior to starting
cman, your cluster will not start up. Spent a few hours trying to
figure out what the problem was with this one.

Brett



From ajeet.singh.raina at logica.com  Thu Aug 14 08:35:55 2008
From: ajeet.singh.raina at logica.com (Singh Raina, Ajeet)
Date: Thu, 14 Aug 2008 14:05:55 +0530
Subject: FW: [Linux-cluster] Managing Cluster without GUI
Message-ID: <0139539A634FD04A99C9B8880AB70CB209B17A5C@in-ex004.groupinfra.com>


Hi,

The Congo Setup is ready.I installed ricci and luci package and included
on service called Bang on that.
Now I want to add NFS as service but I can see 3 options:
1.	NFS Mount
2.	NFS Export
3.	NFS Client

Can anyone help me understanding those options under the interface.





This e-mail and any attachment is for authorised use by the intended recipient(s) only. It may contain proprietary material, confidential information and/or be subject to legal privilege. It should not be copied, disclosed to, retained or used by, any other party. If you are not an intended recipient then please promptly delete this e-mail and any attachment and all copies and inform the sender. Thank you.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080814/4633e4b7/attachment.htm>

From jakub.suchy at enlogit.cz  Thu Aug 14 08:42:59 2008
From: jakub.suchy at enlogit.cz (Jakub Suchy)
Date: Thu, 14 Aug 2008 10:42:59 +0200 (CEST)
Subject: [Linux-cluster] rgmanager timeout of script.sh
Message-ID: <32628.217.77.161.17.1218703379.squirrel@www.hal9000.cz>

Hello,
i am trying to solve a problem with timeouting a status script.

We are using custom init script for our service, which is doing some
operations in status section. However, one of the options when  this
service is down is, that it hangs. Then, a status script may hang too
because it is waiting for an operation to complete.

Is there a way to instruct rgmanager to automatically treat the service as
failed if statut script doesn't return OK in say 30 seconds?

I've found /usr/share/cluster/script.sh with:

<action name="status" interval="30s" timeout="0"/>

but i don't really want to mess with this as it's a part of cluster
scripts (means red hat support). anyway, i tried timeout="30" but it
didn't work (guess "30s" is correct?)

Also, I don't like doing something like:
status: nohup operation &;
test if completed ...


Thank you,
Jakub



From brettcave at gmail.com  Thu Aug 14 11:06:13 2008
From: brettcave at gmail.com (Brett Cave)
Date: Thu, 14 Aug 2008 13:06:13 +0200
Subject: [Linux-cluster] gfs cluster with qdisk and no heuristics
Message-ID: <c0773fd30808140406q6acc5ebby1498ff1c6e03ea76@mail.gmail.com>

Hi,

My gfs cluster has a quorum disk, but the qdisk has no heuristics.
GFS is used for SAN storage - 6 nodes connected via FC to storage. The
qdisk has its own vdisk (san = hp eva).

Is it ok to run quorumd with no heuristics in this scenario?



From chris-m-lists at joelly.net  Thu Aug 14 11:54:22 2008
From: chris-m-lists at joelly.net (Chris Joelly)
Date: Thu, 14 Aug 2008 13:54:22 +0200
Subject: [Linux-cluster] Setting up samba resource with system-config-cluster
Message-ID: <20080814115422.GA5513@joysn.joelly.net>


Hello,

i'm setting up a Samba based fileserver and started using 
system-config-cluster. The initial setup goes fine and all is working as
expected. But when i played around with a Samba service i noticed that 
the smb.conf file is deleted and regenerated every time i remove and 
define a new Samba service. 

Also the behaviour of how the smb.conf file is written (interfaces, 
filesystems) depends on how the additional resources ip address and 
file system is added to the service.

Is there a description of system-config-cluster and its behaviour
besides the RH documentation?

As the scc capabilities are very limited, at least in the version Ubuntu 
8.04 ships (1.0.46), and Conga is not available nor an option, what is
the preferred way of managing the cluster config? 

Is it recommended to manually edit cluster.conf and then propagate the 
changes using ccs_tool? Is there a possibility to verify the cluster.conf 
file if it contains errors or should this be done using a validating XML 
parser? Is a XML schema or DTD available for the various cluster.conf 
versions and how can i determine which version of the cluster.conf schema 
is appropriate for which version of e.g. cman?

Chris



From ajeet.singh.raina at logica.com  Thu Aug 14 12:32:41 2008
From: ajeet.singh.raina at logica.com (Singh Raina, Ajeet)
Date: Thu, 14 Aug 2008 18:02:41 +0530
Subject: [Linux-cluster] NFS Issue in Cluster??
Message-ID: <0139539A634FD04A99C9B8880AB70CB209B17A60@in-ex004.groupinfra.com>

Hello Guys,

I have a doubt and I hope you people gonna Help me with.
I have been stucked with including NFS Resource and Services.
Let me tell you....I want to run NFS Service on one of the red hat
cluster node .ON failover to the second cluster node the NFS should come
up.
But I donno why the way to configure that has been so complicated
written.
All I can see is:

[code]
1.	NFS Mount 
		Name - Create a symbolic name for the NFS mount. 
		Mount Point - Choose the path to which the file system
resource is mounted. 
		Host - Specify the NFS server name. 
		Export Path - NFS export on the server. 
		NFS version - Specify NFS protocol: 
o	NFS3 - Specifies using NFSv3 protocol. The default setting is
NFS. 
o	NFS4 - Specifies using NFSv4 protocol. 
		Options - Mount options. For more information, refer to
the nfs(5) man page. 
		Force Unmount checkbox - If checked, forces the file
system to unmount. The default setting is unchecked. Force Unmount kills
all processes using the mount point to free up the mount when it tries
to unmount. 
			NFS Client 
		Name - Enter a name for the NFS client resource. 
		Target - Enter a target for the NFS client resource.
Supported targets are hostnames, IP addresses (with wild-card support),
and netgroups. 
		Options - Additional client access rights. For more
information, refer to the exports(5) man page, General Options 
				NFS Export 
		Name - Enter a name for the NFS export resource. 
Please Help me understanding what's the difference between NFS Mount,NFS
Export and NFS Client in this context.
I just want to do failover ie. When one node has NFS running then on
being stopped NFS Should be starting at the other end.


Pls Help


This e-mail and any attachment is for authorised use by the intended recipient(s) only. It may contain proprietary material, confidential information and/or be subject to legal privilege. It should not be copied, disclosed to, retained or used by, any other party. If you are not an intended recipient then please promptly delete this e-mail and any attachment and all copies and inform the sender. Thank you.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080814/629cd5fe/attachment.htm>

From vikash at netvigator.com  Thu Aug 14 12:40:09 2008
From: vikash at netvigator.com (Vikash Khatuwala)
Date: Thu, 14 Aug 2008 20:40:09 +0800
Subject: [Linux-cluster] GFS leak glock
Message-ID: <20080814124022.LJGI2335.imsm058dat.netvigator.com@cch-vikash.netvigator.com>

Hi,

Does anyone have any idea what these messages mean? I see them in 
/var/log/messages and on the console too.

GFS: would leak glock ffff8101f09ef408
GFS: would leak glock ffff810185d22258
GFS: would leak glock ffff81015dd930a0
GFS: would leak glock ffff8101e4e184c8
GFS: would leak glock ffff81008785d580

Using CentOS 5.2 with all latest updates.

This message appears every few minutes and I am not sure what it 
means and how to resolve this if its a problem. The FS is gfs with 4 
journals but there is only 1 node at the moment. However the GFS 
seems to be working normally.

Thanks,
Vikash.



From fdinitto at redhat.com  Thu Aug 14 12:46:30 2008
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Thu, 14 Aug 2008 14:46:30 +0200 (CEST)
Subject: [Linux-cluster] Cluster 2.03.07 released
Message-ID: <Pine.LNX.4.64.0808141442580.16551@trider-g7>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


The cluster team and its vibrant community are proud to announce the 
2.03.07 release from the STABLE2 branch.

The STABLE2 branch collects, on a daily base, all bug fixes and the bare
minimal changes required to run the cluster on top of the most recent Linux
kernel (2.6.26) and rock solid openais (0.80.3).

The 2.03.07 has a hot fix for rgmanager daemon that was preventing it from 
starting up properly in certain environments.

The new source tarball can be downloaded here:

   ftp://sources.redhat.com/pub/cluster/releases/cluster-2.03.07.tar.gz
   https://fedorahosted.org/releases/c/l/cluster/cluster-2.03.07.tar.gz

In order to use GFS1, the Linux kernel requires a minimal patch:

   ftp://sources.redhat.com/pub/cluster/releases/lockproto-exports.patch
   https://fedorahosted.org/releases/c/l/cluster/lockproto-exports.patch

To report bugs or issues:

   https://bugzilla.redhat.com/

Would you like to meet the cluster team or members of its community?

   Join us on IRC (irc.freenode.net #linux-cluster) and share your
   experience  with other sysadministrators or power users.

Happy clustering,
Fabio

Under the hood (from 2.03.06):

Bob Peterson (1):
       mkfs.gfs2: should have an optional fs size parm

Christine Caulfield (1):
       cman: fix uninitialised variable warning.

Fabio M. Di Nitto (2):
       cman: make sure not to umount configfs when there are other users
       rgmanager: unbreak locking in clulib

Lon Hohberger (2):
       [rgmanager] Fix resource agent metadata and un-break 'make check' target
       [rgmanager] Re-fix permissions bits broken in last commit

Ryan O'Hara (1):
       ip.sh: add sleeptime parameter

  cman/init.d/cman.in                           |    2 +-
  cman/qdisk/main.c                             |    2 +-
  gfs2/mkfs/main_mkfs.c                         |   19 +++++--
  rgmanager/src/clulib/cman.c                   |    8 ++--
  rgmanager/src/resources/Makefile              |    4 +-
  rgmanager/src/resources/SAPDatabase           |    4 +-
  rgmanager/src/resources/SAPInstance           |    4 +-
  rgmanager/src/resources/apache.metadata       |    2 +-
  rgmanager/src/resources/ip.sh                 |   19 +++++++-
  rgmanager/src/resources/lvm.metadata          |    2 +-
  rgmanager/src/resources/lvm.sh                |    2 +-
  rgmanager/src/resources/mysql.metadata        |    2 +-
  rgmanager/src/resources/named.metadata        |    2 +-
  rgmanager/src/resources/openldap.metadata     |    2 +-
  rgmanager/src/resources/oracledb.sh           |   67 +++++++++++++++++-------
  rgmanager/src/resources/postgres-8.metadata   |    2 +-
  rgmanager/src/resources/ra-api-1-modified.dtd |    7 ++-
  rgmanager/src/resources/samba.metadata        |    2 +-
  rgmanager/src/resources/service.sh            |    2 +-
  rgmanager/src/resources/tomcat-5.metadata     |    2 +-
  20 files changed, 106 insertions(+), 50 deletions(-)

- --
I'm going to make him an offer he can't refuse.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)

iQIVAwUBSKQpLAgUGcMLQ3qJAQJirg//UyPBfXzfjpdafGHpV3xfEakMW21C/gMk
LjBQZszDk51WgE2ty9S/gWYWpGvjRunE1BJ5qr2SXViOmp598Fgi346IosuiFQgv
RveWUE3fNzG9Jg8Jj1PBy+wHzSZRPa78DChWOxxIpSVezzL8sv94ssDdNxtn5UT0
YwiSm82qdg8xfoGxo6UCse+Z9Pws0lCA+1aLIK/XRelno66YZJITr7SEvvRdjifl
Kb9NfRp+ZqbHbRxL9gqy91FYt2T/Dkt/zQ99K0xbQ34nmJzaytFgM6o1CN0q8AzB
3TqPgl0WT55GuNahdVscPfKSpmGgLkgE80YXGFXHfCJ84MFewKCDwBgQ6NveFB9j
DpkLMuI0eCdjnv4pw1tU7qANBExsCWP+4xBCtz3gKLhTaEMw28IWqKey/6axmi7O
N1z+5zlgdOiiKZG7hUtREo9xML8zsR4wLeWmjQtMDIYu2Q66b8Gb6HcjqM7T/0f2
cy9i3jTUS6ctxkp7/15tJyzjTR+6Iq0mmQFJ2A30Bcf/l2sWd74/935J28HiOvzE
42y6glnwagbPfxBcmI2cVFCr8mMz/HC5IlvCUNmgXSuRI0GvrlZdJqsCLrw4i2xG
mvO109pbD6HYEW+hIbVQAYFbpk3AyfArz4NsAr46rfDjXWcoJpaAU5raCdTgBgkH
8FgkqVOBwGg=
=oSXp
-----END PGP SIGNATURE-----



From ajeet.singh.raina at logica.com  Thu Aug 14 13:14:27 2008
From: ajeet.singh.raina at logica.com (Singh Raina, Ajeet)
Date: Thu, 14 Aug 2008 18:44:27 +0530
Subject: [Linux-cluster] RE: NFS Issue in Cluster??
In-Reply-To: <0139539A634FD04A99C9B8880AB70CB209B17A60@in-ex004.groupinfra.com>
Message-ID: <0139539A634FD04A99C9B8880AB70CB209B17A62@in-ex004.groupinfra.com>

Just be more Clear,

I have two Cluster Nodes: Edwin1 and Edwin2.
Edwin2 is NFS Server while Edwin1 is NFS Client.
Anyway,NFS will be running on Active/Active ie both are running NFS
simultanously.

On Edwin2 we can see the following configuration:

# df -h
10.227.167.5:/usr/bang-test/xml on /var/tmp/kunal type nfs
(rw,soft,addr=10.227.167.5)
/dev/sdj1 on /usr/bang-test type ext3 (rw)

[[[[Note: As for Now /dev/sdj1 which is a shared storage partition is
right now mounted on 
Edwin2.As We do Clusvcadm -r bang -m edwin1 command,it will relocate to
edwin1 node.]]]

I have mounted /usr/bang-test/xml to /var/tmp/kunal as seen above.
Corresponding to that I made the entry under NFS Mount on LUCI as:

NFS Mount Resource Configuration

Name  --->  NFSMount   
Mount point   ---> /var/tmp/kunal
Host   --> 10.227.169.3
Export path   --> 10.227.167.5:/usr/bang-test/xml
NFS version NFS3 -> <Default>
NFS4  
Options 

OK.


On Edwin1 Now, The mount point /dev/sdj1 couldnt be seen.(After failover
it will be seen)

#df -h

10.227.167.5:/usr/bang-test/xml
                      206G  360M  195G   1% /var/tmp/kunal

Now When I have written another Service Script called bang which mean
now I have two Script 
in hand.I have added both to the cluster.

As you see output of Clustat:

#clustat

edwin2# clustat
Member Status: Quorate

  Member Name                              Status
  ------ ----                              ------
  edwin1-cluster                           Online, rgmanager
  edwin2-cluster                           Online, Local, rgmanager

  Service Name         Owner (Last)                   State
  ------- ----         ----- ------                   -----
  bang                 edwin2-cluster                 started
  NFSCluster           edwin1-cluster                 started

So Now Both the service are added.When I now perform failover it doesnt
appear working.
Is there anything mischievious I am doing?

Pls Advise

_____________________________________________
From: Singh Raina, Ajeet 
Sent: Thursday, August 14, 2008 6:03 PM
To: 'linux clustering'
Cc: 'piyush yaduvanshi'
Subject: NFS Issue in Cluster??

Hello Guys,

I have a doubt and I hope you people gonna Help me with.
I have been stucked with including NFS Resource and Services.
Let me tell you....I want to run NFS Service on one of the red hat
cluster node .ON failover to the second cluster node the NFS should come
up.
But I donno why the way to configure that has been so complicated
written.
All I can see is:

[code]
1.	NFS Mount 
		Name - Create a symbolic name for the NFS mount. 
		Mount Point - Choose the path to which the file system
resource is mounted. 
		Host - Specify the NFS server name. 
		Export Path - NFS export on the server. 
		NFS version - Specify NFS protocol: 
o	NFS3 - Specifies using NFSv3 protocol. The default setting is
NFS. 
o	NFS4 - Specifies using NFSv4 protocol. 
		Options - Mount options. For more information, refer to
the nfs(5) man page. 
		Force Unmount checkbox - If checked, forces the file
system to unmount. The default setting is unchecked. Force Unmount kills
all processes using the mount point to free up the mount when it tries
to unmount. 
			NFS Client 
		Name - Enter a name for the NFS client resource. 
		Target - Enter a target for the NFS client resource.
Supported targets are hostnames, IP addresses (with wild-card support),
and netgroups. 
		Options - Additional client access rights. For more
information, refer to the exports(5) man page, General Options 
				NFS Export 
		Name - Enter a name for the NFS export resource. 
Please Help me understanding what's the difference between NFS Mount,NFS
Export and NFS Client in this context.
I just want to do failover ie. When one node has NFS running then on
being stopped NFS Should be starting at the other end.


Pls Help


This e-mail and any attachment is for authorised use by the intended recipient(s) only. It may contain proprietary material, confidential information and/or be subject to legal privilege. It should not be copied, disclosed to, retained or used by, any other party. If you are not an intended recipient then please promptly delete this e-mail and any attachment and all copies and inform the sender. Thank you.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080814/c18b0abd/attachment.htm>

From rpeterso at redhat.com  Thu Aug 14 13:52:28 2008
From: rpeterso at redhat.com (Bob Peterson)
Date: Thu, 14 Aug 2008 08:52:28 -0500
Subject: [Linux-cluster] Re: corrupted GFS filesystem
In-Reply-To: <a60f9340808132137h120c597cja34c108e1fc5958f@mail.gmail.com>
References: <a60f9340808132137h120c597cja34c108e1fc5958f@mail.gmail.com>
Message-ID: <1218721948.9521.191.camel@technetium.msp.redhat.com>

On Wed, 2008-08-13 at 23:37 -0500, David Potterveld wrote:
> I have a corrupted GFS filesystem, and gfs_fsck is unable to fix it.
(snip)
> I have the feeling it's only slightly damaged, but I don't know the
> GFS disk structure, and I haven't a clue what to do next. The GFS
> filesystem was created with the default resource group size.
> 
> Any suggestions how to proceed? Is there any way to recover the
> filesystem? This is the storage for a mail server. I have a backup,
> but I would lose 24 hours of inbound mail, and I'd really hate to do
> that.
> 
> Thanks!
> David Potterveld   (davep at core.com)

Hi David,

Can you post the dmesgs or console messages received that reported
the original file system damage?

Also, you never said whether this was RHEL4/Centos4 or RHEL5/Centos5
or equivalent.  That's always helpful.

I can't even imagine what could have happened to delete all your
resource groups.  I've never seen that happen before.
Unless it was a hardware problem or if some other unprotected
process wiped out metadata from the file system.

If the resource groups disappear, that's big trouble.  That's
likely why it thought your root directory was gone.

>From the superblock, I can tell that the gfs_fsck quit abnormally.
So before you can mount again, you'll have to do:
gfs_tool sb /dev/VGsan0/lvsan0 proto "lock_dlm"

Unfortunately, since you ran gfs_fsck several times, you've probably
wiped out all information about how the file system got into that
condition, so we may never know how this happened.
You might have to restore from backup after doing gfs_mkfs again.

If you get a file system withdraw on gfs due to corruption, I
recommend that, before you run gfs_fsck, you save the metadata by
doing this:

gfs2_edit savemeta /dev/your/device /tmp/devicename.metadata

There is a RHEL4 version of the gfs2_edit tool on my people page,
which is here:

http://people.redhat.com/rpeterso/Experimental/RHEL4.x/

(This link contains both source code and 32-bit compiled code).

For RHEL5, I recommend compiling the gfs2_edit code from source
code, because it will be more up to date.  (Older versions of
gfs2_edit don't do as good a job saving GFS (1) metadata.)

The advantage of saving the metadata is that if something goes
wrong, you can restore the file system back to its original
pre-gfs_fsck condition.  Also, if you open up a bugzilla record
so we can try to solve the problem, we can use that metadata to
understand what's wrong with the file system and what happened
to corrupt it.  (Please remember to bzip2 the metadata before
sending it in; it will likely be too big to attach, so you
might need to put it onto a web site or ftp server).

In this case, saving your metadata is probably useless because
it's been changed so much from the original problem by your
running gfs_fsck so many times.

Regards,

Bob Peterson
Red Hat Clustering & GFS




From rpeterso at redhat.com  Thu Aug 14 13:59:45 2008
From: rpeterso at redhat.com (Bob Peterson)
Date: Thu, 14 Aug 2008 08:59:45 -0500
Subject: [Linux-cluster] GFS vs. GFS2: system-config-cluster, locking ...
In-Reply-To: <c0773fd30808132248l66bdf043x7b03236f2144597f@mail.gmail.com>
References: <20080813211332.GA10496@joysn.joelly.net>
	<c0773fd30808132248l66bdf043x7b03236f2144597f@mail.gmail.com>
Message-ID: <1218722385.9521.198.camel@technetium.msp.redhat.com>

Hi Brett,

On Thu, 2008-08-14 at 07:48 +0200, Brett Cave wrote:
> Centos modinfo on gfs also shows gfs2 as dependancy. locking is a
> seperate service from gfs, and from what I understand, both gfs and
> gfs2 use locking (lock_dlm). Not sure why there is this module
> dependancy. Have no idea why there is this depedancy though.

FYI--For 5.x, the locking infrastructure was common between GFS
and GFS2.  It has to do with the common lock harness "lock_dlm"
that is an interface between both GFS and GFS2 into the "dlm" module.
For 5.3, we're splitting the locking modules apart to get rid of
that dependency.

Regards,

Bob Peterson
Red Hat Clustering & GFS




From cedwards at smartechcorp.net  Thu Aug 14 14:25:01 2008
From: cedwards at smartechcorp.net (Chris Edwards)
Date: Thu, 14 Aug 2008 10:25:01 -0400
Subject: [Linux-cluster] VM Resource Failover
In-Reply-To: <0139539A634FD04A99C9B8880AB70CB209B17A5C@in-ex004.groupinfra.com>
References: <0139539A634FD04A99C9B8880AB70CB209B17A5C@in-ex004.groupinfra.com>
Message-ID: <036f01c8fe19$89ce1100$9d6a3300$@net>

I have been trying to simulate a xen VM failover,  I have a 2 machine
cluster and 2 vm's running.  If I issue a " xm destroy ID"  the vm will
automatically reboot to the other node.  But if I reboot one of the clusters
to simulate a machine failure the vm never boots back up until the other
machine comes online.  So here are my questions.

 

1.       How do I get the cluster to boot the vm that has failed when one of
the clustered machines are down?

2.       When I do a "xm destroy ID" the cluster always reboots the vm onto
the other cluster machine, is there any way for me to have it boot back to
the machine its supposed to be running on without having to do a manual
migrate?   Can It auto-migrate back to its original machine over time?

 

 

Here is the out put of my clustat during a reboot of one of the clusters.

 

Cluster Status for Xen @ Thu Aug 14 10:11:21 2008

Member Status: Quorate

 Member Name                             ID   Status

 ------ ----                             ---- ------

 xen1.smartechcorp.net                       1 Online, Local, rgmanager

 xen2.smartechcorp.net                       2 Offline

 Service Name                   Owner (Last)                   State


 ------- ----                   ----- ------                   -----


 vm:Linux1                      xen2.smartechcorp.net          stopping


 vm:Windows1                    xen1.smartechcorp.net          started   

 

Here is my cluster.conf..

 

<?xml version="1.0"?>

<cluster alias="Xen" config_version="29" name="Xen">

        <fence_daemon clean_start="0" post_fail_delay="0"
post_join_delay="-1"/>

        <clusternodes>

                <clusternode name="xen1.smartechcorp.net" nodeid="1"
votes="1">

                        <fence>

                                <method name="1">

                                        <device name="manual"
nodename="xen1.smartechcorp.net"/>

                                </method>

                        </fence>

                </clusternode>

                <clusternode name="xen2.smartechcorp.net" nodeid="2"
votes="1">

                        <fence>

                                <method name="1">

                                        <device name="manual"
nodename="xen2.smartechcorp.net"/>

                                </method>

                        </fence>

                </clusternode>

        </clusternodes>

        <cman expected_votes="1" two_node="1"/>

        <fencedevices>

                <fencedevice agent="fence_manual" name="manual"/>

        </fencedevices>

        <rm>

                <failoverdomains>

                        <failoverdomain name="bias-xen1" nofailback="0"
ordered="1" restricted="0">

                                <failoverdomainnode
name="xen1.smartechcorp.net" priority="1"/>

                                <failoverdomainnode
name="xen2.smartechcorp.net" priority="2"/>

                        </failoverdomain>

                        <failoverdomain name="bias-xen2" nofailback="0"
ordered="1" restricted="0">

                                <failoverdomainnode
name="xen1.smartechcorp.net" priority="2"/>

                                <failoverdomainnode
name="xen2.smartechcorp.net" priority="1"/>

                        </failoverdomain>

                </failoverdomains>

                <resources/>

                <vm autostart="1" domain="bias-xen1" exclusive="0"
migrate="live" name="Windows1" path="/var/lib/xen/images"
recovery="relocate"/>

                <vm autostart="1" domain="bias-xen2" exclusive="0"
migrate="live" name="Linux1" path="/var/lib/xen/images"
recovery="relocate"/>

        </rm>

</cluster>

Thanks for any help, this is driving me crazy!

 

---

 

Chris Edwards
Smartech Corp.
Div. of AirNet Group

 <http://www.airnetgroup.com/> http://www.airnetgroup.com

 <http://www.smartechcorp.net/> http://www.smartechcorp.net

 <mailto:agarrison at smartechcorp.net> cedwards at smartechcorp.net
P:  423-664-7678 x114

C:  423-593-6964

F:  423-664-7680

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080814/725fd77c/attachment.htm>

From kanderso at redhat.com  Thu Aug 14 14:36:07 2008
From: kanderso at redhat.com (Kevin Anderson)
Date: Thu, 14 Aug 2008 09:36:07 -0500
Subject: [Linux-cluster] VM Resource Failover
In-Reply-To: <036f01c8fe19$89ce1100$9d6a3300$@net>
References: <0139539A634FD04A99C9B8880AB70CB209B17A5C@in-ex004.groupinfra.com>
	<036f01c8fe19$89ce1100$9d6a3300$@net>
Message-ID: <1218724567.4068.5.camel@dhcp80-204.msp.redhat.com>

Since you are using manual fencing, did you run fence_ack_manual after
killing the machine?  When the machine comes back up, the ack is
implied, but rgmanager will not be able to perform recovery operations
until fencing is complete.  Suggest you utilize a real fencing agent if
you want this to work seamlessly.

Kevin

On Thu, 2008-08-14 at 10:25 -0400, Chris Edwards wrote:
> I have been trying to simulate a xen VM failover,  I have a 2 machine
> cluster and 2 vm?s running.  If I issue a ? xm destroy ID?  the vm
> will automatically reboot to the other node.  But if I reboot one of
> the clusters to simulate a machine failure the vm never boots back up
> until the other machine comes online.  So here are my questions?
> 
>  
> 
> 1.      How do I get the cluster to boot the vm that has failed when
> one of the clustered machines are down?
> 
> 2.      When I do a ?xm destroy ID? the cluster always reboots the vm
> onto the other cluster machine, is there any way for me to have it
> boot back to the machine its supposed to be running on without having
> to do a manual migrate?   Can It auto-migrate back to its original
> machine over time?
> 
>  
> 
>  
> 
> Here is the out put of my clustat during a reboot of one of the
> clusters?
> 
>  
> 
> Cluster Status for Xen @ Thu Aug 14 10:11:21 2008
> 
> Member Status: Quorate
> 
>  Member Name                             ID   Status
> 
>  ------ ----                             ---- ------
> 
>  xen1.smartechcorp.net                       1 Online, Local,
> rgmanager
> 
>  xen2.smartechcorp.net                       2 Offline
> 
>  Service Name                   Owner (Last)
> State         
> 
>  ------- ----                   ----- ------
> -----         
> 
>  vm:Linux1                      xen2.smartechcorp.net
> stopping      
> 
>  vm:Windows1                    xen1.smartechcorp.net
> started   
> 
>  
> 
> Here is my cluster.conf?.
> 
>  
> 
> <?xml version="1.0"?>
> 
> <cluster alias="Xen" config_version="29" name="Xen">
> 
>         <fence_daemon clean_start="0" post_fail_delay="0"
> post_join_delay="-1"/>
> 
>         <clusternodes>
> 
>                 <clusternode name="xen1.smartechcorp.net" nodeid="1"
> votes="1">
> 
>                         <fence>
> 
>                                 <method name="1">
> 
>                                         <device name="manual"
> nodename="xen1.smartechcorp.net"/>
> 
>                                 </method>
> 
>                         </fence>
> 
>                 </clusternode>
> 
>                 <clusternode name="xen2.smartechcorp.net" nodeid="2"
> votes="1">
> 
>                         <fence>
> 
>                                 <method name="1">
> 
>                                         <device name="manual"
> nodename="xen2.smartechcorp.net"/>
> 
>                                 </method>
> 
>                         </fence>
> 
>                 </clusternode>
> 
>         </clusternodes>
> 
>         <cman expected_votes="1" two_node="1"/>
> 
>         <fencedevices>
> 
>                 <fencedevice agent="fence_manual" name="manual"/>
> 
>         </fencedevices>
> 
>         <rm>
> 
>                 <failoverdomains>
> 
>                         <failoverdomain name="bias-xen1"
> nofailback="0" ordered="1" restricted="0">
> 
>                                 <failoverdomainnode
> name="xen1.smartechcorp.net" priority="1"/>
> 
>                                 <failoverdomainnode
> name="xen2.smartechcorp.net" priority="2"/>
> 
>                         </failoverdomain>
> 
>                         <failoverdomain name="bias-xen2"
> nofailback="0" ordered="1" restricted="0">
> 
>                                 <failoverdomainnode
> name="xen1.smartechcorp.net" priority="2"/>
> 
>                                 <failoverdomainnode
> name="xen2.smartechcorp.net" priority="1"/>
> 
>                         </failoverdomain>
> 
>                 </failoverdomains>
> 
>                 <resources/>
> 
>                 <vm autostart="1" domain="bias-xen1" exclusive="0"
> migrate="live" name="Windows1" path="/var/lib/xen/images"
> recovery="relocate"/>
> 
>                 <vm autostart="1" domain="bias-xen2" exclusive="0"
> migrate="live" name="Linux1" path="/var/lib/xen/images"
> recovery="relocate"/>
> 
>         </rm>
> 
> </cluster>
> 
> Thanks for any help, this is driving me crazy!
> 
>  
> 
> ---
> 
>  
> 
> Chris Edwards
> Smartech Corp.
> Div. of AirNet Group
> 
> http://www.airnetgroup.com
> 
> http://www.smartechcorp.net
> 
> cedwards at smartechcorp.net
> P:  423-664-7678 x114
> 
> C:  423-593-6964
> 
> F:  423-664-7680
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From rpeterso at redhat.com  Thu Aug 14 14:43:18 2008
From: rpeterso at redhat.com (Bob Peterson)
Date: Thu, 14 Aug 2008 09:43:18 -0500
Subject: [Linux-cluster] GFS leak glock
In-Reply-To: <20080814124022.LJGI2335.imsm058dat.netvigator.com@cch-vikash.netvigator.com>
References: <20080814124022.LJGI2335.imsm058dat.netvigator.com@cch-vikash.netvigator.com>
Message-ID: <1218724998.9521.205.camel@technetium.msp.redhat.com>

Hi Vikash,

On Thu, 2008-08-14 at 20:40 +0800, Vikash Khatuwala wrote:
> Hi,
> 
> Does anyone have any idea what these messages mean? I see them in 
> /var/log/messages and on the console too.
> 
> GFS: would leak glock ffff8101f09ef408
> GFS: would leak glock ffff810185d22258
> GFS: would leak glock ffff81015dd930a0
> GFS: would leak glock ffff8101e4e184c8
> GFS: would leak glock ffff81008785d580
> 
> Using CentOS 5.2 with all latest updates.

This sounds like debug code is installed on your system to
debug some problem.  That message does not seem to be in the
most recent source code for GFS, dlm or lock_dlm.  I can't
seem to find a reference to it anywhere offhand.  I can't
imagine Centos would have built GFS from a debug copy of the
source code, so are you sure you don't have debug code on your
system?

Regards,

Bob Peterson
Red Hat Clustering & GFS




From cedwards at smartechcorp.net  Thu Aug 14 14:49:30 2008
From: cedwards at smartechcorp.net (Chris Edwards)
Date: Thu, 14 Aug 2008 10:49:30 -0400
Subject: [Linux-cluster] VM Resource Failover
In-Reply-To: <1218724567.4068.5.camel@dhcp80-204.msp.redhat.com>
References: <0139539A634FD04A99C9B8880AB70CB209B17A5C@in-ex004.groupinfra.com>	<036f01c8fe19$89ce1100$9d6a3300$@net>
	<1218724567.4068.5.camel@dhcp80-204.msp.redhat.com>
Message-ID: <038b01c8fe1c$f4d6b3f0$de841bd0$@net>

This is my first round of playing with clustering, I don't have a fencing agent.  So I need to run fence_ack_manual after I shut one of the nodes down?  This is where I get REALLY confused and don't understand how this works.   When I reboot one of the nodes it seems to hang on fencing for a few minutes, then I have a 50-50 chance that the node will come back up and rejoin the cluster and everything is fine or it will not rejoin the cluster and then I manually try to rejoin it and usually end up rebooting both machines at the same time.  Also should I add my iSCSI GFS shared space as a resource?  Will this automount at bootup?

Thanks for the help!

---

Chris Edwards
Smartech Corp.
Div. of AirNet Group
http://www.airnetgroup.com
http://www.smartechcorp.net
cedwards at smartechcorp.net
P:  423-664-7678 x114
C:  423-593-6964
F:  423-664-7680


-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Kevin Anderson
Sent: Thursday, August 14, 2008 10:36 AM
To: linux clustering
Subject: Re: [Linux-cluster] VM Resource Failover

Since you are using manual fencing, did you run fence_ack_manual after
killing the machine?  When the machine comes back up, the ack is
implied, but rgmanager will not be able to perform recovery operations
until fencing is complete.  Suggest you utilize a real fencing agent if
you want this to work seamlessly.

Kevin

On Thu, 2008-08-14 at 10:25 -0400, Chris Edwards wrote:
> I have been trying to simulate a xen VM failover,  I have a 2 machine
> cluster and 2 vm?s running.  If I issue a ? xm destroy ID?  the vm
> will automatically reboot to the other node.  But if I reboot one of
> the clusters to simulate a machine failure the vm never boots back up
> until the other machine comes online.  So here are my questions?
> 
>  
> 
> 1.      How do I get the cluster to boot the vm that has failed when
> one of the clustered machines are down?
> 
> 2.      When I do a ?xm destroy ID? the cluster always reboots the vm
> onto the other cluster machine, is there any way for me to have it
> boot back to the machine its supposed to be running on without having
> to do a manual migrate?   Can It auto-migrate back to its original
> machine over time?
> 
>  
> 
>  
> 
> Here is the out put of my clustat during a reboot of one of the
> clusters?
> 
>  
> 
> Cluster Status for Xen @ Thu Aug 14 10:11:21 2008
> 
> Member Status: Quorate
> 
>  Member Name                             ID   Status
> 
>  ------ ----                             ---- ------
> 
>  xen1.smartechcorp.net                       1 Online, Local,
> rgmanager
> 
>  xen2.smartechcorp.net                       2 Offline
> 
>  Service Name                   Owner (Last)
> State         
> 
>  ------- ----                   ----- ------
> -----         
> 
>  vm:Linux1                      xen2.smartechcorp.net
> stopping      
> 
>  vm:Windows1                    xen1.smartechcorp.net
> started   
> 
>  
> 
> Here is my cluster.conf?.
> 
>  
> 
> <?xml version="1.0"?>
> 
> <cluster alias="Xen" config_version="29" name="Xen">
> 
>         <fence_daemon clean_start="0" post_fail_delay="0"
> post_join_delay="-1"/>
> 
>         <clusternodes>
> 
>                 <clusternode name="xen1.smartechcorp.net" nodeid="1"
> votes="1">
> 
>                         <fence>
> 
>                                 <method name="1">
> 
>                                         <device name="manual"
> nodename="xen1.smartechcorp.net"/>
> 
>                                 </method>
> 
>                         </fence>
> 
>                 </clusternode>
> 
>                 <clusternode name="xen2.smartechcorp.net" nodeid="2"
> votes="1">
> 
>                         <fence>
> 
>                                 <method name="1">
> 
>                                         <device name="manual"
> nodename="xen2.smartechcorp.net"/>
> 
>                                 </method>
> 
>                         </fence>
> 
>                 </clusternode>
> 
>         </clusternodes>
> 
>         <cman expected_votes="1" two_node="1"/>
> 
>         <fencedevices>
> 
>                 <fencedevice agent="fence_manual" name="manual"/>
> 
>         </fencedevices>
> 
>         <rm>
> 
>                 <failoverdomains>
> 
>                         <failoverdomain name="bias-xen1"
> nofailback="0" ordered="1" restricted="0">
> 
>                                 <failoverdomainnode
> name="xen1.smartechcorp.net" priority="1"/>
> 
>                                 <failoverdomainnode
> name="xen2.smartechcorp.net" priority="2"/>
> 
>                         </failoverdomain>
> 
>                         <failoverdomain name="bias-xen2"
> nofailback="0" ordered="1" restricted="0">
> 
>                                 <failoverdomainnode
> name="xen1.smartechcorp.net" priority="2"/>
> 
>                                 <failoverdomainnode
> name="xen2.smartechcorp.net" priority="1"/>
> 
>                         </failoverdomain>
> 
>                 </failoverdomains>
> 
>                 <resources/>
> 
>                 <vm autostart="1" domain="bias-xen1" exclusive="0"
> migrate="live" name="Windows1" path="/var/lib/xen/images"
> recovery="relocate"/>
> 
>                 <vm autostart="1" domain="bias-xen2" exclusive="0"
> migrate="live" name="Linux1" path="/var/lib/xen/images"
> recovery="relocate"/>
> 
>         </rm>
> 
> </cluster>
> 
> Thanks for any help, this is driving me crazy!
> 
>  
> 
> ---
> 
>  
> 
> Chris Edwards
> Smartech Corp.
> Div. of AirNet Group
> 
> http://www.airnetgroup.com
> 
> http://www.smartechcorp.net
> 
> cedwards at smartechcorp.net
> P:  423-664-7678 x114
> 
> C:  423-593-6964
> 
> F:  423-664-7680
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster





From vikash at netvigator.com  Thu Aug 14 15:08:38 2008
From: vikash at netvigator.com (Vikash Khatuwala)
Date: Thu, 14 Aug 2008 23:08:38 +0800
Subject: [Linux-cluster] GFS leak glock
In-Reply-To: <1218724998.9521.205.camel@technetium.msp.redhat.com>
References: <20080814124022.LJGI2335.imsm058dat.netvigator.com@cch-vikash.netvigator.com>
	<1218724998.9521.205.camel@technetium.msp.redhat.com>
Message-ID: <20080814150840.MIVA2335.imsm058dat.netvigator.com@cch-vikash.netvigator.com>

Dear Bob,

Thanks for your reply.

I am using a modified version of the kernel from Parallels's product 
virtuozzo, version "2.6.18-028stab057.2". So Now I think it is 
probably related to their kernel. I will post your reply to them and 
will update the list when I have something.

Regards,
Vikash.

At 10:43 PM 14-08-08, Bob Peterson wrote:
>Hi Vikash,
>
>On Thu, 2008-08-14 at 20:40 +0800, Vikash Khatuwala wrote:
> > Hi,
> >
> > Does anyone have any idea what these messages mean? I see them in
> > /var/log/messages and on the console too.
> >
> > GFS: would leak glock ffff8101f09ef408
> > GFS: would leak glock ffff810185d22258
> > GFS: would leak glock ffff81015dd930a0
> > GFS: would leak glock ffff8101e4e184c8
> > GFS: would leak glock ffff81008785d580
> >
> > Using CentOS 5.2 with all latest updates.
>
>This sounds like debug code is installed on your system to
>debug some problem.  That message does not seem to be in the
>most recent source code for GFS, dlm or lock_dlm.  I can't
>seem to find a reference to it anywhere offhand.  I can't
>imagine Centos would have built GFS from a debug copy of the
>source code, so are you sure you don't have debug code on your
>system?
>
>Regards,
>
>Bob Peterson
>Red Hat Clustering & GFS
>
>
>--
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>https://www.redhat.com/mailman/listinfo/linux-cluster



From rpeterso at redhat.com  Thu Aug 14 17:47:36 2008
From: rpeterso at redhat.com (Bob Peterson)
Date: Thu, 14 Aug 2008 12:47:36 -0500
Subject: [Linux-cluster] Re: corrupted GFS filesystem
In-Reply-To: <a60f9340808140844lf966afey671f1eeb782fcbb2@mail.gmail.com>
References: <a60f9340808132137h120c597cja34c108e1fc5958f@mail.gmail.com>
	<1218721948.9521.191.camel@technetium.msp.redhat.com>
	<a60f9340808140844lf966afey671f1eeb782fcbb2@mail.gmail.com>
Message-ID: <1218736056.9521.246.camel@technetium.msp.redhat.com>

Hi David,

On Thu, 2008-08-14 at 10:44 -0500, David Potterveld wrote:
> The system in question is RHEL4
> 
> The gfs2_edit tool is quite informative. The root directory (block 26)
> in pointer mode displays the correct directory pointers, and I can
> navigate down that tree. My guess is it's OK.

That's good.

> Block 23 (rindex) seems to be damaged. In structure mode, it is listed
> as unknown block type. Here is the listing in Hex mode:
> 
> Block #23    (0x17)           of 393214976 (0x176FFC00)
> (p.1 of 16)--------------------- rindex file -------------------
> 00017000 00000000 00000000 00000000 00000000 [................]
> 00017010 000002B0 00000002 000025DE 3E6DF93C [..........%.>m.<]
> 00017020 3E6E2420 000061AC 00000279 00000279 [>n$ ..a....y...y]
> 00017030 00B46690 48A30A6D 00000000 00000000 [..f.H..m........]
> 00017040 00000000 00000000 00000000 000002B0 [................]
> 00017050 00000002 000025DF 3E6DF4C8 3E6E2420 [......%.>m..>n$ ]
> 00017060 000093D2 00000235 00000235 00B46B6C [.......5...5..kl]
> 00017070 48A30A6D 00000000 00000000 00000000 [H..m............]
> 00017080 00000000 00000000 000004E2 00000002 [................]
> 00017090 000025E0 3E6DF4C8 3E6E2420 000093D2 [..%.>m..>n$ ....]
> 000170A0 00000235 00000235 00B46FE0 48A30A6D [...5...5..o.H..m]
> 000170B0 00000000 00000000 00000000 00000000 [................]
> 000170C0 00000000 000004E2 00000002 000025E1 [..............%.]
> 000170D0 3E6DF48C 3E6E2420 00009423 00000201 [>m..>n$ ...#....]
> 000170E0 00000201 00B47454 48A30A6D 00000000 [......tTH..m....]
> 000170F0 00000000 00000000 00000000 00000000 [................]
>          *** This seems to be a GFS-1 file system ***

That is definitely trashed.  This block should look like an
inode and it's not even close.  It doesn't look like anything
I'm used to seeing as far as GFS metadata is concerned.
It looks like data.  I'm guessing the hex numbers that look
like 3E6DF48C are some kind of time stamp, but it doesn't
resemble gfs metadata.

I've got a bugzilla record open so I can do some improvements
to the rindex repair code for RHEL5.  The bug is:

https://bugzilla.redhat.com/show_bug.cgi?id=442271

Right now it's slotted for 5.4.  It's been suggested that I add
an option to gfs_fsck to force a rindex repair.  In your case,
you would need a complete rindex rebuild, which the code currently
doesn't do.

If your file system has ever been extended with gfs_grow, it
makes it extremely difficult for gfs_fsck to figure out how
the rindex should look.  Most of the code is already in
gfs_fsck to do this, but as a safety measure, it won't
overwrite your current rindex if it finds more than five
problems (iirc).

So it would take a while to get this working properly,
and even so, it's designed for RHEL5.4, not RHEL4.

I'm guessing that the resource groups are intact, so it should
be able to rebuild the rindex.  But there are obstacles to
overcome.  You would need a special program or a special version
of gfs_fsck that could do this for you.  It would be faster and
more reliable for you to mkfs the file system and restore from
backups if you got them.

> There should be nearly 6000 resource groups.
> 
> Perhaps I'll learn more if I can change the field for the block type
> so it's recognized. Can you tell me where it is and what should be in
> it?

No, that rindex block looks totally trashed.  You would have to
patch in the whole block.  It should look something like this:

Block #23    (0x17)           of 97255424 (0x5CC0000)  (disk inode)
(p.1 of 8)--------------------- rindex file -------------------
00017000 01161970 00000004 00000000 00000000 [...p............]
00017010 00000190 00000000 00000000 00000017 [................]
00017020 00000000 00000017 00000180 00000000 [................]
00017030 00000000 00000001 00000000 00022C80 [..............,.]
00017040 00000000 00000024 00000000 48A44678 [.......$....H.Fx]
00017050 00000000 48A44678 00000000 48A44678 [....H.Fx....H.Fx]
00017060 00000000 00000000 00000000 00000000 [................]
00017070 00000000 00000000 00000000 00000000 [................]
00017080 00000001 0000044C 00010001 00000000 [.......L........]
00017090 00000000 00000000 00000000 00000000 [................]
000170A0 00000000 00000000 00000000 00000000 [................]
000170B0 00000000 00000000 00000000 00000000 [................]
000170C0 00000000 00000000 00000000 00000000 [................]
000170D0 00000000 00000000 00000000 00000000 [................]
000170E0 00000000 00000000 00000000 0000001B [................] 
000170F0 00000000 0000001C 00000000 0000001D [................] 
00017100 00000000 0000001E 00000000 0000001F [................] 
00017110 00000000 00000020 00000000 00000021 [....... .......!] 
00017120 00000000 00000022 00000000 00000023 [.......".......#] 
00017130 00000000 00000024 00000000 00000025 [.......$.......%] 
00017140 00000000 00000026 00000000 00000027 [.......&.......'] 
00017150 00000000 00000028 00000000 00000029 [.......(.......)] 
00017160 00000000 0000002A 00000000 0000002B [.......*.......+] 
00017170 00000000 0000002C 00000000 0000002D [.......,.......-] 
00017180 00000000 0000002E 00000000 0000002F [.............../] 
00017190 00000000 00000030 00000000 00000031 [.......0.......1] 
000171A0 00000000 00000032 00000000 00000033 [.......2.......3] 
000171B0 00000000 00000034 00000000 00000035 [.......4.......5] 
000171C0 00000000 00000036 00000000 00000037 [.......6.......7] 
000171D0 00000000 00000038 00000000 00000039 [.......8.......9] 
000171E0 00000000 0000003A 00000000 0000003B [.......:.......;] 
000171F0 00000000 0000003C 00000000 0000003D [.......<.......=] 
         *** This seems to be a GFS-1 file system ***

The first pointer, at 000170e8 is "00000000 0000001B" which is
the first indirect block pointer.  That block should look something
like this:

Block #27    (0x1b)           of 97255424 (0x5CC0000)  (journal data)
(p.1 of 8)
0001B000 01161970 00000007 00000000 00000000 [...p............]
0001B010 000002BC 00000000 00000000 00000011 [................] 
0001B020 00000005 00000000 00000000 00000016 [................] 
0001B030 000101D8 00004076 00000000 00000000 [...... at v........] 
0001B040 00000000 00000000 00000000 00000000 [................] 
0001B050 00000000 00000000 00000000 00000000 [................] 
0001B060 00000000 00000000 00000000 00000000 [................] 
0001B070 00000000 00000000 00000000 000101EF [................] 
0001B080 00000005 00000000 00000000 000101F4 [................] 
0001B090 0000FFB8 00003FEE 00000000 00000000 [......?.........] 
0001B0A0 00000000 00000000 00000000 00000000 [................] 
0001B0B0 00000000 00000000 00000000 00000000 [................] 
0001B0C0 00000000 00000000 00000000 00000000 [................] 
etc.

If the first 20x bytes look like that, namely,
0001B000 01161970 00000007 00000000 00000000 [...p............]
0001B010 000002BC 00000000 00000000 00000011 [................] 

then that's good news and the indirect blocks are okay.
Each rindex record should be 0x60 bytes long.
The second entry in my example (0001b078) starts with the pointer
to "00000000 000101EF".  On your system it will be different
because it's a different file system.

If your block #27 looks like this, and you don't mind playing
with fire, you can try to patch the rindex block 23 to look
like I've got it above, but page down and fill in all the block
pointers starting with 1b, 1c, 1d, 1e, and so forth.  I'd use 
the 'f' key on block 1b to page forward until you find the end
of the rindex to figure out where to stop.  If you've really got
2000 resource groups, it will take quite a few blocks (30?) before
you hit the end of them.

You'll also need to patch the rindex inode value for file size
to be 0x2ee00 (corresponding to 2000 rg's * 0x60 bytes per rg).
That's at offset 0x38 thru 0x3f.

So you can try to patch them in, then I'd try to save the
metadata with gfs2_edit savemeta, then I'd run gfs_fsck to see
if it can figure it out from there.

This is just something you can try if you're desperate and have
no backups to restore from; I can't be held accountable for
problems, nor can Red Hat.

> Thanks,
> David Potterveld

Regards,

Bob Peterson
Red Hat Clustering & GFS




From cedwards at smartechcorp.net  Thu Aug 14 20:26:07 2008
From: cedwards at smartechcorp.net (Chris Edwards)
Date: Thu, 14 Aug 2008 16:26:07 -0400
Subject: [Linux-cluster] GFS as a Resource
In-Reply-To: <1218736056.9521.246.camel@technetium.msp.redhat.com>
References: <a60f9340808132137h120c597cja34c108e1fc5958f@mail.gmail.com>	<1218721948.9521.191.camel@technetium.msp.redhat.com>	<a60f9340808140844lf966afey671f1eeb782fcbb2@mail.gmail.com>
	<1218736056.9521.246.camel@technetium.msp.redhat.com>
Message-ID: <04cb01c8fe4b$fb40f640$f1c2e2c0$@net>

Hello all again,

I am really trying hard to figure out how this clustering works and I
appreciate any help that is given.

What happens if I set GFS up as a resource?   Should it automatically mount
my GFS file system?

I am using GFS over an iscsi target.

Thanks!

---

Chris Edwards





From brettcave at gmail.com  Fri Aug 15 08:52:04 2008
From: brettcave at gmail.com (Brett Cave)
Date: Fri, 15 Aug 2008 10:52:04 +0200
Subject: [Linux-cluster] GFS as a Resource
In-Reply-To: <04cb01c8fe4b$fb40f640$f1c2e2c0$@net>
References: <a60f9340808132137h120c597cja34c108e1fc5958f@mail.gmail.com>
	<1218721948.9521.191.camel@technetium.msp.redhat.com>
	<a60f9340808140844lf966afey671f1eeb782fcbb2@mail.gmail.com>
	<1218736056.9521.246.camel@technetium.msp.redhat.com>
	<04cb01c8fe4b$fb40f640$f1c2e2c0$@net>
Message-ID: <c0773fd30808150152w213658d7va6c75be4bf335651@mail.gmail.com>

On Thu, Aug 14, 2008 at 10:26 PM, Chris Edwards
<cedwards at smartechcorp.net> wrote:
> Hello all again,
>
> I am really trying hard to figure out how this clustering works and I
> appreciate any help that is given.
>
There are some good docs out there - read through
http://www.centos.org/docs/5/html/Global_File_System/

> What happens if I set GFS up as a resource?   Should it automatically mount
> my GFS file system?

there is a GFS service. If you set it to autostart, it will mount any
gfs file systems defined in /etc/fstab.

You would have to set up iSCSI first, and have your volumes accessible
- to GFS, these just block devices, regardless of the underlying
technology - DRBD / iSCSI / SAN.

The order of startup should be iSCSI first, then cluster services
(cman, locking service such as DLM, fencing, etc) from the cman
service and once the cluster services are up, you can then mount gfs
file systems, via "service gfs start". Also look into using a quorum
disk in your cluster, it can help with the cluster maintaining quorum
when a few nodes go down.

Here is a sample gfs entry from my fs tab.
/dev/sda1               /gfs/cache1             gfs
num_glockd=6,noatime,noquota,nodiratime         0 0

You would replace /dev/sda1 with the path to your iSCSI device.

Brett
>
> I am using GFS over an iscsi target.
>
> Thanks!
>
> ---
>
> Chris Edwards
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



From cedwards at smartechcorp.net  Fri Aug 15 14:01:35 2008
From: cedwards at smartechcorp.net (Chris Edwards)
Date: Fri, 15 Aug 2008 10:01:35 -0400
Subject: [Linux-cluster] GFS as a Resource
In-Reply-To: <c0773fd30808150152w213658d7va6c75be4bf335651@mail.gmail.com>
References: <a60f9340808132137h120c597cja34c108e1fc5958f@mail.gmail.com>	<1218721948.9521.191.camel@technetium.msp.redhat.com>	<a60f9340808140844lf966afey671f1eeb782fcbb2@mail.gmail.com>	<1218736056.9521.246.camel@technetium.msp.redhat.com>	<04cb01c8fe4b$fb40f640$f1c2e2c0$@net>
	<c0773fd30808150152w213658d7va6c75be4bf335651@mail.gmail.com>
Message-ID: <069601c8fedf$6e12ff40$4a38fdc0$@net>

Perfet!  Thanks a million, I didn't realize that GFS needed a fstab entry
with a noauto entry to get it to work.

---

Chris Edwards
Smartech Corp.
Div. of AirNet Group
http://www.airnetgroup.com
http://www.smartechcorp.net
cedwards at smartechcorp.net
P:  423-664-7678 x114
C:  423-593-6964
F:  423-664-7680


-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Brett Cave
Sent: Friday, August 15, 2008 4:52 AM
To: linux clustering
Subject: Re: [Linux-cluster] GFS as a Resource

On Thu, Aug 14, 2008 at 10:26 PM, Chris Edwards
<cedwards at smartechcorp.net> wrote:
> Hello all again,
>
> I am really trying hard to figure out how this clustering works and I
> appreciate any help that is given.
>
There are some good docs out there - read through
http://www.centos.org/docs/5/html/Global_File_System/

> What happens if I set GFS up as a resource?   Should it automatically
mount
> my GFS file system?

there is a GFS service. If you set it to autostart, it will mount any
gfs file systems defined in /etc/fstab.

You would have to set up iSCSI first, and have your volumes accessible
- to GFS, these just block devices, regardless of the underlying
technology - DRBD / iSCSI / SAN.

The order of startup should be iSCSI first, then cluster services
(cman, locking service such as DLM, fencing, etc) from the cman
service and once the cluster services are up, you can then mount gfs
file systems, via "service gfs start". Also look into using a quorum
disk in your cluster, it can help with the cluster maintaining quorum
when a few nodes go down.

Here is a sample gfs entry from my fs tab.
/dev/sda1               /gfs/cache1             gfs
num_glockd=6,noatime,noquota,nodiratime         0 0

You would replace /dev/sda1 with the path to your iSCSI device.

Brett
>
> I am using GFS over an iscsi target.
>
> Thanks!
>
> ---
>
> Chris Edwards
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster




From cedwards at smartechcorp.net  Fri Aug 15 17:18:02 2008
From: cedwards at smartechcorp.net (Chris Edwards)
Date: Fri, 15 Aug 2008 13:18:02 -0400
Subject: [Linux-cluster] GFS as a Resource
In-Reply-To: <069601c8fedf$6e12ff40$4a38fdc0$@net>
References: <a60f9340808132137h120c597cja34c108e1fc5958f@mail.gmail.com>	<1218721948.9521.191.camel@technetium.msp.redhat.com>	<a60f9340808140844lf966afey671f1eeb782fcbb2@mail.gmail.com>	<1218736056.9521.246.camel@technetium.msp.redhat.com>	<04cb01c8fe4b$fb40f640$f1c2e2c0$@net>	<c0773fd30808150152w213658d7va6c75be4bf335651@mail.gmail.com>
	<069601c8fedf$6e12ff40$4a38fdc0$@net>
Message-ID: <07b601c8fefa$dfc46b90$9f4d42b0$@net>

Whoops, scratch that last post.   I now have it working by leaving the entry
in fstab without the noauto and turning GFS off with chkconfig and allowing
the cluster service to turn it on.
Thanks again!

---

Chris Edwards
Smartech Corp.
Div. of AirNet Group
http://www.airnetgroup.com
http://www.smartechcorp.net
cedwards at smartechcorp.net
P:  423-664-7678 x114
C:  423-593-6964
F:  423-664-7680


-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Chris Edwards
Sent: Friday, August 15, 2008 10:02 AM
To: 'linux clustering'
Subject: RE: [Linux-cluster] GFS as a Resource

Perfet!  Thanks a million, I didn't realize that GFS needed a fstab entry
with a noauto entry to get it to work.

---

Chris Edwards
Smartech Corp.
Div. of AirNet Group
http://www.airnetgroup.com
http://www.smartechcorp.net
cedwards at smartechcorp.net
P:  423-664-7678 x114
C:  423-593-6964
F:  423-664-7680


-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Brett Cave
Sent: Friday, August 15, 2008 4:52 AM
To: linux clustering
Subject: Re: [Linux-cluster] GFS as a Resource

On Thu, Aug 14, 2008 at 10:26 PM, Chris Edwards
<cedwards at smartechcorp.net> wrote:
> Hello all again,
>
> I am really trying hard to figure out how this clustering works and I
> appreciate any help that is given.
>
There are some good docs out there - read through
http://www.centos.org/docs/5/html/Global_File_System/

> What happens if I set GFS up as a resource?   Should it automatically
mount
> my GFS file system?

there is a GFS service. If you set it to autostart, it will mount any
gfs file systems defined in /etc/fstab.

You would have to set up iSCSI first, and have your volumes accessible
- to GFS, these just block devices, regardless of the underlying
technology - DRBD / iSCSI / SAN.

The order of startup should be iSCSI first, then cluster services
(cman, locking service such as DLM, fencing, etc) from the cman
service and once the cluster services are up, you can then mount gfs
file systems, via "service gfs start". Also look into using a quorum
disk in your cluster, it can help with the cluster maintaining quorum
when a few nodes go down.

Here is a sample gfs entry from my fs tab.
/dev/sda1               /gfs/cache1             gfs
num_glockd=6,noatime,noquota,nodiratime         0 0

You would replace /dev/sda1 with the path to your iSCSI device.

Brett
>
> I am using GFS over an iscsi target.
>
> Thanks!
>
> ---
>
> Chris Edwards
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster




From maurizio.rottin at gmail.com  Fri Aug 15 18:05:48 2008
From: maurizio.rottin at gmail.com (Maurizio Rottin)
Date: Fri, 15 Aug 2008 20:05:48 +0200
Subject: [Linux-cluster] GFS as a Resource
In-Reply-To: <07b601c8fefa$dfc46b90$9f4d42b0$@net>
References: <a60f9340808132137h120c597cja34c108e1fc5958f@mail.gmail.com>
	<1218721948.9521.191.camel@technetium.msp.redhat.com>
	<a60f9340808140844lf966afey671f1eeb782fcbb2@mail.gmail.com>
	<1218736056.9521.246.camel@technetium.msp.redhat.com>
	<04cb01c8fe4b$fb40f640$f1c2e2c0$@net>
	<c0773fd30808150152w213658d7va6c75be4bf335651@mail.gmail.com>
	<069601c8fedf$6e12ff40$4a38fdc0$@net>
	<07b601c8fefa$dfc46b90$9f4d42b0$@net>
Message-ID: <e83473390808151105kabee394gb30cc70b8680fe8d@mail.gmail.com>

2008/8/15 Chris Edwards <cedwards at smartechcorp.net>:
> Whoops, scratch that last post.   I now have it working by leaving the entry
> in fstab without the noauto and turning GFS off with chkconfig and allowing
> the cluster service to turn it on.
> Thanks again!

i believe thats the wrong way.
I know it works in that way, but:
- if you have only one node, do not use gfs, it's slow!
- if you have more than one node, use it -- and if you can, test gfs2
as weel (it should be more and more fast) -- but do not mount it (only
- i mean, you don't need it to be listed on a fstab) in fstab.
gfs works if only all the nodes are "up and running", which means, if
one node can't be reached, but is up (network or other problems
inolved) no one will use the gfs filesystem.
You must use it as a resorce, and you must have at least one fencing
method for each node in the cluster.
In this way, once a node becomes unreachable, it will be fenced and
the other nodes can write happily on the filesystem. This is because
if one node "can be considered up and maybe running" it may be writing
on the filesystem, or it can maybe think that it is the only one node
in the cluster (think ebout switch problem, or arp spoofing) than if
you try a "clust" command on that node you will see al  the other
nodes down and only that one up....this is why you must have  a
fencing method! that node HAS TO be shut down or reloaded, otherwise
the filesystem will be blocked, and no read o write can be issued by
any of the nodes in the cluster".

I am not talking about what it is in theory(never attended a RH
session), but believe me, in practice it works like that!

create a global resource (and always create a global resource even if
it is a fencing, or a vsftpd resource that every node has in common)
aqnd mount it in every node you need as a service. Do not think an
fstab entry is the better thing you can have, it is not, it can lock
you filesystem till all the nodes are really working and talking one
each other.

-- 
mr



From brettcave at gmail.com  Fri Aug 15 21:21:23 2008
From: brettcave at gmail.com (Brett Cave)
Date: Fri, 15 Aug 2008 23:21:23 +0200
Subject: [Linux-cluster] GFS as a Resource
In-Reply-To: <e83473390808151105kabee394gb30cc70b8680fe8d@mail.gmail.com>
References: <a60f9340808132137h120c597cja34c108e1fc5958f@mail.gmail.com>
	<1218721948.9521.191.camel@technetium.msp.redhat.com>
	<a60f9340808140844lf966afey671f1eeb782fcbb2@mail.gmail.com>
	<1218736056.9521.246.camel@technetium.msp.redhat.com>
	<04cb01c8fe4b$fb40f640$f1c2e2c0$@net>
	<c0773fd30808150152w213658d7va6c75be4bf335651@mail.gmail.com>
	<069601c8fedf$6e12ff40$4a38fdc0$@net>
	<07b601c8fefa$dfc46b90$9f4d42b0$@net>
	<e83473390808151105kabee394gb30cc70b8680fe8d@mail.gmail.com>
Message-ID: <c0773fd30808151421o6dee0748w7ba953bb4accc028@mail.gmail.com>

On Fri, Aug 15, 2008 at 8:05 PM, Maurizio Rottin
<maurizio.rottin at gmail.com> wrote:
> 2008/8/15 Chris Edwards <cedwards at smartechcorp.net>:
>> Whoops, scratch that last post.   I now have it working by leaving the entry
>> in fstab without the noauto and turning GFS off with chkconfig and allowing
>> the cluster service to turn it on.
>> Thanks again!
>
> i believe thats the wrong way.
> I know it works in that way, but:
> - if you have only one node, do not use gfs, it's slow!
> - if you have more than one node, use it -- and if you can, test gfs2
> as weel (it should be more and more fast) -- but do not mount it (only
> - i mean, you don't need it to be listed on a fstab) in fstab.

Not sure what the first "more" is referring to above - stable perhaps?

After a week of gfs2, i reverted back to gfs1 - found it to be more
stable - gfs2 is still experimental. Then again, I am still getting
buggy behaviour from gfs1.

On that note - anyone have any ideas as to why a node trying to mount
gfs after a hard reset has the following connection order (as per
logs) - gfs1.

dlm: connecting to 3
dlm: got connection from 4
dlm: connecting to 4
dlm: got connection from 4

GFS system hang at this point

or connecting to 3, got connection from 3, connecting to 3, got
connection from 4.

This happens quite often, and i have to restart all nodes to get gfs
back up... ideas and suggestions are much appreciated.



> gfs works if only all the nodes are "up and running", which means, if
> one node can't be reached, but is up (network or other problems
> inolved) no one will use the gfs filesystem.
> You must use it as a resorce, and you must have at least one fencing
> method for each node in the cluster.
> In this way, once a node becomes unreachable, it will be fenced and
> the other nodes can write happily on the filesystem. This is because
> if one node "can be considered up and maybe running" it may be writing
> on the filesystem, or it can maybe think that it is the only one node
> in the cluster (think ebout switch problem, or arp spoofing) than if
> you try a "clust" command on that node you will see al  the other
> nodes down and only that one up....this is why you must have  a
> fencing method! that node HAS TO be shut down or reloaded, otherwise
> the filesystem will be blocked, and no read o write can be issued by
> any of the nodes in the cluster".
>
> I am not talking about what it is in theory(never attended a RH
> session), but believe me, in practice it works like that!
>
> create a global resource (and always create a global resource even if
> it is a fencing, or a vsftpd resource that every node has in common)
> aqnd mount it in every node you need as a service. Do not think an
> fstab entry is the better thing you can have, it is not, it can lock
> you filesystem till all the nodes are really working and talking one
> each other.
>
> --
> mr
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



From dake at novatec.de  Sat Aug 16 02:21:26 2008
From: dake at novatec.de (dake at novatec.de)
Date: Sat, 16 Aug 2008 04:21:26 +0200
Subject: [Linux-cluster] Logging not working
Message-ID: <20080816042126.k5eq990a3jq8occc@novatec.dnsalias.net>

Hello guys,

we're still having trouble with logging not working - now using  
Cluster 2.03.07. The network logging setup is fine, and running  
aisexec directly (i.e. with the default config) produces logs. Running  
it via "cman_tool join" does not, except for a line that ccsd produces:

(tail end of aisexec direct startup)
Aug 16 04:07:46 backup openais[6365]: [CLM  ] 	r(0) ip(127.0.0.1)
Aug 16 04:07:46 backup openais[6365]: [CLM  ] Members Left:
Aug 16 04:07:46 backup openais[6365]: [CLM  ] Members Joined:
Aug 16 04:07:46 backup openais[6365]: [CLM  ] 	r(0) ip(127.0.0.1)
Aug 16 04:07:46 backup openais[6365]: [SYNC ] This node is within the  
primary component and will provide service.
Aug 16 04:07:46 backup openais[6365]: [TOTEM] entering OPERATIONAL state.
Aug 16 04:07:46 backup openais[6365]: [CLM  ] got nodejoin message 127.0.0.1
Aug 16 04:07:47 backup ccsd[6261]: Unable to connect to cluster  
infrastructure after 120 seconds.
Aug 16 04:08:17 backup ccsd[6261]: Unable to connect to cluster  
infrastructure after 150 seconds.
Aug 16 04:08:39 backup ccsd[6261]: Initial status:: Inquorate
(this is after I'd killed "vanilla" aisexec and done cman_tool join)

I've tried with and without logging directives in cluster.conf,  
different log facilities, with and without debug on; I've even tried  
adding to_stderr="yes" to it, with no results.

Using netstat, I could determine that aisexec doesn't even open the  
log socket when started per "cman_tool join", while it does when  
started directly. It seems almost like cman mangles the log settings  
somehow and always set all logging to off, no matter the config file  
directives. Why would this happen? It would be really nice to get this  
working, since it may help me figure out why other stuff isn't working  
properly.

Cheers,
Daniel



From jeff.sturm at eprize.com  Sun Aug 17 21:03:31 2008
From: jeff.sturm at eprize.com (Jeff Sturm)
Date: Sun, 17 Aug 2008 17:03:31 -0400
Subject: [Linux-cluster] Load balancing clustered services
Message-ID: <64D0546C5EBBD147B75DE133D798665FE9279C@hugo.eprize.local>

Greetings,
 
The Red Hat Cluster Suite page says the following:


 "For applications that require maximum uptime, a Red Hat Enterprise
Linux cluster with Red Hat Cluster Suite is the answer. Specifically
designed for Red Hat Enterprise Linux, Red Hat Cluster Suite provides
two distinct types of clustering:

    * Application/Service Failover - Create n-node server clusters for
failover of key applications and services
    * IP Load Balancing - Load balance incoming IP network requests
across a farm of servers"


The implication seems to be that the first type addresses high
availability, and the second scalability.  What is the optimal way to
get both?

Please understand that I am already a user of GFS and LVS.  I'm asking
the question because the two seemingly have nothing in common.  For
example, cman knows about cluster membership and can immediately react
when a node leaves the cluster or is fenced.  On the other hand, LVS
(together with either piranha or ldirectord) keeps a list of real
servers, periodically checking each and removing any found to be
unresponsive.

It seems like there are a couple drawbacks to this bifurcated design:

- once cman realizes a node has left the cluster, there is a delay
before ipvs updates its configuration, during which user requests can be
routed to a dead server
- two distinct sets of cluster configurations have to be maintained

Am I misunderstanding something fundamental, or is that the way it is?


-Jeff



From linux at vfemail.net  Mon Aug 18 10:27:18 2008
From: linux at vfemail.net (Alex)
Date: Mon, 18 Aug 2008 13:27:18 +0300
Subject: [Linux-cluster] conga bug or my mistake?
Message-ID: <200808181327.18655.linux@vfemail.net>

Hello all,

My current setup si similar with one described here:
http://sources.redhat.com/cluster/gnbd/gnbd_usage.txt
excepting the fact that i'm having 3 clients and 3 gnbd servers (exporting 
block devices using gnbd).

our gnbd servers have the following IPs: 192.168.113.6 and 192.168.113.7
our gnbd clients have the following IPs: 192.168.113.3 and 192.168.113.4 and 
192.168.113.5

On our management machine (other then above gnbd clients and servers) is 
running:
[root at rhclm ~]# rpm -q luci
luci-0.12.0-7.el5.centos.3
[root at rhclm ~]#

On our gnbd clients is running:
[root at rs1 ~]# rpm -q ricci
ricci-0.12.0-7.el5.centos.3
[root at rs1 ~]#

Now, i'm trying to do the following operations using conga:
Cluster -> Shared Fence Devices -> Add Fence Device

added successfully:

Fence Type: GNBD
Name: gnbd_from_shds
Servers: 192.168.113.6 192.168.113.7

This will add in our cluster.conf:
<fencedevices>
        <fencedevice agent="fence_gnbd" name="gnbd_from_shds" 
servers="192.168.113.6 192.168.113.7"/>
</fencedevices>

Let's try to use it: Cluster -> Nodes hit on 192.168.113.3 and select option 
"Manage Fencing for this Node" -> "Main Fencing Method" -> "Add a fence 
device to this level" -> select gnbd_from_shds ->  and hit "Update main fence 
properties"

Is not working, all the time i'm getting a javascript window error saying the 
following:

[snip]
The following errors were found:
An unknown device type was given: "gnbd."
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[end snip]

You see, is a dot after "gnbd" which i suppose it causing that error.

How can be fixed this error?

Now, I edited manually our cluster.conf as following:
<clusternode name="192.168.113.3" nodeid="3" votes="1">
        <fence>
                <method name="1">
                        <device name="gnbd_from_shds" 
nodename="192.168.113.3"/>
                </method>
        </fence>
</clusternode>

First Question: In docs, i cannot find any explanation about name="value" in 
<method> tag. As you see, value is "1": <method name="1">. Is this value 
valid only inside of <clusternode> section or has global semnification in 
cluster.conf? Can i name it for example "one" or 
"first_fence_method_for_this_node"?

and run:
[root at rs1 ~]# ccs_tool update /etc/cluster/cluster.conf
Config file updated from version 28 to 29

Update complete.
[root at rs1 ~]#

Now, i can see using conga in "Shared Fence Devices" section:

Shared Fence Devices for Cluster: httpcluster
Agent type: Global Network Block Device
Name: gnbd_from_shds
Nodes using this device for fencing: 192.168.113.3

but, if i'm hitting 192.168.113.3 link, i'll get other error:

Site error

This site encountered an error trying to fulfill your request. The errors 
were:

Error Type
    KeyError
Error Value
    'fence-instance-form-gnbd'
Request made at
    2008/08/18 12:42:45.164 GMT+3

Any ideas how to fix it? Is my mistake or is a bug in conga?

Second Question: Is correct to add and use for the rest of our client nodes 
below sintax?

For: 192.168.113.4 and 192.168.113.5 client nodes:

<clusternode name="192.168.113.4" nodeid="2" votes="1">
        <fence>
                <method name="1">
                        <device name="gnbd_from_shds" 
nodename="192.168.113.4"/>
                </method>
        </fence>
</clusternode>

and

<clusternode name="192.168.113.5" nodeid="1" votes="1">
        <fence>
                <method name="1">
                        <device name="gnbd_from_shds" 
nodename="192.168.113.5"/>
                </method>
        </fence>
</clusternode>

For conformity, i am posting below my present cluster.conf file:

<?xml version="1.0"?>
<cluster alias="httpcluster" config_version="29" name="httpcluster">
        <fence_daemon clean_start="0" post_fail_delay="0" 
post_join_delay="3"/>
        <clusternodes>
                <clusternode name="192.168.113.5" nodeid="1" votes="1">
                        <fence/>
                </clusternode>
                <clusternode name="192.168.113.4" nodeid="2" votes="1">
                        <fence/>
                </clusternode>
                <clusternode name="192.168.113.3" nodeid="3" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="gnbd_from_shds" 
nodename="192.168.113.3"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="192.168.113.6" nodeid="4" votes="1">
                        <fence/>
                </clusternode>
                <clusternode name="192.168.113.7" nodeid="5" votes="1">
                        <fence/>
                </clusternode>
        </clusternodes>
        <cman/>
        <fencedevices>
                <fencedevice agent="fence_gnbd" name="gnbd_from_shds" 
servers="192.168.113.6 192.168.113.7"/>
        </fencedevices>
        <rm>
                <failoverdomains/>
                <resources/>
        </rm>
</cluster>

Regards,
Alx



From grigorygor at gmail.com  Mon Aug 18 13:05:37 2008
From: grigorygor at gmail.com (Grisha G.)
Date: Mon, 18 Aug 2008 16:05:37 +0300
Subject: [Linux-cluster] conga bug or my mistake?
In-Reply-To: <200808181327.18655.linux@vfemail.net>
References: <200808181327.18655.linux@vfemail.net>
Message-ID: <c3c0440e0808180605n2418844boc0e1bf998746ee82@mail.gmail.com>

Show us the logs from your nodes

On Mon, Aug 18, 2008 at 1:27 PM, Alex <linux at vfemail.net> wrote:

> Hello all,
>
> My current setup si similar with one described here:
> http://sources.redhat.com/cluster/gnbd/gnbd_usage.txt
> excepting the fact that i'm having 3 clients and 3 gnbd servers (exporting
> block devices using gnbd).
>
> our gnbd servers have the following IPs: 192.168.113.6 and 192.168.113.7
> our gnbd clients have the following IPs: 192.168.113.3 and 192.168.113.4and
> 192.168.113.5
>
> On our management machine (other then above gnbd clients and servers) is
> running:
> [root at rhclm ~]# rpm -q luci
> luci-0.12.0-7.el5.centos.3
> [root at rhclm ~]#
>
> On our gnbd clients is running:
> [root at rs1 ~]# rpm -q ricci
> ricci-0.12.0-7.el5.centos.3
> [root at rs1 ~]#
>
> Now, i'm trying to do the following operations using conga:
> Cluster -> Shared Fence Devices -> Add Fence Device
>
> added successfully:
>
> Fence Type: GNBD
> Name: gnbd_from_shds
> Servers: 192.168.113.6 192.168.113.7
>
> This will add in our cluster.conf:
> <fencedevices>
>        <fencedevice agent="fence_gnbd" name="gnbd_from_shds"
> servers="192.168.113.6 192.168.113.7"/>
> </fencedevices>
>
> Let's try to use it: Cluster -> Nodes hit on 192.168.113.3 and select
> option
> "Manage Fencing for this Node" -> "Main Fencing Method" -> "Add a fence
> device to this level" -> select gnbd_from_shds ->  and hit "Update main
> fence
> properties"
>
> Is not working, all the time i'm getting a javascript window error saying
> the
> following:
>
> [snip]
> The following errors were found:
> An unknown device type was given: "gnbd."
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> [end snip]
>
> You see, is a dot after "gnbd" which i suppose it causing that error.
>
> How can be fixed this error?
>
> Now, I edited manually our cluster.conf as following:
> <clusternode name="192.168.113.3" nodeid="3" votes="1">
>        <fence>
>                <method name="1">
>                        <device name="gnbd_from_shds"
> nodename="192.168.113.3"/>
>                </method>
>        </fence>
> </clusternode>
>
> First Question: In docs, i cannot find any explanation about name="value"
> in
> <method> tag. As you see, value is "1": <method name="1">. Is this value
> valid only inside of <clusternode> section or has global semnification in
> cluster.conf? Can i name it for example "one" or
> "first_fence_method_for_this_node"?
>
> and run:
> [root at rs1 ~]# ccs_tool update /etc/cluster/cluster.conf
> Config file updated from version 28 to 29
>
> Update complete.
> [root at rs1 ~]#
>
> Now, i can see using conga in "Shared Fence Devices" section:
>
> Shared Fence Devices for Cluster: httpcluster
> Agent type: Global Network Block Device
> Name: gnbd_from_shds
> Nodes using this device for fencing: 192.168.113.3
>
> but, if i'm hitting 192.168.113.3 link, i'll get other error:
>
> Site error
>
> This site encountered an error trying to fulfill your request. The errors
> were:
>
> Error Type
>    KeyError
> Error Value
>    'fence-instance-form-gnbd'
> Request made at
>    2008/08/18 12:42:45.164 GMT+3
>
> Any ideas how to fix it? Is my mistake or is a bug in conga?
>
> Second Question: Is correct to add and use for the rest of our client nodes
> below sintax?
>
> For: 192.168.113.4 and 192.168.113.5 client nodes:
>
> <clusternode name="192.168.113.4" nodeid="2" votes="1">
>        <fence>
>                <method name="1">
>                        <device name="gnbd_from_shds"
> nodename="192.168.113.4"/>
>                </method>
>        </fence>
> </clusternode>
>
> and
>
> <clusternode name="192.168.113.5" nodeid="1" votes="1">
>        <fence>
>                <method name="1">
>                        <device name="gnbd_from_shds"
> nodename="192.168.113.5"/>
>                </method>
>        </fence>
> </clusternode>
>
> For conformity, i am posting below my present cluster.conf file:
>
> <?xml version="1.0"?>
> <cluster alias="httpcluster" config_version="29" name="httpcluster">
>        <fence_daemon clean_start="0" post_fail_delay="0"
> post_join_delay="3"/>
>        <clusternodes>
>                <clusternode name="192.168.113.5" nodeid="1" votes="1">
>                        <fence/>
>                </clusternode>
>                <clusternode name="192.168.113.4" nodeid="2" votes="1">
>                        <fence/>
>                </clusternode>
>                <clusternode name="192.168.113.3" nodeid="3" votes="1">
>                        <fence>
>                                <method name="1">
>                                        <device name="gnbd_from_shds"
> nodename="192.168.113.3"/>
>                                </method>
>                        </fence>
>                </clusternode>
>                <clusternode name="192.168.113.6" nodeid="4" votes="1">
>                        <fence/>
>                </clusternode>
>                <clusternode name="192.168.113.7" nodeid="5" votes="1">
>                        <fence/>
>                </clusternode>
>        </clusternodes>
>        <cman/>
>        <fencedevices>
>                <fencedevice agent="fence_gnbd" name="gnbd_from_shds"
> servers="192.168.113.6 192.168.113.7"/>
>        </fencedevices>
>        <rm>
>                <failoverdomains/>
>                <resources/>
>        </rm>
> </cluster>
>
> Regards,
> Alx
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080818/36ba1d27/attachment.htm>

From rhurst at bidmc.harvard.edu  Mon Aug 18 13:22:54 2008
From: rhurst at bidmc.harvard.edu (rhurst at bidmc.harvard.edu)
Date: Mon, 18 Aug 2008 09:22:54 -0400
Subject: [Linux-cluster] Load balancing clustered services
In-Reply-To: <64D0546C5EBBD147B75DE133D798665FE9279C@hugo.eprize.local>
References: <64D0546C5EBBD147B75DE133D798665FE9279C@hugo.eprize.local>
Message-ID: <1219065774.4957.46.camel@WSBID06223.bidmc.harvard.edu>

Both the RHCS page and your assessment are correct.  Keep in mind that
RHCS / GFS provide the host framework for applications to leverage for
high availability and/or scalability  -- simply installing and running
them alone are not enough.

What you use for hardware, what your application is capable of doing
within this environment, and its IMPLEMENTATION determine whether you
are attempting to achieve high availability and/or scalability.  At the
very least, you will have a better solution in place than running a
single monolithic server.

A simple example, if you run a monolithic database instance, and simply
want fail-over to another node, the resource group manager's policy for
that clustered services can do this for you -- without any manual
intervention -- such as moving its IP and disk resources and restarting
the database.  That IS THE BEST availability you can get out of such a
design.  And this does nothing to increase scalability.

But expand this use-case by implementing a database that was built for
high availability -- such as Cache ECP or Oracle RAC -- then such an
outage on one node (planned or unplanned) will be managed by RHCS / GFS
architecture to provide for 100% uptime.  But, you also get scalability
as a positive outcome from this same infrastructure AND implementation.

We are using RHCS / GFS to manage a Cache ECP environment.  The
production application / database is not split yet into multiple tiers,
but it is shadowed for quick fail-over.  Until it is broken up over
several servers, we will never achieve ~100% uptime ... there will
always be that downtime during service transitioning, planned or
unplanned.

We are also using RHCS only to front-end Peoplesoft (BEA WebLogic) for
high-availability, but implemented as an active-active server pair.
True, it also serves for scalability, even though a single server can
easily handle our load.  But if something happens to one server (runaway
process(es) from a bad script, bad application, etc.), we can shut it
down without interrupting service.

Hope this helps.


________________________________________________________________________
??Robert Hurst, Sr. Cach? Administrator
Beth Israel Deaconess Medical Center
1135 Tremont Street, REN-7
Boston, Massachusetts   02120-2140
617-754-8754 ? Fax: 617-754-8730 ? Cell: 401-787-3154
Any technology distinguishable from magic is insufficiently advanced.


On Sun, 2008-08-17 at 17:03 -0400, Jeff Sturm wrote:

> Greetings,
>  
> The Red Hat Cluster Suite page says the following:
> 
> 
>  "For applications that require maximum uptime, a Red Hat Enterprise
> Linux cluster with Red Hat Cluster Suite is the answer. Specifically
> designed for Red Hat Enterprise Linux, Red Hat Cluster Suite provides
> two distinct types of clustering:
> 
>     * Application/Service Failover - Create n-node server clusters for
> failover of key applications and services
>     * IP Load Balancing - Load balance incoming IP network requests
> across a farm of servers"
> 
> 
> The implication seems to be that the first type addresses high
> availability, and the second scalability.  What is the optimal way to
> get both?
> 
> Please understand that I am already a user of GFS and LVS.  I'm asking
> the question because the two seemingly have nothing in common.  For
> example, cman knows about cluster membership and can immediately react
> when a node leaves the cluster or is fenced.  On the other hand, LVS
> (together with either piranha or ldirectord) keeps a list of real
> servers, periodically checking each and removing any found to be
> unresponsive.
> 
> It seems like there are a couple drawbacks to this bifurcated design:
> 
> - once cman realizes a node has left the cluster, there is a delay
> before ipvs updates its configuration, during which user requests can be
> routed to a dead server
> - two distinct sets of cluster configurations have to be maintained
> 
> Am I misunderstanding something fundamental, or is that the way it is?
> 
> 
> -Jeff
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080818/83adc338/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2178 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080818/83adc338/attachment.p7s>

From linux at vfemail.net  Mon Aug 18 13:35:50 2008
From: linux at vfemail.net (Alex)
Date: Mon, 18 Aug 2008 16:35:50 +0300
Subject: [Linux-cluster] conga bug or my mistake?
In-Reply-To: <c3c0440e0808180605n2418844boc0e1bf998746ee82@mail.gmail.com>
References: <200808181327.18655.linux@vfemail.net>
	<c3c0440e0808180605n2418844boc0e1bf998746ee82@mail.gmail.com>
Message-ID: <200808181635.50874.linux@vfemail.net>

On Monday 18 August 2008 16:05, Grisha G. wrote:
> Show us the logs from your nodes

All right, here comes output from /var/lib/luci/log/event.log when i am 
accessing Cluster -> select my cluster (httpcluster) -> Shared Fence Devices 
link:

Cluster: httpcluster
Agent type: Global Network Block Device
Name: gnbd_from_shds
Nodes using this device for fencing: 192.168.113.3

and now, hitting 192.168.113.3 link, i'll get the following:

[root at rhclm ~]# tail -f /var/lib/luci/log/event.log
[some old output omitted]
...
2008-08-18T16:25:07 ERROR Zope.SiteErrorLog 
https://192.168.113.8:8084/luci/cluster/index_html
Traceback (innermost last):
  Module ZPublisher.Publish, line 115, in publish
  Module ZPublisher.mapply, line 88, in mapply
  Module ZPublisher.Publish, line 41, in call_object
  Module Shared.DC.Scripts.Bindings, line 311, in __call__
  Module Shared.DC.Scripts.Bindings, line 348, in _bindAndExec
  Module Products.PageTemplates.ZopePageTemplate, line 255, in _exec
  Module Products.PageTemplates.PageTemplate, line 104, in pt_render
   - <ZopePageTemplate at /luci/cluster/index_html>
  Module TAL.TALInterpreter, line 238, in __call__
  Module TAL.TALInterpreter, line 281, in interpret
  Module TAL.TALInterpreter, line 728, in do_defineMacro
  Module TAL.TALInterpreter, line 281, in interpret
  Module TAL.TALInterpreter, line 457, in do_optTag_tal
  Module TAL.TALInterpreter, line 442, in do_optTag
  Module TAL.TALInterpreter, line 437, in no_tag
  Module TAL.TALInterpreter, line 281, in interpret
  Module TAL.TALInterpreter, line 780, in do_defineSlot
  Module TAL.TALInterpreter, line 281, in interpret
  Module TAL.TALInterpreter, line 457, in do_optTag_tal
  Module TAL.TALInterpreter, line 442, in do_optTag
  Module TAL.TALInterpreter, line 437, in no_tag
  Module TAL.TALInterpreter, line 281, in interpret
  Module TAL.TALInterpreter, line 728, in do_defineMacro
  Module TAL.TALInterpreter, line 281, in interpret
  Module TAL.TALInterpreter, line 749, in do_useMacro
  Module TAL.TALInterpreter, line 281, in interpret
  Module TAL.TALInterpreter, line 457, in do_optTag_tal
  Module TAL.TALInterpreter, line 442, in do_optTag
  Module TAL.TALInterpreter, line 437, in no_tag
  Module TAL.TALInterpreter, line 281, in interpret
  Module TAL.TALInterpreter, line 715, in do_condition
  Module TAL.TALInterpreter, line 281, in interpret
  Module TAL.TALInterpreter, line 457, in do_optTag_tal
  Module TAL.TALInterpreter, line 442, in do_optTag
  Module TAL.TALInterpreter, line 437, in no_tag
  Module TAL.TALInterpreter, line 281, in interpret
  Module TAL.TALInterpreter, line 715, in do_condition
  Module TAL.TALInterpreter, line 281, in interpret
  Module TAL.TALInterpreter, line 457, in do_optTag_tal
  Module TAL.TALInterpreter, line 442, in do_optTag
  Module TAL.TALInterpreter, line 437, in no_tag
  Module TAL.TALInterpreter, line 281, in interpret
  Module TAL.TALInterpreter, line 749, in do_useMacro
  Module TAL.TALInterpreter, line 281, in interpret
  Module TAL.TALInterpreter, line 715, in do_condition
  Module TAL.TALInterpreter, line 281, in interpret
  Module TAL.TALInterpreter, line 457, in do_optTag_tal
  Module TAL.TALInterpreter, line 442, in do_optTag
  Module TAL.TALInterpreter, line 437, in no_tag
  Module TAL.TALInterpreter, line 281, in interpret
  Module TAL.TALInterpreter, line 691, in do_loop_tal
  Module TAL.TALInterpreter, line 281, in interpret
  Module TAL.TALInterpreter, line 457, in do_optTag_tal
  Module TAL.TALInterpreter, line 442, in do_optTag
  Module TAL.TALInterpreter, line 437, in no_tag
  Module TAL.TALInterpreter, line 281, in interpret
  Module TAL.TALInterpreter, line 715, in do_condition
  Module TAL.TALInterpreter, line 281, in interpret
  Module TAL.TALInterpreter, line 457, in do_optTag_tal
  Module TAL.TALInterpreter, line 442, in do_optTag
  Module TAL.TALInterpreter, line 437, in no_tag
  Module TAL.TALInterpreter, line 281, in interpret
  Module TAL.TALInterpreter, line 691, in do_loop_tal
  Module TAL.TALInterpreter, line 281, in interpret
  Module TAL.TALInterpreter, line 457, in do_optTag_tal
  Module TAL.TALInterpreter, line 442, in do_optTag
  Module TAL.TALInterpreter, line 437, in no_tag
  Module TAL.TALInterpreter, line 281, in interpret
  Module TAL.TALInterpreter, line 749, in do_useMacro
  Module TAL.TALInterpreter, line 281, in interpret
  Module TAL.TALInterpreter, line 715, in do_condition
  Module TAL.TALInterpreter, line 281, in interpret
  Module TAL.TALInterpreter, line 715, in do_condition
  Module TAL.TALInterpreter, line 281, in interpret
  Module TAL.TALInterpreter, line 457, in do_optTag_tal
  Module TAL.TALInterpreter, line 442, in do_optTag
  Module TAL.TALInterpreter, line 437, in no_tag
  Module TAL.TALInterpreter, line 281, in interpret
  Module TAL.TALInterpreter, line 735, in do_useMacro
  Module Products.PageTemplates.TALES, line 221, in evaluate
   - URL: /luci/cluster/fence-macros
   - Line 2034, Column 2
   - Expression: standard:'here/fence-macros/macros/fence-instance-form-gnbd'
   - Names:
      {'container': <Folder at /luci/cluster>,
       'context': <Folder at /luci/cluster>,
       'default': <Products.PageTemplates.TALES.Default instance at 
0xb75707ec>,       'here': <Folder at /luci/cluster>,
       'loop': <Products.PageTemplates.TALES.SafeMapping object at 0xdcde8ac>,
       'modules': <Products.PageTemplates.ZRPythonExpr._SecureModuleImporter 
instance at 0xb75123ac>,
       'nothing': None,
       'options': {'args': ()},
       'repeat': <Products.PageTemplates.TALES.SafeMapping object at 
0xdcde8ac>,       'request': <HTTPRequest, 
URL=https://192.168.113.8:8084/luci/cluster/index_html>,
       'root': <Application at >,
       'template': <ZopePageTemplate at /luci/cluster/index_html>,
       'traverse_subpath': [],
       'user': <PropertiedUser 'admin'>}
  Module Products.PageTemplates.Expressions, line 185, in __call__
  Module Products.PageTemplates.Expressions, line 173, in _eval
  Module Products.PageTemplates.Expressions, line 127, in _eval
   - __traceback_info__: here
  Module Products.PageTemplates.Expressions, line 320, in restrictedTraverse
   - __traceback_info__: {'path': ['fence-macros', 'macros', 
'fence-instance-form-gnbd'], 'TraversalRequestNameStack': []}
KeyError: 'fence-instance-form-gnbd'

What other logs you need? In syslog (/var/log/messages) does not appear 
something related to this event!

Regards,
Alx

>
> On Mon, Aug 18, 2008 at 1:27 PM, Alex <linux at vfemail.net> wrote:
> > Hello all,
> >
> > My current setup si similar with one described here:
> > http://sources.redhat.com/cluster/gnbd/gnbd_usage.txt
> > excepting the fact that i'm having 3 clients and 3 gnbd servers
> > (exporting block devices using gnbd).
> >
> > our gnbd servers have the following IPs: 192.168.113.6 and 192.168.113.7
> > our gnbd clients have the following IPs: 192.168.113.3 and
> > 192.168.113.4and 192.168.113.5
> >
> > On our management machine (other then above gnbd clients and servers) is
> > running:
> > [root at rhclm ~]# rpm -q luci
> > luci-0.12.0-7.el5.centos.3
> > [root at rhclm ~]#
> >
> > On our gnbd clients is running:
> > [root at rs1 ~]# rpm -q ricci
> > ricci-0.12.0-7.el5.centos.3
> > [root at rs1 ~]#
> >
> > Now, i'm trying to do the following operations using conga:
> > Cluster -> Shared Fence Devices -> Add Fence Device
> >
> > added successfully:
> >
> > Fence Type: GNBD
> > Name: gnbd_from_shds
> > Servers: 192.168.113.6 192.168.113.7
> >
> > This will add in our cluster.conf:
> > <fencedevices>
> >        <fencedevice agent="fence_gnbd" name="gnbd_from_shds"
> > servers="192.168.113.6 192.168.113.7"/>
> > </fencedevices>
> >
> > Let's try to use it: Cluster -> Nodes hit on 192.168.113.3 and select
> > option
> > "Manage Fencing for this Node" -> "Main Fencing Method" -> "Add a fence
> > device to this level" -> select gnbd_from_shds ->  and hit "Update main
> > fence
> > properties"
> >
> > Is not working, all the time i'm getting a javascript window error saying
> > the
> > following:
> >
> > [snip]
> > The following errors were found:
> > An unknown device type was given: "gnbd."
> > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > [end snip]
> >
> > You see, is a dot after "gnbd" which i suppose it causing that error.
> >
> > How can be fixed this error?
> >
> > Now, I edited manually our cluster.conf as following:
> > <clusternode name="192.168.113.3" nodeid="3" votes="1">
> >        <fence>
> >                <method name="1">
> >                        <device name="gnbd_from_shds"
> > nodename="192.168.113.3"/>
> >                </method>
> >        </fence>
> > </clusternode>
> >
> > First Question: In docs, i cannot find any explanation about name="value"
> > in
> > <method> tag. As you see, value is "1": <method name="1">. Is this value
> > valid only inside of <clusternode> section or has global semnification in
> > cluster.conf? Can i name it for example "one" or
> > "first_fence_method_for_this_node"?
> >
> > and run:
> > [root at rs1 ~]# ccs_tool update /etc/cluster/cluster.conf
> > Config file updated from version 28 to 29
> >
> > Update complete.
> > [root at rs1 ~]#
> >
> > Now, i can see using conga in "Shared Fence Devices" section:
> >
> > Shared Fence Devices for Cluster: httpcluster
> > Agent type: Global Network Block Device
> > Name: gnbd_from_shds
> > Nodes using this device for fencing: 192.168.113.3
> >
> > but, if i'm hitting 192.168.113.3 link, i'll get other error:
> >
> > Site error
> >
> > This site encountered an error trying to fulfill your request. The errors
> > were:
> >
> > Error Type
> >    KeyError
> > Error Value
> >    'fence-instance-form-gnbd'
> > Request made at
> >    2008/08/18 12:42:45.164 GMT+3
> >
> > Any ideas how to fix it? Is my mistake or is a bug in conga?
> >
> > Second Question: Is correct to add and use for the rest of our client
> > nodes below sintax?
> >
> > For: 192.168.113.4 and 192.168.113.5 client nodes:
> >
> > <clusternode name="192.168.113.4" nodeid="2" votes="1">
> >        <fence>
> >                <method name="1">
> >                        <device name="gnbd_from_shds"
> > nodename="192.168.113.4"/>
> >                </method>
> >        </fence>
> > </clusternode>
> >
> > and
> >
> > <clusternode name="192.168.113.5" nodeid="1" votes="1">
> >        <fence>
> >                <method name="1">
> >                        <device name="gnbd_from_shds"
> > nodename="192.168.113.5"/>
> >                </method>
> >        </fence>
> > </clusternode>
> >
> > For conformity, i am posting below my present cluster.conf file:
> >
> > <?xml version="1.0"?>
> > <cluster alias="httpcluster" config_version="29" name="httpcluster">
> >        <fence_daemon clean_start="0" post_fail_delay="0"
> > post_join_delay="3"/>
> >        <clusternodes>
> >                <clusternode name="192.168.113.5" nodeid="1" votes="1">
> >                        <fence/>
> >                </clusternode>
> >                <clusternode name="192.168.113.4" nodeid="2" votes="1">
> >                        <fence/>
> >                </clusternode>
> >                <clusternode name="192.168.113.3" nodeid="3" votes="1">
> >                        <fence>
> >                                <method name="1">
> >                                        <device name="gnbd_from_shds"
> > nodename="192.168.113.3"/>
> >                                </method>
> >                        </fence>
> >                </clusternode>
> >                <clusternode name="192.168.113.6" nodeid="4" votes="1">
> >                        <fence/>
> >                </clusternode>
> >                <clusternode name="192.168.113.7" nodeid="5" votes="1">
> >                        <fence/>
> >                </clusternode>
> >        </clusternodes>
> >        <cman/>
> >        <fencedevices>
> >                <fencedevice agent="fence_gnbd" name="gnbd_from_shds"
> > servers="192.168.113.6 192.168.113.7"/>
> >        </fencedevices>
> >        <rm>
> >                <failoverdomains/>
> >                <resources/>
> >        </rm>
> > </cluster>
> >
> > Regards,
> > Alx
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster



From linux at vfemail.net  Mon Aug 18 14:11:39 2008
From: linux at vfemail.net (Alex)
Date: Mon, 18 Aug 2008 17:11:39 +0300
Subject: [Linux-cluster] conga bug or my mistake?
In-Reply-To: <c3c0440e0808180605n2418844boc0e1bf998746ee82@mail.gmail.com>
References: <200808181327.18655.linux@vfemail.net>
	<c3c0440e0808180605n2418844boc0e1bf998746ee82@mail.gmail.com>
Message-ID: <200808181711.39127.linux@vfemail.net>

On Monday 18 August 2008 16:05, Grisha G. wrote:
> Show us the logs from your nodes
>

It seems really a bug. I found another post with the same problem here:

http://www.mail-archive.com/linux-cluster at redhat.com/msg03911.html

Can somebody tell me how to fix?

Regards,
Alx

> On Mon, Aug 18, 2008 at 1:27 PM, Alex <linux at vfemail.net> wrote:
> > Hello all,
> >
> > My current setup si similar with one described here:
> > http://sources.redhat.com/cluster/gnbd/gnbd_usage.txt
> > excepting the fact that i'm having 3 clients and 3 gnbd servers
> > (exporting block devices using gnbd).
> >
> > our gnbd servers have the following IPs: 192.168.113.6 and 192.168.113.7
> > our gnbd clients have the following IPs: 192.168.113.3 and
> > 192.168.113.4and 192.168.113.5
> >
> > On our management machine (other then above gnbd clients and servers) is
> > running:
> > [root at rhclm ~]# rpm -q luci
> > luci-0.12.0-7.el5.centos.3
> > [root at rhclm ~]#
> >
> > On our gnbd clients is running:
> > [root at rs1 ~]# rpm -q ricci
> > ricci-0.12.0-7.el5.centos.3
> > [root at rs1 ~]#
> >
> > Now, i'm trying to do the following operations using conga:
> > Cluster -> Shared Fence Devices -> Add Fence Device
> >
> > added successfully:
> >
> > Fence Type: GNBD
> > Name: gnbd_from_shds
> > Servers: 192.168.113.6 192.168.113.7
> >
> > This will add in our cluster.conf:
> > <fencedevices>
> >        <fencedevice agent="fence_gnbd" name="gnbd_from_shds"
> > servers="192.168.113.6 192.168.113.7"/>
> > </fencedevices>
> >
> > Let's try to use it: Cluster -> Nodes hit on 192.168.113.3 and select
> > option
> > "Manage Fencing for this Node" -> "Main Fencing Method" -> "Add a fence
> > device to this level" -> select gnbd_from_shds ->  and hit "Update main
> > fence
> > properties"
> >
> > Is not working, all the time i'm getting a javascript window error saying
> > the
> > following:
> >
> > [snip]
> > The following errors were found:
> > An unknown device type was given: "gnbd."
> > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > [end snip]
> >
> > You see, is a dot after "gnbd" which i suppose it causing that error.
> >
> > How can be fixed this error?
> >
> > Now, I edited manually our cluster.conf as following:
> > <clusternode name="192.168.113.3" nodeid="3" votes="1">
> >        <fence>
> >                <method name="1">
> >                        <device name="gnbd_from_shds"
> > nodename="192.168.113.3"/>
> >                </method>
> >        </fence>
> > </clusternode>
> >
> > First Question: In docs, i cannot find any explanation about name="value"
> > in
> > <method> tag. As you see, value is "1": <method name="1">. Is this value
> > valid only inside of <clusternode> section or has global semnification in
> > cluster.conf? Can i name it for example "one" or
> > "first_fence_method_for_this_node"?
> >
> > and run:
> > [root at rs1 ~]# ccs_tool update /etc/cluster/cluster.conf
> > Config file updated from version 28 to 29
> >
> > Update complete.
> > [root at rs1 ~]#
> >
> > Now, i can see using conga in "Shared Fence Devices" section:
> >
> > Shared Fence Devices for Cluster: httpcluster
> > Agent type: Global Network Block Device
> > Name: gnbd_from_shds
> > Nodes using this device for fencing: 192.168.113.3
> >
> > but, if i'm hitting 192.168.113.3 link, i'll get other error:
> >
> > Site error
> >
> > This site encountered an error trying to fulfill your request. The errors
> > were:
> >
> > Error Type
> >    KeyError
> > Error Value
> >    'fence-instance-form-gnbd'
> > Request made at
> >    2008/08/18 12:42:45.164 GMT+3
> >
> > Any ideas how to fix it? Is my mistake or is a bug in conga?
> >
> > Second Question: Is correct to add and use for the rest of our client
> > nodes below sintax?
> >
> > For: 192.168.113.4 and 192.168.113.5 client nodes:
> >
> > <clusternode name="192.168.113.4" nodeid="2" votes="1">
> >        <fence>
> >                <method name="1">
> >                        <device name="gnbd_from_shds"
> > nodename="192.168.113.4"/>
> >                </method>
> >        </fence>
> > </clusternode>
> >
> > and
> >
> > <clusternode name="192.168.113.5" nodeid="1" votes="1">
> >        <fence>
> >                <method name="1">
> >                        <device name="gnbd_from_shds"
> > nodename="192.168.113.5"/>
> >                </method>
> >        </fence>
> > </clusternode>
> >
> > For conformity, i am posting below my present cluster.conf file:
> >
> > <?xml version="1.0"?>
> > <cluster alias="httpcluster" config_version="29" name="httpcluster">
> >        <fence_daemon clean_start="0" post_fail_delay="0"
> > post_join_delay="3"/>
> >        <clusternodes>
> >                <clusternode name="192.168.113.5" nodeid="1" votes="1">
> >                        <fence/>
> >                </clusternode>
> >                <clusternode name="192.168.113.4" nodeid="2" votes="1">
> >                        <fence/>
> >                </clusternode>
> >                <clusternode name="192.168.113.3" nodeid="3" votes="1">
> >                        <fence>
> >                                <method name="1">
> >                                        <device name="gnbd_from_shds"
> > nodename="192.168.113.3"/>
> >                                </method>
> >                        </fence>
> >                </clusternode>
> >                <clusternode name="192.168.113.6" nodeid="4" votes="1">
> >                        <fence/>
> >                </clusternode>
> >                <clusternode name="192.168.113.7" nodeid="5" votes="1">
> >                        <fence/>
> >                </clusternode>
> >        </clusternodes>
> >        <cman/>
> >        <fencedevices>
> >                <fencedevice agent="fence_gnbd" name="gnbd_from_shds"
> > servers="192.168.113.6 192.168.113.7"/>
> >        </fencedevices>
> >        <rm>
> >                <failoverdomains/>
> >                <resources/>
> >        </rm>
> > </cluster>
> >
> > Regards,
> > Alx
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster



From shakerqawasmi at gmail.com  Mon Aug 18 14:21:31 2008
From: shakerqawasmi at gmail.com (Shaker Qawasmi)
Date: Mon, 18 Aug 2008 17:21:31 +0300
Subject: [Linux-cluster] Problem Running CMAN, GNBD and GFS on RHEL 5.2
Message-ID: <e22fb93a0808180721g2c8fa0b9p5bf464af1f5f4909@mail.gmail.com>

I have the following setup:
    1 node, runs as GNBD server, it's export GFS (v1) partition, doesn't
mount it localaly.
    2 nodes (node1,node2), import the GFS (v1) partition from gnbd node and
mount it as DocumentRoot for apache.

all the 3 nodes located on a Hosting Company, they have public IPs so i have
ssh login for this servers without NATing or forwarding.., all located in
one Rack with one subnet and one switch connecting them.
this setup was running for about 3 months, but unfortuntally it's now
stopped for unknown resone,
The problem accour by starting the clustering manager (cman service), then i
can not ping/access the gnbd server outside its subnet, becuase almost all
of the packets were droped, although I can ping it from any server on its
subnet without any packet lose. other nodes are accessible without any
problem, also cman_tool status shows all nodes as connected without any
problem (M).

I solved it by preventing INPUT/OUTPUT traffic for the mulitcast address (
239.192.30.102) from the server using its firewall (iptables) while keeping
the cman service running, it is not the proper solution because this
mulitcast address demand by clustering service to talk with the other
clustering servers, which they are already using the same multicast address
without any problem.

the hosting company say it's switch does not prevent multicasting.

any idea about this problem?

sorry for my english.

my cluster.conf:
<?xml version="1.0"?>
<cluster name="ex-cluster" config_version="6">

<cman expected_votes="1">
</cman>

<fence_deamon post_join_delay="60">
</fence_deamon>

<clusternodes>
<clusternode name="gnbd1.example.com" nodeid="1">
        <fence>
                <method name="single">
                        <device name="gnbd" ipaddr="gnbd1.example.com"/>
                </method>
        </fence>
</clusternode>
<clusternode name="node1.example.com" nodeid="2">
        <fence>
                <method name="single">
                        <device name="gnbd" ipaddr="node1.example.com"/>
                </method>
        </fence>
</clusternode>
<clusternode name="node2.example.com" nodeid="3">
        <fence>
                <method name="single">
                        <device name="gnbd" ipaddr="node2.example.com"/>
                </method>
        </fence>
</clusternode>
</clusternodes>

<fencedevices>
        <fencedevice name="gnbd" agent="fence_gnbd" servers="
gnbd1.example.com"/>
</fencedevices>
<rm>
    <resources>
        <clusterfs device="/dev/sdb1" force_unmount="0" fsid="5391"
fstype="gfs" mountpoint="/www/www-data" name="www" options=""/>
    </resources>
</rm>
</cluster>


-- 
Shaker Qawasmi
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080818/afb66e4b/attachment.htm>

From anujhere at gmail.com  Mon Aug 18 14:25:53 2008
From: anujhere at gmail.com (=?UTF-8?Q?Anuj_Singh_(=E0=A4=85=E0=A4=A8=E0=A5=81=E0=A4=9C)?=)
Date: Mon, 18 Aug 2008 19:55:53 +0530
Subject: [Linux-cluster] rhel5 gfs1 or gfs2, how to prove that i have gfs1
Message-ID: <3120c9e30808180725k942651bt1f674dcba6d32536@mail.gmail.com>

Hello everyone,
I created a cluster on rhel5,
To make the gfs1 file system I used following command.

' gfs_mkfs -t new_cluster:GFS -p lock_dlm -j2 /dev/MyVol0/MyLV0'

Mounted /dev/MyVol0/MyLV0  on  /image.
'mount -t gfs /dev/MyVol0/MyLV0 /image '
No error I got.

Mount command gives me.
/dev/mapper/MyVol0-MyLV0 on /image type gfs
(rw,hostdata=jid=0:id=131073:first=1)

My questions:
1) How to prove that I have gfs1 file system in use.
2) Ismod |grep gfs gives me.
[root at pr0031 CL]# lsmod |grep gfs
gfs                   252740  1
gfs2                  341965  2 gfs,lock_dlm
configfs               28753  2 dlm
3) rmmod gfs2
ERROR: Module gfs2 is in use by gfs,lock_dlm
^^ why gfs2 module is in use?
as lock_dlm is associated with gfs2, does it mean it's not production ready?


So friends what is the technical answer about whether I have gfs1 in use? I
installed gfs2 as a dependency does it means I have gfs2? even though I
created gfs1 file system. :P?

Now a question about gfs2 file system.
As redhat says they support what they give.

1). Does RHEL5 from redhat gives gfs2 only? which is not yet production
ready??
2). Does RHEL5 provides gfs1?
3). Is the support for gfs1 on RHEL5 available?

I have these rpm's installed over my cluster.
gfs-utils-0.1.17-1.el5.i386.rpm
kmod-gfs-0.1.23-5.el5.i686.rpm
gfs2-utils-0.1.44-1.el5.i386.rpm

Thanks and regards
Anuj Singh
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080818/f031af7e/attachment.htm>

From kanderso at redhat.com  Mon Aug 18 14:38:54 2008
From: kanderso at redhat.com (Kevin Anderson)
Date: Mon, 18 Aug 2008 09:38:54 -0500
Subject: [Linux-cluster] rhel5 gfs1 or gfs2, how to prove that i have gfs1
In-Reply-To: <3120c9e30808180725k942651bt1f674dcba6d32536@mail.gmail.com>
References: <3120c9e30808180725k942651bt1f674dcba6d32536@mail.gmail.com>
Message-ID: <1219070334.4134.13.camel@dhcp80-204.msp.redhat.com>

On Mon, 2008-08-18 at 19:55 +0530, Anuj Singh (????) wrote:
> Hello everyone, 
> I created a cluster on rhel5,
> To make the gfs1 file system I used following command.
> 
> ' gfs_mkfs -t new_cluster:GFS -p lock_dlm -j2 /dev/MyVol0/MyLV0'
> 
> Mounted /dev/MyVol0/MyLV0  on  /image.
> 'mount -t gfs /dev/MyVol0/MyLV0 /image '
> No error I got.
> 
> Mount command gives me. 
> /dev/mapper/MyVol0-MyLV0 on /image type gfs
> (rw,hostdata=jid=0:id=131073:first=1)

Looks good to me.

> 
> My questions:
> 1) How to prove that I have gfs1 file system in use. 
> 2) Ismod |grep gfs gives me.
> [root at pr0031 CL]# lsmod |grep gfs
> gfs                   252740  1 
> gfs2                  341965  2 gfs,lock_dlm
> configfs               28753  2 dlm
> 3) rmmod gfs2
> ERROR: Module gfs2 is in use by gfs,lock_dlm
> ^^ why gfs2 module is in use? 
> as lock_dlm is associated with gfs2, does it mean it's not production
> ready?

There lock modules are including in the upstream kernel as part of the
gfs2 tree.  There are three symbols that gfs2 and gfs share, with gfs2
being the owner of the symbols that provides access to the lock modules.
This requires gfs2 module to be loaded in order for gfs to access the
lock modules.  That is the only part of gfs2 being leveraged for gfs.
We are working to remove that dependency so as not to confuse people.
Net result is that if you use mkfs -t gfs or gfs_mkfs  - you get a gfs
filesystem. 

So, you are good to go with a supported gfs configuration on RHEL5.

Kevin





From grigorygor at gmail.com  Mon Aug 18 14:48:40 2008
From: grigorygor at gmail.com (Grisha G.)
Date: Mon, 18 Aug 2008 17:48:40 +0300
Subject: [Linux-cluster] mysql resource
Message-ID: <c3c0440e0808180748y1c7968ecs42528302f0c465cd@mail.gmail.com>

Hi all.

Can  anybody tell my how to configure mysql resource in redhat 5.2

I get this error :

 RG service:mysql failed to stop; intervention required
 Service service:mysql failed to stop cleanly
 Service service:mysql is failed
 Stopping service service:mysql
 start on mysql "mysql" returned 1 (generic error)
 stop on mysql "mysql" returned 1 (generic error)
 Failed to start service:mysql; return value: 1
 Checking Existence Of File /var/run/cluster/mysql/mysql:mysql.pid
[mysql:mysql] > Failed - File Doesn't Exist
 Starting Service mysql:mysql > Failed - Timeout Error
 Stopping Service mysql:mysql > Failed
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080818/fc41ddad/attachment.htm>

From brettcave at gmail.com  Mon Aug 18 15:26:23 2008
From: brettcave at gmail.com (Brett Cave)
Date: Mon, 18 Aug 2008 17:26:23 +0200
Subject: [Linux-cluster] rhel5 gfs1 or gfs2, how to prove that i have gfs1
In-Reply-To: <3120c9e30808180725k942651bt1f674dcba6d32536@mail.gmail.com>
References: <3120c9e30808180725k942651bt1f674dcba6d32536@mail.gmail.com>
Message-ID: <c0773fd30808180826w11dd8390y6a3b3d55572ce258@mail.gmail.com>

2008/8/18 Anuj Singh (????) <anujhere at gmail.com>:
> Hello everyone,
> I created a cluster on rhel5,
> To make the gfs1 file system I used following command.
>
> ' gfs_mkfs -t new_cluster:GFS -p lock_dlm -j2 /dev/MyVol0/MyLV0'
>
> Mounted /dev/MyVol0/MyLV0  on  /image.
> 'mount -t gfs /dev/MyVol0/MyLV0 /image '
> No error I got.
>
> Mount command gives me.
> /dev/mapper/MyVol0-MyLV0 on /image type gfs
> (rw,hostdata=jid=0:id=131073:first=1)
>
> My questions:
> 1) How to prove that I have gfs1 file system in use.
> 2) Ismod |grep gfs gives me.
> [root at pr0031 CL]# lsmod |grep gfs
> gfs                   252740  1
> gfs2                  341965  2 gfs,lock_dlm
> configfs               28753  2 dlm
> 3) rmmod gfs2
> ERROR: Module gfs2 is in use by gfs,lock_dlm
> ^^ why gfs2 module is in use?
> as lock_dlm is associated with gfs2, does it mean it's not production ready?

from	Bob Peterson <rpeterso at redhat.com>
to	linux clustering <linux-cluster at redhat.com>
date	Thu, Aug 14, 2008 at 3:59 PM
subject	Re: [Linux-cluster] GFS vs. GFS2: system-config-cluster, locking ...
		
FYI--For 5.x, the locking infrastructure was common between GFS
and GFS2.  It has to do with the common lock harness "lock_dlm"
that is an interface between both GFS and GFS2 into the "dlm" module.
For 5.3, we're splitting the locking modules apart to get rid of
that dependency.



From brettcave at gmail.com  Mon Aug 18 15:35:29 2008
From: brettcave at gmail.com (Brett Cave)
Date: Mon, 18 Aug 2008 17:35:29 +0200
Subject: [Linux-cluster] GFS frozen again
Message-ID: <c0773fd30808180835u5bb0ca91iee43a9f518fb5b62@mail.gmail.com>

GFS has frozen again - after reconfiguring and running GFS for almost
a month now, have not been able to get GFS running stably.

[root at blade2 ~]# cat /etc/issue
CentOS release 5 (Final)
Kernel \r on an \m

[root at blade2 ~]# uname -a
Linux blade2 2.6.18-53.el5 #1 SMP Mon Nov 12 02:14:55 EST 2007 x86_64
x86_64 x86_64 GNU/Linux

[root at blade2 ~]# rpm -qa | grep gfs
gfs2-utils-0.1.38-1.el5
kmod-gfs-0.1.19-7.el5
gfs-utils-0.1.12-1.el5

[root at blade2 ~]# modinfo gfs
filename:       /lib/modules/2.6.18-53.el5/extra/gfs/gfs.ko
license:        GPL
author:         Red Hat, Inc.
description:    Global File System 0.1.19-7.el5
srcversion:     18B81D3FD6ECDCCFA53D745
depends:        gfs2
vermagic:       2.6.18-53.el5 SMP mod_unload gcc-4.1


Is anyone actually running GFS on Centos5 stably? Was running gfs2,
but was also unstable, hence the move back to gfs.

Setup: 3node cluster with 1 vote each and 1 quorum disk.
Each node has 1 x dual port hba connected to a fibra san (no
multipath, only single port on each card connected to SAN). SAN is
MSA1500. 2 GFS partitions, 1 qdisk partition on SAN.

System runs fine for a few days, and then will notice that some
mountpoints become unavailable. The entire system locks up when this
happens, and the only option I have is to reset all nodes in the
cluster to start up the cluster again. no errors in logs, nothing out
of the ordinary that i can see.



From shawnlhood at gmail.com  Mon Aug 18 15:43:00 2008
From: shawnlhood at gmail.com (Shawn Hood)
Date: Mon, 18 Aug 2008 11:43:00 -0400
Subject: [Linux-cluster] GFS frozen again
In-Reply-To: <c0773fd30808180835u5bb0ca91iee43a9f518fb5b62@mail.gmail.com>
References: <c0773fd30808180835u5bb0ca91iee43a9f518fb5b62@mail.gmail.com>
Message-ID: <cfe2fc960808180843p4da1f74eyb7a5a9e616bbca42@mail.gmail.com>

Could you post the errors from syslog/dmesg?

Shawn

On Mon, Aug 18, 2008 at 11:35 AM, Brett Cave <brettcave at gmail.com> wrote:

> GFS has frozen again - after reconfiguring and running GFS for almost
> a month now, have not been able to get GFS running stably.
>
> [root at blade2 ~]# cat /etc/issue
> CentOS release 5 (Final)
> Kernel \r on an \m
>
> [root at blade2 ~]# uname -a
> Linux blade2 2.6.18-53.el5 #1 SMP Mon Nov 12 02:14:55 EST 2007 x86_64
> x86_64 x86_64 GNU/Linux
>
> [root at blade2 ~]# rpm -qa | grep gfs
> gfs2-utils-0.1.38-1.el5
> kmod-gfs-0.1.19-7.el5
> gfs-utils-0.1.12-1.el5
>
> [root at blade2 ~]# modinfo gfs
> filename:       /lib/modules/2.6.18-53.el5/extra/gfs/gfs.ko
> license:        GPL
> author:         Red Hat, Inc.
> description:    Global File System 0.1.19-7.el5
> srcversion:     18B81D3FD6ECDCCFA53D745
> depends:        gfs2
> vermagic:       2.6.18-53.el5 SMP mod_unload gcc-4.1
>
>
> Is anyone actually running GFS on Centos5 stably? Was running gfs2,
> but was also unstable, hence the move back to gfs.
>
> Setup: 3node cluster with 1 vote each and 1 quorum disk.
> Each node has 1 x dual port hba connected to a fibra san (no
> multipath, only single port on each card connected to SAN). SAN is
> MSA1500. 2 GFS partitions, 1 qdisk partition on SAN.
>
> System runs fine for a few days, and then will notice that some
> mountpoints become unavailable. The entire system locks up when this
> happens, and the only option I have is to reset all nodes in the
> cluster to start up the cluster again. no errors in logs, nothing out
> of the ordinary that i can see.
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>

--
Shawn Hood
910.670.1819 m
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080818/d17d2ddd/attachment.htm>

From shawnlhood at gmail.com  Mon Aug 18 15:44:40 2008
From: shawnlhood at gmail.com (Shawn Hood)
Date: Mon, 18 Aug 2008 11:44:40 -0400
Subject: [Linux-cluster] GFS frozen again
In-Reply-To: <cfe2fc960808180843p4da1f74eyb7a5a9e616bbca42@mail.gmail.com>
References: <c0773fd30808180835u5bb0ca91iee43a9f518fb5b62@mail.gmail.com>
	<cfe2fc960808180843p4da1f74eyb7a5a9e616bbca42@mail.gmail.com>
Message-ID: <cfe2fc960808180844o1c87bae5q8ca13ac998bf40ff@mail.gmail.com>

Doh!  I missed the part about there being nothing in the logs.  ;)

Shawn

On Mon, Aug 18, 2008 at 11:43 AM, Shawn Hood <shawnlhood at gmail.com> wrote:

> Could you post the errors from syslog/dmesg?
>
> Shawn
>
>
> On Mon, Aug 18, 2008 at 11:35 AM, Brett Cave <brettcave at gmail.com> wrote:
>
>> GFS has frozen again - after reconfiguring and running GFS for almost
>> a month now, have not been able to get GFS running stably.
>>
>> [root at blade2 ~]# cat /etc/issue
>> CentOS release 5 (Final)
>> Kernel \r on an \m
>>
>> [root at blade2 ~]# uname -a
>> Linux blade2 2.6.18-53.el5 #1 SMP Mon Nov 12 02:14:55 EST 2007 x86_64
>> x86_64 x86_64 GNU/Linux
>>
>> [root at blade2 ~]# rpm -qa | grep gfs
>> gfs2-utils-0.1.38-1.el5
>> kmod-gfs-0.1.19-7.el5
>> gfs-utils-0.1.12-1.el5
>>
>> [root at blade2 ~]# modinfo gfs
>> filename:       /lib/modules/2.6.18-53.el5/extra/gfs/gfs.ko
>> license:        GPL
>> author:         Red Hat, Inc.
>> description:    Global File System 0.1.19-7.el5
>> srcversion:     18B81D3FD6ECDCCFA53D745
>> depends:        gfs2
>> vermagic:       2.6.18-53.el5 SMP mod_unload gcc-4.1
>>
>>
>> Is anyone actually running GFS on Centos5 stably? Was running gfs2,
>> but was also unstable, hence the move back to gfs.
>>
>> Setup: 3node cluster with 1 vote each and 1 quorum disk.
>> Each node has 1 x dual port hba connected to a fibra san (no
>> multipath, only single port on each card connected to SAN). SAN is
>> MSA1500. 2 GFS partitions, 1 qdisk partition on SAN.
>>
>> System runs fine for a few days, and then will notice that some
>> mountpoints become unavailable. The entire system locks up when this
>> happens, and the only option I have is to reset all nodes in the
>> cluster to start up the cluster again. no errors in logs, nothing out
>> of the ordinary that i can see.
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
> --
> Shawn Hood
> 910.670.1819 m
>
>


-- 
--
Shawn Hood
910.670.1819 m
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080818/71c1f7d3/attachment.htm>

From brettcave at gmail.com  Mon Aug 18 16:06:02 2008
From: brettcave at gmail.com (Brett Cave)
Date: Mon, 18 Aug 2008 18:06:02 +0200
Subject: [Linux-cluster] GFS frozen again
In-Reply-To: <cfe2fc960808180843p4da1f74eyb7a5a9e616bbca42@mail.gmail.com>
References: <c0773fd30808180835u5bb0ca91iee43a9f518fb5b62@mail.gmail.com>
	<cfe2fc960808180843p4da1f74eyb7a5a9e616bbca42@mail.gmail.com>
Message-ID: <c0773fd30808180906l38d602dcjbb0a6579fb422405@mail.gmail.com>

2008/8/18 Shawn Hood <shawnlhood at gmail.com>:
> Could you post the errors from syslog/dmesg?

as i was finishing off this email, i just noticed this from the logs
near the end of blade2:
Aug 17 19:50:24 blade2 gfs_controld[2839]: retrieve_plocks: ckpt open
error 12 cache1
That happens after blade2 has been fenced, has sucessfully rejoined
the fence and cman domains, and is now trying to mount gfs
filesystems. The first gfs file system it tries to mount causes a lock
up.

:) well, 1 node lost connectivity (blade2) to the cluster and was
fenced (fence_ilo) - the environment is HP bladesystem with x86_64
blades (2 intel, 1 amd). Here are logs from blade3:

Aug 17 19:48:55 blade3 openais[2696]: [TOTEM] The token was lost in
the OPERATIONAL state.
Aug 17 19:48:55 blade3 openais[2696]: [TOTEM] Receive multicast socket
recv buffer size (288000 bytes).
Aug 17 19:48:55 blade3 openais[2696]: [TOTEM] Transmit multicast
socket send buffer size (262142 bytes).
Aug 17 19:48:55 blade3 openais[2696]: [TOTEM] entering GATHER state from 2.
Aug 17 19:49:00 blade3 openais[2696]: [TOTEM] entering GATHER state from 11.
Aug 17 19:49:00 blade3 openais[2696]: [TOTEM] Creating commit token
because I am the rep.
Aug 17 19:49:00 blade3 openais[2696]: [TOTEM] Saving state aru 1fd
high seq received 1fd
Aug 17 19:49:00 blade3 openais[2696]: [TOTEM] Storing new sequence id
for ring 25c
Aug 17 19:49:00 blade3 openais[2696]: [TOTEM] entering COMMIT state.
Aug 17 19:49:00 blade3 openais[2696]: [TOTEM] entering RECOVERY state.
Aug 17 19:49:00 blade3 openais[2696]: [TOTEM] position [0] member
192.168.70.103:
Aug 17 19:49:00 blade3 openais[2696]: [TOTEM] previous ring seq 600
rep 192.168.70.102
Aug 17 19:49:00 blade3 openais[2696]: [TOTEM] aru 1fd high delivered
1fd received flag 1
Aug 17 19:49:00 blade3 openais[2696]: [TOTEM] position [1] member
192.168.70.104:
Aug 17 19:49:00 blade3 openais[2696]: [TOTEM] previous ring seq 600
rep 192.168.70.102
Aug 17 19:49:00 blade3 openais[2696]: [TOTEM] aru 1fd high delivered
1fd received flag 1
Aug 17 19:49:00 blade3 openais[2696]: [TOTEM] Did not need to
originate any messages in recovery.
Aug 17 19:49:00 blade3 openais[2696]: [TOTEM] Sending initial ORF token
Aug 17 19:49:00 blade3 kernel: dlm: closing connection to node 2
Aug 17 19:49:00 blade3 openais[2696]: [CLM  ] CLM CONFIGURATION CHANGE
Aug 17 19:49:00 blade3 openais[2696]: [CLM  ] New Configuration:
Aug 17 19:49:00 blade3 openais[2696]: [CLM  ]   r(0) ip(192.168.70.103)
Aug 17 19:49:00 blade3 openais[2696]: [CLM  ]   r(0) ip(192.168.70.104)
Aug 17 19:49:00 blade3 openais[2696]: [CLM  ] Members Left:
Aug 17 19:49:00 blade3 openais[2696]: [CLM  ]   r(0) ip(192.168.70.102)
Aug 17 19:49:00 blade3 openais[2696]: [CLM  ] Members Joined:
Aug 17 19:49:00 blade3 openais[2696]: [CLM  ] CLM CONFIGURATION CHANGE
Aug 17 19:49:00 blade3 openais[2696]: [CLM  ] New Configuration:
Aug 17 19:49:00 blade3 openais[2696]: [CLM  ]   r(0) ip(192.168.70.103)
Aug 17 19:49:00 blade3 openais[2696]: [CLM  ]   r(0) ip(192.168.70.104)
Aug 17 19:49:00 blade3 openais[2696]: [CLM  ] Members Left:
Aug 17 19:49:00 blade3 openais[2696]: [CLM  ] Members Joined:
Aug 17 19:49:00 blade3 openais[2696]: [SYNC ] This node is within the
primary component and will provide service.
Aug 17 19:49:01 blade3 openais[2696]: [TOTEM] entering OPERATIONAL state.
Aug 17 19:49:01 blade3 openais[2696]: [CLM  ] got nodejoin message
192.168.70.103
Aug 17 19:49:01 blade3 openais[2696]: [CLM  ] got nodejoin message
192.168.70.104
Aug 17 19:49:01 blade3 openais[2696]: [CPG  ] got joinlist message from node 4
Aug 17 19:49:01 blade3 openais[2696]: [CPG  ] got joinlist message from node 3
Aug 17 19:49:03 blade3 fenced[2712]: blade2 not a cluster member after
3 sec post_fail_delay
Aug 17 19:49:03 blade3 fenced[2712]: fencing node "blade2"
Aug 17 19:49:16 blade3 fenced[2712]: fence "blade2" success
Aug 17 19:49:21 blade3 kernel: GFS: fsid=jemdevcluster:cache1.1:
jid=2: Trying to acquire journal lock...
Aug 17 19:49:21 blade3 kernel: GFS: fsid=jemdevcluster:storage.1:
jid=2: Trying to acquire journal lock...
Aug 17 19:49:21 blade3 kernel: GFS: fsid=jemdevcluster:storage.1:
jid=2: Looking at journal...
Aug 17 19:49:21 blade3 kernel: GFS: fsid=jemdevcluster:cache1.1:
jid=2: Looking at journal...
Aug 17 19:49:21 blade3 kernel: GFS: fsid=jemdevcluster:cache1.1:
jid=2: Acquiring the transaction lock...
Aug 17 19:49:21 blade3 kernel: GFS: fsid=jemdevcluster:storage.1:
jid=2: Acquiring the transaction lock...
Aug 17 19:49:21 blade3 kernel: GFS: fsid=jemdevcluster:cache1.1:
jid=2: Replaying journal...
Aug 17 19:49:21 blade3 kernel: GFS: fsid=jemdevcluster:cache1.1:
jid=2: Replayed 0 of 1 blocks
Aug 17 19:49:21 blade3 kernel: GFS: fsid=jemdevcluster:cache1.1:
jid=2: replays = 0, skips = 0, sames = 1
Aug 17 19:49:21 blade3 kernel: GFS: fsid=jemdevcluster:storage.1:
jid=2: Replaying journal...
Aug 17 19:49:21 blade3 kernel: GFS: fsid=jemdevcluster:cache1.1:
jid=2: Journal replayed in 1s
Aug 17 19:49:21 blade3 kernel: GFS: fsid=jemdevcluster:cache1.1: jid=2: Done
Aug 17 19:49:21 blade3 kernel: GFS: fsid=jemdevcluster:storage.1:
jid=2: Replayed 0 of 9 blocks
Aug 17 19:49:21 blade3 kernel: GFS: fsid=jemdevcluster:storage.1:
jid=2: replays = 0, skips = 0, sames = 9
Aug 17 19:49:21 blade3 kernel: GFS: fsid=jemdevcluster:storage.1:
jid=2: Journal replayed in 1s
Aug 17 19:49:21 blade3 kernel: GFS: fsid=jemdevcluster:storage.1: jid=2: Done
Aug 17 19:49:30 blade3 openais[2696]: [CMAN ] lost contact with quorum device
Aug 17 19:49:30 blade3 openais[2696]: [CMAN ] quorum lost, blocking activity
Aug 17 19:49:30 blade3 kernel: dlm: closing connection to node 0
Aug 17 19:49:30 blade3 qdiskd[2765]: <info> Assuming master role
Aug 17 19:49:33 blade3 qdiskd[2765]: <notice> Writing eviction notice for node 2
Aug 17 19:49:33 blade3 openais[2696]: [CMAN ] quorum regained, resuming activity
Aug 17 19:49:36 blade3 qdiskd[2765]: <notice> Node 2 evicted
Aug 17 19:51:25 blade3 openais[2696]: [TOTEM] entering GATHER state from 11.
Aug 17 19:51:26 blade3 openais[2696]: [TOTEM] Saving state aru 28 high
seq received 28
Aug 17 19:51:26 blade3 openais[2696]: [TOTEM] Storing new sequence id
for ring 260
Aug 17 19:51:26 blade3 openais[2696]: [TOTEM] entering COMMIT state.
Aug 17 19:51:26 blade3 openais[2696]: [TOTEM] entering RECOVERY state.
Aug 17 19:51:26 blade3 openais[2696]: [TOTEM] position [0] member
192.168.70.102:
Aug 17 19:51:26 blade3 openais[2696]: [TOTEM] previous ring seq 604
rep 192.168.70.102
Aug 17 19:51:26 blade3 openais[2696]: [TOTEM] aru 9 high delivered 9
received flag 1
Aug 17 19:51:26 blade3 openais[2696]: [TOTEM] position [1] member
192.168.70.103:
Aug 17 19:51:26 blade3 openais[2696]: [TOTEM] previous ring seq 604
rep 192.168.70.103
Aug 17 19:51:26 blade3 openais[2696]: [TOTEM] aru 28 high delivered 28
received flag 1
Aug 17 19:51:26 blade3 openais[2696]: [TOTEM] position [2] member
192.168.70.104:
Aug 17 19:51:26 blade3 openais[2696]: [TOTEM] previous ring seq 604
rep 192.168.70.103
Aug 17 19:51:26 blade3 openais[2696]: [TOTEM] aru 28 high delivered 28
received flag 1
Aug 17 19:51:26 blade3 openais[2696]: [TOTEM] Did not need to
originate any messages in recovery.
Aug 17 19:51:26 blade3 openais[2696]: [CLM  ] CLM CONFIGURATION CHANGE
Aug 17 19:51:26 blade3 openais[2696]: [CLM  ] New Configuration:
Aug 17 19:51:26 blade3 openais[2696]: [CLM  ]   r(0) ip(192.168.70.103)
Aug 17 19:51:26 blade3 openais[2696]: [CLM  ]   r(0) ip(192.168.70.104)
Aug 17 19:51:26 blade3 openais[2696]: [CLM  ] Members Left:
Aug 17 19:51:26 blade3 openais[2696]: [CLM  ] Members Joined:
Aug 17 19:51:26 blade3 openais[2696]: [CLM  ] CLM CONFIGURATION CHANGE
Aug 17 19:51:26 blade3 openais[2696]: [CLM  ] New Configuration:
Aug 17 19:51:26 blade3 openais[2696]: [CLM  ]   r(0) ip(192.168.70.102)
Aug 17 19:51:26 blade3 openais[2696]: [CLM  ]   r(0) ip(192.168.70.103)
Aug 17 19:51:26 blade3 openais[2696]: [CLM  ]   r(0) ip(192.168.70.104)
Aug 17 19:51:26 blade3 openais[2696]: [CLM  ] Members Left:
Aug 17 19:51:26 blade3 openais[2696]: [CLM  ] Members Joined:
Aug 17 19:51:26 blade3 openais[2696]: [CLM  ]   r(0) ip(192.168.70.102)
Aug 17 19:51:26 blade3 openais[2696]: [SYNC ] This node is within the
primary component and will provide service.
Aug 17 19:51:26 blade3 openais[2696]: [TOTEM] entering OPERATIONAL state.
Aug 17 19:51:26 blade3 openais[2696]: [CLM  ] got nodejoin message
192.168.70.102
Aug 17 19:51:26 blade3 openais[2696]: [CLM  ] got nodejoin message
192.168.70.103
Aug 17 19:51:26 blade3 openais[2696]: [CLM  ] got nodejoin message
192.168.70.104
Aug 17 19:51:26 blade3 openais[2696]: [CPG  ] got joinlist message from node 3
Aug 17 19:51:26 blade3 openais[2696]: [CPG  ] got joinlist message from node 4
Aug 17 19:51:44 blade3 kernel: dlm: connecting to 2
Aug 18 17:16:03 blade3 openais[2696]: [TOTEM] The token was lost in
the OPERATIONAL state.
Aug 18 17:16:03 blade3 openais[2696]: [TOTEM] Receive multicast socket
recv buffer size (288000 bytes).
Aug 18 17:16:03 blade3 openais[2696]: [TOTEM] Transmit multicast
socket send buffer size (262142 bytes).
Aug 18 17:16:03 blade3 openais[2696]: [TOTEM] entering GATHER state from 2.


blade2 logs are as follows:
19:46 - last log entry unrelated to gfs. system freezes.
19:50 - boot log entries - system has been fenced and is now starting up.
Aug 17 19:50:03 blade2 ccsd[2804]: Starting ccsd 2.0.73:
Aug 17 19:50:03 blade2 ccsd[2804]:  Built: Nov 12 2007 13:07:35
Aug 17 19:50:03 blade2 ccsd[2804]:  Copyright (C) Red Hat, Inc.  2004
All rights reserved.
Aug 17 19:50:03 blade2 ccsd[2804]: cluster.conf (cluster name =
jemdevcluster, version = 5) found.
Aug 17 19:50:03 blade2 ccsd[2804]: Remote copy of cluster.conf is from
quorate node.
Aug 17 19:50:03 blade2 ccsd[2804]:  Local version # : 5
Aug 17 19:50:03 blade2 ccsd[2804]:  Remote version #: 5
Aug 17 19:50:04 blade2 ccsd[2804]: Remote copy of cluster.conf is from
quorate node.
Aug 17 19:50:04 blade2 ccsd[2804]:  Local version # : 5
Aug 17 19:50:04 blade2 ccsd[2804]:  Remote version #: 5
Aug 17 19:50:04 blade2 ccsd[2804]: Remote copy of cluster.conf is from
quorate node.
Aug 17 19:50:04 blade2 ccsd[2804]:  Local version # : 5
Aug 17 19:50:04 blade2 ccsd[2804]:  Remote version #: 5
Aug 17 19:50:04 blade2 ccsd[2804]: Remote copy of cluster.conf is from
quorate node.
Aug 17 19:50:04 blade2 ccsd[2804]:  Local version # : 5
Aug 17 19:50:04 blade2 ccsd[2804]:  Remote version #: 5
Aug 17 19:50:04 blade2 openais[2811]: [MAIN ] AIS Executive Service
RELEASE 'subrev 1358 version 0.80.3'
Aug 17 19:50:04 blade2 openais[2811]: [MAIN ] Copyright (C) 2002-2006
MontaVista Software, Inc and contributors.
Aug 17 19:50:04 blade2 openais[2811]: [MAIN ] Copyright (C) 2006 Red Hat, Inc.
Aug 17 19:50:04 blade2 openais[2811]: [MAIN ] AIS Executive Service:
started and ready to provide service.
Aug 17 19:50:04 blade2 openais[2811]: [MAIN ] Using default multicast
address of 239.192.24.76
Aug 17 19:50:04 blade2 openais[2811]: [MAIN ] openais component
openais_cpg loaded.
Aug 17 19:50:04 blade2 openais[2811]: [MAIN ] Registering service
handler 'openais cluster closed process group service v1.01'
Aug 17 19:50:04 blade2 openais[2811]: [MAIN ] openais component
openais_cfg loaded.
Aug 17 19:50:05 blade2 openais[2811]: [MAIN ] Registering service
handler 'openais configuration service'
Aug 17 19:50:05 blade2 openais[2811]: [MAIN ] openais component
openais_msg loaded.
Aug 17 19:50:05 blade2 openais[2811]: [MAIN ] Registering service
handler 'openais message service B.01.01'
Aug 17 19:50:05 blade2 openais[2811]: [MAIN ] openais component
openais_lck loaded.
Aug 17 19:50:05 blade2 openais[2811]: [MAIN ] Registering service
handler 'openais distributed locking service B.01.01'
Aug 17 19:50:05 blade2 openais[2811]: [MAIN ] openais component
openais_evt loaded.
Aug 17 19:50:05 blade2 openais[2811]: [MAIN ] Registering service
handler 'openais event service B.01.01'
Aug 17 19:50:05 blade2 openais[2811]: [MAIN ] openais component
openais_ckpt loaded.
Aug 17 19:50:05 blade2 openais[2811]: [MAIN ] Registering service
handler 'openais checkpoint service B.01.01'
Aug 17 19:50:05 blade2 openais[2811]: [MAIN ] openais component
openais_amf loaded.
Aug 17 19:50:05 blade2 openais[2811]: [MAIN ] Registering service
handler 'openais availability management framework B.01.01'
Aug 17 19:50:05 blade2 openais[2811]: [MAIN ] openais component
openais_clm loaded.
Aug 17 19:50:05 blade2 openais[2811]: [MAIN ] Registering service
handler 'openais cluster membership service B.01.01'
Aug 17 19:50:05 blade2 openais[2811]: [MAIN ] openais component
openais_evs loaded.
Aug 17 19:50:05 blade2 openais[2811]: [MAIN ] Registering service
handler 'openais extended virtual synchrony service'
Aug 17 19:50:05 blade2 openais[2811]: [MAIN ] openais component
openais_cman loaded.
Aug 17 19:50:05 blade2 openais[2811]: [MAIN ] Registering service
handler 'openais CMAN membership service 2.01'
Aug 17 19:50:05 blade2 openais[2811]: [TOTEM] Token Timeout (10000 ms)
retransmit timeout (495 ms)
Aug 17 19:50:05 blade2 openais[2811]: [TOTEM] token hold (386 ms)
retransmits before loss (20 retrans)
Aug 17 19:50:05 blade2 openais[2811]: [TOTEM] join (60 ms) send_join
(0 ms) consensus (4800 ms) merge (200 ms)
Aug 17 19:50:05 blade2 openais[2811]: [TOTEM] downcheck (1000 ms) fail
to recv const (50 msgs)
Aug 17 19:50:05 blade2 openais[2811]: [TOTEM] seqno unchanged const
(30 rotations) Maximum network MTU 1500
Aug 17 19:50:05 blade2 openais[2811]: [TOTEM] window size per rotation
(50 messages) maximum messages per rotation (17 messages)
Aug 17 19:50:06 blade2 openais[2811]: [TOTEM] send threads (0 threads)
Aug 17 19:50:06 blade2 openais[2811]: [TOTEM] RRP token expired timeout (495 ms)
Aug 17 19:50:06 blade2 openais[2811]: [TOTEM] RRP token problem
counter (2000 ms)
Aug 17 19:50:06 blade2 openais[2811]: [TOTEM] RRP threshold (10 problem count)
Aug 17 19:50:06 blade2 openais[2811]: [TOTEM] RRP mode set to none.
Aug 17 19:50:06 blade2 openais[2811]: [TOTEM] heartbeat_failures_allowed (0)
Aug 17 19:50:06 blade2 openais[2811]: [TOTEM] max_network_delay (50 ms)
Aug 17 19:50:06 blade2 openais[2811]: [TOTEM] HeartBeat is Disabled.
To enable set heartbeat_failures_allowed > 0
Aug 17 19:50:06 blade2 openais[2811]: [TOTEM] Receive multicast socket
recv buffer size (262142 bytes).
Aug 17 19:50:06 blade2 openais[2811]: [TOTEM] Transmit multicast
socket send buffer size (262142 bytes).
Aug 17 19:50:06 blade2 openais[2811]: [TOTEM] The network interface
[192.168.70.102] is now up.
Aug 17 19:50:06 blade2 openais[2811]: [TOTEM] Created or loaded
sequence id 600.192.168.70.102 for this ring.
Aug 17 19:50:06 blade2 openais[2811]: [TOTEM] entering GATHER state from 15.
Aug 17 19:50:06 blade2 openais[2811]: [SERV ] Initialising service
handler 'openais extended virtual synchrony service'
Aug 17 19:50:06 blade2 openais[2811]: [SERV ] Initialising service
handler 'openais cluster membership service B.01.01'
Aug 17 19:50:06 blade2 openais[2811]: [SERV ] Initialising service
handler 'openais availability management framework B.01.01'
Aug 17 19:50:06 blade2 openais[2811]: [SERV ] Initialising service
handler 'openais checkpoint service B.01.01'
Aug 17 19:50:06 blade2 openais[2811]: [SERV ] Initialising service
handler 'openais event service B.01.01'
Aug 17 19:50:06 blade2 openais[2811]: [SERV ] Initialising service
handler 'openais distributed locking service B.01.01'
Aug 17 19:50:06 blade2 openais[2811]: [SERV ] Initialising service
handler 'openais message service B.01.01'
Aug 17 19:50:06 blade2 openais[2811]: [SERV ] Initialising service
handler 'openais configuration service'
Aug 17 19:50:06 blade2 openais[2811]: [SERV ] Initialising service
handler 'openais cluster closed process group service v1.01'
Aug 17 19:50:06 blade2 openais[2811]: [SERV ] Initialising service
handler 'openais CMAN membership service 2.01'
Aug 17 19:50:06 blade2 openais[2811]: [CMAN ] CMAN 2.0.73 (built Nov
12 2007 13:07:39) started
Aug 17 19:50:06 blade2 openais[2811]: [SYNC ] Not using a virtual
synchrony filter.
Aug 17 19:50:06 blade2 openais[2811]: [TOTEM] Creating commit token
because I am the rep.
Aug 17 19:50:06 blade2 openais[2811]: [TOTEM] Saving state aru 0 high
seq received 0
Aug 17 19:50:06 blade2 openais[2811]: [TOTEM] Storing new sequence id
for ring 25c
Aug 17 19:50:06 blade2 ccsd[2804]: Initial status:: Quorate
Aug 17 19:50:06 blade2 openais[2811]: [TOTEM] entering COMMIT state.
Aug 17 19:50:06 blade2 openais[2811]: [TOTEM] entering RECOVERY state.
Aug 17 19:50:06 blade2 openais[2811]: [TOTEM] position [0] member
192.168.70.102:
Aug 17 19:50:06 blade2 openais[2811]: [TOTEM] previous ring seq 600
rep 192.168.70.102
Aug 17 19:50:06 blade2 openais[2811]: [TOTEM] aru 0 high delivered 0
received flag 1
Aug 17 19:50:06 blade2 openais[2811]: [TOTEM] Did not need to
originate any messages in recovery.
Aug 17 19:50:06 blade2 openais[2811]: [TOTEM] Sending initial ORF token
Aug 17 19:50:06 blade2 openais[2811]: [CLM  ] CLM CONFIGURATION CHANGE
Aug 17 19:50:07 blade2 openais[2811]: [CLM  ] New Configuration:
Aug 17 19:50:07 blade2 openais[2811]: [CLM  ] Members Left:
Aug 17 19:50:07 blade2 openais[2811]: [CLM  ] Members Joined:
Aug 17 19:50:07 blade2 openais[2811]: [CLM  ] CLM CONFIGURATION CHANGE
Aug 17 19:50:07 blade2 openais[2811]: [CLM  ] New Configuration:
Aug 17 19:50:07 blade2 openais[2811]: [CLM  ]   r(0) ip(192.168.70.102)
Aug 17 19:50:07 blade2 openais[2811]: [CLM  ] Members Left:
Aug 17 19:50:07 blade2 openais[2811]: [CLM  ] Members Joined:
Aug 17 19:50:07 blade2 openais[2811]: [CLM  ]   r(0) ip(192.168.70.102)
Aug 17 19:50:07 blade2 openais[2811]: [SYNC ] This node is within the
primary component and will provide service.
Aug 17 19:50:07 blade2 openais[2811]: [TOTEM] entering OPERATIONAL state.
Aug 17 19:50:07 blade2 openais[2811]: [CLM  ] got nodejoin message
192.168.70.102
Aug 17 19:50:07 blade2 openais[2811]: [TOTEM] entering GATHER state from 11.
Aug 17 19:50:07 blade2 openais[2811]: [TOTEM] Creating commit token
because I am the rep.
Aug 17 19:50:07 blade2 openais[2811]: [TOTEM] Saving state aru 9 high
seq received 9
Aug 17 19:50:07 blade2 openais[2811]: [TOTEM] Storing new sequence id
for ring 260
Aug 17 19:50:07 blade2 openais[2811]: [TOTEM] entering COMMIT state.
Aug 17 19:50:07 blade2 openais[2811]: [TOTEM] entering RECOVERY state.
Aug 17 19:50:07 blade2 openais[2811]: [TOTEM] position [0] member
192.168.70.102:
Aug 17 19:50:07 blade2 openais[2811]: [TOTEM] previous ring seq 604
rep 192.168.70.102
Aug 17 19:50:07 blade2 openais[2811]: [TOTEM] aru 9 high delivered 9
received flag 1
Aug 17 19:50:08 blade2 openais[2811]: [TOTEM] position [1] member
192.168.70.103:
Aug 17 19:50:08 blade2 openais[2811]: [TOTEM] previous ring seq 604
rep 192.168.70.103
Aug 17 19:50:08 blade2 openais[2811]: [TOTEM] aru 28 high delivered 28
received flag 1
Aug 17 19:50:08 blade2 openais[2811]: [TOTEM] position [2] member
192.168.70.104:
Aug 17 19:50:08 blade2 openais[2811]: [TOTEM] previous ring seq 604
rep 192.168.70.103
Aug 17 19:50:08 blade2 openais[2811]: [TOTEM] aru 28 high delivered 28
received flag 1
Aug 17 19:50:08 blade2 openais[2811]: [TOTEM] Did not need to
originate any messages in recovery.
Aug 17 19:50:08 blade2 openais[2811]: [TOTEM] Sending initial ORF token
Aug 17 19:50:08 blade2 openais[2811]: [CLM  ] CLM CONFIGURATION CHANGE
Aug 17 19:50:08 blade2 openais[2811]: [CLM  ] New Configuration:
Aug 17 19:50:08 blade2 openais[2811]: [CLM  ]   r(0) ip(192.168.70.102)
Aug 17 19:50:08 blade2 openais[2811]: [CLM  ] Members Left:
Aug 17 19:50:08 blade2 openais[2811]: [CLM  ] Members Joined:
Aug 17 19:50:08 blade2 openais[2811]: [CLM  ] CLM CONFIGURATION CHANGE
Aug 17 19:50:08 blade2 openais[2811]: [CLM  ] New Configuration:
Aug 17 19:50:08 blade2 openais[2811]: [CLM  ]   r(0) ip(192.168.70.102)
Aug 17 19:50:08 blade2 openais[2811]: [CLM  ]   r(0) ip(192.168.70.103)
Aug 17 19:50:08 blade2 openais[2811]: [CLM  ]   r(0) ip(192.168.70.104)
Aug 17 19:50:08 blade2 openais[2811]: [CLM  ] Members Left:
Aug 17 19:50:08 blade2 openais[2811]: [CLM  ] Members Joined:
Aug 17 19:50:08 blade2 openais[2811]: [CLM  ]   r(0) ip(192.168.70.103)
Aug 17 19:50:08 blade2 openais[2811]: [CLM  ]   r(0) ip(192.168.70.104)
Aug 17 19:50:08 blade2 openais[2811]: [SYNC ] This node is within the
primary component and will provide service.
Aug 17 19:50:08 blade2 openais[2811]: [TOTEM] entering OPERATIONAL state.
Aug 17 19:50:09 blade2 openais[2811]: [CMAN ] quorum regained, resuming activity
Aug 17 19:50:09 blade2 openais[2811]: [CLM  ] got nodejoin message
192.168.70.102
Aug 17 19:50:09 blade2 openais[2811]: [CLM  ] got nodejoin message
192.168.70.103
Aug 17 19:50:09 blade2 openais[2811]: [CLM  ] got nodejoin message
192.168.70.104
Aug 17 19:50:09 blade2 openais[2811]: [CPG  ] got joinlist message from node 3
Aug 17 19:50:09 blade2 openais[2811]: [CPG  ] got joinlist message from node 4
Aug 17 19:50:09 blade2 qdiskd[2877]: <info> Quorum Partition:
/dev/sda5 Label: jemqdisk
Aug 17 19:50:09 blade2 qdiskd[2878]: <info> Quorum Daemon Initializing
Aug 17 19:50:22 blade2 qdiskd[2878]: <info> Node 3 is the master
Aug 17 19:50:24 blade2 gfs_controld[2839]: retrieve_plocks: ckpt open
error 12 cache1
Aug 17 19:50:40 blade2 qdiskd[2878]: <info> Initial score 1/1
Aug 17 19:50:40 blade2 qdiskd[2878]: <info> Initialization complete
Aug 17 19:50:40 blade2 openais[2811]: [CMAN ] quorum device registered
Aug 17 19:50:40 blade2 qdiskd[2878]: <notice> Score sufficient for
master operation (1/1; required=1); upgrading

>
> Shawn
>
> On Mon, Aug 18, 2008 at 11:35 AM, Brett Cave <brettcave at gmail.com> wrote:
>>
>> GFS has frozen again - after reconfiguring and running GFS for almost
>> a month now, have not been able to get GFS running stably.
>>
>> [root at blade2 ~]# cat /etc/issue
>> CentOS release 5 (Final)
>> Kernel \r on an \m
>>
>> [root at blade2 ~]# uname -a
>> Linux blade2 2.6.18-53.el5 #1 SMP Mon Nov 12 02:14:55 EST 2007 x86_64
>> x86_64 x86_64 GNU/Linux
>>
>> [root at blade2 ~]# rpm -qa | grep gfs
>> gfs2-utils-0.1.38-1.el5
>> kmod-gfs-0.1.19-7.el5
>> gfs-utils-0.1.12-1.el5
>>
>> [root at blade2 ~]# modinfo gfs
>> filename:       /lib/modules/2.6.18-53.el5/extra/gfs/gfs.ko
>> license:        GPL
>> author:         Red Hat, Inc.
>> description:    Global File System 0.1.19-7.el5
>> srcversion:     18B81D3FD6ECDCCFA53D745
>> depends:        gfs2
>> vermagic:       2.6.18-53.el5 SMP mod_unload gcc-4.1
>>
>>
>> Is anyone actually running GFS on Centos5 stably? Was running gfs2,
>> but was also unstable, hence the move back to gfs.
>>
>> Setup: 3node cluster with 1 vote each and 1 quorum disk.
>> Each node has 1 x dual port hba connected to a fibra san (no
>> multipath, only single port on each card connected to SAN). SAN is
>> MSA1500. 2 GFS partitions, 1 qdisk partition on SAN.
>>
>> System runs fine for a few days, and then will notice that some
>> mountpoints become unavailable. The entire system locks up when this
>> happens, and the only option I have is to reset all nodes in the
>> cluster to start up the cluster again. no errors in logs, nothing out
>> of the ordinary that i can see.
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>
> --
> Shawn Hood
> 910.670.1819 m
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



From cedwards at smartechcorp.net  Mon Aug 18 17:36:41 2008
From: cedwards at smartechcorp.net (Chris Edwards)
Date: Mon, 18 Aug 2008 13:36:41 -0400
Subject: [Linux-cluster] GFS as a Resource
In-Reply-To: <e83473390808151105kabee394gb30cc70b8680fe8d@mail.gmail.com>
References: <a60f9340808132137h120c597cja34c108e1fc5958f@mail.gmail.com>	<1218721948.9521.191.camel@technetium.msp.redhat.com>	<a60f9340808140844lf966afey671f1eeb782fcbb2@mail.gmail.com>	<1218736056.9521.246.camel@technetium.msp.redhat.com>	<04cb01c8fe4b$fb40f640$f1c2e2c0$@net>	<c0773fd30808150152w213658d7va6c75be4bf335651@mail.gmail.com>	<069601c8fedf$6e12ff40$4a38fdc0$@net>	<07b601c8fefa$dfc46b90$9f4d42b0$@net>
	<e83473390808151105kabee394gb30cc70b8680fe8d@mail.gmail.com>
Message-ID: <110c01c90158$f9a352f0$ece9f8d0$@net>

Here is my clustat....

Cluster Status for Xen @ Mon Aug 18 13:28:37 2008
Member Status: Quorate

 Member Name                              ID   Status
 ------ ----                              ---- ------
 xen1.smartechcorp.net                        1 Online, Local, rgmanager
 xen2.smartechcorp.net                        2 Online, rgmanager

 Service Name                    Owner (Last)                    State

 ------- ----                    ----- ------                    -----

 service:GFS Mount Xen1          xen1.smartechcorp.net           started

 service:GFS Mount Xen2          xen2.smartechcorp.net           started


Here is my cluster.conf...

<?xml version="1.0"?>
<cluster alias="Xen" config_version="53" name="Xen">
        <fence_daemon clean_start="0" post_fail_delay="0"
post_join_delay="-1"/>
        <clusternodes>
                <clusternode name="xen1.smartechcorp.net" nodeid="1"
votes="1">
                        <fence>
                                <method name="1">
                                        <device name="manual"
nodename="xen1.smartechcorp.net"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="xen2.smartechcorp.net" nodeid="2"
votes="1">
                        <fence>
                                <method name="1">
                                        <device name="manual"
nodename="xen2.smartechcorp.net"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <cman expected_votes="1" two_node="1"/>
        <fencedevices>
                <fencedevice agent="fence_manual" name="manual"/>
        </fencedevices>
        <rm>
                <failoverdomains>
                        <failoverdomain name="bias-xen1" nofailback="0"
ordered="1" restricted="0">

                                <failoverdomainnode
name="xen1.smartechcorp.net" priority="1"/>

                                <failoverdomainnode
name="xen2.smartechcorp.net" priority="2"/>
                        </failoverdomain>
                        <failoverdomain name="bias-xen2" nofailback="0"
ordered="1" restricted="0">
                                <failoverdomainnode
name="xen1.smartechcorp.net" priority="2"/>
                                <failoverdomainnode
name="xen2.smartechcorp.net" priority="1"/>
                        </failoverdomain>
                        <failoverdomain name="gfs-xen1" nofailback="0"
ordered="0" restricted="1">
                                <failoverdomainnode
name="xen1.smartechcorp.net" priority="1"/>
                        </failoverdomain>
                        <failoverdomain name="gfs-xen2" nofailback="0"
ordered="0" restricted="1">
                                <failoverdomainnode
name="xen2.smartechcorp.net" priority="1"/>
                        </failoverdomain>
                </failoverdomains>
                <resources/>
                <service autostart="1" domain="gfs-xen2" exclusive="0"
name="GFS Mount Xen2" recovery="restart"/>
                <service autostart="1" domain="gfs-xen1" exclusive="0"
name="GFS Mount Xen1" recovery="restart"/>
        </rm>
        <quorumd device="/dev/sdb5" interval="1" min_score="1" tko="10"
votes="1">
                <heuristic interval="2" program="ping -c3 -t2 10.10.10.1"
score="1"/>
        </quorumd>
</cluster>

Without a entry in fstab my gfs file systems never mount.   So I am
wondering how I can leave out entries in my fstab.

---

Chris Edwards
Smartech Corp.
Div. of AirNet Group
http://www.airnetgroup.com
http://www.smartechcorp.net
cedwards at smartechcorp.net
P:  423-664-7678 x114
C:  423-593-6964
F:  423-664-7680


-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Maurizio Rottin
Sent: Friday, August 15, 2008 2:06 PM
To: linux clustering
Subject: Re: [Linux-cluster] GFS as a Resource

2008/8/15 Chris Edwards <cedwards at smartechcorp.net>:
> Whoops, scratch that last post.   I now have it working by leaving the
entry
> in fstab without the noauto and turning GFS off with chkconfig and
allowing
> the cluster service to turn it on.
> Thanks again!

i believe thats the wrong way.
I know it works in that way, but:
- if you have only one node, do not use gfs, it's slow!
- if you have more than one node, use it -- and if you can, test gfs2
as weel (it should be more and more fast) -- but do not mount it (only
- i mean, you don't need it to be listed on a fstab) in fstab.
gfs works if only all the nodes are "up and running", which means, if
one node can't be reached, but is up (network or other problems
inolved) no one will use the gfs filesystem.
You must use it as a resorce, and you must have at least one fencing
method for each node in the cluster.
In this way, once a node becomes unreachable, it will be fenced and
the other nodes can write happily on the filesystem. This is because
if one node "can be considered up and maybe running" it may be writing
on the filesystem, or it can maybe think that it is the only one node
in the cluster (think ebout switch problem, or arp spoofing) than if
you try a "clust" command on that node you will see al  the other
nodes down and only that one up....this is why you must have  a
fencing method! that node HAS TO be shut down or reloaded, otherwise
the filesystem will be blocked, and no read o write can be issued by
any of the nodes in the cluster".

I am not talking about what it is in theory(never attended a RH
session), but believe me, in practice it works like that!

create a global resource (and always create a global resource even if
it is a fencing, or a vsftpd resource that every node has in common)
aqnd mount it in every node you need as a service. Do not think an
fstab entry is the better thing you can have, it is not, it can lock
you filesystem till all the nodes are really working and talking one
each other.

-- 
mr

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster




From brettcave at gmail.com  Mon Aug 18 17:53:18 2008
From: brettcave at gmail.com (Brett Cave)
Date: Mon, 18 Aug 2008 19:53:18 +0200
Subject: [Linux-cluster] GFS as a Resource
In-Reply-To: <110c01c90158$f9a352f0$ece9f8d0$@net>
References: <a60f9340808132137h120c597cja34c108e1fc5958f@mail.gmail.com>
	<1218721948.9521.191.camel@technetium.msp.redhat.com>
	<a60f9340808140844lf966afey671f1eeb782fcbb2@mail.gmail.com>
	<1218736056.9521.246.camel@technetium.msp.redhat.com>
	<04cb01c8fe4b$fb40f640$f1c2e2c0$@net>
	<c0773fd30808150152w213658d7va6c75be4bf335651@mail.gmail.com>
	<069601c8fedf$6e12ff40$4a38fdc0$@net>
	<07b601c8fefa$dfc46b90$9f4d42b0$@net>
	<e83473390808151105kabee394gb30cc70b8680fe8d@mail.gmail.com>
	<110c01c90158$f9a352f0$ece9f8d0$@net>
Message-ID: <c0773fd30808181053k177d01f5y36546163ae2d6599@mail.gmail.com>

On Mon, Aug 18, 2008 at 7:36 PM, Chris Edwards
<cedwards at smartechcorp.net> wrote:
> Without a entry in fstab my gfs file systems never mount.   So I am
> wondering how I can leave out entries in my fstab.

AFAIK, fstab is needed, unless u want to run gfs_mount with all the
params. Why would you specifically not want to have fstab entries?


Brett



From cedwards at smartechcorp.net  Mon Aug 18 18:00:18 2008
From: cedwards at smartechcorp.net (Chris Edwards)
Date: Mon, 18 Aug 2008 14:00:18 -0400
Subject: [Linux-cluster] GFS as a Resource
In-Reply-To: <c0773fd30808181053k177d01f5y36546163ae2d6599@mail.gmail.com>
References: <a60f9340808132137h120c597cja34c108e1fc5958f@mail.gmail.com>	<1218721948.9521.191.camel@technetium.msp.redhat.com>	<a60f9340808140844lf966afey671f1eeb782fcbb2@mail.gmail.com>	<1218736056.9521.246.camel@technetium.msp.redhat.com>	<04cb01c8fe4b$fb40f640$f1c2e2c0$@net>	<c0773fd30808150152w213658d7va6c75be4bf335651@mail.gmail.com>	<069601c8fedf$6e12ff40$4a38fdc0$@net>	<07b601c8fefa$dfc46b90$9f4d42b0$@net>	<e83473390808151105kabee394gb30cc70b8680fe8d@mail.gmail.com>	<110c01c90158$f9a352f0$ece9f8d0$@net>
	<c0773fd30808181053k177d01f5y36546163ae2d6599@mail.gmail.com>
Message-ID: <111301c9015c$46994a80$d3cbdf80$@net>

I don't care either way, I was just responding to the last reply to my
message.  The only reason I was thinking that you wouldn't want an fstab
entry is so that a server doesn't mount the GFS file systems until its apart
of the cluster.

---

Chris Edwards
Smartech Corp.
Div. of AirNet Group
http://www.airnetgroup.com
http://www.smartechcorp.net
cedwards at smartechcorp.net
P:  423-664-7678 x114
C:  423-593-6964
F:  423-664-7680


-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Brett Cave
Sent: Monday, August 18, 2008 1:53 PM
To: linux clustering
Subject: Re: [Linux-cluster] GFS as a Resource

On Mon, Aug 18, 2008 at 7:36 PM, Chris Edwards
<cedwards at smartechcorp.net> wrote:
> Without a entry in fstab my gfs file systems never mount.   So I am
> wondering how I can leave out entries in my fstab.

AFAIK, fstab is needed, unless u want to run gfs_mount with all the
params. Why would you specifically not want to have fstab entries?


Brett

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster




From anujhere at gmail.com  Mon Aug 18 18:34:42 2008
From: anujhere at gmail.com (=?UTF-8?Q?Anuj_Singh_(=E0=A4=85=E0=A4=A8=E0=A5=81=E0=A4=9C)?=)
Date: Tue, 19 Aug 2008 00:04:42 +0530
Subject: [Linux-cluster] rhel5 gfs1 or gfs2, how to prove that i have gfs1
In-Reply-To: <c0773fd30808180826w11dd8390y6a3b3d55572ce258@mail.gmail.com>
References: <3120c9e30808180725k942651bt1f674dcba6d32536@mail.gmail.com>
	<c0773fd30808180826w11dd8390y6a3b3d55572ce258@mail.gmail.com>
Message-ID: <3120c9e30808181134k3035c644s46b15727d9193f3c@mail.gmail.com>

Thanks kevin and Bob for the information.
When can we expect GFS2 for production?





On Mon, Aug 18, 2008 at 8:56 PM, Brett Cave <brettcave at gmail.com> wrote:

> 2008/8/18 Anuj Singh (????) <anujhere at gmail.com>:
> > Hello everyone,
> > I created a cluster on rhel5,
> > To make the gfs1 file system I used following command.
> >
> > ' gfs_mkfs -t new_cluster:GFS -p lock_dlm -j2 /dev/MyVol0/MyLV0'
> >
> > Mounted /dev/MyVol0/MyLV0  on  /image.
> > 'mount -t gfs /dev/MyVol0/MyLV0 /image '
> > No error I got.
> >
> > Mount command gives me.
> > /dev/mapper/MyVol0-MyLV0 on /image type gfs
> > (rw,hostdata=jid=0:id=131073:first=1)
> >
> > My questions:
> > 1) How to prove that I have gfs1 file system in use.
> > 2) Ismod |grep gfs gives me.
> > [root at pr0031 CL]# lsmod |grep gfs
> > gfs                   252740  1
> > gfs2                  341965  2 gfs,lock_dlm
> > configfs               28753  2 dlm
> > 3) rmmod gfs2
> > ERROR: Module gfs2 is in use by gfs,lock_dlm
> > ^^ why gfs2 module is in use?
> > as lock_dlm is associated with gfs2, does it mean it's not production
> ready?
>
> from    Bob Peterson <rpeterso at redhat.com>
> to      linux clustering <linux-cluster at redhat.com>
> date    Thu, Aug 14, 2008 at 3:59 PM
> subject Re: [Linux-cluster] GFS vs. GFS2: system-config-cluster, locking
> ...
>
> FYI--For 5.x, the locking infrastructure was common between GFS
> and GFS2.  It has to do with the common lock harness "lock_dlm"
> that is an interface between both GFS and GFS2 into the "dlm" module.
> For 5.3, we're splitting the locking modules apart to get rid of
> that dependency.
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080819/d8f2550b/attachment.htm>

From lhh at redhat.com  Mon Aug 18 21:00:30 2008
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 18 Aug 2008 17:00:30 -0400
Subject: [Linux-cluster] Fencing
In-Reply-To: <334701c8fd5b$7b7beeb0$7273cc10$@net>
References: <334701c8fd5b$7b7beeb0$7273cc10$@net>
Message-ID: <1219093230.3444.32.camel@ayanami>

On Wed, 2008-08-13 at 11:44 -0400, Chris Edwards wrote:
> Does anyone know where would I find a good technical explanation on
> how fencing works and how to set it up?


http://sources.redhat.com/cluster/wiki/FAQ/Fencing

Device-specific:

http://sources.redhat.com/cluster/wiki/WTI_FencingConfig (see last section)
http://sources.redhat.com/cluster/wiki/IPMI_FencingConfig

Agent design:

http://sources.redhat.com/cluster/wiki/FenceAgentAPI

-- Lon







From lhh at redhat.com  Mon Aug 18 21:04:09 2008
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 18 Aug 2008 17:04:09 -0400
Subject: [Linux-cluster] rgmanager timeout of script.sh
In-Reply-To: <32628.217.77.161.17.1218703379.squirrel@www.hal9000.cz>
References: <32628.217.77.161.17.1218703379.squirrel@www.hal9000.cz>
Message-ID: <1219093449.3444.36.camel@ayanami>

On Thu, 2008-08-14 at 10:42 +0200, Jakub Suchy wrote:
> Hello,
> i am trying to solve a problem with timeouting a status script.
> 
> We are using custom init script for our service, which is doing some
> operations in status section. However, one of the options when  this
> service is down is, that it hangs. Then, a status script may hang too
> because it is waiting for an operation to complete.
> 
> Is there a way to instruct rgmanager to automatically treat the service as
> failed if statut script doesn't return OK in say 30 seconds?

No - there's a bugzilla open about it.

https://bugzilla.redhat.com/show_bug.cgi?id=333161


-- Lon




From brettcave at gmail.com  Tue Aug 19 07:16:55 2008
From: brettcave at gmail.com (Brett Cave)
Date: Tue, 19 Aug 2008 09:16:55 +0200
Subject: [Linux-cluster] GFS as a Resource
In-Reply-To: <111301c9015c$46994a80$d3cbdf80$@net>
References: <a60f9340808132137h120c597cja34c108e1fc5958f@mail.gmail.com>
	<1218736056.9521.246.camel@technetium.msp.redhat.com>
	<04cb01c8fe4b$fb40f640$f1c2e2c0$@net>
	<c0773fd30808150152w213658d7va6c75be4bf335651@mail.gmail.com>
	<069601c8fedf$6e12ff40$4a38fdc0$@net>
	<07b601c8fefa$dfc46b90$9f4d42b0$@net>
	<e83473390808151105kabee394gb30cc70b8680fe8d@mail.gmail.com>
	<110c01c90158$f9a352f0$ece9f8d0$@net>
	<c0773fd30808181053k177d01f5y36546163ae2d6599@mail.gmail.com>
	<111301c9015c$46994a80$d3cbdf80$@net>
Message-ID: <c0773fd30808190016i1ad45570q997f311cc9768f0d@mail.gmail.com>

On Mon, Aug 18, 2008 at 8:00 PM, Chris Edwards
<cedwards at smartechcorp.net> wrote:
> I don't care either way, I was just responding to the last reply to my
> message.  The only reason I was thinking that you wouldn't want an fstab
> entry is so that a server doesn't mount the GFS file systems until its apart
> of the cluster.

I had issues with this exact problem, so I edited /etc/init.d/gfs and
changed chkconfig startup sequence order to a higher number (but still
lower than all services that depend on gfs mounts).

>
> ---
>
> Chris Edwards
> Smartech Corp.
> Div. of AirNet Group
> http://www.airnetgroup.com
> http://www.smartechcorp.net
> cedwards at smartechcorp.net
> P:  423-664-7678 x114
> C:  423-593-6964
> F:  423-664-7680
>
>
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Brett Cave
> Sent: Monday, August 18, 2008 1:53 PM
> To: linux clustering
> Subject: Re: [Linux-cluster] GFS as a Resource
>
> On Mon, Aug 18, 2008 at 7:36 PM, Chris Edwards
> <cedwards at smartechcorp.net> wrote:
>> Without a entry in fstab my gfs file systems never mount.   So I am
>> wondering how I can leave out entries in my fstab.
>
> AFAIK, fstab is needed, unless u want to run gfs_mount with all the
> params. Why would you specifically not want to have fstab entries?
>
>
> Brett
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



From brettcave at gmail.com  Tue Aug 19 08:14:24 2008
From: brettcave at gmail.com (Brett Cave)
Date: Tue, 19 Aug 2008 10:14:24 +0200
Subject: [Linux-cluster] gfs_controld[]: retrieve_plocks: ckpt open error 12
	gfsmountpoint
Message-ID: <c0773fd30808190114j22eecefapb21aca1ff8a409e7@mail.gmail.com>

Seems like this is causing problems with the cluster - getting this on
1 node just before cluster hangs.
gfs_controld[]: retrieve_plocks: ckpt open error 12 gfs

The only reference i can find when googling this to plock.c
	rv = saCkptCheckpointOpen(ckpt_handle, &name, NULL,
				  SA_CKPT_CHECKPOINT_READ, 0, &h);
	if (rv == SA_AIS_ERR_TRY_AGAIN) {
		log_group(mg, "retrieve_plocks: ckpt open retry");
		sleep(1);
		goto open_retry;
	}
	if (rv != SA_AIS_OK) {
		log_error("retrieve_plocks: ckpt open error %d %s",
			  rv, mg->name);
		return;
	}

Not quite sure what CkptCheckpoint is, but from seeing the code from
ais, it seems to be some form of fault tolerance.
Found a post about a possible bug in the sackptCheckpointOpen function:
https://lists.linux-foundation.org/pipermail/openais/2006-September/008360.html


Have just installed newer versions of cman, gfs-utils, openais and
kmod-gfs, and upgraded kernel now, going to see if im still getting
hangs. been running for a few hours now with node resets and IO bursts
and seems to be behaving a little better.



From brettcave at gmail.com  Tue Aug 19 08:42:24 2008
From: brettcave at gmail.com (Brett Cave)
Date: Tue, 19 Aug 2008 10:42:24 +0200
Subject: [Linux-cluster] cman_tool flags: dirty follow up question
Message-ID: <c0773fd30808190142h4f6d5136l61c6104cbcbe48b6@mail.gmail.com>

Hi,

After upgrading the RPM's on Centos5, i get Flags: Dirty on cman_tool
status. Found a post on this list from May just before I joined. A
follow up to this topic.

Chrissie: It's a perfectly normal state. in fact it's expected if you
are running services. It simply means that the cluster has some
services running that have state of their own that cannot be recovered
without a full restart. I would be more worried if you did NOT see
this in cman_tool status. It's NOT a warning. don't worry about it :)


Prior to upgrading my cman and gfs-utils, I was getting "flags: ". The
cluster is not running any services other than the internal fence and
dlm ones (cman_tool services shows fence only).

There was an update to cvs last year September where the flag was
added, so I'm guessing that this might resolve the issue. (Think by
Chrissie again).

Don't quite understand "This node has internal state and must not join
a cluster that also has state" description of dirty flag though, does
this mean that because the node is part of a cluster it has state? And
that only a stateless node can join a cluster with state? (or if the
cluster doesn't have state, then the node will be the first one in the
cluster to start up...).



From ccaulfie at redhat.com  Tue Aug 19 08:58:12 2008
From: ccaulfie at redhat.com (Christine Caulfield)
Date: Tue, 19 Aug 2008 09:58:12 +0100
Subject: [Linux-cluster] cman_tool flags: dirty follow up question
In-Reply-To: <c0773fd30808190142h4f6d5136l61c6104cbcbe48b6@mail.gmail.com>
References: <c0773fd30808190142h4f6d5136l61c6104cbcbe48b6@mail.gmail.com>
Message-ID: <48AA8B24.6010105@redhat.com>

Brett Cave wrote:
> Hi,
> 
> After upgrading the RPM's on Centos5, i get Flags: Dirty on cman_tool
> status. Found a post on this list from May just before I joined. A
> follow up to this topic.
> 
> Chrissie: It's a perfectly normal state. in fact it's expected if you
> are running services. It simply means that the cluster has some
> services running that have state of their own that cannot be recovered
> without a full restart. I would be more worried if you did NOT see
> this in cman_tool status. It's NOT a warning. don't worry about it :)
> 
> 
> Prior to upgrading my cman and gfs-utils, I was getting "flags: ". The
> cluster is not running any services other than the internal fence and
> dlm ones (cman_tool services shows fence only).
> 
> There was an update to cvs last year September where the flag was
> added, so I'm guessing that this might resolve the issue. (Think by
> Chrissie again).
> 
> Don't quite understand "This node has internal state and must not join
> a cluster that also has state" description of dirty flag though, does
> this mean that because the node is part of a cluster it has state? And
> that only a stateless node can join a cluster with state? (or if the
> cluster doesn't have state, then the node will be the first one in the
> cluster to start up...).

The Dirty flag is set by services when they realise they have a state
they can't combine with an existing cluster. This is usually the DLM or
GFS in a Red Hat cluster. When a node first starts up it has no state
and can join in with other cluster nodes that do. It can then create its
own state quite happily because it can tell the other nodes about it.

The situation this flag is designed to prevent is if two cluster split
up for a short period of time and then rejoin soon - usually less the
the time it takes for fencing to take effect or maybe the cluster is
split evenly so that neither half has quorum. When this occurs each half
does not know what the other half has been up to during the split, and
so the two halves cannot be allowed to join in a cluster together again
for fear of corrupting each other's state.

This results in the dreaded "Disallowed" state that you might see
(though I hope not). It's usually caused by bad network configuration,
or excessive traffic. Fiddling with the totem parameters (CAREFULLY!)
can alleviate it.



Chrissie



From brettcave at gmail.com  Tue Aug 19 09:19:18 2008
From: brettcave at gmail.com (Brett Cave)
Date: Tue, 19 Aug 2008 11:19:18 +0200
Subject: [Linux-cluster] cman_tool flags: dirty follow up question
In-Reply-To: <48AA8B24.6010105@redhat.com>
References: <c0773fd30808190142h4f6d5136l61c6104cbcbe48b6@mail.gmail.com>
	<48AA8B24.6010105@redhat.com>
Message-ID: <c0773fd30808190219u5c58fbb4l90a7bb3b1e25f80b@mail.gmail.com>

On Tue, Aug 19, 2008 at 10:58 AM, Christine Caulfield
<ccaulfie at redhat.com> wrote:
> Brett Cave wrote:
>> Hi,
>>
>> After upgrading the RPM's on Centos5, i get Flags: Dirty on cman_tool
>> status. Found a post on this list from May just before I joined. A
>> follow up to this topic.
>>
>> Chrissie: It's a perfectly normal state. in fact it's expected if you
>> are running services. It simply means that the cluster has some
>> services running that have state of their own that cannot be recovered
>> without a full restart. I would be more worried if you did NOT see
>> this in cman_tool status. It's NOT a warning. don't worry about it :)
>>
>>
>> Prior to upgrading my cman and gfs-utils, I was getting "flags: ". The
>> cluster is not running any services other than the internal fence and
>> dlm ones (cman_tool services shows fence only).
>>
>> There was an update to cvs last year September where the flag was
>> added, so I'm guessing that this might resolve the issue. (Think by
>> Chrissie again).
>>
>> Don't quite understand "This node has internal state and must not join
>> a cluster that also has state" description of dirty flag though, does
>> this mean that because the node is part of a cluster it has state? And
>> that only a stateless node can join a cluster with state? (or if the
>> cluster doesn't have state, then the node will be the first one in the
>> cluster to start up...).
>
> The Dirty flag is set by services when they realise they have a state
> they can't combine with an existing cluster. This is usually the DLM or
> GFS in a Red Hat cluster. When a node first starts up it has no state
> and can join in with other cluster nodes that do. It can then create its
> own state quite happily because it can tell the other nodes about it.
>
> The situation this flag is designed to prevent is if two cluster split
> up for a short period of time and then rejoin soon - usually less the
> the time it takes for fencing to take effect or maybe the cluster is
> split evenly so that neither half has quorum. When this occurs each half
> does not know what the other half has been up to during the split, and
> so the two halves cannot be allowed to join in a cluster together again
> for fear of corrupting each other's state.
>
> This results in the dreaded "Disallowed" state that you might see
> (though I hope not). It's usually caused by bad network configuration,
> or excessive traffic. Fiddling with the totem parameters (CAREFULLY!)
> can alleviate it.
>

Thanks, makes sense now.

The cluster is relatively small (3 - 6 nodes initially), so going to
leave ais totem config for now. will start adjusting when we do load
IO testing if recovery is too slow. would like to see how failure
recovery performs when heartbeat failures is enabled vs disabled.

Brett
>
>
> Chrissie
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



From linux at vfemail.net  Tue Aug 19 14:23:49 2008
From: linux at vfemail.net (Alex)
Date: Tue, 19 Aug 2008 17:23:49 +0300
Subject: [Linux-cluster] gnbd fencing
Message-ID: <200808191723.49872.linux@vfemail.net>

I have a cluster with 5 nodes:
- 3 gnbd clients and 
- 2 gnbd servers

our gnbd clients have the following IPs:
192.168.113.{3,4,5}
our gnbd servers have the following IPs:
192.168.113.{6,7}

Our gnbd servers 192.168.113.{6,7} are exporting different block devices, 
let's say: shd1_hdb1 from server1 and shd2_hdc1 from server2

All our clients 192.168.113.{3,4,5} are importing booth devices:
[root at rs1 ~]# ls /dev/gnbd
shd1_hdb1  shd2_hdc1
[root at rs2 ~]# ls /dev/gnbd
shd1_hdb1  shd2_hdc1
[root at rs3 ~]# ls /dev/gnbd
shd1_hdb1  shd2_hdc1

On our client nodes /dev/gnbd/shd1_hdb1 and /dev/gnbd/shd2_hdc1 are forming a 
LVM mirrorored volume which on top is using gfs.

Question1: in cluster.conf, for our gnbd servers is correct to use the same 
fencing method like the one used for our client nodes or should be used 
manual fencing?

For example, for node 4 and 5 is correct to have:

<clusternode name="192.168.113.6" nodeid="4" votes="1">
<fence>
 <method name="first_fence_method_for_this_node">
  <device name="gnbd_from_shds" nodename="192.168.113.6"/>
 </method>
</fence>
</clusternode>

<clusternode name="192.168.113.7" nodeid="5" votes="1">
<fence>
 <method name="first_fence_method_for_this_node">
  <device name="gnbd_from_shds" nodename="192.168.113.7"/>
 </method>
</fence>
</clusternode>

Question2: Our gfs logical volume should be defined as quorum disk or is 
enough to increase votes on node 4 and node 5 (our gnbd servers) from 
votes="1" to votes="2"?

Here comes my current cluster.conf:

<?xml version="1.0"?>
<cluster alias="httpcluster" config_version="32" name="httpcluster">
        <fence_daemon clean_start="0" post_fail_delay="0" 
post_join_delay="3"/>
        <clusternodes>
                <clusternode name="192.168.113.5" nodeid="1" votes="1">
                        <fence>
                                <method 
name="first_fence_method_for_this_node">
                                        <device name="gnbd_from_shds" 
nodename="192.168.113.5"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="192.168.113.4" nodeid="2" votes="1">
                        <fence>
                                <method 
name="first_fence_method_for_this_node">
                                        <device name="gnbd_from_shds" 
nodename="192.168.113.4"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="192.168.113.3" nodeid="3" votes="1">
                        <fence>
                                <method 
name="first_fence_method_for_this_node">
                                        <device name="gnbd_from_shds" 
nodename="192.168.113.3"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="192.168.113.6" nodeid="4" votes="1">
                        <fence/>
                </clusternode>
                <clusternode name="192.168.113.7" nodeid="5" votes="1">
                        <fence/>
                </clusternode>
        </clusternodes>
        <cman/>
        <fencedevices>
                <fencedevice agent="fence_gnbd" name="gnbd_from_shds" 
servers="192.168.113.6 192.168.113.7"/>
                <fencedevice agent="fence_manual" name="manual_f"/>
        </fencedevices>
        <rm>
                <failoverdomains/>
                <resources/>
        </rm>
</cluster>

Regards,
Alx



From cedwards at smartechcorp.net  Tue Aug 19 14:57:15 2008
From: cedwards at smartechcorp.net (Chris Edwards)
Date: Tue, 19 Aug 2008 10:57:15 -0400
Subject: [Linux-cluster] XVM Fence Daemon
In-Reply-To: <c0773fd30808190016i1ad45570q997f311cc9768f0d@mail.gmail.com>
References: <a60f9340808132137h120c597cja34c108e1fc5958f@mail.gmail.com>	<1218736056.9521.246.camel@technetium.msp.redhat.com>	<04cb01c8fe4b$fb40f640$f1c2e2c0$@net>	<c0773fd30808150152w213658d7va6c75be4bf335651@mail.gmail.com>	<069601c8fedf$6e12ff40$4a38fdc0$@net>	<07b601c8fefa$dfc46b90$9f4d42b0$@net>	<e83473390808151105kabee394gb30cc70b8680fe8d@mail.gmail.com>	<110c01c90158$f9a352f0$ece9f8d0$@net>	<c0773fd30808181053k177d01f5y36546163ae2d6599@mail.gmail.com>	<111301c9015c$46994a80$d3cbdf80$@net>
	<c0773fd30808190016i1ad45570q997f311cc9768f0d@mail.gmail.com>
Message-ID: <158d01c9020b$de28a030$9a79e090$@net>

In Luci it gives me an option to run a XVM fence daemon, does anyone know
what this is?

Thanks!

---

Chris Edwards






From rottmann at atix.de  Tue Aug 19 15:12:19 2008
From: rottmann at atix.de (Reiner Rottmann)
Date: Tue, 19 Aug 2008 17:12:19 +0200
Subject: [Linux-cluster] XVM Fence Daemon
In-Reply-To: <158d01c9020b$de28a030$9a79e090$@net>
References: <a60f9340808132137h120c597cja34c108e1fc5958f@mail.gmail.com>
	<c0773fd30808190016i1ad45570q997f311cc9768f0d@mail.gmail.com>
	<158d01c9020b$de28a030$9a79e090$@net>
Message-ID: <200808191712.22721.rottmann@atix.de>

Hello,

This is from the man page. Always a good start to dive into the fence tools:

# man fence_xvm

---quote---
NAME
       fence_xvm - I/O Fencing agent for Xen virtual machines.
---quote---

-- 
Gruss / Regards,
Dipl.-Ing. (FH) Reiner Rottmann

ATIX Informationstechnologie und Consulting AG
Einsteinstr. 10 
85716 Unterschleissheim
Deutschland/Germany

Phone: +49-89 452 3538-12     Fax: +49-89 990 1766-0
Email: rottmann at atix.de       PGP Key-ID: 0xCA67C5A6
       www.atix.de  |  www.open-sharedroot.org

Vorstaende: Marc Grimme, Mark Hlawatschek, Thomas Merz (Vors.)
Vorsitzender des Aufsichtsrats: Dr. Martin Buss
Registergericht: Amtsgericht Muenchen
Registernummer: HRB 168930
USt.-Id.: DE209485962

Visit my presentation about Xen Virtualization at FrOSCon 2008
http://programm.froscon.org/2008/events/165.en.htmlOn Tuesday 19 August 2008 
04:57:15 pm Chris Edwards wrote:
> In Luci it gives me an option to run a XVM fence daemon, does anyone know
> what this is?
>
> Thanks!
>
> ---
>
> Chris Edwards
>
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part.
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080819/59efea86/attachment.sig>

From cedwards at smartechcorp.net  Tue Aug 19 16:49:29 2008
From: cedwards at smartechcorp.net (Chris Edwards)
Date: Tue, 19 Aug 2008 12:49:29 -0400
Subject: [Linux-cluster] VM Service not rebooting.
In-Reply-To: <200808191723.49872.linux@vfemail.net>
References: <200808191723.49872.linux@vfemail.net>
Message-ID: <160c01c9021b$8be39450$a3aabcf0$@net>

I have basically followed these instructions on building a xen cluster...
http://www.redhatmagazine.com/2007/08/23/automated-failover-and-recovery-of-
virtualized-guests-in-advanced-platform/#comment-121366

When I issue a "xm destroy <ID>" the vm automatically reboots backup, but
when I reboot or shut down a machine the VM's that were on the shutdown node
never come back up on the existing node.

Any idea why this doesn't work?

Thanks!

My cluster.conf...

<?xml version="1.0"?>

<cluster alias="Xen" config_version="67" name="Xen">

        <fence_daemon clean_start="0" post_fail_delay="0"
post_join_delay="300"/>

        <clusternodes>

                <clusternode name="xen1.smartechcorp.net" nodeid="1"
votes="1">

                        <fence>

                                <method name="1">

                                        <device name="manual"
nodename="xen1.smartechcorp.net"/>

                                </method>

                        </fence>

                </clusternode>

                <clusternode name="xen2.smartechcorp.net" nodeid="2"
votes="1">

                        <fence>

                                <method name="1">

                                        <device name="manual"
nodename="xen2.smartechcorp.net"/>

                                </method>

                        </fence>

                </clusternode>

        </clusternodes>

        <cman expected_votes="1" two_node="1"/>

        <fencedevices>

                <fencedevice agent="fence_manual" name="manual"/>

        </fencedevices>

        <rm>

                <failoverdomains>

                        <failoverdomain name="bias-xen1" nofailback="0"
ordered="0" restricted="0">

                                <failoverdomainnode
name="xen1.smartechcorp.net" priority="1"/>

                        </failoverdomain>

                        <failoverdomain name="bias-xen2" nofailback="0"
ordered="0" restricted="0">

                                <failoverdomainnode
name="xen2.smartechcorp.net" priority="1"/>

                        </failoverdomain>

                        <failoverdomain name="gfs-xen1" nofailback="0"
ordered="0" restricted="1">

                                <failoverdomainnode
name="xen1.smartechcorp.net" priority="1"/>

                        </failoverdomain>

                        <failoverdomain name="gfs-xen2" nofailback="0"
ordered="0" restricted="1">

                                <failoverdomainnode
name="xen2.smartechcorp.net" priority="1"/>

                        </failoverdomain>

                </failoverdomains>

                <resources/>

                <service autostart="1" domain="gfs-xen2" exclusive="0"
name="GFS Mount Xen2" recovery="restart"/>

                <service autostart="1" domain="gfs-xen1" exclusive="0"
name="GFS Mount Xen1" recovery="restart"/>

                <vm autostart="1" domain="bias-xen1" exclusive="0"
migrate="live" name="Windows1" path="/xen/boot" recovery="relocate"/>

                <vm autostart="1" domain="bias-xen2" exclusive="0"
migrate="live" name="Linux1" path="/xen/boot" recovery="relocate"/>

        </rm>

        <quorumd device="/dev/sdb5" interval="1" min_score="1" tko="10"
votes="1">

                <heuristic interval="2" program="ping -c3 -t2 10.10.10.1"
score="1"/>

        </quorumd>

</cluster>

---

Chris Edwards






From cedwards at smartechcorp.net  Tue Aug 19 16:50:06 2008
From: cedwards at smartechcorp.net (Chris Edwards)
Date: Tue, 19 Aug 2008 12:50:06 -0400
Subject: [Linux-cluster] XVM Fence Daemon
In-Reply-To: <200808191712.22721.rottmann@atix.de>
References: <a60f9340808132137h120c597cja34c108e1fc5958f@mail.gmail.com>	<c0773fd30808190016i1ad45570q997f311cc9768f0d@mail.gmail.com>	<158d01c9020b$de28a030$9a79e090$@net>
	<200808191712.22721.rottmann@atix.de>
Message-ID: <160d01c9021b$a25c99c0$e715cd40$@net>

Thanks!  It looks like its for handling xen nodes that are in a cluster.

---

Chris Edwards
Smartech Corp.
Div. of AirNet Group
http://www.airnetgroup.com
http://www.smartechcorp.net
cedwards at smartechcorp.net
P:  423-664-7678 x114
C:  423-593-6964
F:  423-664-7680


-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Reiner Rottmann
Sent: Tuesday, August 19, 2008 11:12 AM
To: linux clustering
Subject: Re: [Linux-cluster] XVM Fence Daemon

Hello,

This is from the man page. Always a good start to dive into the fence tools:

# man fence_xvm

---quote---
NAME
       fence_xvm - I/O Fencing agent for Xen virtual machines.
---quote---

--
Gruss / Regards,
Dipl.-Ing. (FH) Reiner Rottmann

ATIX Informationstechnologie und Consulting AG Einsteinstr. 10
85716 Unterschleissheim
Deutschland/Germany

Phone: +49-89 452 3538-12     Fax: +49-89 990 1766-0
Email: rottmann at atix.de       PGP Key-ID: 0xCA67C5A6
       www.atix.de  |  www.open-sharedroot.org

Vorstaende: Marc Grimme, Mark Hlawatschek, Thomas Merz (Vors.) Vorsitzender
des Aufsichtsrats: Dr. Martin Buss
Registergericht: Amtsgericht Muenchen
Registernummer: HRB 168930
USt.-Id.: DE209485962

Visit my presentation about Xen Virtualization at FrOSCon 2008
http://programm.froscon.org/2008/events/165.en.htmlOn Tuesday 19 August 2008
04:57:15 pm Chris Edwards wrote:
> In Luci it gives me an option to run a XVM fence daemon, does anyone 
> know what this is?
>
> Thanks!
>
> ---
>
> Chris Edwards
>
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster






From bruno at redix.com.br  Tue Aug 19 19:16:36 2008
From: bruno at redix.com.br (Bruno Frensch Deschamps)
Date: Tue, 19 Aug 2008 16:16:36 -0300
Subject: [Linux-cluster] Fence Debug
Message-ID: <48AB1C14.8040704@redix.com.br>

Hi
I have problems with my fence(bladecenter) and i want to debug it, how 
can i do this?
I try to run fenced -D but have no success!!


-- 
Bruno F. Deschamps - Consultor
Profissional Certificado LPIC-1
--------------------------------------------------------------------
Redix - Gest?o em T.I. com Software Livre
http://www.redix.com.br - redix at redix.com.br
Tel. Coml.: +55 (47) 3323-7313
--------------------------------------------------------------------



From jparsons at redhat.com  Tue Aug 19 19:24:02 2008
From: jparsons at redhat.com (jim parsons)
Date: Tue, 19 Aug 2008 15:24:02 -0400
Subject: [Linux-cluster] Fence Debug
In-Reply-To: <48AB1C14.8040704@redix.com.br>
References: <48AB1C14.8040704@redix.com.br>
Message-ID: <1219173842.3267.18.camel@localhost.localdomain>

On Tue, 2008-08-19 at 16:16 -0300, Bruno Frensch Deschamps wrote:
> Hi
> I have problems with my fence(bladecenter) and i want to debug it, how 
> can i do this?
> I try to run fenced -D but have no success!!
> 
> 

Try running fence_bladecenter on one of the nodes from the command line.
Look at the man page for necessary switches. Use the -v verbose setting.

If that works, run fence_node <name_of_node_to_fence> on a node. If this
does not work, likely a cluster.conf error (mispelt attribute...missing
param...etc.)

-j



From jparsons at redhat.com  Tue Aug 19 20:11:34 2008
From: jparsons at redhat.com (jim parsons)
Date: Tue, 19 Aug 2008 16:11:34 -0400
Subject: [Linux-cluster] XVM Fence Daemon
In-Reply-To: <160d01c9021b$a25c99c0$e715cd40$@net>
References: <a60f9340808132137h120c597cja34c108e1fc5958f@mail.gmail.com>
	<c0773fd30808190016i1ad45570q997f311cc9768f0d@mail.gmail.com>
	<158d01c9020b$de28a030$9a79e090$@net>
	<200808191712.22721.rottmann@atix.de>
	<160d01c9021b$a25c99c0$e715cd40$@net>
Message-ID: <1219176694.3267.20.camel@localhost.localdomain>


> 04:57:15 pm Chris Edwards wrote:
> > In Luci it gives me an option to run a XVM fence daemon, does anyone 
> > know what this is?
> >
> > Thanks!
> >
> > ---
> >
> > Chris Edwards
When building a cluster of virtual machines (an inner cluster, if you
will), it is necessary to have a way of fencing a virtual node. This is
where fence_xvm comes in handy. It consists of two parts: The agent
itself which resides in the /sbin dir of each virtual node, and a
fence_xvm daemon (fence_xvmd) which runs on the hardware supporting the
virtual nodes (the outer cluster). The daemon listens for fencing
requests from the agents on the virtual nodes, and then does the deed:
quickly, and without remorse. ;)

Selecting the luci option to run XVM fence daemon places a configuration
setting in the outer clusters' cluster.conf file indicating this daemon
needs to be started when the outer cluster starts. You will still need
to configure the virtual nodes to use fence_xvm though. 

-j
> >
> >
> >
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> 
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From fdinitto at redhat.com  Wed Aug 20 07:38:02 2008
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Wed, 20 Aug 2008 09:38:02 +0200 (CEST)
Subject: [Linux-cluster] Re: [Cluster-devel] Info about building master
	branch
In-Reply-To: <Pine.LNX.4.64.0808121340470.16551@trider-g7>
References: <Pine.LNX.4.64.0808121340470.16551@trider-g7>
Message-ID: <Pine.LNX.4.64.0808200937280.16551@trider-g7>

On Tue, 12 Aug 2008, Fabio M. Di Nitto wrote:

>
> hi guys,
>
> Chrissie landed today the awesome port to corosync for cman and related bits. 
> Shortly after also some of the usual build bits have landed.
>
> There is a bug in corosync headers that break cman build.
>
> In order to workaround the problem you can just use:
>
> CFLAGS="-I/path/to/openais/include/files" make

If you upgrade to corosync 0.91 or higher, this hack/workaround is no 
longer required.

Fabio

--
I'm going to make him an offer he can't refuse.



From chris_mavin at hotmail.com  Wed Aug 20 10:01:14 2008
From: chris_mavin at hotmail.com (Chris Mavin)
Date: Wed, 20 Aug 2008 11:01:14 +0100
Subject: [Linux-cluster] gnbd export
Message-ID: <BAY126-W181F78E75F476723C84F1D8E680@phx.gbl>


Hi there,
 
Wonder if anyone could shed some light on this topic.
 
I have a running cluster but the gnbd export on the storage server is exporting a logical volume rather then the raw device. 
 
It all seems to work fine but would exporting the raw device improve performance or anything?
 
Thanks
 
Chris.
 
 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080820/fca9e199/attachment.htm>

From smaddison at competa.com  Wed Aug 20 11:51:21 2008
From: smaddison at competa.com (Steve Maddison)
Date: Wed, 20 Aug 2008 13:51:21 +0200 (CEST)
Subject: [Linux-cluster] gnbd export
In-Reply-To: <BAY126-W181F78E75F476723C84F1D8E680@phx.gbl>
References: <BAY126-W181F78E75F476723C84F1D8E680@phx.gbl>
Message-ID: <5450.FntSdVURQy0=.1219233081.squirrel@192.168.2.64>

Chris Mavin wrote:
> I have a running cluster but the gnbd export on the storage server is
> exporting a logical volume rather then the raw device.
>
> It all seems to work fine but would exporting the raw device improve
> performance or anything?

If you're asking whether the use of a logical volume as opposed to a raw
device makes any difference in performance, it depends on how the logical
volume was set up. If the LV is spread over several physical devices,
performance can be affected depending on whether the data is striped,
mirrored or appended linearly. If a single device is simply being used as
a "container" for the LV, the additional overhead may ever so slightly
degrade performance (although the difference is usually negligible).

The advantage of using a logical volume is of course that you can easily
manipulate it, for example to add more physical storage space later.

Cheers,

--
Steve Maddison
Sr. Unix/Linux Engineer
Competa IT B.V.
http://www.competa.com/




From bruno at redix.com.br  Wed Aug 20 12:03:40 2008
From: bruno at redix.com.br (Bruno Frensch Deschamps)
Date: Wed, 20 Aug 2008 09:03:40 -0300
Subject: [Linux-cluster] Luci / ricci problems
Message-ID: <48AC081C.7090208@redix.com.br>

Hi
There is a error in my luci server,

I configure a cluster on node 1 but it does not replicate do node2.

How i fix this?


-- 
Bruno F. Deschamps - Consultor
Profissional Certificado LPIC-1
--------------------------------------------------------------------
Redix - Gest?o em T.I. com Software Livre
http://www.redix.com.br - redix at redix.com.br
Tel. Coml.: +55 (47) 3323-7313
--------------------------------------------------------------------



From ajeet.singh.raina at logica.com  Wed Aug 20 13:04:58 2008
From: ajeet.singh.raina at logica.com (Singh Raina, Ajeet)
Date: Wed, 20 Aug 2008 18:34:58 +0530
Subject: [Linux-cluster] Want to Setup NFS with redhat Cluster??
Message-ID: <0139539A634FD04A99C9B8880AB70CB209B17A8C@in-ex004.groupinfra.com>

Can anyone Help me in setting up NFS with Red Hat Cluster?
I still don't understand what is NFS Mount, NFS Client and NFS Exports?

Please Help me.


This e-mail and any attachment is for authorised use by the intended recipient(s) only. It may contain proprietary material, confidential information and/or be subject to legal privilege. It should not be copied, disclosed to, retained or used by, any other party. If you are not an intended recipient then please promptly delete this e-mail and any attachment and all copies and inform the sender. Thank you.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080820/074cc499/attachment.htm>

From cedwards at smartechcorp.net  Wed Aug 20 13:29:50 2008
From: cedwards at smartechcorp.net (Chris Edwards)
Date: Wed, 20 Aug 2008 09:29:50 -0400
Subject: [Linux-cluster] Luci / ricci problems
In-Reply-To: <48AC081C.7090208@redix.com.br>
References: <48AC081C.7090208@redix.com.br>
Message-ID: <19ef01c902c8$d284e7d0$778eb770$@net>

Are you using the web interface to configure the node?

---

Chris Edwards


-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Bruno Frensch
Deschamps
Sent: Wednesday, August 20, 2008 8:04 AM
To: linux clustering
Subject: [Linux-cluster] Luci / ricci problems

Hi
There is a error in my luci server,

I configure a cluster on node 1 but it does not replicate do node2.

How i fix this?


-- 
Bruno F. Deschamps - Consultor
Profissional Certificado LPIC-1
--------------------------------------------------------------------
Redix - Gest?o em T.I. com Software Livre
http://www.redix.com.br - redix at redix.com.br
Tel. Coml.: +55 (47) 3323-7313
--------------------------------------------------------------------

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster





From bruno at redix.com.br  Wed Aug 20 13:52:03 2008
From: bruno at redix.com.br (Bruno Frensch Deschamps)
Date: Wed, 20 Aug 2008 10:52:03 -0300
Subject: [Linux-cluster] Luci / ricci problems
In-Reply-To: <19ef01c902c8$d284e7d0$778eb770$@net>
References: <48AC081C.7090208@redix.com.br>
	<19ef01c902c8$d284e7d0$778eb770$@net>
Message-ID: <48AC2183.4020704@redix.com.br>

Yes

Bruno F. Deschamps - Consultor
Profissional Certificado LPIC-1
--------------------------------------------------------------------
Redix - Gest?o em T.I. com Software Livre
http://www.redix.com.br - redix at redix.com.br
Tel. Coml.: +55 (47) 3323-7313
--------------------------------------------------------------------



Chris Edwards escreveu:
> Are you using the web interface to configure the node?
>
> ---
>
> Chris Edwards
>
>
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Bruno Frensch
> Deschamps
> Sent: Wednesday, August 20, 2008 8:04 AM
> To: linux clustering
> Subject: [Linux-cluster] Luci / ricci problems
>
> Hi
> There is a error in my luci server,
>
> I configure a cluster on node 1 but it does not replicate do node2.
>
> How i fix this?
>
>
>   



From cedwards at smartechcorp.net  Wed Aug 20 14:01:26 2008
From: cedwards at smartechcorp.net (Chris Edwards)
Date: Wed, 20 Aug 2008 10:01:26 -0400
Subject: [Linux-cluster] Luci / ricci problems
In-Reply-To: <48AC2183.4020704@redix.com.br>
References: <48AC081C.7090208@redix.com.br>	<19ef01c902c8$d284e7d0$778eb770$@net>
	<48AC2183.4020704@redix.com.br>
Message-ID: <1a5a01c902cd$3cab72b0$b6025810$@net>

Weird,  as long as I have joined both nodes to the cluster they should
update.  Make sure both nodes are identical to start with and have the same
config_version.

---

Chris Edwards


-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Bruno Frensch
Deschamps
Sent: Wednesday, August 20, 2008 9:52 AM
To: linux clustering
Subject: Re: [Linux-cluster] Luci / ricci problems

Yes

Bruno F. Deschamps - Consultor
Profissional Certificado LPIC-1
--------------------------------------------------------------------
Redix - Gest?o em T.I. com Software Livre
http://www.redix.com.br - redix at redix.com.br
Tel. Coml.: +55 (47) 3323-7313
--------------------------------------------------------------------



Chris Edwards escreveu:
> Are you using the web interface to configure the node?
>
> ---
>
> Chris Edwards
>
>
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Bruno Frensch
> Deschamps
> Sent: Wednesday, August 20, 2008 8:04 AM
> To: linux clustering
> Subject: [Linux-cluster] Luci / ricci problems
>
> Hi
> There is a error in my luci server,
>
> I configure a cluster on node 1 but it does not replicate do node2.
>
> How i fix this?
>
>
>   

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster





From cedwards at smartechcorp.net  Wed Aug 20 14:23:43 2008
From: cedwards at smartechcorp.net (Chris Edwards)
Date: Wed, 20 Aug 2008 10:23:43 -0400
Subject: [Linux-cluster] Luci / ricci problems
In-Reply-To: <1a5a01c902cd$3cab72b0$b6025810$@net>
References: <48AC081C.7090208@redix.com.br>	<19ef01c902c8$d284e7d0$778eb770$@net>	<48AC2183.4020704@redix.com.br>
	<1a5a01c902cd$3cab72b0$b6025810$@net>
Message-ID: <1a9001c902d0$59c9cba0$0d5d62e0$@net>

I meant both cluster.conf files are identical.

---

Chris Edwards
Smartech Corp.
Div. of AirNet Group
http://www.airnetgroup.com
http://www.smartechcorp.net
cedwards at smartechcorp.net
P:  423-664-7678 x114
C:  423-593-6964
F:  423-664-7680


-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Chris Edwards
Sent: Wednesday, August 20, 2008 10:01 AM
To: 'linux clustering'
Subject: RE: [Linux-cluster] Luci / ricci problems

Weird,  as long as I have joined both nodes to the cluster they should
update.  Make sure both nodes are identical to start with and have the same
config_version.

---

Chris Edwards


-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Bruno Frensch
Deschamps
Sent: Wednesday, August 20, 2008 9:52 AM
To: linux clustering
Subject: Re: [Linux-cluster] Luci / ricci problems

Yes

Bruno F. Deschamps - Consultor
Profissional Certificado LPIC-1
--------------------------------------------------------------------
Redix - Gest?o em T.I. com Software Livre
http://www.redix.com.br - redix at redix.com.br
Tel. Coml.: +55 (47) 3323-7313
--------------------------------------------------------------------



Chris Edwards escreveu:
> Are you using the web interface to configure the node?
>
> ---
>
> Chris Edwards
>
>
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Bruno Frensch
> Deschamps
> Sent: Wednesday, August 20, 2008 8:04 AM
> To: linux clustering
> Subject: [Linux-cluster] Luci / ricci problems
>
> Hi
> There is a error in my luci server,
>
> I configure a cluster on node 1 but it does not replicate do node2.
>
> How i fix this?
>
>
>   

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster





From tom at netspot.com.au  Thu Aug 21 03:08:02 2008
From: tom at netspot.com.au (Tom Lanyon)
Date: Thu, 21 Aug 2008 12:38:02 +0930
Subject: [Linux-cluster] Common GNBD boot/startup procedures?
Message-ID: <3C12B5DA-D6FF-4718-B3AC-A0B48BFD50FA@netspot.com.au>

Hi list,

What are people commonly doing to start gnbd_serv and to export/import  
GNBDs on server startup?

Thanks,
Tom

--
Tom Lanyon
Systems Administrator
NetSpot Pty Ltd



From ajeet.singh.raina at logica.com  Thu Aug 21 07:00:53 2008
From: ajeet.singh.raina at logica.com (Singh Raina, Ajeet)
Date: Thu, 21 Aug 2008 12:30:53 +0530
Subject: [Linux-cluster] NFS Issue?
Message-ID: <0139539A634FD04A99C9B8880AB70CB209B17A8F@in-ex004.groupinfra.com>

Hello Guys,

I have few doubts and I think this gonna be best place to discuss.
I have two REDHAT Cluster nodes.My Clustat says:
[code]
tuxbuddy# clustat
Member Status: Quorate

  Member Name                              Status
  ------ ----                              ------
  ed1-cluster                           Online, Local, rgmanager
  ed2-cluster                           Online, rgmanager

  Service Name         Owner (Last)                   State
  ------- ----         ----- ------                   -----
  bang                 ed2-cluster                 started
tuxbuddy#
[/code]

The Service "bang" is just a test script and its doing the failover
successfully.


I have a third Server called Application Server which is  outside the
cluster.That Application server needs to be continously mounted on one
of the cluster nodes.So for that someone advised me of NFS.

Now I have NFS installed on both the cluster nodes and also in
Application Server.

What I think is When the first node failover the second one comes up.And
NFS too dies(stop) on first node(Correct me if I am wrong?)
Now The connectivity between failed node and Application Server fails?

I want such a setup in which The Application Server remain connected to
First node even if the first node does failover to the secondary cluster
nodes.

Any Suggestions?


This e-mail and any attachment is for authorised use by the intended recipient(s) only. It may contain proprietary material, confidential information and/or be subject to legal privilege. It should not be copied, disclosed to, retained or used by, any other party. If you are not an intended recipient then please promptly delete this e-mail and any attachment and all copies and inform the sender. Thank you.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080821/3a606d10/attachment.htm>

From Markus at hochholdinger.net  Thu Aug 21 09:51:38 2008
From: Markus at hochholdinger.net (Markus Hochholdinger)
Date: Thu, 21 Aug 2008 11:51:38 +0200
Subject: [Linux-cluster] 30 secs to login to two-node RHEL 5.2 cluster
In-Reply-To: <60f08e700807221837u71979a20qe771a314a940d5ca@mail.gmail.com>
References: <60f08e700807221837u71979a20qe771a314a940d5ca@mail.gmail.com>
Message-ID: <200808211151.44906.Markus@hochholdinger.net>

Hi,

Am Mittwoch, 23. Juli 2008 03:37 schrieb sunhux G:
> We have a 2 node RHEL V5.2 cluster.
> On both nodes, it takes about 30 seconds from the time
> after I key in the password & hit ENTER   to get to the
> command prompt.  Yes, it consistently took that amount
> of time for "both" nodes.
[..]
> What else should I look out for?

nameserver!?


-- 
greetings

eMHa
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080821/a8b2edc8/attachment.sig>

From deepjm at gmail.com  Thu Aug 21 13:43:48 2008
From: deepjm at gmail.com (Bruno Deschamps)
Date: Thu, 21 Aug 2008 10:43:48 -0300
Subject: [Linux-cluster] Cluster Fence not working on all nodes
Message-ID: <54843e600808210643j59c6c77fi26965193bcf4607b@mail.gmail.com>

 Hi

I have testing my cluster nodes on IBM Blade Center, but i have a problem,
the fence does not work correctly.
I have 2 nodes:
node1 : 10.0.20.34
node2 : 10.0.20.35

When i fence it manually with the command fence_node node1 and fence_node
node2 its woks correctly.
When a service is running on node2, and i disconnect node1 form the
network(to force the fence) it works correctly too.
My problem is when is running a service on node1, and i disconnect node2
from the network, is does not fence the machine.

Here is the logs of the servers, then you can see the fence working, On
node2 you can note that fence return success. But not on node1.

Have you ever experienced this kind of a problem?
Have any suggestions on what i have to do?


I run fence_tool dump on the node that dont fence, and show this message:

1219238621 stop default
1219238621 start default 4 members 1
1219238621 do_recovery stop 1 start 4 finish 1
1219238621 add node 2 to list 1
1219238621 averting fence of node 10.0.20.35
1219238621 finish default 4
1219238681 client 4: dump


Someone  know why he show the message "averting fence of node" and dont
fence the node?

Thanks for the help.



node1:
Aug 18 11:11:46 node1 openais[3515]: [TOTEM] entering GATHER state from 0.
Aug 18 11:11:46 node1 openais[3515]: [TOTEM] Creating commit token because I
am the rep.
Aug 18 11:11:46 node1 openais[3515]: [TOTEM] Saving state aru 57 high seq
received 57
Aug 18 11:11:46 node1 openais[3515]: [TOTEM] Storing new sequence id for
ring 1edb5c
Aug 18 11:11:46 node1 openais[3515]: [TOTEM] entering COMMIT state.
Aug 18 11:11:46 node1 openais[3515]: [TOTEM] entering RECOVERY state.
Aug 18 11:11:46 node1 openais[3515]: [TOTEM] position [0] member 10.0.20.34:

Aug 18 11:11:46 node1 openais[3515]: [TOTEM] previous ring seq 2022232 rep
10.0.20.34
Aug 18 11:11:46 node1 openais[3515]: [TOTEM] aru 57 high delivered 57
received flag 1
Aug 18 11:11:46 node1 openais[3515]: [TOTEM] Did not need to originate any
messages in recovery.
Aug 18 11:11:46 node1 openais[3515]: [TOTEM] Sending initial ORF token
Aug 18 11:11:46 node1 openais[3515]: [CLM  ] CLM CONFIGURATION CHANGE
Aug 18 11:11:46 node1 openais[3515]: [CLM  ] New Configuration:
Aug 18 11:11:46 node1 kernel: dlm: closing connection to node 2
*Aug 18 11:11:46 node1 fenced: 10.0.20.35 not a cluster member after 0 sec
post_fail_delay*
Aug 18 11:11:46 node1 openais[3515]: [CLM  ]    r(0) ip(10.0.20.34)
Aug 18 11:11:46 node1 openais[3515]: [CLM  ] Members Left:
Aug 18 11:11:46 node1 openais[3515]: [CLM  ]    r(0) ip(10.0.20.35)
Aug 18 11:11:46 node1 openais[3515]: [CLM  ] Members Joined:
Aug 18 11:11:46 node1 openais[3515]: [CLM  ] CLM CONFIGURATION CHANGE
Aug 18 11:11:46 node1 openais[3515]: [CLM  ] New Configuration:
Aug 18 11:11:46 node1 openais[3515]: [CLM  ]    r(0) ip(10.0.20.34)
Aug 18 11:11:46 node1 openais[3515]: [CLM  ] Members Left:
Aug 18 11:11:46 node1 openais[3515]: [CLM  ] Members Joined:
Aug 18 11:11:46 node1 openais[3515]: [SYNC ] This node is within the primary
component and will provide service.
Aug 18 11:11:46 node1 openais[3515]: [TOTEM] entering OPERATIONAL state.
Aug 18 11:11:46 node1 openais[3515]: [CLM  ] got nodejoin message 10.0.20.34
Aug 18 11:11:46 node1 openais[3515]: [CPG  ] got joinlist message from node
1



Node2:
Aug 18 15:55:52 node2 openais[5232]: [TOTEM] entering GATHER state from 0.
Aug 18 15:55:52 node2 openais[5232]: [TOTEM] Creating commit token because I
am the rep.
Aug 18 15:55:52 node2 openais[5232]: [TOTEM] Saving state aru 53 high seq
received 53
Aug 18 15:55:52 node2 openais[5232]: [TOTEM] Storing new sequence id for
ring 1edb7c
Aug 18 15:55:52 node2 openais[5232]: [TOTEM] entering COMMIT state.
Aug 18 15:55:52 node2 openais[5232]: [TOTEM] entering RECOVERY state.
Aug 18 15:55:52 node2 openais[5232]: [TOTEM] position [0] member 10.0.20.35:

Aug 18 15:55:52 node2 openais[5232]: [TOTEM] previous ring seq 2022264 rep
10.0.20.34
Aug 18 15:55:52 node2 openais[5232]: [TOTEM] aru 53 high delivered 53
received flag 1
Aug 18 15:55:52 node2 openais[5232]: [TOTEM] Did not need to originate any
messages in recovery.
Aug 18 15:55:52 node2 openais[5232]: [TOTEM] Sending initial ORF token
Aug 18 15:55:52 node2 openais[5232]: [CLM  ] CLM CONFIGURATION CHANGE
Aug 18 15:55:52 node2 openais[5232]: [CLM  ] New Configuration:
Aug 18 15:55:52 node2 kernel: dlm: closing connection to node 1
*Aug 18 15:55:52 node2 fenced[5248]: 10.0.20.34 not a cluster member after 0
sec post_fail_delay*
Aug 18 15:55:52 node2 openais[5232]: [CLM  ]    r(0) ip(10.0.20.35)
Aug 18 15:55:52 node2 fenced[5248]: fencing node "10.0.20.34"
Aug 18 15:55:53 node2 openais[5232]: [CLM  ] Members Left:
Aug 18 15:55:53 node2 openais[5232]: [CLM  ]    r(0) ip(10.0.20.34)
Aug 18 15:55:53 node2 openais[5232]: [CLM  ] Members Joined:
Aug 18 15:55:53 node2 openais[5232]: [CLM  ] CLM CONFIGURATION CHANGE
Aug 18 15:55:53 node2 openais[5232]: [CLM  ] New Configuration:
Aug 18 15:55:53 node2 openais[5232]: [CLM  ]    r(0) ip(10.0.20.35)
Aug 18 15:55:53 node2 openais[5232]: [CLM  ] Members Left:
Aug 18 15:55:53 node2 openais[5232]: [CLM  ] Members Joined:
Aug 18 15:55:53 node2 openais[5232]: [SYNC ] This node is within the primary
component and will provide service.
Aug 18 15:55:53 node2 openais[5232]: [TOTEM] entering OPERATIONAL state.
Aug 18 15:55:53 node2 openais[5232]: [CLM  ] got nodejoin message 10.0.20.35
Aug 18 15:55:53 node2 openais[5232]: [CPG  ] got joinlist message from node
2
*Aug 18 15:55:59 node2 fenced[5248]: fence "10.0.20.34" success*
Aug 18 15:56:00 node2 clurgmgrd[5507]: <notice> Taking over service
service:FirewallClusta from down member 10.0.20.34
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080821/ae1fc1cf/attachment.htm>

From bruno at redix.com.br  Thu Aug 21 13:46:49 2008
From: bruno at redix.com.br (Bruno Frensch Deschamps)
Date: Thu, 21 Aug 2008 10:46:49 -0300
Subject: [Linux-cluster] Cluster Fence dont work on all nodes
Message-ID: <48AD71C9.4050605@redix.com.br>

Hi

I have testing my cluster nodes on IBM Blade Center, but i have a 
problem, the fence does not work correctly.
I have 2 nodes:
node1 : 10.0.20.34
node2 : 10.0.20.35

When i fence it manually with the command fence_node node1 and 
fence_node node2 its woks correctly.
When a service is running on node2, and i disconnect node1 form the 
network(to force the fence) it works correctly too.
My problem is when is running a service on node1, and i disconnect node2 
from the network, is does not fence the machine.

Here is the logs of the servers, then you can see the fence working, On 
node2 you can note that fence return success. But not on node1.

Have you ever experienced this kind of a problem?
Have any suggestions on what i have to do?


I run fence_tool dump on the node that dont fence, and show this message:

1219238621 stop default
1219238621 start default 4 members 1
1219238621 do_recovery stop 1 start 4 finish 1
1219238621 add node 2 to list 1
1219238621 averting fence of node 10.0.20.35
1219238621 finish default 4
1219238681 client 4: dump


Someone  know why he show the message "averting fence of node" and dont 
fence the node?

Thanks for the help.



node1:
Aug 18 11:11:46 node1 openais[3515]: [TOTEM] entering GATHER state from 0.
Aug 18 11:11:46 node1 openais[3515]: [TOTEM] Creating commit token 
because I am the rep.
Aug 18 11:11:46 node1 openais[3515]: [TOTEM] Saving state aru 57 high 
seq received 57
Aug 18 11:11:46 node1 openais[3515]: [TOTEM] Storing new sequence id for 
ring 1edb5c
Aug 18 11:11:46 node1 openais[3515]: [TOTEM] entering COMMIT state.
Aug 18 11:11:46 node1 openais[3515]: [TOTEM] entering RECOVERY state.
Aug 18 11:11:46 node1 openais[3515]: [TOTEM] position [0] member 
10.0.20.34:
Aug 18 11:11:46 node1 openais[3515]: [TOTEM] previous ring seq 2022232 
rep 10.0.20.34
Aug 18 11:11:46 node1 openais[3515]: [TOTEM] aru 57 high delivered 57 
received flag 1
Aug 18 11:11:46 node1 openais[3515]: [TOTEM] Did not need to originate 
any messages in recovery.
Aug 18 11:11:46 node1 openais[3515]: [TOTEM] Sending initial ORF token
Aug 18 11:11:46 node1 openais[3515]: [CLM  ] CLM CONFIGURATION CHANGE
Aug 18 11:11:46 node1 openais[3515]: [CLM  ] New Configuration:
Aug 18 11:11:46 node1 kernel: dlm: closing connection to node 2
Aug 18 11:11:46 node1 fenced: 10.0.20.35 not a cluster member after 0 
sec post_fail_delay
Aug 18 11:11:46 node1 openais[3515]: [CLM  ]    r(0) ip(10.0.20.34) 
Aug 18 11:11:46 node1 openais[3515]: [CLM  ] Members Left:
Aug 18 11:11:46 node1 openais[3515]: [CLM  ]    r(0) ip(10.0.20.35) 
Aug 18 11:11:46 node1 openais[3515]: [CLM  ] Members Joined:
Aug 18 11:11:46 node1 openais[3515]: [CLM  ] CLM CONFIGURATION CHANGE
Aug 18 11:11:46 node1 openais[3515]: [CLM  ] New Configuration:
Aug 18 11:11:46 node1 openais[3515]: [CLM  ]    r(0) ip(10.0.20.34) 
Aug 18 11:11:46 node1 openais[3515]: [CLM  ] Members Left:
Aug 18 11:11:46 node1 openais[3515]: [CLM  ] Members Joined:
Aug 18 11:11:46 node1 openais[3515]: [SYNC ] This node is within the 
primary component and will provide service.
Aug 18 11:11:46 node1 openais[3515]: [TOTEM] entering OPERATIONAL state.
Aug 18 11:11:46 node1 openais[3515]: [CLM  ] got nodejoin message 
10.0.20.34
Aug 18 11:11:46 node1 openais[3515]: [CPG  ] got joinlist message from 
node 1



Node2:
Aug 18 15:55:52 node2 openais[5232]: [TOTEM] entering GATHER state from 0.
Aug 18 15:55:52 node2 openais[5232]: [TOTEM] Creating commit token 
because I am the rep.
Aug 18 15:55:52 node2 openais[5232]: [TOTEM] Saving state aru 53 high 
seq received 53
Aug 18 15:55:52 node2 openais[5232]: [TOTEM] Storing new sequence id for 
ring 1edb7c
Aug 18 15:55:52 node2 openais[5232]: [TOTEM] entering COMMIT state.
Aug 18 15:55:52 node2 openais[5232]: [TOTEM] entering RECOVERY state.
Aug 18 15:55:52 node2 openais[5232]: [TOTEM] position [0] member 
10.0.20.35:
Aug 18 15:55:52 node2 openais[5232]: [TOTEM] previous ring seq 2022264 
rep 10.0.20.34
Aug 18 15:55:52 node2 openais[5232]: [TOTEM] aru 53 high delivered 53 
received flag 1
Aug 18 15:55:52 node2 openais[5232]: [TOTEM] Did not need to originate 
any messages in recovery.
Aug 18 15:55:52 node2 openais[5232]: [TOTEM] Sending initial ORF token
Aug 18 15:55:52 node2 openais[5232]: [CLM  ] CLM CONFIGURATION CHANGE
Aug 18 15:55:52 node2 openais[5232]: [CLM  ] New Configuration:
Aug 18 15:55:52 node2 kernel: dlm: closing connection to node 1
Aug 18 15:55:52 node2 fenced[5248]: 10.0.20.34 not a cluster member 
after 0 sec post_fail_delay
Aug 18 15:55:52 node2 openais[5232]: [CLM  ]    r(0) ip(10.0.20.35) 
Aug 18 15:55:52 node2 fenced[5248]: fencing node "10.0.20.34"
Aug 18 15:55:53 node2 openais[5232]: [CLM  ] Members Left:
Aug 18 15:55:53 node2 openais[5232]: [CLM  ]    r(0) ip(10.0.20.34) 
Aug 18 15:55:53 node2 openais[5232]: [CLM  ] Members Joined:
Aug 18 15:55:53 node2 openais[5232]: [CLM  ] CLM CONFIGURATION CHANGE
Aug 18 15:55:53 node2 openais[5232]: [CLM  ] New Configuration:
Aug 18 15:55:53 node2 openais[5232]: [CLM  ]    r(0) ip(10.0.20.35) 
Aug 18 15:55:53 node2 openais[5232]: [CLM  ] Members Left:
Aug 18 15:55:53 node2 openais[5232]: [CLM  ] Members Joined:
Aug 18 15:55:53 node2 openais[5232]: [SYNC ] This node is within the 
primary component and will provide service.
Aug 18 15:55:53 node2 openais[5232]: [TOTEM] entering OPERATIONAL state.
Aug 18 15:55:53 node2 openais[5232]: [CLM  ] got nodejoin message 
10.0.20.35
Aug 18 15:55:53 node2 openais[5232]: [CPG  ] got joinlist message from 
node 2
Aug 18 15:55:59 node2 fenced[5248]: fence "10.0.20.34" success
Aug 18 15:56:00 node2 clurgmgrd[5507]: <notice> Taking over service 
service:FirewallClusta from down member 10.0.20.34

-- 
Bruno F. Deschamps - Consultor
Profissional Certificado LPIC-1
--------------------------------------------------------------------
Redix - Gest?o em T.I. com Software Livre
http://www.redix.com.br - redix at redix.com.br
Tel. Coml.: +55 (47) 3323-7313
--------------------------------------------------------------------



From David.Livingstone at cn.ca  Thu Aug 21 16:03:55 2008
From: David.Livingstone at cn.ca (David.Livingstone at cn.ca)
Date: Thu, 21 Aug 2008 10:03:55 -0600
Subject: [Linux-cluster] GFS with iSCSI and HP 2012i
Message-ID: <OF606D0B42.27DE7266-ON872574AB.007AA52B-872574AC.00583C33@cn.ca>

 I'm looking for feedback on two items :

1. Experience with GFS  using  iSCSI 
2. Any experience using the HP 2012i iSCSI array .

My current application runs  AS 3 with clustering on a 2 node HP DL380G3 
Packaged Cluster(SCSI shared array) in a primary/secondary configuration. 
The data is made available to other  local ES3 machines(4 x DL380G3/4's) 
via NFS. 
These machines run reports as well as run apache for user access.  The
shared array holds numerous log files which are written to constantly. 
Currently we have ~200 files which recieve ~4GB of data daily so the write
 loading is not high. 

One option for upgrading is to : 
 - Upgrade to latest RH with clustering and GFS on all the servers.
 - Have the shared data on a HP iSCSI(MSA2012i) which is 
   accessed by all. We use HP equipment so my options are limited.

I have found the following article comparing iSCSI devices including
 HP2012i on both Windows and Centos 5  : 
http://www.networkworld.com/reviews/2008/072808-test-iscsi-sans.html
HP 2012i results : 
http://www.networkworld.com/reviews/2008/072808-test-iscsi-sans-hp.html

Thanks


David K Livingstone
Edmonton, AB
Email: David.Livingstone at cn.ca 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080821/fe3964f3/attachment.htm>

From lbigum at iseek.com.au  Thu Aug 21 22:18:34 2008
From: lbigum at iseek.com.au (Luke Bigum)
Date: Fri, 22 Aug 2008 08:18:34 +1000
Subject: [Linux-cluster] Common GNBD boot/startup procedures?
In-Reply-To: <3C12B5DA-D6FF-4718-B3AC-A0B48BFD50FA@netspot.com.au>
References: <3C12B5DA-D6FF-4718-B3AC-A0B48BFD50FA@netspot.com.au>
Message-ID: <50A3F7088FE1A14FB0CF57A22487388679BF4F8A44@EXCHANGE1.intranet.iseek.com.au>

I wrote this in a few minutes, pretty dodgy and not fully tested:

$ cat /etc/init.d/gnbd_serv
#!/bin/bash
#
# chkconfig: - 23 77
# description: Starts and stops gnbd server
#
#

#lbigum: crappy script to start GNBD server.

case "$1" in
  start)
        gnbd_serv -n
        gnbd_export -c -e gnbd -d /dev/sda5
        ;;
  stop)
        gnbd_serv -k
        sleep 3
        gnbd_serv -K
        ;;
  *)
        echo $"Usage: $0 {start|stop}"
        ;;
esac

and for a client:

$ cat /etc/init.d/gnbd_client
#!/bin/bash
#
# chkconfig: - 23 77
# description: Starts and stops gnbd client
#
#

case "$1" in
  start)
        modprobe gnbd
        gnbd_import -i hostname.example.com
        ;;
  stop)
        gnbd_import -R
        ;;
    *)
        echo $"Usage: $0 {start|stop}"
        ;;
esac

--
Luke Bigum
Systems Administrator
iseek Communications Pty Ltd
Excellence in business data solutions
ph 1300 661 668 fax 1300 661 540
www.iseek.com.au

-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Tom Lanyon
Sent: Thursday, 21 August 2008 1:08 PM
To: linux-cluster at redhat.com
Subject: [Linux-cluster] Common GNBD boot/startup procedures?

Hi list,

What are people commonly doing to start gnbd_serv and to export/import
GNBDs on server startup?

Thanks,
Tom

--
Tom Lanyon
Systems Administrator
NetSpot Pty Ltd

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From satoru.satoh at gmail.com  Fri Aug 22 08:10:20 2008
From: satoru.satoh at gmail.com (Satoru SATOH)
Date: Fri, 22 Aug 2008 17:10:20 +0900
Subject: [Linux-cluster] [PATCH][RESEND] Add network interface select option
	for fence_xvmd
Message-ID: <20080822081013.GA15358@localhost.localdomain>

Hello,


I updated my patch for fence_xvmd to add network interface select option
posted before.

This patch fixes the following issues ATST:

 1. fence_xvmd selects wrong network interface to listen on if host has
    multiple interfaces and target interface is not for default route.
    As a result, fence_xvmd does not repond to fence_xvm's request.
 2. fence_xvmd cannot start if default route is not set.

The following patch is for cluster-3 HEAD.

The same problem exists in cluster-2 (rhel5's cluster) and I opened
bugzilla bug for that version: rhbz#459720.


Signed-Off-By: Satoru SATOH <satoru.satoh at gmail.com>

 fence/agents/xvm/fence_xvmd.c |    6 +++---
 fence/agents/xvm/mcast.c      |    9 +++++----
 fence/agents/xvm/mcast.h      |    4 ++--
 fence/agents/xvm/options.c    |   13 +++++++++++++
 fence/agents/xvm/options.h    |    1 +
 fence/man/fence_xvmd.8        |    3 +++
 6 files changed, 27 insertions(+), 9 deletions(-)

diff --git a/fence/agents/xvm/fence_xvmd.c b/fence/agents/xvm/fence_xvmd.c
index 888f24b..2746c23 100644
--- a/fence/agents/xvm/fence_xvmd.c
+++ b/fence/agents/xvm/fence_xvmd.c
@@ -921,7 +921,7 @@ main(int argc, char **argv)
 	unsigned int logmode = 0;
 	char key[MAX_KEY_LEN];
 	int key_len = 0, x;
-	char *my_options = "dfi:a:p:C:U:c:k:u?hLXV";
+	char *my_options = "dfi:a:p:I:C:U:c:k:u?hLXV";
 	cman_handle_t ch = NULL;
 	void *h = NULL;
 
@@ -1031,9 +1031,9 @@ main(int argc, char **argv)
 	}
 
 	if (args.family == PF_INET)
-		mc_sock = ipv4_recv_sk(args.addr, args.port);
+		mc_sock = ipv4_recv_sk(args.addr, args.port, args.ifindex);
 	else
-		mc_sock = ipv6_recv_sk(args.addr, args.port);
+		mc_sock = ipv6_recv_sk(args.addr, args.port, args.ifindex);
 	if (mc_sock < 0) {
 		log_printf(LOG_ERR,
 			   "Could not set up multicast listen socket\n");
diff --git a/fence/agents/xvm/mcast.c b/fence/agents/xvm/mcast.c
index db46328..9f20c89 100644
--- a/fence/agents/xvm/mcast.c
+++ b/fence/agents/xvm/mcast.c
@@ -31,10 +31,10 @@ LOGSYS_DECLARE_SUBSYS ("XVM", SYSLOGLEVEL);
   Sets up a multicast receive socket
  */
 int
-ipv4_recv_sk(char *addr, int port)
+ipv4_recv_sk(char *addr, int port, unsigned int ifindex)
 {
 	int sock;
-	struct ip_mreq mreq;
+	struct ip_mreqn mreq;
 	struct sockaddr_in sin;
 
 	/* Store multicast address */
@@ -74,7 +74,7 @@ ipv4_recv_sk(char *addr, int port)
 	 * Join multicast group
 	 */
 	/* mreq.imr_multiaddr.s_addr is set above */
-	mreq.imr_interface.s_addr = htonl(INADDR_ANY);
+	mreq.imr_ifindex = ifindex;
 	dbg_printf(4, "Joining multicast group\n");
 	if (setsockopt(sock, IPPROTO_IP, IP_ADD_MEMBERSHIP,
 		       &mreq, sizeof(mreq)) == -1) {
@@ -184,7 +184,7 @@ ipv4_send_sk(char *send_addr, char *addr, int port, struct sockaddr *tgt,
   Sets up a multicast receive (ipv6) socket
  */
 int
-ipv6_recv_sk(char *addr, int port)
+ipv6_recv_sk(char *addr, int port, unsigned int ifindex)
 {
 	int sock, val;
 	struct ipv6_mreq mreq;
@@ -203,6 +203,7 @@ ipv6_recv_sk(char *addr, int port)
 	memcpy(&mreq.ipv6mr_multiaddr, &sin.sin6_addr,
 	       sizeof(struct in6_addr));
 
+	mreq.ipv6mr_interface = ifindex;
 
 	/********************************
 	 * SET UP MULTICAST RECV SOCKET *
diff --git a/fence/agents/xvm/mcast.h b/fence/agents/xvm/mcast.h
index 5113f04..08fd6de 100644
--- a/fence/agents/xvm/mcast.h
+++ b/fence/agents/xvm/mcast.h
@@ -4,11 +4,11 @@
 #define IPV4_MCAST_DEFAULT "225.0.0.12"
 #define IPV6_MCAST_DEFAULT "ff05::3:1"
 
-int ipv4_recv_sk(char *addr, int port);
+int ipv4_recv_sk(char *addr, int port, unsigned int ifindex);
 int ipv4_send_sk(char *src_addr, char *addr, int port,
 		 struct sockaddr *src, socklen_t slen,
 		 int ttl);
-int ipv6_recv_sk(char *addr, int port);
+int ipv6_recv_sk(char *addr, int port, unsigned int ifindex);
 int ipv6_send_sk(char *src_addr, char *addr, int port,
 		 struct sockaddr *src, socklen_t slen,
 		 int ttl);
diff --git a/fence/agents/xvm/options.c b/fence/agents/xvm/options.c
index 969ca8d..58b2aec 100644
--- a/fence/agents/xvm/options.c
+++ b/fence/agents/xvm/options.c
@@ -104,6 +104,13 @@ assign_port(fence_xvm_args_t *args, struct arg_info *arg, char *value)
 
 
 static inline void
+assign_interface(fence_xvm_args_t *args, struct arg_info *arg, char *value)
+{
+	args->ifindex = if_nametoindex(value);
+}
+
+
+static inline void
 assign_retrans(fence_xvm_args_t *args, struct arg_info *arg, char *value)
 {
 	args->retr_time = atoi(value);
@@ -307,6 +314,10 @@ static struct arg_info _arg_info[] = {
 	  "IP port (default=1229)",
 	  assign_port },
 
+	{ 'I', "-I <interface>", "multicast_address",
+	  "Network interface name to listen on",
+	  assign_interface },
+
 	{ 'r', "-r <retrans>", "retrans", 
 	  "Multicast retransmit time (in 1/10sec; default=20)",
 	  assign_retrans },
@@ -416,6 +427,7 @@ args_init(fence_xvm_args_t *args)
 	args->hash = DEFAULT_HASH;
 	args->auth = DEFAULT_AUTH;
 	args->port = 1229;
+	args->ifindex = 0;
 	args->family = PF_INET;
 	args->timeout = 30;
 	args->retr_time = 20;
@@ -445,6 +457,7 @@ args_print(fence_xvm_args_t *args)
 	_pr_int(args->hash);
 	_pr_int(args->auth);
 	_pr_int(args->port);
+	_pr_int(args->ifindex);
 	_pr_int(args->family);
 	_pr_int(args->timeout);
 	_pr_int(args->retr_time);
diff --git a/fence/agents/xvm/options.h b/fence/agents/xvm/options.h
index 7a2dcca..07f99da 100644
--- a/fence/agents/xvm/options.h
+++ b/fence/agents/xvm/options.h
@@ -23,6 +23,7 @@ typedef struct {
 	fence_hash_t hash;
 	fence_auth_type_t auth;
 	int port;
+	unsigned int ifindex;
 	int family;
 	int timeout;
 	int retr_time;
diff --git a/fence/man/fence_xvmd.8 b/fence/man/fence_xvmd.8
index 5a47211..12af607 100644
--- a/fence/man/fence_xvmd.8
+++ b/fence/man/fence_xvmd.8
@@ -39,6 +39,9 @@ for ipv6)
 \fB-p\fP \fIport\fP
 Port to use (default=1229)
 .TP
+\fB-I\fP \fIinterface\fP
+Network interface to listen on, e.g. eth0.
+.TP
 \fB-C\fP \fIauth\fP
 Authentication type (none, sha1, sha256, sha512; default=sha256).  This
 controls the authentication mechanism used to authenticate clients.  The
-- 
1.5.6.4



From linux at vfemail.net  Fri Aug 22 08:48:12 2008
From: linux at vfemail.net (Alex)
Date: Fri, 22 Aug 2008 11:48:12 +0300
Subject: [Linux-cluster] conga bug or my mistake?
In-Reply-To: <200808181711.39127.linux@vfemail.net>
References: <200808181327.18655.linux@vfemail.net>
	<c3c0440e0808180605n2418844boc0e1bf998746ee82@mail.gmail.com>
	<200808181711.39127.linux@vfemail.net>
Message-ID: <200808221148.12385.linux@vfemail.net>

On Monday 18 August 2008 17:11, Alex wrote:
> On Monday 18 August 2008 16:05, Grisha G. wrote:
> > Show us the logs from your nodes
>
> It seems really a bug. I found another post with the same problem here:
>
> http://www.mail-archive.com/linux-cluster at redhat.com/msg03911.html
>

Indeed, is a bug. See here:

https://bugzilla.redhat.com/show_bug.cgi?id=459469

They say that it has been fixed in cvs. Can somebody indicate where i can 
access conga CVS in order to see modifications and/or location of the 
proposed patch/diff?

Regards,
Alx


> Can somebody tell me how to fix?
>
> Regards,
> Alx
>
> > On Mon, Aug 18, 2008 at 1:27 PM, Alex <linux at vfemail.net> wrote:
> > > Hello all,
> > >
> > > My current setup si similar with one described here:
> > > http://sources.redhat.com/cluster/gnbd/gnbd_usage.txt
> > > excepting the fact that i'm having 3 clients and 3 gnbd servers
> > > (exporting block devices using gnbd).
> > >
> > > our gnbd servers have the following IPs: 192.168.113.6 and
> > > 192.168.113.7 our gnbd clients have the following IPs: 192.168.113.3
> > > and
> > > 192.168.113.4and 192.168.113.5
> > >
> > > On our management machine (other then above gnbd clients and servers)
> > > is running:
> > > [root at rhclm ~]# rpm -q luci
> > > luci-0.12.0-7.el5.centos.3
> > > [root at rhclm ~]#
> > >
> > > On our gnbd clients is running:
> > > [root at rs1 ~]# rpm -q ricci
> > > ricci-0.12.0-7.el5.centos.3
> > > [root at rs1 ~]#
> > >
> > > Now, i'm trying to do the following operations using conga:
> > > Cluster -> Shared Fence Devices -> Add Fence Device
> > >
> > > added successfully:
> > >
> > > Fence Type: GNBD
> > > Name: gnbd_from_shds
> > > Servers: 192.168.113.6 192.168.113.7
> > >
> > > This will add in our cluster.conf:
> > > <fencedevices>
> > >        <fencedevice agent="fence_gnbd" name="gnbd_from_shds"
> > > servers="192.168.113.6 192.168.113.7"/>
> > > </fencedevices>
> > >
> > > Let's try to use it: Cluster -> Nodes hit on 192.168.113.3 and select
> > > option
> > > "Manage Fencing for this Node" -> "Main Fencing Method" -> "Add a fence
> > > device to this level" -> select gnbd_from_shds ->  and hit "Update main
> > > fence
> > > properties"
> > >
> > > Is not working, all the time i'm getting a javascript window error
> > > saying the
> > > following:
> > >
> > > [snip]
> > > The following errors were found:
> > > An unknown device type was given: "gnbd."
> > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > > [end snip]
> > >
> > > You see, is a dot after "gnbd" which i suppose it causing that error.
> > >
> > > How can be fixed this error?
> > >
> > > Now, I edited manually our cluster.conf as following:
> > > <clusternode name="192.168.113.3" nodeid="3" votes="1">
> > >        <fence>
> > >                <method name="1">
> > >                        <device name="gnbd_from_shds"
> > > nodename="192.168.113.3"/>
> > >                </method>
> > >        </fence>
> > > </clusternode>
> > >
> > > First Question: In docs, i cannot find any explanation about
> > > name="value" in
> > > <method> tag. As you see, value is "1": <method name="1">. Is this
> > > value valid only inside of <clusternode> section or has global
> > > semnification in cluster.conf? Can i name it for example "one" or
> > > "first_fence_method_for_this_node"?
> > >
> > > and run:
> > > [root at rs1 ~]# ccs_tool update /etc/cluster/cluster.conf
> > > Config file updated from version 28 to 29
> > >
> > > Update complete.
> > > [root at rs1 ~]#
> > >
> > > Now, i can see using conga in "Shared Fence Devices" section:
> > >
> > > Shared Fence Devices for Cluster: httpcluster
> > > Agent type: Global Network Block Device
> > > Name: gnbd_from_shds
> > > Nodes using this device for fencing: 192.168.113.3
> > >
> > > but, if i'm hitting 192.168.113.3 link, i'll get other error:
> > >
> > > Site error
> > >
> > > This site encountered an error trying to fulfill your request. The
> > > errors were:
> > >
> > > Error Type
> > >    KeyError
> > > Error Value
> > >    'fence-instance-form-gnbd'
> > > Request made at
> > >    2008/08/18 12:42:45.164 GMT+3
> > >
> > > Any ideas how to fix it? Is my mistake or is a bug in conga?
> > >
> > > Second Question: Is correct to add and use for the rest of our client
> > > nodes below sintax?
> > >
> > > For: 192.168.113.4 and 192.168.113.5 client nodes:
> > >
> > > <clusternode name="192.168.113.4" nodeid="2" votes="1">
> > >        <fence>
> > >                <method name="1">
> > >                        <device name="gnbd_from_shds"
> > > nodename="192.168.113.4"/>
> > >                </method>
> > >        </fence>
> > > </clusternode>
> > >
> > > and
> > >
> > > <clusternode name="192.168.113.5" nodeid="1" votes="1">
> > >        <fence>
> > >                <method name="1">
> > >                        <device name="gnbd_from_shds"
> > > nodename="192.168.113.5"/>
> > >                </method>
> > >        </fence>
> > > </clusternode>
> > >
> > > For conformity, i am posting below my present cluster.conf file:
> > >
> > > <?xml version="1.0"?>
> > > <cluster alias="httpcluster" config_version="29" name="httpcluster">
> > >        <fence_daemon clean_start="0" post_fail_delay="0"
> > > post_join_delay="3"/>
> > >        <clusternodes>
> > >                <clusternode name="192.168.113.5" nodeid="1" votes="1">
> > >                        <fence/>
> > >                </clusternode>
> > >                <clusternode name="192.168.113.4" nodeid="2" votes="1">
> > >                        <fence/>
> > >                </clusternode>
> > >                <clusternode name="192.168.113.3" nodeid="3" votes="1">
> > >                        <fence>
> > >                                <method name="1">
> > >                                        <device name="gnbd_from_shds"
> > > nodename="192.168.113.3"/>
> > >                                </method>
> > >                        </fence>
> > >                </clusternode>
> > >                <clusternode name="192.168.113.6" nodeid="4" votes="1">
> > >                        <fence/>
> > >                </clusternode>
> > >                <clusternode name="192.168.113.7" nodeid="5" votes="1">
> > >                        <fence/>
> > >                </clusternode>
> > >        </clusternodes>
> > >        <cman/>
> > >        <fencedevices>
> > >                <fencedevice agent="fence_gnbd" name="gnbd_from_shds"
> > > servers="192.168.113.6 192.168.113.7"/>
> > >        </fencedevices>
> > >        <rm>
> > >                <failoverdomains/>
> > >                <resources/>
> > >        </rm>
> > > </cluster>
> > >
> > > Regards,
> > > Alx
> > >
> > > --
> > > Linux-cluster mailing list
> > > Linux-cluster at redhat.com
> > > https://www.redhat.com/mailman/listinfo/linux-cluster
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From fdinitto at redhat.com  Fri Aug 22 06:46:08 2008
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Fri, 22 Aug 2008 08:46:08 +0200 (CEST)
Subject: [Linux-cluster] Cluster Infrastructure IRC meeting - Monday 25th of
	Aug 2pm UTC/GMT
Message-ID: <Pine.LNX.4.64.0808220716210.16551@trider-g7>


Hi everybody,

When  : Monday 25th of Aug 2pm UTC/GMT (*)(**)
Where : irc.freenode.net #linux-cluster
Who   : everybody interested is invited to participate
Agenda: http://sources.redhat.com/cluster/wiki/Meetings/2008-Aug-25

This is our first IRC meeting and it is supposed to last approx. 1 hour.
Be prepared that it might last longer.

In order to give more space to open discussion please prepare yourself:

- everybody can add topics to the Agenda in the "Voice to the community"
   section.
- prepare a 5/10 lines summary for your topic before the meeting. Typing
   realtime takes too long.

- How it should work -

For each item on the agenda:

- fabbione (me on IRC) will make a call on the person that should talk
   about the item.
- The intersted person will copy/paste her/his own summary.
- Discussion/question on the specific item. If the discussion takes longer
   than 5 minutes, we will postpone it at the end of the meeting. This is
   to allow everybody to cast her/his own voice in the agenda. The meeting
   moderator will call on the time.

Some golden rules:

- please respect the person that is talking and give her/him time to
   complete what she/he is saying. Communication on IRC can be slow.
- Avoid to interrupt the discussion with Off-Topic items. We don't
   want to use moderation powers if we can avoid it.
- if, for any reason, your internet connection drops during the meeting,
   please message me and I will copy paste the last bits of the
   conversation.

After the meeting:

- please update the wiki for Actions (in case there are any to take)
- mail me with comments/suggestions on how was the meeting. It is
   perfectly ok to say "it's really bad" but it would also be nice to know
   if it was a good experience.

Fabio

(*) To convert from UTC/GMT to your local time:
http://www.timeanddate.com/worldclock/converter.html
http://www.timezoneconverter.com/cgi-bin/tzc.tzc

(**) some tools report UTC, others GMT.
For our simple usage of time, they can be considered the same. 
http://en.wikipedia.org/wiki/Coordinated_Universal_Time for the curious 
people out there.

--
I'm going to make him an offer he can't refuse.



From wferi at niif.hu  Fri Aug 22 17:12:49 2008
From: wferi at niif.hu (Ferenc Wagner)
Date: Fri, 22 Aug 2008 19:12:49 +0200
Subject: [Linux-cluster] gfs over raid/lvm or any other option?
In-Reply-To: <200808121326.44254.linux@vfemail.net> (linux@vfemail.net's
	message of "Tue, 12 Aug 2008 13:26:44 +0300")
References: <200808121326.44254.linux@vfemail.net>
Message-ID: <87sksxht2m.fsf@tac.ki.iif.hu>

Alex <linux at vfemail.net> writes:

> If on one machine i have 8 block devices (/deb/sda, /dev/sdb, ... /dev/sdg, 
> dev/sdh) imported via iscsi from 8 different computers (computerX) can i 
> group volumes two by two using raid1 and after that to join resulted /dev/md0 
> up to md3 in one logical volume and run GFS on top?

You can aggregate iSCSI devices by mdadm only on one node at a time.
That is, the Linux software RAID is not cluster aware.  You can use it
in HA, though.  For what you want, check out ddraid or distributed
file systems like Coda or Lustre.
-- 
Regards,
Feri.



From stephenamadei at hotmail.com  Fri Aug 22 22:57:49 2008
From: stephenamadei at hotmail.com (Stephen Amadei)
Date: Fri, 22 Aug 2008 18:57:49 -0400
Subject: [Linux-cluster] GFS2 stops working... can't umount
Message-ID: <BAY110-W10C3583331004E8F97AECABB6A0@phx.gbl>


 
Hello.
 
I am having a reoccuring problem with my experimental GFS2 setup.  It's running cluster-2.03.06, openair-0.80.3 and the kernel is 2.6.26 with the DRBD and GRSecurity patches.  This is on Slackware 12.1.
 
After about 10 days of having the GFS2 filesystem mounted, I tried to do a simple 'ls' on one of the directories, and the process dies with a D (in ps ax).
 
Of course, I cannot unmount the GFS2 partition due to the hanging processes, so I'll need a reboot to fix the situation, which I cannot do until later tonight.
 
The last time this happened, I had to reboot, run fsck on the partition, and shortly thereafter, I was remounted, and happy. 
 
Keep in mind this is not a partition with heavy usage.  In fact, it is practically a read-only partition, to be used to synchronize the htdocs on web server backends.
 
There was nothing in the logs.  I looked at upgrading to 2.03.07, but the changelog doesn't seem to address any freezing problems.
 
Is this a known problem?
 
Since the data on this partition is not particularly important, I could reformat and start over... is it possible the fsck is not fixing the partition good enough?
 
Thanks in advance.
 
Stephen
 
_________________________________________________________________
Get ideas on sharing photos from people like you.  Find new ways to share.
http://www.windowslive.com/explore/photogallery/posts?ocid=TXT_TAGLM_WL_Photo_Gallery_082008
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080822/abf1c9ac/attachment.htm>

From gregory at steulet.org  Sat Aug 23 08:34:45 2008
From: gregory at steulet.org (gregory steulet)
Date: Sat, 23 Aug 2008 10:34:45 +0200
Subject: [Linux-cluster] Trying to compile cluster-2.03.07
Message-ID: <1219480485-9db09a382c13844d21541f17abca7d60@steulet.org>

Hi folks,

I'm a bit fed up with this project of red hat cluster. At first I tried to compile cluster-2.03.07 on a 64 bits system... well I forgot it, it looks not possible. Now I switched on a centos 32 bits, other errors occured when I try to compile cluster-2.03.07

My configuration is the following : 
- centos 5.2 32 bits, last patches
- kernel 2.6.26.3
-openais-080.3

I used the following procedure to compile cluster-2.03.07

configure --kernel_src=/usr/src/linux-2.6.26.3
make

There I got lot of errors regarding openais (cman/deamon/deamon.c:21:34: error: openais/service/swab.h: no such file or directory, deamon.c:22:35: error: openais/totem/aispool.h: nos such file or directory)

However I had a look on http://sources.redhat.com/cluster/wiki/ and my set up looks good... How do I have to proceed ???

Greg

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080823/0fb4a80e/attachment.htm>

From satoru.satoh at gmail.com  Mon Aug 25 04:44:22 2008
From: satoru.satoh at gmail.com (Satoru SATOH)
Date: Mon, 25 Aug 2008 13:44:22 +0900
Subject: [Linux-cluster] Trying to compile cluster-2.03.07
In-Reply-To: <1219480485-9db09a382c13844d21541f17abca7d60@steulet.org>
References: <1219480485-9db09a382c13844d21541f17abca7d60@steulet.org>
Message-ID: <20080825044419.GA5313@localhost.localdomain>

Hi,


Are you sure openais-devel is installed?

In rhel-5, openais-devel owns that header file. I'm not sure same goes
for centos but same or similar named package should be required, I
guess.

[root at cluster-1@foo ~]# cat /etc/redhat-release
Red Hat Enterprise Linux Server release 5.2 (Tikanga)
[root at cluster-1@foo ~]# rpm -qf $(locate openais/service/swab.h)
openais-devel-0.80.3-15.el5


- satoru

On Sat, Aug 23, 2008 at 10:34:45AM +0200, gregory steulet wrote:
> Hi folks,
> 
> I'm a bit fed up with this project of red hat cluster. At first I tried to
> compile cluster-2.03.07 on a 64 bits system... well I forgot it, it looks not
> possible. Now I switched on a centos 32 bits, other errors occured when I try
> to compile cluster-2.03.07
> 
> My configuration is the following :
> - centos 5.2 32 bits, last patches
> - kernel 2.6.26.3
> -openais-080.3
> 
> I used the following procedure to compile cluster-2.03.07
> 
> configure --kernel_src=/usr/src/linux-2.6.26.3
> make
> 
> There I got lot of errors regarding openais (cman/deamon/deamon.c:21:34: error:
> openais/service/swab.h: no such file or directory, deamon.c:22:35: error:
> openais/totem/aispool.h: nos such file or directory)
> 
> However I had a look on http://sources.redhat.com/cluster/wiki/ and my set up
> looks good... How do I have to proceed ???
> 
> Greg



From ajeet.singh.raina at logica.com  Mon Aug 25 06:10:15 2008
From: ajeet.singh.raina at logica.com (Singh Raina, Ajeet)
Date: Mon, 25 Aug 2008 11:40:15 +0530
Subject: [Linux-cluster] PriQuorum Partition?
Message-ID: <0139539A634FD04A99C9B8880AB70CB209B17AAF@in-ex004.groupinfra.com>

I want to do partition for pri-quorum disk on MSA 1500 Shared Storage
for Red Hat Clustering.
I don't have mkqdisk package for the same.
All I want is By using fdisk.

How can it be possible?


This e-mail and any attachment is for authorised use by the intended recipient(s) only. It may contain proprietary material, confidential information and/or be subject to legal privilege. It should not be copied, disclosed to, retained or used by, any other party. If you are not an intended recipient then please promptly delete this e-mail and any attachment and all copies and inform the sender. Thank you.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080825/cbe2d960/attachment.htm>

From tom at netspot.com.au  Mon Aug 25 06:19:49 2008
From: tom at netspot.com.au (Tom Lanyon)
Date: Mon, 25 Aug 2008 15:49:49 +0930
Subject: [Linux-cluster] Virtual service using CLVM not migrating
Message-ID: <2B9A98D7-A9C6-4210-B1B6-9B6306E21362@netspot.com.au>

Hi list,

(let me know if this should be on the xen list, but I think it's an  
issue with clvm locking a logical volume)

I have a three node RHEL5 cluster running some virtual machines. The  
virtual machines use a LVM LV as their root which is available cluster- 
wide via clvmd.

Live migration between cluster nodes seems to work well when running  
one-vm-per-node exclusively, but fails when a node is running more  
than one virtual machine.

I can migrate my two VMs, "nodea" and "nodeb", onto the same physical  
node and they run fine:

# xm list
Name                                      ID Mem(MiB) VCPUs State    
Time(s)
Domain-0                                0     4120     4 r-----   3398.3
nodea                                      9     5999     1 - 
b----      0.3
nodeb                                      4     5999     1 -b----     
265.9


However, when I try to migrate one of these VMs *away* from this  
physical node to another cluster member (using clusvcadm -M), it  
performs the state transfer and then I get a nasty error on the VMs  
console and I end up with a broken virtual machine on both physical  
nodes:

WARNING: g.e. still in use!
WARNING: leaking g.e. and page still in use!
WARNING: g.e. still in use!
WARNING: leaking g.e. and page still in use!
netif_release_rx_bufs: 0 xfer, 62 noxfer, 194 unused
WARNING: g.e. still in use!
WARNING: leaking g.e. and page still in use!


Sorry for the large email, but I'll also include the xend log on the  
source physical server showing the failure. You can see that device  
51712 is 'still active' while trying to migrate and that device 51712  
is the LV block device; I assume this means it is having trouble  
removing a CLVM lock?


[2008-08-24 00:27:43 xend 5252] DEBUG (balloon:127) Balloon: 26652 KiB  
free; need 25600; done.
[2008-08-24 00:27:43 xend 5252] DEBUG (XendCheckpoint:89) [xc_save]: / 
usr/lib64/xen/bin/xc_save 22 9 0 0 1
[2008-08-24 00:27:43 xend 5252] INFO (XendCheckpoint:351) ERROR  
Internal error: Couldn't enable shadow mode
[2008-08-24 00:27:43 xend 5252] INFO (XendCheckpoint:351) Save exit rc=1
[2008-08-24 00:27:43 xend 5252] ERROR (XendCheckpoint:133) Save failed  
on domain nodea (9).
Traceback (most recent call last):
   File "/usr/lib64/python2.4/site-packages/xen/xend/ 
XendCheckpoint.py", line 110, in save
     forkHelper(cmd, fd, saveInputHandler, False)
   File "/usr/lib64/python2.4/site-packages/xen/xend/ 
XendCheckpoint.py", line 339, in forkHelper
     raise XendError("%s failed" % string.join(cmd))
XendError: /usr/lib64/xen/bin/xc_save 22 9 0 0 1 failed
[2008-08-24 00:27:43 xend.XendDomainInfo 5252] DEBUG (XendDomainInfo: 
1601) XendDomainInfo.resumeDomain(9)
[2008-08-24 00:27:43 xend.XendDomainInfo 5252] INFO (XendDomainInfo: 
1722) Dev 51712 still active, looping...
[2008-08-24 00:27:44 xend.XendDomainInfo 5252] INFO (XendDomainInfo: 
1722) Dev 51712 still active, looping...
[2008-08-24 00:27:44 xend.XendDomainInfo 5252] INFO (XendDomainInfo: 
1722) Dev 51712 still active, looping...
[2008-08-24 00:27:44 xend.XendDomainInfo 5252] INFO (XendDomainInfo: 
1722) Dev 51712 still active, looping...
[2008-08-24 00:27:44 xend.XendDomainInfo 5252] INFO (XendDomainInfo: 
1722) Dev 51712 still active, looping...
[2008-08-24 00:27:44 xend.XendDomainInfo 5252] INFO (XendDomainInfo: 
1722) Dev 51712 still active, looping...
[2008-08-24 00:27:44 xend.XendDomainInfo 5252] INFO (XendDomainInfo: 
1722) Dev 51712 still active, looping...
[2008-08-24 00:27:44 xend.XendDomainInfo 5252] INFO (XendDomainInfo: 
1722) Dev 51712 still active, looping...
[2008-08-24 00:27:44 xend.XendDomainInfo 5252] DEBUG (XendDomainInfo: 
1614) XendDomainInfo.resumeDomain: devices released
[2008-08-24 00:27:44 xend.XendDomainInfo 5252] DEBUG (XendDomainInfo: 
791) Storing domain details: {'console/ring-ref': '2057005', 'console/ 
port': '2', 'name': 'migrating-nodea', 'console/limit': '1048576',
  'vm': '/vm/b845f914-33a3-e1cf-551e-01b6d346b92b', 'domid': '9', 'cpu/ 
0/availability': 'online', 'memory/target': '6144000', 'store/ring- 
ref': '2049294', 'store/port': '1'}
[2008-08-24 00:27:44 xend 5252] DEBUG (DevController:110)  
DevController: writing {'backend-id': '0', 'mac': '00:16:3e:6c:ae:9f',  
'handle': '0', 'state': '1', 'backend': '/local/domain/0/backend/vif/ 
9/0'} t
o /local/domain/9/device/vif/0.
[2008-08-24 00:27:44 xend 5252] DEBUG (DevController:112)  
DevController: writing {'bridge': 'br102', 'domain': 'migrating- 
nodea', 'handle': '0', 'script': '/etc/xen/scripts/vif-bridge',  
'state': '1', 'fron
tend': '/local/domain/9/device/vif/0', 'mac': '00:16:3e:6c:ae:9f',  
'online': '1', 'frontend-id': '9'} to /local/domain/0/backend/vif/9/0.
[2008-08-24 00:27:44 xend 5252] DEBUG (blkif:24) exception looking up  
device number for xvda: [Errno 2] No such file or directory: '/dev/xvda'
[2008-08-24 00:27:44 xend 5252] DEBUG (DevController:110)  
DevController: writing {'backend-id': '0', 'virtual-device': '51712',  
'device-type': 'disk', 'state': '1', 'backend': '/local/domain/0/ 
backend/vbd/
9/51712'} to /local/domain/9/device/vbd/51712.
[2008-08-24 00:27:44 xend 5252] DEBUG (DevController:112)  
DevController: writing {'domain': 'migrating-nodea', 'frontend': '/ 
local/domain/9/device/vbd/51712', 'format': 'raw', 'dev': 'xvda',  
'state': '1',
'params': '/dev/int_vg/os_nodea', 'mode': 'w', 'online': '1',  
'frontend-id': '9', 'type': 'phy'} to /local/domain/0/backend/vbd/ 
9/51712.
[2008-08-24 00:27:44 xend.XendDomainInfo 5252] DEBUG (XendDomainInfo: 
1626) XendDomainInfo.resumeDomain: devices created
[2008-08-24 00:27:44 xend.XendDomainInfo 5252] ERROR (XendDomainInfo: 
1631) XendDomainInfo.resume: xc.domain_resume failed on domain 9.
Traceback (most recent call last):
   File "/usr/lib64/python2.4/site-packages/xen/xend/ 
XendDomainInfo.py", line 1628, in resumeDomain
     xc.domain_resume(self.domid, fast)
Error: (1, 'Internal error', "Couldn't map start_info")
[2008-08-24 00:27:44 xend 5252] DEBUG (XendCheckpoint:136)  
XendCheckpoint.save: resumeDomain
[2008-08-24 00:27:44 xend.XendDomainInfo 5252] INFO (XendDomainInfo: 
1722) Dev 51712 still active, looping...
[2008-08-24 00:27:44 xend.XendDomainInfo 5252] INFO (XendDomainInfo: 
1722) Dev 51712 still active, looping...
[2008-08-24 00:27:44 xend.XendDomainInfo 5252] INFO (XendDomainInfo: 
1722) Dev 51712 still active, looping...
[2008-08-24 00:27:45 xend.XendDomainInfo 5252] INFO (XendDomainInfo: 
1722) Dev 51712 still active, looping...
-------many repeats-------
[2008-08-24 00:28:14 xend.XendDomainInfo 5252] INFO (XendDomainInfo: 
1728) Dev still active but hit max loop timeout




From linux at vfemail.net  Mon Aug 25 09:23:52 2008
From: linux at vfemail.net (Alex)
Date: Mon, 25 Aug 2008 12:23:52 +0300
Subject: [Linux-cluster] gfs over raid/lvm or any other option?
In-Reply-To: <87sksxht2m.fsf@tac.ki.iif.hu>
References: <200808121326.44254.linux@vfemail.net>
	<87sksxht2m.fsf@tac.ki.iif.hu>
Message-ID: <200808251223.52561.linux@vfemail.net>

On Friday 22 August 2008 20:12, Ferenc Wagner wrote:
> Alex <linux at vfemail.net> writes:
> > If on one machine i have 8 block devices (/deb/sda, /dev/sdb, ...
> > /dev/sdg, dev/sdh) imported via iscsi from 8 different computers
> > (computerX) can i group volumes two by two using raid1 and after that to
> > join resulted /dev/md0 up to md3 in one logical volume and run GFS on
> > top?
>
> You can aggregate iSCSI devices by mdadm only on one node at a time.
> That is, the Linux software RAID is not cluster aware.  You can use it
> in HA, though.  For what you want, check out ddraid or distributed
> file systems like Coda or Lustre.

Dear Ferenc,

Many thanks for your reply. Regarding $subj, i know about ddraid. I found it 
before my post. Unfortunatelly, this project is still in alpha (not 
production quality) and seems to be abandoned or freezed ... 

Coda is looking like a dead project, it is moving very slowly (unpredictable 
development on stable kernel-2.6 and probably few commits per year).

Finally, i agree with you: lustre is the solution...

Regards,
Alx



From raju.rajsand at gmail.com  Mon Aug 25 09:51:17 2008
From: raju.rajsand at gmail.com (Rajagopal Swaminathan)
Date: Mon, 25 Aug 2008 09:51:17 +0000
Subject: [Linux-cluster] GFS with iSCSI and HP 2012i
In-Reply-To: <OF606D0B42.27DE7266-ON872574AB.007AA52B-872574AC.00583C33@cn.ca>
References: <OF606D0B42.27DE7266-ON872574AB.007AA52B-872574AC.00583C33@cn.ca>
Message-ID: <8786b91c0808250251m4c3451d6jc8bf449a93212c9f@mail.gmail.com>

Greetings,

2008/8/21 <David.Livingstone at cn.ca>

>
>  I'm looking for feedback on two items :
>
> 1. Experience with GFS  using  iSCSI
>

I have used GFS over iSCSI on an HP AIO box once under test conditions and
did not have any issues.

Pls note it was for few hours and NOT repeat NOT production environment.

So can't comment on multipathing, performance etc, etc,

Regards

Rajagopal
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080825/046d43a6/attachment.htm>

From raju.rajsand at gmail.com  Mon Aug 25 10:06:58 2008
From: raju.rajsand at gmail.com (Rajagopal Swaminathan)
Date: Mon, 25 Aug 2008 10:06:58 +0000
Subject: [Linux-cluster] Want to Setup NFS with redhat Cluster??
In-Reply-To: <0139539A634FD04A99C9B8880AB70CB209B17A8C@in-ex004.groupinfra.com>
References: <0139539A634FD04A99C9B8880AB70CB209B17A8C@in-ex004.groupinfra.com>
Message-ID: <8786b91c0808250306u5bd41c7bjcca2772feab48cdf@mail.gmail.com>

Greetings,


2008/8/20 Singh Raina, Ajeet <ajeet.singh.raina at logica.com>

>  Can anyone Help me in setting up NFS with Red Hat Cluster?
>
> I still don't understand what is NFS Mount, NFS Client and NFS Exports?
>

OK Let us say you have 3 machines A, B, C.

A, B are Cluster nodes.

C is a machine which has to make available the data on its local hard disks
to who ever wants it using NFS protocol.
This data is available in file format which are stored under directories say
/XXX, /YYY, /ZZZ

Now our friends A & B -- the cluster members -- wants to access data under
/XXX located in C. They are not interested in /YYY and /ZZZ

Now let us bring another player in this love triangle -- a laptop called as
GADDI which wants the /ZZZ available on the NFS server C.


In this Scenario, A and B are NFS clients.

Our GADDI too is an NFS client.

C is the NFS Server. /XXX, /YYY and /ZZZ are NFS exports.

When A and/or B mounts C:/XXX on a local mount point, say /NFSXXX, it is
called and NFS mount

Now the /YYY as yet unused. (like scripts in the whole bollywood movie....)

:-)


Having explained that can you please rephrase your movie.. err.. scenario
again please using the above analogies
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080825/c320a73d/attachment.htm>

From ajeet.singh.raina at logica.com  Mon Aug 25 10:52:43 2008
From: ajeet.singh.raina at logica.com (Singh Raina, Ajeet)
Date: Mon, 25 Aug 2008 16:22:43 +0530
Subject: [Linux-cluster] Want to Setup NFS with redhat Cluster??
In-Reply-To: <8786b91c0808250306u5bd41c7bjcca2772feab48cdf@mail.gmail.com>
Message-ID: <0139539A634FD04A99C9B8880AB70CB209B17AB6@in-ex004.groupinfra.com>

Thanks for the Entire Movie. But Few doubts on mind and want your
suggestions.

I have two nodes A and B That's fine.
C is not a cluster node.

As per the project requirement C has to be always be NFS mounted to A.
When A's application goes down, NFS service wont be able to failover as
lots of users are still mounted and accessing from node c to node A
through NFS.I could see there is one command called umount -a but don't
find it efficient.

How can I achieve that?


-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Rajagopal
Swaminathan
Sent: Monday, August 25, 2008 3:37 PM
To: linux clustering
Subject: Re: [Linux-cluster] Want to Setup NFS with redhat Cluster??

Greetings,



2008/8/20 Singh Raina, Ajeet <ajeet.singh.raina at logica.com>


	Can anyone Help me in setting up NFS with Red Hat Cluster?

	I still don't understand what is NFS Mount, NFS Client and NFS
Exports?


OK Let us say you have 3 machines A, B, C.

A, B are Cluster nodes.

C is a machine which has to make available the data on its local hard
disks to who ever wants it using NFS protocol. 
This data is available in file format which are stored under directories
say /XXX, /YYY, /ZZZ

Now our friends A & B -- the cluster members -- wants to access data
under /XXX located in C. They are not interested in /YYY and /ZZZ

Now let us bring another player in this love triangle -- a laptop called
as GADDI which wants the /ZZZ available on the NFS server C.


In this Scenario, A and B are NFS clients.

Our GADDI too is an NFS client.

C is the NFS Server. /XXX, /YYY and /ZZZ are NFS exports.

When A and/or B mounts C:/XXX on a local mount point, say /NFSXXX, it is
called and NFS mount

Now the /YYY as yet unused. (like scripts in the whole bollywood
movie....)

:-)


Having explained that can you please rephrase your movie.. err..
scenario again please using the above analogies


This e-mail and any attachment is for authorised use by the intended recipient(s) only. It may contain proprietary material, confidential information and/or be subject to legal privilege. It should not be copied, disclosed to, retained or used by, any other party. If you are not an intended recipient then please promptly delete this e-mail and any attachment and all copies and inform the sender. Thank you.





From anujhere at gmail.com  Mon Aug 25 15:26:07 2008
From: anujhere at gmail.com (=?UTF-8?Q?Anuj_Singh_(=E0=A4=85=E0=A4=A8=E0=A5=81=E0=A4=9C)_?=)
Date: Mon, 25 Aug 2008 11:26:07 -0400
Subject: [Linux-cluster] Want to Setup NFS with redhat Cluster??
In-Reply-To: <0139539A634FD04A99C9B8880AB70CB209B17AB6@in-ex004.groupinfra.com>
References: <8786b91c0808250306u5bd41c7bjcca2772feab48cdf@mail.gmail.com>
	<0139539A634FD04A99C9B8880AB70CB209B17AB6@in-ex004.groupinfra.com>
Message-ID: <3120c9e30808250826g1329df55l161af2e0d19c21d8@mail.gmail.com>

AIUI your target is to use A & B (Cluster nodes) to provide high
availability of nfs service to your client machine C.
a) If node A goes down B takes over the nfs service, provided to client C.
b) If node B goes down A takes over the nfs service. provided to client C.
c) At any particular time only A node or B node providing service to C.
c) Client C, should see the same data independent of A and B, whether A is
providing nfs export or B.

Above thing can be achieved easily.
a) Get the DRBD (distributed redundant block device) and replicate data
between A node, B node. RAID1
b) You need to create a virtual IP which floats between A node and B node
along with startup of nfs service.
c) You client C will access the floating virtual IP.
above three things can be achieved by linux-heartbeat as well as RHCS, with
RHCS you only have to manage DRBD primary. which can be handled with some
small script.

If in doubt ask how to.

Anuj Singh.


On Mon, Aug 25, 2008 at 6:52 AM, Singh Raina, Ajeet <
ajeet.singh.raina at logica.com> wrote:

> Thanks for the Entire Movie. But Few doubts on mind and want your
> suggestions.
>
> I have two nodes A and B That's fine.
> C is not a cluster node.
>
> As per the project requirement C has to be always be NFS mounted to A.
> When A's application goes down, NFS service wont be able to failover as
> lots of users are still mounted and accessing from node c to node A
> through NFS.I could see there is one command called umount -a but don't
> find it efficient.
>
> How can I achieve that?
>
>
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Rajagopal
> Swaminathan
> Sent: Monday, August 25, 2008 3:37 PM
> To: linux clustering
> Subject: Re: [Linux-cluster] Want to Setup NFS with redhat Cluster??
>
> Greetings,
>
>
>
> 2008/8/20 Singh Raina, Ajeet <ajeet.singh.raina at logica.com>
>
>
>        Can anyone Help me in setting up NFS with Red Hat Cluster?
>
>        I still don't understand what is NFS Mount, NFS Client and NFS
> Exports?
>
>
> OK Let us say you have 3 machines A, B, C.
>
> A, B are Cluster nodes.
>
> C is a machine which has to make available the data on its local hard
> disks to who ever wants it using NFS protocol.
> This data is available in file format which are stored under directories
> say /XXX, /YYY, /ZZZ
>
> Now our friends A & B -- the cluster members -- wants to access data
> under /XXX located in C. They are not interested in /YYY and /ZZZ
>
> Now let us bring another player in this love triangle -- a laptop called
> as GADDI which wants the /ZZZ available on the NFS server C.
>
>
> In this Scenario, A and B are NFS clients.
>
> Our GADDI too is an NFS client.
>
> C is the NFS Server. /XXX, /YYY and /ZZZ are NFS exports.
>
> When A and/or B mounts C:/XXX on a local mount point, say /NFSXXX, it is
> called and NFS mount
>
> Now the /YYY as yet unused. (like scripts in the whole bollywood
> movie....)
>
> :-)
>
>
> Having explained that can you please rephrase your movie.. err..
> scenario again please using the above analogies
>
>
> This e-mail and any attachment is for authorised use by the intended
> recipient(s) only. It may contain proprietary material, confidential
> information and/or be subject to legal privilege. It should not be copied,
> disclosed to, retained or used by, any other party. If you are not an
> intended recipient then please promptly delete this e-mail and any
> attachment and all copies and inform the sender. Thank you.
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080825/fc38375f/attachment.htm>

From michael.osullivan at auckland.ac.nz  Mon Aug 25 17:30:03 2008
From: michael.osullivan at auckland.ac.nz (michael.osullivan at auckland.ac.nz)
Date: Tue, 26 Aug 2008 05:30:03 +1200 (NZST)
Subject: [Linux-cluster] gfs over raid/lvm or any other option?
Message-ID: <20298.128.187.190.8.1219685403.squirrel@mail.esc.auckland.ac.nz>

There are two approaches I have seen that may be suitable:

1) lustre - I didn't like this as it needed two "special" meta-servers and
I was building a smaller storage system;
2) pvfs

I did not use either of these approaches as they focus on keeping the
storage system running, rather than keeping the data highly available.

For my test storage I wanted to build a system that would still present
the stored data even if a single point in the network fails.

I have used iSCSI, mdadm and GFS as follows. I have two storage servers
with alomst 2TB of disk space for storage each. Both of these two servers
present a single logical volume to a 2-node cluster using iSCSI. There are
2 NICs on each storage server, so each volume is accessible via two ports.
There are 2 NICs on each cluster node also. The storage system was
connected to the cluster using some ethernet switches. Using mdadm I have
successfully multipathed each logical volume and then using mdadm again I
have built a RAID-5 device from these two volumes. The raid device is
successfully detected by each cluster node. On this raid device I created
a logical volume using clvm and on that logical volume I built a GFS to
control cluster access to the storage. The GFS has been successfully
mounted on both cluster nodes.

Despite some problems with the cluster (due to my own limited knowledge
about clusters and fencing) I have successfully created and accessed files
on the GFS from both cluster nodes. I am in the process of sorting out the
clustering problems and testing the configuration using IOMeter.

Hope this helps, Mike



From gregory at steulet.org  Mon Aug 25 19:34:52 2008
From: gregory at steulet.org (gregory steulet)
Date: Mon, 25 Aug 2008 21:34:52 +0200
Subject: [Linux-cluster] Trying to compile cluster-2.03.07
In-Reply-To: <20080825044419.GA5313@localhost.localdomain>
Message-ID: <1219692892-deaa494fda200cd7171085477f59ded3@steulet.org>

Hello, 

At first thanks for your answer.

In fact I used the following procedure to install all my components :

 

#get the latest stable kernel realease
[root at localhost ~]# cd /usr/src
[root at localhost src]#wget http://www.eu.kernel.org/pub/linux/kernel/v2.6/linux-2.6.26.2.tar.gz
[root at localhost src]# tar -xvzf linux-2.6.26.3.tar.gz
[root at localhost src]# cd linux-2.6.26.3
[root at localhost src]# cp /boot/configurexxxxxx /usr/src/linux-2.6.26.3/.configure

#configure you kernel
[root at localhost linux-2.6.26.3]# make menuconfig

#get & apply the lockproto patch
[root at localhost linux-2.6.26.3]# wget ftp://sources.redhat.com/pub/cluster/releases/lockproto-exports.patch
[root at localhost linux-2.6.26.3]# patch -p1 < lockproto-exports.patch
patching file fs/gfs2/locking.c

#compiling kernel
[root at localhost linux-2.6.26.3]# make bzImage
[root at localhost linux-2.6.26.3]# make modules
[root at localhost linux-2.6.26.3]# make modules_install
[root at localhost linux-2.6.26.3]# make install
[root at localhost linux-2.6.26.3]# reboot

#installing openais
[root at localhost openais]# tar -xvzf openais-0.80.3.tar.gz
[root at localhost openais-0.80.3]# make
[root at localhost openais-0.80.3]# make install

#trying to install cluster

[root at localhost cluster]# wget ftp://sources.redhat.com/pub/cluster/releases/cluster-2.03.07.tar.gz
[root at localhost cluster]# tar -xvzf cluster-2.03.07.tar.gz
[root at localhost src]# cd cluster-2.03.07
[root at localhost cluster-2.03.07]# ./configure kernel_src=/usr/src/linux-2.6.26.3
[root at localhost cluster-2.03.07]# make install

 

In fact I just would like a reliable procedure to install all those components without having problems.
Just a simple procedure with all requirements and commands. I'm simply doing that to know if GFS is usable only on Redhat or if it's possible to install on other distribution

Thanks again for your help, regards


Greg

 

Hi,Are you sure openais-devel is installed?In rhel-5, openais-devel owns that header file. I'm not sure same goesfor centos but same or similar named package should be required, Iguess.[root at cluster-1@foo ~]# cat /etc/redhat-releaseRed Hat Enterprise Linux Server release 5.2 (Tikanga)[root at cluster-1@foo ~]# rpm -qf $(locate openais/service/swab.h)openais-devel-0.80.3-15.el5- satoruOn Sat, Aug 23, 2008 at 10:34:45AM +0200, gregory steulet wrote:> Hi folks,> > I'm a bit fed up with this project of red hat cluster. At first I triedto> compile cluster-2.03.07 on a 64 bits system... well I forgot it, it looksnot> possible. Now I switched on a centos 32 bits, other errors occured when Itry> to compile cluster-2.03.07> > My configuration is the following :> - centos 5.2 32 bits, last patches> - kernel 2.6.26.3> -openais-080.3> > I used the following procedure to compile cluster-2.03.07> > configure --kernel_src=/usr/src/linux-2.6.26.3> make> > There I got lot of errors regarding op!
 enais (cman/deamon/deamon.c:21:34:error:> openais/service/swab.h: no such file or directory, deamon.c:22:35: error:> openais/totem/aispool.h: nos such file or directory)> > However I had a look on http://sources.redhat.com/cluster/wiki/ and my set up> looks good... How do I have to proceed ???> > Greg--Linux-cluster mailing listLinux-cluster at redhat.comhttps://www.redhat.com/mailman/listinfo/linux-cluster

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080825/1e6b6575/attachment.htm>

From jmartin at learningobjects.com  Mon Aug 25 22:22:55 2008
From: jmartin at learningobjects.com (James Martin)
Date: Mon, 25 Aug 2008 18:22:55 -0400
Subject: [Linux-cluster] LVS not not failing over properly
Message-ID: <48B330BF.6010606@learningobjects.com>

I have a LVS-NAT implementation in the lab that sort of works.  I have a 
primary and hot backup lvs node, and two web servers behind it.  I can 
happily point my web browser at the virtual IP and I get the apache test 
page just fine.  I check the httpd access logs on the two real web 
servers and see that the load is being distributed. 

The problem lies when I try to test the failover of the lvs nodes.  I 
shut the primary node down, and I see that it at least attempts to fail 
over, and seems to do so successfully:

Aug 25 18:21:44 lb2 pulse[5064]: partner dead: activating lvs
Aug 25 18:21:44 lb2 lvs[5083]: starting virtual service glassfish active: 80
Aug 25 18:21:44 lb2 avahi-daemon[3136]: Registering new address record 
for 10.11.12.10 on eth1.
Aug 25 18:21:44 lb2 avahi-daemon[3136]: Withdrawing address record for 
10.11.12.10 on eth1.
Aug 25 18:21:44 lb2 avahi-daemon[3136]: Registering new address record 
for 10.11.12.10 on eth1.
Aug 25 18:21:44 lb2 avahi-daemon[3136]: Registering new address record 
for 10.100.13.220 on eth0.
Aug 25 18:21:44 lb2 avahi-daemon[3136]: Withdrawing address record for 
10.100.13.220 on eth0.
Aug 25 18:21:44 lb2 avahi-daemon[3136]: Registering new address record 
for 10.100.13.220 on eth0.
Aug 25 18:21:44 lb2 lvs[5083]: create_monitor for glassfish/gf1 running 
as pid 5094
Aug 25 18:21:44 lb2 nanny[5094]: starting LVS client monitor for 
10.100.13.220:80
Aug 25 18:21:44 lb2 nanny[5095]: starting LVS client monitor for 
10.100.13.220:80
Aug 25 18:21:44 lb2 lvs[5083]: create_monitor for glassfish/gf2 running 
as pid 5095
Aug 25 18:21:44 lb2 nanny[5094]: making 10.11.12.1:80 available
Aug 25 18:21:44 lb2 nanny[5095]: making 10.11.12.2:80 available
Aug 25 18:21:49 lb2 pulse[5085]: gratuitous lvs arps finished


The problem is that attempts from my web browser to refresh the page are 
unsuccessful.  The lvs.cf is synchronized between the lvs nodes.  Here's 
a copy of the config:


serial_no = 49
primary = 10.100.13.96
primary_private = 10.11.12.8
service = lvs
backup_active = 1
backup = 10.100.13.87
backup_private = 10.11.12.9
heartbeat = 1
heartbeat_port = 539
keepalive = 6
deadtime = 10
network = nat
nat_router = 10.11.12.10 eth1:1
nat_nmask = 255.255.255.0
debug_level = NONE
monitor_links = 1
virtual glassfish {
     active = 1
     address = 10.100.13.220 eth0:1
     vip_nmask = 255.255.255.0
     port = 80
     send = "GET / HTTP/1.0\r\n\r\n"
     expect = "HTTP"
     use_regex = 0
     load_monitor = none
     scheduler = wlc
     protocol = tcp
     timeout = 6
     reentry = 15
     quiesce_server = 0
     server gf1 {
         address = 10.11.12.1
         active = 1
         weight = 1
     }
     server gf2 {
         address = 10.11.12.2
         active = 1
         weight = 1
     }
}


I believe the problem lies in arping, but I'm not sure how to diagnose 
this.  There are no firewalls between my browser and the lvs, and I'm 
using a fairly dumb 100mb switch (also tried with a smarter switch).

Any help would be greatly appreciated.

Thanks,

James



From ross at kallisti.us  Mon Aug 25 23:29:41 2008
From: ross at kallisti.us (Ross Vandegrift)
Date: Mon, 25 Aug 2008 19:29:41 -0400
Subject: [Linux-cluster] GFS2 becomes non-responsive, no fencing
Message-ID: <20080825232941.GB6707@kallisti.us>

Hi everyone,

Have run into a strange problem on our RH cluster installation.  We
have a cluster that uses iscsi shared storage for GFS2.  It's been
running for months with no problems.

Today, the app on one node died.  I logged in, assumed things were
fenced, and tried to go about my business of restarting it.  After
some fiddling, I got the box back in the cluster fine.

It just happened again, and I've dug in a bit more.  I was wrong - the
failed node has not been fenced.  The last thing in dmesg on the
failing node is:

GFS2: fsid=: Trying to join cluster "lock_dlm", "sensors:rrd_gfs"
GFS2: fsid=sensors:rrd_gfs.1: Joined cluster. Now mounting FS...
GFS2: fsid=sensors:rrd_gfs.1: jid=1, already locked for use
GFS2: fsid=sensors:rrd_gfs.1: jid=1: Looking at journal...
GFS2: fsid=sensors:rrd_gfs.1: jid=1: Done

Any reads or writes to the mounted filesystem hangs like the DLM can't
get locks.  Connectivity to the storage is good: no interfaces show
dropped packets or errors.  cman_tool reports the node as healthy:

[root at sensor01 ~]# cman_tool status
Version: 6.0.1
Config Version: 14
Cluster Name: sensors
Cluster Id: 14059
Cluster Member: Yes
Cluster Generation: 368
Membership state: Cluster-Member
Nodes: 2
Expected votes: 3
Total votes: 2
Quorum: 2  
Active subsystems: 7
Flags: 
Ports Bound: 0 11  
Node name: sensor01.dc3
Node ID: 1
Multicast addresses: 239.192.54.34 

The missing vote is a third node that is not yet live, but it's been
in that state of rweeks now with no problems.

[root at sensor01 ~]# cman_tool nodes -f
Node  Sts   Inc   Joined               Name
   1   M    360   2008-08-25 16:24:29  sensor01.dc3
       Last fenced:   2008-08-25 16:04:25 by leaf8b-2.dc3
   2   M    364   2008-08-25 16:24:29  sensor02.dc3
   3   X    364                        sensor03.dc3
       Node has not been fenced since it went down

The fencing above is when I rebooted the node - because processes were
hung on GFS I/O, I had to hard reset the box, which caused the other
nodes to fence it.

Cluster LVM operations seem to work fine - I can query all LVM objects
without a problem.  But as soon as I try a filesystem operation, boom,
I hang.

Any hints on where I can start looking?

-- 
Ross Vandegrift
ross at kallisti.us

"The good Christian should beware of mathematicians, and all those who
make empty prophecies. The danger already exists that the mathematicians
have made a covenant with the devil to darken the spirit and to confine
man in the bonds of Hell."
	--St. Augustine, De Genesi ad Litteram, Book II, xviii, 37



From johannes.russek at io-consulting.net  Tue Aug 26 00:36:25 2008
From: johannes.russek at io-consulting.net (=?ISO-8859-15?Q?Johannes_Ru=DFek?=)
Date: Tue, 26 Aug 2008 02:36:25 +0200
Subject: [Linux-cluster] apache mod_disk_cache
Message-ID: <48B35009.6030102@io-consulting.net>

hello everybody,
did anyone ever run mod_disk_cache on multiple notes using a shared gfs 
volume as cachedir?
does that even make sense? (except for i could use striping without 
parity on the storage)
regards,
johannes



From lbigum at iseek.com.au  Tue Aug 26 01:42:26 2008
From: lbigum at iseek.com.au (Luke Bigum)
Date: Tue, 26 Aug 2008 11:42:26 +1000
Subject: [Linux-cluster] GFS2 deadlock
Message-ID: <50A3F7088FE1A14FB0CF57A22487388679BF4F8A54@EXCHANGE1.intranet.iseek.com.au>

Hi guys,

I think I've deadlocked a GFS2 cluster I've been testing with. It's a three node cluster with shared storage exported via GNBD from a separate host. I was running an "rm -Rf /mnt/*; rsync -av /etc/ /mnt/foo/" script in an infinite loop on each node just to see what'd happen, now everything's locked up :) There's no traffic to my GNBD server so I don't think one node has got an indefinite lock.

No operations on the file system on any node works, the processes that are hung all look stuck in system calls: nothing can be killed. Any tips on how to resolve this without rebooting the whole cluster? Is there any debugging information I can get that'd help diagnose what caused the problem?

Thanks,

-Luke

--
Luke Bigum
Systems Administrator
iseek Communications Pty Ltd
Excellence in business data solutions
ph 1300 661 668 fax 1300 661 540
www.iseek.com.au<http://www.iseek.com.au/>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080826/8addf1e7/attachment.htm>

From bernard.chew at muvee.com  Tue Aug 26 03:57:34 2008
From: bernard.chew at muvee.com (Bernard Chew)
Date: Tue, 26 Aug 2008 11:57:34 +0800
Subject: [Linux-cluster] GFS with iSCSI and HP 2012i
References: <OF606D0B42.27DE7266-ON872574AB.007AA52B-872574AC.00583C33@cn.ca>
	<8786b91c0808250251m4c3451d6jc8bf449a93212c9f@mail.gmail.com>
Message-ID: <229C73600EB0E54DA818AB599482BCE951EB6D@shadowfax.sg.muvee.net>

 
> From: linux-cluster-bounces at redhat.com on behalf of Rajagopal Swaminathan
> Sent: Mon 8/25/2008 5:51 PM
> To: linux clustering
> Subject: Re: [Linux-cluster] GFS with iSCSI and HP 2012i
>
>
> Greetings,
> 

> 2008/8/21 <David.Livingstone at cn.ca>
> 
> I'm looking for feedback on two items : 
> 
> 1. Experience with GFS  using  iSCSI 
>
>
> I have used GFS over iSCSI on an HP AIO box once under test conditions and did not have any issues.
>
>
> Pls note it was for few hours and NOT repeat NOT production environment.
>
> So can't comment on multipathing, performance etc, etc, 
> 
> Regards
> 
> Rajagopal
 
Hi,
 
I've no experience with HP 2012i but I'm using Red Hat GFS with iSCSI and Dell MD3000i for nearly a year in my production environment; so far so good.
 
Regards,
Bernard

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080826/d66a3363/attachment.htm>

From linux at vfemail.net  Tue Aug 26 07:28:32 2008
From: linux at vfemail.net (Alex)
Date: Tue, 26 Aug 2008 10:28:32 +0300
Subject: [Linux-cluster] gfs over raid/lvm or any other option?
In-Reply-To: <20298.128.187.190.8.1219685403.squirrel@mail.esc.auckland.ac.nz>
References: <20298.128.187.190.8.1219685403.squirrel@mail.esc.auckland.ac.nz>
Message-ID: <200808261028.32699.linux@vfemail.net>

On Monday 25 August 2008 20:30, michael.osullivan at auckland.ac.nz wrote:
> There are two approaches I have seen that may be suitable:
>
> 1) lustre - I didn't like this as it needed two "special" meta-servers and
> I was building a smaller storage system;
> 2) pvfs
>

Hi Mike,

Please, do not erase thread, will be difficult to trace the subject and 
content...

PVFS (Parallel Virtual File System) - has no redundancy - Lose one node lose 
them all. Also their website is down (www.pvfs.org), so you can add no 
reliability too ...

Your setup will work only in case as you are using mkfs.gfs -j 1... else is 
broken. Also, has no scallability.

Regards,
Alx

> I did not use either of these approaches as they focus on keeping the
> storage system running, rather than keeping the data highly available.
>
> For my test storage I wanted to build a system that would still present
> the stored data even if a single point in the network fails.
>
> I have used iSCSI, mdadm and GFS as follows. I have two storage servers
> with alomst 2TB of disk space for storage each. Both of these two servers
> present a single logical volume to a 2-node cluster using iSCSI. There are
> 2 NICs on each storage server, so each volume is accessible via two ports.
> There are 2 NICs on each cluster node also. The storage system was
> connected to the cluster using some ethernet switches. Using mdadm I have
> successfully multipathed each logical volume and then using mdadm again I
> have built a RAID-5 device from these two volumes. The raid device is
> successfully detected by each cluster node. On this raid device I created
> a logical volume using clvm and on that logical volume I built a GFS to
> control cluster access to the storage. The GFS has been successfully
> mounted on both cluster nodes.
>
> Despite some problems with the cluster (due to my own limited knowledge
> about clusters and fencing) I have successfully created and accessed files
> on the GFS from both cluster nodes. I am in the process of sorting out the
> clustering problems and testing the configuration using IOMeter.
>
> Hope this helps, Mike
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From fdinitto at redhat.com  Tue Aug 26 06:57:07 2008
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Tue, 26 Aug 2008 08:57:07 +0200 (CEST)
Subject: [Linux-cluster] Trying to compile cluster-2.03.07
In-Reply-To: <1219692892-deaa494fda200cd7171085477f59ded3@steulet.org>
References: <1219692892-deaa494fda200cd7171085477f59ded3@steulet.org>
Message-ID: <Pine.LNX.4.64.0808260855520.16551@trider-g7>

On Mon, 25 Aug 2008, gregory steulet wrote:

> Hello,
>
> At first thanks for your answer.
>
> In fact I used the following procedure to install all my components :
>
>
>
> #get the latest stable kernel realease
> [root at localhost ~]# cd /usr/src
> [root at localhost src]#wget http://www.eu.kernel.org/pub/linux/kernel/v2.6/linux-2.6.26.2.tar.gz
> [root at localhost src]# tar -xvzf linux-2.6.26.3.tar.gz
> [root at localhost src]# cd linux-2.6.26.3
> [root at localhost src]# cp /boot/configurexxxxxx /usr/src/linux-2.6.26.3/.configure
>
> #configure you kernel
> [root at localhost linux-2.6.26.3]# make menuconfig
>
> #get & apply the lockproto patch
> [root at localhost linux-2.6.26.3]# wget ftp://sources.redhat.com/pub/cluster/releases/lockproto-exports.patch
> [root at localhost linux-2.6.26.3]# patch -p1 < lockproto-exports.patch
> patching file fs/gfs2/locking.c
>
> #compiling kernel
> [root at localhost linux-2.6.26.3]# make bzImage
> [root at localhost linux-2.6.26.3]# make modules
> [root at localhost linux-2.6.26.3]# make modules_install
> [root at localhost linux-2.6.26.3]# make install
> [root at localhost linux-2.6.26.3]# reboot
>
> #installing openais
> [root at localhost openais]# tar -xvzf openais-0.80.3.tar.gz
> [root at localhost openais-0.80.3]# make
> [root at localhost openais-0.80.3]# make install
>
> #trying to install cluster
>
> [root at localhost cluster]# wget ftp://sources.redhat.com/pub/cluster/releases/cluster-2.03.07.tar.gz
> [root at localhost cluster]# tar -xvzf cluster-2.03.07.tar.gz
> [root at localhost src]# cd cluster-2.03.07
> [root at localhost cluster-2.03.07]# ./configure kernel_src=/usr/src/linux-2.6.26.3

^^  this is probably not enough. On some systems you need to specify 
--libdir and other bits as well.

Please post the build error and not just a snippet of it.

Fabio

--
I'm going to make him an offer he can't refuse.



From fdinitto at redhat.com  Tue Aug 26 06:58:01 2008
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Tue, 26 Aug 2008 08:58:01 +0200 (CEST)
Subject: [Linux-cluster] [PATCH] Add network interface select option for
	fence_xvmd
In-Reply-To: <20080805040202.GA14134@gescom.nrt.redhat.com>
References: <20080805040202.GA14134@gescom.nrt.redhat.com>
Message-ID: <Pine.LNX.4.64.0808260857450.16551@trider-g7>


Lon,

any take on this one?

Fabio

On Tue, 5 Aug 2008, Satoru SATOH wrote:

> Hello,
>
>
> # I sent this before but it looks disappered somewhere so that resend it
> # again. Excuse me if you received the same mail twice.
>
> It should be useful that fence_xvmd listen on a certain network
> interface which manually specified under some conditions such as a
> system has multiple network interfaces and the one to default route is
> not prefered choice, I think.
>
> The following patch adds the option "-I <interface_name>" to select
> network interface fence_xvmd to listen on.
>
> - satoru
>
>
> fence/agents/xvm/fence_xvmd.c |    8 ++++----
> fence/agents/xvm/mcast.c      |   21 ++++++++++++++++++---
> fence/agents/xvm/mcast.h      |    4 ++--
> fence/agents/xvm/options.c    |   13 +++++++++++++
> fence/agents/xvm/options.h    |    1 +
> fence/man/fence_xvmd.8        |    3 +++
> 6 files changed, 41 insertions(+), 9 deletions(-)
>
> diff --git a/fence/agents/xvm/fence_xvmd.c b/fence/agents/xvm/fence_xvmd.c
> index 888f24b..1dc5eba 100644
> --- a/fence/agents/xvm/fence_xvmd.c
> +++ b/fence/agents/xvm/fence_xvmd.c
> @@ -921,7 +921,7 @@ main(int argc, char **argv)
> 	unsigned int logmode = 0;
> 	char key[MAX_KEY_LEN];
> 	int key_len = 0, x;
> -	char *my_options = "dfi:a:p:C:U:c:k:u?hLXV";
> +	char *my_options = "dfi:a:I:p:C:U:c:k:u?hLXV";
> 	cman_handle_t ch = NULL;
> 	void *h = NULL;
>
> @@ -1031,9 +1031,9 @@ main(int argc, char **argv)
> 	}
>
> 	if (args.family == PF_INET)
> -		mc_sock = ipv4_recv_sk(args.addr, args.port);
> +		mc_sock = ipv4_recv_sk(args.addr, args.port, args.ifindex);
> 	else
> -		mc_sock = ipv6_recv_sk(args.addr, args.port);
> +		mc_sock = ipv6_recv_sk(args.addr, args.port, args.ifindex);
> 	if (mc_sock < 0) {
> 		log_printf(LOG_ERR,
> 			   "Could not set up multicast listen socket\n");
> @@ -1049,5 +1049,5 @@ main(int argc, char **argv)
>
> 	//malloc_dump_table();
>
> -	return 0;
> +	exit(errno);
> }
> diff --git a/fence/agents/xvm/mcast.c b/fence/agents/xvm/mcast.c
> index db46328..001e3ac 100644
> --- a/fence/agents/xvm/mcast.c
> +++ b/fence/agents/xvm/mcast.c
> @@ -31,11 +31,12 @@ LOGSYS_DECLARE_SUBSYS ("XVM", SYSLOGLEVEL);
>   Sets up a multicast receive socket
>  */
> int
> -ipv4_recv_sk(char *addr, int port)
> +ipv4_recv_sk(char *addr, int port, unsigned int ifindex)
> {
> 	int sock;
> 	struct ip_mreq mreq;
> 	struct sockaddr_in sin;
> +	struct ifreq ifreq;
>
> 	/* Store multicast address */
> 	if (inet_pton(PF_INET, addr,
> @@ -74,7 +75,20 @@ ipv4_recv_sk(char *addr, int port)
> 	 * Join multicast group
> 	 */
> 	/* mreq.imr_multiaddr.s_addr is set above */
> -	mreq.imr_interface.s_addr = htonl(INADDR_ANY);
> +	if (ifindex > 0 && if_indextoname(ifindex, ifreq.ifr_name) != NULL) {
> +		ifreq.ifr_addr.sa_family = AF_INET;
> +		if (ioctl(sock, SIOCGIFADDR, &ifreq) < 0) {
> +			printf("Failed to get address of the interface %d\n",
> +				ifindex);
> +			mreq.imr_interface.s_addr = htonl(INADDR_ANY);
> +		} else {
> +			memcpy(&mreq.imr_interface,
> +				&((struct sockaddr_in *) &ifreq.ifr_addr)->sin_addr,
> +				sizeof(struct in_addr));
> +		}
> +	} else {
> +		mreq.imr_interface.s_addr = htonl(INADDR_ANY);
> +	}
> 	dbg_printf(4, "Joining multicast group\n");
> 	if (setsockopt(sock, IPPROTO_IP, IP_ADD_MEMBERSHIP,
> 		       &mreq, sizeof(mreq)) == -1) {
> @@ -184,7 +198,7 @@ ipv4_send_sk(char *send_addr, char *addr, int port, struct sockaddr *tgt,
>   Sets up a multicast receive (ipv6) socket
>  */
> int
> -ipv6_recv_sk(char *addr, int port)
> +ipv6_recv_sk(char *addr, int port, unsigned int ifindex)
> {
> 	int sock, val;
> 	struct ipv6_mreq mreq;
> @@ -203,6 +217,7 @@ ipv6_recv_sk(char *addr, int port)
> 	memcpy(&mreq.ipv6mr_multiaddr, &sin.sin6_addr,
> 	       sizeof(struct in6_addr));
>
> +	mreq.ipv6mr_interface = (ifindex > 0) ? ifindex : 0;
>
> 	/********************************
> 	 * SET UP MULTICAST RECV SOCKET *
> diff --git a/fence/agents/xvm/mcast.h b/fence/agents/xvm/mcast.h
> index 5113f04..08fd6de 100644
> --- a/fence/agents/xvm/mcast.h
> +++ b/fence/agents/xvm/mcast.h
> @@ -4,11 +4,11 @@
> #define IPV4_MCAST_DEFAULT "225.0.0.12"
> #define IPV6_MCAST_DEFAULT "ff05::3:1"
>
> -int ipv4_recv_sk(char *addr, int port);
> +int ipv4_recv_sk(char *addr, int port, unsigned int ifindex);
> int ipv4_send_sk(char *src_addr, char *addr, int port,
> 		 struct sockaddr *src, socklen_t slen,
> 		 int ttl);
> -int ipv6_recv_sk(char *addr, int port);
> +int ipv6_recv_sk(char *addr, int port, unsigned int ifindex);
> int ipv6_send_sk(char *src_addr, char *addr, int port,
> 		 struct sockaddr *src, socklen_t slen,
> 		 int ttl);
> diff --git a/fence/agents/xvm/options.c b/fence/agents/xvm/options.c
> index 969ca8d..519f57e 100644
> --- a/fence/agents/xvm/options.c
> +++ b/fence/agents/xvm/options.c
> @@ -82,6 +82,13 @@ assign_address(fence_xvm_args_t *args, struct arg_info *arg, char *value)
>
>
> static inline void
> +assign_interface(fence_xvm_args_t *args, struct arg_info *arg, char *value)
> +{
> +	args->ifindex = if_nametoindex(value);
> +}
> +
> +
> +static inline void
> assign_ttl(fence_xvm_args_t *args, struct arg_info *arg, char *value)
> {
> 	int ttl;
> @@ -299,6 +306,10 @@ static struct arg_info _arg_info[] = {
> 	  "Multicast address (default=225.0.0.12 / ff02::3:1)",
> 	  assign_address },
>
> +	{ 'I', "-I <interface>", NULL,
> +	  "Network interface to listen on (default=auto; kernel selects)",
> +	  assign_interface },
> +
> 	{ 'T', "-T <ttl>", "multicast_ttl",
> 	  "Multicast time-to-live (in hops; default=2)",
> 	  assign_ttl },
> @@ -422,6 +433,7 @@ args_init(fence_xvm_args_t *args)
> 	args->flags = 0;
> 	args->debug = 0;
> 	args->ttl = DEFAULT_TTL;
> +	args->ifindex = 0;
> }
>
>
> @@ -439,6 +451,7 @@ args_print(fence_xvm_args_t *args)
> {
> 	dbg_printf(1, "-- args @ %p --\n", args);
> 	_pr_str(args->addr);
> +	_pr_int(args->ifindex);
> 	_pr_str(args->domain);
> 	_pr_str(args->key_file);
> 	_pr_int(args->op);
> diff --git a/fence/agents/xvm/options.h b/fence/agents/xvm/options.h
> index 7a2dcca..8720366 100644
> --- a/fence/agents/xvm/options.h
> +++ b/fence/agents/xvm/options.h
> @@ -29,6 +29,7 @@ typedef struct {
> 	arg_flags_t flags;
> 	int debug;
> 	int ttl;
> +	unsigned int ifindex;
> } fence_xvm_args_t;
>
> /* Private structure for commandline / stdin fencing args */
> diff --git a/fence/man/fence_xvmd.8 b/fence/man/fence_xvmd.8
> index 5a47211..05d4720 100644
> --- a/fence/man/fence_xvmd.8
> +++ b/fence/man/fence_xvmd.8
> @@ -36,6 +36,9 @@ IP family to use (auto, ipv4, or ipv6; default = auto)
> Multicast address to listen on (default=225.0.0.12 for ipv4, ff02::3:1
> for ipv6)
> .TP
> +\fB-I\fP \fIinterface\fP
> +Network interface to use; e.g. eth0 (default: one[s] kernel choosed)
> +.TP
> \fB-p\fP \fIport\fP
> Port to use (default=1229)
> .TP
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>


--
I'm going to make him an offer he can't refuse.



From fdinitto at redhat.com  Tue Aug 26 10:02:25 2008
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Tue, 26 Aug 2008 12:02:25 +0200 (CEST)
Subject: [Linux-cluster] Cluster 2.99.09 (development snapshot) released
Message-ID: <Pine.LNX.4.64.0808261151330.16551@trider-g7>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


The cluster team and its community are proud to announce the 2.99.09
release from the master branch.

This release includes some major changes and features that people are 
going to love:

* First of all our library API can now be considered stable and the soname
   has been bumped to 3.0!

* GFS1 is now totally standalone and does not require GFS2 nor a patched
   upstream kernel to run.

* A full new contributed tool has been added to the repository (askant).
   Thanks to Andrew for his awesome work.
   We look forward for more community work to be merged in the newly
   created contrib/ section.

The 2.99.XX releases are _NOT_ meant to be used for production 
environments yet, but they start to be more and more useable without major 
glitches.

The master branch is the main development tree that receives all new
features, code, clean up and a whole brand new set of bugs,

At some point in time this code will become the 3.0 stable release.

Everybody with test equipment and time to spare, is highly encouraged to
download, install and test the 2.99 releases and more important report
problems.

In order to build the 2.99.09 release you will need:

- - corosync svn r1659.
- - openais svn r1638.
- - linux kernel (2.6.27)

The new source tarball can be downloaded here:

   ftp://sources.redhat.com/pub/cluster/releases/cluster-2.99.09.tar.gz
   https://fedorahosted.org/releases/c/l/cluster/cluster-2.99.09.tar.gz

To report bugs or issues:

   https://bugzilla.redhat.com/

Would you like to meet the cluster team or members of its community?

   Join us on IRC (irc.freenode.net #linux-cluster) and share your
   experience  with other sysadministrators or power users.

Happy clustering,
Fabio

Under the hood (from 2.99.08):

Abhijith Das (2):
       gfs-kernel: Bug 450209: Create gfs1-specific lock modules + minor fixes to build with 2.6.27
       gfs-kernel: bug 450209 - addendum to previous patch. Removes extraneous lock_dlm_plock.c

Andrew Price (1):
       askant: Import askant into tree

Christine Caulfield (8):
       cman (mainly): use corosync
       cman: Fix find_handle leak
       cman: fix objdb-destroying typo
       cman" load openais services by default
       cman: Silence some compiler warnings.
       cman: add cman_tool -A to disable load of openais services
       cman: Return quorum state in a STATECHANGE notification
       cman: return the correct length of a message

David Teigland (13):
       libdlm: handle truncated device names
       gfs_controld: queries in libgroup mode
       dlm_controld: fs_register and fs_result fixes
       dlm_controld: kill the cluster on misbehaving nodes
       dlm_controld: fix nodeid in fs_result
       gfs_controld: fix fs_notify during recovery
       dlm_controld: open dlm-monitor misc device
       gfs_controld: kill the cluster on misbehaving nodes
       fenced: kill the cluster on misbehaving nodes
       groupd: remove detection of uncontrolled kernel dlm and gfs
       dlm_controld: isolate cman and fence code
       fenced: add skip_undefined option
       gfs_controld: ignore dlm uevents

Fabio M. Di Nitto (23):
       build: add support for corosync
       build: bump kernel requirement to 2.6.27
       cman: make ccsd startup optional and allow override of config loader
       config: move ccs/ccs_tool to config/tools/ccs_tool
       cman: switch default config parser to xmlconfig
       ccs: libccscompat don't include unrequired header
       ccs: move debug.h to ccs/daemon
       ccs: move comm_headers.h to ccs/daemon
       config: move generic documenation and man pages to config/man
       ccs: move libccscompat into config/libs and mark it legacy code
       ccs: move ccsais plugin to config/plugins/ccsais and mark it legacy code
       ccs: move ccs/daemon to config/daemons/ccds and mark it legacy code
       build: define legacy_code=1 on clean target
       libccs: add support for /child::*[%d]/ for xpathlite
       qdisk: allow scan of sysfs to dive into first level symlinks
       qdisk: fix sysfs path diving
       build: create contrib/ top level section
       build: add contrib/Makefile
       build: plugin askant in our build system
       misc: remove exec bits from different files
       build: rename --enable_xen to --enable_virt
       build: add --without_config build option
       build: bump library soname to 3.0

Lon Hohberger (1):
       rgmanager: Ancillary fix for rhbz #453000

  Makefile                                   |   18 +-
  ccs/Makefile                               |    4 -
  ccs/ccs_tool/Makefile                      |   56 --
  ccs/ccs_tool/ccs_tool.c                    |  353 -------
  ccs/ccs_tool/editconf.c                    | 1261 -------------------------
  ccs/ccs_tool/editconf.h                    |    8 -
  ccs/ccs_tool/update.c                      |  673 --------------
  ccs/ccs_tool/update.h                      |    6 -
  ccs/ccsais/Makefile                        |   33 -
  ccs/ccsais/config.c                        |  229 -----
  ccs/daemon/Makefile                        |   34 -
  ccs/daemon/ccsd.c                          |  922 ------------------
  ccs/daemon/cluster_mgr.c                   |  688 --------------
  ccs/daemon/cluster_mgr.h                   |    6 -
  ccs/daemon/cnx_mgr.c                       | 1393 ----------------------------
  ccs/daemon/cnx_mgr.h                       |    7 -
  ccs/daemon/globals.c                       |   19 -
  ccs/daemon/globals.h                       |   23 -
  ccs/daemon/misc.c                          |  288 ------
  ccs/daemon/misc.h                          |   19 -
  ccs/include/comm_headers.h                 |   48 -
  ccs/include/debug.h                        |    9 -
  ccs/libccscompat/Makefile                  |   15 -
  ccs/libccscompat/libccscompat.c            |  753 ---------------
  ccs/libccscompat/libccscompat.h            |   18 -
  ccs/man/Makefile                           |   12 -
  ccs/man/ccs.7                              |   22 -
  ccs/man/ccs_test.8                         |  132 ---
  ccs/man/ccs_tool.8                         |  185 ----
  ccs/man/ccsd.8                             |   74 --
  ccs/man/cluster.conf.5                     |   61 --
  cman/cman_tool/Makefile                    |    2 +-
  cman/cman_tool/cman_tool.h                 |    1 +
  cman/cman_tool/join.c                      |   45 +-
  cman/cman_tool/main.c                      |   11 +-
  cman/daemon/Makefile                       |    2 +-
  cman/daemon/ais.c                          |  103 +-
  cman/daemon/ais.h                          |    2 -
  cman/daemon/barrier.c                      |   14 +-
  cman/daemon/cman-preconfig.c               |  222 +++---
  cman/daemon/cmanconfig.c                   |  103 +--
  cman/daemon/cmanconfig.h                   |   36 +-
  cman/daemon/cnxman-private.h               |    2 -
  cman/daemon/commands.c                     |  140 ++--
  cman/daemon/commands.h                     |    5 +-
  cman/daemon/daemon.c                       |    7 +-
  cman/daemon/logging.c                      |    2 +-
  cman/daemon/logging.h                      |    2 +-
  cman/daemon/nodelist.h                     |   79 ++-
  cman/init.d/cman                           |   64 ++-
  cman/lib/libcman.h                         |    2 +-
  cman/man/cman_tool.8                       |   12 +-
  cman/qdisk/Makefile                        |    4 +-
  cman/qdisk/daemon_init.c                   |    2 +-
  cman/qdisk/disk.c                          |    2 +-
  cman/qdisk/disk_util.c                     |    2 +-
  cman/qdisk/main.c                          |    2 +-
  cman/qdisk/mkqdisk.c                       |    2 +-
  cman/qdisk/proc.c                          |    2 +-
  cman/qdisk/scandisk.c                      |   13 +-
  cman/qdisk/score.c                         |    2 +-
  config/Makefile                            |    2 +-
  config/daemons/Makefile                    |    8 +
  config/daemons/ccsd/Makefile               |   37 +
  config/daemons/ccsd/ccsd.c                 |  922 ++++++++++++++++++
  config/daemons/ccsd/cluster_mgr.c          |  688 ++++++++++++++
  config/daemons/ccsd/cluster_mgr.h          |    6 +
  config/daemons/ccsd/cnx_mgr.c              | 1393 ++++++++++++++++++++++++++++
  config/daemons/ccsd/cnx_mgr.h              |    7 +
  config/daemons/ccsd/comm_headers.h         |   48 +
  config/daemons/ccsd/debug.h                |    9 +
  config/daemons/ccsd/globals.c              |   19 +
  config/daemons/ccsd/globals.h              |   23 +
  config/daemons/ccsd/misc.c                 |  288 ++++++
  config/daemons/ccsd/misc.h                 |   19 +
  config/daemons/man/Makefile                |    9 +
  config/daemons/man/ccsd.8                  |   74 ++
  config/libs/Makefile                       |    5 +-
  config/libs/libccscompat/Makefile          |   15 +
  config/libs/libccscompat/libccscompat.c    |  752 +++++++++++++++
  config/libs/libccscompat/libccscompat.h    |   18 +
  config/libs/libccsconfdb/Makefile          |    4 +-
  config/libs/libccsconfdb/libccs.c          |    3 +-
  config/man/Makefile                        |    5 +
  config/man/ccs.7                           |   22 +
  config/man/cluster.conf.5                  |   61 ++
  config/plugins/Makefile                    |    5 +-
  config/plugins/ccsais/Makefile             |   33 +
  config/plugins/ccsais/config.c             |  224 +++++
  config/plugins/ldap/configldap.c           |   26 +-
  config/plugins/xml/config.c                |   10 +-
  config/tools/Makefile                      |    2 +-
  config/tools/ccs_tool/Makefile             |   53 ++
  config/tools/ccs_tool/ccs_tool.c           |  353 +++++++
  config/tools/ccs_tool/editconf.c           | 1261 +++++++++++++++++++++++++
  config/tools/ccs_tool/editconf.h           |    8 +
  config/tools/ccs_tool/update.c             |  673 ++++++++++++++
  config/tools/ccs_tool/update.h             |    6 +
  config/tools/ldap/Makefile                 |    4 +-
  config/tools/ldap/confdb2ldif.c            |    3 +-
  config/tools/man/Makefile                  |    8 +-
  config/tools/man/ccs_test.8                |  132 +++
  config/tools/man/ccs_tool.8                |  185 ++++
  configure                                  |   61 +-
  contrib/Makefile                           |    6 +
  contrib/askant/INSTALL                     |   42 +
  contrib/askant/Makefile                    |   24 +
  contrib/askant/PLUGINAPI                   |   65 ++
  contrib/askant/README                      |   74 ++
  contrib/askant/askant/about.py             |    5 +
  contrib/askant/askant/askant.py            |   24 +
  contrib/askant/askant/blktrace.py          |   93 ++
  contrib/askant/askant/commands.py          |  333 +++++++
  contrib/askant/askant/sysfs.py             |   86 ++
  contrib/askant/fsplugins/gfs2/gfs2.c       |  405 ++++++++
  contrib/askant/fsplugins/gfs2/gfs2.h       |    3 +
  contrib/askant/fsplugins/gfs2/gfs2module.c |  104 ++
  contrib/askant/scripts/askant              |    6 +
  contrib/askant/setup.py                    |   15 +
  dlm/libdlm/51-dlm.rules                    |    1 +
  dlm/libdlm/libdlm.c                        |  134 +++-
  fence/agents/Makefile                      |    2 +-
  fence/agents/xvm/Makefile                  |    8 +-
  fence/agents/xvm/debug.h                   |    2 +-
  fence/fence_node/Makefile                  |    4 +-
  fence/fence_node/fence_node.c              |    2 +-
  fence/fenced/Makefile                      |    4 +-
  fence/fenced/config.c                      |   47 +-
  fence/fenced/config.h                      |    3 +
  fence/fenced/cpg.c                         |    3 +
  fence/fenced/fd.h                          |    5 +-
  fence/fenced/main.c                        |    8 +-
  fence/fenced/member_cman.c                 |   22 +-
  gfs-kernel/src/gfs/Makefile                |    7 +
  gfs-kernel/src/gfs/acl.c                   |    2 +-
  gfs-kernel/src/gfs/bits.c                  |    2 +-
  gfs-kernel/src/gfs/bmap.c                  |    2 +-
  gfs-kernel/src/gfs/dio.c                   |    2 +-
  gfs-kernel/src/gfs/dir.c                   |    2 +-
  gfs-kernel/src/gfs/eaops.c                 |    2 +-
  gfs-kernel/src/gfs/eattr.c                 |    2 +-
  gfs-kernel/src/gfs/file.c                  |    2 +-
  gfs-kernel/src/gfs/gfs.h                   |    2 +-
  gfs-kernel/src/gfs/glock.c                 |    2 +-
  gfs-kernel/src/gfs/glops.c                 |    2 +-
  gfs-kernel/src/gfs/inode.c                 |   10 +-
  gfs-kernel/src/gfs/ioctl.c                 |    2 +-
  gfs-kernel/src/gfs/lm.c                    |    8 +-
  gfs-kernel/src/gfs/lm_interface.h          |  278 ++++++
  gfs-kernel/src/gfs/lock_dlm.h              |  182 ++++
  gfs-kernel/src/gfs/lock_dlm_lock.c         |  527 +++++++++++
  gfs-kernel/src/gfs/lock_dlm_main.c         |   40 +
  gfs-kernel/src/gfs/lock_dlm_mount.c        |  279 ++++++
  gfs-kernel/src/gfs/lock_dlm_sysfs.c        |  225 +++++
  gfs-kernel/src/gfs/lock_dlm_thread.c       |  367 ++++++++
  gfs-kernel/src/gfs/lock_nolock_main.c      |  230 +++++
  gfs-kernel/src/gfs/locking.c               |  180 ++++
  gfs-kernel/src/gfs/log.c                   |    2 +-
  gfs-kernel/src/gfs/lops.c                  |    2 +-
  gfs-kernel/src/gfs/lvb.c                   |    2 +-
  gfs-kernel/src/gfs/main.c                  |   12 +-
  gfs-kernel/src/gfs/mount.c                 |    2 +-
  gfs-kernel/src/gfs/ondisk.c                |    2 +-
  gfs-kernel/src/gfs/ops_address.c           |    2 +-
  gfs-kernel/src/gfs/ops_dentry.c            |    2 +-
  gfs-kernel/src/gfs/ops_export.c            |    2 +-
  gfs-kernel/src/gfs/ops_file.c              |    6 +-
  gfs-kernel/src/gfs/ops_inode.c             |   16 +-
  gfs-kernel/src/gfs/ops_super.c             |    2 +-
  gfs-kernel/src/gfs/ops_vm.c                |    2 +-
  gfs-kernel/src/gfs/page.c                  |    2 +-
  gfs-kernel/src/gfs/proc.c                  |    2 +-
  gfs-kernel/src/gfs/quota.c                 |    2 +-
  gfs-kernel/src/gfs/recovery.c              |    2 +-
  gfs-kernel/src/gfs/rgrp.c                  |    2 +-
  gfs-kernel/src/gfs/super.c                 |    2 +-
  gfs-kernel/src/gfs/sys.c                   |    2 +-
  gfs-kernel/src/gfs/trans.c                 |    2 +-
  gfs-kernel/src/gfs/unlinked.c              |    2 +-
  gfs-kernel/src/gfs/util.c                  |    2 +-
  group/daemon/Makefile                      |    4 +-
  group/daemon/cman.c                        |    2 +-
  group/daemon/cpg.c                         |    2 +-
  group/daemon/gd_internal.h                 |    4 +-
  group/daemon/main.c                        |   93 --
  group/dlm_controld/Makefile                |    7 +-
  group/dlm_controld/action.c                |  137 +++-
  group/dlm_controld/config.c                |   38 -
  group/dlm_controld/cpg.c                   |   39 +-
  group/dlm_controld/dlm_daemon.h            |   58 +-
  group/dlm_controld/group.c                 |    2 +-
  group/dlm_controld/main.c                  |  128 ++-
  group/dlm_controld/member_cman.c           |   68 ++-
  group/dlm_controld/plock.c                 |  189 +----
  group/gfs_controld/Makefile                |    7 +-
  group/gfs_controld/cpg-new.c               |   11 +-
  group/gfs_controld/cpg-old.c               |   31 +-
  group/gfs_controld/cpg-old.h               |   25 +
  group/gfs_controld/gfs_daemon.h            |    8 +-
  group/gfs_controld/group.c                 |  123 +++-
  group/gfs_controld/main.c                  |   16 +-
  group/gfs_controld/member_cman.c           |   23 +-
  group/gfs_controld/util.c                  |   73 ++
  group/libgfscontrol/libgfscontrol.h        |    1 +
  make/defines.mk.input                      |    9 +-
  make/official_release_version              |    2 +-
  rgmanager/src/resources/ip.sh              |    2 +-
  207 files changed, 12468 insertions(+), 8335 deletions(-)

- --
I'm going to make him an offer he can't refuse.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)

iQIVAwUBSLPUtggUGcMLQ3qJAQLaOQ/8CMdFJ5724n1Hp9Z67k72FkU8umzCMR0S
MPnCNLn/UOf3Q6PpAxcM2qcoJwsJ/XjWm6Qt0G3tJ8yyDOclZ8OJ9EckujsAT212
F/xbwsXvY3zh4PfDN9OkudAHL9AN4gM7STx2yVKcktZLQ83BgM8NsrMEWhWoYnY8
2m0ZKa7vG9KG/SJnSJVcx8ZfNrTnrm1GOCbCh1ggslE7JrCJdZYfIG8nfBnEv14X
OJRGI+NaRTZznPtWbuG+SBx8F07lVeC3lWqalB2jtu1J4Dmgs307V+q/4q4hhQxc
eT6HbxtLeYj0Py+Mcfkvo1Es6Pbnw5X0vqaW2PfgTLYYTcVi2ZAJ4m4d2rXlGrDQ
P4Bj8iNX3K7WOJIbOg56qRlm+G1WLnVlKDVhPj7bgqeCs9TDDyDnzXdrlrbiKww4
Yxh0lEC2+/nFNkpS5Dpq8f+Oxcn+T9Zk9VGzkll1D0fqJ+syE9LhG+X4jDFgJpTN
lqQkdN83AF+gYOfKPYez8IEPoEnFLhz5NfCHN+UASGaMc0zxPhVLqWieHFZufDbH
qmSJ/Ro8/zkrIVjqs0KdrfQoI14jClAt6I+ZWLQFB67qTqkbaapbe5pdxyM5xHeY
QO4giQbx2QJpBUW/eGIv4sYl+YICr5m2t2jxZkJiKKb+w3RodE6Nu+ZsQnB5zMqy
hnNrRVGrp8g=
=TX7g
-----END PGP SIGNATURE-----



From fdinitto at redhat.com  Tue Aug 26 06:58:26 2008
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Tue, 26 Aug 2008 08:58:26 +0200 (CEST)
Subject: [Linux-cluster] [PATCH] trivial debug print fix for
	fence/agents/xvm/
In-Reply-To: <20080805050628.GC14134@gescom.nrt.redhat.com>
References: <20080805050628.GC14134@gescom.nrt.redhat.com>
Message-ID: <Pine.LNX.4.64.0808260858090.16551@trider-g7>


Lon, looks good to me..

Fabio

On Tue, 5 Aug 2008, Satoru SATOH wrote:

> Hello,
>
> Here is a trivial patch to add missing line breaks in some debug print
> lines for fence_xvm*.
>
> - satoru
>
>
> fence/agents/xvm/simple_auth.c |    4 ++--
> 1 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/fence/agents/xvm/simple_auth.c b/fence/agents/xvm/simple_auth.c
> index 07bc261..a1f04f4 100644
> --- a/fence/agents/xvm/simple_auth.c
> +++ b/fence/agents/xvm/simple_auth.c
> @@ -381,7 +381,7 @@ read_key_file(char *file, char *key, size_t max_len)
> 		}
>
> 		if (nread == 0) {
> -			dbg_printf(3, "Stopped reading @ %d bytes",
> +			dbg_printf(3, "Stopped reading @ %d bytes\n",
> 				(int)max_len-remain);
> 			break;
> 		}
> @@ -391,7 +391,7 @@ read_key_file(char *file, char *key, size_t max_len)
> 	}
>
> 	close(fd);
> -	dbg_printf(3, "Actual key length = %d bytes", (int)max_len-remain);
> +	dbg_printf(3, "Actual key length = %d bytes\n", (int)max_len-remain);
>
> 	return (int)(max_len - remain);
> }
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>


--
I'm going to make him an offer he can't refuse.



From fdinitto at redhat.com  Tue Aug 26 07:07:02 2008
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Tue, 26 Aug 2008 09:07:02 +0200 (CEST)
Subject: [Linux-cluster] Re: [Cluster-devel] Cluster Infrastructure IRC
 meeting - Monday 25th of Aug 2pm UTC/GMT
In-Reply-To: <Pine.LNX.4.64.0808220716210.16551@trider-g7>
References: <Pine.LNX.4.64.0808220716210.16551@trider-g7>
Message-ID: <Pine.LNX.4.64.0808260905520.16551@trider-g7>

On Fri, 22 Aug 2008, Fabio M. Di Nitto wrote:

>
> Hi everybody,
>
> When  : Monday 25th of Aug 2pm UTC/GMT (*)(**)
> Where : irc.freenode.net #linux-cluster
> Who   : everybody interested is invited to participate
> Agenda: http://sources.redhat.com/cluster/wiki/Meetings/2008-Aug-25

IRC log of the meeting are now online, also linked from the Agenda.

http://sources.redhat.com/cluster/wiki/Meetings/2008-Aug-25/irclogs

thanks everybody for participating.

Fabio

--
I'm going to make him an offer he can't refuse.



From gspiegl at gmx.at  Tue Aug 26 16:26:47 2008
From: gspiegl at gmx.at (Gerhard Spiegl)
Date: Tue, 26 Aug 2008 18:26:47 +0200
Subject: [Linux-cluster] 2 node cluster gets fenced despite qdisk
Message-ID: <48B42EC7.5090609@gmx.at>

Hi all!

I'm trying to get a 2 node cluster w/ quorum device running, most of its running
fine. 
The cluster has a public net interface (bond0) and a private one (bond1).
When the clusterinterconnect gets lost (ifconfig down the underlying eth devs), 
the two nodes immediatly fence each other and the cluster goes down. Is this some
sort of expected behavior?

I assumed the master node (qdiskd) {w,c,sh}ould stay alive and provide services,
as the cluster still has one communication channel (the quorum disk).

Below is the cluster.conf, OS used is RHEL5.2 with the latest patches.

Or is the only use of the quorum disk not to get a split brain condition?

If one could point me to a good resource for RHCS cluster configuration (eg
comprehensive explaination of cluster.conf options) I would much apreciate this.

kind regards
Gerhard



[root at ols011p yum.repos.d]# cat /etc/cluster/cluster.conf
<?xml version="1.0"?>
<cluster alias="fusi01" config_version="12" name="fusi01">
        <fence_daemon post_fail_delay="0" post_join_delay="10"/>
        <cman expected_votes="3" two_node="0"/>
        <clusternodes>
                <clusternode name="ols011p" nodeid="1" votes="1">
                        <fence>
                                <method name="1">
                                        <device lanplus="1" name="OLS011-m"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="ols012p" nodeid="2" votes="1">
                        <fence>
                                <method name="1">
                                        <device lanplus="1" name="OLS012-m"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <quorumd device="/dev/mapper/HDS-00F9p1" votes="1">

            <heuristic interval="2" tko="3" program="ping -c1 -t3 172.27.111.254" score="1"/>

        </quorumd>

        <fencedevices>
                <fencedevice agent="fence_ipmilan" option="off" auth="password" ipaddr="ols011-m" login="root" name="OLS011-m" passwd="changeme"/>
                <fencedevice agent="fence_ipmilan" option="off" auth="password" ipaddr="ols012-m" login="root" name="OLS012-m" passwd="changeme"/>
        </fencedevices>

        <rm>
                <failoverdomains>
                        <failoverdomain name="fusi01_hvm_dom" nofailback="0" ordered="1" restricted="1">
                                <failoverdomainnode name="ols011p" priority="2"/>
                                <failoverdomainnode name="ols012p" priority="1"/>
                        </failoverdomain>
                        <failoverdomain name="fusi01_pvm_dom" nofailback="0" ordered="1" restricted="1">
                                <failoverdomainnode name="ols011p" priority="1"/>
                                <failoverdomainnode name="ols012p" priority="2"/>
                        </failoverdomain>
                </failoverdomains>
                <resources/>
                <vm autostart="1" domain="fusi01_pvm_dom" exclusive="0" migrate="live" name="fusi01pvm" path="/global/xenconfig" recovery="restart"/>
                <vm autostart="1" domain="fusi01_hvm_dom" exclusive="0" migrate="live" name="fusi01hvm" path="/global/xenconfig" recovery="restart"/>
        </rm>
</cluster>



From ross at kallisti.us  Tue Aug 26 17:50:04 2008
From: ross at kallisti.us (Ross Vandegrift)
Date: Tue, 26 Aug 2008 13:50:04 -0400
Subject: [Linux-cluster] GFS2 becomes non-responsive, no fencing
In-Reply-To: <20080825232941.GB6707@kallisti.us>
References: <20080825232941.GB6707@kallisti.us>
Message-ID: <20080826175004.GB18608@kallisti.us>

On Mon, Aug 25, 2008 at 07:29:41PM -0400, Ross Vandegrift wrote:
> Today, the app on one node died.  I logged in, assumed things were
> fenced, and tried to go about my business of restarting it.  After
> some fiddling, I got the box back in the cluster fine.
> 
> It just happened again, and I've dug in a bit more.  I was wrong - the
> failed node has not been fenced.  The last thing in dmesg on the
> failing node is:

Some more information gleaned today.  I left the node running last
night without fixing the GFS2 access.  Today, we noticed that
filesystem access has been restored for new processes - it's slow
(sometimes taking minutes to return an ls for 10 items), but
it eventually responds.  The application threads that are sleeping in D
still haven't received their data from reads issued yesterday
afternoon.

A cursory examination of DLM-related keys in /sys reveal that the
working and broken nodes are configured the same.  No major disparity
in terms of memory use, except the obvious fact that the broken node
shows very litte disk IO.

I'm pretty much at a loss - any ideas would be very welcome.

-- 
Ross Vandegrift
ross at kallisti.us

"The good Christian should beware of mathematicians, and all those who
make empty prophecies. The danger already exists that the mathematicians
have made a covenant with the devil to darken the spirit and to confine
man in the bonds of Hell."
	--St. Augustine, De Genesi ad Litteram, Book II, xviii, 37



From billpp at gmail.com  Tue Aug 26 19:03:10 2008
From: billpp at gmail.com (Flavio Junior)
Date: Tue, 26 Aug 2008 16:03:10 -0300
Subject: [Linux-cluster] 2 node cluster gets fenced despite qdisk
In-Reply-To: <48B42EC7.5090609@gmx.at>
References: <48B42EC7.5090609@gmx.at>
Message-ID: <58aa8d780808261203k3ddddd5dk811c95d47308833a@mail.gmail.com>

Hi ;)

I'm a newbie in this subject but, maybe it could help you:

http://sources.redhat.com/cluster/doc/cluster_schema_rhel5.html

http://sources.redhat.com/cluster/wiki/FAQ/CMAN#cman_quorum - I'm
reading exactly about it right now :)

Ignores me if you had already read this links.

--

Fl?vio do Carmo J?nior aka waKKu

On Tue, Aug 26, 2008 at 1:26 PM, Gerhard Spiegl <gspiegl at gmx.at> wrote:
> Hi all!
>
> I'm trying to get a 2 node cluster w/ quorum device running, most of its running
> fine.
> The cluster has a public net interface (bond0) and a private one (bond1).
> When the clusterinterconnect gets lost (ifconfig down the underlying eth devs),
> the two nodes immediatly fence each other and the cluster goes down. Is this some
> sort of expected behavior?
>
> I assumed the master node (qdiskd) {w,c,sh}ould stay alive and provide services,
> as the cluster still has one communication channel (the quorum disk).
>
> Below is the cluster.conf, OS used is RHEL5.2 with the latest patches.
>
> Or is the only use of the quorum disk not to get a split brain condition?
>
> If one could point me to a good resource for RHCS cluster configuration (eg
> comprehensive explaination of cluster.conf options) I would much apreciate this.
>
> kind regards
> Gerhard
>
>
>
> [root at ols011p yum.repos.d]# cat /etc/cluster/cluster.conf
> <?xml version="1.0"?>
> <cluster alias="fusi01" config_version="12" name="fusi01">
>        <fence_daemon post_fail_delay="0" post_join_delay="10"/>
>        <cman expected_votes="3" two_node="0"/>
>        <clusternodes>
>                <clusternode name="ols011p" nodeid="1" votes="1">
>                        <fence>
>                                <method name="1">
>                                        <device lanplus="1" name="OLS011-m"/>
>                                </method>
>                        </fence>
>                </clusternode>
>                <clusternode name="ols012p" nodeid="2" votes="1">
>                        <fence>
>                                <method name="1">
>                                        <device lanplus="1" name="OLS012-m"/>
>                                </method>
>                        </fence>
>                </clusternode>
>        </clusternodes>
>        <quorumd device="/dev/mapper/HDS-00F9p1" votes="1">
>
>            <heuristic interval="2" tko="3" program="ping -c1 -t3 172.27.111.254" score="1"/>
>
>        </quorumd>
>
>        <fencedevices>
>                <fencedevice agent="fence_ipmilan" option="off" auth="password" ipaddr="ols011-m" login="root" name="OLS011-m" passwd="changeme"/>
>                <fencedevice agent="fence_ipmilan" option="off" auth="password" ipaddr="ols012-m" login="root" name="OLS012-m" passwd="changeme"/>
>        </fencedevices>
>
>        <rm>
>                <failoverdomains>
>                        <failoverdomain name="fusi01_hvm_dom" nofailback="0" ordered="1" restricted="1">
>                                <failoverdomainnode name="ols011p" priority="2"/>
>                                <failoverdomainnode name="ols012p" priority="1"/>
>                        </failoverdomain>
>                        <failoverdomain name="fusi01_pvm_dom" nofailback="0" ordered="1" restricted="1">
>                                <failoverdomainnode name="ols011p" priority="1"/>
>                                <failoverdomainnode name="ols012p" priority="2"/>
>                        </failoverdomain>
>                </failoverdomains>
>                <resources/>
>                <vm autostart="1" domain="fusi01_pvm_dom" exclusive="0" migrate="live" name="fusi01pvm" path="/global/xenconfig" recovery="restart"/>
>                <vm autostart="1" domain="fusi01_hvm_dom" exclusive="0" migrate="live" name="fusi01hvm" path="/global/xenconfig" recovery="restart"/>
>        </rm>
> </cluster>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



From mharrington at eons.com  Tue Aug 26 19:29:17 2008
From: mharrington at eons.com (Matt Harrington)
Date: Tue, 26 Aug 2008 15:29:17 -0400
Subject: [Linux-cluster] fence_apc unknown screen encountered
Message-ID: <48B4598D.9040901@eons.com>

I am encountering an unknown screen exception from fence_apc when trying 
to fence a system in a 3-node cluster (centos5.2 cman-2.0.84-2.el5).  
What is interesting, is that I can fence the other two nodes in my 
cluster.  I believe the difference is that the problem node has two 
power supplies which means that fence_apc is called with off/on instead 
of restart.  This also requires connecting to two different pdus.  It 
could also be that there is something wrong with the config which was 
taken from an older system and updated with luci.  I am unable to 
descern any differences between the menus of the two pdus.



[root at fs102 ~]# /sbin/fence_node fs103
agent "fence_apc" reports: Traceback (most recent call last):
  File "/sbin/fence_apc", line 829, in ?
    main()
  File "/sbin/fence_apc", line 303, in main
    do_power_off(sock)
  File "/sbin/fence_apc", line 813, in do_power_off
    x = do_power_switch(sock, "off")
  File "/sbi
agent "fence_apc" reports: n/fence_apc", line 611, in do_power_switch
    result_code, response = power_off(txt + ndbuf)
  File "/sbin/fence_apc", line 817, in power_off
    x = power_switch(buffer, False, "2", "3");
  File "/sbin/fence_apc", line 810, in power_switch
    raise "un
agent "fence_apc" reports: known screen encountered in \n" + str(lines) 
+ "\n"
unknown screen encountered in
['', '> 2', '', '', '------- Configure Outlet 
------------------------------------------------------', '', '    #  
State  Ph  Name                     Pwr On Dly  Pwr Off D
agent "fence_apc" reports: ly  Reboot Dur.', '   
----------------------------------------------------------------------------', 
'    2  ON     1   fs103                    0 sec       0 sec        5 
sec', '', '     1- Outlet Name         : fs103', '     2- Power On 
Delay(sec) : 0',
agent "fence_apc" reports:  '     3- Power Off Delay(sec): 0', '     4- 
Reboot Duration(sec): 5', '     5- Accept Changes      : ', '', '     ?- 
Help, <ESC>- Back, <ENTER>- Refresh, <CTRL-L>- Event Log']


[root at fs102 ~]# /sbin/fence_apc -a 10.10.1.200 -l pdu -p pdu -n 13 -o status
Status check successful. Port 13 is OFF
[root at fs102 ~]# /sbin/fence_apc -a 10.10.1.201 -l pdu -p pdu -n 2 -o status
Status check successful. Port 2 is ON
[root at fs102 ~]# /sbin/fence_apc -a 10.10.1.201 -l pdu -p pdu -n 2 -o off
Traceback (most recent call last):
  File "/sbin/fence_apc", line 829, in ?
    main()
  File "/sbin/fence_apc", line 303, in main
    do_power_off(sock)
  File "/sbin/fence_apc", line 813, in do_power_off
    x = do_power_switch(sock, "off")
  File "/sbin/fence_apc", line 611, in do_power_switch
    result_code, response = power_off(txt + ndbuf)
  File "/sbin/fence_apc", line 817, in power_off
    x = power_switch(buffer, False, "2", "3");
  File "/sbin/fence_apc", line 810, in power_switch
    raise "unknown screen encountered in \n" + str(lines) + "\n"
unknown screen encountered in
['2', '', '', '------- Configure Outlet 
------------------------------------------------------', '', '    #  
State  Ph  Name                     Pwr On Dly  Pwr Off Dly  Reboot 
Dur.', '   
----------------------------------------------------------------------------', 
'    2  ON     1   fs103                    0 sec       0 sec        5 
sec', '', '     1- Outlet Name         : fs103', '     2- Power On 
Delay(sec) : 0', '     3- Power Off Delay(sec): 0', '     4- Reboot 
Duration(sec): 5', '     5- Accept Changes      : ', '', '     ?- Help, 
<ESC>- Back, <ENTER>- Refresh, <CTRL-L>- Event Log']




<cluster config_version="143" name="gfs_cluster">
    <fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/>
    <clusternodes>
        <clusternode name="fs101" nodeid="1" votes="1">
            <fence>
                <method name="1">
                    <device name="pdu102.eons.dev" port="12"/>
                </method>
            </fence>
        </clusternode>
        <clusternode name="fs102" nodeid="2" votes="1">
            <fence>
                <method name="1">
                    <device name="pdu101.eons.dev" port="8"/>
                </method>
            </fence>
        </clusternode>
        <clusternode name="fs103" nodeid="3" votes="1">
            <fence>
                <method name="1">
                    <device name="pdu101.eons.dev" option="off" port="13"/>
                    <device name="pdu102.eons.dev" option="off" port="2"/>
                    <device name="pdu101.eons.dev" option="on" port="13"/>
                    <device name="pdu102.eons.dev" option="on" port="2"/>
                </method>
            </fence>
        </clusternode>
    </clusternodes>
        <fencedevices>
                <fencedevice agent="fence_apc" ipaddr="10.10.1.200" 
login="pdu" name="pdu101.eons.dev" passwd="pdu"/>
                <fencedevice agent="fence_apc" ipaddr="10.10.1.201" 
login="pdu" name="pdu102.eons.dev" passwd="pdu"/>
        </fencedevices>
...
</cluster>




[root at fs102 ~]# cat /etc/redhat-release
CentOS release 5.2 (Final)
[root at fs102 ~]# rpm -qf /sbin/fence_apc
cman-2.0.84-2.el5
[root at fs102 ~]# rpm -q luci
luci-0.12.0-7.el5.centos.3


pdu101:
American Power Conversion               Network Management Card AOS      
v3.5.9
(c) Copyright 2008 All Rights Reserved  Rack PDU APP                     
v3.5.8

pdu102:
American Power Conversion               Network Management Card AOS      
v3.5.9
(c) Copyright 2008 All Rights Reserved  Rack PDU APP                     
v3.5.8



From mharrington at eons.com  Tue Aug 26 19:43:52 2008
From: mharrington at eons.com (Matt Harrington)
Date: Tue, 26 Aug 2008 15:43:52 -0400
Subject: [Linux-cluster] fence_apc unknown screen encountered
In-Reply-To: <48B4598D.9040901@eons.com>
References: <48B4598D.9040901@eons.com>
Message-ID: <48B45CF8.1030308@eons.com>

I can pinpoint the problem with verbose logging.  For some reason, 
fence_apc repeats the outlet selection menu option.  On an outlet number 
 > 2, this is harmless, but in the case where the outlet <= 2, the 
script horks.  Here is the output on a working call to illustrate the 
duplicate selection; "13" is entered twice:

^M------- Outlet Control/Configuration 
------------------------------------------

     1- Outlet 1                 ON
     2- build                    ON
     3- www103                   ON
     4- www102                   ON
     5- Outlet 5                 ON
     6- Outlet 6                 ON
     7- Outlet 7                 ON
     8- fs102                    ON
     9- build                    ON
    10- app102                   ON
    11- Outlet 11                ON
    12- db103                    ON
    13- fs103                    ON
    14- Outlet 14                ON
    15- Outlet 15                ON
    16- Outlet 16                ON
    17- Master Control/Configuration

     <ESC>- Back, <ENTER>- Refresh, <CTRL-L>- Event Log
 > 13

^M------- fs103 
-----------------------------------------------------------------

        Name         : fs103
        Outlet       : 13
        State        : ON

     1- Control Outlet   
     2- Configure Outlet 

     ?- Help, <ESC>- Back, <ENTER>- Refresh, <CTRL-L>- Event Log
 > 13

^M------- fs103 
-----------------------------------------------------------------

        Name         : fs103
        Outlet       : 13
        State        : ON

     1- Control Outlet   
     2- Configure Outlet 

     ?- Help, <ESC>- Back, <ENTER>- Refresh, <CTRL-L>- Event Log
 > 1


Matt Harrington wrote:
> I am encountering an unknown screen exception from fence_apc when 
> trying to fence a system in a 3-node cluster (centos5.2 
> cman-2.0.84-2.el5).  What is interesting, is that I can fence the 
> other two nodes in my cluster.  I believe the difference is that the 
> problem node has two power supplies which means that fence_apc is 
> called with off/on instead of restart.  This also requires connecting 
> to two different pdus.  It could also be that there is something wrong 
> with the config which was taken from an older system and updated with 
> luci.  I am unable to descern any differences between the menus of the 
> two pdus.
>
>
>
> [root at fs102 ~]# /sbin/fence_node fs103
> agent "fence_apc" reports: Traceback (most recent call last):
>  File "/sbin/fence_apc", line 829, in ?
>    main()
>  File "/sbin/fence_apc", line 303, in main
>    do_power_off(sock)
>  File "/sbin/fence_apc", line 813, in do_power_off
>    x = do_power_switch(sock, "off")
>  File "/sbi
> agent "fence_apc" reports: n/fence_apc", line 611, in do_power_switch
>    result_code, response = power_off(txt + ndbuf)
>  File "/sbin/fence_apc", line 817, in power_off
>    x = power_switch(buffer, False, "2", "3");
>  File "/sbin/fence_apc", line 810, in power_switch
>    raise "un
> agent "fence_apc" reports: known screen encountered in \n" + 
> str(lines) + "\n"
> unknown screen encountered in
> ['', '> 2', '', '', '------- Configure Outlet 
> ------------------------------------------------------', '', '    #  
> State  Ph  Name                     Pwr On Dly  Pwr Off D
> agent "fence_apc" reports: ly  Reboot Dur.', '   
> ----------------------------------------------------------------------------', 
> '    2  ON     1   fs103                    0 sec       0 sec        5 
> sec', '', '     1- Outlet Name         : fs103', '     2- Power On 
> Delay(sec) : 0',
> agent "fence_apc" reports:  '     3- Power Off Delay(sec): 0', '     
> 4- Reboot Duration(sec): 5', '     5- Accept Changes      : ', '', 
> '     ?- Help, <ESC>- Back, <ENTER>- Refresh, <CTRL-L>- Event Log']
>
>
> [root at fs102 ~]# /sbin/fence_apc -a 10.10.1.200 -l pdu -p pdu -n 13 -o 
> status
> Status check successful. Port 13 is OFF
> [root at fs102 ~]# /sbin/fence_apc -a 10.10.1.201 -l pdu -p pdu -n 2 -o 
> status
> Status check successful. Port 2 is ON
> [root at fs102 ~]# /sbin/fence_apc -a 10.10.1.201 -l pdu -p pdu -n 2 -o off
> Traceback (most recent call last):
>  File "/sbin/fence_apc", line 829, in ?
>    main()
>  File "/sbin/fence_apc", line 303, in main
>    do_power_off(sock)
>  File "/sbin/fence_apc", line 813, in do_power_off
>    x = do_power_switch(sock, "off")
>  File "/sbin/fence_apc", line 611, in do_power_switch
>    result_code, response = power_off(txt + ndbuf)
>  File "/sbin/fence_apc", line 817, in power_off
>    x = power_switch(buffer, False, "2", "3");
>  File "/sbin/fence_apc", line 810, in power_switch
>    raise "unknown screen encountered in \n" + str(lines) + "\n"
> unknown screen encountered in
> ['2', '', '', '------- Configure Outlet 
> ------------------------------------------------------', '', '    #  
> State  Ph  Name                     Pwr On Dly  Pwr Off Dly  Reboot 
> Dur.', '   
> ----------------------------------------------------------------------------', 
> '    2  ON     1   fs103                    0 sec       0 sec        5 
> sec', '', '     1- Outlet Name         : fs103', '     2- Power On 
> Delay(sec) : 0', '     3- Power Off Delay(sec): 0', '     4- Reboot 
> Duration(sec): 5', '     5- Accept Changes      : ', '', '     ?- 
> Help, <ESC>- Back, <ENTER>- Refresh, <CTRL-L>- Event Log']
>
>
>
>
> <cluster config_version="143" name="gfs_cluster">
>    <fence_daemon clean_start="0" post_fail_delay="0" 
> post_join_delay="3"/>
>    <clusternodes>
>        <clusternode name="fs101" nodeid="1" votes="1">
>            <fence>
>                <method name="1">
>                    <device name="pdu102.eons.dev" port="12"/>
>                </method>
>            </fence>
>        </clusternode>
>        <clusternode name="fs102" nodeid="2" votes="1">
>            <fence>
>                <method name="1">
>                    <device name="pdu101.eons.dev" port="8"/>
>                </method>
>            </fence>
>        </clusternode>
>        <clusternode name="fs103" nodeid="3" votes="1">
>            <fence>
>                <method name="1">
>                    <device name="pdu101.eons.dev" option="off" 
> port="13"/>
>                    <device name="pdu102.eons.dev" option="off" port="2"/>
>                    <device name="pdu101.eons.dev" option="on" port="13"/>
>                    <device name="pdu102.eons.dev" option="on" port="2"/>
>                </method>
>            </fence>
>        </clusternode>
>    </clusternodes>
>        <fencedevices>
>                <fencedevice agent="fence_apc" ipaddr="10.10.1.200" 
> login="pdu" name="pdu101.eons.dev" passwd="pdu"/>
>                <fencedevice agent="fence_apc" ipaddr="10.10.1.201" 
> login="pdu" name="pdu102.eons.dev" passwd="pdu"/>
>        </fencedevices>
> ...
> </cluster>
>
>
>
>
> [root at fs102 ~]# cat /etc/redhat-release
> CentOS release 5.2 (Final)
> [root at fs102 ~]# rpm -qf /sbin/fence_apc
> cman-2.0.84-2.el5
> [root at fs102 ~]# rpm -q luci
> luci-0.12.0-7.el5.centos.3
>
>
> pdu101:
> American Power Conversion               Network Management Card 
> AOS      v3.5.9
> (c) Copyright 2008 All Rights Reserved  Rack PDU 
> APP                     v3.5.8
>
> pdu102:
> American Power Conversion               Network Management Card 
> AOS      v3.5.9
> (c) Copyright 2008 All Rights Reserved  Rack PDU 
> APP                     v3.5.8
>
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From Millard.Matt at principal.com  Tue Aug 26 19:46:59 2008
From: Millard.Matt at principal.com (Millard, Matt)
Date: Tue, 26 Aug 2008 14:46:59 -0500
Subject: [Linux-cluster] Unable to communicate with luci server
Message-ID: <A3980FD020E35F4D966478FA59437C4D16747EB7@PFGDSMMBX010.principalusa.corp.principal.com>

I'm working on setting up a brand new cluster using Conga.  After I initialize Luci and restart it, go into Cluster, Create a New Cluster and enter in fqdn hostnames for each of my two nodes, password and select "View SSL cert fingerprints" I get a popup that says "Unable to communicate with luci server."  I've tried to set this up on two separate Luci servers with the same results.  I've verified that /etc/hosts has the fqdn names for each node, multicast is working (pingable), port 11111 is open (per nmap).  I get no errors in /var/log/messages.  I've tried both the public and private networks to get the nodes setup with no results.

Also I can just be sitting at the "Clusters" page with nothing on it and the same "Unable to communicate with luci server."  Error will pop up randomly in my browser.  Any thoughts on where to start with this? 

RHEL5 - U2 on both nodes and on the Luci server.

Matt?



-----Message Disclaimer-----

This e-mail message is intended only for the use of the individual or
entity to which it is addressed, and may contain information that is
privileged, confidential and exempt from disclosure under applicable law.
If you are not the intended recipient, any dissemination, distribution or
copying of this communication is strictly prohibited. If you have
received this communication in error, please notify us immediately by
reply email to Connect at principal.com and delete or destroy all copies of
the original message and attachments thereto. Email sent to or from the
Principal Financial Group or any of its member companies may be retained
as required by law or regulation.

Nothing in this message is intended to constitute an Electronic signature
for purposes of the Uniform Electronic Transactions Act (UETA) or the
Electronic Signatures in Global and National Commerce Act ("E-Sign")
unless a specific statement to the contrary is included in this message.

While this communication may be used to promote or market a transaction
or an idea that is discussed in the publication, it is intended to provide
general information about the subject matter covered and is provided with
the understanding that The Principal is not rendering legal, accounting,
or tax advice. It is not a marketed opinion and may not be used to avoid
penalties under the Internal Revenue Code. You should consult with
appropriate counsel or other advisors on all matters pertaining to legal,
tax, or accounting obligations and requirements.




From lp at xbe.ch  Tue Aug 26 20:36:52 2008
From: lp at xbe.ch (Lorenz Pfiffner)
Date: Tue, 26 Aug 2008 22:36:52 +0200
Subject: [Linux-cluster] Linux cluster moved to new subdomain
In-Reply-To: <0AB7D520EBDCE743BFAE18CF2B5B04C901EAFECB03@G3W0070.americas.hpqcorp.net>
References: <8ee061010806160945pc2418f8w95749c8bf566d02d@mail.gmail.com>	<20080616191641.GA17965@kallisti.us>	<4856C385.8000800@gmail.com>	<8ee061010806170954y1d6d7555qf7cd5a137b63e018@mail.gmail.com>	<485819DC.90503@gmail.com>	<8ee061010806171522l1d18480em861bffb87f8b8be2@mail.gmail.com>	<8ee061010806181048j2d9e4635n6e0133c855e4ea06@mail.gmail.com>	<485A71ED.8030305@gmail.com>	<485A7E59.8020107@gmail.com>	<0AB7D520EBDCE743BFAE18CF2B5B04C901EAFECAFC@G3W0070.americas.hpqcorp.net>	<001301c8d2cb$9317be60$b9473b20$@gr>
	<0AB7D520EBDCE743BFAE18CF2B5B04C901EAFECB03@G3W0070.americas.hpqcorp.net>
Message-ID: <48B46964.3040708@xbe.ch>

Hi Dave

Did you solve this problem? I have a similar case here. I changed the 
IP's of one of my two node cluster and now I can't get the cluster back 
into luci. I assume luci stores it's config in a binary Zope Data.fs 
file. But how can change something there?

BTW: Changing node IP's was pain! A Howto would be an interesting thing 
for the Knowledge Base or even RedHat Magazine. I think it would really 
be appreciated by many Cluster Suite operators.

Greetz
Lorenz

Harding, David wrote:
>  Yes,  The issue is that when I first built the cluster I used the full qualified host name. I changed the cluster.conf file,
> But some where it is still getting the old fully qualified host name.  I can see it in the log files.  Where else does it
> Store the cluster host names other then the cluster.conf file ?
> 
> dave
> 
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Theophanis Kontogiannis
> Sent: Friday, June 20, 2008 7:49 AM
> To: 'linux clustering'
> Subject: RE: [Linux-cluster] Linux cluster moved to new subdomain
> 
> Hi Dave,
> 
> Did you make the appropriate changes on the iptables, to reflect the new IPs given to the servers?
> 
> Sincerely,
> 
> Theophanis Kontogiannis
> 
> 
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Harding, David
> Sent: Thursday, June 19, 2008 11:36 PM
> To: 'linux clustering'
> Subject: [Linux-cluster] Linux cluster moved to new subdomain
> 
> 
> 
> 
> We moved our Linux cluster to a new tcpip subnet.
> It is running Redhat V4 update 6.  After the move I fixed the cluster.conf to reflect the subnet name.
> 
> I then went into luci.  When I select the cluster tab I get an error message stating "an error occurred when trying to contact any of the nodes in the ermmro cluster"
> The systems show up ok in the homepase tab and everything looks correct under the storage tab.
> 
> dave
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From michael.osullivan at auckland.ac.nz  Tue Aug 26 23:06:14 2008
From: michael.osullivan at auckland.ac.nz (michael.osullivan at auckland.ac.nz)
Date: Wed, 27 Aug 2008 11:06:14 +1200 (NZST)
Subject: [Linux-cluster] gfs over raid/lvm or any other option?
Message-ID: <1624.128.187.190.8.1219791974.squirrel@mail.esc.auckland.ac.nz>

Hi Alex,

Sorry for erasing the thread, I hope this posting is ok.

I am new to implementing storage and clustering, so I may not understand
the issues building the storage system.

I initially used conga to create the GFS2 file system with multiple
journals (-j 2 or more, I don't remember exactly). It seemed to work ok.
However, I have had some problems with conga, so I am trying to control
things from the command line now. It is a learning process...!

In terms of scalibility, couldn't I add a third storage server to the
RAID-5, grow the logical volume and then grow the GFS?

Thanks, Mike

Date: Tue, 26 Aug 2008 10:28:32 +0300
From: Alex <linux at vfemail.net>
Subject: Re: [Linux-cluster] gfs over raid/lvm or any other option?
To: linux clustering <linux-cluster at redhat.com>
Message-ID: <200808261028.32699.linux at vfemail.net>
Content-Type: text/plain;  charset="iso-8859-1"

On Monday 25 August 2008 20:30, michael.osullivan at auckland.ac.nz wrote:

> > There are two approaches I have seen that may be suitable:
> >
> > 1) lustre - I didn't like this as it needed two "special" meta-servers
and
> > I was building a smaller storage system;
> > 2) pvfs
> >
>

Hi Mike,

Please, do not erase thread, will be difficult to trace the subject and
content...

PVFS (Parallel Virtual File System) - has no redundancy - Lose one node lose
them all. Also their website is down (www.pvfs.org), so you can add no
reliability too ...

Your setup will work only in case as you are using mkfs.gfs -j 1... else is
broken. Also, has no scallability.

Regards,
Alx

> > I did not use either of these approaches as they focus on keeping the
> > storage system running, rather than keeping the data highly available.
> >
> > For my test storage I wanted to build a system that would still present
> > the stored data even if a single point in the network fails.
> >
> > I have used iSCSI, mdadm and GFS as follows. I have two storage servers
> > with alomst 2TB of disk space for storage each. Both of these two servers
> > present a single logical volume to a 2-node cluster using iSCSI. There
are
> > 2 NICs on each storage server, so each volume is accessible via two
ports.
> > There are 2 NICs on each cluster node also. The storage system was
> > connected to the cluster using some ethernet switches. Using mdadm I have
> > successfully multipathed each logical volume and then using mdadm again I
> > have built a RAID-5 device from these two volumes. The raid device is
> > successfully detected by each cluster node. On this raid device I created
> > a logical volume using clvm and on that logical volume I built a GFS to
> > control cluster access to the storage. The GFS has been successfully
> > mounted on both cluster nodes.
> >
> > Despite some problems with the cluster (due to my own limited knowledge
> > about clusters and fencing) I have successfully created and accessed
files
> > on the GFS from both cluster nodes. I am in the process of sorting out
the
> > clustering problems and testing the configuration using IOMeter.
> >
> > Hope this helps, Mike
> >



From swhiteho at redhat.com  Wed Aug 27 08:30:52 2008
From: swhiteho at redhat.com (Steven Whitehouse)
Date: Wed, 27 Aug 2008 09:30:52 +0100
Subject: [Linux-cluster] GFS2 becomes non-responsive, no fencing
In-Reply-To: <20080826175004.GB18608@kallisti.us>
References: <20080825232941.GB6707@kallisti.us>
	<20080826175004.GB18608@kallisti.us>
Message-ID: <1219825852.20622.191.camel@quoit>

Hi,

On Tue, 2008-08-26 at 13:50 -0400, Ross Vandegrift wrote:
> On Mon, Aug 25, 2008 at 07:29:41PM -0400, Ross Vandegrift wrote:
> > Today, the app on one node died.  I logged in, assumed things were
> > fenced, and tried to go about my business of restarting it.  After
> > some fiddling, I got the box back in the cluster fine.
> > 
> > It just happened again, and I've dug in a bit more.  I was wrong - the
> > failed node has not been fenced.  The last thing in dmesg on the
> > failing node is:
> 
> Some more information gleaned today.  I left the node running last
> night without fixing the GFS2 access.  Today, we noticed that
> filesystem access has been restored for new processes - it's slow
> (sometimes taking minutes to return an ls for 10 items), but
> it eventually responds.  The application threads that are sleeping in D
> still haven't received their data from reads issued yesterday
> afternoon.
> 
> A cursory examination of DLM-related keys in /sys reveal that the
> working and broken nodes are configured the same.  No major disparity
> in terms of memory use, except the obvious fact that the broken node
> shows very litte disk IO.
> 
> I'm pretty much at a loss - any ideas would be very welcome.
> 

There are a few things to check. Firstly compare /proc/slabinfo on a
slow node with that on a node running at normal speed. That will tell
you if there is a problem with memory leaking or not being reclaimed
properly.

If the node seems stuck, then try and echo t >/proc/sysrq-trigger and
look at the backtraces of any process which has called into gfs2 to see
where they are waiting. Also a dump of the glocks (you'll need to have
debugfs mounted) on all nodes should then allow you to work out whether
something on the stuck nodes is waiting for something on one of the
other nodes. Sometimes its useful to look at the DLM locks as well.

If you feel that you'd rather not go through all the details yourself,
then please put the info into a bugzilla entry and we'll take a look,

Steve.




From linux at vfemail.net  Wed Aug 27 10:01:57 2008
From: linux at vfemail.net (Alex)
Date: Wed, 27 Aug 2008 13:01:57 +0300
Subject: [Linux-cluster] one click to start httpd on all nodes - possible?
Message-ID: <200808271301.57414.linux@vfemail.net>

Hi all,

I have 3 nodes, forming a cluster. How sould be configured a service in 
cluster.conf file in order to be able to stop or to start httpd daemon on all 
our nodes at the same time? All i can find in docs is related to failover 
scenario (stoping httpd on one node wil cause starting httpd on other node) 
which is not what i need. For nodes management i am using conga, so, i would 
like to have a service to do that? Is possible? If not, should i use other 
external tools (like nagios) to do that?

Regards,
Alx



From bruno at redix.com.br  Wed Aug 27 11:31:49 2008
From: bruno at redix.com.br (Bruno Frensch Deschamps)
Date: Wed, 27 Aug 2008 08:31:49 -0300
Subject: [Linux-cluster] Firewall Failover
Message-ID: <48B53B25.7030701@redix.com.br>

Hi,

I have create a service firewall with some resources like IP, MYSQL, and 
other scripts
My firewall cluster has 2 nodes (node1 and node2)
The firewall service is runnig on node 1, including mysqld, when i kill 
mysqld in node1, the service still running on node1,
My question is:
Should node1 be fenced?
How i test the mysqld in both server to be fenced when it is down?


NOTE: fence are working in both servers when i kill a service like a IP.

-- 
Bruno F. Deschamps - Consultor
Profissional Certificado LPIC-1
--------------------------------------------------------------------
Redix - Gest?o em T.I. com Software Livre
http://www.redix.com.br - redix at redix.com.br
Tel. Coml.: +55 (47) 3323-7313
--------------------------------------------------------------------



From bruno at redix.com.br  Wed Aug 27 12:04:40 2008
From: bruno at redix.com.br (Bruno Frensch Deschamps)
Date: Wed, 27 Aug 2008 09:04:40 -0300
Subject: [Linux-cluster] Firewall Failover
In-Reply-To: <48B53B25.7030701@redix.com.br>
References: <48B53B25.7030701@redix.com.br>
Message-ID: <48B542D8.6040503@redix.com.br>

My /var/log/messages:

Aug 27 08:00:05 node1 clurgmgrd: [3817]: <err> Checking Existence Of 
File /var/run/cluster/mysql/mysql:MySQL.pid [mysql:MySQL] > Failed
Aug 27 08:00:05 node1 clurgmgrd: [3817]: <err> Monitoring Service 
mysql:MySQL > Service Is Not Running
Aug 27 08:00:05 node1 clurgmgrd[3817]: <notice> status on mysql "MySQL" 
returned 1 (generic error)
Aug 27 08:00:05 node1 clurgmgrd[3817]: <notice> Stopping service 
service:firewall
Aug 27 08:00:05 node1 clurgmgrd: [3817]: <err> Checking Existence Of 
File /var/run/cluster/mysql/mysql:MySQL.pid [mysql:MySQL] > Failed - 
File Doesn't Exist
Aug 27 08:00:05 node1 clurgmgrd: [3817]: <err> Stopping Service 
mysql:MySQL > Failed
Aug 27 08:00:05 node1 clurgmgrd[3817]: <notice> stop on mysql "MySQL" 
returned 1 (generic error)
Aug 27 08:00:05 node1 squid[8578]: Squid Parent: child process 8580 
exited with status 0
Aug 27 08:00:07 node1 openvpn[8554]: event_wait : Interrupted system 
call (code=4)
Aug 27 08:00:07 node1 openvpn[8554]: TCP/UDP: Closing socket
Aug 27 08:00:07 node1 openvpn[8554]: Closing TUN/TAP interface
Aug 27 08:00:07 node1 openvpn[8554]: SIGTERM[hard,] received, process 
exiting
Aug 27 08:00:07 node1 clurgmgrd[3817]: <crit> #12: RG service:firewall 
failed to stop; intervention required
Aug 27 08:00:07 node1 clurgmgrd[3817]: <notice> Service service:firewall 
is failed

its detect that mysqld is down...
how can i start this service again on other node automatic?


clustat shows:

Cluster Status for bruniques @ Wed Aug 27 08:02:32 2008
Member Status: Quorate

 Member Name                                                     ID   Status
 ------ ----                                                     ---- ------
 node1                                                               1 
Online, Local, rgmanager
 node2                                                               2 
Online, rgmanager

 Service Name                                                     Owner 
(Last)                                                     State        
 ------- ----                                                     ----- 
------                                                     -----        
 service:firewall                                                 
(node1)                                                          failed



but it not realocate the service to other node.



Bruno F. Deschamps - Consultor
Profissional Certificado LPIC-1
--------------------------------------------------------------------
Redix - Gest?o em T.I. com Software Livre
http://www.redix.com.br - redix at redix.com.br
Tel. Coml.: +55 (47) 3323-7313
--------------------------------------------------------------------



Bruno Frensch Deschamps escreveu:
> Hi,
>
> I have create a service firewall with some resources like IP, MYSQL, 
> and other scripts
> My firewall cluster has 2 nodes (node1 and node2)
> The firewall service is runnig on node 1, including mysqld, when i 
> kill mysqld in node1, the service still running on node1,
> My question is:
> Should node1 be fenced?
> How i test the mysqld in both server to be fenced when it is down?
>
>
> NOTE: fence are working in both servers when i kill a service like a IP.
>



From mgrac at redhat.com  Wed Aug 27 13:22:57 2008
From: mgrac at redhat.com (Marek 'marx' Grac)
Date: Wed, 27 Aug 2008 15:22:57 +0200
Subject: [Linux-cluster] fence_apc unknown screen encountered
In-Reply-To: <48B45CF8.1030308@eons.com>
References: <48B4598D.9040901@eons.com> <48B45CF8.1030308@eons.com>
Message-ID: <48B55531.9000404@redhat.com>

Hi,

Can you test our new fence agent for APC? You can find it in git 
repository (available also from web 
http://git.fedorahosted.org/git/cluster.git) in branch RHEL5 (also in 
master, stable2, ...). Just download package pexpect, 
fence/agents/apc/fence_apc.py and fence/agents/lib/fencing.py.py (rename 
to fencing.py); and try to run it from command-line (./fence_apc.py -h). 
Please let me know the results :)

marx,

Matt Harrington wrote:
> I can pinpoint the problem with verbose logging.  For some reason, 
> fence_apc repeats the outlet selection menu option.  On an outlet 
> number > 2, this is harmless, but in the case where the outlet <= 2, 
> the script horks.  Here is the output on a working call to illustrate 
> the duplicate selection; "13" is entered twice:
>
> ^M------- Outlet Control/Configuration 
> ------------------------------------------
>
>     1- Outlet 1                 ON
>     2- build                    ON
>     3- www103                   ON
>     4- www102                   ON
>     5- Outlet 5                 ON
>     6- Outlet 6                 ON
>     7- Outlet 7                 ON
>     8- fs102                    ON
>     9- build                    ON
>    10- app102                   ON
>    11- Outlet 11                ON
>    12- db103                    ON
>    13- fs103                    ON
>    14- Outlet 14                ON
>    15- Outlet 15                ON
>    16- Outlet 16                ON
>    17- Master Control/Configuration
>
>     <ESC>- Back, <ENTER>- Refresh, <CTRL-L>- Event Log
> > 13
>
> ^M------- fs103 
> -----------------------------------------------------------------
>
>        Name         : fs103
>        Outlet       : 13
>        State        : ON
>
>     1- Control Outlet       2- Configure Outlet
>     ?- Help, <ESC>- Back, <ENTER>- Refresh, <CTRL-L>- Event Log
> > 13
>
> ^M------- fs103 
> -----------------------------------------------------------------
>
>        Name         : fs103
>        Outlet       : 13
>        State        : ON
>
>     1- Control Outlet       2- Configure Outlet
>     ?- Help, <ESC>- Back, <ENTER>- Refresh, <CTRL-L>- Event Log
> > 1
>
>
> Matt Harrington wrote:
>> I am encountering an unknown screen exception from fence_apc when 
>> trying to fence a system in a 3-node cluster (centos5.2 
>> cman-2.0.84-2.el5).  What is interesting, is that I can fence the 
>> other two nodes in my cluster.  I believe the difference is that the 
>> problem node has two power supplies which means that fence_apc is 
>> called with off/on instead of restart.  This also requires connecting 
>> to two different pdus.  It could also be that there is something 
>> wrong with the config which was taken from an older system and 
>> updated with luci.  I am unable to descern any differences between 
>> the menus of the two pdus.
>>
>>
>>
>> [root at fs102 ~]# /sbin/fence_node fs103
>> agent "fence_apc" reports: Traceback (most recent call last):
>>  File "/sbin/fence_apc", line 829, in ?
>>    main()
>>  File "/sbin/fence_apc", line 303, in main
>>    do_power_off(sock)
>>  File "/sbin/fence_apc", line 813, in do_power_off
>>    x = do_power_switch(sock, "off")
>>  File "/sbi
>> agent "fence_apc" reports: n/fence_apc", line 611, in do_power_switch
>>    result_code, response = power_off(txt + ndbuf)
>>  File "/sbin/fence_apc", line 817, in power_off
>>    x = power_switch(buffer, False, "2", "3");
>>  File "/sbin/fence_apc", line 810, in power_switch
>>    raise "un
>> agent "fence_apc" reports: known screen encountered in \n" + 
>> str(lines) + "\n"
>> unknown screen encountered in
>> ['', '> 2', '', '', '------- Configure Outlet 
>> ------------------------------------------------------', '', '    #  
>> State  Ph  Name                     Pwr On Dly  Pwr Off D
>> agent "fence_apc" reports: ly  Reboot Dur.', '   
>> ----------------------------------------------------------------------------', 
>> '    2  ON     1   fs103                    0 sec       0 sec        
>> 5 sec', '', '     1- Outlet Name         : fs103', '     2- Power On 
>> Delay(sec) : 0',
>> agent "fence_apc" reports:  '     3- Power Off Delay(sec): 0', '     
>> 4- Reboot Duration(sec): 5', '     5- Accept Changes      : ', '', 
>> '     ?- Help, <ESC>- Back, <ENTER>- Refresh, <CTRL-L>- Event Log']
>>
>>
>> [root at fs102 ~]# /sbin/fence_apc -a 10.10.1.200 -l pdu -p pdu -n 13 -o 
>> status
>> Status check successful. Port 13 is OFF
>> [root at fs102 ~]# /sbin/fence_apc -a 10.10.1.201 -l pdu -p pdu -n 2 -o 
>> status
>> Status check successful. Port 2 is ON
>> [root at fs102 ~]# /sbin/fence_apc -a 10.10.1.201 -l pdu -p pdu -n 2 -o off
>> Traceback (most recent call last):
>>  File "/sbin/fence_apc", line 829, in ?
>>    main()
>>  File "/sbin/fence_apc", line 303, in main
>>    do_power_off(sock)
>>  File "/sbin/fence_apc", line 813, in do_power_off
>>    x = do_power_switch(sock, "off")
>>  File "/sbin/fence_apc", line 611, in do_power_switch
>>    result_code, response = power_off(txt + ndbuf)
>>  File "/sbin/fence_apc", line 817, in power_off
>>    x = power_switch(buffer, False, "2", "3");
>>  File "/sbin/fence_apc", line 810, in power_switch
>>    raise "unknown screen encountered in \n" + str(lines) + "\n"
>> unknown screen encountered in
>> ['2', '', '', '------- Configure Outlet 
>> ------------------------------------------------------', '', '    #  
>> State  Ph  Name                     Pwr On Dly  Pwr Off Dly  Reboot 
>> Dur.', '   
>> ----------------------------------------------------------------------------', 
>> '    2  ON     1   fs103                    0 sec       0 sec        
>> 5 sec', '', '     1- Outlet Name         : fs103', '     2- Power On 
>> Delay(sec) : 0', '     3- Power Off Delay(sec): 0', '     4- Reboot 
>> Duration(sec): 5', '     5- Accept Changes      : ', '', '     ?- 
>> Help, <ESC>- Back, <ENTER>- Refresh, <CTRL-L>- Event Log']
>>
>>
>>
>>
>> <cluster config_version="143" name="gfs_cluster">
>>    <fence_daemon clean_start="0" post_fail_delay="0" 
>> post_join_delay="3"/>
>>    <clusternodes>
>>        <clusternode name="fs101" nodeid="1" votes="1">
>>            <fence>
>>                <method name="1">
>>                    <device name="pdu102.eons.dev" port="12"/>
>>                </method>
>>            </fence>
>>        </clusternode>
>>        <clusternode name="fs102" nodeid="2" votes="1">
>>            <fence>
>>                <method name="1">
>>                    <device name="pdu101.eons.dev" port="8"/>
>>                </method>
>>            </fence>
>>        </clusternode>
>>        <clusternode name="fs103" nodeid="3" votes="1">
>>            <fence>
>>                <method name="1">
>>                    <device name="pdu101.eons.dev" option="off" 
>> port="13"/>
>>                    <device name="pdu102.eons.dev" option="off" 
>> port="2"/>
>>                    <device name="pdu101.eons.dev" option="on" 
>> port="13"/>
>>                    <device name="pdu102.eons.dev" option="on" port="2"/>
>>                </method>
>>            </fence>
>>        </clusternode>
>>    </clusternodes>
>>        <fencedevices>
>>                <fencedevice agent="fence_apc" ipaddr="10.10.1.200" 
>> login="pdu" name="pdu101.eons.dev" passwd="pdu"/>
>>                <fencedevice agent="fence_apc" ipaddr="10.10.1.201" 
>> login="pdu" name="pdu102.eons.dev" passwd="pdu"/>
>>        </fencedevices>
>> ...
>> </cluster>
>>
>>
>>
>>
>> [root at fs102 ~]# cat /etc/redhat-release
>> CentOS release 5.2 (Final)
>> [root at fs102 ~]# rpm -qf /sbin/fence_apc
>> cman-2.0.84-2.el5
>> [root at fs102 ~]# rpm -q luci
>> luci-0.12.0-7.el5.centos.3
>>
>>
>> pdu101:
>> American Power Conversion               Network Management Card 
>> AOS      v3.5.9
>> (c) Copyright 2008 All Rights Reserved  Rack PDU 
>> APP                     v3.5.8
>>
>> pdu102:
>> American Power Conversion               Network Management Card 
>> AOS      v3.5.9
>> (c) Copyright 2008 All Rights Reserved  Rack PDU 
>> APP                     v3.5.8
>>
>> -- 
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


-- 
Marek Grac
Red Hat Czech s.r.o.



From mgrac at redhat.com  Wed Aug 27 13:28:55 2008
From: mgrac at redhat.com (Marek 'marx' Grac)
Date: Wed, 27 Aug 2008 15:28:55 +0200
Subject: [Linux-cluster] one click to start httpd on all nodes - possible?
In-Reply-To: <200808271301.57414.linux@vfemail.net>
References: <200808271301.57414.linux@vfemail.net>
Message-ID: <48B55697.7020904@redhat.com>

Hi,
Alex wrote:
> Hi all,
>
> I have 3 nodes, forming a cluster. How sould be configured a service in 
> cluster.conf file in order to be able to stop or to start httpd daemon on all 
> our nodes at the same time? All i can find in docs is related to failover 
> scenario (stoping httpd on one node wil cause starting httpd on other node) 
> which is not what i need. For nodes management i am using conga, so, i would 
> like to have a service to do that? Is possible? If not, should i use other 
> external tools (like nagios) to do that?
>   
I don't think that it is possible to do this directly. But it should be 
easy to create several services with
httpd (perhaps with different failover domains) and then run/stop it 
using CLI tool:
        clusvcadm -e service / clusvcadm -d service
You can put this in any script and then you are able to start/stop it 
from anywhere

marx,

-- 
Marek Grac
Red Hat Czech s.r.o.



From johannes.russek at io-consulting.net  Wed Aug 27 14:32:42 2008
From: johannes.russek at io-consulting.net (=?UTF-8?B?Sm9oYW5uZXMgUnXDn2Vr?=)
Date: Wed, 27 Aug 2008 16:32:42 +0200
Subject: [Linux-cluster] one click to start httpd on all nodes - possible?
In-Reply-To: <48B55697.7020904@redhat.com>
References: <200808271301.57414.linux@vfemail.net>
	<48B55697.7020904@redhat.com>
Message-ID: <48B5658A.9040607@io-consulting.net>

Hi Alex,
if it's just something you want to do manually, are you aware of 
"http://sourceforge.net/projects/pdsh" ?
regards,
Johannes

Marek 'marx' Grac schrieb:
> Hi,
> Alex wrote:
>> Hi all,
>>
>> I have 3 nodes, forming a cluster. How sould be configured a service 
>> in cluster.conf file in order to be able to stop or to start httpd 
>> daemon on all our nodes at the same time? All i can find in docs is 
>> related to failover scenario (stoping httpd on one node wil cause 
>> starting httpd on other node) which is not what i need. For nodes 
>> management i am using conga, so, i would like to have a service to do 
>> that? Is possible? If not, should i use other external tools (like 
>> nagios) to do that?
>>   
> I don't think that it is possible to do this directly. But it should 
> be easy to create several services with
> httpd (perhaps with different failover domains) and then run/stop it 
> using CLI tool:
>        clusvcadm -e service / clusvcadm -d service
> You can put this in any script and then you are able to start/stop it 
> from anywhere
>
> marx,
>



From ross at kallisti.us  Wed Aug 27 15:13:45 2008
From: ross at kallisti.us (Ross Vandegrift)
Date: Wed, 27 Aug 2008 11:13:45 -0400
Subject: [Linux-cluster] GFS2 becomes non-responsive, no fencing
In-Reply-To: <1219825852.20622.191.camel@quoit>
References: <20080825232941.GB6707@kallisti.us>
	<20080826175004.GB18608@kallisti.us>
	<1219825852.20622.191.camel@quoit>
Message-ID: <20080827151344.GC28004@kallisti.us>

On Wed, Aug 27, 2008 at 09:30:52AM +0100, Steven Whitehouse wrote:
> There are a few things to check. Firstly compare /proc/slabinfo on a
> slow node with that on a node running at normal speed. That will tell
> you if there is a problem with memory leaking or not being reclaimed
> properly.

Nothing interesting at a quick glance over slabinfo.  The active node
has much more consumed memory than the inactive one, which would seem
normal since the busted node hasn't done any work in hours.

Anything large is expected - icache, dcache, buffer_head.  The active
node has more objects then the busted one.

> If the node seems stuck, then try and echo t >/proc/sysrq-trigger and
> look at the backtraces of any process which has called into gfs2 to see
> where they are waiting. Also a dump of the glocks (you'll need to have
> debugfs mounted) on all nodes should then allow you to work out whether
> something on the stuck nodes is waiting for something on one of the
> other nodes. Sometimes its useful to look at the DLM locks as well.

Okay - processes that are hung due to I/O on GFS2 filesystem all have
a similar call stack:

 =======================
ls            D C981824A  2928 11112  11000 (NOTLB)
       e3065e48 00200086 f8cc4b9d c981824a 00008773 f5126a80 00000008 f5876550 
       c20e7550 c990b3d9 00008773 000f318f 00000001 f587665c c20049e0 00000044 
       f8c4b10b f5327ac0 f8c4b83e ffffffff 00000000 00000000 e3065e74 00000000 
Call Trace:
 [<f8cc4b9d>] put_rsb+0x27/0x36 [dlm]
 [<f8c4b10b>] gdlm_ast+0x0/0x2 [lock_dlm]
 [<f8c4b83e>] gdlm_bast+0x0/0x76 [lock_dlm]
 [<f8d21c99>] just_schedule+0x5/0x8 [gfs2]
 [<c0604d68>] __wait_on_bit+0x33/0x58
 [<f8d21c94>] just_schedule+0x0/0x8 [gfs2]
 [<f8d21c94>] just_schedule+0x0/0x8 [gfs2]
 [<c0604def>] out_of_line_wait_on_bit+0x62/0x6a
 [<c0436076>] wake_bit_function+0x0/0x3c
 [<f8d21c90>] wait_on_holder+0x27/0x2b [gfs2]
 [<f8d22e32>] glock_wait_internal+0xdb/0x1ec [gfs2]
 [<f8d230b1>] gfs2_glock_nq+0x16e/0x18e [gfs2]
 [<f8d24177>] gfs2_glock_nq_atime+0x164/0x2de [gfs2]
 [<f8d2b7dd>] gfs2_readdir+0x47/0x8b [gfs2]
 [<c047f754>] filldir64+0x0/0xc5
 [<f8d2416f>] gfs2_glock_nq_atime+0x15c/0x2de [gfs2]
 [<c047f935>] vfs_readdir+0x63/0x8d
 [<c047f754>] filldir64+0x0/0xc5
 [<c047f9c2>] sys_getdents64+0x63/0xa5
 [<c0404eff>] syscall_call+0x7/0xb
 =======================
python        D 2A0CCE0D  1676 10551  10175 (NOTLB)
       f468fd7c 00000082 00000096 2a0cce0d 00000828 00000001 00000009 f5023550 
       c20e7550 2a0ce9ae 00000828 00001ba1 00000001 f502365c c20049e0 f40af2c4 
       f8cc4b9d 00000000 f40af2c0 ffffffff 00000000 00000000 f468fda8 00000000 
Call Trace:
 [<f8cc4b9d>] put_rsb+0x27/0x36 [dlm]
 [<f8d21c99>] just_schedule+0x5/0x8 [gfs2]
 [<c0604d68>] __wait_on_bit+0x33/0x58
 [<f8d21c94>] just_schedule+0x0/0x8 [gfs2]
 [<f8d21c94>] just_schedule+0x0/0x8 [gfs2]
 [<c0604def>] out_of_line_wait_on_bit+0x62/0x6a
 [<c0436076>] wake_bit_function+0x0/0x3c
 [<f8d21c90>] wait_on_holder+0x27/0x2b [gfs2]
 [<f8d22e32>] glock_wait_internal+0xdb/0x1ec [gfs2]
 [<f8d230b1>] gfs2_glock_nq+0x16e/0x18e [gfs2]
 [<f8d2e911>] gfs2_permission+0x69/0xb4 [gfs2]
 [<f8d2e90a>] gfs2_permission+0x62/0xb4 [gfs2]
 [<f8d2e8a8>] gfs2_permission+0x0/0xb4 [gfs2]
 [<c047b557>] permission+0x78/0xb5
 [<c047c9c0>] __link_path_walk+0x141/0xd33
 [<f8d23322>] gfs2_glock_dq+0x9e/0xb2 [gfs2]
 [<c048d67a>] __mark_inode_dirty+0x13d/0x14f
 [<c047d5fb>] link_path_walk+0x49/0xbd
 [<c044ae04>] audit_syscall_entry+0x11c/0x14e
 [<c047d9c8>] do_path_lookup+0x20e/0x25e
 [<c047ded5>] sys_mkdirat+0x36/0xb6
 [<c044ae04>] audit_syscall_entry+0x11c/0x14e
 [<c047df64>] sys_mkdir+0xf/0x13
 [<c0404eff>] syscall_call+0x7/0xb
 =======================


I've got dumps of the glocks from debugfs, but I'm not really familiar
enough to GFS to understand what I'm reading.  I tried to file a bug
in RH Bugzilla, but am getting 503 errors.  I've posted the glock
dumps here:

http://kallisti.us/~ross/working-glocks
http://kallisti.us/~ross/broken-glocks

Can you point me in the direction of a document that explains what
the various things in the output mean?

-- 
Ross Vandegrift
ross at kallisti.us

"The good Christian should beware of mathematicians, and all those who
make empty prophecies. The danger already exists that the mathematicians
have made a covenant with the devil to darken the spirit and to confine
man in the bonds of Hell."
	--St. Augustine, De Genesi ad Litteram, Book II, xviii, 37



From linux at vfemail.net  Thu Aug 28 10:20:38 2008
From: linux at vfemail.net (Alex)
Date: Thu, 28 Aug 2008 13:20:38 +0300
Subject: [Linux-cluster] one click to start httpd on all nodes - possible?
In-Reply-To: <48B55697.7020904@redhat.com>
References: <200808271301.57414.linux@vfemail.net>
	<48B55697.7020904@redhat.com>
Message-ID: <200808281320.38680.linux@vfemail.net>

On Wednesday 27 August 2008 16:28, Marek 'marx' Grac wrote:
> Hi,
>
> Alex wrote:
> > Hi all,
> >
> > I have 3 nodes, forming a cluster. How sould be configured a service in
> > cluster.conf file in order to be able to stop or to start httpd daemon on
> > all our nodes at the same time? All i can find in docs is related to
> > failover scenario (stoping httpd on one node wil cause starting httpd on
> > other node) which is not what i need. For nodes management i am using
> > conga, so, i would like to have a service to do that? Is possible? If
> > not, should i use other external tools (like nagios) to do that?
>
> I don't think that it is possible to do this directly. But it should be
> easy to create several services with
> httpd (perhaps with different failover domains) and then run/stop it
> using CLI tool:
>         clusvcadm -e service / clusvcadm -d service
> You can put this in any script and then you are able to start/stop it
> from anywhere

Hi marx,

Many thanks for your reply. This is not working for me. let simplify with 2 
nodes. I created a service and a resource, as you suggest:

[snip from cluster.conf]
<service autostart="1" exclusive="0" name="httpd_service">
<script ref="httpd_script"/>
</service>

<resources>
<script file="/etc/rc.d/init.d/httpd" name="httpd_script"/>
</resources>
[end snip]

Now, supposing that httpd_service is running on one node, clusvcadm -d 
httpd_service will disable service and stop httpd on that node, but clusvcadm 
-e httpd_service will enable service on the node where is issued command and 
will not start service also on the second node...so, as i said, is a failover 
configuration. I want to be able to controll httpd in parallel on all nodes! 
So, what is missing from my above cluster.conf setup?

Regards,
Alx



From ortsinfo at gmail.com  Thu Aug 28 10:40:29 2008
From: ortsinfo at gmail.com (Patrick Neuner)
Date: Thu, 28 Aug 2008 12:40:29 +0200
Subject: [Linux-cluster] since GFS we are having issues with load and bad fs
	performance
Message-ID: <33fbbd410808280340i1215a614r92c58d51b8a31ffe@mail.gmail.com>

Hello,

we are using a Virtualisation Virtuozzo and migrated all to GFS File System.

All virtual Environments are within different directories.
Mounted GFS: /vz/
Virutual Environments are unter /vz/private/###
Templates that could be used by all for reading only is under
/vz/templates/###

Hardware are Blades with 2 quadcore, connected to a HP 4100 SAN with
fibrechannel.
Using multipath that manages 4 routes to the storage. VRAID 5 with lots of
HDD's for fast performance on the SAN side.
450 GB.
Servers are connected to each other with GBIT.

Formatting of filesystem was done using conga (luci) with defaults.
Using newest packages from 5.2, all current updates.

Since then (using exaxt same hardware as before), now even another server,
and still,
we are having very slow performances. If talking about about load: All
Servers were between 1-2 behaving normally,
now we are between 4-12, having it jumping all the time, when doing backups,
even more problematic.
We are already in the process of moving environments away from GFS back to
ext3.

The virtual environments usually have Webserver and Mailservers running, so
lot's of small files, usually more read then write.

Opening Webpages, especially using cms systems, are slower and when doing
backups (just reading) very slow.

I knew there is some overhead at GFS, but I have following stat on our
environment:
780 MB File compressed:
copy from ext3 to ext3: taking 6 seconds
copy from ext3 to gfs: taking 12 seconds
copy from gfs to gfs: taking 15 seconds.

(gfs and ext3 are LUN's from the same SAN).

Here I am talking about 1 file, so locking shouldn't be an issue.
I have googled a lot about performance, and found some hints (fast_fs and so
on) which could be changes using gfs_tools, but it seems, they are not
available with standard rpms

Everything else seems stable for us, just the performance of the fs is soo
slow on all servers connected to the gfs, and I know it's not our hardware,
using ext3 on the same hardware is like way faster. not 20 % or so, but
200-300%.

So what I would like to know, as I am sure lot's of people are using GFS
with a bigger environment than we do (only 3 servers on 450 GB GFS), if this
is a normal behaviour, is GFS so much slower and producse so much overhead
due locking,
in comparison with ext3, or could this be any other reason.

Our hardware is not the key here.
It's probably my last approach to see if there is anything else we can do,
or have to move back to ext3.

BTW: How could I do a filesystem-check of a GFS, without having to unmount
the running one.
I thought of doing a snapshot and using this presentation, but my first
approach didn't work, gfs_fsck said, no gfs file system on the presented
LUN.. even it was a live snapshop.
So not sure if that is possible at all to just make a snapshot of a gfs and
then do a filesystem check. Just to get sure, it's not an error of the fs
after formatting it. (read that too someone had this problem.).

Thanks
Patrick
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080828/61c765aa/attachment.htm>

From raju.rajsand at gmail.com  Thu Aug 28 11:20:15 2008
From: raju.rajsand at gmail.com (Rajagopal Swaminathan)
Date: Thu, 28 Aug 2008 11:20:15 +0000
Subject: [Linux-cluster] Want to Setup NFS with redhat Cluster??
In-Reply-To: <3120c9e30808250826g1329df55l161af2e0d19c21d8@mail.gmail.com>
References: <8786b91c0808250306u5bd41c7bjcca2772feab48cdf@mail.gmail.com>
	<0139539A634FD04A99C9B8880AB70CB209B17AB6@in-ex004.groupinfra.com>
	<3120c9e30808250826g1329df55l161af2e0d19c21d8@mail.gmail.com>
Message-ID: <8786b91c0808280420o524b6937wc9670ed75302a14@mail.gmail.com>

Greetings

2008/8/25 Anuj Singh (????) <anujhere at gmail.com>

> AIUI your target is to use A & B (Cluster nodes) to provide high
> availability of nfs service to your client machine C.
> a) If node A goes down B takes over the nfs service, provided to client C.
> b) If node B goes down A takes over the nfs service. provided to client C.
> c) At any particular time only A node or B node providing service to C.
> c) Client C, should see the same data independent of A and B, whether A is
> providing nfs export or B.
>

sources.redhat.com/cluster/doc/*nfs**cookbook*.pdf

Now in the above howto, it is assumed that you are using SAN for storage

If budget constraints are there, One alterntive for two node cluster with
shared storage achived with the storage box having two scsi channels and
both the nodes having one scsi controller each.

else the following is applicable


> Above thing can be achieved easily.
> a) Get the DRBD (distributed redundant block device) and replicate data
> between A node, B node. RAID1
> b) You need to create a virtual IP which floats between A node and B node
> along with startup of nfs service.
> c) You client C will access the floating virtual IP.
> above three things can be achieved by linux-heartbeat as well as RHCS, with
> RHCS you only have to manage DRBD primary. which can be handled with some
> small script.
>

In fact with DRBD also we can have GFS by making both primary. I have worked
in such a setup. to further bolster fault tolerance, you can have Software
RAID1 or 5 in each box. SATA drives are not expensive

Alternatively if you want more nodes to participate in the cluster, you can
build a low cost iSCSI target using openfiler and give-em a handful of
bonded NICs with jumbo frames to achieve decent performance. With this
solution you will have a storage box outside. If you have rack mounting
facility, try Sunfire x4150 it is just 1RU and a lot of thigs like 8 HDD
bays, 4NICs Memory upto 64GB.

Hope that answers the question somewhat

Regards

Rajagopal
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080828/6f624ae8/attachment.htm>

From Harri.Paivaniemi at tietoenator.com  Thu Aug 28 11:58:31 2008
From: Harri.Paivaniemi at tietoenator.com (Harri.Paivaniemi at tietoenator.com)
Date: Thu, 28 Aug 2008 14:58:31 +0300
Subject: [Linux-cluster] one click to start httpd on all nodes - possible?
References: <200808271301.57414.linux@vfemail.net><48B55697.7020904@redhat.com>
	<200808281320.38680.linux@vfemail.net>
Message-ID: <41E8D4F07FCE154CBEBAA60FFC92F67709FD36@apollo.eu.tieto.com>

You could make httpd_service_node1, httpd_service_node2 and httpd_service_node3- services...

Or just spread keys and "for x in node1 node2 node3; do ssh $x service httpd start|stop"...

-hjp



-----Original Message-----
From: linux-cluster-bounces at redhat.com on behalf of Alex
Sent: Thu 8/28/2008 13:20
To: linux clustering
Subject: Re: [Linux-cluster] one click to start httpd on all nodes - possible?
 
On Wednesday 27 August 2008 16:28, Marek 'marx' Grac wrote:
> Hi,
>
> Alex wrote:
> > Hi all,
> >
> > I have 3 nodes, forming a cluster. How sould be configured a service in
> > cluster.conf file in order to be able to stop or to start httpd daemon on
> > all our nodes at the same time? All i can find in docs is related to
> > failover scenario (stoping httpd on one node wil cause starting httpd on
> > other node) which is not what i need. For nodes management i am using
> > conga, so, i would like to have a service to do that? Is possible? If
> > not, should i use other external tools (like nagios) to do that?
>
> I don't think that it is possible to do this directly. But it should be
> easy to create several services with
> httpd (perhaps with different failover domains) and then run/stop it
> using CLI tool:
>         clusvcadm -e service / clusvcadm -d service
> You can put this in any script and then you are able to start/stop it
> from anywhere

Hi marx,

Many thanks for your reply. This is not working for me. let simplify with 2 
nodes. I created a service and a resource, as you suggest:

[snip from cluster.conf]
<service autostart="1" exclusive="0" name="httpd_service">
<script ref="httpd_script"/>
</service>

<resources>
<script file="/etc/rc.d/init.d/httpd" name="httpd_script"/>
</resources>
[end snip]

Now, supposing that httpd_service is running on one node, clusvcadm -d 
httpd_service will disable service and stop httpd on that node, but clusvcadm 
-e httpd_service will enable service on the node where is issued command and 
will not start service also on the second node...so, as i said, is a failover 
configuration. I want to be able to controll httpd in parallel on all nodes! 
So, what is missing from my above cluster.conf setup?

Regards,
Alx

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

-------------- next part --------------
A non-text attachment was scrubbed...
Name: winmail.dat
Type: application/ms-tnef
Size: 3954 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080828/1ba51576/attachment.bin>

From linux at vfemail.net  Thu Aug 28 13:36:13 2008
From: linux at vfemail.net (Alex)
Date: Thu, 28 Aug 2008 16:36:13 +0300
Subject: [Linux-cluster] one click to start httpd on all nodes - possible?
In-Reply-To: <48B5658A.9040607@io-consulting.net>
References: <200808271301.57414.linux@vfemail.net>
	<48B55697.7020904@redhat.com> <48B5658A.9040607@io-consulting.net>
Message-ID: <200808281636.14208.linux@vfemail.net>

On Wednesday 27 August 2008 17:32, Johannes Ru?ek wrote:
> Hi Alex,
> if it's just something you want to do manually, are you aware of
> "http://sourceforge.net/projects/pdsh" ?
> regards,
> Johannes
>

I take a look over it... Is not ok at all ... It require rsh server to be 
installed on our nodes (is anybody using rsh...?) or require ssh v1 which is 
also disabled by default on all our nodes for security reasons... Finally, 
will do not do more then one line in a script which i coded in 5 seconds:

Stop service:
n=3
for i in `seq 1 $n`
do
ssh -l root node$i /etc/rc.d/init.d/httpd stop && logout;
done

Start service:
n=3
for i in `seq 1 $n`
do
ssh -l root node$i /etc/rc.d/init.d/httpd start && logout;
done

I don't think that this is a modern way and elevated method to be used in 
clustering...

If that is all it can redhat cluster v2, i'm afraid is very bad news....

Regards,
Alx

> Marek 'marx' Grac schrieb:
> > Hi,
> >
> > Alex wrote:
> >> Hi all,
> >>
> >> I have 3 nodes, forming a cluster. How sould be configured a service
> >> in cluster.conf file in order to be able to stop or to start httpd
> >> daemon on all our nodes at the same time? All i can find in docs is
> >> related to failover scenario (stoping httpd on one node wil cause
> >> starting httpd on other node) which is not what i need. For nodes
> >> management i am using conga, so, i would like to have a service to do
> >> that? Is possible? If not, should i use other external tools (like
> >> nagios) to do that?
> >
> > I don't think that it is possible to do this directly. But it should
> > be easy to create several services with
> > httpd (perhaps with different failover domains) and then run/stop it
> > using CLI tool:
> >        clusvcadm -e service / clusvcadm -d service
> > You can put this in any script and then you are able to start/stop it
> > from anywhere
> >
> > marx,
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From mgrac at redhat.com  Thu Aug 28 14:27:21 2008
From: mgrac at redhat.com (Marek 'marx' Grac)
Date: Thu, 28 Aug 2008 16:27:21 +0200
Subject: [Linux-cluster] one click to start httpd on all nodes - possible?
In-Reply-To: <200808281320.38680.linux@vfemail.net>
References: <200808271301.57414.linux@vfemail.net>	<48B55697.7020904@redhat.com>
	<200808281320.38680.linux@vfemail.net>
Message-ID: <48B6B5C9.9000801@redhat.com>

Hi,

Alex wrote:
> Hi marx,
>
> Many thanks for your reply. This is not working for me. let simplify with 2 
> nodes. I created a service and a resource, as you suggest:
>
> [snip from cluster.conf]
> <service autostart="1" exclusive="0" name="httpd_service">
> <script ref="httpd_script"/>
> </service>
>
> <resources>
> <script file="/etc/rc.d/init.d/httpd" name="httpd_script"/>
> </resources>
> [end snip]
>
> Now, supposing that httpd_service is running on one node, clusvcadm -d 
> httpd_service will disable service and stop httpd on that node, but clusvcadm 
> -e httpd_service will enable service on the node where is issued command and 
> will not start service also on the second node...so, as i said, is a failover 
> configuration. I want to be able to controll httpd in parallel on all nodes! 
> So, what is missing from my above cluster.conf setup?
>   
Service is defined for whole cluster, so you need different services. 
This is why you usually have to run them with
different commands. But still using your approach you will lost the main 
benefit - high availability - main reason for using
cluster suite for me :) I prefer to use resource 'apache' because then 
you can have several apaches on different IP address on
one server (in any of them fail). Steps for this path should be define 
service for each httpd and define failover domain for each
service (with different priority - so they will run on different nodes 
if possible). After this you can use clusvcadm -e service and it
will run on correct nodes.

marx,



-- 
Marek Grac
Red Hat Czech s.r.o.



From travellig at gmail.com  Thu Aug 28 15:03:11 2008
From: travellig at gmail.com (travellig travellig)
Date: Thu, 28 Aug 2008 16:03:11 +0100
Subject: [Linux-cluster] RHCS suited for split-site configurations.
Message-ID: <694487210808280803h41b5c55cr2d39ef38cb6dc32@mail.gmail.com>

Hi All,

The $Subject came about as part of a comparision exercise between the
best possible solutions for a HA cluster that span across at the very
least two geographical sites.

For example, in a two node cluster (Active-Passive):

Node 1 (N1) located in DataCenter A (DCA) and node 2 (N2) located in
DataCenter B ( DCB). The distance between DCA & DCB is < 200 miles.

The nodes will connect to an DMX800 via 2 HBAs using Brocade-FCSwitches in DCA

N1 runs all the services while N2 is the back-up.

Is the $Subject supported by RH ?

Best regards,

-- 
Fernand



From praman at informatica.com  Thu Aug 28 16:46:11 2008
From: praman at informatica.com (Raman, Pattabhi)
Date: Thu, 28 Aug 2008 22:16:11 +0530
Subject: [Linux-cluster] one click to start httpd on all nodes - possible?
In-Reply-To: <200808281636.14208.linux@vfemail.net>
Message-ID: <02E7FA106DF5944BB0571C25A9243DE60195B1DA@in23ex01.informatica.com>

Hi Johannes, 

Pdsh is just great :-). 

I saw this mail below, downloaded, installed and I got the setup and the ssh-trusted connections in just 10 minutes. And the shell is just to cool admin tool 

Thanks for the link 

Regards,
Pattabhi Raman 

-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Alex
Sent: Thursday, August 28, 2008 7:06 PM
To: linux clustering
Subject: Re: [Linux-cluster] one click to start httpd on all nodes - possible?

On Wednesday 27 August 2008 17:32, Johannes Ru?ek wrote:
> Hi Alex,
> if it's just something you want to do manually, are you aware of
> "http://sourceforge.net/projects/pdsh" ?
> regards,
> Johannes
>

I take a look over it... Is not ok at all ... It require rsh server to be 
installed on our nodes (is anybody using rsh...?) or require ssh v1 which is 
also disabled by default on all our nodes for security reasons... Finally, 
will do not do more then one line in a script which i coded in 5 seconds:

Stop service:
n=3
for i in `seq 1 $n`
do
ssh -l root node$i /etc/rc.d/init.d/httpd stop && logout;
done

Start service:
n=3
for i in `seq 1 $n`
do
ssh -l root node$i /etc/rc.d/init.d/httpd start && logout;
done

I don't think that this is a modern way and elevated method to be used in 
clustering...

If that is all it can redhat cluster v2, i'm afraid is very bad news....

Regards,
Alx

> Marek 'marx' Grac schrieb:
> > Hi,
> >
> > Alex wrote:
> >> Hi all,
> >>
> >> I have 3 nodes, forming a cluster. How sould be configured a service
> >> in cluster.conf file in order to be able to stop or to start httpd
> >> daemon on all our nodes at the same time? All i can find in docs is
> >> related to failover scenario (stoping httpd on one node wil cause
> >> starting httpd on other node) which is not what i need. For nodes
> >> management i am using conga, so, i would like to have a service to do
> >> that? Is possible? If not, should i use other external tools (like
> >> nagios) to do that?
> >
> > I don't think that it is possible to do this directly. But it should
> > be easy to create several services with
> > httpd (perhaps with different failover domains) and then run/stop it
> > using CLI tool:
> >        clusvcadm -e service / clusvcadm -d service
> > You can put this in any script and then you are able to start/stop it
> > from anywhere
> >
> > marx,
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From markwag at u.washington.edu  Thu Aug 28 17:17:17 2008
From: markwag at u.washington.edu (Mark Wagner)
Date: Thu, 28 Aug 2008 10:17:17 -0700
Subject: [Linux-cluster] Using clustering to facilitate OS patching
Message-ID: <20080828171716.GA12034@n-its-markwag2.mcis.washington.edu>

One of the ways we use clustering is to allow us to patch a box without
extended downtime. I.e., to patch server A (1) relocate services on A to B
(encountering a small downtime), (2) patch/test A with no time pressure,
then (3) relocate the services back to A. Does anybody do the same or
have comments on this patching technique?

We are now beginning to virtualize and using this method in a virtualized
environment seems awkward at best. It is OK for the dom0. The services
are domUs so the relocation step is relocating them. However, if we
run clustering for the domUs we end up running clustering on top of
clustering. Is that reasonable? I can see one pitfall in that if you
aren't careful you could have a clustered domU pair running on the
same dom0.

Mark

-- 
Mark Wagner <markwag at u.washington.edu>
System Administrator, UW Medicine IT Services
206-616-6119



From Millard.Matt at principal.com  Fri Aug 29 02:51:17 2008
From: Millard.Matt at principal.com (Millard, Matt)
Date: Thu, 28 Aug 2008 21:51:17 -0500
Subject: [Linux-cluster] Unable to communicate with luci server
In-Reply-To: <A3980FD020E35F4D966478FA59437C4D16747EB7@PFGDSMMBX010.principalusa.corp.principal.com>
References: <A3980FD020E35F4D966478FA59437C4D16747EB7@PFGDSMMBX010.principalusa.corp.principal.com>
Message-ID: <A3980FD020E35F4D966478FA59437C4D16842C75@PFGDSMMBX010.principalusa.corp.principal.com>



> -----Original Message-----
> From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-
> bounces at redhat.com] On Behalf Of Millard, Matt
> Sent: Tuesday, August 26, 2008 2:47 PM
> To: linux-cluster at redhat.com
> Subject: [Linux-cluster] Unable to communicate with luci server
> 
> I'm working on setting up a brand new cluster using Conga.  After I
> initialize Luci and restart it, go into Cluster, Create a New Cluster
and
> enter in fqdn hostnames for each of my two nodes, password and select
> "View SSL cert fingerprints" I get a popup that says "Unable to
> communicate with luci server."  I've tried to set this up on two
separate
> Luci servers with the same results.  I've verified that /etc/hosts has
the
> fqdn names for each node, multicast is working (pingable), port 11111
is
> open (per nmap).  I get no errors in /var/log/messages.  I've tried
both
> the public and private networks to get the nodes setup with no
results.
> 
> Also I can just be sitting at the "Clusters" page with nothing on it
and
> the same "Unable to communicate with luci server."  Error will pop up
> randomly in my browser.  Any thoughts on where to start with this?
> 
> RHEL5 - U2 on both nodes and on the Luci server.
> 
> Matt
> 

This turned out to be a problem with using IE6 as my browser.  I was
able to switch to Firefox 3 and able to setup my cluster without issue.

Just thought I'd send a message to help someone out in the future


-----Message Disclaimer-----

This e-mail message is intended only for the use of the individual or
entity to which it is addressed, and may contain information that is
privileged, confidential and exempt from disclosure under applicable law.
If you are not the intended recipient, any dissemination, distribution or
copying of this communication is strictly prohibited. If you have
received this communication in error, please notify us immediately by
reply email to Connect at principal.com and delete or destroy all copies of
the original message and attachments thereto. Email sent to or from the
Principal Financial Group or any of its member companies may be retained
as required by law or regulation.

Nothing in this message is intended to constitute an Electronic signature
for purposes of the Uniform Electronic Transactions Act (UETA) or the
Electronic Signatures in Global and National Commerce Act ("E-Sign")
unless a specific statement to the contrary is included in this message.

While this communication may be used to promote or market a transaction
or an idea that is discussed in the publication, it is intended to provide
general information about the subject matter covered and is provided with
the understanding that The Principal is not rendering legal, accounting,
or tax advice. It is not a marketed opinion and may not be used to avoid
penalties under the Internal Revenue Code. You should consult with
appropriate counsel or other advisors on all matters pertaining to legal,
tax, or accounting obligations and requirements.




From sean at bruenor.org  Fri Aug 29 04:22:34 2008
From: sean at bruenor.org (Sean E. Millichamp)
Date: Fri, 29 Aug 2008 00:22:34 -0400
Subject: [Linux-cluster] [PATCH] Properly close file descriptors in qdiskd
Message-ID: <1219983754.3117.14.camel@sewt>

While testing qdiskd with a ping heuristic on a cluster system with
SELinux in enforcing mode I noticed some odd AVC denial messages.  It
eventually led me to discover that the qdisk_open function wasn't
properly closing open file descriptors to some of my block devices
before returning on certain errors.  The end result was that the ping
heuristic inherited these open FDs which were in violation of the ping
SELinux policy.

In my specific case the lseek() call was failing for the partitions
corresponding to my extended partition container on my boot drives.

I scanned the rest of the function and noticed a couple of other calls
where it seemed like closing the file descriptor before returning was
also appropriate.

With this patch I haven't been able to reproduce the SELinux denial
messages.

This patch is against the RHEL52 code, but seems to also be applicable
on Fedora 9 and the git master branch.

What is the proper way to submit patches?  Should I open a bug report
somewhere?

Thanks,
Sean

-------------- next part --------------
A non-text attachment was scrubbed...
Name: qdisk-close-fd.patch
Type: text/x-patch
Size: 725 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080829/f60a843d/attachment.bin>

From alexander.vorobiyov at rzn.nex3.ru  Fri Aug 29 10:35:18 2008
From: alexander.vorobiyov at rzn.nex3.ru (Alexander Vorobiyov)
Date: Fri, 29 Aug 2008 14:35:18 +0400
Subject: [Linux-cluster] lvcreate: Error locking on node
Message-ID: <1220006118.3124.14.camel@blackhouse>

[root at xxx ~]# lvmconf --enable-cluster

[root at xxx ~]# vgcreate -c y storage /dev/sda3
  Volume group "storage" successfully created

[root at xxx ~]# lvcreate -n test -l100%VG storage
  Error locking on node yyy.nex3.ru: Volume group for uuid not found:
Zca33zkMWEMsmqi1P5Fswul69ib4rmL0WQkIjaHnkqwx3oFwDTKD6Om5JZC49Ogu
  Aborting. Failed to activate new LV to wipe the start of it.

I trying restart clvmd on all nodes but it does not help.
Centos 5.2
Cluster LVM daemon version: 2.02.32-RHEL5 (2008-03-04)
Protocol version:           0.2.1

-- 
Alexander Vorobiyov
NTC NOC
The engineer of communications
Russia, Ryazan
mailto:alexander.vorobiyov at rzn.nex3.ru



From lp at xbe.ch  Fri Aug 29 10:42:55 2008
From: lp at xbe.ch (Lorenz Pfiffner)
Date: Fri, 29 Aug 2008 12:42:55 +0200
Subject: [Linux-cluster] Linux cluster moved to new subdomain
In-Reply-To: <48B46964.3040708@xbe.ch>
References: <8ee061010806160945pc2418f8w95749c8bf566d02d@mail.gmail.com>	<20080616191641.GA17965@kallisti.us>	<4856C385.8000800@gmail.com>	<8ee061010806170954y1d6d7555qf7cd5a137b63e018@mail.gmail.com>	<485819DC.90503@gmail.com>	<8ee061010806171522l1d18480em861bffb87f8b8be2@mail.gmail.com>	<8ee061010806181048j2d9e4635n6e0133c855e4ea06@mail.gmail.com>	<485A71ED.8030305@gmail.com>	<485A7E59.8020107@gmail.com>	<0AB7D520EBDCE743BFAE18CF2B5B04C901EAFECAFC@G3W0070.americas.hpqcorp.net>	<001301c8d2cb$9317be60$b9473b20$@gr>	<0AB7D520EBDCE743BFAE18CF2B5B04C901EAFECB03@G3W0070.americas.hpqcorp.net>
	<48B46964.3040708@xbe.ch>
Message-ID: <48B7D2AF.7050705@xbe.ch>

Hi

I solved this problem by backing up the luci db (it's binary stored in Data.fs), 'luci_admin backup', saving the luci_backup.xml, removing luci (rpm -e luci), reinstalling it (yum install luci), 
adjust node ip's in luci_backup.xml, copying it back to /var/lib/luci/var/, and restoring it back with 'luci_admin restore'. Of course, luci has to be shutdown during this procedure.

It worked this easy for me because the keys of my nodes didn't change.

I don't know if this is the proper solution, but at least it worked for me and maybe helps other people with the same problem.

Kind regards
Lorenz

Lorenz Pfiffner wrote:
> Hi Dave
> 
> Did you solve this problem? I have a similar case here. I changed the 
> IP's of one of my two node cluster and now I can't get the cluster back 
> into luci. I assume luci stores it's config in a binary Zope Data.fs 
> file. But how can change something there?
> 
> BTW: Changing node IP's was pain! A Howto would be an interesting thing 
> for the Knowledge Base or even RedHat Magazine. I think it would really 
> be appreciated by many Cluster Suite operators.
> 
> Greetz
> Lorenz
> 
> Harding, David wrote:
>>  Yes,  The issue is that when I first built the cluster I used the 
>> full qualified host name. I changed the cluster.conf file,
>> But some where it is still getting the old fully qualified host name.  
>> I can see it in the log files.  Where else does it
>> Store the cluster host names other then the cluster.conf file ?
>>
>> dave
>>
>> -----Original Message-----
>> From: linux-cluster-bounces at redhat.com 
>> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Theophanis 
>> Kontogiannis
>> Sent: Friday, June 20, 2008 7:49 AM
>> To: 'linux clustering'
>> Subject: RE: [Linux-cluster] Linux cluster moved to new subdomain
>>
>> Hi Dave,
>>
>> Did you make the appropriate changes on the iptables, to reflect the 
>> new IPs given to the servers?
>>
>> Sincerely,
>>
>> Theophanis Kontogiannis
>>
>>
>> -----Original Message-----
>> From: linux-cluster-bounces at redhat.com
>> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Harding, David
>> Sent: Thursday, June 19, 2008 11:36 PM
>> To: 'linux clustering'
>> Subject: [Linux-cluster] Linux cluster moved to new subdomain
>>
>>
>>
>>
>> We moved our Linux cluster to a new tcpip subnet.
>> It is running Redhat V4 update 6.  After the move I fixed the 
>> cluster.conf to reflect the subnet name.
>>
>> I then went into luci.  When I select the cluster tab I get an error 
>> message stating "an error occurred when trying to contact any of the 
>> nodes in the ermmro cluster"
>> The systems show up ok in the homepase tab and everything looks 
>> correct under the storage tab.
>>
>> dave
>>
>> -- 
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>>
>> -- 
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>> -- 
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From macbogucki at gmail.com  Fri Aug 29 11:01:58 2008
From: macbogucki at gmail.com (Maciej Bogucki)
Date: Fri, 29 Aug 2008 13:01:58 +0200
Subject: [Linux-cluster] RHCS suited for split-site configurations.
In-Reply-To: <694487210808280803h41b5c55cr2d39ef38cb6dc32@mail.gmail.com>
References: <694487210808280803h41b5c55cr2d39ef38cb6dc32@mail.gmail.com>
Message-ID: <48B7D726.2070005@gmail.com>

travellig travellig wrote:
> Hi All,
>
> The $Subject came about as part of a comparision exercise between the
> best possible solutions for a HA cluster that span across at the very
> least two geographical sites.
>
> For example, in a two node cluster (Active-Passive):
>
> Node 1 (N1) located in DataCenter A (DCA) and node 2 (N2) located in
> DataCenter B ( DCB). The distance between DCA & DCB is < 200 miles.
>
> The nodes will connect to an DMX800 via 2 HBAs using Brocade-FCSwitches in DCA
>
> N1 runs all the services while N2 is the back-up.
>
> Is the $Subject supported by RH ?
>
>
>   
Hello,

If You want to use migration of Resources(Apache,MySQL, ...) if You 
primary DC is down, then RHCS is good option. You need two reliable 
interconnections betwean DCA and DSB, to avoid split-brain. You could 
also try heartbeat[1] instead od RHCS.
Are you going to use GFS?
Do You have DMX800 SAN in DCB? And what with replication of data between 
them?

[1] - http://www.linux-ha.org/


Best Regards
Maciej Bogucki



From jstoner at opsource.net  Fri Aug 29 15:44:17 2008
From: jstoner at opsource.net (Jeff Stoner)
Date: Fri, 29 Aug 2008 16:44:17 +0100
Subject: [Linux-cluster] one click to start httpd on all nodes - possible?
In-Reply-To: <200808271301.57414.linux@vfemail.net>
References: <200808271301.57414.linux@vfemail.net>
Message-ID: <38A48FA2F0103444906AD22E14F1B5A307F20245@mailxchg01.corp.opsource.net>

If you are running httpd on all nodes, why are you managing httpd with
RHCS in the first place? Simply "chkconfig httpd on" and when the server
starts, httpd will start, too.

If you want to build a simple web server farm, this is more easily
accomplished using a load balancer (hardware or software) in front of
the web servers than with Cluster Services.

Perhaps you could explain in more detail what you are trying to
accomplish. Are there an additional resources (file system mount, ip
address, etc.) associated with httpd. Under what conditions would you
start or stop httpd on a node?


--Jeff
Sr. Systems Engineer

OpSource, Inc.
http://www.opsource.net
"Your Success is Our Success"
  

> -----Original Message-----
> From: linux-cluster-bounces at redhat.com 
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Alex
> Sent: Wednesday, August 27, 2008 6:02 AM
> To: linux clustering
> Subject: [Linux-cluster] one click to start httpd on all 
> nodes - possible?
> 
> Hi all,
> 
> I have 3 nodes, forming a cluster. How sould be configured a 
> service in 
> cluster.conf file in order to be able to stop or to start 
> httpd daemon on all 
> our nodes at the same time? All i can find in docs is related 
> to failover 
> scenario (stoping httpd on one node wil cause starting httpd 
> on other node) 
> which is not what i need. For nodes management i am using 
> conga, so, i would 
> like to have a service to do that? Is possible? If not, 
> should i use other 
> external tools (like nagios) to do that?
> 
> Regards,
> Alx
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> 



From jbrassow at redhat.com  Fri Aug 29 21:08:12 2008
From: jbrassow at redhat.com (Jonathan Brassow)
Date: Fri, 29 Aug 2008 16:08:12 -0500
Subject: [Linux-cluster] lvcreate: Error locking on node
In-Reply-To: <1220006118.3124.14.camel@blackhouse>
References: <1220006118.3124.14.camel@blackhouse>
Message-ID: <60820AA3-5228-4EB4-9E11-94B63CC90C04@redhat.com>

Can all the nodes in your cluster see the storage devices?

  brassow

On Aug 29, 2008, at 5:35 AM, Alexander Vorobiyov wrote:

> [root at xxx ~]# lvmconf --enable-cluster
>
> [root at xxx ~]# vgcreate -c y storage /dev/sda3
>  Volume group "storage" successfully created
>
> [root at xxx ~]# lvcreate -n test -l100%VG storage
>  Error locking on node yyy.nex3.ru: Volume group for uuid not found:
> Zca33zkMWEMsmqi1P5Fswul69ib4rmL0WQkIjaHnkqwx3oFwDTKD6Om5JZC49Ogu
>  Aborting. Failed to activate new LV to wipe the start of it.
>
> I trying restart clvmd on all nodes but it does not help.
> Centos 5.2
> Cluster LVM daemon version: 2.02.32-RHEL5 (2008-03-04)
> Protocol version:           0.2.1
>
> -- 
> Alexander Vorobiyov
> NTC NOC
> The engineer of communications
> Russia, Ryazan
> mailto:alexander.vorobiyov at rzn.nex3.ru
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From jamesc at exa.com  Sat Aug 30 01:19:25 2008
From: jamesc at exa.com (James Chamberlain)
Date: Fri, 29 Aug 2008 21:19:25 -0400
Subject: [Linux-cluster] lm_dlm_cancel
Message-ID: <81D8B57D-B9C8-4AA0-8BEC-F45212795FB6@exa.com>

Hi all,

I'm trying to grow a GFS filesystem.  I've grown this filesystem  
before and everything went fine.  However, when I issued gfs_grow this  
time, I saw the following messages in my logs:

Aug 29 21:04:13 s12n02 kernel: lock_dlm: lm_dlm_cancel 2,17 flags 80
Aug 29 21:04:13 s12n02 kernel: lock_dlm: lm_dlm_cancel skip 2,17 flags  
100
Aug 29 21:04:14 s12n02 kernel: lock_dlm: lm_dlm_cancel 2,17 flags 80
Aug 29 21:04:14 s12n02 kernel: dlm: scratch112: (14239) dlm_unlock:  
10241 busy 2
Aug 29 21:04:14 s12n02 kernel: lock_dlm: lm_dlm_cancel rv -16 2,17  
flags 40080

The last three lines of these log entries repeat themselves once a  
second until I hit ^C.  The filesystem appears to still be up and  
accessible.  Any thoughts on what's going on here and what I can do  
about it?

Thanks,

James



From alexander.vorobiyov at rzn.nex3.ru  Sat Aug 30 19:51:23 2008
From: alexander.vorobiyov at rzn.nex3.ru (Alexander Vorobiyov)
Date: Sat, 30 Aug 2008 23:51:23 +0400
Subject: [Linux-cluster] lvcreate: Error locking on node
Message-ID: <1220125884.3124.28.camel@blackhouse>

I try to create lvm logical volume that it was accessible through any
node of my cluster, at refusal of any cluster node. I try to organise
this storage through gigabit ethernet. clvm can provide these storage in
this case or I am mistaken?

-- 
Alexander Vorobiyov
NTC NOC
The engineer of communications
Russia, Ryazan
+7(4912)901553 ext. 630
mailto:alexander.vorobiyov at rzn.nex3.ru