From linux at vfemail.net  Mon Sep  1 08:54:43 2008
From: linux at vfemail.net (Alex)
Date: Mon, 1 Sep 2008 11:54:43 +0300
Subject: [Linux-cluster] one click to start httpd on all nodes - possible?
In-Reply-To: <38A48FA2F0103444906AD22E14F1B5A307F20245@mailxchg01.corp.opsource.net>
References: <200808271301.57414.linux@vfemail.net>
	<38A48FA2F0103444906AD22E14F1B5A307F20245@mailxchg01.corp.opsource.net>
Message-ID: <200809011154.43460.linux@vfemail.net>

On Friday 29 August 2008 18:44, Jeff Stoner wrote:
> If you are running httpd on all nodes, why are you managing httpd with
> RHCS in the first place? Simply "chkconfig httpd on" and when the server
> starts, httpd will start, too.
>

Hello Jeff,

Here, http has been used just as an example. Could be any other service (ftp, 
proxy cache, etc) or shared resurce (a gfs volume mounted on all our 
servers). Here i am talking about functionality...

> If you want to build a simple web server farm, this is more easily
> accomplished using a load balancer (hardware or software) in front of
> the web servers than with Cluster Services.
>
> Perhaps you could explain in more detail what you are trying to
> accomplish. Are there an additional resources (file system mount, ip
> address, etc.) associated with httpd. Under what conditions would you
> start or stop httpd on a node?

Simple: supposing that i have 3 thiered cluster model (1st thier configured 
for HA and load balancing, 2nd thier with N nodes acting as real servers, all 
running the same service and accesing a shared volume via iscsi, and 3rd 
thier acting as SAN), i need a "command center" to control (start/stop) a 
resourse/service globally in 2nd thier, on all N nodes. For example, 
currently i am using fstab to mount at boot time a shared GFS volume on all 
our cluster nodes. There are some cases, when i want that resource to be 
unmounted at the same time on all nodes... Is difficult and take time to ssh 
inside each node and stop a service accessing that resource and after that 
umount resource on each node (eg: service httpd stop && 
umount /dev/my_shared_volume). Maybe, what i want is not possible using 
cluster configuration, but i would like to know, how other peoples are are 
achieving this task.

Regards,
Alx

>
>
> --Jeff
> Sr. Systems Engineer
>
> OpSource, Inc.
> http://www.opsource.net
> "Your Success is Our Success"
>
> > -----Original Message-----
> > From: linux-cluster-bounces at redhat.com
> > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Alex
> > Sent: Wednesday, August 27, 2008 6:02 AM
> > To: linux clustering
> > Subject: [Linux-cluster] one click to start httpd on all
> > nodes - possible?
> >
> > Hi all,
> >
> > I have 3 nodes, forming a cluster. How sould be configured a
> > service in
> > cluster.conf file in order to be able to stop or to start
> > httpd daemon on all
> > our nodes at the same time? All i can find in docs is related
> > to failover
> > scenario (stoping httpd on one node wil cause starting httpd
> > on other node)
> > which is not what i need. For nodes management i am using
> > conga, so, i would
> > like to have a service to do that? Is possible? If not,
> > should i use other
> > external tools (like nagios) to do that?
> >
> > Regards,
> > Alx
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From gpbuono at gmail.com  Mon Sep  1 09:29:49 2008
From: gpbuono at gmail.com (Gian Paolo Buono)
Date: Mon, 1 Sep 2008 11:29:49 +0200
Subject: [Linux-cluster] the cluster don't restart (clvmd)
Message-ID: <c60a46e50809010229k50f77640u122b4b43287336b3@mail.gmail.com>

Hi,
I have a cluster configuration with two node..this is my cluster.conf:

####################cluster.conf####################
<?xml version="1.0"?>
<cluster alias="yoda-cl" config_version="3" name="yoda-cl">
        <fence_daemon clean_start="0" post_fail_delay="0"
post_join_delay="3"/>
        <clusternodes>
                <clusternode name="yoda2.cs.tin.it" nodeid="1" votes="1">
                        <fence/>
                </clusternode>
                <clusternode name="yoda1.cs.tin.it" nodeid="2" votes="1">
                        <fence/>
                </clusternode>
        </clusternodes>
        <cman expected_votes="1" two_node="1"/>
        <rm>
                <failoverdomains/>
                <resources/>
        </rm>
        <fencedevices/>
</cluster>
####################cluster.conf####################

I have  tried to restart cluster without reboot because the command clustat
on node 2 don't work ... but ther is a problem on fence device..this is the
messages..

[root at yoda1 cluster]# /etc/init.d/cman start
Starting cluster:
   Enabling workaround for Xend bridged networking... done
   Loading modules... done
   Mounting configfs... done
   Starting ccsd... done
   Starting cman... done
   Starting daemons... done
   Starting fencing... failed

The follow the log:
Sep  1 11:27:47 yoda1 groupd[8162]: found uncontrolled kernel object clvmd
in /sys/kernel/dlm
Sep  1 11:27:47 yoda1 groupd[8162]: local node must be reset to clear 1
uncontrolled instances of gfs and/or dlm
Sep  1 11:27:47 yoda1 openais[8154]: [CMAN ] cman killed by node 2 because
we were killed by cman_tool or other application
Sep  1 11:27:47 yoda1 fence_node[8163]: Fence of "yoda1.cs.tin.it" was
unsuccessful
Sep  1 11:27:47 yoda1 fenced[8169]: cman_init error (nil) 111
Sep  1 11:27:47 yoda1 gfs_controld[8181]: cman_init error 111
Sep  1 11:27:57 yoda1 dlm_controld[8175]: group_init error (nil) 111

best regards
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080901/f2fc0478/attachment.htm>

From jakub.suchy at enlogit.cz  Mon Sep  1 15:03:49 2008
From: jakub.suchy at enlogit.cz (Jakub Suchy)
Date: Mon, 1 Sep 2008 17:03:49 +0200
Subject: [Linux-cluster] the cluster don't restart (clvmd)
In-Reply-To: <c60a46e50809010229k50f77640u122b4b43287336b3@mail.gmail.com>
References: <c60a46e50809010229k50f77640u122b4b43287336b3@mail.gmail.com>
Message-ID: <20080901150349.GA28998@nazorne.cz>

Fence may be failing because you have NO fencing devices :)


Jakub Suchy

Gian Paolo Buono wrote:
> Hi,
> I have a cluster configuration with two node..this is my cluster.conf:
> 
> ####################cluster.conf####################
> <?xml version="1.0"?>
> <cluster alias="yoda-cl" config_version="3" name="yoda-cl">
>         <fence_daemon clean_start="0" post_fail_delay="0"
> post_join_delay="3"/>
>         <clusternodes>
>                 <clusternode name="yoda2.cs.tin.it" nodeid="1" votes="1">
>                         <fence/>
>                 </clusternode>
>                 <clusternode name="yoda1.cs.tin.it" nodeid="2" votes="1">
>                         <fence/>
>                 </clusternode>
>         </clusternodes>
>         <cman expected_votes="1" two_node="1"/>
>         <rm>
>                 <failoverdomains/>
>                 <resources/>
>         </rm>
>         <fencedevices/>
> </cluster>
> ####################cluster.conf####################
> 
> I have  tried to restart cluster without reboot because the command clustat
> on node 2 don't work ... but ther is a problem on fence device..this is the
> messages..
> 
> [root at yoda1 cluster]# /etc/init.d/cman start
> Starting cluster:
>    Enabling workaround for Xend bridged networking... done
>    Loading modules... done
>    Mounting configfs... done
>    Starting ccsd... done
>    Starting cman... done
>    Starting daemons... done
>    Starting fencing... failed
> 
> The follow the log:
> Sep  1 11:27:47 yoda1 groupd[8162]: found uncontrolled kernel object clvmd
> in /sys/kernel/dlm
> Sep  1 11:27:47 yoda1 groupd[8162]: local node must be reset to clear 1
> uncontrolled instances of gfs and/or dlm
> Sep  1 11:27:47 yoda1 openais[8154]: [CMAN ] cman killed by node 2 because
> we were killed by cman_tool or other application
> Sep  1 11:27:47 yoda1 fence_node[8163]: Fence of "yoda1.cs.tin.it" was
> unsuccessful
> Sep  1 11:27:47 yoda1 fenced[8169]: cman_init error (nil) 111
> Sep  1 11:27:47 yoda1 gfs_controld[8181]: cman_init error 111
> Sep  1 11:27:57 yoda1 dlm_controld[8175]: group_init error (nil) 111
> 
> best regards

> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

-- 
Jakub Such? <jakub.suchy at enlogit.cz>
GSM: +420 - 777 817 949

Enlogit s.r.o, U Cukrovaru 509/4, 400 07 ?st? nad Labem
tel.: +420 - 474 745 159, fax: +420 - 474 745 160
e-mail: info at enlogit.cz, web: http://www.enlogit.cz

Energy & Logic in IT



From ccaulfie at redhat.com  Mon Sep  1 15:14:15 2008
From: ccaulfie at redhat.com (Christine Caulfield)
Date: Mon, 01 Sep 2008 16:14:15 +0100
Subject: [Linux-cluster] the cluster don't restart (clvmd)
In-Reply-To: <c60a46e50809010229k50f77640u122b4b43287336b3@mail.gmail.com>
References: <c60a46e50809010229k50f77640u122b4b43287336b3@mail.gmail.com>
Message-ID: <48BC06C7.7010800@redhat.com>

Gian Paolo Buono wrote:
> Hi,
> I have a cluster configuration with two node..this is my cluster.conf:
> 
> ####################cluster.conf####################
> <?xml version="1.0"?>
> <cluster alias="yoda-cl" config_version="3" name="yoda-cl">
>         <fence_daemon clean_start="0" post_fail_delay="0"
> post_join_delay="3"/>
>         <clusternodes>
>                 <clusternode name="yoda2.cs.tin.it
> <http://yoda2.cs.tin.it>" nodeid="1" votes="1">
>                         <fence/>
>                 </clusternode>
>                 <clusternode name="yoda1.cs.tin.it
> <http://yoda1.cs.tin.it>" nodeid="2" votes="1">
>                         <fence/>
>                 </clusternode>
>         </clusternodes>
>         <cman expected_votes="1" two_node="1"/>
>         <rm>
>                 <failoverdomains/>
>                 <resources/>
>         </rm>
>         <fencedevices/>
> </cluster>
> ####################cluster.conf####################
> 
> I have  tried to restart cluster without reboot because the command
> clustat on node 2 don't work ... but ther is a problem on fence
> device..this is the messages..
> 
> [root at yoda1 cluster]# /etc/init.d/cman start
> Starting cluster:
>    Enabling workaround for Xend bridged networking... done
>    Loading modules... done
>    Mounting configfs... done
>    Starting ccsd... done
>    Starting cman... done
>    Starting daemons... done
>    Starting fencing... failed
> 
> The follow the log:
> Sep  1 11:27:47 yoda1 groupd[8162]: found uncontrolled kernel object
> clvmd in /sys/kernel/dlm
> Sep  1 11:27:47 yoda1 groupd[8162]: local node must be reset to clear 1
> uncontrolled instances of gfs and/or dlm
> Sep  1 11:27:47 yoda1 openais[8154]: [CMAN ] cman killed by node 2
> because we were killed by cman_tool or other application
> Sep  1 11:27:47 yoda1 fence_node[8163]: Fence of "yoda1.cs.tin.it
> <http://yoda1.cs.tin.it>" was unsuccessful
> Sep  1 11:27:47 yoda1 fenced[8169]: cman_init error (nil) 111
> Sep  1 11:27:47 yoda1 gfs_controld[8181]: cman_init error 111
> Sep  1 11:27:57 yoda1 dlm_controld[8175]: group_init error (nil) 111
>

It's all failing to start because the cluster software wasn't shut down
properly originally. ALL the daemons must be shut down and GFS
filesystems mounted etc. Only then can you restart the cluster software.

Looking at the messages I would guess that either clvmd was killed with
-9 (there is a stray clvmd lockspace in existance) or the cluster was
shutdown with "cman_tool leave force". Or maybe the daemons were killed
by hand.

In the event it's often easier to reboot ...

Chrissie



From jamesc at exa.com  Mon Sep  1 23:55:48 2008
From: jamesc at exa.com (James Chamberlain)
Date: Mon, 1 Sep 2008 19:55:48 -0400
Subject: [Linux-cluster] lm_dlm_cancel
In-Reply-To: <81D8B57D-B9C8-4AA0-8BEC-F45212795FB6@exa.com>
References: <81D8B57D-B9C8-4AA0-8BEC-F45212795FB6@exa.com>
Message-ID: <D4B52777-20EC-43AF-BC25-957F63B7CF3A@exa.com>

Hi all,

Since I sent the below, the aforementioned cluster crashed.  Now I  
can't mount the scratch112 filesystem.  Attempts to do so crash the  
node trying to mount it.  If I run gfs_fsck against it, I see the  
following:

# gfs_fsck -nv /dev/s12/scratch112
Initializing fsck
Initializing lists...
Initializing special inodes...
Validating Resource Group index.
Level 1 check.
5834 resource groups found.
(passed)
Setting block ranges...
Can't seek to last block in file system: 4969529913
Unable to determine the boundaries of the file system.
Freeing buffers.

Not being able to determine the boundaries of the file system seems  
like a very bad thing.  However, LVM didn't complain in the slightest  
when I expanded the logical volume.  How can I recover from this?

Thanks,

James


On Aug 29, 2008, at 9:19 PM, James Chamberlain wrote:

> Hi all,
>
> I'm trying to grow a GFS filesystem.  I've grown this filesystem  
> before and everything went fine.  However, when I issued gfs_grow  
> this time, I saw the following messages in my logs:
>
> Aug 29 21:04:13 s12n02 kernel: lock_dlm: lm_dlm_cancel 2,17 flags 80
> Aug 29 21:04:13 s12n02 kernel: lock_dlm: lm_dlm_cancel skip 2,17  
> flags 100
> Aug 29 21:04:14 s12n02 kernel: lock_dlm: lm_dlm_cancel 2,17 flags 80
> Aug 29 21:04:14 s12n02 kernel: dlm: scratch112: (14239) dlm_unlock:  
> 10241 busy 2
> Aug 29 21:04:14 s12n02 kernel: lock_dlm: lm_dlm_cancel rv -16 2,17  
> flags 40080
>
> The last three lines of these log entries repeat themselves once a  
> second until I hit ^C.  The filesystem appears to still be up and  
> accessible.  Any thoughts on what's going on here and what I can do  
> about it?
>
> Thanks,
>
> James
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From gregory at steulet.org  Tue Sep  2 12:49:17 2008
From: gregory at steulet.org (gregory steulet)
Date: Tue, 02 Sep 2008 14:49:17 +0200
Subject: [Linux-cluster] Unable to retrieve batch 576881287 status from xxxx
	Service Manager not running on this node
Message-ID: <1220359757-0f96762ead54f0b961f5fb5c743dba0a@steulet.org>

Hi folks,

I've a problem with luci, maybe did you already encounter this kind of problem. I add an IP ressource and a VIP service. Unfortunately my VIP service is monitored down. if I try to enable this service I get an error message like below 

Sep 2 14:31:51 emperor01 luci[8115]: Unable to retrieve batch 576881287 status from emperor01.high-availability.eu:11111: Service Manager not running on this node


[root at emperor01 ~]# hostname
emperor01.high-availability.eu

[root at emperor01 ~]# vi /etc/hosts
# Do not remove the following line, or various programs
# that require network functionality will fail.
127.0.0.1               localhost.localdomain   localhost
192.168.1.102           emperor02.high-availability.eu emperor02
192.168.1.101           emperor01.high-availability.eu emperor01
1.0.0.1                 emperor01.int
1.0.0.2                 emperor02.int



vi /etc/cluster/cluster.conf

<?xml version="1.0"?>
<cluster config_version="5" name="new_cluster">
        <fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/>
        <clusternodes>
                <clusternode name="emperor01.high-availability.eu" nodeid="1" votes="1">
                        <fence/>
                </clusternode>
                <clusternode name="emperor02.high-availability.eu" nodeid="2" votes="1">
                        <fence/>
                </clusternode>
        </clusternodes>
        <cman expected_votes="1" two_node="1"/>
        <fencedevices/>
        <rm>
                <failoverdomains/>
                <resources>
                        <ip address="192.168.1.120" monitor_link="0"/>
                </resources>
                <service autostart="1" exclusive="0" name="VIP" recovery="relocate">
                        <ip ref="192.168.1.120"/>
                </service>
        </rm>
</cluster>

Have a great day, best regards

Greg






From teigland at redhat.com  Tue Sep  2 13:55:33 2008
From: teigland at redhat.com (David Teigland)
Date: Tue, 2 Sep 2008 08:55:33 -0500
Subject: [Linux-cluster] lm_dlm_cancel
In-Reply-To: <D4B52777-20EC-43AF-BC25-957F63B7CF3A@exa.com>
References: <81D8B57D-B9C8-4AA0-8BEC-F45212795FB6@exa.com>
	<D4B52777-20EC-43AF-BC25-957F63B7CF3A@exa.com>
Message-ID: <20080902135533.GB21199@redhat.com>

On Mon, Sep 01, 2008 at 07:55:48PM -0400, James Chamberlain wrote:
> Hi all,
> 
> Since I sent the below, the aforementioned cluster crashed.  Now I  
> can't mount the scratch112 filesystem.  Attempts to do so crash the  
> node trying to mount it.  If I run gfs_fsck against it, I see the  
> following:
> 
> # gfs_fsck -nv /dev/s12/scratch112
> Initializing fsck
> Initializing lists...
> Initializing special inodes...
> Validating Resource Group index.
> Level 1 check.
> 5834 resource groups found.
> (passed)
> Setting block ranges...
> Can't seek to last block in file system: 4969529913
> Unable to determine the boundaries of the file system.
> Freeing buffers.
> 
> Not being able to determine the boundaries of the file system seems  
> like a very bad thing.  However, LVM didn't complain in the slightest  
> when I expanded the logical volume.  How can I recover from this?

Looks like the killed gfs_grow left your fs is a bad condition.
I believe Bob Peterson has addressed that recently.

> >I'm trying to grow a GFS filesystem.  I've grown this filesystem  
> >before and everything went fine.  However, when I issued gfs_grow  
> >this time, I saw the following messages in my logs:
> >
> >Aug 29 21:04:13 s12n02 kernel: lock_dlm: lm_dlm_cancel 2,17 flags 80
> >Aug 29 21:04:13 s12n02 kernel: lock_dlm: lm_dlm_cancel skip 2,17  
> >flags 100
> >Aug 29 21:04:14 s12n02 kernel: lock_dlm: lm_dlm_cancel 2,17 flags 80
> >Aug 29 21:04:14 s12n02 kernel: dlm: scratch112: (14239) dlm_unlock:  
> >10241 busy 2
> >Aug 29 21:04:14 s12n02 kernel: lock_dlm: lm_dlm_cancel rv -16 2,17  
> >flags 40080
> >
> >The last three lines of these log entries repeat themselves once a  
> >second until I hit ^C.  The filesystem appears to still be up and  
> >accessible.  Any thoughts on what's going on here and what I can do  
> >about it?

Should be fixed by
https://bugzilla.redhat.com/show_bug.cgi?id=438268

Dave



From jamesc at exa.com  Tue Sep  2 14:15:25 2008
From: jamesc at exa.com (James Chamberlain)
Date: Tue, 2 Sep 2008 10:15:25 -0400 (EDT)
Subject: [Linux-cluster] lm_dlm_cancel
In-Reply-To: <20080902135533.GB21199@redhat.com>
References: <81D8B57D-B9C8-4AA0-8BEC-F45212795FB6@exa.com>
	<D4B52777-20EC-43AF-BC25-957F63B7CF3A@exa.com>
	<20080902135533.GB21199@redhat.com>
Message-ID: <Pine.LNX.4.64.0809021011590.17880@hawking.exa.com>

On Tue, 2 Sep 2008, David Teigland wrote:

> On Mon, Sep 01, 2008 at 07:55:48PM -0400, James Chamberlain wrote:
>> Hi all,
>>
>> Since I sent the below, the aforementioned cluster crashed.  Now I
>> can't mount the scratch112 filesystem.  Attempts to do so crash the
>> node trying to mount it.  If I run gfs_fsck against it, I see the
>> following:
>>
>> # gfs_fsck -nv /dev/s12/scratch112
>> Initializing fsck
>> Initializing lists...
>> Initializing special inodes...
>> Validating Resource Group index.
>> Level 1 check.
>> 5834 resource groups found.
>> (passed)
>> Setting block ranges...
>> Can't seek to last block in file system: 4969529913
>> Unable to determine the boundaries of the file system.
>> Freeing buffers.
>>
>> Not being able to determine the boundaries of the file system seems
>> like a very bad thing.  However, LVM didn't complain in the slightest
>> when I expanded the logical volume.  How can I recover from this?
>
> Looks like the killed gfs_grow left your fs is a bad condition.
> I believe Bob Peterson has addressed that recently.

I think it was in a bad condition before I hit ^C rather than because I 
did.  As I mentioned, I was getting the lm_dlm_cancel messages before I hit 
^C.  But I'd agree that one way or another, the gfs_grow operation somehow 
left the fs in a bad state.

>>> I'm trying to grow a GFS filesystem.  I've grown this filesystem
>>> before and everything went fine.  However, when I issued gfs_grow
>>> this time, I saw the following messages in my logs:
>>>
>>> Aug 29 21:04:13 s12n02 kernel: lock_dlm: lm_dlm_cancel 2,17 flags 80
>>> Aug 29 21:04:13 s12n02 kernel: lock_dlm: lm_dlm_cancel skip 2,17
>>> flags 100
>>> Aug 29 21:04:14 s12n02 kernel: lock_dlm: lm_dlm_cancel 2,17 flags 80
>>> Aug 29 21:04:14 s12n02 kernel: dlm: scratch112: (14239) dlm_unlock:
>>> 10241 busy 2
>>> Aug 29 21:04:14 s12n02 kernel: lock_dlm: lm_dlm_cancel rv -16 2,17
>>> flags 40080
>>>
>>> The last three lines of these log entries repeat themselves once a
>>> second until I hit ^C.  The filesystem appears to still be up and
>>> accessible.  Any thoughts on what's going on here and what I can do
>>> about it?
>
> Should be fixed by
> https://bugzilla.redhat.com/show_bug.cgi?id=438268

Thanks Dave.  Any idea if there's a corresponding patch for RHEL 4?

Regards,

James



From pradhanparas at gmail.com  Tue Sep  2 19:54:52 2008
From: pradhanparas at gmail.com (Paras pradhan)
Date: Tue, 2 Sep 2008 14:54:52 -0500
Subject: [Linux-cluster] VM migration
Message-ID: <8b711df40809021254k38457dbdi7ebf6e4c4a6e5df@mail.gmail.com>

Hi,
I am running a cluster having 2 nodes using red hat cluster suite in CentOS
5.2. node 1 has a para virtualized guest(centOS)  running under Xen. My
question is when node1 is rebooted, guest is automatically relocated to node
2 . Instead of relocation, is migration possible in this case which can
result in Zero down time?


Thanks in adv
Paras.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080902/a0a20f3d/attachment.htm>

From macscr at macscr.com  Tue Sep  2 20:03:15 2008
From: macscr at macscr.com (Mark Chaney)
Date: Tue, 2 Sep 2008 15:03:15 -0500
Subject: [Linux-cluster] VM migration
In-Reply-To: <8b711df40809021254k38457dbdi7ebf6e4c4a6e5df@mail.gmail.com>
References: <8b711df40809021254k38457dbdi7ebf6e4c4a6e5df@mail.gmail.com>
Message-ID: <036001c90d36$f151e140$d3f5a3c0$@com>

Are you using shared storage? Is far as I know, the current cluster suite
wont do an automatic live migration to another server during a reboot, but
it will do the live migration back after the reboot. Does that make sense?

 

From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Paras pradhan
Sent: Tuesday, September 02, 2008 2:55 PM
To: linux-cluster at redhat.com
Subject: [Linux-cluster] VM migration

 

Hi,

 

I am running a cluster having 2 nodes using red hat cluster suite in CentOS
5.2. node 1 has a para virtualized guest(centOS)  running under Xen. My
question is when node1 is rebooted, guest is automatically relocated to node
2 . Instead of relocation, is migration possible in this case which can
result in Zero down time?

 

 

Thanks in adv

Paras.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080902/a0a64d38/attachment.htm>

From pradhanparas at gmail.com  Tue Sep  2 20:08:52 2008
From: pradhanparas at gmail.com (Paras pradhan)
Date: Tue, 2 Sep 2008 15:08:52 -0500
Subject: [Linux-cluster] VM migration
In-Reply-To: <036001c90d36$f151e140$d3f5a3c0$@com>
References: <8b711df40809021254k38457dbdi7ebf6e4c4a6e5df@mail.gmail.com>
	<036001c90d36$f151e140$d3f5a3c0$@com>
Message-ID: <8b711df40809021308k53c511c7n3fb944fa94bbff1@mail.gmail.com>

2008/9/2 Mark Chaney <macscr at macscr.com>

>  Are you using shared storage? Is far as I know, the current cluster suite
> wont do an automatic live migration to another server during a reboot, but
> it will do the live migration back after the reboot. Does that make sense?
>
>
>
> *From:* linux-cluster-bounces at redhat.com [mailto:
> linux-cluster-bounces at redhat.com] *On Behalf Of *Paras pradhan
> *Sent:* Tuesday, September 02, 2008 2:55 PM
> *To:* linux-cluster at redhat.com
> *Subject:* [Linux-cluster] VM migration
>
>
>
> Hi,
>
>
>
> I am running a cluster having 2 nodes using red hat cluster suite in CentOS
> 5.2. node 1 has a para virtualized guest(centOS)  running under Xen. My
> question is when node1 is rebooted, guest is automatically relocated to node
> 2 . Instead of relocation, is migration possible in this case which can
> result in Zero down time?
>
>
>
>
>
> Thanks in adv
>
> Paras.
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster




Exactly as you said. It is relocating and as soon as the original node comes
back it is migrated automatically.

Yes I am using shared storage using SAN/GFS2. Any workaround on this to make
migration possible inserted of relocation

Paras.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080902/3ff3460f/attachment.htm>

From emuller at engineyard.com  Tue Sep  2 22:11:35 2008
From: emuller at engineyard.com (Edward Muller)
Date: Tue, 2 Sep 2008 15:11:35 -0700
Subject: [Linux-cluster] lock_dlm: gdlm_cancel messages?
Message-ID: <590F9A59-C4EC-42F7-B036-34E3245FB9B8@engineyard.com>

We have a customer who we believe is putting excessive locking  
pressure on one of several gfs volumes (9 total across 5 systems).

They've started to get occasional load spikes that seem to show that  
the gfs is "locking" for a minute or two. Without any action on our  
part the load spikes clear and everything continues as normal.

And we've recently seen the following log entries:

Sep  2 12:57:57 xc88-s00007 kernel: lock_dlm: gdlm_cancel 1,2 flags 0
Sep  2 12:57:57 xc88-s00007 kernel: lock_dlm: gdlm_cancel skip 1,2  
flags 0
Sep  2 12:57:58 xc88-s00007 kernel: lock_dlm: gdlm_cancel 1,2 flags 0
Sep  2 12:57:58 xc88-s00007 kernel: lock_dlm: gdlm_cancel skip 1,2  
flags 0
Sep  2 12:58:40 xc88-s00007 kernel: lock_dlm: gdlm_cancel 1,2 flags 0
Sep  2 12:58:40 xc88-s00007 kernel: lock_dlm: gdlm_cancel skip 1,2  
flags 0
Sep  2 12:58:58 xc88-s00007 kernel: lock_dlm: gdlm_cancel 1,2 flags 0
Sep  2 12:58:58 xc88-s00007 kernel: lock_dlm: gdlm_cancel skip 1,2  
flags 0
Sep  2 12:59:14 xc88-s00007 kernel: lock_dlm: gdlm_cancel 1,2 flags 0
Sep  2 12:59:14 xc88-s00007 kernel: lock_dlm: gdlm_cancel skip 1,2  
flags 0

For all intents and purposes we're running RHCS2 from RHEL 5.2 w/ the  
RHEL 5.2 kernel (2.6.18-92.1.10)

This used to happen to this customer a lot more frequently on RHCS1  
(1.03), but we upgraded them to the above RHCS2 packages and kernel  
and things have been much better.

I'm going to start dumping gfs_tool counters data for the various gfs  
filesystems.

Any advice tracking this down would be useful.

Thanks!


--
Edward Muller
Engine Yard Inc. : Support, Scalability, Reliability
+1.866.518.9273 x209  - Mobile: +1.417.844.2435
IRC: edwardam - XMPP/GTalk: emuller at engineyard.com
Pacific/US



From macscr at macscr.com  Wed Sep  3 10:05:02 2008
From: macscr at macscr.com (Mark Chaney)
Date: Wed, 3 Sep 2008 05:05:02 -0500
Subject: [Linux-cluster] dlm: no local IP address has been set
Message-ID: <001101c90dac$89f858a0$9de909e0$@com>

grr, I am still getting this error every so often:

dlm: no local IP address has been set
dlm: cannot start dlm lowcomms -107

but my hosts files are fine and actually all servers are members of the
clutser
error is only showing on 1 of 3 servers and its always right after they come
back up after being fenced.

I am running CentOS 5.2 and am using the newest stable packages, using yum.
Here is an example of my hosts files:

127.0.0.1       localhost.localdomain   localhost
#::1    localhost6.localdomain6 localhost6
67.xxx.159.xx wheeljac.blah.com wheeljack
67.xxx.159.xx skydive.blah.com skydive
67.xxx.159.xx ratchet.blah.com ratchet
192.168.1.11    wheeljack.local
192.168.1.10    ratchet.local
192.168.1.12    skydive.local

The .local hostnames are the names of my nodes in my cluster.conf. Like I
said, the server joined the cluster fine, but since DLM had the issue, CLVMD
wasn't able to start.

Any help would sincerely be appreciated.

-Mark







From teigland at redhat.com  Wed Sep  3 13:34:48 2008
From: teigland at redhat.com (David Teigland)
Date: Wed, 3 Sep 2008 08:34:48 -0500
Subject: [Linux-cluster] dlm: no local IP address has been set
In-Reply-To: <001101c90dac$89f858a0$9de909e0$@com>
References: <001101c90dac$89f858a0$9de909e0$@com>
Message-ID: <20080903133448.GA22775@redhat.com>

On Wed, Sep 03, 2008 at 05:05:02AM -0500, Mark Chaney wrote:
> grr, I am still getting this error every so often:
> 
> dlm: no local IP address has been set
> dlm: cannot start dlm lowcomms -107

This is generally caused by some previous step failing or not happening
in the whole startup process.  The most proximate cause is dlm_controld
not starting, which would usually be the result of something else like
configfs not being mounted or the dlm kernel module not being loaded.

Dave



From teigland at redhat.com  Wed Sep  3 13:44:33 2008
From: teigland at redhat.com (David Teigland)
Date: Wed, 3 Sep 2008 08:44:33 -0500
Subject: [Linux-cluster] lock_dlm: gdlm_cancel messages?
In-Reply-To: <590F9A59-C4EC-42F7-B036-34E3245FB9B8@engineyard.com>
References: <590F9A59-C4EC-42F7-B036-34E3245FB9B8@engineyard.com>
Message-ID: <20080903134433.GB22775@redhat.com>

On Tue, Sep 02, 2008 at 03:11:35PM -0700, Edward Muller wrote:
> We have a customer who we believe is putting excessive locking  
> pressure on one of several gfs volumes (9 total across 5 systems).
> 
> They've started to get occasional load spikes that seem to show that  
> the gfs is "locking" for a minute or two. Without any action on our  
> part the load spikes clear and everything continues as normal.
> 
> And we've recently seen the following log entries:
> 
> Sep  2 12:57:57 xc88-s00007 kernel: lock_dlm: gdlm_cancel 1,2 flags 0
> Sep  2 12:57:57 xc88-s00007 kernel: lock_dlm: gdlm_cancel skip 1,2  
> flags 0
> Sep  2 12:57:58 xc88-s00007 kernel: lock_dlm: gdlm_cancel 1,2 flags 0
> Sep  2 12:57:58 xc88-s00007 kernel: lock_dlm: gdlm_cancel skip 1,2  
> flags 0
> Sep  2 12:58:40 xc88-s00007 kernel: lock_dlm: gdlm_cancel 1,2 flags 0
> Sep  2 12:58:40 xc88-s00007 kernel: lock_dlm: gdlm_cancel skip 1,2  
> flags 0
> Sep  2 12:58:58 xc88-s00007 kernel: lock_dlm: gdlm_cancel 1,2 flags 0
> Sep  2 12:58:58 xc88-s00007 kernel: lock_dlm: gdlm_cancel skip 1,2  
> flags 0
> Sep  2 12:59:14 xc88-s00007 kernel: lock_dlm: gdlm_cancel 1,2 flags 0
> Sep  2 12:59:14 xc88-s00007 kernel: lock_dlm: gdlm_cancel skip 1,2  
> flags 0

FS activity will block while gfs does recovery, and the cancel messages
are also usually due to recovery.  If gfs is doing recovery, you'd see
clear messages about it in /var/log/messages.  Otherwise, I'd check
whether they're using any gfs administrative commands like gfs_tool.

Dave



From macscr at macscr.com  Wed Sep  3 18:33:46 2008
From: macscr at macscr.com (Mark Chaney)
Date: Wed, 3 Sep 2008 13:33:46 -0500
Subject: [Linux-cluster] dlm: no local IP address has been set
In-Reply-To: <20080903133448.GA22775@redhat.com>
References: <001101c90dac$89f858a0$9de909e0$@com>
	<20080903133448.GA22775@redhat.com>
Message-ID: <005f01c90df3$9b30f220$d192d660$@com>

Still not sure what exactly isn't loading, I did find these details quite
odd:

im getting a few different results on each server when I run lsmod:
http://pastebin.ca/1192806

-----Original Message-----
From: David Teigland [mailto:teigland at redhat.com] 
Sent: Wednesday, September 03, 2008 8:35 AM
To: Mark Chaney
Cc: Linux-cluster at redhat.com
Subject: Re: [Linux-cluster] dlm: no local IP address has been set

On Wed, Sep 03, 2008 at 05:05:02AM -0500, Mark Chaney wrote:
> grr, I am still getting this error every so often:
> 
> dlm: no local IP address has been set
> dlm: cannot start dlm lowcomms -107

This is generally caused by some previous step failing or not happening
in the whole startup process.  The most proximate cause is dlm_controld
not starting, which would usually be the result of something else like
configfs not being mounted or the dlm kernel module not being loaded.

Dave



From jbrassow at redhat.com  Wed Sep  3 19:21:56 2008
From: jbrassow at redhat.com (Jonathan Brassow)
Date: Wed, 3 Sep 2008 14:21:56 -0500
Subject: [Linux-cluster] lvcreate: Error locking on node
In-Reply-To: <1220125884.3124.28.camel@blackhouse>
References: <1220125884.3124.28.camel@blackhouse>
Message-ID: <EFA13D07-0A46-4C70-A0F4-FAE670CD6381@redhat.com>

clvm is a way to control/configure your storage in a cluster.  It does  
not provide storage from one machine /to/ a cluster.

There are several ways of making this happen though.  You can use  
GNBD, iSCSI, AOE, fibre channel, or pretty much any other method of  
sharing block devices.  Once all machines in your cluster can see the  
same storage, then you will be able to use CLVM to configure/manage it.

  brassow

On Aug 30, 2008, at 2:51 PM, Alexander Vorobiyov wrote:

> I try to create lvm logical volume that it was accessible through any
> node of my cluster, at refusal of any cluster node. I try to organise
> this storage through gigabit ethernet. clvm can provide these  
> storage in
> this case or I am mistaken?
>
> -- 
> Alexander Vorobiyov
> NTC NOC
> The engineer of communications
> Russia, Ryazan
> +7(4912)901553 ext. 630
> mailto:alexander.vorobiyov at rzn.nex3.ru
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From quickshiftin at gmail.com  Wed Sep  3 21:56:29 2008
From: quickshiftin at gmail.com (Nathan Nobbe)
Date: Wed, 3 Sep 2008 15:56:29 -0600
Subject: [Linux-cluster] practial dlm usage inquiry
Message-ID: <7dd2dc0b0809031456i7fd421d6qb2e59ee621f751e7@mail.gmail.com>

hi all,

my first post on this list.  say, ive been checking out dlm a little bit
lately.  we are trying to build (or leverage) a distributed lock to manage
access to files mounted via nfs.  some of the folks here are telling me that
dlm is not suitable for this, however, my suspicion is that the dlm is
independent of a filesystem.  is this accurate?  for example i imagine the
code could ask for a lock on a given resource, and then if it acquires a
write lock, modify the file.  am i way off base here?  just trying to get my
footing before investing too much time in the wrong approach...

tia,

-nathan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080903/1828225d/attachment.htm>

From egerlach at feds.uwaterloo.ca  Thu Sep  4 01:19:04 2008
From: egerlach at feds.uwaterloo.ca (Eric Gerlach)
Date: Wed, 03 Sep 2008 21:19:04 -0400
Subject: [Linux-cluster] Can't create files bigger than 3864 bytes on GFS
Message-ID: <48BF3788.9090800@feds.uwaterloo.ca>

Hi,

I'm currently trying to set up GFS on a couple of Debian Testing boxes. 
  I've got the GFS setup and mounted, however I'm having trouble writing 
to files.  Using vi, cp, whatever.

If the file is up to 3864 bytes (one inode) it's fine.  But if I try to 
edit a file to make it bigger than that size, or copy in a file larger 
than that, the result is a zero byte file.

I've also tried doing GFS with the lock_nolock mechanism, and I get the 
same result.  The kernel is 2.6.26, and the Debian version of the tools 
is based off of cluster-2.03.06.  My filesystem is GFS, not GFS2.

Has anyone seen anything like this before?  I'm a newbie to GFS, so I'm 
not even sure where to start looking for an answer.  Even some help in 
that direction would be helpful.

Thanks in advance.

Cheers,

-- 
Eric Gerlach, Network Administrator
Federation of Students
University of Waterloo
p: (519) 888-4567 x36329
e: egerlach at feds.uwaterloo.ca



From jeff.sturm at eprize.com  Thu Sep  4 01:57:58 2008
From: jeff.sturm at eprize.com (Jeff Sturm)
Date: Wed, 3 Sep 2008 21:57:58 -0400
Subject: [Linux-cluster] practial dlm usage inquiry
In-Reply-To: <7dd2dc0b0809031456i7fd421d6qb2e59ee621f751e7@mail.gmail.com>
References: <7dd2dc0b0809031456i7fd421d6qb2e59ee621f751e7@mail.gmail.com>
Message-ID: <64D0546C5EBBD147B75DE133D798665FE92B73@hugo.eprize.local>

Nathan,
 
I believe it's possible to use DLM as a general-purpose lock manager.
You may find the following helpful:
 
http://people.redhat.com/ccaulfie/docs/rhdlmbook.pdf
 
Jeff


________________________________

	From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Nathan Nobbe
	Sent: Wednesday, September 03, 2008 5:56 PM
	To: linux-cluster at redhat.com
	Subject: [Linux-cluster] practial dlm usage inquiry
	
	
	hi all,
	
	my first post on this list.  say, ive been checking out dlm a
little bit lately.  we are trying to build (or leverage) a distributed
lock to manage access to files mounted via nfs.  some of the folks here
are telling me that dlm is not suitable for this, however, my suspicion
is that the dlm is independent of a filesystem.  is this accurate?  for
example i imagine the code could ask for a lock on a given resource, and
then if it acquires a write lock, modify the file.  am i way off base
here?  just trying to get my footing before investing too much time in
the wrong approach...
	
	tia,
	
	-nathan
	

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080903/520ee584/attachment.htm>

From quickshiftin at gmail.com  Thu Sep  4 03:59:45 2008
From: quickshiftin at gmail.com (Nathan Nobbe)
Date: Wed, 3 Sep 2008 21:59:45 -0600
Subject: [Linux-cluster] practial dlm usage inquiry
In-Reply-To: <64D0546C5EBBD147B75DE133D798665FE92B73@hugo.eprize.local>
References: <7dd2dc0b0809031456i7fd421d6qb2e59ee621f751e7@mail.gmail.com>
	<64D0546C5EBBD147B75DE133D798665FE92B73@hugo.eprize.local>
Message-ID: <7dd2dc0b0809032059j6b248491i3200bef3045fd428@mail.gmail.com>

2008/9/3 Jeff Sturm <jeff.sturm at eprize.com>

>  Nathan,
>
> I believe it's possible to use DLM as a general-purpose lock manager.
>

thanks for corroborating :)


> You may find the following helpful:
>
> http://people.redhat.com/ccaulfie/docs/rhdlmbook.pdf
>


i was reading the book earlier today, and im going to start experimenting
tonight.
my intention is to develop a small php extension, to wrap some of the
methods
in the dlm api.

-nathan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080903/5aaf3a43/attachment.htm>

From chris-m-lists at joelly.net  Thu Sep  4 22:08:05 2008
From: chris-m-lists at joelly.net (Chris Joelly)
Date: Fri, 5 Sep 2008 00:08:05 +0200
Subject: [Linux-cluster] Handling of shutdown and mounting of GFS
Message-ID: <20080904220805.GB12807@joysn.joelly.net>

Hello,

i try to get RHCS up and running and have some success so far.

The cluster with 2 nodes is running, but i don't know how to remove one
node the correct way. I can move the active service (an IP address by
now) to the second node and then want to remove the other node from the
running cluster. cman_tool leave remove should be used for this which is
recommended on the RH documentation. But if i try that i get the error
message:

root at store02:/etc/cluster# cman_tool leave remove
cman_tool: Error leaving cluster: Device or resource busy

I cannot figure out which device is busy so that the node is not able
to leave the cluster. The service (IP address) moved to the other node
correctly as i can see using clustat ...

The only way to get out of this problem is to restart the whole cluster
which brings down the service(s) and results in unnecessary fencing...
Is there a known way to remove one node from the cluster without
bringing down the whole cluster?

Another strange thing comes up when i try to use GFS:

i have configured DRBD on a backing HW Raid10 device, use LVM2 to build
a clusteraware VG, and on top of that use LVs and GFS across the two
cluster nodes.

Using the GFS filesystems without noauto in fstab doesn't mount the
filesystems on boot using /etc/init.d/gfs-tools. I think this is due to
the ordering the sysv init scripts are started. All RHCS stuff is
started from within rcS, and drbd is startet from within rc2. I read the
section of the debian-policy to figure out if rcS is meant to run before
rc2, but this isn't mentioned in the policy. So i assume that drbd is
started in rc2 after rcS, which would mean that every filesystem on top
of drbd is not able to mount on boot time...
Can anybody prove this?

The reason why i try to mount a GFS filesystem at boottime is that i
want to build cluster services on top of it, and that services (more
than one) are relying on one fs. A better solution would be to define
a shared GFS filesystem resource which could be used across more than
one cluster services, but the cluster take care that the filesystem is
only mounted once...
Can this be achieved with RHCS?

thanks for any advice ...

-- 
Ubuntu 8.04 LTS 64bit
RHCS 2.0
cluster.conf attached!
-------------- next part --------------
<?xml version="1.0"?>
<cluster alias="store" config_version="47" name="store">
	<fence_daemon post_fail_delay="0" post_join_delay="60"/>
	<clusternodes>
		<clusternode name="store01" nodeid="1" votes="1">
			<fence>
				<method name="1">
					<device name="ap7922" port="1"/>
				</method>
			</fence>
		</clusternode>
		<clusternode name="store02" nodeid="2" votes="1">
			<fence>
				<method name="1">
					<device name="ap7922" port="9"/>
				</method>
			</fence>
		</clusternode>
	</clusternodes>
	<cman expected_votes="1" two_node="1"/>
	<fencedevices>
		<fencedevice agent="fence_apc" ipaddr="192.168.2.10" login="***" name="ap7922" passwd="***"/>
	</fencedevices>
	<rm>
		<failoverdomains>
			<failoverdomain name="ip-fail2node2" ordered="1" restricted="0">
				<failoverdomainnode name="store01" priority="1"/>
			</failoverdomain>
			<failoverdomain name="ip-fail2node1" ordered="1">
				<failoverdomainnode name="store02" priority="1"/>
			</failoverdomain>
		</failoverdomains>
		<resources>
			<ip address="192.168.2.20" monitor_link="1"/>
			<ip address="192.168.2.23" monitor_link="1"/>
		</resources>
		<service autostart="1" domain="ip-fail2node1" name="store02-ip" recovery="restart">
			<ip ref="192.168.2.23"/>
		</service>
		<service autostart="1" domain="ip-fail2node2" name="store01-ip" recovery="restart">
			<ip ref="192.168.2.20"/>
		</service>
	</rm>
</cluster>

From anujhere at gmail.com  Fri Sep  5 11:32:28 2008
From: anujhere at gmail.com (=?UTF-8?Q?Anuj_Singh_(=E0=A4=85=E0=A4=A8=E0=A5=81=E0=A4=9C)_?=)
Date: Fri, 5 Sep 2008 17:02:28 +0530
Subject: [Linux-cluster] how to get my cluster working if my /dev/sda
	becomes /dev/sdb (CLVM, iscsi, ERROR: Module iscsi_sfnet in use)
Message-ID: <3120c9e30809050432r66da26b2rfc5df10cbf241145@mail.gmail.com>

Hi,
I configured a cluster using gfs1 on rhel-4 kernel version  2.6.9-55.16.EL.
Using iscsi-target and initiator.
gfs1 mount is exported via nfs service.

I can manually stop all services in following sequence:
nfs, portmap, rgmanager, gfs, clvmd, fenced, cman, ccsd.
to stop my iscsi service first I give 'vgchange -aln' then I stop iscsi
service, otherwise i get an error of module in use, as I have an clusterd
lvm over iscsi device (/dev/sda1)

Everything works fine, but when i am trying to simulate a possible problem,
f.e. iscsi service is stopped I get following error.

Test1:
When cluster is working I stop iscsi service with
 /etc/init.d/iscsi stop
Searching for iscsi-based multipath maps
Found 0 maps
Stopping iscsid:                                           [  OK  ]
Removing iscsi driver: ERROR: Module iscsi_sfnet is in use
                                                           [FAILED]
To stop my iscsi service without a failure,  I stop all cluster services as
follows.
/etc/init.d/nfs stop
/etc/init.d/portmap stop
/etc/init.d/rgmanager stop
/etc/init.d/gfs stop
/etc/init.d/clvmd stop
/etc/init.d/fenced stop
/etc/init.d/cman stop
/etc/init.d/ccsd stop
Every service stops with a ok message. now again when i stop my iscsi
service I get same error
 /etc/init.d/iscsi stop
Removing iscsi driver: ERROR: Module iscsi_sfnet is in
use                             [FAILED]

On my iscsi device (which is /dev/sd1), i have a LVM with gfs1 file-system,
as all the cluster services are stopped, I try to deactivate the lvm with:

 vgchange -aln
  /dev/dm-0: read failed after 0 of 4096 at 0: Input/output error
  No volume groups found

At the moment if I start my iscsi service, my /dev/sda becomes /dev/sdb as
well as iscsi service gives me following error:

[root at pr0031 new]# /sbin/service iscsi start
Checking iscsi config:                                     [  OK  ]
Loading iscsi driver:                                      [  OK  ]
mknod: `/dev/iscsictl': File exists
Starting iscsid:                                           [  OK  ]

Sep  5 16:42:37 pr0031 iscsi: iscsi config check succeeded
Sep  5 16:42:37 pr0031 iscsi: Loading iscsi driver:  succeeded
Sep  5 16:42:42 pr0031 iscsid[20732]: version 4:0.1.11-7 variant
(14-Apr-2008)
Sep  5 16:42:42 pr0031 iscsi: iscsid startup succeeded
Sep  5 16:42:42 pr0031 iscsid[20736]: Connected to Discovery Address
192.168.10.199
Sep  5 16:42:42 pr0031 kernel: iscsi-sfnet:host16: Session established
Sep  5 16:42:42 pr0031 kernel: scsi16 : SFNet iSCSI driver
Sep  5 16:42:42 pr0031 kernel:   Vendor: IET       Model: VIRTUAL-DISK
Rev: 0
Sep  5 16:42:42 pr0031 kernel:   Type:   Direct-Access
ANSI SCSI revision: 04
Sep  5 16:42:42 pr0031 kernel: SCSI device sdb: 1975932 512-byte hdwr
sectors (1012 MB)
Sep  5 16:42:42 pr0031 kernel: SCSI device sdb: drive cache: write through
Sep  5 16:42:42 pr0031 kernel: SCSI device sdb: 1975932 512-byte hdwr
sectors (1012 MB)
Sep  5 16:42:42 pr0031 kernel: SCSI device sdb: drive cache: write through
Sep  5 16:42:42 pr0031 kernel:  sdb: sdb1
Sep  5 16:42:42 pr0031 kernel: Attached scsi disk sdb at scsi16, channel 0,
id 0, lun 0
Sep  5 16:42:43 pr0031 scsi.agent[20764]: disk at
/devices/platform/host16/target16:0:0/16:0:0:0

As my /dev/sda1 became /dev/sdb1, if i start cluster services, I have no gfs
mount.

clurgmgrd[21062]: <notice> Starting stopped service flx
Sep  5 16:47:16 pr0031 kernel: scsi15 (0:0): rejecting I/O to dead device
Sep  5 16:47:16 pr0031 clurgmgrd: [21062]: <err> 'mount -t gfs
/dev/mapper/VG01-LV01 /u01' failed, error=32
Sep  5 16:47:16 pr0031 clurgmgrd[21062]: <notice> start on
clusterfs:gfsmount_u01 returned 2 (invalid argument(s))
Sep  5 16:47:16 pr0031 clurgmgrd[21062]: <warning> #68: Failed to start flx;
return value: 1
Sep  5 16:47:16 pr0031 clurgmgrd[21062]: <notice> Stopping service flx


After the above situation I need to restart the nodes, which I don't want
to, I created a script to handle all this, in which if i restart all the
services first, first I get the same /dev/sdb ( which should be /dev/sda so
that my cluster can have a gfs mount). When I restart all the services
second time, I get no error (this time iscsi disk is attached with /dev/sda
device name and I don't see any /dev/iscsctl exist error at the iscsi
startup time) and cluster starts working.
my script : http://www.grex.org/~anuj/cluster.txt

So, how to get my cluster working if my /dev/sda becomes /dev/sdb?

Thanks and Regards
Anuj Singh
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080905/4539eb5c/attachment.htm>

From theophanis_kontogiannis at yahoo.gr  Fri Sep  5 12:34:09 2008
From: theophanis_kontogiannis at yahoo.gr (Theophanis Kontogiannis)
Date: Fri, 5 Sep 2008 15:34:09 +0300
Subject: [Linux-cluster] Problem with disk geometry
Message-ID: <00b101c90f53$b39d9140$1ad8b3c0$@gr>

Hello All,



Though I have posted this to centos hardware forum, I also post it here
since it affects my cluster and drbd setup (the drbd complains that "drbd0:
The peer's disk size is too small!", so the LV cannot become available, so
the cluster services on node-1 do not start).


I have 2.6.18-92.1.10.el5.centos.plus and fdisk (util-linux 2.13-pre7)

My old 80GB disk failed and I had replaced.
However the vendor replaced it with a different model.
My old disk was WD800JB-00JJC0 but the new one is a WD800JB-22JJC0.
Since I must have the new disk with the same partitions as the old one
(apart from RAID-1 software, I also have installed DRBD mirrored with a
second node), I use fdisk to partition it.

But there start the problems:

1. BIOS reports C38308, H16, S255, Landing Zone 38307, Precomp 0

2. Kernel reports hda: 156301488 sectors (80026 MB) w/8192KiB Cache,
CHS=65535/16/63, UDMA(100)

3. fdisk reports 

Disk /dev/hda: 80.0 GB, 80026361856 bytes
16 heads, 63 sectors/track, 155061 cylinders
Units = cylinders of 1008 * 512 = 516096 bytes

and the disk turns up like:

Device Boot Start End Blocks Id System
/dev/hda1 * 1 208 104800+ fd Linux raid autodetect
/dev/hda2 209 20554 10254384 fd Linux raid autodetect
/dev/hda3 20555 24500 1988784 83 Linux
/dev/hda4 24501 155061 65802744 83 Linux


The result is that I cannot create the same layout on the disk as the old
disk.
The old disk was like that:

Disk /dev/hda: 80.0 GB, 80026361856 bytes
255 heads, 63 sectors/track, 9729 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Device Boot Start End Blocks Id System
/dev/hda1 * 1 13 104391 fd Linux raid autodetect
/dev/hda2 14 1288 10241437+ fd Linux raid autodetect
/dev/hda3 1289 1543 2048287+ 82 Linux swap / Solaris
/dev/hda4 1544 9729 65754045 83 Linux

Any help?

Why the is the new disk known in different ways by BIOS, kernel and fdisk?
And how can I fix this?


Thank you All for your Time,
Theophanis Kontogiannis

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080905/15868019/attachment.htm>

From anujhere at gmail.com  Fri Sep  5 13:10:53 2008
From: anujhere at gmail.com (=?UTF-8?Q?Anuj_Singh_(=E0=A4=85=E0=A4=A8=E0=A5=81=E0=A4=9C)_?=)
Date: Fri, 5 Sep 2008 18:40:53 +0530
Subject: [Linux-cluster] Re: how to get my cluster working if my /dev/sda
	becomes /dev/sdb (CLVM, iscsi, ERROR: Module iscsi_sfnet in use)
In-Reply-To: <3120c9e30809050432r66da26b2rfc5df10cbf241145@mail.gmail.com>
References: <3120c9e30809050432r66da26b2rfc5df10cbf241145@mail.gmail.com>
Message-ID: <3120c9e30809050610o20c46493v3cd6a669eb409990@mail.gmail.com>

Thanks,
changed script a bit, things working now. resetting iscsi service.
But device name order independent will be better.

Thanks and regards
Anuj Singh



On Fri, Sep 5, 2008 at 5:02 PM, Anuj Singh (????) <anujhere at gmail.com>wrote:

> Hi,
> I configured a cluster using gfs1 on rhel-4 kernel version  2.6.9-55.16.EL.
> Using iscsi-target and initiator.
> gfs1 mount is exported via nfs service.
>
> I can manually stop all services in following sequence:
> nfs, portmap, rgmanager, gfs, clvmd, fenced, cman, ccsd.
> to stop my iscsi service first I give 'vgchange -aln' then I stop iscsi
> service, otherwise i get an error of module in use, as I have an clusterd
> lvm over iscsi device (/dev/sda1)
>
> Everything works fine, but when i am trying to simulate a possible problem,
> f.e. iscsi service is stopped I get following error.
>
> Test1:
> When cluster is working I stop iscsi service with
>  /etc/init.d/iscsi stop
> Searching for iscsi-based multipath maps
> Found 0 maps
> Stopping iscsid:                                           [  OK  ]
> Removing iscsi driver: ERROR: Module iscsi_sfnet is in use
>                                                            [FAILED]
> To stop my iscsi service without a failure,  I stop all cluster services as
> follows.
> /etc/init.d/nfs stop
> /etc/init.d/portmap stop
> /etc/init.d/rgmanager stop
> /etc/init.d/gfs stop
> /etc/init.d/clvmd stop
> /etc/init.d/fenced stop
> /etc/init.d/cman stop
> /etc/init.d/ccsd stop
> Every service stops with a ok message. now again when i stop my iscsi
> service I get same error
>  /etc/init.d/iscsi stop
> Removing iscsi driver: ERROR: Module iscsi_sfnet is in
> use                             [FAILED]
>
> On my iscsi device (which is /dev/sd1), i have a LVM with gfs1 file-system,
> as all the cluster services are stopped, I try to deactivate the lvm with:
>
>  vgchange -aln
>   /dev/dm-0: read failed after 0 of 4096 at 0: Input/output error
>   No volume groups found
>
> At the moment if I start my iscsi service, my /dev/sda becomes /dev/sdb as
> well as iscsi service gives me following error:
>
> [root at pr0031 new]# /sbin/service iscsi start
> Checking iscsi config:                                     [  OK  ]
> Loading iscsi driver:                                      [  OK  ]
> mknod: `/dev/iscsictl': File exists
> Starting iscsid:                                           [  OK  ]
>
> Sep  5 16:42:37 pr0031 iscsi: iscsi config check succeeded
> Sep  5 16:42:37 pr0031 iscsi: Loading iscsi driver:  succeeded
> Sep  5 16:42:42 pr0031 iscsid[20732]: version 4:0.1.11-7 variant
> (14-Apr-2008)
> Sep  5 16:42:42 pr0031 iscsi: iscsid startup succeeded
> Sep  5 16:42:42 pr0031 iscsid[20736]: Connected to Discovery Address
> 192.168.10.199
> Sep  5 16:42:42 pr0031 kernel: iscsi-sfnet:host16: Session established
> Sep  5 16:42:42 pr0031 kernel: scsi16 : SFNet iSCSI driver
> Sep  5 16:42:42 pr0031 kernel:   Vendor: IET       Model: VIRTUAL-DISK
> Rev: 0
> Sep  5 16:42:42 pr0031 kernel:   Type:   Direct-Access
> ANSI SCSI revision: 04
> Sep  5 16:42:42 pr0031 kernel: SCSI device sdb: 1975932 512-byte hdwr
> sectors (1012 MB)
> Sep  5 16:42:42 pr0031 kernel: SCSI device sdb: drive cache: write through
> Sep  5 16:42:42 pr0031 kernel: SCSI device sdb: 1975932 512-byte hdwr
> sectors (1012 MB)
> Sep  5 16:42:42 pr0031 kernel: SCSI device sdb: drive cache: write through
> Sep  5 16:42:42 pr0031 kernel:  sdb: sdb1
> Sep  5 16:42:42 pr0031 kernel: Attached scsi disk sdb at scsi16, channel 0,
> id 0, lun 0
> Sep  5 16:42:43 pr0031 scsi.agent[20764]: disk at
> /devices/platform/host16/target16:0:0/16:0:0:0
>
> As my /dev/sda1 became /dev/sdb1, if i start cluster services, I have no
> gfs mount.
>
> clurgmgrd[21062]: <notice> Starting stopped service flx
> Sep  5 16:47:16 pr0031 kernel: scsi15 (0:0): rejecting I/O to dead device
> Sep  5 16:47:16 pr0031 clurgmgrd: [21062]: <err> 'mount -t gfs
> /dev/mapper/VG01-LV01 /u01' failed, error=32
> Sep  5 16:47:16 pr0031 clurgmgrd[21062]: <notice> start on
> clusterfs:gfsmount_u01 returned 2 (invalid argument(s))
> Sep  5 16:47:16 pr0031 clurgmgrd[21062]: <warning> #68: Failed to start
> flx; return value: 1
> Sep  5 16:47:16 pr0031 clurgmgrd[21062]: <notice> Stopping service flx
>
>
> After the above situation I need to restart the nodes, which I don't want
> to, I created a script to handle all this, in which if i restart all the
> services first, first I get the same /dev/sdb ( which should be /dev/sda so
> that my cluster can have a gfs mount). When I restart all the services
> second time, I get no error (this time iscsi disk is attached with /dev/sda
> device name and I don't see any /dev/iscsctl exist error at the iscsi
> startup time) and cluster starts working.
> my script : http://www.grex.org/~anuj/cluster.txt<http://www.grex.org/%7Eanuj/cluster.txt>
>
> So, how to get my cluster working if my /dev/sda becomes /dev/sdb?
>
> Thanks and Regards
> Anuj Singh
>
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080905/01ca2242/attachment.htm>

From mharrington at eons.com  Fri Sep  5 13:30:15 2008
From: mharrington at eons.com (Matt Harrington)
Date: Fri, 05 Sep 2008 09:30:15 -0400
Subject: [Linux-cluster] Re: how to get my cluster working if my /dev/sda
	becomes /dev/sdb (CLVM, iscsi, ERROR: Module iscsi_sfnet in use)
In-Reply-To: <3120c9e30809050610o20c46493v3cd6a669eb409990@mail.gmail.com>
References: <3120c9e30809050432r66da26b2rfc5df10cbf241145@mail.gmail.com>
	<3120c9e30809050610o20c46493v3cd6a669eb409990@mail.gmail.com>
Message-ID: <48C13467.7050803@eons.com>

If the problem is device naming, use multipath to create a 
/dev/mapper/something static device name which will always map to a 
particular disk independent of load order.

Anuj Singh (????) wrote:
> Thanks,
> changed script a bit, things working now. resetting iscsi service.
> But device name order independent will be better.
>
> Thanks and regards
> Anuj Singh
>
>
>
> On Fri, Sep 5, 2008 at 5:02 PM, Anuj Singh (????) <anujhere at gmail.com 
> <mailto:anujhere at gmail.com>> wrote:
>
>     Hi,
>     I configured a cluster using gfs1 on rhel-4 kernel version 
>     2.6.9-55.16.EL.
>     Using iscsi-target and initiator.
>     gfs1 mount is exported via nfs service.
>
>     I can manually stop all services in following sequence:
>     nfs, portmap, rgmanager, gfs, clvmd, fenced, cman, ccsd.
>     to stop my iscsi service first I give 'vgchange -aln' then I stop
>     iscsi service, otherwise i get an error of module in use, as I
>     have an clusterd lvm over iscsi device (/dev/sda1)
>
>     Everything works fine, but when i am trying to simulate a possible
>     problem, f.e. iscsi service is stopped I get following error.
>
>     Test1:
>     When cluster is working I stop iscsi service with
>      /etc/init.d/iscsi stop
>     Searching for iscsi-based multipath maps
>     Found 0 maps
>     Stopping iscsid:                                           [  OK  ]
>     Removing iscsi driver: ERROR: Module iscsi_sfnet is in use
>                                                                [FAILED]
>     To stop my iscsi service without a failure,  I stop all cluster
>     services as follows.
>     /etc/init.d/nfs stop
>     /etc/init.d/portmap stop
>     /etc/init.d/rgmanager stop
>     /etc/init.d/gfs stop
>     /etc/init.d/clvmd stop
>     /etc/init.d/fenced stop
>     /etc/init.d/cman stop
>     /etc/init.d/ccsd stop
>     Every service stops with a ok message. now again when i stop my
>     iscsi service I get same error
>      /etc/init.d/iscsi stop
>     Removing iscsi driver: ERROR: Module iscsi_sfnet is in
>     use                             [FAILED]
>
>     On my iscsi device (which is /dev/sd1), i have a LVM with gfs1
>     file-system,
>     as all the cluster services are stopped, I try to deactivate the
>     lvm with:
>
>      vgchange -aln
>       /dev/dm-0: read failed after 0 of 4096 at 0: Input/output error
>       No volume groups found
>
>     At the moment if I start my iscsi service, my /dev/sda becomes
>     /dev/sdb as well as iscsi service gives me following error:
>
>     [root at pr0031 new]# /sbin/service iscsi start
>     Checking iscsi config:                                     [  OK  ]
>     Loading iscsi driver:                                      [  OK  ]
>     mknod: `/dev/iscsictl': File exists
>     Starting iscsid:                                           [  OK  ]
>
>     Sep  5 16:42:37 pr0031 iscsi: iscsi config check succeeded
>     Sep  5 16:42:37 pr0031 iscsi: Loading iscsi driver:  succeeded
>     Sep  5 16:42:42 pr0031 iscsid[20732]: version 4:0.1.11-7 variant
>     (14-Apr-2008)
>     Sep  5 16:42:42 pr0031 iscsi: iscsid startup succeeded
>     Sep  5 16:42:42 pr0031 iscsid[20736]: Connected to Discovery
>     Address 192.168.10.199 <http://192.168.10.199>
>     Sep  5 16:42:42 pr0031 kernel: iscsi-sfnet:host16: Session established
>     Sep  5 16:42:42 pr0031 kernel: scsi16 : SFNet iSCSI driver
>     Sep  5 16:42:42 pr0031 kernel:   Vendor: IET       Model:
>     VIRTUAL-DISK      Rev: 0  
>     Sep  5 16:42:42 pr0031 kernel:   Type:  
>     Direct-Access                      ANSI SCSI revision: 04
>     Sep  5 16:42:42 pr0031 kernel: SCSI device sdb: 1975932 512-byte
>     hdwr sectors (1012 MB)
>     Sep  5 16:42:42 pr0031 kernel: SCSI device sdb: drive cache: write
>     through
>     Sep  5 16:42:42 pr0031 kernel: SCSI device sdb: 1975932 512-byte
>     hdwr sectors (1012 MB)
>     Sep  5 16:42:42 pr0031 kernel: SCSI device sdb: drive cache: write
>     through
>     Sep  5 16:42:42 pr0031 kernel:  sdb: sdb1
>     Sep  5 16:42:42 pr0031 kernel: Attached scsi disk sdb at scsi16,
>     channel 0, id 0, lun 0
>     Sep  5 16:42:43 pr0031 scsi.agent[20764]: disk at
>     /devices/platform/host16/target16:0:0/16:0:0:0
>
>     As my /dev/sda1 became /dev/sdb1, if i start cluster services, I
>     have no gfs mount.
>
>     clurgmgrd[21062]: <notice> Starting stopped service flx
>     Sep  5 16:47:16 pr0031 kernel: scsi15 (0:0): rejecting I/O to dead
>     device
>     Sep  5 16:47:16 pr0031 clurgmgrd: [21062]: <err> 'mount -t gfs 
>     /dev/mapper/VG01-LV01 /u01' failed, error=32
>     Sep  5 16:47:16 pr0031 clurgmgrd[21062]: <notice> start on
>     clusterfs:gfsmount_u01 returned 2 (invalid argument(s))
>     Sep  5 16:47:16 pr0031 clurgmgrd[21062]: <warning> #68: Failed to
>     start flx; return value: 1
>     Sep  5 16:47:16 pr0031 clurgmgrd[21062]: <notice> Stopping service
>     flx
>
>
>     After the above situation I need to restart the nodes, which I
>     don't want to, I created a script to handle all this, in which if
>     i restart all the services first, first I get the same /dev/sdb (
>     which should be /dev/sda so that my cluster can have a gfs mount).
>     When I restart all the services second time, I get no error (this
>     time iscsi disk is attached with /dev/sda device name and I don't
>     see any /dev/iscsctl exist error at the iscsi startup time) and
>     cluster starts working.
>     my script : http://www.grex.org/~anuj/cluster.txt
>     <http://www.grex.org/%7Eanuj/cluster.txt>
>
>     So, how to get my cluster working if my /dev/sda becomes /dev/sdb?
>
>     Thanks and Regards
>     Anuj Singh
>
>
>
>
>
>
> ------------------------------------------------------------------------
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080905/c541dea9/attachment.htm>

From ajeet.singh.raina at logica.com  Mon Sep  8 14:25:46 2008
From: ajeet.singh.raina at logica.com (Singh Raina, Ajeet)
Date: Mon, 8 Sep 2008 19:55:46 +0530
Subject: [Linux-cluster] RPC Issue?
Message-ID: <0139539A634FD04A99C9B8880AB70CB209B17B00@in-ex004.groupinfra.com>



_____________________________________________
From: Singh Raina, Ajeet 
Sent: Monday, September 08, 2008 6:45 PM
To: 'linux clustering'
Subject: RPC: failed to contact portmap (errno -5)

I have setup two Node Cluster Setup and while trying to failover
manually,its throwing the following error:


[code]

Sep  8 15:00:57 10.14.26.133 rgmanager: [31864]: <notice> Shutting down
Cluster Service Manager...
Sep  8 15:00:57 10.14.26.133 clurgmgrd[13694]: <notice> Shutting down
Sep  8 15:00:57 10.14.26.133 clurgmgrd[13694]: <notice> Stopping service
Mplus
Sep  8 15:00:57 10.14.26.133 clurgmgrd: [13694]: <info> Executing
/home/fsadmin/featureserver/scripts/mplus_control_script.sh stop
Sep  8 15:01:00 10.14.26.133  nfsd: last server has exited
Sep  8 15:01:00 10.14.26.133  nfsd: unexporting all filesystems
Sep  8 15:01:00 10.14.26.133  rpciod: active tasks at shutdown?!
Sep  8 15:01:00 10.14.26.133  RPC: failed to contact portmap (errno -5).
Sep  8 15:01:00 10.14.26.133 clurgmgrd: [13694]: <info> Removing IPv4
address 10.14.26.139 from bond0
Sep  8 15:01:00 10.14.26.133 clurgmgrd: [13694]: <info> unmounting
/data/Xml
Sep  8 15:01:00 10.14.26.133 clurgmgrd: [13694]: <warning> Device
/dev/sda2 is mounted on /usr/users/fsadmin/archive instead of
/home/fsadmin/archive
Sep  8 15:01:00 10.14.26.133 clurgmgrd: [13694]: <info> unmounting
/home/fsadmin/archive
Sep  8 15:01:01 10.14.26.133 clurgmgrd: [13694]: <warning> Device
/dev/sda8 is mounted on /usr/users/fsadmin/featureserver/config instead
of /home/fsadmin/featureserver/config
Sep  8 15:01:01 10.14.26.133 clurgmgrd: [13694]: <info> unmounting
/home/fsadmin/featureserver/config
Sep  8 15:01:01 10.14.26.133 clurgmgrd: [13694]: <info> unmounting
/var/lib/mysql
Sep  8 15:01:01 10.14.26.133 clurgmgrd: [13694]: <warning> Device
/dev/sda6 is mounted on /usr/users/fsadmin/mysql/logs instead of
/home/fsadmin/mysql/logs
Sep  8 15:01:01 10.14.26.133 clurgmgrd: [13694]: <info> unmounting
/home/fsadmin/mysql/logs
Sep  8 15:01:01 10.14.26.133 clurgmgrd: [13694]: <warning> Device
/dev/sda5 is mounted on /usr/users/fsadmin/mysql/data instead of
/home/fsadmin/mysql/data
Sep  8 15:01:01 10.14.26.133 clurgmgrd: [13694]: <info> unmounting
/home/fsadmin/mysql/data
Sep  8 15:01:01 10.14.26.133 clurgmgrd: [13694]: <warning> Device
/dev/sda3 is mounted on /usr/users/fsadmin/cdrs instead of
/home/fsadmin/cdrs
Sep  8 15:01:01 10.14.26.133 clurgmgrd: [13694]: <info> unmounting
/home/fsadmin/cdrs
Sep  8 15:01:01 10.14.26.133 clurgmgrd[13694]: <notice> Service Mplus is
stopped
Sep  8 15:01:01 10.14.26.133 clurgmgrd[13694]: <notice> Shutdown
complete, exiting
Sep  8 15:01:02 10.14.26.133 rgmanager: [31864]: <notice> Cluster
Service Manager is stopped.
Sep  8 15:01:02 10.14.26.133  WARNING: dlm_emergency_shutdown
Sep  8 15:01:02 10.14.26.133  WARNING: dlm_emergency_shutdown
Sep  8 15:01:07 10.14.26.133  CMAN 2.6.9-53.16 (built Jul 15 2008
14:07:56) installed
Sep  8 15:01:07 10.14.26.133  DLM 2.6.9-52.12 (built Jul 15 2008
14:34:18) installed

[/code]

The Above Bold Letters indicates the Error Message I am not able to
trace out.
I have only one service called Mplus which through the script is
starting up.

Please Help me with the Issue.


This e-mail and any attachment is for authorised use by the intended recipient(s) only. It may contain proprietary material, confidential information and/or be subject to legal privilege. It should not be copied, disclosed to, retained or used by, any other party. If you are not an intended recipient then please promptly delete this e-mail and any attachment and all copies and inform the sender. Thank you.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080908/f7a9193f/attachment.htm>

From federico.simoncelli at gmail.com  Mon Sep  8 15:23:27 2008
From: federico.simoncelli at gmail.com (Federico Simoncelli)
Date: Mon, 8 Sep 2008 17:23:27 +0200
Subject: [Linux-cluster] Cluster Logwatch
Message-ID: <a01fe36d0809080823u399c60e1se179859c820ae3e1@mail.gmail.com>

Hi all. Do you know if anyone has ever written a logwatch
configuration/script for the cluster services?
If not I'm going to start writing one on my own (is anyone interested
in helping?).

Thanks.
-- 
Federico.



From lhh at redhat.com  Mon Sep  8 15:36:24 2008
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 08 Sep 2008 11:36:24 -0400
Subject: [Linux-cluster] [PATCH][RESEND] Add network interface select
	option for fence_xvmd
In-Reply-To: <20080822081013.GA15358@localhost.localdomain>
References: <20080822081013.GA15358@localhost.localdomain>
Message-ID: <1220888184.4540.0.camel@ayanami>

On Fri, 2008-08-22 at 17:10 +0900, Satoru SATOH wrote:
> Hello,
> 
> 
> I updated my patch for fence_xvmd to add network interface select option
> posted before.
> 
> This patch fixes the following issues ATST:
> 
>  1. fence_xvmd selects wrong network interface to listen on if host has
>     multiple interfaces and target interface is not for default route.
>     As a result, fence_xvmd does not repond to fence_xvm's request.
>  2. fence_xvmd cannot start if default route is not set.
> 
> The following patch is for cluster-3 HEAD.
> 
> The same problem exists in cluster-2 (rhel5's cluster) and I opened
> bugzilla bug for that version: rhbz#459720.
> 

Merged.  Sorry for the delay.

-- Lon




From mockey.chen at nsn.com  Tue Sep  9 06:58:40 2008
From: mockey.chen at nsn.com (Chen, Mockey (NSN - CN/Cheng Du))
Date: Tue, 9 Sep 2008 14:58:40 +0800
Subject: [Linux-cluster] How to config tie-breaker IP in RHEL 5.2
Message-ID: <174CED94DD8DC54AB888B56E103B118725D1A2@CNBEEXC007.nsn-intra.net>

OS: RHEL 5.2.

I have configured a two node cluster without shared disk, actually I
have no shared disk in deployment.

I want to even one of the node down, the other node can still provide
service. 

I know there is trick called quorum disk and tie-breaker IP, since I
have no shared disk resource, I want to use tie-breaker IP, but I can
not find any information about tie-breaker IP in RHCS document.

Any idea ?

Thanks.

Chen Ming
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080909/38827e6d/attachment.htm>

From cryptogrid at gmail.com  Tue Sep  9 12:55:56 2008
From: cryptogrid at gmail.com (crypto grid)
Date: Tue, 9 Sep 2008 09:55:56 -0300
Subject: [Linux-cluster] Two node cluster, quorum disk?
Message-ID: <a9f464b80809090555m24f84de1i705a554409674f5d@mail.gmail.com>

Hi all, can anyone tell me if it's mandatory the use of a quorum disk on a
two node cluster. Both nodes are connected to a SAN via two HBA's (I'm using
multipath).
The cluster will be configured in active/passive mode, supporting a database
service.

>From red hat documentation: "Configuring qdiskd is not required unless you
have special requirements for node health."

Any recommendations or suggestions will be appreciated.

Thanks.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080909/51567751/attachment.htm>

From stpierre at NebrWesleyan.edu  Tue Sep  9 14:56:42 2008
From: stpierre at NebrWesleyan.edu (Chris St. Pierre)
Date: Tue, 9 Sep 2008 09:56:42 -0500 (CDT)
Subject: [Linux-cluster] Two node cluster, quorum disk?
In-Reply-To: <a9f464b80809090555m24f84de1i705a554409674f5d@mail.gmail.com>
References: <a9f464b80809090555m24f84de1i705a554409674f5d@mail.gmail.com>
Message-ID: <alpine.LFD.1.10.0809090956270.23269@grunthos.NebrWesleyan.edu>

On Tue, 9 Sep 2008, crypto grid wrote:

> Hi all, can anyone tell me if it's mandatory the use of a quorum disk on a
> two node cluster.

http://sources.redhat.com/cluster/wiki/FAQ/CMAN#quorumdiskneeded

Chris St. Pierre
Unix Systems Administrator
Nebraska Wesleyan University



From gspiegl at gmx.at  Tue Sep  9 15:51:15 2008
From: gspiegl at gmx.at (Gerhard Spiegl)
Date: Tue, 09 Sep 2008 17:51:15 +0200
Subject: [Linux-cluster] Use qdisk heuristics w/o a quorum device/partition
Message-ID: <48C69B73.4000805@gmx.at>

Hello all!

We are trying to set up a 20 node cluster and want to use a 
"ping-heuristic" and a heuristic that checks the state of
the fiberchannel ports.

Is it possible to use qdisk heuristics without a dedicated
quorum partition, as this setup would only support 16 nodes?

qdiskd fails starting:
Sep  9 14:37:15 ols017c qdiskd[25339]: <info> Quorum Daemon Initializing
Sep  9 14:37:15 ols017c qdiskd[25339]: <crit> Initialization failed

Although "service qdiskd  start" seems to be successfull (OK).

# qdiskd -fd
[4248] debug: Loading configuration information
[4248] debug: Heuristic: 'ping -c1 -t3 172.27.111.254' score=1 interval=2 tko=1
[4248] debug: 1 heuristics loaded
[4248] debug: Quorum Daemon: 1 heuristics, 1 interval, 10 tko, 1 votes
[4248] debug: Run Flags: 00000035
[4248] info: Quorum Daemon Initializing
stat: Bad address
[4248] crit: Initialization failed

[[snip cluster.conf]]
<quorumd log_level="7" votes="1">
        <heuristic program="ping -c1 -t3 172.27.111.254"/>
	<heuristic program="check_fc.sh" score="1"/>
</quorumd>
[[/snip]]

Already tried experimenting with interval and tko but without success.

Any help would be appreciated.

cheers
Gerhard







From kanderso at redhat.com  Tue Sep  9 18:08:59 2008
From: kanderso at redhat.com (Kevin Anderson)
Date: Tue, 09 Sep 2008 13:08:59 -0500
Subject: [Linux-cluster] Use qdisk heuristics w/o a quorum device/partition
In-Reply-To: <48C69B73.4000805@gmx.at>
References: <48C69B73.4000805@gmx.at>
Message-ID: <1220983739.4295.10.camel@dhcp80-204.msp.redhat.com>

On Tue, 2008-09-09 at 17:51 +0200, Gerhard Spiegl wrote:
> Hello all!
> 
> We are trying to set up a 20 node cluster and want to use a 
> "ping-heuristic" and a heuristic that checks the state of
> the fiberchannel ports.

What actions do you want to take place based on these heuristics?

> 
> Is it possible to use qdisk heuristics without a dedicated
> quorum partition, as this setup would only support 16 nodes?
> 
There is a 16 node limitation to qdisk primarily because we think
performance hitting the same small number of blocks on the disk by that
many nodes will be abysmal.  Lon would know, but probably a value you
could change and play with in the code.  

Am more interested in what problem you are trying to solve with the
heuristics?  It doesn't seem to be quorum related as the normal
cman/openais capabilities will work fine with that number of nodes.  If
you are worried about split sites, just add an additional node to the
cluster that is some other location.  The node would only be used for
quorum votes.

Kevin




From gspiegl at gmx.at  Tue Sep  9 19:19:15 2008
From: gspiegl at gmx.at (Gerhard Spiegl)
Date: Tue, 09 Sep 2008 21:19:15 +0200
Subject: [Linux-cluster] Use qdisk heuristics w/o a quorum device/partition
In-Reply-To: <1220983739.4295.10.camel@dhcp80-204.msp.redhat.com>
References: <48C69B73.4000805@gmx.at>
	<1220983739.4295.10.camel@dhcp80-204.msp.redhat.com>
Message-ID: <48C6CC33.10604@gmx.at>

Hello Kevin,

thanks for your reply.

Kevin Anderson wrote:
> On Tue, 2008-09-09 at 17:51 +0200, Gerhard Spiegl wrote:
>> Hello all!
>>
>> We are trying to set up a 20 node cluster and want to use a 
>> "ping-heuristic" and a heuristic that checks the state of
>> the fiberchannel ports.
> 
> What actions do you want to take place based on these heuristics?

The node should get fenced (or fence/reboot itself) if the public
interface (bond0) looses connection - or both paths (dm-multipath)
to the storage get lost.

Without quorum device:
We faced the problem of complete loss of storage connectivity resulting
the GFS to withdraw (only when IO is issued on it (we only use GFS for
xen vm definition files)), causing GFS and CLVM to lockup and never
released. Only the manual reboot/halt solves the situation (in addition
the specific node gets fenced after poweroff - trifle to late ;)).

With quorum device:
The node loosing the storage gets fenced because it looses the qdisk.

Obviousley, but >16 nodes qdisk is not an option, so I wrote a small shell
script to check the fiberchannel paths. The idea is when FC is lost, the node
fences/reboots itself. But the heuristics only work when a "device=/dev/dm-8"
is specified in cluster.conf <quorumd ..> tag. Without it the qdiskd refuses to
start.

>> Is it possible to use qdisk heuristics without a dedicated
>> quorum partition, as this setup would only support 16 nodes?
>>
> There is a 16 node limitation to qdisk primarily because we think
> performance hitting the same small number of blocks on the disk by that
> many nodes will be abysmal.  Lon would know, but probably a value you
> could change and play with in the code.  

I read about this in the cluster wiki/FAQ and it sounds comprehensible,
also we dont want to play around in the source as our goal is a fully
supported configuration by RedHat.com

> Am more interested in what problem you are trying to solve with the
> heuristics?  It doesn't seem to be quorum related as the normal
> cman/openais capabilities will work fine with that number of nodes.

It seems not to, as stated the loss of storage connectivity causes the
whole cluster to disfunction, wich is not expected.
If it helps I will send cluster.conf tomorrow as I m off the office today
( CET ).

Maybe there is another way of detecting the storage failure but I couldnt
find any docs. Also I would be glad if you could point me to a more
comprehensive documentation anywhere on the net.

> you are worried about split sites, just add an additional node to the
> cluster that is some other location.  The node would only be used for
> quorum votes.

I am not sure what you mean with split sites (split brain?), but thats not the
issue. Do you mean an additional node without any service or failoverdomain
configured?

regards
Gerhard




From markwag at u.washington.edu  Tue Sep  9 19:24:30 2008
From: markwag at u.washington.edu (Mark Wagner)
Date: Tue, 9 Sep 2008 12:24:30 -0700
Subject: [Linux-cluster] Re: Two node cluster, quorum disk?
In-Reply-To: <20080909160016.01AF68E07C6@hormel.redhat.com>
References: <20080909160016.01AF68E07C6@hormel.redhat.com>
Message-ID: <20080909192430.GM3470@n-its-markwag2.mcis.washington.edu>

On Tue, Sep 09, 2008 at 12:00:16PM -0400, linux-cluster-request at redhat.com wrote:
> On Tue, 9 Sep 2008, crypto grid wrote:
> 
> > Hi all, can anyone tell me if it's mandatory the use of a quorum disk on a
> > two node cluster.
> 
> http://sources.redhat.com/cluster/wiki/FAQ/CMAN#quorumdiskneeded

That answer may need a little editing since it seems to be saying two
different things.

	Note that if you configure a quorum disk/partition, you want
	two_node="1" or expected_votes="2" since the quorum disk solves
	the voting imbalance. You want two_node="0" and expected_votes="3"
	(or nodes + 1 if it's not a two-node cluster).

-- 
Mark Wagner <markwag at u.washington.edu>
System Administrator, UW Medicine IT Services
206-616-6119



From kanderso at redhat.com  Tue Sep  9 19:35:01 2008
From: kanderso at redhat.com (Kevin Anderson)
Date: Tue, 09 Sep 2008 14:35:01 -0500
Subject: [Linux-cluster] Use qdisk heuristics w/o a quorum device/partition
In-Reply-To: <48C6CC33.10604@gmx.at>
References: <48C69B73.4000805@gmx.at>
	<1220983739.4295.10.camel@dhcp80-204.msp.redhat.com>
	<48C6CC33.10604@gmx.at>
Message-ID: <1220988901.4295.31.camel@dhcp80-204.msp.redhat.com>

On Tue, 2008-09-09 at 21:19 +0200, Gerhard Spiegl wrote:
> Hello Kevin,
> 
> thanks for your reply.
> 
> Kevin Anderson wrote:
> > On Tue, 2008-09-09 at 17:51 +0200, Gerhard Spiegl wrote:
> >> Hello all!
> >>
> >> We are trying to set up a 20 node cluster and want to use a 
> >> "ping-heuristic" and a heuristic that checks the state of
> >> the fiberchannel ports.
> > 
> > What actions do you want to take place based on these heuristics?
> 
> The node should get fenced (or fence/reboot itself) if the public
> interface (bond0) looses connection - or both paths (dm-multipath)
> to the storage get lost.
> 
> Without quorum device:
> We faced the problem of complete loss of storage connectivity resulting
> the GFS to withdraw (only when IO is issued on it (we only use GFS for
> xen vm definition files)), causing GFS and CLVM to lockup and never
> released. Only the manual reboot/halt solves the situation (in addition
> the specific node gets fenced after poweroff - trifle to late ;)).

You can avoid the withdraw and force a panic by using the debug mount
option for your GFS filesystems.  With debug set, GFS will when getting
an I/O error, panic the system effectively self fencing the node.  The
reason behind withdraw was to give the operator a chance to gracefully
remove the node from the cluster after filesystem failure.  This is
useful when multiple filesystems are mounted with multiple storage
devices.  A withdraw always requires rebooting the node to recover.
However, in your case, panic action is probably what you want.  We
recently opened a new bugzilla for a new feature to give you better
control of the options in this case.  
https://bugzilla.redhat.com/show_bug.cgi?id=461065

Anyway, the debug mount option should avoid the situation you are
describing.

> 
> > you are worried about split sites, just add an additional node to the
> > cluster that is some other location.  The node would only be used for
> > quorum votes.
> 
> I am not sure what you mean with split sites (split brain?), but thats not the
> issue. Do you mean an additional node without any service or failoverdomain
> configured?

With split sites and an even number of nodes, you could end up in the
situation that if an entire site goes down, you no longer have cluster
quorum.  Having an extra and therefor odd number of nodes in the cluster
would enable the cluster to continue to operate at the remaining site.  

But, not the problem you were trying to solve in this case.  Try -o
debug on your GFS mount options.

Thanks
Kevin



From gspiegl at gmx.at  Tue Sep  9 19:59:38 2008
From: gspiegl at gmx.at (Gerhard Spiegl)
Date: Tue, 09 Sep 2008 21:59:38 +0200
Subject: [Linux-cluster] Use qdisk heuristics w/o a quorum device/partition
In-Reply-To: <1220988901.4295.31.camel@dhcp80-204.msp.redhat.com>
References: <48C69B73.4000805@gmx.at>	<1220983739.4295.10.camel@dhcp80-204.msp.redhat.com>	<48C6CC33.10604@gmx.at>
	<1220988901.4295.31.camel@dhcp80-204.msp.redhat.com>
Message-ID: <48C6D5AA.8080706@gmx.at>

Kevin Anderson wrote:
> You can avoid the withdraw and force a panic by using the debug mount
> option for your GFS filesystems.  With debug set, GFS will when getting
> an I/O error, panic the system effectively self fencing the node.  The
> reason behind withdraw was to give the operator a chance to gracefully
> remove the node from the cluster after filesystem failure.  This is
> useful when multiple filesystems are mounted with multiple storage
> devices.  A withdraw always requires rebooting the node to recover.
> However, in your case, panic action is probably what you want.  We
> recently opened a new bugzilla for a new feature to give you better
> control of the options in this case.  
> https://bugzilla.redhat.com/show_bug.cgi?id=461065
> 
> Anyway, the debug mount option should avoid the situation you are
> describing.

If it does, it is exactly what we were looking for. In fact GFS reported an IO
error in syslog (and on "ls" "df" ...), but only the "nice" withdraw happened.
The only thing we found out was that passing the -w option to gfs_controld
(init.d/cman) would avoid withdrawing GFS. We expected the kernel to panic but
the only result was a puny syslog message :)
Tomorrow I will try adding "debug" to the mount opts in fstab.

> With split sites and an even number of nodes, you could end up in the
> situation that if an entire site goes down, you no longer have cluster
> quorum.  Having an extra and therefor odd number of nodes in the cluster
> would enable the cluster to continue to operate at the remaining site.
Will keep this in mind, may become handy someday.

> 
> Thanks
> Kevin

Thank You!
Gerhard




From Gerhardus.Geldenhuis at gta-travel.com  Wed Sep 10 09:34:12 2008
From: Gerhardus.Geldenhuis at gta-travel.com (Gerhardus.Geldenhuis at gta-travel.com)
Date: Wed, 10 Sep 2008 10:34:12 +0100
Subject: [Linux-cluster] Unable to retrieve batch 1776334432 status from....
Message-ID: <1BCA52CF5845E543B4B81AAFEF2AFD7904AD240E@LONSEXC01.gta.travel.lcl>

Hi
I have justed started using cluster tools for the first time so my
question might be a bit obvious to the expierenced eye.

I am getting the following error when trying to do config changes using
luci.
Unable to retrieve batch 423452352 status from
longapa02alt.gta.travel:11111: ccs_tool failed to propagate conf: unable
to connect to the CCS daemon: connection refused Failed to update config
file.

I get the following error in /var/log/messages
Sep 10 10:29:56 LONGAPA02ALT ccsd[11151]: Unable to connect to cluster
infrastructure after 3570 seconds

I am unsure what is causing this. 

I was able to initially create the cluster config and then it stopped
working. Any suggestions as to what is causing this would be very much
appreciated. 

I do get results from the ccs_tool so some parts are at least working...

 /sbin/ccs_tool lsnode

Cluster name: apache-ha, config_version: 1

Nodename                        Votes Nodeid Fencetype
longapa02alt.gta.travel.lcl        1    1
longapa02blt.gta.travel.lcl        1    2

Regards

______________________________________________________________________
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email 
______________________________________________________________________



From mockey.chen at nsn.com  Wed Sep 10 09:56:51 2008
From: mockey.chen at nsn.com (Chen, Mockey (NSN - CN/Cheng Du))
Date: Wed, 10 Sep 2008 17:56:51 +0800
Subject: [Linux-cluster] Unable to retrieve batch 1776334432 status
	from....
In-Reply-To: <1BCA52CF5845E543B4B81AAFEF2AFD7904AD240E@LONSEXC01.gta.travel.lcl>
References: <1BCA52CF5845E543B4B81AAFEF2AFD7904AD240E@LONSEXC01.gta.travel.lcl>
Message-ID: <174CED94DD8DC54AB888B56E103B118728C48B@CNBEEXC007.nsn-intra.net>

 

>-----Original Message-----
>From: linux-cluster-bounces at redhat.com 
>[mailto:linux-cluster-bounces at redhat.com] On Behalf Of ext 
>Gerhardus.Geldenhuis at gta-travel.com
>Sent: Wednesday, September 10, 2008 5:34 PM
>To: linux-cluster at redhat.com
>Subject: [Linux-cluster] Unable to retrieve batch 1776334432 
>status from....
>
>Hi
>I have justed started using cluster tools for the first time 
>so my question might be a bit obvious to the expierenced eye.
>
>I am getting the following error when trying to do config 
>changes using luci.
>Unable to retrieve batch 423452352 status from
>longapa02alt.gta.travel:11111: ccs_tool failed to propagate 
>conf: unable to connect to the CCS daemon: connection refused 
>Failed to update config file.
>

I maybe due to you not start ccsd daemon. try to use 
ccs_test connect 
to test whether you can connect your ccsd daemon.

In RHCS 5, you can start ccsd by following command:
service cman start




From Gerhardus.Geldenhuis at gta-travel.com  Wed Sep 10 10:05:34 2008
From: Gerhardus.Geldenhuis at gta-travel.com (Gerhardus.Geldenhuis at gta-travel.com)
Date: Wed, 10 Sep 2008 11:05:34 +0100
Subject: [Linux-cluster] Unable to retrieve batch 1776334432 statusfrom....
In-Reply-To: <174CED94DD8DC54AB888B56E103B118728C48B@CNBEEXC007.nsn-intra.net>
References: <1BCA52CF5845E543B4B81AAFEF2AFD7904AD240E@LONSEXC01.gta.travel.lcl>
	<174CED94DD8DC54AB888B56E103B118728C48B@CNBEEXC007.nsn-intra.net>
Message-ID: <1BCA52CF5845E543B4B81AAFEF2AFD7904AD2411@LONSEXC01.gta.travel.lcl>

Thanks,
I get the following error:

service cman start
Starting cluster:
   Loading modules... done
   Mounting configfs... done
   Starting ccsd... done
   Starting cman... failed
cman not started: Can't find local node name in cluster.conf
/usr/sbin/cman_tool: aisexec daemon didn't start
                                                           [FAILED]

I am looking into this atm, but any suggestions welcomed.

Regards
 

> -----Original Message-----
> From: linux-cluster-bounces at redhat.com 
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Chen, 
> Mockey (NSN - CN/Cheng Du)
> Sent: 10 September 2008 10:57
> To: linux clustering
> Subject: RE: [Linux-cluster] Unable to retrieve batch 
> 1776334432 statusfrom....
> 
> 
>  
> 
> >-----Original Message-----
> >From: linux-cluster-bounces at redhat.com 
> >[mailto:linux-cluster-bounces at redhat.com] On Behalf Of ext 
> >Gerhardus.Geldenhuis at gta-travel.com
> >Sent: Wednesday, September 10, 2008 5:34 PM
> >To: linux-cluster at redhat.com
> >Subject: [Linux-cluster] Unable to retrieve batch 1776334432 status 
> >from....
> >
> >Hi
> >I have justed started using cluster tools for the first time so my 
> >question might be a bit obvious to the expierenced eye.
> >
> >I am getting the following error when trying to do config 
> changes using 
> >luci.
> >Unable to retrieve batch 423452352 status from
> >longapa02alt.gta.travel:11111: ccs_tool failed to propagate
> >conf: unable to connect to the CCS daemon: connection 
> refused Failed to 
> >update config file.
> >
> 
> I maybe due to you not start ccsd daemon. try to use 
> ccs_test connect 
> to test whether you can connect your ccsd daemon.
> 
> In RHCS 5, you can start ccsd by following command:
> service cman start
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 

______________________________________________________________________
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email 
______________________________________________________________________



From Gerhardus.Geldenhuis at gta-travel.com  Wed Sep 10 11:14:12 2008
From: Gerhardus.Geldenhuis at gta-travel.com (Gerhardus.Geldenhuis at gta-travel.com)
Date: Wed, 10 Sep 2008 12:14:12 +0100
Subject: [Linux-cluster] Unable to retrieve batch 1776334432 statusfrom....
In-Reply-To: <1BCA52CF5845E543B4B81AAFEF2AFD7904AD2411@LONSEXC01.gta.travel.lcl>
References: <1BCA52CF5845E543B4B81AAFEF2AFD7904AD240E@LONSEXC01.gta.travel.lcl><174CED94DD8DC54AB888B56E103B118728C48B@CNBEEXC007.nsn-intra.net>
	<1BCA52CF5845E543B4B81AAFEF2AFD7904AD2411@LONSEXC01.gta.travel.lcl>
Message-ID: <1BCA52CF5845E543B4B81AAFEF2AFD7904AD2412@LONSEXC01.gta.travel.lcl>

I am not sure what needs to be added to the config file for it to work.

My config file looks like this atm:
<?xml version="1.0"?>
<cluster alias="apache-ha" config_version="1" name="apache-ha">
        <fence_daemon post_fail_delay="0" post_join_delay="3"/>
        <clusternodes>
                <clusternode name="longapa02alt.gta.travel.lcl"
nodeid="1" votes="1"/>
                <clusternode name="longapa02blt.gta.travel.lcl"
nodeid="2" votes="1"/>
        </clusternodes>
        <cman expected_votes="1" two_node="1"/>
        <fencedevices/>
        <rm/>
</cluster>

The hostname has been set in uppercase on the system, which I thought
might be a problem but changing the config file to uppercase hostnames
has not made a difference.

Regards

> -----Original Message-----
> From: linux-cluster-bounces at redhat.com 
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of 
> Gerhardus.Geldenhuis at gta-travel.com
> Sent: 10 September 2008 11:06
> To: linux-cluster at redhat.com
> Subject: RE: [Linux-cluster] Unable to retrieve batch 
> 1776334432 statusfrom....
> 
> Thanks,
> I get the following error:
> 
> service cman start
> Starting cluster:
>    Loading modules... done
>    Mounting configfs... done
>    Starting ccsd... done
>    Starting cman... failed
> cman not started: Can't find local node name in cluster.conf
> /usr/sbin/cman_tool: aisexec daemon didn't start
>                                                            [FAILED]
> 
> I am looking into this atm, but any suggestions welcomed.
> 
> Regards
>  
> 
> > -----Original Message-----
> > From: linux-cluster-bounces at redhat.com 
> > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Chen, Mockey 
> > (NSN - CN/Cheng Du)
> > Sent: 10 September 2008 10:57
> > To: linux clustering
> > Subject: RE: [Linux-cluster] Unable to retrieve batch
> > 1776334432 statusfrom....
> > 
> > 
> >  
> > 
> > >-----Original Message-----
> > >From: linux-cluster-bounces at redhat.com 
> > >[mailto:linux-cluster-bounces at redhat.com] On Behalf Of ext 
> > >Gerhardus.Geldenhuis at gta-travel.com
> > >Sent: Wednesday, September 10, 2008 5:34 PM
> > >To: linux-cluster at redhat.com
> > >Subject: [Linux-cluster] Unable to retrieve batch 
> 1776334432 status 
> > >from....
> > >
> > >Hi
> > >I have justed started using cluster tools for the first time so my 
> > >question might be a bit obvious to the expierenced eye.
> > >
> > >I am getting the following error when trying to do config
> > changes using
> > >luci.
> > >Unable to retrieve batch 423452352 status from
> > >longapa02alt.gta.travel:11111: ccs_tool failed to propagate
> > >conf: unable to connect to the CCS daemon: connection
> > refused Failed to
> > >update config file.
> > >
> > 
> > I maybe due to you not start ccsd daemon. try to use 
> ccs_test connect 
> > to test whether you can connect your ccsd daemon.
> > 
> > In RHCS 5, you can start ccsd by following command:
> > service cman start
> > 
> > 
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> > 
> 
> ______________________________________________________________________
> This email has been scanned by the MessageLabs Email Security System.
> For more information please visit 
> http://www.messagelabs.com/email 
> ______________________________________________________________________
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 

______________________________________________________________________
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email 
______________________________________________________________________



From Gerhardus.Geldenhuis at gta-travel.com  Wed Sep 10 12:45:16 2008
From: Gerhardus.Geldenhuis at gta-travel.com (Gerhardus.Geldenhuis at gta-travel.com)
Date: Wed, 10 Sep 2008 13:45:16 +0100
Subject: [Linux-cluster] Unable to retrieve batch 1776334432 statusfrom....
In-Reply-To: <1BCA52CF5845E543B4B81AAFEF2AFD7904AD2412@LONSEXC01.gta.travel.lcl>
References: <1BCA52CF5845E543B4B81AAFEF2AFD7904AD240E@LONSEXC01.gta.travel.lcl><174CED94DD8DC54AB888B56E103B118728C48B@CNBEEXC007.nsn-intra.net><1BCA52CF5845E543B4B81AAFEF2AFD7904AD2411@LONSEXC01.gta.travel.lcl>
	<1BCA52CF5845E543B4B81AAFEF2AFD7904AD2412@LONSEXC01.gta.travel.lcl>
Message-ID: <1BCA52CF5845E543B4B81AAFEF2AFD7904AD2413@LONSEXC01.gta.travel.lcl>

Progress report,
It looks like for some or other reason the name of the cluster is not
set properly in the cman startup script. I have set it manually.

Next problem is:
cman not started: Can't connect to CCSD /usr/sbin/cman_tool: aisexec
daemon didn't start

Regards 

> -----Original Message-----
> From: linux-cluster-bounces at redhat.com 
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of 
> Gerhardus.Geldenhuis at gta-travel.com
> Sent: 10 September 2008 12:14
> To: linux-cluster at redhat.com
> Subject: RE: [Linux-cluster] Unable to retrieve batch 
> 1776334432 statusfrom....
> 
> I am not sure what needs to be added to the config file for 
> it to work.
> 
> My config file looks like this atm:
> <?xml version="1.0"?>
> <cluster alias="apache-ha" config_version="1" name="apache-ha">
>         <fence_daemon post_fail_delay="0" post_join_delay="3"/>
>         <clusternodes>
>                 <clusternode name="longapa02alt.gta.travel.lcl"
> nodeid="1" votes="1"/>
>                 <clusternode name="longapa02blt.gta.travel.lcl"
> nodeid="2" votes="1"/>
>         </clusternodes>
>         <cman expected_votes="1" two_node="1"/>
>         <fencedevices/>
>         <rm/>
> </cluster>
> 
> The hostname has been set in uppercase on the system, which I 
> thought might be a problem but changing the config file to 
> uppercase hostnames has not made a difference.
> 
> Regards
> 
> > -----Original Message-----
> > From: linux-cluster-bounces at redhat.com 
> > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of 
> > Gerhardus.Geldenhuis at gta-travel.com
> > Sent: 10 September 2008 11:06
> > To: linux-cluster at redhat.com
> > Subject: RE: [Linux-cluster] Unable to retrieve batch
> > 1776334432 statusfrom....
> > 
> > Thanks,
> > I get the following error:
> > 
> > service cman start
> > Starting cluster:
> >    Loading modules... done
> >    Mounting configfs... done
> >    Starting ccsd... done
> >    Starting cman... failed
> > cman not started: Can't find local node name in cluster.conf
> > /usr/sbin/cman_tool: aisexec daemon didn't start
> >                                                            [FAILED]
> > 
> > I am looking into this atm, but any suggestions welcomed.
> > 
> > Regards
> >  
> > 
> > > -----Original Message-----
> > > From: linux-cluster-bounces at redhat.com 
> > > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of 
> Chen, Mockey 
> > > (NSN - CN/Cheng Du)
> > > Sent: 10 September 2008 10:57
> > > To: linux clustering
> > > Subject: RE: [Linux-cluster] Unable to retrieve batch
> > > 1776334432 statusfrom....
> > > 
> > > 
> > >  
> > > 
> > > >-----Original Message-----
> > > >From: linux-cluster-bounces at redhat.com 
> > > >[mailto:linux-cluster-bounces at redhat.com] On Behalf Of ext 
> > > >Gerhardus.Geldenhuis at gta-travel.com
> > > >Sent: Wednesday, September 10, 2008 5:34 PM
> > > >To: linux-cluster at redhat.com
> > > >Subject: [Linux-cluster] Unable to retrieve batch
> > 1776334432 status
> > > >from....
> > > >
> > > >Hi
> > > >I have justed started using cluster tools for the first 
> time so my 
> > > >question might be a bit obvious to the expierenced eye.
> > > >
> > > >I am getting the following error when trying to do config
> > > changes using
> > > >luci.
> > > >Unable to retrieve batch 423452352 status from
> > > >longapa02alt.gta.travel:11111: ccs_tool failed to propagate
> > > >conf: unable to connect to the CCS daemon: connection
> > > refused Failed to
> > > >update config file.
> > > >
> > > 
> > > I maybe due to you not start ccsd daemon. try to use
> > ccs_test connect
> > > to test whether you can connect your ccsd daemon.
> > > 
> > > In RHCS 5, you can start ccsd by following command:
> > > service cman start
> > > 
> > > 
> > > --
> > > Linux-cluster mailing list
> > > Linux-cluster at redhat.com
> > > https://www.redhat.com/mailman/listinfo/linux-cluster
> > > 
> > 
> > 
> ______________________________________________________________________
> > This email has been scanned by the MessageLabs Email 
> Security System.
> > For more information please visit
> > http://www.messagelabs.com/email
> > 
> ______________________________________________________________________
> > 
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> > 
> 
> ______________________________________________________________________
> This email has been scanned by the MessageLabs Email Security System.
> For more information please visit 
> http://www.messagelabs.com/email 
> ______________________________________________________________________
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 

______________________________________________________________________
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email 
______________________________________________________________________



From Gerhardus.Geldenhuis at gta-travel.com  Wed Sep 10 13:23:03 2008
From: Gerhardus.Geldenhuis at gta-travel.com (Gerhardus.Geldenhuis at gta-travel.com)
Date: Wed, 10 Sep 2008 14:23:03 +0100
Subject: [Linux-cluster] Unable to retrieve batch 1776334432 statusfrom....
In-Reply-To: <1BCA52CF5845E543B4B81AAFEF2AFD7904AD2413@LONSEXC01.gta.travel.lcl>
References: <1BCA52CF5845E543B4B81AAFEF2AFD7904AD240E@LONSEXC01.gta.travel.lcl><174CED94DD8DC54AB888B56E103B118728C48B@CNBEEXC007.nsn-intra.net><1BCA52CF5845E543B4B81AAFEF2AFD7904AD2411@LONSEXC01.gta.travel.lcl><1BCA52CF5845E543B4B81AAFEF2AFD7904AD2412@LONSEXC01.gta.travel.lcl>
	<1BCA52CF5845E543B4B81AAFEF2AFD7904AD2413@LONSEXC01.gta.travel.lcl>
Message-ID: <1BCA52CF5845E543B4B81AAFEF2AFD7904AD2414@LONSEXC01.gta.travel.lcl>

For the benefit of future googlers.

I managed to get the cman service started without any error messages. I
also removed my customizations from the init script.

It looks like the cman service is not able to detect whether services it
starts up has been started and then fails.

I ended up stopping all services having only the following running:
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address               Foreign Address
State       PID/Program name
tcp        0      0 0.0.0.0:11111               0.0.0.0:*
LISTEN      3918/ricci
tcp        0      0 127.0.0.1:6010              0.0.0.0:*
LISTEN      5721/0
tcp        0      0 :::22                       :::*
LISTEN      3601/sshd
tcp        0      0 ::1:6010                    :::*
LISTEN      5721/0
tcp        0      0 ::ffff:10.x.x.x:22      ::ffff:10.x.x.x:40884
ESTABLISHED 5718/sshd: lcp13o [ 

I then ran
service cman start

Which started up without a problem

Regards

> -----Original Message-----
> From: linux-cluster-bounces at redhat.com 
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of 
> Gerhardus.Geldenhuis at gta-travel.com
> Sent: 10 September 2008 13:45
> To: linux-cluster at redhat.com
> Subject: RE: [Linux-cluster] Unable to retrieve batch 
> 1776334432 statusfrom....
> 
> Progress report,
> It looks like for some or other reason the name of the 
> cluster is not set properly in the cman startup script. I 
> have set it manually.
> 
> Next problem is:
> cman not started: Can't connect to CCSD /usr/sbin/cman_tool: 
> aisexec daemon didn't start
> 
> Regards 
> 
> > -----Original Message-----
> > From: linux-cluster-bounces at redhat.com 
> > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of 
> > Gerhardus.Geldenhuis at gta-travel.com
> > Sent: 10 September 2008 12:14
> > To: linux-cluster at redhat.com
> > Subject: RE: [Linux-cluster] Unable to retrieve batch
> > 1776334432 statusfrom....
> > 
> > I am not sure what needs to be added to the config file for it to 
> > work.
> > 
> > My config file looks like this atm:
> > <?xml version="1.0"?>
> > <cluster alias="apache-ha" config_version="1" name="apache-ha">
> >         <fence_daemon post_fail_delay="0" post_join_delay="3"/>
> >         <clusternodes>
> >                 <clusternode name="longapa02alt.gta.travel.lcl"
> > nodeid="1" votes="1"/>
> >                 <clusternode name="longapa02blt.gta.travel.lcl"
> > nodeid="2" votes="1"/>
> >         </clusternodes>
> >         <cman expected_votes="1" two_node="1"/>
> >         <fencedevices/>
> >         <rm/>
> > </cluster>
> > 
> > The hostname has been set in uppercase on the system, which 
> I thought 
> > might be a problem but changing the config file to 
> uppercase hostnames 
> > has not made a difference.
> > 
> > Regards
> > 
> > > -----Original Message-----
> > > From: linux-cluster-bounces at redhat.com 
> > > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of 
> > > Gerhardus.Geldenhuis at gta-travel.com
> > > Sent: 10 September 2008 11:06
> > > To: linux-cluster at redhat.com
> > > Subject: RE: [Linux-cluster] Unable to retrieve batch
> > > 1776334432 statusfrom....
> > > 
> > > Thanks,
> > > I get the following error:
> > > 
> > > service cman start
> > > Starting cluster:
> > >    Loading modules... done
> > >    Mounting configfs... done
> > >    Starting ccsd... done
> > >    Starting cman... failed
> > > cman not started: Can't find local node name in cluster.conf
> > > /usr/sbin/cman_tool: aisexec daemon didn't start
> > >                                                           
>  [FAILED]
> > > 
> > > I am looking into this atm, but any suggestions welcomed.
> > > 
> > > Regards
> > >  
> > > 
> > > > -----Original Message-----
> > > > From: linux-cluster-bounces at redhat.com 
> > > > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of
> > Chen, Mockey
> > > > (NSN - CN/Cheng Du)
> > > > Sent: 10 September 2008 10:57
> > > > To: linux clustering
> > > > Subject: RE: [Linux-cluster] Unable to retrieve batch
> > > > 1776334432 statusfrom....
> > > > 
> > > > 
> > > >  
> > > > 
> > > > >-----Original Message-----
> > > > >From: linux-cluster-bounces at redhat.com 
> > > > >[mailto:linux-cluster-bounces at redhat.com] On Behalf Of ext 
> > > > >Gerhardus.Geldenhuis at gta-travel.com
> > > > >Sent: Wednesday, September 10, 2008 5:34 PM
> > > > >To: linux-cluster at redhat.com
> > > > >Subject: [Linux-cluster] Unable to retrieve batch
> > > 1776334432 status
> > > > >from....
> > > > >
> > > > >Hi
> > > > >I have justed started using cluster tools for the first
> > time so my
> > > > >question might be a bit obvious to the expierenced eye.
> > > > >
> > > > >I am getting the following error when trying to do config
> > > > changes using
> > > > >luci.
> > > > >Unable to retrieve batch 423452352 status from
> > > > >longapa02alt.gta.travel:11111: ccs_tool failed to propagate
> > > > >conf: unable to connect to the CCS daemon: connection
> > > > refused Failed to
> > > > >update config file.
> > > > >
> > > > 
> > > > I maybe due to you not start ccsd daemon. try to use
> > > ccs_test connect
> > > > to test whether you can connect your ccsd daemon.
> > > > 
> > > > In RHCS 5, you can start ccsd by following command:
> > > > service cman start
> > > > 
> > > > 
> > > > --
> > > > Linux-cluster mailing list
> > > > Linux-cluster at redhat.com
> > > > https://www.redhat.com/mailman/listinfo/linux-cluster
> > > > 
> > > 
> > > 
> > 
> ______________________________________________________________________
> > > This email has been scanned by the MessageLabs Email
> > Security System.
> > > For more information please visit
> > > http://www.messagelabs.com/email
> > > 
> > 
> ______________________________________________________________________
> > > 
> > > --
> > > Linux-cluster mailing list
> > > Linux-cluster at redhat.com
> > > https://www.redhat.com/mailman/listinfo/linux-cluster
> > > 
> > 
> > 
> ______________________________________________________________________
> > This email has been scanned by the MessageLabs Email 
> Security System.
> > For more information please visit
> > http://www.messagelabs.com/email
> > 
> ______________________________________________________________________
> > 
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> > 
> 
> ______________________________________________________________________
> This email has been scanned by the MessageLabs Email Security System.
> For more information please visit 
> http://www.messagelabs.com/email 
> ______________________________________________________________________
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 

______________________________________________________________________
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email 
______________________________________________________________________



From nick at javacat.f2s.com  Wed Sep 10 14:42:29 2008
From: nick at javacat.f2s.com (Nick Lunt)
Date: Wed, 10 Sep 2008 15:42:29 +0100
Subject: [Linux-cluster] Unable to retrieve batch 1776334432 statusfrom....
In-Reply-To: <1BCA52CF5845E543B4B81AAFEF2AFD7904AD2413@LONSEXC01.gta.travel.lcl>
References: <1BCA52CF5845E543B4B81AAFEF2AFD7904AD240E@LONSEXC01.gta.travel.lcl><174CED94DD8DC54AB888B56E103B118728C48B@CNBEEXC007.nsn-intra.net><1BCA52CF5845E543B4B81AAFEF2AFD7904AD2411@LONSEXC01.gta.travel.lcl>	<1BCA52CF5845E543B4B81AAFEF2AFD7904AD2412@LONSEXC01.gta.travel.lcl>
	<1BCA52CF5845E543B4B81AAFEF2AFD7904AD2413@LONSEXC01.gta.travel.lcl>
Message-ID: <000301c91353$733bc500$59b34f00$@f2s.com>

Hi Gerhardus

> cman not started: Can't connect to CCSD /usr/sbin/cman_tool: aisexec

I had the same issue, caused by openais being chkconfiged on. Try to
chkconfig openais off and see how it goes.

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of
Gerhardus.Geldenhuis at gta-travel.com
Sent: 10 September 2008 13:45
To: linux-cluster at redhat.com
Subject: RE: [Linux-cluster] Unable to retrieve batch 1776334432
statusfrom....

Progress report,
It looks like for some or other reason the name of the cluster is not
set properly in the cman startup script. I have set it manually.

Next problem is:
cman not started: Can't connect to CCSD /usr/sbin/cman_tool: aisexec
daemon didn't start

Regards 

> -----Original Message-----
> From: linux-cluster-bounces at redhat.com 
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of 
> Gerhardus.Geldenhuis at gta-travel.com
> Sent: 10 September 2008 12:14
> To: linux-cluster at redhat.com
> Subject: RE: [Linux-cluster] Unable to retrieve batch 
> 1776334432 statusfrom....
> 
> I am not sure what needs to be added to the config file for 
> it to work.
> 
> My config file looks like this atm:
> <?xml version="1.0"?>
> <cluster alias="apache-ha" config_version="1" name="apache-ha">
>         <fence_daemon post_fail_delay="0" post_join_delay="3"/>
>         <clusternodes>
>                 <clusternode name="longapa02alt.gta.travel.lcl"
> nodeid="1" votes="1"/>
>                 <clusternode name="longapa02blt.gta.travel.lcl"
> nodeid="2" votes="1"/>
>         </clusternodes>
>         <cman expected_votes="1" two_node="1"/>
>         <fencedevices/>
>         <rm/>
> </cluster>
> 
> The hostname has been set in uppercase on the system, which I 
> thought might be a problem but changing the config file to 
> uppercase hostnames has not made a difference.
> 
> Regards
> 
> > -----Original Message-----
> > From: linux-cluster-bounces at redhat.com 
> > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of 
> > Gerhardus.Geldenhuis at gta-travel.com
> > Sent: 10 September 2008 11:06
> > To: linux-cluster at redhat.com
> > Subject: RE: [Linux-cluster] Unable to retrieve batch
> > 1776334432 statusfrom....
> > 
> > Thanks,
> > I get the following error:
> > 
> > service cman start
> > Starting cluster:
> >    Loading modules... done
> >    Mounting configfs... done
> >    Starting ccsd... done
> >    Starting cman... failed
> > cman not started: Can't find local node name in cluster.conf
> > /usr/sbin/cman_tool: aisexec daemon didn't start
> >                                                            [FAILED]
> > 
> > I am looking into this atm, but any suggestions welcomed.
> > 
> > Regards
> >  
> > 
> > > -----Original Message-----
> > > From: linux-cluster-bounces at redhat.com 
> > > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of 
> Chen, Mockey 
> > > (NSN - CN/Cheng Du)
> > > Sent: 10 September 2008 10:57
> > > To: linux clustering
> > > Subject: RE: [Linux-cluster] Unable to retrieve batch
> > > 1776334432 statusfrom....
> > > 
> > > 
> > >  
> > > 
> > > >-----Original Message-----
> > > >From: linux-cluster-bounces at redhat.com 
> > > >[mailto:linux-cluster-bounces at redhat.com] On Behalf Of ext 
> > > >Gerhardus.Geldenhuis at gta-travel.com
> > > >Sent: Wednesday, September 10, 2008 5:34 PM
> > > >To: linux-cluster at redhat.com
> > > >Subject: [Linux-cluster] Unable to retrieve batch
> > 1776334432 status
> > > >from....
> > > >
> > > >Hi
> > > >I have justed started using cluster tools for the first 
> time so my 
> > > >question might be a bit obvious to the expierenced eye.
> > > >
> > > >I am getting the following error when trying to do config
> > > changes using
> > > >luci.
> > > >Unable to retrieve batch 423452352 status from
> > > >longapa02alt.gta.travel:11111: ccs_tool failed to propagate
> > > >conf: unable to connect to the CCS daemon: connection
> > > refused Failed to
> > > >update config file.
> > > >
> > > 
> > > I maybe due to you not start ccsd daemon. try to use
> > ccs_test connect
> > > to test whether you can connect your ccsd daemon.
> > > 
> > > In RHCS 5, you can start ccsd by following command:
> > > service cman start
> > > 
> > > 
> > > --
> > > Linux-cluster mailing list
> > > Linux-cluster at redhat.com
> > > https://www.redhat.com/mailman/listinfo/linux-cluster
> > > 
> > 
> > 
> ______________________________________________________________________
> > This email has been scanned by the MessageLabs Email 
> Security System.
> > For more information please visit
> > http://www.messagelabs.com/email
> > 
> ______________________________________________________________________
> > 
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> > 
> 
> ______________________________________________________________________
> This email has been scanned by the MessageLabs Email Security System.
> For more information please visit 
> http://www.messagelabs.com/email 
> ______________________________________________________________________
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 

______________________________________________________________________
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email 
______________________________________________________________________

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From Gerhardus.Geldenhuis at gta-travel.com  Wed Sep 10 16:09:50 2008
From: Gerhardus.Geldenhuis at gta-travel.com (Gerhardus.Geldenhuis at gta-travel.com)
Date: Wed, 10 Sep 2008 17:09:50 +0100
Subject: [Linux-cluster] Monitoring services/customize failure criteria
Message-ID: <1BCA52CF5845E543B4B81AAFEF2AFD7904AD2415@LONSEXC01.gta.travel.lcl>

Hi
I have managed to sucesfully install and configure a basic failover
cluster that consists of two apache boxes and a vip.

I am still a bit unclear about how/where the cluster software monitors a
resource I have added. I have added a script resource that points to the
init script of apache.

It looks like the gui is clever enough to add a start command to the
apache service, however I am not sure if it would do this for all
scripts, what if I use a custom script that does not require a
start/stop parameter.

How can I customize "failure criteria" for a cluster resource?

Regards


______________________________________________________________________
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email 
______________________________________________________________________



From jstoner at opsource.net  Wed Sep 10 17:58:19 2008
From: jstoner at opsource.net (Jeff Stoner)
Date: Wed, 10 Sep 2008 18:58:19 +0100
Subject: [Linux-cluster] Monitoring services/customize failure criteria
In-Reply-To: <1BCA52CF5845E543B4B81AAFEF2AFD7904AD2415@LONSEXC01.gta.travel.lcl>
References: <1BCA52CF5845E543B4B81AAFEF2AFD7904AD2415@LONSEXC01.gta.travel.lcl>
Message-ID: <38A48FA2F0103444906AD22E14F1B5A30802AF6D@mailxchg01.corp.opsource.net>

> I am still a bit unclear about how/where the cluster software 
> monitors a
> resource I have added. 

/usr/share/cluster has the shell scripts used to manage resources,
however...

> I have added a script resource that 
> points to the
> init script of apache.

The Script resource is a generic resource (as opposed to the more
specific mysql, nfs, ip, file system, etc. resources.) The script
identified in the <script> section is responsible for properly starting,
stopping and verifying the resource is functioning properly. The script
needs to conform to LSB + OCF RA API draft specifications. This means a
script *must* accept as parameters, at a minimum for use with cluster
services, start, stop and status. These script must exit with 0 for
success and a non-zero value for any non-success condition (error,
warning, got up on the wrong side of the bed, whatever.)

> however I am not sure if it would do this for all scripts

Yes, it will.

> what if I use a custom script that does not require a
> start/stop parameter.

Only you can answer that since it's your script. I can't vouch for the
integrity of your cluster.

> How can I customize "failure criteria" for a cluster resource?

If you are using a Script resource, the script will get called with a
'status' parameter. It is totally up to you how you validate the
resource. Using Apache as an example, you can check for a pid file,
check that at least one httpd process shows up in the process list, open
a socket connection to the proper port to make sure it's listening, send
a complete HTTP request to httpd and parse the output - the
possibilities are many. It's your decision how you want to do the
validation.

Happy hacking!

--Jeff
Performance Engineer

OpSource, Inc.
http://www.opsource.net
"Your Success is Our Success"
  



From mockey.chen at nsn.com  Thu Sep 11 01:50:03 2008
From: mockey.chen at nsn.com (Chen, Mockey (NSN - CN/Cheng Du))
Date: Thu, 11 Sep 2008 09:50:03 +0800
Subject: [Linux-cluster] Monitoring services/customize failure criteria
In-Reply-To: <38A48FA2F0103444906AD22E14F1B5A30802AF6D@mailxchg01.corp.opsource.net>
References: <1BCA52CF5845E543B4B81AAFEF2AFD7904AD2415@LONSEXC01.gta.travel.lcl>
	<38A48FA2F0103444906AD22E14F1B5A30802AF6D@mailxchg01.corp.opsource.net>
Message-ID: <174CED94DD8DC54AB888B56E103B118728C58B@CNBEEXC007.nsn-intra.net>


>> How can I customize "failure criteria" for a cluster resource?
>
>If you are using a Script resource, the script will get called 
>with a 'status' parameter. It is totally up to you how you 
>validate the resource. Using Apache as an example, you can 
>check for a pid file, check that at least one httpd process 
>shows up in the process list, open a socket connection to the 
>proper port to make sure it's listening, send a complete HTTP 
>request to httpd and parse the output - the possibilities are 
>many. It's your decision how you want to do the validation.

In default configuration, cluster check the resource health every 30
seconds, 
I want to config this interval in /etc/cluster/cluster.conf, but I can
not find any related parameter,
or I missing something ?

The only way I can change it is mannually change
/usr/share/cluster/script.sh

Any suggestions ?


Regards.




From jakub.suchy at enlogit.cz  Thu Sep 11 07:30:01 2008
From: jakub.suchy at enlogit.cz (Jakub Suchy)
Date: Thu, 11 Sep 2008 09:30:01 +0200 (CEST)
Subject: [Linux-cluster] Unable to retrieve batch 1776334432 
	statusfrom....
In-Reply-To: <1BCA52CF5845E543B4B81AAFEF2AFD7904AD2414@LONSEXC01.gta.travel.lcl>
References: <1BCA52CF5845E543B4B81AAFEF2AFD7904AD240E@LONSEXC01.gta.travel.lcl><174CED94DD8DC54AB888B56E103B118728C48B@CNBEEXC007.nsn-intra.net><1BCA52CF5845E543B4B81AAFEF2AFD7904AD2411@LONSEXC01.gta.travel.lcl><1BCA52CF5845E543B4B81AAFEF2AFD7904AD2412@LONSEXC01.gta.travel.lcl>
	<1BCA52CF5845E543B4B81AAFEF2AFD7904AD2413@LONSEXC01.gta.travel.lcl>
	<1BCA52CF5845E543B4B81AAFEF2AFD7904AD2414@LONSEXC01.gta.travel.lcl>
Message-ID: <10464.217.77.161.17.1221118201.squirrel@www.hal9000.cz>

Hi,
are you sure that you have all hostnames set up properly?

/etc/hosts should say:
A.B.C.D node1.somewhere
A.B.C.E node2.somewhere

You should use node1.somewhere as node name in your cluster.conf

Jakub

> For the benefit of future googlers.
>
> I managed to get the cman service started without any error messages. I
> also removed my customizations from the init script.
>
> It looks like the cman service is not able to detect whether services it
> starts up has been started and then fails.
>
> I ended up stopping all services having only the following running:
> Active Internet connections (servers and established)
> Proto Recv-Q Send-Q Local Address               Foreign Address
> State       PID/Program name
> tcp        0      0 0.0.0.0:11111               0.0.0.0:*
> LISTEN      3918/ricci
> tcp        0      0 127.0.0.1:6010              0.0.0.0:*
> LISTEN      5721/0
> tcp        0      0 :::22                       :::*
> LISTEN      3601/sshd
> tcp        0      0 ::1:6010                    :::*
> LISTEN      5721/0
> tcp        0      0 ::ffff:10.x.x.x:22      ::ffff:10.x.x.x:40884
> ESTABLISHED 5718/sshd: lcp13o [
>
> I then ran
> service cman start




From Gerhardus.Geldenhuis at gta-travel.com  Thu Sep 11 08:43:51 2008
From: Gerhardus.Geldenhuis at gta-travel.com (Gerhardus.Geldenhuis at gta-travel.com)
Date: Thu, 11 Sep 2008 09:43:51 +0100
Subject: [Linux-cluster] Monitoring services/customize failure criteria
In-Reply-To: <38A48FA2F0103444906AD22E14F1B5A30802AF6D@mailxchg01.corp.opsource.net>
References: <1BCA52CF5845E543B4B81AAFEF2AFD7904AD2415@LONSEXC01.gta.travel.lcl>
	<38A48FA2F0103444906AD22E14F1B5A30802AF6D@mailxchg01.corp.opsource.net>
Message-ID: <1BCA52CF5845E543B4B81AAFEF2AFD7904AD241A@LONSEXC01.gta.travel.lcl>

Thanks for the excellent response, I have just a few follow up
questions.

> -----Original Message-----
> From: linux-cluster-bounces at redhat.com 
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jeff Stoner
> Sent: 10 September 2008 18:58
> To: linux clustering
> Subject: RE: [Linux-cluster] Monitoring services/customize 
> failure criteria
> 
> 
> The Script resource is a generic resource (as opposed to the 
> more specific mysql, nfs, ip, file system, etc. resources.) 
> The script identified in the <script> section is responsible 
> for properly starting, stopping and verifying the resource is 
> functioning properly. The script needs to conform to LSB + 
> OCF RA API draft specifications. This means a script *must* 
> accept as parameters, at a minimum for use with cluster 
> services, start, stop and status. These script must exit with 
> 0 for success and a non-zero value for any non-success 
> condition (error, warning, got up on the wrong side of the 
> bed, whatever.)
> 
> > however I am not sure if it would do this for all scripts
> 
> Yes, it will.
> 
> 
> > How can I customize "failure criteria" for a cluster resource?
> 
> If you are using a Script resource, the script will get 
> called with a 'status' parameter. It is totally up to you how 
> you validate the resource. Using Apache as an example, you 
> can check for a pid file, check that at least one httpd 
> process shows up in the process list, open a socket 
> connection to the proper port to make sure it's listening, 
> send a complete HTTP request to httpd and parse the output - 
> the possibilities are many. It's your decision how you want 
> to do the validation.

Just for clarification, if I use a script resource will the same script
be used with a status parameter to check the status of the resource?

Where is the frequency specified for health checking of the resource
whether it be a custom script of apache? If we want to check health
every second where can I set this frequency? I have used luci web
interface up until now to do my configs with and have not seen anywhere
yet where I can set frequency of health checks.

I kind of expected a more complex frame work for health checking but
from my understanding of your reply it seems very simple by just
requiring the starting script to respond to status requests.

There is a script in /usr/share/cluster/script.sh which looks like it
contains some of my answers but I am not sure where this would be used.

Regards

______________________________________________________________________
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email 
______________________________________________________________________



From ajeet.singh.raina at logica.com  Thu Sep 11 14:38:50 2008
From: ajeet.singh.raina at logica.com (Singh Raina, Ajeet)
Date: Thu, 11 Sep 2008 20:08:50 +0530
Subject: [Linux-cluster] NFS Late Failover Process..
Message-ID: <0139539A634FD04A99C9B8880AB70CB209B17B16@in-ex004.groupinfra.com>

Hi,

I have Two Nodes- Master1 And Master2 in cluster.I have one script
called scripts.sh acts as a control script for failover.
When I kill any process from one node the failover is taking place
successful but I had also created mount point in /etc/fstab which should
also get umount and then failover to the other node.
I have 10 mount points in /etc/fstab..and during the failover its trying
to do successful too late..approx 10 mins for complete failover.

Any Idea how can I reduce the time?



Pls help


This e-mail and any attachment is for authorised use by the intended recipient(s) only. It may contain proprietary material, confidential information and/or be subject to legal privilege. It should not be copied, disclosed to, retained or used by, any other party. If you are not an intended recipient then please promptly delete this e-mail and any attachment and all copies and inform the sender. Thank you.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080911/c05f42d9/attachment.htm>

From Gerhardus.Geldenhuis at gta-travel.com  Thu Sep 11 15:41:38 2008
From: Gerhardus.Geldenhuis at gta-travel.com (Gerhardus.Geldenhuis at gta-travel.com)
Date: Thu, 11 Sep 2008 16:41:38 +0100
Subject: [Linux-cluster] Monitoring services/customize failure criteria
In-Reply-To: <1BCA52CF5845E543B4B81AAFEF2AFD7904AD241A@LONSEXC01.gta.travel.lcl>
References: <1BCA52CF5845E543B4B81AAFEF2AFD7904AD2415@LONSEXC01.gta.travel.lcl><38A48FA2F0103444906AD22E14F1B5A30802AF6D@mailxchg01.corp.opsource.net>
	<1BCA52CF5845E543B4B81AAFEF2AFD7904AD241A@LONSEXC01.gta.travel.lcl>
Message-ID: <1BCA52CF5845E543B4B81AAFEF2AFD7904AD2424@LONSEXC01.gta.travel.lcl>

> 
> Just for clarification, if I use a script resource will the 
> same script be used with a status parameter to check the 
> status of the resource?
> 
> Where is the frequency specified for health checking of the 
> resource whether it be a custom script of apache? If we want 
> to check health every second where can I set this frequency? 
> I have used luci web interface up until now to do my configs 
> with and have not seen anywhere yet where I can set frequency 
> of health checks.
> 
> I kind of expected a more complex frame work for health 
> checking but from my understanding of your reply it seems 
> very simple by just requiring the starting script to respond 
> to status requests.
> 
> There is a script in /usr/share/cluster/script.sh which looks 
> like it contains some of my answers but I am not sure where 
> this would be used.
> 
> Regards
> 

I have been reading:
/usr/share/doc/rgmanager-2.0.38/README

<quote>
The resource manager is a daemon which provides failover of user-defined
resource collected into groups.
</quote>

This at first glance looked to be my answer, however after reading the
whole document I am still not clear on where I can actually customize
frequency of resource checking.

Regards

______________________________________________________________________
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email 
______________________________________________________________________



From sghosh at redhat.com  Thu Sep 11 16:25:04 2008
From: sghosh at redhat.com (Subhendu Ghosh)
Date: Thu, 11 Sep 2008 12:25:04 -0400
Subject: [Linux-cluster] Monitoring services/customize failure criteria
In-Reply-To: <1BCA52CF5845E543B4B81AAFEF2AFD7904AD2424@LONSEXC01.gta.travel.lcl>
References: <1BCA52CF5845E543B4B81AAFEF2AFD7904AD2415@LONSEXC01.gta.travel.lcl><38A48FA2F0103444906AD22E14F1B5A30802AF6D@mailxchg01.corp.opsource.net>	<1BCA52CF5845E543B4B81AAFEF2AFD7904AD241A@LONSEXC01.gta.travel.lcl>
	<1BCA52CF5845E543B4B81AAFEF2AFD7904AD2424@LONSEXC01.gta.travel.lcl>
Message-ID: <48C94660.7070600@redhat.com>

Gerhardus.Geldenhuis at gta-travel.com wrote:
>> Just for clarification, if I use a script resource will the 
>> same script be used with a status parameter to check the 
>> status of the resource?
>>
>> Where is the frequency specified for health checking of the 
>> resource whether it be a custom script of apache? If we want 
>> to check health every second where can I set this frequency? 
>> I have used luci web interface up until now to do my configs 
>> with and have not seen anywhere yet where I can set frequency 
>> of health checks.
>>
>> I kind of expected a more complex frame work for health 
>> checking but from my understanding of your reply it seems 
>> very simple by just requiring the starting script to respond 
>> to status requests.
>>
>> There is a script in /usr/share/cluster/script.sh which looks 
>> like it contains some of my answers but I am not sure where 
>> this would be used.
>>
>> Regards
>>
> 
> I have been reading:
> /usr/share/doc/rgmanager-2.0.38/README
> 
> <quote>
> The resource manager is a daemon which provides failover of user-defined
> resource collected into groups.
> </quote>
> 
> This at first glance looked to be my answer, however after reading the
> whole document I am still not clear on where I can actually customize
> frequency of resource checking.
> 
> Regards
> 
>

http://sources.redhat.com/cluster/wiki/FAQ/RGManager#rgm_interval


-- 
Subhendu Ghosh
Solutions Architect
Red Hat




From jstoner at opsource.net  Thu Sep 11 16:59:42 2008
From: jstoner at opsource.net (Jeff Stoner)
Date: Thu, 11 Sep 2008 17:59:42 +0100
Subject: [Linux-cluster] Monitoring services/customize failure criteria
In-Reply-To: <1BCA52CF5845E543B4B81AAFEF2AFD7904AD241A@LONSEXC01.gta.travel.lcl>
References: <1BCA52CF5845E543B4B81AAFEF2AFD7904AD2415@LONSEXC01.gta.travel.lcl><38A48FA2F0103444906AD22E14F1B5A30802AF6D@mailxchg01.corp.opsource.net>
	<1BCA52CF5845E543B4B81AAFEF2AFD7904AD241A@LONSEXC01.gta.travel.lcl>
Message-ID: <38A48FA2F0103444906AD22E14F1B5A30802B156@mailxchg01.corp.opsource.net>

> -----Original Message-----
> Just for clarification, if I use a script resource will the 
> same script
> be used with a status parameter to check the status of the resource?

Yes.

> Where is the frequency specified for health checking of the resource
> whether it be a custom script of apache? If we want to check health
> every second where can I set this frequency? I have used luci web
> interface up until now to do my configs with and have not 
> seen anywhere
> yet where I can set frequency of health checks.

http://sources.redhat.com/cluster/faq.html#rgm_interval

> I kind of expected a more complex frame work for health checking but
> from my understanding of your reply it seems very simple by just
> requiring the starting script to respond to status requests.

Correct. The script resource is generic - used for when a more specific
resource type is not available. For the developers to build a framework
to do health checking is adding (potentially) a lot of complexity where
it may not be warranted. Simply throw it back on the implementer to deal
with - the KISS principle.  :-)


--Jeff
Performance Engineer

OpSource, Inc.
http://www.opsource.net
"Your Success is Our Success"
  




From jamshed.jimmy at gmail.com  Thu Sep 11 17:31:20 2008
From: jamshed.jimmy at gmail.com (Jamshed Zaidi)
Date: Thu, 11 Sep 2008 22:31:20 +0500
Subject: [Linux-cluster] Is there any software available for HPC monitoring
Message-ID: <7ed8d870809111031s21a3bc62s48dc56a7c0c5af7c@mail.gmail.com>

Hello to everyone, this is my 2nd post on this group i want to know that any
Open source tool which is available for monitoring and managing HPC (high
performance clustering) ???. i got Ganglia. Its amazing effort by *
University** of California , Berkeley. your response will be appreciated.*

http://www.technorabit.com/2008/09/04/monitoring-hpc/

-- 
Syed Jamshed Zaidi (Jamy-Virus)
Linux Admin/Programmer @ Naseeb Networks
0321-4087492
www.technorabit.com
"Shoot for the moon. Even if you miss, you'll land among the stars"
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080911/f494f27e/attachment.htm>

From bruno at redix.com.br  Fri Sep 12 11:39:31 2008
From: bruno at redix.com.br (Bruno Frensch Deschamps)
Date: Fri, 12 Sep 2008 08:39:31 -0300
Subject: [Linux-cluster] Cluster failed detection
Message-ID: <48CA54F3.4000005@redix.com.br>

Hi,

How RHCS detect that a cluster is failed?
What method of heartbeat is used to do this?


-- 
Bruno F. Deschamps - Consultor
Profissional Certificado LPIC-1
--------------------------------------------------------------------
Redix - Gest?o em T.I. com Software Livre
http://www.redix.com.br - redix at redix.com.br
Tel. Coml.: +55 (47) 3323-7313
--------------------------------------------------------------------




From Gerhardus.Geldenhuis at gta-travel.com  Fri Sep 12 13:03:49 2008
From: Gerhardus.Geldenhuis at gta-travel.com (Gerhardus.Geldenhuis at gta-travel.com)
Date: Fri, 12 Sep 2008 14:03:49 +0100
Subject: [Linux-cluster] Cluster failed detection
In-Reply-To: <48CA54F3.4000005@redix.com.br>
References: <48CA54F3.4000005@redix.com.br>
Message-ID: <1BCA52CF5845E543B4B81AAFEF2AFD7904AD242A@LONSEXC01.gta.travel.lcl>

Hi Bruno,
I am quite new to the redhat cluster suite myself but I will try to
explain.

Detecting whether the cluster has failed will depend on how many nodes
you have setup and how you have configured them. I would suggest reading
up more on quoroms and failover domains in the documentation which will
explain better than I would.

With regards to individual service on a cluster, the script that is
called to start the service if it is a software service will be called
with a status parameter which will tell the cluster whether it is in
good condition or not. There is more customizable fields in these
scripts in embedded xml files( it would probably have been more logical
to put them in a conf directory...) that can be customized. The scripts
I am talking about is the standard scripts that come with the cluster
software and can be found in:
/usr/share/cluster

I hope this helps.

Regards

> -----Original Message-----
> From: linux-cluster-bounces at redhat.com 
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Bruno 
> Frensch Deschamps
> Sent: 12 September 2008 12:40
> To: linux clustering
> Subject: [Linux-cluster] Cluster failed detection
> 
> Hi,
> 
> How RHCS detect that a cluster is failed?
> What method of heartbeat is used to do this?
> 
> 

______________________________________________________________________
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email 
______________________________________________________________________



From sghosh at redhat.com  Fri Sep 12 15:40:42 2008
From: sghosh at redhat.com (Subhendu Ghosh)
Date: Fri, 12 Sep 2008 11:40:42 -0400
Subject: [Linux-cluster] Is there any software available for HPC monitoring
In-Reply-To: <7ed8d870809111031s21a3bc62s48dc56a7c0c5af7c@mail.gmail.com>
References: <7ed8d870809111031s21a3bc62s48dc56a7c0c5af7c@mail.gmail.com>
Message-ID: <48CA8D7A.5010400@redhat.com>

Jamshed Zaidi wrote:
> Hello to everyone, this is my 2nd post on this group i want to know that 
> any Open source tool which is available for monitoring and managing HPC 
> (high performance clustering) ???. i got Ganglia. Its amazing effort by 
> *University** of California , Berkeley. your response will be appreciated.*
> 
> http://www.technorabit.com/2008/09/04/monitoring-hpc/


Both Ganglia and Collectd will allow you to capture system performance 
history. Collectd has a Nagios plugin that can be used to escalate based on 
performance thresholds. Others have integrated Ganglia with Nagios as well.

The real answer depends on what you mean by monitoring- fault, performance, other?

-regards
Subhendu

-- 
Subhendu Ghosh
Solutions Architect
Red Hat




From Gerhardus.Geldenhuis at gta-travel.com  Fri Sep 12 15:39:08 2008
From: Gerhardus.Geldenhuis at gta-travel.com (Gerhardus.Geldenhuis at gta-travel.com)
Date: Fri, 12 Sep 2008 16:39:08 +0100
Subject: [Linux-cluster] rgmanager dies when starting cman service
Message-ID: <1BCA52CF5845E543B4B81AAFEF2AFD7904AD242B@LONSEXC01.gta.travel.lcl>

Hi
When I start my rgmanager service it dies the moment cman is started up.
If I then start it again it works fine. I am not sure if rgmanager logs
anywhere, there is certainly nothing in my messages log or in
cluster.log which is created by 
 <rm log_facility="local4" log_level="7"> in my cluster.

The machine is up to date and have the latest pathces installed.

Could it be a DLM problem? I saw some messages relating to dlm problems
between rgmanager and cman but not sure whether it is my problem too.

I straced rgmanager and got the following which I thought relevent:
[pid 31533] write(2, "failed acquiring lockspace: Tran"..., 64) = 64

Any suggestions would be appreciated.

Regards

______________________________________________________________________
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email 
______________________________________________________________________



From Gerhardus.Geldenhuis at gta-travel.com  Fri Sep 12 16:00:39 2008
From: Gerhardus.Geldenhuis at gta-travel.com (Gerhardus.Geldenhuis at gta-travel.com)
Date: Fri, 12 Sep 2008 17:00:39 +0100
Subject: [Linux-cluster] rgmanager dies when starting cman service
In-Reply-To: <1BCA52CF5845E543B4B81AAFEF2AFD7904AD242B@LONSEXC01.gta.travel.lcl>
References: <1BCA52CF5845E543B4B81AAFEF2AFD7904AD242B@LONSEXC01.gta.travel.lcl>
Message-ID: <1BCA52CF5845E543B4B81AAFEF2AFD7904AD242C@LONSEXC01.gta.travel.lcl>

The following ties up with my rgmanager service dying:

Sep 12 14:51:19 LONGAPA02ALT kernel: dlm: no local IP address has been
set
Sep 12 14:51:19 LONGAPA02ALT kernel: dlm: cannot start dlm lowcomms -107

I am searching for a doc now that will tell me where to set a local ip
to see if that is the cause of the problem...

Regards

> -----Original Message-----
> From: linux-cluster-bounces at redhat.com 
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of 
> Gerhardus.Geldenhuis at gta-travel.com
> Sent: 12 September 2008 16:39
> To: linux-cluster at redhat.com
> Subject: [Linux-cluster] rgmanager dies when starting cman service
> 
> Hi
> When I start my rgmanager service it dies the moment cman is 
> started up.
> If I then start it again it works fine. I am not sure if 
> rgmanager logs anywhere, there is certainly nothing in my 
> messages log or in cluster.log which is created by  <rm 
> log_facility="local4" log_level="7"> in my cluster.
> 
> The machine is up to date and have the latest pathces installed.
> 
> Could it be a DLM problem? I saw some messages relating to 
> dlm problems between rgmanager and cman but not sure whether 
> it is my problem too.
> 
> I straced rgmanager and got the following which I thought relevent:
> [pid 31533] write(2, "failed acquiring lockspace: Tran"..., 64) = 64
> 
> Any suggestions would be appreciated.
> 
> Regards
> 
> ______________________________________________________________________
> This email has been scanned by the MessageLabs Email Security System.
> For more information please visit 
> http://www.messagelabs.com/email 
> ______________________________________________________________________
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 

______________________________________________________________________
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email 
______________________________________________________________________



From ccaulfie at redhat.com  Fri Sep 12 16:04:31 2008
From: ccaulfie at redhat.com (Christine Caulfield)
Date: Fri, 12 Sep 2008 17:04:31 +0100
Subject: [Linux-cluster] rgmanager dies when starting cman service
In-Reply-To: <1BCA52CF5845E543B4B81AAFEF2AFD7904AD242C@LONSEXC01.gta.travel.lcl>
References: <1BCA52CF5845E543B4B81AAFEF2AFD7904AD242B@LONSEXC01.gta.travel.lcl>
	<1BCA52CF5845E543B4B81AAFEF2AFD7904AD242C@LONSEXC01.gta.travel.lcl>
Message-ID: <48CA930F.5040507@redhat.com>

Gerhardus.Geldenhuis at gta-travel.com wrote:
> The following ties up with my rgmanager service dying:
> 
> Sep 12 14:51:19 LONGAPA02ALT kernel: dlm: no local IP address has been
> set
> Sep 12 14:51:19 LONGAPA02ALT kernel: dlm: cannot start dlm lowcomms -107
> 
> I am searching for a doc now that will tell me where to set a local ip
> to see if that is the cause of the problem...


That message usually means that dlm_controld isn't running or that it
started up after configfs was mounted. So you might want to look into those.

-- 

Chrissie



From Gerhardus.Geldenhuis at gta-travel.com  Fri Sep 12 16:24:59 2008
From: Gerhardus.Geldenhuis at gta-travel.com (Gerhardus.Geldenhuis at gta-travel.com)
Date: Fri, 12 Sep 2008 17:24:59 +0100
Subject: [Linux-cluster] Monitoring services/customize failure criteria
In-Reply-To: <38A48FA2F0103444906AD22E14F1B5A30802B156@mailxchg01.corp.opsource.net>
References: <1BCA52CF5845E543B4B81AAFEF2AFD7904AD2415@LONSEXC01.gta.travel.lcl><38A48FA2F0103444906AD22E14F1B5A30802AF6D@mailxchg01.corp.opsource.net><1BCA52CF5845E543B4B81AAFEF2AFD7904AD241A@LONSEXC01.gta.travel.lcl>
	<38A48FA2F0103444906AD22E14F1B5A30802B156@mailxchg01.corp.opsource.net>
Message-ID: <1BCA52CF5845E543B4B81AAFEF2AFD7904AD242D@LONSEXC01.gta.travel.lcl>

> 
> > -----Original Message-----
> > Just for clarification, if I use a script resource will the same 
> > script be used with a status parameter to check the status of the 
> > resource?
> 
> Yes.
> 
> > Where is the frequency specified for health checking of the 
> resource 
> > whether it be a custom script of apache? If we want to check health 
> > every second where can I set this frequency? I have used luci web 
> > interface up until now to do my configs with and have not seen 
> > anywhere yet where I can set frequency of health checks.
> 
> http://sources.redhat.com/cluster/faq.html#rgm_interval
> 
Thanks! I have made changes to the script.sh and ip.sh scripts. Setting
values to as low as 2seconds. We have a very simple apache setup and we
want quick failover. The checks easily complete within a seconds so I
don't believe I am creating endless loops of continuos checking. The log
file however still shows that checking is done every 10 seconds. My
changes has increased the frequency but there seems to be an lower limit
of 10sec and I can't find a place to override the value.

Sep 12 15:24:11 LONGAPA02ALT clurgmgrd: [14666]: <info> Executing
/etc/rc.d/init.d/httpd status
Sep 12 15:24:21 LONGAPA02ALT clurgmgrd: [14666]: <info> Executing
/etc/rc.d/init.d/httpd status
Sep 12 15:24:31 LONGAPA02ALT clurgmgrd: [14666]: <info> Executing
/etc/rc.d/init.d/httpd status

Currently if I stop the httpd server manually on a node the failover
takes about 15sec and if I time it roughly to coincide with the next
check the best I could get failover down to is 11sec. It seems that I am
still bound by this 10sec setting somewhere. It definitely does not take
10 sec to shutdown httpd, tear down the ip and then do the reverse on
the other node.



Failover simply does not happen if I physically disconnect a node. I
believe it is a config error somewhere on my part but not sure exactly
what. I have a two node cluster so there would be a quorom for the other
node to continue. I am looking into this currently but some
hints/suggestions would be appreciated. If I bring the box back online,
ie start the network services again failover occurs to the other node
which I find bizarre.

Regards

> 
> Correct. The script resource is generic - used for when a 
> more specific resource type is not available. For the 
> developers to build a framework to do health checking is 
> adding (potentially) a lot of complexity where it may not be 
> warranted. Simply throw it back on the implementer to deal 
> with - the KISS principle.  :-)
> 
> 
> --Jeff
> Performance Engineer
> 
> OpSource, Inc.
> http://www.opsource.net
> "Your Success is Our Success"
>   
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 

______________________________________________________________________
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email 
______________________________________________________________________



From tsi at ualberta.ca  Fri Sep 12 16:44:32 2008
From: tsi at ualberta.ca (Marc Aurele La France)
Date: Fri, 12 Sep 2008 10:44:32 -0600 (Mountain Daylight Time)
Subject: [Linux-cluster] 32 nodes limit?
Message-ID: <alpine.WNT.1.10.0809121013350.660@tsi>

Hi.

Just what I hope will be a quick question regarding the cluster suite.

The current lock manager FAQ states ...

"CMAN in RHEL4 has known problems when you have more than 32 nodes in the 
cluster.  We're working to resolve those issues, but until then use GULM if 
you have more than 32 nodes."

... while the pre-Wiki version of this document refered to DLM instead of 
CMAN in RHEL4.  Which one is it?  DLM makes more sense to me.

In any case, I gather that this issue has been resolved.  If so, can you tell 
me the minimum version of the cluster suite and/or upstream kernel that would 
allow for more than 32 nodes (with DLM)?  A pointer to a patch or patches 
that I could use would be ideal.

More details ...

I'm trying to move a 5TB filespace from NFS to GFS2.  I have a P4 (the 
current NFS server) and 33 Opteron nodes, all running a stock 2.6.22 kernel, 
OpenAIS 0.80.3, and a 2.00.00 cluster suite.  For now, I've dummied out 
fencing and set expected_votes to 1.  I can start/stop cman on all nodes no 
problem.  With all cman's running, I've formatted, mounted and populated the 
filesystem using the P4.  Proceeding through the Opterons to mount the 
filesystem succeeds until the 32nd node, at which point mount.gfs2 hangs (in 
"D" according to `ps ax`).  Going back, the first 16 systems that have 
mounted the filesystem can still `ls` the top level directory, but attempts 
to do so on the remaining systems also get stuck in "D".  Any attempt to 
unmount the filesystem throws the entire setup in "D".

Due to various considerations, moving to more recent versions is not the 
preferred option at this point.  Hence my question.

Any ideas?

Thanks.

Marc.

+----------------------------------+----------------------------------+
|  Marc Aurele La France           |  work:   1-780-492-9310          |
|  Academic Information and        |  fax:    1-780-492-1729          |
|    Communications Technologies   |  email:  tsi at ualberta.ca         |
|  352 General Services Building   +----------------------------------+
|  University of Alberta           |                                  |
|  Edmonton, Alberta               |    Standard disclaimers apply    |
|  T6G 2H1                         |                                  |
|  CANADA                          |                                  |
+----------------------------------+----------------------------------+



From eritchie at interactivebrokers.com  Fri Sep 12 21:42:38 2008
From: eritchie at interactivebrokers.com (Eric Ritchie)
Date: Fri, 12 Sep 2008 17:42:38 -0400
Subject: [Linux-cluster] fence start-up issue
Message-ID: <48CAE24E.4000002@interactivebrokers.com>

    I sometimes run into an issue when a node in my 2-node cluster is 
rebooting and hangs on fenced. It seems it can't communicate with the 
other node and after the post_join_delay, it fences the other node. This 
happened again today, and when the second node rebooted after the fence, 
they were in a split-brain configuration.
    I saw in the cluster faq, in the cman section, question 6 that the 
cluster communication network should be the same network as the fencing 
device. I think this may be my problem but I don't understand why. I'm 
using HP iLo for fencing and I setup cross-connect cables for the 
cluster communication between the 2 nodes. Why would having cluster 
communication and fencing on different networks be an issue?

Thanks for your time

-- 
Eric Ritchie
Interactive Brokers LLC
203-618-5868



From celso at webbertek.com.br  Fri Sep 12 22:21:13 2008
From: celso at webbertek.com.br (Celso K. Webber)
Date: Fri, 12 Sep 2008 19:21:13 -0300
Subject: [Linux-cluster] One question about IPMI fencing with Cluster Suite
	v5.1
Message-ID: <48CAEB59.6040708@webbertek.com.br>

Hello all,

Sorry if this question has been answered before, but I didn't find anything 
in the archives.

We deployed a Red Hat Cluster Suite on a customer, and apparently everything 
goes fine until there's a need for one node to fence the other (for 
instance, we turn it off to test failover).

As usual for us, we configured the fencing using IPMI, which is available on 
every modern branded server.

It seems that sometimes, one machine can't fence the other. Although we can 
see the Cluster starting "ipmitool -I lanplus -H xxx -U xxx -P xxx chassis 
power off", it times out while trying to power off the other machine.

The more incredible thing is that if, at this exact moment, we issue an 
"ipmitool ... chassis power status" at the command line, it works ok with 
the same node failing.

So I have a few questions:
* can a problem like this (fencing agent not being able to fence) cause 
instability on the cluster? In our case, the clusters gets crazy even if we 
reboot the failed node, it does join the cluster, but rgmanager never gets 
started;

* has anyone faced this problem with IPMI? We have used IPMI as a fence 
agent on tenths of implementations with Red Hat Cluster Suite, since version 
3, and we have never had this kind of problem. The servers in question are 
Dell PowerEdges 2900, and there is a crossover cable beetween both onboard 
#1 NICs of the server, so that we have a dedicated network path for one 
machine turning off the other.


Thank you all for your support.

Regards,

Celso.

-- 
Esta mensagem foi verificada pelo sistema de antiv?rus e
 acredita-se estar livre de perigo.



From sghosh at redhat.com  Fri Sep 12 23:39:33 2008
From: sghosh at redhat.com (Subhendu Ghosh)
Date: Fri, 12 Sep 2008 19:39:33 -0400
Subject: [Linux-cluster] fence start-up issue
In-Reply-To: <48CAE24E.4000002@interactivebrokers.com>
References: <48CAE24E.4000002@interactivebrokers.com>
Message-ID: <48CAFDB5.4010103@redhat.com>

Eric Ritchie wrote:
>    I sometimes run into an issue when a node in my 2-node cluster is 
> rebooting and hangs on fenced. It seems it can't communicate with the 
> other node and after the post_join_delay, it fences the other node. This 
> happened again today, and when the second node rebooted after the fence, 
> they were in a split-brain configuration.
>    I saw in the cluster faq, in the cman section, question 6 that the 
> cluster communication network should be the same network as the fencing 
> device. I think this may be my problem but I don't understand why. I'm 
> using HP iLo for fencing and I setup cross-connect cables for the 
> cluster communication between the 2 nodes. Why would having cluster 
> communication and fencing on different networks be an issue?
> 
> Thanks for your time
> 

Having distinct heartbeat and fencing networks creates the possibility of race 
condition, which you seem to be running into.

The cluster communication may not have stabilized in the post_join_delay time 
frame due to any number of issues including network outage.  In this case 
fencing would fail from the node starting up as it is the same path to fence 
device as to cluster member.

By separating the two - fence can succeed while cluster communication fails.

Recommendation would be for cluster communication and iLO reachability to be 
through the same NIC on the host.

-regards
Subhendu

-- 
Subhendu Ghosh
Solutions Architect
Red Hat



From sghosh at redhat.com  Fri Sep 12 23:41:46 2008
From: sghosh at redhat.com (Subhendu Ghosh)
Date: Fri, 12 Sep 2008 19:41:46 -0400
Subject: [Linux-cluster] One question about IPMI fencing with Cluster
	Suite	v5.1
In-Reply-To: <48CAEB59.6040708@webbertek.com.br>
References: <48CAEB59.6040708@webbertek.com.br>
Message-ID: <48CAFE3A.6010707@redhat.com>

Celso K. Webber wrote:
> Hello all,
> 
> Sorry if this question has been answered before, but I didn't find 
> anything in the archives.
> 
> We deployed a Red Hat Cluster Suite on a customer, and apparently 
> everything goes fine until there's a need for one node to fence the 
> other (for instance, we turn it off to test failover).
> 
> As usual for us, we configured the fencing using IPMI, which is 
> available on every modern branded server.
> 
> It seems that sometimes, one machine can't fence the other. Although we 
> can see the Cluster starting "ipmitool -I lanplus -H xxx -U xxx -P xxx 
> chassis power off", it times out while trying to power off the other 
> machine.

Have you tried the above command by itself to see if IPMI on the systems 
responds correctly by shutting down?

> 
> The more incredible thing is that if, at this exact moment, we issue an 
> "ipmitool ... chassis power status" at the command line, it works ok 
> with the same node failing.
> 
> So I have a few questions:
> * can a problem like this (fencing agent not being able to fence) cause 
> instability on the cluster? In our case, the clusters gets crazy even if 
> we reboot the failed node, it does join the cluster, but rgmanager never 
> gets started;
> 
> * has anyone faced this problem with IPMI? We have used IPMI as a fence 
> agent on tenths of implementations with Red Hat Cluster Suite, since 
> version 3, and we have never had this kind of problem. The servers in 
> question are Dell PowerEdges 2900, and there is a crossover cable 
> beetween both onboard #1 NICs of the server, so that we have a dedicated 
> network path for one machine turning off the other.
> 
> 
> Thank you all for your support.
> 
> Regards,
> 
> Celso.
> 


-- 
Subhendu Ghosh
Solutions Architect
Red Hat



From mehulc87 at gmail.com  Sat Sep 13 04:58:55 2008
From: mehulc87 at gmail.com (Mehul Chadha)
Date: Sat, 13 Sep 2008 10:28:55 +0530
Subject: [Linux-cluster] Enhancing the linux clustering capabilty
Message-ID: <251d650c0809122158q1d00d5aakc2f0b3dda8588b64@mail.gmail.com>

Hi,
      We are a group of 3 students doing their graduation in computer
science and are interested in contributing to the clustering capability in
linux. We have been reading about dragonfly BSD as a strong contender for
clustering OS's. The 2 notable implementation in kernel space of dragonfly
BSD are the syslink protocol basically which looks after the resource
management in the cluster and hence keeps the cluster glued together and the
cache coherence management which allows a number of processes running in
different systems to have read and write to different portions of the same
file at the same time. We were thinking of implementing the same feature set
in linux kernel . We would be thankful if we can have some suggestions and
advice from your side.

Thanks and Regards,
Mehul
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080913/1d9c5ac2/attachment.htm>

From celso at webbertek.com.br  Sat Sep 13 13:50:33 2008
From: celso at webbertek.com.br (Celso K. Webber)
Date: Sat, 13 Sep 2008 10:50:33 -0300
Subject: [Linux-cluster] One question about IPMI fencing with Cluster
	Suite	v5.1
In-Reply-To: <48CAFE3A.6010707@redhat.com>
References: <48CAEB59.6040708@webbertek.com.br> <48CAFE3A.6010707@redhat.com>
Message-ID: <48CBC529.7050203@webbertek.com.br>

Hi Subhendu,

Subhendu Ghosh escreveu:
> Celso K. Webber wrote:
>> Hello all,
>>
>> It seems that sometimes, one machine can't fence the other. Although 
>> we can see the Cluster starting "ipmitool -I lanplus -H xxx -U xxx -P 
>> xxx chassis power off", it times out while trying to power off the 
>> other machine.
> 
> Have you tried the above command by itself to see if IPMI on the systems 
> responds correctly by shutting down?
> 
Yes I did. This is indeed the way I use to "power off" the machine for 
failover tests without needing to go in front of the machine to press the 
power button.

I didn't make any further analysis in the "ipmilan" fence agent, but it 
seems that in this environment the agent is not being able to power off / 
power on the machine. Manually with ipmitool works ok, as far as I could 
test it.

On question that intrigates me: if one node fails to fence the other, do 
this situation prevents the rebooted node to have rgmanager started? Or, 
could be the case that the rebooted node tries to contact the fence agent to 
issue something like a "power status" before starting up rgmanager?

Thank you.

-- 
*Celso Kopp Webber*

celso at webbertek.com.br <mailto:celso at webbertek.com.br>

-- 
Esta mensagem foi verificada pelo sistema de antiv?rus e
 acredita-se estar livre de perigo.



From celso at webbertek.com.br  Sat Sep 13 13:54:23 2008
From: celso at webbertek.com.br (Celso K. Webber)
Date: Sat, 13 Sep 2008 10:54:23 -0300
Subject: [Linux-cluster] fence start-up issue
In-Reply-To: <48CAFDB5.4010103@redhat.com>
References: <48CAE24E.4000002@interactivebrokers.com>
	<48CAFDB5.4010103@redhat.com>
Message-ID: <48CBC60F.7050709@webbertek.com.br>

Subhendu,

I remember to previously see a recommendation which is exactly the opposite 
in the Cluster FAQ, which was: do not use the same network for integrated 
fencing (iLO, DRAC, IPMI) and heartbeat.

Did this change recently or in Cluster Suite v5? I'm sure in v4 I had to 
make them in separate networks.

Thank you.

Celso.

Subhendu Ghosh escreveu:
> Eric Ritchie wrote:
>>    I sometimes run into an issue when a node in my 2-node cluster is 
>> rebooting and hangs on fenced. It seems it can't communicate with the 
>> other node and after the post_join_delay, it fences the other node. 
>> This happened again today, and when the second node rebooted after the 
>> fence, they were in a split-brain configuration.
>>    I saw in the cluster faq, in the cman section, question 6 that the 
>> cluster communication network should be the same network as the 
>> fencing device. I think this may be my problem but I don't understand 
>> why. I'm using HP iLo for fencing and I setup cross-connect cables for 
>> the cluster communication between the 2 nodes. Why would having 
>> cluster communication and fencing on different networks be an issue?
>>
>> Thanks for your time
>>
> 
> Having distinct heartbeat and fencing networks creates the possibility 
> of race condition, which you seem to be running into.
> 
> The cluster communication may not have stabilized in the post_join_delay 
> time frame due to any number of issues including network outage.  In 
> this case fencing would fail from the node starting up as it is the same 
> path to fence device as to cluster member.
> 
> By separating the two - fence can succeed while cluster communication 
> fails.
> 
> Recommendation would be for cluster communication and iLO reachability 
> to be through the same NIC on the host.
> 
> -regards
> Subhendu
> 

-- 
*Celso Kopp Webber*

celso at webbertek.com.br <mailto:celso at webbertek.com.br>

*Webbertek - Opensource Knowledge*
(41) 8813-1919 - celular
(41) 4063-8448, ramal 102 - fixo


-- 
Esta mensagem foi verificada pelo sistema de antiv?rus e
 acredita-se estar livre de perigo.



From ccaulfie at redhat.com  Mon Sep 15 08:10:19 2008
From: ccaulfie at redhat.com (Christine Caulfield)
Date: Mon, 15 Sep 2008 09:10:19 +0100
Subject: [Linux-cluster] 32 nodes limit?
In-Reply-To: <alpine.WNT.1.10.0809121013350.660@tsi>
References: <alpine.WNT.1.10.0809121013350.660@tsi>
Message-ID: <48CE186B.1030409@redhat.com>

Marc Aurele La France wrote:
> Hi.
> 
> Just what I hope will be a quick question regarding the cluster suite.
> 
> The current lock manager FAQ states ...
> 
> "CMAN in RHEL4 has known problems when you have more than 32 nodes in
> the cluster.  We're working to resolve those issues, but until then use
> GULM if you have more than 32 nodes."
> 
> ... while the pre-Wiki version of this document refered to DLM instead
> of CMAN in RHEL4.  Which one is it?  DLM makes more sense to me.

CMAN & DLM are two different parts of the same cluster infrastructure.
CMAN is the cluster manager, and DLM is the Distributed Lock Manager. if
you're using GFS then you will need both.

> In any case, I gather that this issue has been resolved.  If so, can you
> tell me the minimum version of the cluster suite and/or upstream kernel
> that would allow for more than 32 nodes (with DLM)?  A pointer to a
> patch or patches that I could use would be ideal.

We have seen users with 32 or more nodes have trouble with CMAN that do
seem to be related to the size of the cluster. But it it's not a hard
limit and after a certain amount of (non-QE) internal testing we had no
problems with 36 nodes in a CMAN cluster.

There have been no patches committed to RHEL4 that specifically address
the problems people have seen with 32+ nodes.


> More details ...
> 
> I'm trying to move a 5TB filespace from NFS to GFS2.  I have a P4 (the
> current NFS server) and 33 Opteron nodes, all running a stock 2.6.22
> kernel, OpenAIS 0.80.3, and a 2.00.00 cluster suite.  For now, I've
> dummied out fencing and set expected_votes to 1.  I can start/stop cman
> on all nodes no problem.  With all cman's running, I've formatted,
> mounted and populated the filesystem using the P4.  Proceeding through
> the Opterons to mount the filesystem succeeds until the 32nd node, at
> which point mount.gfs2 hangs (in "D" according to `ps ax`).  Going back,
> the first 16 systems that have mounted the filesystem can still `ls` the
> top level directory, but attempts to do so on the remaining systems also
> get stuck in "D".  Any attempt to unmount the filesystem throws the
> entire setup in "D".
> 
> Due to various considerations, moving to more recent versions is not the
> preferred option at this point.  Hence my question.

CMAN/openais in RHEL5 seems to be happy up to around 48 nodes (again
this is not a QE figure, it's something we have tested in development
only) with appropriate tuning. If you are seeing problems then it might
be helpful to adjust some of the times use in the openais totem
protocol. man openais.conf will tell you something about them. Before
doing this though it's worth checking the output of "group_tool" command
and syslog to see if there are any openais or other daemon errors that
might be causing your problems. If necessary post them to this list.

It's also worth mentioning that 2.00.00 has had a considerable number of
bugfixes applied since it was released and the current version is
2.03.07. I do strongly recommend you upgrade to this version even though
you say it is not "the preferred option at this point".

I hope this helps,

Chrissie



From lmb at suse.de  Mon Sep 15 08:22:54 2008
From: lmb at suse.de (Lars Marowsky-Bree)
Date: Mon, 15 Sep 2008 10:22:54 +0200
Subject: [Linux-cluster] Enhancing the linux clustering capabilty
In-Reply-To: <251d650c0809122158q1d00d5aakc2f0b3dda8588b64@mail.gmail.com>
References: <251d650c0809122158q1d00d5aakc2f0b3dda8588b64@mail.gmail.com>
Message-ID: <20080915082254.GA845@marowsky-bree.de>

On 2008-09-13T10:28:55, Mehul Chadha <mehulc87 at gmail.com> wrote:

> Hi,
>       We are a group of 3 students doing their graduation in computer
> science and are interested in contributing to the clustering capability in
> linux. We have been reading about dragonfly BSD as a strong contender for
> clustering OS's. The 2 notable implementation in kernel space of dragonfly
> BSD are the syslink protocol basically which looks after the resource
> management in the cluster and hence keeps the cluster glued together and the
> cache coherence management which allows a number of processes running in
> different systems to have read and write to different portions of the same
> file at the same time. We were thinking of implementing the same feature set
> in linux kernel . We would be thankful if we can have some suggestions and
> advice from your side.

In Linux, we have the protocol in user-space (provided by openAIS), and
the concurrent & cache-coherent file access is managed by GFS2 or
OCFS2.

I'd be interested to learn if there is anything on Dragonfly which Linux
misses.


Regards,
    Lars

-- 
Teamlead Kernel, SuSE Labs, Research and Development
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG N?rnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde



From lmb at suse.de  Mon Sep 15 08:25:26 2008
From: lmb at suse.de (Lars Marowsky-Bree)
Date: Mon, 15 Sep 2008 10:25:26 +0200
Subject: [Linux-cluster] Monitoring services/customize failure criteria
In-Reply-To: <38A48FA2F0103444906AD22E14F1B5A30802AF6D@mailxchg01.corp.opsource.net>
References: <1BCA52CF5845E543B4B81AAFEF2AFD7904AD2415@LONSEXC01.gta.travel.lcl>
	<38A48FA2F0103444906AD22E14F1B5A30802AF6D@mailxchg01.corp.opsource.net>
Message-ID: <20080915082526.GB845@marowsky-bree.de>

On 2008-09-10T18:58:19, Jeff Stoner <jstoner at opsource.net> wrote:

> The Script resource is a generic resource (as opposed to the more
> specific mysql, nfs, ip, file system, etc. resources.) The script
> identified in the <script> section is responsible for properly starting,
> stopping and verifying the resource is functioning properly. The script
> needs to conform to LSB + OCF RA API draft specifications. This means a
> script *must* accept as parameters, at a minimum for use with cluster
> services, start, stop and status. These script must exit with 0 for
> success and a non-zero value for any non-success condition (error,
> warning, got up on the wrong side of the bed, whatever.)

Is it really only 0 versus non-zero for status? How does the system
distinguish between running, failed, and cleanly stopped then? I ask
because both LSB & OCF specify a slightly more differentiated status
exit code.


Regards,
    Lars

-- 
Teamlead Kernel, SuSE Labs, Research and Development
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG N?rnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde



From Gerhardus.Geldenhuis at gta-travel.com  Mon Sep 15 13:21:09 2008
From: Gerhardus.Geldenhuis at gta-travel.com (Gerhardus.Geldenhuis at gta-travel.com)
Date: Mon, 15 Sep 2008 14:21:09 +0100
Subject: [Linux-cluster] Monitor frequency setting being ignored...
Message-ID: <1BCA52CF5845E543B4B81AAFEF2AFD7904AD2431@LONSEXC01.gta.travel.lcl>

Hi 
I have set short intervals for monitoring my cluster resources:

output from rg_test

Resource Rules for "script"
...
  status
    Check Interval: 5 seconds
  monitor
    Check Interval: 5 seconds

Resource Rules for "ip"
  status
    Timeout (hint): 5 seconds
    Check Interval: 1 seconds
  monitor
    Timeout (hint): 5 seconds
    Check Interval: 1 seconds

However these values are not being used and instead defaults to checking
every 10 seconds. I have read every man page and faq and wiki I could
find and nowhere is made mention that it will default to 10 sec. If
anyone can shed light on why this is happening I would really appreciate
it. 

Entries from my log file (<rm log_facility="local4" log_level="7">)

Sep 15 14:15:00 LONGAPA02BLT clurgmgrd: [1289]: <info> Executing
/etc/rc.d/init.d/httpd status
Sep 15 14:15:10 LONGAPA02BLT clurgmgrd: [1289]: <info> Executing
/etc/rc.d/init.d/httpd status
Sep 15 14:15:20 LONGAPA02BLT clurgmgrd: [1289]: <info> Executing
/etc/rc.d/init.d/httpd status

Sep 15 14:20:09 LONGAPA02BLT clurgmgrd: [1289]: <debug> 10.200.31.52
present on eth0
Sep 15 14:20:19 LONGAPA02BLT clurgmgrd: [1289]: <debug> 10.200.31.52
present on eth0
Sep 15 14:20:29 LONGAPA02BLT clurgmgrd: [1289]: <debug> 10.200.31.52
present on eth0

Currently httpd service is started using the script script instead of
the apache script.

Regards

______________________________________________________________________
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email 
______________________________________________________________________



From c.ferrigno at tno.it  Mon Sep 15 13:22:43 2008
From: c.ferrigno at tno.it (Cinzia Ferrigno)
Date: Mon, 15 Sep 2008 15:22:43 +0200
Subject: [Linux-cluster] Configuration Fence Device
Message-ID: <6E1F6EB26C1E4A7E977881512D3F5AEA@domtecno.tno.it>


Hi,
I search documentation to configure fence device, particularly ILO and Fabric Switch devices. 
Unfortunately, I have not found documentation about network and cables.
The network ILO and the network cluster node are the same?
I can use a Cisco Fabric Switch instead of Brocade Fabric Switch?
Thanks for your help,

Cinzia
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080915/d2a8506a/attachment.htm>

From bogdan at rotariu.ro  Mon Sep 15 14:05:07 2008
From: bogdan at rotariu.ro (Bogdan-Stefan Rotariu)
Date: Mon, 15 Sep 2008 17:05:07 +0300
Subject: [Linux-cluster] gfs2 + nas storage
Message-ID: <48CE6B93.9010303@rotariu.ro>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Dear Sirs,

I'm trying to setup a configuration for some webservers to share same
storage device via HBA's.
I am using CentOS 5.2, cman-2.0.84-2, openais-0.80.3-15.
The mkfs command looks like :
gfs2_mkfs -t clusterv:gfs2 -p lock_dlm -j 10 -J 64 /dev/sdb1 .

The mounting and using the same storage is ok till made some files on
server1 and wanted to see them on server2. The files where visible from
server1 but not from server2, all was ok after umount/mount. Same
applies when I delete a file from any of the clustered servers, the
files remain on the others. What am I doing wrong ?

If needed I will paste the cluster.conf file.

Thank you.


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org

iEYEARECAAYFAkjOa5MACgkQGQ4QA8SBkpVyUwCeMenN5uWvFZNII3EWvZAIjW0p
hGYAnitHLU9vcKv/qwNs36uwgqpXDezD
=YU3s
-----END PGP SIGNATURE-----



From alan.zg at gmail.com  Mon Sep 15 19:20:58 2008
From: alan.zg at gmail.com (Alan A)
Date: Mon, 15 Sep 2008 14:20:58 -0500
Subject: [Linux-cluster] GFS deployment questions
Message-ID: <fac531740809151220j592c20ddo55d9a42dd0f099f4@mail.gmail.com>

I am trying to setup cluster test environment mainly for the GFS. The
application that will be supported likes to take control of CPU thread and
is not clusterable in terms of process sharing, however it does talk to
SAN where GFS comes in. This is not something we did before so reading
manuals and this email list is our primary source of information. I have few
simple/ or not so simple questions, and would be grateful for any pointers
in the right direction.

Going through documentation, it appears that after fencing is setup, one
would need to allocate resources to support cluster. Is this only for High
Availability Cluster or any cluster, meaning - if I am only to test GFS do I
need to allocate resources in order to launch GFS.

I know that GFS service must be launched on a cluster. Once fencing is
configured, what are the steps required to deploy GFS, and any potential
'bumps' to watch out for?

Some questions regarding fencing:
We are using HP DL385s as the test model. After researching possible fencing
devices we decided to try HP ILO and SCSI fencing before testing HBA switch
fencing. For HP ILO fencing - each of the nodes has HP ILO name, but I am
not sure what goes into hostname - is it the name of the node, or the IP
address of HP ILO?
There isn't a lot of information about SCSI fencing, I would appreciate any
pointers for/against SCSI fencing.



Thanks
Alan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080915/8f638b52/attachment.htm>

From jstoner at opsource.net  Mon Sep 15 20:37:24 2008
From: jstoner at opsource.net (Jeff Stoner)
Date: Mon, 15 Sep 2008 21:37:24 +0100
Subject: [Linux-cluster] Monitoring services/customize failure criteria
In-Reply-To: <20080915082526.GB845@marowsky-bree.de>
References: <1BCA52CF5845E543B4B81AAFEF2AFD7904AD2415@LONSEXC01.gta.travel.lcl><38A48FA2F0103444906AD22E14F1B5A30802AF6D@mailxchg01.corp.opsource.net>
	<20080915082526.GB845@marowsky-bree.de>
Message-ID: <38A48FA2F0103444906AD22E14F1B5A30802B5E9@mailxchg01.corp.opsource.net>

> -----Original Message-----
> Is it really only 0 versus non-zero for status? How does the system
> distinguish between running, failed, and cleanly stopped then? I ask
> because both LSB & OCF specify a slightly more differentiated status
> exit code.

Yes, 0 is success and anything else is "not success." If the resource is
not 100% operational (as detemermined by the script) then it should not
be returning a 0 exit status. If it is, it's broken.

The LSB is fairly clear on this matter. It gives a table of exit status
codes to use with the 'status' action, of which 0 is the only success
code. All the other codes are failure codes. For all other non-status
actions, it states "the init script shall return an exit status of zero
if the action was successful. Otherwise, the exit status shall be
non-zero." This goes all the way back to LSB 1.0 (published June 29,
2001.)

http://refspecs.linux-foundation.org/LSB_3.2.0/LSB-Core-generic/LSB-Core
-generic/iniscrptact.html

It is more important to realize that Cluster Services is taking
advantage of the defined LSB actions of init scripts. Another way to
look at it is:

- the LSB *only* applies to init scripts, not a script resource for
Cluster Services.
- you can use *any* script you like for the script resource. It does not
have to be an init script (it simply makes sense to do so.)

Here's an example of an script that can be used as a script resource:

===========================
#!/bin/sh

case $1
in
   start)
      exit 0
   ;;
   stop)
      exit 0
   ;;
   status)
      vmstat 1 1 >> /var/log/stats
   ;;
esac
===========================

It implements the basics required by Cluster Services but it does not
conform to the LSB specification.

--Jeff
Performance Engineer

OpSource, Inc.
http://www.opsource.net
"Your Success is Our Success"
  




From lmb at suse.de  Mon Sep 15 20:58:28 2008
From: lmb at suse.de (Lars Marowsky-Bree)
Date: Mon, 15 Sep 2008 22:58:28 +0200
Subject: [Linux-cluster] Monitoring services/customize failure criteria
In-Reply-To: <38A48FA2F0103444906AD22E14F1B5A30802B5E9@mailxchg01.corp.opsource.net>
References: <20080915082526.GB845@marowsky-bree.de>
	<38A48FA2F0103444906AD22E14F1B5A30802B5E9@mailxchg01.corp.opsource.net>
Message-ID: <20080915205828.GD17332@marowsky-bree.de>

On 2008-09-15T21:37:24, Jeff Stoner <jstoner at opsource.net> wrote:

> Yes, 0 is success and anything else is "not success." If the resource is
> not 100% operational (as detemermined by the script) then it should not
> be returning a 0 exit status. If it is, it's broken.

But "status" needs more than just a binary distinction.

> The LSB is fairly clear on this matter. It gives a table of exit status
> codes to use with the 'status' action, of which 0 is the only success
> code. All the other codes are failure codes. For all other non-status
> actions, it states "the init script shall return an exit status of zero
> if the action was successful. Otherwise, the exit status shall be
> non-zero." This goes all the way back to LSB 1.0 (published June 29,
> 2001.)

Right, but it also is more elaborate on "status". ;-)

I was trying to understand how this is handled by rgmanager.

Pacemaker needs to distinguish between "running fine", "active but
somehow failed", and "cleanly stopped". (The numeric values are not
relevant for the discussion, but map to 0, anything else, 7.)

It is also the detail of status/monitor which implementers get most
frequently wrong. "But it's either running or not!" ... Which is clearly
not true, or at least such a case couldn't protect against certain
failure modes. (Such as multiple-active on several nodes, which is
likely to be _also_ failed.)

Hence my interest in your statement that just two states suffice; I am
just trying to understand.

(BTW, this goes back further than LSB 1.0, as that only documented
behaviour which was practice long before ;-)


Regards,
    Lars

-- 
Teamlead Kernel, SuSE Labs, Research and Development
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG N?rnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde



From jstoner at opsource.net  Mon Sep 15 23:29:49 2008
From: jstoner at opsource.net (Jeff Stoner)
Date: Tue, 16 Sep 2008 00:29:49 +0100
Subject: [Linux-cluster] Monitoring services/customize failure criteria
In-Reply-To: <20080915205828.GD17332@marowsky-bree.de>
References: <20080915082526.GB845@marowsky-bree.de><38A48FA2F0103444906AD22E14F1B5A30802B5E9@mailxchg01.corp.opsource.net>
	<20080915205828.GD17332@marowsky-bree.de>
Message-ID: <38A48FA2F0103444906AD22E14F1B5A30802B640@mailxchg01.corp.opsource.net>

> -----Original Message-----
> It is also the detail of status/monitor which implementers get most
> frequently wrong. "But it's either running or not!" ... Which 
> is clearly
> not true, or at least such a case couldn't protect against certain
> failure modes. (Such as multiple-active on several nodes, which is
> likely to be _also_ failed.)

Ok. I think I understand where the confusion lies.

LSB is strictly for init scripts.
OCF is strictly for a cluster-managed resource.

They are similar but have significant differences. For example, LSB
scripts are required to implement a 'status' action while OCF scripts
are required to implement a 'monitor' action. This difference alone
means, technically, you can't interchange LSB and OCF scripts unless
they implement both (in some fashion.)

I think this is the missing link in our conversation: the script
resource type in Cluster Services is an attempt to make a LSB-compliant
script into a OCF-compliant script. So, the /usr/share/cluster/script.sh
expects the script you specify to behave like an LSB script, not an OCF
script. As such, the script resource type falls back to LSB conventions
and uses a binary approach to a resource's start/stop/status actions:
zero for success and non-zero for any failure. Other resource types
(file system, nfs, ip, mysql, samba, etc.) may implement full OCF RA API
exit codes.

Does this help?

--Jeff
Performance Engineer

OpSource, Inc.
http://www.opsource.net
"Your Success is Our Success"
  




From sghosh at redhat.com  Tue Sep 16 03:47:07 2008
From: sghosh at redhat.com (Subhendu Ghosh)
Date: Mon, 15 Sep 2008 23:47:07 -0400
Subject: [Linux-cluster] Monitoring services/customize failure criteria
In-Reply-To: <38A48FA2F0103444906AD22E14F1B5A30802B640@mailxchg01.corp.opsource.net>
References: <20080915082526.GB845@marowsky-bree.de><38A48FA2F0103444906AD22E14F1B5A30802B5E9@mailxchg01.corp.opsource.net>	<20080915205828.GD17332@marowsky-bree.de>
	<38A48FA2F0103444906AD22E14F1B5A30802B640@mailxchg01.corp.opsource.net>
Message-ID: <48CF2C3B.90109@redhat.com>

Jeff Stoner wrote:
>> -----Original Message-----
>> It is also the detail of status/monitor which implementers get most
>> frequently wrong. "But it's either running or not!" ... Which 
>> is clearly
>> not true, or at least such a case couldn't protect against certain
>> failure modes. (Such as multiple-active on several nodes, which is
>> likely to be _also_ failed.)
> 
> Ok. I think I understand where the confusion lies.
> 
> LSB is strictly for init scripts.
> OCF is strictly for a cluster-managed resource.
> 
> They are similar but have significant differences. For example, LSB
> scripts are required to implement a 'status' action while OCF scripts
> are required to implement a 'monitor' action. This difference alone
> means, technically, you can't interchange LSB and OCF scripts unless
> they implement both (in some fashion.)
> 
> I think this is the missing link in our conversation: the script
> resource type in Cluster Services is an attempt to make a LSB-compliant
> script into a OCF-compliant script. So, the /usr/share/cluster/script.sh
> expects the script you specify to behave like an LSB script, not an OCF
> script. As such, the script resource type falls back to LSB conventions
> and uses a binary approach to a resource's start/stop/status actions:
> zero for success and non-zero for any failure. Other resource types
> (file system, nfs, ip, mysql, samba, etc.) may implement full OCF RA API
> exit codes.
> 
> Does this help?
>

Also internally, rgmanager can recognize other non-zero OCF return codes:

http://git.fedorahosted.org/git/cluster.git?p=cluster.git&a=search&h=HEAD&st=grep&s=OCF_RA

-subhendu



From balajisundar at midascomm.com  Tue Sep 16 05:50:59 2008
From: balajisundar at midascomm.com (Balaji)
Date: Tue, 16 Sep 2008 11:20:59 +0530
Subject: [Linux-cluster] Ethernet Channel Bonding Configuration
	Clarification is Needed
Message-ID: <48CF4943.5020005@midascomm.com>

Dear All,

I have using RHEL4 Update 4 Linux and Kernel Version is 2.6.9-42.EL

I have configured Cluster Suite with 2 servers
Server 1 : 192.168.13.110 IP Address and hostname is primary
Server 2 : 192.168.13.179 IP Address and hostname is secondary
Floating : 192.168.13.83 IP Address (Assumed by currently active server)

I have configured Ethernet Channel Bonding in Each Cluster Nodes as per 
the below link

http://www.redhat.com/docs/manuals/csgfs/browse/rh-cs-en/s1-hardware-connect.html#S2-HARDWARE-ETHBOND

Channel Bonding Configuration Details are

1) Created bonding devices in "/etc/modprobe.conf" file alias bond0 bonding
  options bonding miimon=100 mode=1
2) Edit the "/etc/sysconfig/network-scripts/ifcfg-eth0 and ifcfg-eth1" 
configuration
  DEVICE=eth0
  USERCTL= no
  ONBOOT=yes
  MASTER=bond0
  SLAVE=yes
  BOOTPROTO=none

  DEVICE=eth1
  USERCTL= no
  ONBOOT=yes
  MASTER=bond0
  SLAVE=yes
  BOOTPROTO=none
3) Created a network script for the bonding device is 
"/etc/sysconfig/network-scripts/ifcfg-bond0"
  DEVICE=bond0
  USERCTL=no
  ONBOOT=yes     NETMASK=255.255.255.0
  GATEWAY=192.168.13.1
  IPADDR=192.168.13.110
4) Reboot the system for the changes to take effect.

After i am rebooted both the server then cluster node becomes simplex 
and Services are started on both the nodes
The cluster output in primary node

Member Status: Quorate

Member Name                              Status
-----------                             ---------
primary                                 Online, Local, rgmanager
secondary                               Offline

Service Name         Owner (Last)                   State
------------         ------------                  --------
Service              primary                        started

The cluster output in secondary node

Member Status: Quorate

Member Name                              Status
-----------                             ---------
primary                                 Offline
secondary                               Online, Local, rgmanager

Service Name         Owner (Last)                   State
------------         --------------                --------
Service              secondary                     started

Before Ethernet Channel Bonding cluster services are active in one node 
and other nodes acts as passive node.
But after Ethernet Channel Bonding cluster services are active on both 
the nodes

I don't know what is the problem and is their any configuration is 
required in cluster configuration file and
cman is working only with eth0 interface

Can some one throw light on this peculiar problem

Regards
-S.Balaji



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080916/ac41cb32/attachment.htm>

From nkhare.lists at gmail.com  Tue Sep 16 06:07:44 2008
From: nkhare.lists at gmail.com (Neependra Khare)
Date: Tue, 16 Sep 2008 11:37:44 +0530
Subject: [Linux-cluster] Ethernet Channel Bonding
	Configuration	Clarification is Needed
In-Reply-To: <48CF4943.5020005@midascomm.com>
References: <48CF4943.5020005@midascomm.com>
Message-ID: <48CF4D30.6080003@gmail.com>

Hello Balaji,
>
>
> After i am rebooted both the server then cluster node becomes simplex 
> and Services are started on both the nodes
> The cluster output in primary node
>
> Member Status: Quorate
>
> Member Name                              Status
> -----------                             ---------
> primary                                 Online, Local, rgmanager
> secondary                               Offline
>
> Service Name         Owner (Last)                   State
> ------------         ------------                  --------
> Service              primary                        started
>
> The cluster output in secondary node
>
> Member Status: Quorate
>
> Member Name                              Status
> -----------                             ---------
> primary                                 Offline
> secondary                               Online, Local, rgmanager
>
> Service Name         Owner (Last)                   State
> ------------         --------------                --------
> Service              secondary                     started
This looks like a typical split brain condition.
http://sources.redhat.com/cluster/faq.html#split_brain

Is this only happening when you use bonding?

Make sure that both nodes are able to communicate with each other.
Check out the logs and configuration.If you can't figure it out then 
send "/etc/hosts", cluster config file and related logs.

Neependra



From stevan.colaco at gmail.com  Tue Sep 16 07:34:26 2008
From: stevan.colaco at gmail.com (Stevan Colaco)
Date: Tue, 16 Sep 2008 10:34:26 +0300
Subject: [Linux-cluster] Ethernet Channel Bonding Configuration
	Clarification is Needed
In-Reply-To: <48CF4D30.6080003@gmail.com>
References: <48CF4943.5020005@midascomm.com> <48CF4D30.6080003@gmail.com>
Message-ID: <56bb44d0809160034r28eca9b4kb1d8af65f7aa2e94@mail.gmail.com>

Hello Balaji,

Before looking into Cluter setup, is th Channel ethernet bonding working OK?

1) have you set the alias entry for bond0 interface in /etc/modprobe.conf ??
[root at web2 ~]# cat /etc/modprobe.conf
alias bond0 bonding
options bond0 mode=1 miimon=100 use_carrier=0
#

2) is the bonding module loaded?
[root at web2 ~]# lsmod | grep -i bonding
bonding                72252  0
[root at web2 ~]#

3) if not then load the module
 #modprobe bonding
#lsmod | grep -i bonding
#make sure the channel bonding is module is loaded.

4) ifdown bond0 OR #ifdown eth0

5) below are my server Ifcfg config files, compare your config files with them.

[root at web2 network-scripts]# cat ifcfg-eth0
DEVICE=eth0
BOOTPROTO=none
ONBOOT=yes
SLAVE=yes
MASTER=bond0
HWADDR=00:1C:C4:BE:8C:70
[root at web2 network-scripts]#

[root at web2 network-scripts]# cat ifcfg-eth1
DEVICE=eth1
BOOTPROTO=none
ONBOOT=yes
SLAVE=yes
MASTER=bond0
HWADDR=00:1C:C4:BE:8C:7E
[root at web2 network-scripts]#

[root at web2 network-scripts]# cat ifcfg-bond0
DEVICE=bond0
BOOTPROTO=static
IPADDR=10.10.1.91
NETMASK=255.255.255.128
ONBOOT=yes
[root at web2 network-scripts]#

6) restart network service
#service network restart

7) verify bond0 interface works fine...

#ip addr list  ----> look for bond0, eth0 and eth1 interfaces

[root at web2 network-scripts]# cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v2.6.3-rh (June 8, 2005)

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: eth0
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

Slave Interface: eth0
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:1c:c4:be:8c:70

Slave Interface: eth1
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:1c:c4:be:8c:7e
[root at web2 network-scripts]#

Regards,
-Stevan Colaco




On Tue, Sep 16, 2008 at 9:07 AM, Neependra Khare <nkhare.lists at gmail.com> wrote:
> Hello Balaji,
>>
>>
>> After i am rebooted both the server then cluster node becomes simplex and
>> Services are started on both the nodes
>> The cluster output in primary node
>>
>> Member Status: Quorate
>>
>> Member Name                              Status
>> -----------                             ---------
>> primary                                 Online, Local, rgmanager
>> secondary                               Offline
>>
>> Service Name         Owner (Last)                   State
>> ------------         ------------                  --------
>> Service              primary                        started
>>
>> The cluster output in secondary node
>>
>> Member Status: Quorate
>>
>> Member Name                              Status
>> -----------                             ---------
>> primary                                 Offline
>> secondary                               Online, Local, rgmanager
>>
>> Service Name         Owner (Last)                   State
>> ------------         --------------                --------
>> Service              secondary                     started
>
> This looks like a typical split brain condition.
> http://sources.redhat.com/cluster/faq.html#split_brain
>
> Is this only happening when you use bonding?
>
> Make sure that both nodes are able to communicate with each other.
> Check out the logs and configuration.If you can't figure it out then send
> "/etc/hosts", cluster config file and related logs.
>
> Neependra
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



From beekhof at gmail.com  Tue Sep 16 10:11:09 2008
From: beekhof at gmail.com (Andrew Beekhof)
Date: Tue, 16 Sep 2008 12:11:09 +0200
Subject: [Linux-cluster] Monitoring services/customize failure criteria
In-Reply-To: <38A48FA2F0103444906AD22E14F1B5A30802B640@mailxchg01.corp.opsource.net>
References: <20080915082526.GB845@marowsky-bree.de>
	<38A48FA2F0103444906AD22E14F1B5A30802B5E9@mailxchg01.corp.opsource.net>
	<20080915205828.GD17332@marowsky-bree.de>
	<38A48FA2F0103444906AD22E14F1B5A30802B640@mailxchg01.corp.opsource.net>
Message-ID: <26ef5e70809160311j50cf062en6a33123f24a41e59@mail.gmail.com>

On Tue, Sep 16, 2008 at 01:29, Jeff Stoner <jstoner at opsource.net> wrote:
>> -----Original Message-----
>> It is also the detail of status/monitor which implementers get most
>> frequently wrong. "But it's either running or not!" ... Which
>> is clearly
>> not true, or at least such a case couldn't protect against certain
>> failure modes. (Such as multiple-active on several nodes, which is
>> likely to be _also_ failed.)
>
> Ok. I think I understand where the confusion lies.
>
> LSB is strictly for init scripts.
> OCF is strictly for a cluster-managed resource.

This is an unnecessary distinction.

A true LSB script, by which I mean one that follows _all_ the LSB
guidelines for status^, is perfectly adequate for clustering.
And an OCF script that implements sane parameter defaults can, for the
most part, be easily used as an init script for stopping and starting
services.

^ http://refspecs.linux-foundation.org/LSB_3.2.0/LSB-Core-generic/LSB-Core-generic/iniscrptact.html

[quote]
If the status action is requested, the init script will return the
following exit status codes.

0	program is running or service is OK
1	program is dead and /var/run pid file exists
2	program is dead and /var/lock lock file exists
3	program is not running
4	program or service status is unknown
5-99	reserved for future LSB use
100-149	reserved for distribution use
150-199	reserved for application use
200-254	reserved
[end quote]

>
> They are similar but have significant differences. For example, LSB
> scripts are required to implement a 'status' action while OCF scripts
> are required to implement a 'monitor' action.

Btw. Lars was one of the primary authors of the OCF spec... he's
pretty familiar with it and how it differs from LSB ;-)

> This difference alone
> means, technically, you can't interchange LSB and OCF scripts unless
> they implement both (in some fashion.)
>
> I think this is the missing link in our conversation: the script
> resource type in Cluster Services is an attempt to make a LSB-compliant
> script into a OCF-compliant script. So, the /usr/share/cluster/script.sh
> expects the script you specify to behave like an LSB script, not an OCF
> script. As such, the script resource type falls back to LSB conventions
> and uses a binary approach to a resource's start/stop/status actions:
> zero for success and non-zero for any failure. Other resource types
> (file system, nfs, ip, mysql, samba, etc.) may implement full OCF RA API
> exit codes.
>
> Does this help?

I'm guessing this was what Lars was asking about.

In case of interest, our equivalent is the LRMd which understands both
standards (as well as the old Heartbeat style ones) and hides any
differences from OCF.  So any cluster manager using it can treat
everything as an OCF resource.

Part of our resource definition is a "class" field which the admin
uses to tell the LRMd what standard to use for a given resource and it
will automagically map everything from and to OCF as required (eg. by
calling status instead of monitor for LSB and remapping the return
codes).



From lmb at suse.de  Tue Sep 16 13:12:26 2008
From: lmb at suse.de (Lars Marowsky-Bree)
Date: Tue, 16 Sep 2008 15:12:26 +0200
Subject: [Linux-cluster] Monitoring services/customize failure criteria
In-Reply-To: <38A48FA2F0103444906AD22E14F1B5A30802B640@mailxchg01.corp.opsource.net>
References: <20080915205828.GD17332@marowsky-bree.de>
	<38A48FA2F0103444906AD22E14F1B5A30802B640@mailxchg01.corp.opsource.net>
Message-ID: <20080916131226.GO17332@marowsky-bree.de>

On 2008-09-16T00:29:49, Jeff Stoner <jstoner at opsource.net> wrote:

> LSB is strictly for init scripts.
> OCF is strictly for a cluster-managed resource.

That's not quite true; OCF is a backwards-compatible extension to LSB.
ie, a script can be a fully compliant OCF _and_ LSB script at the same
time, and behave differently depending on how it is called, and share
code.

(An LSB script basically is _identical_ to an OCF one, except that it
defaults the instance being managed.)

> They are similar but have significant differences. For example, LSB
> scripts are required to implement a 'status' action while OCF scripts
> are required to implement a 'monitor' action. This difference alone
> means, technically, you can't interchange LSB and OCF scripts unless
> they implement both (in some fashion.)

Exactly; the point is that they _can_ implement both. "monitor" really
is simply a cleaner version of "status" (because status is the only op
in the LSB list which has deviating exit codes).

(Detecting whether they are being called as LSB or as OCF is easily
possible by evaluating the environment.)

If I was to go back, I probably would not redefine "monitor". It seemed
like a good idea at the time - and it perhaps still emphasizes that we
care about correctness a bit more here.

(A _truly_ correct LSB script would be sufficient, but more often than
not, people get the LSB exit codes wrong because normal system start-up
doesn't care. That doesn't mean their scripts aren't buggy or
non-compliant, though - Debian got this wrong very often in the past.)

> I think this is the missing link in our conversation: the script
> resource type in Cluster Services is an attempt to make a LSB-compliant
> script into a OCF-compliant script.

An LSB-compliant script doesn't need a wrapper to be made OCF-compliant.

Or, well, yes, of course the calling conventions are different, but LSB
would already provide everything the cluster needs, and merely doing a
wrapper around it doesn't provide the LSB script with more features.

> So, the /usr/share/cluster/script.sh expects the script you specify to
> behave like an LSB script, not an OCF script. As such, the script
> resource type falls back to LSB conventions and uses a binary approach
> to a resource's start/stop/status actions: zero for success and
> non-zero for any failure.

I think the disconnect here is mostly a misunderstanding of the exit
codes for LSB "status"; it is not merely a 0 or !0 distinction. It also
has "3", which means "not running (unused)". A rather important
distinction from a failure.



Regards,
    Lars

-- 
Teamlead Kernel, SuSE Labs, Research and Development
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG N?rnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde



From alan.zg at gmail.com  Tue Sep 16 15:48:19 2008
From: alan.zg at gmail.com (Alan A)
Date: Tue, 16 Sep 2008 10:48:19 -0500
Subject: [Linux-cluster] Ethernet Channel Bonding Configuration
	Clarification is Needed
In-Reply-To: <48CF4943.5020005@midascomm.com>
References: <48CF4943.5020005@midascomm.com>
Message-ID: <fac531740809160848l57a4ba14q2fa5ce369a8ced78@mail.gmail.com>

You have to look for the bonding option in /etc/modprobe.conf .. Mode 1 is
going to use only one NIC at the time and engage the other NIC only if the
first one fails. Try mode 0 which is round robin or mode 5 or 6.


2008/9/16 Balaji <balajisundar at midascomm.com>

>   Dear All,
>
> I have using RHEL4 Update 4 Linux and Kernel Version is 2.6.9-42.EL
>
> I have configured Cluster Suite with 2 servers
> Server 1 : 192.168.13.110 IP Address and hostname is primary
> Server 2 : 192.168.13.179 IP Address and hostname is secondary
> Floating : 192.168.13.83 IP Address (Assumed by currently active server)
>
> I have configured Ethernet Channel Bonding in Each Cluster Nodes as per the
> below link
>
>
> http://www.redhat.com/docs/manuals/csgfs/browse/rh-cs-en/s1-hardware-connect.html#S2-HARDWARE-ETHBOND
>
> Channel Bonding Configuration Details are
>
> 1) Created bonding devices in "/etc/modprobe.conf" file alias bond0 bonding
>
>   options bonding miimon=100 mode=1
> 2) Edit the "/etc/sysconfig/network-scripts/ifcfg-eth0 and ifcfg-eth1"
> configuration
>   DEVICE=eth0
>   USERCTL= no
>   ONBOOT=yes
>   MASTER=bond0
>   SLAVE=yes
>   BOOTPROTO=none
>
>   DEVICE=eth1
>   USERCTL= no
>   ONBOOT=yes
>   MASTER=bond0
>   SLAVE=yes
>   BOOTPROTO=none
> 3) Created a network script for the bonding device is
> "/etc/sysconfig/network-scripts/ifcfg-bond0"
>   DEVICE=bond0
>   USERCTL=no
>   ONBOOT=yes     NETMASK=255.255.255.0
>   GATEWAY=192.168.13.1
>   IPADDR=192.168.13.110
> 4) Reboot the system for the changes to take effect.
>
> After i am rebooted both the server then cluster node becomes simplex and
> Services are started on both the nodes
> The cluster output in primary node
>
> Member Status: Quorate
>
> Member Name                              Status
> -----------                             ---------
> primary                                 Online, Local, rgmanager
> secondary                               Offline
>
> Service Name         Owner (Last)                   State
> ------------         ------------                  --------
> Service              primary                        started
>
> The cluster output in secondary node
>
> Member Status: Quorate
>
> Member Name                              Status
> -----------                             ---------
> primary                                 Offline
> secondary                               Online, Local, rgmanager
>
> Service Name         Owner (Last)                   State
> ------------         --------------                --------
> Service              secondary                     started
>
> Before Ethernet Channel Bonding cluster services are active in one node and
> other nodes acts as passive node.
> But after Ethernet Channel Bonding cluster services are active on both the
> nodes
>
> I don't know what is the problem and is their any configuration is required
> in cluster configuration file and
> cman is working only with eth0 interface
>
> Can some one throw light on this peculiar problem
>
> Regards
> -S.Balaji
>
>    <http://lists.centos.org/mailman/listinfo/centos>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



-- 
Alan A.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080916/467e7450/attachment.htm>

From michael.osullivan at auckland.ac.nz  Tue Sep 16 20:51:28 2008
From: michael.osullivan at auckland.ac.nz (michael.osullivan at auckland.ac.nz)
Date: Wed, 17 Sep 2008 08:51:28 +1200 (NZST)
Subject: [Linux-cluster] Problem with shared storage and volume groups
Message-ID: <1325.128.187.48.244.1221598288.squirrel@mail.esc.auckland.ac.nz>

Hi everyone,

I have a shared storage device that my 2-node cluster can see
(/dev/storage on both nodes). I wanted to set up a GFS system on the
shared storage, but I am having problems getting both nodes to recognise a
clustered volume group. To set up the clustered volume group I have used
one of the cluster nodes (let's call it node 1).

First, I created a physical volume on my storage device

pvcreate /dev/storage

Then I created a clustered volume group

vgcreate -c y storage_vg /dev/storage

However, when I run

vgchange -aly

on the other cluster node (node 2) I get an error stating that node 2
can't find a volume group with the correct UUID. If I run vgscan and then
vgchange -aly the error goes away, but node 2 still can't see storage_vg.

I have also tried using conga without any success.

Can anyone help? Please let me know if you need more info.

Thanks, Mike





From jstoner at opsource.net  Tue Sep 16 21:55:25 2008
From: jstoner at opsource.net (Jeff Stoner)
Date: Tue, 16 Sep 2008 22:55:25 +0100
Subject: [Linux-cluster] Monitoring services/customize failure criteria
In-Reply-To: <20080916131226.GO17332@marowsky-bree.de>
References: <20080915205828.GD17332@marowsky-bree.de><38A48FA2F0103444906AD22E14F1B5A30802B640@mailxchg01.corp.opsource.net>
	<20080916131226.GO17332@marowsky-bree.de>
Message-ID: <38A48FA2F0103444906AD22E14F1B5A30802B798@mailxchg01.corp.opsource.net>

> -----Original Message-----
> > LSB is strictly for init scripts.
> > OCF is strictly for a cluster-managed resource.
> 
> That's not quite true; OCF is a backwards-compatible extension to LSB.
> ie, a script can be a fully compliant OCF _and_ LSB script at the same
> time, and behave differently depending on how it is called, and share
> code.

Yes, a script can be compliant with both standards but that can't be
assumed.

I'm using section 4 of OCF RA API (at
http://www.opencf.org/cgi-bin/viewcvs.cgi/specs/ra/resource-agent-api.tx
t?rev=HEAD - if this is not the correct document, then please point me
to the correct one) as the basis of my comparison of LSB and OCF:

"The API tries to make it possible to have RA function both as a normal
LSB
init script and a cluster-aware RA, but this is not required
functionality."

I would interpret that to mean OCF RA scripts are not required to be
backwards-compatible with LSB. The statement "OCF is a
backwards-compatible extension to LSB" could be misleading. I think a
better statement would be "OCF allows a script to be
backwards-compatible with LSB." An OCF script would have to be
individually identified as being backwards-compatible with LSB, not OCF
RA scripts as a whole. I'm taking the safe stance and saying they're
different because of the wording in OCF RA API. If a OCF script is
compliant with LSB, then it can be used as an init script and would be
better, all around, in my opinion.

The problem, which prompted this whole discussion in the first place,
are the exit codes for 'status' and 'monitor'. As you said, LSB uses 2
different tables of codes while OCF has only one. Your original question
from Monday:

> Is it really only 0 versus non-zero for status? How does the system
> distinguish between running, failed, and cleanly stopped then? I ask
> because both LSB & OCF specify a slightly more differentiated status
> exit code.

My answer is: the wrapper script only distinguishes between zero and
non-zero exit codes from scripts. In essence, zero is success and
non-zero is failure. Since the LSB exit codes for 'status' do not map to
the exit codes for 'monitor', the /usr/share/cluster/script.sh "mashes"
it into one. The code (from Red Hat 5 Update 1) would seem to confirm
this:

=====================================
# Don't need to catch return codes; this one will work.
ocf_log info "Executing ${OCF_RESKEY_file} $1"
${OCF_RESKEY_file} $1

declare -i rv=$?
if [ $rv -ne 0 ]; then
        ocf_log err "script:$OCF_RESKEY_name: $1 of $OCF_RESKEY_file
failed (returned $rv)"
        return $OCF_ERR_GENERIC
fi
=====================================

All non-zero exit codes are mapped to the OCF generic error code of 1.

> An LSB-compliant script doesn't need a wrapper to be made 
> OCF-compliant.
> 
> Or, well, yes, of course the calling conventions are 
> different,

That, alone, means you do need a wrapper *if* the calling program
(rgmanager, pacemaker, whatever) does not implement LSB conventions. You
can't call an LSB script with "monitor" and expect a result. If the
resource manager knows to call a script with "status" instead of
"monitor" and interpret the exit code properly, then you have a more
versatile product and don't need a wrapper script. But then the onus is
on you to track the LSB spec along with the OCF spec and handle
differences between them in your code. Rgmanager doesn't have to do that
because it expects only OCF behavior and uses a wrapper shell script to
handle LSB scripts.

> I think the disconnect here is mostly a misunderstanding of the exit
> codes for LSB "status"; it is not merely a 0 or !0 
> distinction. It also
> has "3", which means "not running (unused)". A rather important
> distinction from a failure.

Agreed. Interpreting the exit code needs to be done with prior knowledge
about the state of the service. An exit code of 3 can be either "good"
or "bad" - it depends on what state the service *should* be in. If a
service should be running, a 3 is bad. If it should not be running, a 3
is good. Unfortunately, the wrapper script for rgmanager doesn't make
such distinctions.

--Jeff
Performance Engineer

OpSource, Inc.
http://www.opsource.net
"Your Success is Our Success"
  




From balajisundar at midascomm.com  Wed Sep 17 05:21:44 2008
From: balajisundar at midascomm.com (Balaji)
Date: Wed, 17 Sep 2008 10:51:44 +0530
Subject: [Linux-cluster] Re: Linux-cluster Digest, Vol 53, Issue 14
In-Reply-To: <20080916154843.2C4A66189C8@hormel.redhat.com>
References: <20080916154843.2C4A66189C8@hormel.redhat.com>
Message-ID: <48D093E8.5000605@midascomm.com>

Dear All, 
  Currently I have successfully configured IPMILAN fencing to both the 
servers and Ethernet Channel Bonding as per the below link
  
http://www.redhat.com/docs/manuals/csgfs/browse/rh-cs-en/s1-hardware-connect.html#S2-HARDWARE-ETHBOND
  After the Fencing and Ethernet Channel Bonding both the servers are 
getting rebooted alternatively (because of fencing by peer node)

I've attached the following files for your reference
 From server-1, (corviewprimary)
1. /var/log/messages ( Refer messages from Sep  10 15:20:22)
2./etc/cluster/cluster.conf
3./etc/sysconfig/network-scripts/ifcfg-eth0
4./etc/sysconfig/network-scripts/ifcfg-eth1
5./etc/sysconfig/network-scripts/ifcfg-bond0
as primary.zip

 From server-2, (corviewsecondary)
1. /var/log/messages ( Refer messages from Sep  10 15:24:07)
2./etc/cluster/cluster.conf
3./etc/sysconfig/network-scripts/ifcfg-eth0
4./etc/sysconfig/network-scripts/ifcfg-eth1
5./etc/sysconfig/network-scripts/ifcfg-bond0
as secondary.zip

Is this happening only when i use bonding.

Can some one throw light on this peculiar problem


Regards
-S.Balaji




From stevan.colaco at gmail.com  Wed Sep 17 07:39:33 2008
From: stevan.colaco at gmail.com (Stevan Colaco)
Date: Wed, 17 Sep 2008 10:39:33 +0300
Subject: [Linux-cluster] GFS SAN Configuration on RHEL5.1 (iscsi storage)
Message-ID: <56bb44d0809170039t49552024vf1b6b347b6fd9354@mail.gmail.com>

Hello All,

I have setup 3 node cluster on RHEL5.1 which works fine.
now i'm trying to setup GFS to Conga Luci. I have setup iscsi targer
and all 3 cluster nodes as iscsi initiators. all 3 cluster nodes can
see the shared storage device as /dev/sda
I've logged into the Luci interface, selected the Storage Tab and
clicked on the First node (Ex: station1.example.com)
Error: The  page seems to be hang with Probing Storage message.

Has anyone faced this problem before? any clues...

Thanks in Advance,
-Stevan Colaco



From ccaulfie at redhat.com  Wed Sep 17 07:39:53 2008
From: ccaulfie at redhat.com (Christine Caulfield)
Date: Wed, 17 Sep 2008 08:39:53 +0100
Subject: [Linux-cluster] Problem with shared storage and volume groups
In-Reply-To: <1325.128.187.48.244.1221598288.squirrel@mail.esc.auckland.ac.nz>
References: <1325.128.187.48.244.1221598288.squirrel@mail.esc.auckland.ac.nz>
Message-ID: <48D0B449.2050208@redhat.com>

michael.osullivan at auckland.ac.nz wrote:
> Hi everyone,
> 
> I have a shared storage device that my 2-node cluster can see
> (/dev/storage on both nodes). I wanted to set up a GFS system on the
> shared storage, but I am having problems getting both nodes to recognise a
> clustered volume group. To set up the clustered volume group I have used
> one of the cluster nodes (let's call it node 1).
> 
> First, I created a physical volume on my storage device
> 
> pvcreate /dev/storage
> 
> Then I created a clustered volume group
> 
> vgcreate -c y storage_vg /dev/storage
> 
> However, when I run
> 
> vgchange -aly
> 
> on the other cluster node (node 2) I get an error stating that node 2
> can't find a volume group with the correct UUID. If I run vgscan and then
> vgchange -aly the error goes away, but node 2 still can't see storage_vg.
> 
> I have also tried using conga without any success.
> 
> Can anyone help? Please let me know if you need more info.
> 

Are you running clvmd? If not you should be :)
-- 

Chrissie



From nimish.bhargava at gmail.com  Wed Sep 17 10:47:01 2008
From: nimish.bhargava at gmail.com (Nimish Bhargava)
Date: Wed, 17 Sep 2008 16:17:01 +0530
Subject: [Linux-cluster] Failover
Message-ID: <14e6c76f0809170347x3d421055sa6498fb925a0c8aa@mail.gmail.com>

Hi

I am having a 2 node cluster.
I want to send a signal or run a script on the active server when failover
is taking place i.e. I am manually migrating from the active to standby
server even though active has not gone down.

Is it possible to send this signal ?

Regards
Nimish
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080917/1dd071a5/attachment.htm>

From grigorygor at gmail.com  Wed Sep 17 12:18:23 2008
From: grigorygor at gmail.com (Grisha G.)
Date: Wed, 17 Sep 2008 14:18:23 +0200
Subject: [Linux-cluster] Failover
In-Reply-To: <14e6c76f0809170347x3d421055sa6498fb925a0c8aa@mail.gmail.com>
References: <14e6c76f0809170347x3d421055sa6498fb925a0c8aa@mail.gmail.com>
Message-ID: <c3c0440e0809170518r44a10402n140d716297e3b919@mail.gmail.com>

Try running the script you need as resource  in the service.

2008/9/17 Nimish Bhargava <nimish.bhargava at gmail.com>

> Hi
>
> I am having a 2 node cluster.
> I want to send a signal or run a script on the active server when failover
> is taking place i.e. I am manually migrating from the active to standby
> server even though active has not gone down.
>
> Is it possible to send this signal ?
>
> Regards
> Nimish
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080917/6cf19059/attachment.htm>

From grigorygor at gmail.com  Wed Sep 17 12:22:39 2008
From: grigorygor at gmail.com (Grisha G.)
Date: Wed, 17 Sep 2008 14:22:39 +0200
Subject: [Linux-cluster] GFS SAN Configuration on RHEL5.1 (iscsi storage)
In-Reply-To: <56bb44d0809170039t49552024vf1b6b347b6fd9354@mail.gmail.com>
References: <56bb44d0809170039t49552024vf1b6b347b6fd9354@mail.gmail.com>
Message-ID: <c3c0440e0809170522n6a63c9bfxed03f04cc1acd11c@mail.gmail.com>

I had a problem with creating GFS with luci, just create the GFS in the
command line from one of the nodes run partprobe on all nodes, this way you
don't need reboot.


On Wed, Sep 17, 2008 at 9:39 AM, Stevan Colaco <stevan.colaco at gmail.com>wrote:

> Hello All,
>
> I have setup 3 node cluster on RHEL5.1 which works fine.
> now i'm trying to setup GFS to Conga Luci. I have setup iscsi targer
> and all 3 cluster nodes as iscsi initiators. all 3 cluster nodes can
> see the shared storage device as /dev/sda
> I've logged into the Luci interface, selected the Storage Tab and
> clicked on the First node (Ex: station1.example.com)
> Error: The  page seems to be hang with Probing Storage message.
>
> Has anyone faced this problem before? any clues...
>
> Thanks in Advance,
> -Stevan Colaco
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080917/f6598c72/attachment.htm>

From michael.osullivan at auckland.ac.nz  Wed Sep 17 15:42:50 2008
From: michael.osullivan at auckland.ac.nz (michael.osullivan at auckland.ac.nz)
Date: Thu, 18 Sep 2008 03:42:50 +1200 (NZST)
Subject: [Linux-cluster] Problem with shared storage and volume groups
Message-ID: <23266.128.187.151.66.1221666170.squirrel@mail.esc.auckland.ac.nz>

Hi Chrissie,

Yes, I am running clvmd. What is the best way to find out where the
clustered volume group is visible, i.e., is there any way to list the
cluster or machines where the volume group thinks it should be visible? Is
there a way to resolve the locking problems?

Thanks, Mike




From michael.osullivan at auckland.ac.nz  Wed Sep 17 15:43:45 2008
From: michael.osullivan at auckland.ac.nz (michael.osullivan at auckland.ac.nz)
Date: Thu, 18 Sep 2008 03:43:45 +1200 (NZST)
Subject: [Linux-cluster] Problem with shared storage and volume     
	groups
Message-ID: <23275.128.187.151.66.1221666225.squirrel@mail.esc.auckland.ac.nz>

Hi Chrissie,

Yes, I am running clvmd. What is the best way to find out where the
clustered volume group is visible, i.e., is there any way to list the
cluster or machines where the volume group thinks it should be visible? Is
there a way to resolve the locking problems?

Thanks, Mike






From ccaulfie at redhat.com  Wed Sep 17 16:09:46 2008
From: ccaulfie at redhat.com (Christine Caulfield)
Date: Wed, 17 Sep 2008 17:09:46 +0100
Subject: [Linux-cluster] Problem with shared storage and volume groups
In-Reply-To: <1325.128.187.48.244.1221598288.squirrel@mail.esc.auckland.ac.nz>
References: <1325.128.187.48.244.1221598288.squirrel@mail.esc.auckland.ac.nz>
Message-ID: <48D12BCA.40007@redhat.com>

michael.osullivan at auckland.ac.nz wrote:
> Hi everyone,
> 
> I have a shared storage device that my 2-node cluster can see
> (/dev/storage on both nodes). I wanted to set up a GFS system on the
> shared storage, but I am having problems getting both nodes to recognise a
> clustered volume group. To set up the clustered volume group I have used
> one of the cluster nodes (let's call it node 1).
> 
> First, I created a physical volume on my storage device
> 
> pvcreate /dev/storage
> 
> Then I created a clustered volume group
> 
> vgcreate -c y storage_vg /dev/storage
> 
> However, when I run
> 
> vgchange -aly
> 
> on the other cluster node (node 2) I get an error stating that node 2
> can't find a volume group with the correct UUID. If I run vgscan and then
> vgchange -aly the error goes away, but node 2 still can't see storage_vg.
> 
> I have also tried using conga without any success.
> 
> Can anyone help? Please let me know if you need more info.
> 
> Thanks, Mike

A full set of error messages from all the affected node would be
helpful. It's also worth doing a 'pvs' on each node to see if the PVs
are visible. Check which version of clvmd (and LVM) you have, some older
versions of clvmd need a restart if a new device is added.



Chrissie



From nimish.bhargava at gmail.com  Thu Sep 18 07:19:10 2008
From: nimish.bhargava at gmail.com (Nimish Bhargava)
Date: Thu, 18 Sep 2008 12:49:10 +0530
Subject: [Linux-cluster] Failover
In-Reply-To: <c3c0440e0809170518r44a10402n140d716297e3b919@mail.gmail.com>
References: <14e6c76f0809170347x3d421055sa6498fb925a0c8aa@mail.gmail.com>
	<c3c0440e0809170518r44a10402n140d716297e3b919@mail.gmail.com>
Message-ID: <14e6c76f0809180019q1008634dkf9711826a684d8d3@mail.gmail.com>

Thanks Grisha

But could you please elaborate on this suggestion..
Even if I add the script as a resource, how can I ensure that it runs only
during failover/migration.

Regards
Nimish

2008/9/17 Grisha G. <grigorygor at gmail.com>

> Try running the script you need as resource  in the service.
>
> 2008/9/17 Nimish Bhargava <nimish.bhargava at gmail.com>
>
>> Hi
>>
>> I am having a 2 node cluster.
>> I want to send a signal or run a script on the active server when failover
>> is taking place i.e. I am manually migrating from the active to standby
>> server even though active has not gone down.
>>
>> Is it possible to send this signal ?
>>
>> Regards
>> Nimish
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080918/2960bded/attachment.htm>

From raju.rajsand at gmail.com  Thu Sep 18 11:55:10 2008
From: raju.rajsand at gmail.com (Rajagopal Swaminathan)
Date: Thu, 18 Sep 2008 17:25:10 +0530
Subject: [Linux-cluster] Ethernet Channel Bonding Configuration
	Clarification is Needed
In-Reply-To: <fac531740809160848l57a4ba14q2fa5ce369a8ced78@mail.gmail.com>
References: <48CF4943.5020005@midascomm.com>
	<fac531740809160848l57a4ba14q2fa5ce369a8ced78@mail.gmail.com>
Message-ID: <8786b91c0809180455x1348226w9423db824e7831ad@mail.gmail.com>

2008/9/16 Alan A <alan.zg at gmail.com>

> You have to look for the bonding option in /etc/modprobe.conf .. Mode 1 is
> going to use only one NIC at the time and engage the other NIC only if the
> first one fails. Try mode 0 which is round robin or mode 5 or 6.
>
>
>  have configured Cluster Suite with 2 servers
>> Server 1 : 192.168.13.110 IP Address and hostname is primary
>> Server 2 : 192.168.13.179 IP Address and hostname is secondary
>> Floating : 192.168.13.83 IP Address (Assumed by currently active server)
>>
>> I have configured Ethernet Channel Bonding in Each Cluster Nodes as per
>> the below link
>>
>> <http://www.redhat.com/docs/manuals/csgfs/browse/rh-cs-en/s1-hardware-connect.html#S2-HARDWARE-ETHBOND>
>>
>
another thing to check is that multicasting is on in the switch (if it is a
managable one) handling the cluster heartbeat .... That is known to have
caused quite a few anxious moments for the clusterites in the past...
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080918/f1119aa9/attachment.htm>

From michael.osullivan at auckland.ac.nz  Thu Sep 18 20:53:43 2008
From: michael.osullivan at auckland.ac.nz (michael.osullivan at auckland.ac.nz)
Date: Fri, 19 Sep 2008 08:53:43 +1200 (NZST)
Subject: [Linux-cluster] Problem with shared storage and volume groups 
Message-ID: <48194.128.187.125.5.1221771223.squirrel@mail.esc.auckland.ac.nz>

Hi again,

The output of pvs and vgs on the node where I created the volume group are:

[root at croi-01 ~]# pvs
  PV              VG            Fmt  Attr PSize  PFree
  /dev/iscsi_raid iscsi_raid_vg lvm2 a-   19.07G 19.07G
  /dev/sda2       VolGroup00    lvm2 a-   37.16G     0
[root at croi-01 ~]# vgs
  VG            #PV #LV #SN Attr   VSize  VFree
  VolGroup00      1   2   0 wz--n- 37.16G     0
  iscsi_raid_vg   1   0   0 wz--nc 19.07G 19.07G
[root at croi-01 ~]#

On the other node they are

[root at croi-02 ~]# pvs
  PV              VG            Fmt  Attr PSize  PFree
  /dev/sda2       VolGroup00    lvm2 a-   37.16G     0
[root at croi-02 ~]# vgs
  VG            #PV #LV #SN Attr   VSize  VFree
  VolGroup00      1   2   0 wz--n- 37.16G     0
[root at croi-02 ~]#

No errors are apparent in /var/log/messages...

The clvmd version is

Cluster LVM daemon version: 2.02.32-RHEL5 (2008-03-04)
Protocol version:           0.2.1

Any ideas?

Thanks, Mike



From ccaulfie at redhat.com  Fri Sep 19 07:44:11 2008
From: ccaulfie at redhat.com (Christine Caulfield)
Date: Fri, 19 Sep 2008 08:44:11 +0100
Subject: [Linux-cluster] Problem with shared storage and volume groups
In-Reply-To: <48194.128.187.125.5.1221771223.squirrel@mail.esc.auckland.ac.nz>
References: <48194.128.187.125.5.1221771223.squirrel@mail.esc.auckland.ac.nz>
Message-ID: <48D3584B.5020301@redhat.com>

michael.osullivan at auckland.ac.nz wrote:
> Hi again,
> 
> The output of pvs and vgs on the node where I created the volume group are:
> 
> [root at croi-01 ~]# pvs
>   PV              VG            Fmt  Attr PSize  PFree
>   /dev/iscsi_raid iscsi_raid_vg lvm2 a-   19.07G 19.07G
>   /dev/sda2       VolGroup00    lvm2 a-   37.16G     0
> [root at croi-01 ~]# vgs
>   VG            #PV #LV #SN Attr   VSize  VFree
>   VolGroup00      1   2   0 wz--n- 37.16G     0
>   iscsi_raid_vg   1   0   0 wz--nc 19.07G 19.07G
> [root at croi-01 ~]#
>
> On the other node they are
> 
> [root at croi-02 ~]# pvs
>   PV              VG            Fmt  Attr PSize  PFree
>   /dev/sda2       VolGroup00    lvm2 a-   37.16G     0
> [root at croi-02 ~]# vgs
>   VG            #PV #LV #SN Attr   VSize  VFree
>   VolGroup00      1   2   0 wz--n- 37.16G     0
> [root at croi-02 ~]#
> 
> No errors are apparent in /var/log/messages...

So the important thing there is that /dev/iscsi_raid is not visible (to
LVM) on croi-02. Without that, the VG that resides on it is not going to
be available to it.

Make sure that iscsi_raid is actually available on that node (not just
visible, but accessible) and also that it is not excluded by some filter
in /etc/lvm/lvm.conf. Running vgscan will reload the filters and
re-create the cache file /etc/lvm/.cache. That latter file shows which
devices LVM will scan.

If you're sure that the device is available and not filtered out then

pvscan --config 'global {locking_type=0}' -vvvvvv

and see if you can see in that (long and involved) log file why the
device is not being used.


Chrissie



From chris-m-lists at joelly.net  Fri Sep 19 20:05:54 2008
From: chris-m-lists at joelly.net (Chris Joelly)
Date: Fri, 19 Sep 2008 22:05:54 +0200
Subject: [Linux-cluster] Locking in GFS
Message-ID: <20080919200554.GA11369@joysn.joelly.net>

Hello,

i have a question on locking issues on GFS:

how do GFS lock files on the filesystem. I have found one posting to
this list which states that locking occurs "more or less" on file 
level. Is this true? or does some kind of locking occur on directory
level too?

The question arrose because i was thinking on whats would happen when
using GFS on an file server with home directories for lets say 1000
users. how do i setup this directory structure to avoid locking issues
as best as possible.

is a directory layout like the following ok when /home is a GFS file 
system:

/home/a/b/c/abc
/home/d/e/f/def
/home/g/h/i/ghi
...
/home/x/y/z/xyz

thanks Chris



From gregory at steulet.org  Fri Sep 19 22:26:36 2008
From: gregory at steulet.org (gregory steulet)
Date: Sat, 20 Sep 2008 00:26:36 +0200
Subject: [Linux-cluster] MySQL service
Message-ID: <1221863196-c7da3004953daae50758b0d17b8993d8@steulet.org>

Hi folks,

I've a problem with the mysql resource. I cannot start/stop or even monitor a service with mysql. You should find everything regarding my configuration and the error messages below. For Information I manage other resources such as VIP that are working fine.

1.Check mysql status 
----------------------

[root at emperor01 ~]# /etc/init.d/mysql.server status
MySQL is not running                                       [FAILED]


2.Enabling the service mysql_service via luci
---------------------------------------------

Sep 20 00:14:27 emperor01 clurgmgrd[5035]: <notice> Starting disabled service service:mysql_service 
Sep 20 00:14:30 emperor01 luci[18284]: Unable to retrieve batch 525737614 status from emperor01.high-availability.eu:11111: module scheduled for execution
Sep 20 00:14:49 emperor01 last message repeated 2 times
Sep 20 00:14:57 emperor01 clurgmgrd: [5035]: <err> Starting Service mysql:mysql > Failed - Timeout Error 
Sep 20 00:14:57 emperor01 clurgmgrd[5035]: <notice> start on mysql "mysql" returned 1 (generic error) 
Sep 20 00:14:57 emperor01 clurgmgrd[5035]: <warning> #68: Failed to start service:mysql_service; return value: 1 
Sep 20 00:14:57 emperor01 clurgmgrd[5035]: <notice> Stopping service service:mysql_service 
Sep 20 00:14:57 emperor01 clurgmgrd: [5035]: <err> Checking Existence Of File /u00/app/mysql/admin/mysqld1/socket/mysqld1.pid [mysql:mysql] > Failed - File D
ist 
Sep 20 00:14:57 emperor01 clurgmgrd: [5035]: <err> Stopping Service mysql:mysql > Failed 
Sep 20 00:14:57 emperor01 clurgmgrd[5035]: <notice> stop on mysql "mysql" returned 1 (generic error) 
Sep 20 00:14:58 emperor01 clurgmgrd[5035]: <crit> #12: RG service:mysql_service failed to stop; intervention required 
Sep 20 00:14:58 emperor01 clurgmgrd[5035]: <notice> Service service:mysql_service is failed 
Sep 20 00:14:58 emperor01 clurgmgrd[5035]: <crit> #13: Service service:mysql_service failed to stop cleanly 
Sep 20 00:14:59 emperor01 luci[18284]: Unable to retrieve batch 525737614 status from emperor01.high-availability.eu:11111: clusvcadm start failed to start m
ice: 

3.Check mysql status 
-----------------------
[root at emperor01 ~]# /etc/init.d/mysql.server status
MySQL is not running                                       [FAILED]

4.Try to start the service manually
------------------------------------
[root at emperor01 ~]# /etc/init.d/mysql.server start
Starting MySQL.                                            [  OK  ]
[root at emperor01 ~]# /etc/init.d/mysql.server status
MySQL running (11467)                                      [  OK  ]


5. /etc/my.cnf
-------------------

[mysqld1]

server-id        = 1
port             = 33001

log-bin          = mysqld1-bin

datadir          = /u01/mysqldata/mysqld1
basedir          = /u00/app/mysql/product/5.1.26

mysqld           = /u00/app/mysql/product/5.1.26/bin/mysqld
mysqladmin       = /u00/app/mysql/product/5.1.26/bin/mysqladmin

# Log/Socket files
socket           = /u00/app/mysql/admin/mysqld1/socket/mysqld1.sock
pid-file         = /u00/app/mysql/admin/mysqld1/socket/mysqld1.pid
log-error        = /u00/app/mysql/admin/mysqld1/log/mysqld1.err
log              = /u00/app/mysql/admin/mysqld1/log/mysqld1.log
log-slow-queries = /u00/app/mysql/admin/mysqld1/log/mysqld1_slow_queries.log

language         = /u00/app/mysql/product/5.1.26/share/english

6. cluster.conf
--------------------

<?xml version="1.0"?>
<cluster alias="MySQLFailover" config_version="71" name="new_cluster">
        <fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/>
        <clusternodes>
                <clusternode name="emperor01.high-availability.eu" nodeid="1" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="AP7920" port="1"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="emperor02.high-availability.eu" nodeid="2" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="AP7920" port="4"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <cman expected_votes="1" two_node="1"/>
        <fencedevices>
                <fencedevice agent="fence_apc" ipaddr="192.168.1.10" login="apc" name="AP7920" passwd="apc"/>
        </fencedevices>
        <rm>
                <failoverdomains>
                        <failoverdomain name="new_cluster_failover" nofailback="1" ordered="0" restricted="1">
                                <failoverdomainnode name="emperor01.high-availability.eu" priority="1"/>
                                <failoverdomainnode name="emperor02.high-availability.eu" priority="1"/>
                        </failoverdomain>
                </failoverdomains>
                <resources>
                        <ip address="192.168.1.120" monitor_link="1"/>
                        <mysql config_file="/etc/my.cnf" listen_address="192.168.1.120" name="mysql" shutdown_wait="0"/>
                </resources>
                <service autostart="1" exclusive="0" name="mysql" recovery="relocate">
                        <ip ref="192.168.1.120"/>
                </service>
                <service autostart="1" name="mysql_service">
                        <mysql ref="mysql"/>
                </service>
        </rm>
        <quorumd interval="3" label="quorum_disk" min_score="1" tko="10" votes="1">
                <heuristic interval="2" program="ping -c3 -t2 192.168.1.1" score="1"/>
        </quorumd>
        <totem consensus="4800" join="60" token="10000" token_retransmits_before_loss_const="20"/>
</cluster>

7. /usr/share/cluster/mysql.sh
--------------------------------

...export LC_ALL=C
export LANG=C
export PATH=/bin:/sbin:/usr/bin:/usr/sbin

. $(dirname $0)/ocf-shellfuncs
. $(dirname $0)/utils/config-utils.sh
. $(dirname $0)/utils/messages.sh
. $(dirname $0)/utils/ra-skelet.sh

declare MYSQL_MYSQLD=/u00/app/mysql/product/5.1.26/bin/mysqld_safe
declare MYSQL_ipAddress
declare MYSQL_pid_file=/u00/app/mysql/admin/mysqld1/socket/mysqld1.pid
declare MYSQL_timeout=30
...

8. /etc/init.d/mysql.server
-----------------------------

...
basedir=/u00/app/mysql/product/5.1.26
datadir=/u01/mysqldata/mysqld1
...
pid_file=/u00/app/mysql/admin/mysqld1/socket/mysqld1.pid
server_pid_file=/u00/app/mysql/admin/mysqld1/socket/mysqld1.pid
use_mysqld_safe=1
user=mysql
....


Any help of any kind is welcome ;-)

Greg



From michael.osullivan at auckland.ac.nz  Sat Sep 20 00:11:09 2008
From: michael.osullivan at auckland.ac.nz (michael.osullivan at auckland.ac.nz)
Date: Sat, 20 Sep 2008 12:11:09 +1200 (NZST)
Subject: [Linux-cluster] Problem with shared storage and volume 	groups
Message-ID: <1307.128.187.173.205.1221869469.squirrel@mail.esc.auckland.ac.nz>

Hi Chrissie,

Thanks for your help. I have managed to get my cluster up and running and
the volume group is now visible on both cluster nodes. I suspect that I
clvmd was not running properly, although I am really not sure. However,
after starting up the cluster this afternoon and starting rgmanager and
clvmd on both nodes the volume group (iscsi_raid_vg) is there. However, no
I try to create a logical volume to use for my GFS

lvcreate -L 19.07G iscsi_raid_vg

and I get the error

Error locking on node <other_node>: Error backing up metadata, can't find
VG for group #global
Aborting. Failed to activate new LV to wipe the start of it.

Any suggestions?

Also, I have been trying to configure my 2-node cluster using conga, but
have resorted to command line control as conga seems to hang quite
frequently. I have read the Red Hat Manuals, but there is a lot of
material and nothing that seems to cover the problems I keep encountering.
I think the main issue is that I have shared storage presented via iSCSI
to the cluster. I have to power down every night and power back up every
morning. I then have to use mdadm to reassemble the iSCSI storage (due to
multipathing and RAID). Then I have to get the cluster back up and
running. I am getting better at the process, but I don't think I am
powering down the cluster in a "nice" way, so I get issues when I try to
start it up again. Do you have any suggestions for a shutdown, startup
sequence that will get the cluster up and going nicely? Currently, the
sequence is something like:

# Shutdown
service clvmd stop
service rgmanager stop
service cman stop
# Shut the machine down

# Startup
iptables -F # Isolated experimental network, so I am pretty free here
mdadm --assemble --scan # Sometimes I will have to rebuild the md devices
individually
service cman start
service rgmanager start
service clvmd start
# Do I need to cman_tool join here? Or fence_ack_manual -n <node>?

Sorry for all the questions, but I am only building a prototype at this
stage and would like to get it "down and up" as easily as possible so I
can do some testing.

Kind regards, Mike



From grigorygor at gmail.com  Sat Sep 20 17:31:21 2008
From: grigorygor at gmail.com (Grisha G.)
Date: Sat, 20 Sep 2008 19:31:21 +0200
Subject: [Linux-cluster] Failover
In-Reply-To: <14e6c76f0809180019q1008634dkf9711826a684d8d3@mail.gmail.com>
References: <14e6c76f0809170347x3d421055sa6498fb925a0c8aa@mail.gmail.com>
	<c3c0440e0809170518r44a10402n140d716297e3b919@mail.gmail.com>
	<14e6c76f0809180019q1008634dkf9711826a684d8d3@mail.gmail.com>
Message-ID: <c3c0440e0809201031g53220e6dv911a727aed9ee3b0@mail.gmail.com>

Hi Nimish

You can add it to one of the cluster services that run on your cluster, this
way then this service will migrate/relocate your script will run on new
node.

Grisha

2008/9/18 Nimish Bhargava <nimish.bhargava at gmail.com>

> Thanks Grisha
>
> But could you please elaborate on this suggestion..
> Even if I add the script as a resource, how can I ensure that it runs only
> during failover/migration.
>
> Regards
> Nimish
>
> 2008/9/17 Grisha G. <grigorygor at gmail.com>
>
> Try running the script you need as resource  in the service.
>>
>> 2008/9/17 Nimish Bhargava <nimish.bhargava at gmail.com>
>>
>>> Hi
>>>
>>> I am having a 2 node cluster.
>>> I want to send a signal or run a script on the active server when
>>> failover is taking place i.e. I am manually migrating from the active to
>>> standby server even though active has not gone down.
>>>
>>> Is it possible to send this signal ?
>>>
>>> Regards
>>> Nimish
>>>
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>
>>
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080920/649f92bc/attachment.htm>

From grigorygor at gmail.com  Sun Sep 21 14:12:55 2008
From: grigorygor at gmail.com (Grisha G.)
Date: Sun, 21 Sep 2008 16:12:55 +0200
Subject: [Linux-cluster] MySQL service
In-Reply-To: <1221863196-c7da3004953daae50758b0d17b8993d8@steulet.org>
References: <1221863196-c7da3004953daae50758b0d17b8993d8@steulet.org>
Message-ID: <c3c0440e0809210712u49bfdb95ma34f2bdf11ca6580@mail.gmail.com>

Hi Greg

There seems to be a problem with RHCS mysql init script, I'll try to get to
it when I'll have more time.
For now you can solve this problem by not using the MySQL resource in luci,
instead use script resource to start/stop MySQL with /etc/init.d/mysql

If you use GFS and nead a floating IP your cluster.conf should look
something like this.

<resources>
                        <script file="/etc/init.d/mysqld"
name="mysql-scritp"/>
                        <ip address="192.168.122.100" monitor_link="1"/>
                        <clusterfs device="/dev/sda3" force_unmount="0"
fsid="59408" fstype="gfs" mountpoint="/data" name="virt-gfs"
self_fence="0"/>
                </resources>

Put all this resources in one service GFS should be first.

Hope this will help you

Grisha
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080921/e3de3a1b/attachment.htm>

From mulach at libero.it  Mon Sep 22 06:56:44 2008
From: mulach at libero.it (mulach at libero.it)
Date: Mon, 22 Sep 2008 08:56:44 +0200
Subject: [Linux-cluster] Cluster Centos - Don't switch resource
Message-ID: <K7L4MK$1975E3C7C3A907E3606A6C891AD3106D@libero.it>

Hi,

in Centos 5.2 i had create a cluster with Conga(two node has been in vmware server). The problem is that when a node fail, don't switch the service. Below the cluster.conf

--------------------------------
<?xml version="1.0"?>
<cluster alias="SecondCluster" config_version="23" name="SecondCluster">
	<fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/>
	<clusternodes>
		<clusternode name="clu1.localdomain" nodeid="1" votes="1">
			<fence>
				<method name="1"/>
			</fence>
		</clusternode>
		<clusternode name="clu2.localdomain" nodeid="2" votes="1">
			<fence>
				<method name="1"/>
			</fence>
		</clusternode>
	</clusternodes>
	<cman expected_votes="1" two_node="1"/>
	<fencedevices/>
	<rm>
		<failoverdomains>
			<failoverdomain name="fail" ordered="1" restricted="1">
				<failoverdomainnode name="clu1.localdomain" priority="1"/>
				<failoverdomainnode name="clu2.localdomain" priority="2"/>
			</failoverdomain>
		</failoverdomains>
		<resources>
			<ip address="192.168.80.201" monitor_link="1"/>
			<fs device="/dev/sdb1" force_fsck="0" force_unmount="0" fsid="45662" fstype="ext3" mountpoint="/mnt/sdc" name="Share" options="" self_fence="0"/>
		</resources>
		<service autostart="1" domain="fail" exclusive="1" name="IPS" recovery="restart">
			<ip ref="192.168.80.201"/>
			<fs ref="Share"/>
		</service>
	</rm>
	<fence_xvmd/>
</cluster>

--------------------------------




From linux at vfemail.net  Mon Sep 22 07:07:00 2008
From: linux at vfemail.net (Alex)
Date: Mon, 22 Sep 2008 10:07:00 +0300
Subject: [Linux-cluster] Cluster Centos - Don't switch resource
In-Reply-To: <K7L4MK$1975E3C7C3A907E3606A6C891AD3106D@libero.it>
References: <K7L4MK$1975E3C7C3A907E3606A6C891AD3106D@libero.it>
Message-ID: <200809221007.00927.linux@vfemail.net>

Take a look here, for a working example. It worked for me:
http://www.redhat.com/docs/manuals/enterprise/RHEL-5-manual/en-US/RHEL510/Cluster_Administration/ap-httpd-service-CA.html

Regards,
Alx

On Monday 22 September 2008 09:56, mulach at libero.it wrote:
> Hi,
>
> in Centos 5.2 i had create a cluster with Conga(two node has been in vmware
> server). The problem is that when a node fail, don't switch the service.
> Below the cluster.conf
>
> --------------------------------
> <?xml version="1.0"?>
> <cluster alias="SecondCluster" config_version="23" name="SecondCluster">
>  <fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/>
>  <clusternodes>
>   <clusternode name="clu1.localdomain" nodeid="1" votes="1">
>    <fence>
>     <method name="1"/>
>    </fence>
>   </clusternode>
>   <clusternode name="clu2.localdomain" nodeid="2" votes="1">
>    <fence>
>     <method name="1"/>
>    </fence>
>   </clusternode>
>  </clusternodes>
>  <cman expected_votes="1" two_node="1"/>
>  <fencedevices/>
>  <rm>
>   <failoverdomains>
>    <failoverdomain name="fail" ordered="1" restricted="1">
>     <failoverdomainnode name="clu1.localdomain" priority="1"/>
>     <failoverdomainnode name="clu2.localdomain" priority="2"/>
>    </failoverdomain>
>   </failoverdomains>
>   <resources>
>    <ip address="192.168.80.201" monitor_link="1"/>
>    <fs device="/dev/sdb1" force_fsck="0" force_unmount="0" fsid="45662"
> fstype="ext3" mountpoint="/mnt/sdc" name="Share" options=""
> self_fence="0"/> </resources>
>   <service autostart="1" domain="fail" exclusive="1" name="IPS"
> recovery="restart"> <ip ref="192.168.80.201"/>
>    <fs ref="Share"/>
>   </service>
>  </rm>
>  <fence_xvmd/>
> </cluster>
>
> --------------------------------
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From grigorygor at gmail.com  Mon Sep 22 09:43:10 2008
From: grigorygor at gmail.com (Grisha G.)
Date: Mon, 22 Sep 2008 12:43:10 +0300
Subject: [Linux-cluster] Cluster Centos - Don't switch resource
In-Reply-To: <200809221007.00927.linux@vfemail.net>
References: <K7L4MK$1975E3C7C3A907E3606A6C891AD3106D@libero.it>
	<200809221007.00927.linux@vfemail.net>
Message-ID: <c3c0440e0809220243m7326f18dj7262e00b04db521d@mail.gmail.com>

Try changing   recovery="restart" to  relocate if that wont work, check if
your fencing works properly
If your node don't get fenced the other node will not take the service from
the failed node.
Try using quorum partition to solve this problem

Grisha

On Mon, Sep 22, 2008 at 10:07 AM, Alex <linux at vfemail.net> wrote:

> Take a look here, for a working example. It worked for me:
>
> http://www.redhat.com/docs/manuals/enterprise/RHEL-5-manual/en-US/RHEL510/Cluster_Administration/ap-httpd-service-CA.html
>
> Regards,
> Alx
>
> On Monday 22 September 2008 09:56, mulach at libero.it wrote:
> > Hi,
> >
> > in Centos 5.2 i had create a cluster with Conga(two node has been in
> vmware
> > server). The problem is that when a node fail, don't switch the service.
> > Below the cluster.conf
> >
> > --------------------------------
> > <?xml version="1.0"?>
> > <cluster alias="SecondCluster" config_version="23" name="SecondCluster">
> >  <fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/>
> >  <clusternodes>
> >   <clusternode name="clu1.localdomain" nodeid="1" votes="1">
> >    <fence>
> >     <method name="1"/>
> >    </fence>
> >   </clusternode>
> >   <clusternode name="clu2.localdomain" nodeid="2" votes="1">
> >    <fence>
> >     <method name="1"/>
> >    </fence>
> >   </clusternode>
> >  </clusternodes>
> >  <cman expected_votes="1" two_node="1"/>
> >  <fencedevices/>
> >  <rm>
> >   <failoverdomains>
> >    <failoverdomain name="fail" ordered="1" restricted="1">
> >     <failoverdomainnode name="clu1.localdomain" priority="1"/>
> >     <failoverdomainnode name="clu2.localdomain" priority="2"/>
> >    </failoverdomain>
> >   </failoverdomains>
> >   <resources>
> >    <ip address="192.168.80.201" monitor_link="1"/>
> >    <fs device="/dev/sdb1" force_fsck="0" force_unmount="0" fsid="45662"
> > fstype="ext3" mountpoint="/mnt/sdc" name="Share" options=""
> > self_fence="0"/> </resources>
> >   <service autostart="1" domain="fail" exclusive="1" name="IPS"
> > recovery="restart"> <ip ref="192.168.80.201"/>
> >    <fs ref="Share"/>
> >   </service>
> >  </rm>
> >  <fence_xvmd/>
> > </cluster>
> >
> > --------------------------------
> >
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080922/f76dca7a/attachment.htm>

From mgrac at redhat.com  Mon Sep 22 11:58:20 2008
From: mgrac at redhat.com (Marek 'marx' Grac)
Date: Mon, 22 Sep 2008 13:58:20 +0200
Subject: [Linux-cluster] MySQL service
In-Reply-To: <1221863196-c7da3004953daae50758b0d17b8993d8@steulet.org>
References: <1221863196-c7da3004953daae50758b0d17b8993d8@steulet.org>
Message-ID: <48D7885C.3070109@redhat.com>

Hi,

gregory steulet wrote:
> Hi folks,
>
> I've a problem with the mysql resource. I cannot start/stop or even monitor a service with mysql. You should find everything regarding my configuration and the error messages below. For Information I manage other resources such as VIP that are working fine.
>
> 1.Check mysql status 
> ----------------------
>
> [root at emperor01 ~]# /etc/init.d/mysql.server status
> MySQL is not running                                       [FAILED]
>   
> 3.Check mysql status 
> -----------------------
> [root at emperor01 ~]# /etc/init.d/mysql.server status
> MySQL is not running                                       [FAILED]
>   
When you use a RH Cluster Suite and our resource agent then you should 
NOT use init scripts in /etc/init.d because they are not cluster ready. 
Main advantage of resource agent is that it is possible to have several 
Mysql/Apache/... running on same machine. Standard init scripts checks 
pid, lock files etc in standard paths. We have them in different paths.

> 6. cluster.conf
> --------------------
>
>                 <resources>
>                         <ip address="192.168.1.120" monitor_link="1"/>
>                         <mysql config_file="/etc/my.cnf" listen_address="192.168.1.120" name="mysql" shutdown_wait="0"/>
>                 </resources>
>                 <service autostart="1" exclusive="0" name="mysql" recovery="relocate">
>                         <ip ref="192.168.1.120"/>
>                 </service>
>                 <service autostart="1" name="mysql_service">
>                         <mysql ref="mysql"/>
>                 </service>
>         </rm>
>   

> 7. /usr/share/cluster/mysql.sh
> --------------------------------
>
> ..export LC_ALL=C
> export LANG=C
> export PATH=/bin:/sbin:/usr/bin:/usr/sbin
>
>  $(dirname $0)/ocf-shellfuncs
>  $(dirname $0)/utils/config-utils.sh
>  $(dirname $0)/utils/messages.sh
>  $(dirname $0)/utils/ra-skelet.sh
>
> declare MYSQL_MYSQLD=/u00/app/mysql/product/5.1.26/bin/mysqld_safe
> declare MYSQL_ipAddress
> declare MYSQL_pid_file=/u00/app/mysql/admin/mysqld1/socket/mysqld1.pid
> declare MYSQL_timeout=30
> ..
>
>   

In normal use there is no need to change resource agent itself. You need 
to change MYSQL_MYSQLD and this should not broke anything. From what I 
can see from your output problem is that we are waiting for pid file 
(line 126 of mysql.sh) and it appears elsewhere or didn't appear in 30 
seconds.

marx,

-- 
Marek Grac
Red Hat Czech s.r.o.



From rpeterso at redhat.com  Mon Sep 22 14:19:51 2008
From: rpeterso at redhat.com (Bob Peterson)
Date: Mon, 22 Sep 2008 10:19:51 -0400 (EDT)
Subject: [Linux-cluster] Locking in GFS
In-Reply-To: <355077132.2380201222092947850.JavaMail.root@zmail02.collab.prod.int.phx2.redhat.com>
Message-ID: <582872992.2380951222093190284.JavaMail.root@zmail02.collab.prod.int.phx2.redhat.com>

----- "Chris Joelly" <chris-m-lists at joelly.net> wrote:
| Hello,
| 
| i have a question on locking issues on GFS:
| 
| how do GFS lock files on the filesystem. I have found one posting to
| this list which states that locking occurs "more or less" on file 
| level. Is this true? or does some kind of locking occur on directory
| level too?
| 
| The question arrose because i was thinking on whats would happen when
| using GFS on an file server with home directories for lets say 1000
| users. how do i setup this directory structure to avoid locking
| issues
| as best as possible.
| 
| is a directory layout like the following ok when /home is a GFS file 
| system:
| 
| /home/a/b/c/abc
| /home/d/e/f/def
| /home/g/h/i/ghi
| ..
| /home/x/y/z/xyz
| 
| thanks Chris

Hi Chris,

I'm not sure what your concern is.  There shouldn't be any issues
with GFS locking, for the most part.  GFS has cluster-wide locking
so that locking a file on one node in the cluster will known
automatically by the entire cluster.  GFS uses an internal lock
caching mechanism known as "glocks".  These glocks are cached locally
and don't require lots of network interaction unless there are
multiple nodes requesting the same lock.

With home directories, I wouldn't expect a lot of lock contention
anyway, so again, I'm not sure what your concern is.

Regards,

Bob Peterson
Red Hat Clustering & GFS



From mulach at libero.it  Mon Sep 22 15:24:31 2008
From: mulach at libero.it (mulach)
Date: Mon, 22 Sep 2008 17:24:31 +0200
Subject: R: [Linux-cluster] Cluster Centos - Don't switch resource
In-Reply-To: <c3c0440e0809220243m7326f18dj7262e00b04db521d@mail.gmail.com>
References: <K7L4MK$1975E3C7C3A907E3606A6C891AD3106D@libero.it><200809221007.00927.linux@vfemail.net>
	<c3c0440e0809220243m7326f18dj7262e00b04db521d@mail.gmail.com>
Message-ID: <2FD650A952A84E53B7445845793E74BA@invitto>

I changed recovery="restart" to  relocate, but don?t work. I understood what
means Try using quorum partition to solve this problem

 

 

  _____  

Da: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] Per conto di Grisha G.
Inviato: luned? 22 settembre 2008 11.43
A: linux clustering
Oggetto: Re: [Linux-cluster] Cluster Centos - Don't switch resource

 


Try changing   recovery="restart" to  relocate if that wont work, check if
your fencing works properly
If your node don't get fenced the other node will not take the service from
the failed node.
Try using quorum partition to solve this problem

Grisha

On Mon, Sep 22, 2008 at 10:07 AM, Alex <linux at vfemail.net> wrote:

Take a look here, for a working example. It worked for me:
http://www.redhat.com/docs/manuals/enterprise/RHEL-5-manual/en-US/RHEL510/Cl
uster_Administration/ap-httpd-service-CA.html

Regards,
Alx


On Monday 22 September 2008 09:56, mulach at libero.it wrote:
> Hi,
>
> in Centos 5.2 i had create a cluster with Conga(two node has been in
vmware
> server). The problem is that when a node fail, don't switch the service.
> Below the cluster.conf
>
> --------------------------------
> <?xml version="1.0"?>
> <cluster alias="SecondCluster" config_version="23" name="SecondCluster">
>  <fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/>
>  <clusternodes>
>   <clusternode name="clu1.localdomain" nodeid="1" votes="1">
>    <fence>
>     <method name="1"/>
>    </fence>
>   </clusternode>
>   <clusternode name="clu2.localdomain" nodeid="2" votes="1">
>    <fence>
>     <method name="1"/>
>    </fence>
>   </clusternode>
>  </clusternodes>
>  <cman expected_votes="1" two_node="1"/>
>  <fencedevices/>
>  <rm>
>   <failoverdomains>
>    <failoverdomain name="fail" ordered="1" restricted="1">
>     <failoverdomainnode name="clu1.localdomain" priority="1"/>
>     <failoverdomainnode name="clu2.localdomain" priority="2"/>
>    </failoverdomain>
>   </failoverdomains>
>   <resources>
>    <ip address="192.168.80.201" monitor_link="1"/>
>    <fs device="/dev/sdb1" force_fsck="0" force_unmount="0" fsid="45662"
> fstype="ext3" mountpoint="/mnt/sdc" name="Share" options=""
> self_fence="0"/> </resources>
>   <service autostart="1" domain="fail" exclusive="1" name="IPS"
> recovery="restart"> <ip ref="192.168.80.201"/>
>    <fs ref="Share"/>
>   </service>
>  </rm>
>  <fence_xvmd/>
> </cluster>
>
> --------------------------------
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

 

No virus found in this incoming message.
Checked by AVG - http://www.avg.com
Version: 8.0.169 / Virus Database: 270.7.0/1683 - Release Date: 21/09/2008
10.10

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080922/ade7b6f9/attachment.htm>

From xavier.montagutelli at unilim.fr  Mon Sep 22 15:30:44 2008
From: xavier.montagutelli at unilim.fr (Xavier Montagutelli)
Date: Mon, 22 Sep 2008 17:30:44 +0200
Subject: [Linux-cluster] Cluster Centos - Don't switch resource
In-Reply-To: <K7L4MK$1975E3C7C3A907E3606A6C891AD3106D@libero.it>
References: <K7L4MK$1975E3C7C3A907E3606A6C891AD3106D@libero.it>
Message-ID: <200809221730.45084.xavier.montagutelli@unilim.fr>

On Monday 22 September 2008 08:56, mulach at libero.it wrote:
> Hi,
>
> in Centos 5.2 i had create a cluster with Conga(two node has been in vmware
> server). The problem is that when a node fail, don't switch the service.
> Below the cluster.conf
>
> --------------------------------
> <?xml version="1.0"?>
> <cluster alias="SecondCluster" config_version="23" name="SecondCluster">
> 	<fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/>
> 	<clusternodes>
> 		<clusternode name="clu1.localdomain" nodeid="1" votes="1">
> 			<fence>
> 				<method name="1"/>
> 			</fence>
> 		</clusternode>
> 		<clusternode name="clu2.localdomain" nodeid="2" votes="1">
> 			<fence>
> 				<method name="1"/>
> 			</fence>
> 		</clusternode>
> 	</clusternodes>

If I understand correctly your cluster.conf file, you don't have fencing 
devices for your nodes. In my tests, I **had** to define fencing methods for 
the service to switch when one node fails. Otherwise, it doesn't work !

You can try configuring a "fence_manual" method (just for testing). After clu1 
failure, clu2 should "fence" it. You then have to use 
the "fence_ack_manual -n clu1.localdomain" command to confirm manually that 
the fencing is done.

http://sources.redhat.com/cluster/wiki/FAQ/Fencing#fence_manual2


In my cluster.conf, it looks like :


 	<clusternode name="clu1.localdomain" nodeid="1" votes="1">
 		<fence>
 		  <method name="1"/>
                     <device name="manual_1" nodename="clu1.localdomain"/>
 		</fence>
	</clusternode>
	(same for clu2)

        <fencedevices>
                <fencedevice agent="fence_manual" name="manual_2"/>
                <fencedevice agent="fence_manual" name="manual_1"/>
        </fencedevices>

Does it solve your pb ?


> 	<cman expected_votes="1" two_node="1"/>
> 	<fencedevices/>
> 	<rm>
> 		<failoverdomains>
> 			<failoverdomain name="fail" ordered="1" restricted="1">
> 				<failoverdomainnode name="clu1.localdomain" priority="1"/>
> 				<failoverdomainnode name="clu2.localdomain" priority="2"/>
> 			</failoverdomain>
> 		</failoverdomains>
> 		<resources>
> 			<ip address="192.168.80.201" monitor_link="1"/>
> 			<fs device="/dev/sdb1" force_fsck="0" force_unmount="0" fsid="45662"
> fstype="ext3" mountpoint="/mnt/sdc" name="Share" options=""
> self_fence="0"/> </resources>
> 		<service autostart="1" domain="fail" exclusive="1" name="IPS"
> recovery="restart"> <ip ref="192.168.80.201"/>
> 			<fs ref="Share"/>
> 		</service>
> 	</rm>
> 	<fence_xvmd/>
> </cluster>
>
> --------------------------------
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

-- 
Xavier Montagutelli                      Tel : +33 (0)5 55 45 77 20
Service Commun Informatique              Fax : +33 (0)5 55 45 75 95
Universite de Limoges
123, avenue Albert Thomas
87060 Limoges cedex



From xavier.montagutelli at unilim.fr  Mon Sep 22 15:33:34 2008
From: xavier.montagutelli at unilim.fr (Xavier Montagutelli)
Date: Mon, 22 Sep 2008 17:33:34 +0200
Subject: [Linux-cluster] Cluster Centos - Don't switch resource
In-Reply-To: <200809221730.45084.xavier.montagutelli@unilim.fr>
References: <K7L4MK$1975E3C7C3A907E3606A6C891AD3106D@libero.it>
	<200809221730.45084.xavier.montagutelli@unilim.fr>
Message-ID: <200809221733.35106.xavier.montagutelli@unilim.fr>

On Monday 22 September 2008 17:30, Xavier Montagutelli wrote:
[..]
>
> In my cluster.conf, it looks like :
>
>
>  	<clusternode name="clu1.localdomain" nodeid="1" votes="1">
>  		<fence>
>  		  <method name="1"/>
>                      <device name="manual_1" nodename="clu1.localdomain"/>
>  		</fence>
> 	</clusternode>

Oups, please correct :

  		<fence>
  		  <method name="1">
                      <device name="manual_1" nodename="clu1.localdomain"/>
	          </method>
  		</fence>


-- 
Xavier Montagutelli                      Tel : +33 (0)5 55 45 77 20
Service Commun Informatique              Fax : +33 (0)5 55 45 75 95
Universite de Limoges
123, avenue Albert Thomas
87060 Limoges cedex



From mulach at libero.it  Mon Sep 22 15:36:43 2008
From: mulach at libero.it (mulach)
Date: Mon, 22 Sep 2008 17:36:43 +0200
Subject: R: [Linux-cluster] Cluster Centos - Don't switch resource
In-Reply-To: <200809221733.35106.xavier.montagutelli@unilim.fr>
References: <K7L4MK$1975E3C7C3A907E3606A6C891AD3106D@libero.it><200809221730.45084.xavier.montagutelli@unilim.fr>
	<200809221733.35106.xavier.montagutelli@unilim.fr>
Message-ID: <02B888F9239945748BFF6DA835608191@invitto>

OK tnks 1000!

-----Messaggio originale-----
Da: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] Per conto di Xavier Montagutelli
Inviato: luned? 22 settembre 2008 17.34
A: linux clustering
Oggetto: Re: [Linux-cluster] Cluster Centos - Don't switch resource

On Monday 22 September 2008 17:30, Xavier Montagutelli wrote:
[..]
>
> In my cluster.conf, it looks like :
>
>
>  	<clusternode name="clu1.localdomain" nodeid="1" votes="1">
>  		<fence>
>  		  <method name="1"/>
>                      <device name="manual_1" nodename="clu1.localdomain"/>
>  		</fence>
> 	</clusternode>

Oups, please correct :

  		<fence>
  		  <method name="1">
                      <device name="manual_1" nodename="clu1.localdomain"/>
	          </method>
  		</fence>


-- 
Xavier Montagutelli                      Tel : +33 (0)5 55 45 77 20
Service Commun Informatique              Fax : +33 (0)5 55 45 75 95
Universite de Limoges
123, avenue Albert Thomas
87060 Limoges cedex

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

No virus found in this incoming message.
Checked by AVG - http://www.avg.com 
Version: 8.0.169 / Virus Database: 270.7.0/1683 - Release Date: 21/09/2008
10.10




From grigorygor at gmail.com  Mon Sep 22 16:54:25 2008
From: grigorygor at gmail.com (Grisha G.)
Date: Mon, 22 Sep 2008 18:54:25 +0200
Subject: R: [Linux-cluster] Cluster Centos - Don't switch resource
In-Reply-To: <2FD650A952A84E53B7445845793E74BA@invitto>
References: <K7L4MK$1975E3C7C3A907E3606A6C891AD3106D@libero.it>
	<200809221007.00927.linux@vfemail.net>
	<c3c0440e0809220243m7326f18dj7262e00b04db521d@mail.gmail.com>
	<2FD650A952A84E53B7445845793E74BA@invitto>
Message-ID: <c3c0440e0809220954g127c25b0g688cece75308ef21@mail.gmail.com>

with a 2 node cluster you may have a "split brain" problem (it then there is
no communication between the nodes so each node try's to fence the other) in
your case the problem may be different and there is no fencing at all.
The quorum disk/partition solves that problem.
After you set the fencing for your nodes (make sure the fencing works)
create a quorum partiotion its easy  (Google it ) in your cluster.conf
you'll need to change

<cman expected_votes="1" two_node="1"/> to
<cman expected_votes="3" two_node="0"/>

See if this helps you.

2008/9/22 mulach <mulach at libero.it>

>  I changed recovery="restart" to  relocate, but don't work. I understood
> what means Try using quorum partition to solve this problem
>
>
>
>
>  ------------------------------
>
> *Da:* linux-cluster-bounces at redhat.com [mailto:
> linux-cluster-bounces at redhat.com] *Per conto di *Grisha G.
> *Inviato:* luned? 22 settembre 2008 11.43
> *A:* linux clustering
> *Oggetto:* Re: [Linux-cluster] Cluster Centos - Don't switch resource
>
>
>
>
> Try changing   recovery="restart" to  relocate if that wont work, check if
> your fencing works properly
> If your node don't get fenced the other node will not take the service from
> the failed node.
> Try using quorum partition to solve this problem
>
> Grisha
>
> On Mon, Sep 22, 2008 at 10:07 AM, Alex <linux at vfemail.net> wrote:
>
> Take a look here, for a working example. It worked for me:
>
> http://www.redhat.com/docs/manuals/enterprise/RHEL-5-manual/en-US/RHEL510/Cluster_Administration/ap-httpd-service-CA.html
>
> Regards,
> Alx
>
>
> On Monday 22 September 2008 09:56, mulach at libero.it wrote:
> > Hi,
> >
> > in Centos 5.2 i had create a cluster with Conga(two node has been in
> vmware
> > server). The problem is that when a node fail, don't switch the service.
> > Below the cluster.conf
> >
> > --------------------------------
> > <?xml version="1.0"?>
> > <cluster alias="SecondCluster" config_version="23" name="SecondCluster">
> >  <fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/>
> >  <clusternodes>
> >   <clusternode name="clu1.localdomain" nodeid="1" votes="1">
> >    <fence>
> >     <method name="1"/>
> >    </fence>
> >   </clusternode>
> >   <clusternode name="clu2.localdomain" nodeid="2" votes="1">
> >    <fence>
> >     <method name="1"/>
> >    </fence>
> >   </clusternode>
> >  </clusternodes>
> >  <cman expected_votes="1" two_node="1"/>
> >  <fencedevices/>
> >  <rm>
> >   <failoverdomains>
> >    <failoverdomain name="fail" ordered="1" restricted="1">
> >     <failoverdomainnode name="clu1.localdomain" priority="1"/>
> >     <failoverdomainnode name="clu2.localdomain" priority="2"/>
> >    </failoverdomain>
> >   </failoverdomains>
> >   <resources>
> >    <ip address="192.168.80.201" monitor_link="1"/>
> >    <fs device="/dev/sdb1" force_fsck="0" force_unmount="0" fsid="45662"
> > fstype="ext3" mountpoint="/mnt/sdc" name="Share" options=""
> > self_fence="0"/> </resources>
> >   <service autostart="1" domain="fail" exclusive="1" name="IPS"
> > recovery="restart"> <ip ref="192.168.80.201"/>
> >    <fs ref="Share"/>
> >   </service>
> >  </rm>
> >  <fence_xvmd/>
> > </cluster>
> >
> > --------------------------------
> >
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
>
> No virus found in this incoming message.
> Checked by AVG - http://www.avg.com
> Version: 8.0.169 / Virus Database: 270.7.0/1683 - Release Date: 21/09/2008
> 10.10
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080922/bf369d3e/attachment.htm>

From jerlyon at gmail.com  Mon Sep 22 18:28:06 2008
From: jerlyon at gmail.com (Jeremy Lyon)
Date: Mon, 22 Sep 2008 12:28:06 -0600
Subject: [Linux-cluster] fast_stafs
Message-ID: <779919740809221128td5e8ceat506a4b4bb9b447f2@mail.gmail.com>

Hi,

I was reading through a write up on fast_statsf,
http://people.redhat.com/rpeterso/Patches/GFS/readme.gfs_fast_statfs.R4.  In
the explanation of how the code works in 2.6 states:

2.6 Local change (delta) is synced to disk whenever quota daemon is
waked up and the (a tunable, default to 5 seconds). It is then
subsequently zeroed out.

Does this mean that I can't mount a GFS file system with the noquota option
and use fast_statfs?

Also, am I correct in understanding that if there is an unclean shutdown
that it is best to turn off fast_statfs  and then back on to get the statfs
data in sync on existing nodes?


Thanks
Jeremy
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080922/37c024b8/attachment.htm>

From s.wendy.cheng at gmail.com  Mon Sep 22 19:01:43 2008
From: s.wendy.cheng at gmail.com (Wendy Cheng)
Date: Mon, 22 Sep 2008 15:01:43 -0400
Subject: [Fwd: Re: [Fwd: [Linux-cluster] fast_stafs]]
Message-ID: <48D7EB97.8080809@gmail.com>


-------------- next part --------------
An embedded message was scrubbed...
From: Wendy Cheng <wcheng at netapp.com>
Subject: Re: [Fwd: [Linux-cluster] fast_stafs]
Date: Mon, 22 Sep 2008 15:00:43 -0400
Size: 1364
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080922/4e07b101/attachment.eml>

From s.wendy.cheng at gmail.com  Tue Sep 23 02:45:14 2008
From: s.wendy.cheng at gmail.com (Wendy Cheng)
Date: Mon, 22 Sep 2008 21:45:14 -0500
Subject: [Linux-cluster] Locking in GFS
In-Reply-To: <20080919200554.GA11369@joysn.joelly.net>
References: <20080919200554.GA11369@joysn.joelly.net>
Message-ID: <48D8583A.3000703@gmail.com>

Chris Joelly wrote:
> Hello,
>
> i have a question on locking issues on GFS:
>
> how do GFS lock files on the filesystem. I have found one posting to
> this list which states that locking occurs "more or less" on file 
> level. Is this true? or does some kind of locking occur on directory
> level too?
>   

You may view GFS(1) internal lock granularity is on system call level - 
that is, when either a write or read (say pwrite() or pread()) is 
issued, the associated file is locked until the system call returns. 
There are few simple things that will be helpful if you keep them in mind:

1. a write requires an exclusive lock (i.e., when there is a write going 
on, every access to that file has to wait).
2. a read needs a shared lock (i.e.  many reads to the same file will 
not be stalled).
3. a write may involve directory lock (e.g. a "create" would need a 
write lock of the parent directory).
4. local locking (two writes compete the same lock on the same node) is 
always much better than inter-node (different nodes) locking (ping-pong 
the same write lock between different nodes is very expensive).
5. standard APIs (such as fcntl() and flock()) precedes GFS(1) internal 
locking if used correctly (e.g. upon obtaining an exclusive flock, other 
access to that file will be stalled, assuming every instance of the 
executables running on different nodes has the correct flock implemented 
and honored).

One exception to (5) is posix byte-range locks. Say there are two 
processes running on different nodes, each obtaining its own byte range 
locks. Process A locks byte offset 0 to 10K; process B locks byte 10K+1 
to 40K. When both have writes issued, one of them has to wait until 
other's write system call completes before it can continue - a result of 
its posix locking implementation that is done by a separate module 
outside its internal filesystem locking.

> The question arrose because i was thinking on whats would happen when
> using GFS on an file server with home directories for lets say 1000
> users. how do i setup this directory structure to avoid locking issues
> as best as possible.
>
> is a directory layout like the following ok when /home is a GFS file 
> system:
>
> /home/a/b/c/abc
> /home/d/e/f/def
> /home/g/h/i/ghi
> ...
> /home/x/y/z/xyz
>
>   

Hope above statements have helped you understanding that on GFS(1),

1. A short-and-fat directory structure will work (much) better than 
tall-and-skinny ones.
2. If possible, the directory setup should avoid ping-pong directory 
and/or write locks between different nodes

For GFS2, after browsing thru its newest source code few minutes ago - 
it reminds me of a "sea horse" shape with linux page locks on its curly 
"belly" :). It is difficult (for me) to describe so I'll skip.

-- Wendy



From s.wendy.cheng at gmail.com  Tue Sep 23 04:06:10 2008
From: s.wendy.cheng at gmail.com (Wendy Cheng)
Date: Mon, 22 Sep 2008 23:06:10 -0500
Subject: [Linux-cluster] Locking in GFS
In-Reply-To: <48D8583A.3000703@gmail.com>
References: <20080919200554.GA11369@joysn.joelly.net>
	<48D8583A.3000703@gmail.com>
Message-ID: <48D86B32.6050504@gmail.com>

Wendy Cheng wrote:
> Chris Joelly wrote:
>> Hello,
>>
>> i have a question on locking issues on GFS:
>>
>> how do GFS lock files on the filesystem. I have found one posting to
>> this list which states that locking occurs "more or less" on file 
>> level. Is this true? or does some kind of locking occur on directory
>> level too?
>>   
>
> You may view GFS(1) internal lock granularity is on system call level 
> - that is, when either a write or read (say pwrite() or pread()) is 
> issued, the associated file is locked until the system call returns. 
> There are few simple things that will be helpful if you keep them in 
> mind:
>
> 1. a write requires an exclusive lock (i.e., when there is a write 
> going on, every access to that file has to wait).
> 2. a read needs a shared lock (i.e.  many reads to the same file will 
> not be stalled).
> 3. a write may involve directory lock (e.g. a "create" would need a 
> write lock of the parent directory).
> 4. local locking (two writes compete the same lock on the same node) 
> is always much better than inter-node (different nodes) locking 
> (ping-pong the same write lock between different nodes is very 
> expensive).
> 5. standard APIs (such as fcntl() and flock()) precedes GFS(1) 
> internal locking if used correctly (e.g. upon obtaining an exclusive 
> flock, other access to that file will be stalled, assuming every 
> instance of the executables running on different nodes has the correct 
> flock implemented and honored).
>
> One exception to (5) is posix byte-range locks. Say there are two 
> processes running on different nodes, each obtaining its own byte 
> range locks. Process A locks byte offset 0 to 10K; process B locks 
> byte 10K+1 to 40K. When both have writes issued, one of them has to 
> wait until other's write system call completes before it can continue 
> - a result of its posix locking implementation that is done by a 
> separate module outside its internal filesystem locking.
>
>> The question arrose because i was thinking on whats would happen when
>> using GFS on an file server with home directories for lets say 1000
>> users. how do i setup this directory structure to avoid locking issues
>> as best as possible.
>>
>> is a directory layout like the following ok when /home is a GFS file 
>> system:
>>
>> /home/a/b/c/abc
>> /home/d/e/f/def
>> /home/g/h/i/ghi
>> ...
>> /home/x/y/z/xyz
>>
>>   
>
> Hope above statements have helped you understanding that on GFS(1),
>
> 1. A short-and-fat directory structure will work (much) better than 
> tall-and-skinny ones.
> 2. If possible, the directory setup should avoid ping-pong directory 
> and/or write locks between different nodes

Really think about it, (1) is *not* a right description at all. What I 
meant to say is that, if possible, try not to put everyone in the very 
same directory if there are lots of write activities that will cause 
directory lock(s) contention, particularly if it (they) has (have) to 
get passed around different nodes.

Sorry !

-- Wendy




From tsi at ualberta.ca  Tue Sep 23 16:01:07 2008
From: tsi at ualberta.ca (Marc Aurele La France)
Date: Tue, 23 Sep 2008 10:01:07 -0600 (Mountain Daylight Time)
Subject: [Linux-cluster] 32 nodes limit?
In-Reply-To: <48CE186B.1030409@redhat.com>
References: <alpine.WNT.1.10.0809121013350.660@tsi>
	<48CE186B.1030409@redhat.com>
Message-ID: <alpine.WNT.1.10.0809230928440.756@cluij.ucs.ualberta.ca>

On Mon, 15 Sep 2008, Christine Caulfield wrote:
> Marc Aurele La France wrote:
>> I'm trying to move a 5TB filespace from NFS to GFS2.  I have a P4 (the
>> current NFS server) and 33 Opteron nodes, all running a stock 2.6.22
>> kernel, OpenAIS 0.80.3, and a 2.00.00 cluster suite.  For now, I've
>> dummied out fencing and set expected_votes to 1.  I can start/stop cman
>> on all nodes no problem.  With all cman's running, I've formatted,
>> mounted and populated the filesystem using the P4.  Proceeding through
>> the Opterons to mount the filesystem succeeds until the 32nd node, at
>> which point mount.gfs2 hangs (in "D" according to `ps ax`).  Going back,
>> the first 16 systems that have mounted the filesystem can still `ls` the
>> top level directory, but attempts to do so on the remaining systems also
>> get stuck in "D".  Any attempt to unmount the filesystem throws the
>> entire setup in "D".

>> Due to various considerations, moving to more recent versions is not the
>> preferred option at this point.  Hence my question.

> CMAN/openais in RHEL5 seems to be happy up to around 48 nodes (again
> this is not a QE figure, it's something we have tested in development
> only) with appropriate tuning. If you are seeing problems then it might
> be helpful to adjust some of the times use in the openais totem
> protocol. man openais.conf will tell you something about them. Before
> doing this though it's worth checking the output of "group_tool" command
> and syslog to see if there are any openais or other daemon errors that
> might be causing your problems. If necessary post them to this list.

> It's also worth mentioning that 2.00.00 has had a considerable number of
> bugfixes applied since it was released and the current version is
> 2.03.07. I do strongly recommend you upgrade to this version even though
> you say it is not "the preferred option at this point".

> I hope this helps,

It most certainly does.  Thanks for the hint.  It turns out I had 
neglected to copy over my openais configuration from a test cluster. 
Everything seems to work now.

FWIW, upgrading to L&G versions is not the preferred option at this point 
primarily due to the PITFA the kernel invariably creates with its 
incompatible changes to internal APIs.  I have a number of external 
additions to deal with, and not all of them are likely to have been ported 
to the latest kernels.

Anyway, sorry for the noise, but thanks for your time.  Much appreciated.

Marc.

+----------------------------------+----------------------------------+
|  Marc Aurele La France           |  work:   1-780-492-9310          |
|  Academic Information and        |  fax:    1-780-492-1729          |
|    Communications Technologies   |  email:  tsi at ualberta.ca         |
|  352 General Services Building   +----------------------------------+
|  University of Alberta           |                                  |
|  Edmonton, Alberta               |    Standard disclaimers apply    |
|  T6G 2H1                         |                                  |
|  CANADA                          |                                  |
+----------------------------------+----------------------------------+



From stevan.colaco at gmail.com  Tue Sep 23 17:27:21 2008
From: stevan.colaco at gmail.com (Stevan Colaco)
Date: Tue, 23 Sep 2008 20:27:21 +0300
Subject: [Linux-cluster] Fencing issue using IPMI (nodes fencing each other
	ending in a loop)
Message-ID: <56bb44d0809231027k74a7bcd6nf9cd43b104ecec62@mail.gmail.com>

Hello

issue: Fencing using fence_ipmilan, each node keeps fencing the other
node ending in a fence loop.....

We have implemented RH Cluster on RHEL5.2 64bit.
Server Hardware: SUN X4150
Storage: SUN 6140
Fencing Machnism: fence_ipmilan

 We have downloaded the IPMI fence_ipmilan and configured two node
cluster with ipmi fencing. But..

when we ifdown the NIC interface, the node gets fenced but the service
does not relocate to the other node. at the same time when the
initially fenced node joins back the cluster it fences the other
node......
this keeps on ending in a loop.

We downloaded and followed the intructions from the ipmi site
mentioned below
http://docs.sun.com/source/819-6588-13/ipmi_com.html#0_74891

we tested with following  Cmd line method which works fine.
#fence_ipmilan -a "ip addr" -l root -p <Passkey> -o <on|off|reboot>

here is my cluster.conf

<?xml version="1.0"?>
<cluster alias="tibcouat" config_version="12" name="tibcouat">
	<fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/>
	<clusternodes>
		<clusternode name="tibco-node1-uat.kmefic.com.kw" nodeid="1" votes="1">
			<fence>
				<method name="1">
					<device name="tibco-node1"/>
				</method>
			</fence>
		</clusternode>
		<clusternode name="tibco-node2-uat.kmefic.com.kw" nodeid="2" votes="1">
			<fence>
				<method name="1">
					<device name="tibco-node2"/>
				</method>
			</fence>
		</clusternode>
	</clusternodes>
	<cman expected_votes="1" two_node="1"/>
	<fencedevices>
		<fencedevice agent="fence_ipmilan" ipaddr="172.16.71.41"
login="root" name="tibco-node1" passwd="changeme"/>
		<fencedevice agent="fence_ipmilan" ipaddr="172.16.71.42"
login="root" name="tibco-node2" passwd="changeme"/>
	</fencedevices>
	<rm>
		<failoverdomains>
			<failoverdomain name="prefer_node1" nofailback="0" ordered="1"
restricted="1">
				<failoverdomainnode name="tibco-node1-uat.kmefic.com.kw" priority="1"/>
				<failoverdomainnode name="tibco-node2-uat.kmefic.com.kw" priority="2"/>
			</failoverdomain>
		</failoverdomains>
		<resources>
			<ip address="172.16.71.55" monitor_link="1"/>
			<clusterfs device="/dev/vg0/gfsdata" force_unmount="0" fsid="63282"
fstype="gfs" mountpoint="/var/www/html" name="gfsdata"
self_fence="0"/>
			<apache config_file="conf/httpd.conf" name="docroot"
server_root="/etc/httpd" shutdown_wait="0"/>
		</resources>
		<service autostart="1" domain="prefer_node1" exclusive="0"
name="webby" recovery="relocate">
			<ip ref="172.16.71.55"/>
		</service>
	</rm>
</cluster>


Kindly investigate and provide us the solution at the earliest.

Thanks & Best Regards,



From lars at oddbit.com  Wed Sep 24 02:19:10 2008
From: lars at oddbit.com (Lars Kellogg-Stedman)
Date: Tue, 23 Sep 2008 22:19:10 -0400
Subject: [Linux-cluster] Startup ordering of cman, fenced, qdiskd, etc.
Message-ID: <c63ea11b0809231919x897f350i5c19f5c2efb825c9@mail.gmail.com>

Hello all,

I've recently started evaluating the RedHat cluster suite.  I've run
into some behavior I don't fully understand, and I'm hoping someone
here can help me out.

First, the essentials: I'm running cman-2.0.84-2 under CentOS 5.2.
The cluster is a two-node cluster with a quorum disk; the full
cluster.conf is available here: <http://tinyurl.com/3zdh2k>.  There
are two nodes, cluster0 (nodeid 1) and cluster1 (nodeid 2). The quorum
disk configuration has a single 'ping' heuristic.

While investigating failure scenarios, I introduced a network
interruption that would (a) isolate the two cluster nodes from each
other and (b) isolate node 2 ("cluster1") from the endpoint used in
qdiskd's "ping" heuristic:

  iptables -A INPUT -s cluster0 -j DROP; iptables -A OUTPUT -d
cluster0 -j DROP; iptables -A OUTPUT -p icmp -j DROP

Note that this introduces a temporary network interruption that will
not persist across a system reboot.  At the time I ran this command,
cluster1 was the quorum master ("Master Node ID: 2").

This initiated the following sequence of events:

(1) cluster1 fenced cluster0.

    --> the cluster now has one node and a quorum disk
    --> cluster is (still) quorate

(2) qdiskd noted the failure of the ping heuristic

    --> cluster becomes inquorate
    --> cman reboots cluster1 ("[CMAN ] quorum lost, blocking activity")

(3) cluster1 boots up and attempts to start cman.  qdiskd is not yet
running.  cluster0 is powered off.

    --> cluster never achieves quorum
    --> cman start never gets beyond "Starting fencing..."
    --> bootup never completes, so no manual intervention is possible

I've swapped the startup order of cman and qdiskd, and now things seem
to work (read, "the system actually boots and starts up cluster
services").  Is there any reason *not* to do this?  Would it make more
sense to start qdiskd out of the cman init script (like fenced,
groupd, et al)?  It seems to make a lot more sense this way, but
presumably people much smarter than I configured the default behavior,
so I'm nervous.



From pushmailiam at gmail.com  Wed Sep 24 11:50:23 2008
From: pushmailiam at gmail.com (push mail)
Date: Wed, 24 Sep 2008 11:50:23 +0000
Subject: [Linux-cluster] rgmanager fails to start!
Message-ID: <707c81d50809240450v1f0d7277nf9015cb58992c2c@mail.gmail.com>

Hi,

I have four nodes in a clustering.

Cman is working correctly in all nodes, but rgmanager works only in one node
(server4) ! trying to start rgmanager on the others nodes fails!

clurgmgrd concludes that locks are not working and exists.

following cluster.conf file and output

[root at server1 ~]# uname -a
Linux server1 2.6.18-53.1.14.el5 #1 SMP Wed Mar 5 11:36:49 EST 2008 i686
i686 i386 GNU/Linux

[root at server1 ~]# rpm -q cman
cman-2.0.60-1.el5

[root at server1 ~]# rpm -q rgmanager
rgmanager-2.0.23-1.el5.centos

#less /etc/cluster/cluster.conf
<?xml version="1.0"?>
<cluster config_version="1" name="m2p">
        <fence_daemon post_fail_delay="0" post_join_delay="3"/>
        <clusternodes>
                <clusternode name="server1" nodeid="1" votes="1">
                        <fence/>
                </clusternode>
                <clusternode name="server2" nodeid="2" votes="1">
                        <fence/>
                </clusternode>
                <clusternode name="server3" nodeid="3" votes="1">
                        <fence/>
                </clusternode>
                <clusternode name="server4" nodeid="4" votes="1">
                        <fence/>
                </clusternode>
        </clusternodes>
        <cman expected_votes="1" two_node="0"/>
        <fencedevices/>
        <rm>
                <failoverdomains/>
                <resources>
                        <ip address="192.168.174.170" monitor_link="1"/>
                        <script file="/etc/init.d/m2pollerdirector"
name="m2pollerdirector"/>
                </resources>
                <service autostart="1" name="serverdirector">
                        <ip ref="192.168.174.170"/>
                        <script ref="m2pollerdirector"/>
                </service>
        </rm>
</cluster>


 [root at server1 ~]# clustat
Member Status: Quorate

  Member Name                        ID   Status
  ------ ----                        ---- ------
  server1                             1 Online, Local
  server2                             2 Online
  server3                             3 Online
  server4                             4 Online

    [root at server2 ~]# service rgmanager status
clurgmgrd dead but pid file exists

  [root at server2 ~]# clustat
Member Status: Quorate

  Member Name                        ID   Status
  ------ ----                        ---- ------
  server1                             1 Online
  server2                             2 Online, Local
  server3                             3 Online
  server4                             4 Online

  [root at server2 ~]# service rgmanager status
clurgmgrd dead but pid file exists

[root at server3 ~]# clustat
Member Status: Quorate

  Member Name                        ID   Status
  ------ ----                        ---- ------
  server1                             1 Online
  server2                             2 Online
  server3                             3 Online, Local
  server4                             4 Online

[root at server3 ~]# service rgmanager status
clurgmgrd dead but pid file exists



[root at server4 ~]# clustat
Member Status: Quorate

  Member Name                        ID   Status
  ------ ----                        ---- ------
  server1                             1 Online
  server2                             2 Online
  server3                             3 Online
  server4                             4 Online, Local, rgmanager

  Service Name         Owner (Last)                   State
  ------- ----         ----- ------                   -----
  service:m2pollerdire server4                      started


thank you in advance for your help.

--
Lhadj
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080924/4e732e9f/attachment.htm>

From Alain.Moulle at bull.net  Wed Sep 24 12:40:43 2008
From: Alain.Moulle at bull.net (Alain.Moulle)
Date: Wed, 24 Sep 2008 14:40:43 +0200
Subject: [Linux-cluster] CS5 / quorum disk configuration
Message-ID: <48DA354B.4020200@bull.net>

Hi

Something strange about qdisk configuration with CS5 :
it seems that it is necessary to reboot the system after the configuration
of quorum disk (via mkqdisk and in cluster.conf) so that it works fine,
otherwise, we encounter the problem of "Node x is undead" at first
failover try.
Whereas after reboot of both nodes, this problem definitely dissapears.

Is this already known ?
Is this normal ?
What are the reasons ?

Thanks
Regards,
Alain Moull?



From grigorygor at gmail.com  Wed Sep 24 13:53:46 2008
From: grigorygor at gmail.com (Grisha G.)
Date: Wed, 24 Sep 2008 15:53:46 +0200
Subject: [Linux-cluster] Fencing issue using IPMI (nodes fencing each
	other ending in a loop)
In-Reply-To: <56bb44d0809231027k74a7bcd6nf9cd43b104ecec62@mail.gmail.com>
References: <56bb44d0809231027k74a7bcd6nf9cd43b104ecec62@mail.gmail.com>
Message-ID: <c3c0440e0809240653l6fca5876uc9511c2475dae8e9@mail.gmail.com>

In 2 node cluster you should use a quorum disk to solve the split brain
problem.
after you create a quorum disk change this line in you cluster.conf
from <cman expected_votes="1" two_node="1"/>
to  <cman expected_votes="3" two_node="0"/>

Grisha


On Tue, Sep 23, 2008 at 7:27 PM, Stevan Colaco <stevan.colaco at gmail.com>wrote:

> Hello
>
> issue: Fencing using fence_ipmilan, each node keeps fencing the other
> node ending in a fence loop.....
>
> We have implemented RH Cluster on RHEL5.2 64bit.
> Server Hardware: SUN X4150
> Storage: SUN 6140
> Fencing Machnism: fence_ipmilan
>
>  We have downloaded the IPMI fence_ipmilan and configured two node
> cluster with ipmi fencing. But..
>
> when we ifdown the NIC interface, the node gets fenced but the service
> does not relocate to the other node. at the same time when the
> initially fenced node joins back the cluster it fences the other
> node......
> this keeps on ending in a loop.
>
> We downloaded and followed the intructions from the ipmi site
> mentioned below
> http://docs.sun.com/source/819-6588-13/ipmi_com.html#0_74891
>
> we tested with following  Cmd line method which works fine.
> #fence_ipmilan -a "ip addr" -l root -p <Passkey> -o <on|off|reboot>
>
> here is my cluster.conf
>
> <?xml version="1.0"?>
> <cluster alias="tibcouat" config_version="12" name="tibcouat">
>        <fence_daemon clean_start="0" post_fail_delay="0"
> post_join_delay="3"/>
>        <clusternodes>
>                <clusternode name="tibco-node1-uat.kmefic.com.kw"
> nodeid="1" votes="1">
>                        <fence>
>                                <method name="1">
>                                        <device name="tibco-node1"/>
>                                </method>
>                        </fence>
>                </clusternode>
>                <clusternode name="tibco-node2-uat.kmefic.com.kw"
> nodeid="2" votes="1">
>                        <fence>
>                                <method name="1">
>                                        <device name="tibco-node2"/>
>                                </method>
>                        </fence>
>                </clusternode>
>        </clusternodes>
>        <cman expected_votes="1" two_node="1"/>
>        <fencedevices>
>                <fencedevice agent="fence_ipmilan" ipaddr="172.16.71.41"
> login="root" name="tibco-node1" passwd="changeme"/>
>                <fencedevice agent="fence_ipmilan" ipaddr="172.16.71.42"
> login="root" name="tibco-node2" passwd="changeme"/>
>        </fencedevices>
>        <rm>
>                <failoverdomains>
>                        <failoverdomain name="prefer_node1" nofailback="0"
> ordered="1"
> restricted="1">
>                                <failoverdomainnode name="
> tibco-node1-uat.kmefic.com.kw" priority="1"/>
>                                <failoverdomainnode name="
> tibco-node2-uat.kmefic.com.kw" priority="2"/>
>                        </failoverdomain>
>                </failoverdomains>
>                <resources>
>                        <ip address="172.16.71.55" monitor_link="1"/>
>                        <clusterfs device="/dev/vg0/gfsdata"
> force_unmount="0" fsid="63282"
> fstype="gfs" mountpoint="/var/www/html" name="gfsdata"
> self_fence="0"/>
>                        <apache config_file="conf/httpd.conf" name="docroot"
> server_root="/etc/httpd" shutdown_wait="0"/>
>                </resources>
>                <service autostart="1" domain="prefer_node1" exclusive="0"
> name="webby" recovery="relocate">
>                        <ip ref="172.16.71.55"/>
>                </service>
>        </rm>
> </cluster>
>
>
> Kindly investigate and provide us the solution at the earliest.
>
> Thanks & Best Regards,
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080924/87888b25/attachment.htm>

From jakub.suchy at enlogit.cz  Wed Sep 24 14:34:59 2008
From: jakub.suchy at enlogit.cz (Jakub Suchy)
Date: Wed, 24 Sep 2008 16:34:59 +0200
Subject: [Linux-cluster] Node won't rejoin after reboot
Message-ID: <20080924143459.GC27554@aaron>

Hello,
we are currently trying to determine a problem in our cluster setup. We
are having two problems, both related together:
1) When doing failover, living node reports "waiting for node to be
fenced" and no failover is done...
2) When the failing node rejoins the cluster, it is killed with a
message: "Killing node node2 because it has rejoined the cluster with existing state"

Both seems to be network related, Cisco infrastructure (65xx and 35xx). And both of them
disappear when moving to non-Cisco infrastructure.

Please let me emphasize, that I AM aware of this document:
http://www.openais.org/doku.php?id=faq:cisco_switches
And we have configured the Cisco according to this (and nevertheless, I
believe this is valid only for multi-switch infrastructure, our nodes
are both connected to a single switch).

We are also aware of:
http://www.cisco.com/en/US/products/hw/switches/ps708/products_tech_note09186a008059a9df.shtml
But this is not our problem, again, single-switch scenario. We have
tried to turn IGMP snooping off and our engineers reported that it
didn't work.

Also:
http://www.mail-archive.com/linux-cluster at redhat.com/msg03889.html
didn't help.

I have intercepted the traffic on the living node using tcpdump,
including all layer headers and it seems that there is no IGMP Join
message from the second node. I suspect it may be the problem. Do
anybody know any details I can check or a fix for this bug?

Thanks,
Jakub



From lhh at redhat.com  Wed Sep 24 21:00:24 2008
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 24 Sep 2008 17:00:24 -0400
Subject: [Linux-cluster] CS5 / quorum disk configuration
In-Reply-To: <48DA354B.4020200@bull.net>
References: <48DA354B.4020200@bull.net>
Message-ID: <1222290024.18628.3.camel@ayanami>

On Wed, 2008-09-24 at 14:40 +0200, Alain.Moulle wrote:
> Hi
> 
> Something strange about qdisk configuration with CS5 :
> it seems that it is necessary to reboot the system after the configuration
> of quorum disk (via mkqdisk and in cluster.conf) so that it works fine,
> otherwise, we encounter the problem of "Node x is undead" at first
> failover try.
> Whereas after reboot of both nodes, this problem definitely dissapears.
> 
> Is this already known ?
> Is this normal ?
> What are the reasons ?

There's a bug which should have been fixed in 5.2, but was reverted,
then later fixed again.

A checkout from the RHEL5 git branch should work fine.

-- Lon



From lhh at redhat.com  Wed Sep 24 21:01:53 2008
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 24 Sep 2008 17:01:53 -0400
Subject: [Linux-cluster] Node won't rejoin after reboot
In-Reply-To: <20080924143459.GC27554@aaron>
References: <20080924143459.GC27554@aaron>
Message-ID: <1222290113.18628.5.camel@ayanami>

On Wed, 2008-09-24 at 16:34 +0200, Jakub Suchy wrote:
> Hello,
> we are currently trying to determine a problem in our cluster setup. We
> are having two problems, both related together:
> 1) When doing failover, living node reports "waiting for node to be
> fenced" and no failover is done...

Any reason logs about why fencing didn't complete?

-- Lon




From lhh at redhat.com  Wed Sep 24 21:02:42 2008
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 24 Sep 2008 17:02:42 -0400
Subject: [Linux-cluster] rgmanager fails to start!
In-Reply-To: <707c81d50809240450v1f0d7277nf9015cb58992c2c@mail.gmail.com>
References: <707c81d50809240450v1f0d7277nf9015cb58992c2c@mail.gmail.com>
Message-ID: <1222290162.18628.7.camel@ayanami>

On Wed, 2008-09-24 at 11:50 +0000, push mail wrote:

> Cman is working correctly in all nodes, but rgmanager works only in
> one node (server4) ! trying to start rgmanager on the others nodes
> fails!
> 
> 
> clurgmgrd concludes that locks are not working and exists.

configfs not mounted?
dlm_controld not running?

-- Lon

> 



From lhh at redhat.com  Wed Sep 24 21:05:46 2008
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 24 Sep 2008 17:05:46 -0400
Subject: [Linux-cluster] Startup ordering of cman, fenced, qdiskd, etc.
In-Reply-To: <c63ea11b0809231919x897f350i5c19f5c2efb825c9@mail.gmail.com>
References: <c63ea11b0809231919x897f350i5c19f5c2efb825c9@mail.gmail.com>
Message-ID: <1222290346.18628.11.camel@ayanami>

On Tue, 2008-09-23 at 22:19 -0400, Lars Kellogg-Stedman wrote:

> (2) qdiskd noted the failure of the ping heuristic
> 
>     --> cluster becomes inquorate
>     --> cman reboots cluster1 ("[CMAN ] quorum lost, blocking activity")

What are your qdiskd timings set to?


> I've swapped the startup order of cman and qdiskd, and now things seem
> to work (read, "the system actually boots and starts up cluster
> services").  Is there any reason *not* to do this?

No reason to not do this.  This is fixed in the current RHEL5 branch.

> Would it make more
> sense to start qdiskd out of the cman init script (like fenced,
> groupd, et al)?

This is what we do in the current rhel5 branch:

  - if qdiskd is chkconfig'd on start qdiskd before fenced during
startup
  - qdiskd init script becomes a no-op if qdiskd is already running

-- Lon



From caronc at navcanada.ca  Wed Sep 24 21:51:42 2008
From: caronc at navcanada.ca (Caron, Chris)
Date: Wed, 24 Sep 2008 17:51:42 -0400
Subject: [Linux-cluster] qdiskd Startup Questions
Message-ID: <474534909BE4064E853161350C47578E0BFFD0FC@ncrmail1.corp.navcan.ca>

Lon H. posted a question that sort of sparked a few questions from me
about Quorum Disks...

 

1.	Even if there is absolutely no configuration for a quorum disk
(yet), nor is one even created (through mkqdisk)... can the Quorum Disk
Daemon still run (at startup)? You might ask why I would even want to do
this. Simply because our company preloads a working cluster through a
kickstart and a series of scripts... it would just be one less command
to have to call on every single node (at the start) after creating the
partition and updating the cluster.conf tool.  Just having the daemon
run at the start would mean you could configure the Quorum Disk from
just one node. 

	a.	(Assuming you can run the quorum disk... it would be
running on all cluster nodes)
	b.	Create the quorum disk at one of the nodes
	c.	Update the cluster.conf (+version) to support it, and
then ccs_tool it.
	d.	(am I missing a step?)

2.	When you type mkqdisk -L after creating a Quorum disk.  Does the
node that created the quorum disk have any impact on the cluster at all?
I guess I get confused with creating a 10MB iscsi mount on all cluster
nodes, and only having to run the mkqdisk once on one of the nodes...
Does this mean if that node fails, the quorum disk is gone?  Or do all
nodes access the block device that has been made available to them too
(ie, all nodes see /dev/sdc, but mkqdisk -L says it was created on
node04)... Maybe my question should be: how is knowing who created the
Qdisk useful to me?
3.	Since there is all this talk about the Quorum Disk Daemon being
the last thing to start (after fencing).... Would it be feasible to just
make a failoverdomain per node and only consisting of 1 node to
exclusively start the Quorum Disk Daemon as a cluster service (per node)
(this site explains it better:
http://sources.redhat.com/cluster/wiki/MultipleInstanceServices)... That
way leaving or joining a node into the cluster starts and stops it's
daemon as needed.

 

 

Chris 

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080924/a1e01dbf/attachment.htm>

From lars at oddbit.com  Thu Sep 25 01:12:40 2008
From: lars at oddbit.com (Lars Kellogg-Stedman)
Date: Wed, 24 Sep 2008 21:12:40 -0400
Subject: [Linux-cluster] qdiskd Startup Questions
In-Reply-To: <474534909BE4064E853161350C47578E0BFFD0FC@ncrmail1.corp.navcan.ca>
References: <474534909BE4064E853161350C47578E0BFFD0FC@ncrmail1.corp.navcan.ca>
Message-ID: <c63ea11b0809241812h59234424gaf3a1f5554ae3242@mail.gmail.com>

Caveat: I'm new to this, so even if I sound authoritative look for
someone to second any opinions expressed here.

> When you type mkqdisk ?L after creating a Quorum disk.  Does the node that
> created the quorum disk have any impact on the cluster at all? I guess I get
> confused with creating a 10MB iscsi mount on all cluster nodes, and only
> having to run the mkqdisk once on one of the nodes? Does this mean if that
> node fails, the quorum disk is gone?

The quorum disk is, by definition, shared by all the nodes.  When you
run mkqdisk you're writing some data to the disk, but this data is
seen by all nodes in the cluster.  Nobody "owns" the quorum disk
(although there is a quorum master).  If the node on which you created
it dies, it doesn't matter -- the other nodes in the cluster will
select a new quorum master, and as long as you still have quorum on
disk it will still contribute votes to the cluster.

> but mkqdisk ?L says it was created on node04)? Maybe my question should be:
> how is knowing who created the Qdisk useful to me?

It's not.

> Since there is all this talk about the Quorum Disk Daemon being the last
> thing to start (after fencing)?. Would it be feasible to just make a
> failoverdomain per node and only consisting of 1 node to exclusively start
> the Quorum Disk Daemon as a cluster service (per node) (this site explains
> it better:

qdiskd needs to start up before the cluster is joined, because in some
failure scenarios you need the quorum disk in order to establish
quorum.  Without it, the cluster will never start (see my recent post
for how this works in a two-node cluster).



From lars at oddbit.com  Thu Sep 25 01:17:01 2008
From: lars at oddbit.com (Lars Kellogg-Stedman)
Date: Wed, 24 Sep 2008 21:17:01 -0400
Subject: [Linux-cluster] Startup ordering of cman, fenced, qdiskd, etc.
In-Reply-To: <1222290346.18628.11.camel@ayanami>
References: <c63ea11b0809231919x897f350i5c19f5c2efb825c9@mail.gmail.com>
	<1222290346.18628.11.camel@ayanami>
Message-ID: <c63ea11b0809241817g68362b84ua7444b64da3dd4e2@mail.gmail.com>

> What are your qdiskd timings set to?

Mostly the defaults, except for tko on the ping heuristic:

  <quorumd status_file="/tmp/quorumd.status" label="quorum0" votes="1">
    <heuristic program="ping -c1 -t1 10.2.0.1" tko="3"/>
  </quorumd>

> No reason to not do this.  This is fixed in the current RHEL5 branch.

Excellent.  Thanks for confirming my instinct that this was a sane(r?)
configuration.

Cheers,

  -- Lars



From fdinitto at redhat.com  Thu Sep 25 05:03:44 2008
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Thu, 25 Sep 2008 07:03:44 +0200 (CEST)
Subject: [Linux-cluster] Cluster 2.99.10 (development snapshot) released
Message-ID: <Pine.LNX.4.64.0809250655400.21806@trider-g7>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


The cluster team and its community are proud to announce the 2.99.10
release from the master branch.

This release features changes all over the place with tons of bug fixes 
and most notably a new VMWare fence agent that supports both server and 
ESX products.
For SUN T2000 users out there, an experimental fence_alom has been added.

The 2.99.XX releases are _NOT_ meant to be used for production
environments.. yet.

The master branch is the main development tree that receives all new
features, code, clean up and a whole brand new set of bugs,

At some point in time this code will become the 3.0 stable release.

Everybody with test equipment and time to spare, is highly encouraged to
download, install and test the 2.99 releases and more important report
problems.

In order to build the 2.99.10 release you will need:

- - corosync svn r1667.
- - openais svn r1651.
- - linux kernel (2.6.27)

The new source tarball can be downloaded here:

   ftp://sources.redhat.com/pub/cluster/releases/cluster-2.99.10.tar.gz
   https://fedorahosted.org/releases/c/l/cluster/cluster-2.99.10.tar.gz

To report bugs or issues:

   https://bugzilla.redhat.com/

Would you like to meet the cluster team or members of its community?

   Join us on IRC (irc.freenode.net #linux-cluster) and share your
   experience  with other sysadministrators or power users.

Happy clustering,
Fabio

Under the hood (from 2.99.09):

Abhijith Das (3):
       libgfs2:  Bug 459630 -  GFS2: changes needed to gfs2-utils due to gfs2meta fs changes in bz 457798
       gfs-kernel: bz298931 - GFS unlinked inode metadata leak
       Revert "gfs-kernel: bz298931 - GFS unlinked inode metadata leak"

Bob Peterson (5):
       GFS2: Make gfs2_fsck accept UNLINKED metadata blocks
       Changes needed to stay compatible with libvolume_id.
       Changes needed to stay current with libvolume_id.
       GFS2: sync buffers to disk when rewriting superblock
       GFS2: gfs2_fsck: fix segfault while running special block lists.

Chris Feist (2):
       fence: fixed a fence storm with fence_egenera
       cman: fixed makefiles to actually install the vmware manpage

Christine Caulfield (12):
       cman: Allow a recently left node to join cleanly.
       cman & config: Move special cases out of config modules
       cman: cope better with malformed config files
       config: fix ldap load bug caused by new objdb ordering in corosync
       config: Remove stray fprintf
       cman: Initialise variable
       cman: honour the dirty flag on a node we haven't seen before
       config: Allow multiple top-level keys
       cman: Copy "service" keys down to corosync
       config: Get rid of files I committed accidentally.
       cman: rename 'move' functions to 'copy'
       cman: Clean shutdown_con if the controlling process is killed.

David Teigland (15):
       groupd: fix daemon quit on SIGTERM
       fence_tool: new option to delay before join
       init.d/cman: use fence_tool -m for two node clusters
       fenced: joining daemon cpg to bypass fencing
       fenced: handle merge of cpg partition
       dlm_controld/gfs_controld: handle merge of cpg partition
       libdlm: /dev/misc/dlm-control created by udev
       fence_tool/dlm_tool/gfs_control: improve ls output format
       mount.gfs: fix mount error handling
       gfs_controld: ignore second leave
       gfs_controld: fix and implement remount
       dlm_controld: ignore old plock dev when using new one
       gfs_controld: withdraw and recovery fixes
       gfs_controld: ignore uevents after first_done
       gfs_controld: add protocol negotiation

Fabio M. Di Nitto (12):
       cman: init script best to require $time
       build: fix clean target of contrib section
       config: make more functions static
       ccs: libccsconfdb header cleanup
       misc: init scripts clean up
       libdlm: major cleanup
       rgmanger: fix handling of VIP v6
       ccs: deal with xml file format special case
       fence: update alom description
       fence: install fence_alom man page
       misc: cleanup ifdefs around RELEASE_VERSION
       dlm/fence/gfs: fix daemon spinning 100% due to memory corruption

Jan Friesse (5):
       fence: Fence agent for VMware ESX
       cman: Removed old Perl version of VMware fence agent, so new version is built.
       fence: Fix fence agent for VMware ESX.
       fence: Fix fence agent for VMware ESX.
       Fence: Added fence agent for Sun Advanced Lights Out Manager (ALOM)

Lon Hohberger (14):
       cman: Fix qdiskd file descriptor leak
       cman: show '-d' option in mkqdisk -h and mkqdisk.8
       rgmanager: Make freeze/unfreeze work with central_processing
       rgmanager: Detect restricted failover domain crash
       rgmanager: Permit careful restart w/o disturbing services
       rgmanager: Wait for fence domain join to complete
       rgmanager: Fix up clusvcadm.8 manual page to show -M option
       rgmanager: make status poll interval configurable
       rgmanager: Clean up build
       rgmanager: Implement enforcement of timeouts on a per-resource basis
       rgmanager: Make clustat and clusvcadm work faster
       rgmanager: Resolve hostnames->IPs and back when checking NFS clients
       cman: Fix broken qdisk main.c patch reverted with scandisk merge
       cman: Don't let qdiskd update cman if the disk is unavailable

Marek 'marx' Grac (3):
       [FENCE] Fix #448043 - Update man pages for fence agents
       [FENCE] Fix #237266 - LPAR/HMC fence agent
       [FENCE] Fix #460054 - fence_apc fails with pexpect exception

Satoru SATOH (1):
       fence: Add network interface select option for fence_xvmd

  cman/daemon/ais.c                               |    4 +-
  cman/daemon/cman-preconfig.c                    |   89 ++-
  cman/daemon/commands.c                          |   13 +-
  cman/daemon/nodelist.h                          |    3 +-
  cman/init.d/cman                                |   27 +-
  cman/init.d/qdiskd                              |    1 +
  cman/man/mkqdisk.8                              |    5 +-
  cman/man/qdisk.5                                |   16 +
  cman/qdisk/disk.c                               |    3 +
  cman/qdisk/disk.h                               |    2 +
  cman/qdisk/main.c                               |   83 ++-
  cman/qdisk/mkqdisk.c                            |    2 +-
  config/daemons/ccsd/cnx_mgr.c                   |   20 +-
  config/libs/libccsconfdb/libccs.c               |   11 +-
  config/plugins/ccsais/config.c                  |   10 +-
  config/plugins/ldap/configldap.c                |   10 +-
  config/plugins/xml/config.c                     |   39 +-
  contrib/Makefile                                |    2 +-
  contrib/askant/Makefile                         |    8 +-
  contrib/libaislock/Makefile                     |   13 +
  contrib/libaislock/libaislock.c                 |  466 ++++++++++
  contrib/libaislock/libaislock.h                 |  190 ++++
  dlm/libdlm/Makefile                             |   24 +-
  dlm/libdlm/libaislock.c                         |  468 ----------
  dlm/libdlm/libaislock.h                         |  190 ----
  dlm/libdlm/libdlm.c                             |  180 ++---
  dlm/tool/main.c                                 |  145 ++--
  fence/agents/alom/Makefile                      |    5 +
  fence/agents/alom/fence_alom.py                 |   69 ++
  fence/agents/apc/fence_apc.py                   |   13 +-
  fence/agents/egenera/fence_egenera.pl           |    9 +-
  fence/agents/lib/fencing.py.py                  |   54 +-
  fence/agents/lpar/fence_lpar.py                 |    3 +-
  fence/agents/rps10/rps10.c                      |    5 -
  fence/agents/scsi/scsi_reserve                  |    7 +-
  fence/agents/vmware/fence_vmware.pl             |  322 -------
  fence/agents/vmware/fence_vmware.py             |  111 +++
  fence/agents/xvm/Makefile                       |    2 +-
  fence/agents/xvm/fence_xvm.c                    |    2 -
  fence/agents/xvm/fence_xvmd.c                   |    8 +-
  fence/agents/xvm/mcast.c                        |    9 +-
  fence/agents/xvm/mcast.h                        |    4 +-
  fence/agents/xvm/options.c                      |   13 +
  fence/agents/xvm/options.h                      |    1 +
  fence/fence_tool/fence_tool.c                   |  190 ++++-
  fence/fenced/cpg.c                              |  207 ++++-
  fence/fenced/fd.h                               |    6 +
  fence/fenced/main.c                             |  167 +++-
  fence/fenced/member_cman.c                      |    9 +
  fence/fenced/recover.c                          |   22 +-
  fence/man/Makefile                              |    2 +
  fence/man/fence_alom.8                          |   90 ++
  fence/man/fence_apc.8                           |   16 +-
  fence/man/fence_bladecenter.8                   |   31 +-
  fence/man/fence_ilo.8                           |   21 +-
  fence/man/fence_tool.8                          |    7 +-
  fence/man/fence_vmware.8                        |  137 +++
  fence/man/fence_wti.8                           |   21 +-
  fence/man/fence_xvmd.8                          |    3 +
  gfs/gfs_mkfs/main.c                             |   28 +-
  gfs/init.d/gfs                                  |    1 +
  gfs2/fsck/pass1b.c                              |    4 +-
  gfs2/fsck/pass1c.c                              |    4 +-
  gfs2/fsck/pass5.c                               |   14 +-
  gfs2/init.d/gfs2                                |    1 +
  gfs2/libgfs2/buf.c                              |    1 +
  gfs2/libgfs2/misc.c                             |    2 +-
  gfs2/mkfs/main_mkfs.c                           |   30 +-
  gfs2/mount/util.c                               |   15 +-
  group/daemon/cman.c                             |   10 -
  group/daemon/gd_internal.h                      |    1 -
  group/daemon/joinleave.c                        |    4 -
  group/daemon/main.c                             |    3 +-
  group/dlm_controld/action.c                     |    7 +-
  group/dlm_controld/cpg.c                        |   53 +-
  group/dlm_controld/main.c                       |   21 +-
  group/gfs_control/main.c                        |  150 ++--
  group/gfs_controld/cpg-new.c                    | 1079 ++++++++++++++++++++---
  group/gfs_controld/cpg-old.c                    |  181 ++---
  group/gfs_controld/gfs_daemon.h                 |   23 +-
  group/gfs_controld/main.c                       |  159 +++-
  group/gfs_controld/util.c                       |    6 +-
  make/cobj.mk                                    |    8 +-
  make/libs.mk                                    |    2 +-
  rgmanager/include/members.h                     |    3 +
  rgmanager/include/resgroup.h                    |    9 +-
  rgmanager/include/reslist.h                     |    3 +-
  rgmanager/init.d/rgmanager                      |    1 +
  rgmanager/man/clurgmgrd.8                       |   13 +-
  rgmanager/man/clusvcadm.8                       |   66 ++-
  rgmanager/src/clulib/members.c                  |   29 +
  rgmanager/src/clulib/rg_strings.c               |   23 +-
  rgmanager/src/daemons/clurmtabd.c               |    4 +-
  rgmanager/src/daemons/event_config.c            |    8 +
  rgmanager/src/daemons/fo_domain.c               |   23 +-
  rgmanager/src/daemons/groups.c                  |  123 ++-
  rgmanager/src/daemons/main.c                    |   55 +-
  rgmanager/src/daemons/reslist.c                 |    7 +-
  rgmanager/src/daemons/restree.c                 |  101 ++-
  rgmanager/src/daemons/rg_event.c                |   58 +-
  rgmanager/src/daemons/rg_forward.c              |    6 +-
  rgmanager/src/daemons/rg_locks.c                |    3 +-
  rgmanager/src/daemons/rg_state.c                |   51 +-
  rgmanager/src/daemons/rg_thread.c               |    3 +-
  rgmanager/src/daemons/service_op.c              |   13 +-
  rgmanager/src/daemons/slang_event.c             |   52 +-
  rgmanager/src/daemons/test.c                    |    3 +-
  rgmanager/src/resources/default_event_script.sl |   16 +-
  rgmanager/src/resources/ip.sh                   |    4 +
  rgmanager/src/resources/nfsclient.sh            |   94 ++-
  rgmanager/src/resources/service.sh              |   21 +
  rgmanager/src/utils/clustat.c                   |    4 -
  rgmanager/src/utils/clusvcadm.c                 |    4 -
  113 files changed, 4156 insertions(+), 2041 deletions(-)

- --
I'm going to make him an offer he can't refuse.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)

iQIVAwUBSNsbtggUGcMLQ3qJAQIFUhAAtD9OmfL5QnAxfnoqg/1YyV2bHGDwlF1x
u2Wmc+EJybVyGJbGg7jKWXllwjjGY1/yhrvStzkBnVTFRNa2n/aqu6wi+LFZ09Ov
hJ+wJGOGyGTrfv1DWFMtkig1+2w5WgAHmFN90ll68Tzc0yVg5U/mhmyUwaoNYQFJ
8z2w4PZ8YVOpBMp3MdTzblBdMvV7sBCAqKqzhN17UzHJs4RDtDbpuPPmMUgXBXzE
gVqWaKCyPVqWd8xvTv1EhR7ill2SwfzSViLbltsGA2JclBffT/jgWICZT7hciDIQ
syHNlDNJNIE+IzJNZ0s2xZz1pq8TN46waQ3jVDcgtQ+ukQFb40KhSR2/1JfLk0Si
FENzrNQlfF4k+RACIPqxexbhLvZx2tfN3sRavQG2xzBuFAXIGuAUKdg15Eo8loqG
kGYsh4E0Im/3QIWalpp9Wn+Cc0l8Ix6jQG4CDnS1gWkprK4UYj0UG8FL0wowy/DP
bXgru1z7bkjeEqmDYTGoDZMawE8sV1noCvFO8BEt0Mz9aS9sU7n12hkURheqwxnY
6kXqyIkrMc20HSBeqD625mN9FTbhVDwKTYSrsrD5KTAGI3k6ZaAa7tO8mjYn5uXE
u9RWeETf+R3+eI5pe/GpBRg84dQc75GclC3EECG+qlT8McbuvTDqG1HHcyu2kz03
254XzYFRTZ4=
=jtig
-----END PGP SIGNATURE-----



From Alain.Moulle at bull.net  Thu Sep 25 07:32:51 2008
From: Alain.Moulle at bull.net (Alain.Moulle)
Date: Thu, 25 Sep 2008 09:32:51 +0200
Subject: [Linux-cluster] CS5 evolutions with RHEL 5.2
Message-ID: <48DB3EA3.2020509@bull.net>

Hi

I just wonder if there are a lot of ?volutions in CS5 delivered with RHEL5.2
with regard to the one delivered with RHEL5.1 ?

Thanks a lot
Regards,
Alain



From peter.tiggerdine at uq.edu.au  Thu Sep 25 08:17:47 2008
From: peter.tiggerdine at uq.edu.au (Peter Tiggerdine)
Date: Thu, 25 Sep 2008 18:17:47 +1000
Subject: [Linux-cluster] CS5 evolutions with RHEL 5.2
In-Reply-To: <48DB3EA3.2020509@bull.net>
References: <48DB3EA3.2020509@bull.net>
Message-ID: <DB208538359CE54C920E89A5038023EA05D4D0@UQEXMB5.soe.uq.edu.au>

Allan,

Host of GFS bugs where fixed from what I have noticed (like growing a partition). 

Regards,

Peter Tiggerdine


-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Alain.Moulle
Sent: Thursday, 25 September 2008 5:33 PM
To: linux-cluster at redhat.com
Subject: [Linux-cluster] CS5 evolutions with RHEL 5.2

Hi

I just wonder if there are a lot of ?volutions in CS5 delivered with RHEL5.2 with regard to the one delivered with RHEL5.1 ?

Thanks a lot
Regards,
Alain

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From pushmailiam at gmail.com  Thu Sep 25 10:07:34 2008
From: pushmailiam at gmail.com (push mail)
Date: Thu, 25 Sep 2008 10:07:34 +0000
Subject: [Linux-cluster] rgmanager fails to start!
In-Reply-To: <1222290162.18628.7.camel@ayanami>
References: <707c81d50809240450v1f0d7277nf9015cb58992c2c@mail.gmail.com>
	<1222290162.18628.7.camel@ayanami>
Message-ID: <707c81d50809250307i56b7dde9nea8db9f1b4add217@mail.gmail.com>

Hi,

dlm_controld is running.

there is no /config directory. i guess configfs is not mounted.

Any ideas how can i check if configfs is mounted? mount configfs?

Thank you in advance,

On 24/09/2008, Lon Hohberger <lhh at redhat.com> wrote:
> On Wed, 2008-09-24 at 11:50 +0000, push mail wrote:
>
>  > Cman is working correctly in all nodes, but rgmanager works only in
>  > one node (server4) ! trying to start rgmanager on the others nodes
>  > fails!
>  >
>  >
>  > clurgmgrd concludes that locks are not working and exists.
>
>
> configfs not mounted?
>  dlm_controld not running?
>
>  -- Lon
>
>  >
>
>
>  --
>  Linux-cluster mailing list
>  Linux-cluster at redhat.com
>  https://www.redhat.com/mailman/listinfo/linux-cluster
>



From linux-cluster at merctech.com  Thu Sep 25 15:26:27 2008
From: linux-cluster at merctech.com (linux-cluster at merctech.com)
Date: Thu, 25 Sep 2008 11:26:27 -0400
Subject: [Linux-cluster] clustering NIS, incorrect return value of ypserv,
	ypxfrd init scripts
Message-ID: <26148.1222356387@mirchi>


I'm in the process of setting up a 2-node CentOS 5.2 cluster. Among other 
things, this cluster will act as an HA NIS slave.

The setup of the NIS service was fairly painless, except that some NIS init
scripts (ypbind, ypxfrd) incorrectly set a return value of "0" in response to
the "status" option, regardless of the actual daemon status.

This appears to be a bug in the init scripts--they don't use the return value 
from the /etc/init.d/rc.d/functions "status" check as their exit value.

Here's an example from /etc/init.d/ypserv:

---------------------------------------------------------
# See how we were called.
case "$1" in
  start)
        start
        ;;
  stop)
        stop
        ;;
  status)
        status ypserv
        ;;
  restart|reload)
        stop
        start
        ;;
  condrestart)
        if [ -f /var/lock/subsys/ypserv ]; then
            stop
            start
        fi
        ;;
  *)
        echo $"Usage: $0 {start|stop|status|restart|reload|condrestart}"
        exit 1
esac

exit $RETVAL
---------------------------------------------------------


Note that /etc/init.d/ypserv exits with the value of $RETVAL, which the "status"
block leaves unset (hence, 0). This means that if the ypserv daemon is dead,
then clurgmgrd will never detect the error and will not restart or relocate the
service.


The simple fix (in my case) was to add:
	RETVAL=$?
after the "status" check.

Perhaps this should be submitted as a bug report to Redhat.



-----
Mark Bergman    Biker, Rock Climber, Unix mechanic, IATSE #1 Stagehand

http://wwwkeys.pgp.net:11371/pks/lookup?op=get&search=bergman%40merctech.com




From jerlyon at gmail.com  Thu Sep 25 17:19:29 2008
From: jerlyon at gmail.com (Jeremy Lyon)
Date: Thu, 25 Sep 2008 11:19:29 -0600
Subject: [Linux-cluster] pvmove with a clustered VG
Message-ID: <779919740809251019t82ef425w6863c4b05044519a@mail.gmail.com>

I wasn't sure which list to, so I chose both cluster and lvm.

My current configuration:
2 Node RHEL 5.2 cluster with multiple GFS on top of logical volumes in one
volume group.

# rpm -q cman lvm2 lvm2-cluster kmod-gfs
cman-2.0.84-2.el5
lvm2-2.02.32-4.el5
lvm2-cluster-2.02.32-4.el5
kmod-gfs-0.1.23-5.el5

I need to move a PV out of this volume group so I attempted to run pvmove
/dev/sdk1 but errors about locking on the other node.  I assumed this was
because of the multiple GFS file systems being used on both nodes (services
were spread across the nodes).  So I relocated all services to one node and
even stopped rgmanager, gfs, clvmd and cman on the idle node to make sure
that no locks would remain open.

I still had issues with running the pvmove.  I saw these messages:

Sep 24 17:56:48 nodea kernel: device-mapper: mirror log: Module for logging
t
ype "clustered-core" not found.
Sep 24 17:56:48 nodea kernel: device-mapper: table: 253:31: mirror: Error
cre
ating mirror dirty log
Sep 24 17:56:48 nodea kernel: device-mapper: ioctl: error adding target to
ta
ble

And after about 14% of the move was completed, there was another locking
message and many processes went into an uninteruptable sleep state.  Load on
the server shot up to around 80.

I finally had to reboot the node and run pvmove --abort to get everything
back to working condition.

Is it not possible to run pvmove on a clustered VG?  Any help would be
appreciated.

-Jeremy
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080925/477213ef/attachment.htm>

From alan.zg at gmail.com  Thu Sep 25 19:04:29 2008
From: alan.zg at gmail.com (Alan A)
Date: Thu, 25 Sep 2008 14:04:29 -0500
Subject: [Linux-cluster] GFS volume hangs on 3 nodes after gfs_grow
Message-ID: <fac531740809251204r29ca357eo8503bb011a1219f@mail.gmail.com>

Hi all!

I have 3 node test cluster utilizing SCSI fencing and GFS. I have made 2 GFS
Logical Volumes - lvm1 and lvm2, both utilizing 5GB on 10GB disks. Testing
the command line tools I did lvextend -L +1G /devicename to bring lvm2 to
6GB. This went fine without any problems. Then I issued command gfs_grow
/mountpoint and the volume became inaccessible. Any command trying to access
the volume hangs, and umount returns: /sbin/umount.gfs: /lvm2: device is
busy.

Few questions - Since I have two volumes on this cluster and the lvm1 works
just fine, would there be any suggestions to unmounting lvm2 in order to try
and fix it?
Is gfs_grow - bug free or not (use/do not use)?
Is there any other way besides restarting the cluster/ nodes to get lvm2
back in operational state?
-- 
Alan A.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080925/2e3c02e2/attachment.htm>

From mbroz at redhat.com  Thu Sep 25 19:08:15 2008
From: mbroz at redhat.com (Milan Broz)
Date: Thu, 25 Sep 2008 21:08:15 +0200
Subject: [Linux-cluster] pvmove with a clustered VG
In-Reply-To: <779919740809251019t82ef425w6863c4b05044519a@mail.gmail.com>
References: <779919740809251019t82ef425w6863c4b05044519a@mail.gmail.com>
Message-ID: <48DBE19F.4050109@redhat.com>

Jeremy Lyon wrote:
> Is it not possible to run pvmove on a clustered VG?  Any help would be
> appreciated.

Hi,
basic pvmove operation on clustered VG is supported since RHEL4.7
and should be available for RHEL5.3 too.

But not yet for mirrored LV, only for linear mapped LVs.

If you have clustered mirror module available, it can run online,
otherwise you have to deactivate VG (and pvmove run exclusively on one node).

Milan
--
mbroz at redhat.com



From rpeterso at redhat.com  Thu Sep 25 19:46:15 2008
From: rpeterso at redhat.com (Bob Peterson)
Date: Thu, 25 Sep 2008 15:46:15 -0400 (EDT)
Subject: [Linux-cluster] GFS volume hangs on 3 nodes after gfs_grow
In-Reply-To: <fac531740809251204r29ca357eo8503bb011a1219f@mail.gmail.com>
Message-ID: <687896478.3947131222371975717.JavaMail.root@zmail02.collab.prod.int.phx2.redhat.com>

----- "Alan A" <alan.zg at gmail.com> wrote:
| Hi all!
| 
| I have 3 node test cluster utilizing SCSI fencing and GFS. I have made
| 2 GFS
| Logical Volumes - lvm1 and lvm2, both utilizing 5GB on 10GB disks.
| Testing
| the command line tools I did lvextend -L +1G /devicename to bring lvm2
| to
| 6GB. This went fine without any problems. Then I issued command
| gfs_grow
| /mountpoint and the volume became inaccessible. Any command trying to
| access
| the volume hangs, and umount returns: /sbin/umount.gfs: /lvm2: device
| is
| busy.
| 
| Few questions - Since I have two volumes on this cluster and the lvm1
| works
| just fine, would there be any suggestions to unmounting lvm2 in order
| to try
| and fix it?
| Is gfs_grow - bug free or not (use/do not use)?
| Is there any other way besides restarting the cluster/ nodes to get
| lvm2
| back in operational state?
| -- 
| Alan A.

Hi Alan,

Did you check in dmesg for kernel messages relating to the hang?

I have seen some bugs in gfs_grow, and there are some fixes that
haven't made it out to all users yet, but you did not tell us which
version of the software you're using.  You didn't even say whether
this is RHEL4/CentOS4 or RHEL5/Centos5 or another distro. 

I'm not aware of any bugs in the most recent gfs_grow that appears
in the cluster git repository.  These gfs_grow fixes will trickle
out to various releases if you're not compiling from the source code,
so you may or may not have the fixed code.

If your software is not recent, it's likely that an interrupted or
hung gfs_grow will end up corrupting the GFS file system.  There is
a new, improved version of gfs_fsck that can repair the damage, but
again, you need a recent version of the software.

Regards,

Bob Peterson
Red Hat Clustering & GFS



From alan.zg at gmail.com  Thu Sep 25 20:11:54 2008
From: alan.zg at gmail.com (Alan A)
Date: Thu, 25 Sep 2008 15:11:54 -0500
Subject: [Linux-cluster] GFS volume hangs on 3 nodes after gfs_grow
In-Reply-To: <687896478.3947131222371975717.JavaMail.root@zmail02.collab.prod.int.phx2.redhat.com>
References: <fac531740809251204r29ca357eo8503bb011a1219f@mail.gmail.com>
	<687896478.3947131222371975717.JavaMail.root@zmail02.collab.prod.int.phx2.redhat.com>
Message-ID: <fac531740809251311u198019d4l1c7023833ff0597@mail.gmail.com>

Hello Bob, and thanks for the reply.

I am using RHEL5, 3 node cluster - node2, node3 and node4. Node2 is also a
lucy box.
Now that I look at the versions of the GFS, I do see some differences. We
are not running Xen kernel.

The command gfs_grow was launched from the Cluster node4 and here is what I
have found issuing rpm -qa | grep gfs:
2.6.18-92.1.10.el5 #1 SMP Wed Jul 23 03:55:54 EDT 2008 i686 i686 i386
GNU/Linux

gfs2-utils-0.1.44-1.el5
gfs-utils-0.1.17-1.el5
kmod-gfs-0.1.23-5.el5
kmod-gfs2-1.92-1.1.el5
kmod-gfs2-xen-1.92-1.1.el5
kmod-gfs-xen-0.1.23-5.e

Node 3:
2.6.18-92.1.10.el5 #1 SMP Wed Jul 23 03:55:54 EDT 2008 i686 i686 i386
GNU/Linux

gfs2-utils-0.1.44-1.el5_2.1
gfs-utils-0.1.17-1.el5
kmod-gfs-0.1.23-5.el5
*kmod-gfs-0.1.23-5.el5_2.2*
kmod-gfs2-1.92-1.1.el5
*kmod-gfs2-1.92-1.1.el5_2.2*
kmod-gfs2-xen-1.92-1.1.el5
kmod-gfs2-xen-1.92-1.1.el5_2.2
kmod-gfs-xen-0.1.23-5.el5
kmod-gfs-xen-0.1.23-5.el5_2.2

Node2 - lucy node:
2.6.18-92.1.10.el5 #1 SMP Wed Jul 23 03:55:54 EDT 2008 i686 i686 i386
GNU/Linux

gfs2-utils-0.1.44-1.el5_2.1
kmod-gfs-0.1.23-5.el5_2.2
gfs-utils-0.1.17-1.el5
kmod-gfs-0.1.23-5.el5

Let me know if you need any additional information. What would be suggested
path to recovery. I tried gfs_fsck but I get:
Initializing fsck
Unable to open device: /lvm_test2


On Thu, Sep 25, 2008 at 2:46 PM, Bob Peterson <rpeterso at redhat.com> wrote:

> ----- "Alan A" <alan.zg at gmail.com> wrote:
> | Hi all!
> |
> | I have 3 node test cluster utilizing SCSI fencing and GFS. I have made
> | 2 GFS
> | Logical Volumes - lvm1 and lvm2, both utilizing 5GB on 10GB disks.
> | Testing
> | the command line tools I did lvextend -L +1G /devicename to bring lvm2
> | to
> | 6GB. This went fine without any problems. Then I issued command
> | gfs_grow
> | /mountpoint and the volume became inaccessible. Any command trying to
> | access
> | the volume hangs, and umount returns: /sbin/umount.gfs: /lvm2: device
> | is
> | busy.
> |
> | Few questions - Since I have two volumes on this cluster and the lvm1
> | works
> | just fine, would there be any suggestions to unmounting lvm2 in order
> | to try
> | and fix it?
> | Is gfs_grow - bug free or not (use/do not use)?
> | Is there any other way besides restarting the cluster/ nodes to get
> | lvm2
> | back in operational state?
> | --
> | Alan A.
>
> Hi Alan,
>
> Did you check in dmesg for kernel messages relating to the hang?
>
> I have seen some bugs in gfs_grow, and there are some fixes that
> haven't made it out to all users yet, but you did not tell us which
> version of the software you're using.  You didn't even say whether
> this is RHEL4/CentOS4 or RHEL5/Centos5 or another distro.
>
> I'm not aware of any bugs in the most recent gfs_grow that appears
> in the cluster git repository.  These gfs_grow fixes will trickle
> out to various releases if you're not compiling from the source code,
> so you may or may not have the fixed code.
>
> If your software is not recent, it's likely that an interrupted or
> hung gfs_grow will end up corrupting the GFS file system.  There is
> a new, improved version of gfs_fsck that can repair the damage, but
> again, you need a recent version of the software.
>
> Regards,
>
> Bob Peterson
> Red Hat Clustering & GFS
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



-- 
Alan A.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080925/2791f338/attachment.htm>

From rpeterso at redhat.com  Thu Sep 25 22:49:04 2008
From: rpeterso at redhat.com (Bob Peterson)
Date: Thu, 25 Sep 2008 18:49:04 -0400 (EDT)
Subject: [Linux-cluster] GFS volume hangs on 3 nodes after gfs_grow
In-Reply-To: <1276708008.3994041222382932546.JavaMail.root@zmail02.collab.prod.int.phx2.redhat.com>
Message-ID: <888837761.3994061222382944795.JavaMail.root@zmail02.collab.prod.int.phx2.redhat.com>

----- "Alan A" <alan.zg at gmail.com> wrote:
| Hello Bob, and thanks for the reply.
| 
| gfs-utils-0.1.17-1.el5
(snip)
| Let me know if you need any additional information. What would be
| suggested
| path to recovery. I tried gfs_fsck but I get:
| Initializing fsck
| Unable to open device: /lvm_test2

Hi Alan,

The good news is that gfs-utils-0.1.17-1.el5 has the latest fixes for
the gfs_grow program (as of the time of this post), so in theory
it should not have caused any corruption.

The fix to gfs_fsck for repairing that kind of corruption does not appear
until gfs-utils-0.1.18-1.el5, but in theory, the gfs_grow should not
corrupt the file system, so you shouldn't need the repair code.

Are you sure you're specifying the right device?  I would expect
something more like /dev/test2_vg/lvm_test2 rather than just /lvm_test2.
So that doesn't look like you specified a valid device.

If you did specify the correct device, and I'm just not understanding,
it looks to me like gfs_fsck can't open the SCSI device.  That likely
means the device is locked up, perhaps by SCSI fencing or something,
which is why I suggest looking in "dmesg" for kernel problems.

If you got a message in dmesg that looks something like this:

lock_dlm: yourfs: gdlm_lock 2,17 err=-16 cur=3 req=5 lkf=3044 flags=80

then you might be a victim of this bug:

https://bugzilla.redhat.com/show_bug.cgi?id=438268

(I apologise in advance if you can't view this record; that's out
of my control).  There is a fix for that bug in the kernel's dlm code
that's scheduled to go in to 5.3, but it has only been released in
the source code so far.

I don't know much about SCSI fencing or SCSI reservations, but
maybe booting the whole cluster will free things up.

Regards,

Bob Peterson
Red Hat Clustering & GFS



From alan.zg at gmail.com  Fri Sep 26 14:24:36 2008
From: alan.zg at gmail.com (Alan A)
Date: Fri, 26 Sep 2008 09:24:36 -0500
Subject: [Linux-cluster] GFS volume hangs on 3 nodes after gfs_grow
In-Reply-To: <888837761.3994061222382944795.JavaMail.root@zmail02.collab.prod.int.phx2.redhat.com>
References: <1276708008.3994041222382932546.JavaMail.root@zmail02.collab.prod.int.phx2.redhat.com>
	<888837761.3994061222382944795.JavaMail.root@zmail02.collab.prod.int.phx2.redhat.com>
Message-ID: <fac531740809260724h673f5bd0h9f2b3457bc6b9cc5@mail.gmail.com>

Hi Bob. Thanks for the reply once again.

Yes lvm_test2 is the mount point - but in my explanation of the problem I
used /lvm2 for clarification. I haven't rebooted the node yet, and now that
I try, it hangs. I will have to bring crash cart to the data center and
check it out sometimes today. That is why I haven't been able to check dmesg
for potential DLM problems.

I am rebooting the cluster (not desired outcome but have to bring it to a
stable state), and will post the dmesg afterwards.

On Thu, Sep 25, 2008 at 5:49 PM, Bob Peterson <rpeterso at redhat.com> wrote:

> ----- "Alan A" <alan.zg at gmail.com> wrote:
> | Hello Bob, and thanks for the reply.
> |
> | gfs-utils-0.1.17-1.el5
> (snip)
> | Let me know if you need any additional information. What would be
> | suggested
> | path to recovery. I tried gfs_fsck but I get:
> | Initializing fsck
> | Unable to open device: /lvm_test2
>
> Hi Alan,
>
> The good news is that gfs-utils-0.1.17-1.el5 has the latest fixes for
> the gfs_grow program (as of the time of this post), so in theory
> it should not have caused any corruption.
>
> The fix to gfs_fsck for repairing that kind of corruption does not appear
> until gfs-utils-0.1.18-1.el5, but in theory, the gfs_grow should not
> corrupt the file system, so you shouldn't need the repair code.
>
> Are you sure you're specifying the right device?  I would expect
> something more like /dev/test2_vg/lvm_test2 rather than just /lvm_test2.
> So that doesn't look like you specified a valid device.
>
> If you did specify the correct device, and I'm just not understanding,
> it looks to me like gfs_fsck can't open the SCSI device.  That likely
> means the device is locked up, perhaps by SCSI fencing or something,
> which is why I suggest looking in "dmesg" for kernel problems.
>
> If you got a message in dmesg that looks something like this:
>
> lock_dlm: yourfs: gdlm_lock 2,17 err=-16 cur=3 req=5 lkf=3044 flags=80
>
> then you might be a victim of this bug:
>
> https://bugzilla.redhat.com/show_bug.cgi?id=438268
>
> (I apologise in advance if you can't view this record; that's out
> of my control).  There is a fix for that bug in the kernel's dlm code
> that's scheduled to go in to 5.3, but it has only been released in
> the source code so far.
>
> I don't know much about SCSI fencing or SCSI reservations, but
> maybe booting the whole cluster will free things up.
>
> Regards,
>
> Bob Peterson
> Red Hat Clustering & GFS
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



-- 
Alan A.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080926/350b84ea/attachment.htm>

From alan.zg at gmail.com  Fri Sep 26 14:43:39 2008
From: alan.zg at gmail.com (Alan A)
Date: Fri, 26 Sep 2008 09:43:39 -0500
Subject: [Linux-cluster] Fwd: GFS volume hangs on 3 nodes after gfs_grow
In-Reply-To: <fac531740809251204r29ca357eo8503bb011a1219f@mail.gmail.com>
References: <fac531740809251204r29ca357eo8503bb011a1219f@mail.gmail.com>
Message-ID: <fac531740809260743l25955417we33dad69aaa00f5b@mail.gmail.com>

This is worse than I tought. The entire cluster is hanging upon restart
command issued from the Conga - lucy box. I tried bringing gfs service down
on node2 (lucy) with the: service gfs stop (we are not running rgmanager),
and I got:
FATAL: Module gfs is in use.

On node3:
service gfs status:
Configured GFS mountpoints:
/lvm_test1
/lvm_test2
Active GFS mountpoints:
/lvm_test1
/lvm_test2

service gfs stop:
Unmounting GFS filesystems:  (hangs)

node2 - .175
node3 - .78
node4 - .79
All nodes are configured on the same segment.

These are the messages from the node3 once from the point I tried to restart
the cluster:
Sep 26 09:00:38 dev03 openais[8692]: [TOTEM] entering GATHER state from 12.
Sep 26 09:00:43 dev03 openais[8692]: [TOTEM] entering GATHER state from 0.
Sep 26 09:00:43 dev03 openais[8692]: [TOTEM] Creating commit token because I
am the rep.
Sep 26 09:00:43 dev03 openais[8692]: [TOTEM] Saving state aru 1e1 high seq
received 1e1
Sep 26 09:00:43 dev03 openais[8692]: [TOTEM] Storing new sequence id for
ring 454
Sep 26 09:00:43 dev03 openais[8692]: [TOTEM] entering COMMIT state.
Sep 26 09:00:43 dev03 openais[8692]: [TOTEM] entering RECOVERY state.
Sep 26 09:00:43 dev03 openais[8692]: [TOTEM] position [0] member
xxx.xxx.xxx.78:
Sep 26 09:00:43 dev03 openais[8692]: [TOTEM] previous ring seq 1104 rep
xxx.xxx.xxx.78
Sep 26 09:00:43 dev03 openais[8692]: [TOTEM] aru 1e1 high delivered 1e1
received flag 1
Sep 26 09:00:43 dev03 openais[8692]: [TOTEM] position [1] member
xxx.xxx.xxx.175:
Sep 26 09:00:43 dev03 openais[8692]: [TOTEM] previous ring seq 1104 rep
xxx.xxx.xxx.78
Sep 26 09:00:43 dev03 openais[8692]: [TOTEM] aru 1e1 high delivered 1e1
received flag 1
Sep 26 09:00:43 dev03 openais[8692]: [TOTEM] Did not need to originate any
messages in recovery.
Sep 26 09:00:43 dev03 openais[8692]: [TOTEM] Sending initial ORF token
Sep 26 09:00:43 dev03 openais[8692]: [CLM  ] CLM CONFIGURATION CHANGE
Sep 26 09:00:43 dev03 kernel: dlm: closing connection to node 2
Sep 26 09:00:43 dev03 openais[8692]: [CLM  ] New Configuration:
Sep 26 09:00:43 dev03 openais[8692]: [CLM  ]       r(0) ip(xxx.xxx.xxx.78)
Sep 26 09:00:43 dev03 openais[8692]: [CLM  ]       r(0) ip(xxx.xxx.xxx.175)

Sep 26 09:00:43 dev03 openais[8692]: [CLM  ] Members Left:
Sep 26 09:00:43 dev03 openais[8692]: [CLM  ]       r(0) ip(xxx.xxx.xxx.79)
Sep 26 09:00:43 dev03 openais[8692]: [CLM  ] Members Joined:
Sep 26 09:00:43 dev03 openais[8692]: [CLM  ] CLM CONFIGURATION CHANGE
Sep 26 09:00:43 dev03 openais[8692]: [CLM  ] New Configuration:
Sep 26 09:00:43 dev03 openais[8692]: [CLM  ]       r(0) ip(xxx.xxx.xxx.78)
Sep 26 09:00:43 dev03 openais[8692]: [CLM  ]       r(0) ip(xxx.xxx.xxx.175)

Sep 26 09:00:43 dev03 openais[8692]: [CLM  ] Members Left:
Sep 26 09:00:43 dev03 openais[8692]: [CLM  ] Members Joined:
Sep 26 09:00:43 dev03 openais[8692]: [SYNC ] This node is within the primary
component and will provide service.
Sep 26 09:00:43 dev03 openais[8692]: [TOTEM] entering OPERATIONAL state.
Sep 26 09:00:43 dev03 openais[8692]: [CLM  ] got nodejoin message
xxx.xxx.xxx.78
Sep 26 09:00:43 dev03 openais[8692]: [CLM  ] got nodejoin message
xxx.xxx.xxx.175
Sep 26 09:00:43 dev03 openais[8692]: [CPG  ] got joinlist message from node
3
Sep 26 09:00:43 dev03 fenced[8710]: fencing deferred to
fenmrdev02.maritz.com
Sep 26 09:00:43 dev03 openais[8692]: [CPG  ] got joinlist message from node
1
Sep 26 09:00:45 dev03 kernel: GFS: fsid=test1_cluster:gfs_fs1.2: jid=1:
Trying to acquire journal lock...
Sep 26 09:00:45 dev03 kernel: GFS: fsid=test1_cluster:gfs_fs1.2: jid=1: Busy
Sep 26 09:02:37 dev03 openais[8692]: [TOTEM] entering GATHER state from 11.
Sep 26 09:02:37 dev03 openais[8692]: [TOTEM] Creating commit token because I
am the rep.
Sep 26 09:02:37 dev03 openais[8692]: [TOTEM] Saving state aru 31 high seq
received 31
Sep 26 09:02:37 dev03 openais[8692]: [TOTEM] Storing new sequence id for
ring 458
Sep 26 09:02:37 dev03 openais[8692]: [TOTEM] entering COMMIT state.
Sep 26 09:02:37 dev03 openais[8692]: [TOTEM] entering RECOVERY state.
Sep 26 09:02:37 dev03 openais[8692]: [TOTEM] position [0] member
xxx.xxx.xxx.78:
Sep 26 09:02:37 dev03 openais[8692]: [TOTEM] previous ring seq 1108 rep
xxx.xxx.xxx.78
Sep 26 09:02:37 dev03 openais[8692]: [TOTEM] aru 31 high delivered 31
received flag 1
Sep 26 09:02:37 dev03 openais[8692]: [TOTEM] position [1] member
xxx.xxx.xxx.79:
Sep 26 09:02:37 dev03 openais[8692]: [TOTEM] previous ring seq 1108 rep
xxx.xxx.xxx.79
Sep 26 09:02:37 dev03 openais[8692]: [TOTEM] aru 9 high delivered 9 received
flag 1
Sep 26 09:02:37 dev03 openais[8692]: [TOTEM] position [2] member
xxx.xxx.xxx.175:
Sep 26 09:02:37 dev03 openais[8692]: [TOTEM] previous ring seq 1108 rep
xxx.xxx.xxx.78
Sep 26 09:02:37 dev03 openais[8692]: [TOTEM] aru 31 high delivered 31
received flag 1
Sep 26 09:02:37 dev03 openais[8692]: [TOTEM] Did not need to originate any
messages in recovery.
Sep 26 09:02:37 dev03 openais[8692]: [TOTEM] Sending initial ORF token
Sep 26 09:02:37 dev03 openais[8692]: [CLM  ] CLM CONFIGURATION CHANGE
Sep 26 09:02:37 dev03 openais[8692]: [CLM  ] New Configuration:
Sep 26 09:02:37 dev03 openais[8692]: [CLM  ]       r(0) ip(xxx.xxx.xxx.78)
Sep 26 09:02:37 dev03 openais[8692]: [CLM  ]       r(0) ip(xxx.xxx.xxx.175)

Sep 26 09:02:37 dev03 openais[8692]: [CLM  ] Members Left:
Sep 26 09:02:37 dev03 openais[8692]: [CLM  ] Members Joined:
Sep 26 09:02:37 dev03 openais[8692]: [CLM  ] CLM CONFIGURATION CHANGE
Sep 26 09:02:37 dev03 openais[8692]: [CLM  ] New Configuration:
Sep 26 09:02:37 dev03 openais[8692]: [CLM  ]       r(0) ip(xxx.xxx.xxx.78)
Sep 26 09:02:37 dev03 openais[8692]: [CLM  ]       r(0) ip(xxx.xxx.xxx.79)
Sep 26 09:02:37 dev03 openais[8692]: [CLM  ]       r(0) ip(xxx.xxx.xxx.175)

Sep 26 09:02:37 dev03 openais[8692]: [CLM  ] Members Left:
Sep 26 09:02:37 dev03 openais[8692]: [CLM  ] Members Joined:
Sep 26 09:02:37 dev03 openais[8692]: [CLM  ]       r(0) ip(xxx.xxx.xxx.79)
Sep 26 09:02:37 dev03 openais[8692]: [SYNC ] This node is within the primary
component and will provide service.
Sep 26 09:02:37 dev03 openais[8692]: [TOTEM] entering OPERATIONAL state.
Sep 26 09:02:37 dev03 openais[8692]: [CLM  ] got nodejoin message
xxx.xxx.xxx.78
Sep 26 09:02:37 dev03 openais[8692]: [CLM  ] got nodejoin message
xxx.xxx.xxx.79
Sep 26 09:02:37 dev03 openais[8692]: [CLM  ] got nodejoin message
xxx.xxx.xxx.175
Sep 26 09:02:37 dev03 openais[8692]: [CPG  ] got joinlist message from node
3
Sep 26 09:02:37 dev03 openais[8692]: [CPG  ] got joinlist message from node
1
Sep 26 09:02:43 dev03 kernel: dlm: connecting to 2


---------- Forwarded message ----------
From: Alan A <alan.zg at gmail.com>
Date: Thu, Sep 25, 2008 at 2:04 PM
Subject: GFS volume hangs on 3 nodes after gfs_grow
To: linux clustering <linux-cluster at redhat.com>


Hi all!

I have 3 node test cluster utilizing SCSI fencing and GFS. I have made 2 GFS
Logical Volumes - lvm1 and lvm2, both utilizing 5GB on 10GB disks. Testing
the command line tools I did lvextend -L +1G /devicename to bring lvm2 to
6GB. This went fine without any problems. Then I issued command gfs_grow
/mountpoint and the volume became inaccessible. Any command trying to access
the volume hangs, and umount returns: /sbin/umount.gfs: /lvm2: device is
busy.

Few questions - Since I have two volumes on this cluster and the lvm1 works
just fine, would there be any suggestions to unmounting lvm2 in order to try
and fix it?
Is gfs_grow - bug free or not (use/do not use)?
Is there any other way besides restarting the cluster/ nodes to get lvm2
back in operational state?
-- 
Alan A.



-- 
Alan A.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080926/f48cbb06/attachment.htm>

From linux-cluster at merctech.com  Fri Sep 26 14:45:22 2008
From: linux-cluster at merctech.com (linux-cluster at merctech.com)
Date: Fri, 26 Sep 2008 10:45:22 -0400
Subject: [Linux-cluster] support for hierarchical services?
Message-ID: <23248.1222440322@mirchi>


Does RHCS5 (rgmanager-2.0.38-2.el5_2.1, cman-2.0.84-2.el5,
openais-0.80.3-15.el5) support hierarchical services, not just
resources? This doesn't seem possible via the "system-config-cluster"
tool or Luci.

For example, I've got a clustered web server. It provides several
applications, each of which has it's own resources that Apache doesn't
need to know about directly. Each application relies on Apache as a
presentation layer. I don't want to make the individual applications'
dependencies separate resources within a single "Apache" service,
because that makes management difficult.

For example, is it possible to create a structure like:

	Service:	Apache
		Private Resource:	IP address
		Private Resource:	GFS Vol1

		Shared Service:	Wiki
		Shared Service:	SVN
		Shared Service:	Calendar


	Service:	Wiki
		Private Resource:	GFS Vol2
		Private Resource:	MySQL "wiki" instance
	
	Service:	SVN
		Private Resource:	MySQL "svn" instance
		Private Resource:	GFS Vol3
	
	Service:	Calendar
			(no resources beyond apache)
		

Note that resources like "GFS Vol2" and "MySQL wiki instance" are assigned
to the "Wiki" service, not directly to the "Apache" service. The Apache
service sees "Wiki" as a resource.

With this kind of structure, I could administratively disable the "Wiki" service
without causing the other web applications to restart or relocate. However, if a
resource that the Wiki requires fails, the the standard failover policy would
apply...the Wiki service would restart and if that is unsuccessful, then the
Apache service (and it's dependencies--Wiki, SVN, Calendar, IP address, GFS
Vol1) would relocate.

In other words, it doesn't make sense to run the Wiki service if Apache isn't
running, but the Apache service itself doesn't depend on the MySQL "wiki"
instance resource. If the Wiki service cannot be restarted (or is
administratively disabled), the other web applications should not be affected.

-----
Mark Bergman
http://wwwkeys.pgp.net:11371/pks/lookup?op=get&search=bergman%40merctech.com



From rpeterso at redhat.com  Fri Sep 26 15:33:59 2008
From: rpeterso at redhat.com (Bob Peterson)
Date: Fri, 26 Sep 2008 11:33:59 -0400 (EDT)
Subject: [Linux-cluster] Fwd: GFS volume hangs on 3 nodes after gfs_grow
In-Reply-To: <fac531740809260743l25955417we33dad69aaa00f5b@mail.gmail.com>
Message-ID: <2015936812.39581222443239412.JavaMail.root@zmail02.collab.prod.int.phx2.redhat.com>

----- "Alan A" <alan.zg at gmail.com> wrote:
| This is worse than I tought. The entire cluster is hanging upon
| restart
| command issued from the Conga - lucy box. I tried bringing gfs service
| down
| on node2 (lucy) with the: service gfs stop (we are not running
| rgmanager),
| and I got:
| FATAL: Module gfs is in use.

Hi Alan,

It sounds like conga can't reboot the cluster because the GFS file
system is still mounted, or is in use.  I don't know much about conga,
so forgive my ignorance there.  You may need to unmount the gfs file
system before you reboot.  The dmesg you sent looked perfectly normal
to me.  Those are normal openais messages.  I'm more interested to
see if there were any "File system withdrawn" messages, general protection
faults, or kernel panic messages or other serious kernel errors on any
of the nodes in the cluster just around the time of the first failure.

This is just a wild guess, but I'm guessing that there was some kind
of error, like a kernel panic that occurred a while back.  That caused
the node to be fenced.  Perhaps the SCSI fencing locked up the device
somehow so none of the nodes can use it.  If that's the case, you
should be able to log in to each of the nodes, unmount the gfs file
systems that are mounted, manually, and then reboot them.
If it doesn't let you unmount them, it might be because some process
is still using the GFS file system.  For example, if you're using
NFS to export the GFS file system, you probably need to do
service nfs stop before it will let you unmount the gfs, then reboot.

So I would comb through the /var/log/messages of each node looking
for an error message regarding the node being fenced, withdrawn, panic,
SCSI errors, or any kind of serious errors that occurred around the
time where you first had the problem.

Regards,

Bob Peterson
Red Hat Clustering & GFS



From alan.zg at gmail.com  Fri Sep 26 15:53:09 2008
From: alan.zg at gmail.com (Alan A)
Date: Fri, 26 Sep 2008 10:53:09 -0500
Subject: [Linux-cluster] Fwd: GFS volume hangs on 3 nodes after gfs_grow
In-Reply-To: <2015936812.39581222443239412.JavaMail.root@zmail02.collab.prod.int.phx2.redhat.com>
References: <fac531740809260743l25955417we33dad69aaa00f5b@mail.gmail.com>
	<2015936812.39581222443239412.JavaMail.root@zmail02.collab.prod.int.phx2.redhat.com>
Message-ID: <fac531740809260853o45deee4eg53e8e43130ef33ee@mail.gmail.com>

Thanks again, Bob.

No kernel-panic on any of the nodes. I had to cold boot all 3 nodes in order
to get the cluster going (might have been a fence issue but am not 100%
sure, since we use only SCSI fencing until we agree on secondary fencing
method). What is 'scary' is that gfs_grow command paralized that volume on
all 3 nodes, and I coldn't access, nor unmount, nor run gfs_fsck, from any
of the nodes. We will do more testing on this, btw do you have suggested
"safe" method of growing and shrinking the volume other than what is noted
in 5.2 documentation (since we followed the RHEL manual). If the GFS volume
hangs - what is the best way to try and unmount it from the node,  would
'gfs_freeze' helped)?

I checked all nodes for service gfs status, service clvmd status and service
cman status. On node4 service clvmd status hangs on displaying me active
volumes:

>From node3 - service clvmd status:
clvmd (pid 8892) is running...
active volumes: LogVol00 LogVol01 gfs_sda1 gfs_sdb1

Here is the node4 response to service clvmd status:
clvmd (pid 4829) is running... (and it hangs)

On node3 - I didn't get any messages in the dmesg - but I got this in
/var/log/messages:

Sep 26 10:40:17 dev03 kernel: GFS: fsid=test1_cluster:gfs_sdb1.1: Unmount
seems to be stalled. Dumping lock state...

Sep 26 10:40:17 dev03 kernel: Glock (5, 22)

Sep 26 10:40:17 dev03 kernel:   gl_flags = 1

Sep 26 10:40:17 dev03 kernel:   gl_count = 3

Sep 26 10:40:17 dev03 kernel:   gl_state = 3

Sep 26 10:40:17 dev03 kernel:   req_gh = yes

Sep 26 10:40:17 dev03 kernel:   req_bh = yes

Sep 26 10:40:17 dev03 kernel:   lvb_count = 0

Sep 26 10:40:17 dev03 kernel:   object = no

Sep 26 10:40:17 dev03 kernel:   new_le = no

Sep 26 10:40:17 dev03 kernel:   incore_le = no

Sep 26 10:40:17 dev03 kernel:   reclaim = no

Sep 26 10:40:17 dev03 kernel:   aspace = no

Sep 26 10:40:17 dev03 kernel:   ail_bufs = no

Sep 26 10:40:17 dev03 kernel:   Request

Sep 26 10:40:17 dev03 kernel:     owner = -1

Sep 26 10:40:17 dev03 kernel:     gh_state = 0

Sep 26 10:40:17 dev03 kernel:     gh_flags = 0

Sep 26 10:40:17 dev03 kernel:     error = 0

Sep 26 10:40:17 dev03 kernel:     gh_iflags = 2 4 5

Sep 26 10:40:17 dev03 kernel:   Waiter2

Sep 26 10:40:17 dev03 kernel:     owner = -1

Sep 26 10:40:17 dev03 kernel:     gh_state = 0

Sep 26 10:40:17 dev03 kernel:     gh_flags = 0

Sep 26 10:40:17 dev03 kernel:     error = 0

Sep 26 10:40:17 dev03 kernel:     gh_iflags = 2 4 5

Sep 26 10:40:17 dev03 kernel: Glock (1, 2)

Sep 26 10:40:17 dev03 kernel:   gl_flags = 1 2

Sep 26 10:40:17 dev03 kernel:   gl_count = 3

Sep 26 10:40:17 dev03 kernel:   gl_state = 3

Sep 26 10:40:17 dev03 kernel:   req_gh = yes

Sep 26 10:40:17 dev03 kernel:   req_bh = yes

Sep 26 10:40:17 dev03 kernel:   lvb_count = 0

Sep 26 10:40:17 dev03 kernel:   object = no

Sep 26 10:40:17 dev03 kernel:   new_le = no

Sep 26 10:40:17 dev03 kernel:   incore_le = no

Sep 26 10:40:17 dev03 kernel:   reclaim = no

Sep 26 10:40:17 dev03 kernel:   aspace = no

Sep 26 10:40:17 dev03 kernel:   ail_bufs = no

Sep 26 10:40:17 dev03 kernel:   Request

Sep 26 10:40:17 dev03 kernel:     owner = -1

Sep 26 10:40:17 dev03 kernel:     gh_state = 0

Sep 26 10:40:17 dev03 kernel:     gh_flags = 0

Sep 26 10:40:17 dev03 kernel:     error = 0

Sep 26 10:40:17 dev03 kernel:     gh_iflags = 2 4 5

Sep 26 10:40:17 dev03 kernel:   Waiter2

Sep 26 10:40:17 dev03 kernel:     owner = -1

Sep 26 10:40:17 dev03 kernel:     gh_state = 0

Sep 26 10:40:17 dev03 kernel:     gh_flags = 0

Sep 26 10:40:17 dev03 kernel:     error = 0

Sep 26 10:40:17 dev03 kernel:     gh_iflags = 2 4 5

Sep 26 10:40:17 dev03 kernel: Glock (2, 24)

Sep 26 10:40:17 dev03 kernel:   gl_flags = 1

Sep 26 10:40:17 dev03 kernel:   gl_count = 5

Sep 26 10:40:17 dev03 kernel:   gl_state = 1

Sep 26 10:40:17 dev03 kernel:   req_gh = yes

Sep 26 10:40:17 dev03 kernel:   req_bh = yes

Sep 26 10:40:17 dev03 kernel:   lvb_count = 0

Sep 26 10:40:17 dev03 kernel:   object = yes

Sep 26 10:40:17 dev03 kernel:   new_le = no

Sep 26 10:40:17 dev03 kernel:   incore_le = no

Sep 26 10:40:17 dev03 kernel:   reclaim = no

Sep 26 10:40:17 dev03 kernel:   aspace = 1

Sep 26 10:40:17 dev03 kernel:   ail_bufs = no

Sep 26 10:40:17 dev03 kernel:   Request

Sep 26 10:40:17 dev03 kernel:     owner = -1

Sep 26 10:40:17 dev03 kernel:     gh_state = 0

Sep 26 10:40:17 dev03 kernel:     gh_flags = 0

Sep 26 10:40:17 dev03 kernel:     error = 0

Sep 26 10:40:17 dev03 kernel:     gh_iflags = 2 4 5

Sep 26 10:40:17 dev03 kernel:   Waiter2

Sep 26 10:40:17 dev03 kernel:     owner = -1

Sep 26 10:40:17 dev03 kernel:     gh_state = 0

Sep 26 10:40:17 dev03 kernel:     gh_flags = 0

Sep 26 10:40:17 dev03 kernel:     error = 0

Sep 26 10:40:17 dev03 kernel:     gh_iflags = 2 4 5

Sep 26 10:40:17 dev03 kernel:   Inode: busy

Sep 26 10:40:17 dev03 kernel: Glock (2, 22)

Sep 26 10:40:17 dev03 kernel:   gl_flags = 1 2

Sep 26 10:40:17 dev03 kernel:   gl_count = 3

Sep 26 10:40:17 dev03 kernel:   gl_state = 3

Sep 26 10:40:17 dev03 kernel:   req_gh = yes

Sep 26 10:40:17 dev03 kernel:   req_bh = yes

Sep 26 10:40:17 dev03 kernel:   lvb_count = 0

Sep 26 10:40:17 dev03 kernel:   object = no

Sep 26 10:40:17 dev03 kernel:   new_le = no

Sep 26 10:40:17 dev03 kernel:   incore_le = no

Sep 26 10:40:17 dev03 kernel:   reclaim = no

Sep 26 10:40:17 dev03 kernel:   aspace = 1

Sep 26 10:40:17 dev03 kernel:   ail_bufs = no

Sep 26 10:40:17 dev03 kernel:   Request

Sep 26 10:40:17 dev03 kernel:     owner = -1

Sep 26 10:40:17 dev03 kernel:     gh_state = 0

Sep 26 10:40:17 dev03 kernel:     gh_flags = 0

Sep 26 10:40:17 dev03 kernel:     error = 0

Sep 26 10:40:17 dev03 kernel:     gh_iflags = 2 4 5

Sep 26 10:40:17 dev03 kernel:   Waiter2

Sep 26 10:40:17 dev03 kernel:     owner = -1

Sep 26 10:40:17 dev03 kernel:     gh_state = 0

Sep 26 10:40:17 dev03 kernel:     gh_flags = 0

Sep 26 10:40:17 dev03 kernel:     error = 0

Sep 26 10:40:17 dev03 kernel:     gh_iflags = 2 4 5

Sep 26 10:40:17 dev03 kernel: Glock (5, 24)

Sep 26 10:40:17 dev03 kernel:   gl_flags =

Sep 26 10:40:17 dev03 kernel:   gl_count = 2

Sep 26 10:40:17 dev03 kernel:   gl_state = 3

Sep 26 10:40:17 dev03 kernel:   req_gh = no

Sep 26 10:40:17 dev03 kernel:   req_bh = no

Sep 26 10:40:17 dev03 kernel:   lvb_count = 0

Sep 26 10:40:17 dev03 kernel:   object = yes

Sep 26 10:40:17 dev03 kernel:   new_le = no

Sep 26 10:40:17 dev03 kernel:   incore_le = no

Sep 26 10:40:17 dev03 kernel:   reclaim = no

Sep 26 10:40:17 dev03 kernel:   aspace = no

Sep 26 10:40:17 dev03 kernel:   ail_bufs = no

Sep 26 10:40:17 dev03 kernel:   Holder

Sep 26 10:40:17 dev03 kernel:     owner = -1

Sep 26 10:40:17 dev03 kernel:     gh_state = 3

Sep 26 10:40:17 dev03 kernel:     gh_flags = 5 7

Sep 26 10:40:17 dev03 kernel:     error = 0

Sep 26 10:40:17 dev03 kernel:     gh_iflags = 1 6 7

Sep 26 10:40:17 dev03 kernel: Glock (5, 21)

Sep 26 10:40:17 dev03 kernel:   gl_flags = 1

Sep 26 10:40:17 dev03 kernel:   gl_count = 3

Sep 26 10:40:17 dev03 kernel:   gl_state = 3

Sep 26 10:40:17 dev03 kernel:   req_gh = yes

Sep 26 10:40:17 dev03 kernel:   req_bh = yes

Sep 26 10:40:17 dev03 kernel:   lvb_count = 0

Sep 26 10:40:17 dev03 kernel:   object = no

Sep 26 10:40:17 dev03 kernel:   new_le = no

Sep 26 10:40:17 dev03 kernel:   incore_le = no

Sep 26 10:40:17 dev03 kernel:   reclaim = no

Sep 26 10:40:17 dev03 kernel:   aspace = no

Sep 26 10:40:17 dev03 kernel:   ail_bufs = no

Sep 26 10:40:17 dev03 kernel:   Request

Sep 26 10:40:17 dev03 kernel:     owner = -1

Sep 26 10:40:17 dev03 kernel:     gh_state = 0

Sep 26 10:40:17 dev03 kernel:     gh_flags = 0

Sep 26 10:40:17 dev03 kernel:     error = 0

Sep 26 10:40:17 dev03 kernel:     gh_iflags = 2 4 5

Sep 26 10:40:17 dev03 kernel:   Waiter2

Sep 26 10:40:17 dev03 kernel:     owner = -1

Sep 26 10:40:17 dev03 kernel:     gh_state = 0

Sep 26 10:40:17 dev03 kernel:     gh_flags = 0

Sep 26 10:40:17 dev03 kernel:     error = 0

Sep 26 10:40:17 dev03 kernel:     gh_iflags = 2 4 5

Sep 26 10:40:17 dev03 kernel: Glock (5, 23)

Sep 26 10:40:17 dev03 kernel:   gl_flags = 1

Sep 26 10:40:17 dev03 kernel:   gl_count = 3

Sep 26 10:40:17 dev03 kernel:   gl_state = 3

Sep 26 10:40:17 dev03 kernel:   req_gh = yes

Sep 26 10:40:17 dev03 kernel:   req_bh = yes

Sep 26 10:40:17 dev03 kernel:   lvb_count = 0

Sep 26 10:40:17 dev03 kernel:   object = no

Sep 26 10:40:17 dev03 kernel:   new_le = no

Sep 26 10:40:17 dev03 kernel:   incore_le = no

Sep 26 10:40:17 dev03 kernel:   reclaim = no

Sep 26 10:40:17 dev03 kernel:   aspace = no

Sep 26 10:40:17 dev03 kernel:   ail_bufs = no

Sep 26 10:40:17 dev03 kernel:   Request

Sep 26 10:40:17 dev03 kernel:     owner = -1

Sep 26 10:40:17 dev03 kernel:     gh_state = 0

Sep 26 10:40:17 dev03 kernel:     gh_flags = 0

Sep 26 10:40:17 dev03 kernel:     error = 0

Sep 26 10:40:17 dev03 kernel:     gh_iflags = 2 4 5

Sep 26 10:40:17 dev03 kernel:   Waiter2

Sep 26 10:40:17 dev03 kernel:     owner = -1

Sep 26 10:40:17 dev03 kernel:     gh_state = 0

Sep 26 10:40:17 dev03 kernel:     gh_flags = 0

Sep 26 10:40:17 dev03 kernel:     error = 0

Sep 26 10:40:17 dev03 kernel:     gh_iflags = 2 4 5

Sep 26 10:40:17 dev03 kernel: Glock (1, 1)

Sep 26 10:40:17 dev03 kernel:   gl_flags = 1

Sep 26 10:40:17 dev03 kernel:   gl_count = 3

Sep 26 10:40:17 dev03 kernel:   gl_state = 3

Sep 26 10:40:17 dev03 kernel:   req_gh = yes

Sep 26 10:40:17 dev03 kernel:   req_bh = yes

Sep 26 10:40:17 dev03 kernel:   lvb_count = 0

Sep 26 10:40:17 dev03 kernel:   object = no

Sep 26 10:40:17 dev03 kernel:   new_le = no

Sep 26 10:40:17 dev03 kernel:   incore_le = no

Sep 26 10:40:17 dev03 kernel:   reclaim = no

Sep 26 10:40:17 dev03 kernel:   aspace = no

Sep 26 10:40:17 dev03 kernel:   ail_bufs = no

Sep 26 10:40:17 dev03 kernel:   Request

Sep 26 10:40:17 dev03 kernel:     owner = -1

Sep 26 10:40:17 dev03 kernel:     gh_state = 0

Sep 26 10:40:17 dev03 kernel:     gh_flags = 0

Sep 26 10:40:17 dev03 kernel:     error = 0

Sep 26 10:40:17 dev03 kernel:     gh_iflags = 2 4 5

Sep 26 10:40:17 dev03 kernel:   Waiter2

Sep 26 10:40:17 dev03 kernel:     owner = -1

Sep 26 10:40:17 dev03 kernel:     gh_state = 0

Sep 26 10:40:17 dev03 kernel:     gh_flags = 0

Sep 26 10:40:17 dev03 kernel:     error = 0

Sep 26 10:40:17 dev03 kernel:     gh_iflags = 2 4 5

Sep 26 10:40:17 dev03 kernel: Glock (2, 21)

Sep 26 10:40:17 dev03 kernel:   gl_flags = 1 2

Sep 26 10:40:17 dev03 kernel:   gl_count = 3

Sep 26 10:40:17 dev03 kernel:   gl_state = 3

Sep 26 10:40:17 dev03 kernel:   req_gh = yes

Sep 26 10:40:17 dev03 kernel:   req_bh = yes

Sep 26 10:40:17 dev03 kernel:   lvb_count = 0

Sep 26 10:40:17 dev03 kernel:   object = no

Sep 26 10:40:17 dev03 kernel:   new_le = no

Sep 26 10:40:17 dev03 kernel:   incore_le = no

Sep 26 10:40:17 dev03 kernel:   reclaim = no

Sep 26 10:40:17 dev03 kernel:   aspace = 1

Sep 26 10:40:17 dev03 kernel:   ail_bufs = no

Sep 26 10:40:17 dev03 kernel:   Request

Sep 26 10:40:17 dev03 kernel:     owner = -1

Sep 26 10:40:17 dev03 kernel:     gh_state = 0

Sep 26 10:40:17 dev03 kernel:     gh_flags = 0

Sep 26 10:40:17 dev03 kernel:     error = 0

Sep 26 10:40:17 dev03 kernel:     gh_iflags = 2 4 5

Sep 26 10:40:17 dev03 kernel:   Waiter2

Sep 26 10:40:17 dev03 kernel:     owner = -1

Sep 26 10:40:17 dev03 kernel:     gh_state = 0

Sep 26 10:40:17 dev03 kernel:     gh_flags = 0

Sep 26 10:40:17 dev03 kernel:     error = 0

Sep 26 10:40:17 dev03 kernel:     gh_iflags = 2 4 5

Sep 26 10:40:17 dev03 kernel: Glock (2, 25)

Sep 26 10:40:17 dev03 kernel:   gl_flags = 1

Sep 26 10:40:17 dev03 kernel:   gl_count = 3

Sep 26 10:40:17 dev03 kernel:   gl_state = 3

Sep 26 10:40:17 dev03 kernel:   req_gh = yes

Sep 26 10:40:17 dev03 kernel:   req_bh = yes

Sep 26 10:40:17 dev03 kernel:   lvb_count = 0

Sep 26 10:40:17 dev03 kernel:   object = no

Sep 26 10:40:17 dev03 kernel:   new_le = no

Sep 26 10:40:17 dev03 kernel:   incore_le = no

Sep 26 10:40:17 dev03 kernel:   reclaim = no

Sep 26 10:40:17 dev03 kernel:   aspace = 1

Sep 26 10:40:17 dev03 kernel:   ail_bufs = no

Sep 26 10:40:17 dev03 kernel:   Request

Sep 26 10:40:17 dev03 kernel:     owner = -1

Sep 26 10:40:17 dev03 kernel:     gh_state = 0

Sep 26 10:40:17 dev03 kernel:     gh_flags = 0

Sep 26 10:40:17 dev03 kernel:     error = 0

Sep 26 10:40:17 dev03 kernel:     gh_iflags = 2 4 5

Sep 26 10:40:17 dev03 kernel:   Waiter2

Sep 26 10:40:17 dev03 kernel:     owner = -1

Sep 26 10:40:17 dev03 kernel:     gh_state = 0

Sep 26 10:40:17 dev03 kernel:     gh_flags = 0

Sep 26 10:40:17 dev03 kernel:     error = 0

Sep 26 10:40:17 dev03 kernel:     gh_iflags = 2 4 5

Sep 26 10:40:17 dev03 kernel: Glock (2, 23)

Sep 26 10:40:17 dev03 kernel:   gl_flags = 1

Sep 26 10:40:17 dev03 kernel:   gl_count = 3

Sep 26 10:40:17 dev03 kernel:   gl_state = 3

Sep 26 10:40:17 dev03 kernel:   req_gh = yes

Sep 26 10:40:17 dev03 kernel:   req_bh = yes

Sep 26 10:40:17 dev03 kernel:   lvb_count = 0

Sep 26 10:40:17 dev03 kernel:   object = no

Sep 26 10:40:17 dev03 kernel:   new_le = no

Sep 26 10:40:17 dev03 kernel:   incore_le = no

Sep 26 10:40:17 dev03 kernel:   reclaim = no

Sep 26 10:40:17 dev03 kernel:   aspace = 1

Sep 26 10:40:17 dev03 kernel:   ail_bufs = no

Sep 26 10:40:17 dev03 kernel:   Request

Sep 26 10:40:17 dev03 kernel:     owner = -1

Sep 26 10:40:17 dev03 kernel:     gh_state = 0

Sep 26 10:40:17 dev03 kernel:     gh_flags = 0

Sep 26 10:40:17 dev03 kernel:     error = 0

Sep 26 10:40:17 dev03 kernel:     gh_iflags = 2 4 5

Sep 26 10:40:17 dev03 kernel:   Waiter2

Sep 26 10:40:17 dev03 kernel:     owner = -1

Sep 26 10:40:17 dev03 kernel:     gh_state = 0

Sep 26 10:40:17 dev03 kernel:     gh_flags = 0

Sep 26 10:40:17 dev03 kernel:     error = 0

Sep 26 10:40:17 dev03 kernel:     gh_iflags = 2 4 5

Sep 26 10:40:17 dev03 kernel: Glock (5, 25)

Sep 26 10:40:17 dev03 kernel:   gl_flags = 1

Sep 26 10:40:17 dev03 kernel:   gl_count = 3

Sep 26 10:40:17 dev03 kernel:   gl_state = 3

Sep 26 10:40:17 dev03 kernel:   req_gh = yes

Sep 26 10:40:17 dev03 kernel:   req_bh = yes

Sep 26 10:40:17 dev03 kernel:   lvb_count = 0

Sep 26 10:40:17 dev03 kernel:   object = no

Sep 26 10:40:17 dev03 kernel:   new_le = no

Sep 26 10:40:17 dev03 kernel:   incore_le = no

Sep 26 10:40:17 dev03 kernel:   reclaim = no

Sep 26 10:40:17 dev03 kernel:   aspace = no

Sep 26 10:40:17 dev03 kernel:   ail_bufs = no

Sep 26 10:40:17 dev03 kernel:   Request

Sep 26 10:40:17 dev03 kernel:     owner = -1

Sep 26 10:40:17 dev03 kernel:     gh_state = 0

Sep 26 10:40:17 dev03 kernel:     gh_flags = 0

Sep 26 10:40:17 dev03 kernel:     error = 0

Sep 26 10:40:17 dev03 kernel:     gh_iflags = 2 4 5

Sep 26 10:40:17 dev03 kernel:   Waiter2

Sep 26 10:40:17 dev03 kernel:     owner = -1

Sep 26 10:40:17 dev03 kernel:     gh_state = 0

Sep 26 10:40:17 dev03 kernel:     gh_flags = 0

Sep 26 10:40:17 dev03 kernel:     error = 0

Sep 26 10:40:17 dev03 kernel:     gh_iflags = 2 4 5


----------------------------------------------------------------------------------------------------------------

Here are /dmesg lines - starting with the node4, gfs_sdb1 is the "bad
volume":

Sep 26 09:11:46 dev04 kernel: Joined cluster. Now mounting FS...

Sep 26 09:11:46 dev04 kernel: GFS: fsid=test1_cluster:gfs_sdb1.1: jid=0:
Trying to acquire journal lock...

Sep 26 09:11:46 dev04 kernel: GFS: fsid=test1_cluster:gfs_sdb1.1: jid=0:
Busy

Sep 26 09:11:46 dev04 kernel: GFS: fsid=test1_cluster:gfs_sdb1.1: jid=1:
Trying to acquire journal lock...

Sep 26 09:11:46 dev04 kernel: GFS: fsid=test1_cluster:gfs_sdb1.1: jid=1:
Looking at journal...

Sep 26 09:11:46 dev04 kernel: GFS: fsid=test1_cluster:gfs_sdb1.1: jid=1:
Done

Sep 26 09:11:46 dev04 kernel: GFS: fsid=test1_cluster:gfs_sdb1.1: jid=2:
Trying to acquire journal lock...

Sep 26 09:11:46 dev04 kernel: GFS: fsid=test1_cluster:gfs_sdb1.1: jid=2:
Looking at journal...

Sep 26 09:11:46 dev04 kernel: GFS: fsid=test1_cluster:gfs_sdb1.1: jid=2:
Acquiring the transaction lock...

Sep 26 09:11:46 dev04 kernel: GFS: fsid=test1_cluster:gfs_sdb1.1: jid=2:
Replaying journal...

Sep 26 09:11:46 dev04 kernel: GFS: fsid=test1_cluster:gfs_sdb1.1: jid=2:
Replayed 0 of 9 blocks

Sep 26 09:11:46 dev04 kernel: GFS: fsid=test1_cluster:gfs_sdb1.1: jid=2:
replays = 0, skips = 2, sames = 7

Sep 26 09:11:46 dev04 kernel: GFS: fsid=test1_cluster:gfs_sdb1.1: jid=2:
Journal replayed in 1s

Sep 26 09:11:46 dev04 kernel: GFS: fsid=test1_cluster:gfs_sdb1.1: jid=2:
Done

Sep 26 09:11:46 dev04 kernel: GFS: fsid=test1_cluster:gfs_sdb1.1: Scanning
for log elements...

Sep 26 09:11:46 dev04 kernel: GFS: fsid=test1_cluster:gfs_sdb1.1: Found 0
unlinked inodes

Sep 26 09:11:46 dev04 kernel: GFS: fsid=test1_cluster:gfs_sdb1.1: Found
quota changes for 0 IDs

Sep 26 09:11:46 dev04 kernel: GFS: fsid=test1_cluster:gfs_sdb1.1: Done

Sep 26 09:11:46 dev04 kernel: ACPI: PCI Interrupt 0000:01:04.2[B] -> GSI 22
(level, low) -> IRQ 233

Sep 26 09:11:47 dev04 kernel: GFS: fsid=test1_cluster:gfs_sdb1.1: jid=0:
Trying to acquire journal lock...

Sep 26 09:11:47 dev04 gfs_controld[6351]: gfs_sdb1 finish: needs recovery
jid 0 nodeid 1 status 1

Sep 26 09:11:47 dev04 kernel: GFS: fsid=test1_cluster:gfs_sdb1.1: jid=0:
Busy


-------------------









On Fri, Sep 26, 2008 at 10:33 AM, Bob Peterson <rpeterso at redhat.com> wrote:

> ----- "Alan A" <alan.zg at gmail.com> wrote:
> | This is worse than I tought. The entire cluster is hanging upon
> | restart
> | command issued from the Conga - lucy box. I tried bringing gfs service
> | down
> | on node2 (lucy) with the: service gfs stop (we are not running
> | rgmanager),
> | and I got:
> | FATAL: Module gfs is in use.
>
> Hi Alan,
>
> It sounds like conga can't reboot the cluster because the GFS file
> system is still mounted, or is in use.  I don't know much about conga,
> so forgive my ignorance there.  You may need to unmount the gfs file
> system before you reboot.  The dmesg you sent looked perfectly normal
> to me.  Those are normal openais messages.  I'm more interested to
> see if there were any "File system withdrawn" messages, general protection
> faults, or kernel panic messages or other serious kernel errors on any
> of the nodes in the cluster just around the time of the first failure.
>
> This is just a wild guess, but I'm guessing that there was some kind
> of error, like a kernel panic that occurred a while back.  That caused
> the node to be fenced.  Perhaps the SCSI fencing locked up the device
> somehow so none of the nodes can use it.  If that's the case, you
> should be able to log in to each of the nodes, unmount the gfs file
> systems that are mounted, manually, and then reboot them.
> If it doesn't let you unmount them, it might be because some process
> is still using the GFS file system.  For example, if you're using
> NFS to export the GFS file system, you probably need to do
> service nfs stop before it will let you unmount the gfs, then reboot.
>
> So I would comb through the /var/log/messages of each node looking
> for an error message regarding the node being fenced, withdrawn, panic,
> SCSI errors, or any kind of serious errors that occurred around the
> time where you first had the problem.
>
> Regards,
>
> Bob Peterson
> Red Hat Clustering & GFS
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



-- 
Alan A.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080926/a3550730/attachment.htm>

From jparsons at redhat.com  Fri Sep 26 16:02:03 2008
From: jparsons at redhat.com (jim parsons)
Date: Fri, 26 Sep 2008 12:02:03 -0400
Subject: [Linux-cluster] Fwd: GFS volume hangs on 3 nodes after gfs_grow
In-Reply-To: <2015936812.39581222443239412.JavaMail.root@zmail02.collab.prod.int.phx2.redhat.com>
References: <2015936812.39581222443239412.JavaMail.root@zmail02.collab.prod.int.phx2.redhat.com>
Message-ID: <1222444923.3260.5.camel@localhost.localdomain>

On Fri, 2008-09-26 at 11:33 -0400, Bob Peterson wrote:
> ----- "Alan A" <alan.zg at gmail.com> wrote:
> | This is worse than I tought. The entire cluster is hanging upon
> | restart
> | command issued from the Conga - lucy box. I tried bringing gfs service
> | down
> | on node2 (lucy) with the: service gfs stop (we are not running
> | rgmanager),
> | and I got:
> | FATAL: Module gfs is in use.
> 
> Hi Alan,
> 
> It sounds like conga can't reboot the cluster because the GFS file
> system is still mounted, or is in use.  I don't know much about conga,
> so forgive my ignorance there.  You may need to unmount the gfs file
> system before you reboot.  The dmesg you sent looked perfectly normal
> to me.  Those are normal openais messages.  I'm more interested to
> see if there were any "File system withdrawn" messages, general protection
> faults, or kernel panic messages or other serious kernel errors on any
> of the nodes in the cluster just around the time of the first failure.
> 
> This is just a wild guess, but I'm guessing that there was some kind
> of error, like a kernel panic that occurred a while back.  That caused
> the node to be fenced.  Perhaps the SCSI fencing locked up the device
> somehow so none of the nodes can use it.  If that's the case, you
> should be able to log in to each of the nodes, unmount the gfs file
> systems that are mounted, manually, and then reboot them.
> If it doesn't let you unmount them, it might be because some process
> is still using the GFS file system.  For example, if you're using
> NFS to export the GFS file system, you probably need to do
> service nfs stop before it will let you unmount the gfs, then reboot.

The luci storage tab should allow you to do this as well.
Perhaps Conga can be coerced into investigating if this is the problem,
and just doing the right thing. Feel free to file a ticket for this
capability. It would be nice to have this in 2.0

-jim
> 




From rpeterso at redhat.com  Fri Sep 26 17:44:39 2008
From: rpeterso at redhat.com (Bob Peterson)
Date: Fri, 26 Sep 2008 13:44:39 -0400 (EDT)
Subject: [Linux-cluster] Fwd: GFS volume hangs on 3 nodes after gfs_grow
In-Reply-To: <fac531740809260853o45deee4eg53e8e43130ef33ee@mail.gmail.com>
Message-ID: <1204434903.49871222451079909.JavaMail.root@zmail02.collab.prod.int.phx2.redhat.com>

----- "Alan A" <alan.zg at gmail.com> wrote:
| Thanks again, Bob.
| 
| No kernel-panic on any of the nodes. I had to cold boot all 3 nodes in
| order
| to get the cluster going (might have been a fence issue but am not
| 100%
| sure, since we use only SCSI fencing until we agree on secondary
| fencing
| method). What is 'scary' is that gfs_grow command paralized that
| volume on
| all 3 nodes, and I coldn't access, nor unmount, nor run gfs_fsck, from
| any
| of the nodes. We will do more testing on this, btw do you have
| suggested
| "safe" method of growing and shrinking the volume other than what is
| noted
| in 5.2 documentation (since we followed the RHEL manual). If the GFS
| volume
| hangs - what is the best way to try and unmount it from the node, 
| would
| 'gfs_freeze' helped)?

Hi Alan,

No, gfs_freeze won't help.  In these cases, it's probably best to
reboot the node that caused the problem, by /sbin/reboot -fin or
throwing the power switch I think.  I suspect that clvmd status
hung because of the earlier problem.

I'm not aware of any problems in your version of gfs_grow that can
cause this kind of lockup.  It's designed to be run seamlessly while
other processes are using the file system, and that's the kind of
thing we test regularly.

If you figure out how to recreate the lockup, let me know so I
can try it out.  Of course, if this is a production cluster, I
would not take it out of production a long time to try this.
But if I can recreate the problem here, I'll file a bugzilla
record and get it fixed.

Regards,

Bob Peterson
Red Hat Clustering & GFS



From alan.zg at gmail.com  Fri Sep 26 18:34:07 2008
From: alan.zg at gmail.com (Alan A)
Date: Fri, 26 Sep 2008 13:34:07 -0500
Subject: [Linux-cluster] Fwd: GFS volume hangs on 3 nodes after gfs_grow
In-Reply-To: <1204434903.49871222451079909.JavaMail.root@zmail02.collab.prod.int.phx2.redhat.com>
References: <fac531740809260853o45deee4eg53e8e43130ef33ee@mail.gmail.com>
	<1204434903.49871222451079909.JavaMail.root@zmail02.collab.prod.int.phx2.redhat.com>
Message-ID: <fac531740809261134l3eb60caei24e4fc71054d7a61@mail.gmail.com>

Again thanks for the fast and prompt response Bob.

I restored nodes to the healthy state and they can access GFS volumes.
node3:
service gfs status
Configured GFS mountpoints:
/lvm_test1
/lvm_test2
Active GFS mountpoints:
/lvm_test1
/lvm_test2

node4:
service gfs status
Configured GFS mountpoints:
/lvm_test1
/lvm_test2
Active GFS mountpoints:
/lvm_test1
/lvm_test2

node2 - lucy node:
service gfs status
Configured GFS mountpoints:
/lvm_test1
/lvm_test2
Active GFS mountpoints:
/lvm_test1
/lvm_test2


I will try to reproduce the problem with gfs_grow.

One more question regarding GFS - what steps would you recommend (if any)
for growing and shrinking active GFS volume?

On Fri, Sep 26, 2008 at 12:44 PM, Bob Peterson <rpeterso at redhat.com> wrote:

> ----- "Alan A" <alan.zg at gmail.com> wrote:
> | Thanks again, Bob.
> |
> | No kernel-panic on any of the nodes. I had to cold boot all 3 nodes in
> | order
> | to get the cluster going (might have been a fence issue but am not
> | 100%
> | sure, since we use only SCSI fencing until we agree on secondary
> | fencing
> | method). What is 'scary' is that gfs_grow command paralized that
> | volume on
> | all 3 nodes, and I coldn't access, nor unmount, nor run gfs_fsck, from
> | any
> | of the nodes. We will do more testing on this, btw do you have
> | suggested
> | "safe" method of growing and shrinking the volume other than what is
> | noted
> | in 5.2 documentation (since we followed the RHEL manual). If the GFS
> | volume
> | hangs - what is the best way to try and unmount it from the node,
> | would
> | 'gfs_freeze' helped)?
>
> Hi Alan,
>
> No, gfs_freeze won't help.  In these cases, it's probably best to
> reboot the node that caused the problem, by /sbin/reboot -fin or
> throwing the power switch I think.  I suspect that clvmd status
> hung because of the earlier problem.
>
> I'm not aware of any problems in your version of gfs_grow that can
> cause this kind of lockup.  It's designed to be run seamlessly while
> other processes are using the file system, and that's the kind of
> thing we test regularly.
>
> If you figure out how to recreate the lockup, let me know so I
> can try it out.  Of course, if this is a production cluster, I
> would not take it out of production a long time to try this.
> But if I can recreate the problem here, I'll file a bugzilla
> record and get it fixed.
>
> Regards,
>
> Bob Peterson
> Red Hat Clustering & GFS
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



-- 
Alan A.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080926/75285fe0/attachment.htm>

From andrew at ntsg.umt.edu  Fri Sep 26 18:37:40 2008
From: andrew at ntsg.umt.edu (Andrew A. Neuschwander)
Date: Fri, 26 Sep 2008 12:37:40 -0600
Subject: [Linux-cluster] cman openais totem default for:
	retransmits_before_loss
Message-ID: <48DD2BF4.5030906@ntsg.umt.edu>

Does cman set totem's retransmits_before_loss option? My cluster 
frequently loses the token in 'OPERATIONAL state'. I don't see anything 
in my network stats that indicate a problem. Sometimes when totem 
reforms the cluster, it fences a node that seems to be working fine. 
Sometimes, nodes end up in a disallowed state and quorum is lost. In 
this case, I usually restart the entire cluster.

Would setting totem's token retransmits_before_loss be of any help?

Thanks,
-A
--
Andrew A. Neuschwander, RHCE
Linux Systems/Software Engineer
College of Forestry and Conservation
The University of Montana
http://www.ntsg.umt.edu
andrew at ntsg.umt.edu - 406.243.6310



From rpeterso at redhat.com  Fri Sep 26 18:59:11 2008
From: rpeterso at redhat.com (Bob Peterson)
Date: Fri, 26 Sep 2008 14:59:11 -0400 (EDT)
Subject: [Linux-cluster] Fwd: GFS volume hangs on 3 nodes after gfs_grow
In-Reply-To: <fac531740809261134l3eb60caei24e4fc71054d7a61@mail.gmail.com>
Message-ID: <454119621.10161222455551443.JavaMail.root@zmail02.collab.prod.int.phx2.redhat.com>

----- "Alan A" <alan.zg at gmail.com> wrote:
| Again thanks for the fast and prompt response Bob.
| 
| I will try to reproduce the problem with gfs_grow.
| 
| One more question regarding GFS - what steps would you recommend (if
| any)
| for growing and shrinking active GFS volume?

Hi Alan,

Neither GFS or GFS2 volumes cannot be shrunk.  Eventually I
need to start working on a gfs2_shrink tool for gfs2, but I
don't think GFS will ever be able to shrink.

As for growing, it sounds like you're already familiar with
that.  You just do something like:

lvresize or lvextend the logical volume
mount the gfs volume to a mount point
gfs_grow /your/mount/point

It's probably safest to do gfs_grow when there is not a lot of
system activity.  For example, at night when the system is not
being beat up by lots of I/O.

Regards,

Bob Peterson
Red Hat Clustering & GFS



From alan.zg at gmail.com  Fri Sep 26 19:06:59 2008
From: alan.zg at gmail.com (Alan A)
Date: Fri, 26 Sep 2008 14:06:59 -0500
Subject: [Linux-cluster] Fwd: GFS volume hangs on 3 nodes after gfs_grow
In-Reply-To: <454119621.10161222455551443.JavaMail.root@zmail02.collab.prod.int.phx2.redhat.com>
References: <fac531740809261134l3eb60caei24e4fc71054d7a61@mail.gmail.com>
	<454119621.10161222455551443.JavaMail.root@zmail02.collab.prod.int.phx2.redhat.com>
Message-ID: <fac531740809261206n6a6896fbk446ddc20d86e5800@mail.gmail.com>

I have been able to recreate the problem with gfs_grow. Here is the output
of the test command, and the actual command, with the /var/log/messages -
all from the node3. I am opening ticket with RH, and will give you the
ticket number afterwards.


[root at dev03 /]# gfs_grow -v -T /lvm_test2

FS: Mount Point: /lvm_test2

FS: Device: /dev/mapper/gfs_sdb1-gfs_sdb1

FS: Options: rw,hostdata=jid=2:id=262146:first=0

FS: Size: 1572864

RGRP: Current Resource Group List:

RI: Addr 1328945, RgLen 15, Start 1328960, DataLen 243904, BmapLen 60976

RI: Addr 1310720, RgLen 2, Start 1310722, DataLen 18220, BmapLen 4555

RI: Addr 1250100, RgLen 4, Start 1250104, DataLen 60616, BmapLen 15154

RI: Addr 1189480, RgLen 4, Start 1189484, DataLen 60616, BmapLen 15154

RI: Addr 1128860, RgLen 4, Start 1128864, DataLen 60616, BmapLen 15154

RI: Addr 1068240, RgLen 4, Start 1068244, DataLen 60616, BmapLen 15154

RI: Addr 1007620, RgLen 4, Start 1007624, DataLen 60616, BmapLen 15154

RI: Addr 947000, RgLen 4, Start 947004, DataLen 60616, BmapLen 15154

RI: Addr 886380, RgLen 4, Start 886384, DataLen 60616, BmapLen 15154

RI: Addr 825760, RgLen 4, Start 825764, DataLen 60616, BmapLen 15154

RI: Addr 765140, RgLen 4, Start 765144, DataLen 60616, BmapLen 15154

RI: Addr 704512, RgLen 4, Start 704516, DataLen 60624, BmapLen 15156

RI: Addr 545589, RgLen 4, Start 545593, DataLen 60612, BmapLen 15153

RI: Addr 484970, RgLen 4, Start 484974, DataLen 60612, BmapLen 15153

RI: Addr 424351, RgLen 4, Start 424355, DataLen 60612, BmapLen 15153

RI: Addr 363732, RgLen 4, Start 363736, DataLen 60612, BmapLen 15153

RI: Addr 303113, RgLen 4, Start 303117, DataLen 60612, BmapLen 15153

RI: Addr 242494, RgLen 4, Start 242498, DataLen 60612, BmapLen 15153

RI: Addr 181875, RgLen 4, Start 181879, DataLen 60612, BmapLen 15153

RI: Addr 121256, RgLen 4, Start 121260, DataLen 60612, BmapLen 15153

RI: Addr 60637, RgLen 4, Start 60641, DataLen 60612, BmapLen 15153

RI: Addr 17, RgLen 4, Start 21, DataLen 60616, BmapLen 15154

RGRP: 22 Resource groups in total

JRNL: Current Journal List:

JI: Addr 671744 NumSeg 2048 SegSize 16

JI: Addr 638976 NumSeg 2048 SegSize 16

JI: Addr 606208 NumSeg 2048 SegSize 16

JRNL: 3 Journals in total

DEV: Size: 1703936

RGRP: New Resource Group List:

RI: Addr 1572864, RgLen 9, Start 1572873, DataLen 131060, BmapLen 32765

RGRP: 1 Resource groups in total

[root at dev03 /]# gfs_grow -v  /lvm_test2

FS: Mount Point: /lvm_test2

FS: Device: /dev/mapper/gfs_sdb1-gfs_sdb1

FS: Options: rw,hostdata=jid=2:id=262146:first=0

FS: Size: 1572864

RGRP: Current Resource Group List:

RI: Addr 1328945, RgLen 15, Start 1328960, DataLen 243904, BmapLen 60976

RI: Addr 1310720, RgLen 2, Start 1310722, DataLen 18220, BmapLen 4555

RI: Addr 1250100, RgLen 4, Start 1250104, DataLen 60616, BmapLen 15154

RI: Addr 1189480, RgLen 4, Start 1189484, DataLen 60616, BmapLen 15154

RI: Addr 1128860, RgLen 4, Start 1128864, DataLen 60616, BmapLen 15154

RI: Addr 1068240, RgLen 4, Start 1068244, DataLen 60616, BmapLen 15154

RI: Addr 1007620, RgLen 4, Start 1007624, DataLen 60616, BmapLen 15154

RI: Addr 947000, RgLen 4, Start 947004, DataLen 60616, BmapLen 15154

RI: Addr 886380, RgLen 4, Start 886384, DataLen 60616, BmapLen 15154

RI: Addr 825760, RgLen 4, Start 825764, DataLen 60616, BmapLen 15154

RI: Addr 765140, RgLen 4, Start 765144, DataLen 60616, BmapLen 15154

RI: Addr 704512, RgLen 4, Start 704516, DataLen 60624, BmapLen 15156

RI: Addr 545589, RgLen 4, Start 545593, DataLen 60612, BmapLen 15153

RI: Addr 484970, RgLen 4, Start 484974, DataLen 60612, BmapLen 15153

RI: Addr 424351, RgLen 4, Start 424355, DataLen 60612, BmapLen 15153

RI: Addr 363732, RgLen 4, Start 363736, DataLen 60612, BmapLen 15153

RI: Addr 303113, RgLen 4, Start 303117, DataLen 60612, BmapLen 15153

RI: Addr 242494, RgLen 4, Start 242498, DataLen 60612, BmapLen 15153

RI: Addr 181875, RgLen 4, Start 181879, DataLen 60612, BmapLen 15153

RI: Addr 121256, RgLen 4, Start 121260, DataLen 60612, BmapLen 15153

RI: Addr 60637, RgLen 4, Start 60641, DataLen 60612, BmapLen 15153

RI: Addr 17, RgLen 4, Start 21, DataLen 60616, BmapLen 15154

RGRP: 22 Resource groups in total

JRNL: Current Journal List:

JI: Addr 671744 NumSeg 2048 SegSize 16

JI: Addr 638976 NumSeg 2048 SegSize 16

JI: Addr 606208 NumSeg 2048 SegSize 16

JRNL: 3 Journals in total

DEV: Size: 1703936

RGRP: New Resource Group List:

RI: Addr 1572864, RgLen 9, Start 1572873, DataLen 131060, BmapLen 32765

RGRP: 1 Resource groups in total

Preparing to write new FS information...

Done.





Node3

/var/log/messages



Sep 26 13:28:13 dev03 clvmd: Cluster LVM daemon started - connected to CMAN

Sep 26 13:28:17 dev03 kernel: GFS 0.1.23-5.el5_2.2 (built Aug 14 2008
17:08:35) installed

Sep 26 13:28:17 dev03 kernel: Trying to join cluster "lock_dlm",
"test1_cluster:gfs_fs1"

Sep 26 13:28:17 dev03 kernel: Joined cluster. Now mounting FS...

Sep 26 13:28:18 dev03 kernel: GFS: fsid=test1_cluster:gfs_fs1.2: jid=2:
Trying to acquire journal lock...

Sep 26 13:28:18 dev03 kernel: GFS: fsid=test1_cluster:gfs_fs1.2: jid=2:
Looking at journal...

Sep 26 13:28:18 dev03 kernel: GFS: fsid=test1_cluster:gfs_fs1.2: jid=2: Done

Sep 26 13:28:18 dev03 kernel: Trying to join cluster "lock_dlm",
"test1_cluster:gfs_sdb1"

Sep 26 13:28:18 dev03 kernel: Joined cluster. Now mounting FS...

Sep 26 13:28:18 dev03 kernel: GFS: fsid=test1_cluster:gfs_sdb1.2: jid=2:
Trying to acquire journal lock...

Sep 26 13:28:18 dev03 kernel: GFS: fsid=test1_cluster:gfs_sdb1.2: jid=2:
Looking at journal...

Sep 26 13:28:18 dev03 kernel: GFS: fsid=test1_cluster:gfs_sdb1.2: jid=2:
Done


On Fri, Sep 26, 2008 at 1:59 PM, Bob Peterson <rpeterso at redhat.com> wrote:

> ----- "Alan A" <alan.zg at gmail.com> wrote:
> | Again thanks for the fast and prompt response Bob.
> |
> | I will try to reproduce the problem with gfs_grow.
> |
> | One more question regarding GFS - what steps would you recommend (if
> | any)
> | for growing and shrinking active GFS volume?
>
> Hi Alan,
>
> Neither GFS or GFS2 volumes cannot be shrunk.  Eventually I
> need to start working on a gfs2_shrink tool for gfs2, but I
> don't think GFS will ever be able to shrink.
>
> As for growing, it sounds like you're already familiar with
> that.  You just do something like:
>
> lvresize or lvextend the logical volume
> mount the gfs volume to a mount point
> gfs_grow /your/mount/point
>
> It's probably safest to do gfs_grow when there is not a lot of
> system activity.  For example, at night when the system is not
> being beat up by lots of I/O.
>
> Regards,
>
> Bob Peterson
> Red Hat Clustering & GFS
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



-- 
Alan A.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080926/820da0f4/attachment.htm>

From alan.zg at gmail.com  Fri Sep 26 19:38:45 2008
From: alan.zg at gmail.com (Alan A)
Date: Fri, 26 Sep 2008 14:38:45 -0500
Subject: [Linux-cluster] Fwd: GFS volume hangs on 3 nodes after gfs_grow
In-Reply-To: <fac531740809261206n6a6896fbk446ddc20d86e5800@mail.gmail.com>
References: <fac531740809261134l3eb60caei24e4fc71054d7a61@mail.gmail.com>
	<454119621.10161222455551443.JavaMail.root@zmail02.collab.prod.int.phx2.redhat.com>
	<fac531740809261206n6a6896fbk446ddc20d86e5800@mail.gmail.com>
Message-ID: <fac531740809261238v2d8ecfbfg7def960c0133efa7@mail.gmail.com>

Node3

/var/log/messages



Sep 26 13:28:13 dev03 clvmd: Cluster LVM daemon started - connected to CMAN

Sep 26 13:28:17 dev03 kernel: GFS 0.1.23-5.el5_2.2 (built Aug 14 2008
17:08:35) installed

Sep 26 13:28:17 dev03 kernel: Trying to join cluster "lock_dlm",
"test1_cluster:gfs_fs1"

Sep 26 13:28:17 dev03 kernel: Joined cluster. Now mounting FS...

Sep 26 13:28:18 dev03 kernel: GFS: fsid=test1_cluster:gfs_fs1.2: jid=2:
Trying to acquire journal lock...

Sep 26 13:28:18 dev03 kernel: GFS: fsid=test1_cluster:gfs_fs1.2: jid=2:
Looking at journal...

Sep 26 13:28:18 dev03 kernel: GFS: fsid=test1_cluster:gfs_fs1.2: jid=2: Done

Sep 26 13:28:18 dev03 kernel: Trying to join cluster "lock_dlm",
"test1_cluster:gfs_sdb1"

Sep 26 13:28:18 dev03 kernel: Joined cluster. Now mounting FS...

Sep 26 13:28:18 dev03 kernel: GFS: fsid=test1_cluster:gfs_sdb1.2: jid=2:
Trying to acquire journal lock...

Sep 26 13:28:18 dev03 kernel: GFS: fsid=test1_cluster:gfs_sdb1.2: jid=2:
Looking at journal...

Sep 26 13:28:18 dev03 kernel: GFS: fsid=test1_cluster:gfs_sdb1.2: jid=2:
Done


On Fri, Sep 26, 2008 at 2:06 PM, Alan A <alan.zg at gmail.com> wrote:

>
> I have been able to recreate the problem with gfs_grow. Here is the output
> of the test command, and the actual command, with the /var/log/messages -
> all from the node3. I am opening ticket with RH, and will give you the
> ticket number afterwards.
>
>
> [root at dev03 /]# gfs_grow -v -T /lvm_test2
>
> FS: Mount Point: /lvm_test2
>
> FS: Device: /dev/mapper/gfs_sdb1-gfs_sdb1
>
> FS: Options: rw,hostdata=jid=2:id=262146:first=0
>
> FS: Size: 1572864
>
> RGRP: Current Resource Group List:
>
> RI: Addr 1328945, RgLen 15, Start 1328960, DataLen 243904, BmapLen 60976
>
> RI: Addr 1310720, RgLen 2, Start 1310722, DataLen 18220, BmapLen 4555
>
> RI: Addr 1250100, RgLen 4, Start 1250104, DataLen 60616, BmapLen 15154
>
> RI: Addr 1189480, RgLen 4, Start 1189484, DataLen 60616, BmapLen 15154
>
> RI: Addr 1128860, RgLen 4, Start 1128864, DataLen 60616, BmapLen 15154
>
> RI: Addr 1068240, RgLen 4, Start 1068244, DataLen 60616, BmapLen 15154
>
> RI: Addr 1007620, RgLen 4, Start 1007624, DataLen 60616, BmapLen 15154
>
> RI: Addr 947000, RgLen 4, Start 947004, DataLen 60616, BmapLen 15154
>
> RI: Addr 886380, RgLen 4, Start 886384, DataLen 60616, BmapLen 15154
>
> RI: Addr 825760, RgLen 4, Start 825764, DataLen 60616, BmapLen 15154
>
> RI: Addr 765140, RgLen 4, Start 765144, DataLen 60616, BmapLen 15154
>
> RI: Addr 704512, RgLen 4, Start 704516, DataLen 60624, BmapLen 15156
>
> RI: Addr 545589, RgLen 4, Start 545593, DataLen 60612, BmapLen 15153
>
> RI: Addr 484970, RgLen 4, Start 484974, DataLen 60612, BmapLen 15153
>
> RI: Addr 424351, RgLen 4, Start 424355, DataLen 60612, BmapLen 15153
>
> RI: Addr 363732, RgLen 4, Start 363736, DataLen 60612, BmapLen 15153
>
> RI: Addr 303113, RgLen 4, Start 303117, DataLen 60612, BmapLen 15153
>
> RI: Addr 242494, RgLen 4, Start 242498, DataLen 60612, BmapLen 15153
>
> RI: Addr 181875, RgLen 4, Start 181879, DataLen 60612, BmapLen 15153
>
> RI: Addr 121256, RgLen 4, Start 121260, DataLen 60612, BmapLen 15153
>
> RI: Addr 60637, RgLen 4, Start 60641, DataLen 60612, BmapLen 15153
>
> RI: Addr 17, RgLen 4, Start 21, DataLen 60616, BmapLen 15154
>
> RGRP: 22 Resource groups in total
>
> JRNL: Current Journal List:
>
> JI: Addr 671744 NumSeg 2048 SegSize 16
>
> JI: Addr 638976 NumSeg 2048 SegSize 16
>
> JI: Addr 606208 NumSeg 2048 SegSize 16
>
> JRNL: 3 Journals in total
>
> DEV: Size: 1703936
>
> RGRP: New Resource Group List:
>
> RI: Addr 1572864, RgLen 9, Start 1572873, DataLen 131060, BmapLen 32765
>
> RGRP: 1 Resource groups in total
>
> [root at dev03 /]# gfs_grow -v  /lvm_test2
>
> FS: Mount Point: /lvm_test2
>
> FS: Device: /dev/mapper/gfs_sdb1-gfs_sdb1
>
> FS: Options: rw,hostdata=jid=2:id=262146:first=0
>
> FS: Size: 1572864
>
> RGRP: Current Resource Group List:
>
> RI: Addr 1328945, RgLen 15, Start 1328960, DataLen 243904, BmapLen 60976
>
> RI: Addr 1310720, RgLen 2, Start 1310722, DataLen 18220, BmapLen 4555
>
> RI: Addr 1250100, RgLen 4, Start 1250104, DataLen 60616, BmapLen 15154
>
> RI: Addr 1189480, RgLen 4, Start 1189484, DataLen 60616, BmapLen 15154
>
> RI: Addr 1128860, RgLen 4, Start 1128864, DataLen 60616, BmapLen 15154
>
> RI: Addr 1068240, RgLen 4, Start 1068244, DataLen 60616, BmapLen 15154
>
> RI: Addr 1007620, RgLen 4, Start 1007624, DataLen 60616, BmapLen 15154
>
> RI: Addr 947000, RgLen 4, Start 947004, DataLen 60616, BmapLen 15154
>
> RI: Addr 886380, RgLen 4, Start 886384, DataLen 60616, BmapLen 15154
>
> RI: Addr 825760, RgLen 4, Start 825764, DataLen 60616, BmapLen 15154
>
> RI: Addr 765140, RgLen 4, Start 765144, DataLen 60616, BmapLen 15154
>
> RI: Addr 704512, RgLen 4, Start 704516, DataLen 60624, BmapLen 15156
>
> RI: Addr 545589, RgLen 4, Start 545593, DataLen 60612, BmapLen 15153
>
> RI: Addr 484970, RgLen 4, Start 484974, DataLen 60612, BmapLen 15153
>
> RI: Addr 424351, RgLen 4, Start 424355, DataLen 60612, BmapLen 15153
>
> RI: Addr 363732, RgLen 4, Start 363736, DataLen 60612, BmapLen 15153
>
> RI: Addr 303113, RgLen 4, Start 303117, DataLen 60612, BmapLen 15153
>
> RI: Addr 242494, RgLen 4, Start 242498, DataLen 60612, BmapLen 15153
>
> RI: Addr 181875, RgLen 4, Start 181879, DataLen 60612, BmapLen 15153
>
> RI: Addr 121256, RgLen 4, Start 121260, DataLen 60612, BmapLen 15153
>
> RI: Addr 60637, RgLen 4, Start 60641, DataLen 60612, BmapLen 15153
>
> RI: Addr 17, RgLen 4, Start 21, DataLen 60616, BmapLen 15154
>
> RGRP: 22 Resource groups in total
>
> JRNL: Current Journal List:
>
> JI: Addr 671744 NumSeg 2048 SegSize 16
>
> JI: Addr 638976 NumSeg 2048 SegSize 16
>
> JI: Addr 606208 NumSeg 2048 SegSize 16
>
> JRNL: 3 Journals in total
>
> DEV: Size: 1703936
>
> RGRP: New Resource Group List:
>
> RI: Addr 1572864, RgLen 9, Start 1572873, DataLen 131060, BmapLen 32765
>
> RGRP: 1 Resource groups in total
>
> Preparing to write new FS information...
>
> Done.
>
>
>
>
>
> Node3
>
> /var/log/messages
>
>
>
> Sep 26 13:28:13 dev03 clvmd: Cluster LVM daemon started - connected to CMAN
>
> Sep 26 13:28:17 dev03 kernel: GFS 0.1.23-5.el5_2.2 (built Aug 14 2008
> 17:08:35) installed
>
> Sep 26 13:28:17 dev03 kernel: Trying to join cluster "lock_dlm",
> "test1_cluster:gfs_fs1"
>
> Sep 26 13:28:17 dev03 kernel: Joined cluster. Now mounting FS...
>
> Sep 26 13:28:18 dev03 kernel: GFS: fsid=test1_cluster:gfs_fs1.2: jid=2:
> Trying to acquire journal lock...
>
> Sep 26 13:28:18 dev03 kernel: GFS: fsid=test1_cluster:gfs_fs1.2: jid=2:
> Looking at journal...
>
> Sep 26 13:28:18 dev03 kernel: GFS: fsid=test1_cluster:gfs_fs1.2: jid=2:
> Done
>
> Sep 26 13:28:18 dev03 kernel: Trying to join cluster "lock_dlm",
> "test1_cluster:gfs_sdb1"
>
> Sep 26 13:28:18 dev03 kernel: Joined cluster. Now mounting FS...
>
> Sep 26 13:28:18 dev03 kernel: GFS: fsid=test1_cluster:gfs_sdb1.2: jid=2:
> Trying to acquire journal lock...
>
> Sep 26 13:28:18 dev03 kernel: GFS: fsid=test1_cluster:gfs_sdb1.2: jid=2:
> Looking at journal...
>
> Sep 26 13:28:18 dev03 kernel: GFS: fsid=test1_cluster:gfs_sdb1.2: jid=2:
> Done
>
>
> On Fri, Sep 26, 2008 at 1:59 PM, Bob Peterson <rpeterso at redhat.com> wrote:
>
>> ----- "Alan A" <alan.zg at gmail.com> wrote:
>> | Again thanks for the fast and prompt response Bob.
>> |
>> | I will try to reproduce the problem with gfs_grow.
>> |
>> | One more question regarding GFS - what steps would you recommend (if
>> | any)
>> | for growing and shrinking active GFS volume?
>>
>> Hi Alan,
>>
>> Neither GFS or GFS2 volumes cannot be shrunk.  Eventually I
>> need to start working on a gfs2_shrink tool for gfs2, but I
>> don't think GFS will ever be able to shrink.
>>
>> As for growing, it sounds like you're already familiar with
>> that.  You just do something like:
>>
>> lvresize or lvextend the logical volume
>> mount the gfs volume to a mount point
>> gfs_grow /your/mount/point
>>
>> It's probably safest to do gfs_grow when there is not a lot of
>> system activity.  For example, at night when the system is not
>> being beat up by lots of I/O.
>>
>> Regards,
>>
>> Bob Peterson
>> Red Hat Clustering & GFS
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
>
>
> --
> Alan A.
>



-- 
Alan A.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080926/f22704fc/attachment.htm>

From alan.zg at gmail.com  Fri Sep 26 19:43:57 2008
From: alan.zg at gmail.com (Alan A)
Date: Fri, 26 Sep 2008 14:43:57 -0500
Subject: [Linux-cluster] Fwd: GFS volume hangs on 3 nodes after gfs_grow
In-Reply-To: <fac531740809261238v2d8ecfbfg7def960c0133efa7@mail.gmail.com>
References: <fac531740809261134l3eb60caei24e4fc71054d7a61@mail.gmail.com>
	<454119621.10161222455551443.JavaMail.root@zmail02.collab.prod.int.phx2.redhat.com>
	<fac531740809261206n6a6896fbk446ddc20d86e5800@mail.gmail.com>
	<fac531740809261238v2d8ecfbfg7def960c0133efa7@mail.gmail.com>
Message-ID: <fac531740809261243r23b1ad30lc4ccf003537c93af@mail.gmail.com>

Sorry somehow my copy past keeps excluding this part:

Sep 26 13:28:38 dev03 ntpd[8930]: synchronized to 10.99.1.5, stratum 3

Sep 26 13:52:25 dev03 kernel: GFS: fsid=test1_cluster:gfs_sdb1.2: jid=0:
Trying to acquire journal lock...

Sep 26 13:52:25 dev03 kernel: GFS: fsid=test1_cluster:gfs_sdb1.2: jid=0:
Busy

Sep 26 13:52:37 dev03 kernel: GFS: fsid=test1_cluster:gfs_sdb1.2: jid=1:
Trying to acquire journal lock...

Sep 26 13:52:37 dev03 kernel: GFS: fsid=test1_cluster:gfs_sdb1.2: jid=1:
Busy

*Sep 26 13:52:37 dev03 gfs_controld[14620]: gfs_sdb1 finish: needs recovery
jid 1 nodeid 1 status 1*


On Fri, Sep 26, 2008 at 2:38 PM, Alan A <alan.zg at gmail.com> wrote:

> Node3
>
> /var/log/messages
>
>
>
> Sep 26 13:28:13 dev03 clvmd: Cluster LVM daemon started - connected to CMAN
>
> Sep 26 13:28:17 dev03 kernel: GFS 0.1.23-5.el5_2.2 (built Aug 14 2008
> 17:08:35) installed
>
> Sep 26 13:28:17 dev03 kernel: Trying to join cluster "lock_dlm",
> "test1_cluster:gfs_fs1"
>
> Sep 26 13:28:17 dev03 kernel: Joined cluster. Now mounting FS...
>
> Sep 26 13:28:18 dev03 kernel: GFS: fsid=test1_cluster:gfs_fs1.2: jid=2:
> Trying to acquire journal lock...
>
> Sep 26 13:28:18 dev03 kernel: GFS: fsid=test1_cluster:gfs_fs1.2: jid=2:
> Looking at journal...
>
> Sep 26 13:28:18 dev03 kernel: GFS: fsid=test1_cluster:gfs_fs1.2: jid=2:
> Done
>
> Sep 26 13:28:18 dev03 kernel: Trying to join cluster "lock_dlm",
> "test1_cluster:gfs_sdb1"
>
> Sep 26 13:28:18 dev03 kernel: Joined cluster. Now mounting FS...
>
> Sep 26 13:28:18 dev03 kernel: GFS: fsid=test1_cluster:gfs_sdb1.2: jid=2:
> Trying to acquire journal lock...
>
> Sep 26 13:28:18 dev03 kernel: GFS: fsid=test1_cluster:gfs_sdb1.2: jid=2:
> Looking at journal...
>
> Sep 26 13:28:18 dev03 kernel: GFS: fsid=test1_cluster:gfs_sdb1.2: jid=2:
> Done
>
>
> On Fri, Sep 26, 2008 at 2:06 PM, Alan A <alan.zg at gmail.com> wrote:
>
>>
>> I have been able to recreate the problem with gfs_grow. Here is the output
>> of the test command, and the actual command, with the /var/log/messages -
>> all from the node3. I am opening ticket with RH, and will give you the
>> ticket number afterwards.
>>
>>
>> [root at dev03 /]# gfs_grow -v -T /lvm_test2
>>
>> FS: Mount Point: /lvm_test2
>>
>> FS: Device: /dev/mapper/gfs_sdb1-gfs_sdb1
>>
>> FS: Options: rw,hostdata=jid=2:id=262146:first=0
>>
>> FS: Size: 1572864
>>
>> RGRP: Current Resource Group List:
>>
>> RI: Addr 1328945, RgLen 15, Start 1328960, DataLen 243904, BmapLen 60976
>>
>> RI: Addr 1310720, RgLen 2, Start 1310722, DataLen 18220, BmapLen 4555
>>
>> RI: Addr 1250100, RgLen 4, Start 1250104, DataLen 60616, BmapLen 15154
>>
>> RI: Addr 1189480, RgLen 4, Start 1189484, DataLen 60616, BmapLen 15154
>>
>> RI: Addr 1128860, RgLen 4, Start 1128864, DataLen 60616, BmapLen 15154
>>
>> RI: Addr 1068240, RgLen 4, Start 1068244, DataLen 60616, BmapLen 15154
>>
>> RI: Addr 1007620, RgLen 4, Start 1007624, DataLen 60616, BmapLen 15154
>>
>> RI: Addr 947000, RgLen 4, Start 947004, DataLen 60616, BmapLen 15154
>>
>> RI: Addr 886380, RgLen 4, Start 886384, DataLen 60616, BmapLen 15154
>>
>> RI: Addr 825760, RgLen 4, Start 825764, DataLen 60616, BmapLen 15154
>>
>> RI: Addr 765140, RgLen 4, Start 765144, DataLen 60616, BmapLen 15154
>>
>> RI: Addr 704512, RgLen 4, Start 704516, DataLen 60624, BmapLen 15156
>>
>> RI: Addr 545589, RgLen 4, Start 545593, DataLen 60612, BmapLen 15153
>>
>> RI: Addr 484970, RgLen 4, Start 484974, DataLen 60612, BmapLen 15153
>>
>> RI: Addr 424351, RgLen 4, Start 424355, DataLen 60612, BmapLen 15153
>>
>> RI: Addr 363732, RgLen 4, Start 363736, DataLen 60612, BmapLen 15153
>>
>> RI: Addr 303113, RgLen 4, Start 303117, DataLen 60612, BmapLen 15153
>>
>> RI: Addr 242494, RgLen 4, Start 242498, DataLen 60612, BmapLen 15153
>>
>> RI: Addr 181875, RgLen 4, Start 181879, DataLen 60612, BmapLen 15153
>>
>> RI: Addr 121256, RgLen 4, Start 121260, DataLen 60612, BmapLen 15153
>>
>> RI: Addr 60637, RgLen 4, Start 60641, DataLen 60612, BmapLen 15153
>>
>> RI: Addr 17, RgLen 4, Start 21, DataLen 60616, BmapLen 15154
>>
>> RGRP: 22 Resource groups in total
>>
>> JRNL: Current Journal List:
>>
>> JI: Addr 671744 NumSeg 2048 SegSize 16
>>
>> JI: Addr 638976 NumSeg 2048 SegSize 16
>>
>> JI: Addr 606208 NumSeg 2048 SegSize 16
>>
>> JRNL: 3 Journals in total
>>
>> DEV: Size: 1703936
>>
>> RGRP: New Resource Group List:
>>
>> RI: Addr 1572864, RgLen 9, Start 1572873, DataLen 131060, BmapLen 32765
>>
>> RGRP: 1 Resource groups in total
>>
>> [root at dev03 /]# gfs_grow -v  /lvm_test2
>>
>> FS: Mount Point: /lvm_test2
>>
>> FS: Device: /dev/mapper/gfs_sdb1-gfs_sdb1
>>
>> FS: Options: rw,hostdata=jid=2:id=262146:first=0
>>
>> FS: Size: 1572864
>>
>> RGRP: Current Resource Group List:
>>
>> RI: Addr 1328945, RgLen 15, Start 1328960, DataLen 243904, BmapLen 60976
>>
>> RI: Addr 1310720, RgLen 2, Start 1310722, DataLen 18220, BmapLen 4555
>>
>> RI: Addr 1250100, RgLen 4, Start 1250104, DataLen 60616, BmapLen 15154
>>
>> RI: Addr 1189480, RgLen 4, Start 1189484, DataLen 60616, BmapLen 15154
>>
>> RI: Addr 1128860, RgLen 4, Start 1128864, DataLen 60616, BmapLen 15154
>>
>> RI: Addr 1068240, RgLen 4, Start 1068244, DataLen 60616, BmapLen 15154
>>
>> RI: Addr 1007620, RgLen 4, Start 1007624, DataLen 60616, BmapLen 15154
>>
>> RI: Addr 947000, RgLen 4, Start 947004, DataLen 60616, BmapLen 15154
>>
>> RI: Addr 886380, RgLen 4, Start 886384, DataLen 60616, BmapLen 15154
>>
>> RI: Addr 825760, RgLen 4, Start 825764, DataLen 60616, BmapLen 15154
>>
>> RI: Addr 765140, RgLen 4, Start 765144, DataLen 60616, BmapLen 15154
>>
>> RI: Addr 704512, RgLen 4, Start 704516, DataLen 60624, BmapLen 15156
>>
>> RI: Addr 545589, RgLen 4, Start 545593, DataLen 60612, BmapLen 15153
>>
>> RI: Addr 484970, RgLen 4, Start 484974, DataLen 60612, BmapLen 15153
>>
>> RI: Addr 424351, RgLen 4, Start 424355, DataLen 60612, BmapLen 15153
>>
>> RI: Addr 363732, RgLen 4, Start 363736, DataLen 60612, BmapLen 15153
>>
>> RI: Addr 303113, RgLen 4, Start 303117, DataLen 60612, BmapLen 15153
>>
>> RI: Addr 242494, RgLen 4, Start 242498, DataLen 60612, BmapLen 15153
>>
>> RI: Addr 181875, RgLen 4, Start 181879, DataLen 60612, BmapLen 15153
>>
>> RI: Addr 121256, RgLen 4, Start 121260, DataLen 60612, BmapLen 15153
>>
>> RI: Addr 60637, RgLen 4, Start 60641, DataLen 60612, BmapLen 15153
>>
>> RI: Addr 17, RgLen 4, Start 21, DataLen 60616, BmapLen 15154
>>
>> RGRP: 22 Resource groups in total
>>
>> JRNL: Current Journal List:
>>
>> JI: Addr 671744 NumSeg 2048 SegSize 16
>>
>> JI: Addr 638976 NumSeg 2048 SegSize 16
>>
>> JI: Addr 606208 NumSeg 2048 SegSize 16
>>
>> JRNL: 3 Journals in total
>>
>> DEV: Size: 1703936
>>
>> RGRP: New Resource Group List:
>>
>> RI: Addr 1572864, RgLen 9, Start 1572873, DataLen 131060, BmapLen 32765
>>
>> RGRP: 1 Resource groups in total
>>
>> Preparing to write new FS information...
>>
>> Done.
>>
>>
>>
>>
>>
>> Node3
>>
>> /var/log/messages
>>
>>
>>
>> Sep 26 13:28:13 dev03 clvmd: Cluster LVM daemon started - connected to
>> CMAN
>>
>> Sep 26 13:28:17 dev03 kernel: GFS 0.1.23-5.el5_2.2 (built Aug 14 2008
>> 17:08:35) installed
>>
>> Sep 26 13:28:17 dev03 kernel: Trying to join cluster "lock_dlm",
>> "test1_cluster:gfs_fs1"
>>
>> Sep 26 13:28:17 dev03 kernel: Joined cluster. Now mounting FS...
>>
>> Sep 26 13:28:18 dev03 kernel: GFS: fsid=test1_cluster:gfs_fs1.2: jid=2:
>> Trying to acquire journal lock...
>>
>> Sep 26 13:28:18 dev03 kernel: GFS: fsid=test1_cluster:gfs_fs1.2: jid=2:
>> Looking at journal...
>>
>> Sep 26 13:28:18 dev03 kernel: GFS: fsid=test1_cluster:gfs_fs1.2: jid=2:
>> Done
>>
>> Sep 26 13:28:18 dev03 kernel: Trying to join cluster "lock_dlm",
>> "test1_cluster:gfs_sdb1"
>>
>> Sep 26 13:28:18 dev03 kernel: Joined cluster. Now mounting FS...
>>
>> Sep 26 13:28:18 dev03 kernel: GFS: fsid=test1_cluster:gfs_sdb1.2: jid=2:
>> Trying to acquire journal lock...
>>
>> Sep 26 13:28:18 dev03 kernel: GFS: fsid=test1_cluster:gfs_sdb1.2: jid=2:
>> Looking at journal...
>>
>> Sep 26 13:28:18 dev03 kernel: GFS: fsid=test1_cluster:gfs_sdb1.2: jid=2:
>> Done
>>
>>
>> On Fri, Sep 26, 2008 at 1:59 PM, Bob Peterson <rpeterso at redhat.com>wrote:
>>
>>> ----- "Alan A" <alan.zg at gmail.com> wrote:
>>> | Again thanks for the fast and prompt response Bob.
>>> |
>>> | I will try to reproduce the problem with gfs_grow.
>>> |
>>> | One more question regarding GFS - what steps would you recommend (if
>>> | any)
>>> | for growing and shrinking active GFS volume?
>>>
>>> Hi Alan,
>>>
>>> Neither GFS or GFS2 volumes cannot be shrunk.  Eventually I
>>> need to start working on a gfs2_shrink tool for gfs2, but I
>>> don't think GFS will ever be able to shrink.
>>>
>>> As for growing, it sounds like you're already familiar with
>>> that.  You just do something like:
>>>
>>> lvresize or lvextend the logical volume
>>> mount the gfs volume to a mount point
>>> gfs_grow /your/mount/point
>>>
>>> It's probably safest to do gfs_grow when there is not a lot of
>>> system activity.  For example, at night when the system is not
>>> being beat up by lots of I/O.
>>>
>>> Regards,
>>>
>>> Bob Peterson
>>> Red Hat Clustering & GFS
>>>
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>
>>
>>
>>
>> --
>> Alan A.
>>
>
>
>
> --
> Alan A.
>



-- 
Alan A.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080926/d85d4396/attachment.htm>

From rpeterso at redhat.com  Fri Sep 26 19:51:50 2008
From: rpeterso at redhat.com (Bob Peterson)
Date: Fri, 26 Sep 2008 15:51:50 -0400 (EDT)
Subject: [Linux-cluster] Fwd: GFS volume hangs on 3 nodes after gfs_grow
In-Reply-To: <1576481480.18951222457991958.JavaMail.root@zmail02.collab.prod.int.phx2.redhat.com>
Message-ID: <1362790665.25391222458710364.JavaMail.root@zmail02.collab.prod.int.phx2.redhat.com>

----- "Alan A" <alan.zg at gmail.com> wrote:
| I have been able to recreate the problem with gfs_grow. Here is the
| output
| of the test command, and the actual command, with the
| /var/log/messages -
| all from the node3. I am opening ticket with RH, and will give you
| the
| ticket number afterwards.
| 
| 
| [root at dev03 /]# gfs_grow -v -T /lvm_test2

Hi Alan,

This is helpful, thanks.  Please check that the clustered
bit is "on" for your volume group.  If you do the "vgs"
command, it should show "wz--nc" under "Attr" for the volume
group.  If not, (if it shows "wz--n-") it would explain your
problem, because it needs to be on.  So please, double-check
for me.  If the clustered bit is not on, the other nodes are not
informed by lvm of the resize that took place, so when gfs_grow
informs them of the file system size change, things don't go
well.  :7)

If the volume group is indeed clustered, it would help me a lot
to get a complete history of these commands pertaining to your GFS
volume:

lvcreate
lvresize or lvextend
mount
umount
gfs_mkfs or mkfs.gfs
gfs_fsck
gfs_grow

You might be able to grep these from the "history" command.
I just now created a logical volume of a similar size, and I was
able to do a gfs_grow on it without any problems, hangs or
consequences.  I must be doing something different, so I want
to make sure I hit the same conditions you hit by using the
exact same commands, if possible.  Otherwise, I'll have to
try a bunch of combinations and it may take be days to hit
the right combination.

Also, let me know what kind of activity was happening on the
other nodes while you were doing gfs_grow.  Not detailed; I
just want to know if the other nodes were likely to have been
writing to the file system.

Regards,

Bob Peterson
Red Hat Clustering & GFS



From alan.zg at gmail.com  Fri Sep 26 20:08:48 2008
From: alan.zg at gmail.com (Alan A)
Date: Fri, 26 Sep 2008 15:08:48 -0500
Subject: [Linux-cluster] Fwd: GFS volume hangs on 3 nodes after gfs_grow
In-Reply-To: <1362790665.25391222458710364.JavaMail.root@zmail02.collab.prod.int.phx2.redhat.com>
References: <1576481480.18951222457991958.JavaMail.root@zmail02.collab.prod.int.phx2.redhat.com>
	<1362790665.25391222458710364.JavaMail.root@zmail02.collab.prod.int.phx2.redhat.com>
Message-ID: <fac531740809261308y5766dde8me828d3161c105925@mail.gmail.com>

  gfs_sda1     1   1   0 wz--n- 10.00G 5.00G
  gfs_sdb1     1   1   0 wz--n- 10.00G 3.50G

The vg's are not enabled with the cluster bit. I am sorry for not being more
informative on LVM. If I enable cluster bit with vgchange -c command would
that convert the volumes to cluster volume group, or should I start by
blowing them away?

On Fri, Sep 26, 2008 at 2:51 PM, Bob Peterson <rpeterso at redhat.com> wrote:

> ----- "Alan A" <alan.zg at gmail.com> wrote:
> | I have been able to recreate the problem with gfs_grow. Here is the
> | output
> | of the test command, and the actual command, with the
> | /var/log/messages -
> | all from the node3. I am opening ticket with RH, and will give you
> | the
> | ticket number afterwards.
> |
> |
> | [root at dev03 /]# gfs_grow -v -T /lvm_test2
>
> Hi Alan,
>
> This is helpful, thanks.  Please check that the clustered
> bit is "on" for your volume group.  If you do the "vgs"
> command, it should show "wz--nc" under "Attr" for the volume
> group.  If not, (if it shows "wz--n-") it would explain your
> problem, because it needs to be on.  So please, double-check
> for me.  If the clustered bit is not on, the other nodes are not
> informed by lvm of the resize that took place, so when gfs_grow
> informs them of the file system size change, things don't go
> well.  :7)
>
> If the volume group is indeed clustered, it would help me a lot
> to get a complete history of these commands pertaining to your GFS
> volume:
>
> lvcreate
> lvresize or lvextend
> mount
> umount
> gfs_mkfs or mkfs.gfs
> gfs_fsck
> gfs_grow
>
> You might be able to grep these from the "history" command.
> I just now created a logical volume of a similar size, and I was
> able to do a gfs_grow on it without any problems, hangs or
> consequences.  I must be doing something different, so I want
> to make sure I hit the same conditions you hit by using the
> exact same commands, if possible.  Otherwise, I'll have to
> try a bunch of combinations and it may take be days to hit
> the right combination.
>
> Also, let me know what kind of activity was happening on the
> other nodes while you were doing gfs_grow.  Not detailed; I
> just want to know if the other nodes were likely to have been
> writing to the file system.
>
> Regards,
>
> Bob Peterson
> Red Hat Clustering & GFS
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



-- 
Alan A.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080926/e226bdae/attachment.htm>

From sdake at redhat.com  Fri Sep 26 20:36:48 2008
From: sdake at redhat.com (Steven Dake)
Date: Fri, 26 Sep 2008 13:36:48 -0700
Subject: [Linux-cluster] cman openais totem default for:
	retransmits_before_loss
In-Reply-To: <48DD2BF4.5030906@ntsg.umt.edu>
References: <48DD2BF4.5030906@ntsg.umt.edu>
Message-ID: <1222461408.3742.10.camel@balance>


On Fri, 2008-09-26 at 12:37 -0600, Andrew A. Neuschwander wrote:
> Does cman set totem's retransmits_before_loss option? My cluster 
> frequently loses the token in 'OPERATIONAL state'. I don't see anything 
> in my network stats that indicate a problem. Sometimes when totem 
> reforms the cluster, it fences a node that seems to be working fine. 
> Sometimes, nodes end up in a disallowed state and quorum is lost. In 
> this case, I usually restart the entire cluster.
> 
> Would setting totem's token retransmits_before_loss be of any help?
> 

Yes it is set to 20 token retransmits.  It could be beneficial if your
network is overloaded to raise this value.  You could "oversend" the
token to allow more likely delivery of the ORF token to the next
processor.  I recommend keeping the number of tokens sent to a value
greater then 50 msec (so use 200 (tokens retransmitted every 50 msec) as
a ceiling for your settings).

Keep in mind this will increase the overload on your network to even
higher values.

Regards
-steve

> Thanks,
> -A
> --
> Andrew A. Neuschwander, RHCE
> Linux Systems/Software Engineer
> College of Forestry and Conservation
> The University of Montana
> http://www.ntsg.umt.edu
> andrew at ntsg.umt.edu - 406.243.6310
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From rpeterso at redhat.com  Fri Sep 26 20:46:50 2008
From: rpeterso at redhat.com (Bob Peterson)
Date: Fri, 26 Sep 2008 16:46:50 -0400 (EDT)
Subject: [Linux-cluster] Fwd: GFS volume hangs on 3 nodes after gfs_grow
In-Reply-To: <fac531740809261308y5766dde8me828d3161c105925@mail.gmail.com>
Message-ID: <1851992010.44101222462010870.JavaMail.root@zmail02.collab.prod.int.phx2.redhat.com>

----- "Alan A" <alan.zg at gmail.com> wrote:
| gfs_sda1     1   1   0 wz--n- 10.00G 5.00G
|   gfs_sdb1     1   1   0 wz--n- 10.00G 3.50G
| 
| The vg's are not enabled with the cluster bit. I am sorry for not
| being more
| informative on LVM. If I enable cluster bit with vgchange -c command
| would
| that convert the volumes to cluster volume group, or should I start
| by
| blowing them away?

Hi Alan,

It's perfectly fine to enable the clustered bit with vgchange -cy,
and that flags the entire vg and all the lvs inside it.
You should probably unmount-then-mount the gfs file system on all
the nodes after that, just to be safe.  I don't think you need to
remount on the node that did the gfs_grow though.

The missing clustered bit is probably the cause of all your grief,
so I wouldn't bother opening a ticket, unless you can recreate
the failure with the clustered bit on, after umount-then-mount.

Regards,

Bob Peterson
Red Hat Clustering & GFS



From alan.zg at gmail.com  Fri Sep 26 20:54:48 2008
From: alan.zg at gmail.com (Alan A)
Date: Fri, 26 Sep 2008 15:54:48 -0500
Subject: [Linux-cluster] Fwd: GFS volume hangs on 3 nodes after gfs_grow
In-Reply-To: <1851992010.44101222462010870.JavaMail.root@zmail02.collab.prod.int.phx2.redhat.com>
References: <fac531740809261308y5766dde8me828d3161c105925@mail.gmail.com>
	<1851992010.44101222462010870.JavaMail.root@zmail02.collab.prod.int.phx2.redhat.com>
Message-ID: <fac531740809261354p4cb4901yc8acc7cf44f58c4a@mail.gmail.com>

Thank you bob for all the hard work. I am sorry for not being specific from
the get go.

Just a little info to some future readers - if you get:
"Skipping cluster volume group"  - all you have to do is change
/etc/lvm/lvm.conf locking to type 3.

   # Type of locking to use. Defaults to local file-based locking (1).
    # Turn locking off by setting to 0 (dangerous: risks metadata corruption
    # if LVM2 commands get run concurrently).
    # Type 2 uses the external shared library locking_library.
    # Type 3 uses built-in clustered locking.
    locking_type = 3

Thanks all for the help.





On Fri, Sep 26, 2008 at 3:46 PM, Bob Peterson <rpeterso at redhat.com> wrote:

> ----- "Alan A" <alan.zg at gmail.com> wrote:
> | gfs_sda1     1   1   0 wz--n- 10.00G 5.00G
> |   gfs_sdb1     1   1   0 wz--n- 10.00G 3.50G
> |
> | The vg's are not enabled with the cluster bit. I am sorry for not
> | being more
> | informative on LVM. If I enable cluster bit with vgchange -c command
> | would
> | that convert the volumes to cluster volume group, or should I start
> | by
> | blowing them away?
>
> Hi Alan,
>
> It's perfectly fine to enable the clustered bit with vgchange -cy,
> and that flags the entire vg and all the lvs inside it.
> You should probably unmount-then-mount the gfs file system on all
> the nodes after that, just to be safe.  I don't think you need to
> remount on the node that did the gfs_grow though.
>
> The missing clustered bit is probably the cause of all your grief,
> so I wouldn't bother opening a ticket, unless you can recreate
> the failure with the clustered bit on, after umount-then-mount.
>
> Regards,
>
> Bob Peterson
> Red Hat Clustering & GFS
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



-- 
Alan A.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080926/610974ad/attachment.htm>

From Dave.Jones at maritz.com  Fri Sep 26 21:52:16 2008
From: Dave.Jones at maritz.com (Jones, Dave)
Date: Fri, 26 Sep 2008 16:52:16 -0500
Subject: [Linux-cluster] HP DL380 G4 & cluster fencing options
In-Reply-To: <DDF0CD44B9EED841B556D44B7929813A05A878D3@FENEXCH1602C.us.maritz.net>
References: <DDF0CD44B9EED841B556D44B7929813A05A878D3@FENEXCH1602C.us.maritz.net>
Message-ID: <DDF0CD44B9EED841B556D44B7929813A05A878D4@FENEXCH1602C.us.maritz.net>



(I apologize if this double-posts, long story) 

Hello all.

I have what I hope are a few simple questions on RHEL 5.2 clustering
using HP DL380 hardware.

Long story short, we have several HP DL380-G4s that we are trying to set
up in a test cluster but we do not have any of the obvious fencing
options available to us.  ILO fencing does not seem to be an option
because our standards dictate ILO be on an out-of-band network.
We have SCSI fencing in place but I'm not convinced it is working
properly nor that it is reliable.
I hope to find funding to get an APC Masterswitch in place but that is
not our standard so I need approval.

In the meantime, can anyone offer any answers on the following
questions?

1) Are TripLite or other power controllers supported aside from what is
on the hardware list?
	(http://www.redhat.com/cluster_suite/hardware/)

2) Is there a reliable way to change the SAN switch fencing controls to
use SNMP (preferred) or perhaps SSH?
	According to 'man fence_brocade' the system uses telnet.
	An interactive login to a 256-port switch is bad enough, but
doing so over an unencrypted telnet session just isn't going to fly!
	I would think this would be a relatively trivial change to the
script.
	Has anyone done something like it?

3) Does anyone have any experience clustering DL380s and have
recommended Primary & Secondary fencing methods or advice they would be
willing to share?

4) Is anyone out there using only 1 fencing method with GFS filesystems
in Production environment?

Thanks,
Dave

Confidentiality Warning:  This e-mail contains information intended only for the use of the individual or entity named above.  If the reader of this e-mail is not the intended recipient or the employee or agent responsible for delivering it to the intended recipient, any dissemination, publication or copying of this e-mail is strictly prohibited.  The sender does not accept any responsibility for any loss, disruption or damage to your data or computer system that may occur while using data contained in, or transmitted with, this e-mail.  
If you have received this e-mail in error, please immediately notify us by return e-mail.  Thank you.






From james.hofmeister at hp.com  Sat Sep 27 04:41:57 2008
From: james.hofmeister at hp.com (Hofmeister, James (WTEC Linux))
Date: Sat, 27 Sep 2008 04:41:57 +0000
Subject: [Linux-cluster] RE: HP DL380 G4 & cluster fencing options
In-Reply-To: <DDF0CD44B9EED841B556D44B7929813A05A878D4@FENEXCH1602C.us.maritz.net>
References: <DDF0CD44B9EED841B556D44B7929813A05A878D3@FENEXCH1602C.us.maritz.net>
	<DDF0CD44B9EED841B556D44B7929813A05A878D4@FENEXCH1602C.us.maritz.net>
Message-ID: <EC61DD7B6048464AB0E1B713AF7521BC1655C73E58@GVW0676EXC.americas.hpqcorp.net>

Hello Dave,

RE: HP DL380 G4 & cluster fencing options

Using ILO fencing should work fine as long as the ILO for 'all' of the cluster node are connected to the same physical network and IP subnet.  The ILO IP network does not need to be the same as the heartbeat/admin traffic network.

Regards,
James Hofmeister
Hewlett Packard Linux Solutions Engineer

|-----Original Message-----
|From: linux-cluster-bounces at redhat.com
|[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jones, Dave
|Sent: Friday, September 26, 2008 2:52 PM
|To: Linux-cluster at redhat.com
|Subject: [Linux-cluster] HP DL380 G4 & cluster fencing options
|
|
|
|(I apologize if this double-posts, long story)
|
|Hello all.
|
|I have what I hope are a few simple questions on RHEL 5.2
|clustering using HP DL380 hardware.
|
|Long story short, we have several HP DL380-G4s that we are
|trying to set up in a test cluster but we do not have any of
|the obvious fencing options available to us.  ILO fencing does
|not seem to be an option because our standards dictate ILO be
|on an out-of-band network.
|We have SCSI fencing in place but I'm not convinced it is
|working properly nor that it is reliable.
|I hope to find funding to get an APC Masterswitch in place but
|that is not our standard so I need approval.
|
|In the meantime, can anyone offer any answers on the following
|questions?
|
|1) Are TripLite or other power controllers supported aside
|from what is on the hardware list?
|        (http://www.redhat.com/cluster_suite/hardware/)
|
|2) Is there a reliable way to change the SAN switch fencing
|controls to use SNMP (preferred) or perhaps SSH?
|        According to 'man fence_brocade' the system uses telnet.
|        An interactive login to a 256-port switch is bad
|enough, but doing so over an unencrypted telnet session just
|isn't going to fly!
|        I would think this would be a relatively trivial
|change to the script.
|        Has anyone done something like it?
|
|3) Does anyone have any experience clustering DL380s and have
|recommended Primary & Secondary fencing methods or advice they
|would be willing to share?
|
|4) Is anyone out there using only 1 fencing method with GFS
|filesystems in Production environment?
|
|Thanks,
|Dave
|
|Confidentiality Warning:  This e-mail contains information
|intended only for the use of the individual or entity named
|above.  If the reader of this e-mail is not the intended
|recipient or the employee or agent responsible for delivering
|it to the intended recipient, any dissemination, publication
|or copying of this e-mail is strictly prohibited.  The sender
|does not accept any responsibility for any loss, disruption or
|damage to your data or computer system that may occur while
|using data contained in, or transmitted with, this e-mail.
|If you have received this e-mail in error, please immediately
|notify us by return e-mail.  Thank you.
|
|
|
|
|--
|Linux-cluster mailing list
|Linux-cluster at redhat.com
|https://www.redhat.com/mailman/listinfo/linux-cluster
|



From stevan.colaco at gmail.com  Sat Sep 27 08:44:16 2008
From: stevan.colaco at gmail.com (Stevan Colaco)
Date: Sat, 27 Sep 2008 11:44:16 +0300
Subject: [Linux-cluster] Fencing issue using IPMI (nodes fencing each
	other ending in a loop)
In-Reply-To: <c3c0440e0809240653l6fca5876uc9511c2475dae8e9@mail.gmail.com>
References: <56bb44d0809231027k74a7bcd6nf9cd43b104ecec62@mail.gmail.com>
	<c3c0440e0809240653l6fca5876uc9511c2475dae8e9@mail.gmail.com>
Message-ID: <56bb44d0809270144p5102714bha6d388146f12d588@mail.gmail.com>

Hello All,

To test i have moved network connectivity from cisco swictes to 5 port
DLINK swicth.
Cluster is workig with fencing properly.

Best Regards,
-Stevan Colaco


2008/9/24 Grisha G. <grigorygor at gmail.com>:
> In 2 node cluster you should use a quorum disk to solve the split brain
> problem.
> after you create a quorum disk change this line in you cluster.conf
> from <cman expected_votes="1" two_node="1"/>
> to  <cman expected_votes="3" two_node="0"/>
>
> Grisha
>
>
> On Tue, Sep 23, 2008 at 7:27 PM, Stevan Colaco <stevan.colaco at gmail.com>
> wrote:
>>
>> Hello
>>
>> issue: Fencing using fence_ipmilan, each node keeps fencing the other
>> node ending in a fence loop.....
>>
>> We have implemented RH Cluster on RHEL5.2 64bit.
>> Server Hardware: SUN X4150
>> Storage: SUN 6140
>> Fencing Machnism: fence_ipmilan
>>
>>  We have downloaded the IPMI fence_ipmilan and configured two node
>> cluster with ipmi fencing. But..
>>
>> when we ifdown the NIC interface, the node gets fenced but the service
>> does not relocate to the other node. at the same time when the
>> initially fenced node joins back the cluster it fences the other
>> node......
>> this keeps on ending in a loop.
>>
>> We downloaded and followed the intructions from the ipmi site
>> mentioned below
>> http://docs.sun.com/source/819-6588-13/ipmi_com.html#0_74891
>>
>> we tested with following  Cmd line method which works fine.
>> #fence_ipmilan -a "ip addr" -l root -p <Passkey> -o <on|off|reboot>
>>
>> here is my cluster.conf
>>
>> <?xml version="1.0"?>
>> <cluster alias="tibcouat" config_version="12" name="tibcouat">
>>        <fence_daemon clean_start="0" post_fail_delay="0"
>> post_join_delay="3"/>
>>        <clusternodes>
>>                <clusternode name="tibco-node1-uat.kmefic.com.kw"
>> nodeid="1" votes="1">
>>                        <fence>
>>                                <method name="1">
>>                                        <device name="tibco-node1"/>
>>                                </method>
>>                        </fence>
>>                </clusternode>
>>                <clusternode name="tibco-node2-uat.kmefic.com.kw"
>> nodeid="2" votes="1">
>>                        <fence>
>>                                <method name="1">
>>                                        <device name="tibco-node2"/>
>>                                </method>
>>                        </fence>
>>                </clusternode>
>>        </clusternodes>
>>        <cman expected_votes="1" two_node="1"/>
>>        <fencedevices>
>>                <fencedevice agent="fence_ipmilan" ipaddr="172.16.71.41"
>> login="root" name="tibco-node1" passwd="changeme"/>
>>                <fencedevice agent="fence_ipmilan" ipaddr="172.16.71.42"
>> login="root" name="tibco-node2" passwd="changeme"/>
>>        </fencedevices>
>>        <rm>
>>                <failoverdomains>
>>                        <failoverdomain name="prefer_node1" nofailback="0"
>> ordered="1"
>> restricted="1">
>>                                <failoverdomainnode
>> name="tibco-node1-uat.kmefic.com.kw" priority="1"/>
>>                                <failoverdomainnode
>> name="tibco-node2-uat.kmefic.com.kw" priority="2"/>
>>                        </failoverdomain>
>>                </failoverdomains>
>>                <resources>
>>                        <ip address="172.16.71.55" monitor_link="1"/>
>>                        <clusterfs device="/dev/vg0/gfsdata"
>> force_unmount="0" fsid="63282"
>> fstype="gfs" mountpoint="/var/www/html" name="gfsdata"
>> self_fence="0"/>
>>                        <apache config_file="conf/httpd.conf"
>> name="docroot"
>> server_root="/etc/httpd" shutdown_wait="0"/>
>>                </resources>
>>                <service autostart="1" domain="prefer_node1" exclusive="0"
>> name="webby" recovery="relocate">
>>                        <ip ref="172.16.71.55"/>
>>                </service>
>>        </rm>
>> </cluster>
>>
>>
>> Kindly investigate and provide us the solution at the earliest.
>>
>> Thanks & Best Regards,
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



From jakub.suchy at enlogit.cz  Sat Sep 27 15:25:01 2008
From: jakub.suchy at enlogit.cz (Jakub Suchy)
Date: Sat, 27 Sep 2008 17:25:01 +0200
Subject: [Linux-cluster] Fencing issue using IPMI (nodes fencing each
	other ending in a loop)
Message-ID: <20080927152501.GB21762@aaron>

You may be having the same issue as I have:
[Linux-cluster] Node won't rejoin after reboot

Do you have Red Hat support case open for it? If you do, can you send me
the issue number privately (!) ? I would use it in our support case,
because the issues may be related.

Also:
- Is IGMP Snooping enabled on Cisco switches? Try disabling it, if
  possible.
- RHEL5.2 is using IGMPv3 by default. Try moving to IGMPv2

Jakub

Stevan Colaco wrote:
> Hello All,
> 
> To test i have moved network connectivity from cisco swictes to 5 port
> DLINK swicth.
> Cluster is workig with fencing properly.
> 
> Best Regards,
> -Stevan Colaco
> 
> 
> 2008/9/24 Grisha G. <grigorygor at gmail.com>:
> > In 2 node cluster you should use a quorum disk to solve the split brain
> > problem.
> > after you create a quorum disk change this line in you cluster.conf
> > from <cman expected_votes="1" two_node="1"/>
> > to  <cman expected_votes="3" two_node="0"/>
> >
> > Grisha
> >
> >
> > On Tue, Sep 23, 2008 at 7:27 PM, Stevan Colaco <stevan.colaco at gmail.com>
> > wrote:
> >>
> >> Hello
> >>
> >> issue: Fencing using fence_ipmilan, each node keeps fencing the other
> >> node ending in a fence loop.....
> >>
> >> We have implemented RH Cluster on RHEL5.2 64bit.
> >> Server Hardware: SUN X4150
> >> Storage: SUN 6140
> >> Fencing Machnism: fence_ipmilan
> >>
> >>  We have downloaded the IPMI fence_ipmilan and configured two node
> >> cluster with ipmi fencing. But..
> >>
> >> when we ifdown the NIC interface, the node gets fenced but the service
> >> does not relocate to the other node. at the same time when the
> >> initially fenced node joins back the cluster it fences the other
> >> node......
> >> this keeps on ending in a loop.
> >>
> >> We downloaded and followed the intructions from the ipmi site
> >> mentioned below
> >> http://docs.sun.com/source/819-6588-13/ipmi_com.html#0_74891
> >>
> >> we tested with following  Cmd line method which works fine.
> >> #fence_ipmilan -a "ip addr" -l root -p <Passkey> -o <on|off|reboot>
> >>
> >> here is my cluster.conf
> >>
> >> <?xml version="1.0"?>
> >> <cluster alias="tibcouat" config_version="12" name="tibcouat">
> >>        <fence_daemon clean_start="0" post_fail_delay="0"
> >> post_join_delay="3"/>
> >>        <clusternodes>
> >>                <clusternode name="tibco-node1-uat.kmefic.com.kw"
> >> nodeid="1" votes="1">
> >>                        <fence>
> >>                                <method name="1">
> >>                                        <device name="tibco-node1"/>
> >>                                </method>
> >>                        </fence>
> >>                </clusternode>
> >>                <clusternode name="tibco-node2-uat.kmefic.com.kw"
> >> nodeid="2" votes="1">
> >>                        <fence>
> >>                                <method name="1">
> >>                                        <device name="tibco-node2"/>
> >>                                </method>
> >>                        </fence>
> >>                </clusternode>
> >>        </clusternodes>
> >>        <cman expected_votes="1" two_node="1"/>
> >>        <fencedevices>
> >>                <fencedevice agent="fence_ipmilan" ipaddr="172.16.71.41"
> >> login="root" name="tibco-node1" passwd="changeme"/>
> >>                <fencedevice agent="fence_ipmilan" ipaddr="172.16.71.42"
> >> login="root" name="tibco-node2" passwd="changeme"/>
> >>        </fencedevices>
> >>        <rm>
> >>                <failoverdomains>
> >>                        <failoverdomain name="prefer_node1" nofailback="0"
> >> ordered="1"
> >> restricted="1">
> >>                                <failoverdomainnode
> >> name="tibco-node1-uat.kmefic.com.kw" priority="1"/>
> >>                                <failoverdomainnode
> >> name="tibco-node2-uat.kmefic.com.kw" priority="2"/>
> >>                        </failoverdomain>
> >>                </failoverdomains>
> >>                <resources>
> >>                        <ip address="172.16.71.55" monitor_link="1"/>
> >>                        <clusterfs device="/dev/vg0/gfsdata"
> >> force_unmount="0" fsid="63282"
> >> fstype="gfs" mountpoint="/var/www/html" name="gfsdata"
> >> self_fence="0"/>
> >>                        <apache config_file="conf/httpd.conf"
> >> name="docroot"
> >> server_root="/etc/httpd" shutdown_wait="0"/>
> >>                </resources>
> >>                <service autostart="1" domain="prefer_node1" exclusive="0"
> >> name="webby" recovery="relocate">
> >>                        <ip ref="172.16.71.55"/>
> >>                </service>
> >>        </rm>
> >> </cluster>
> >>
> >>
> >> Kindly investigate and provide us the solution at the earliest.
> >>



From alain.richard at equation.fr  Sun Sep 28 07:39:33 2008
From: alain.richard at equation.fr (Alain Richard)
Date: Sun, 28 Sep 2008 09:39:33 +0200
Subject: [Linux-cluster] pvmove with a clustered VG
In-Reply-To: <779919740809251019t82ef425w6863c4b05044519a@mail.gmail.com>
References: <779919740809251019t82ef425w6863c4b05044519a@mail.gmail.com>
Message-ID: <4F8051BD-00FB-40ED-9B50-346647CFA49D@equation.fr>


Le 25 sept. 08 ? 19:19, Jeremy Lyon a ?crit :

> I wasn't sure which list to, so I chose both cluster and lvm.
>
> My current configuration:
> 2 Node RHEL 5.2 cluster with multiple GFS on top of logical volumes  
> in one volume group.
>
> # rpm -q cman lvm2 lvm2-cluster kmod-gfs
> cman-2.0.84-2.el5
> lvm2-2.02.32-4.el5
> lvm2-cluster-2.02.32-4.el5
> kmod-gfs-0.1.23-5.el5
>
> I need to move a PV out of this volume group so I attempted to run  
> pvmove /dev/sdk1 but errors about locking on the other node.  I  
> assumed this was because of the multiple GFS file systems being used  
> on both nodes (services were spread across the nodes).  So I  
> relocated all services to one node and even stopped rgmanager, gfs,  
> clvmd and cman on the idle node to make sure that no locks would  
> remain open.
>
> I still had issues with running the pvmove.  I saw these messages:
>
> Sep 24 17:56:48 nodea kernel: device-mapper: mirror log: Module for  
> logging t
> ype "clustered-core" not found.
> Sep 24 17:56:48 nodea kernel: device-mapper: table: 253:31: mirror:  
> Error cre
> ating mirror dirty log
> Sep 24 17:56:48 nodea kernel: device-mapper: ioctl: error adding  
> target to ta
> ble
>
> And after about 14% of the move was completed, there was another  
> locking message and many processes went into an uninteruptable sleep  
> state.  Load on the server shot up to around 80.
>
> I finally had to reboot the node and run pvmove --abort to get  
> everything back to working condition.
>
> Is it not possible to run pvmove on a clustered VG?  Any help would  
> be appreciated.
>
> -Jeremy


I have also encountered a similar problem under the current RHEL 5.2  
with clvmd and pvmoving several lv from one pv to an other :

- on small volumes (8 G), pvmove do its job without problem
- on bigger volumes, pvmove seams to hangs at one point, the only  
solution being to reboot the node.
- on volumes having several segments, the node hangs at the end of the  
each segment
- you alway get the warning about creating mirror dirty log, but I  
have found out a bugzilla entry for this one (you may just ignore it)

Also I have "feature" that is also a solution  :

- start the pvmode, then issue a ctrl-c after getting some output (x %  
done)
- the pvmode is supposed to stop at this time, but in fact it  
continues in background !
- the pvmove continue up to the point of the next segment boundary
- if the lv you are moving is multi-segmented, the first segment was  
successfully moved, but you have to issue a pvmove again (and a ctrl- 
c) to continue the next segment

I have succeeded in moving about 1To of data from one SAN to an other  
with that feature !

I think there is both problems here :
- a problem with pvmove output and pvmove background operation. Note  
that I have to do a pvmove + ctrl-c and that a pvmove -b do not work.  
This problem is probably a lvm problem as I have found some other user  
having the same issue under ubuntu (and without cman/clvmd)
- a problem with clvmd locking that is done at the end of each  
segments : at that time the new pvmove mirror segment is made the  
current allocated one and the old one is made free, and the other  
members of the cluster are informed (cvlvmd locking + segment marking  
+ clvmd unlocking). This operation seams to fail with an error about  
locking error.

Regards,

-- 
Alain RICHARD <mailto:alain.richard at equation.fr>
EQUATION SA <http://www.equation.fr/>
Tel : +33 477 79 48 00     Fax : +33 477 79 48 01
Applications client/serveur, ing?nierie r?seau et Linux

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080928/08a9f9d8/attachment.htm>

From macscr at macscr.com  Sun Sep 28 19:04:12 2008
From: macscr at macscr.com (Mark Chaney)
Date: Sun, 28 Sep 2008 14:04:12 -0500
Subject: [Linux-cluster] fencing loop
Message-ID: <001b01c9219d$00495dc0$00dc1940$@com>

My cluster nodes are getting fenced when I simply just restart the network
service. What can I do to prevent that happening so quick? Can I even
prevent it since im using GFS?

I have a 3 node cluster running CentOS 5.2. They all are connected to my
iscsi SAN, so CLVMD is being used and GFS as well.

To make things worse, it appears that 2 servers were fenced when I simply
restarted the network service on one server. One server came back online,
the second one did, but then was immediately fenced again. Seems to loop
until I basically stop the cluster and have everyone rejoin.

Any help/ideas would sincerely be appreciated.

- Mark





From grigorygor at gmail.com  Mon Sep 29 02:52:37 2008
From: grigorygor at gmail.com (Grisha G.)
Date: Mon, 29 Sep 2008 05:52:37 +0300
Subject: [Linux-cluster] fencing loop
In-Reply-To: <001b01c9219d$00495dc0$00dc1940$@com>
References: <001b01c9219d$00495dc0$00dc1940$@com>
Message-ID: <c3c0440e0809281952s5f0150cekffb3eb21a77a987d@mail.gmail.com>

Hi

Show us your conf and logs

On Sun, Sep 28, 2008 at 10:04 PM, Mark Chaney <macscr at macscr.com> wrote:

> My cluster nodes are getting fenced when I simply just restart the network
> service. What can I do to prevent that happening so quick? Can I even
> prevent it since im using GFS?
>
> I have a 3 node cluster running CentOS 5.2. They all are connected to my
> iscsi SAN, so CLVMD is being used and GFS as well.
>
> To make things worse, it appears that 2 servers were fenced when I simply
> restarted the network service on one server. One server came back online,
> the second one did, but then was immediately fenced again. Seems to loop
> until I basically stop the cluster and have everyone rejoin.
>
> Any help/ideas would sincerely be appreciated.
>
> - Mark
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080929/8ceec5a0/attachment.htm>

From macscr at macscr.com  Mon Sep 29 08:06:51 2008
From: macscr at macscr.com (Mark Chaney)
Date: Mon, 29 Sep 2008 03:06:51 -0500
Subject: [Linux-cluster] proper cluster crash procedures?
Message-ID: <003f01c9220a$55aa6ee0$00ff4ca0$@com>

I have a 3 node cluster that has shared storage using iscsi san, hence I am
using GFS. Anyway, I had it crash for whatever reason, not sure if something
was rebooted incorrectly or what, but now I have been spending the past 2
hours trying to get the cluster back up. I would think that sampling
rebooting all the nodes would work, but heck, that hasn't. What should I be
doing? Should I just start up one at a time? BTW, I am using ipmi for
fencing if that makes a difference. I can post my cluster.conf if that's
helpful, but I would think there would be general techniques available.

Thanks,
Mark






From macscr at macscr.com  Mon Sep 29 08:16:08 2008
From: macscr at macscr.com (Mark Chaney)
Date: Mon, 29 Sep 2008 03:16:08 -0500
Subject: [Linux-cluster] proper cluster crash procedures?
In-Reply-To: <003f01c9220a$55aa6ee0$00ff4ca0$@com>
References: <003f01c9220a$55aa6ee0$00ff4ca0$@com>
Message-ID: <004001c9220b$a1686d90$e43948b0$@com>

Here is my cluster.conf

#########################################

<?xml version="1.0"?>
<cluster alias="myiacon" config_version="16" name="myiacon">
	<fence_daemon clean_start="0" post_fail_delay="0"
post_join_delay="60"/>
	<clusternodes>
		<clusternode name="ratchet.local" nodeid="1" votes="1">
			<fence>
				<method name="1">
					<device name="ratchet_ipmi"/>
				</method>
			</fence>
		</clusternode>
		<clusternode name="skydive.local" nodeid="2" votes="1">
			<fence>
				<method name="1">
					<device name="skydive_ipmi"/>
				</method>
			</fence>
		</clusternode>
		<clusternode name="wheeljack.local" nodeid="3" votes="1">
			<fence>
				<method name="1">
					<device name="wheeljack_ipmi"/>
				</method>
			</fence>
		</clusternode>
	</clusternodes>
	<cman/>
	<fencedevices>
		<fencedevice agent="fence_ipmilan" ipaddr="192.168.1.100"
login="root" name="ratchet_ipmi" passwd="xxxxx"/>
		<fencedevice agent="fence_ipmilan" ipaddr="192.168.1.102"
login="root" name="skydive_ipmi" passwd="xxxxx"/>
		<fencedevice agent="fence_ipmilan" ipaddr="192.168.1.101"
login="root" name="wheeljack_ipmi" passwd="xxxxxx"/>
	</fencedevices>
	<rm>
		<failoverdomains/>
		<resources/>
	</rm>
</cluster>

#############################################

And here is one of the errors I just started getting:

Sep 29 08:10:06 wheeljack openais[5453]: [MAIN ] Killing node ratchet.local
beca    use it has rejoined the cluster with existing state

But half the time, servers just complain that they cant reconnect to the
cluster.


-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Mark Chaney
Sent: Monday, September 29, 2008 3:07 AM
To: linux-cluster at redhat.com
Subject: [Linux-cluster] proper cluster crash procedures?

I have a 3 node cluster that has shared storage using iscsi san, hence I am
using GFS. Anyway, I had it crash for whatever reason, not sure if something
was rebooted incorrectly or what, but now I have been spending the past 2
hours trying to get the cluster back up. I would think that sampling
rebooting all the nodes would work, but heck, that hasn't. What should I be
doing? Should I just start up one at a time? BTW, I am using ipmi for
fencing if that makes a difference. I can post my cluster.conf if that's
helpful, but I would think there would be general techniques available.

Thanks,
Mark




--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From macscr at macscr.com  Mon Sep 29 08:22:05 2008
From: macscr at macscr.com (Mark Chaney)
Date: Mon, 29 Sep 2008 03:22:05 -0500
Subject: [Linux-cluster] proper cluster crash procedures?
In-Reply-To: <004001c9220b$a1686d90$e43948b0$@com>
References: <003f01c9220a$55aa6ee0$00ff4ca0$@com>
	<004001c9220b$a1686d90$e43948b0$@com>
Message-ID: <004101c9220c$76a9c800$63fd5800$@com>

Another common reason why it wont connect is:

Sep 29 03:20:53 skydive ccsd[3192]: Attempt to close an unopened CCS
descriptor (13590).
Sep 29 03:20:53 skydive ccsd[3192]: Error while processing disconnect:
Invalid request descriptor

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Mark Chaney
Sent: Monday, September 29, 2008 3:16 AM
To: 'linux clustering'
Subject: RE: [Linux-cluster] proper cluster crash procedures?

Here is my cluster.conf

#########################################

<?xml version="1.0"?>
<cluster alias="myiacon" config_version="16" name="myiacon">
	<fence_daemon clean_start="0" post_fail_delay="0"
post_join_delay="60"/>
	<clusternodes>
		<clusternode name="ratchet.local" nodeid="1" votes="1">
			<fence>
				<method name="1">
					<device name="ratchet_ipmi"/>
				</method>
			</fence>
		</clusternode>
		<clusternode name="skydive.local" nodeid="2" votes="1">
			<fence>
				<method name="1">
					<device name="skydive_ipmi"/>
				</method>
			</fence>
		</clusternode>
		<clusternode name="wheeljack.local" nodeid="3" votes="1">
			<fence>
				<method name="1">
					<device name="wheeljack_ipmi"/>
				</method>
			</fence>
		</clusternode>
	</clusternodes>
	<cman/>
	<fencedevices>
		<fencedevice agent="fence_ipmilan" ipaddr="192.168.1.100"
login="root" name="ratchet_ipmi" passwd="xxxxx"/>
		<fencedevice agent="fence_ipmilan" ipaddr="192.168.1.102"
login="root" name="skydive_ipmi" passwd="xxxxx"/>
		<fencedevice agent="fence_ipmilan" ipaddr="192.168.1.101"
login="root" name="wheeljack_ipmi" passwd="xxxxxx"/>
	</fencedevices>
	<rm>
		<failoverdomains/>
		<resources/>
	</rm>
</cluster>

#############################################

And here is one of the errors I just started getting:

Sep 29 08:10:06 wheeljack openais[5453]: [MAIN ] Killing node ratchet.local
beca    use it has rejoined the cluster with existing state

But half the time, servers just complain that they cant reconnect to the
cluster.


-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Mark Chaney
Sent: Monday, September 29, 2008 3:07 AM
To: linux-cluster at redhat.com
Subject: [Linux-cluster] proper cluster crash procedures?

I have a 3 node cluster that has shared storage using iscsi san, hence I am
using GFS. Anyway, I had it crash for whatever reason, not sure if something
was rebooted incorrectly or what, but now I have been spending the past 2
hours trying to get the cluster back up. I would think that sampling
rebooting all the nodes would work, but heck, that hasn't. What should I be
doing? Should I just start up one at a time? BTW, I am using ipmi for
fencing if that makes a difference. I can post my cluster.conf if that's
helpful, but I would think there would be general techniques available.

Thanks,
Mark




--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From Edoardo.Causarano at laitspa.it  Mon Sep 29 09:50:32 2008
From: Edoardo.Causarano at laitspa.it (Edoardo Causarano)
Date: Mon, 29 Sep 2008 11:50:32 +0200
Subject: R: [Linux-cluster] fencing loop
Message-ID: <B9F5EE1EE80E12479854A81337FB102C014F4326@RLBEMAIL02.interno.regione.lazio.it>

Well, you're using iscsi so restarting network will impact access to the fs (a quorum disk wouldn't help either). Maybe timeouts are too short and the machines are queuing fence commands too soon. How is your fencing config'd? (net, same iface or private, serial) describe scenario more accurately: name machines A B C, it's not clear if two nodes are fencing each other or if the third one is.

-----Original Message-----
From: linux-cluster-bounces at redhat.com <linux-cluster-bounces at redhat.com>
To: linux-cluster at redhat.com <linux-cluster at redhat.com>
Sent: Sun Sep 28 21:04:12 2008
Subject: [Linux-cluster] fencing loop

My cluster nodes are getting fenced when I simply just restart the network
service. What can I do to prevent that happening so quick? Can I even
prevent it since im using GFS?

I have a 3 node cluster running CentOS 5.2. They all are connected to my
iscsi SAN, so CLVMD is being used and GFS as well.

To make things worse, it appears that 2 servers were fenced when I simply
restarted the network service on one server. One server came back online,
the second one did, but then was immediately fenced again. Seems to loop
until I basically stop the cluster and have everyone rejoin.

Any help/ideas would sincerely be appreciated.

- Mark



--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080929/7ee3f9f3/attachment.htm>

From macscr at macscr.com  Mon Sep 29 15:31:39 2008
From: macscr at macscr.com (Mark Chaney)
Date: Mon, 29 Sep 2008 10:31:39 -0500
Subject: [Linux-cluster] fencing loop
In-Reply-To: <B9F5EE1EE80E12479854A81337FB102C014F4326@RLBEMAIL02.interno.regione.lazio.it>
References: <B9F5EE1EE80E12479854A81337FB102C014F4326@RLBEMAIL02.interno.regione.lazio.it>
Message-ID: <006f01c92248$79168b00$6b43a100$@com>

Here is my cluster.conf

 

#########################################

 

<?xml version="1.0"?>

<cluster alias="myiacon" config_version="16" name="myiacon">

      <fence_daemon clean_start="0" post_fail_delay="0"

post_join_delay="60"/>

      <clusternodes>

            <clusternode name="ratchet.local" nodeid="1" votes="1">

                  <fence>

                        <method name="1">

                              <device name="ratchet_ipmi"/>

                        </method>

                  </fence>

            </clusternode>

            <clusternode name="skydive.local" nodeid="2" votes="1">

                  <fence>

                        <method name="1">

                              <device name="skydive_ipmi"/>

                        </method>

                  </fence>

            </clusternode>

            <clusternode name="wheeljack.local" nodeid="3" votes="1">

                  <fence>

                        <method name="1">

                              <device name="wheeljack_ipmi"/>

                        </method>

                  </fence>

            </clusternode>

      </clusternodes>

      <cman/>

      <fencedevices>

            <fencedevice agent="fence_ipmilan" ipaddr="192.168.1.100"

login="root" name="ratchet_ipmi" passwd="xxxxx"/>

            <fencedevice agent="fence_ipmilan" ipaddr="192.168.1.102"

login="root" name="skydive_ipmi" passwd="xxxxx"/>

            <fencedevice agent="fence_ipmilan" ipaddr="192.168.1.101"

login="root" name="wheeljack_ipmi" passwd="xxxxxx"/>

      </fencedevices>

      <rm>

            <failoverdomains/>

            <resources/>

      </rm>

</cluster>

 

#############################################

 

 

From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Edoardo Causarano
Sent: Monday, September 29, 2008 4:51 AM
To: linux-cluster at redhat.com
Subject: R: [Linux-cluster] fencing loop

 

Well, you're using iscsi so restarting network will impact access to the fs
(a quorum disk wouldn't help either). Maybe timeouts are too short and the
machines are queuing fence commands too soon. How is your fencing config'd?
(net, same iface or private, serial) describe scenario more accurately: name
machines A B C, it's not clear if two nodes are fencing each other or if the
third one is.

-----Original Message-----
From: linux-cluster-bounces at redhat.com <linux-cluster-bounces at redhat.com>
To: linux-cluster at redhat.com <linux-cluster at redhat.com>
Sent: Sun Sep 28 21:04:12 2008
Subject: [Linux-cluster] fencing loop

My cluster nodes are getting fenced when I simply just restart the network
service. What can I do to prevent that happening so quick? Can I even
prevent it since im using GFS?

I have a 3 node cluster running CentOS 5.2. They all are connected to my
iscsi SAN, so CLVMD is being used and GFS as well.

To make things worse, it appears that 2 servers were fenced when I simply
restarted the network service on one server. One server came back online,
the second one did, but then was immediately fenced again. Seems to loop
until I basically stop the cluster and have everyone rejoin.

Any help/ideas would sincerely be appreciated.

- Mark



--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080929/3fb3df1f/attachment.htm>

From Dave.Jones at maritz.com  Mon Sep 29 14:09:13 2008
From: Dave.Jones at maritz.com (Jones, Dave)
Date: Mon, 29 Sep 2008 09:09:13 -0500
Subject: [Linux-cluster] RE: HP DL380 G4 & cluster fencing options
In-Reply-To: <EC61DD7B6048464AB0E1B713AF7521BC1655C73E58@GVW0676EXC.americas.hpqcorp.net>
References: <DDF0CD44B9EED841B556D44B7929813A05A878D3@FENEXCH1602C.us.maritz.net><DDF0CD44B9EED841B556D44B7929813A05A878D4@FENEXCH1602C.us.maritz.net>
	<EC61DD7B6048464AB0E1B713AF7521BC1655C73E58@GVW0676EXC.americas.hpqcorp.net>
Message-ID: <DDF0CD44B9EED841B556D44B7929813A05A878D5@FENEXCH1602C.us.maritz.net>

Hello James.

Thank you for the reply.
As I was trying to explain, the ILO network is out of band, therefore
there is no way for the cluster to connect to, login or otherwise
interact with the ILO controller.  ILO is on an different subnet and
different cabling infrastructure with no routing to the rest of the LAN.
So unless there is something I'm not understanding, ILO is not an
option.

I'm looking for other ways to fence these systems.

Thanks,
Dave


-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Hofmeister, James
(WTEC Linux)
Sent: Friday, September 26, 2008 11:42 PM
To: linux clustering
Subject: [Linux-cluster] RE: HP DL380 G4 & cluster fencing options

Hello Dave,

RE: HP DL380 G4 & cluster fencing options

Using ILO fencing should work fine as long as the ILO for 'all' of the
cluster node are connected to the same physical network and IP subnet.
The ILO IP network does not need to be the same as the heartbeat/admin
traffic network.

Regards,
James Hofmeister
Hewlett Packard Linux Solutions Engineer

|-----Original Message-----
|From: linux-cluster-bounces at redhat.com
|[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jones, Dave
|Sent: Friday, September 26, 2008 2:52 PM
|To: Linux-cluster at redhat.com
|Subject: [Linux-cluster] HP DL380 G4 & cluster fencing options
|
|
|
|(I apologize if this double-posts, long story)
|
|Hello all.
|
|I have what I hope are a few simple questions on RHEL 5.2
|clustering using HP DL380 hardware.
|
|Long story short, we have several HP DL380-G4s that we are
|trying to set up in a test cluster but we do not have any of
|the obvious fencing options available to us.  ILO fencing does
|not seem to be an option because our standards dictate ILO be
|on an out-of-band network.
|We have SCSI fencing in place but I'm not convinced it is
|working properly nor that it is reliable.
|I hope to find funding to get an APC Masterswitch in place but
|that is not our standard so I need approval.
|
|In the meantime, can anyone offer any answers on the following
|questions?
|
|1) Are TripLite or other power controllers supported aside
|from what is on the hardware list?
|        (http://www.redhat.com/cluster_suite/hardware/)
|
|2) Is there a reliable way to change the SAN switch fencing
|controls to use SNMP (preferred) or perhaps SSH?
|        According to 'man fence_brocade' the system uses telnet.
|        An interactive login to a 256-port switch is bad
|enough, but doing so over an unencrypted telnet session just
|isn't going to fly!
|        I would think this would be a relatively trivial
|change to the script.
|        Has anyone done something like it?
|
|3) Does anyone have any experience clustering DL380s and have
|recommended Primary & Secondary fencing methods or advice they
|would be willing to share?
|
|4) Is anyone out there using only 1 fencing method with GFS
|filesystems in Production environment?
|
|Thanks,
|Dave
|
|Confidentiality Warning:  This e-mail contains information
|intended only for the use of the individual or entity named
|above.  If the reader of this e-mail is not the intended
|recipient or the employee or agent responsible for delivering
|it to the intended recipient, any dissemination, publication
|or copying of this e-mail is strictly prohibited.  The sender
|does not accept any responsibility for any loss, disruption or
|damage to your data or computer system that may occur while
|using data contained in, or transmitted with, this e-mail.
|If you have received this e-mail in error, please immediately
|notify us by return e-mail.  Thank you.
|
|
|
|
|--
|Linux-cluster mailing list
|Linux-cluster at redhat.com
|https://www.redhat.com/mailman/listinfo/linux-cluster
|

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

Confidentiality Warning:  This e-mail contains information intended only for the use of the individual or entity named above.  If the reader of this e-mail is not the intended recipient or the employee or agent responsible for delivering it to the intended recipient, any dissemination, publication or copying of this e-mail is strictly prohibited.  The sender does not accept any responsibility for any loss, disruption or damage to your data or computer system that may occur while using data contained in, or transmitted with, this e-mail.  
If you have received this e-mail in error, please immediately notify us by return e-mail.  Thank you.






From alan.zg at gmail.com  Mon Sep 29 18:25:43 2008
From: alan.zg at gmail.com (Alan A)
Date: Mon, 29 Sep 2008 13:25:43 -0500
Subject: [Linux-cluster] proper cluster crash procedures?
In-Reply-To: <003f01c9220a$55aa6ee0$00ff4ca0$@com>
References: <003f01c9220a$55aa6ee0$00ff4ca0$@com>
Message-ID: <fac531740809291125u500684e4sb83bb7f9fb8e9e09@mail.gmail.com>

I am not sure that this will help but try disabling following services
before reboot and then have them join the cluster afterwards:
chkconfig rgmanager off
chkconfig gfs off
chkconfig clvmd off
chkconfig cman off

Then when node is up - start the services in this order
service cman start
service clvmd start
service gfs start
service rgmanager start

Stopping services should go in the reverse order. Don't forget to chkconfig
<servicename> on after you have the cluster working again.

On Mon, Sep 29, 2008 at 3:06 AM, Mark Chaney <macscr at macscr.com> wrote:

> I have a 3 node cluster that has shared storage using iscsi san, hence I am
> using GFS. Anyway, I had it crash for whatever reason, not sure if
> something
> was rebooted incorrectly or what, but now I have been spending the past 2
> hours trying to get the cluster back up. I would think that sampling
> rebooting all the nodes would work, but heck, that hasn't. What should I be
> doing? Should I just start up one at a time? BTW, I am using ipmi for
> fencing if that makes a difference. I can post my cluster.conf if that's
> helpful, but I would think there would be general techniques available.
>
> Thanks,
> Mark
>
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



-- 
Alan A.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080929/d0954ac3/attachment.htm>

From macscr at macscr.com  Mon Sep 29 19:44:25 2008
From: macscr at macscr.com (Mark Chaney)
Date: Mon, 29 Sep 2008 14:44:25 -0500
Subject: [Linux-cluster] proper cluster crash procedures?
In-Reply-To: <fac531740809291125u500684e4sb83bb7f9fb8e9e09@mail.gmail.com>
References: <003f01c9220a$55aa6ee0$00ff4ca0$@com>
	<fac531740809291125u500684e4sb83bb7f9fb8e9e09@mail.gmail.com>
Message-ID: <00a801c9226b$c8bc04a0$5a340de0$@com>

Thanks for the tips, but Ive found that one of my biggest problems is
getting the services to actually stop. The Cluster Service Manager usually
hangs when I try to stop it. 

 

Also, did you take a look at my editions to this thread? I listed my
cluster.conf and some of the error messages I have been getting.

 

From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Alan A
Sent: Monday, September 29, 2008 1:26 PM
To: linux clustering
Subject: Re: [Linux-cluster] proper cluster crash procedures?

 

I am not sure that this will help but try disabling following services
before reboot and then have them join the cluster afterwards:
chkconfig rgmanager off
chkconfig gfs off
chkconfig clvmd off
chkconfig cman off

Then when node is up - start the services in this order
service cman start
service clvmd start
service gfs start
service rgmanager start

Stopping services should go in the reverse order. Don't forget to chkconfig
<servicename> on after you have the cluster working again.

On Mon, Sep 29, 2008 at 3:06 AM, Mark Chaney <macscr at macscr.com> wrote:

I have a 3 node cluster that has shared storage using iscsi san, hence I am
using GFS. Anyway, I had it crash for whatever reason, not sure if something
was rebooted incorrectly or what, but now I have been spending the past 2
hours trying to get the cluster back up. I would think that sampling
rebooting all the nodes would work, but heck, that hasn't. What should I be
doing? Should I just start up one at a time? BTW, I am using ipmi for
fencing if that makes a difference. I can post my cluster.conf if that's
helpful, but I would think there would be general techniques available.

Thanks,
Mark




--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster




-- 
Alan A.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080929/a77b3ae5/attachment.htm>

From james.hofmeister at hp.com  Mon Sep 29 20:36:15 2008
From: james.hofmeister at hp.com (Hofmeister, James (WTEC Linux))
Date: Mon, 29 Sep 2008 20:36:15 +0000
Subject: [Linux-cluster] RE: HP DL380 G4 & cluster fencing options
In-Reply-To: <DDF0CD44B9EED841B556D44B7929813A05A878D5@FENEXCH1602C.us.maritz.net>
References: <DDF0CD44B9EED841B556D44B7929813A05A878D3@FENEXCH1602C.us.maritz.net><DDF0CD44B9EED841B556D44B7929813A05A878D4@FENEXCH1602C.us.maritz.net>
	<EC61DD7B6048464AB0E1B713AF7521BC1655C73E58@GVW0676EXC.americas.hpqcorp.net>
	<DDF0CD44B9EED841B556D44B7929813A05A878D5@FENEXCH1602C.us.maritz.net>
Message-ID: <EC61DD7B6048464AB0E1B713AF7521BC1655C74433@GVW0676EXC.americas.hpqcorp.net>

Hello Dave,

Sorry, I get what you are saying now...  You have *no* system NIC on the same physical network as the ILO.

|Regards,
|James Hofmeister
|Hewlett Packard Linux Solutions Engineer

|-----Original Message-----
|From: linux-cluster-bounces at redhat.com
|[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jones, Dave
|Sent: Monday, September 29, 2008 7:09 AM
|To: linux clustering
|Subject: RE: [Linux-cluster] RE: HP DL380 G4 & cluster fencing options
|
|Hello James.
|
|Thank you for the reply.
|As I was trying to explain, the ILO network is out of band,
|therefore there is no way for the cluster to connect to, login
|or otherwise interact with the ILO controller.  ILO is on an
|different subnet and different cabling infrastructure with no
|routing to the rest of the LAN.
|So unless there is something I'm not understanding, ILO is not
|an option.
|
|I'm looking for other ways to fence these systems.
|
|Thanks,
|Dave
|
|
|-----Original Message-----
|From: linux-cluster-bounces at redhat.com
|[mailto:linux-cluster-bounces at redhat.com] On Behalf Of
|Hofmeister, James (WTEC Linux)
|Sent: Friday, September 26, 2008 11:42 PM
|To: linux clustering
|Subject: [Linux-cluster] RE: HP DL380 G4 & cluster fencing options
|
|Hello Dave,
|
|RE: HP DL380 G4 & cluster fencing options
|
|Using ILO fencing should work fine as long as the ILO for
|'all' of the cluster node are connected to the same physical
|network and IP subnet.
|The ILO IP network does not need to be the same as the
|heartbeat/admin traffic network.
|
|Regards,
|James Hofmeister
|Hewlett Packard Linux Solutions Engineer
|
||-----Original Message-----
||From: linux-cluster-bounces at redhat.com
||[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jones, Dave
||Sent: Friday, September 26, 2008 2:52 PM
||To: Linux-cluster at redhat.com
||Subject: [Linux-cluster] HP DL380 G4 & cluster fencing options
||
||
||
||(I apologize if this double-posts, long story)
||
||Hello all.
||
||I have what I hope are a few simple questions on RHEL 5.2 clustering
||using HP DL380 hardware.
||
||Long story short, we have several HP DL380-G4s that we are trying to
||set up in a test cluster but we do not have any of the
|obvious fencing
||options available to us.  ILO fencing does not seem to be an option
||because our standards dictate ILO be on an out-of-band network.
||We have SCSI fencing in place but I'm not convinced it is working
||properly nor that it is reliable.
||I hope to find funding to get an APC Masterswitch in place
|but that is
||not our standard so I need approval.
||
||In the meantime, can anyone offer any answers on the following
||questions?
||
||1) Are TripLite or other power controllers supported aside
|from what is
||on the hardware list?
||        (http://www.redhat.com/cluster_suite/hardware/)
||
||2) Is there a reliable way to change the SAN switch fencing
|controls to
||use SNMP (preferred) or perhaps SSH?
||        According to 'man fence_brocade' the system uses telnet.
||        An interactive login to a 256-port switch is bad enough, but
||doing so over an unencrypted telnet session just isn't going to fly!
||        I would think this would be a relatively trivial
|change to the
||script.
||        Has anyone done something like it?
||
||3) Does anyone have any experience clustering DL380s and have
||recommended Primary & Secondary fencing methods or advice
|they would be
||willing to share?
||
||4) Is anyone out there using only 1 fencing method with GFS
|filesystems
||in Production environment?
||
||Thanks,
||Dave
||
||Confidentiality Warning:  This e-mail contains information intended
||only for the use of the individual or entity named above.  If the
||reader of this e-mail is not the intended recipient or the
|employee or
||agent responsible for delivering it to the intended recipient, any
||dissemination, publication or copying of this e-mail is strictly
||prohibited.  The sender does not accept any responsibility for any
||loss, disruption or damage to your data or computer system that may
||occur while using data contained in, or transmitted with, this e-mail.
||If you have received this e-mail in error, please immediately
|notify us
||by return e-mail.  Thank you.
||
||
||
||
||--
||Linux-cluster mailing list
||Linux-cluster at redhat.com
||https://www.redhat.com/mailman/listinfo/linux-cluster
||
|
|--
|Linux-cluster mailing list
|Linux-cluster at redhat.com
|https://www.redhat.com/mailman/listinfo/linux-cluster
|
|Confidentiality Warning:  This e-mail contains information
|intended only for the use of the individual or entity named
|above.  If the reader of this e-mail is not the intended
|recipient or the employee or agent responsible for delivering
|it to the intended recipient, any dissemination, publication
|or copying of this e-mail is strictly prohibited.  The sender
|does not accept any responsibility for any loss, disruption or
|damage to your data or computer system that may occur while
|using data contained in, or transmitted with, this e-mail.
|If you have received this e-mail in error, please immediately
|notify us by return e-mail.  Thank you.
|
|
|
|
|--
|Linux-cluster mailing list
|Linux-cluster at redhat.com
|https://www.redhat.com/mailman/listinfo/linux-cluster
|



From terrybdavis at gmail.com  Mon Sep 29 21:59:03 2008
From: terrybdavis at gmail.com (Terry Davis)
Date: Mon, 29 Sep 2008 16:59:03 -0500
Subject: [Linux-cluster] adding volume to cluster
Message-ID: <14139e3a0809291459y6d4beb36sf69feb14c18f4b72@mail.gmail.com>

Hello,

I am having a heck of a time getting a volume to show up in my cluster.  I
have a feeling I am doing something wrong but this isn't the first one I've
added so I'm not sure where I got lucky before.  Here is what I've done thus
far in my 2 node RHEL5 cluster:

1) Created my volume in my SAN and gave both nodes access to it
2) on node A: created 4TB partition with parted and made a gpt label
3) on node A: pvcreate /dev/sdc1
4) on node B: vgcreate vg_data01e /dev/sdc1
5) on both nodes: vgchange -a y
6) on node A: lvcreate -n lv_data01e vg_data01e

I get the error:
  Error locking on node omadvnfs01b: Volume group for uuid not found:
p9SfIjriPtXY33G1Yi3YdojvAAAzmAuwlOLqhVzX8mqL6goiVmUAgQZLGcDnX324
  Error locking on node omadvnfs01a: Volume group for uuid not found:
p9SfIjriPtXY33G1Yi3YdojvAAAzmAuwlOLqhVzX8mqL6goiVmUAgQZLGcDnX324
  Aborting. Failed to activate new LV to wipe the start of it.

I tried restarting clvmd for good measure.  Still no luck.  What am I doing
wrong?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080929/c8451db6/attachment.htm>

From terrybdavis at gmail.com  Mon Sep 29 22:33:09 2008
From: terrybdavis at gmail.com (Terry Davis)
Date: Mon, 29 Sep 2008 17:33:09 -0500
Subject: [Linux-cluster] Re: adding volume to cluster
In-Reply-To: <14139e3a0809291459y6d4beb36sf69feb14c18f4b72@mail.gmail.com>
References: <14139e3a0809291459y6d4beb36sf69feb14c18f4b72@mail.gmail.com>
Message-ID: <14139e3a0809291533v682b9d64s20e4ae0385a7b38f@mail.gmail.com>

On Mon, Sep 29, 2008 at 4:59 PM, Terry Davis <terrybdavis at gmail.com> wrote:

> Hello,
>
> I am having a heck of a time getting a volume to show up in my cluster.  I
> have a feeling I am doing something wrong but this isn't the first one I've
> added so I'm not sure where I got lucky before.  Here is what I've done thus
> far in my 2 node RHEL5 cluster:
>
> 1) Created my volume in my SAN and gave both nodes access to it
> 2) on node A: created 4TB partition with parted and made a gpt label
> 3) on node A: pvcreate /dev/sdc1
> 4) on node B: vgcreate vg_data01e /dev/sdc1
> 5) on both nodes: vgchange -a y
> 6) on node A: lvcreate -n lv_data01e vg_data01e
>
> I get the error:
>   Error locking on node omadvnfs01b: Volume group for uuid not found:
> p9SfIjriPtXY33G1Yi3YdojvAAAzmAuwlOLqhVzX8mqL6goiVmUAgQZLGcDnX324
>   Error locking on node omadvnfs01a: Volume group for uuid not found:
> p9SfIjriPtXY33G1Yi3YdojvAAAzmAuwlOLqhVzX8mqL6goiVmUAgQZLGcDnX324
>   Aborting. Failed to activate new LV to wipe the start of it.
>
> I tried restarting clvmd for good measure.  Still no luck.  What am I doing
> wrong?
>
>
>
I found something else peculiar:

on node A:
[root at omadvnfs01a ~]# lvmdiskscan
  /dev/sdc1                  [        4.00 TB] LVM physical volume

on node B:
[root at omadvnfs01b ~]# lvmdiskscan
  /dev/sdh1                  [        3.81 TB] LVM physical volume

Not sure how they could be different sizes.  I don't see any references in
/dev/mapper either.  Shouldn't those be created with the VG?

Thanks!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080929/ea3ee9c0/attachment.htm>

From jeff.sturm at eprize.com  Tue Sep 30 12:20:55 2008
From: jeff.sturm at eprize.com (Jeff Sturm)
Date: Tue, 30 Sep 2008 08:20:55 -0400
Subject: [Linux-cluster] adding volume to cluster
In-Reply-To: <14139e3a0809291459y6d4beb36sf69feb14c18f4b72@mail.gmail.com>
References: <14139e3a0809291459y6d4beb36sf69feb14c18f4b72@mail.gmail.com>
Message-ID: <64D0546C5EBBD147B75DE133D798665F018063CD@hugo.eprize.local>

If this volume group is to be shared by multiple hosts you need a
clustered volume group.
 
Make sure clvmd is running, that you've set the locking_type in lvm.conf
and and pass "-c y" arguments to vgcreate.


________________________________

	From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Terry Davis
	Sent: Monday, September 29, 2008 5:59 PM
	To: linux-cluster at redhat.com
	Subject: [Linux-cluster] adding volume to cluster
	
	
	Hello,
	
	I am having a heck of a time getting a volume to show up in my
cluster.  I have a feeling I am doing something wrong but this isn't the
first one I've added so I'm not sure where I got lucky before.  Here is
what I've done thus far in my 2 node RHEL5 cluster:
	
	1) Created my volume in my SAN and gave both nodes access to it
	2) on node A: created 4TB partition with parted and made a gpt
label
	3) on node A: pvcreate /dev/sdc1
	4) on node B: vgcreate vg_data01e /dev/sdc1
	5) on both nodes: vgchange -a y
	6) on node A: lvcreate -n lv_data01e vg_data01e
	
	I get the error:
	  Error locking on node omadvnfs01b: Volume group for uuid not
found: p9SfIjriPtXY33G1Yi3YdojvAAAzmAuwlOLqhVzX8mqL6goiVmUAgQZLGcDnX324
	  Error locking on node omadvnfs01a: Volume group for uuid not
found: p9SfIjriPtXY33G1Yi3YdojvAAAzmAuwlOLqhVzX8mqL6goiVmUAgQZLGcDnX324
	  Aborting. Failed to activate new LV to wipe the start of it.
	
	I tried restarting clvmd for good measure.  Still no luck.  What
am I doing wrong?
	
	
	

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080930/c832b941/attachment.htm>

From terrybdavis at gmail.com  Tue Sep 30 13:10:31 2008
From: terrybdavis at gmail.com (Terry Davis)
Date: Tue, 30 Sep 2008 08:10:31 -0500
Subject: [Linux-cluster] adding volume to cluster
In-Reply-To: <64D0546C5EBBD147B75DE133D798665F018063CD@hugo.eprize.local>
References: <14139e3a0809291459y6d4beb36sf69feb14c18f4b72@mail.gmail.com>
	<64D0546C5EBBD147B75DE133D798665F018063CD@hugo.eprize.local>
Message-ID: <14139e3a0809300610q1318d16agf2df7a92459d87d6@mail.gmail.com>

2008/9/30 Jeff Sturm <jeff.sturm at eprize.com>

>  If this volume group is to be shared by multiple hosts you need a
> clustered volume group.
>
> Make sure clvmd is running, that you've set the locking_type in lvm.conf
> and and pass "-c y" arguments to vgcreate.
>
>  ------------------------------
> **
>
>
I removed my volume group and recreated it with -c y, but I still can't
create a volume:

[root at omadvnfs01a ~]# vgcreate -c y vg_data01e /dev/sdc1
  Volume group "vg_data01e" successfully created
[root at omadvnfs01a ~]# lvcreate --size 4T -n lv_data01e vg_data01e
  Error locking on node omadvnfs01b: device-mapper: create ioctl failed:
Device or resource busy
  Aborting. Failed to activate new LV to wipe the start of it.

my locking type is 3.


It looks like the VG got created on both nodes:
on A node:
  --- Volume group ---
  VG Name               vg_data01e
  System ID
  Format                lvm2
  Metadata Areas        1
  Metadata Sequence No  5
  VG Access             read/write
  VG Status             resizable
  Clustered             yes
  Shared                no
  MAX LV                0
  Cur LV                0
  Open LV               0
  Max PV                0
  Cur PV                1
  Act PV                1
  VG Size               4.00 TB
  PE Size               4.00 MB
  Total PE              1048578
  Alloc PE / Size       0 / 0
  Free  PE / Size       1048578 / 4.00 TB
  VG UUID               rT2b6W-OAsO-fBm7-je9M-dH7W-emvR-ehkU4z

on node B:
  --- Volume group ---
  VG Name               vg_data01e
  System ID
  Format                lvm2
  Metadata Areas        1
  Metadata Sequence No  5
  VG Access             read/write
  VG Status             resizable
  Clustered             yes
  Shared                no
  MAX LV                0
  Cur LV                0
  Open LV               0
  Max PV                0
  Cur PV                1
  Act PV                1
  VG Size               4.00 TB
  PE Size               4.00 MB
  Total PE              1048578
  Alloc PE / Size       0 / 0
  Free  PE / Size       1048578 / 4.00 TB
  VG UUID               rT2b6W-OAsO-fBm7-je9M-dH7W-emvR-ehkU4z


Any ideas?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080930/b475912e/attachment.htm>

From julianorodrigues at mac.com  Tue Sep 30 15:28:01 2008
From: julianorodrigues at mac.com (Juliano Rodrigues)
Date: Tue, 30 Sep 2008 12:28:01 -0300
Subject: [Linux-cluster] Distributed Replicated GFS shared storage
Message-ID: <C5CC877B-7125-4F1F-A362-AD4877EDA3D4@mac.com>

Hello,

In order to design an HA project I need a solution to replicate one  
GFS shared storage to another "hot" (standby) GFS "mirror", in case of  
my primary shared storage permanently fail.

Inside RHEL Advanced Platform there is any supported way to accomplish  
that?

Can anyone point a path or tell any experience in this scenario?? What  
can I do about that?

Thanks, Juliano Rodrigues.



From sakect at gmail.com  Tue Sep 30 15:46:50 2008
From: sakect at gmail.com (POWERBALL ONLINE)
Date: Tue, 30 Sep 2008 22:46:50 +0700
Subject: [Linux-cluster] adding volume to cluster
In-Reply-To: <14139e3a0809291459y6d4beb36sf69feb14c18f4b72@mail.gmail.com>
References: <14139e3a0809291459y6d4beb36sf69feb14c18f4b72@mail.gmail.com>
Message-ID: <c99654a0809300846j5e803b9clccd562f3ffd12284@mail.gmail.com>

Hello,

Are you already update patch?
Because it is teh LVM bug you can find it in redhat bugzila.

Best Regards,

Somsak

2008/9/30 Terry Davis <terrybdavis at gmail.com>

> Hello,
>
> I am having a heck of a time getting a volume to show up in my cluster.  I
> have a feeling I am doing something wrong but this isn't the first one I've
> added so I'm not sure where I got lucky before.  Here is what I've done thus
> far in my 2 node RHEL5 cluster:
>
> 1) Created my volume in my SAN and gave both nodes access to it
> 2) on node A: created 4TB partition with parted and made a gpt label
> 3) on node A: pvcreate /dev/sdc1
> 4) on node B: vgcreate vg_data01e /dev/sdc1
> 5) on both nodes: vgchange -a y
> 6) on node A: lvcreate -n lv_data01e vg_data01e
>
> I get the error:
>   Error locking on node omadvnfs01b: Volume group for uuid not found:
> p9SfIjriPtXY33G1Yi3YdojvAAAzmAuwlOLqhVzX8mqL6goiVmUAgQZLGcDnX324
>   Error locking on node omadvnfs01a: Volume group for uuid not found:
> p9SfIjriPtXY33G1Yi3YdojvAAAzmAuwlOLqhVzX8mqL6goiVmUAgQZLGcDnX324
>   Aborting. Failed to activate new LV to wipe the start of it.
>
> I tried restarting clvmd for good measure.  Still no luck.  What am I doing
> wrong?
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080930/6e28df89/attachment.htm>

From terrybdavis at gmail.com  Tue Sep 30 16:33:06 2008
From: terrybdavis at gmail.com (Terry Davis)
Date: Tue, 30 Sep 2008 11:33:06 -0500
Subject: [Linux-cluster] adding volume to cluster
In-Reply-To: <c99654a0809300846j5e803b9clccd562f3ffd12284@mail.gmail.com>
References: <14139e3a0809291459y6d4beb36sf69feb14c18f4b72@mail.gmail.com>
	<c99654a0809300846j5e803b9clccd562f3ffd12284@mail.gmail.com>
Message-ID: <14139e3a0809300933s5014e4fp2465554c0bd1255a@mail.gmail.com>

lvm2-2.02.32-4.el5
lvm2-cluster-2.02.32-4.el5

I don't have any patches available for LVM.

2008/9/30 POWERBALL ONLINE <sakect at gmail.com>

> Hello,
>
> Are you already update patch?
> Because it is teh LVM bug you can find it in redhat bugzila.
>
> Best Regards,
>
> Somsak
>
> 2008/9/30 Terry Davis <terrybdavis at gmail.com>
>
>> Hello,
>>
>> I am having a heck of a time getting a volume to show up in my cluster.  I
>> have a feeling I am doing something wrong but this isn't the first one I've
>> added so I'm not sure where I got lucky before.  Here is what I've done thus
>> far in my 2 node RHEL5 cluster:
>>
>> 1) Created my volume in my SAN and gave both nodes access to it
>> 2) on node A: created 4TB partition with parted and made a gpt label
>> 3) on node A: pvcreate /dev/sdc1
>> 4) on node B: vgcreate vg_data01e /dev/sdc1
>> 5) on both nodes: vgchange -a y
>> 6) on node A: lvcreate -n lv_data01e vg_data01e
>>
>> I get the error:
>>   Error locking on node omadvnfs01b: Volume group for uuid not found:
>> p9SfIjriPtXY33G1Yi3YdojvAAAzmAuwlOLqhVzX8mqL6goiVmUAgQZLGcDnX324
>>   Error locking on node omadvnfs01a: Volume group for uuid not found:
>> p9SfIjriPtXY33G1Yi3YdojvAAAzmAuwlOLqhVzX8mqL6goiVmUAgQZLGcDnX324
>>   Aborting. Failed to activate new LV to wipe the start of it.
>>
>> I tried restarting clvmd for good measure.  Still no luck.  What am I
>> doing wrong?
>>
>>
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080930/a8726e52/attachment.htm>

From billpp at gmail.com  Tue Sep 30 18:50:30 2008
From: billpp at gmail.com (Flavio Junior)
Date: Tue, 30 Sep 2008 15:50:30 -0300
Subject: [Linux-cluster] Distributed Replicated GFS shared storage
In-Reply-To: <C5CC877B-7125-4F1F-A362-AD4877EDA3D4@mac.com>
References: <C5CC877B-7125-4F1F-A362-AD4877EDA3D4@mac.com>
Message-ID: <58aa8d780809301150v5ff58ad3tb48461e650099243@mail.gmail.com>

Maybe DRBD is what you are looking for. www.drbd.org

But, I thought the best way to do it would be using a storage mirror
technology as "Enhanced Remote Mirror" by IBM, Data Replicator by
Veritas ...

--

Fl?vio do Carmo J?nior aka waKKu

On Tue, Sep 30, 2008 at 12:28 PM, Juliano Rodrigues
<julianorodrigues at mac.com> wrote:
> Hello,
>
> In order to design an HA project I need a solution to replicate one GFS
> shared storage to another "hot" (standby) GFS "mirror", in case of my
> primary shared storage permanently fail.
>
> Inside RHEL Advanced Platform there is any supported way to accomplish that?
>
> Can anyone point a path or tell any experience in this scenario?? What can I
> do about that?
>
> Thanks, Juliano Rodrigues.
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



From joseparrella at gmail.com  Tue Sep 30 21:16:33 2008
From: joseparrella at gmail.com (=?UTF-8?B?Sm9zw6kgTWlndWVsIFBhcnJlbGxhIFJvbWVybw==?=)
Date: Tue, 30 Sep 2008 16:46:33 -0430
Subject: [Linux-cluster] Distributed Replicated GFS shared storage
In-Reply-To: <C5CC877B-7125-4F1F-A362-AD4877EDA3D4@mac.com>
References: <C5CC877B-7125-4F1F-A362-AD4877EDA3D4@mac.com>
Message-ID: <48E29731.4010604@gmail.com>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Juliano Rodrigues escribi?, en fecha 30/09/08 10:58:
> Hello,
> 
> In order to design an HA project I need a solution to replicate one GFS
> shared storage to another "hot" (standby) GFS "mirror", in case of my
> primary shared storage permanently fail.
> 
> Inside RHEL Advanced Platform there is any supported way to accomplish
> that?

I believe the whole point of GFS is avoiding you to spend twice your
storage capacity just for the sake of storage distribution. It already
enables you to have a standby server which can go live through a
resource manager whenever you need it.

However, if you need to have two separate storage facilities which sync
in one way, DRBD is probably the easiest way to do so. Heartbeat can
manage DRBD resources at block- and filesystem-level easily, and other
resource managers can probably do so (though I haven't used them)

HTH,
Jose
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkjilzEACgkQUWAsjQBcO4KhwQCeM0lxhXfCwxiAigfi+39pHGog
alwAn3UilZcaPU009vaoxVhXFV6J5KqY
=IVLO
-----END PGP SIGNATURE-----



From Edoardo.Causarano at laitspa.it  Tue Sep 30 23:02:49 2008
From: Edoardo.Causarano at laitspa.it (Edoardo Causarano)
Date: Wed, 1 Oct 2008 01:02:49 +0200
Subject: R: Re: [Linux-cluster] Distributed Replicated GFS shared storage
Message-ID: <B9F5EE1EE80E12479854A81337FB102C014F4327@RLBEMAIL02.interno.regione.lazio.it>

You want to protect against phisycal loss of your disk chassis. Every vendor has it's own proprietary mirroring system; see what netapp, IBM, HP, mcsq have to say. Else see if you can publish local and remote LUNs on the same SAN (*WDM on a dark fiber) and make an LVM mirror with them. It would require some serious maintenance ;)

E



-----Original Message-----
From: linux-cluster-bounces at redhat.com <linux-cluster-bounces at redhat.com>
To: linux clustering <linux-cluster at redhat.com>
Sent: Tue Sep 30 20:50:30 2008
Subject: Re: [Linux-cluster] Distributed Replicated GFS shared storage

Maybe DRBD is what you are looking for. www.drbd.org

But, I thought the best way to do it would be using a storage mirror
technology as "Enhanced Remote Mirror" by IBM, Data Replicator by
Veritas ...

--

Fl?vio do Carmo J?nior aka waKKu

On Tue, Sep 30, 2008 at 12:28 PM, Juliano Rodrigues
<julianorodrigues at mac.com> wrote:
> Hello,
>
> In order to design an HA project I need a solution to replicate one GFS
> shared storage to another "hot" (standby) GFS "mirror", in case of my
> primary shared storage permanently fail.
>
> Inside RHEL Advanced Platform there is any supported way to accomplish that?
>
> Can anyone point a path or tell any experience in this scenario?? What can I
> do about that?
>
> Thanks, Juliano Rodrigues.
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20081001/4625c8d0/attachment.htm>