From swhiteho at redhat.com  Thu Nov  1 12:13:51 2012
From: swhiteho at redhat.com (Steven Whitehouse)
Date: Thu, 01 Nov 2012 12:13:51 +0000
Subject: [Linux-cluster] gfs2_tool unfreeze hang
In-Reply-To: <CAAxr-=8Yj6n53UF=TLbjfUdv6Sbx50MYbwA-AbG__Tdo9ZgBCA@mail.gmail.com>
References: <CAAxr-=8Yj6n53UF=TLbjfUdv6Sbx50MYbwA-AbG__Tdo9ZgBCA@mail.gmail.com>
Message-ID: <1351772031.2708.24.camel@menhir>

Hi,


On Wed, 2012-10-31 at 14:07 -0500, james pedia wrote:
> Noticed this thread for the same issue at:
> 
> 
> https://www.redhat.com/archives/linux-cluster/2012-September/msg00084.html:
> 
> 
> I think I hit the same issue:
> 
> 
> (CentOS6.3)
> # uname -r
> 2.6.32-279.el6.x86_64
> 
> 
> gfs2-utils-3.0.12.1-32.el6_3.1.x86_64 is in use here.
> 
> 
> 
> 
> # gfs2_tool freeze /var/www/html
> # ls -l /var/www/html/
> total 8
> -rw-r--r-- 1 root root 10 Oct 30 23:47 a
> -rw-r--r-- 1 root root 41 Oct 30 20:44 index.html
> # cp /var/www/html/a /var/www/html/b
> (HANG HERE)
> 
> 
> Then try this:
> # gfs2_tool unfreeze /var/www/html
> (HANG AS WELL)
> 
> 
> The whole cluster has to be reset to recover from this. 
> 
> 
> 'dmsetup suspend' and 'dmsetup resume' are working fine. 
> 
> 
> Are these commands basically doing the same thing ('dmsetup suspend'
> vs 'gfs2_tool freeze')? 
> 
> 
> Is there a way to see if GFS2 file system is currently being suspended
> or frozen?
> 
Yes they do the same thing. I'd always recommend dmsetup suspend over
the gfs2_tool method though, since the latter is going away in due
course. There is, unfortunately, no way to check the suspend status of a
GFS2 filesystem currently,

Steve.

> 
> Thanks,
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster




From zheka at uvt.cz  Fri Nov  2 16:25:08 2012
From: zheka at uvt.cz (Yevheniy Demchenko)
Date: Fri, 02 Nov 2012 17:25:08 +0100
Subject: [Linux-cluster] Monitoring Frequency - can it be changed?
In-Reply-To: <CAKrd532rEidJa3CSgK0XisuQO8kaknAhFS1PovKgf8iB593KOQ@mail.gmail.com>
References: <CAKrd532rEidJa3CSgK0XisuQO8kaknAhFS1PovKgf8iB593KOQ@mail.gmail.com>
Message-ID: <5093F3E4.7090107@uvt.cz>

Monitoring frequencies may be defined per resource in cluster.conf, i.e.:

<ip ... >
   <action name="status" depth="*" interval="30" />
</ip>

Detailed info here: https://fedorahosted.org/cluster/wiki/ResourceActions
Also, one can change default action times per resource type in 
resource-agent meta-data in <actions> section.


Ing. Yevheniy Demchenko
Senior Linux Administrator
UVT s.r.o.

On 10/30/2012 11:34 AM, Parvez Shaikh wrote:
> Hi experts,
>
> Can we change frequency at which resources are monitored by Cluster?
>
> I observed 30 seconds as monitoring frequency.
>
> Thanks,
> Parvez
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20121102/3f3386e2/attachment.htm>

From carlopmart at gmail.com  Sat Nov  3 11:49:40 2012
From: carlopmart at gmail.com (C. L. Martinez)
Date: Sat, 3 Nov 2012 11:49:40 +0000
Subject: [Linux-cluster] Some problems with fence_virt.conf
Message-ID: <CAEjQA5+-Xd87RrVe1nsSmBn03qR0-5kXJmthK4RFcn2=24c0-Q@mail.gmail.com>

Hi all,

 I am trying to setup a virtual kvm guest cluster under a centos 6.3
x86_64 (guests are CentOS 6.3, too). When  I have setup
fence_virt.conf (I will use fence_xvm/fence_virt to fence guests),
these errors appears:

[root at kvmhost etc]# fence_virtd -F -d99
Background mode disabled
Debugging threshold is now 99
fence_virtd {
	debug = "99";
	listener = "multicast";
	backend = "libvirt";
	module_path = "/usr/lib64/fence-virt";
}

listeners {
	multicast {
		key_file = "/etc/cluster/fence_xvm.key";
		address = "255.0.0.15";
		family = "ipv4";
		port = "1229";
		interface = "prodif";
	}

}

backends {
	libvirt {
		uri = "qemu:///system";
	}

}

Backend plugin: libvirt
Listener plugin: multicast
Searching /usr/lib64/fence-virt for plugins...
Searching for plugins in /usr/lib64/fence-virt
Loading plugin from /usr/lib64/fence-virt/libvirt.so
Registered backend plugin libvirt 0.1
Loading plugin from /usr/lib64/fence-virt/multicast.so
Failed to map backend_plugin_version
Registered listener plugin multicast 1.1
2 plugins found
Available backends:
    libvirt 0.1
Available listeners:
    multicast 1.1
Debugging threshold is now 99
Using qemu:///system
Debugging threshold is now 99
Got /etc/cluster/fence_xvm.key for key_file
Got ipv4 for family
Got 255.0.0.15 for address
Got 1229 for port
Got prodif for interface
Reading in key file /etc/cluster/fence_xvm.key into 0x13bb070 (4096 max size)
Actual key length = 4096 bytes
Setting up ipv4 multicast receive (255.0.0.15:1229)
Joining multicast group
Failed to bind multicast receive socket to 255.0.0.15: Invalid argument
Check network configuration.
Could not set up multicast listen socket

Why is not possible to bind multicast socket?? In kvm host I have
installed these packages:

[root at kvmhost etc]# rpm -qa | grep fence | sort
fence-virt-0.2.3-9.el6.x86_64
fence-virtd-0.2.3-9.el6.x86_64
fence-virtd-libvirt-0.2.3-9.el6.x86_64
fence-virtd-multicast-0.2.3-9.el6.x86_64



From andrew at beekhof.net  Mon Nov  5 05:12:40 2012
From: andrew at beekhof.net (Andrew Beekhof)
Date: Mon, 5 Nov 2012 16:12:40 +1100
Subject: [Linux-cluster] Some problems with fence_virt.conf
In-Reply-To: <CAEjQA5+-Xd87RrVe1nsSmBn03qR0-5kXJmthK4RFcn2=24c0-Q@mail.gmail.com>
References: <CAEjQA5+-Xd87RrVe1nsSmBn03qR0-5kXJmthK4RFcn2=24c0-Q@mail.gmail.com>
Message-ID: <CAEDLWG2V81RUYurtyt0uHkCg+OVY49LiWiMPcshLEXm4uT10xA@mail.gmail.com>

Is "prodif" really the interface name?
I'd have expected something like "virbr0"

On Sat, Nov 3, 2012 at 10:49 PM, C. L. Martinez <carlopmart at gmail.com> wrote:
> Hi all,
>
>  I am trying to setup a virtual kvm guest cluster under a centos 6.3
> x86_64 (guests are CentOS 6.3, too). When  I have setup
> fence_virt.conf (I will use fence_xvm/fence_virt to fence guests),
> these errors appears:
>
> [root at kvmhost etc]# fence_virtd -F -d99
> Background mode disabled
> Debugging threshold is now 99
> fence_virtd {
>         debug = "99";
>         listener = "multicast";
>         backend = "libvirt";
>         module_path = "/usr/lib64/fence-virt";
> }
>
> listeners {
>         multicast {
>                 key_file = "/etc/cluster/fence_xvm.key";
>                 address = "255.0.0.15";
>                 family = "ipv4";
>                 port = "1229";
>                 interface = "prodif";
>         }
>
> }
>
> backends {
>         libvirt {
>                 uri = "qemu:///system";
>         }
>
> }
>
> Backend plugin: libvirt
> Listener plugin: multicast
> Searching /usr/lib64/fence-virt for plugins...
> Searching for plugins in /usr/lib64/fence-virt
> Loading plugin from /usr/lib64/fence-virt/libvirt.so
> Registered backend plugin libvirt 0.1
> Loading plugin from /usr/lib64/fence-virt/multicast.so
> Failed to map backend_plugin_version
> Registered listener plugin multicast 1.1
> 2 plugins found
> Available backends:
>     libvirt 0.1
> Available listeners:
>     multicast 1.1
> Debugging threshold is now 99
> Using qemu:///system
> Debugging threshold is now 99
> Got /etc/cluster/fence_xvm.key for key_file
> Got ipv4 for family
> Got 255.0.0.15 for address
> Got 1229 for port
> Got prodif for interface
> Reading in key file /etc/cluster/fence_xvm.key into 0x13bb070 (4096 max size)
> Actual key length = 4096 bytes
> Setting up ipv4 multicast receive (255.0.0.15:1229)
> Joining multicast group
> Failed to bind multicast receive socket to 255.0.0.15: Invalid argument
> Check network configuration.
> Could not set up multicast listen socket
>
> Why is not possible to bind multicast socket?? In kvm host I have
> installed these packages:
>
> [root at kvmhost etc]# rpm -qa | grep fence | sort
> fence-virt-0.2.3-9.el6.x86_64
> fence-virtd-0.2.3-9.el6.x86_64
> fence-virtd-libvirt-0.2.3-9.el6.x86_64
> fence-virtd-multicast-0.2.3-9.el6.x86_64
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From carlopmart at gmail.com  Mon Nov  5 07:03:20 2012
From: carlopmart at gmail.com (C. L. Martinez)
Date: Mon, 5 Nov 2012 07:03:20 +0000
Subject: [Linux-cluster] Some problems with fence_virt.conf
In-Reply-To: <CAEDLWG2V81RUYurtyt0uHkCg+OVY49LiWiMPcshLEXm4uT10xA@mail.gmail.com>
References: <CAEjQA5+-Xd87RrVe1nsSmBn03qR0-5kXJmthK4RFcn2=24c0-Q@mail.gmail.com>
	<CAEDLWG2V81RUYurtyt0uHkCg+OVY49LiWiMPcshLEXm4uT10xA@mail.gmail.com>
Message-ID: <CAEjQA5Kci3v9rhvu2kTo7Fu2qP8KSW6TO1nhAB4dtuMzaz=fFg@mail.gmail.com>

On Mon, Nov 5, 2012 at 5:12 AM, Andrew Beekhof <andrew at beekhof.net> wrote:
> Is "prodif" really the interface name?
> I'd have expected something like "virbr0"
>

Yes, it is correct. I don't use default bridge names provided by libvirtd ...



From mgrac at redhat.com  Mon Nov  5 11:00:27 2012
From: mgrac at redhat.com (Marek Grac)
Date: Mon, 05 Nov 2012 12:00:27 +0100
Subject: [Linux-cluster] fence-agents 3.1.11 stable release
Message-ID: <50979C4B.8020605@redhat.com>

Welcome to the fence-agents 3.1.11 release.

This release includes these updates:
* support new API used in RHEV-M 3.1
* fence_cisco_ucs incorrect timeout value was used during login operation
* support on/off also for fabric fence agents (which do not have 
'reboot'). Support for enable/disable was not removed.
* fence_na support XML metadata output
* manual page for ipmilan was fixed to contain correct information about 
usage for HP iLO3, iLO4

The new source tarball can be downloaded here:

https://fedorahosted.org/releases/f/e/fence-agents/fence-agents-3.1.11.tar.xz

To report bugs or issues:

https://bugzilla.redhat.com/

Would you like to meet the cluster team or members of its community?

    Join us on IRC (irc.freenode.net #linux-cluster) and share your
    experience  with other sysadministrators or power users.

Thanks/congratulations to all people that contributed to achieve this
milestone.

m,



From andrew at beekhof.net  Wed Nov  7 05:05:24 2012
From: andrew at beekhof.net (Andrew Beekhof)
Date: Wed, 7 Nov 2012 16:05:24 +1100
Subject: [Linux-cluster] Some problems with fence_virt.conf
In-Reply-To: <CAEjQA5Kci3v9rhvu2kTo7Fu2qP8KSW6TO1nhAB4dtuMzaz=fFg@mail.gmail.com>
References: <CAEjQA5+-Xd87RrVe1nsSmBn03qR0-5kXJmthK4RFcn2=24c0-Q@mail.gmail.com>
	<CAEDLWG2V81RUYurtyt0uHkCg+OVY49LiWiMPcshLEXm4uT10xA@mail.gmail.com>
	<CAEjQA5Kci3v9rhvu2kTo7Fu2qP8KSW6TO1nhAB4dtuMzaz=fFg@mail.gmail.com>
Message-ID: <CAEDLWG0zL7P0U_9Ed2=6jWCbaLqi8rTTSc0dOOL8rQxU4UTrgA@mail.gmail.com>

On Mon, Nov 5, 2012 at 6:03 PM, C. L. Martinez <carlopmart at gmail.com> wrote:
> On Mon, Nov 5, 2012 at 5:12 AM, Andrew Beekhof <andrew at beekhof.net> wrote:
>> Is "prodif" really the interface name?
>> I'd have expected something like "virbr0"
>>
>
> Yes, it is correct. I don't use default bridge names provided by libvirtd ...

Are you sure that multicast address is valid?
Perhaps try: 225.0.0.12  (not 255.0.0....)



From carlopmart at gmail.com  Wed Nov  7 06:44:20 2012
From: carlopmart at gmail.com (C. L. Martinez)
Date: Wed, 7 Nov 2012 06:44:20 +0000
Subject: [Linux-cluster] Some problems with fence_virt.conf
In-Reply-To: <CAEDLWG0zL7P0U_9Ed2=6jWCbaLqi8rTTSc0dOOL8rQxU4UTrgA@mail.gmail.com>
References: <CAEjQA5+-Xd87RrVe1nsSmBn03qR0-5kXJmthK4RFcn2=24c0-Q@mail.gmail.com>
	<CAEDLWG2V81RUYurtyt0uHkCg+OVY49LiWiMPcshLEXm4uT10xA@mail.gmail.com>
	<CAEjQA5Kci3v9rhvu2kTo7Fu2qP8KSW6TO1nhAB4dtuMzaz=fFg@mail.gmail.com>
	<CAEDLWG0zL7P0U_9Ed2=6jWCbaLqi8rTTSc0dOOL8rQxU4UTrgA@mail.gmail.com>
Message-ID: <CAEjQA5+ZgpGvBZqw_G1WciY6x-OmJ9n-Hb+UObQ_9nDp0NygaA@mail.gmail.com>

On Wed, Nov 7, 2012 at 5:05 AM, Andrew Beekhof <andrew at beekhof.net> wrote:
> On Mon, Nov 5, 2012 at 6:03 PM, C. L. Martinez <carlopmart at gmail.com> wrote:
>> On Mon, Nov 5, 2012 at 5:12 AM, Andrew Beekhof <andrew at beekhof.net> wrote:
>>> Is "prodif" really the interface name?
>>> I'd have expected something like "virbr0"
>>>
>>
>> Yes, it is correct. I don't use default bridge names provided by libvirtd ...
>
> Are you sure that multicast address is valid?
> Perhaps try: 225.0.0.12  (not 255.0.0....)
>

I have tried 225.0.0.12 too, and result is the same ...



From andrew at beekhof.net  Wed Nov  7 08:09:28 2012
From: andrew at beekhof.net (Andrew Beekhof)
Date: Wed, 7 Nov 2012 19:09:28 +1100
Subject: [Linux-cluster] Some problems with fence_virt.conf
In-Reply-To: <CAEjQA5+ZgpGvBZqw_G1WciY6x-OmJ9n-Hb+UObQ_9nDp0NygaA@mail.gmail.com>
References: <CAEjQA5+-Xd87RrVe1nsSmBn03qR0-5kXJmthK4RFcn2=24c0-Q@mail.gmail.com>
	<CAEDLWG2V81RUYurtyt0uHkCg+OVY49LiWiMPcshLEXm4uT10xA@mail.gmail.com>
	<CAEjQA5Kci3v9rhvu2kTo7Fu2qP8KSW6TO1nhAB4dtuMzaz=fFg@mail.gmail.com>
	<CAEDLWG0zL7P0U_9Ed2=6jWCbaLqi8rTTSc0dOOL8rQxU4UTrgA@mail.gmail.com>
	<CAEjQA5+ZgpGvBZqw_G1WciY6x-OmJ9n-Hb+UObQ_9nDp0NygaA@mail.gmail.com>
Message-ID: <CAEDLWG0v7=6TmBZ3+G_qC+5YttaP2uLeMvFtu9KePVHqj=6iDQ@mail.gmail.com>

On Wed, Nov 7, 2012 at 5:44 PM, C. L. Martinez <carlopmart at gmail.com> wrote:
> On Wed, Nov 7, 2012 at 5:05 AM, Andrew Beekhof <andrew at beekhof.net> wrote:
>> On Mon, Nov 5, 2012 at 6:03 PM, C. L. Martinez <carlopmart at gmail.com> wrote:
>>> On Mon, Nov 5, 2012 at 5:12 AM, Andrew Beekhof <andrew at beekhof.net> wrote:
>>>> Is "prodif" really the interface name?
>>>> I'd have expected something like "virbr0"
>>>>
>>>
>>> Yes, it is correct. I don't use default bridge names provided by libvirtd ...
>>
>> Are you sure that multicast address is valid?
>> Perhaps try: 225.0.0.12  (not 255.0.0....)
>>
>
> I have tried 225.0.0.12 too, and result is the same ...

Have you tried with -d99 (i think thats how you get more debug info)



From bubble at hoster-ok.com  Thu Nov  8 06:09:36 2012
From: bubble at hoster-ok.com (Vladislav Bogdanov)
Date: Thu, 08 Nov 2012 09:09:36 +0300
Subject: [Linux-cluster] Some problems with fence_virt.conf
In-Reply-To: <CAEjQA5+-Xd87RrVe1nsSmBn03qR0-5kXJmthK4RFcn2=24c0-Q@mail.gmail.com>
References: <CAEjQA5+-Xd87RrVe1nsSmBn03qR0-5kXJmthK4RFcn2=24c0-Q@mail.gmail.com>
Message-ID: <509B4CA0.8020606@hoster-ok.com>

03.11.2012 14:49, C. L. Martinez wrote:
> Hi all,
> 
>  I am trying to setup a virtual kvm guest cluster under a centos 6.3
> x86_64 (guests are CentOS 6.3, too). When  I have setup
> fence_virt.conf (I will use fence_xvm/fence_virt to fence guests),
> these errors appears:
> 
> [root at kvmhost etc]# fence_virtd -F -d99
> Background mode disabled
> Debugging threshold is now 99
> fence_virtd {
> 	debug = "99";
> 	listener = "multicast";
> 	backend = "libvirt";
> 	module_path = "/usr/lib64/fence-virt";
> }
> 
> listeners {
> 	multicast {
> 		key_file = "/etc/cluster/fence_xvm.key";
> 		address = "255.0.0.15";
> 		family = "ipv4";
> 		port = "1229";
> 		interface = "prodif";
> 	}
> 
> }
> 
> backends {
> 	libvirt {
> 		uri = "qemu:///system";
> 	}
> 
> }
> 
> Backend plugin: libvirt
> Listener plugin: multicast
> Searching /usr/lib64/fence-virt for plugins...
> Searching for plugins in /usr/lib64/fence-virt
> Loading plugin from /usr/lib64/fence-virt/libvirt.so
> Registered backend plugin libvirt 0.1
> Loading plugin from /usr/lib64/fence-virt/multicast.so
> Failed to map backend_plugin_version
> Registered listener plugin multicast 1.1
> 2 plugins found
> Available backends:
>     libvirt 0.1
> Available listeners:
>     multicast 1.1
> Debugging threshold is now 99
> Using qemu:///system
> Debugging threshold is now 99
> Got /etc/cluster/fence_xvm.key for key_file
> Got ipv4 for family
> Got 255.0.0.15 for address
> Got 1229 for port
> Got prodif for interface
> Reading in key file /etc/cluster/fence_xvm.key into 0x13bb070 (4096 max size)
> Actual key length = 4096 bytes
> Setting up ipv4 multicast receive (255.0.0.15:1229)
> Joining multicast group
> Failed to bind multicast receive socket to 255.0.0.15: Invalid argument
> Check network configuration.
> Could not set up multicast listen socket
> 
> Why is not possible to bind multicast socket?? In kvm host I have
> installed these packages:

selinux?

> 
> [root at kvmhost etc]# rpm -qa | grep fence | sort
> fence-virt-0.2.3-9.el6.x86_64
> fence-virtd-0.2.3-9.el6.x86_64
> fence-virtd-libvirt-0.2.3-9.el6.x86_64
> fence-virtd-multicast-0.2.3-9.el6.x86_64
> 



From lists at verwilst.be  Thu Nov  8 18:43:00 2012
From: lists at verwilst.be (Bart Verwilst)
Date: Thu, 08 Nov 2012 19:43:00 +0100
Subject: [Linux-cluster] Failover network device with rgmanager
In-Reply-To: <506DB1C4.2080609@redhat.com>
References: <2c3f847bbba16467723fe057dbded285@verwilst.be>
	<506DB1C4.2080609@redhat.com>
Message-ID: <7107b25ea4aa7c871a40eca860514b5a@verwilst.be>

Thanks a lot for the tip!

Kind regards,

Bart

Lon Hohberger schreef op 04.10.2012 17:56:
> On 10/04/2012 09:47 AM, Bart Verwilst wrote:
>> Hi,
>>
>> I would like to make rgmanager manage a network interface i 
>> configured
>> under sysconfig ( ifcfg-ethX ). It should be brought up by the 
>> active
>> node as a resource, and ifdown'ed by the standby node. ( It's 
>> actually a
>> GRE tunnel interface ). Is there a straightforward way on how to do 
>> this
>> with CentOS 6.2 cman/rgmanager?
>>
>
> 'script' resource, like:
>
> #!/bin/sh
>
> case $1 in
> start)
> 	ifup ethX
> 	exit $?
> 	;;
> stop)
> 	ifdown ethX
> 	exit $?
> 	;;
> status)
>         ...
> 	;;
> esac
>
> exit 1
>
> -- Lon



From sumodirjo at gmail.com  Fri Nov  9 00:47:55 2012
From: sumodirjo at gmail.com (Muhammad Panji)
Date: Fri, 9 Nov 2012 07:47:55 +0700
Subject: [Linux-cluster] Failover root cause
Message-ID: <CANbzdHkQ=SD+Q=WQ5rjEn9M4P3Kv2Pa2T9ztF1c6Kx_71qnV_A@mail.gmail.com>

Dear All,
I have an oracle cluster on RHEL 6.2 with 2 servers. Several days ago
the service was failover from node1 to node2. From /var/log/messages
on node2 I only see this message :

...
Oct 23 12:54:19 db2svr corosync[4142]:   [TOTEM ] A processor failed,
forming new configuration.
Oct 23 12:54:21 db2svr corosync[4142]:   [QUORUM] Members[1]: 2
Oct 23 12:54:21 db2svr corosync[4142]:   [TOTEM ] A processor joined
or left the membership and a new membership was formed.
Oct 23 12:54:21 db2svr kernel: dlm: closing connection to node 1
Oct 23 12:54:21 db2svr rgmanager[5327]: State change: clu1 DOWN
Oct 23 12:54:21 db2svr fenced[4193]: fencing node clu1
...

Googling this message " [TOTEM ] A processor failed, forming new
configuration." I learned that it means node2 couldn't see node1 and
then fence node1. on node1 I get this message :

Oct 23 12:50:45 db1svr rgmanager[75890]: [script] Executing
/etc/init.d/httpd status
Oct 23 12:56:01 db1svr kernel: imklog 4.6.2, log source = /proc/kmsg started.
Oct 23 12:56:01 db1svr rsyslogd: [origin software="rsyslogd"
swVersion="4.6.2" x-pid="3792" x-info="http://www.rsyslog.com"]
(re)start
Oct 23 12:56:01 db1svr kernel: Initializing cgroup subsys cpuset
Oct 23 12:56:01 db1svr kernel: Initializing cgroup subsys cpu
Oct 23 12:56:01 db1svr kernel: Linux version 2.6.32-220.el6.x86_64
(mockbuild at x86-004.build.bos.redhat.com) (gcc version 4.4.5 20110214
(Red Hat 4.4.5-6) (GCC) ) #1 SMP Wed Nov 9 08:03:13 EST 2011

on 12:50 rgmanager still checking the service and then it's rebooted.
Thing that make it worse is that the date / time of both servers are
different so that I can't compare the logs directly. Current time
difference between both servers is around 5 minutes.

I would like to ask where to look for the cause of this failover? I
plan to graph sar data today to see if there were bottleneck on CPU
etc so that node1 could not send status to node2, but if no bottleneck
on CPU or RAM etc where should I find the root cause of failover?
thank you.
Regards,





-- 
Muhammad Panji
http://www.panji.web.id
http://www.kurungsiku.com



From songyu555 at gmail.com  Fri Nov  9 03:40:51 2012
From: songyu555 at gmail.com (Yu)
Date: Fri, 9 Nov 2012 14:40:51 +1100
Subject: [Linux-cluster] Failover root cause
In-Reply-To: <CANbzdHkQ=SD+Q=WQ5rjEn9M4P3Kv2Pa2T9ztF1c6Kx_71qnV_A@mail.gmail.com>
References: <CANbzdHkQ=SD+Q=WQ5rjEn9M4P3Kv2Pa2T9ztF1c6Kx_71qnV_A@mail.gmail.com>
Message-ID: <C044DD36-E258-4381-B034-CF9EB4DE5965@gmail.com>

Regardless what was the root cause you find. Cluster requires Ntp service to ensure all nodes have time synchronized.  So you have to fix this 5 mins difference now.

Regards
Yu

On 09/11/2012, at 11:47, Muhammad Panji <sumodirjo at gmail.com> wrote:

> Dear All,
> I have an oracle cluster on RHEL 6.2 with 2 servers. Several days ago
> the service was failover from node1 to node2. From /var/log/messages
> on node2 I only see this message :
> 
> ...
> Oct 23 12:54:19 db2svr corosync[4142]:   [TOTEM ] A processor failed,
> forming new configuration.
> Oct 23 12:54:21 db2svr corosync[4142]:   [QUORUM] Members[1]: 2
> Oct 23 12:54:21 db2svr corosync[4142]:   [TOTEM ] A processor joined
> or left the membership and a new membership was formed.
> Oct 23 12:54:21 db2svr kernel: dlm: closing connection to node 1
> Oct 23 12:54:21 db2svr rgmanager[5327]: State change: clu1 DOWN
> Oct 23 12:54:21 db2svr fenced[4193]: fencing node clu1
> ...
> 
> Googling this message " [TOTEM ] A processor failed, forming new
> configuration." I learned that it means node2 couldn't see node1 and
> then fence node1. on node1 I get this message :
> 
> Oct 23 12:50:45 db1svr rgmanager[75890]: [script] Executing
> /etc/init.d/httpd status
> Oct 23 12:56:01 db1svr kernel: imklog 4.6.2, log source = /proc/kmsg started.
> Oct 23 12:56:01 db1svr rsyslogd: [origin software="rsyslogd"
> swVersion="4.6.2" x-pid="3792" x-info="http://www.rsyslog.com"]
> (re)start
> Oct 23 12:56:01 db1svr kernel: Initializing cgroup subsys cpuset
> Oct 23 12:56:01 db1svr kernel: Initializing cgroup subsys cpu
> Oct 23 12:56:01 db1svr kernel: Linux version 2.6.32-220.el6.x86_64
> (mockbuild at x86-004.build.bos.redhat.com) (gcc version 4.4.5 20110214
> (Red Hat 4.4.5-6) (GCC) ) #1 SMP Wed Nov 9 08:03:13 EST 2011
> 
> on 12:50 rgmanager still checking the service and then it's rebooted.
> Thing that make it worse is that the date / time of both servers are
> different so that I can't compare the logs directly. Current time
> difference between both servers is around 5 minutes.
> 
> I would like to ask where to look for the cause of this failover? I
> plan to graph sar data today to see if there were bottleneck on CPU
> etc so that node1 could not send status to node2, but if no bottleneck
> on CPU or RAM etc where should I find the root cause of failover?
> thank you.
> Regards,
> 
> 
> 
> 
> 
> -- 
> Muhammad Panji
> http://www.panji.web.id
> http://www.kurungsiku.com
> 
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From parvez.h.shaikh at gmail.com  Fri Nov  9 10:40:22 2012
From: parvez.h.shaikh at gmail.com (Parvez Shaikh)
Date: Fri, 9 Nov 2012 16:10:22 +0530
Subject: [Linux-cluster] fence_bladecenter - changing default fence action
Message-ID: <CAKrd533grMbbQXn6JJzjvr_gPw_352W=aXC_+HFS_nQgjc8f0w@mail.gmail.com>

Hi experts,

Is there any way to override default fence action (reboot?) for
fence_bladecenter through cluster.conf?

Can we specify what is fencing action (reboot/off/on) for fence_bladecenter
per blade?

Thanks
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20121109/cfa08879/attachment.htm>

From binanalhalabi at yahoo.com  Fri Nov  9 12:02:43 2012
From: binanalhalabi at yahoo.com (Binan AL Halabi)
Date: Fri, 9 Nov 2012 04:02:43 -0800 (PST)
Subject: [Linux-cluster] fence_bladecenter - changing default fence
	action
In-Reply-To: <CAKrd533grMbbQXn6JJzjvr_gPw_352W=aXC_+HFS_nQgjc8f0w@mail.gmail.com>
References: <CAKrd533grMbbQXn6JJzjvr_gPw_352W=aXC_+HFS_nQgjc8f0w@mail.gmail.com>
Message-ID: <1352462563.57880.YahooMailNeo@web122604.mail.ne1.yahoo.com>

Hi,

You can specify the fencing action per blade depending on the fencing agent.
http://www.sourceware.org/cluster/doc/cluster_schema_rhel5.html

use action attribute per node in configuration file:
see Example 7.4 and 7.5 here:
https://access.redhat.com/knowledge/docs/en-US/Red_Hat_Enterprise_Linux/6/html/Cluster_Administration/s1-config-fencing-cli-CA.html#ex-clusterconf-fencing-fencemethods-cli-CA


// Binan


________________________________
 Fr?n: Parvez Shaikh <parvez.h.shaikh at gmail.com>
Till: linux clustering <linux-cluster at redhat.com> 
Skickat: fredag, 9 november 2012 11:40
?mne: [Linux-cluster] fence_bladecenter - changing default fence action
 

Hi experts,

Is there any way to override default fence action (reboot?) for fence_bladecenter through cluster.conf?

Can we specify what is fencing action (reboot/off/on) for fence_bladecenter per blade?

Thanks

-- 
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20121109/a1c61ea3/attachment.htm>

From binanalhalabi at yahoo.com  Fri Nov  9 12:16:19 2012
From: binanalhalabi at yahoo.com (Binan AL Halabi)
Date: Fri, 9 Nov 2012 04:16:19 -0800 (PST)
Subject: [Linux-cluster] fence_bladecenter - changing default fence
	action
In-Reply-To: <CAKrd533grMbbQXn6JJzjvr_gPw_352W=aXC_+HFS_nQgjc8f0w@mail.gmail.com>
References: <CAKrd533grMbbQXn6JJzjvr_gPw_352W=aXC_+HFS_nQgjc8f0w@mail.gmail.com>
Message-ID: <1352463379.10146.YahooMailNeo@web122606.mail.ne1.yahoo.com>


Hi,

You can specify the fencing action per blade depending on the fencing agent.
http://www.sourceware.org/cluster/doc/cluster_schema_rhel5.html

use action attribute per node in configuration file:
see Example 7.4 and 7.5 here:
https://access.redhat.com/knowledge/docs/en-US/Red_Hat_Enterprise_Linux/6/html/Cluster_Administration/s1-config-fencing-cli-CA.html#ex-clusterconf-fencing-fencemethods-cli-CA

// Binan


________________________________
 Fr?n: Parvez Shaikh <parvez.h.shaikh at gmail.com>
Till: linux clustering <linux-cluster at redhat.com> 
Skickat: fredag, 9 november 2012 11:40
?mne: [Linux-cluster] fence_bladecenter - changing default fence action
 

Hi experts,

Is there any way to override default fence action (reboot?) for fence_bladecenter through cluster.conf?

Can we specify what is fencing action (reboot/off/on) for fence_bladecenter per blade?

Thanks

-- 
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20121109/5f70d68d/attachment.htm>

From queszama at yahoo.in  Sat Nov 10 02:26:15 2012
From: queszama at yahoo.in (Zama Ques)
Date: Sat, 10 Nov 2012 10:26:15 +0800 (SGT)
Subject: [Linux-cluster] Packet loss after configuring Ethernet bonding
Message-ID: <1352514375.40862.YahooMailNeo@web193003.mail.sg3.yahoo.com>

Hi All, 


Need help on resolving a issue related to implementing High Availability at network level . I understand that this is not the right forum to ask this question , but since it is related to HA and Linux , I am asking here and I feel somebody here? will have answer to the issues I am facing .

I am trying to implement Ethernet Bonding , Both the interface in my server are connected to two different network switches . 


My configuration is as follows: 


========
# cat /proc/net/bonding/bond0

Ethernet Channel Bonding Driver: v3.6.0 (September 26, 2009)

Bonding Mode: adaptive load balancing Primary Slave: None Currently 
Active Slave: eth0 MII Status: up MII Polling Interval (ms): 0 Up Delay 
(ms): 0 Down Delay (ms): 0

Slave Interface: eth0 MII Status: up Speed: 1000 Mbps Duplex: full Link 
Failure Count: 0 Permanent HW addr: e4:e1:5b:d0:11:10 Slave queue ID: 0

Slave Interface: eth1 MII Status: up Speed: 1000 Mbps Duplex: full Link 
Failure Count: 0 Permanent HW addr: e4:e1:5b:d0:11:14 Slave queue ID: 0
------------
# cat /sys/class/net/bond0/bonding/mode 

? balance-alb 6


# cat /sys/class/net/bond0/bonding/miimon
?? 0

============


The issue for me is that I am seeing packet loss after configuring bonding .? Tried connecting both the interface to the same switch , but still seeing the packet loss . Also , tried changing miimon value to 100 , but still seeing the packet loss.?

What I am missing in the configuration ? Any help will be highly appreciated in resolving the problem . 



Thanks
Zaman



From lists at alteeve.ca  Sat Nov 10 02:54:33 2012
From: lists at alteeve.ca (Digimer)
Date: Fri, 09 Nov 2012 21:54:33 -0500
Subject: [Linux-cluster] Packet loss after configuring Ethernet bonding
In-Reply-To: <1352514375.40862.YahooMailNeo@web193003.mail.sg3.yahoo.com>
References: <1352514375.40862.YahooMailNeo@web193003.mail.sg3.yahoo.com>
Message-ID: <509DC1E9.9090704@alteeve.ca>

On 11/09/2012 09:26 PM, Zama Ques wrote:
> Hi All, 
> 
> Need help on resolving a issue related to implementing High Availability at network level . I understand that this is not the right forum to ask this question , but since it is related to HA and Linux , I am asking here and I feel somebody here  will have answer to the issues I am facing .
> 
> I am trying to implement Ethernet Bonding , Both the interface in my server are connected to two different network switches . 
> 
> My configuration is as follows: 
> 
> ========
> # cat /proc/net/bonding/bond0
> 
> Ethernet Channel Bonding Driver: v3.6.0 (September 26, 2009)
> 
> Bonding Mode: adaptive load balancing Primary Slave: None Currently 
> Active Slave: eth0 MII Status: up MII Polling Interval (ms): 0 Up Delay 
> (ms): 0 Down Delay (ms): 0
> 
> Slave Interface: eth0 MII Status: up Speed: 1000 Mbps Duplex: full Link 
> Failure Count: 0 Permanent HW addr: e4:e1:5b:d0:11:10 Slave queue ID: 0
> 
> Slave Interface: eth1 MII Status: up Speed: 1000 Mbps Duplex: full Link 
> Failure Count: 0 Permanent HW addr: e4:e1:5b:d0:11:14 Slave queue ID: 0
> ------------
> # cat /sys/class/net/bond0/bonding/mode 
> 
>   balance-alb 6
> 
> 
> # cat /sys/class/net/bond0/bonding/miimon
>    0
> 
> ============
> 
> 
> The issue for me is that I am seeing packet loss after configuring bonding .  Tried connecting both the interface to the same switch , but still seeing the packet loss . Also , tried changing miimon value to 100 , but still seeing the packet loss. 
> 
> What I am missing in the configuration ? Any help will be highly appreciated in resolving the problem . 
> 
> 
> 
> Thanks
> Zaman

You didn't share any details on your configuration, but I will assume
you are using corosync.

The only supported bonding mode is Active/Passive (mode=1). I've
personally tried all modes, out of curiosity, and all had problems. The
short of it is that if you need more that 1 gbit of performance, buy
faster cards.

If you are interested in what I use, it's documented here:

https://alteeve.ca/w/2-Node_Red_Hat_KVM_Cluster_Tutorial#Network

I've used this setup in several production clusters and have tested
failure are recovery extensively. It's proven very stable. :)

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?



From queszama at yahoo.in  Sat Nov 10 04:12:19 2012
From: queszama at yahoo.in (Zama Ques)
Date: Sat, 10 Nov 2012 12:12:19 +0800 (SGT)
Subject: [Linux-cluster] Packet loss after configuring Ethernet bonding
In-Reply-To: <509DC1E9.9090704@alteeve.ca>
References: <1352514375.40862.YahooMailNeo@web193003.mail.sg3.yahoo.com>
	<509DC1E9.9090704@alteeve.ca>
Message-ID: <1352520739.40244.YahooMailNeo@web193002.mail.sg3.yahoo.com>





----- Original Message -----
From: Digimer <lists at alteeve.ca>
To: Zama Ques <queszama at yahoo.in>; linux clustering <linux-cluster at redhat.com>
Cc: 
Sent: Saturday, 10 November 2012 8:24 AM
Subject: Re: [Linux-cluster] Packet loss after configuring Ethernet bonding

On 11/09/2012 09:26 PM, Zama Ques wrote:
> Hi All, 
> 
> Need help on resolving a issue related to implementing High Availability at network level . I understand that this is not the right forum to ask this question , but since it is related to HA and Linux , I am asking here and I feel somebody here? will have answer to the issues I am facing .
> 
> I am trying to implement Ethernet Bonding , Both the interface in my server are connected to two different network switches . 
> 
> My configuration is as follows: 
> 
> ========
> # cat /proc/net/bonding/bond0
> 
> Ethernet Channel Bonding Driver: v3.6.0 (September 26, 2009)
> 
> Bonding Mode: adaptive load balancing Primary Slave: None Currently 
> Active Slave: eth0 MII Status: up MII Polling Interval (ms): 0 Up Delay 
> (ms): 0 Down Delay (ms): 0
> 
> Slave Interface: eth0 MII Status: up Speed: 1000 Mbps Duplex: full Link 
> Failure Count: 0 Permanent HW addr: e4:e1:5b:d0:11:10 Slave queue ID: 0
> 
> Slave Interface: eth1 MII Status: up Speed: 1000 Mbps Duplex: full Link 
> Failure Count: 0 Permanent HW addr: e4:e1:5b:d0:11:14 Slave queue ID: 0
> ------------
> # cat /sys/class/net/bond0/bonding/mode 
> 
>?  balance-alb 6
> 
> 
> # cat /sys/class/net/bond0/bonding/miimon
>? ? 0
> 
> ============
> 
> 
> The issue for me is that I am seeing packet loss after configuring bonding .? Tried connecting both the interface to the same switch , but still seeing the packet loss . Also , tried changing miimon value to 100 , but still seeing the packet loss. 
> 
> What I am missing in the configuration ? Any help will be highly appreciated in resolving the problem . 
> 
> 
> 
> Thanks
> Zaman

?> You didn't share any details on your configuration, but I will assume
> you are using corosync.

> The only supported bonding mode is Active/Passive (mode=1). I've
> personally tried all modes, out of curiosity, and all had problems. The
> short of it is that if you need more that 1 gbit of performance, buy
> faster cards.

> If you are interested in what I use, it's documented here:

>? https://alteeve.ca/w/2-Node_Red_Hat_KVM_Cluster_Tutorial#Network

>? I've used this setup in several production clusters and have tested
>? failure are recovery extensively. It's proven very stable. :)

?
Thanks Digimer for the quick response and pointing me to the link . I am yet to reach cluster configuration , initially trying to? understand ethernet bonding before going into cluster configuration. So , option for me is only to use Active/Passive bonding mode in case of clustered environment. 
Few more clarifications needed , Can we use other bonding modes in non clustered environment .? I am seeing packet loss in other modes . Also , the support of? using only mode=1 in cluster environment is it a restriction of RHEL Cluster suite or it is by design . 

Will be great if you clarify these queries .

Thanks in Advance
Zaman



From lists at alteeve.ca  Sat Nov 10 04:22:44 2012
From: lists at alteeve.ca (Digimer)
Date: Fri, 09 Nov 2012 23:22:44 -0500
Subject: [Linux-cluster] Packet loss after configuring Ethernet bonding
In-Reply-To: <1352520739.40244.YahooMailNeo@web193002.mail.sg3.yahoo.com>
References: <1352514375.40862.YahooMailNeo@web193003.mail.sg3.yahoo.com>
	<509DC1E9.9090704@alteeve.ca>
	<1352520739.40244.YahooMailNeo@web193002.mail.sg3.yahoo.com>
Message-ID: <509DD694.1000900@alteeve.ca>

On 11/09/2012 11:12 PM, Zama Ques wrote:
> ----- Original Message -----
> From: Digimer <lists at alteeve.ca>
> To: Zama Ques <queszama at yahoo.in>; linux clustering <linux-cluster at redhat.com>
> Cc: 
> Sent: Saturday, 10 November 2012 8:24 AM
> Subject: Re: [Linux-cluster] Packet loss after configuring Ethernet bonding
> 
> On 11/09/2012 09:26 PM, Zama Ques wrote:
>> Hi All, 
>>
>> Need help on resolving a issue related to implementing High Availability at network level . I understand that this is not the right forum to ask this question , but since it is related to HA and Linux , I am asking here and I feel somebody here  will have answer to the issues I am facing .
>>
>> I am trying to implement Ethernet Bonding , Both the interface in my server are connected to two different network switches . 
>>
>> My configuration is as follows: 
>>
>> ========
>> # cat /proc/net/bonding/bond0
>>
>> Ethernet Channel Bonding Driver: v3.6.0 (September 26, 2009)
>>
>> Bonding Mode: adaptive load balancing Primary Slave: None Currently 
>> Active Slave: eth0 MII Status: up MII Polling Interval (ms): 0 Up Delay 
>> (ms): 0 Down Delay (ms): 0
>>
>> Slave Interface: eth0 MII Status: up Speed: 1000 Mbps Duplex: full Link 
>> Failure Count: 0 Permanent HW addr: e4:e1:5b:d0:11:10 Slave queue ID: 0
>>
>> Slave Interface: eth1 MII Status: up Speed: 1000 Mbps Duplex: full Link 
>> Failure Count: 0 Permanent HW addr: e4:e1:5b:d0:11:14 Slave queue ID: 0
>> ------------
>> # cat /sys/class/net/bond0/bonding/mode 
>>
>>    balance-alb 6
>>
>>
>> # cat /sys/class/net/bond0/bonding/miimon
>>     0
>>
>> ============
>>
>>
>> The issue for me is that I am seeing packet loss after configuring bonding .  Tried connecting both the interface to the same switch , but still seeing the packet loss . Also , tried changing miimon value to 100 , but still seeing the packet loss. 
>>
>> What I am missing in the configuration ? Any help will be highly appreciated in resolving the problem . 
>>
>>
>>
>> Thanks
>> Zaman
> 
>  > You didn't share any details on your configuration, but I will assume
>> you are using corosync.
> 
>> The only supported bonding mode is Active/Passive (mode=1). I've
>> personally tried all modes, out of curiosity, and all had problems. The
>> short of it is that if you need more that 1 gbit of performance, buy
>> faster cards.
> 
>> If you are interested in what I use, it's documented here:
> 
>>   https://alteeve.ca/w/2-Node_Red_Hat_KVM_Cluster_Tutorial#Network
> 
>>   I've used this setup in several production clusters and have tested
>>   failure are recovery extensively. It's proven very stable. :)
> 
>  
> Thanks Digimer for the quick response and pointing me to the link . I am yet to reach cluster configuration , initially trying to  understand ethernet bonding before going into cluster configuration. So , option for me is only to use Active/Passive bonding mode in case of clustered environment. 
> Few more clarifications needed , Can we use other bonding modes in non clustered environment .  I am seeing packet loss in other modes . Also , the support of  using only mode=1 in cluster environment is it a restriction of RHEL Cluster suite or it is by design . 
> 
> Will be great if you clarify these queries .
> 
> Thanks in Advance
> Zaman

Corosync is the only actively developed/supported (HA) cluster
communications and membership tool. It's used on all modern distros for
clustering and the requirement for mode=1 is with it. As such, it
doesn't matter which OS you are on, it's the only mode that will work
(reliably).

The problem is that corosync needs to detect state changes quickly. It
does this using the totem protocol (which serves other purposes), which
passes a token around the nodes in the cluster. If a node is sent a
token and the token is not returned within a time-out period, it is
declared lost and a new token is dispatched. Once too many failures
occur in a row, the node is declared lost and it is ejected from the
cluster. This process is detailed in the link above under the "Concept;
Fencing" section.

With all modes other than mode=1, the failure recovery and/or the
restoration of a link in the bond causes a sufficient disruption to
cause a node to be declared lost. As I mentioned, this matches my
experience in testing the other modes. It isn't an arbitrary rule.

As for non-clustered traffic; the usefulness of other bond modes depends
entirely on the traffic you are pushing over it. Personally, I am
focused on HA in clusters, so I only use mode=1, regardless of the
traffic designed for it.

digimer

ps - You will see reference to "heartbeat" as a comms layer in
clustering. It's been deprecated and should not be used. Likewise,
pacemaker is the future of clustering, so it should be to resource
manager you learn/use.

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?



From tc3driver at gmail.com  Sat Nov 10 04:35:34 2012
From: tc3driver at gmail.com (Bill G.)
Date: Fri, 9 Nov 2012 20:35:34 -0800
Subject: [Linux-cluster] Packet loss after configuring Ethernet bonding
In-Reply-To: <1352520739.40244.YahooMailNeo@web193002.mail.sg3.yahoo.com>
References: <1352514375.40862.YahooMailNeo@web193003.mail.sg3.yahoo.com>
	<509DC1E9.9090704@alteeve.ca>
	<1352520739.40244.YahooMailNeo@web193002.mail.sg3.yahoo.com>
Message-ID: <CABQafziXR02F0+gYPs3Z5+c__HQ4EXkg1C0U5fHxtKcMYYCXAg@mail.gmail.com>

Hi Zaman,

There are some configurations that need to be made to the switch to allow
both nics to come up with the same mac. I am by no means a network expert,
so I cannot think of the name of the protocol off the top of my head. I am
willing to wager that the lack of that configuration is the cause of your
packet loss.
On Nov 9, 2012 8:22 PM, "Zama Ques" <queszama at yahoo.in> wrote:

>
>
>
>
> ----- Original Message -----
> From: Digimer <lists at alteeve.ca>
> To: Zama Ques <queszama at yahoo.in>; linux clustering <
> linux-cluster at redhat.com>
> Cc:
> Sent: Saturday, 10 November 2012 8:24 AM
> Subject: Re: [Linux-cluster] Packet loss after configuring Ethernet bonding
>
> On 11/09/2012 09:26 PM, Zama Ques wrote:
> > Hi All,
> >
> > Need help on resolving a issue related to implementing High Availability
> at network level . I understand that this is not the right forum to ask
> this question , but since it is related to HA and Linux , I am asking here
> and I feel somebody here  will have answer to the issues I am facing .
> >
> > I am trying to implement Ethernet Bonding , Both the interface in my
> server are connected to two different network switches .
> >
> > My configuration is as follows:
> >
> > ========
> > # cat /proc/net/bonding/bond0
> >
> > Ethernet Channel Bonding Driver: v3.6.0 (September 26, 2009)
> >
> > Bonding Mode: adaptive load balancing Primary Slave: None Currently
> > Active Slave: eth0 MII Status: up MII Polling Interval (ms): 0 Up Delay
> > (ms): 0 Down Delay (ms): 0
> >
> > Slave Interface: eth0 MII Status: up Speed: 1000 Mbps Duplex: full Link
> > Failure Count: 0 Permanent HW addr: e4:e1:5b:d0:11:10 Slave queue ID: 0
> >
> > Slave Interface: eth1 MII Status: up Speed: 1000 Mbps Duplex: full Link
> > Failure Count: 0 Permanent HW addr: e4:e1:5b:d0:11:14 Slave queue ID: 0
> > ------------
> > # cat /sys/class/net/bond0/bonding/mode
> >
> >   balance-alb 6
> >
> >
> > # cat /sys/class/net/bond0/bonding/miimon
> >    0
> >
> > ============
> >
> >
> > The issue for me is that I am seeing packet loss after configuring
> bonding .  Tried connecting both the interface to the same switch , but
> still seeing the packet loss . Also , tried changing miimon value to 100 ,
> but still seeing the packet loss.
> >
> > What I am missing in the configuration ? Any help will be highly
> appreciated in resolving the problem .
> >
> >
> >
> > Thanks
> > Zaman
>
>  > You didn't share any details on your configuration, but I will assume
> > you are using corosync.
>
> > The only supported bonding mode is Active/Passive (mode=1). I've
> > personally tried all modes, out of curiosity, and all had problems. The
> > short of it is that if you need more that 1 gbit of performance, buy
> > faster cards.
>
> > If you are interested in what I use, it's documented here:
>
> >  https://alteeve.ca/w/2-Node_Red_Hat_KVM_Cluster_Tutorial#Network
>
> >  I've used this setup in several production clusters and have tested
> >  failure are recovery extensively. It's proven very stable. :)
>
>
> Thanks Digimer for the quick response and pointing me to the link . I am
> yet to reach cluster configuration , initially trying to  understand
> ethernet bonding before going into cluster configuration. So , option for
> me is only to use Active/Passive bonding mode in case of clustered
> environment.
> Few more clarifications needed , Can we use other bonding modes in non
> clustered environment .  I am seeing packet loss in other modes . Also ,
> the support of  using only mode=1 in cluster environment is it a
> restriction of RHEL Cluster suite or it is by design .
>
> Will be great if you clarify these queries .
>
> Thanks in Advance
> Zaman
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20121109/5e805e9f/attachment.htm>

From Imran.Kalam at auspost.com.au  Sun Nov 11 22:32:02 2012
From: Imran.Kalam at auspost.com.au (Kalam, Imran)
Date: Sun, 11 Nov 2012 22:32:02 +0000
Subject: [Linux-cluster] Cluster node1 rebooted itself
Message-ID: <C24770572F75F7468F5640958A3806601F509D4A@HPVIEXSMS02.corp.auspost.local>

Hi All.

I have 2 node GFS cluster running RHAS4 update 5 kernel 2.6.9-55.ELsmp. On Sunday morning the node1 (master) has rebooted itself and I could only see the following in the message log file. Has anyone experienced the same problem? Please let me know if you need more information. Thanks

Nov 11 00:12:47 kernel: CMAN: Being told to leave the cluster by node 2
Nov 11 00:12:47 kernel: CMAN: we are leaving the cluster.
Nov 11 00:12:47 kernel: WARNING: dlm_emergency_shutdown
Nov 11 00:12:47 kernel: WARNING: dlm_emergency_shutdown
Nov 11 00:12:47 kernel: SM: 00000002 sm_stop: SG still joined
Nov 11 00:12:47 kernel: SM: 01000003 sm_stop: SG still joined
Nov 11 00:12:47 kernel: SM: 02000007 sm_stop: SG still joined
Nov 11 00:12:47 kernel: SM: 03000004 sm_stop: SG still joined
Nov 11 00:12:47 clurgmgrd[6872]: <warning> #67: Shutting down uncleanly
Nov 11 00:12:47 ccsd[6613]: Cluster manager shutdown.  Attemping to reconnect...
Nov 11 00:12:48 ccsd[6613]: Cluster is not quorate.  Refusing connection.
Nov 11 00:12:48 ccsd[6613]: Error while processing connect: Connection refused
Nov 11 00:12:48 ccsd[6613]: Invalid descriptor specified (-111).
Nov 11 00:12:48 ccsd[6613]: Someone may be attempting something evil.
Nov 11 00:12:48 ccsd[6613]: Error while processing get: Invalid request descriptor
Nov 11 00:12:48 ccsd[6613]: Invalid descriptor specified (-111).
Nov 11 00:12:48 ccsd[6613]: Someone may be attempting something evil.
Nov 11 00:12:48 ccsd[6613]: Error while processing get: Invalid request descriptor
Nov 11 00:12:48 ccsd[6613]: Invalid descriptor specified (-21).
Nov 11 00:12:48 ccsd[6613]: Someone may be attempting something evil.
Nov 11 00:12:48 ccsd[6613]: Error while processing disconnect: Invalid request descriptor
Nov 11 00:12:48 clurgmgrd: [6872]: <info> unmounting /dev/mapper/vg_shared-lv00 (/opt/xxshare)
Nov 11 00:12:48 ccsd[6613]: Cluster is not quorate.  Refusing connection.
Nov 11 00:12:48 ccsd[6613]: Error while processing connect: Connection refused
Nov 11 00:12:48 ccsd[6613]: Cluster is not quorate.  Refusing connection.
Nov 11 00:12:48 ccsd[6613]: Error while processing connect: Connection refused
Nov 11 00:12:48 ccsd[6613]: Invalid descriptor specified (-111).
Nov 11 00:12:48 ccsd[6613]: Someone may be attempting something evil.
Nov 11 00:12:48 ccsd[6613]: Error while processing get: Invalid request descriptor
Nov 11 00:12:48 ccsd[6613]: Invalid descriptor specified (-111).



Regards
Imran Kalam
Technical Specialist
Post IT
Corporate Services
Australia Post
Level 2, 185 Rosslyn St. West Melbourne
Phone: (03) 9322 0382
Fax: 9204 7303
Mob: 0439 559 461

A





Australia Post is committed to providing our customers with excellent service. If we can assist you in any way please telephone 13 13 18 or visit our website.

The information contained in this email communication may be proprietary, confidential or legally professionally privileged. It is intended exclusively for the individual or entity to which it is addressed. You should only read, disclose, re-transmit, copy, distribute, act in reliance on or commercialise the information if you are authorised to do so. Australia Post does not represent, warrant or guarantee that the integrity of this email communication has been maintained nor that the communication is free of errors, virus or interference.

If you are not the addressee or intended recipient please notify us by replying direct to the sender and then destroy any electronic or paper copy of this message. Any views expressed in this email communication are taken to be those of the individual sender, except where the sender specifically attributes those views to Australia Post and is authorised to do so.

Please consider the environment before printing this email.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20121111/35666351/attachment.htm>

From lists at alteeve.ca  Sun Nov 11 22:35:40 2012
From: lists at alteeve.ca (Digimer)
Date: Sun, 11 Nov 2012 17:35:40 -0500
Subject: [Linux-cluster] Cluster node1 rebooted itself
In-Reply-To: <C24770572F75F7468F5640958A3806601F509D4A@HPVIEXSMS02.corp.auspost.local>
References: <C24770572F75F7468F5640958A3806601F509D4A@HPVIEXSMS02.corp.auspost.local>
Message-ID: <50A0283C.5020808@alteeve.ca>

It's hard to make much of a guess given that your cluster configuration
is unknown. That said, it would seem that something interrupted comms.
What is in the syslog of node 2 at the same time period? can you share
you cluster.conf please (obfuscating only passwords)?

On 11/11/2012 05:32 PM, Kalam, Imran wrote:
> Hi All.
>  
> I have 2 node GFS cluster running RHAS4 update 5 kernel 2.6.9-55.ELsmp.
> On Sunday morning the node1 (master) has rebooted itself and I could
> only see the following in the message log file. Has anyone experienced
> the same problem? Please let me know if you need more information. Thanks
>  
> Nov 11 00:12:47 kernel: CMAN: Being told to leave the cluster by node 2
> Nov 11 00:12:47 kernel: CMAN: we are leaving the cluster.
> Nov 11 00:12:47 kernel: WARNING: dlm_emergency_shutdown
> Nov 11 00:12:47 kernel: WARNING: dlm_emergency_shutdown
> Nov 11 00:12:47 kernel: SM: 00000002 sm_stop: SG still joined
> Nov 11 00:12:47 kernel: SM: 01000003 sm_stop: SG still joined
> Nov 11 00:12:47 kernel: SM: 02000007 sm_stop: SG still joined
> Nov 11 00:12:47 kernel: SM: 03000004 sm_stop: SG still joined
> Nov 11 00:12:47 clurgmgrd[6872]: <warning> #67: Shutting down uncleanly
> Nov 11 00:12:47 ccsd[6613]: Cluster manager shutdown.  Attemping to
> reconnect...
> Nov 11 00:12:48 ccsd[6613]: Cluster is not quorate.  Refusing connection.
> Nov 11 00:12:48 ccsd[6613]: Error while processing connect: Connection
> refused
> Nov 11 00:12:48 ccsd[6613]: Invalid descriptor specified (-111).
> Nov 11 00:12:48 ccsd[6613]: Someone may be attempting something evil.
> Nov 11 00:12:48 ccsd[6613]: Error while processing get: Invalid request
> descriptor
> Nov 11 00:12:48 ccsd[6613]: Invalid descriptor specified (-111).
> Nov 11 00:12:48 ccsd[6613]: Someone may be attempting something evil.
> Nov 11 00:12:48 ccsd[6613]: Error while processing get: Invalid request
> descriptor
> Nov 11 00:12:48 ccsd[6613]: Invalid descriptor specified (-21).
> Nov 11 00:12:48 ccsd[6613]: Someone may be attempting something evil.
> Nov 11 00:12:48 ccsd[6613]: Error while processing disconnect: Invalid
> request descriptor
> Nov 11 00:12:48 clurgmgrd: [6872]: <info> unmounting
> /dev/mapper/vg_shared-lv00 (/opt/xxshare)
> Nov 11 00:12:48 ccsd[6613]: Cluster is not quorate.  Refusing connection.
> Nov 11 00:12:48 ccsd[6613]: Error while processing connect: Connection
> refused
> Nov 11 00:12:48 ccsd[6613]: Cluster is not quorate.  Refusing connection.
> Nov 11 00:12:48 ccsd[6613]: Error while processing connect: Connection
> refused
> Nov 11 00:12:48 ccsd[6613]: Invalid descriptor specified (-111).
> Nov 11 00:12:48 ccsd[6613]: Someone may be attempting something evil.
> Nov 11 00:12:48 ccsd[6613]: Error while processing get: Invalid request
> descriptor
> Nov 11 00:12:48 ccsd[6613]: Invalid descriptor specified (-111).
>  
>  
> *Regards*
> Imran Kalam
> Technical Specialist
> Post IT
> Corporate Services
> Australia Post
> Level 2, 185 Rosslyn St. West Melbourne
> Phone: (03) 9322 0382
> Fax: 9204 7303
> Mob: 0439 559 461
>  
> A
>  
>  
>  
> 
> Australia Post is committed to providing our customers with excellent
> service. If we can assist you in any way please telephone 13 13 18 or
> visit our website.
> 
> The information contained in this email communication may be
> proprietary, confidential or legally professionally privileged. It is
> intended exclusively for the individual or entity to which it is
> addressed. You should only read, disclose, re-transmit, copy,
> distribute, act in reliance on or commercialise the information if you
> are authorised to do so. Australia Post does not represent, warrant or
> guarantee that the integrity of this email communication has been
> maintained nor that the communication is free of errors, virus or
> interference.
> 
> If you are not the addressee or intended recipient please notify us by
> replying direct to the sender and then destroy any electronic or paper
> copy of this message. Any views expressed in this email communication
> are taken to be those of the individual sender, except where the sender
> specifically attributes those views to Australia Post and is authorised
> to do so.
> 
> Please consider the environment before printing this email.
> 
> 
> 


-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?



From sam at dotsec.com  Sun Nov 11 22:17:06 2012
From: sam at dotsec.com (Sam Wilson)
Date: Mon, 12 Nov 2012 08:17:06 +1000
Subject: [Linux-cluster] Packet loss after configuring Ethernet bonding
In-Reply-To: <CABQafziXR02F0+gYPs3Z5+c__HQ4EXkg1C0U5fHxtKcMYYCXAg@mail.gmail.com>
References: <1352514375.40862.YahooMailNeo@web193003.mail.sg3.yahoo.com>
	<509DC1E9.9090704@alteeve.ca>
	<1352520739.40244.YahooMailNeo@web193002.mail.sg3.yahoo.com>
	<CABQafziXR02F0+gYPs3Z5+c__HQ4EXkg1C0U5fHxtKcMYYCXAg@mail.gmail.com>
Message-ID: <50A023E2.4070301@dotsec.com>

With regards to what switch support is required for GNU\linux bonding
its worth having a read through the docs
http://www.kernel.org/doc/Documentation/networking/bonding.txt to
understand the available modes in the bonding driver.

As far as I understand it only mode=4 requires switch side participation
in the bonding. All other modes are implemented on the host side.

Cheers,

Sam



From Imran.Kalam at auspost.com.au  Sun Nov 11 22:48:30 2012
From: Imran.Kalam at auspost.com.au (Kalam, Imran)
Date: Sun, 11 Nov 2012 22:48:30 +0000
Subject: [Linux-cluster] Cluster node1 rebooted itself
In-Reply-To: <50A0283C.5020808@alteeve.ca>
References: <C24770572F75F7468F5640958A3806601F509D4A@HPVIEXSMS02.corp.auspost.local>
	<50A0283C.5020808@alteeve.ca>
Message-ID: <C24770572F75F7468F5640958A3806601F509DB3@HPVIEXSMS02.corp.auspost.local>

Hi Digimer.

Below are the information from the second node log file and configuration is on its way. Thanks

Nov 11 00:12:47 qdiskd[6704]: <notice> Writing eviction notice for node 1
Nov 11 00:12:47 kernel: CMAN: removing node node1hb from the cluster : Killed by another node
Nov 11 00:12:49 qdiskd[6704]: <notice> Node 1 evicted
Nov 11 00:12:55 fenced[6771]: node1hb not a cluster member after 8 sec post_fail_delay
Nov 11 00:12:55 fenced[6771]: fencing node "node1hb"
Nov 11 00:14:00 ccsd[6603]: Attempt to close an unopened CCS descriptor (5462880).
Nov 11 00:14:00 ccsd[6603]: Error while processing disconnect: Invalid request descriptor
Nov 11 00:14:00 fenced[6771]: fence "node1hb" success
Nov 11 00:14:07 kernel: GFS: fsid=EMS_cluster1:opt-xxxshare.1: jid=0: Trying to acquire journal lock...
Nov 11 00:14:07 kernel: GFS: fsid=EMS_cluster1:opt-xxxshare.1: jid=0: Looking at journal...
Nov 11 00:14:07 kernel: GFS: fsid=EMS_cluster1:opt-xxxshare.1: jid=0: Acquiring the transaction lock...
Nov 11 00:14:07 kernel: GFS: fsid=EMS_cluster1:opt-xxxshare.1: jid=0: Replaying journal...
Nov 11 00:14:07 kernel: GFS: fsid=EMS_cluster1:opt-xxxshare.1: jid=0: Replayed 4 of 4 blocks
Nov 11 00:14:07 kernel: GFS: fsid=EMS_cluster1:opt-xxxshare.1: jid=0: replays = 4, skips = 0, sames = 0
Nov 11 00:14:07 kernel: GFS: fsid=EMS_cluster1:opt-xxxshare.1: jid=0: Journal replayed in 1s
Nov 11 00:14:07 kernel: GFS: fsid=EMS_cluster1:opt-xxxshare.1: jid=0: Done
Nov 11 00:14:07 clurgmgrd[6833]: <info> Magma Event: Membership Change
Nov 11 00:14:07 clurgmgrd[6833]: <info> State change: node1hb DOWN
Nov 11 00:16:59 kernel: CMAN: node node1hb rejoining
Nov 11 00:17:08 clurgmgrd[6833]: <info> Magma Event: Membership Change
Nov 11 00:17:08 clurgmgrd[6833]: <info> State change: node1hb UP

-----Original Message-----
From: Digimer [mailto:lists at alteeve.ca] 
Sent: Monday, 12 November, 2012 9:36 AM
To: linux clustering
Cc: Kalam, Imran
Subject: Re: [Linux-cluster] Cluster node1 rebooted itself

It's hard to make much of a guess given that your cluster configuration
is unknown. That said, it would seem that something interrupted comms.
What is in the syslog of node 2 at the same time period? can you share
you cluster.conf please (obfuscating only passwords)?

On 11/11/2012 05:32 PM, Kalam, Imran wrote:
> Hi All.
>  
> I have 2 node GFS cluster running RHAS4 update 5 kernel 2.6.9-55.ELsmp.
> On Sunday morning the node1 (master) has rebooted itself and I could
> only see the following in the message log file. Has anyone experienced
> the same problem? Please let me know if you need more information. Thanks
>  
> Nov 11 00:12:47 kernel: CMAN: Being told to leave the cluster by node 2
> Nov 11 00:12:47 kernel: CMAN: we are leaving the cluster.
> Nov 11 00:12:47 kernel: WARNING: dlm_emergency_shutdown
> Nov 11 00:12:47 kernel: WARNING: dlm_emergency_shutdown
> Nov 11 00:12:47 kernel: SM: 00000002 sm_stop: SG still joined
> Nov 11 00:12:47 kernel: SM: 01000003 sm_stop: SG still joined
> Nov 11 00:12:47 kernel: SM: 02000007 sm_stop: SG still joined
> Nov 11 00:12:47 kernel: SM: 03000004 sm_stop: SG still joined
> Nov 11 00:12:47 clurgmgrd[6872]: <warning> #67: Shutting down uncleanly
> Nov 11 00:12:47 ccsd[6613]: Cluster manager shutdown.  Attemping to
> reconnect...
> Nov 11 00:12:48 ccsd[6613]: Cluster is not quorate.  Refusing connection.
> Nov 11 00:12:48 ccsd[6613]: Error while processing connect: Connection
> refused
> Nov 11 00:12:48 ccsd[6613]: Invalid descriptor specified (-111).
> Nov 11 00:12:48 ccsd[6613]: Someone may be attempting something evil.
> Nov 11 00:12:48 ccsd[6613]: Error while processing get: Invalid request
> descriptor
> Nov 11 00:12:48 ccsd[6613]: Invalid descriptor specified (-111).
> Nov 11 00:12:48 ccsd[6613]: Someone may be attempting something evil.
> Nov 11 00:12:48 ccsd[6613]: Error while processing get: Invalid request
> descriptor
> Nov 11 00:12:48 ccsd[6613]: Invalid descriptor specified (-21).
> Nov 11 00:12:48 ccsd[6613]: Someone may be attempting something evil.
> Nov 11 00:12:48 ccsd[6613]: Error while processing disconnect: Invalid
> request descriptor
> Nov 11 00:12:48 clurgmgrd: [6872]: <info> unmounting
> /dev/mapper/vg_shared-lv00 (/opt/xxshare)
> Nov 11 00:12:48 ccsd[6613]: Cluster is not quorate.  Refusing connection.
> Nov 11 00:12:48 ccsd[6613]: Error while processing connect: Connection
> refused
> Nov 11 00:12:48 ccsd[6613]: Cluster is not quorate.  Refusing connection.
> Nov 11 00:12:48 ccsd[6613]: Error while processing connect: Connection
> refused
> Nov 11 00:12:48 ccsd[6613]: Invalid descriptor specified (-111).
> Nov 11 00:12:48 ccsd[6613]: Someone may be attempting something evil.
> Nov 11 00:12:48 ccsd[6613]: Error while processing get: Invalid request
> descriptor
> Nov 11 00:12:48 ccsd[6613]: Invalid descriptor specified (-111).
>  
>  
> *Regards*
> Imran Kalam
> Technical Specialist
> Post IT
> Corporate Services
> Australia Post
> Level 2, 185 Rosslyn St. West Melbourne
> Phone: (03) 9322 0382
> Fax: 9204 7303
> Mob: 0439 559 461
>  
> A
>  
>  
>  
> 
> Australia Post is committed to providing our customers with excellent
> service. If we can assist you in any way please telephone 13 13 18 or
> visit our website.
> 
> The information contained in this email communication may be
> proprietary, confidential or legally professionally privileged. It is
> intended exclusively for the individual or entity to which it is
> addressed. You should only read, disclose, re-transmit, copy,
> distribute, act in reliance on or commercialise the information if you
> are authorised to do so. Australia Post does not represent, warrant or
> guarantee that the integrity of this email communication has been
> maintained nor that the communication is free of errors, virus or
> interference.
> 
> If you are not the addressee or intended recipient please notify us by
> replying direct to the sender and then destroy any electronic or paper
> copy of this message. Any views expressed in this email communication
> are taken to be those of the individual sender, except where the sender
> specifically attributes those views to Australia Post and is authorised
> to do so.
> 
> Please consider the environment before printing this email.
> 
> 
> 


-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?



From sumodirjo at gmail.com  Sun Nov 11 22:49:43 2012
From: sumodirjo at gmail.com (Muhammad Panji)
Date: Mon, 12 Nov 2012 05:49:43 +0700
Subject: [Linux-cluster] Failover root cause
In-Reply-To: <C044DD36-E258-4381-B034-CF9EB4DE5965@gmail.com>
References: <CANbzdHkQ=SD+Q=WQ5rjEn9M4P3Kv2Pa2T9ztF1c6Kx_71qnV_A@mail.gmail.com>
	<C044DD36-E258-4381-B034-CF9EB4DE5965@gmail.com>
Message-ID: <CANbzdH=dMhc0YsTrjztkf-Qb-v8fUSs1uTE9DOgBFmABeyaOQA@mail.gmail.com>

Hi,
I plan to implement NTP so that both servers time synchronized. How
can I look for the failover cause? I already graph sar data and no
peak usage on the time when db1svr was fenced by db2svr. What file
(and what specific message) that I should look to know the root cause
of this failover. Thank you.
Regards,



Panji

On Fri, Nov 9, 2012 at 10:40 AM, Yu <songyu555 at gmail.com> wrote:
> Regardless what was the root cause you find. Cluster requires Ntp service to ensure all nodes have time synchronized.  So you have to fix this 5 mins difference now.
>
> Regards
> Yu
>
> On 09/11/2012, at 11:47, Muhammad Panji <sumodirjo at gmail.com> wrote:
>
>> Dear All,
>> I have an oracle cluster on RHEL 6.2 with 2 servers. Several days ago
>> the service was failover from node1 to node2. From /var/log/messages
>> on node2 I only see this message :
>>
>> ...
>> Oct 23 12:54:19 db2svr corosync[4142]:   [TOTEM ] A processor failed,
>> forming new configuration.
>> Oct 23 12:54:21 db2svr corosync[4142]:   [QUORUM] Members[1]: 2
>> Oct 23 12:54:21 db2svr corosync[4142]:   [TOTEM ] A processor joined
>> or left the membership and a new membership was formed.
>> Oct 23 12:54:21 db2svr kernel: dlm: closing connection to node 1
>> Oct 23 12:54:21 db2svr rgmanager[5327]: State change: clu1 DOWN
>> Oct 23 12:54:21 db2svr fenced[4193]: fencing node clu1
>> ...
>>
>> Googling this message " [TOTEM ] A processor failed, forming new
>> configuration." I learned that it means node2 couldn't see node1 and
>> then fence node1. on node1 I get this message :
>>
>> Oct 23 12:50:45 db1svr rgmanager[75890]: [script] Executing
>> /etc/init.d/httpd status
>> Oct 23 12:56:01 db1svr kernel: imklog 4.6.2, log source = /proc/kmsg started.
>> Oct 23 12:56:01 db1svr rsyslogd: [origin software="rsyslogd"
>> swVersion="4.6.2" x-pid="3792" x-info="http://www.rsyslog.com"]
>> (re)start
>> Oct 23 12:56:01 db1svr kernel: Initializing cgroup subsys cpuset
>> Oct 23 12:56:01 db1svr kernel: Initializing cgroup subsys cpu
>> Oct 23 12:56:01 db1svr kernel: Linux version 2.6.32-220.el6.x86_64
>> (mockbuild at x86-004.build.bos.redhat.com) (gcc version 4.4.5 20110214
>> (Red Hat 4.4.5-6) (GCC) ) #1 SMP Wed Nov 9 08:03:13 EST 2011
>>
>> on 12:50 rgmanager still checking the service and then it's rebooted.
>> Thing that make it worse is that the date / time of both servers are
>> different so that I can't compare the logs directly. Current time
>> difference between both servers is around 5 minutes.
>>
>> I would like to ask where to look for the cause of this failover? I
>> plan to graph sar data today to see if there were bottleneck on CPU
>> etc so that node1 could not send status to node2, but if no bottleneck
>> on CPU or RAM etc where should I find the root cause of failover?
>> thank you.
>> Regards,
>>
>>
>>
>>
>>
>> --
>> Muhammad Panji
>> http://www.panji.web.id
>> http://www.kurungsiku.com
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



-- 
Muhammad Panji
http://www.panji.web.id
http://www.kurungsiku.com



From Imran.Kalam at auspost.com.au  Sun Nov 11 22:49:29 2012
From: Imran.Kalam at auspost.com.au (Kalam, Imran)
Date: Sun, 11 Nov 2012 22:49:29 +0000
Subject: [Linux-cluster] Packet loss after configuring Ethernet bonding
In-Reply-To: <50A023E2.4070301@dotsec.com>
References: <1352514375.40862.YahooMailNeo@web193003.mail.sg3.yahoo.com> 
	<509DC1E9.9090704@alteeve.ca> 
	<1352520739.40244.YahooMailNeo@web193002.mail.sg3.yahoo.com> 
	<CABQafziXR02F0+gYPs3Z5+c__HQ4EXkg1C0U5fHxtKcMYYCXAg@mail.gmail.com> 
	<50A023E2.4070301@dotsec.com>
Message-ID: <C24770572F75F7468F5640958A3806601F509DC1@HPVIEXSMS02.corp.auspost.local>

Thanks, I will read over the document.

-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Sam Wilson
Sent: Monday, 12 November, 2012 9:17 AM
To: linux-cluster at redhat.com
Subject: Re: [Linux-cluster] Packet loss after configuring Ethernet bonding

With regards to what switch support is required for GNU\linux bonding
its worth having a read through the docs
http://www.kernel.org/doc/Documentation/networking/bonding.txt to
understand the available modes in the bonding driver.

As far as I understand it only mode=4 requires switch side participation
in the bonding. All other modes are implemented on the host side.

Cheers,

Sam

-- 
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Australia Post is committed to providing our customers with excellent service. If we can assist you in any way please telephone 13 13 18 or visit our website.

The information contained in this email communication may be proprietary, confidential or legally professionally privileged. It is intended exclusively for the individual or entity to which it is addressed. You should only read, disclose, re-transmit, copy, distribute, act in reliance on or commercialise the information if you are authorised to do so. Australia Post does not represent, warrant or guarantee that the integrity of this email communication has been maintained nor that the communication is free of errors, virus or interference.

If you are not the addressee or intended recipient please notify us by replying direct to the sender and then destroy any electronic or paper copy of this message. Any views expressed in this email communication are taken to be those of the individual sender, except where the sender specifically attributes those views to Australia Post and is authorised to do so.

Please consider the environment before printing this email.




From lists at alteeve.ca  Sun Nov 11 22:54:32 2012
From: lists at alteeve.ca (Digimer)
Date: Sun, 11 Nov 2012 17:54:32 -0500
Subject: [Linux-cluster] Cluster node1 rebooted itself
In-Reply-To: <C24770572F75F7468F5640958A3806601F509DB3@HPVIEXSMS02.corp.auspost.local>
References: <C24770572F75F7468F5640958A3806601F509D4A@HPVIEXSMS02.corp.auspost.local>
	<50A0283C.5020808@alteeve.ca>
	<C24770572F75F7468F5640958A3806601F509DB3@HPVIEXSMS02.corp.auspost.local>
Message-ID: <50A02CA8.5070800@alteeve.ca>

Ya, certainly looks like a network problem.

If you have a support contract with Red Hat, you may want to bring them
in to have a more detailed review though. I am only guessing based on
what you've listed here.

Cheers

On 11/11/2012 05:48 PM, Kalam, Imran wrote:
> Hi Digimer.
> 
> Below are the information from the second node log file and configuration is on its way. Thanks
> 
> Nov 11 00:12:47 qdiskd[6704]: <notice> Writing eviction notice for node 1
> Nov 11 00:12:47 kernel: CMAN: removing node node1hb from the cluster : Killed by another node
> Nov 11 00:12:49 qdiskd[6704]: <notice> Node 1 evicted
> Nov 11 00:12:55 fenced[6771]: node1hb not a cluster member after 8 sec post_fail_delay
> Nov 11 00:12:55 fenced[6771]: fencing node "node1hb"
> Nov 11 00:14:00 ccsd[6603]: Attempt to close an unopened CCS descriptor (5462880).
> Nov 11 00:14:00 ccsd[6603]: Error while processing disconnect: Invalid request descriptor
> Nov 11 00:14:00 fenced[6771]: fence "node1hb" success
> Nov 11 00:14:07 kernel: GFS: fsid=EMS_cluster1:opt-xxxshare.1: jid=0: Trying to acquire journal lock...
> Nov 11 00:14:07 kernel: GFS: fsid=EMS_cluster1:opt-xxxshare.1: jid=0: Looking at journal...
> Nov 11 00:14:07 kernel: GFS: fsid=EMS_cluster1:opt-xxxshare.1: jid=0: Acquiring the transaction lock...
> Nov 11 00:14:07 kernel: GFS: fsid=EMS_cluster1:opt-xxxshare.1: jid=0: Replaying journal...
> Nov 11 00:14:07 kernel: GFS: fsid=EMS_cluster1:opt-xxxshare.1: jid=0: Replayed 4 of 4 blocks
> Nov 11 00:14:07 kernel: GFS: fsid=EMS_cluster1:opt-xxxshare.1: jid=0: replays = 4, skips = 0, sames = 0
> Nov 11 00:14:07 kernel: GFS: fsid=EMS_cluster1:opt-xxxshare.1: jid=0: Journal replayed in 1s
> Nov 11 00:14:07 kernel: GFS: fsid=EMS_cluster1:opt-xxxshare.1: jid=0: Done
> Nov 11 00:14:07 clurgmgrd[6833]: <info> Magma Event: Membership Change
> Nov 11 00:14:07 clurgmgrd[6833]: <info> State change: node1hb DOWN
> Nov 11 00:16:59 kernel: CMAN: node node1hb rejoining
> Nov 11 00:17:08 clurgmgrd[6833]: <info> Magma Event: Membership Change
> Nov 11 00:17:08 clurgmgrd[6833]: <info> State change: node1hb UP
> 
> -----Original Message-----
> From: Digimer [mailto:lists at alteeve.ca] 
> Sent: Monday, 12 November, 2012 9:36 AM
> To: linux clustering
> Cc: Kalam, Imran
> Subject: Re: [Linux-cluster] Cluster node1 rebooted itself
> 
> It's hard to make much of a guess given that your cluster configuration
> is unknown. That said, it would seem that something interrupted comms.
> What is in the syslog of node 2 at the same time period? can you share
> you cluster.conf please (obfuscating only passwords)?
> 
> On 11/11/2012 05:32 PM, Kalam, Imran wrote:
>> Hi All.
>>  
>> I have 2 node GFS cluster running RHAS4 update 5 kernel 2.6.9-55.ELsmp.
>> On Sunday morning the node1 (master) has rebooted itself and I could
>> only see the following in the message log file. Has anyone experienced
>> the same problem? Please let me know if you need more information. Thanks
>>  
>> Nov 11 00:12:47 kernel: CMAN: Being told to leave the cluster by node 2
>> Nov 11 00:12:47 kernel: CMAN: we are leaving the cluster.
>> Nov 11 00:12:47 kernel: WARNING: dlm_emergency_shutdown
>> Nov 11 00:12:47 kernel: WARNING: dlm_emergency_shutdown
>> Nov 11 00:12:47 kernel: SM: 00000002 sm_stop: SG still joined
>> Nov 11 00:12:47 kernel: SM: 01000003 sm_stop: SG still joined
>> Nov 11 00:12:47 kernel: SM: 02000007 sm_stop: SG still joined
>> Nov 11 00:12:47 kernel: SM: 03000004 sm_stop: SG still joined
>> Nov 11 00:12:47 clurgmgrd[6872]: <warning> #67: Shutting down uncleanly
>> Nov 11 00:12:47 ccsd[6613]: Cluster manager shutdown.  Attemping to
>> reconnect...
>> Nov 11 00:12:48 ccsd[6613]: Cluster is not quorate.  Refusing connection.
>> Nov 11 00:12:48 ccsd[6613]: Error while processing connect: Connection
>> refused
>> Nov 11 00:12:48 ccsd[6613]: Invalid descriptor specified (-111).
>> Nov 11 00:12:48 ccsd[6613]: Someone may be attempting something evil.
>> Nov 11 00:12:48 ccsd[6613]: Error while processing get: Invalid request
>> descriptor
>> Nov 11 00:12:48 ccsd[6613]: Invalid descriptor specified (-111).
>> Nov 11 00:12:48 ccsd[6613]: Someone may be attempting something evil.
>> Nov 11 00:12:48 ccsd[6613]: Error while processing get: Invalid request
>> descriptor
>> Nov 11 00:12:48 ccsd[6613]: Invalid descriptor specified (-21).
>> Nov 11 00:12:48 ccsd[6613]: Someone may be attempting something evil.
>> Nov 11 00:12:48 ccsd[6613]: Error while processing disconnect: Invalid
>> request descriptor
>> Nov 11 00:12:48 clurgmgrd: [6872]: <info> unmounting
>> /dev/mapper/vg_shared-lv00 (/opt/xxshare)
>> Nov 11 00:12:48 ccsd[6613]: Cluster is not quorate.  Refusing connection.
>> Nov 11 00:12:48 ccsd[6613]: Error while processing connect: Connection
>> refused
>> Nov 11 00:12:48 ccsd[6613]: Cluster is not quorate.  Refusing connection.
>> Nov 11 00:12:48 ccsd[6613]: Error while processing connect: Connection
>> refused
>> Nov 11 00:12:48 ccsd[6613]: Invalid descriptor specified (-111).
>> Nov 11 00:12:48 ccsd[6613]: Someone may be attempting something evil.
>> Nov 11 00:12:48 ccsd[6613]: Error while processing get: Invalid request
>> descriptor
>> Nov 11 00:12:48 ccsd[6613]: Invalid descriptor specified (-111).
>>  
>>  
>> *Regards*
>> Imran Kalam
>> Technical Specialist
>> Post IT
>> Corporate Services
>> Australia Post
>> Level 2, 185 Rosslyn St. West Melbourne
>> Phone: (03) 9322 0382
>> Fax: 9204 7303
>> Mob: 0439 559 461
>>  
>> A
>>  
>>  
>>  
>>
>> Australia Post is committed to providing our customers with excellent
>> service. If we can assist you in any way please telephone 13 13 18 or
>> visit our website.
>>
>> The information contained in this email communication may be
>> proprietary, confidential or legally professionally privileged. It is
>> intended exclusively for the individual or entity to which it is
>> addressed. You should only read, disclose, re-transmit, copy,
>> distribute, act in reliance on or commercialise the information if you
>> are authorised to do so. Australia Post does not represent, warrant or
>> guarantee that the integrity of this email communication has been
>> maintained nor that the communication is free of errors, virus or
>> interference.
>>
>> If you are not the addressee or intended recipient please notify us by
>> replying direct to the sender and then destroy any electronic or paper
>> copy of this message. Any views expressed in this email communication
>> are taken to be those of the individual sender, except where the sender
>> specifically attributes those views to Australia Post and is authorised
>> to do so.
>>
>> Please consider the environment before printing this email.
>>
>>
>>
> 
> 


-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?



From dev at sdd.jp  Mon Nov 12 06:24:56 2012
From: dev at sdd.jp (Antonio Castellano)
Date: Mon, 12 Nov 2012 15:24:56 +0900
Subject: [Linux-cluster] Bug inquiry (#831330)
Message-ID: <135270149815343600006c50@sv0.inside.kobe.sdd.jp>

Hi,

I'd like to know about the status of the bug number 831330 and its schedule. Our system is complaining about it and I don't have enough permissions to access its bugzilla related page. It is urgent.

This is the link related to the text reported in our log:
https://access.redhat.com/knowledge/ja/node/141203

And this is the bugzilla link:
https://bugzilla.redhat.com/show_bug.cgi?id=831330

Is there anybody out there that can help me? The help will be greatly appreciated.

Thank you very much!

--
Antonio Castellano [DEV at SDD.jp]
   Seventh Dimension Design, Inc.
   http://www.SDD.jp
   VOICE: +81-78-252-8855, FAX: +81-78-252-8856



From lists at alteeve.ca  Mon Nov 12 06:33:11 2012
From: lists at alteeve.ca (Digimer)
Date: Mon, 12 Nov 2012 01:33:11 -0500
Subject: [Linux-cluster] Bug inquiry (#831330)
In-Reply-To: <135270149815343600006c50@sv0.inside.kobe.sdd.jp>
References: <135270149815343600006c50@sv0.inside.kobe.sdd.jp>
Message-ID: <50A09827.6000204@alteeve.ca>

On 11/12/2012 01:24 AM, Antonio Castellano wrote:
> Hi,
> 
> I'd like to know about the status of the bug number 831330 and its schedule. Our system is complaining about it and I don't have enough permissions to access its bugzilla related page. It is urgent.
> 
> This is the link related to the text reported in our log:
> https://access.redhat.com/knowledge/ja/node/141203
> 
> And this is the bugzilla link:
> https://bugzilla.redhat.com/show_bug.cgi?id=831330
> 
> Is there anybody out there that can help me? The help will be greatly appreciated.
> 
> Thank you very much!

Closed bugs generally have customer-specific information in them. They
are closed to reduce the risk of leaking private information. The only
way for you to see the status of that bug is to speak with your support
person, assuming that the bug is yours.

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?



From dev at sdd.jp  Mon Nov 12 07:40:46 2012
From: dev at sdd.jp (Antonio Castellano)
Date: Mon, 12 Nov 2012 16:40:46 +0900
Subject: [Linux-cluster] Bug inquiry (#831330)
In-Reply-To: <50A09827.6000204@alteeve.ca>
References: <50A09827.6000204@alteeve.ca>
	<135270149815343600006c50@sv0.inside.kobe.sdd.jp>
Message-ID: <1352706048367910000775f@sv0.inside.kobe.sdd.jp>

> On 11/12/2012 01:24 AM, Antonio Castellano wrote:
> > Hi,
> > 
> > I'd like to know about the status of the bug number 831330 and its schedule. Our system is complaining about it and I don't have enough permissions to access its bugzilla related page. It is urgent.
> > 
> > This is the link related to the text reported in our log:
> > https://access.redhat.com/knowledge/ja/node/141203
> > 
> > And this is the bugzilla link:
> > https://bugzilla.redhat.com/show_bug.cgi?id=831330
> > 
> > Is there anybody out there that can help me? The help will be greatly appreciated.
> > 
> > Thank you very much!
> 
> Closed bugs generally have customer-specific information in them. They
> are closed to reduce the risk of leaking private information. The only
> way for you to see the status of that bug is to speak with your support
> person, assuming that the bug is yours.
> 
> -- 
> Digimer
> Papers and Projects: https://alteeve.ca/w/
> What if the cure for cancer is trapped in the mind of a person without
> access to education?

I see. Not what I was hoping for, but thank you very much anyway for the quick reply!

Best regards,

--
Antonio Castellano [DEV at SDD.jp]
   Seventh Dimension Design, Inc.
   http://www.SDD.jp
   VOICE: +81-78-252-8855, FAX: +81-78-252-8856



From swhiteho at redhat.com  Mon Nov 12 10:19:19 2012
From: swhiteho at redhat.com (Steven Whitehouse)
Date: Mon, 12 Nov 2012 10:19:19 +0000
Subject: [Linux-cluster] Bug inquiry (#831330)
In-Reply-To: <135270149815343600006c50@sv0.inside.kobe.sdd.jp>
References: <135270149815343600006c50@sv0.inside.kobe.sdd.jp>
Message-ID: <1352715560.2721.9.camel@menhir>

Hi,

On Mon, 2012-11-12 at 15:24 +0900, Antonio Castellano wrote:
> Hi,
> 
> I'd like to know about the status of the bug number 831330 and its schedule. Our system is complaining about it and I don't have enough permissions to access its bugzilla related page. It is urgent.
> 
> This is the link related to the text reported in our log:
> https://access.redhat.com/knowledge/ja/node/141203
> 
> And this is the bugzilla link:
> https://bugzilla.redhat.com/show_bug.cgi?id=831330
> 
> Is there anybody out there that can help me? The help will be greatly appreciated.
> 
> Thank you very much!
> 
Assuming that you are a Red Hat customer, please open a ticket. The bug
mostly contains customer's private data, so that I don't think opening
this one up would help much as there would be little that we could
share.

This is though, our highest priority bug at the moment (when I say our,
I mean the GFS2 team). There is a simple workaround (just use a slightly
older kernel) which is one reason why we've had trouble in tracing this,
because people are (understandably) using that rather than running the
kernel we've built to debug this issue.

We've been unable to reproduce this internally, despite trying many
different workloads. If you are in a position to help us debug the
issue, then any assistance is very gratefully received,

Steve.





From anprice at redhat.com  Mon Nov 12 20:37:36 2012
From: anprice at redhat.com (Andrew Price)
Date: Mon, 12 Nov 2012 20:37:36 +0000
Subject: [Linux-cluster] gfs2-utils 3.1.5 Released
Message-ID: <50A15E10.8030009@redhat.com>

Hi,

gfs2-utils 3.1.5 has been released. This version features bug fixes and 
performance enhancements for fsck.gfs2 in particular, better handling of 
symlinks in mkfs.gfs2, a small block manipulation language to aid future 
testing, a gfs2_lockcapture script which replaces gfs2_lockgather, and 
various other minor enhancements and bug fixes.

The mount.gfs2 helper utility has been removed as it is no longer 
required to mount gfs2 file systems. gfs2_tool and gfs2_quota have also 
been removed. Users of gfs2_quota should now use the generic quota 
utilities and users of gfs2_tool should now use tunegfs2, gfs2 mount 
options and the generic dmsetup and chattr/lsattr tools.

See below for a full list of changes. The source tarball is available from:

   https://fedorahosted.org/released/gfs2-utils/gfs2-utils-3.1.5.tar.gz

To report bugs or issues, please file them against the gfs2-utils 
component of Fedora (rawhide) at:

 
https://bugzilla.redhat.com/enter_bug.cgi?product=Fedora&component=gfs2-utils&version=rawhide

Regards,

Andy Price
Red Hat File Systems


Changes since 3.1.4:

Andrew Price (29):
       gfs2_utils: Improve error messages
       fsck.gfs2: Fix handling of eattr indirect blocks
       libgfs2: Remove gfs_get_leaf_nr
       libgfs2: Clean up some warnings
       gfs2-utils: Remove references to unlinked file tag
       gfs2_edit: Fix find_mtype and support gfs1 structures
       gfs2_edit: Clean up some magic offsets
       libgfs2: Use flags for versions in metadata description
       mkfs.gfs2: Check for symlinks before reporting device contents
       gfs2-utils: Remove obsolete tools
       gfs2-utils: Make building gfs_controld optional
       gfs2-utils: Only build group/ when gfs_controld is enabled
       gfs2-utils: Remove unused exported functions
       mkfs.gfs2: Avoid a rename race when checking file contents
       fsck.gfs2: Fix buffer overflow in get_lockproto_table
       libgfs2: Remove exit calls from inode_read and inode_get
       libgfs2: Remove exit call from __gfs_inode_get
       gfs2_edit: Some comment cleanups
       mkfs.gfs2: Check locktable more strictly for valid chars
       libgfs2: Add a gfs2 block query language
       libgfs2: Move valid_block into fsck.gfs2
       libgfs2: gfs2_get_bitmap performance enhancements
       fsck.gfs2: Fix build failure
       gfs2-utils: build: Avoid using the kernel versions of kernel headers
       libgfs2: Add a small testing language UI
       gfs2-utils: Update .gitignore
       gfs2-utils: Remove gfs2_lockgather
       gfs2-utils: Rename lockgather directory to lockcapture
       gfs2-utils: Remove remaining references to gfs2_lockgather

Bob Peterson (8):
       gfs2_edit savemeta: Get rid of "slow" mode
       gfs2_edit savemeta: report save statistics more often
       gfs2_edit savemeta: fix block range checking
       gfs2_edit restoremeta: sync changes on a regular basis RHEL6
       gfs_controld: fix ignore_nolock for mounted nolock fs
       fsck.gfs2: soften the messages when reclaiming freemeta blocks
       fsck.gfs2: Check for formal inode number mismatch
       GFS2: Fix a compiler warning in pass2's check_dentry

Shane Bradley (1):
       gfs2-utils: Added a new script called gfs2_lockcapture that will 
capture lockdump data.

Steven Whitehouse (3):
       libgfs2: libgfs2.h: Add gfs_block_tag structure, and some more 
flag symbols
       mount.gfs2: Remove obsolete tool
       libgfs2: Add pointer restriction flags



From fdinitto at redhat.com  Tue Nov 13 09:17:47 2012
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Tue, 13 Nov 2012 10:17:47 +0100
Subject: [Linux-cluster] gfs2-utils 3.1.5 Released
In-Reply-To: <50A15E10.8030009@redhat.com>
References: <50A15E10.8030009@redhat.com>
Message-ID: <50A2103B.9000009@redhat.com>

On 11/12/2012 9:37 PM, Andrew Price wrote:
> Hi,
> 
> gfs2-utils 3.1.5 has been released. This version features bug fixes and
> performance enhancements for fsck.gfs2 in particular, better handling of
> symlinks in mkfs.gfs2, a small block manipulation language to aid future
> testing, a gfs2_lockcapture script which replaces gfs2_lockgather, and
> various other minor enhancements and bug fixes.
> 
> The mount.gfs2 helper utility has been removed as it is no longer
> required to mount gfs2 file systems. gfs2_tool and gfs2_quota have also
> been removed. Users of gfs2_quota should now use the generic quota
> utilities and users of gfs2_tool should now use tunegfs2, gfs2 mount
> options and the generic dmsetup and chattr/lsattr tools.

IIRC there is a specific minimum kernel version for mount.gfs2 to be
obsoleted and quota tool version to obsolete gfs2_quota. Might be a good
idea to document it, so that users won?t attempt random back-ports.

Fabio



From anprice at redhat.com  Tue Nov 13 11:57:34 2012
From: anprice at redhat.com (Andrew Price)
Date: Tue, 13 Nov 2012 11:57:34 +0000
Subject: [Linux-cluster] gfs2-utils 3.1.5 Released
In-Reply-To: <50A2103B.9000009@redhat.com>
References: <50A15E10.8030009@redhat.com> <50A2103B.9000009@redhat.com>
Message-ID: <50A235AE.7030807@redhat.com>

On 13/11/12 09:17, Fabio M. Di Nitto wrote:
> On 11/12/2012 9:37 PM, Andrew Price wrote:
>> The mount.gfs2 helper utility has been removed as it is no longer
>> required to mount gfs2 file systems. gfs2_tool and gfs2_quota have also
>> been removed. Users of gfs2_quota should now use the generic quota
>> utilities and users of gfs2_tool should now use tunegfs2, gfs2 mount
>> options and the generic dmsetup and chattr/lsattr tools.
>
> IIRC there is a specific minimum kernel version for mount.gfs2 to be
> obsoleted and quota tool version to obsolete gfs2_quota. Might be a good
> idea to document it, so that users won?t attempt random back-ports.

Yes, it looks like mount.gfs2 hasn't been required since kernel 2.6.36

Andy



From dev at sdd.jp  Wed Nov 14 07:09:38 2012
From: dev at sdd.jp (Antonio Castellano)
Date: Wed, 14 Nov 2012 16:09:38 +0900
Subject: [Linux-cluster] Bug inquiry (#831330)
In-Reply-To: <1352715560.2721.9.camel@menhir>
References: <1352715560.2721.9.camel@menhir>
	<135270149815343600006c50@sv0.inside.kobe.sdd.jp>
Message-ID: <1352876980602019000002ec@sv0.inside.kobe.sdd.jp>

Hi, Steven. 
Thank you for the reply.

I'm sending you here the syslog portion where the problem appears. Maybe it will be of some help. 
The kernel version is 2.6.18-308.11.1.el5PAE.

Nov 12 15:50:16 blahblah6 kernel: GFS2: fsid=blahblah:data023.2: fatal: invalid metadata block 
Nov 12 15:50:16 blahblah6 kernel: GFS2: fsid=blahblah:data023.2:&#160;&#160; bh = 151918444 (magic number) 
Nov 12 15:50:16 blahblah6 kernel: GFS2: fsid=blahblah:data023.2:&#160;&#160; function = get_leaf, file = fs/gfs2/dir.c, line = 763 
Nov 12 15:50:16 blahblah6 kernel: GFS2: fsid=blahblah:data023.2: about to withdraw this file system 
Nov 12 15:50:16 blahblah6 kernel: GFS2: fsid=blahblah:data023.2: telling LM to withdraw 
Nov 12 15:50:17 blahblah6 kernel: GFS2: fsid=blahblah:data023.2: withdrawn 
Nov 12 15:50:17 blahblah6 kernel:&#160; [<f95f76f6>] gfs2_lm_withdraw+0x8d/0xb0 [gfs2] 
Nov 12 15:50:17 blahblah6 kernel:&#160; [<f960a98e>] gfs2_meta_check_ii+0x28/0x33 [gfs2] 
Nov 12 15:50:17 blahblah6 kernel:&#160; [<f95ed682>] get_leaf+0x5e/0x9d [gfs2] 
Nov 12 15:50:17 blahblah6 kernel:&#160; [<f95edccb>] get_first_leaf+0x24/0x2a [gfs2] 
Nov 12 15:50:17 blahblah6 kernel:&#160; [<f95edd52>] gfs2_dirent_search+0x81/0x180 [gfs2] 
Nov 12 15:50:17 blahblah6 kernel:&#160; [<f95ee07f>] gfs2_dirent_find+0x0/0x4c [gfs2] 
Nov 12 15:50:17 blahblah6 kernel:&#160; [<f95f4344>] run_queue+0xbd/0x18a [gfs2] 
Nov 12 15:50:17 blahblah6 kernel:&#160; [<f95ef448>] gfs2_dir_search+0x1d/0x7f [gfs2] 
Nov 12 15:50:17 blahblah6 kernel:&#160; [<c04833e2>] permission+0xa2/0xb5 
Nov 12 15:50:17 blahblah6 kernel:&#160; [<f95f5aa0>] gfs2_lookupi+0x116/0x14f [gfs2] 
Nov 12 15:50:17 blahblah6 kernel:&#160; [<f95f5a5a>] gfs2_lookupi+0xd0/0x14f [gfs2] 
Nov 12 15:50:17 blahblah6 kernel:&#160; [<f9602136>] gfs2_lookup+0x1b/0x8e [gfs2] 
Nov 12 15:50:17 blahblah6 kernel:&#160; [<f95f3b6c>] gfs2_glock_put+0xcf/0xe7 [gfs2] 
Nov 12 15:50:17 blahblah6 kernel:&#160; [<c048c807>] d_alloc+0x151/0x17f 
Nov 12 15:50:17 blahblah6 kernel:&#160; [<c04831c8>] do_lookup+0x102/0x1b6 
Nov 12 15:50:17 blahblah6 kernel:&#160; [<c0484b45>] __link_path_walk+0x318/0xd1d 
Nov 12 15:50:17 blahblah6 kernel:&#160; [<c0485584>] link_path_walk+0x3a/0x99 
Nov 12 15:50:17 blahblah6 kernel:&#160; [<c0485961>] do_path_lookup+0x231/0x297 
Nov 12 15:50:17 blahblah6 kernel:&#160; [<c04860bb>] __user_walk_fd+0x29/0x3a 
Nov 12 15:50:17 blahblah6 kernel:&#160; [<c047f4b9>] vfs_stat_fd+0x15/0x3c 
Nov 12 15:50:17 blahblah6 kernel:&#160; [<c047f525>] sys_stat64+0xf/0x23 
Nov 12 15:50:17 blahblah6 kernel:&#160; [<c06258e8>] do_page_fault+0x356/0x653 
Nov 12 15:50:17 blahblah6 kernel:&#160; [<c047795f>] __fput+0x15c/0x184 
Nov 12 15:50:17 blahblah6 kernel:&#160; [<c0625592>] do_page_fault+0x0/0x653 
Nov 12 15:50:17 blahblah6 kernel:&#160; [<c0404ee1>] sysenter_past_esp+0x56/0x79

We have 5 servers accessing a shared filesystem that consists of 24 virtual disks on top of multiple HDDs using GSF2. Once this problem happens in a virtual disk, we can't write into it (but the rest of the virtual disks keep on working without any problem). Also, it seems that running fsck fixes the virtual disk temporarily, but after a while it breaks again. Is there any way to fix this problem, or at least reduce how often it happens (it's happening almost every day in our system), without having to inst
all an older kernel version?

Best regards,

> Hi,
> 
> On Mon, 2012-11-12 at 15:24 +0900, Antonio Castellano wrote:
> > Hi,
> > 
> > I'd like to know about the status of the bug number 831330 and its schedule. Our system is complaining about it and I don't have enough permissions to access its bugzilla related page. It is urgent.
> > 
> > This is the link related to the text reported in our log:
> > https://access.redhat.com/knowledge/ja/node/141203
> > 
> > And this is the bugzilla link:
> > https://bugzilla.redhat.com/show_bug.cgi?id=831330
> > 
> > Is there anybody out there that can help me? The help will be greatly appreciated.
> > 
> > Thank you very much!
> > 
> Assuming that you are a Red Hat customer, please open a ticket. The bug
> mostly contains customer's private data, so that I don't think opening
> this one up would help much as there would be little that we could
> share.
> 
> This is though, our highest priority bug at the moment (when I say our,
> I mean the GFS2 team). There is a simple workaround (just use a slightly
> older kernel) which is one reason why we've had trouble in tracing this,
> because people are (understandably) using that rather than running the
> kernel we've built to debug this issue.
> 
> We've been unable to reproduce this internally, despite trying many
> different workloads. If you are in a position to help us debug the
> issue, then any assistance is very gratefully received,
> 
> Steve.
> 
> 
> 

--
Antonio Castellano [DEV at SDD.jp]
   Seventh Dimension Design, Inc.
   http://www.SDD.jp
   VOICE: +81-78-252-8855, FAX: +81-78-252-8856



From getridofthespam at yahoo.com  Wed Nov 14 14:54:07 2012
From: getridofthespam at yahoo.com (getridofthespam)
Date: Wed, 14 Nov 2012 06:54:07 -0800 (PST)
Subject: [Linux-cluster] gfs for exsisting disks
Message-ID: <1352904847.66273.YahooMailClassic@web125505.mail.ne1.yahoo.com>

Hi all,

I have a Centos 6.3 with a SAN storage attached. mount extract:

/dev/mapper/mpathcp1 on /3parslice1 type ext4 (rw)
/dev/mapper/mpathbp1 on /3parslice2 type ext4 (rw)

The slices are ext4 formatted.

Now I want to add a second server that needs to access the
same disk slices.

Is gfs the solution? Can I keep the data on the disks? 
Any procedure to follow available?

Tnx for all answers.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20121114/2824047e/attachment.htm>

From bmr at redhat.com  Wed Nov 14 15:05:54 2012
From: bmr at redhat.com (Bryn M. Reeves)
Date: Wed, 14 Nov 2012 15:05:54 +0000
Subject: [Linux-cluster] gfs for exsisting disks
In-Reply-To: <1352904847.66273.YahooMailClassic@web125505.mail.ne1.yahoo.com>
References: <1352904847.66273.YahooMailClassic@web125505.mail.ne1.yahoo.com>
Message-ID: <50A3B352.3000003@redhat.com>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 11/14/2012 02:54 PM, getridofthespam wrote:
> /dev/mapper/mpathcp1 on /3parslice1 type ext4 (rw) 
> /dev/mapper/mpathbp1 on /3parslice2 type ext4 (rw)
> 
> The slices are ext4 formatted.
> 
> Now I want to add a second server that needs to access the same
> disk slices.
> 
> Is gfs the solution? Can I keep the data on the disks? Any
> procedure to follow available?

GFS (or better GFS2..) would be one solution but you cannot "convert"
from an ext type file system; you would need to backup and restore to
a newly-created GFS2 volume.

You could also consider using a network file system exported from one
host or an external filer as an alternative to sharing the data
between two hosts.

It's difficult to tell whether a cluster file system like GFS2 is "the
solution" without knowing what application will use it and how the app
is structured; this is key to determining how a clustered file system
will perform and is an important factor in deciding which is the best
option for a given case.

Regards,
Bryn.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAlCjs1IACgkQ6YSQoMYUY95k0gCeIm0buQfFVJocBOxoYaWKexjK
7BwAn2FacRUL0Ba8veE2G7rz20ijTjXl
=1/4Y
-----END PGP SIGNATURE-----



From lists at alteeve.ca  Wed Nov 14 15:07:12 2012
From: lists at alteeve.ca (Digimer)
Date: Wed, 14 Nov 2012 10:07:12 -0500
Subject: [Linux-cluster] gfs for exsisting disks
In-Reply-To: <1352904847.66273.YahooMailClassic@web125505.mail.ne1.yahoo.com>
References: <1352904847.66273.YahooMailClassic@web125505.mail.ne1.yahoo.com>
Message-ID: <50A3B3A0.7030903@alteeve.ca>

On 11/14/2012 09:54 AM, getridofthespam wrote:
> Hi all,
> 
> I have a Centos 6.3 with a SAN storage attached. mount extract:
> 
> /dev/mapper/mpathcp1 on /3parslice1 type ext4 (rw)
> /dev/mapper/mpathbp1 on /3parslice2 type ext4 (rw)
> 
> The slices are ext4 formatted.
> 
> Now I want to add a second server that needs to access the
> same disk slices.
> 
> Is gfs the solution? Can I keep the data on the disks?
> Any procedure to follow available?
> 
> Tnx for all answers.

Unless there is some voodoo I don't know about, no, you will need to
backup, reformat gfs2 and restore the files.

Yes, gfs2 will allow 2+ nodes to access the same data on the SAN, but
there are considerations to be aware of. First is that the distributed
locking (dlm) comes at an overhead cost. before each write can happen, a
lock must be requested from the cluster. If you have disk intensive
apps, this might cause unacceptable delays.

Also, you *must must must* have testing, working fencing for gfs2 to be
safe.

So it might be worth putting together a test case before you commit to
converting production boxes, if you have the resources.

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?



From rossnick-lists at cybercat.ca  Wed Nov 14 19:06:26 2012
From: rossnick-lists at cybercat.ca (Nicolas Ross)
Date: Wed, 14 Nov 2012 14:06:26 -0500
Subject: [Linux-cluster] Can't get apache resource agent working
Message-ID: <50A3EBB2.3010009@cybercat.ca>

Hi !

I am trying to add a apache resource to a service, and I can't get it to
work.

Here's my service :

    <service autostart="0" domain="cybercat" exclusive="0" name="SandBox">
      <clusterfs ref="SandBox">
        <ip address="192.168.110.29" monitor_link="on" sleeptime="1"/>
        <ip address="192.168.112.29" monitor_link="on" sleeptime="1"/>
        <apache config_file="/CyberCat/SandBox/etc/httpd.conf"
name="SandBoxHttpd"/>
      </clusterfs>
    </service>

The apache config file is basicly a copy of /etc/httpd/conf/httpd.conf,
tailored to my needs, with PidFIle
"/var/run/cluster/apache/apache:SandBoxHttpd.pid" in it.

If I do :

/usr/sbin/httpd -f /CyberCat/SandBox/etc/httpd.conf

It works perfectly fine, and it creates the pid ar the proper location.

So I used rg_test :

rg_test test ./cluster.conf start service SandBox

Starting SandBox...
<debug>  /dev/dm-11 already mounted
[clusterfs] /dev/dm-11 already mounted
<debug>  192.168.110.29 already configured
[ip] 192.168.110.29 already configured
<debug>  192.168.112.29 already configured
[ip] 192.168.112.29 already configured
<debug>  Verifying Configuration Of apache:SandBoxHttpd
[apache] Verifying Configuration Of apache:SandBoxHttpd
<debug>  Checking Syntax Of The File /CyberCat/SandBox/etc/httpd.conf
[apache] Checking Syntax Of The File /CyberCat/SandBox/etc/httpd.conf
<debug>  Checking Syntax Of The File /CyberCat/SandBox/etc/httpd.conf >
Succeed
[apache] Checking Syntax Of The File /CyberCat/SandBox/etc/httpd.conf >
Succeed
<debug>  Monitoring Service apache:SandBoxHttpd
[apache] Monitoring Service apache:SandBoxHttpd
<error>  Checking Existence Of File
/var/run/cluster/apache/apache:SandBoxHttpd.pid [apache:SandBoxHttpd] >
Failed
[apache] Checking Existence Of File
/var/run/cluster/apache/apache:SandBoxHttpd.pid [apache:SandBoxHttpd] >
Failed
<error>  Monitoring Service apache:SandBoxHttpd > Service Is Not Running
[apache] Monitoring Service apache:SandBoxHttpd > Service Is Not Running
<info>   Starting Service apache:SandBoxHttpd
[apache] Starting Service apache:SandBoxHttpd
<debug>  Looking For IP Addresses
[apache] Looking For IP Addresses
<debug>  0 IP addresses found for SandBox/SandBoxHttpd
[apache] 0 IP addresses found for SandBox/SandBoxHttpd
<error>  Looking For IP Addresses [apache:SandBoxHttpd] > Failed - No IP
Addresses Found
[apache] Looking For IP Addresses [apache:SandBoxHttpd] > Failed - No IP
Addresses Found
Failed to start SandBox

So it seems rgmanager can't find IP addresses for this service, and I
can't figure why. I have other services that uses mysql resource agent,
and the work perfectly with the exact same hiearchy of service/fs/ip,etc.

I've also tried this config :

    <service autostart="0" domain="cybercat" exclusive="0" name="SandBox">
      <ip address="192.168.110.29" monitor_link="on" sleeptime="1"/>
      <ip address="192.168.112.29" monitor_link="on" sleeptime="1"/>
      <apache config_file="/CyberCat/SandBox/etc/httpd.conf"
name="SandBoxHttpd"/>
      <clusterfs ref="SandBox"/>
    </service>

With the same outcome.

Thanks for any insights.



From rossnick-lists at cybercat.ca  Thu Nov 15 15:34:34 2012
From: rossnick-lists at cybercat.ca (Nicolas Ross)
Date: Thu, 15 Nov 2012 10:34:34 -0500
Subject: [Linux-cluster] Can't get apache resource agent working
In-Reply-To: <50A3EBB2.3010009@cybercat.ca>
References: <50A3EBB2.3010009@cybercat.ca>
Message-ID: <50A50B8A.2040006@cybercat.ca>

> I am trying to add a apache resource to a service, and I can't get it to
> work.
>
> Here's my service :
>
>     <service autostart="0" domain="cybercat" exclusive="0" name="SandBox">
>       <clusterfs ref="SandBox">
>         <ip address="192.168.110.29" monitor_link="on" sleeptime="1"/>
>         <ip address="192.168.112.29" monitor_link="on" sleeptime="1"/>
>         <apache config_file="/CyberCat/SandBox/etc/httpd.conf"
> name="SandBoxHttpd"/>
>       </clusterfs>
>     </service>
>
> The apache config file is basicly a copy of /etc/httpd/conf/httpd.conf,
> tailored to my needs, with PidFIle
> "/var/run/cluster/apache/apache:SandBoxHttpd.pid" in it.
>
> If I do :
>
> /usr/sbin/httpd -f /CyberCat/SandBox/etc/httpd.conf
>
> It works perfectly fine, and it creates the pid ar the proper location.
>
> So I used rg_test :
>
> rg_test test ./cluster.conf start service SandBox
>
> Starting SandBox...
> <debug>  /dev/dm-11 already mounted
> [clusterfs] /dev/dm-11 already mounted
> <debug>  192.168.110.29 already configured
> [ip] 192.168.110.29 already configured
> <debug>  192.168.112.29 already configured
> [ip] 192.168.112.29 already configured
> <debug>  Verifying Configuration Of apache:SandBoxHttpd
> [apache] Verifying Configuration Of apache:SandBoxHttpd
> <debug>  Checking Syntax Of The File /CyberCat/SandBox/etc/httpd.conf
> [apache] Checking Syntax Of The File /CyberCat/SandBox/etc/httpd.conf
> <debug>  Checking Syntax Of The File /CyberCat/SandBox/etc/httpd.conf >
> Succeed
> [apache] Checking Syntax Of The File /CyberCat/SandBox/etc/httpd.conf >
> Succeed
> <debug>  Monitoring Service apache:SandBoxHttpd
> [apache] Monitoring Service apache:SandBoxHttpd
> <error>  Checking Existence Of File
> /var/run/cluster/apache/apache:SandBoxHttpd.pid [apache:SandBoxHttpd] >
> Failed
> [apache] Checking Existence Of File
> /var/run/cluster/apache/apache:SandBoxHttpd.pid [apache:SandBoxHttpd] >
> Failed
> <error>  Monitoring Service apache:SandBoxHttpd > Service Is Not Running
> [apache] Monitoring Service apache:SandBoxHttpd > Service Is Not Running
> <info>   Starting Service apache:SandBoxHttpd
> [apache] Starting Service apache:SandBoxHttpd
> <debug>  Looking For IP Addresses
> [apache] Looking For IP Addresses
> <debug>  0 IP addresses found for SandBox/SandBoxHttpd
> [apache] 0 IP addresses found for SandBox/SandBoxHttpd
> <error>  Looking For IP Addresses [apache:SandBoxHttpd] > Failed - No IP
> Addresses Found
> [apache] Looking For IP Addresses [apache:SandBoxHttpd] > Failed - No IP
> Addresses Found
> Failed to start SandBox
>
> So it seems rgmanager can't find IP addresses for this service, and I
> can't figure why. I have other services that uses mysql resource agent,
> and the work perfectly with the exact same hiearchy of service/fs/ip,etc.
>
> I've also tried this config :
>
>     <service autostart="0" domain="cybercat" exclusive="0" name="SandBox">
>       <ip address="192.168.110.29" monitor_link="on" sleeptime="1"/>
>       <ip address="192.168.112.29" monitor_link="on" sleeptime="1"/>
>       <apache config_file="/CyberCat/SandBox/etc/httpd.conf"
> name="SandBoxHttpd"/>
>       <clusterfs ref="SandBox"/>
>     </service>
>
> With the same outcome.
>
> Thanks for any insights.
>
Even if I add a new service with ccs like this :

ccs -f cluster.conf --addservice SandBox3
ccs -f cluster.conf --addsubservice SandBox3 ip address="192.168.112.29"
ccs -f cluster.conf --addsubservice SandBox3 apache name="testapache"

It fails to find any IP addresses.



From jajcus at jajcus.net  Mon Nov 19 09:16:48 2012
From: jajcus at jajcus.net (Jacek Konieczny)
Date: Mon, 19 Nov 2012 10:16:48 +0100
Subject: [Linux-cluster] node fenced by dlm_controld on a clean shutdown
Message-ID: <20121119091647.GA20419@jajo.eggsoft>

Hi,

I am setting up a cluster using:

Linux kernel 3.6.6
Corosync 2.1.0
DLM 4.0.0
CLVMD 2.02.98
Pacemaker 1.1.8
DRBD 8.3.13

Now I have stuck on the 'clean shutdown of a node' scenario.

It goes like that:
- resources using the shared storage are properly stopped by Pacemaker.
- DRBD is cleanly demoted and unconfigured by Pacemaker
- Pacemaker cleanly exits
- CLVMD is stopped.
? dlm_controld is stopped
? corosync is being stopped

and at this point the node is fenced (rebooted) by the dlm_controld on
the other node. I would expect it continue with a clean shutdown.

Any idea how to debug/fix it?
Is this '541 cpg_dispatch error 9' the problem?

Logs from the node being shut down (log file system mounted with the 'sync'
option, syslog shutdown delayed as much as possible):

Kernel:
Nov 19 09:49:40 dev1n2 kernel: : [  542.049407] block drbd0: worker terminated
Nov 19 09:49:40 dev1n2 kernel: : [  542.049412] block drbd0: Terminating drbd0_worker
Nov 19 09:49:43 dev1n2 kernel: : [  544.934390] dlm: clvmd: leaving the lockspace group...
Nov 19 09:49:43 dev1n2 kernel: : [  544.937584] dlm: clvmd: group event done 0 0
Nov 19 09:49:43 dev1n2 kernel: : [  544.937897] dlm: clvmd: release_lockspace final free
Nov 19 09:49:43 dev1n2 kernel: : [  544.961407] dlm: closing connection to node 2
Nov 19 09:49:43 dev1n2 kernel: : [  544.961431] dlm: closing connection to node 1

User space:
Nov 19 09:49:41 dev1n2 pacemakerd[1267]:   notice: stop_child: Stopping cib: Sent -15 to process 1279
Nov 19 09:49:41 dev1n2 notifyd[1139]: [notice] dev1n2[2] stonithd:1281:0x7fc423dfd5e0 is now disconnected from corosync
Nov 19 09:49:41 dev1n2 cib[1279]:   notice: terminate_cs_connection: Disconnecting from Corosync
Nov 19 09:49:41 dev1n2 corosync[1130]:  [TOTEM ] Retransmit List: 1db
Nov 19 09:49:41 dev1n2 notifyd[1139]: [notice] dev1n2[2] cib:1279:0x7fc4240008d0 is now disconnected from corosync
Nov 19 09:49:41 dev1n2 cib[1279]:   notice: terminate_cs_connection: Disconnecting from Corosync
Nov 19 09:49:41 dev1n2 corosync[1130]:  [TOTEM ] Retransmit List: 1dd
Nov 19 09:49:41 dev1n2 pacemakerd[1267]:   notice: pcmk_shutdown_worker: Shutdown complete
Nov 19 09:49:41 dev1n2 notifyd[1139]: [notice] dev1n2[2] pacemakerd:1267:0x7fc423bf8ed0 is now disconnected from corosync
Nov 19 09:49:41 dev1n2 notifyd[1139]: [notice] dev1n2[2] pacemakerd:1267:0x7fc423bf7660 is now disconnected from corosync
Nov 19 09:49:41 dev1n2 corosync[1130]:  [TOTEM ] Retransmit List: 1de
Nov 19 09:49:41 dev1n2 corosync[1130]:  [TOTEM ] Retransmit List: 1de
Nov 19 09:49:41 dev1n2 corosync[1130]:  [TOTEM ] Retransmit List: 1e1
Nov 19 09:49:41 dev1n2 corosync[1130]:  [TOTEM ] Retransmit List: 1e1
Nov 19 09:49:43 dev1n2 dlm_controld[1142]: 541 cpg_dispatch error 9
Nov 19 09:49:43 dev1n2 corosync[1130]:  [TOTEM ] Retransmit List: 1e7
Nov 19 09:49:43 dev1n2 corosync[1130]:  [TOTEM ] Retransmit List: 1e7
Nov 19 09:49:43 dev1n2 corosync[1130]:  [SERV  ] Unloading all Corosync service engines.
Nov 19 09:49:43 dev1n2 corosync[1130]:  [QB    ] withdrawing server sockets
Nov 19 09:49:43 dev1n2 corosync[1130]:  [SERV  ] Service engine unloaded: corosync vote quorum service v1.0
Nov 19 09:49:43 dev1n2 corosync[1130]:  [QB    ] withdrawing server sockets
Nov 19 09:49:43 dev1n2 corosync[1130]:  [SERV  ] Service engine unloaded: corosync configuration map access
Nov 19 09:49:43 dev1n2 corosync[1130]:  [QB    ] withdrawing server sockets
Nov 19 09:49:43 dev1n2 corosync[1130]:  [SERV  ] Service engine unloaded: corosync configuration service
Nov 19 09:49:43 dev1n2 corosync[1130]:  [QB    ] withdrawing server sockets
Nov 19 09:49:43 dev1n2 corosync[1130]:  [SERV  ] Service engine unloaded: corosync cluster closed process group service v1.01
Nov 19 09:49:43 dev1n2 corosync[1130]:  [QB    ] withdrawing server sockets
Nov 19 09:49:43 dev1n2 corosync[1130]:  [SERV  ] Service engine unloaded: corosync cluster quorum service v0.1
Nov 19 09:49:43 dev1n2 corosync[1130]:  [SERV  ] Service engine unloaded: corosync profile loading service
Nov 19 09:49:43 dev1n2 corosync[1130]:  [WD    ] magically closing the watchdog.
Nov 19 09:49:43 dev1n2 corosync[1130]:  [SERV  ] Service engine unloaded: corosync watchdog service
Nov 19 09:49:43 dev1n2 corosync[1130]:  [MAIN  ] Corosync Cluster Engine exiting normally


Logs from the surviving node:

Kernel:
Nov 19 09:49:39 dev1n1 kernel: : [80664.615988] block drbd0: conn( Unconnected -> WFConnection )
Nov 19 09:49:42 dev1n1 kernel: : [80667.497187] dlm: clvmd: dlm_recover 11
Nov 19 09:49:42 dev1n1 kernel: : [80667.497241] dlm: clvmd: dlm_clear_toss 1 done
Nov 19 09:49:42 dev1n1 kernel: : [80667.497291] dlm: clvmd: remove member 2
Nov 19 09:49:42 dev1n1 kernel: : [80667.497295] dlm: clvmd: dlm_recover_members 1 nodes
Nov 19 09:49:42 dev1n1 kernel: : [80667.497298] dlm: clvmd: generation 15 slots 1 1:1
Nov 19 09:49:42 dev1n1 kernel: : [80667.497300] dlm: clvmd: dlm_recover_directory
Nov 19 09:49:42 dev1n1 kernel: : [80667.497302] dlm: clvmd: dlm_recover_directory 0 in 0 new
Nov 19 09:49:42 dev1n1 kernel: : [80667.497304] dlm: clvmd: dlm_recover_directory 0 out 0 messages
Nov 19 09:49:42 dev1n1 kernel: : [80667.497307] dlm: clvmd: dlm_recover_masters
Nov 19 09:49:42 dev1n1 kernel: : [80667.497309] dlm: clvmd: dlm_recover_masters 0 of 1
Nov 19 09:49:42 dev1n1 kernel: : [80667.497311] dlm: clvmd: dlm_recover_locks 0 out
Nov 19 09:49:42 dev1n1 kernel: : [80667.497313] dlm: clvmd: dlm_recover_locks 0 in
Nov 19 09:49:42 dev1n1 kernel: : [80667.497314] dlm: clvmd: dlm_recover_rsbs 1 done
Nov 19 09:49:42 dev1n1 kernel: : [80667.497366] dlm: clvmd: dlm_recover 11 generation 15 done: 0 ms
Nov 19 09:49:43 dev1n1 kernel: : [80668.211818] dlm: closing connection to node 2
Nov 19 09:49:46 dev1n1 kernel: : [80670.779015] igb: p1p2 NIC Link is Down

User space:
Nov 19 09:49:40 dev1n1 pengine[1078]:   notice: stage6: Scheduling Node dev1n2 for shutdown
Nov 19 09:49:40 dev1n1 pengine[1078]:    error: rsc_expand_action: Couldn't expand vpbx_vg_cl_demote_0
Nov 19 09:49:40 dev1n1 pengine[1078]:    error: rsc_expand_action: Couldn't expand vpbx_vg_cl_demote_0
Nov 19 09:49:40 dev1n1 pengine[1078]:   notice: LogActions: Stop    stonith-dev1n1      (dev1n2)
Nov 19 09:49:40 dev1n1 pengine[1078]:   notice: process_pe_message: Calculated Transition 17: /var/lib/pacemaker/pengine/pe-input-1035.bz2
Nov 19 09:49:40 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1d1
Nov 19 09:49:40 dev1n1 crmd[1080]:   notice: run_graph: Transition 17 (Complete=2, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-1035.bz2): Complete
Nov 19 09:49:40 dev1n1 crmd[1080]:   notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
Nov 19 09:49:40 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1d4 
Nov 19 09:49:40 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1d6 1d8 
Nov 19 09:49:40 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1d6 
Nov 19 09:49:40 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1d6 
Nov 19 09:49:40 dev1n1 crmd[1080]:   notice: peer_update_callback: do_shutdown of dev1n2 (op 63) is complete
Nov 19 09:49:40 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1df 
Nov 19 09:49:40 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1df 
Nov 19 09:49:40 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1e3  
Nov 19 09:49:40 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1e3 
Nov 19 09:49:42 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1e6 
Nov 19 09:49:42 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1e9 
Nov 19 09:49:43 dev1n1 corosync[1004]:  [QUORUM] Members[1]: 1
Nov 19 09:49:43 dev1n1 crmd[1080]:   notice: corosync_mark_unseen_peer_dead: Node 2/dev1n2 was not seen in the previous transition
Nov 19 09:49:43 dev1n1 crmd[1080]:   notice: crm_update_peer_state: corosync_mark_unseen_peer_dead: Node dev1n2[2] - state is now lost
Nov 19 09:49:43 dev1n1 corosync[1004]:  [TOTEM ] A processor joined or left the membership and a new membership (10.28.45.27:30736) was formed. 
Nov 19 09:49:43 dev1n1 corosync[1004]:  [MAIN  ] Completed service synchronization, ready to provide service.
Nov 19 09:49:43 dev1n1 dlm_controld[1014]: 80664 fence request 2 pid 27225 nodedown time 1353314983 fence_all dlm_stonith
Nov 19 09:49:43 dev1n1 notifyd[1010]: [notice] 192.168.1.2[2] ip:192.168.1.2 left
Nov 19 09:49:43 dev1n1 stonith-ng[1075]:   notice: stonith_command: Client stonith-api.27225.b5ff8f4d wants to fence (reboot) '2' with device '(any)'
Nov 19 09:49:43 dev1n1 stonith-ng[1075]:   notice: initiate_remote_stonith_op: Initiating remote operation reboot for dev1n2: 71447261-0e53-4b20-b628-d3f026a4ae24 (0)
Nov 19 09:49:44 dev1n1 external/ipmi[27242]: [27254]: debug: ipmitool output: Chassis Power Control: Reset
Nov 19 09:49:45 dev1n1 stonith-ng[1075]:   notice: log_operation: Operation 'reboot' [27234] (call 0 from stonith-api.27225) for host 'dev1n2' with device 'stonith-dev1n2' returned: 0 (OK)
Nov 19 09:49:45 dev1n1 stonith-ng[1075]:   notice: remote_op_done: Operation reboot of dev1n2 by dev1n1 for stonith-api.27225 at dev1n1.71447261: OK
Nov 19 09:49:45 dev1n1 crmd[1080]:   notice: tengine_stonith_notify: Peer dev1n2 was terminated (st_notify_fence) by dev1n1 for dev1n1: OK (ref=71447261-0e53-4b20-b628-d3f026a4ae24) by client stonith-api.27225

Greets,
        Jacek



From jajcus at jajcus.net  Mon Nov 19 09:39:20 2012
From: jajcus at jajcus.net (Jacek Konieczny)
Date: Mon, 19 Nov 2012 10:39:20 +0100
Subject: [Linux-cluster] node fenced by dlm_controld on a clean shutdown
In-Reply-To: <20121119091647.GA20419@jajo.eggsoft>
References: <20121119091647.GA20419@jajo.eggsoft>
Message-ID: <20121119093920.GB20419@jajo.eggsoft>

On Mon, Nov 19, 2012 at 10:16:48AM +0100, Jacek Konieczny wrote:
> It goes like that:
> - resources using the shared storage are properly stopped by Pacemaker.
> - DRBD is cleanly demoted and unconfigured by Pacemaker
> - Pacemaker cleanly exits
> - CLVMD is stopped.
> ? dlm_controld is stopped
> ? corosync is being stopped
> 
> and at this point the node is fenced (rebooted) by the dlm_controld on
> the other node. I would expect it continue with a clean shutdown.
> 
> Any idea how to debug/fix it?
> Is this '541 cpg_dispatch error 9' the problem?

I found a workaround: I have added a 10 seconds pause between
dlm_controld and corosync shutdown. The node shuts down cleanly now (is
not fenced). '541 cpg_dispatch error 9' is still there in the logs,
though.

Greets,
        Jacek



From teigland at redhat.com  Mon Nov 19 15:23:19 2012
From: teigland at redhat.com (David Teigland)
Date: Mon, 19 Nov 2012 10:23:19 -0500
Subject: [Linux-cluster] node fenced by dlm_controld on a clean shutdown
In-Reply-To: <20121119093920.GB20419@jajo.eggsoft>
References: <20121119091647.GA20419@jajo.eggsoft>
	<20121119093920.GB20419@jajo.eggsoft>
Message-ID: <20121119152319.GA19052@redhat.com>

On Mon, Nov 19, 2012 at 10:39:20AM +0100, Jacek Konieczny wrote:
> On Mon, Nov 19, 2012 at 10:16:48AM +0100, Jacek Konieczny wrote:
> > It goes like that:
> > - resources using the shared storage are properly stopped by Pacemaker.
> > - DRBD is cleanly demoted and unconfigured by Pacemaker
> > - Pacemaker cleanly exits
> > - CLVMD is stopped.
> > ??? dlm_controld is stopped
> > ??? corosync is being stopped
> > 
> > and at this point the node is fenced (rebooted) by the dlm_controld on
> > the other node. I would expect it continue with a clean shutdown.
> > 
> > Any idea how to debug/fix it?
> > Is this '541 cpg_dispatch error 9' the problem?
> 
> I found a workaround: I have added a 10 seconds pause between
> dlm_controld and corosync shutdown. The node shuts down cleanly now (is
> not fenced). '541 cpg_dispatch error 9' is still there in the logs,
> though.

corosync-cfgtool -H is supposed to shut down corosync cleanly using the
cfg_shutdown_callback.  It looks like it may not be doing that.



From jfriesse at redhat.com  Mon Nov 19 16:11:45 2012
From: jfriesse at redhat.com (Jan Friesse)
Date: Mon, 19 Nov 2012 17:11:45 +0100
Subject: [Linux-cluster] node fenced by dlm_controld on a clean shutdown
In-Reply-To: <20121119152319.GA19052@redhat.com>
References: <20121119091647.GA20419@jajo.eggsoft>
	<20121119093920.GB20419@jajo.eggsoft>
	<20121119152319.GA19052@redhat.com>
Message-ID: <50AA5A41.2030402@redhat.com>

David Teigland napsal(a):
> On Mon, Nov 19, 2012 at 10:39:20AM +0100, Jacek Konieczny wrote:
>> On Mon, Nov 19, 2012 at 10:16:48AM +0100, Jacek Konieczny wrote:
>>> It goes like that:
>>> - resources using the shared storage are properly stopped by Pacemaker.
>>> - DRBD is cleanly demoted and unconfigured by Pacemaker
>>> - Pacemaker cleanly exits
>>> - CLVMD is stopped.
>>> ??? dlm_controld is stopped
>>> ??? corosync is being stopped
>>>
>>> and at this point the node is fenced (rebooted) by the dlm_controld on
>>> the other node. I would expect it continue with a clean shutdown.
>>>
>>> Any idea how to debug/fix it?
>>> Is this '541 cpg_dispatch error 9' the problem?
>>
>> I found a workaround: I have added a 10 seconds pause between
>> dlm_controld and corosync shutdown. The node shuts down cleanly now (is
>> not fenced). '541 cpg_dispatch error 9' is still there in the logs,
>> though.
> 
> corosync-cfgtool -H is supposed to shut down corosync cleanly using the
> cfg_shutdown_callback.  It looks like it may not be doing that.
> 

I don't think it's about corosync not shut down cleanly. As can be seen
in logs:
...
Nov 19 09:49:43 dev1n2 corosync[1130]:  [SERV  ] Service engine
unloaded: corosync profile loading service
Nov 19 09:49:43 dev1n2 corosync[1130]:  [WD    ] magically closing the
watchdog.
Nov 19 09:49:43 dev1n2 corosync[1130]:  [SERV  ] Service engine
unloaded: corosync watchdog service
Nov 19 09:49:43 dev1n2 corosync[1130]:  [MAIN  ] Corosync Cluster Engine
exiting normally



From anprice at redhat.com  Mon Nov 19 17:56:36 2012
From: anprice at redhat.com (Andrew Price)
Date: Mon, 19 Nov 2012 17:56:36 +0000
Subject: [Linux-cluster] Please help translating gfs2-utils
Message-ID: <50AA72D4.5080305@redhat.com>

Hi all,

gfs2-utils is an open source project containing tools necessary for 
creating, checking, tuning and manipulating gfs2 file systems. The 
gfs2-utils package is required wherever gfs2 file systems are used, 
particularly in Linux clusters.

I'm currently trying to improve the strings in upstream gfs2-utils for 
localisation. In the meantime, I'd like to drum up interest in 
progressing the translation effort. We have a Transifex project set up 
and open for translations:

   https://www.transifex.com/projects/p/gfs2-utils/

i18n support is a fairly recent addition to the project so the strings 
likely require some work to make life easy for translators. If there are 
any issues please contact me or file a bug report at 
http://bugzilla.redhat.com/ under the gfs2-utils package in Fedora / 
Rawhide, and I'll try to get the strings updated as soon as I can.

Whether bug reports or translations, any help you can provide in 
translating gfs2-utils into different languages, and making it easier to 
do so, would be greatly appreciated.

Regards,

Andy Price



From uxbod at splatnix.net  Mon Nov 19 21:58:54 2012
From: uxbod at splatnix.net (Phil Daws)
Date: Mon, 19 Nov 2012 21:58:54 +0000 (GMT)
Subject: [Linux-cluster] Thin (sparse) provisioning
In-Reply-To: <1454666582.455812.1353361979619.JavaMail.root@innovot.com>
Message-ID: <324998424.456303.1353362334446.JavaMail.root@innovot.com>

Hello:

am learning about clustering with DRBD and GFS2 and have a question about thin provisioning.  I would like to set up a number of individual vservers that reside on their own LVs which can then be shared between two nodes and flipped backwards and forwards using Pacemaker.  When setting up the block/lvm device for DRBD I have used:

lvcreate --virtualsize 1T --size 10G --name vserver01 vg1

once that has been added as a resource would I perform a standard mkfs.gfs2 or do I need to specify any further options; I was thinking something like:

mkfs.gfs2 -t vservercluster:vservers -p lock_dlm -j 2 /dev/vservermirror/vserver01

Is that the way I should be doing it ?

Thanks.



From jamescyriac76 at gmail.com  Tue Nov 20 10:58:57 2012
From: jamescyriac76 at gmail.com (james cyriac)
Date: Tue, 20 Nov 2012 14:58:57 +0400
Subject: [Linux-cluster] how to mount GFS volumes same time both the cluster
	nodes?
Message-ID: <CAFZu6EGkN60WGwU_fR3DyJfDHEARiLBuensmmDjCbV8-Z4VU3A@mail.gmail.com>

Hi all,

i am installing redhat cluster 6 two node cluser.the issue is i am not able
to mount my GFS file sytem in both the node at same time..

please find my clustat output ..


[root at saperpprod01 ~]# clustat
Cluster Status for sap-cluster1 @ Tue Nov 20 14:51:28 2012
Member Status: Quorate
 Member Name                                                     ID   Status
 ------ ----                                                     ---- ------
 saperpprod01                                                        1
Online, Local, rgmanager
 saperpprod02                                                        2
Online, rgmanager
 Service Name                                                     Owner
(Last)                                                     State
 ------- ----                                                     -----
------                                                     -----
 service:oracle
saperpprod01                                                     started
 service:profile-gfs
saperpprod01                                                     started
 service:sap
saperpprod01                                                     started
[root at saperpprod01 ~]#
oralce and sap is fine and it is flaying in both nodes.i want mount my GFS
vols same time at both the nodes.

Thanks in advacne
james


but profile-gfs is GFS file system and i want present the GFS mount point
same time both the node.please help me this
On Mon, Nov 19, 2012 at 1:16 PM, Jacek Konieczny <jajcus at jajcus.net> wrote:

> Hi,
>
> I am setting up a cluster using:
>
> Linux kernel 3.6.6
> Corosync 2.1.0
> DLM 4.0.0
> CLVMD 2.02.98
> Pacemaker 1.1.8
> DRBD 8.3.13
>
> Now I have stuck on the 'clean shutdown of a node' scenario.
>
> It goes like that:
> - resources using the shared storage are properly stopped by Pacemaker.
> - DRBD is cleanly demoted and unconfigured by Pacemaker
> - Pacemaker cleanly exits
> - CLVMD is stopped.
> ? dlm_controld is stopped
> ? corosync is being stopped
>
> and at this point the node is fenced (rebooted) by the dlm_controld on
> the other node. I would expect it continue with a clean shutdown.
>
> Any idea how to debug/fix it?
> Is this '541 cpg_dispatch error 9' the problem?
>
> Logs from the node being shut down (log file system mounted with the 'sync'
> option, syslog shutdown delayed as much as possible):
>
> Kernel:
> Nov 19 09:49:40 dev1n2 kernel: : [  542.049407] block drbd0: worker
> terminated
> Nov 19 09:49:40 dev1n2 kernel: : [  542.049412] block drbd0: Terminating
> drbd0_worker
> Nov 19 09:49:43 dev1n2 kernel: : [  544.934390] dlm: clvmd: leaving the
> lockspace group...
> Nov 19 09:49:43 dev1n2 kernel: : [  544.937584] dlm: clvmd: group event
> done 0 0
> Nov 19 09:49:43 dev1n2 kernel: : [  544.937897] dlm: clvmd:
> release_lockspace final free
> Nov 19 09:49:43 dev1n2 kernel: : [  544.961407] dlm: closing connection to
> node 2
> Nov 19 09:49:43 dev1n2 kernel: : [  544.961431] dlm: closing connection to
> node 1
>
> User space:
> Nov 19 09:49:41 dev1n2 pacemakerd[1267]:   notice: stop_child: Stopping
> cib: Sent -15 to process 1279
> Nov 19 09:49:41 dev1n2 notifyd[1139]: [notice] dev1n2[2]
> stonithd:1281:0x7fc423dfd5e0 is now disconnected from corosync
> Nov 19 09:49:41 dev1n2 cib[1279]:   notice: terminate_cs_connection:
> Disconnecting from Corosync
> Nov 19 09:49:41 dev1n2 corosync[1130]:  [TOTEM ] Retransmit List: 1db
> Nov 19 09:49:41 dev1n2 notifyd[1139]: [notice] dev1n2[2]
> cib:1279:0x7fc4240008d0 is now disconnected from corosync
> Nov 19 09:49:41 dev1n2 cib[1279]:   notice: terminate_cs_connection:
> Disconnecting from Corosync
> Nov 19 09:49:41 dev1n2 corosync[1130]:  [TOTEM ] Retransmit List: 1dd
> Nov 19 09:49:41 dev1n2 pacemakerd[1267]:   notice: pcmk_shutdown_worker:
> Shutdown complete
> Nov 19 09:49:41 dev1n2 notifyd[1139]: [notice] dev1n2[2]
> pacemakerd:1267:0x7fc423bf8ed0 is now disconnected from corosync
> Nov 19 09:49:41 dev1n2 notifyd[1139]: [notice] dev1n2[2]
> pacemakerd:1267:0x7fc423bf7660 is now disconnected from corosync
> Nov 19 09:49:41 dev1n2 corosync[1130]:  [TOTEM ] Retransmit List: 1de
> Nov 19 09:49:41 dev1n2 corosync[1130]:  [TOTEM ] Retransmit List: 1de
> Nov 19 09:49:41 dev1n2 corosync[1130]:  [TOTEM ] Retransmit List: 1e1
> Nov 19 09:49:41 dev1n2 corosync[1130]:  [TOTEM ] Retransmit List: 1e1
> Nov 19 09:49:43 dev1n2 dlm_controld[1142]: 541 cpg_dispatch error 9
> Nov 19 09:49:43 dev1n2 corosync[1130]:  [TOTEM ] Retransmit List: 1e7
> Nov 19 09:49:43 dev1n2 corosync[1130]:  [TOTEM ] Retransmit List: 1e7
> Nov 19 09:49:43 dev1n2 corosync[1130]:  [SERV  ] Unloading all Corosync
> service engines.
> Nov 19 09:49:43 dev1n2 corosync[1130]:  [QB    ] withdrawing server sockets
> Nov 19 09:49:43 dev1n2 corosync[1130]:  [SERV  ] Service engine unloaded:
> corosync vote quorum service v1.0
> Nov 19 09:49:43 dev1n2 corosync[1130]:  [QB    ] withdrawing server sockets
> Nov 19 09:49:43 dev1n2 corosync[1130]:  [SERV  ] Service engine unloaded:
> corosync configuration map access
> Nov 19 09:49:43 dev1n2 corosync[1130]:  [QB    ] withdrawing server sockets
> Nov 19 09:49:43 dev1n2 corosync[1130]:  [SERV  ] Service engine unloaded:
> corosync configuration service
> Nov 19 09:49:43 dev1n2 corosync[1130]:  [QB    ] withdrawing server sockets
> Nov 19 09:49:43 dev1n2 corosync[1130]:  [SERV  ] Service engine unloaded:
> corosync cluster closed process group service v1.01
> Nov 19 09:49:43 dev1n2 corosync[1130]:  [QB    ] withdrawing server sockets
> Nov 19 09:49:43 dev1n2 corosync[1130]:  [SERV  ] Service engine unloaded:
> corosync cluster quorum service v0.1
> Nov 19 09:49:43 dev1n2 corosync[1130]:  [SERV  ] Service engine unloaded:
> corosync profile loading service
> Nov 19 09:49:43 dev1n2 corosync[1130]:  [WD    ] magically closing the
> watchdog.
> Nov 19 09:49:43 dev1n2 corosync[1130]:  [SERV  ] Service engine unloaded:
> corosync watchdog service
> Nov 19 09:49:43 dev1n2 corosync[1130]:  [MAIN  ] Corosync Cluster Engine
> exiting normally
>
>
> Logs from the surviving node:
>
> Kernel:
> Nov 19 09:49:39 dev1n1 kernel: : [80664.615988] block drbd0: conn(
> Unconnected -> WFConnection )
> Nov 19 09:49:42 dev1n1 kernel: : [80667.497187] dlm: clvmd: dlm_recover 11
> Nov 19 09:49:42 dev1n1 kernel: : [80667.497241] dlm: clvmd: dlm_clear_toss
> 1 done
> Nov 19 09:49:42 dev1n1 kernel: : [80667.497291] dlm: clvmd: remove member 2
> Nov 19 09:49:42 dev1n1 kernel: : [80667.497295] dlm: clvmd:
> dlm_recover_members 1 nodes
> Nov 19 09:49:42 dev1n1 kernel: : [80667.497298] dlm: clvmd: generation 15
> slots 1 1:1
> Nov 19 09:49:42 dev1n1 kernel: : [80667.497300] dlm: clvmd:
> dlm_recover_directory
> Nov 19 09:49:42 dev1n1 kernel: : [80667.497302] dlm: clvmd:
> dlm_recover_directory 0 in 0 new
> Nov 19 09:49:42 dev1n1 kernel: : [80667.497304] dlm: clvmd:
> dlm_recover_directory 0 out 0 messages
> Nov 19 09:49:42 dev1n1 kernel: : [80667.497307] dlm: clvmd:
> dlm_recover_masters
> Nov 19 09:49:42 dev1n1 kernel: : [80667.497309] dlm: clvmd:
> dlm_recover_masters 0 of 1
> Nov 19 09:49:42 dev1n1 kernel: : [80667.497311] dlm: clvmd:
> dlm_recover_locks 0 out
> Nov 19 09:49:42 dev1n1 kernel: : [80667.497313] dlm: clvmd:
> dlm_recover_locks 0 in
> Nov 19 09:49:42 dev1n1 kernel: : [80667.497314] dlm: clvmd:
> dlm_recover_rsbs 1 done
> Nov 19 09:49:42 dev1n1 kernel: : [80667.497366] dlm: clvmd: dlm_recover 11
> generation 15 done: 0 ms
> Nov 19 09:49:43 dev1n1 kernel: : [80668.211818] dlm: closing connection to
> node 2
> Nov 19 09:49:46 dev1n1 kernel: : [80670.779015] igb: p1p2 NIC Link is Down
>
> User space:
> Nov 19 09:49:40 dev1n1 pengine[1078]:   notice: stage6: Scheduling Node
> dev1n2 for shutdown
> Nov 19 09:49:40 dev1n1 pengine[1078]:    error: rsc_expand_action:
> Couldn't expand vpbx_vg_cl_demote_0
> Nov 19 09:49:40 dev1n1 pengine[1078]:    error: rsc_expand_action:
> Couldn't expand vpbx_vg_cl_demote_0
> Nov 19 09:49:40 dev1n1 pengine[1078]:   notice: LogActions: Stop
>  stonith-dev1n1      (dev1n2)
> Nov 19 09:49:40 dev1n1 pengine[1078]:   notice: process_pe_message:
> Calculated Transition 17: /var/lib/pacemaker/pengine/pe-input-1035.bz2
> Nov 19 09:49:40 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1d1
> Nov 19 09:49:40 dev1n1 crmd[1080]:   notice: run_graph: Transition 17
> (Complete=2, Pending=0, Fired=0, Skipped=0, Incomplete=0,
> Source=/var/lib/pacemaker/pengine/pe-input-1035.bz2): Complete
> Nov 19 09:49:40 dev1n1 crmd[1080]:   notice: do_state_transition: State
> transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
> cause=C_FSA_INTERNAL origin=notify_crmd ]
> Nov 19 09:49:40 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1d4
> Nov 19 09:49:40 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1d6 1d8
> Nov 19 09:49:40 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1d6
> Nov 19 09:49:40 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1d6
> Nov 19 09:49:40 dev1n1 crmd[1080]:   notice: peer_update_callback:
> do_shutdown of dev1n2 (op 63) is complete
> Nov 19 09:49:40 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1df
> Nov 19 09:49:40 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1df
> Nov 19 09:49:40 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1e3
> Nov 19 09:49:40 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1e3
> Nov 19 09:49:42 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1e6
> Nov 19 09:49:42 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1e9
> Nov 19 09:49:43 dev1n1 corosync[1004]:  [QUORUM] Members[1]: 1
> Nov 19 09:49:43 dev1n1 crmd[1080]:   notice:
> corosync_mark_unseen_peer_dead: Node 2/dev1n2 was not seen in the previous
> transition
> Nov 19 09:49:43 dev1n1 crmd[1080]:   notice: crm_update_peer_state:
> corosync_mark_unseen_peer_dead: Node dev1n2[2] - state is now lost
> Nov 19 09:49:43 dev1n1 corosync[1004]:  [TOTEM ] A processor joined or
> left the membership and a new membership (10.28.45.27:30736) was formed.
> Nov 19 09:49:43 dev1n1 corosync[1004]:  [MAIN  ] Completed service
> synchronization, ready to provide service.
> Nov 19 09:49:43 dev1n1 dlm_controld[1014]: 80664 fence request 2 pid 27225
> nodedown time 1353314983 fence_all dlm_stonith
> Nov 19 09:49:43 dev1n1 notifyd[1010]: [notice] 192.168.1.2[2]
> ip:192.168.1.2 left
> Nov 19 09:49:43 dev1n1 stonith-ng[1075]:   notice: stonith_command: Client
> stonith-api.27225.b5ff8f4d wants to fence (reboot) '2' with device '(any)'
> Nov 19 09:49:43 dev1n1 stonith-ng[1075]:   notice:
> initiate_remote_stonith_op: Initiating remote operation reboot for dev1n2:
> 71447261-0e53-4b20-b628-d3f026a4ae24 (0)
> Nov 19 09:49:44 dev1n1 external/ipmi[27242]: [27254]: debug: ipmitool
> output: Chassis Power Control: Reset
> Nov 19 09:49:45 dev1n1 stonith-ng[1075]:   notice: log_operation:
> Operation 'reboot' [27234] (call 0 from stonith-api.27225) for host
> 'dev1n2' with device 'stonith-dev1n2' returned: 0 (OK)
> Nov 19 09:49:45 dev1n1 stonith-ng[1075]:   notice: remote_op_done:
> Operation reboot of dev1n2 by dev1n1 for stonith-api.27225 at dev1n1.71447261:
> OK
> Nov 19 09:49:45 dev1n1 crmd[1080]:   notice: tengine_stonith_notify: Peer
> dev1n2 was terminated (st_notify_fence) by dev1n1 for dev1n1: OK
> (ref=71447261-0e53-4b20-b628-d3f026a4ae24) by client stonith-api.27225
>
> Greets,
>         Jacek
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20121120/61742541/attachment.htm>

From emi2fast at gmail.com  Tue Nov 20 11:07:20 2012
From: emi2fast at gmail.com (emmanuel segura)
Date: Tue, 20 Nov 2012 12:07:20 +0100
Subject: [Linux-cluster] how to mount GFS volumes same time both the
 cluster nodes?
In-Reply-To: <CAFZu6EGkN60WGwU_fR3DyJfDHEARiLBuensmmDjCbV8-Z4VU3A@mail.gmail.com>
References: <CAFZu6EGkN60WGwU_fR3DyJfDHEARiLBuensmmDjCbV8-Z4VU3A@mail.gmail.com>
Message-ID: <CAE7pJ3DRRWpfEzND8_Ui74P2X17DJq7BC_dPpsVrgC9vZw6HRg@mail.gmail.com>

You have to use /etc/fstab with _netdev option, redhat cluster doesn't
support active/active service

2012/11/20 james cyriac <jamescyriac76 at gmail.com>

> Hi all,
>
> i am installing redhat cluster 6 two node cluser.the issue is i am not
> able to mount my GFS file sytem in both the node at same time..
>
> please find my clustat output ..
>
>
> [root at saperpprod01 ~]# clustat
> Cluster Status for sap-cluster1 @ Tue Nov 20 14:51:28 2012
> Member Status: Quorate
>  Member Name                                                     ID
> Status
>  ------ ----                                                     ----
> ------
>  saperpprod01                                                        1
> Online, Local, rgmanager
>  saperpprod02                                                        2
> Online, rgmanager
>  Service Name                                                     Owner
> (Last)                                                     State
>  ------- ----                                                     -----
> ------                                                     -----
>  service:oracle
> saperpprod01                                                     started
>  service:profile-gfs
> saperpprod01                                                     started
>  service:sap
> saperpprod01                                                     started
> [root at saperpprod01 ~]#
> oralce and sap is fine and it is flaying in both nodes.i want mount my GFS
> vols same time at both the nodes.
>
> Thanks in advacne
> james
>
>
> but profile-gfs is GFS file system and i want present the GFS mount point
> same time both the node.please help me this
> On Mon, Nov 19, 2012 at 1:16 PM, Jacek Konieczny <jajcus at jajcus.net>wrote:
>
>> Hi,
>>
>> I am setting up a cluster using:
>>
>> Linux kernel 3.6.6
>> Corosync 2.1.0
>> DLM 4.0.0
>> CLVMD 2.02.98
>> Pacemaker 1.1.8
>> DRBD 8.3.13
>>
>> Now I have stuck on the 'clean shutdown of a node' scenario.
>>
>> It goes like that:
>> - resources using the shared storage are properly stopped by Pacemaker.
>> - DRBD is cleanly demoted and unconfigured by Pacemaker
>> - Pacemaker cleanly exits
>> - CLVMD is stopped.
>> ? dlm_controld is stopped
>> ? corosync is being stopped
>>
>> and at this point the node is fenced (rebooted) by the dlm_controld on
>> the other node. I would expect it continue with a clean shutdown.
>>
>> Any idea how to debug/fix it?
>> Is this '541 cpg_dispatch error 9' the problem?
>>
>> Logs from the node being shut down (log file system mounted with the
>> 'sync'
>> option, syslog shutdown delayed as much as possible):
>>
>> Kernel:
>> Nov 19 09:49:40 dev1n2 kernel: : [  542.049407] block drbd0: worker
>> terminated
>> Nov 19 09:49:40 dev1n2 kernel: : [  542.049412] block drbd0: Terminating
>> drbd0_worker
>> Nov 19 09:49:43 dev1n2 kernel: : [  544.934390] dlm: clvmd: leaving the
>> lockspace group...
>> Nov 19 09:49:43 dev1n2 kernel: : [  544.937584] dlm: clvmd: group event
>> done 0 0
>> Nov 19 09:49:43 dev1n2 kernel: : [  544.937897] dlm: clvmd:
>> release_lockspace final free
>> Nov 19 09:49:43 dev1n2 kernel: : [  544.961407] dlm: closing connection
>> to node 2
>> Nov 19 09:49:43 dev1n2 kernel: : [  544.961431] dlm: closing connection
>> to node 1
>>
>> User space:
>> Nov 19 09:49:41 dev1n2 pacemakerd[1267]:   notice: stop_child: Stopping
>> cib: Sent -15 to process 1279
>> Nov 19 09:49:41 dev1n2 notifyd[1139]: [notice] dev1n2[2]
>> stonithd:1281:0x7fc423dfd5e0 is now disconnected from corosync
>> Nov 19 09:49:41 dev1n2 cib[1279]:   notice: terminate_cs_connection:
>> Disconnecting from Corosync
>> Nov 19 09:49:41 dev1n2 corosync[1130]:  [TOTEM ] Retransmit List: 1db
>> Nov 19 09:49:41 dev1n2 notifyd[1139]: [notice] dev1n2[2]
>> cib:1279:0x7fc4240008d0 is now disconnected from corosync
>> Nov 19 09:49:41 dev1n2 cib[1279]:   notice: terminate_cs_connection:
>> Disconnecting from Corosync
>> Nov 19 09:49:41 dev1n2 corosync[1130]:  [TOTEM ] Retransmit List: 1dd
>> Nov 19 09:49:41 dev1n2 pacemakerd[1267]:   notice: pcmk_shutdown_worker:
>> Shutdown complete
>> Nov 19 09:49:41 dev1n2 notifyd[1139]: [notice] dev1n2[2]
>> pacemakerd:1267:0x7fc423bf8ed0 is now disconnected from corosync
>> Nov 19 09:49:41 dev1n2 notifyd[1139]: [notice] dev1n2[2]
>> pacemakerd:1267:0x7fc423bf7660 is now disconnected from corosync
>> Nov 19 09:49:41 dev1n2 corosync[1130]:  [TOTEM ] Retransmit List: 1de
>> Nov 19 09:49:41 dev1n2 corosync[1130]:  [TOTEM ] Retransmit List: 1de
>> Nov 19 09:49:41 dev1n2 corosync[1130]:  [TOTEM ] Retransmit List: 1e1
>> Nov 19 09:49:41 dev1n2 corosync[1130]:  [TOTEM ] Retransmit List: 1e1
>> Nov 19 09:49:43 dev1n2 dlm_controld[1142]: 541 cpg_dispatch error 9
>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [TOTEM ] Retransmit List: 1e7
>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [TOTEM ] Retransmit List: 1e7
>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [SERV  ] Unloading all Corosync
>> service engines.
>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [QB    ] withdrawing server
>> sockets
>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [SERV  ] Service engine unloaded:
>> corosync vote quorum service v1.0
>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [QB    ] withdrawing server
>> sockets
>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [SERV  ] Service engine unloaded:
>> corosync configuration map access
>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [QB    ] withdrawing server
>> sockets
>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [SERV  ] Service engine unloaded:
>> corosync configuration service
>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [QB    ] withdrawing server
>> sockets
>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [SERV  ] Service engine unloaded:
>> corosync cluster closed process group service v1.01
>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [QB    ] withdrawing server
>> sockets
>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [SERV  ] Service engine unloaded:
>> corosync cluster quorum service v0.1
>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [SERV  ] Service engine unloaded:
>> corosync profile loading service
>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [WD    ] magically closing the
>> watchdog.
>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [SERV  ] Service engine unloaded:
>> corosync watchdog service
>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [MAIN  ] Corosync Cluster Engine
>> exiting normally
>>
>>
>> Logs from the surviving node:
>>
>> Kernel:
>> Nov 19 09:49:39 dev1n1 kernel: : [80664.615988] block drbd0: conn(
>> Unconnected -> WFConnection )
>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497187] dlm: clvmd: dlm_recover 11
>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497241] dlm: clvmd:
>> dlm_clear_toss 1 done
>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497291] dlm: clvmd: remove member
>> 2
>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497295] dlm: clvmd:
>> dlm_recover_members 1 nodes
>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497298] dlm: clvmd: generation 15
>> slots 1 1:1
>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497300] dlm: clvmd:
>> dlm_recover_directory
>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497302] dlm: clvmd:
>> dlm_recover_directory 0 in 0 new
>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497304] dlm: clvmd:
>> dlm_recover_directory 0 out 0 messages
>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497307] dlm: clvmd:
>> dlm_recover_masters
>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497309] dlm: clvmd:
>> dlm_recover_masters 0 of 1
>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497311] dlm: clvmd:
>> dlm_recover_locks 0 out
>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497313] dlm: clvmd:
>> dlm_recover_locks 0 in
>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497314] dlm: clvmd:
>> dlm_recover_rsbs 1 done
>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497366] dlm: clvmd: dlm_recover
>> 11 generation 15 done: 0 ms
>> Nov 19 09:49:43 dev1n1 kernel: : [80668.211818] dlm: closing connection
>> to node 2
>> Nov 19 09:49:46 dev1n1 kernel: : [80670.779015] igb: p1p2 NIC Link is Down
>>
>> User space:
>> Nov 19 09:49:40 dev1n1 pengine[1078]:   notice: stage6: Scheduling Node
>> dev1n2 for shutdown
>> Nov 19 09:49:40 dev1n1 pengine[1078]:    error: rsc_expand_action:
>> Couldn't expand vpbx_vg_cl_demote_0
>> Nov 19 09:49:40 dev1n1 pengine[1078]:    error: rsc_expand_action:
>> Couldn't expand vpbx_vg_cl_demote_0
>> Nov 19 09:49:40 dev1n1 pengine[1078]:   notice: LogActions: Stop
>>  stonith-dev1n1      (dev1n2)
>> Nov 19 09:49:40 dev1n1 pengine[1078]:   notice: process_pe_message:
>> Calculated Transition 17: /var/lib/pacemaker/pengine/pe-input-1035.bz2
>> Nov 19 09:49:40 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1d1
>> Nov 19 09:49:40 dev1n1 crmd[1080]:   notice: run_graph: Transition 17
>> (Complete=2, Pending=0, Fired=0, Skipped=0, Incomplete=0,
>> Source=/var/lib/pacemaker/pengine/pe-input-1035.bz2): Complete
>> Nov 19 09:49:40 dev1n1 crmd[1080]:   notice: do_state_transition: State
>> transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
>> cause=C_FSA_INTERNAL origin=notify_crmd ]
>> Nov 19 09:49:40 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1d4
>> Nov 19 09:49:40 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1d6 1d8
>> Nov 19 09:49:40 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1d6
>> Nov 19 09:49:40 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1d6
>> Nov 19 09:49:40 dev1n1 crmd[1080]:   notice: peer_update_callback:
>> do_shutdown of dev1n2 (op 63) is complete
>> Nov 19 09:49:40 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1df
>> Nov 19 09:49:40 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1df
>> Nov 19 09:49:40 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1e3
>> Nov 19 09:49:40 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1e3
>> Nov 19 09:49:42 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1e6
>> Nov 19 09:49:42 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1e9
>> Nov 19 09:49:43 dev1n1 corosync[1004]:  [QUORUM] Members[1]: 1
>> Nov 19 09:49:43 dev1n1 crmd[1080]:   notice:
>> corosync_mark_unseen_peer_dead: Node 2/dev1n2 was not seen in the previous
>> transition
>> Nov 19 09:49:43 dev1n1 crmd[1080]:   notice: crm_update_peer_state:
>> corosync_mark_unseen_peer_dead: Node dev1n2[2] - state is now lost
>> Nov 19 09:49:43 dev1n1 corosync[1004]:  [TOTEM ] A processor joined or
>> left the membership and a new membership (10.28.45.27:30736) was formed.
>> Nov 19 09:49:43 dev1n1 corosync[1004]:  [MAIN  ] Completed service
>> synchronization, ready to provide service.
>> Nov 19 09:49:43 dev1n1 dlm_controld[1014]: 80664 fence request 2 pid
>> 27225 nodedown time 1353314983 fence_all dlm_stonith
>> Nov 19 09:49:43 dev1n1 notifyd[1010]: [notice] 192.168.1.2[2]
>> ip:192.168.1.2 left
>> Nov 19 09:49:43 dev1n1 stonith-ng[1075]:   notice: stonith_command:
>> Client stonith-api.27225.b5ff8f4d wants to fence (reboot) '2' with device
>> '(any)'
>> Nov 19 09:49:43 dev1n1 stonith-ng[1075]:   notice:
>> initiate_remote_stonith_op: Initiating remote operation reboot for dev1n2:
>> 71447261-0e53-4b20-b628-d3f026a4ae24 (0)
>> Nov 19 09:49:44 dev1n1 external/ipmi[27242]: [27254]: debug: ipmitool
>> output: Chassis Power Control: Reset
>> Nov 19 09:49:45 dev1n1 stonith-ng[1075]:   notice: log_operation:
>> Operation 'reboot' [27234] (call 0 from stonith-api.27225) for host
>> 'dev1n2' with device 'stonith-dev1n2' returned: 0 (OK)
>> Nov 19 09:49:45 dev1n1 stonith-ng[1075]:   notice: remote_op_done:
>> Operation reboot of dev1n2 by dev1n1 for stonith-api.27225 at dev1n1.71447261:
>> OK
>> Nov 19 09:49:45 dev1n1 crmd[1080]:   notice: tengine_stonith_notify: Peer
>> dev1n2 was terminated (st_notify_fence) by dev1n1 for dev1n1: OK
>> (ref=71447261-0e53-4b20-b628-d3f026a4ae24) by client stonith-api.27225
>>
>> Greets,
>>         Jacek
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



-- 
esta es mi vida e me la vivo hasta que dios quiera
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20121120/87fb0b08/attachment.htm>

From jamescyriac76 at gmail.com  Tue Nov 20 11:31:23 2012
From: jamescyriac76 at gmail.com (james cyriac)
Date: Tue, 20 Nov 2012 15:31:23 +0400
Subject: [Linux-cluster] how to mount GFS volumes same time both the
 cluster nodes?
In-Reply-To: <CAE7pJ3DRRWpfEzND8_Ui74P2X17DJq7BC_dPpsVrgC9vZw6HRg@mail.gmail.com>
References: <CAFZu6EGkN60WGwU_fR3DyJfDHEARiLBuensmmDjCbV8-Z4VU3A@mail.gmail.com>
	<CAE7pJ3DRRWpfEzND8_Ui74P2X17DJq7BC_dPpsVrgC9vZw6HRg@mail.gmail.com>
Message-ID: <CAFZu6EGNwP64f7sMpHcs2j=g6dVhbpj54P8c=AB-JY6MFUYJ4Q@mail.gmail.com>

Hi,

can you send the detials,i have to put entry in both servers?now i created

map disk 150G both servers
and created in node 1 vg03
then
mkfs.gfs2 -p lock_dlm -t sap-cluster1:gfs2 -j 8 /dev/vg03/lvol0

now i able to mount in first server.


 /dev/vg03/lvol0          /usr/sap/trans       gfs2 defaults   0   0

On Tue, Nov 20, 2012 at 3:07 PM, emmanuel segura <emi2fast at gmail.com> wrote:

> You have to use /etc/fstab with _netdev option, redhat cluster doesn't
> support active/active service
>
>
> 2012/11/20 james cyriac <jamescyriac76 at gmail.com>
>
>> Hi all,
>>
>> i am installing redhat cluster 6 two node cluser.the issue is i am not
>> able to mount my GFS file sytem in both the node at same time..
>>
>> please find my clustat output ..
>>
>>
>> [root at saperpprod01 ~]# clustat
>> Cluster Status for sap-cluster1 @ Tue Nov 20 14:51:28 2012
>> Member Status: Quorate
>>  Member Name                                                     ID
>> Status
>>  ------ ----                                                     ----
>> ------
>>  saperpprod01                                                        1
>> Online, Local, rgmanager
>>  saperpprod02                                                        2
>> Online, rgmanager
>>  Service Name                                                     Owner
>> (Last)                                                     State
>>  ------- ----                                                     -----
>> ------                                                     -----
>>  service:oracle
>> saperpprod01                                                     started
>>  service:profile-gfs
>> saperpprod01                                                     started
>>  service:sap
>> saperpprod01                                                     started
>> [root at saperpprod01 ~]#
>> oralce and sap is fine and it is flaying in both nodes.i want mount my
>> GFS vols same time at both the nodes.
>>
>> Thanks in advacne
>> james
>>
>>
>> but profile-gfs is GFS file system and i want present the GFS mount point
>> same time both the node.please help me this
>> On Mon, Nov 19, 2012 at 1:16 PM, Jacek Konieczny <jajcus at jajcus.net>wrote:
>>
>>> Hi,
>>>
>>> I am setting up a cluster using:
>>>
>>> Linux kernel 3.6.6
>>> Corosync 2.1.0
>>> DLM 4.0.0
>>> CLVMD 2.02.98
>>> Pacemaker 1.1.8
>>> DRBD 8.3.13
>>>
>>> Now I have stuck on the 'clean shutdown of a node' scenario.
>>>
>>> It goes like that:
>>> - resources using the shared storage are properly stopped by Pacemaker.
>>> - DRBD is cleanly demoted and unconfigured by Pacemaker
>>> - Pacemaker cleanly exits
>>> - CLVMD is stopped.
>>> ? dlm_controld is stopped
>>> ? corosync is being stopped
>>>
>>> and at this point the node is fenced (rebooted) by the dlm_controld on
>>> the other node. I would expect it continue with a clean shutdown.
>>>
>>> Any idea how to debug/fix it?
>>> Is this '541 cpg_dispatch error 9' the problem?
>>>
>>> Logs from the node being shut down (log file system mounted with the
>>> 'sync'
>>> option, syslog shutdown delayed as much as possible):
>>>
>>> Kernel:
>>> Nov 19 09:49:40 dev1n2 kernel: : [  542.049407] block drbd0: worker
>>> terminated
>>> Nov 19 09:49:40 dev1n2 kernel: : [  542.049412] block drbd0: Terminating
>>> drbd0_worker
>>> Nov 19 09:49:43 dev1n2 kernel: : [  544.934390] dlm: clvmd: leaving the
>>> lockspace group...
>>> Nov 19 09:49:43 dev1n2 kernel: : [  544.937584] dlm: clvmd: group event
>>> done 0 0
>>> Nov 19 09:49:43 dev1n2 kernel: : [  544.937897] dlm: clvmd:
>>> release_lockspace final free
>>> Nov 19 09:49:43 dev1n2 kernel: : [  544.961407] dlm: closing connection
>>> to node 2
>>> Nov 19 09:49:43 dev1n2 kernel: : [  544.961431] dlm: closing connection
>>> to node 1
>>>
>>> User space:
>>> Nov 19 09:49:41 dev1n2 pacemakerd[1267]:   notice: stop_child: Stopping
>>> cib: Sent -15 to process 1279
>>> Nov 19 09:49:41 dev1n2 notifyd[1139]: [notice] dev1n2[2]
>>> stonithd:1281:0x7fc423dfd5e0 is now disconnected from corosync
>>> Nov 19 09:49:41 dev1n2 cib[1279]:   notice: terminate_cs_connection:
>>> Disconnecting from Corosync
>>> Nov 19 09:49:41 dev1n2 corosync[1130]:  [TOTEM ] Retransmit List: 1db
>>> Nov 19 09:49:41 dev1n2 notifyd[1139]: [notice] dev1n2[2]
>>> cib:1279:0x7fc4240008d0 is now disconnected from corosync
>>> Nov 19 09:49:41 dev1n2 cib[1279]:   notice: terminate_cs_connection:
>>> Disconnecting from Corosync
>>> Nov 19 09:49:41 dev1n2 corosync[1130]:  [TOTEM ] Retransmit List: 1dd
>>> Nov 19 09:49:41 dev1n2 pacemakerd[1267]:   notice: pcmk_shutdown_worker:
>>> Shutdown complete
>>> Nov 19 09:49:41 dev1n2 notifyd[1139]: [notice] dev1n2[2]
>>> pacemakerd:1267:0x7fc423bf8ed0 is now disconnected from corosync
>>> Nov 19 09:49:41 dev1n2 notifyd[1139]: [notice] dev1n2[2]
>>> pacemakerd:1267:0x7fc423bf7660 is now disconnected from corosync
>>> Nov 19 09:49:41 dev1n2 corosync[1130]:  [TOTEM ] Retransmit List: 1de
>>> Nov 19 09:49:41 dev1n2 corosync[1130]:  [TOTEM ] Retransmit List: 1de
>>> Nov 19 09:49:41 dev1n2 corosync[1130]:  [TOTEM ] Retransmit List: 1e1
>>> Nov 19 09:49:41 dev1n2 corosync[1130]:  [TOTEM ] Retransmit List: 1e1
>>> Nov 19 09:49:43 dev1n2 dlm_controld[1142]: 541 cpg_dispatch error 9
>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [TOTEM ] Retransmit List: 1e7
>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [TOTEM ] Retransmit List: 1e7
>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [SERV  ] Unloading all Corosync
>>> service engines.
>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [QB    ] withdrawing server
>>> sockets
>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [SERV  ] Service engine
>>> unloaded: corosync vote quorum service v1.0
>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [QB    ] withdrawing server
>>> sockets
>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [SERV  ] Service engine
>>> unloaded: corosync configuration map access
>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [QB    ] withdrawing server
>>> sockets
>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [SERV  ] Service engine
>>> unloaded: corosync configuration service
>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [QB    ] withdrawing server
>>> sockets
>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [SERV  ] Service engine
>>> unloaded: corosync cluster closed process group service v1.01
>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [QB    ] withdrawing server
>>> sockets
>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [SERV  ] Service engine
>>> unloaded: corosync cluster quorum service v0.1
>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [SERV  ] Service engine
>>> unloaded: corosync profile loading service
>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [WD    ] magically closing the
>>> watchdog.
>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [SERV  ] Service engine
>>> unloaded: corosync watchdog service
>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [MAIN  ] Corosync Cluster Engine
>>> exiting normally
>>>
>>>
>>> Logs from the surviving node:
>>>
>>> Kernel:
>>> Nov 19 09:49:39 dev1n1 kernel: : [80664.615988] block drbd0: conn(
>>> Unconnected -> WFConnection )
>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497187] dlm: clvmd: dlm_recover
>>> 11
>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497241] dlm: clvmd:
>>> dlm_clear_toss 1 done
>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497291] dlm: clvmd: remove
>>> member 2
>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497295] dlm: clvmd:
>>> dlm_recover_members 1 nodes
>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497298] dlm: clvmd: generation
>>> 15 slots 1 1:1
>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497300] dlm: clvmd:
>>> dlm_recover_directory
>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497302] dlm: clvmd:
>>> dlm_recover_directory 0 in 0 new
>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497304] dlm: clvmd:
>>> dlm_recover_directory 0 out 0 messages
>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497307] dlm: clvmd:
>>> dlm_recover_masters
>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497309] dlm: clvmd:
>>> dlm_recover_masters 0 of 1
>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497311] dlm: clvmd:
>>> dlm_recover_locks 0 out
>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497313] dlm: clvmd:
>>> dlm_recover_locks 0 in
>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497314] dlm: clvmd:
>>> dlm_recover_rsbs 1 done
>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497366] dlm: clvmd: dlm_recover
>>> 11 generation 15 done: 0 ms
>>> Nov 19 09:49:43 dev1n1 kernel: : [80668.211818] dlm: closing connection
>>> to node 2
>>> Nov 19 09:49:46 dev1n1 kernel: : [80670.779015] igb: p1p2 NIC Link is
>>> Down
>>>
>>> User space:
>>> Nov 19 09:49:40 dev1n1 pengine[1078]:   notice: stage6: Scheduling Node
>>> dev1n2 for shutdown
>>> Nov 19 09:49:40 dev1n1 pengine[1078]:    error: rsc_expand_action:
>>> Couldn't expand vpbx_vg_cl_demote_0
>>> Nov 19 09:49:40 dev1n1 pengine[1078]:    error: rsc_expand_action:
>>> Couldn't expand vpbx_vg_cl_demote_0
>>> Nov 19 09:49:40 dev1n1 pengine[1078]:   notice: LogActions: Stop
>>>  stonith-dev1n1      (dev1n2)
>>> Nov 19 09:49:40 dev1n1 pengine[1078]:   notice: process_pe_message:
>>> Calculated Transition 17: /var/lib/pacemaker/pengine/pe-input-1035.bz2
>>> Nov 19 09:49:40 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1d1
>>> Nov 19 09:49:40 dev1n1 crmd[1080]:   notice: run_graph: Transition 17
>>> (Complete=2, Pending=0, Fired=0, Skipped=0, Incomplete=0,
>>> Source=/var/lib/pacemaker/pengine/pe-input-1035.bz2): Complete
>>> Nov 19 09:49:40 dev1n1 crmd[1080]:   notice: do_state_transition: State
>>> transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
>>> cause=C_FSA_INTERNAL origin=notify_crmd ]
>>> Nov 19 09:49:40 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1d4
>>> Nov 19 09:49:40 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1d6 1d8
>>> Nov 19 09:49:40 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1d6
>>> Nov 19 09:49:40 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1d6
>>> Nov 19 09:49:40 dev1n1 crmd[1080]:   notice: peer_update_callback:
>>> do_shutdown of dev1n2 (op 63) is complete
>>> Nov 19 09:49:40 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1df
>>> Nov 19 09:49:40 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1df
>>> Nov 19 09:49:40 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1e3
>>> Nov 19 09:49:40 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1e3
>>> Nov 19 09:49:42 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1e6
>>> Nov 19 09:49:42 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1e9
>>> Nov 19 09:49:43 dev1n1 corosync[1004]:  [QUORUM] Members[1]: 1
>>> Nov 19 09:49:43 dev1n1 crmd[1080]:   notice:
>>> corosync_mark_unseen_peer_dead: Node 2/dev1n2 was not seen in the previous
>>> transition
>>> Nov 19 09:49:43 dev1n1 crmd[1080]:   notice: crm_update_peer_state:
>>> corosync_mark_unseen_peer_dead: Node dev1n2[2] - state is now lost
>>> Nov 19 09:49:43 dev1n1 corosync[1004]:  [TOTEM ] A processor joined or
>>> left the membership and a new membership (10.28.45.27:30736) was formed.
>>> Nov 19 09:49:43 dev1n1 corosync[1004]:  [MAIN  ] Completed service
>>> synchronization, ready to provide service.
>>> Nov 19 09:49:43 dev1n1 dlm_controld[1014]: 80664 fence request 2 pid
>>> 27225 nodedown time 1353314983 fence_all dlm_stonith
>>> Nov 19 09:49:43 dev1n1 notifyd[1010]: [notice] 192.168.1.2[2]
>>> ip:192.168.1.2 left
>>> Nov 19 09:49:43 dev1n1 stonith-ng[1075]:   notice: stonith_command:
>>> Client stonith-api.27225.b5ff8f4d wants to fence (reboot) '2' with device
>>> '(any)'
>>> Nov 19 09:49:43 dev1n1 stonith-ng[1075]:   notice:
>>> initiate_remote_stonith_op: Initiating remote operation reboot for dev1n2:
>>> 71447261-0e53-4b20-b628-d3f026a4ae24 (0)
>>> Nov 19 09:49:44 dev1n1 external/ipmi[27242]: [27254]: debug: ipmitool
>>> output: Chassis Power Control: Reset
>>> Nov 19 09:49:45 dev1n1 stonith-ng[1075]:   notice: log_operation:
>>> Operation 'reboot' [27234] (call 0 from stonith-api.27225) for host
>>> 'dev1n2' with device 'stonith-dev1n2' returned: 0 (OK)
>>> Nov 19 09:49:45 dev1n1 stonith-ng[1075]:   notice: remote_op_done:
>>> Operation reboot of dev1n2 by dev1n1 for stonith-api.27225 at dev1n1.71447261:
>>> OK
>>> Nov 19 09:49:45 dev1n1 crmd[1080]:   notice: tengine_stonith_notify:
>>> Peer dev1n2 was terminated (st_notify_fence) by dev1n1 for dev1n1: OK
>>> (ref=71447261-0e53-4b20-b628-d3f026a4ae24) by client stonith-api.27225
>>>
>>> Greets,
>>>         Jacek
>>>
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>>
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
>
>
> --
> esta es mi vida e me la vivo hasta que dios quiera
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20121120/112433d1/attachment.htm>

From sean at rentul.net  Tue Nov 20 11:41:25 2012
From: sean at rentul.net (Sean Lutner)
Date: Tue, 20 Nov 2012 06:41:25 -0500
Subject: [Linux-cluster] how to mount GFS volumes same time both the
 cluster nodes?
In-Reply-To: <CAFZu6EGNwP64f7sMpHcs2j=g6dVhbpj54P8c=AB-JY6MFUYJ4Q@mail.gmail.com>
References: <CAFZu6EGkN60WGwU_fR3DyJfDHEARiLBuensmmDjCbV8-Z4VU3A@mail.gmail.com>
	<CAE7pJ3DRRWpfEzND8_Ui74P2X17DJq7BC_dPpsVrgC9vZw6HRg@mail.gmail.com>
	<CAFZu6EGNwP64f7sMpHcs2j=g6dVhbpj54P8c=AB-JY6MFUYJ4Q@mail.gmail.com>
Message-ID: <A4F24F6B-40BC-42F4-92CE-4725BF2CC89C@rentul.net>

Did you run lvmconf --enable-cluster?

Sent from my iPhone

On Nov 20, 2012, at 6:31 AM, james cyriac <jamescyriac76 at gmail.com> wrote:

> Hi,
>  
> can you send the detials,i have to put entry in both servers?now i created
>  
> map disk 150G both servers
> and created in node 1 vg03
> then
> mkfs.gfs2 -p lock_dlm -t sap-cluster1:gfs2 -j 8 /dev/vg03/lvol0
>  
> now i able to mount in first server.
>  
>  
>  /dev/vg03/lvol0          /usr/sap/trans       gfs2 defaults   0   0
> 
> On Tue, Nov 20, 2012 at 3:07 PM, emmanuel segura <emi2fast at gmail.com> wrote:
>> You have to use /etc/fstab with _netdev option, redhat cluster doesn't support active/active service
>> 
>> 
>> 2012/11/20 james cyriac <jamescyriac76 at gmail.com>
>>> Hi all,
>>> 
>>> i am installing redhat cluster 6 two node cluser.the issue is i am not able to mount my GFS file sytem in both the node at same time..
>>>  
>>> please find my clustat output ..
>>> 
>>> 
>>> [root at saperpprod01 ~]# clustat
>>> Cluster Status for sap-cluster1 @ Tue Nov 20 14:51:28 2012
>>> Member Status: Quorate
>>>  Member Name                                                     ID   Status
>>>  ------ ----                                                     ---- ------
>>>  saperpprod01                                                        1 Online, Local, rgmanager
>>>  saperpprod02                                                        2 Online, rgmanager
>>>  Service Name                                                     Owner (Last)                                                     State
>>>  ------- ----                                                     ----- ------                                                     -----
>>>  service:oracle                                                   saperpprod01                                                     started
>>>  service:profile-gfs                                              saperpprod01                                                     started
>>>  service:sap                                                      saperpprod01                                                     started
>>> [root at saperpprod01 ~]#
>>> oralce and sap is fine and it is flaying in both nodes.i want mount my GFS vols same time at both the nodes.
>>>  
>>> Thanks in advacne
>>> james
>>>  
>>>  
>>> but profile-gfs is GFS file system and i want present the GFS mount point same time both the node.please help me this
>>> On Mon, Nov 19, 2012 at 1:16 PM, Jacek Konieczny <jajcus at jajcus.net> wrote:
>>>> Hi,
>>>> 
>>>> I am setting up a cluster using:
>>>> 
>>>> Linux kernel 3.6.6
>>>> Corosync 2.1.0
>>>> DLM 4.0.0
>>>> CLVMD 2.02.98
>>>> Pacemaker 1.1.8
>>>> DRBD 8.3.13
>>>> 
>>>> Now I have stuck on the 'clean shutdown of a node' scenario.
>>>> 
>>>> It goes like that:
>>>> - resources using the shared storage are properly stopped by Pacemaker.
>>>> - DRBD is cleanly demoted and unconfigured by Pacemaker
>>>> - Pacemaker cleanly exits
>>>> - CLVMD is stopped.
>>>> ? dlm_controld is stopped
>>>> ? corosync is being stopped
>>>> 
>>>> and at this point the node is fenced (rebooted) by the dlm_controld on
>>>> the other node. I would expect it continue with a clean shutdown.
>>>> 
>>>> Any idea how to debug/fix it?
>>>> Is this '541 cpg_dispatch error 9' the problem?
>>>> 
>>>> Logs from the node being shut down (log file system mounted with the 'sync'
>>>> option, syslog shutdown delayed as much as possible):
>>>> 
>>>> Kernel:
>>>> Nov 19 09:49:40 dev1n2 kernel: : [  542.049407] block drbd0: worker terminated
>>>> Nov 19 09:49:40 dev1n2 kernel: : [  542.049412] block drbd0: Terminating drbd0_worker
>>>> Nov 19 09:49:43 dev1n2 kernel: : [  544.934390] dlm: clvmd: leaving the lockspace group...
>>>> Nov 19 09:49:43 dev1n2 kernel: : [  544.937584] dlm: clvmd: group event done 0 0
>>>> Nov 19 09:49:43 dev1n2 kernel: : [  544.937897] dlm: clvmd: release_lockspace final free
>>>> Nov 19 09:49:43 dev1n2 kernel: : [  544.961407] dlm: closing connection to node 2
>>>> Nov 19 09:49:43 dev1n2 kernel: : [  544.961431] dlm: closing connection to node 1
>>>> 
>>>> User space:
>>>> Nov 19 09:49:41 dev1n2 pacemakerd[1267]:   notice: stop_child: Stopping cib: Sent -15 to process 1279
>>>> Nov 19 09:49:41 dev1n2 notifyd[1139]: [notice] dev1n2[2] stonithd:1281:0x7fc423dfd5e0 is now disconnected from corosync
>>>> Nov 19 09:49:41 dev1n2 cib[1279]:   notice: terminate_cs_connection: Disconnecting from Corosync
>>>> Nov 19 09:49:41 dev1n2 corosync[1130]:  [TOTEM ] Retransmit List: 1db
>>>> Nov 19 09:49:41 dev1n2 notifyd[1139]: [notice] dev1n2[2] cib:1279:0x7fc4240008d0 is now disconnected from corosync
>>>> Nov 19 09:49:41 dev1n2 cib[1279]:   notice: terminate_cs_connection: Disconnecting from Corosync
>>>> Nov 19 09:49:41 dev1n2 corosync[1130]:  [TOTEM ] Retransmit List: 1dd
>>>> Nov 19 09:49:41 dev1n2 pacemakerd[1267]:   notice: pcmk_shutdown_worker: Shutdown complete
>>>> Nov 19 09:49:41 dev1n2 notifyd[1139]: [notice] dev1n2[2] pacemakerd:1267:0x7fc423bf8ed0 is now disconnected from corosync
>>>> Nov 19 09:49:41 dev1n2 notifyd[1139]: [notice] dev1n2[2] pacemakerd:1267:0x7fc423bf7660 is now disconnected from corosync
>>>> Nov 19 09:49:41 dev1n2 corosync[1130]:  [TOTEM ] Retransmit List: 1de
>>>> Nov 19 09:49:41 dev1n2 corosync[1130]:  [TOTEM ] Retransmit List: 1de
>>>> Nov 19 09:49:41 dev1n2 corosync[1130]:  [TOTEM ] Retransmit List: 1e1
>>>> Nov 19 09:49:41 dev1n2 corosync[1130]:  [TOTEM ] Retransmit List: 1e1
>>>> Nov 19 09:49:43 dev1n2 dlm_controld[1142]: 541 cpg_dispatch error 9
>>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [TOTEM ] Retransmit List: 1e7
>>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [TOTEM ] Retransmit List: 1e7
>>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [SERV  ] Unloading all Corosync service engines.
>>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [QB    ] withdrawing server sockets
>>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [SERV  ] Service engine unloaded: corosync vote quorum service v1.0
>>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [QB    ] withdrawing server sockets
>>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [SERV  ] Service engine unloaded: corosync configuration map access
>>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [QB    ] withdrawing server sockets
>>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [SERV  ] Service engine unloaded: corosync configuration service
>>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [QB    ] withdrawing server sockets
>>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [SERV  ] Service engine unloaded: corosync cluster closed process group service v1.01
>>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [QB    ] withdrawing server sockets
>>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [SERV  ] Service engine unloaded: corosync cluster quorum service v0.1
>>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [SERV  ] Service engine unloaded: corosync profile loading service
>>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [WD    ] magically closing the watchdog.
>>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [SERV  ] Service engine unloaded: corosync watchdog service
>>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [MAIN  ] Corosync Cluster Engine exiting normally
>>>> 
>>>> 
>>>> Logs from the surviving node:
>>>> 
>>>> Kernel:
>>>> Nov 19 09:49:39 dev1n1 kernel: : [80664.615988] block drbd0: conn( Unconnected -> WFConnection )
>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497187] dlm: clvmd: dlm_recover 11
>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497241] dlm: clvmd: dlm_clear_toss 1 done
>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497291] dlm: clvmd: remove member 2
>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497295] dlm: clvmd: dlm_recover_members 1 nodes
>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497298] dlm: clvmd: generation 15 slots 1 1:1
>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497300] dlm: clvmd: dlm_recover_directory
>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497302] dlm: clvmd: dlm_recover_directory 0 in 0 new
>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497304] dlm: clvmd: dlm_recover_directory 0 out 0 messages
>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497307] dlm: clvmd: dlm_recover_masters
>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497309] dlm: clvmd: dlm_recover_masters 0 of 1
>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497311] dlm: clvmd: dlm_recover_locks 0 out
>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497313] dlm: clvmd: dlm_recover_locks 0 in
>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497314] dlm: clvmd: dlm_recover_rsbs 1 done
>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497366] dlm: clvmd: dlm_recover 11 generation 15 done: 0 ms
>>>> Nov 19 09:49:43 dev1n1 kernel: : [80668.211818] dlm: closing connection to node 2
>>>> Nov 19 09:49:46 dev1n1 kernel: : [80670.779015] igb: p1p2 NIC Link is Down
>>>> 
>>>> User space:
>>>> Nov 19 09:49:40 dev1n1 pengine[1078]:   notice: stage6: Scheduling Node dev1n2 for shutdown
>>>> Nov 19 09:49:40 dev1n1 pengine[1078]:    error: rsc_expand_action: Couldn't expand vpbx_vg_cl_demote_0
>>>> Nov 19 09:49:40 dev1n1 pengine[1078]:    error: rsc_expand_action: Couldn't expand vpbx_vg_cl_demote_0
>>>> Nov 19 09:49:40 dev1n1 pengine[1078]:   notice: LogActions: Stop    stonith-dev1n1      (dev1n2)
>>>> Nov 19 09:49:40 dev1n1 pengine[1078]:   notice: process_pe_message: Calculated Transition 17: /var/lib/pacemaker/pengine/pe-input-1035.bz2
>>>> Nov 19 09:49:40 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1d1
>>>> Nov 19 09:49:40 dev1n1 crmd[1080]:   notice: run_graph: Transition 17 (Complete=2, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-1035.bz2): Complete
>>>> Nov 19 09:49:40 dev1n1 crmd[1080]:   notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
>>>> Nov 19 09:49:40 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1d4
>>>> Nov 19 09:49:40 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1d6 1d8
>>>> Nov 19 09:49:40 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1d6
>>>> Nov 19 09:49:40 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1d6
>>>> Nov 19 09:49:40 dev1n1 crmd[1080]:   notice: peer_update_callback: do_shutdown of dev1n2 (op 63) is complete
>>>> Nov 19 09:49:40 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1df
>>>> Nov 19 09:49:40 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1df
>>>> Nov 19 09:49:40 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1e3
>>>> Nov 19 09:49:40 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1e3
>>>> Nov 19 09:49:42 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1e6
>>>> Nov 19 09:49:42 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1e9
>>>> Nov 19 09:49:43 dev1n1 corosync[1004]:  [QUORUM] Members[1]: 1
>>>> Nov 19 09:49:43 dev1n1 crmd[1080]:   notice: corosync_mark_unseen_peer_dead: Node 2/dev1n2 was not seen in the previous transition
>>>> Nov 19 09:49:43 dev1n1 crmd[1080]:   notice: crm_update_peer_state: corosync_mark_unseen_peer_dead: Node dev1n2[2] - state is now lost
>>>> Nov 19 09:49:43 dev1n1 corosync[1004]:  [TOTEM ] A processor joined or left the membership and a new membership (10.28.45.27:30736) was formed.
>>>> Nov 19 09:49:43 dev1n1 corosync[1004]:  [MAIN  ] Completed service synchronization, ready to provide service.
>>>> Nov 19 09:49:43 dev1n1 dlm_controld[1014]: 80664 fence request 2 pid 27225 nodedown time 1353314983 fence_all dlm_stonith
>>>> Nov 19 09:49:43 dev1n1 notifyd[1010]: [notice] 192.168.1.2[2] ip:192.168.1.2 left
>>>> Nov 19 09:49:43 dev1n1 stonith-ng[1075]:   notice: stonith_command: Client stonith-api.27225.b5ff8f4d wants to fence (reboot) '2' with device '(any)'
>>>> Nov 19 09:49:43 dev1n1 stonith-ng[1075]:   notice: initiate_remote_stonith_op: Initiating remote operation reboot for dev1n2: 71447261-0e53-4b20-b628-d3f026a4ae24 (0)
>>>> Nov 19 09:49:44 dev1n1 external/ipmi[27242]: [27254]: debug: ipmitool output: Chassis Power Control: Reset
>>>> Nov 19 09:49:45 dev1n1 stonith-ng[1075]:   notice: log_operation: Operation 'reboot' [27234] (call 0 from stonith-api.27225) for host 'dev1n2' with device 'stonith-dev1n2' returned: 0 (OK)
>>>> Nov 19 09:49:45 dev1n1 stonith-ng[1075]:   notice: remote_op_done: Operation reboot of dev1n2 by dev1n1 for stonith-api.27225 at dev1n1.71447261: OK
>>>> Nov 19 09:49:45 dev1n1 crmd[1080]:   notice: tengine_stonith_notify: Peer dev1n2 was terminated (st_notify_fence) by dev1n1 for dev1n1: OK (ref=71447261-0e53-4b20-b628-d3f026a4ae24) by client stonith-api.27225
>>>> 
>>>> Greets,
>>>>         Jacek
>>>> 
>>>> --
>>>> Linux-cluster mailing list
>>>> Linux-cluster at redhat.com
>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>> 
>>> 
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>> 
>> 
>> 
>> -- 
>> esta es mi vida e me la vivo hasta que dios quiera
>> 
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20121120/8f2f25d8/attachment.htm>

From emi2fast at gmail.com  Tue Nov 20 12:02:37 2012
From: emi2fast at gmail.com (emmanuel segura)
Date: Tue, 20 Nov 2012 13:02:37 +0100
Subject: [Linux-cluster] how to mount GFS volumes same time both the
 cluster nodes?
In-Reply-To: <CAFZu6EGNwP64f7sMpHcs2j=g6dVhbpj54P8c=AB-JY6MFUYJ4Q@mail.gmail.com>
References: <CAFZu6EGkN60WGwU_fR3DyJfDHEARiLBuensmmDjCbV8-Z4VU3A@mail.gmail.com>
	<CAE7pJ3DRRWpfEzND8_Ui74P2X17DJq7BC_dPpsVrgC9vZw6HRg@mail.gmail.com>
	<CAFZu6EGNwP64f7sMpHcs2j=g6dVhbpj54P8c=AB-JY6MFUYJ4Q@mail.gmail.com>
Message-ID: <CAE7pJ3A3vbhYE=G=5kKFxTtA=SwjOi5MQQQU1sMg6dQR9Of+mA@mail.gmail.com>

Do it the same step on second server

2012/11/20 james cyriac <jamescyriac76 at gmail.com>

> Hi,
>
> can you send the detials,i have to put entry in both servers?now i created
>
> map disk 150G both servers
> and created in node 1 vg03
> then
> mkfs.gfs2 -p lock_dlm -t sap-cluster1:gfs2 -j 8 /dev/vg03/lvol0
>
> now i able to mount in first server.
>
>
>  /dev/vg03/lvol0          /usr/sap/trans       gfs2 defaults   0   0
>
> On Tue, Nov 20, 2012 at 3:07 PM, emmanuel segura <emi2fast at gmail.com>wrote:
>
>> You have to use /etc/fstab with _netdev option, redhat cluster doesn't
>> support active/active service
>>
>>
>> 2012/11/20 james cyriac <jamescyriac76 at gmail.com>
>>
>>> Hi all,
>>>
>>> i am installing redhat cluster 6 two node cluser.the issue is i am not
>>> able to mount my GFS file sytem in both the node at same time..
>>>
>>> please find my clustat output ..
>>>
>>>
>>> [root at saperpprod01 ~]# clustat
>>> Cluster Status for sap-cluster1 @ Tue Nov 20 14:51:28 2012
>>> Member Status: Quorate
>>>  Member Name                                                     ID
>>> Status
>>>  ------ ----                                                     ----
>>> ------
>>>  saperpprod01                                                        1
>>> Online, Local, rgmanager
>>>  saperpprod02                                                        2
>>> Online, rgmanager
>>>  Service Name                                                     Owner
>>> (Last)                                                     State
>>>  ------- ----                                                     -----
>>> ------                                                     -----
>>>  service:oracle
>>> saperpprod01                                                     started
>>>  service:profile-gfs
>>> saperpprod01                                                     started
>>>  service:sap
>>> saperpprod01                                                     started
>>> [root at saperpprod01 ~]#
>>> oralce and sap is fine and it is flaying in both nodes.i want mount my
>>> GFS vols same time at both the nodes.
>>>
>>> Thanks in advacne
>>> james
>>>
>>>
>>> but profile-gfs is GFS file system and i want present the GFS mount
>>> point same time both the node.please help me this
>>> On Mon, Nov 19, 2012 at 1:16 PM, Jacek Konieczny <jajcus at jajcus.net>wrote:
>>>
>>>> Hi,
>>>>
>>>> I am setting up a cluster using:
>>>>
>>>> Linux kernel 3.6.6
>>>> Corosync 2.1.0
>>>> DLM 4.0.0
>>>> CLVMD 2.02.98
>>>> Pacemaker 1.1.8
>>>> DRBD 8.3.13
>>>>
>>>> Now I have stuck on the 'clean shutdown of a node' scenario.
>>>>
>>>> It goes like that:
>>>> - resources using the shared storage are properly stopped by Pacemaker.
>>>> - DRBD is cleanly demoted and unconfigured by Pacemaker
>>>> - Pacemaker cleanly exits
>>>> - CLVMD is stopped.
>>>> ? dlm_controld is stopped
>>>> ? corosync is being stopped
>>>>
>>>> and at this point the node is fenced (rebooted) by the dlm_controld on
>>>> the other node. I would expect it continue with a clean shutdown.
>>>>
>>>> Any idea how to debug/fix it?
>>>> Is this '541 cpg_dispatch error 9' the problem?
>>>>
>>>> Logs from the node being shut down (log file system mounted with the
>>>> 'sync'
>>>> option, syslog shutdown delayed as much as possible):
>>>>
>>>> Kernel:
>>>> Nov 19 09:49:40 dev1n2 kernel: : [  542.049407] block drbd0: worker
>>>> terminated
>>>> Nov 19 09:49:40 dev1n2 kernel: : [  542.049412] block drbd0:
>>>> Terminating drbd0_worker
>>>> Nov 19 09:49:43 dev1n2 kernel: : [  544.934390] dlm: clvmd: leaving the
>>>> lockspace group...
>>>> Nov 19 09:49:43 dev1n2 kernel: : [  544.937584] dlm: clvmd: group event
>>>> done 0 0
>>>> Nov 19 09:49:43 dev1n2 kernel: : [  544.937897] dlm: clvmd:
>>>> release_lockspace final free
>>>> Nov 19 09:49:43 dev1n2 kernel: : [  544.961407] dlm: closing connection
>>>> to node 2
>>>> Nov 19 09:49:43 dev1n2 kernel: : [  544.961431] dlm: closing connection
>>>> to node 1
>>>>
>>>> User space:
>>>> Nov 19 09:49:41 dev1n2 pacemakerd[1267]:   notice: stop_child: Stopping
>>>> cib: Sent -15 to process 1279
>>>> Nov 19 09:49:41 dev1n2 notifyd[1139]: [notice] dev1n2[2]
>>>> stonithd:1281:0x7fc423dfd5e0 is now disconnected from corosync
>>>> Nov 19 09:49:41 dev1n2 cib[1279]:   notice: terminate_cs_connection:
>>>> Disconnecting from Corosync
>>>> Nov 19 09:49:41 dev1n2 corosync[1130]:  [TOTEM ] Retransmit List: 1db
>>>> Nov 19 09:49:41 dev1n2 notifyd[1139]: [notice] dev1n2[2]
>>>> cib:1279:0x7fc4240008d0 is now disconnected from corosync
>>>> Nov 19 09:49:41 dev1n2 cib[1279]:   notice: terminate_cs_connection:
>>>> Disconnecting from Corosync
>>>> Nov 19 09:49:41 dev1n2 corosync[1130]:  [TOTEM ] Retransmit List: 1dd
>>>> Nov 19 09:49:41 dev1n2 pacemakerd[1267]:   notice:
>>>> pcmk_shutdown_worker: Shutdown complete
>>>> Nov 19 09:49:41 dev1n2 notifyd[1139]: [notice] dev1n2[2]
>>>> pacemakerd:1267:0x7fc423bf8ed0 is now disconnected from corosync
>>>> Nov 19 09:49:41 dev1n2 notifyd[1139]: [notice] dev1n2[2]
>>>> pacemakerd:1267:0x7fc423bf7660 is now disconnected from corosync
>>>> Nov 19 09:49:41 dev1n2 corosync[1130]:  [TOTEM ] Retransmit List: 1de
>>>> Nov 19 09:49:41 dev1n2 corosync[1130]:  [TOTEM ] Retransmit List: 1de
>>>> Nov 19 09:49:41 dev1n2 corosync[1130]:  [TOTEM ] Retransmit List: 1e1
>>>> Nov 19 09:49:41 dev1n2 corosync[1130]:  [TOTEM ] Retransmit List: 1e1
>>>> Nov 19 09:49:43 dev1n2 dlm_controld[1142]: 541 cpg_dispatch error 9
>>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [TOTEM ] Retransmit List: 1e7
>>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [TOTEM ] Retransmit List: 1e7
>>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [SERV  ] Unloading all Corosync
>>>> service engines.
>>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [QB    ] withdrawing server
>>>> sockets
>>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [SERV  ] Service engine
>>>> unloaded: corosync vote quorum service v1.0
>>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [QB    ] withdrawing server
>>>> sockets
>>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [SERV  ] Service engine
>>>> unloaded: corosync configuration map access
>>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [QB    ] withdrawing server
>>>> sockets
>>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [SERV  ] Service engine
>>>> unloaded: corosync configuration service
>>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [QB    ] withdrawing server
>>>> sockets
>>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [SERV  ] Service engine
>>>> unloaded: corosync cluster closed process group service v1.01
>>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [QB    ] withdrawing server
>>>> sockets
>>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [SERV  ] Service engine
>>>> unloaded: corosync cluster quorum service v0.1
>>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [SERV  ] Service engine
>>>> unloaded: corosync profile loading service
>>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [WD    ] magically closing the
>>>> watchdog.
>>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [SERV  ] Service engine
>>>> unloaded: corosync watchdog service
>>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [MAIN  ] Corosync Cluster
>>>> Engine exiting normally
>>>>
>>>>
>>>> Logs from the surviving node:
>>>>
>>>> Kernel:
>>>> Nov 19 09:49:39 dev1n1 kernel: : [80664.615988] block drbd0: conn(
>>>> Unconnected -> WFConnection )
>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497187] dlm: clvmd: dlm_recover
>>>> 11
>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497241] dlm: clvmd:
>>>> dlm_clear_toss 1 done
>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497291] dlm: clvmd: remove
>>>> member 2
>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497295] dlm: clvmd:
>>>> dlm_recover_members 1 nodes
>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497298] dlm: clvmd: generation
>>>> 15 slots 1 1:1
>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497300] dlm: clvmd:
>>>> dlm_recover_directory
>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497302] dlm: clvmd:
>>>> dlm_recover_directory 0 in 0 new
>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497304] dlm: clvmd:
>>>> dlm_recover_directory 0 out 0 messages
>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497307] dlm: clvmd:
>>>> dlm_recover_masters
>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497309] dlm: clvmd:
>>>> dlm_recover_masters 0 of 1
>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497311] dlm: clvmd:
>>>> dlm_recover_locks 0 out
>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497313] dlm: clvmd:
>>>> dlm_recover_locks 0 in
>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497314] dlm: clvmd:
>>>> dlm_recover_rsbs 1 done
>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497366] dlm: clvmd: dlm_recover
>>>> 11 generation 15 done: 0 ms
>>>> Nov 19 09:49:43 dev1n1 kernel: : [80668.211818] dlm: closing connection
>>>> to node 2
>>>> Nov 19 09:49:46 dev1n1 kernel: : [80670.779015] igb: p1p2 NIC Link is
>>>> Down
>>>>
>>>> User space:
>>>> Nov 19 09:49:40 dev1n1 pengine[1078]:   notice: stage6: Scheduling Node
>>>> dev1n2 for shutdown
>>>> Nov 19 09:49:40 dev1n1 pengine[1078]:    error: rsc_expand_action:
>>>> Couldn't expand vpbx_vg_cl_demote_0
>>>> Nov 19 09:49:40 dev1n1 pengine[1078]:    error: rsc_expand_action:
>>>> Couldn't expand vpbx_vg_cl_demote_0
>>>> Nov 19 09:49:40 dev1n1 pengine[1078]:   notice: LogActions: Stop
>>>>  stonith-dev1n1      (dev1n2)
>>>> Nov 19 09:49:40 dev1n1 pengine[1078]:   notice: process_pe_message:
>>>> Calculated Transition 17: /var/lib/pacemaker/pengine/pe-input-1035.bz2
>>>> Nov 19 09:49:40 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1d1
>>>> Nov 19 09:49:40 dev1n1 crmd[1080]:   notice: run_graph: Transition 17
>>>> (Complete=2, Pending=0, Fired=0, Skipped=0, Incomplete=0,
>>>> Source=/var/lib/pacemaker/pengine/pe-input-1035.bz2): Complete
>>>> Nov 19 09:49:40 dev1n1 crmd[1080]:   notice: do_state_transition: State
>>>> transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
>>>> cause=C_FSA_INTERNAL origin=notify_crmd ]
>>>> Nov 19 09:49:40 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1d4
>>>> Nov 19 09:49:40 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1d6
>>>> 1d8
>>>> Nov 19 09:49:40 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1d6
>>>> Nov 19 09:49:40 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1d6
>>>> Nov 19 09:49:40 dev1n1 crmd[1080]:   notice: peer_update_callback:
>>>> do_shutdown of dev1n2 (op 63) is complete
>>>> Nov 19 09:49:40 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1df
>>>> Nov 19 09:49:40 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1df
>>>> Nov 19 09:49:40 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1e3
>>>> Nov 19 09:49:40 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1e3
>>>> Nov 19 09:49:42 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1e6
>>>> Nov 19 09:49:42 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1e9
>>>> Nov 19 09:49:43 dev1n1 corosync[1004]:  [QUORUM] Members[1]: 1
>>>> Nov 19 09:49:43 dev1n1 crmd[1080]:   notice:
>>>> corosync_mark_unseen_peer_dead: Node 2/dev1n2 was not seen in the previous
>>>> transition
>>>> Nov 19 09:49:43 dev1n1 crmd[1080]:   notice: crm_update_peer_state:
>>>> corosync_mark_unseen_peer_dead: Node dev1n2[2] - state is now lost
>>>> Nov 19 09:49:43 dev1n1 corosync[1004]:  [TOTEM ] A processor joined or
>>>> left the membership and a new membership (10.28.45.27:30736) was
>>>> formed.
>>>> Nov 19 09:49:43 dev1n1 corosync[1004]:  [MAIN  ] Completed service
>>>> synchronization, ready to provide service.
>>>> Nov 19 09:49:43 dev1n1 dlm_controld[1014]: 80664 fence request 2 pid
>>>> 27225 nodedown time 1353314983 fence_all dlm_stonith
>>>> Nov 19 09:49:43 dev1n1 notifyd[1010]: [notice] 192.168.1.2[2]
>>>> ip:192.168.1.2 left
>>>> Nov 19 09:49:43 dev1n1 stonith-ng[1075]:   notice: stonith_command:
>>>> Client stonith-api.27225.b5ff8f4d wants to fence (reboot) '2' with device
>>>> '(any)'
>>>> Nov 19 09:49:43 dev1n1 stonith-ng[1075]:   notice:
>>>> initiate_remote_stonith_op: Initiating remote operation reboot for dev1n2:
>>>> 71447261-0e53-4b20-b628-d3f026a4ae24 (0)
>>>> Nov 19 09:49:44 dev1n1 external/ipmi[27242]: [27254]: debug: ipmitool
>>>> output: Chassis Power Control: Reset
>>>> Nov 19 09:49:45 dev1n1 stonith-ng[1075]:   notice: log_operation:
>>>> Operation 'reboot' [27234] (call 0 from stonith-api.27225) for host
>>>> 'dev1n2' with device 'stonith-dev1n2' returned: 0 (OK)
>>>> Nov 19 09:49:45 dev1n1 stonith-ng[1075]:   notice: remote_op_done:
>>>> Operation reboot of dev1n2 by dev1n1 for stonith-api.27225 at dev1n1.71447261:
>>>> OK
>>>> Nov 19 09:49:45 dev1n1 crmd[1080]:   notice: tengine_stonith_notify:
>>>> Peer dev1n2 was terminated (st_notify_fence) by dev1n1 for dev1n1: OK
>>>> (ref=71447261-0e53-4b20-b628-d3f026a4ae24) by client stonith-api.27225
>>>>
>>>> Greets,
>>>>         Jacek
>>>>
>>>> --
>>>> Linux-cluster mailing list
>>>> Linux-cluster at redhat.com
>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>
>>>
>>>
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>
>>
>>
>>
>> --
>> esta es mi vida e me la vivo hasta que dios quiera
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



-- 
esta es mi vida e me la vivo hasta que dios quiera
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20121120/fe3de77d/attachment.htm>

From sean at rentul.net  Tue Nov 20 12:30:00 2012
From: sean at rentul.net (Sean Lutner)
Date: Tue, 20 Nov 2012 07:30:00 -0500
Subject: [Linux-cluster] how to mount GFS volumes same time both the
 cluster nodes?
In-Reply-To: <CAE7pJ3A3vbhYE=G=5kKFxTtA=SwjOi5MQQQU1sMg6dQR9Of+mA@mail.gmail.com>
References: <CAFZu6EGkN60WGwU_fR3DyJfDHEARiLBuensmmDjCbV8-Z4VU3A@mail.gmail.com>
	<CAE7pJ3DRRWpfEzND8_Ui74P2X17DJq7BC_dPpsVrgC9vZw6HRg@mail.gmail.com>
	<CAFZu6EGNwP64f7sMpHcs2j=g6dVhbpj54P8c=AB-JY6MFUYJ4Q@mail.gmail.com>
	<CAE7pJ3A3vbhYE=G=5kKFxTtA=SwjOi5MQQQU1sMg6dQR9Of+mA@mail.gmail.com>
Message-ID: <1FD6615A-6518-41C6-ABE5-C4FDCEC94FAF@rentul.net>

You don't need to do that. Running the LVM commands in one node is all you need to do assuming that its the same storage presented to both hosts.

Sent from my iPhone

On Nov 20, 2012, at 7:02 AM, emmanuel segura <emi2fast at gmail.com> wrote:

> Do it the same step on second server
> 
> 2012/11/20 james cyriac <jamescyriac76 at gmail.com>
>> Hi,
>>  
>> can you send the detials,i have to put entry in both servers?now i created
>>  
>> map disk 150G both servers
>> and created in node 1 vg03
>> then
>> mkfs.gfs2 -p lock_dlm -t sap-cluster1:gfs2 -j 8 /dev/vg03/lvol0
>>  
>> now i able to mount in first server.
>>  
>>  
>>  /dev/vg03/lvol0          /usr/sap/trans       gfs2 defaults   0   0
>> 
>> On Tue, Nov 20, 2012 at 3:07 PM, emmanuel segura <emi2fast at gmail.com> wrote:
>>> You have to use /etc/fstab with _netdev option, redhat cluster doesn't support active/active service
>>> 
>>> 
>>> 2012/11/20 james cyriac <jamescyriac76 at gmail.com>
>>>> Hi all,
>>>> 
>>>> i am installing redhat cluster 6 two node cluser.the issue is i am not able to mount my GFS file sytem in both the node at same time..
>>>>  
>>>> please find my clustat output ..
>>>> 
>>>> 
>>>> [root at saperpprod01 ~]# clustat
>>>> Cluster Status for sap-cluster1 @ Tue Nov 20 14:51:28 2012
>>>> Member Status: Quorate
>>>>  Member Name                                                     ID   Status
>>>>  ------ ----                                                     ---- ------
>>>>  saperpprod01                                                        1 Online, Local, rgmanager
>>>>  saperpprod02                                                        2 Online, rgmanager
>>>>  Service Name                                                     Owner (Last)                                                     State
>>>>  ------- ----                                                     ----- ------                                                     -----
>>>>  service:oracle                                                   saperpprod01                                                     started
>>>>  service:profile-gfs                                              saperpprod01                                                     started
>>>>  service:sap                                                      saperpprod01                                                     started
>>>> [root at saperpprod01 ~]#
>>>> oralce and sap is fine and it is flaying in both nodes.i want mount my GFS vols same time at both the nodes.
>>>>  
>>>> Thanks in advacne
>>>> james
>>>>  
>>>>  
>>>> but profile-gfs is GFS file system and i want present the GFS mount point same time both the node.please help me this
>>>> On Mon, Nov 19, 2012 at 1:16 PM, Jacek Konieczny <jajcus at jajcus.net> wrote:
>>>>> Hi,
>>>>> 
>>>>> I am setting up a cluster using:
>>>>> 
>>>>> Linux kernel 3.6.6
>>>>> Corosync 2.1.0
>>>>> DLM 4.0.0
>>>>> CLVMD 2.02.98
>>>>> Pacemaker 1.1.8
>>>>> DRBD 8.3.13
>>>>> 
>>>>> Now I have stuck on the 'clean shutdown of a node' scenario.
>>>>> 
>>>>> It goes like that:
>>>>> - resources using the shared storage are properly stopped by Pacemaker.
>>>>> - DRBD is cleanly demoted and unconfigured by Pacemaker
>>>>> - Pacemaker cleanly exits
>>>>> - CLVMD is stopped.
>>>>> ? dlm_controld is stopped
>>>>> ? corosync is being stopped
>>>>> 
>>>>> and at this point the node is fenced (rebooted) by the dlm_controld on
>>>>> the other node. I would expect it continue with a clean shutdown.
>>>>> 
>>>>> Any idea how to debug/fix it?
>>>>> Is this '541 cpg_dispatch error 9' the problem?
>>>>> 
>>>>> Logs from the node being shut down (log file system mounted with the 'sync'
>>>>> option, syslog shutdown delayed as much as possible):
>>>>> 
>>>>> Kernel:
>>>>> Nov 19 09:49:40 dev1n2 kernel: : [  542.049407] block drbd0: worker terminated
>>>>> Nov 19 09:49:40 dev1n2 kernel: : [  542.049412] block drbd0: Terminating drbd0_worker
>>>>> Nov 19 09:49:43 dev1n2 kernel: : [  544.934390] dlm: clvmd: leaving the lockspace group...
>>>>> Nov 19 09:49:43 dev1n2 kernel: : [  544.937584] dlm: clvmd: group event done 0 0
>>>>> Nov 19 09:49:43 dev1n2 kernel: : [  544.937897] dlm: clvmd: release_lockspace final free
>>>>> Nov 19 09:49:43 dev1n2 kernel: : [  544.961407] dlm: closing connection to node 2
>>>>> Nov 19 09:49:43 dev1n2 kernel: : [  544.961431] dlm: closing connection to node 1
>>>>> 
>>>>> User space:
>>>>> Nov 19 09:49:41 dev1n2 pacemakerd[1267]:   notice: stop_child: Stopping cib: Sent -15 to process 1279
>>>>> Nov 19 09:49:41 dev1n2 notifyd[1139]: [notice] dev1n2[2] stonithd:1281:0x7fc423dfd5e0 is now disconnected from corosync
>>>>> Nov 19 09:49:41 dev1n2 cib[1279]:   notice: terminate_cs_connection: Disconnecting from Corosync
>>>>> Nov 19 09:49:41 dev1n2 corosync[1130]:  [TOTEM ] Retransmit List: 1db
>>>>> Nov 19 09:49:41 dev1n2 notifyd[1139]: [notice] dev1n2[2] cib:1279:0x7fc4240008d0 is now disconnected from corosync
>>>>> Nov 19 09:49:41 dev1n2 cib[1279]:   notice: terminate_cs_connection: Disconnecting from Corosync
>>>>> Nov 19 09:49:41 dev1n2 corosync[1130]:  [TOTEM ] Retransmit List: 1dd
>>>>> Nov 19 09:49:41 dev1n2 pacemakerd[1267]:   notice: pcmk_shutdown_worker: Shutdown complete
>>>>> Nov 19 09:49:41 dev1n2 notifyd[1139]: [notice] dev1n2[2] pacemakerd:1267:0x7fc423bf8ed0 is now disconnected from corosync
>>>>> Nov 19 09:49:41 dev1n2 notifyd[1139]: [notice] dev1n2[2] pacemakerd:1267:0x7fc423bf7660 is now disconnected from corosync
>>>>> Nov 19 09:49:41 dev1n2 corosync[1130]:  [TOTEM ] Retransmit List: 1de
>>>>> Nov 19 09:49:41 dev1n2 corosync[1130]:  [TOTEM ] Retransmit List: 1de
>>>>> Nov 19 09:49:41 dev1n2 corosync[1130]:  [TOTEM ] Retransmit List: 1e1
>>>>> Nov 19 09:49:41 dev1n2 corosync[1130]:  [TOTEM ] Retransmit List: 1e1
>>>>> Nov 19 09:49:43 dev1n2 dlm_controld[1142]: 541 cpg_dispatch error 9
>>>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [TOTEM ] Retransmit List: 1e7
>>>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [TOTEM ] Retransmit List: 1e7
>>>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [SERV  ] Unloading all Corosync service engines.
>>>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [QB    ] withdrawing server sockets
>>>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [SERV  ] Service engine unloaded: corosync vote quorum service v1.0
>>>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [QB    ] withdrawing server sockets
>>>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [SERV  ] Service engine unloaded: corosync configuration map access
>>>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [QB    ] withdrawing server sockets
>>>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [SERV  ] Service engine unloaded: corosync configuration service
>>>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [QB    ] withdrawing server sockets
>>>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [SERV  ] Service engine unloaded: corosync cluster closed process group service v1.01
>>>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [QB    ] withdrawing server sockets
>>>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [SERV  ] Service engine unloaded: corosync cluster quorum service v0.1
>>>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [SERV  ] Service engine unloaded: corosync profile loading service
>>>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [WD    ] magically closing the watchdog.
>>>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [SERV  ] Service engine unloaded: corosync watchdog service
>>>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [MAIN  ] Corosync Cluster Engine exiting normally
>>>>> 
>>>>> 
>>>>> Logs from the surviving node:
>>>>> 
>>>>> Kernel:
>>>>> Nov 19 09:49:39 dev1n1 kernel: : [80664.615988] block drbd0: conn( Unconnected -> WFConnection )
>>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497187] dlm: clvmd: dlm_recover 11
>>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497241] dlm: clvmd: dlm_clear_toss 1 done
>>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497291] dlm: clvmd: remove member 2
>>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497295] dlm: clvmd: dlm_recover_members 1 nodes
>>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497298] dlm: clvmd: generation 15 slots 1 1:1
>>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497300] dlm: clvmd: dlm_recover_directory
>>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497302] dlm: clvmd: dlm_recover_directory 0 in 0 new
>>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497304] dlm: clvmd: dlm_recover_directory 0 out 0 messages
>>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497307] dlm: clvmd: dlm_recover_masters
>>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497309] dlm: clvmd: dlm_recover_masters 0 of 1
>>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497311] dlm: clvmd: dlm_recover_locks 0 out
>>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497313] dlm: clvmd: dlm_recover_locks 0 in
>>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497314] dlm: clvmd: dlm_recover_rsbs 1 done
>>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497366] dlm: clvmd: dlm_recover 11 generation 15 done: 0 ms
>>>>> Nov 19 09:49:43 dev1n1 kernel: : [80668.211818] dlm: closing connection to node 2
>>>>> Nov 19 09:49:46 dev1n1 kernel: : [80670.779015] igb: p1p2 NIC Link is Down
>>>>> 
>>>>> User space:
>>>>> Nov 19 09:49:40 dev1n1 pengine[1078]:   notice: stage6: Scheduling Node dev1n2 for shutdown
>>>>> Nov 19 09:49:40 dev1n1 pengine[1078]:    error: rsc_expand_action: Couldn't expand vpbx_vg_cl_demote_0
>>>>> Nov 19 09:49:40 dev1n1 pengine[1078]:    error: rsc_expand_action: Couldn't expand vpbx_vg_cl_demote_0
>>>>> Nov 19 09:49:40 dev1n1 pengine[1078]:   notice: LogActions: Stop    stonith-dev1n1      (dev1n2)
>>>>> Nov 19 09:49:40 dev1n1 pengine[1078]:   notice: process_pe_message: Calculated Transition 17: /var/lib/pacemaker/pengine/pe-input-1035.bz2
>>>>> Nov 19 09:49:40 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1d1
>>>>> Nov 19 09:49:40 dev1n1 crmd[1080]:   notice: run_graph: Transition 17 (Complete=2, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-1035.bz2): Complete
>>>>> Nov 19 09:49:40 dev1n1 crmd[1080]:   notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
>>>>> Nov 19 09:49:40 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1d4
>>>>> Nov 19 09:49:40 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1d6 1d8
>>>>> Nov 19 09:49:40 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1d6
>>>>> Nov 19 09:49:40 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1d6
>>>>> Nov 19 09:49:40 dev1n1 crmd[1080]:   notice: peer_update_callback: do_shutdown of dev1n2 (op 63) is complete
>>>>> Nov 19 09:49:40 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1df
>>>>> Nov 19 09:49:40 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1df
>>>>> Nov 19 09:49:40 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1e3
>>>>> Nov 19 09:49:40 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1e3
>>>>> Nov 19 09:49:42 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1e6
>>>>> Nov 19 09:49:42 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1e9
>>>>> Nov 19 09:49:43 dev1n1 corosync[1004]:  [QUORUM] Members[1]: 1
>>>>> Nov 19 09:49:43 dev1n1 crmd[1080]:   notice: corosync_mark_unseen_peer_dead: Node 2/dev1n2 was not seen in the previous transition
>>>>> Nov 19 09:49:43 dev1n1 crmd[1080]:   notice: crm_update_peer_state: corosync_mark_unseen_peer_dead: Node dev1n2[2] - state is now lost
>>>>> Nov 19 09:49:43 dev1n1 corosync[1004]:  [TOTEM ] A processor joined or left the membership and a new membership (10.28.45.27:30736) was formed.
>>>>> Nov 19 09:49:43 dev1n1 corosync[1004]:  [MAIN  ] Completed service synchronization, ready to provide service.
>>>>> Nov 19 09:49:43 dev1n1 dlm_controld[1014]: 80664 fence request 2 pid 27225 nodedown time 1353314983 fence_all dlm_stonith
>>>>> Nov 19 09:49:43 dev1n1 notifyd[1010]: [notice] 192.168.1.2[2] ip:192.168.1.2 left
>>>>> Nov 19 09:49:43 dev1n1 stonith-ng[1075]:   notice: stonith_command: Client stonith-api.27225.b5ff8f4d wants to fence (reboot) '2' with device '(any)'
>>>>> Nov 19 09:49:43 dev1n1 stonith-ng[1075]:   notice: initiate_remote_stonith_op: Initiating remote operation reboot for dev1n2: 71447261-0e53-4b20-b628-d3f026a4ae24 (0)
>>>>> Nov 19 09:49:44 dev1n1 external/ipmi[27242]: [27254]: debug: ipmitool output: Chassis Power Control: Reset
>>>>> Nov 19 09:49:45 dev1n1 stonith-ng[1075]:   notice: log_operation: Operation 'reboot' [27234] (call 0 from stonith-api.27225) for host 'dev1n2' with device 'stonith-dev1n2' returned: 0 (OK)
>>>>> Nov 19 09:49:45 dev1n1 stonith-ng[1075]:   notice: remote_op_done: Operation reboot of dev1n2 by dev1n1 for stonith-api.27225 at dev1n1.71447261: OK
>>>>> Nov 19 09:49:45 dev1n1 crmd[1080]:   notice: tengine_stonith_notify: Peer dev1n2 was terminated (st_notify_fence) by dev1n1 for dev1n1: OK (ref=71447261-0e53-4b20-b628-d3f026a4ae24) by client stonith-api.27225
>>>>> 
>>>>> Greets,
>>>>>         Jacek
>>>>> 
>>>>> --
>>>>> Linux-cluster mailing list
>>>>> Linux-cluster at redhat.com
>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>> 
>>>> 
>>>> --
>>>> Linux-cluster mailing list
>>>> Linux-cluster at redhat.com
>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>> 
>>> 
>>> 
>>> -- 
>>> esta es mi vida e me la vivo hasta que dios quiera
>>> 
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>> 
>> 
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> 
> 
> -- 
> esta es mi vida e me la vivo hasta que dios quiera
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20121120/990df51a/attachment.htm>

From jamescyriac76 at gmail.com  Tue Nov 20 13:05:55 2012
From: jamescyriac76 at gmail.com (james cyriac)
Date: Tue, 20 Nov 2012 17:05:55 +0400
Subject: [Linux-cluster] how to mount GFS volumes same time both the
 cluster nodes?
In-Reply-To: <1FD6615A-6518-41C6-ABE5-C4FDCEC94FAF@rentul.net>
References: <CAFZu6EGkN60WGwU_fR3DyJfDHEARiLBuensmmDjCbV8-Z4VU3A@mail.gmail.com>
	<CAE7pJ3DRRWpfEzND8_Ui74P2X17DJq7BC_dPpsVrgC9vZw6HRg@mail.gmail.com>
	<CAFZu6EGNwP64f7sMpHcs2j=g6dVhbpj54P8c=AB-JY6MFUYJ4Q@mail.gmail.com>
	<CAE7pJ3A3vbhYE=G=5kKFxTtA=SwjOi5MQQQU1sMg6dQR9Of+mA@mail.gmail.com>
	<1FD6615A-6518-41C6-ABE5-C4FDCEC94FAF@rentul.net>
Message-ID: <CAFZu6EHBeuoBDX_hFxu4a41NaT613Vgz-k-srSK8TLVuakXkSQ@mail.gmail.com>

Thanks to all i rebooted the node2 now i am bale to mount both servers.

now how i can add this service in Cluster,becase i have to assgin a IP for
this service.

Thanks
james






On Tue, Nov 20, 2012 at 4:30 PM, Sean Lutner <sean at rentul.net> wrote:

> You don't need to do that. Running the LVM commands in one node is all you
> need to do assuming that its the same storage presented to both hosts.
>
> Sent from my iPhone
>
> On Nov 20, 2012, at 7:02 AM, emmanuel segura <emi2fast at gmail.com> wrote:
>
> Do it the same step on second server
>
> 2012/11/20 james cyriac <jamescyriac76 at gmail.com>
>
>> Hi,
>>
>> can you send the detials,i have to put entry in both servers?now i
>> created
>>
>> map disk 150G both servers
>> and created in node 1 vg03
>> then
>> mkfs.gfs2 -p lock_dlm -t sap-cluster1:gfs2 -j 8 /dev/vg03/lvol0
>>
>> now i able to mount in first server.
>>
>>
>>  /dev/vg03/lvol0          /usr/sap/trans       gfs2 defaults   0   0
>>
>> On Tue, Nov 20, 2012 at 3:07 PM, emmanuel segura <emi2fast at gmail.com>wrote:
>>
>>> You have to use /etc/fstab with _netdev option, redhat cluster doesn't
>>> support active/active service
>>>
>>>
>>> 2012/11/20 james cyriac <jamescyriac76 at gmail.com>
>>>
>>>> Hi all,
>>>>
>>>> i am installing redhat cluster 6 two node cluser.the issue is i am not
>>>> able to mount my GFS file sytem in both the node at same time..
>>>>
>>>> please find my clustat output ..
>>>>
>>>>
>>>> [root at saperpprod01 ~]# clustat
>>>> Cluster Status for sap-cluster1 @ Tue Nov 20 14:51:28 2012
>>>> Member Status: Quorate
>>>>  Member Name                                                     ID
>>>> Status
>>>>  ------ ----                                                     ----
>>>> ------
>>>>  saperpprod01                                                        1
>>>> Online, Local, rgmanager
>>>>  saperpprod02                                                        2
>>>> Online, rgmanager
>>>>  Service Name                                                     Owner
>>>> (Last)                                                     State
>>>>  ------- ----                                                     -----
>>>> ------                                                     -----
>>>>  service:oracle
>>>> saperpprod01                                                     started
>>>>  service:profile-gfs
>>>> saperpprod01                                                     started
>>>>  service:sap
>>>> saperpprod01                                                     started
>>>> [root at saperpprod01 ~]#
>>>> oralce and sap is fine and it is flaying in both nodes.i want mount my
>>>> GFS vols same time at both the nodes.
>>>>
>>>> Thanks in advacne
>>>> james
>>>>
>>>>
>>>> but profile-gfs is GFS file system and i want present the GFS mount
>>>> point same time both the node.please help me this
>>>> On Mon, Nov 19, 2012 at 1:16 PM, Jacek Konieczny <jajcus at jajcus.net>wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I am setting up a cluster using:
>>>>>
>>>>> Linux kernel 3.6.6
>>>>> Corosync 2.1.0
>>>>> DLM 4.0.0
>>>>> CLVMD 2.02.98
>>>>> Pacemaker 1.1.8
>>>>> DRBD 8.3.13
>>>>>
>>>>> Now I have stuck on the 'clean shutdown of a node' scenario.
>>>>>
>>>>> It goes like that:
>>>>> - resources using the shared storage are properly stopped by Pacemaker.
>>>>> - DRBD is cleanly demoted and unconfigured by Pacemaker
>>>>> - Pacemaker cleanly exits
>>>>> - CLVMD is stopped.
>>>>> ? dlm_controld is stopped
>>>>> ? corosync is being stopped
>>>>>
>>>>> and at this point the node is fenced (rebooted) by the dlm_controld on
>>>>> the other node. I would expect it continue with a clean shutdown.
>>>>>
>>>>> Any idea how to debug/fix it?
>>>>> Is this '541 cpg_dispatch error 9' the problem?
>>>>>
>>>>> Logs from the node being shut down (log file system mounted with the
>>>>> 'sync'
>>>>> option, syslog shutdown delayed as much as possible):
>>>>>
>>>>> Kernel:
>>>>> Nov 19 09:49:40 dev1n2 kernel: : [  542.049407] block drbd0: worker
>>>>> terminated
>>>>> Nov 19 09:49:40 dev1n2 kernel: : [  542.049412] block drbd0:
>>>>> Terminating drbd0_worker
>>>>> Nov 19 09:49:43 dev1n2 kernel: : [  544.934390] dlm: clvmd: leaving
>>>>> the lockspace group...
>>>>> Nov 19 09:49:43 dev1n2 kernel: : [  544.937584] dlm: clvmd: group
>>>>> event done 0 0
>>>>> Nov 19 09:49:43 dev1n2 kernel: : [  544.937897] dlm: clvmd:
>>>>> release_lockspace final free
>>>>> Nov 19 09:49:43 dev1n2 kernel: : [  544.961407] dlm: closing
>>>>> connection to node 2
>>>>> Nov 19 09:49:43 dev1n2 kernel: : [  544.961431] dlm: closing
>>>>> connection to node 1
>>>>>
>>>>> User space:
>>>>> Nov 19 09:49:41 dev1n2 pacemakerd[1267]:   notice: stop_child:
>>>>> Stopping cib: Sent -15 to process 1279
>>>>> Nov 19 09:49:41 dev1n2 notifyd[1139]: [notice] dev1n2[2]
>>>>> stonithd:1281:0x7fc423dfd5e0 is now disconnected from corosync
>>>>> Nov 19 09:49:41 dev1n2 cib[1279]:   notice: terminate_cs_connection:
>>>>> Disconnecting from Corosync
>>>>> Nov 19 09:49:41 dev1n2 corosync[1130]:  [TOTEM ] Retransmit List: 1db
>>>>> Nov 19 09:49:41 dev1n2 notifyd[1139]: [notice] dev1n2[2]
>>>>> cib:1279:0x7fc4240008d0 is now disconnected from corosync
>>>>> Nov 19 09:49:41 dev1n2 cib[1279]:   notice: terminate_cs_connection:
>>>>> Disconnecting from Corosync
>>>>> Nov 19 09:49:41 dev1n2 corosync[1130]:  [TOTEM ] Retransmit List: 1dd
>>>>> Nov 19 09:49:41 dev1n2 pacemakerd[1267]:   notice:
>>>>> pcmk_shutdown_worker: Shutdown complete
>>>>> Nov 19 09:49:41 dev1n2 notifyd[1139]: [notice] dev1n2[2]
>>>>> pacemakerd:1267:0x7fc423bf8ed0 is now disconnected from corosync
>>>>> Nov 19 09:49:41 dev1n2 notifyd[1139]: [notice] dev1n2[2]
>>>>> pacemakerd:1267:0x7fc423bf7660 is now disconnected from corosync
>>>>> Nov 19 09:49:41 dev1n2 corosync[1130]:  [TOTEM ] Retransmit List: 1de
>>>>> Nov 19 09:49:41 dev1n2 corosync[1130]:  [TOTEM ] Retransmit List: 1de
>>>>> Nov 19 09:49:41 dev1n2 corosync[1130]:  [TOTEM ] Retransmit List: 1e1
>>>>> Nov 19 09:49:41 dev1n2 corosync[1130]:  [TOTEM ] Retransmit List: 1e1
>>>>> Nov 19 09:49:43 dev1n2 dlm_controld[1142]: 541 cpg_dispatch error 9
>>>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [TOTEM ] Retransmit List: 1e7
>>>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [TOTEM ] Retransmit List: 1e7
>>>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [SERV  ] Unloading all
>>>>> Corosync service engines.
>>>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [QB    ] withdrawing server
>>>>> sockets
>>>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [SERV  ] Service engine
>>>>> unloaded: corosync vote quorum service v1.0
>>>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [QB    ] withdrawing server
>>>>> sockets
>>>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [SERV  ] Service engine
>>>>> unloaded: corosync configuration map access
>>>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [QB    ] withdrawing server
>>>>> sockets
>>>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [SERV  ] Service engine
>>>>> unloaded: corosync configuration service
>>>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [QB    ] withdrawing server
>>>>> sockets
>>>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [SERV  ] Service engine
>>>>> unloaded: corosync cluster closed process group service v1.01
>>>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [QB    ] withdrawing server
>>>>> sockets
>>>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [SERV  ] Service engine
>>>>> unloaded: corosync cluster quorum service v0.1
>>>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [SERV  ] Service engine
>>>>> unloaded: corosync profile loading service
>>>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [WD    ] magically closing the
>>>>> watchdog.
>>>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [SERV  ] Service engine
>>>>> unloaded: corosync watchdog service
>>>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [MAIN  ] Corosync Cluster
>>>>> Engine exiting normally
>>>>>
>>>>>
>>>>> Logs from the surviving node:
>>>>>
>>>>> Kernel:
>>>>> Nov 19 09:49:39 dev1n1 kernel: : [80664.615988] block drbd0: conn(
>>>>> Unconnected -> WFConnection )
>>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497187] dlm: clvmd:
>>>>> dlm_recover 11
>>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497241] dlm: clvmd:
>>>>> dlm_clear_toss 1 done
>>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497291] dlm: clvmd: remove
>>>>> member 2
>>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497295] dlm: clvmd:
>>>>> dlm_recover_members 1 nodes
>>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497298] dlm: clvmd: generation
>>>>> 15 slots 1 1:1
>>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497300] dlm: clvmd:
>>>>> dlm_recover_directory
>>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497302] dlm: clvmd:
>>>>> dlm_recover_directory 0 in 0 new
>>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497304] dlm: clvmd:
>>>>> dlm_recover_directory 0 out 0 messages
>>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497307] dlm: clvmd:
>>>>> dlm_recover_masters
>>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497309] dlm: clvmd:
>>>>> dlm_recover_masters 0 of 1
>>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497311] dlm: clvmd:
>>>>> dlm_recover_locks 0 out
>>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497313] dlm: clvmd:
>>>>> dlm_recover_locks 0 in
>>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497314] dlm: clvmd:
>>>>> dlm_recover_rsbs 1 done
>>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497366] dlm: clvmd:
>>>>> dlm_recover 11 generation 15 done: 0 ms
>>>>> Nov 19 09:49:43 dev1n1 kernel: : [80668.211818] dlm: closing
>>>>> connection to node 2
>>>>> Nov 19 09:49:46 dev1n1 kernel: : [80670.779015] igb: p1p2 NIC Link is
>>>>> Down
>>>>>
>>>>> User space:
>>>>> Nov 19 09:49:40 dev1n1 pengine[1078]:   notice: stage6: Scheduling
>>>>> Node dev1n2 for shutdown
>>>>> Nov 19 09:49:40 dev1n1 pengine[1078]:    error: rsc_expand_action:
>>>>> Couldn't expand vpbx_vg_cl_demote_0
>>>>> Nov 19 09:49:40 dev1n1 pengine[1078]:    error: rsc_expand_action:
>>>>> Couldn't expand vpbx_vg_cl_demote_0
>>>>> Nov 19 09:49:40 dev1n1 pengine[1078]:   notice: LogActions: Stop
>>>>>  stonith-dev1n1      (dev1n2)
>>>>> Nov 19 09:49:40 dev1n1 pengine[1078]:   notice: process_pe_message:
>>>>> Calculated Transition 17: /var/lib/pacemaker/pengine/pe-input-1035.bz2
>>>>> Nov 19 09:49:40 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1d1
>>>>> Nov 19 09:49:40 dev1n1 crmd[1080]:   notice: run_graph: Transition 17
>>>>> (Complete=2, Pending=0, Fired=0, Skipped=0, Incomplete=0,
>>>>> Source=/var/lib/pacemaker/pengine/pe-input-1035.bz2): Complete
>>>>> Nov 19 09:49:40 dev1n1 crmd[1080]:   notice: do_state_transition:
>>>>> State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
>>>>> cause=C_FSA_INTERNAL origin=notify_crmd ]
>>>>> Nov 19 09:49:40 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1d4
>>>>> Nov 19 09:49:40 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1d6
>>>>> 1d8
>>>>> Nov 19 09:49:40 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1d6
>>>>> Nov 19 09:49:40 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1d6
>>>>> Nov 19 09:49:40 dev1n1 crmd[1080]:   notice: peer_update_callback:
>>>>> do_shutdown of dev1n2 (op 63) is complete
>>>>> Nov 19 09:49:40 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1df
>>>>> Nov 19 09:49:40 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1df
>>>>> Nov 19 09:49:40 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1e3
>>>>> Nov 19 09:49:40 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1e3
>>>>> Nov 19 09:49:42 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1e6
>>>>> Nov 19 09:49:42 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1e9
>>>>> Nov 19 09:49:43 dev1n1 corosync[1004]:  [QUORUM] Members[1]: 1
>>>>> Nov 19 09:49:43 dev1n1 crmd[1080]:   notice:
>>>>> corosync_mark_unseen_peer_dead: Node 2/dev1n2 was not seen in the previous
>>>>> transition
>>>>> Nov 19 09:49:43 dev1n1 crmd[1080]:   notice: crm_update_peer_state:
>>>>> corosync_mark_unseen_peer_dead: Node dev1n2[2] - state is now lost
>>>>> Nov 19 09:49:43 dev1n1 corosync[1004]:  [TOTEM ] A processor joined or
>>>>> left the membership and a new membership (10.28.45.27:30736) was
>>>>> formed.
>>>>> Nov 19 09:49:43 dev1n1 corosync[1004]:  [MAIN  ] Completed service
>>>>> synchronization, ready to provide service.
>>>>> Nov 19 09:49:43 dev1n1 dlm_controld[1014]: 80664 fence request 2 pid
>>>>> 27225 nodedown time 1353314983 fence_all dlm_stonith
>>>>> Nov 19 09:49:43 dev1n1 notifyd[1010]: [notice] 192.168.1.2[2]
>>>>> ip:192.168.1.2 left
>>>>> Nov 19 09:49:43 dev1n1 stonith-ng[1075]:   notice: stonith_command:
>>>>> Client stonith-api.27225.b5ff8f4d wants to fence (reboot) '2' with device
>>>>> '(any)'
>>>>> Nov 19 09:49:43 dev1n1 stonith-ng[1075]:   notice:
>>>>> initiate_remote_stonith_op: Initiating remote operation reboot for dev1n2:
>>>>> 71447261-0e53-4b20-b628-d3f026a4ae24 (0)
>>>>> Nov 19 09:49:44 dev1n1 external/ipmi[27242]: [27254]: debug: ipmitool
>>>>> output: Chassis Power Control: Reset
>>>>> Nov 19 09:49:45 dev1n1 stonith-ng[1075]:   notice: log_operation:
>>>>> Operation 'reboot' [27234] (call 0 from stonith-api.27225) for host
>>>>> 'dev1n2' with device 'stonith-dev1n2' returned: 0 (OK)
>>>>> Nov 19 09:49:45 dev1n1 stonith-ng[1075]:   notice: remote_op_done:
>>>>> Operation reboot of dev1n2 by dev1n1 for
>>>>> stonith-api.27225 at dev1n1.71447261: OK
>>>>> Nov 19 09:49:45 dev1n1 crmd[1080]:   notice: tengine_stonith_notify:
>>>>> Peer dev1n2 was terminated (st_notify_fence) by dev1n1 for dev1n1: OK
>>>>> (ref=71447261-0e53-4b20-b628-d3f026a4ae24) by client stonith-api.27225
>>>>>
>>>>> Greets,
>>>>>         Jacek
>>>>>
>>>>> --
>>>>> Linux-cluster mailing list
>>>>> Linux-cluster at redhat.com
>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>
>>>>
>>>>
>>>> --
>>>> Linux-cluster mailing list
>>>> Linux-cluster at redhat.com
>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>
>>>
>>>
>>>
>>> --
>>> esta es mi vida e me la vivo hasta que dios quiera
>>>
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>
>>
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
>
>
> --
> esta es mi vida e me la vivo hasta que dios quiera
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20121120/53980afb/attachment.htm>

From emi2fast at gmail.com  Tue Nov 20 13:15:52 2012
From: emi2fast at gmail.com (emmanuel segura)
Date: Tue, 20 Nov 2012 14:15:52 +0100
Subject: [Linux-cluster] how to mount GFS volumes same time both the
 cluster nodes?
In-Reply-To: <1FD6615A-6518-41C6-ABE5-C4FDCEC94FAF@rentul.net>
References: <CAFZu6EGkN60WGwU_fR3DyJfDHEARiLBuensmmDjCbV8-Z4VU3A@mail.gmail.com>
	<CAE7pJ3DRRWpfEzND8_Ui74P2X17DJq7BC_dPpsVrgC9vZw6HRg@mail.gmail.com>
	<CAFZu6EGNwP64f7sMpHcs2j=g6dVhbpj54P8c=AB-JY6MFUYJ4Q@mail.gmail.com>
	<CAE7pJ3A3vbhYE=G=5kKFxTtA=SwjOi5MQQQU1sMg6dQR9Of+mA@mail.gmail.com>
	<1FD6615A-6518-41C6-ABE5-C4FDCEC94FAF@rentul.net>
Message-ID: <CAE7pJ3BBLv6cL5g_S0LeSzktf1q=4Skt8jXexyv5MySZRE-nYA@mail.gmail.com>

Sorry but i am talking about fstab

2012/11/20 Sean Lutner <sean at rentul.net>

> You don't need to do that. Running the LVM commands in one node is all you
> need to do assuming that its the same storage presented to both hosts.
>
> Sent from my iPhone
>
> On Nov 20, 2012, at 7:02 AM, emmanuel segura <emi2fast at gmail.com> wrote:
>
> Do it the same step on second server
>
> 2012/11/20 james cyriac <jamescyriac76 at gmail.com>
>
>> Hi,
>>
>> can you send the detials,i have to put entry in both servers?now i
>> created
>>
>> map disk 150G both servers
>> and created in node 1 vg03
>> then
>> mkfs.gfs2 -p lock_dlm -t sap-cluster1:gfs2 -j 8 /dev/vg03/lvol0
>>
>> now i able to mount in first server.
>>
>>
>>  /dev/vg03/lvol0          /usr/sap/trans       gfs2 defaults   0   0
>>
>> On Tue, Nov 20, 2012 at 3:07 PM, emmanuel segura <emi2fast at gmail.com>wrote:
>>
>>> You have to use /etc/fstab with _netdev option, redhat cluster doesn't
>>> support active/active service
>>>
>>>
>>> 2012/11/20 james cyriac <jamescyriac76 at gmail.com>
>>>
>>>> Hi all,
>>>>
>>>> i am installing redhat cluster 6 two node cluser.the issue is i am not
>>>> able to mount my GFS file sytem in both the node at same time..
>>>>
>>>> please find my clustat output ..
>>>>
>>>>
>>>> [root at saperpprod01 ~]# clustat
>>>> Cluster Status for sap-cluster1 @ Tue Nov 20 14:51:28 2012
>>>> Member Status: Quorate
>>>>  Member Name                                                     ID
>>>> Status
>>>>  ------ ----                                                     ----
>>>> ------
>>>>  saperpprod01                                                        1
>>>> Online, Local, rgmanager
>>>>  saperpprod02                                                        2
>>>> Online, rgmanager
>>>>  Service Name                                                     Owner
>>>> (Last)                                                     State
>>>>  ------- ----                                                     -----
>>>> ------                                                     -----
>>>>  service:oracle
>>>> saperpprod01                                                     started
>>>>  service:profile-gfs
>>>> saperpprod01                                                     started
>>>>  service:sap
>>>> saperpprod01                                                     started
>>>> [root at saperpprod01 ~]#
>>>> oralce and sap is fine and it is flaying in both nodes.i want mount my
>>>> GFS vols same time at both the nodes.
>>>>
>>>> Thanks in advacne
>>>> james
>>>>
>>>>
>>>> but profile-gfs is GFS file system and i want present the GFS mount
>>>> point same time both the node.please help me this
>>>> On Mon, Nov 19, 2012 at 1:16 PM, Jacek Konieczny <jajcus at jajcus.net>wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I am setting up a cluster using:
>>>>>
>>>>> Linux kernel 3.6.6
>>>>> Corosync 2.1.0
>>>>> DLM 4.0.0
>>>>> CLVMD 2.02.98
>>>>> Pacemaker 1.1.8
>>>>> DRBD 8.3.13
>>>>>
>>>>> Now I have stuck on the 'clean shutdown of a node' scenario.
>>>>>
>>>>> It goes like that:
>>>>> - resources using the shared storage are properly stopped by Pacemaker.
>>>>> - DRBD is cleanly demoted and unconfigured by Pacemaker
>>>>> - Pacemaker cleanly exits
>>>>> - CLVMD is stopped.
>>>>> ? dlm_controld is stopped
>>>>> ? corosync is being stopped
>>>>>
>>>>> and at this point the node is fenced (rebooted) by the dlm_controld on
>>>>> the other node. I would expect it continue with a clean shutdown.
>>>>>
>>>>> Any idea how to debug/fix it?
>>>>> Is this '541 cpg_dispatch error 9' the problem?
>>>>>
>>>>> Logs from the node being shut down (log file system mounted with the
>>>>> 'sync'
>>>>> option, syslog shutdown delayed as much as possible):
>>>>>
>>>>> Kernel:
>>>>> Nov 19 09:49:40 dev1n2 kernel: : [  542.049407] block drbd0: worker
>>>>> terminated
>>>>> Nov 19 09:49:40 dev1n2 kernel: : [  542.049412] block drbd0:
>>>>> Terminating drbd0_worker
>>>>> Nov 19 09:49:43 dev1n2 kernel: : [  544.934390] dlm: clvmd: leaving
>>>>> the lockspace group...
>>>>> Nov 19 09:49:43 dev1n2 kernel: : [  544.937584] dlm: clvmd: group
>>>>> event done 0 0
>>>>> Nov 19 09:49:43 dev1n2 kernel: : [  544.937897] dlm: clvmd:
>>>>> release_lockspace final free
>>>>> Nov 19 09:49:43 dev1n2 kernel: : [  544.961407] dlm: closing
>>>>> connection to node 2
>>>>> Nov 19 09:49:43 dev1n2 kernel: : [  544.961431] dlm: closing
>>>>> connection to node 1
>>>>>
>>>>> User space:
>>>>> Nov 19 09:49:41 dev1n2 pacemakerd[1267]:   notice: stop_child:
>>>>> Stopping cib: Sent -15 to process 1279
>>>>> Nov 19 09:49:41 dev1n2 notifyd[1139]: [notice] dev1n2[2]
>>>>> stonithd:1281:0x7fc423dfd5e0 is now disconnected from corosync
>>>>> Nov 19 09:49:41 dev1n2 cib[1279]:   notice: terminate_cs_connection:
>>>>> Disconnecting from Corosync
>>>>> Nov 19 09:49:41 dev1n2 corosync[1130]:  [TOTEM ] Retransmit List: 1db
>>>>> Nov 19 09:49:41 dev1n2 notifyd[1139]: [notice] dev1n2[2]
>>>>> cib:1279:0x7fc4240008d0 is now disconnected from corosync
>>>>> Nov 19 09:49:41 dev1n2 cib[1279]:   notice: terminate_cs_connection:
>>>>> Disconnecting from Corosync
>>>>> Nov 19 09:49:41 dev1n2 corosync[1130]:  [TOTEM ] Retransmit List: 1dd
>>>>> Nov 19 09:49:41 dev1n2 pacemakerd[1267]:   notice:
>>>>> pcmk_shutdown_worker: Shutdown complete
>>>>> Nov 19 09:49:41 dev1n2 notifyd[1139]: [notice] dev1n2[2]
>>>>> pacemakerd:1267:0x7fc423bf8ed0 is now disconnected from corosync
>>>>> Nov 19 09:49:41 dev1n2 notifyd[1139]: [notice] dev1n2[2]
>>>>> pacemakerd:1267:0x7fc423bf7660 is now disconnected from corosync
>>>>> Nov 19 09:49:41 dev1n2 corosync[1130]:  [TOTEM ] Retransmit List: 1de
>>>>> Nov 19 09:49:41 dev1n2 corosync[1130]:  [TOTEM ] Retransmit List: 1de
>>>>> Nov 19 09:49:41 dev1n2 corosync[1130]:  [TOTEM ] Retransmit List: 1e1
>>>>> Nov 19 09:49:41 dev1n2 corosync[1130]:  [TOTEM ] Retransmit List: 1e1
>>>>> Nov 19 09:49:43 dev1n2 dlm_controld[1142]: 541 cpg_dispatch error 9
>>>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [TOTEM ] Retransmit List: 1e7
>>>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [TOTEM ] Retransmit List: 1e7
>>>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [SERV  ] Unloading all
>>>>> Corosync service engines.
>>>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [QB    ] withdrawing server
>>>>> sockets
>>>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [SERV  ] Service engine
>>>>> unloaded: corosync vote quorum service v1.0
>>>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [QB    ] withdrawing server
>>>>> sockets
>>>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [SERV  ] Service engine
>>>>> unloaded: corosync configuration map access
>>>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [QB    ] withdrawing server
>>>>> sockets
>>>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [SERV  ] Service engine
>>>>> unloaded: corosync configuration service
>>>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [QB    ] withdrawing server
>>>>> sockets
>>>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [SERV  ] Service engine
>>>>> unloaded: corosync cluster closed process group service v1.01
>>>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [QB    ] withdrawing server
>>>>> sockets
>>>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [SERV  ] Service engine
>>>>> unloaded: corosync cluster quorum service v0.1
>>>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [SERV  ] Service engine
>>>>> unloaded: corosync profile loading service
>>>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [WD    ] magically closing the
>>>>> watchdog.
>>>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [SERV  ] Service engine
>>>>> unloaded: corosync watchdog service
>>>>> Nov 19 09:49:43 dev1n2 corosync[1130]:  [MAIN  ] Corosync Cluster
>>>>> Engine exiting normally
>>>>>
>>>>>
>>>>> Logs from the surviving node:
>>>>>
>>>>> Kernel:
>>>>> Nov 19 09:49:39 dev1n1 kernel: : [80664.615988] block drbd0: conn(
>>>>> Unconnected -> WFConnection )
>>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497187] dlm: clvmd:
>>>>> dlm_recover 11
>>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497241] dlm: clvmd:
>>>>> dlm_clear_toss 1 done
>>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497291] dlm: clvmd: remove
>>>>> member 2
>>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497295] dlm: clvmd:
>>>>> dlm_recover_members 1 nodes
>>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497298] dlm: clvmd: generation
>>>>> 15 slots 1 1:1
>>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497300] dlm: clvmd:
>>>>> dlm_recover_directory
>>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497302] dlm: clvmd:
>>>>> dlm_recover_directory 0 in 0 new
>>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497304] dlm: clvmd:
>>>>> dlm_recover_directory 0 out 0 messages
>>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497307] dlm: clvmd:
>>>>> dlm_recover_masters
>>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497309] dlm: clvmd:
>>>>> dlm_recover_masters 0 of 1
>>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497311] dlm: clvmd:
>>>>> dlm_recover_locks 0 out
>>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497313] dlm: clvmd:
>>>>> dlm_recover_locks 0 in
>>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497314] dlm: clvmd:
>>>>> dlm_recover_rsbs 1 done
>>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497366] dlm: clvmd:
>>>>> dlm_recover 11 generation 15 done: 0 ms
>>>>> Nov 19 09:49:43 dev1n1 kernel: : [80668.211818] dlm: closing
>>>>> connection to node 2
>>>>> Nov 19 09:49:46 dev1n1 kernel: : [80670.779015] igb: p1p2 NIC Link is
>>>>> Down
>>>>>
>>>>> User space:
>>>>> Nov 19 09:49:40 dev1n1 pengine[1078]:   notice: stage6: Scheduling
>>>>> Node dev1n2 for shutdown
>>>>> Nov 19 09:49:40 dev1n1 pengine[1078]:    error: rsc_expand_action:
>>>>> Couldn't expand vpbx_vg_cl_demote_0
>>>>> Nov 19 09:49:40 dev1n1 pengine[1078]:    error: rsc_expand_action:
>>>>> Couldn't expand vpbx_vg_cl_demote_0
>>>>> Nov 19 09:49:40 dev1n1 pengine[1078]:   notice: LogActions: Stop
>>>>>  stonith-dev1n1      (dev1n2)
>>>>> Nov 19 09:49:40 dev1n1 pengine[1078]:   notice: process_pe_message:
>>>>> Calculated Transition 17: /var/lib/pacemaker/pengine/pe-input-1035.bz2
>>>>> Nov 19 09:49:40 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1d1
>>>>> Nov 19 09:49:40 dev1n1 crmd[1080]:   notice: run_graph: Transition 17
>>>>> (Complete=2, Pending=0, Fired=0, Skipped=0, Incomplete=0,
>>>>> Source=/var/lib/pacemaker/pengine/pe-input-1035.bz2): Complete
>>>>> Nov 19 09:49:40 dev1n1 crmd[1080]:   notice: do_state_transition:
>>>>> State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
>>>>> cause=C_FSA_INTERNAL origin=notify_crmd ]
>>>>> Nov 19 09:49:40 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1d4
>>>>> Nov 19 09:49:40 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1d6
>>>>> 1d8
>>>>> Nov 19 09:49:40 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1d6
>>>>> Nov 19 09:49:40 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1d6
>>>>> Nov 19 09:49:40 dev1n1 crmd[1080]:   notice: peer_update_callback:
>>>>> do_shutdown of dev1n2 (op 63) is complete
>>>>> Nov 19 09:49:40 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1df
>>>>> Nov 19 09:49:40 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1df
>>>>> Nov 19 09:49:40 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1e3
>>>>> Nov 19 09:49:40 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1e3
>>>>> Nov 19 09:49:42 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1e6
>>>>> Nov 19 09:49:42 dev1n1 corosync[1004]:  [TOTEM ] Retransmit List: 1e9
>>>>> Nov 19 09:49:43 dev1n1 corosync[1004]:  [QUORUM] Members[1]: 1
>>>>> Nov 19 09:49:43 dev1n1 crmd[1080]:   notice:
>>>>> corosync_mark_unseen_peer_dead: Node 2/dev1n2 was not seen in the previous
>>>>> transition
>>>>> Nov 19 09:49:43 dev1n1 crmd[1080]:   notice: crm_update_peer_state:
>>>>> corosync_mark_unseen_peer_dead: Node dev1n2[2] - state is now lost
>>>>> Nov 19 09:49:43 dev1n1 corosync[1004]:  [TOTEM ] A processor joined or
>>>>> left the membership and a new membership (10.28.45.27:30736) was
>>>>> formed.
>>>>> Nov 19 09:49:43 dev1n1 corosync[1004]:  [MAIN  ] Completed service
>>>>> synchronization, ready to provide service.
>>>>> Nov 19 09:49:43 dev1n1 dlm_controld[1014]: 80664 fence request 2 pid
>>>>> 27225 nodedown time 1353314983 fence_all dlm_stonith
>>>>> Nov 19 09:49:43 dev1n1 notifyd[1010]: [notice] 192.168.1.2[2]
>>>>> ip:192.168.1.2 left
>>>>> Nov 19 09:49:43 dev1n1 stonith-ng[1075]:   notice: stonith_command:
>>>>> Client stonith-api.27225.b5ff8f4d wants to fence (reboot) '2' with device
>>>>> '(any)'
>>>>> Nov 19 09:49:43 dev1n1 stonith-ng[1075]:   notice:
>>>>> initiate_remote_stonith_op: Initiating remote operation reboot for dev1n2:
>>>>> 71447261-0e53-4b20-b628-d3f026a4ae24 (0)
>>>>> Nov 19 09:49:44 dev1n1 external/ipmi[27242]: [27254]: debug: ipmitool
>>>>> output: Chassis Power Control: Reset
>>>>> Nov 19 09:49:45 dev1n1 stonith-ng[1075]:   notice: log_operation:
>>>>> Operation 'reboot' [27234] (call 0 from stonith-api.27225) for host
>>>>> 'dev1n2' with device 'stonith-dev1n2' returned: 0 (OK)
>>>>> Nov 19 09:49:45 dev1n1 stonith-ng[1075]:   notice: remote_op_done:
>>>>> Operation reboot of dev1n2 by dev1n1 for
>>>>> stonith-api.27225 at dev1n1.71447261: OK
>>>>> Nov 19 09:49:45 dev1n1 crmd[1080]:   notice: tengine_stonith_notify:
>>>>> Peer dev1n2 was terminated (st_notify_fence) by dev1n1 for dev1n1: OK
>>>>> (ref=71447261-0e53-4b20-b628-d3f026a4ae24) by client stonith-api.27225
>>>>>
>>>>> Greets,
>>>>>         Jacek
>>>>>
>>>>> --
>>>>> Linux-cluster mailing list
>>>>> Linux-cluster at redhat.com
>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>
>>>>
>>>>
>>>> --
>>>> Linux-cluster mailing list
>>>> Linux-cluster at redhat.com
>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>
>>>
>>>
>>>
>>> --
>>> esta es mi vida e me la vivo hasta que dios quiera
>>>
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>
>>
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
>
>
> --
> esta es mi vida e me la vivo hasta que dios quiera
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



-- 
esta es mi vida e me la vivo hasta que dios quiera
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20121120/06529f8b/attachment.htm>

From felipe.o.gutierrez at gmail.com  Tue Nov 20 13:54:46 2012
From: felipe.o.gutierrez at gmail.com (Felipe Gutierrez)
Date: Tue, 20 Nov 2012 10:54:46 -0300
Subject: [Linux-cluster] new Resources on heartbeat can't start
Message-ID: <CAPOsGyYQhMupJca80_Ptzp8=v9u-ym4E4QROK8NK9z9XGD1g7Q@mail.gmail.com>

Hi everyone,

I am trying to setup a new resource on my heartbeat, but for some reason
the resour doesn't come on.
Does anyone have some hint, please?

root at cloud9:/etc/heartbeat# crm_mon -1
============
Last updated: Tue Nov 20 10:45:38 2012
Last change: Tue Nov 20 10:41:57 2012 via crm_shadow on cloud9
Stack: Heartbeat
Current DC: cloud9 (55e3a080-6988-4bb4-814c-f63b20137601) - partition with
quorum
Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c
2 Nodes configured, unknown expected votes
2 Resources configured.
============

Online: [ cloud10 cloud9 ]

 FAILOVER-IP    (ocf::heartbeat:IPaddr):    Started cloud9 FAILED

Failed actions:
    FAILOVER-IP_start_0 (node=cloud9, call=92, rc=1, status=complete):
unknown error
    failover-ip_start_0 (node=cloud9, call=4, rc=1, status=complete):
unknown error
    FAILOVER-IP_start_0 (node=cloud10, call=4, rc=1, status=complete):
unknown error
    failover-ip_start_0 (node=cloud10, call=48, rc=1, status=complete):
unknown error
root at cloud9:/etc/heartbeat#

My ha.cf file is like this:
# enable pacemaker, without stonith
crm             yes

# log where ?
logfacility     local0

# warning of soon be dead
warntime        10

# declare a host (the other node) dead after:
deadtime        20

# dead time on boot (could take some time until net is up)
initdead        120

# time between heartbeats
keepalive       2

# What UDP port to use for udp or ppp-udp communication?
# udpport               694
# bcast         eth0
# mcast         eth0 225.0.0.1 694 1 0
# ucast         eth0 192.168.188.9

# What interfaces to heartbeat over?
# udp           eth0

# the nodes
node            cloud9
node            cloud10

# heartbeats, over dedicated replication interface!
ucast           eth0 192.168.188.9 # ignored by node1 (owner of ip)
ucast           eth0 192.168.188.10 # ignored by node2 (owner of ip)

# ping the switch to assure we are online
ping            192.168.178.1

Best Regards,
Felipe
-- 
*--
-- Felipe Oliveira Gutierrez
-- Felipe.o.Gutierrez at gmail.com
-- https://sites.google.com/site/lipe82/Home/diaadia*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20121120/3b81e845/attachment.htm>

From jfriesse at redhat.com  Wed Nov 21 10:19:02 2012
From: jfriesse at redhat.com (Jan Friesse)
Date: Wed, 21 Nov 2012 11:19:02 +0100
Subject: [Linux-cluster] node fenced by dlm_controld on a clean shutdown
In-Reply-To: <20121119093920.GB20419@jajo.eggsoft>
References: <20121119091647.GA20419@jajo.eggsoft>
	<20121119093920.GB20419@jajo.eggsoft>
Message-ID: <50ACAA96.60401@redhat.com>

Jacek Konieczny napsal(a):
> On Mon, Nov 19, 2012 at 10:16:48AM +0100, Jacek Konieczny wrote:
>> It goes like that:
>> - resources using the shared storage are properly stopped by Pacemaker.
>> - DRBD is cleanly demoted and unconfigured by Pacemaker
>> - Pacemaker cleanly exits
>> - CLVMD is stopped.
>> ? dlm_controld is stopped
>> ? corosync is being stopped
>>
>> and at this point the node is fenced (rebooted) by the dlm_controld on
>> the other node. I would expect it continue with a clean shutdown.
>>
>> Any idea how to debug/fix it?
>> Is this '541 cpg_dispatch error 9' the problem?
> 
> I found a workaround: I have added a 10 seconds pause between
> dlm_controld and corosync shutdown. The node shuts down cleanly now (is
> not fenced). '541 cpg_dispatch error 9' is still there in the logs,
> though.
> 
> Greets,
>         Jacek
> 

Hi,
we've discussed this problem with dave, but I would like to get some
information:
- What distro are you using?
- Packages are compiled or disro?
- what you mean by "clean shutdown"? This is something like service
dlm_control stop, or your own script?

Thanks,
  Honza



From jajcus at jajcus.net  Wed Nov 21 14:48:59 2012
From: jajcus at jajcus.net (Jacek Konieczny)
Date: Wed, 21 Nov 2012 15:48:59 +0100
Subject: [Linux-cluster] node fenced by dlm_controld on a clean shutdown
In-Reply-To: <50ACAA96.60401@redhat.com>
References: <20121119091647.GA20419@jajo.eggsoft>
	<20121119093920.GB20419@jajo.eggsoft> <50ACAA96.60401@redhat.com>
Message-ID: <20121121144858.GF2125@jajo.eggsoft>

On Wed, Nov 21, 2012 at 11:19:02AM +0100, Jan Friesse wrote:
> Hi,
> we've discussed this problem with dave, but I would like to get some
> information:
> - What distro are you using?

PLD Linux

> - Packages are compiled or disro?

I am making packages for the distro as a part of my job.

> - what you mean by "clean shutdown"? This is something like service
> dlm_control stop, or your own script?

systemd, using the corosync.service unit file provided with corosync
sources (it is far from being 'systemd' native) and the dlm.service
as comes with dlm sources (includes my patches).

Shutdown is started by '/sbin/halt' or '/sbin/reboot' using standard
systemd procedure. I have added some rules to make sure Pacemaker is
stopped before the rest, but dlm and corosync order is not affected.

Systemd kills dlm_controld first and as soon as it exits its initiates
stop of corosync. Adding an artificial delay between those two fixes my
problem.

When calling shutdown scripts by hand or the old SysVinit way (through
other shell scripts), the delay between the two jobs could be
'naturally' longer.

Unfortunately, I have been distracted recently by some other, higher
priority, job, so I could not do more investigation in this matter
(still on my TODO, though).

Greets,
        Jacek



From jfriesse at redhat.com  Wed Nov 21 16:19:02 2012
From: jfriesse at redhat.com (Jan Friesse)
Date: Wed, 21 Nov 2012 17:19:02 +0100
Subject: [Linux-cluster] node fenced by dlm_controld on a clean shutdown
In-Reply-To: <20121121144858.GF2125@jajo.eggsoft>
References: <20121119091647.GA20419@jajo.eggsoft>
	<20121119093920.GB20419@jajo.eggsoft> <50ACAA96.60401@redhat.com>
	<20121121144858.GF2125@jajo.eggsoft>
Message-ID: <50ACFEF6.3040907@redhat.com>

Jacek Konieczny napsal(a):
> On Wed, Nov 21, 2012 at 11:19:02AM +0100, Jan Friesse wrote:
>> Hi,
>> we've discussed this problem with dave, but I would like to get some
>> information:
>> - What distro are you using?
> 
> PLD Linux
> 
>> - Packages are compiled or disro?
> 
> I am making packages for the distro as a part of my job.
> 
>> - what you mean by "clean shutdown"? This is something like service
>> dlm_control stop, or your own script?
> 
> systemd, using the corosync.service unit file provided with corosync
> sources (it is far from being 'systemd' native) and the dlm.service

Ya, far far away. But it has good reasons...

> as comes with dlm sources (includes my patches).
> 
> Shutdown is started by '/sbin/halt' or '/sbin/reboot' using standard
> systemd procedure. I have added some rules to make sure Pacemaker is
> stopped before the rest, but dlm and corosync order is not affected.
> 

Ok, cool. This is information I was seeking.

> Systemd kills dlm_controld first and as soon as it exits its initiates
> stop of corosync. Adding an artificial delay between those two fixes my
> problem.
> 

Problem may be, that if dlm_controld refuses to exit, maybe (= this is
theory) it will kill it anyway.

> When calling shutdown scripts by hand or the old SysVinit way (through
> other shell scripts), the delay between the two jobs could be
> 'naturally' longer.
> 
> Unfortunately, I have been distracted recently by some other, higher
> priority, job, so I could not do more investigation in this matter
> (still on my TODO, though).
> 

Understand. You gave me enough information anyway, so thanks.

> Greets,
>         Jacek
> 

Regards,
  Honza



From parvez.h.shaikh at gmail.com  Fri Nov 23 05:25:11 2012
From: parvez.h.shaikh at gmail.com (Parvez Shaikh)
Date: Fri, 23 Nov 2012 10:55:11 +0530
Subject: [Linux-cluster] Normal startup vs startup due to failover on
 cluster node - can they be distinguished?
Message-ID: <CAKrd533g2WLMsdvwdPLtD-KXqMq90p7CdT27rNPwUejhdTaDRQ@mail.gmail.com>

Hi experts,

I am using Red Hat Cluster available on RHEL 5.5. And it doesn't have any
inbuilt mechanism to generate SNMP traps in failures of resources or
failover of services from one node to another.

I have a script agent, which starts, stops and checks status of my
application. Is it possible that in a script resource - to distinguish
between normal startup of service / resource vs startup of service/resource
in response to failover / failure handling? Doing so would help me write
code to generate alarms if startup of service / resource (in my case a
process) is due to failover (not normal startup).

Further is it possible to get information such as cause of failure(leading
to failover), and previous cluster node on which service / resource was
running(prior to failover)?

This would help to provide as much information as possible in traps

Thanks,
Parvez
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20121123/e34396ab/attachment.htm>

From kolapallisatya531 at gmail.com  Fri Nov 23 09:24:59 2012
From: kolapallisatya531 at gmail.com (satya suresh kolapalli)
Date: Fri, 23 Nov 2012 14:54:59 +0530
Subject: [Linux-cluster] Normal startup vs startup due to failover on
 cluster node - can they be distinguished?
In-Reply-To: <CAKrd533g2WLMsdvwdPLtD-KXqMq90p7CdT27rNPwUejhdTaDRQ@mail.gmail.com>
References: <CAKrd533g2WLMsdvwdPLtD-KXqMq90p7CdT27rNPwUejhdTaDRQ@mail.gmail.com>
Message-ID: <CAJsiSVpYhj-YFCiQCLb-xYX09_tcOmAgJ2u7PjUbxwE37PsFuQ@mail.gmail.com>

Hi,

send the script which you have



On 23 November 2012 10:55, Parvez Shaikh <parvez.h.shaikh at gmail.com> wrote:
> Hi experts,
>
> I am using Red Hat Cluster available on RHEL 5.5. And it doesn't have any
> inbuilt mechanism to generate SNMP traps in failures of resources or
> failover of services from one node to another.
>
> I have a script agent, which starts, stops and checks status of my
> application. Is it possible that in a script resource - to distinguish
> between normal startup of service / resource vs startup of service/resource
> in response to failover / failure handling? Doing so would help me write
> code to generate alarms if startup of service / resource (in my case a
> process) is due to failover (not normal startup).
>
> Further is it possible to get information such as cause of failure(leading
> to failover), and previous cluster node on which service / resource was
> running(prior to failover)?
>
> This would help to provide as much information as possible in traps
>
> Thanks,
> Parvez
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



-- 
Regards,
SatyaSuresh Kolapalli
Mob: 7702430892



From mgrac at redhat.com  Sun Nov 25 13:41:11 2012
From: mgrac at redhat.com (Marek Grac)
Date: Sun, 25 Nov 2012 14:41:11 +0100
Subject: [Linux-cluster] Fence agents - breaking compatibility
Message-ID: <50B21FF7.5000404@redhat.com>

Hi,

In last few weeks there were a lot of internal changed in source code of 
fence agents to make code more readable, clean and adaptable. In order 
to clean up code a bit more and remove historical burden, we would like 
to remove/replace some options (details are in 7 patches at cluster-devel).

These changes will be part of next major version of fence agents. There 
will be at least one upstream release (3.1.12) without these changes.

Brief overview:

* most of the fence agents:
     * removed option -T / test (command line / STDIN) --> you can use 
-o monitor / action to test if fence device is working
     * removed option -q / quiet --> fence agents are quiet enough by 
default
     * replaced --udpport / udpport --> use --ipport / ipport
     * on STDIN we also supported these options (this transition was 
done automatically in code) which are now replaced
         blade -> port
         option -> action
         fm -> port
         hostname -> ippaddr

* (fence_drac5) removed  -m / modulename / module_name --> replaced by 
standard -n / --plug / port
     this affect only Dell Drac CMC as other Drac devices do not use 
machine specification at all
* (fence_lpar) removed -n / partition --> replaced by standard -n / 
--plug / port

* (fence_rsb) removed -n / telnet_port --> replaced by --ipport / ipport

m,



From Elliott.Barrere at mywedding.com  Mon Nov 26 19:18:54 2012
From: Elliott.Barrere at mywedding.com (Elliott Barrere)
Date: Mon, 26 Nov 2012 19:18:54 +0000
Subject: [Linux-cluster] Set packet src address to a cluster-managed IP
Message-ID: <E9F7BF36-459A-48B7-9BF8-7DA4EAB7DA4F@mywedding.com>

Hi everyone,

I have a RHEL 5.8 cluster that manages several IP addresses (among other services).  While this works fine for "serving" content (i.e. when a client hits one of the managed IP addresses the content is delivered), I also need the server to _send_ new packets from the managed address (this is an Asterisk cluster so it sends SIP invites to clients, which are rejected unless they come from the correct IP).

I can successfully set the source address for packets by running something like this:

ip route change 10.X.X.0/24 dev eth0 src 10.X.X.10

and this solves my problem.

However, this solution is not "cluster aware", nor is it permanent across reboots.  I could write a script to update the src address after the cluster IPs are applied, but that seems like a bit of a hack.

Has anyone else had this problem?  Any advice for how to deal with it?  I can't imagine I'm the only one wanting to do this.

Cheers,
-elliott-



From lists at alteeve.ca  Tue Nov 27 05:15:16 2012
From: lists at alteeve.ca (Digimer)
Date: Tue, 27 Nov 2012 00:15:16 -0500
Subject: [Linux-cluster] cluster 3.2.0 released
Message-ID: <50B44C64.6000800@alteeve.ca>

Welcome to the cluster 3.2.0 release.

This new major release features improvements in the fencing area and
several bug fixes across the stack.

* New cluster recovery mechanism has been added based on hardware watchdog:
** fence_sanlock (req: wdmd 2,6+, fence_sanlock 2.6+)
** checkquorum.wdmd (req: cman 3.2.0+, wdmd 2.6+)
** Details and usage; https://alteeve.ca/w/Watchdog_Recovery
* fence_check tool for verifying fence device configuration. The tool
can be used in cron scripts. Please refer to the man page for
operational details and caveats.

The new source tarball can be downloaded here:

https://fedorahosted.org/releases/c/l/cluster/cluster-3.2.0.tar.xz

Change Log:

https://fedorahosted.org/releases/c/l/cluster/Changelog-3.2.0

To report bugs or issues:

   https://bugzilla.redhat.com/

Would you like to meet the cluster team or members of its community?

   Join us on IRC (irc.freenode.net #linux-cluster) and share your
   experience  with other sysadministrators or power users.

Thanks/congratulations to all people that contributed to achieve this
great milestone.

Happy clustering,
Digimer

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?



From christian.masopust at siemens.com  Tue Nov 27 07:44:19 2012
From: christian.masopust at siemens.com (Masopust, Christian)
Date: Tue, 27 Nov 2012 08:44:19 +0100
Subject: [Linux-cluster] Set packet src address to a cluster-managed IP
In-Reply-To: <E9F7BF36-459A-48B7-9BF8-7DA4EAB7DA4F@mywedding.com>
References: <E9F7BF36-459A-48B7-9BF8-7DA4EAB7DA4F@mywedding.com>
Message-ID: <C3B6F57F6F0CE34093FF52B3FFBEFA7C01614AAB78CD@ATVIES9917VMSX.ww300.siemens.net>


Hi Elliott,

I had a similar problem with my license-server cluster (for IBM 
Rational ClearCase). As we found out that IBM's license daemon
for ClearCase behaves very badly (sends response packets with the
IP address of the NIC instead of the cluster-ip) and IBM was not
able to provide a fix for that, we decided to use iptables to 
rewrite the addresses.

For that I've added iptables servcie to my cluster configuration (only 
starts on that node that has the license daemon active) and configured 
SNAT and DNAT:

iptables -A PREROUTING -d <cluster-ip>/32 -j DNAT --to-destination <NIC-ip>
iptables -a POSTROUTING -s <NIC-ip>/32 -j SNAT --to-source <cluster-ip>

This configuration of iptables on both nodes and (as said) iptables active
only where license daemon is active and everything works fine for us :)

cheers,
christian
 

> -----Urspr?ngliche Nachricht-----
> Von: linux-cluster-bounces at redhat.com 
> [mailto:linux-cluster-bounces at redhat.com] Im Auftrag von 
> Elliott Barrere
> Gesendet: Montag, 26. November 2012 20:19
> An: <linux-cluster at redhat.com>
> Betreff: [Linux-cluster] Set packet src address to a 
> cluster-managed IP
> 
> Hi everyone,
> 
> I have a RHEL 5.8 cluster that manages several IP addresses 
> (among other services).  While this works fine for "serving" 
> content (i.e. when a client hits one of the managed IP 
> addresses the content is delivered), I also need the server 
> to _send_ new packets from the managed address (this is an 
> Asterisk cluster so it sends SIP invites to clients, which 
> are rejected unless they come from the correct IP).
> 
> I can successfully set the source address for packets by 
> running something like this:
> 
> ip route change 10.X.X.0/24 dev eth0 src 10.X.X.10
> 
> and this solves my problem.
> 
> However, this solution is not "cluster aware", nor is it 
> permanent across reboots.  I could write a script to update 
> the src address after the cluster IPs are applied, but that 
> seems like a bit of a hack.
> 
> Has anyone else had this problem?  Any advice for how to deal 
> with it?  I can't imagine I'm the only one wanting to do this.
> 
> Cheers,
> -elliott-
> 
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 



From parvez.h.shaikh at gmail.com  Tue Nov 27 10:23:19 2012
From: parvez.h.shaikh at gmail.com (Parvez Shaikh)
Date: Tue, 27 Nov 2012 15:53:19 +0530
Subject: [Linux-cluster] Normal startup vs startup due to failover on
 cluster node - can they be distinguished?
In-Reply-To: <CAJsiSVpYhj-YFCiQCLb-xYX09_tcOmAgJ2u7PjUbxwE37PsFuQ@mail.gmail.com>
References: <CAKrd533g2WLMsdvwdPLtD-KXqMq90p7CdT27rNPwUejhdTaDRQ@mail.gmail.com>
	<CAJsiSVpYhj-YFCiQCLb-xYX09_tcOmAgJ2u7PjUbxwE37PsFuQ@mail.gmail.com>
Message-ID: <CAKrd532K_w+8a=H2Tk3JsQyqfoGUh9zbMecPEkex6uTjqP=P8g@mail.gmail.com>

Kind reminder on this.

Any inputs would be of great help. Basically I intend to have SNMP traps
generated to notify failures and failover while using RHCS.

Thanks,
Parvez

On Fri, Nov 23, 2012 at 2:54 PM, satya suresh kolapalli <
kolapallisatya531 at gmail.com> wrote:

> Hi,
>
> send the script which you have
>
>
>
> On 23 November 2012 10:55, Parvez Shaikh <parvez.h.shaikh at gmail.com>
> wrote:
> > Hi experts,
> >
> > I am using Red Hat Cluster available on RHEL 5.5. And it doesn't have any
> > inbuilt mechanism to generate SNMP traps in failures of resources or
> > failover of services from one node to another.
> >
> > I have a script agent, which starts, stops and checks status of my
> > application. Is it possible that in a script resource - to distinguish
> > between normal startup of service / resource vs startup of
> service/resource
> > in response to failover / failure handling? Doing so would help me write
> > code to generate alarms if startup of service / resource (in my case a
> > process) is due to failover (not normal startup).
> >
> > Further is it possible to get information such as cause of
> failure(leading
> > to failover), and previous cluster node on which service / resource was
> > running(prior to failover)?
> >
> > This would help to provide as much information as possible in traps
> >
> > Thanks,
> > Parvez
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
>
> --
> Regards,
> SatyaSuresh Kolapalli
> Mob: 7702430892
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20121127/0a51dc75/attachment.htm>

From uxbod at splatnix.net  Tue Nov 27 14:58:47 2012
From: uxbod at splatnix.net (Phil Daws)
Date: Tue, 27 Nov 2012 14:58:47 +0000 (GMT)
Subject: [Linux-cluster] Thin (sparse) provisioning
In-Reply-To: <324998424.456303.1353362334446.JavaMail.root@innovot.com>
References: <324998424.456303.1353362334446.JavaMail.root@innovot.com>
Message-ID: <213798016.1048691.1354028327212.JavaMail.root@innovot.com>

any help of this would be gratefully appreciated.

Thanks.
----- Original Message ----- 
From: "Phil Daws" <uxbod at splatnix.net> 
To: Linux-cluster at redhat.com 
Sent: Monday, 19 November, 2012 9:58:54 PM 
Subject: [Linux-cluster] Thin (sparse) provisioning 

Hello: 

am learning about clustering with DRBD and GFS2 and have a question about thin provisioning. I would like to set up a number of individual vservers that reside on their own LVs which can then be shared between two nodes and flipped backwards and forwards using Pacemaker. When setting up the block/lvm device for DRBD I have used: 

lvcreate --virtualsize 1T --size 10G --name vserver01 vg1 

once that has been added as a resource would I perform a standard mkfs.gfs2 or do I need to specify any further options; I was thinking something like: 

mkfs.gfs2 -t vservercluster:vservers -p lock_dlm -j 2 /dev/vservermirror/vserver01 

Is that the way I should be doing it ? 

Thanks. 

-- 
Linux-cluster mailing list 
Linux-cluster at redhat.com 
https://www.redhat.com/mailman/listinfo/linux-cluster 



From lists at alteeve.ca  Tue Nov 27 15:14:54 2012
From: lists at alteeve.ca (Digimer)
Date: Tue, 27 Nov 2012 10:14:54 -0500
Subject: [Linux-cluster] Thin (sparse) provisioning
In-Reply-To: <324998424.456303.1353362334446.JavaMail.root@innovot.com>
References: <324998424.456303.1353362334446.JavaMail.root@innovot.com>
Message-ID: <50B4D8EE.1080606@alteeve.ca>

On 11/19/2012 04:58 PM, Phil Daws wrote:
> Hello:
> 
> am learning about clustering with DRBD and GFS2 and have a question about thin provisioning.  I would like to set up a number of individual vservers that reside on their own LVs which can then be shared between two nodes and flipped backwards and forwards using Pacemaker.  When setting up the block/lvm device for DRBD I have used:
> 
> lvcreate --virtualsize 1T --size 10G --name vserver01 vg1
> 
> once that has been added as a resource would I perform a standard mkfs.gfs2 or do I need to specify any further options; I was thinking something like:
> 
> mkfs.gfs2 -t vservercluster:vservers -p lock_dlm -j 2 /dev/vservermirror/vserver01
> 
> Is that the way I should be doing it ?
> 
> Thanks.

I'm not entirely sure what you are trying to do here. If you want to put
VMs on LVs, use clustered LVM (clvmd) and use the LVs as backing devices
for the VMs. GFS2 is a great clustered FS, but no clustered FS is good
for backing VMs, in my opinion.

Here is how I do it:

https://alteeve.ca/w/2-Node_Red_Hat_KVM_Cluster_Tutorial#Provisioning_Virtual_Machines

I use GFS2 for storing the install images and XML definition files only.

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?



From Elliott.Barrere at mywedding.com  Wed Nov 28 20:37:33 2012
From: Elliott.Barrere at mywedding.com (Elliott Barrere)
Date: Wed, 28 Nov 2012 20:37:33 +0000
Subject: [Linux-cluster] Set packet src address to a cluster-managed IP
In-Reply-To: <C3B6F57F6F0CE34093FF52B3FFBEFA7C01614AAB78CD@ATVIES9917VMSX.ww300.siemens.net>
References: <E9F7BF36-459A-48B7-9BF8-7DA4EAB7DA4F@mywedding.com>
	<C3B6F57F6F0CE34093FF52B3FFBEFA7C01614AAB78CD@ATVIES9917VMSX.ww300.siemens.net>
Message-ID: <1B1E96E7-C75B-46B9-88F5-91C1FFEE3F61@mywedding.com>

That is great info, thanks!  I need to run iptables all the time on my servers, but I'm sure I can work a way to add and remove the entries as needed.