From swhiteho at redhat.com Thu Nov 1 12:13:51 2012 From: swhiteho at redhat.com (Steven Whitehouse) Date: Thu, 01 Nov 2012 12:13:51 +0000 Subject: [Linux-cluster] gfs2_tool unfreeze hang In-Reply-To: References: Message-ID: <1351772031.2708.24.camel@menhir> Hi, On Wed, 2012-10-31 at 14:07 -0500, james pedia wrote: > Noticed this thread for the same issue at: > > > https://www.redhat.com/archives/linux-cluster/2012-September/msg00084.html: > > > I think I hit the same issue: > > > (CentOS6.3) > # uname -r > 2.6.32-279.el6.x86_64 > > > gfs2-utils-3.0.12.1-32.el6_3.1.x86_64 is in use here. > > > > > # gfs2_tool freeze /var/www/html > # ls -l /var/www/html/ > total 8 > -rw-r--r-- 1 root root 10 Oct 30 23:47 a > -rw-r--r-- 1 root root 41 Oct 30 20:44 index.html > # cp /var/www/html/a /var/www/html/b > (HANG HERE) > > > Then try this: > # gfs2_tool unfreeze /var/www/html > (HANG AS WELL) > > > The whole cluster has to be reset to recover from this. > > > 'dmsetup suspend' and 'dmsetup resume' are working fine. > > > Are these commands basically doing the same thing ('dmsetup suspend' > vs 'gfs2_tool freeze')? > > > Is there a way to see if GFS2 file system is currently being suspended > or frozen? > Yes they do the same thing. I'd always recommend dmsetup suspend over the gfs2_tool method though, since the latter is going away in due course. There is, unfortunately, no way to check the suspend status of a GFS2 filesystem currently, Steve. > > Thanks, > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From zheka at uvt.cz Fri Nov 2 16:25:08 2012 From: zheka at uvt.cz (Yevheniy Demchenko) Date: Fri, 02 Nov 2012 17:25:08 +0100 Subject: [Linux-cluster] Monitoring Frequency - can it be changed? In-Reply-To: References: Message-ID: <5093F3E4.7090107@uvt.cz> Monitoring frequencies may be defined per resource in cluster.conf, i.e.: Detailed info here: https://fedorahosted.org/cluster/wiki/ResourceActions Also, one can change default action times per resource type in resource-agent meta-data in section. Ing. Yevheniy Demchenko Senior Linux Administrator UVT s.r.o. On 10/30/2012 11:34 AM, Parvez Shaikh wrote: > Hi experts, > > Can we change frequency at which resources are monitored by Cluster? > > I observed 30 seconds as monitoring frequency. > > Thanks, > Parvez > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carlopmart at gmail.com Sat Nov 3 11:49:40 2012 From: carlopmart at gmail.com (C. L. Martinez) Date: Sat, 3 Nov 2012 11:49:40 +0000 Subject: [Linux-cluster] Some problems with fence_virt.conf Message-ID: Hi all, I am trying to setup a virtual kvm guest cluster under a centos 6.3 x86_64 (guests are CentOS 6.3, too). When I have setup fence_virt.conf (I will use fence_xvm/fence_virt to fence guests), these errors appears: [root at kvmhost etc]# fence_virtd -F -d99 Background mode disabled Debugging threshold is now 99 fence_virtd { debug = "99"; listener = "multicast"; backend = "libvirt"; module_path = "/usr/lib64/fence-virt"; } listeners { multicast { key_file = "/etc/cluster/fence_xvm.key"; address = "255.0.0.15"; family = "ipv4"; port = "1229"; interface = "prodif"; } } backends { libvirt { uri = "qemu:///system"; } } Backend plugin: libvirt Listener plugin: multicast Searching /usr/lib64/fence-virt for plugins... Searching for plugins in /usr/lib64/fence-virt Loading plugin from /usr/lib64/fence-virt/libvirt.so Registered backend plugin libvirt 0.1 Loading plugin from /usr/lib64/fence-virt/multicast.so Failed to map backend_plugin_version Registered listener plugin multicast 1.1 2 plugins found Available backends: libvirt 0.1 Available listeners: multicast 1.1 Debugging threshold is now 99 Using qemu:///system Debugging threshold is now 99 Got /etc/cluster/fence_xvm.key for key_file Got ipv4 for family Got 255.0.0.15 for address Got 1229 for port Got prodif for interface Reading in key file /etc/cluster/fence_xvm.key into 0x13bb070 (4096 max size) Actual key length = 4096 bytes Setting up ipv4 multicast receive (255.0.0.15:1229) Joining multicast group Failed to bind multicast receive socket to 255.0.0.15: Invalid argument Check network configuration. Could not set up multicast listen socket Why is not possible to bind multicast socket?? In kvm host I have installed these packages: [root at kvmhost etc]# rpm -qa | grep fence | sort fence-virt-0.2.3-9.el6.x86_64 fence-virtd-0.2.3-9.el6.x86_64 fence-virtd-libvirt-0.2.3-9.el6.x86_64 fence-virtd-multicast-0.2.3-9.el6.x86_64 From andrew at beekhof.net Mon Nov 5 05:12:40 2012 From: andrew at beekhof.net (Andrew Beekhof) Date: Mon, 5 Nov 2012 16:12:40 +1100 Subject: [Linux-cluster] Some problems with fence_virt.conf In-Reply-To: References: Message-ID: Is "prodif" really the interface name? I'd have expected something like "virbr0" On Sat, Nov 3, 2012 at 10:49 PM, C. L. Martinez wrote: > Hi all, > > I am trying to setup a virtual kvm guest cluster under a centos 6.3 > x86_64 (guests are CentOS 6.3, too). When I have setup > fence_virt.conf (I will use fence_xvm/fence_virt to fence guests), > these errors appears: > > [root at kvmhost etc]# fence_virtd -F -d99 > Background mode disabled > Debugging threshold is now 99 > fence_virtd { > debug = "99"; > listener = "multicast"; > backend = "libvirt"; > module_path = "/usr/lib64/fence-virt"; > } > > listeners { > multicast { > key_file = "/etc/cluster/fence_xvm.key"; > address = "255.0.0.15"; > family = "ipv4"; > port = "1229"; > interface = "prodif"; > } > > } > > backends { > libvirt { > uri = "qemu:///system"; > } > > } > > Backend plugin: libvirt > Listener plugin: multicast > Searching /usr/lib64/fence-virt for plugins... > Searching for plugins in /usr/lib64/fence-virt > Loading plugin from /usr/lib64/fence-virt/libvirt.so > Registered backend plugin libvirt 0.1 > Loading plugin from /usr/lib64/fence-virt/multicast.so > Failed to map backend_plugin_version > Registered listener plugin multicast 1.1 > 2 plugins found > Available backends: > libvirt 0.1 > Available listeners: > multicast 1.1 > Debugging threshold is now 99 > Using qemu:///system > Debugging threshold is now 99 > Got /etc/cluster/fence_xvm.key for key_file > Got ipv4 for family > Got 255.0.0.15 for address > Got 1229 for port > Got prodif for interface > Reading in key file /etc/cluster/fence_xvm.key into 0x13bb070 (4096 max size) > Actual key length = 4096 bytes > Setting up ipv4 multicast receive (255.0.0.15:1229) > Joining multicast group > Failed to bind multicast receive socket to 255.0.0.15: Invalid argument > Check network configuration. > Could not set up multicast listen socket > > Why is not possible to bind multicast socket?? In kvm host I have > installed these packages: > > [root at kvmhost etc]# rpm -qa | grep fence | sort > fence-virt-0.2.3-9.el6.x86_64 > fence-virtd-0.2.3-9.el6.x86_64 > fence-virtd-libvirt-0.2.3-9.el6.x86_64 > fence-virtd-multicast-0.2.3-9.el6.x86_64 > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From carlopmart at gmail.com Mon Nov 5 07:03:20 2012 From: carlopmart at gmail.com (C. L. Martinez) Date: Mon, 5 Nov 2012 07:03:20 +0000 Subject: [Linux-cluster] Some problems with fence_virt.conf In-Reply-To: References: Message-ID: On Mon, Nov 5, 2012 at 5:12 AM, Andrew Beekhof wrote: > Is "prodif" really the interface name? > I'd have expected something like "virbr0" > Yes, it is correct. I don't use default bridge names provided by libvirtd ... From mgrac at redhat.com Mon Nov 5 11:00:27 2012 From: mgrac at redhat.com (Marek Grac) Date: Mon, 05 Nov 2012 12:00:27 +0100 Subject: [Linux-cluster] fence-agents 3.1.11 stable release Message-ID: <50979C4B.8020605@redhat.com> Welcome to the fence-agents 3.1.11 release. This release includes these updates: * support new API used in RHEV-M 3.1 * fence_cisco_ucs incorrect timeout value was used during login operation * support on/off also for fabric fence agents (which do not have 'reboot'). Support for enable/disable was not removed. * fence_na support XML metadata output * manual page for ipmilan was fixed to contain correct information about usage for HP iLO3, iLO4 The new source tarball can be downloaded here: https://fedorahosted.org/releases/f/e/fence-agents/fence-agents-3.1.11.tar.xz To report bugs or issues: https://bugzilla.redhat.com/ Would you like to meet the cluster team or members of its community? Join us on IRC (irc.freenode.net #linux-cluster) and share your experience with other sysadministrators or power users. Thanks/congratulations to all people that contributed to achieve this milestone. m, From andrew at beekhof.net Wed Nov 7 05:05:24 2012 From: andrew at beekhof.net (Andrew Beekhof) Date: Wed, 7 Nov 2012 16:05:24 +1100 Subject: [Linux-cluster] Some problems with fence_virt.conf In-Reply-To: References: Message-ID: On Mon, Nov 5, 2012 at 6:03 PM, C. L. Martinez wrote: > On Mon, Nov 5, 2012 at 5:12 AM, Andrew Beekhof wrote: >> Is "prodif" really the interface name? >> I'd have expected something like "virbr0" >> > > Yes, it is correct. I don't use default bridge names provided by libvirtd ... Are you sure that multicast address is valid? Perhaps try: 225.0.0.12 (not 255.0.0....) From carlopmart at gmail.com Wed Nov 7 06:44:20 2012 From: carlopmart at gmail.com (C. L. Martinez) Date: Wed, 7 Nov 2012 06:44:20 +0000 Subject: [Linux-cluster] Some problems with fence_virt.conf In-Reply-To: References: Message-ID: On Wed, Nov 7, 2012 at 5:05 AM, Andrew Beekhof wrote: > On Mon, Nov 5, 2012 at 6:03 PM, C. L. Martinez wrote: >> On Mon, Nov 5, 2012 at 5:12 AM, Andrew Beekhof wrote: >>> Is "prodif" really the interface name? >>> I'd have expected something like "virbr0" >>> >> >> Yes, it is correct. I don't use default bridge names provided by libvirtd ... > > Are you sure that multicast address is valid? > Perhaps try: 225.0.0.12 (not 255.0.0....) > I have tried 225.0.0.12 too, and result is the same ... From andrew at beekhof.net Wed Nov 7 08:09:28 2012 From: andrew at beekhof.net (Andrew Beekhof) Date: Wed, 7 Nov 2012 19:09:28 +1100 Subject: [Linux-cluster] Some problems with fence_virt.conf In-Reply-To: References: Message-ID: On Wed, Nov 7, 2012 at 5:44 PM, C. L. Martinez wrote: > On Wed, Nov 7, 2012 at 5:05 AM, Andrew Beekhof wrote: >> On Mon, Nov 5, 2012 at 6:03 PM, C. L. Martinez wrote: >>> On Mon, Nov 5, 2012 at 5:12 AM, Andrew Beekhof wrote: >>>> Is "prodif" really the interface name? >>>> I'd have expected something like "virbr0" >>>> >>> >>> Yes, it is correct. I don't use default bridge names provided by libvirtd ... >> >> Are you sure that multicast address is valid? >> Perhaps try: 225.0.0.12 (not 255.0.0....) >> > > I have tried 225.0.0.12 too, and result is the same ... Have you tried with -d99 (i think thats how you get more debug info) From bubble at hoster-ok.com Thu Nov 8 06:09:36 2012 From: bubble at hoster-ok.com (Vladislav Bogdanov) Date: Thu, 08 Nov 2012 09:09:36 +0300 Subject: [Linux-cluster] Some problems with fence_virt.conf In-Reply-To: References: Message-ID: <509B4CA0.8020606@hoster-ok.com> 03.11.2012 14:49, C. L. Martinez wrote: > Hi all, > > I am trying to setup a virtual kvm guest cluster under a centos 6.3 > x86_64 (guests are CentOS 6.3, too). When I have setup > fence_virt.conf (I will use fence_xvm/fence_virt to fence guests), > these errors appears: > > [root at kvmhost etc]# fence_virtd -F -d99 > Background mode disabled > Debugging threshold is now 99 > fence_virtd { > debug = "99"; > listener = "multicast"; > backend = "libvirt"; > module_path = "/usr/lib64/fence-virt"; > } > > listeners { > multicast { > key_file = "/etc/cluster/fence_xvm.key"; > address = "255.0.0.15"; > family = "ipv4"; > port = "1229"; > interface = "prodif"; > } > > } > > backends { > libvirt { > uri = "qemu:///system"; > } > > } > > Backend plugin: libvirt > Listener plugin: multicast > Searching /usr/lib64/fence-virt for plugins... > Searching for plugins in /usr/lib64/fence-virt > Loading plugin from /usr/lib64/fence-virt/libvirt.so > Registered backend plugin libvirt 0.1 > Loading plugin from /usr/lib64/fence-virt/multicast.so > Failed to map backend_plugin_version > Registered listener plugin multicast 1.1 > 2 plugins found > Available backends: > libvirt 0.1 > Available listeners: > multicast 1.1 > Debugging threshold is now 99 > Using qemu:///system > Debugging threshold is now 99 > Got /etc/cluster/fence_xvm.key for key_file > Got ipv4 for family > Got 255.0.0.15 for address > Got 1229 for port > Got prodif for interface > Reading in key file /etc/cluster/fence_xvm.key into 0x13bb070 (4096 max size) > Actual key length = 4096 bytes > Setting up ipv4 multicast receive (255.0.0.15:1229) > Joining multicast group > Failed to bind multicast receive socket to 255.0.0.15: Invalid argument > Check network configuration. > Could not set up multicast listen socket > > Why is not possible to bind multicast socket?? In kvm host I have > installed these packages: selinux? > > [root at kvmhost etc]# rpm -qa | grep fence | sort > fence-virt-0.2.3-9.el6.x86_64 > fence-virtd-0.2.3-9.el6.x86_64 > fence-virtd-libvirt-0.2.3-9.el6.x86_64 > fence-virtd-multicast-0.2.3-9.el6.x86_64 > From lists at verwilst.be Thu Nov 8 18:43:00 2012 From: lists at verwilst.be (Bart Verwilst) Date: Thu, 08 Nov 2012 19:43:00 +0100 Subject: [Linux-cluster] Failover network device with rgmanager In-Reply-To: <506DB1C4.2080609@redhat.com> References: <2c3f847bbba16467723fe057dbded285@verwilst.be> <506DB1C4.2080609@redhat.com> Message-ID: <7107b25ea4aa7c871a40eca860514b5a@verwilst.be> Thanks a lot for the tip! Kind regards, Bart Lon Hohberger schreef op 04.10.2012 17:56: > On 10/04/2012 09:47 AM, Bart Verwilst wrote: >> Hi, >> >> I would like to make rgmanager manage a network interface i >> configured >> under sysconfig ( ifcfg-ethX ). It should be brought up by the >> active >> node as a resource, and ifdown'ed by the standby node. ( It's >> actually a >> GRE tunnel interface ). Is there a straightforward way on how to do >> this >> with CentOS 6.2 cman/rgmanager? >> > > 'script' resource, like: > > #!/bin/sh > > case $1 in > start) > ifup ethX > exit $? > ;; > stop) > ifdown ethX > exit $? > ;; > status) > ... > ;; > esac > > exit 1 > > -- Lon From sumodirjo at gmail.com Fri Nov 9 00:47:55 2012 From: sumodirjo at gmail.com (Muhammad Panji) Date: Fri, 9 Nov 2012 07:47:55 +0700 Subject: [Linux-cluster] Failover root cause Message-ID: Dear All, I have an oracle cluster on RHEL 6.2 with 2 servers. Several days ago the service was failover from node1 to node2. From /var/log/messages on node2 I only see this message : ... Oct 23 12:54:19 db2svr corosync[4142]: [TOTEM ] A processor failed, forming new configuration. Oct 23 12:54:21 db2svr corosync[4142]: [QUORUM] Members[1]: 2 Oct 23 12:54:21 db2svr corosync[4142]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Oct 23 12:54:21 db2svr kernel: dlm: closing connection to node 1 Oct 23 12:54:21 db2svr rgmanager[5327]: State change: clu1 DOWN Oct 23 12:54:21 db2svr fenced[4193]: fencing node clu1 ... Googling this message " [TOTEM ] A processor failed, forming new configuration." I learned that it means node2 couldn't see node1 and then fence node1. on node1 I get this message : Oct 23 12:50:45 db1svr rgmanager[75890]: [script] Executing /etc/init.d/httpd status Oct 23 12:56:01 db1svr kernel: imklog 4.6.2, log source = /proc/kmsg started. Oct 23 12:56:01 db1svr rsyslogd: [origin software="rsyslogd" swVersion="4.6.2" x-pid="3792" x-info="http://www.rsyslog.com"] (re)start Oct 23 12:56:01 db1svr kernel: Initializing cgroup subsys cpuset Oct 23 12:56:01 db1svr kernel: Initializing cgroup subsys cpu Oct 23 12:56:01 db1svr kernel: Linux version 2.6.32-220.el6.x86_64 (mockbuild at x86-004.build.bos.redhat.com) (gcc version 4.4.5 20110214 (Red Hat 4.4.5-6) (GCC) ) #1 SMP Wed Nov 9 08:03:13 EST 2011 on 12:50 rgmanager still checking the service and then it's rebooted. Thing that make it worse is that the date / time of both servers are different so that I can't compare the logs directly. Current time difference between both servers is around 5 minutes. I would like to ask where to look for the cause of this failover? I plan to graph sar data today to see if there were bottleneck on CPU etc so that node1 could not send status to node2, but if no bottleneck on CPU or RAM etc where should I find the root cause of failover? thank you. Regards, -- Muhammad Panji http://www.panji.web.id http://www.kurungsiku.com From songyu555 at gmail.com Fri Nov 9 03:40:51 2012 From: songyu555 at gmail.com (Yu) Date: Fri, 9 Nov 2012 14:40:51 +1100 Subject: [Linux-cluster] Failover root cause In-Reply-To: References: Message-ID: Regardless what was the root cause you find. Cluster requires Ntp service to ensure all nodes have time synchronized. So you have to fix this 5 mins difference now. Regards Yu On 09/11/2012, at 11:47, Muhammad Panji wrote: > Dear All, > I have an oracle cluster on RHEL 6.2 with 2 servers. Several days ago > the service was failover from node1 to node2. From /var/log/messages > on node2 I only see this message : > > ... > Oct 23 12:54:19 db2svr corosync[4142]: [TOTEM ] A processor failed, > forming new configuration. > Oct 23 12:54:21 db2svr corosync[4142]: [QUORUM] Members[1]: 2 > Oct 23 12:54:21 db2svr corosync[4142]: [TOTEM ] A processor joined > or left the membership and a new membership was formed. > Oct 23 12:54:21 db2svr kernel: dlm: closing connection to node 1 > Oct 23 12:54:21 db2svr rgmanager[5327]: State change: clu1 DOWN > Oct 23 12:54:21 db2svr fenced[4193]: fencing node clu1 > ... > > Googling this message " [TOTEM ] A processor failed, forming new > configuration." I learned that it means node2 couldn't see node1 and > then fence node1. on node1 I get this message : > > Oct 23 12:50:45 db1svr rgmanager[75890]: [script] Executing > /etc/init.d/httpd status > Oct 23 12:56:01 db1svr kernel: imklog 4.6.2, log source = /proc/kmsg started. > Oct 23 12:56:01 db1svr rsyslogd: [origin software="rsyslogd" > swVersion="4.6.2" x-pid="3792" x-info="http://www.rsyslog.com"] > (re)start > Oct 23 12:56:01 db1svr kernel: Initializing cgroup subsys cpuset > Oct 23 12:56:01 db1svr kernel: Initializing cgroup subsys cpu > Oct 23 12:56:01 db1svr kernel: Linux version 2.6.32-220.el6.x86_64 > (mockbuild at x86-004.build.bos.redhat.com) (gcc version 4.4.5 20110214 > (Red Hat 4.4.5-6) (GCC) ) #1 SMP Wed Nov 9 08:03:13 EST 2011 > > on 12:50 rgmanager still checking the service and then it's rebooted. > Thing that make it worse is that the date / time of both servers are > different so that I can't compare the logs directly. Current time > difference between both servers is around 5 minutes. > > I would like to ask where to look for the cause of this failover? I > plan to graph sar data today to see if there were bottleneck on CPU > etc so that node1 could not send status to node2, but if no bottleneck > on CPU or RAM etc where should I find the root cause of failover? > thank you. > Regards, > > > > > > -- > Muhammad Panji > http://www.panji.web.id > http://www.kurungsiku.com > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From parvez.h.shaikh at gmail.com Fri Nov 9 10:40:22 2012 From: parvez.h.shaikh at gmail.com (Parvez Shaikh) Date: Fri, 9 Nov 2012 16:10:22 +0530 Subject: [Linux-cluster] fence_bladecenter - changing default fence action Message-ID: Hi experts, Is there any way to override default fence action (reboot?) for fence_bladecenter through cluster.conf? Can we specify what is fencing action (reboot/off/on) for fence_bladecenter per blade? Thanks -------------- next part -------------- An HTML attachment was scrubbed... URL: From binanalhalabi at yahoo.com Fri Nov 9 12:02:43 2012 From: binanalhalabi at yahoo.com (Binan AL Halabi) Date: Fri, 9 Nov 2012 04:02:43 -0800 (PST) Subject: [Linux-cluster] fence_bladecenter - changing default fence action In-Reply-To: References: Message-ID: <1352462563.57880.YahooMailNeo@web122604.mail.ne1.yahoo.com> Hi, You can specify the fencing action per blade depending on the fencing agent. http://www.sourceware.org/cluster/doc/cluster_schema_rhel5.html use action attribute per node in configuration file: see Example 7.4 and 7.5 here: https://access.redhat.com/knowledge/docs/en-US/Red_Hat_Enterprise_Linux/6/html/Cluster_Administration/s1-config-fencing-cli-CA.html#ex-clusterconf-fencing-fencemethods-cli-CA // Binan ________________________________ Fr?n: Parvez Shaikh Till: linux clustering Skickat: fredag, 9 november 2012 11:40 ?mne: [Linux-cluster] fence_bladecenter - changing default fence action Hi experts, Is there any way to override default fence action (reboot?) for fence_bladecenter through cluster.conf? Can we specify what is fencing action (reboot/off/on) for fence_bladecenter per blade? Thanks -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- An HTML attachment was scrubbed... URL: From binanalhalabi at yahoo.com Fri Nov 9 12:16:19 2012 From: binanalhalabi at yahoo.com (Binan AL Halabi) Date: Fri, 9 Nov 2012 04:16:19 -0800 (PST) Subject: [Linux-cluster] fence_bladecenter - changing default fence action In-Reply-To: References: Message-ID: <1352463379.10146.YahooMailNeo@web122606.mail.ne1.yahoo.com> Hi, You can specify the fencing action per blade depending on the fencing agent. http://www.sourceware.org/cluster/doc/cluster_schema_rhel5.html use action attribute per node in configuration file: see Example 7.4 and 7.5 here: https://access.redhat.com/knowledge/docs/en-US/Red_Hat_Enterprise_Linux/6/html/Cluster_Administration/s1-config-fencing-cli-CA.html#ex-clusterconf-fencing-fencemethods-cli-CA // Binan ________________________________ Fr?n: Parvez Shaikh Till: linux clustering Skickat: fredag, 9 november 2012 11:40 ?mne: [Linux-cluster] fence_bladecenter - changing default fence action Hi experts, Is there any way to override default fence action (reboot?) for fence_bladecenter through cluster.conf? Can we specify what is fencing action (reboot/off/on) for fence_bladecenter per blade? Thanks -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- An HTML attachment was scrubbed... URL: From queszama at yahoo.in Sat Nov 10 02:26:15 2012 From: queszama at yahoo.in (Zama Ques) Date: Sat, 10 Nov 2012 10:26:15 +0800 (SGT) Subject: [Linux-cluster] Packet loss after configuring Ethernet bonding Message-ID: <1352514375.40862.YahooMailNeo@web193003.mail.sg3.yahoo.com> Hi All, Need help on resolving a issue related to implementing High Availability at network level . I understand that this is not the right forum to ask this question , but since it is related to HA and Linux , I am asking here and I feel somebody here? will have answer to the issues I am facing . I am trying to implement Ethernet Bonding , Both the interface in my server are connected to two different network switches . My configuration is as follows: ======== # cat /proc/net/bonding/bond0 Ethernet Channel Bonding Driver: v3.6.0 (September 26, 2009) Bonding Mode: adaptive load balancing Primary Slave: None Currently Active Slave: eth0 MII Status: up MII Polling Interval (ms): 0 Up Delay (ms): 0 Down Delay (ms): 0 Slave Interface: eth0 MII Status: up Speed: 1000 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: e4:e1:5b:d0:11:10 Slave queue ID: 0 Slave Interface: eth1 MII Status: up Speed: 1000 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: e4:e1:5b:d0:11:14 Slave queue ID: 0 ------------ # cat /sys/class/net/bond0/bonding/mode ? balance-alb 6 # cat /sys/class/net/bond0/bonding/miimon ?? 0 ============ The issue for me is that I am seeing packet loss after configuring bonding .? Tried connecting both the interface to the same switch , but still seeing the packet loss . Also , tried changing miimon value to 100 , but still seeing the packet loss.? What I am missing in the configuration ? Any help will be highly appreciated in resolving the problem . Thanks Zaman From lists at alteeve.ca Sat Nov 10 02:54:33 2012 From: lists at alteeve.ca (Digimer) Date: Fri, 09 Nov 2012 21:54:33 -0500 Subject: [Linux-cluster] Packet loss after configuring Ethernet bonding In-Reply-To: <1352514375.40862.YahooMailNeo@web193003.mail.sg3.yahoo.com> References: <1352514375.40862.YahooMailNeo@web193003.mail.sg3.yahoo.com> Message-ID: <509DC1E9.9090704@alteeve.ca> On 11/09/2012 09:26 PM, Zama Ques wrote: > Hi All, > > Need help on resolving a issue related to implementing High Availability at network level . I understand that this is not the right forum to ask this question , but since it is related to HA and Linux , I am asking here and I feel somebody here will have answer to the issues I am facing . > > I am trying to implement Ethernet Bonding , Both the interface in my server are connected to two different network switches . > > My configuration is as follows: > > ======== > # cat /proc/net/bonding/bond0 > > Ethernet Channel Bonding Driver: v3.6.0 (September 26, 2009) > > Bonding Mode: adaptive load balancing Primary Slave: None Currently > Active Slave: eth0 MII Status: up MII Polling Interval (ms): 0 Up Delay > (ms): 0 Down Delay (ms): 0 > > Slave Interface: eth0 MII Status: up Speed: 1000 Mbps Duplex: full Link > Failure Count: 0 Permanent HW addr: e4:e1:5b:d0:11:10 Slave queue ID: 0 > > Slave Interface: eth1 MII Status: up Speed: 1000 Mbps Duplex: full Link > Failure Count: 0 Permanent HW addr: e4:e1:5b:d0:11:14 Slave queue ID: 0 > ------------ > # cat /sys/class/net/bond0/bonding/mode > > balance-alb 6 > > > # cat /sys/class/net/bond0/bonding/miimon > 0 > > ============ > > > The issue for me is that I am seeing packet loss after configuring bonding . Tried connecting both the interface to the same switch , but still seeing the packet loss . Also , tried changing miimon value to 100 , but still seeing the packet loss. > > What I am missing in the configuration ? Any help will be highly appreciated in resolving the problem . > > > > Thanks > Zaman You didn't share any details on your configuration, but I will assume you are using corosync. The only supported bonding mode is Active/Passive (mode=1). I've personally tried all modes, out of curiosity, and all had problems. The short of it is that if you need more that 1 gbit of performance, buy faster cards. If you are interested in what I use, it's documented here: https://alteeve.ca/w/2-Node_Red_Hat_KVM_Cluster_Tutorial#Network I've used this setup in several production clusters and have tested failure are recovery extensively. It's proven very stable. :) -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? From queszama at yahoo.in Sat Nov 10 04:12:19 2012 From: queszama at yahoo.in (Zama Ques) Date: Sat, 10 Nov 2012 12:12:19 +0800 (SGT) Subject: [Linux-cluster] Packet loss after configuring Ethernet bonding In-Reply-To: <509DC1E9.9090704@alteeve.ca> References: <1352514375.40862.YahooMailNeo@web193003.mail.sg3.yahoo.com> <509DC1E9.9090704@alteeve.ca> Message-ID: <1352520739.40244.YahooMailNeo@web193002.mail.sg3.yahoo.com> ----- Original Message ----- From: Digimer To: Zama Ques ; linux clustering Cc: Sent: Saturday, 10 November 2012 8:24 AM Subject: Re: [Linux-cluster] Packet loss after configuring Ethernet bonding On 11/09/2012 09:26 PM, Zama Ques wrote: > Hi All, > > Need help on resolving a issue related to implementing High Availability at network level . I understand that this is not the right forum to ask this question , but since it is related to HA and Linux , I am asking here and I feel somebody here? will have answer to the issues I am facing . > > I am trying to implement Ethernet Bonding , Both the interface in my server are connected to two different network switches . > > My configuration is as follows: > > ======== > # cat /proc/net/bonding/bond0 > > Ethernet Channel Bonding Driver: v3.6.0 (September 26, 2009) > > Bonding Mode: adaptive load balancing Primary Slave: None Currently > Active Slave: eth0 MII Status: up MII Polling Interval (ms): 0 Up Delay > (ms): 0 Down Delay (ms): 0 > > Slave Interface: eth0 MII Status: up Speed: 1000 Mbps Duplex: full Link > Failure Count: 0 Permanent HW addr: e4:e1:5b:d0:11:10 Slave queue ID: 0 > > Slave Interface: eth1 MII Status: up Speed: 1000 Mbps Duplex: full Link > Failure Count: 0 Permanent HW addr: e4:e1:5b:d0:11:14 Slave queue ID: 0 > ------------ > # cat /sys/class/net/bond0/bonding/mode > >? balance-alb 6 > > > # cat /sys/class/net/bond0/bonding/miimon >? ? 0 > > ============ > > > The issue for me is that I am seeing packet loss after configuring bonding .? Tried connecting both the interface to the same switch , but still seeing the packet loss . Also , tried changing miimon value to 100 , but still seeing the packet loss. > > What I am missing in the configuration ? Any help will be highly appreciated in resolving the problem . > > > > Thanks > Zaman ?> You didn't share any details on your configuration, but I will assume > you are using corosync. > The only supported bonding mode is Active/Passive (mode=1). I've > personally tried all modes, out of curiosity, and all had problems. The > short of it is that if you need more that 1 gbit of performance, buy > faster cards. > If you are interested in what I use, it's documented here: >? https://alteeve.ca/w/2-Node_Red_Hat_KVM_Cluster_Tutorial#Network >? I've used this setup in several production clusters and have tested >? failure are recovery extensively. It's proven very stable. :) ? Thanks Digimer for the quick response and pointing me to the link . I am yet to reach cluster configuration , initially trying to? understand ethernet bonding before going into cluster configuration. So , option for me is only to use Active/Passive bonding mode in case of clustered environment. Few more clarifications needed , Can we use other bonding modes in non clustered environment .? I am seeing packet loss in other modes . Also , the support of? using only mode=1 in cluster environment is it a restriction of RHEL Cluster suite or it is by design . Will be great if you clarify these queries . Thanks in Advance Zaman From lists at alteeve.ca Sat Nov 10 04:22:44 2012 From: lists at alteeve.ca (Digimer) Date: Fri, 09 Nov 2012 23:22:44 -0500 Subject: [Linux-cluster] Packet loss after configuring Ethernet bonding In-Reply-To: <1352520739.40244.YahooMailNeo@web193002.mail.sg3.yahoo.com> References: <1352514375.40862.YahooMailNeo@web193003.mail.sg3.yahoo.com> <509DC1E9.9090704@alteeve.ca> <1352520739.40244.YahooMailNeo@web193002.mail.sg3.yahoo.com> Message-ID: <509DD694.1000900@alteeve.ca> On 11/09/2012 11:12 PM, Zama Ques wrote: > ----- Original Message ----- > From: Digimer > To: Zama Ques ; linux clustering > Cc: > Sent: Saturday, 10 November 2012 8:24 AM > Subject: Re: [Linux-cluster] Packet loss after configuring Ethernet bonding > > On 11/09/2012 09:26 PM, Zama Ques wrote: >> Hi All, >> >> Need help on resolving a issue related to implementing High Availability at network level . I understand that this is not the right forum to ask this question , but since it is related to HA and Linux , I am asking here and I feel somebody here will have answer to the issues I am facing . >> >> I am trying to implement Ethernet Bonding , Both the interface in my server are connected to two different network switches . >> >> My configuration is as follows: >> >> ======== >> # cat /proc/net/bonding/bond0 >> >> Ethernet Channel Bonding Driver: v3.6.0 (September 26, 2009) >> >> Bonding Mode: adaptive load balancing Primary Slave: None Currently >> Active Slave: eth0 MII Status: up MII Polling Interval (ms): 0 Up Delay >> (ms): 0 Down Delay (ms): 0 >> >> Slave Interface: eth0 MII Status: up Speed: 1000 Mbps Duplex: full Link >> Failure Count: 0 Permanent HW addr: e4:e1:5b:d0:11:10 Slave queue ID: 0 >> >> Slave Interface: eth1 MII Status: up Speed: 1000 Mbps Duplex: full Link >> Failure Count: 0 Permanent HW addr: e4:e1:5b:d0:11:14 Slave queue ID: 0 >> ------------ >> # cat /sys/class/net/bond0/bonding/mode >> >> balance-alb 6 >> >> >> # cat /sys/class/net/bond0/bonding/miimon >> 0 >> >> ============ >> >> >> The issue for me is that I am seeing packet loss after configuring bonding . Tried connecting both the interface to the same switch , but still seeing the packet loss . Also , tried changing miimon value to 100 , but still seeing the packet loss. >> >> What I am missing in the configuration ? Any help will be highly appreciated in resolving the problem . >> >> >> >> Thanks >> Zaman > > > You didn't share any details on your configuration, but I will assume >> you are using corosync. > >> The only supported bonding mode is Active/Passive (mode=1). I've >> personally tried all modes, out of curiosity, and all had problems. The >> short of it is that if you need more that 1 gbit of performance, buy >> faster cards. > >> If you are interested in what I use, it's documented here: > >> https://alteeve.ca/w/2-Node_Red_Hat_KVM_Cluster_Tutorial#Network > >> I've used this setup in several production clusters and have tested >> failure are recovery extensively. It's proven very stable. :) > > > Thanks Digimer for the quick response and pointing me to the link . I am yet to reach cluster configuration , initially trying to understand ethernet bonding before going into cluster configuration. So , option for me is only to use Active/Passive bonding mode in case of clustered environment. > Few more clarifications needed , Can we use other bonding modes in non clustered environment . I am seeing packet loss in other modes . Also , the support of using only mode=1 in cluster environment is it a restriction of RHEL Cluster suite or it is by design . > > Will be great if you clarify these queries . > > Thanks in Advance > Zaman Corosync is the only actively developed/supported (HA) cluster communications and membership tool. It's used on all modern distros for clustering and the requirement for mode=1 is with it. As such, it doesn't matter which OS you are on, it's the only mode that will work (reliably). The problem is that corosync needs to detect state changes quickly. It does this using the totem protocol (which serves other purposes), which passes a token around the nodes in the cluster. If a node is sent a token and the token is not returned within a time-out period, it is declared lost and a new token is dispatched. Once too many failures occur in a row, the node is declared lost and it is ejected from the cluster. This process is detailed in the link above under the "Concept; Fencing" section. With all modes other than mode=1, the failure recovery and/or the restoration of a link in the bond causes a sufficient disruption to cause a node to be declared lost. As I mentioned, this matches my experience in testing the other modes. It isn't an arbitrary rule. As for non-clustered traffic; the usefulness of other bond modes depends entirely on the traffic you are pushing over it. Personally, I am focused on HA in clusters, so I only use mode=1, regardless of the traffic designed for it. digimer ps - You will see reference to "heartbeat" as a comms layer in clustering. It's been deprecated and should not be used. Likewise, pacemaker is the future of clustering, so it should be to resource manager you learn/use. -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? From tc3driver at gmail.com Sat Nov 10 04:35:34 2012 From: tc3driver at gmail.com (Bill G.) Date: Fri, 9 Nov 2012 20:35:34 -0800 Subject: [Linux-cluster] Packet loss after configuring Ethernet bonding In-Reply-To: <1352520739.40244.YahooMailNeo@web193002.mail.sg3.yahoo.com> References: <1352514375.40862.YahooMailNeo@web193003.mail.sg3.yahoo.com> <509DC1E9.9090704@alteeve.ca> <1352520739.40244.YahooMailNeo@web193002.mail.sg3.yahoo.com> Message-ID: Hi Zaman, There are some configurations that need to be made to the switch to allow both nics to come up with the same mac. I am by no means a network expert, so I cannot think of the name of the protocol off the top of my head. I am willing to wager that the lack of that configuration is the cause of your packet loss. On Nov 9, 2012 8:22 PM, "Zama Ques" wrote: > > > > > ----- Original Message ----- > From: Digimer > To: Zama Ques ; linux clustering < > linux-cluster at redhat.com> > Cc: > Sent: Saturday, 10 November 2012 8:24 AM > Subject: Re: [Linux-cluster] Packet loss after configuring Ethernet bonding > > On 11/09/2012 09:26 PM, Zama Ques wrote: > > Hi All, > > > > Need help on resolving a issue related to implementing High Availability > at network level . I understand that this is not the right forum to ask > this question , but since it is related to HA and Linux , I am asking here > and I feel somebody here will have answer to the issues I am facing . > > > > I am trying to implement Ethernet Bonding , Both the interface in my > server are connected to two different network switches . > > > > My configuration is as follows: > > > > ======== > > # cat /proc/net/bonding/bond0 > > > > Ethernet Channel Bonding Driver: v3.6.0 (September 26, 2009) > > > > Bonding Mode: adaptive load balancing Primary Slave: None Currently > > Active Slave: eth0 MII Status: up MII Polling Interval (ms): 0 Up Delay > > (ms): 0 Down Delay (ms): 0 > > > > Slave Interface: eth0 MII Status: up Speed: 1000 Mbps Duplex: full Link > > Failure Count: 0 Permanent HW addr: e4:e1:5b:d0:11:10 Slave queue ID: 0 > > > > Slave Interface: eth1 MII Status: up Speed: 1000 Mbps Duplex: full Link > > Failure Count: 0 Permanent HW addr: e4:e1:5b:d0:11:14 Slave queue ID: 0 > > ------------ > > # cat /sys/class/net/bond0/bonding/mode > > > > balance-alb 6 > > > > > > # cat /sys/class/net/bond0/bonding/miimon > > 0 > > > > ============ > > > > > > The issue for me is that I am seeing packet loss after configuring > bonding . Tried connecting both the interface to the same switch , but > still seeing the packet loss . Also , tried changing miimon value to 100 , > but still seeing the packet loss. > > > > What I am missing in the configuration ? Any help will be highly > appreciated in resolving the problem . > > > > > > > > Thanks > > Zaman > > > You didn't share any details on your configuration, but I will assume > > you are using corosync. > > > The only supported bonding mode is Active/Passive (mode=1). I've > > personally tried all modes, out of curiosity, and all had problems. The > > short of it is that if you need more that 1 gbit of performance, buy > > faster cards. > > > If you are interested in what I use, it's documented here: > > > https://alteeve.ca/w/2-Node_Red_Hat_KVM_Cluster_Tutorial#Network > > > I've used this setup in several production clusters and have tested > > failure are recovery extensively. It's proven very stable. :) > > > Thanks Digimer for the quick response and pointing me to the link . I am > yet to reach cluster configuration , initially trying to understand > ethernet bonding before going into cluster configuration. So , option for > me is only to use Active/Passive bonding mode in case of clustered > environment. > Few more clarifications needed , Can we use other bonding modes in non > clustered environment . I am seeing packet loss in other modes . Also , > the support of using only mode=1 in cluster environment is it a > restriction of RHEL Cluster suite or it is by design . > > Will be great if you clarify these queries . > > Thanks in Advance > Zaman > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Imran.Kalam at auspost.com.au Sun Nov 11 22:32:02 2012 From: Imran.Kalam at auspost.com.au (Kalam, Imran) Date: Sun, 11 Nov 2012 22:32:02 +0000 Subject: [Linux-cluster] Cluster node1 rebooted itself Message-ID: Hi All. I have 2 node GFS cluster running RHAS4 update 5 kernel 2.6.9-55.ELsmp. On Sunday morning the node1 (master) has rebooted itself and I could only see the following in the message log file. Has anyone experienced the same problem? Please let me know if you need more information. Thanks Nov 11 00:12:47 kernel: CMAN: Being told to leave the cluster by node 2 Nov 11 00:12:47 kernel: CMAN: we are leaving the cluster. Nov 11 00:12:47 kernel: WARNING: dlm_emergency_shutdown Nov 11 00:12:47 kernel: WARNING: dlm_emergency_shutdown Nov 11 00:12:47 kernel: SM: 00000002 sm_stop: SG still joined Nov 11 00:12:47 kernel: SM: 01000003 sm_stop: SG still joined Nov 11 00:12:47 kernel: SM: 02000007 sm_stop: SG still joined Nov 11 00:12:47 kernel: SM: 03000004 sm_stop: SG still joined Nov 11 00:12:47 clurgmgrd[6872]: #67: Shutting down uncleanly Nov 11 00:12:47 ccsd[6613]: Cluster manager shutdown. Attemping to reconnect... Nov 11 00:12:48 ccsd[6613]: Cluster is not quorate. Refusing connection. Nov 11 00:12:48 ccsd[6613]: Error while processing connect: Connection refused Nov 11 00:12:48 ccsd[6613]: Invalid descriptor specified (-111). Nov 11 00:12:48 ccsd[6613]: Someone may be attempting something evil. Nov 11 00:12:48 ccsd[6613]: Error while processing get: Invalid request descriptor Nov 11 00:12:48 ccsd[6613]: Invalid descriptor specified (-111). Nov 11 00:12:48 ccsd[6613]: Someone may be attempting something evil. Nov 11 00:12:48 ccsd[6613]: Error while processing get: Invalid request descriptor Nov 11 00:12:48 ccsd[6613]: Invalid descriptor specified (-21). Nov 11 00:12:48 ccsd[6613]: Someone may be attempting something evil. Nov 11 00:12:48 ccsd[6613]: Error while processing disconnect: Invalid request descriptor Nov 11 00:12:48 clurgmgrd: [6872]: unmounting /dev/mapper/vg_shared-lv00 (/opt/xxshare) Nov 11 00:12:48 ccsd[6613]: Cluster is not quorate. Refusing connection. Nov 11 00:12:48 ccsd[6613]: Error while processing connect: Connection refused Nov 11 00:12:48 ccsd[6613]: Cluster is not quorate. Refusing connection. Nov 11 00:12:48 ccsd[6613]: Error while processing connect: Connection refused Nov 11 00:12:48 ccsd[6613]: Invalid descriptor specified (-111). Nov 11 00:12:48 ccsd[6613]: Someone may be attempting something evil. Nov 11 00:12:48 ccsd[6613]: Error while processing get: Invalid request descriptor Nov 11 00:12:48 ccsd[6613]: Invalid descriptor specified (-111). Regards Imran Kalam Technical Specialist Post IT Corporate Services Australia Post Level 2, 185 Rosslyn St. West Melbourne Phone: (03) 9322 0382 Fax: 9204 7303 Mob: 0439 559 461 A Australia Post is committed to providing our customers with excellent service. If we can assist you in any way please telephone 13 13 18 or visit our website. The information contained in this email communication may be proprietary, confidential or legally professionally privileged. It is intended exclusively for the individual or entity to which it is addressed. You should only read, disclose, re-transmit, copy, distribute, act in reliance on or commercialise the information if you are authorised to do so. Australia Post does not represent, warrant or guarantee that the integrity of this email communication has been maintained nor that the communication is free of errors, virus or interference. If you are not the addressee or intended recipient please notify us by replying direct to the sender and then destroy any electronic or paper copy of this message. Any views expressed in this email communication are taken to be those of the individual sender, except where the sender specifically attributes those views to Australia Post and is authorised to do so. Please consider the environment before printing this email. -------------- next part -------------- An HTML attachment was scrubbed... URL: From lists at alteeve.ca Sun Nov 11 22:35:40 2012 From: lists at alteeve.ca (Digimer) Date: Sun, 11 Nov 2012 17:35:40 -0500 Subject: [Linux-cluster] Cluster node1 rebooted itself In-Reply-To: References: Message-ID: <50A0283C.5020808@alteeve.ca> It's hard to make much of a guess given that your cluster configuration is unknown. That said, it would seem that something interrupted comms. What is in the syslog of node 2 at the same time period? can you share you cluster.conf please (obfuscating only passwords)? On 11/11/2012 05:32 PM, Kalam, Imran wrote: > Hi All. > > I have 2 node GFS cluster running RHAS4 update 5 kernel 2.6.9-55.ELsmp. > On Sunday morning the node1 (master) has rebooted itself and I could > only see the following in the message log file. Has anyone experienced > the same problem? Please let me know if you need more information. Thanks > > Nov 11 00:12:47 kernel: CMAN: Being told to leave the cluster by node 2 > Nov 11 00:12:47 kernel: CMAN: we are leaving the cluster. > Nov 11 00:12:47 kernel: WARNING: dlm_emergency_shutdown > Nov 11 00:12:47 kernel: WARNING: dlm_emergency_shutdown > Nov 11 00:12:47 kernel: SM: 00000002 sm_stop: SG still joined > Nov 11 00:12:47 kernel: SM: 01000003 sm_stop: SG still joined > Nov 11 00:12:47 kernel: SM: 02000007 sm_stop: SG still joined > Nov 11 00:12:47 kernel: SM: 03000004 sm_stop: SG still joined > Nov 11 00:12:47 clurgmgrd[6872]: #67: Shutting down uncleanly > Nov 11 00:12:47 ccsd[6613]: Cluster manager shutdown. Attemping to > reconnect... > Nov 11 00:12:48 ccsd[6613]: Cluster is not quorate. Refusing connection. > Nov 11 00:12:48 ccsd[6613]: Error while processing connect: Connection > refused > Nov 11 00:12:48 ccsd[6613]: Invalid descriptor specified (-111). > Nov 11 00:12:48 ccsd[6613]: Someone may be attempting something evil. > Nov 11 00:12:48 ccsd[6613]: Error while processing get: Invalid request > descriptor > Nov 11 00:12:48 ccsd[6613]: Invalid descriptor specified (-111). > Nov 11 00:12:48 ccsd[6613]: Someone may be attempting something evil. > Nov 11 00:12:48 ccsd[6613]: Error while processing get: Invalid request > descriptor > Nov 11 00:12:48 ccsd[6613]: Invalid descriptor specified (-21). > Nov 11 00:12:48 ccsd[6613]: Someone may be attempting something evil. > Nov 11 00:12:48 ccsd[6613]: Error while processing disconnect: Invalid > request descriptor > Nov 11 00:12:48 clurgmgrd: [6872]: unmounting > /dev/mapper/vg_shared-lv00 (/opt/xxshare) > Nov 11 00:12:48 ccsd[6613]: Cluster is not quorate. Refusing connection. > Nov 11 00:12:48 ccsd[6613]: Error while processing connect: Connection > refused > Nov 11 00:12:48 ccsd[6613]: Cluster is not quorate. Refusing connection. > Nov 11 00:12:48 ccsd[6613]: Error while processing connect: Connection > refused > Nov 11 00:12:48 ccsd[6613]: Invalid descriptor specified (-111). > Nov 11 00:12:48 ccsd[6613]: Someone may be attempting something evil. > Nov 11 00:12:48 ccsd[6613]: Error while processing get: Invalid request > descriptor > Nov 11 00:12:48 ccsd[6613]: Invalid descriptor specified (-111). > > > *Regards* > Imran Kalam > Technical Specialist > Post IT > Corporate Services > Australia Post > Level 2, 185 Rosslyn St. West Melbourne > Phone: (03) 9322 0382 > Fax: 9204 7303 > Mob: 0439 559 461 > > A > > > > > Australia Post is committed to providing our customers with excellent > service. If we can assist you in any way please telephone 13 13 18 or > visit our website. > > The information contained in this email communication may be > proprietary, confidential or legally professionally privileged. It is > intended exclusively for the individual or entity to which it is > addressed. You should only read, disclose, re-transmit, copy, > distribute, act in reliance on or commercialise the information if you > are authorised to do so. Australia Post does not represent, warrant or > guarantee that the integrity of this email communication has been > maintained nor that the communication is free of errors, virus or > interference. > > If you are not the addressee or intended recipient please notify us by > replying direct to the sender and then destroy any electronic or paper > copy of this message. Any views expressed in this email communication > are taken to be those of the individual sender, except where the sender > specifically attributes those views to Australia Post and is authorised > to do so. > > Please consider the environment before printing this email. > > > -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? From sam at dotsec.com Sun Nov 11 22:17:06 2012 From: sam at dotsec.com (Sam Wilson) Date: Mon, 12 Nov 2012 08:17:06 +1000 Subject: [Linux-cluster] Packet loss after configuring Ethernet bonding In-Reply-To: References: <1352514375.40862.YahooMailNeo@web193003.mail.sg3.yahoo.com> <509DC1E9.9090704@alteeve.ca> <1352520739.40244.YahooMailNeo@web193002.mail.sg3.yahoo.com> Message-ID: <50A023E2.4070301@dotsec.com> With regards to what switch support is required for GNU\linux bonding its worth having a read through the docs http://www.kernel.org/doc/Documentation/networking/bonding.txt to understand the available modes in the bonding driver. As far as I understand it only mode=4 requires switch side participation in the bonding. All other modes are implemented on the host side. Cheers, Sam From Imran.Kalam at auspost.com.au Sun Nov 11 22:48:30 2012 From: Imran.Kalam at auspost.com.au (Kalam, Imran) Date: Sun, 11 Nov 2012 22:48:30 +0000 Subject: [Linux-cluster] Cluster node1 rebooted itself In-Reply-To: <50A0283C.5020808@alteeve.ca> References: <50A0283C.5020808@alteeve.ca> Message-ID: Hi Digimer. Below are the information from the second node log file and configuration is on its way. Thanks Nov 11 00:12:47 qdiskd[6704]: Writing eviction notice for node 1 Nov 11 00:12:47 kernel: CMAN: removing node node1hb from the cluster : Killed by another node Nov 11 00:12:49 qdiskd[6704]: Node 1 evicted Nov 11 00:12:55 fenced[6771]: node1hb not a cluster member after 8 sec post_fail_delay Nov 11 00:12:55 fenced[6771]: fencing node "node1hb" Nov 11 00:14:00 ccsd[6603]: Attempt to close an unopened CCS descriptor (5462880). Nov 11 00:14:00 ccsd[6603]: Error while processing disconnect: Invalid request descriptor Nov 11 00:14:00 fenced[6771]: fence "node1hb" success Nov 11 00:14:07 kernel: GFS: fsid=EMS_cluster1:opt-xxxshare.1: jid=0: Trying to acquire journal lock... Nov 11 00:14:07 kernel: GFS: fsid=EMS_cluster1:opt-xxxshare.1: jid=0: Looking at journal... Nov 11 00:14:07 kernel: GFS: fsid=EMS_cluster1:opt-xxxshare.1: jid=0: Acquiring the transaction lock... Nov 11 00:14:07 kernel: GFS: fsid=EMS_cluster1:opt-xxxshare.1: jid=0: Replaying journal... Nov 11 00:14:07 kernel: GFS: fsid=EMS_cluster1:opt-xxxshare.1: jid=0: Replayed 4 of 4 blocks Nov 11 00:14:07 kernel: GFS: fsid=EMS_cluster1:opt-xxxshare.1: jid=0: replays = 4, skips = 0, sames = 0 Nov 11 00:14:07 kernel: GFS: fsid=EMS_cluster1:opt-xxxshare.1: jid=0: Journal replayed in 1s Nov 11 00:14:07 kernel: GFS: fsid=EMS_cluster1:opt-xxxshare.1: jid=0: Done Nov 11 00:14:07 clurgmgrd[6833]: Magma Event: Membership Change Nov 11 00:14:07 clurgmgrd[6833]: State change: node1hb DOWN Nov 11 00:16:59 kernel: CMAN: node node1hb rejoining Nov 11 00:17:08 clurgmgrd[6833]: Magma Event: Membership Change Nov 11 00:17:08 clurgmgrd[6833]: State change: node1hb UP -----Original Message----- From: Digimer [mailto:lists at alteeve.ca] Sent: Monday, 12 November, 2012 9:36 AM To: linux clustering Cc: Kalam, Imran Subject: Re: [Linux-cluster] Cluster node1 rebooted itself It's hard to make much of a guess given that your cluster configuration is unknown. That said, it would seem that something interrupted comms. What is in the syslog of node 2 at the same time period? can you share you cluster.conf please (obfuscating only passwords)? On 11/11/2012 05:32 PM, Kalam, Imran wrote: > Hi All. > > I have 2 node GFS cluster running RHAS4 update 5 kernel 2.6.9-55.ELsmp. > On Sunday morning the node1 (master) has rebooted itself and I could > only see the following in the message log file. Has anyone experienced > the same problem? Please let me know if you need more information. Thanks > > Nov 11 00:12:47 kernel: CMAN: Being told to leave the cluster by node 2 > Nov 11 00:12:47 kernel: CMAN: we are leaving the cluster. > Nov 11 00:12:47 kernel: WARNING: dlm_emergency_shutdown > Nov 11 00:12:47 kernel: WARNING: dlm_emergency_shutdown > Nov 11 00:12:47 kernel: SM: 00000002 sm_stop: SG still joined > Nov 11 00:12:47 kernel: SM: 01000003 sm_stop: SG still joined > Nov 11 00:12:47 kernel: SM: 02000007 sm_stop: SG still joined > Nov 11 00:12:47 kernel: SM: 03000004 sm_stop: SG still joined > Nov 11 00:12:47 clurgmgrd[6872]: #67: Shutting down uncleanly > Nov 11 00:12:47 ccsd[6613]: Cluster manager shutdown. Attemping to > reconnect... > Nov 11 00:12:48 ccsd[6613]: Cluster is not quorate. Refusing connection. > Nov 11 00:12:48 ccsd[6613]: Error while processing connect: Connection > refused > Nov 11 00:12:48 ccsd[6613]: Invalid descriptor specified (-111). > Nov 11 00:12:48 ccsd[6613]: Someone may be attempting something evil. > Nov 11 00:12:48 ccsd[6613]: Error while processing get: Invalid request > descriptor > Nov 11 00:12:48 ccsd[6613]: Invalid descriptor specified (-111). > Nov 11 00:12:48 ccsd[6613]: Someone may be attempting something evil. > Nov 11 00:12:48 ccsd[6613]: Error while processing get: Invalid request > descriptor > Nov 11 00:12:48 ccsd[6613]: Invalid descriptor specified (-21). > Nov 11 00:12:48 ccsd[6613]: Someone may be attempting something evil. > Nov 11 00:12:48 ccsd[6613]: Error while processing disconnect: Invalid > request descriptor > Nov 11 00:12:48 clurgmgrd: [6872]: unmounting > /dev/mapper/vg_shared-lv00 (/opt/xxshare) > Nov 11 00:12:48 ccsd[6613]: Cluster is not quorate. Refusing connection. > Nov 11 00:12:48 ccsd[6613]: Error while processing connect: Connection > refused > Nov 11 00:12:48 ccsd[6613]: Cluster is not quorate. Refusing connection. > Nov 11 00:12:48 ccsd[6613]: Error while processing connect: Connection > refused > Nov 11 00:12:48 ccsd[6613]: Invalid descriptor specified (-111). > Nov 11 00:12:48 ccsd[6613]: Someone may be attempting something evil. > Nov 11 00:12:48 ccsd[6613]: Error while processing get: Invalid request > descriptor > Nov 11 00:12:48 ccsd[6613]: Invalid descriptor specified (-111). > > > *Regards* > Imran Kalam > Technical Specialist > Post IT > Corporate Services > Australia Post > Level 2, 185 Rosslyn St. West Melbourne > Phone: (03) 9322 0382 > Fax: 9204 7303 > Mob: 0439 559 461 > > A > > > > > Australia Post is committed to providing our customers with excellent > service. If we can assist you in any way please telephone 13 13 18 or > visit our website. > > The information contained in this email communication may be > proprietary, confidential or legally professionally privileged. It is > intended exclusively for the individual or entity to which it is > addressed. You should only read, disclose, re-transmit, copy, > distribute, act in reliance on or commercialise the information if you > are authorised to do so. Australia Post does not represent, warrant or > guarantee that the integrity of this email communication has been > maintained nor that the communication is free of errors, virus or > interference. > > If you are not the addressee or intended recipient please notify us by > replying direct to the sender and then destroy any electronic or paper > copy of this message. Any views expressed in this email communication > are taken to be those of the individual sender, except where the sender > specifically attributes those views to Australia Post and is authorised > to do so. > > Please consider the environment before printing this email. > > > -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? From sumodirjo at gmail.com Sun Nov 11 22:49:43 2012 From: sumodirjo at gmail.com (Muhammad Panji) Date: Mon, 12 Nov 2012 05:49:43 +0700 Subject: [Linux-cluster] Failover root cause In-Reply-To: References: Message-ID: Hi, I plan to implement NTP so that both servers time synchronized. How can I look for the failover cause? I already graph sar data and no peak usage on the time when db1svr was fenced by db2svr. What file (and what specific message) that I should look to know the root cause of this failover. Thank you. Regards, Panji On Fri, Nov 9, 2012 at 10:40 AM, Yu wrote: > Regardless what was the root cause you find. Cluster requires Ntp service to ensure all nodes have time synchronized. So you have to fix this 5 mins difference now. > > Regards > Yu > > On 09/11/2012, at 11:47, Muhammad Panji wrote: > >> Dear All, >> I have an oracle cluster on RHEL 6.2 with 2 servers. Several days ago >> the service was failover from node1 to node2. From /var/log/messages >> on node2 I only see this message : >> >> ... >> Oct 23 12:54:19 db2svr corosync[4142]: [TOTEM ] A processor failed, >> forming new configuration. >> Oct 23 12:54:21 db2svr corosync[4142]: [QUORUM] Members[1]: 2 >> Oct 23 12:54:21 db2svr corosync[4142]: [TOTEM ] A processor joined >> or left the membership and a new membership was formed. >> Oct 23 12:54:21 db2svr kernel: dlm: closing connection to node 1 >> Oct 23 12:54:21 db2svr rgmanager[5327]: State change: clu1 DOWN >> Oct 23 12:54:21 db2svr fenced[4193]: fencing node clu1 >> ... >> >> Googling this message " [TOTEM ] A processor failed, forming new >> configuration." I learned that it means node2 couldn't see node1 and >> then fence node1. on node1 I get this message : >> >> Oct 23 12:50:45 db1svr rgmanager[75890]: [script] Executing >> /etc/init.d/httpd status >> Oct 23 12:56:01 db1svr kernel: imklog 4.6.2, log source = /proc/kmsg started. >> Oct 23 12:56:01 db1svr rsyslogd: [origin software="rsyslogd" >> swVersion="4.6.2" x-pid="3792" x-info="http://www.rsyslog.com"] >> (re)start >> Oct 23 12:56:01 db1svr kernel: Initializing cgroup subsys cpuset >> Oct 23 12:56:01 db1svr kernel: Initializing cgroup subsys cpu >> Oct 23 12:56:01 db1svr kernel: Linux version 2.6.32-220.el6.x86_64 >> (mockbuild at x86-004.build.bos.redhat.com) (gcc version 4.4.5 20110214 >> (Red Hat 4.4.5-6) (GCC) ) #1 SMP Wed Nov 9 08:03:13 EST 2011 >> >> on 12:50 rgmanager still checking the service and then it's rebooted. >> Thing that make it worse is that the date / time of both servers are >> different so that I can't compare the logs directly. Current time >> difference between both servers is around 5 minutes. >> >> I would like to ask where to look for the cause of this failover? I >> plan to graph sar data today to see if there were bottleneck on CPU >> etc so that node1 could not send status to node2, but if no bottleneck >> on CPU or RAM etc where should I find the root cause of failover? >> thank you. >> Regards, >> >> >> >> >> >> -- >> Muhammad Panji >> http://www.panji.web.id >> http://www.kurungsiku.com >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -- Muhammad Panji http://www.panji.web.id http://www.kurungsiku.com From Imran.Kalam at auspost.com.au Sun Nov 11 22:49:29 2012 From: Imran.Kalam at auspost.com.au (Kalam, Imran) Date: Sun, 11 Nov 2012 22:49:29 +0000 Subject: [Linux-cluster] Packet loss after configuring Ethernet bonding In-Reply-To: <50A023E2.4070301@dotsec.com> References: <1352514375.40862.YahooMailNeo@web193003.mail.sg3.yahoo.com> <509DC1E9.9090704@alteeve.ca> <1352520739.40244.YahooMailNeo@web193002.mail.sg3.yahoo.com> <50A023E2.4070301@dotsec.com> Message-ID: Thanks, I will read over the document. -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Sam Wilson Sent: Monday, 12 November, 2012 9:17 AM To: linux-cluster at redhat.com Subject: Re: [Linux-cluster] Packet loss after configuring Ethernet bonding With regards to what switch support is required for GNU\linux bonding its worth having a read through the docs http://www.kernel.org/doc/Documentation/networking/bonding.txt to understand the available modes in the bonding driver. As far as I understand it only mode=4 requires switch side participation in the bonding. All other modes are implemented on the host side. Cheers, Sam -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster Australia Post is committed to providing our customers with excellent service. If we can assist you in any way please telephone 13 13 18 or visit our website. The information contained in this email communication may be proprietary, confidential or legally professionally privileged. It is intended exclusively for the individual or entity to which it is addressed. You should only read, disclose, re-transmit, copy, distribute, act in reliance on or commercialise the information if you are authorised to do so. Australia Post does not represent, warrant or guarantee that the integrity of this email communication has been maintained nor that the communication is free of errors, virus or interference. If you are not the addressee or intended recipient please notify us by replying direct to the sender and then destroy any electronic or paper copy of this message. Any views expressed in this email communication are taken to be those of the individual sender, except where the sender specifically attributes those views to Australia Post and is authorised to do so. Please consider the environment before printing this email. From lists at alteeve.ca Sun Nov 11 22:54:32 2012 From: lists at alteeve.ca (Digimer) Date: Sun, 11 Nov 2012 17:54:32 -0500 Subject: [Linux-cluster] Cluster node1 rebooted itself In-Reply-To: References: <50A0283C.5020808@alteeve.ca> Message-ID: <50A02CA8.5070800@alteeve.ca> Ya, certainly looks like a network problem. If you have a support contract with Red Hat, you may want to bring them in to have a more detailed review though. I am only guessing based on what you've listed here. Cheers On 11/11/2012 05:48 PM, Kalam, Imran wrote: > Hi Digimer. > > Below are the information from the second node log file and configuration is on its way. Thanks > > Nov 11 00:12:47 qdiskd[6704]: Writing eviction notice for node 1 > Nov 11 00:12:47 kernel: CMAN: removing node node1hb from the cluster : Killed by another node > Nov 11 00:12:49 qdiskd[6704]: Node 1 evicted > Nov 11 00:12:55 fenced[6771]: node1hb not a cluster member after 8 sec post_fail_delay > Nov 11 00:12:55 fenced[6771]: fencing node "node1hb" > Nov 11 00:14:00 ccsd[6603]: Attempt to close an unopened CCS descriptor (5462880). > Nov 11 00:14:00 ccsd[6603]: Error while processing disconnect: Invalid request descriptor > Nov 11 00:14:00 fenced[6771]: fence "node1hb" success > Nov 11 00:14:07 kernel: GFS: fsid=EMS_cluster1:opt-xxxshare.1: jid=0: Trying to acquire journal lock... > Nov 11 00:14:07 kernel: GFS: fsid=EMS_cluster1:opt-xxxshare.1: jid=0: Looking at journal... > Nov 11 00:14:07 kernel: GFS: fsid=EMS_cluster1:opt-xxxshare.1: jid=0: Acquiring the transaction lock... > Nov 11 00:14:07 kernel: GFS: fsid=EMS_cluster1:opt-xxxshare.1: jid=0: Replaying journal... > Nov 11 00:14:07 kernel: GFS: fsid=EMS_cluster1:opt-xxxshare.1: jid=0: Replayed 4 of 4 blocks > Nov 11 00:14:07 kernel: GFS: fsid=EMS_cluster1:opt-xxxshare.1: jid=0: replays = 4, skips = 0, sames = 0 > Nov 11 00:14:07 kernel: GFS: fsid=EMS_cluster1:opt-xxxshare.1: jid=0: Journal replayed in 1s > Nov 11 00:14:07 kernel: GFS: fsid=EMS_cluster1:opt-xxxshare.1: jid=0: Done > Nov 11 00:14:07 clurgmgrd[6833]: Magma Event: Membership Change > Nov 11 00:14:07 clurgmgrd[6833]: State change: node1hb DOWN > Nov 11 00:16:59 kernel: CMAN: node node1hb rejoining > Nov 11 00:17:08 clurgmgrd[6833]: Magma Event: Membership Change > Nov 11 00:17:08 clurgmgrd[6833]: State change: node1hb UP > > -----Original Message----- > From: Digimer [mailto:lists at alteeve.ca] > Sent: Monday, 12 November, 2012 9:36 AM > To: linux clustering > Cc: Kalam, Imran > Subject: Re: [Linux-cluster] Cluster node1 rebooted itself > > It's hard to make much of a guess given that your cluster configuration > is unknown. That said, it would seem that something interrupted comms. > What is in the syslog of node 2 at the same time period? can you share > you cluster.conf please (obfuscating only passwords)? > > On 11/11/2012 05:32 PM, Kalam, Imran wrote: >> Hi All. >> >> I have 2 node GFS cluster running RHAS4 update 5 kernel 2.6.9-55.ELsmp. >> On Sunday morning the node1 (master) has rebooted itself and I could >> only see the following in the message log file. Has anyone experienced >> the same problem? Please let me know if you need more information. Thanks >> >> Nov 11 00:12:47 kernel: CMAN: Being told to leave the cluster by node 2 >> Nov 11 00:12:47 kernel: CMAN: we are leaving the cluster. >> Nov 11 00:12:47 kernel: WARNING: dlm_emergency_shutdown >> Nov 11 00:12:47 kernel: WARNING: dlm_emergency_shutdown >> Nov 11 00:12:47 kernel: SM: 00000002 sm_stop: SG still joined >> Nov 11 00:12:47 kernel: SM: 01000003 sm_stop: SG still joined >> Nov 11 00:12:47 kernel: SM: 02000007 sm_stop: SG still joined >> Nov 11 00:12:47 kernel: SM: 03000004 sm_stop: SG still joined >> Nov 11 00:12:47 clurgmgrd[6872]: #67: Shutting down uncleanly >> Nov 11 00:12:47 ccsd[6613]: Cluster manager shutdown. Attemping to >> reconnect... >> Nov 11 00:12:48 ccsd[6613]: Cluster is not quorate. Refusing connection. >> Nov 11 00:12:48 ccsd[6613]: Error while processing connect: Connection >> refused >> Nov 11 00:12:48 ccsd[6613]: Invalid descriptor specified (-111). >> Nov 11 00:12:48 ccsd[6613]: Someone may be attempting something evil. >> Nov 11 00:12:48 ccsd[6613]: Error while processing get: Invalid request >> descriptor >> Nov 11 00:12:48 ccsd[6613]: Invalid descriptor specified (-111). >> Nov 11 00:12:48 ccsd[6613]: Someone may be attempting something evil. >> Nov 11 00:12:48 ccsd[6613]: Error while processing get: Invalid request >> descriptor >> Nov 11 00:12:48 ccsd[6613]: Invalid descriptor specified (-21). >> Nov 11 00:12:48 ccsd[6613]: Someone may be attempting something evil. >> Nov 11 00:12:48 ccsd[6613]: Error while processing disconnect: Invalid >> request descriptor >> Nov 11 00:12:48 clurgmgrd: [6872]: unmounting >> /dev/mapper/vg_shared-lv00 (/opt/xxshare) >> Nov 11 00:12:48 ccsd[6613]: Cluster is not quorate. Refusing connection. >> Nov 11 00:12:48 ccsd[6613]: Error while processing connect: Connection >> refused >> Nov 11 00:12:48 ccsd[6613]: Cluster is not quorate. Refusing connection. >> Nov 11 00:12:48 ccsd[6613]: Error while processing connect: Connection >> refused >> Nov 11 00:12:48 ccsd[6613]: Invalid descriptor specified (-111). >> Nov 11 00:12:48 ccsd[6613]: Someone may be attempting something evil. >> Nov 11 00:12:48 ccsd[6613]: Error while processing get: Invalid request >> descriptor >> Nov 11 00:12:48 ccsd[6613]: Invalid descriptor specified (-111). >> >> >> *Regards* >> Imran Kalam >> Technical Specialist >> Post IT >> Corporate Services >> Australia Post >> Level 2, 185 Rosslyn St. West Melbourne >> Phone: (03) 9322 0382 >> Fax: 9204 7303 >> Mob: 0439 559 461 >> >> A >> >> >> >> >> Australia Post is committed to providing our customers with excellent >> service. If we can assist you in any way please telephone 13 13 18 or >> visit our website. >> >> The information contained in this email communication may be >> proprietary, confidential or legally professionally privileged. It is >> intended exclusively for the individual or entity to which it is >> addressed. You should only read, disclose, re-transmit, copy, >> distribute, act in reliance on or commercialise the information if you >> are authorised to do so. Australia Post does not represent, warrant or >> guarantee that the integrity of this email communication has been >> maintained nor that the communication is free of errors, virus or >> interference. >> >> If you are not the addressee or intended recipient please notify us by >> replying direct to the sender and then destroy any electronic or paper >> copy of this message. Any views expressed in this email communication >> are taken to be those of the individual sender, except where the sender >> specifically attributes those views to Australia Post and is authorised >> to do so. >> >> Please consider the environment before printing this email. >> >> >> > > -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? From dev at sdd.jp Mon Nov 12 06:24:56 2012 From: dev at sdd.jp (Antonio Castellano) Date: Mon, 12 Nov 2012 15:24:56 +0900 Subject: [Linux-cluster] Bug inquiry (#831330) Message-ID: <135270149815343600006c50@sv0.inside.kobe.sdd.jp> Hi, I'd like to know about the status of the bug number 831330 and its schedule. Our system is complaining about it and I don't have enough permissions to access its bugzilla related page. It is urgent. This is the link related to the text reported in our log: https://access.redhat.com/knowledge/ja/node/141203 And this is the bugzilla link: https://bugzilla.redhat.com/show_bug.cgi?id=831330 Is there anybody out there that can help me? The help will be greatly appreciated. Thank you very much! -- Antonio Castellano [DEV at SDD.jp] Seventh Dimension Design, Inc. http://www.SDD.jp VOICE: +81-78-252-8855, FAX: +81-78-252-8856 From lists at alteeve.ca Mon Nov 12 06:33:11 2012 From: lists at alteeve.ca (Digimer) Date: Mon, 12 Nov 2012 01:33:11 -0500 Subject: [Linux-cluster] Bug inquiry (#831330) In-Reply-To: <135270149815343600006c50@sv0.inside.kobe.sdd.jp> References: <135270149815343600006c50@sv0.inside.kobe.sdd.jp> Message-ID: <50A09827.6000204@alteeve.ca> On 11/12/2012 01:24 AM, Antonio Castellano wrote: > Hi, > > I'd like to know about the status of the bug number 831330 and its schedule. Our system is complaining about it and I don't have enough permissions to access its bugzilla related page. It is urgent. > > This is the link related to the text reported in our log: > https://access.redhat.com/knowledge/ja/node/141203 > > And this is the bugzilla link: > https://bugzilla.redhat.com/show_bug.cgi?id=831330 > > Is there anybody out there that can help me? The help will be greatly appreciated. > > Thank you very much! Closed bugs generally have customer-specific information in them. They are closed to reduce the risk of leaking private information. The only way for you to see the status of that bug is to speak with your support person, assuming that the bug is yours. -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? From dev at sdd.jp Mon Nov 12 07:40:46 2012 From: dev at sdd.jp (Antonio Castellano) Date: Mon, 12 Nov 2012 16:40:46 +0900 Subject: [Linux-cluster] Bug inquiry (#831330) In-Reply-To: <50A09827.6000204@alteeve.ca> References: <50A09827.6000204@alteeve.ca> <135270149815343600006c50@sv0.inside.kobe.sdd.jp> Message-ID: <1352706048367910000775f@sv0.inside.kobe.sdd.jp> > On 11/12/2012 01:24 AM, Antonio Castellano wrote: > > Hi, > > > > I'd like to know about the status of the bug number 831330 and its schedule. Our system is complaining about it and I don't have enough permissions to access its bugzilla related page. It is urgent. > > > > This is the link related to the text reported in our log: > > https://access.redhat.com/knowledge/ja/node/141203 > > > > And this is the bugzilla link: > > https://bugzilla.redhat.com/show_bug.cgi?id=831330 > > > > Is there anybody out there that can help me? The help will be greatly appreciated. > > > > Thank you very much! > > Closed bugs generally have customer-specific information in them. They > are closed to reduce the risk of leaking private information. The only > way for you to see the status of that bug is to speak with your support > person, assuming that the bug is yours. > > -- > Digimer > Papers and Projects: https://alteeve.ca/w/ > What if the cure for cancer is trapped in the mind of a person without > access to education? I see. Not what I was hoping for, but thank you very much anyway for the quick reply! Best regards, -- Antonio Castellano [DEV at SDD.jp] Seventh Dimension Design, Inc. http://www.SDD.jp VOICE: +81-78-252-8855, FAX: +81-78-252-8856 From swhiteho at redhat.com Mon Nov 12 10:19:19 2012 From: swhiteho at redhat.com (Steven Whitehouse) Date: Mon, 12 Nov 2012 10:19:19 +0000 Subject: [Linux-cluster] Bug inquiry (#831330) In-Reply-To: <135270149815343600006c50@sv0.inside.kobe.sdd.jp> References: <135270149815343600006c50@sv0.inside.kobe.sdd.jp> Message-ID: <1352715560.2721.9.camel@menhir> Hi, On Mon, 2012-11-12 at 15:24 +0900, Antonio Castellano wrote: > Hi, > > I'd like to know about the status of the bug number 831330 and its schedule. Our system is complaining about it and I don't have enough permissions to access its bugzilla related page. It is urgent. > > This is the link related to the text reported in our log: > https://access.redhat.com/knowledge/ja/node/141203 > > And this is the bugzilla link: > https://bugzilla.redhat.com/show_bug.cgi?id=831330 > > Is there anybody out there that can help me? The help will be greatly appreciated. > > Thank you very much! > Assuming that you are a Red Hat customer, please open a ticket. The bug mostly contains customer's private data, so that I don't think opening this one up would help much as there would be little that we could share. This is though, our highest priority bug at the moment (when I say our, I mean the GFS2 team). There is a simple workaround (just use a slightly older kernel) which is one reason why we've had trouble in tracing this, because people are (understandably) using that rather than running the kernel we've built to debug this issue. We've been unable to reproduce this internally, despite trying many different workloads. If you are in a position to help us debug the issue, then any assistance is very gratefully received, Steve. From anprice at redhat.com Mon Nov 12 20:37:36 2012 From: anprice at redhat.com (Andrew Price) Date: Mon, 12 Nov 2012 20:37:36 +0000 Subject: [Linux-cluster] gfs2-utils 3.1.5 Released Message-ID: <50A15E10.8030009@redhat.com> Hi, gfs2-utils 3.1.5 has been released. This version features bug fixes and performance enhancements for fsck.gfs2 in particular, better handling of symlinks in mkfs.gfs2, a small block manipulation language to aid future testing, a gfs2_lockcapture script which replaces gfs2_lockgather, and various other minor enhancements and bug fixes. The mount.gfs2 helper utility has been removed as it is no longer required to mount gfs2 file systems. gfs2_tool and gfs2_quota have also been removed. Users of gfs2_quota should now use the generic quota utilities and users of gfs2_tool should now use tunegfs2, gfs2 mount options and the generic dmsetup and chattr/lsattr tools. See below for a full list of changes. The source tarball is available from: https://fedorahosted.org/released/gfs2-utils/gfs2-utils-3.1.5.tar.gz To report bugs or issues, please file them against the gfs2-utils component of Fedora (rawhide) at: https://bugzilla.redhat.com/enter_bug.cgi?product=Fedora&component=gfs2-utils&version=rawhide Regards, Andy Price Red Hat File Systems Changes since 3.1.4: Andrew Price (29): gfs2_utils: Improve error messages fsck.gfs2: Fix handling of eattr indirect blocks libgfs2: Remove gfs_get_leaf_nr libgfs2: Clean up some warnings gfs2-utils: Remove references to unlinked file tag gfs2_edit: Fix find_mtype and support gfs1 structures gfs2_edit: Clean up some magic offsets libgfs2: Use flags for versions in metadata description mkfs.gfs2: Check for symlinks before reporting device contents gfs2-utils: Remove obsolete tools gfs2-utils: Make building gfs_controld optional gfs2-utils: Only build group/ when gfs_controld is enabled gfs2-utils: Remove unused exported functions mkfs.gfs2: Avoid a rename race when checking file contents fsck.gfs2: Fix buffer overflow in get_lockproto_table libgfs2: Remove exit calls from inode_read and inode_get libgfs2: Remove exit call from __gfs_inode_get gfs2_edit: Some comment cleanups mkfs.gfs2: Check locktable more strictly for valid chars libgfs2: Add a gfs2 block query language libgfs2: Move valid_block into fsck.gfs2 libgfs2: gfs2_get_bitmap performance enhancements fsck.gfs2: Fix build failure gfs2-utils: build: Avoid using the kernel versions of kernel headers libgfs2: Add a small testing language UI gfs2-utils: Update .gitignore gfs2-utils: Remove gfs2_lockgather gfs2-utils: Rename lockgather directory to lockcapture gfs2-utils: Remove remaining references to gfs2_lockgather Bob Peterson (8): gfs2_edit savemeta: Get rid of "slow" mode gfs2_edit savemeta: report save statistics more often gfs2_edit savemeta: fix block range checking gfs2_edit restoremeta: sync changes on a regular basis RHEL6 gfs_controld: fix ignore_nolock for mounted nolock fs fsck.gfs2: soften the messages when reclaiming freemeta blocks fsck.gfs2: Check for formal inode number mismatch GFS2: Fix a compiler warning in pass2's check_dentry Shane Bradley (1): gfs2-utils: Added a new script called gfs2_lockcapture that will capture lockdump data. Steven Whitehouse (3): libgfs2: libgfs2.h: Add gfs_block_tag structure, and some more flag symbols mount.gfs2: Remove obsolete tool libgfs2: Add pointer restriction flags From fdinitto at redhat.com Tue Nov 13 09:17:47 2012 From: fdinitto at redhat.com (Fabio M. Di Nitto) Date: Tue, 13 Nov 2012 10:17:47 +0100 Subject: [Linux-cluster] gfs2-utils 3.1.5 Released In-Reply-To: <50A15E10.8030009@redhat.com> References: <50A15E10.8030009@redhat.com> Message-ID: <50A2103B.9000009@redhat.com> On 11/12/2012 9:37 PM, Andrew Price wrote: > Hi, > > gfs2-utils 3.1.5 has been released. This version features bug fixes and > performance enhancements for fsck.gfs2 in particular, better handling of > symlinks in mkfs.gfs2, a small block manipulation language to aid future > testing, a gfs2_lockcapture script which replaces gfs2_lockgather, and > various other minor enhancements and bug fixes. > > The mount.gfs2 helper utility has been removed as it is no longer > required to mount gfs2 file systems. gfs2_tool and gfs2_quota have also > been removed. Users of gfs2_quota should now use the generic quota > utilities and users of gfs2_tool should now use tunegfs2, gfs2 mount > options and the generic dmsetup and chattr/lsattr tools. IIRC there is a specific minimum kernel version for mount.gfs2 to be obsoleted and quota tool version to obsolete gfs2_quota. Might be a good idea to document it, so that users won?t attempt random back-ports. Fabio From anprice at redhat.com Tue Nov 13 11:57:34 2012 From: anprice at redhat.com (Andrew Price) Date: Tue, 13 Nov 2012 11:57:34 +0000 Subject: [Linux-cluster] gfs2-utils 3.1.5 Released In-Reply-To: <50A2103B.9000009@redhat.com> References: <50A15E10.8030009@redhat.com> <50A2103B.9000009@redhat.com> Message-ID: <50A235AE.7030807@redhat.com> On 13/11/12 09:17, Fabio M. Di Nitto wrote: > On 11/12/2012 9:37 PM, Andrew Price wrote: >> The mount.gfs2 helper utility has been removed as it is no longer >> required to mount gfs2 file systems. gfs2_tool and gfs2_quota have also >> been removed. Users of gfs2_quota should now use the generic quota >> utilities and users of gfs2_tool should now use tunegfs2, gfs2 mount >> options and the generic dmsetup and chattr/lsattr tools. > > IIRC there is a specific minimum kernel version for mount.gfs2 to be > obsoleted and quota tool version to obsolete gfs2_quota. Might be a good > idea to document it, so that users won?t attempt random back-ports. Yes, it looks like mount.gfs2 hasn't been required since kernel 2.6.36 Andy From dev at sdd.jp Wed Nov 14 07:09:38 2012 From: dev at sdd.jp (Antonio Castellano) Date: Wed, 14 Nov 2012 16:09:38 +0900 Subject: [Linux-cluster] Bug inquiry (#831330) In-Reply-To: <1352715560.2721.9.camel@menhir> References: <1352715560.2721.9.camel@menhir> <135270149815343600006c50@sv0.inside.kobe.sdd.jp> Message-ID: <1352876980602019000002ec@sv0.inside.kobe.sdd.jp> Hi, Steven. Thank you for the reply. I'm sending you here the syslog portion where the problem appears. Maybe it will be of some help. The kernel version is 2.6.18-308.11.1.el5PAE. Nov 12 15:50:16 blahblah6 kernel: GFS2: fsid=blahblah:data023.2: fatal: invalid metadata block Nov 12 15:50:16 blahblah6 kernel: GFS2: fsid=blahblah:data023.2:   bh = 151918444 (magic number) Nov 12 15:50:16 blahblah6 kernel: GFS2: fsid=blahblah:data023.2:   function = get_leaf, file = fs/gfs2/dir.c, line = 763 Nov 12 15:50:16 blahblah6 kernel: GFS2: fsid=blahblah:data023.2: about to withdraw this file system Nov 12 15:50:16 blahblah6 kernel: GFS2: fsid=blahblah:data023.2: telling LM to withdraw Nov 12 15:50:17 blahblah6 kernel: GFS2: fsid=blahblah:data023.2: withdrawn Nov 12 15:50:17 blahblah6 kernel:  [] gfs2_lm_withdraw+0x8d/0xb0 [gfs2] Nov 12 15:50:17 blahblah6 kernel:  [] gfs2_meta_check_ii+0x28/0x33 [gfs2] Nov 12 15:50:17 blahblah6 kernel:  [] get_leaf+0x5e/0x9d [gfs2] Nov 12 15:50:17 blahblah6 kernel:  [] get_first_leaf+0x24/0x2a [gfs2] Nov 12 15:50:17 blahblah6 kernel:  [] gfs2_dirent_search+0x81/0x180 [gfs2] Nov 12 15:50:17 blahblah6 kernel:  [] gfs2_dirent_find+0x0/0x4c [gfs2] Nov 12 15:50:17 blahblah6 kernel:  [] run_queue+0xbd/0x18a [gfs2] Nov 12 15:50:17 blahblah6 kernel:  [] gfs2_dir_search+0x1d/0x7f [gfs2] Nov 12 15:50:17 blahblah6 kernel:  [] permission+0xa2/0xb5 Nov 12 15:50:17 blahblah6 kernel:  [] gfs2_lookupi+0x116/0x14f [gfs2] Nov 12 15:50:17 blahblah6 kernel:  [] gfs2_lookupi+0xd0/0x14f [gfs2] Nov 12 15:50:17 blahblah6 kernel:  [] gfs2_lookup+0x1b/0x8e [gfs2] Nov 12 15:50:17 blahblah6 kernel:  [] gfs2_glock_put+0xcf/0xe7 [gfs2] Nov 12 15:50:17 blahblah6 kernel:  [] d_alloc+0x151/0x17f Nov 12 15:50:17 blahblah6 kernel:  [] do_lookup+0x102/0x1b6 Nov 12 15:50:17 blahblah6 kernel:  [] __link_path_walk+0x318/0xd1d Nov 12 15:50:17 blahblah6 kernel:  [] link_path_walk+0x3a/0x99 Nov 12 15:50:17 blahblah6 kernel:  [] do_path_lookup+0x231/0x297 Nov 12 15:50:17 blahblah6 kernel:  [] __user_walk_fd+0x29/0x3a Nov 12 15:50:17 blahblah6 kernel:  [] vfs_stat_fd+0x15/0x3c Nov 12 15:50:17 blahblah6 kernel:  [] sys_stat64+0xf/0x23 Nov 12 15:50:17 blahblah6 kernel:  [] do_page_fault+0x356/0x653 Nov 12 15:50:17 blahblah6 kernel:  [] __fput+0x15c/0x184 Nov 12 15:50:17 blahblah6 kernel:  [] do_page_fault+0x0/0x653 Nov 12 15:50:17 blahblah6 kernel:  [] sysenter_past_esp+0x56/0x79 We have 5 servers accessing a shared filesystem that consists of 24 virtual disks on top of multiple HDDs using GSF2. Once this problem happens in a virtual disk, we can't write into it (but the rest of the virtual disks keep on working without any problem). Also, it seems that running fsck fixes the virtual disk temporarily, but after a while it breaks again. Is there any way to fix this problem, or at least reduce how often it happens (it's happening almost every day in our system), without having to inst all an older kernel version? Best regards, > Hi, > > On Mon, 2012-11-12 at 15:24 +0900, Antonio Castellano wrote: > > Hi, > > > > I'd like to know about the status of the bug number 831330 and its schedule. Our system is complaining about it and I don't have enough permissions to access its bugzilla related page. It is urgent. > > > > This is the link related to the text reported in our log: > > https://access.redhat.com/knowledge/ja/node/141203 > > > > And this is the bugzilla link: > > https://bugzilla.redhat.com/show_bug.cgi?id=831330 > > > > Is there anybody out there that can help me? The help will be greatly appreciated. > > > > Thank you very much! > > > Assuming that you are a Red Hat customer, please open a ticket. The bug > mostly contains customer's private data, so that I don't think opening > this one up would help much as there would be little that we could > share. > > This is though, our highest priority bug at the moment (when I say our, > I mean the GFS2 team). There is a simple workaround (just use a slightly > older kernel) which is one reason why we've had trouble in tracing this, > because people are (understandably) using that rather than running the > kernel we've built to debug this issue. > > We've been unable to reproduce this internally, despite trying many > different workloads. If you are in a position to help us debug the > issue, then any assistance is very gratefully received, > > Steve. > > > -- Antonio Castellano [DEV at SDD.jp] Seventh Dimension Design, Inc. http://www.SDD.jp VOICE: +81-78-252-8855, FAX: +81-78-252-8856 From getridofthespam at yahoo.com Wed Nov 14 14:54:07 2012 From: getridofthespam at yahoo.com (getridofthespam) Date: Wed, 14 Nov 2012 06:54:07 -0800 (PST) Subject: [Linux-cluster] gfs for exsisting disks Message-ID: <1352904847.66273.YahooMailClassic@web125505.mail.ne1.yahoo.com> Hi all, I have a Centos 6.3 with a SAN storage attached. mount extract: /dev/mapper/mpathcp1 on /3parslice1 type ext4 (rw) /dev/mapper/mpathbp1 on /3parslice2 type ext4 (rw) The slices are ext4 formatted. Now I want to add a second server that needs to access the same disk slices. Is gfs the solution? Can I keep the data on the disks? Any procedure to follow available? Tnx for all answers. -------------- next part -------------- An HTML attachment was scrubbed... URL: From bmr at redhat.com Wed Nov 14 15:05:54 2012 From: bmr at redhat.com (Bryn M. Reeves) Date: Wed, 14 Nov 2012 15:05:54 +0000 Subject: [Linux-cluster] gfs for exsisting disks In-Reply-To: <1352904847.66273.YahooMailClassic@web125505.mail.ne1.yahoo.com> References: <1352904847.66273.YahooMailClassic@web125505.mail.ne1.yahoo.com> Message-ID: <50A3B352.3000003@redhat.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 11/14/2012 02:54 PM, getridofthespam wrote: > /dev/mapper/mpathcp1 on /3parslice1 type ext4 (rw) > /dev/mapper/mpathbp1 on /3parslice2 type ext4 (rw) > > The slices are ext4 formatted. > > Now I want to add a second server that needs to access the same > disk slices. > > Is gfs the solution? Can I keep the data on the disks? Any > procedure to follow available? GFS (or better GFS2..) would be one solution but you cannot "convert" from an ext type file system; you would need to backup and restore to a newly-created GFS2 volume. You could also consider using a network file system exported from one host or an external filer as an alternative to sharing the data between two hosts. It's difficult to tell whether a cluster file system like GFS2 is "the solution" without knowing what application will use it and how the app is structured; this is key to determining how a clustered file system will perform and is an important factor in deciding which is the best option for a given case. Regards, Bryn. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAlCjs1IACgkQ6YSQoMYUY95k0gCeIm0buQfFVJocBOxoYaWKexjK 7BwAn2FacRUL0Ba8veE2G7rz20ijTjXl =1/4Y -----END PGP SIGNATURE----- From lists at alteeve.ca Wed Nov 14 15:07:12 2012 From: lists at alteeve.ca (Digimer) Date: Wed, 14 Nov 2012 10:07:12 -0500 Subject: [Linux-cluster] gfs for exsisting disks In-Reply-To: <1352904847.66273.YahooMailClassic@web125505.mail.ne1.yahoo.com> References: <1352904847.66273.YahooMailClassic@web125505.mail.ne1.yahoo.com> Message-ID: <50A3B3A0.7030903@alteeve.ca> On 11/14/2012 09:54 AM, getridofthespam wrote: > Hi all, > > I have a Centos 6.3 with a SAN storage attached. mount extract: > > /dev/mapper/mpathcp1 on /3parslice1 type ext4 (rw) > /dev/mapper/mpathbp1 on /3parslice2 type ext4 (rw) > > The slices are ext4 formatted. > > Now I want to add a second server that needs to access the > same disk slices. > > Is gfs the solution? Can I keep the data on the disks? > Any procedure to follow available? > > Tnx for all answers. Unless there is some voodoo I don't know about, no, you will need to backup, reformat gfs2 and restore the files. Yes, gfs2 will allow 2+ nodes to access the same data on the SAN, but there are considerations to be aware of. First is that the distributed locking (dlm) comes at an overhead cost. before each write can happen, a lock must be requested from the cluster. If you have disk intensive apps, this might cause unacceptable delays. Also, you *must must must* have testing, working fencing for gfs2 to be safe. So it might be worth putting together a test case before you commit to converting production boxes, if you have the resources. -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? From rossnick-lists at cybercat.ca Wed Nov 14 19:06:26 2012 From: rossnick-lists at cybercat.ca (Nicolas Ross) Date: Wed, 14 Nov 2012 14:06:26 -0500 Subject: [Linux-cluster] Can't get apache resource agent working Message-ID: <50A3EBB2.3010009@cybercat.ca> Hi ! I am trying to add a apache resource to a service, and I can't get it to work. Here's my service : The apache config file is basicly a copy of /etc/httpd/conf/httpd.conf, tailored to my needs, with PidFIle "/var/run/cluster/apache/apache:SandBoxHttpd.pid" in it. If I do : /usr/sbin/httpd -f /CyberCat/SandBox/etc/httpd.conf It works perfectly fine, and it creates the pid ar the proper location. So I used rg_test : rg_test test ./cluster.conf start service SandBox Starting SandBox... /dev/dm-11 already mounted [clusterfs] /dev/dm-11 already mounted 192.168.110.29 already configured [ip] 192.168.110.29 already configured 192.168.112.29 already configured [ip] 192.168.112.29 already configured Verifying Configuration Of apache:SandBoxHttpd [apache] Verifying Configuration Of apache:SandBoxHttpd Checking Syntax Of The File /CyberCat/SandBox/etc/httpd.conf [apache] Checking Syntax Of The File /CyberCat/SandBox/etc/httpd.conf Checking Syntax Of The File /CyberCat/SandBox/etc/httpd.conf > Succeed [apache] Checking Syntax Of The File /CyberCat/SandBox/etc/httpd.conf > Succeed Monitoring Service apache:SandBoxHttpd [apache] Monitoring Service apache:SandBoxHttpd Checking Existence Of File /var/run/cluster/apache/apache:SandBoxHttpd.pid [apache:SandBoxHttpd] > Failed [apache] Checking Existence Of File /var/run/cluster/apache/apache:SandBoxHttpd.pid [apache:SandBoxHttpd] > Failed Monitoring Service apache:SandBoxHttpd > Service Is Not Running [apache] Monitoring Service apache:SandBoxHttpd > Service Is Not Running Starting Service apache:SandBoxHttpd [apache] Starting Service apache:SandBoxHttpd Looking For IP Addresses [apache] Looking For IP Addresses 0 IP addresses found for SandBox/SandBoxHttpd [apache] 0 IP addresses found for SandBox/SandBoxHttpd Looking For IP Addresses [apache:SandBoxHttpd] > Failed - No IP Addresses Found [apache] Looking For IP Addresses [apache:SandBoxHttpd] > Failed - No IP Addresses Found Failed to start SandBox So it seems rgmanager can't find IP addresses for this service, and I can't figure why. I have other services that uses mysql resource agent, and the work perfectly with the exact same hiearchy of service/fs/ip,etc. I've also tried this config : With the same outcome. Thanks for any insights. From rossnick-lists at cybercat.ca Thu Nov 15 15:34:34 2012 From: rossnick-lists at cybercat.ca (Nicolas Ross) Date: Thu, 15 Nov 2012 10:34:34 -0500 Subject: [Linux-cluster] Can't get apache resource agent working In-Reply-To: <50A3EBB2.3010009@cybercat.ca> References: <50A3EBB2.3010009@cybercat.ca> Message-ID: <50A50B8A.2040006@cybercat.ca> > I am trying to add a apache resource to a service, and I can't get it to > work. > > Here's my service : > > > > > > name="SandBoxHttpd"/> > > > > The apache config file is basicly a copy of /etc/httpd/conf/httpd.conf, > tailored to my needs, with PidFIle > "/var/run/cluster/apache/apache:SandBoxHttpd.pid" in it. > > If I do : > > /usr/sbin/httpd -f /CyberCat/SandBox/etc/httpd.conf > > It works perfectly fine, and it creates the pid ar the proper location. > > So I used rg_test : > > rg_test test ./cluster.conf start service SandBox > > Starting SandBox... > /dev/dm-11 already mounted > [clusterfs] /dev/dm-11 already mounted > 192.168.110.29 already configured > [ip] 192.168.110.29 already configured > 192.168.112.29 already configured > [ip] 192.168.112.29 already configured > Verifying Configuration Of apache:SandBoxHttpd > [apache] Verifying Configuration Of apache:SandBoxHttpd > Checking Syntax Of The File /CyberCat/SandBox/etc/httpd.conf > [apache] Checking Syntax Of The File /CyberCat/SandBox/etc/httpd.conf > Checking Syntax Of The File /CyberCat/SandBox/etc/httpd.conf > > Succeed > [apache] Checking Syntax Of The File /CyberCat/SandBox/etc/httpd.conf > > Succeed > Monitoring Service apache:SandBoxHttpd > [apache] Monitoring Service apache:SandBoxHttpd > Checking Existence Of File > /var/run/cluster/apache/apache:SandBoxHttpd.pid [apache:SandBoxHttpd] > > Failed > [apache] Checking Existence Of File > /var/run/cluster/apache/apache:SandBoxHttpd.pid [apache:SandBoxHttpd] > > Failed > Monitoring Service apache:SandBoxHttpd > Service Is Not Running > [apache] Monitoring Service apache:SandBoxHttpd > Service Is Not Running > Starting Service apache:SandBoxHttpd > [apache] Starting Service apache:SandBoxHttpd > Looking For IP Addresses > [apache] Looking For IP Addresses > 0 IP addresses found for SandBox/SandBoxHttpd > [apache] 0 IP addresses found for SandBox/SandBoxHttpd > Looking For IP Addresses [apache:SandBoxHttpd] > Failed - No IP > Addresses Found > [apache] Looking For IP Addresses [apache:SandBoxHttpd] > Failed - No IP > Addresses Found > Failed to start SandBox > > So it seems rgmanager can't find IP addresses for this service, and I > can't figure why. I have other services that uses mysql resource agent, > and the work perfectly with the exact same hiearchy of service/fs/ip,etc. > > I've also tried this config : > > > > > name="SandBoxHttpd"/> > > > > With the same outcome. > > Thanks for any insights. > Even if I add a new service with ccs like this : ccs -f cluster.conf --addservice SandBox3 ccs -f cluster.conf --addsubservice SandBox3 ip address="192.168.112.29" ccs -f cluster.conf --addsubservice SandBox3 apache name="testapache" It fails to find any IP addresses. From jajcus at jajcus.net Mon Nov 19 09:16:48 2012 From: jajcus at jajcus.net (Jacek Konieczny) Date: Mon, 19 Nov 2012 10:16:48 +0100 Subject: [Linux-cluster] node fenced by dlm_controld on a clean shutdown Message-ID: <20121119091647.GA20419@jajo.eggsoft> Hi, I am setting up a cluster using: Linux kernel 3.6.6 Corosync 2.1.0 DLM 4.0.0 CLVMD 2.02.98 Pacemaker 1.1.8 DRBD 8.3.13 Now I have stuck on the 'clean shutdown of a node' scenario. It goes like that: - resources using the shared storage are properly stopped by Pacemaker. - DRBD is cleanly demoted and unconfigured by Pacemaker - Pacemaker cleanly exits - CLVMD is stopped. ? dlm_controld is stopped ? corosync is being stopped and at this point the node is fenced (rebooted) by the dlm_controld on the other node. I would expect it continue with a clean shutdown. Any idea how to debug/fix it? Is this '541 cpg_dispatch error 9' the problem? Logs from the node being shut down (log file system mounted with the 'sync' option, syslog shutdown delayed as much as possible): Kernel: Nov 19 09:49:40 dev1n2 kernel: : [ 542.049407] block drbd0: worker terminated Nov 19 09:49:40 dev1n2 kernel: : [ 542.049412] block drbd0: Terminating drbd0_worker Nov 19 09:49:43 dev1n2 kernel: : [ 544.934390] dlm: clvmd: leaving the lockspace group... Nov 19 09:49:43 dev1n2 kernel: : [ 544.937584] dlm: clvmd: group event done 0 0 Nov 19 09:49:43 dev1n2 kernel: : [ 544.937897] dlm: clvmd: release_lockspace final free Nov 19 09:49:43 dev1n2 kernel: : [ 544.961407] dlm: closing connection to node 2 Nov 19 09:49:43 dev1n2 kernel: : [ 544.961431] dlm: closing connection to node 1 User space: Nov 19 09:49:41 dev1n2 pacemakerd[1267]: notice: stop_child: Stopping cib: Sent -15 to process 1279 Nov 19 09:49:41 dev1n2 notifyd[1139]: [notice] dev1n2[2] stonithd:1281:0x7fc423dfd5e0 is now disconnected from corosync Nov 19 09:49:41 dev1n2 cib[1279]: notice: terminate_cs_connection: Disconnecting from Corosync Nov 19 09:49:41 dev1n2 corosync[1130]: [TOTEM ] Retransmit List: 1db Nov 19 09:49:41 dev1n2 notifyd[1139]: [notice] dev1n2[2] cib:1279:0x7fc4240008d0 is now disconnected from corosync Nov 19 09:49:41 dev1n2 cib[1279]: notice: terminate_cs_connection: Disconnecting from Corosync Nov 19 09:49:41 dev1n2 corosync[1130]: [TOTEM ] Retransmit List: 1dd Nov 19 09:49:41 dev1n2 pacemakerd[1267]: notice: pcmk_shutdown_worker: Shutdown complete Nov 19 09:49:41 dev1n2 notifyd[1139]: [notice] dev1n2[2] pacemakerd:1267:0x7fc423bf8ed0 is now disconnected from corosync Nov 19 09:49:41 dev1n2 notifyd[1139]: [notice] dev1n2[2] pacemakerd:1267:0x7fc423bf7660 is now disconnected from corosync Nov 19 09:49:41 dev1n2 corosync[1130]: [TOTEM ] Retransmit List: 1de Nov 19 09:49:41 dev1n2 corosync[1130]: [TOTEM ] Retransmit List: 1de Nov 19 09:49:41 dev1n2 corosync[1130]: [TOTEM ] Retransmit List: 1e1 Nov 19 09:49:41 dev1n2 corosync[1130]: [TOTEM ] Retransmit List: 1e1 Nov 19 09:49:43 dev1n2 dlm_controld[1142]: 541 cpg_dispatch error 9 Nov 19 09:49:43 dev1n2 corosync[1130]: [TOTEM ] Retransmit List: 1e7 Nov 19 09:49:43 dev1n2 corosync[1130]: [TOTEM ] Retransmit List: 1e7 Nov 19 09:49:43 dev1n2 corosync[1130]: [SERV ] Unloading all Corosync service engines. Nov 19 09:49:43 dev1n2 corosync[1130]: [QB ] withdrawing server sockets Nov 19 09:49:43 dev1n2 corosync[1130]: [SERV ] Service engine unloaded: corosync vote quorum service v1.0 Nov 19 09:49:43 dev1n2 corosync[1130]: [QB ] withdrawing server sockets Nov 19 09:49:43 dev1n2 corosync[1130]: [SERV ] Service engine unloaded: corosync configuration map access Nov 19 09:49:43 dev1n2 corosync[1130]: [QB ] withdrawing server sockets Nov 19 09:49:43 dev1n2 corosync[1130]: [SERV ] Service engine unloaded: corosync configuration service Nov 19 09:49:43 dev1n2 corosync[1130]: [QB ] withdrawing server sockets Nov 19 09:49:43 dev1n2 corosync[1130]: [SERV ] Service engine unloaded: corosync cluster closed process group service v1.01 Nov 19 09:49:43 dev1n2 corosync[1130]: [QB ] withdrawing server sockets Nov 19 09:49:43 dev1n2 corosync[1130]: [SERV ] Service engine unloaded: corosync cluster quorum service v0.1 Nov 19 09:49:43 dev1n2 corosync[1130]: [SERV ] Service engine unloaded: corosync profile loading service Nov 19 09:49:43 dev1n2 corosync[1130]: [WD ] magically closing the watchdog. Nov 19 09:49:43 dev1n2 corosync[1130]: [SERV ] Service engine unloaded: corosync watchdog service Nov 19 09:49:43 dev1n2 corosync[1130]: [MAIN ] Corosync Cluster Engine exiting normally Logs from the surviving node: Kernel: Nov 19 09:49:39 dev1n1 kernel: : [80664.615988] block drbd0: conn( Unconnected -> WFConnection ) Nov 19 09:49:42 dev1n1 kernel: : [80667.497187] dlm: clvmd: dlm_recover 11 Nov 19 09:49:42 dev1n1 kernel: : [80667.497241] dlm: clvmd: dlm_clear_toss 1 done Nov 19 09:49:42 dev1n1 kernel: : [80667.497291] dlm: clvmd: remove member 2 Nov 19 09:49:42 dev1n1 kernel: : [80667.497295] dlm: clvmd: dlm_recover_members 1 nodes Nov 19 09:49:42 dev1n1 kernel: : [80667.497298] dlm: clvmd: generation 15 slots 1 1:1 Nov 19 09:49:42 dev1n1 kernel: : [80667.497300] dlm: clvmd: dlm_recover_directory Nov 19 09:49:42 dev1n1 kernel: : [80667.497302] dlm: clvmd: dlm_recover_directory 0 in 0 new Nov 19 09:49:42 dev1n1 kernel: : [80667.497304] dlm: clvmd: dlm_recover_directory 0 out 0 messages Nov 19 09:49:42 dev1n1 kernel: : [80667.497307] dlm: clvmd: dlm_recover_masters Nov 19 09:49:42 dev1n1 kernel: : [80667.497309] dlm: clvmd: dlm_recover_masters 0 of 1 Nov 19 09:49:42 dev1n1 kernel: : [80667.497311] dlm: clvmd: dlm_recover_locks 0 out Nov 19 09:49:42 dev1n1 kernel: : [80667.497313] dlm: clvmd: dlm_recover_locks 0 in Nov 19 09:49:42 dev1n1 kernel: : [80667.497314] dlm: clvmd: dlm_recover_rsbs 1 done Nov 19 09:49:42 dev1n1 kernel: : [80667.497366] dlm: clvmd: dlm_recover 11 generation 15 done: 0 ms Nov 19 09:49:43 dev1n1 kernel: : [80668.211818] dlm: closing connection to node 2 Nov 19 09:49:46 dev1n1 kernel: : [80670.779015] igb: p1p2 NIC Link is Down User space: Nov 19 09:49:40 dev1n1 pengine[1078]: notice: stage6: Scheduling Node dev1n2 for shutdown Nov 19 09:49:40 dev1n1 pengine[1078]: error: rsc_expand_action: Couldn't expand vpbx_vg_cl_demote_0 Nov 19 09:49:40 dev1n1 pengine[1078]: error: rsc_expand_action: Couldn't expand vpbx_vg_cl_demote_0 Nov 19 09:49:40 dev1n1 pengine[1078]: notice: LogActions: Stop stonith-dev1n1 (dev1n2) Nov 19 09:49:40 dev1n1 pengine[1078]: notice: process_pe_message: Calculated Transition 17: /var/lib/pacemaker/pengine/pe-input-1035.bz2 Nov 19 09:49:40 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1d1 Nov 19 09:49:40 dev1n1 crmd[1080]: notice: run_graph: Transition 17 (Complete=2, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-1035.bz2): Complete Nov 19 09:49:40 dev1n1 crmd[1080]: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ] Nov 19 09:49:40 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1d4 Nov 19 09:49:40 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1d6 1d8 Nov 19 09:49:40 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1d6 Nov 19 09:49:40 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1d6 Nov 19 09:49:40 dev1n1 crmd[1080]: notice: peer_update_callback: do_shutdown of dev1n2 (op 63) is complete Nov 19 09:49:40 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1df Nov 19 09:49:40 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1df Nov 19 09:49:40 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1e3 Nov 19 09:49:40 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1e3 Nov 19 09:49:42 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1e6 Nov 19 09:49:42 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1e9 Nov 19 09:49:43 dev1n1 corosync[1004]: [QUORUM] Members[1]: 1 Nov 19 09:49:43 dev1n1 crmd[1080]: notice: corosync_mark_unseen_peer_dead: Node 2/dev1n2 was not seen in the previous transition Nov 19 09:49:43 dev1n1 crmd[1080]: notice: crm_update_peer_state: corosync_mark_unseen_peer_dead: Node dev1n2[2] - state is now lost Nov 19 09:49:43 dev1n1 corosync[1004]: [TOTEM ] A processor joined or left the membership and a new membership (10.28.45.27:30736) was formed. Nov 19 09:49:43 dev1n1 corosync[1004]: [MAIN ] Completed service synchronization, ready to provide service. Nov 19 09:49:43 dev1n1 dlm_controld[1014]: 80664 fence request 2 pid 27225 nodedown time 1353314983 fence_all dlm_stonith Nov 19 09:49:43 dev1n1 notifyd[1010]: [notice] 192.168.1.2[2] ip:192.168.1.2 left Nov 19 09:49:43 dev1n1 stonith-ng[1075]: notice: stonith_command: Client stonith-api.27225.b5ff8f4d wants to fence (reboot) '2' with device '(any)' Nov 19 09:49:43 dev1n1 stonith-ng[1075]: notice: initiate_remote_stonith_op: Initiating remote operation reboot for dev1n2: 71447261-0e53-4b20-b628-d3f026a4ae24 (0) Nov 19 09:49:44 dev1n1 external/ipmi[27242]: [27254]: debug: ipmitool output: Chassis Power Control: Reset Nov 19 09:49:45 dev1n1 stonith-ng[1075]: notice: log_operation: Operation 'reboot' [27234] (call 0 from stonith-api.27225) for host 'dev1n2' with device 'stonith-dev1n2' returned: 0 (OK) Nov 19 09:49:45 dev1n1 stonith-ng[1075]: notice: remote_op_done: Operation reboot of dev1n2 by dev1n1 for stonith-api.27225 at dev1n1.71447261: OK Nov 19 09:49:45 dev1n1 crmd[1080]: notice: tengine_stonith_notify: Peer dev1n2 was terminated (st_notify_fence) by dev1n1 for dev1n1: OK (ref=71447261-0e53-4b20-b628-d3f026a4ae24) by client stonith-api.27225 Greets, Jacek From jajcus at jajcus.net Mon Nov 19 09:39:20 2012 From: jajcus at jajcus.net (Jacek Konieczny) Date: Mon, 19 Nov 2012 10:39:20 +0100 Subject: [Linux-cluster] node fenced by dlm_controld on a clean shutdown In-Reply-To: <20121119091647.GA20419@jajo.eggsoft> References: <20121119091647.GA20419@jajo.eggsoft> Message-ID: <20121119093920.GB20419@jajo.eggsoft> On Mon, Nov 19, 2012 at 10:16:48AM +0100, Jacek Konieczny wrote: > It goes like that: > - resources using the shared storage are properly stopped by Pacemaker. > - DRBD is cleanly demoted and unconfigured by Pacemaker > - Pacemaker cleanly exits > - CLVMD is stopped. > ? dlm_controld is stopped > ? corosync is being stopped > > and at this point the node is fenced (rebooted) by the dlm_controld on > the other node. I would expect it continue with a clean shutdown. > > Any idea how to debug/fix it? > Is this '541 cpg_dispatch error 9' the problem? I found a workaround: I have added a 10 seconds pause between dlm_controld and corosync shutdown. The node shuts down cleanly now (is not fenced). '541 cpg_dispatch error 9' is still there in the logs, though. Greets, Jacek From teigland at redhat.com Mon Nov 19 15:23:19 2012 From: teigland at redhat.com (David Teigland) Date: Mon, 19 Nov 2012 10:23:19 -0500 Subject: [Linux-cluster] node fenced by dlm_controld on a clean shutdown In-Reply-To: <20121119093920.GB20419@jajo.eggsoft> References: <20121119091647.GA20419@jajo.eggsoft> <20121119093920.GB20419@jajo.eggsoft> Message-ID: <20121119152319.GA19052@redhat.com> On Mon, Nov 19, 2012 at 10:39:20AM +0100, Jacek Konieczny wrote: > On Mon, Nov 19, 2012 at 10:16:48AM +0100, Jacek Konieczny wrote: > > It goes like that: > > - resources using the shared storage are properly stopped by Pacemaker. > > - DRBD is cleanly demoted and unconfigured by Pacemaker > > - Pacemaker cleanly exits > > - CLVMD is stopped. > > ??? dlm_controld is stopped > > ??? corosync is being stopped > > > > and at this point the node is fenced (rebooted) by the dlm_controld on > > the other node. I would expect it continue with a clean shutdown. > > > > Any idea how to debug/fix it? > > Is this '541 cpg_dispatch error 9' the problem? > > I found a workaround: I have added a 10 seconds pause between > dlm_controld and corosync shutdown. The node shuts down cleanly now (is > not fenced). '541 cpg_dispatch error 9' is still there in the logs, > though. corosync-cfgtool -H is supposed to shut down corosync cleanly using the cfg_shutdown_callback. It looks like it may not be doing that. From jfriesse at redhat.com Mon Nov 19 16:11:45 2012 From: jfriesse at redhat.com (Jan Friesse) Date: Mon, 19 Nov 2012 17:11:45 +0100 Subject: [Linux-cluster] node fenced by dlm_controld on a clean shutdown In-Reply-To: <20121119152319.GA19052@redhat.com> References: <20121119091647.GA20419@jajo.eggsoft> <20121119093920.GB20419@jajo.eggsoft> <20121119152319.GA19052@redhat.com> Message-ID: <50AA5A41.2030402@redhat.com> David Teigland napsal(a): > On Mon, Nov 19, 2012 at 10:39:20AM +0100, Jacek Konieczny wrote: >> On Mon, Nov 19, 2012 at 10:16:48AM +0100, Jacek Konieczny wrote: >>> It goes like that: >>> - resources using the shared storage are properly stopped by Pacemaker. >>> - DRBD is cleanly demoted and unconfigured by Pacemaker >>> - Pacemaker cleanly exits >>> - CLVMD is stopped. >>> ??? dlm_controld is stopped >>> ??? corosync is being stopped >>> >>> and at this point the node is fenced (rebooted) by the dlm_controld on >>> the other node. I would expect it continue with a clean shutdown. >>> >>> Any idea how to debug/fix it? >>> Is this '541 cpg_dispatch error 9' the problem? >> >> I found a workaround: I have added a 10 seconds pause between >> dlm_controld and corosync shutdown. The node shuts down cleanly now (is >> not fenced). '541 cpg_dispatch error 9' is still there in the logs, >> though. > > corosync-cfgtool -H is supposed to shut down corosync cleanly using the > cfg_shutdown_callback. It looks like it may not be doing that. > I don't think it's about corosync not shut down cleanly. As can be seen in logs: ... Nov 19 09:49:43 dev1n2 corosync[1130]: [SERV ] Service engine unloaded: corosync profile loading service Nov 19 09:49:43 dev1n2 corosync[1130]: [WD ] magically closing the watchdog. Nov 19 09:49:43 dev1n2 corosync[1130]: [SERV ] Service engine unloaded: corosync watchdog service Nov 19 09:49:43 dev1n2 corosync[1130]: [MAIN ] Corosync Cluster Engine exiting normally From anprice at redhat.com Mon Nov 19 17:56:36 2012 From: anprice at redhat.com (Andrew Price) Date: Mon, 19 Nov 2012 17:56:36 +0000 Subject: [Linux-cluster] Please help translating gfs2-utils Message-ID: <50AA72D4.5080305@redhat.com> Hi all, gfs2-utils is an open source project containing tools necessary for creating, checking, tuning and manipulating gfs2 file systems. The gfs2-utils package is required wherever gfs2 file systems are used, particularly in Linux clusters. I'm currently trying to improve the strings in upstream gfs2-utils for localisation. In the meantime, I'd like to drum up interest in progressing the translation effort. We have a Transifex project set up and open for translations: https://www.transifex.com/projects/p/gfs2-utils/ i18n support is a fairly recent addition to the project so the strings likely require some work to make life easy for translators. If there are any issues please contact me or file a bug report at http://bugzilla.redhat.com/ under the gfs2-utils package in Fedora / Rawhide, and I'll try to get the strings updated as soon as I can. Whether bug reports or translations, any help you can provide in translating gfs2-utils into different languages, and making it easier to do so, would be greatly appreciated. Regards, Andy Price From uxbod at splatnix.net Mon Nov 19 21:58:54 2012 From: uxbod at splatnix.net (Phil Daws) Date: Mon, 19 Nov 2012 21:58:54 +0000 (GMT) Subject: [Linux-cluster] Thin (sparse) provisioning In-Reply-To: <1454666582.455812.1353361979619.JavaMail.root@innovot.com> Message-ID: <324998424.456303.1353362334446.JavaMail.root@innovot.com> Hello: am learning about clustering with DRBD and GFS2 and have a question about thin provisioning. I would like to set up a number of individual vservers that reside on their own LVs which can then be shared between two nodes and flipped backwards and forwards using Pacemaker. When setting up the block/lvm device for DRBD I have used: lvcreate --virtualsize 1T --size 10G --name vserver01 vg1 once that has been added as a resource would I perform a standard mkfs.gfs2 or do I need to specify any further options; I was thinking something like: mkfs.gfs2 -t vservercluster:vservers -p lock_dlm -j 2 /dev/vservermirror/vserver01 Is that the way I should be doing it ? Thanks. From jamescyriac76 at gmail.com Tue Nov 20 10:58:57 2012 From: jamescyriac76 at gmail.com (james cyriac) Date: Tue, 20 Nov 2012 14:58:57 +0400 Subject: [Linux-cluster] how to mount GFS volumes same time both the cluster nodes? Message-ID: Hi all, i am installing redhat cluster 6 two node cluser.the issue is i am not able to mount my GFS file sytem in both the node at same time.. please find my clustat output .. [root at saperpprod01 ~]# clustat Cluster Status for sap-cluster1 @ Tue Nov 20 14:51:28 2012 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ saperpprod01 1 Online, Local, rgmanager saperpprod02 2 Online, rgmanager Service Name Owner (Last) State ------- ---- ----- ------ ----- service:oracle saperpprod01 started service:profile-gfs saperpprod01 started service:sap saperpprod01 started [root at saperpprod01 ~]# oralce and sap is fine and it is flaying in both nodes.i want mount my GFS vols same time at both the nodes. Thanks in advacne james but profile-gfs is GFS file system and i want present the GFS mount point same time both the node.please help me this On Mon, Nov 19, 2012 at 1:16 PM, Jacek Konieczny wrote: > Hi, > > I am setting up a cluster using: > > Linux kernel 3.6.6 > Corosync 2.1.0 > DLM 4.0.0 > CLVMD 2.02.98 > Pacemaker 1.1.8 > DRBD 8.3.13 > > Now I have stuck on the 'clean shutdown of a node' scenario. > > It goes like that: > - resources using the shared storage are properly stopped by Pacemaker. > - DRBD is cleanly demoted and unconfigured by Pacemaker > - Pacemaker cleanly exits > - CLVMD is stopped. > ? dlm_controld is stopped > ? corosync is being stopped > > and at this point the node is fenced (rebooted) by the dlm_controld on > the other node. I would expect it continue with a clean shutdown. > > Any idea how to debug/fix it? > Is this '541 cpg_dispatch error 9' the problem? > > Logs from the node being shut down (log file system mounted with the 'sync' > option, syslog shutdown delayed as much as possible): > > Kernel: > Nov 19 09:49:40 dev1n2 kernel: : [ 542.049407] block drbd0: worker > terminated > Nov 19 09:49:40 dev1n2 kernel: : [ 542.049412] block drbd0: Terminating > drbd0_worker > Nov 19 09:49:43 dev1n2 kernel: : [ 544.934390] dlm: clvmd: leaving the > lockspace group... > Nov 19 09:49:43 dev1n2 kernel: : [ 544.937584] dlm: clvmd: group event > done 0 0 > Nov 19 09:49:43 dev1n2 kernel: : [ 544.937897] dlm: clvmd: > release_lockspace final free > Nov 19 09:49:43 dev1n2 kernel: : [ 544.961407] dlm: closing connection to > node 2 > Nov 19 09:49:43 dev1n2 kernel: : [ 544.961431] dlm: closing connection to > node 1 > > User space: > Nov 19 09:49:41 dev1n2 pacemakerd[1267]: notice: stop_child: Stopping > cib: Sent -15 to process 1279 > Nov 19 09:49:41 dev1n2 notifyd[1139]: [notice] dev1n2[2] > stonithd:1281:0x7fc423dfd5e0 is now disconnected from corosync > Nov 19 09:49:41 dev1n2 cib[1279]: notice: terminate_cs_connection: > Disconnecting from Corosync > Nov 19 09:49:41 dev1n2 corosync[1130]: [TOTEM ] Retransmit List: 1db > Nov 19 09:49:41 dev1n2 notifyd[1139]: [notice] dev1n2[2] > cib:1279:0x7fc4240008d0 is now disconnected from corosync > Nov 19 09:49:41 dev1n2 cib[1279]: notice: terminate_cs_connection: > Disconnecting from Corosync > Nov 19 09:49:41 dev1n2 corosync[1130]: [TOTEM ] Retransmit List: 1dd > Nov 19 09:49:41 dev1n2 pacemakerd[1267]: notice: pcmk_shutdown_worker: > Shutdown complete > Nov 19 09:49:41 dev1n2 notifyd[1139]: [notice] dev1n2[2] > pacemakerd:1267:0x7fc423bf8ed0 is now disconnected from corosync > Nov 19 09:49:41 dev1n2 notifyd[1139]: [notice] dev1n2[2] > pacemakerd:1267:0x7fc423bf7660 is now disconnected from corosync > Nov 19 09:49:41 dev1n2 corosync[1130]: [TOTEM ] Retransmit List: 1de > Nov 19 09:49:41 dev1n2 corosync[1130]: [TOTEM ] Retransmit List: 1de > Nov 19 09:49:41 dev1n2 corosync[1130]: [TOTEM ] Retransmit List: 1e1 > Nov 19 09:49:41 dev1n2 corosync[1130]: [TOTEM ] Retransmit List: 1e1 > Nov 19 09:49:43 dev1n2 dlm_controld[1142]: 541 cpg_dispatch error 9 > Nov 19 09:49:43 dev1n2 corosync[1130]: [TOTEM ] Retransmit List: 1e7 > Nov 19 09:49:43 dev1n2 corosync[1130]: [TOTEM ] Retransmit List: 1e7 > Nov 19 09:49:43 dev1n2 corosync[1130]: [SERV ] Unloading all Corosync > service engines. > Nov 19 09:49:43 dev1n2 corosync[1130]: [QB ] withdrawing server sockets > Nov 19 09:49:43 dev1n2 corosync[1130]: [SERV ] Service engine unloaded: > corosync vote quorum service v1.0 > Nov 19 09:49:43 dev1n2 corosync[1130]: [QB ] withdrawing server sockets > Nov 19 09:49:43 dev1n2 corosync[1130]: [SERV ] Service engine unloaded: > corosync configuration map access > Nov 19 09:49:43 dev1n2 corosync[1130]: [QB ] withdrawing server sockets > Nov 19 09:49:43 dev1n2 corosync[1130]: [SERV ] Service engine unloaded: > corosync configuration service > Nov 19 09:49:43 dev1n2 corosync[1130]: [QB ] withdrawing server sockets > Nov 19 09:49:43 dev1n2 corosync[1130]: [SERV ] Service engine unloaded: > corosync cluster closed process group service v1.01 > Nov 19 09:49:43 dev1n2 corosync[1130]: [QB ] withdrawing server sockets > Nov 19 09:49:43 dev1n2 corosync[1130]: [SERV ] Service engine unloaded: > corosync cluster quorum service v0.1 > Nov 19 09:49:43 dev1n2 corosync[1130]: [SERV ] Service engine unloaded: > corosync profile loading service > Nov 19 09:49:43 dev1n2 corosync[1130]: [WD ] magically closing the > watchdog. > Nov 19 09:49:43 dev1n2 corosync[1130]: [SERV ] Service engine unloaded: > corosync watchdog service > Nov 19 09:49:43 dev1n2 corosync[1130]: [MAIN ] Corosync Cluster Engine > exiting normally > > > Logs from the surviving node: > > Kernel: > Nov 19 09:49:39 dev1n1 kernel: : [80664.615988] block drbd0: conn( > Unconnected -> WFConnection ) > Nov 19 09:49:42 dev1n1 kernel: : [80667.497187] dlm: clvmd: dlm_recover 11 > Nov 19 09:49:42 dev1n1 kernel: : [80667.497241] dlm: clvmd: dlm_clear_toss > 1 done > Nov 19 09:49:42 dev1n1 kernel: : [80667.497291] dlm: clvmd: remove member 2 > Nov 19 09:49:42 dev1n1 kernel: : [80667.497295] dlm: clvmd: > dlm_recover_members 1 nodes > Nov 19 09:49:42 dev1n1 kernel: : [80667.497298] dlm: clvmd: generation 15 > slots 1 1:1 > Nov 19 09:49:42 dev1n1 kernel: : [80667.497300] dlm: clvmd: > dlm_recover_directory > Nov 19 09:49:42 dev1n1 kernel: : [80667.497302] dlm: clvmd: > dlm_recover_directory 0 in 0 new > Nov 19 09:49:42 dev1n1 kernel: : [80667.497304] dlm: clvmd: > dlm_recover_directory 0 out 0 messages > Nov 19 09:49:42 dev1n1 kernel: : [80667.497307] dlm: clvmd: > dlm_recover_masters > Nov 19 09:49:42 dev1n1 kernel: : [80667.497309] dlm: clvmd: > dlm_recover_masters 0 of 1 > Nov 19 09:49:42 dev1n1 kernel: : [80667.497311] dlm: clvmd: > dlm_recover_locks 0 out > Nov 19 09:49:42 dev1n1 kernel: : [80667.497313] dlm: clvmd: > dlm_recover_locks 0 in > Nov 19 09:49:42 dev1n1 kernel: : [80667.497314] dlm: clvmd: > dlm_recover_rsbs 1 done > Nov 19 09:49:42 dev1n1 kernel: : [80667.497366] dlm: clvmd: dlm_recover 11 > generation 15 done: 0 ms > Nov 19 09:49:43 dev1n1 kernel: : [80668.211818] dlm: closing connection to > node 2 > Nov 19 09:49:46 dev1n1 kernel: : [80670.779015] igb: p1p2 NIC Link is Down > > User space: > Nov 19 09:49:40 dev1n1 pengine[1078]: notice: stage6: Scheduling Node > dev1n2 for shutdown > Nov 19 09:49:40 dev1n1 pengine[1078]: error: rsc_expand_action: > Couldn't expand vpbx_vg_cl_demote_0 > Nov 19 09:49:40 dev1n1 pengine[1078]: error: rsc_expand_action: > Couldn't expand vpbx_vg_cl_demote_0 > Nov 19 09:49:40 dev1n1 pengine[1078]: notice: LogActions: Stop > stonith-dev1n1 (dev1n2) > Nov 19 09:49:40 dev1n1 pengine[1078]: notice: process_pe_message: > Calculated Transition 17: /var/lib/pacemaker/pengine/pe-input-1035.bz2 > Nov 19 09:49:40 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1d1 > Nov 19 09:49:40 dev1n1 crmd[1080]: notice: run_graph: Transition 17 > (Complete=2, Pending=0, Fired=0, Skipped=0, Incomplete=0, > Source=/var/lib/pacemaker/pengine/pe-input-1035.bz2): Complete > Nov 19 09:49:40 dev1n1 crmd[1080]: notice: do_state_transition: State > transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS > cause=C_FSA_INTERNAL origin=notify_crmd ] > Nov 19 09:49:40 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1d4 > Nov 19 09:49:40 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1d6 1d8 > Nov 19 09:49:40 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1d6 > Nov 19 09:49:40 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1d6 > Nov 19 09:49:40 dev1n1 crmd[1080]: notice: peer_update_callback: > do_shutdown of dev1n2 (op 63) is complete > Nov 19 09:49:40 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1df > Nov 19 09:49:40 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1df > Nov 19 09:49:40 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1e3 > Nov 19 09:49:40 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1e3 > Nov 19 09:49:42 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1e6 > Nov 19 09:49:42 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1e9 > Nov 19 09:49:43 dev1n1 corosync[1004]: [QUORUM] Members[1]: 1 > Nov 19 09:49:43 dev1n1 crmd[1080]: notice: > corosync_mark_unseen_peer_dead: Node 2/dev1n2 was not seen in the previous > transition > Nov 19 09:49:43 dev1n1 crmd[1080]: notice: crm_update_peer_state: > corosync_mark_unseen_peer_dead: Node dev1n2[2] - state is now lost > Nov 19 09:49:43 dev1n1 corosync[1004]: [TOTEM ] A processor joined or > left the membership and a new membership (10.28.45.27:30736) was formed. > Nov 19 09:49:43 dev1n1 corosync[1004]: [MAIN ] Completed service > synchronization, ready to provide service. > Nov 19 09:49:43 dev1n1 dlm_controld[1014]: 80664 fence request 2 pid 27225 > nodedown time 1353314983 fence_all dlm_stonith > Nov 19 09:49:43 dev1n1 notifyd[1010]: [notice] 192.168.1.2[2] > ip:192.168.1.2 left > Nov 19 09:49:43 dev1n1 stonith-ng[1075]: notice: stonith_command: Client > stonith-api.27225.b5ff8f4d wants to fence (reboot) '2' with device '(any)' > Nov 19 09:49:43 dev1n1 stonith-ng[1075]: notice: > initiate_remote_stonith_op: Initiating remote operation reboot for dev1n2: > 71447261-0e53-4b20-b628-d3f026a4ae24 (0) > Nov 19 09:49:44 dev1n1 external/ipmi[27242]: [27254]: debug: ipmitool > output: Chassis Power Control: Reset > Nov 19 09:49:45 dev1n1 stonith-ng[1075]: notice: log_operation: > Operation 'reboot' [27234] (call 0 from stonith-api.27225) for host > 'dev1n2' with device 'stonith-dev1n2' returned: 0 (OK) > Nov 19 09:49:45 dev1n1 stonith-ng[1075]: notice: remote_op_done: > Operation reboot of dev1n2 by dev1n1 for stonith-api.27225 at dev1n1.71447261: > OK > Nov 19 09:49:45 dev1n1 crmd[1080]: notice: tengine_stonith_notify: Peer > dev1n2 was terminated (st_notify_fence) by dev1n1 for dev1n1: OK > (ref=71447261-0e53-4b20-b628-d3f026a4ae24) by client stonith-api.27225 > > Greets, > Jacek > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- An HTML attachment was scrubbed... URL: From emi2fast at gmail.com Tue Nov 20 11:07:20 2012 From: emi2fast at gmail.com (emmanuel segura) Date: Tue, 20 Nov 2012 12:07:20 +0100 Subject: [Linux-cluster] how to mount GFS volumes same time both the cluster nodes? In-Reply-To: References: Message-ID: You have to use /etc/fstab with _netdev option, redhat cluster doesn't support active/active service 2012/11/20 james cyriac > Hi all, > > i am installing redhat cluster 6 two node cluser.the issue is i am not > able to mount my GFS file sytem in both the node at same time.. > > please find my clustat output .. > > > [root at saperpprod01 ~]# clustat > Cluster Status for sap-cluster1 @ Tue Nov 20 14:51:28 2012 > Member Status: Quorate > Member Name ID > Status > ------ ---- ---- > ------ > saperpprod01 1 > Online, Local, rgmanager > saperpprod02 2 > Online, rgmanager > Service Name Owner > (Last) State > ------- ---- ----- > ------ ----- > service:oracle > saperpprod01 started > service:profile-gfs > saperpprod01 started > service:sap > saperpprod01 started > [root at saperpprod01 ~]# > oralce and sap is fine and it is flaying in both nodes.i want mount my GFS > vols same time at both the nodes. > > Thanks in advacne > james > > > but profile-gfs is GFS file system and i want present the GFS mount point > same time both the node.please help me this > On Mon, Nov 19, 2012 at 1:16 PM, Jacek Konieczny wrote: > >> Hi, >> >> I am setting up a cluster using: >> >> Linux kernel 3.6.6 >> Corosync 2.1.0 >> DLM 4.0.0 >> CLVMD 2.02.98 >> Pacemaker 1.1.8 >> DRBD 8.3.13 >> >> Now I have stuck on the 'clean shutdown of a node' scenario. >> >> It goes like that: >> - resources using the shared storage are properly stopped by Pacemaker. >> - DRBD is cleanly demoted and unconfigured by Pacemaker >> - Pacemaker cleanly exits >> - CLVMD is stopped. >> ? dlm_controld is stopped >> ? corosync is being stopped >> >> and at this point the node is fenced (rebooted) by the dlm_controld on >> the other node. I would expect it continue with a clean shutdown. >> >> Any idea how to debug/fix it? >> Is this '541 cpg_dispatch error 9' the problem? >> >> Logs from the node being shut down (log file system mounted with the >> 'sync' >> option, syslog shutdown delayed as much as possible): >> >> Kernel: >> Nov 19 09:49:40 dev1n2 kernel: : [ 542.049407] block drbd0: worker >> terminated >> Nov 19 09:49:40 dev1n2 kernel: : [ 542.049412] block drbd0: Terminating >> drbd0_worker >> Nov 19 09:49:43 dev1n2 kernel: : [ 544.934390] dlm: clvmd: leaving the >> lockspace group... >> Nov 19 09:49:43 dev1n2 kernel: : [ 544.937584] dlm: clvmd: group event >> done 0 0 >> Nov 19 09:49:43 dev1n2 kernel: : [ 544.937897] dlm: clvmd: >> release_lockspace final free >> Nov 19 09:49:43 dev1n2 kernel: : [ 544.961407] dlm: closing connection >> to node 2 >> Nov 19 09:49:43 dev1n2 kernel: : [ 544.961431] dlm: closing connection >> to node 1 >> >> User space: >> Nov 19 09:49:41 dev1n2 pacemakerd[1267]: notice: stop_child: Stopping >> cib: Sent -15 to process 1279 >> Nov 19 09:49:41 dev1n2 notifyd[1139]: [notice] dev1n2[2] >> stonithd:1281:0x7fc423dfd5e0 is now disconnected from corosync >> Nov 19 09:49:41 dev1n2 cib[1279]: notice: terminate_cs_connection: >> Disconnecting from Corosync >> Nov 19 09:49:41 dev1n2 corosync[1130]: [TOTEM ] Retransmit List: 1db >> Nov 19 09:49:41 dev1n2 notifyd[1139]: [notice] dev1n2[2] >> cib:1279:0x7fc4240008d0 is now disconnected from corosync >> Nov 19 09:49:41 dev1n2 cib[1279]: notice: terminate_cs_connection: >> Disconnecting from Corosync >> Nov 19 09:49:41 dev1n2 corosync[1130]: [TOTEM ] Retransmit List: 1dd >> Nov 19 09:49:41 dev1n2 pacemakerd[1267]: notice: pcmk_shutdown_worker: >> Shutdown complete >> Nov 19 09:49:41 dev1n2 notifyd[1139]: [notice] dev1n2[2] >> pacemakerd:1267:0x7fc423bf8ed0 is now disconnected from corosync >> Nov 19 09:49:41 dev1n2 notifyd[1139]: [notice] dev1n2[2] >> pacemakerd:1267:0x7fc423bf7660 is now disconnected from corosync >> Nov 19 09:49:41 dev1n2 corosync[1130]: [TOTEM ] Retransmit List: 1de >> Nov 19 09:49:41 dev1n2 corosync[1130]: [TOTEM ] Retransmit List: 1de >> Nov 19 09:49:41 dev1n2 corosync[1130]: [TOTEM ] Retransmit List: 1e1 >> Nov 19 09:49:41 dev1n2 corosync[1130]: [TOTEM ] Retransmit List: 1e1 >> Nov 19 09:49:43 dev1n2 dlm_controld[1142]: 541 cpg_dispatch error 9 >> Nov 19 09:49:43 dev1n2 corosync[1130]: [TOTEM ] Retransmit List: 1e7 >> Nov 19 09:49:43 dev1n2 corosync[1130]: [TOTEM ] Retransmit List: 1e7 >> Nov 19 09:49:43 dev1n2 corosync[1130]: [SERV ] Unloading all Corosync >> service engines. >> Nov 19 09:49:43 dev1n2 corosync[1130]: [QB ] withdrawing server >> sockets >> Nov 19 09:49:43 dev1n2 corosync[1130]: [SERV ] Service engine unloaded: >> corosync vote quorum service v1.0 >> Nov 19 09:49:43 dev1n2 corosync[1130]: [QB ] withdrawing server >> sockets >> Nov 19 09:49:43 dev1n2 corosync[1130]: [SERV ] Service engine unloaded: >> corosync configuration map access >> Nov 19 09:49:43 dev1n2 corosync[1130]: [QB ] withdrawing server >> sockets >> Nov 19 09:49:43 dev1n2 corosync[1130]: [SERV ] Service engine unloaded: >> corosync configuration service >> Nov 19 09:49:43 dev1n2 corosync[1130]: [QB ] withdrawing server >> sockets >> Nov 19 09:49:43 dev1n2 corosync[1130]: [SERV ] Service engine unloaded: >> corosync cluster closed process group service v1.01 >> Nov 19 09:49:43 dev1n2 corosync[1130]: [QB ] withdrawing server >> sockets >> Nov 19 09:49:43 dev1n2 corosync[1130]: [SERV ] Service engine unloaded: >> corosync cluster quorum service v0.1 >> Nov 19 09:49:43 dev1n2 corosync[1130]: [SERV ] Service engine unloaded: >> corosync profile loading service >> Nov 19 09:49:43 dev1n2 corosync[1130]: [WD ] magically closing the >> watchdog. >> Nov 19 09:49:43 dev1n2 corosync[1130]: [SERV ] Service engine unloaded: >> corosync watchdog service >> Nov 19 09:49:43 dev1n2 corosync[1130]: [MAIN ] Corosync Cluster Engine >> exiting normally >> >> >> Logs from the surviving node: >> >> Kernel: >> Nov 19 09:49:39 dev1n1 kernel: : [80664.615988] block drbd0: conn( >> Unconnected -> WFConnection ) >> Nov 19 09:49:42 dev1n1 kernel: : [80667.497187] dlm: clvmd: dlm_recover 11 >> Nov 19 09:49:42 dev1n1 kernel: : [80667.497241] dlm: clvmd: >> dlm_clear_toss 1 done >> Nov 19 09:49:42 dev1n1 kernel: : [80667.497291] dlm: clvmd: remove member >> 2 >> Nov 19 09:49:42 dev1n1 kernel: : [80667.497295] dlm: clvmd: >> dlm_recover_members 1 nodes >> Nov 19 09:49:42 dev1n1 kernel: : [80667.497298] dlm: clvmd: generation 15 >> slots 1 1:1 >> Nov 19 09:49:42 dev1n1 kernel: : [80667.497300] dlm: clvmd: >> dlm_recover_directory >> Nov 19 09:49:42 dev1n1 kernel: : [80667.497302] dlm: clvmd: >> dlm_recover_directory 0 in 0 new >> Nov 19 09:49:42 dev1n1 kernel: : [80667.497304] dlm: clvmd: >> dlm_recover_directory 0 out 0 messages >> Nov 19 09:49:42 dev1n1 kernel: : [80667.497307] dlm: clvmd: >> dlm_recover_masters >> Nov 19 09:49:42 dev1n1 kernel: : [80667.497309] dlm: clvmd: >> dlm_recover_masters 0 of 1 >> Nov 19 09:49:42 dev1n1 kernel: : [80667.497311] dlm: clvmd: >> dlm_recover_locks 0 out >> Nov 19 09:49:42 dev1n1 kernel: : [80667.497313] dlm: clvmd: >> dlm_recover_locks 0 in >> Nov 19 09:49:42 dev1n1 kernel: : [80667.497314] dlm: clvmd: >> dlm_recover_rsbs 1 done >> Nov 19 09:49:42 dev1n1 kernel: : [80667.497366] dlm: clvmd: dlm_recover >> 11 generation 15 done: 0 ms >> Nov 19 09:49:43 dev1n1 kernel: : [80668.211818] dlm: closing connection >> to node 2 >> Nov 19 09:49:46 dev1n1 kernel: : [80670.779015] igb: p1p2 NIC Link is Down >> >> User space: >> Nov 19 09:49:40 dev1n1 pengine[1078]: notice: stage6: Scheduling Node >> dev1n2 for shutdown >> Nov 19 09:49:40 dev1n1 pengine[1078]: error: rsc_expand_action: >> Couldn't expand vpbx_vg_cl_demote_0 >> Nov 19 09:49:40 dev1n1 pengine[1078]: error: rsc_expand_action: >> Couldn't expand vpbx_vg_cl_demote_0 >> Nov 19 09:49:40 dev1n1 pengine[1078]: notice: LogActions: Stop >> stonith-dev1n1 (dev1n2) >> Nov 19 09:49:40 dev1n1 pengine[1078]: notice: process_pe_message: >> Calculated Transition 17: /var/lib/pacemaker/pengine/pe-input-1035.bz2 >> Nov 19 09:49:40 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1d1 >> Nov 19 09:49:40 dev1n1 crmd[1080]: notice: run_graph: Transition 17 >> (Complete=2, Pending=0, Fired=0, Skipped=0, Incomplete=0, >> Source=/var/lib/pacemaker/pengine/pe-input-1035.bz2): Complete >> Nov 19 09:49:40 dev1n1 crmd[1080]: notice: do_state_transition: State >> transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS >> cause=C_FSA_INTERNAL origin=notify_crmd ] >> Nov 19 09:49:40 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1d4 >> Nov 19 09:49:40 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1d6 1d8 >> Nov 19 09:49:40 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1d6 >> Nov 19 09:49:40 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1d6 >> Nov 19 09:49:40 dev1n1 crmd[1080]: notice: peer_update_callback: >> do_shutdown of dev1n2 (op 63) is complete >> Nov 19 09:49:40 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1df >> Nov 19 09:49:40 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1df >> Nov 19 09:49:40 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1e3 >> Nov 19 09:49:40 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1e3 >> Nov 19 09:49:42 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1e6 >> Nov 19 09:49:42 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1e9 >> Nov 19 09:49:43 dev1n1 corosync[1004]: [QUORUM] Members[1]: 1 >> Nov 19 09:49:43 dev1n1 crmd[1080]: notice: >> corosync_mark_unseen_peer_dead: Node 2/dev1n2 was not seen in the previous >> transition >> Nov 19 09:49:43 dev1n1 crmd[1080]: notice: crm_update_peer_state: >> corosync_mark_unseen_peer_dead: Node dev1n2[2] - state is now lost >> Nov 19 09:49:43 dev1n1 corosync[1004]: [TOTEM ] A processor joined or >> left the membership and a new membership (10.28.45.27:30736) was formed. >> Nov 19 09:49:43 dev1n1 corosync[1004]: [MAIN ] Completed service >> synchronization, ready to provide service. >> Nov 19 09:49:43 dev1n1 dlm_controld[1014]: 80664 fence request 2 pid >> 27225 nodedown time 1353314983 fence_all dlm_stonith >> Nov 19 09:49:43 dev1n1 notifyd[1010]: [notice] 192.168.1.2[2] >> ip:192.168.1.2 left >> Nov 19 09:49:43 dev1n1 stonith-ng[1075]: notice: stonith_command: >> Client stonith-api.27225.b5ff8f4d wants to fence (reboot) '2' with device >> '(any)' >> Nov 19 09:49:43 dev1n1 stonith-ng[1075]: notice: >> initiate_remote_stonith_op: Initiating remote operation reboot for dev1n2: >> 71447261-0e53-4b20-b628-d3f026a4ae24 (0) >> Nov 19 09:49:44 dev1n1 external/ipmi[27242]: [27254]: debug: ipmitool >> output: Chassis Power Control: Reset >> Nov 19 09:49:45 dev1n1 stonith-ng[1075]: notice: log_operation: >> Operation 'reboot' [27234] (call 0 from stonith-api.27225) for host >> 'dev1n2' with device 'stonith-dev1n2' returned: 0 (OK) >> Nov 19 09:49:45 dev1n1 stonith-ng[1075]: notice: remote_op_done: >> Operation reboot of dev1n2 by dev1n1 for stonith-api.27225 at dev1n1.71447261: >> OK >> Nov 19 09:49:45 dev1n1 crmd[1080]: notice: tengine_stonith_notify: Peer >> dev1n2 was terminated (st_notify_fence) by dev1n1 for dev1n1: OK >> (ref=71447261-0e53-4b20-b628-d3f026a4ae24) by client stonith-api.27225 >> >> Greets, >> Jacek >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -- esta es mi vida e me la vivo hasta que dios quiera -------------- next part -------------- An HTML attachment was scrubbed... URL: From jamescyriac76 at gmail.com Tue Nov 20 11:31:23 2012 From: jamescyriac76 at gmail.com (james cyriac) Date: Tue, 20 Nov 2012 15:31:23 +0400 Subject: [Linux-cluster] how to mount GFS volumes same time both the cluster nodes? In-Reply-To: References: Message-ID: Hi, can you send the detials,i have to put entry in both servers?now i created map disk 150G both servers and created in node 1 vg03 then mkfs.gfs2 -p lock_dlm -t sap-cluster1:gfs2 -j 8 /dev/vg03/lvol0 now i able to mount in first server. /dev/vg03/lvol0 /usr/sap/trans gfs2 defaults 0 0 On Tue, Nov 20, 2012 at 3:07 PM, emmanuel segura wrote: > You have to use /etc/fstab with _netdev option, redhat cluster doesn't > support active/active service > > > 2012/11/20 james cyriac > >> Hi all, >> >> i am installing redhat cluster 6 two node cluser.the issue is i am not >> able to mount my GFS file sytem in both the node at same time.. >> >> please find my clustat output .. >> >> >> [root at saperpprod01 ~]# clustat >> Cluster Status for sap-cluster1 @ Tue Nov 20 14:51:28 2012 >> Member Status: Quorate >> Member Name ID >> Status >> ------ ---- ---- >> ------ >> saperpprod01 1 >> Online, Local, rgmanager >> saperpprod02 2 >> Online, rgmanager >> Service Name Owner >> (Last) State >> ------- ---- ----- >> ------ ----- >> service:oracle >> saperpprod01 started >> service:profile-gfs >> saperpprod01 started >> service:sap >> saperpprod01 started >> [root at saperpprod01 ~]# >> oralce and sap is fine and it is flaying in both nodes.i want mount my >> GFS vols same time at both the nodes. >> >> Thanks in advacne >> james >> >> >> but profile-gfs is GFS file system and i want present the GFS mount point >> same time both the node.please help me this >> On Mon, Nov 19, 2012 at 1:16 PM, Jacek Konieczny wrote: >> >>> Hi, >>> >>> I am setting up a cluster using: >>> >>> Linux kernel 3.6.6 >>> Corosync 2.1.0 >>> DLM 4.0.0 >>> CLVMD 2.02.98 >>> Pacemaker 1.1.8 >>> DRBD 8.3.13 >>> >>> Now I have stuck on the 'clean shutdown of a node' scenario. >>> >>> It goes like that: >>> - resources using the shared storage are properly stopped by Pacemaker. >>> - DRBD is cleanly demoted and unconfigured by Pacemaker >>> - Pacemaker cleanly exits >>> - CLVMD is stopped. >>> ? dlm_controld is stopped >>> ? corosync is being stopped >>> >>> and at this point the node is fenced (rebooted) by the dlm_controld on >>> the other node. I would expect it continue with a clean shutdown. >>> >>> Any idea how to debug/fix it? >>> Is this '541 cpg_dispatch error 9' the problem? >>> >>> Logs from the node being shut down (log file system mounted with the >>> 'sync' >>> option, syslog shutdown delayed as much as possible): >>> >>> Kernel: >>> Nov 19 09:49:40 dev1n2 kernel: : [ 542.049407] block drbd0: worker >>> terminated >>> Nov 19 09:49:40 dev1n2 kernel: : [ 542.049412] block drbd0: Terminating >>> drbd0_worker >>> Nov 19 09:49:43 dev1n2 kernel: : [ 544.934390] dlm: clvmd: leaving the >>> lockspace group... >>> Nov 19 09:49:43 dev1n2 kernel: : [ 544.937584] dlm: clvmd: group event >>> done 0 0 >>> Nov 19 09:49:43 dev1n2 kernel: : [ 544.937897] dlm: clvmd: >>> release_lockspace final free >>> Nov 19 09:49:43 dev1n2 kernel: : [ 544.961407] dlm: closing connection >>> to node 2 >>> Nov 19 09:49:43 dev1n2 kernel: : [ 544.961431] dlm: closing connection >>> to node 1 >>> >>> User space: >>> Nov 19 09:49:41 dev1n2 pacemakerd[1267]: notice: stop_child: Stopping >>> cib: Sent -15 to process 1279 >>> Nov 19 09:49:41 dev1n2 notifyd[1139]: [notice] dev1n2[2] >>> stonithd:1281:0x7fc423dfd5e0 is now disconnected from corosync >>> Nov 19 09:49:41 dev1n2 cib[1279]: notice: terminate_cs_connection: >>> Disconnecting from Corosync >>> Nov 19 09:49:41 dev1n2 corosync[1130]: [TOTEM ] Retransmit List: 1db >>> Nov 19 09:49:41 dev1n2 notifyd[1139]: [notice] dev1n2[2] >>> cib:1279:0x7fc4240008d0 is now disconnected from corosync >>> Nov 19 09:49:41 dev1n2 cib[1279]: notice: terminate_cs_connection: >>> Disconnecting from Corosync >>> Nov 19 09:49:41 dev1n2 corosync[1130]: [TOTEM ] Retransmit List: 1dd >>> Nov 19 09:49:41 dev1n2 pacemakerd[1267]: notice: pcmk_shutdown_worker: >>> Shutdown complete >>> Nov 19 09:49:41 dev1n2 notifyd[1139]: [notice] dev1n2[2] >>> pacemakerd:1267:0x7fc423bf8ed0 is now disconnected from corosync >>> Nov 19 09:49:41 dev1n2 notifyd[1139]: [notice] dev1n2[2] >>> pacemakerd:1267:0x7fc423bf7660 is now disconnected from corosync >>> Nov 19 09:49:41 dev1n2 corosync[1130]: [TOTEM ] Retransmit List: 1de >>> Nov 19 09:49:41 dev1n2 corosync[1130]: [TOTEM ] Retransmit List: 1de >>> Nov 19 09:49:41 dev1n2 corosync[1130]: [TOTEM ] Retransmit List: 1e1 >>> Nov 19 09:49:41 dev1n2 corosync[1130]: [TOTEM ] Retransmit List: 1e1 >>> Nov 19 09:49:43 dev1n2 dlm_controld[1142]: 541 cpg_dispatch error 9 >>> Nov 19 09:49:43 dev1n2 corosync[1130]: [TOTEM ] Retransmit List: 1e7 >>> Nov 19 09:49:43 dev1n2 corosync[1130]: [TOTEM ] Retransmit List: 1e7 >>> Nov 19 09:49:43 dev1n2 corosync[1130]: [SERV ] Unloading all Corosync >>> service engines. >>> Nov 19 09:49:43 dev1n2 corosync[1130]: [QB ] withdrawing server >>> sockets >>> Nov 19 09:49:43 dev1n2 corosync[1130]: [SERV ] Service engine >>> unloaded: corosync vote quorum service v1.0 >>> Nov 19 09:49:43 dev1n2 corosync[1130]: [QB ] withdrawing server >>> sockets >>> Nov 19 09:49:43 dev1n2 corosync[1130]: [SERV ] Service engine >>> unloaded: corosync configuration map access >>> Nov 19 09:49:43 dev1n2 corosync[1130]: [QB ] withdrawing server >>> sockets >>> Nov 19 09:49:43 dev1n2 corosync[1130]: [SERV ] Service engine >>> unloaded: corosync configuration service >>> Nov 19 09:49:43 dev1n2 corosync[1130]: [QB ] withdrawing server >>> sockets >>> Nov 19 09:49:43 dev1n2 corosync[1130]: [SERV ] Service engine >>> unloaded: corosync cluster closed process group service v1.01 >>> Nov 19 09:49:43 dev1n2 corosync[1130]: [QB ] withdrawing server >>> sockets >>> Nov 19 09:49:43 dev1n2 corosync[1130]: [SERV ] Service engine >>> unloaded: corosync cluster quorum service v0.1 >>> Nov 19 09:49:43 dev1n2 corosync[1130]: [SERV ] Service engine >>> unloaded: corosync profile loading service >>> Nov 19 09:49:43 dev1n2 corosync[1130]: [WD ] magically closing the >>> watchdog. >>> Nov 19 09:49:43 dev1n2 corosync[1130]: [SERV ] Service engine >>> unloaded: corosync watchdog service >>> Nov 19 09:49:43 dev1n2 corosync[1130]: [MAIN ] Corosync Cluster Engine >>> exiting normally >>> >>> >>> Logs from the surviving node: >>> >>> Kernel: >>> Nov 19 09:49:39 dev1n1 kernel: : [80664.615988] block drbd0: conn( >>> Unconnected -> WFConnection ) >>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497187] dlm: clvmd: dlm_recover >>> 11 >>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497241] dlm: clvmd: >>> dlm_clear_toss 1 done >>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497291] dlm: clvmd: remove >>> member 2 >>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497295] dlm: clvmd: >>> dlm_recover_members 1 nodes >>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497298] dlm: clvmd: generation >>> 15 slots 1 1:1 >>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497300] dlm: clvmd: >>> dlm_recover_directory >>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497302] dlm: clvmd: >>> dlm_recover_directory 0 in 0 new >>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497304] dlm: clvmd: >>> dlm_recover_directory 0 out 0 messages >>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497307] dlm: clvmd: >>> dlm_recover_masters >>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497309] dlm: clvmd: >>> dlm_recover_masters 0 of 1 >>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497311] dlm: clvmd: >>> dlm_recover_locks 0 out >>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497313] dlm: clvmd: >>> dlm_recover_locks 0 in >>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497314] dlm: clvmd: >>> dlm_recover_rsbs 1 done >>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497366] dlm: clvmd: dlm_recover >>> 11 generation 15 done: 0 ms >>> Nov 19 09:49:43 dev1n1 kernel: : [80668.211818] dlm: closing connection >>> to node 2 >>> Nov 19 09:49:46 dev1n1 kernel: : [80670.779015] igb: p1p2 NIC Link is >>> Down >>> >>> User space: >>> Nov 19 09:49:40 dev1n1 pengine[1078]: notice: stage6: Scheduling Node >>> dev1n2 for shutdown >>> Nov 19 09:49:40 dev1n1 pengine[1078]: error: rsc_expand_action: >>> Couldn't expand vpbx_vg_cl_demote_0 >>> Nov 19 09:49:40 dev1n1 pengine[1078]: error: rsc_expand_action: >>> Couldn't expand vpbx_vg_cl_demote_0 >>> Nov 19 09:49:40 dev1n1 pengine[1078]: notice: LogActions: Stop >>> stonith-dev1n1 (dev1n2) >>> Nov 19 09:49:40 dev1n1 pengine[1078]: notice: process_pe_message: >>> Calculated Transition 17: /var/lib/pacemaker/pengine/pe-input-1035.bz2 >>> Nov 19 09:49:40 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1d1 >>> Nov 19 09:49:40 dev1n1 crmd[1080]: notice: run_graph: Transition 17 >>> (Complete=2, Pending=0, Fired=0, Skipped=0, Incomplete=0, >>> Source=/var/lib/pacemaker/pengine/pe-input-1035.bz2): Complete >>> Nov 19 09:49:40 dev1n1 crmd[1080]: notice: do_state_transition: State >>> transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS >>> cause=C_FSA_INTERNAL origin=notify_crmd ] >>> Nov 19 09:49:40 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1d4 >>> Nov 19 09:49:40 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1d6 1d8 >>> Nov 19 09:49:40 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1d6 >>> Nov 19 09:49:40 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1d6 >>> Nov 19 09:49:40 dev1n1 crmd[1080]: notice: peer_update_callback: >>> do_shutdown of dev1n2 (op 63) is complete >>> Nov 19 09:49:40 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1df >>> Nov 19 09:49:40 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1df >>> Nov 19 09:49:40 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1e3 >>> Nov 19 09:49:40 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1e3 >>> Nov 19 09:49:42 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1e6 >>> Nov 19 09:49:42 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1e9 >>> Nov 19 09:49:43 dev1n1 corosync[1004]: [QUORUM] Members[1]: 1 >>> Nov 19 09:49:43 dev1n1 crmd[1080]: notice: >>> corosync_mark_unseen_peer_dead: Node 2/dev1n2 was not seen in the previous >>> transition >>> Nov 19 09:49:43 dev1n1 crmd[1080]: notice: crm_update_peer_state: >>> corosync_mark_unseen_peer_dead: Node dev1n2[2] - state is now lost >>> Nov 19 09:49:43 dev1n1 corosync[1004]: [TOTEM ] A processor joined or >>> left the membership and a new membership (10.28.45.27:30736) was formed. >>> Nov 19 09:49:43 dev1n1 corosync[1004]: [MAIN ] Completed service >>> synchronization, ready to provide service. >>> Nov 19 09:49:43 dev1n1 dlm_controld[1014]: 80664 fence request 2 pid >>> 27225 nodedown time 1353314983 fence_all dlm_stonith >>> Nov 19 09:49:43 dev1n1 notifyd[1010]: [notice] 192.168.1.2[2] >>> ip:192.168.1.2 left >>> Nov 19 09:49:43 dev1n1 stonith-ng[1075]: notice: stonith_command: >>> Client stonith-api.27225.b5ff8f4d wants to fence (reboot) '2' with device >>> '(any)' >>> Nov 19 09:49:43 dev1n1 stonith-ng[1075]: notice: >>> initiate_remote_stonith_op: Initiating remote operation reboot for dev1n2: >>> 71447261-0e53-4b20-b628-d3f026a4ae24 (0) >>> Nov 19 09:49:44 dev1n1 external/ipmi[27242]: [27254]: debug: ipmitool >>> output: Chassis Power Control: Reset >>> Nov 19 09:49:45 dev1n1 stonith-ng[1075]: notice: log_operation: >>> Operation 'reboot' [27234] (call 0 from stonith-api.27225) for host >>> 'dev1n2' with device 'stonith-dev1n2' returned: 0 (OK) >>> Nov 19 09:49:45 dev1n1 stonith-ng[1075]: notice: remote_op_done: >>> Operation reboot of dev1n2 by dev1n1 for stonith-api.27225 at dev1n1.71447261: >>> OK >>> Nov 19 09:49:45 dev1n1 crmd[1080]: notice: tengine_stonith_notify: >>> Peer dev1n2 was terminated (st_notify_fence) by dev1n1 for dev1n1: OK >>> (ref=71447261-0e53-4b20-b628-d3f026a4ae24) by client stonith-api.27225 >>> >>> Greets, >>> Jacek >>> >>> -- >>> Linux-cluster mailing list >>> Linux-cluster at redhat.com >>> https://www.redhat.com/mailman/listinfo/linux-cluster >> >> >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster >> > > > > -- > esta es mi vida e me la vivo hasta que dios quiera > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sean at rentul.net Tue Nov 20 11:41:25 2012 From: sean at rentul.net (Sean Lutner) Date: Tue, 20 Nov 2012 06:41:25 -0500 Subject: [Linux-cluster] how to mount GFS volumes same time both the cluster nodes? In-Reply-To: References: Message-ID: Did you run lvmconf --enable-cluster? Sent from my iPhone On Nov 20, 2012, at 6:31 AM, james cyriac wrote: > Hi, > > can you send the detials,i have to put entry in both servers?now i created > > map disk 150G both servers > and created in node 1 vg03 > then > mkfs.gfs2 -p lock_dlm -t sap-cluster1:gfs2 -j 8 /dev/vg03/lvol0 > > now i able to mount in first server. > > > /dev/vg03/lvol0 /usr/sap/trans gfs2 defaults 0 0 > > On Tue, Nov 20, 2012 at 3:07 PM, emmanuel segura wrote: >> You have to use /etc/fstab with _netdev option, redhat cluster doesn't support active/active service >> >> >> 2012/11/20 james cyriac >>> Hi all, >>> >>> i am installing redhat cluster 6 two node cluser.the issue is i am not able to mount my GFS file sytem in both the node at same time.. >>> >>> please find my clustat output .. >>> >>> >>> [root at saperpprod01 ~]# clustat >>> Cluster Status for sap-cluster1 @ Tue Nov 20 14:51:28 2012 >>> Member Status: Quorate >>> Member Name ID Status >>> ------ ---- ---- ------ >>> saperpprod01 1 Online, Local, rgmanager >>> saperpprod02 2 Online, rgmanager >>> Service Name Owner (Last) State >>> ------- ---- ----- ------ ----- >>> service:oracle saperpprod01 started >>> service:profile-gfs saperpprod01 started >>> service:sap saperpprod01 started >>> [root at saperpprod01 ~]# >>> oralce and sap is fine and it is flaying in both nodes.i want mount my GFS vols same time at both the nodes. >>> >>> Thanks in advacne >>> james >>> >>> >>> but profile-gfs is GFS file system and i want present the GFS mount point same time both the node.please help me this >>> On Mon, Nov 19, 2012 at 1:16 PM, Jacek Konieczny wrote: >>>> Hi, >>>> >>>> I am setting up a cluster using: >>>> >>>> Linux kernel 3.6.6 >>>> Corosync 2.1.0 >>>> DLM 4.0.0 >>>> CLVMD 2.02.98 >>>> Pacemaker 1.1.8 >>>> DRBD 8.3.13 >>>> >>>> Now I have stuck on the 'clean shutdown of a node' scenario. >>>> >>>> It goes like that: >>>> - resources using the shared storage are properly stopped by Pacemaker. >>>> - DRBD is cleanly demoted and unconfigured by Pacemaker >>>> - Pacemaker cleanly exits >>>> - CLVMD is stopped. >>>> ? dlm_controld is stopped >>>> ? corosync is being stopped >>>> >>>> and at this point the node is fenced (rebooted) by the dlm_controld on >>>> the other node. I would expect it continue with a clean shutdown. >>>> >>>> Any idea how to debug/fix it? >>>> Is this '541 cpg_dispatch error 9' the problem? >>>> >>>> Logs from the node being shut down (log file system mounted with the 'sync' >>>> option, syslog shutdown delayed as much as possible): >>>> >>>> Kernel: >>>> Nov 19 09:49:40 dev1n2 kernel: : [ 542.049407] block drbd0: worker terminated >>>> Nov 19 09:49:40 dev1n2 kernel: : [ 542.049412] block drbd0: Terminating drbd0_worker >>>> Nov 19 09:49:43 dev1n2 kernel: : [ 544.934390] dlm: clvmd: leaving the lockspace group... >>>> Nov 19 09:49:43 dev1n2 kernel: : [ 544.937584] dlm: clvmd: group event done 0 0 >>>> Nov 19 09:49:43 dev1n2 kernel: : [ 544.937897] dlm: clvmd: release_lockspace final free >>>> Nov 19 09:49:43 dev1n2 kernel: : [ 544.961407] dlm: closing connection to node 2 >>>> Nov 19 09:49:43 dev1n2 kernel: : [ 544.961431] dlm: closing connection to node 1 >>>> >>>> User space: >>>> Nov 19 09:49:41 dev1n2 pacemakerd[1267]: notice: stop_child: Stopping cib: Sent -15 to process 1279 >>>> Nov 19 09:49:41 dev1n2 notifyd[1139]: [notice] dev1n2[2] stonithd:1281:0x7fc423dfd5e0 is now disconnected from corosync >>>> Nov 19 09:49:41 dev1n2 cib[1279]: notice: terminate_cs_connection: Disconnecting from Corosync >>>> Nov 19 09:49:41 dev1n2 corosync[1130]: [TOTEM ] Retransmit List: 1db >>>> Nov 19 09:49:41 dev1n2 notifyd[1139]: [notice] dev1n2[2] cib:1279:0x7fc4240008d0 is now disconnected from corosync >>>> Nov 19 09:49:41 dev1n2 cib[1279]: notice: terminate_cs_connection: Disconnecting from Corosync >>>> Nov 19 09:49:41 dev1n2 corosync[1130]: [TOTEM ] Retransmit List: 1dd >>>> Nov 19 09:49:41 dev1n2 pacemakerd[1267]: notice: pcmk_shutdown_worker: Shutdown complete >>>> Nov 19 09:49:41 dev1n2 notifyd[1139]: [notice] dev1n2[2] pacemakerd:1267:0x7fc423bf8ed0 is now disconnected from corosync >>>> Nov 19 09:49:41 dev1n2 notifyd[1139]: [notice] dev1n2[2] pacemakerd:1267:0x7fc423bf7660 is now disconnected from corosync >>>> Nov 19 09:49:41 dev1n2 corosync[1130]: [TOTEM ] Retransmit List: 1de >>>> Nov 19 09:49:41 dev1n2 corosync[1130]: [TOTEM ] Retransmit List: 1de >>>> Nov 19 09:49:41 dev1n2 corosync[1130]: [TOTEM ] Retransmit List: 1e1 >>>> Nov 19 09:49:41 dev1n2 corosync[1130]: [TOTEM ] Retransmit List: 1e1 >>>> Nov 19 09:49:43 dev1n2 dlm_controld[1142]: 541 cpg_dispatch error 9 >>>> Nov 19 09:49:43 dev1n2 corosync[1130]: [TOTEM ] Retransmit List: 1e7 >>>> Nov 19 09:49:43 dev1n2 corosync[1130]: [TOTEM ] Retransmit List: 1e7 >>>> Nov 19 09:49:43 dev1n2 corosync[1130]: [SERV ] Unloading all Corosync service engines. >>>> Nov 19 09:49:43 dev1n2 corosync[1130]: [QB ] withdrawing server sockets >>>> Nov 19 09:49:43 dev1n2 corosync[1130]: [SERV ] Service engine unloaded: corosync vote quorum service v1.0 >>>> Nov 19 09:49:43 dev1n2 corosync[1130]: [QB ] withdrawing server sockets >>>> Nov 19 09:49:43 dev1n2 corosync[1130]: [SERV ] Service engine unloaded: corosync configuration map access >>>> Nov 19 09:49:43 dev1n2 corosync[1130]: [QB ] withdrawing server sockets >>>> Nov 19 09:49:43 dev1n2 corosync[1130]: [SERV ] Service engine unloaded: corosync configuration service >>>> Nov 19 09:49:43 dev1n2 corosync[1130]: [QB ] withdrawing server sockets >>>> Nov 19 09:49:43 dev1n2 corosync[1130]: [SERV ] Service engine unloaded: corosync cluster closed process group service v1.01 >>>> Nov 19 09:49:43 dev1n2 corosync[1130]: [QB ] withdrawing server sockets >>>> Nov 19 09:49:43 dev1n2 corosync[1130]: [SERV ] Service engine unloaded: corosync cluster quorum service v0.1 >>>> Nov 19 09:49:43 dev1n2 corosync[1130]: [SERV ] Service engine unloaded: corosync profile loading service >>>> Nov 19 09:49:43 dev1n2 corosync[1130]: [WD ] magically closing the watchdog. >>>> Nov 19 09:49:43 dev1n2 corosync[1130]: [SERV ] Service engine unloaded: corosync watchdog service >>>> Nov 19 09:49:43 dev1n2 corosync[1130]: [MAIN ] Corosync Cluster Engine exiting normally >>>> >>>> >>>> Logs from the surviving node: >>>> >>>> Kernel: >>>> Nov 19 09:49:39 dev1n1 kernel: : [80664.615988] block drbd0: conn( Unconnected -> WFConnection ) >>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497187] dlm: clvmd: dlm_recover 11 >>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497241] dlm: clvmd: dlm_clear_toss 1 done >>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497291] dlm: clvmd: remove member 2 >>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497295] dlm: clvmd: dlm_recover_members 1 nodes >>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497298] dlm: clvmd: generation 15 slots 1 1:1 >>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497300] dlm: clvmd: dlm_recover_directory >>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497302] dlm: clvmd: dlm_recover_directory 0 in 0 new >>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497304] dlm: clvmd: dlm_recover_directory 0 out 0 messages >>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497307] dlm: clvmd: dlm_recover_masters >>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497309] dlm: clvmd: dlm_recover_masters 0 of 1 >>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497311] dlm: clvmd: dlm_recover_locks 0 out >>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497313] dlm: clvmd: dlm_recover_locks 0 in >>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497314] dlm: clvmd: dlm_recover_rsbs 1 done >>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497366] dlm: clvmd: dlm_recover 11 generation 15 done: 0 ms >>>> Nov 19 09:49:43 dev1n1 kernel: : [80668.211818] dlm: closing connection to node 2 >>>> Nov 19 09:49:46 dev1n1 kernel: : [80670.779015] igb: p1p2 NIC Link is Down >>>> >>>> User space: >>>> Nov 19 09:49:40 dev1n1 pengine[1078]: notice: stage6: Scheduling Node dev1n2 for shutdown >>>> Nov 19 09:49:40 dev1n1 pengine[1078]: error: rsc_expand_action: Couldn't expand vpbx_vg_cl_demote_0 >>>> Nov 19 09:49:40 dev1n1 pengine[1078]: error: rsc_expand_action: Couldn't expand vpbx_vg_cl_demote_0 >>>> Nov 19 09:49:40 dev1n1 pengine[1078]: notice: LogActions: Stop stonith-dev1n1 (dev1n2) >>>> Nov 19 09:49:40 dev1n1 pengine[1078]: notice: process_pe_message: Calculated Transition 17: /var/lib/pacemaker/pengine/pe-input-1035.bz2 >>>> Nov 19 09:49:40 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1d1 >>>> Nov 19 09:49:40 dev1n1 crmd[1080]: notice: run_graph: Transition 17 (Complete=2, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-1035.bz2): Complete >>>> Nov 19 09:49:40 dev1n1 crmd[1080]: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ] >>>> Nov 19 09:49:40 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1d4 >>>> Nov 19 09:49:40 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1d6 1d8 >>>> Nov 19 09:49:40 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1d6 >>>> Nov 19 09:49:40 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1d6 >>>> Nov 19 09:49:40 dev1n1 crmd[1080]: notice: peer_update_callback: do_shutdown of dev1n2 (op 63) is complete >>>> Nov 19 09:49:40 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1df >>>> Nov 19 09:49:40 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1df >>>> Nov 19 09:49:40 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1e3 >>>> Nov 19 09:49:40 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1e3 >>>> Nov 19 09:49:42 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1e6 >>>> Nov 19 09:49:42 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1e9 >>>> Nov 19 09:49:43 dev1n1 corosync[1004]: [QUORUM] Members[1]: 1 >>>> Nov 19 09:49:43 dev1n1 crmd[1080]: notice: corosync_mark_unseen_peer_dead: Node 2/dev1n2 was not seen in the previous transition >>>> Nov 19 09:49:43 dev1n1 crmd[1080]: notice: crm_update_peer_state: corosync_mark_unseen_peer_dead: Node dev1n2[2] - state is now lost >>>> Nov 19 09:49:43 dev1n1 corosync[1004]: [TOTEM ] A processor joined or left the membership and a new membership (10.28.45.27:30736) was formed. >>>> Nov 19 09:49:43 dev1n1 corosync[1004]: [MAIN ] Completed service synchronization, ready to provide service. >>>> Nov 19 09:49:43 dev1n1 dlm_controld[1014]: 80664 fence request 2 pid 27225 nodedown time 1353314983 fence_all dlm_stonith >>>> Nov 19 09:49:43 dev1n1 notifyd[1010]: [notice] 192.168.1.2[2] ip:192.168.1.2 left >>>> Nov 19 09:49:43 dev1n1 stonith-ng[1075]: notice: stonith_command: Client stonith-api.27225.b5ff8f4d wants to fence (reboot) '2' with device '(any)' >>>> Nov 19 09:49:43 dev1n1 stonith-ng[1075]: notice: initiate_remote_stonith_op: Initiating remote operation reboot for dev1n2: 71447261-0e53-4b20-b628-d3f026a4ae24 (0) >>>> Nov 19 09:49:44 dev1n1 external/ipmi[27242]: [27254]: debug: ipmitool output: Chassis Power Control: Reset >>>> Nov 19 09:49:45 dev1n1 stonith-ng[1075]: notice: log_operation: Operation 'reboot' [27234] (call 0 from stonith-api.27225) for host 'dev1n2' with device 'stonith-dev1n2' returned: 0 (OK) >>>> Nov 19 09:49:45 dev1n1 stonith-ng[1075]: notice: remote_op_done: Operation reboot of dev1n2 by dev1n1 for stonith-api.27225 at dev1n1.71447261: OK >>>> Nov 19 09:49:45 dev1n1 crmd[1080]: notice: tengine_stonith_notify: Peer dev1n2 was terminated (st_notify_fence) by dev1n1 for dev1n1: OK (ref=71447261-0e53-4b20-b628-d3f026a4ae24) by client stonith-api.27225 >>>> >>>> Greets, >>>> Jacek >>>> >>>> -- >>>> Linux-cluster mailing list >>>> Linux-cluster at redhat.com >>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>> >>> >>> -- >>> Linux-cluster mailing list >>> Linux-cluster at redhat.com >>> https://www.redhat.com/mailman/listinfo/linux-cluster >> >> >> >> -- >> esta es mi vida e me la vivo hasta que dios quiera >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- An HTML attachment was scrubbed... URL: From emi2fast at gmail.com Tue Nov 20 12:02:37 2012 From: emi2fast at gmail.com (emmanuel segura) Date: Tue, 20 Nov 2012 13:02:37 +0100 Subject: [Linux-cluster] how to mount GFS volumes same time both the cluster nodes? In-Reply-To: References: Message-ID: Do it the same step on second server 2012/11/20 james cyriac > Hi, > > can you send the detials,i have to put entry in both servers?now i created > > map disk 150G both servers > and created in node 1 vg03 > then > mkfs.gfs2 -p lock_dlm -t sap-cluster1:gfs2 -j 8 /dev/vg03/lvol0 > > now i able to mount in first server. > > > /dev/vg03/lvol0 /usr/sap/trans gfs2 defaults 0 0 > > On Tue, Nov 20, 2012 at 3:07 PM, emmanuel segura wrote: > >> You have to use /etc/fstab with _netdev option, redhat cluster doesn't >> support active/active service >> >> >> 2012/11/20 james cyriac >> >>> Hi all, >>> >>> i am installing redhat cluster 6 two node cluser.the issue is i am not >>> able to mount my GFS file sytem in both the node at same time.. >>> >>> please find my clustat output .. >>> >>> >>> [root at saperpprod01 ~]# clustat >>> Cluster Status for sap-cluster1 @ Tue Nov 20 14:51:28 2012 >>> Member Status: Quorate >>> Member Name ID >>> Status >>> ------ ---- ---- >>> ------ >>> saperpprod01 1 >>> Online, Local, rgmanager >>> saperpprod02 2 >>> Online, rgmanager >>> Service Name Owner >>> (Last) State >>> ------- ---- ----- >>> ------ ----- >>> service:oracle >>> saperpprod01 started >>> service:profile-gfs >>> saperpprod01 started >>> service:sap >>> saperpprod01 started >>> [root at saperpprod01 ~]# >>> oralce and sap is fine and it is flaying in both nodes.i want mount my >>> GFS vols same time at both the nodes. >>> >>> Thanks in advacne >>> james >>> >>> >>> but profile-gfs is GFS file system and i want present the GFS mount >>> point same time both the node.please help me this >>> On Mon, Nov 19, 2012 at 1:16 PM, Jacek Konieczny wrote: >>> >>>> Hi, >>>> >>>> I am setting up a cluster using: >>>> >>>> Linux kernel 3.6.6 >>>> Corosync 2.1.0 >>>> DLM 4.0.0 >>>> CLVMD 2.02.98 >>>> Pacemaker 1.1.8 >>>> DRBD 8.3.13 >>>> >>>> Now I have stuck on the 'clean shutdown of a node' scenario. >>>> >>>> It goes like that: >>>> - resources using the shared storage are properly stopped by Pacemaker. >>>> - DRBD is cleanly demoted and unconfigured by Pacemaker >>>> - Pacemaker cleanly exits >>>> - CLVMD is stopped. >>>> ? dlm_controld is stopped >>>> ? corosync is being stopped >>>> >>>> and at this point the node is fenced (rebooted) by the dlm_controld on >>>> the other node. I would expect it continue with a clean shutdown. >>>> >>>> Any idea how to debug/fix it? >>>> Is this '541 cpg_dispatch error 9' the problem? >>>> >>>> Logs from the node being shut down (log file system mounted with the >>>> 'sync' >>>> option, syslog shutdown delayed as much as possible): >>>> >>>> Kernel: >>>> Nov 19 09:49:40 dev1n2 kernel: : [ 542.049407] block drbd0: worker >>>> terminated >>>> Nov 19 09:49:40 dev1n2 kernel: : [ 542.049412] block drbd0: >>>> Terminating drbd0_worker >>>> Nov 19 09:49:43 dev1n2 kernel: : [ 544.934390] dlm: clvmd: leaving the >>>> lockspace group... >>>> Nov 19 09:49:43 dev1n2 kernel: : [ 544.937584] dlm: clvmd: group event >>>> done 0 0 >>>> Nov 19 09:49:43 dev1n2 kernel: : [ 544.937897] dlm: clvmd: >>>> release_lockspace final free >>>> Nov 19 09:49:43 dev1n2 kernel: : [ 544.961407] dlm: closing connection >>>> to node 2 >>>> Nov 19 09:49:43 dev1n2 kernel: : [ 544.961431] dlm: closing connection >>>> to node 1 >>>> >>>> User space: >>>> Nov 19 09:49:41 dev1n2 pacemakerd[1267]: notice: stop_child: Stopping >>>> cib: Sent -15 to process 1279 >>>> Nov 19 09:49:41 dev1n2 notifyd[1139]: [notice] dev1n2[2] >>>> stonithd:1281:0x7fc423dfd5e0 is now disconnected from corosync >>>> Nov 19 09:49:41 dev1n2 cib[1279]: notice: terminate_cs_connection: >>>> Disconnecting from Corosync >>>> Nov 19 09:49:41 dev1n2 corosync[1130]: [TOTEM ] Retransmit List: 1db >>>> Nov 19 09:49:41 dev1n2 notifyd[1139]: [notice] dev1n2[2] >>>> cib:1279:0x7fc4240008d0 is now disconnected from corosync >>>> Nov 19 09:49:41 dev1n2 cib[1279]: notice: terminate_cs_connection: >>>> Disconnecting from Corosync >>>> Nov 19 09:49:41 dev1n2 corosync[1130]: [TOTEM ] Retransmit List: 1dd >>>> Nov 19 09:49:41 dev1n2 pacemakerd[1267]: notice: >>>> pcmk_shutdown_worker: Shutdown complete >>>> Nov 19 09:49:41 dev1n2 notifyd[1139]: [notice] dev1n2[2] >>>> pacemakerd:1267:0x7fc423bf8ed0 is now disconnected from corosync >>>> Nov 19 09:49:41 dev1n2 notifyd[1139]: [notice] dev1n2[2] >>>> pacemakerd:1267:0x7fc423bf7660 is now disconnected from corosync >>>> Nov 19 09:49:41 dev1n2 corosync[1130]: [TOTEM ] Retransmit List: 1de >>>> Nov 19 09:49:41 dev1n2 corosync[1130]: [TOTEM ] Retransmit List: 1de >>>> Nov 19 09:49:41 dev1n2 corosync[1130]: [TOTEM ] Retransmit List: 1e1 >>>> Nov 19 09:49:41 dev1n2 corosync[1130]: [TOTEM ] Retransmit List: 1e1 >>>> Nov 19 09:49:43 dev1n2 dlm_controld[1142]: 541 cpg_dispatch error 9 >>>> Nov 19 09:49:43 dev1n2 corosync[1130]: [TOTEM ] Retransmit List: 1e7 >>>> Nov 19 09:49:43 dev1n2 corosync[1130]: [TOTEM ] Retransmit List: 1e7 >>>> Nov 19 09:49:43 dev1n2 corosync[1130]: [SERV ] Unloading all Corosync >>>> service engines. >>>> Nov 19 09:49:43 dev1n2 corosync[1130]: [QB ] withdrawing server >>>> sockets >>>> Nov 19 09:49:43 dev1n2 corosync[1130]: [SERV ] Service engine >>>> unloaded: corosync vote quorum service v1.0 >>>> Nov 19 09:49:43 dev1n2 corosync[1130]: [QB ] withdrawing server >>>> sockets >>>> Nov 19 09:49:43 dev1n2 corosync[1130]: [SERV ] Service engine >>>> unloaded: corosync configuration map access >>>> Nov 19 09:49:43 dev1n2 corosync[1130]: [QB ] withdrawing server >>>> sockets >>>> Nov 19 09:49:43 dev1n2 corosync[1130]: [SERV ] Service engine >>>> unloaded: corosync configuration service >>>> Nov 19 09:49:43 dev1n2 corosync[1130]: [QB ] withdrawing server >>>> sockets >>>> Nov 19 09:49:43 dev1n2 corosync[1130]: [SERV ] Service engine >>>> unloaded: corosync cluster closed process group service v1.01 >>>> Nov 19 09:49:43 dev1n2 corosync[1130]: [QB ] withdrawing server >>>> sockets >>>> Nov 19 09:49:43 dev1n2 corosync[1130]: [SERV ] Service engine >>>> unloaded: corosync cluster quorum service v0.1 >>>> Nov 19 09:49:43 dev1n2 corosync[1130]: [SERV ] Service engine >>>> unloaded: corosync profile loading service >>>> Nov 19 09:49:43 dev1n2 corosync[1130]: [WD ] magically closing the >>>> watchdog. >>>> Nov 19 09:49:43 dev1n2 corosync[1130]: [SERV ] Service engine >>>> unloaded: corosync watchdog service >>>> Nov 19 09:49:43 dev1n2 corosync[1130]: [MAIN ] Corosync Cluster >>>> Engine exiting normally >>>> >>>> >>>> Logs from the surviving node: >>>> >>>> Kernel: >>>> Nov 19 09:49:39 dev1n1 kernel: : [80664.615988] block drbd0: conn( >>>> Unconnected -> WFConnection ) >>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497187] dlm: clvmd: dlm_recover >>>> 11 >>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497241] dlm: clvmd: >>>> dlm_clear_toss 1 done >>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497291] dlm: clvmd: remove >>>> member 2 >>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497295] dlm: clvmd: >>>> dlm_recover_members 1 nodes >>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497298] dlm: clvmd: generation >>>> 15 slots 1 1:1 >>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497300] dlm: clvmd: >>>> dlm_recover_directory >>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497302] dlm: clvmd: >>>> dlm_recover_directory 0 in 0 new >>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497304] dlm: clvmd: >>>> dlm_recover_directory 0 out 0 messages >>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497307] dlm: clvmd: >>>> dlm_recover_masters >>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497309] dlm: clvmd: >>>> dlm_recover_masters 0 of 1 >>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497311] dlm: clvmd: >>>> dlm_recover_locks 0 out >>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497313] dlm: clvmd: >>>> dlm_recover_locks 0 in >>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497314] dlm: clvmd: >>>> dlm_recover_rsbs 1 done >>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497366] dlm: clvmd: dlm_recover >>>> 11 generation 15 done: 0 ms >>>> Nov 19 09:49:43 dev1n1 kernel: : [80668.211818] dlm: closing connection >>>> to node 2 >>>> Nov 19 09:49:46 dev1n1 kernel: : [80670.779015] igb: p1p2 NIC Link is >>>> Down >>>> >>>> User space: >>>> Nov 19 09:49:40 dev1n1 pengine[1078]: notice: stage6: Scheduling Node >>>> dev1n2 for shutdown >>>> Nov 19 09:49:40 dev1n1 pengine[1078]: error: rsc_expand_action: >>>> Couldn't expand vpbx_vg_cl_demote_0 >>>> Nov 19 09:49:40 dev1n1 pengine[1078]: error: rsc_expand_action: >>>> Couldn't expand vpbx_vg_cl_demote_0 >>>> Nov 19 09:49:40 dev1n1 pengine[1078]: notice: LogActions: Stop >>>> stonith-dev1n1 (dev1n2) >>>> Nov 19 09:49:40 dev1n1 pengine[1078]: notice: process_pe_message: >>>> Calculated Transition 17: /var/lib/pacemaker/pengine/pe-input-1035.bz2 >>>> Nov 19 09:49:40 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1d1 >>>> Nov 19 09:49:40 dev1n1 crmd[1080]: notice: run_graph: Transition 17 >>>> (Complete=2, Pending=0, Fired=0, Skipped=0, Incomplete=0, >>>> Source=/var/lib/pacemaker/pengine/pe-input-1035.bz2): Complete >>>> Nov 19 09:49:40 dev1n1 crmd[1080]: notice: do_state_transition: State >>>> transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS >>>> cause=C_FSA_INTERNAL origin=notify_crmd ] >>>> Nov 19 09:49:40 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1d4 >>>> Nov 19 09:49:40 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1d6 >>>> 1d8 >>>> Nov 19 09:49:40 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1d6 >>>> Nov 19 09:49:40 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1d6 >>>> Nov 19 09:49:40 dev1n1 crmd[1080]: notice: peer_update_callback: >>>> do_shutdown of dev1n2 (op 63) is complete >>>> Nov 19 09:49:40 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1df >>>> Nov 19 09:49:40 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1df >>>> Nov 19 09:49:40 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1e3 >>>> Nov 19 09:49:40 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1e3 >>>> Nov 19 09:49:42 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1e6 >>>> Nov 19 09:49:42 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1e9 >>>> Nov 19 09:49:43 dev1n1 corosync[1004]: [QUORUM] Members[1]: 1 >>>> Nov 19 09:49:43 dev1n1 crmd[1080]: notice: >>>> corosync_mark_unseen_peer_dead: Node 2/dev1n2 was not seen in the previous >>>> transition >>>> Nov 19 09:49:43 dev1n1 crmd[1080]: notice: crm_update_peer_state: >>>> corosync_mark_unseen_peer_dead: Node dev1n2[2] - state is now lost >>>> Nov 19 09:49:43 dev1n1 corosync[1004]: [TOTEM ] A processor joined or >>>> left the membership and a new membership (10.28.45.27:30736) was >>>> formed. >>>> Nov 19 09:49:43 dev1n1 corosync[1004]: [MAIN ] Completed service >>>> synchronization, ready to provide service. >>>> Nov 19 09:49:43 dev1n1 dlm_controld[1014]: 80664 fence request 2 pid >>>> 27225 nodedown time 1353314983 fence_all dlm_stonith >>>> Nov 19 09:49:43 dev1n1 notifyd[1010]: [notice] 192.168.1.2[2] >>>> ip:192.168.1.2 left >>>> Nov 19 09:49:43 dev1n1 stonith-ng[1075]: notice: stonith_command: >>>> Client stonith-api.27225.b5ff8f4d wants to fence (reboot) '2' with device >>>> '(any)' >>>> Nov 19 09:49:43 dev1n1 stonith-ng[1075]: notice: >>>> initiate_remote_stonith_op: Initiating remote operation reboot for dev1n2: >>>> 71447261-0e53-4b20-b628-d3f026a4ae24 (0) >>>> Nov 19 09:49:44 dev1n1 external/ipmi[27242]: [27254]: debug: ipmitool >>>> output: Chassis Power Control: Reset >>>> Nov 19 09:49:45 dev1n1 stonith-ng[1075]: notice: log_operation: >>>> Operation 'reboot' [27234] (call 0 from stonith-api.27225) for host >>>> 'dev1n2' with device 'stonith-dev1n2' returned: 0 (OK) >>>> Nov 19 09:49:45 dev1n1 stonith-ng[1075]: notice: remote_op_done: >>>> Operation reboot of dev1n2 by dev1n1 for stonith-api.27225 at dev1n1.71447261: >>>> OK >>>> Nov 19 09:49:45 dev1n1 crmd[1080]: notice: tengine_stonith_notify: >>>> Peer dev1n2 was terminated (st_notify_fence) by dev1n1 for dev1n1: OK >>>> (ref=71447261-0e53-4b20-b628-d3f026a4ae24) by client stonith-api.27225 >>>> >>>> Greets, >>>> Jacek >>>> >>>> -- >>>> Linux-cluster mailing list >>>> Linux-cluster at redhat.com >>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>> >>> >>> >>> -- >>> Linux-cluster mailing list >>> Linux-cluster at redhat.com >>> https://www.redhat.com/mailman/listinfo/linux-cluster >>> >> >> >> >> -- >> esta es mi vida e me la vivo hasta que dios quiera >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster >> > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -- esta es mi vida e me la vivo hasta que dios quiera -------------- next part -------------- An HTML attachment was scrubbed... URL: From sean at rentul.net Tue Nov 20 12:30:00 2012 From: sean at rentul.net (Sean Lutner) Date: Tue, 20 Nov 2012 07:30:00 -0500 Subject: [Linux-cluster] how to mount GFS volumes same time both the cluster nodes? In-Reply-To: References: Message-ID: <1FD6615A-6518-41C6-ABE5-C4FDCEC94FAF@rentul.net> You don't need to do that. Running the LVM commands in one node is all you need to do assuming that its the same storage presented to both hosts. Sent from my iPhone On Nov 20, 2012, at 7:02 AM, emmanuel segura wrote: > Do it the same step on second server > > 2012/11/20 james cyriac >> Hi, >> >> can you send the detials,i have to put entry in both servers?now i created >> >> map disk 150G both servers >> and created in node 1 vg03 >> then >> mkfs.gfs2 -p lock_dlm -t sap-cluster1:gfs2 -j 8 /dev/vg03/lvol0 >> >> now i able to mount in first server. >> >> >> /dev/vg03/lvol0 /usr/sap/trans gfs2 defaults 0 0 >> >> On Tue, Nov 20, 2012 at 3:07 PM, emmanuel segura wrote: >>> You have to use /etc/fstab with _netdev option, redhat cluster doesn't support active/active service >>> >>> >>> 2012/11/20 james cyriac >>>> Hi all, >>>> >>>> i am installing redhat cluster 6 two node cluser.the issue is i am not able to mount my GFS file sytem in both the node at same time.. >>>> >>>> please find my clustat output .. >>>> >>>> >>>> [root at saperpprod01 ~]# clustat >>>> Cluster Status for sap-cluster1 @ Tue Nov 20 14:51:28 2012 >>>> Member Status: Quorate >>>> Member Name ID Status >>>> ------ ---- ---- ------ >>>> saperpprod01 1 Online, Local, rgmanager >>>> saperpprod02 2 Online, rgmanager >>>> Service Name Owner (Last) State >>>> ------- ---- ----- ------ ----- >>>> service:oracle saperpprod01 started >>>> service:profile-gfs saperpprod01 started >>>> service:sap saperpprod01 started >>>> [root at saperpprod01 ~]# >>>> oralce and sap is fine and it is flaying in both nodes.i want mount my GFS vols same time at both the nodes. >>>> >>>> Thanks in advacne >>>> james >>>> >>>> >>>> but profile-gfs is GFS file system and i want present the GFS mount point same time both the node.please help me this >>>> On Mon, Nov 19, 2012 at 1:16 PM, Jacek Konieczny wrote: >>>>> Hi, >>>>> >>>>> I am setting up a cluster using: >>>>> >>>>> Linux kernel 3.6.6 >>>>> Corosync 2.1.0 >>>>> DLM 4.0.0 >>>>> CLVMD 2.02.98 >>>>> Pacemaker 1.1.8 >>>>> DRBD 8.3.13 >>>>> >>>>> Now I have stuck on the 'clean shutdown of a node' scenario. >>>>> >>>>> It goes like that: >>>>> - resources using the shared storage are properly stopped by Pacemaker. >>>>> - DRBD is cleanly demoted and unconfigured by Pacemaker >>>>> - Pacemaker cleanly exits >>>>> - CLVMD is stopped. >>>>> ? dlm_controld is stopped >>>>> ? corosync is being stopped >>>>> >>>>> and at this point the node is fenced (rebooted) by the dlm_controld on >>>>> the other node. I would expect it continue with a clean shutdown. >>>>> >>>>> Any idea how to debug/fix it? >>>>> Is this '541 cpg_dispatch error 9' the problem? >>>>> >>>>> Logs from the node being shut down (log file system mounted with the 'sync' >>>>> option, syslog shutdown delayed as much as possible): >>>>> >>>>> Kernel: >>>>> Nov 19 09:49:40 dev1n2 kernel: : [ 542.049407] block drbd0: worker terminated >>>>> Nov 19 09:49:40 dev1n2 kernel: : [ 542.049412] block drbd0: Terminating drbd0_worker >>>>> Nov 19 09:49:43 dev1n2 kernel: : [ 544.934390] dlm: clvmd: leaving the lockspace group... >>>>> Nov 19 09:49:43 dev1n2 kernel: : [ 544.937584] dlm: clvmd: group event done 0 0 >>>>> Nov 19 09:49:43 dev1n2 kernel: : [ 544.937897] dlm: clvmd: release_lockspace final free >>>>> Nov 19 09:49:43 dev1n2 kernel: : [ 544.961407] dlm: closing connection to node 2 >>>>> Nov 19 09:49:43 dev1n2 kernel: : [ 544.961431] dlm: closing connection to node 1 >>>>> >>>>> User space: >>>>> Nov 19 09:49:41 dev1n2 pacemakerd[1267]: notice: stop_child: Stopping cib: Sent -15 to process 1279 >>>>> Nov 19 09:49:41 dev1n2 notifyd[1139]: [notice] dev1n2[2] stonithd:1281:0x7fc423dfd5e0 is now disconnected from corosync >>>>> Nov 19 09:49:41 dev1n2 cib[1279]: notice: terminate_cs_connection: Disconnecting from Corosync >>>>> Nov 19 09:49:41 dev1n2 corosync[1130]: [TOTEM ] Retransmit List: 1db >>>>> Nov 19 09:49:41 dev1n2 notifyd[1139]: [notice] dev1n2[2] cib:1279:0x7fc4240008d0 is now disconnected from corosync >>>>> Nov 19 09:49:41 dev1n2 cib[1279]: notice: terminate_cs_connection: Disconnecting from Corosync >>>>> Nov 19 09:49:41 dev1n2 corosync[1130]: [TOTEM ] Retransmit List: 1dd >>>>> Nov 19 09:49:41 dev1n2 pacemakerd[1267]: notice: pcmk_shutdown_worker: Shutdown complete >>>>> Nov 19 09:49:41 dev1n2 notifyd[1139]: [notice] dev1n2[2] pacemakerd:1267:0x7fc423bf8ed0 is now disconnected from corosync >>>>> Nov 19 09:49:41 dev1n2 notifyd[1139]: [notice] dev1n2[2] pacemakerd:1267:0x7fc423bf7660 is now disconnected from corosync >>>>> Nov 19 09:49:41 dev1n2 corosync[1130]: [TOTEM ] Retransmit List: 1de >>>>> Nov 19 09:49:41 dev1n2 corosync[1130]: [TOTEM ] Retransmit List: 1de >>>>> Nov 19 09:49:41 dev1n2 corosync[1130]: [TOTEM ] Retransmit List: 1e1 >>>>> Nov 19 09:49:41 dev1n2 corosync[1130]: [TOTEM ] Retransmit List: 1e1 >>>>> Nov 19 09:49:43 dev1n2 dlm_controld[1142]: 541 cpg_dispatch error 9 >>>>> Nov 19 09:49:43 dev1n2 corosync[1130]: [TOTEM ] Retransmit List: 1e7 >>>>> Nov 19 09:49:43 dev1n2 corosync[1130]: [TOTEM ] Retransmit List: 1e7 >>>>> Nov 19 09:49:43 dev1n2 corosync[1130]: [SERV ] Unloading all Corosync service engines. >>>>> Nov 19 09:49:43 dev1n2 corosync[1130]: [QB ] withdrawing server sockets >>>>> Nov 19 09:49:43 dev1n2 corosync[1130]: [SERV ] Service engine unloaded: corosync vote quorum service v1.0 >>>>> Nov 19 09:49:43 dev1n2 corosync[1130]: [QB ] withdrawing server sockets >>>>> Nov 19 09:49:43 dev1n2 corosync[1130]: [SERV ] Service engine unloaded: corosync configuration map access >>>>> Nov 19 09:49:43 dev1n2 corosync[1130]: [QB ] withdrawing server sockets >>>>> Nov 19 09:49:43 dev1n2 corosync[1130]: [SERV ] Service engine unloaded: corosync configuration service >>>>> Nov 19 09:49:43 dev1n2 corosync[1130]: [QB ] withdrawing server sockets >>>>> Nov 19 09:49:43 dev1n2 corosync[1130]: [SERV ] Service engine unloaded: corosync cluster closed process group service v1.01 >>>>> Nov 19 09:49:43 dev1n2 corosync[1130]: [QB ] withdrawing server sockets >>>>> Nov 19 09:49:43 dev1n2 corosync[1130]: [SERV ] Service engine unloaded: corosync cluster quorum service v0.1 >>>>> Nov 19 09:49:43 dev1n2 corosync[1130]: [SERV ] Service engine unloaded: corosync profile loading service >>>>> Nov 19 09:49:43 dev1n2 corosync[1130]: [WD ] magically closing the watchdog. >>>>> Nov 19 09:49:43 dev1n2 corosync[1130]: [SERV ] Service engine unloaded: corosync watchdog service >>>>> Nov 19 09:49:43 dev1n2 corosync[1130]: [MAIN ] Corosync Cluster Engine exiting normally >>>>> >>>>> >>>>> Logs from the surviving node: >>>>> >>>>> Kernel: >>>>> Nov 19 09:49:39 dev1n1 kernel: : [80664.615988] block drbd0: conn( Unconnected -> WFConnection ) >>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497187] dlm: clvmd: dlm_recover 11 >>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497241] dlm: clvmd: dlm_clear_toss 1 done >>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497291] dlm: clvmd: remove member 2 >>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497295] dlm: clvmd: dlm_recover_members 1 nodes >>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497298] dlm: clvmd: generation 15 slots 1 1:1 >>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497300] dlm: clvmd: dlm_recover_directory >>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497302] dlm: clvmd: dlm_recover_directory 0 in 0 new >>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497304] dlm: clvmd: dlm_recover_directory 0 out 0 messages >>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497307] dlm: clvmd: dlm_recover_masters >>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497309] dlm: clvmd: dlm_recover_masters 0 of 1 >>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497311] dlm: clvmd: dlm_recover_locks 0 out >>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497313] dlm: clvmd: dlm_recover_locks 0 in >>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497314] dlm: clvmd: dlm_recover_rsbs 1 done >>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497366] dlm: clvmd: dlm_recover 11 generation 15 done: 0 ms >>>>> Nov 19 09:49:43 dev1n1 kernel: : [80668.211818] dlm: closing connection to node 2 >>>>> Nov 19 09:49:46 dev1n1 kernel: : [80670.779015] igb: p1p2 NIC Link is Down >>>>> >>>>> User space: >>>>> Nov 19 09:49:40 dev1n1 pengine[1078]: notice: stage6: Scheduling Node dev1n2 for shutdown >>>>> Nov 19 09:49:40 dev1n1 pengine[1078]: error: rsc_expand_action: Couldn't expand vpbx_vg_cl_demote_0 >>>>> Nov 19 09:49:40 dev1n1 pengine[1078]: error: rsc_expand_action: Couldn't expand vpbx_vg_cl_demote_0 >>>>> Nov 19 09:49:40 dev1n1 pengine[1078]: notice: LogActions: Stop stonith-dev1n1 (dev1n2) >>>>> Nov 19 09:49:40 dev1n1 pengine[1078]: notice: process_pe_message: Calculated Transition 17: /var/lib/pacemaker/pengine/pe-input-1035.bz2 >>>>> Nov 19 09:49:40 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1d1 >>>>> Nov 19 09:49:40 dev1n1 crmd[1080]: notice: run_graph: Transition 17 (Complete=2, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-1035.bz2): Complete >>>>> Nov 19 09:49:40 dev1n1 crmd[1080]: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ] >>>>> Nov 19 09:49:40 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1d4 >>>>> Nov 19 09:49:40 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1d6 1d8 >>>>> Nov 19 09:49:40 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1d6 >>>>> Nov 19 09:49:40 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1d6 >>>>> Nov 19 09:49:40 dev1n1 crmd[1080]: notice: peer_update_callback: do_shutdown of dev1n2 (op 63) is complete >>>>> Nov 19 09:49:40 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1df >>>>> Nov 19 09:49:40 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1df >>>>> Nov 19 09:49:40 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1e3 >>>>> Nov 19 09:49:40 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1e3 >>>>> Nov 19 09:49:42 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1e6 >>>>> Nov 19 09:49:42 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1e9 >>>>> Nov 19 09:49:43 dev1n1 corosync[1004]: [QUORUM] Members[1]: 1 >>>>> Nov 19 09:49:43 dev1n1 crmd[1080]: notice: corosync_mark_unseen_peer_dead: Node 2/dev1n2 was not seen in the previous transition >>>>> Nov 19 09:49:43 dev1n1 crmd[1080]: notice: crm_update_peer_state: corosync_mark_unseen_peer_dead: Node dev1n2[2] - state is now lost >>>>> Nov 19 09:49:43 dev1n1 corosync[1004]: [TOTEM ] A processor joined or left the membership and a new membership (10.28.45.27:30736) was formed. >>>>> Nov 19 09:49:43 dev1n1 corosync[1004]: [MAIN ] Completed service synchronization, ready to provide service. >>>>> Nov 19 09:49:43 dev1n1 dlm_controld[1014]: 80664 fence request 2 pid 27225 nodedown time 1353314983 fence_all dlm_stonith >>>>> Nov 19 09:49:43 dev1n1 notifyd[1010]: [notice] 192.168.1.2[2] ip:192.168.1.2 left >>>>> Nov 19 09:49:43 dev1n1 stonith-ng[1075]: notice: stonith_command: Client stonith-api.27225.b5ff8f4d wants to fence (reboot) '2' with device '(any)' >>>>> Nov 19 09:49:43 dev1n1 stonith-ng[1075]: notice: initiate_remote_stonith_op: Initiating remote operation reboot for dev1n2: 71447261-0e53-4b20-b628-d3f026a4ae24 (0) >>>>> Nov 19 09:49:44 dev1n1 external/ipmi[27242]: [27254]: debug: ipmitool output: Chassis Power Control: Reset >>>>> Nov 19 09:49:45 dev1n1 stonith-ng[1075]: notice: log_operation: Operation 'reboot' [27234] (call 0 from stonith-api.27225) for host 'dev1n2' with device 'stonith-dev1n2' returned: 0 (OK) >>>>> Nov 19 09:49:45 dev1n1 stonith-ng[1075]: notice: remote_op_done: Operation reboot of dev1n2 by dev1n1 for stonith-api.27225 at dev1n1.71447261: OK >>>>> Nov 19 09:49:45 dev1n1 crmd[1080]: notice: tengine_stonith_notify: Peer dev1n2 was terminated (st_notify_fence) by dev1n1 for dev1n1: OK (ref=71447261-0e53-4b20-b628-d3f026a4ae24) by client stonith-api.27225 >>>>> >>>>> Greets, >>>>> Jacek >>>>> >>>>> -- >>>>> Linux-cluster mailing list >>>>> Linux-cluster at redhat.com >>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>> >>>> >>>> -- >>>> Linux-cluster mailing list >>>> Linux-cluster at redhat.com >>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>> >>> >>> >>> -- >>> esta es mi vida e me la vivo hasta que dios quiera >>> >>> -- >>> Linux-cluster mailing list >>> Linux-cluster at redhat.com >>> https://www.redhat.com/mailman/listinfo/linux-cluster >> >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster > > > > -- > esta es mi vida e me la vivo hasta que dios quiera > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- An HTML attachment was scrubbed... URL: From jamescyriac76 at gmail.com Tue Nov 20 13:05:55 2012 From: jamescyriac76 at gmail.com (james cyriac) Date: Tue, 20 Nov 2012 17:05:55 +0400 Subject: [Linux-cluster] how to mount GFS volumes same time both the cluster nodes? In-Reply-To: <1FD6615A-6518-41C6-ABE5-C4FDCEC94FAF@rentul.net> References: <1FD6615A-6518-41C6-ABE5-C4FDCEC94FAF@rentul.net> Message-ID: Thanks to all i rebooted the node2 now i am bale to mount both servers. now how i can add this service in Cluster,becase i have to assgin a IP for this service. Thanks james On Tue, Nov 20, 2012 at 4:30 PM, Sean Lutner wrote: > You don't need to do that. Running the LVM commands in one node is all you > need to do assuming that its the same storage presented to both hosts. > > Sent from my iPhone > > On Nov 20, 2012, at 7:02 AM, emmanuel segura wrote: > > Do it the same step on second server > > 2012/11/20 james cyriac > >> Hi, >> >> can you send the detials,i have to put entry in both servers?now i >> created >> >> map disk 150G both servers >> and created in node 1 vg03 >> then >> mkfs.gfs2 -p lock_dlm -t sap-cluster1:gfs2 -j 8 /dev/vg03/lvol0 >> >> now i able to mount in first server. >> >> >> /dev/vg03/lvol0 /usr/sap/trans gfs2 defaults 0 0 >> >> On Tue, Nov 20, 2012 at 3:07 PM, emmanuel segura wrote: >> >>> You have to use /etc/fstab with _netdev option, redhat cluster doesn't >>> support active/active service >>> >>> >>> 2012/11/20 james cyriac >>> >>>> Hi all, >>>> >>>> i am installing redhat cluster 6 two node cluser.the issue is i am not >>>> able to mount my GFS file sytem in both the node at same time.. >>>> >>>> please find my clustat output .. >>>> >>>> >>>> [root at saperpprod01 ~]# clustat >>>> Cluster Status for sap-cluster1 @ Tue Nov 20 14:51:28 2012 >>>> Member Status: Quorate >>>> Member Name ID >>>> Status >>>> ------ ---- ---- >>>> ------ >>>> saperpprod01 1 >>>> Online, Local, rgmanager >>>> saperpprod02 2 >>>> Online, rgmanager >>>> Service Name Owner >>>> (Last) State >>>> ------- ---- ----- >>>> ------ ----- >>>> service:oracle >>>> saperpprod01 started >>>> service:profile-gfs >>>> saperpprod01 started >>>> service:sap >>>> saperpprod01 started >>>> [root at saperpprod01 ~]# >>>> oralce and sap is fine and it is flaying in both nodes.i want mount my >>>> GFS vols same time at both the nodes. >>>> >>>> Thanks in advacne >>>> james >>>> >>>> >>>> but profile-gfs is GFS file system and i want present the GFS mount >>>> point same time both the node.please help me this >>>> On Mon, Nov 19, 2012 at 1:16 PM, Jacek Konieczny wrote: >>>> >>>>> Hi, >>>>> >>>>> I am setting up a cluster using: >>>>> >>>>> Linux kernel 3.6.6 >>>>> Corosync 2.1.0 >>>>> DLM 4.0.0 >>>>> CLVMD 2.02.98 >>>>> Pacemaker 1.1.8 >>>>> DRBD 8.3.13 >>>>> >>>>> Now I have stuck on the 'clean shutdown of a node' scenario. >>>>> >>>>> It goes like that: >>>>> - resources using the shared storage are properly stopped by Pacemaker. >>>>> - DRBD is cleanly demoted and unconfigured by Pacemaker >>>>> - Pacemaker cleanly exits >>>>> - CLVMD is stopped. >>>>> ? dlm_controld is stopped >>>>> ? corosync is being stopped >>>>> >>>>> and at this point the node is fenced (rebooted) by the dlm_controld on >>>>> the other node. I would expect it continue with a clean shutdown. >>>>> >>>>> Any idea how to debug/fix it? >>>>> Is this '541 cpg_dispatch error 9' the problem? >>>>> >>>>> Logs from the node being shut down (log file system mounted with the >>>>> 'sync' >>>>> option, syslog shutdown delayed as much as possible): >>>>> >>>>> Kernel: >>>>> Nov 19 09:49:40 dev1n2 kernel: : [ 542.049407] block drbd0: worker >>>>> terminated >>>>> Nov 19 09:49:40 dev1n2 kernel: : [ 542.049412] block drbd0: >>>>> Terminating drbd0_worker >>>>> Nov 19 09:49:43 dev1n2 kernel: : [ 544.934390] dlm: clvmd: leaving >>>>> the lockspace group... >>>>> Nov 19 09:49:43 dev1n2 kernel: : [ 544.937584] dlm: clvmd: group >>>>> event done 0 0 >>>>> Nov 19 09:49:43 dev1n2 kernel: : [ 544.937897] dlm: clvmd: >>>>> release_lockspace final free >>>>> Nov 19 09:49:43 dev1n2 kernel: : [ 544.961407] dlm: closing >>>>> connection to node 2 >>>>> Nov 19 09:49:43 dev1n2 kernel: : [ 544.961431] dlm: closing >>>>> connection to node 1 >>>>> >>>>> User space: >>>>> Nov 19 09:49:41 dev1n2 pacemakerd[1267]: notice: stop_child: >>>>> Stopping cib: Sent -15 to process 1279 >>>>> Nov 19 09:49:41 dev1n2 notifyd[1139]: [notice] dev1n2[2] >>>>> stonithd:1281:0x7fc423dfd5e0 is now disconnected from corosync >>>>> Nov 19 09:49:41 dev1n2 cib[1279]: notice: terminate_cs_connection: >>>>> Disconnecting from Corosync >>>>> Nov 19 09:49:41 dev1n2 corosync[1130]: [TOTEM ] Retransmit List: 1db >>>>> Nov 19 09:49:41 dev1n2 notifyd[1139]: [notice] dev1n2[2] >>>>> cib:1279:0x7fc4240008d0 is now disconnected from corosync >>>>> Nov 19 09:49:41 dev1n2 cib[1279]: notice: terminate_cs_connection: >>>>> Disconnecting from Corosync >>>>> Nov 19 09:49:41 dev1n2 corosync[1130]: [TOTEM ] Retransmit List: 1dd >>>>> Nov 19 09:49:41 dev1n2 pacemakerd[1267]: notice: >>>>> pcmk_shutdown_worker: Shutdown complete >>>>> Nov 19 09:49:41 dev1n2 notifyd[1139]: [notice] dev1n2[2] >>>>> pacemakerd:1267:0x7fc423bf8ed0 is now disconnected from corosync >>>>> Nov 19 09:49:41 dev1n2 notifyd[1139]: [notice] dev1n2[2] >>>>> pacemakerd:1267:0x7fc423bf7660 is now disconnected from corosync >>>>> Nov 19 09:49:41 dev1n2 corosync[1130]: [TOTEM ] Retransmit List: 1de >>>>> Nov 19 09:49:41 dev1n2 corosync[1130]: [TOTEM ] Retransmit List: 1de >>>>> Nov 19 09:49:41 dev1n2 corosync[1130]: [TOTEM ] Retransmit List: 1e1 >>>>> Nov 19 09:49:41 dev1n2 corosync[1130]: [TOTEM ] Retransmit List: 1e1 >>>>> Nov 19 09:49:43 dev1n2 dlm_controld[1142]: 541 cpg_dispatch error 9 >>>>> Nov 19 09:49:43 dev1n2 corosync[1130]: [TOTEM ] Retransmit List: 1e7 >>>>> Nov 19 09:49:43 dev1n2 corosync[1130]: [TOTEM ] Retransmit List: 1e7 >>>>> Nov 19 09:49:43 dev1n2 corosync[1130]: [SERV ] Unloading all >>>>> Corosync service engines. >>>>> Nov 19 09:49:43 dev1n2 corosync[1130]: [QB ] withdrawing server >>>>> sockets >>>>> Nov 19 09:49:43 dev1n2 corosync[1130]: [SERV ] Service engine >>>>> unloaded: corosync vote quorum service v1.0 >>>>> Nov 19 09:49:43 dev1n2 corosync[1130]: [QB ] withdrawing server >>>>> sockets >>>>> Nov 19 09:49:43 dev1n2 corosync[1130]: [SERV ] Service engine >>>>> unloaded: corosync configuration map access >>>>> Nov 19 09:49:43 dev1n2 corosync[1130]: [QB ] withdrawing server >>>>> sockets >>>>> Nov 19 09:49:43 dev1n2 corosync[1130]: [SERV ] Service engine >>>>> unloaded: corosync configuration service >>>>> Nov 19 09:49:43 dev1n2 corosync[1130]: [QB ] withdrawing server >>>>> sockets >>>>> Nov 19 09:49:43 dev1n2 corosync[1130]: [SERV ] Service engine >>>>> unloaded: corosync cluster closed process group service v1.01 >>>>> Nov 19 09:49:43 dev1n2 corosync[1130]: [QB ] withdrawing server >>>>> sockets >>>>> Nov 19 09:49:43 dev1n2 corosync[1130]: [SERV ] Service engine >>>>> unloaded: corosync cluster quorum service v0.1 >>>>> Nov 19 09:49:43 dev1n2 corosync[1130]: [SERV ] Service engine >>>>> unloaded: corosync profile loading service >>>>> Nov 19 09:49:43 dev1n2 corosync[1130]: [WD ] magically closing the >>>>> watchdog. >>>>> Nov 19 09:49:43 dev1n2 corosync[1130]: [SERV ] Service engine >>>>> unloaded: corosync watchdog service >>>>> Nov 19 09:49:43 dev1n2 corosync[1130]: [MAIN ] Corosync Cluster >>>>> Engine exiting normally >>>>> >>>>> >>>>> Logs from the surviving node: >>>>> >>>>> Kernel: >>>>> Nov 19 09:49:39 dev1n1 kernel: : [80664.615988] block drbd0: conn( >>>>> Unconnected -> WFConnection ) >>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497187] dlm: clvmd: >>>>> dlm_recover 11 >>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497241] dlm: clvmd: >>>>> dlm_clear_toss 1 done >>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497291] dlm: clvmd: remove >>>>> member 2 >>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497295] dlm: clvmd: >>>>> dlm_recover_members 1 nodes >>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497298] dlm: clvmd: generation >>>>> 15 slots 1 1:1 >>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497300] dlm: clvmd: >>>>> dlm_recover_directory >>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497302] dlm: clvmd: >>>>> dlm_recover_directory 0 in 0 new >>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497304] dlm: clvmd: >>>>> dlm_recover_directory 0 out 0 messages >>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497307] dlm: clvmd: >>>>> dlm_recover_masters >>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497309] dlm: clvmd: >>>>> dlm_recover_masters 0 of 1 >>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497311] dlm: clvmd: >>>>> dlm_recover_locks 0 out >>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497313] dlm: clvmd: >>>>> dlm_recover_locks 0 in >>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497314] dlm: clvmd: >>>>> dlm_recover_rsbs 1 done >>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497366] dlm: clvmd: >>>>> dlm_recover 11 generation 15 done: 0 ms >>>>> Nov 19 09:49:43 dev1n1 kernel: : [80668.211818] dlm: closing >>>>> connection to node 2 >>>>> Nov 19 09:49:46 dev1n1 kernel: : [80670.779015] igb: p1p2 NIC Link is >>>>> Down >>>>> >>>>> User space: >>>>> Nov 19 09:49:40 dev1n1 pengine[1078]: notice: stage6: Scheduling >>>>> Node dev1n2 for shutdown >>>>> Nov 19 09:49:40 dev1n1 pengine[1078]: error: rsc_expand_action: >>>>> Couldn't expand vpbx_vg_cl_demote_0 >>>>> Nov 19 09:49:40 dev1n1 pengine[1078]: error: rsc_expand_action: >>>>> Couldn't expand vpbx_vg_cl_demote_0 >>>>> Nov 19 09:49:40 dev1n1 pengine[1078]: notice: LogActions: Stop >>>>> stonith-dev1n1 (dev1n2) >>>>> Nov 19 09:49:40 dev1n1 pengine[1078]: notice: process_pe_message: >>>>> Calculated Transition 17: /var/lib/pacemaker/pengine/pe-input-1035.bz2 >>>>> Nov 19 09:49:40 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1d1 >>>>> Nov 19 09:49:40 dev1n1 crmd[1080]: notice: run_graph: Transition 17 >>>>> (Complete=2, Pending=0, Fired=0, Skipped=0, Incomplete=0, >>>>> Source=/var/lib/pacemaker/pengine/pe-input-1035.bz2): Complete >>>>> Nov 19 09:49:40 dev1n1 crmd[1080]: notice: do_state_transition: >>>>> State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS >>>>> cause=C_FSA_INTERNAL origin=notify_crmd ] >>>>> Nov 19 09:49:40 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1d4 >>>>> Nov 19 09:49:40 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1d6 >>>>> 1d8 >>>>> Nov 19 09:49:40 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1d6 >>>>> Nov 19 09:49:40 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1d6 >>>>> Nov 19 09:49:40 dev1n1 crmd[1080]: notice: peer_update_callback: >>>>> do_shutdown of dev1n2 (op 63) is complete >>>>> Nov 19 09:49:40 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1df >>>>> Nov 19 09:49:40 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1df >>>>> Nov 19 09:49:40 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1e3 >>>>> Nov 19 09:49:40 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1e3 >>>>> Nov 19 09:49:42 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1e6 >>>>> Nov 19 09:49:42 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1e9 >>>>> Nov 19 09:49:43 dev1n1 corosync[1004]: [QUORUM] Members[1]: 1 >>>>> Nov 19 09:49:43 dev1n1 crmd[1080]: notice: >>>>> corosync_mark_unseen_peer_dead: Node 2/dev1n2 was not seen in the previous >>>>> transition >>>>> Nov 19 09:49:43 dev1n1 crmd[1080]: notice: crm_update_peer_state: >>>>> corosync_mark_unseen_peer_dead: Node dev1n2[2] - state is now lost >>>>> Nov 19 09:49:43 dev1n1 corosync[1004]: [TOTEM ] A processor joined or >>>>> left the membership and a new membership (10.28.45.27:30736) was >>>>> formed. >>>>> Nov 19 09:49:43 dev1n1 corosync[1004]: [MAIN ] Completed service >>>>> synchronization, ready to provide service. >>>>> Nov 19 09:49:43 dev1n1 dlm_controld[1014]: 80664 fence request 2 pid >>>>> 27225 nodedown time 1353314983 fence_all dlm_stonith >>>>> Nov 19 09:49:43 dev1n1 notifyd[1010]: [notice] 192.168.1.2[2] >>>>> ip:192.168.1.2 left >>>>> Nov 19 09:49:43 dev1n1 stonith-ng[1075]: notice: stonith_command: >>>>> Client stonith-api.27225.b5ff8f4d wants to fence (reboot) '2' with device >>>>> '(any)' >>>>> Nov 19 09:49:43 dev1n1 stonith-ng[1075]: notice: >>>>> initiate_remote_stonith_op: Initiating remote operation reboot for dev1n2: >>>>> 71447261-0e53-4b20-b628-d3f026a4ae24 (0) >>>>> Nov 19 09:49:44 dev1n1 external/ipmi[27242]: [27254]: debug: ipmitool >>>>> output: Chassis Power Control: Reset >>>>> Nov 19 09:49:45 dev1n1 stonith-ng[1075]: notice: log_operation: >>>>> Operation 'reboot' [27234] (call 0 from stonith-api.27225) for host >>>>> 'dev1n2' with device 'stonith-dev1n2' returned: 0 (OK) >>>>> Nov 19 09:49:45 dev1n1 stonith-ng[1075]: notice: remote_op_done: >>>>> Operation reboot of dev1n2 by dev1n1 for >>>>> stonith-api.27225 at dev1n1.71447261: OK >>>>> Nov 19 09:49:45 dev1n1 crmd[1080]: notice: tengine_stonith_notify: >>>>> Peer dev1n2 was terminated (st_notify_fence) by dev1n1 for dev1n1: OK >>>>> (ref=71447261-0e53-4b20-b628-d3f026a4ae24) by client stonith-api.27225 >>>>> >>>>> Greets, >>>>> Jacek >>>>> >>>>> -- >>>>> Linux-cluster mailing list >>>>> Linux-cluster at redhat.com >>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>> >>>> >>>> >>>> -- >>>> Linux-cluster mailing list >>>> Linux-cluster at redhat.com >>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>> >>> >>> >>> >>> -- >>> esta es mi vida e me la vivo hasta que dios quiera >>> >>> -- >>> Linux-cluster mailing list >>> Linux-cluster at redhat.com >>> https://www.redhat.com/mailman/listinfo/linux-cluster >>> >> >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster >> > > > > -- > esta es mi vida e me la vivo hasta que dios quiera > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From emi2fast at gmail.com Tue Nov 20 13:15:52 2012 From: emi2fast at gmail.com (emmanuel segura) Date: Tue, 20 Nov 2012 14:15:52 +0100 Subject: [Linux-cluster] how to mount GFS volumes same time both the cluster nodes? In-Reply-To: <1FD6615A-6518-41C6-ABE5-C4FDCEC94FAF@rentul.net> References: <1FD6615A-6518-41C6-ABE5-C4FDCEC94FAF@rentul.net> Message-ID: Sorry but i am talking about fstab 2012/11/20 Sean Lutner > You don't need to do that. Running the LVM commands in one node is all you > need to do assuming that its the same storage presented to both hosts. > > Sent from my iPhone > > On Nov 20, 2012, at 7:02 AM, emmanuel segura wrote: > > Do it the same step on second server > > 2012/11/20 james cyriac > >> Hi, >> >> can you send the detials,i have to put entry in both servers?now i >> created >> >> map disk 150G both servers >> and created in node 1 vg03 >> then >> mkfs.gfs2 -p lock_dlm -t sap-cluster1:gfs2 -j 8 /dev/vg03/lvol0 >> >> now i able to mount in first server. >> >> >> /dev/vg03/lvol0 /usr/sap/trans gfs2 defaults 0 0 >> >> On Tue, Nov 20, 2012 at 3:07 PM, emmanuel segura wrote: >> >>> You have to use /etc/fstab with _netdev option, redhat cluster doesn't >>> support active/active service >>> >>> >>> 2012/11/20 james cyriac >>> >>>> Hi all, >>>> >>>> i am installing redhat cluster 6 two node cluser.the issue is i am not >>>> able to mount my GFS file sytem in both the node at same time.. >>>> >>>> please find my clustat output .. >>>> >>>> >>>> [root at saperpprod01 ~]# clustat >>>> Cluster Status for sap-cluster1 @ Tue Nov 20 14:51:28 2012 >>>> Member Status: Quorate >>>> Member Name ID >>>> Status >>>> ------ ---- ---- >>>> ------ >>>> saperpprod01 1 >>>> Online, Local, rgmanager >>>> saperpprod02 2 >>>> Online, rgmanager >>>> Service Name Owner >>>> (Last) State >>>> ------- ---- ----- >>>> ------ ----- >>>> service:oracle >>>> saperpprod01 started >>>> service:profile-gfs >>>> saperpprod01 started >>>> service:sap >>>> saperpprod01 started >>>> [root at saperpprod01 ~]# >>>> oralce and sap is fine and it is flaying in both nodes.i want mount my >>>> GFS vols same time at both the nodes. >>>> >>>> Thanks in advacne >>>> james >>>> >>>> >>>> but profile-gfs is GFS file system and i want present the GFS mount >>>> point same time both the node.please help me this >>>> On Mon, Nov 19, 2012 at 1:16 PM, Jacek Konieczny wrote: >>>> >>>>> Hi, >>>>> >>>>> I am setting up a cluster using: >>>>> >>>>> Linux kernel 3.6.6 >>>>> Corosync 2.1.0 >>>>> DLM 4.0.0 >>>>> CLVMD 2.02.98 >>>>> Pacemaker 1.1.8 >>>>> DRBD 8.3.13 >>>>> >>>>> Now I have stuck on the 'clean shutdown of a node' scenario. >>>>> >>>>> It goes like that: >>>>> - resources using the shared storage are properly stopped by Pacemaker. >>>>> - DRBD is cleanly demoted and unconfigured by Pacemaker >>>>> - Pacemaker cleanly exits >>>>> - CLVMD is stopped. >>>>> ? dlm_controld is stopped >>>>> ? corosync is being stopped >>>>> >>>>> and at this point the node is fenced (rebooted) by the dlm_controld on >>>>> the other node. I would expect it continue with a clean shutdown. >>>>> >>>>> Any idea how to debug/fix it? >>>>> Is this '541 cpg_dispatch error 9' the problem? >>>>> >>>>> Logs from the node being shut down (log file system mounted with the >>>>> 'sync' >>>>> option, syslog shutdown delayed as much as possible): >>>>> >>>>> Kernel: >>>>> Nov 19 09:49:40 dev1n2 kernel: : [ 542.049407] block drbd0: worker >>>>> terminated >>>>> Nov 19 09:49:40 dev1n2 kernel: : [ 542.049412] block drbd0: >>>>> Terminating drbd0_worker >>>>> Nov 19 09:49:43 dev1n2 kernel: : [ 544.934390] dlm: clvmd: leaving >>>>> the lockspace group... >>>>> Nov 19 09:49:43 dev1n2 kernel: : [ 544.937584] dlm: clvmd: group >>>>> event done 0 0 >>>>> Nov 19 09:49:43 dev1n2 kernel: : [ 544.937897] dlm: clvmd: >>>>> release_lockspace final free >>>>> Nov 19 09:49:43 dev1n2 kernel: : [ 544.961407] dlm: closing >>>>> connection to node 2 >>>>> Nov 19 09:49:43 dev1n2 kernel: : [ 544.961431] dlm: closing >>>>> connection to node 1 >>>>> >>>>> User space: >>>>> Nov 19 09:49:41 dev1n2 pacemakerd[1267]: notice: stop_child: >>>>> Stopping cib: Sent -15 to process 1279 >>>>> Nov 19 09:49:41 dev1n2 notifyd[1139]: [notice] dev1n2[2] >>>>> stonithd:1281:0x7fc423dfd5e0 is now disconnected from corosync >>>>> Nov 19 09:49:41 dev1n2 cib[1279]: notice: terminate_cs_connection: >>>>> Disconnecting from Corosync >>>>> Nov 19 09:49:41 dev1n2 corosync[1130]: [TOTEM ] Retransmit List: 1db >>>>> Nov 19 09:49:41 dev1n2 notifyd[1139]: [notice] dev1n2[2] >>>>> cib:1279:0x7fc4240008d0 is now disconnected from corosync >>>>> Nov 19 09:49:41 dev1n2 cib[1279]: notice: terminate_cs_connection: >>>>> Disconnecting from Corosync >>>>> Nov 19 09:49:41 dev1n2 corosync[1130]: [TOTEM ] Retransmit List: 1dd >>>>> Nov 19 09:49:41 dev1n2 pacemakerd[1267]: notice: >>>>> pcmk_shutdown_worker: Shutdown complete >>>>> Nov 19 09:49:41 dev1n2 notifyd[1139]: [notice] dev1n2[2] >>>>> pacemakerd:1267:0x7fc423bf8ed0 is now disconnected from corosync >>>>> Nov 19 09:49:41 dev1n2 notifyd[1139]: [notice] dev1n2[2] >>>>> pacemakerd:1267:0x7fc423bf7660 is now disconnected from corosync >>>>> Nov 19 09:49:41 dev1n2 corosync[1130]: [TOTEM ] Retransmit List: 1de >>>>> Nov 19 09:49:41 dev1n2 corosync[1130]: [TOTEM ] Retransmit List: 1de >>>>> Nov 19 09:49:41 dev1n2 corosync[1130]: [TOTEM ] Retransmit List: 1e1 >>>>> Nov 19 09:49:41 dev1n2 corosync[1130]: [TOTEM ] Retransmit List: 1e1 >>>>> Nov 19 09:49:43 dev1n2 dlm_controld[1142]: 541 cpg_dispatch error 9 >>>>> Nov 19 09:49:43 dev1n2 corosync[1130]: [TOTEM ] Retransmit List: 1e7 >>>>> Nov 19 09:49:43 dev1n2 corosync[1130]: [TOTEM ] Retransmit List: 1e7 >>>>> Nov 19 09:49:43 dev1n2 corosync[1130]: [SERV ] Unloading all >>>>> Corosync service engines. >>>>> Nov 19 09:49:43 dev1n2 corosync[1130]: [QB ] withdrawing server >>>>> sockets >>>>> Nov 19 09:49:43 dev1n2 corosync[1130]: [SERV ] Service engine >>>>> unloaded: corosync vote quorum service v1.0 >>>>> Nov 19 09:49:43 dev1n2 corosync[1130]: [QB ] withdrawing server >>>>> sockets >>>>> Nov 19 09:49:43 dev1n2 corosync[1130]: [SERV ] Service engine >>>>> unloaded: corosync configuration map access >>>>> Nov 19 09:49:43 dev1n2 corosync[1130]: [QB ] withdrawing server >>>>> sockets >>>>> Nov 19 09:49:43 dev1n2 corosync[1130]: [SERV ] Service engine >>>>> unloaded: corosync configuration service >>>>> Nov 19 09:49:43 dev1n2 corosync[1130]: [QB ] withdrawing server >>>>> sockets >>>>> Nov 19 09:49:43 dev1n2 corosync[1130]: [SERV ] Service engine >>>>> unloaded: corosync cluster closed process group service v1.01 >>>>> Nov 19 09:49:43 dev1n2 corosync[1130]: [QB ] withdrawing server >>>>> sockets >>>>> Nov 19 09:49:43 dev1n2 corosync[1130]: [SERV ] Service engine >>>>> unloaded: corosync cluster quorum service v0.1 >>>>> Nov 19 09:49:43 dev1n2 corosync[1130]: [SERV ] Service engine >>>>> unloaded: corosync profile loading service >>>>> Nov 19 09:49:43 dev1n2 corosync[1130]: [WD ] magically closing the >>>>> watchdog. >>>>> Nov 19 09:49:43 dev1n2 corosync[1130]: [SERV ] Service engine >>>>> unloaded: corosync watchdog service >>>>> Nov 19 09:49:43 dev1n2 corosync[1130]: [MAIN ] Corosync Cluster >>>>> Engine exiting normally >>>>> >>>>> >>>>> Logs from the surviving node: >>>>> >>>>> Kernel: >>>>> Nov 19 09:49:39 dev1n1 kernel: : [80664.615988] block drbd0: conn( >>>>> Unconnected -> WFConnection ) >>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497187] dlm: clvmd: >>>>> dlm_recover 11 >>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497241] dlm: clvmd: >>>>> dlm_clear_toss 1 done >>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497291] dlm: clvmd: remove >>>>> member 2 >>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497295] dlm: clvmd: >>>>> dlm_recover_members 1 nodes >>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497298] dlm: clvmd: generation >>>>> 15 slots 1 1:1 >>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497300] dlm: clvmd: >>>>> dlm_recover_directory >>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497302] dlm: clvmd: >>>>> dlm_recover_directory 0 in 0 new >>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497304] dlm: clvmd: >>>>> dlm_recover_directory 0 out 0 messages >>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497307] dlm: clvmd: >>>>> dlm_recover_masters >>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497309] dlm: clvmd: >>>>> dlm_recover_masters 0 of 1 >>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497311] dlm: clvmd: >>>>> dlm_recover_locks 0 out >>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497313] dlm: clvmd: >>>>> dlm_recover_locks 0 in >>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497314] dlm: clvmd: >>>>> dlm_recover_rsbs 1 done >>>>> Nov 19 09:49:42 dev1n1 kernel: : [80667.497366] dlm: clvmd: >>>>> dlm_recover 11 generation 15 done: 0 ms >>>>> Nov 19 09:49:43 dev1n1 kernel: : [80668.211818] dlm: closing >>>>> connection to node 2 >>>>> Nov 19 09:49:46 dev1n1 kernel: : [80670.779015] igb: p1p2 NIC Link is >>>>> Down >>>>> >>>>> User space: >>>>> Nov 19 09:49:40 dev1n1 pengine[1078]: notice: stage6: Scheduling >>>>> Node dev1n2 for shutdown >>>>> Nov 19 09:49:40 dev1n1 pengine[1078]: error: rsc_expand_action: >>>>> Couldn't expand vpbx_vg_cl_demote_0 >>>>> Nov 19 09:49:40 dev1n1 pengine[1078]: error: rsc_expand_action: >>>>> Couldn't expand vpbx_vg_cl_demote_0 >>>>> Nov 19 09:49:40 dev1n1 pengine[1078]: notice: LogActions: Stop >>>>> stonith-dev1n1 (dev1n2) >>>>> Nov 19 09:49:40 dev1n1 pengine[1078]: notice: process_pe_message: >>>>> Calculated Transition 17: /var/lib/pacemaker/pengine/pe-input-1035.bz2 >>>>> Nov 19 09:49:40 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1d1 >>>>> Nov 19 09:49:40 dev1n1 crmd[1080]: notice: run_graph: Transition 17 >>>>> (Complete=2, Pending=0, Fired=0, Skipped=0, Incomplete=0, >>>>> Source=/var/lib/pacemaker/pengine/pe-input-1035.bz2): Complete >>>>> Nov 19 09:49:40 dev1n1 crmd[1080]: notice: do_state_transition: >>>>> State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS >>>>> cause=C_FSA_INTERNAL origin=notify_crmd ] >>>>> Nov 19 09:49:40 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1d4 >>>>> Nov 19 09:49:40 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1d6 >>>>> 1d8 >>>>> Nov 19 09:49:40 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1d6 >>>>> Nov 19 09:49:40 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1d6 >>>>> Nov 19 09:49:40 dev1n1 crmd[1080]: notice: peer_update_callback: >>>>> do_shutdown of dev1n2 (op 63) is complete >>>>> Nov 19 09:49:40 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1df >>>>> Nov 19 09:49:40 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1df >>>>> Nov 19 09:49:40 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1e3 >>>>> Nov 19 09:49:40 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1e3 >>>>> Nov 19 09:49:42 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1e6 >>>>> Nov 19 09:49:42 dev1n1 corosync[1004]: [TOTEM ] Retransmit List: 1e9 >>>>> Nov 19 09:49:43 dev1n1 corosync[1004]: [QUORUM] Members[1]: 1 >>>>> Nov 19 09:49:43 dev1n1 crmd[1080]: notice: >>>>> corosync_mark_unseen_peer_dead: Node 2/dev1n2 was not seen in the previous >>>>> transition >>>>> Nov 19 09:49:43 dev1n1 crmd[1080]: notice: crm_update_peer_state: >>>>> corosync_mark_unseen_peer_dead: Node dev1n2[2] - state is now lost >>>>> Nov 19 09:49:43 dev1n1 corosync[1004]: [TOTEM ] A processor joined or >>>>> left the membership and a new membership (10.28.45.27:30736) was >>>>> formed. >>>>> Nov 19 09:49:43 dev1n1 corosync[1004]: [MAIN ] Completed service >>>>> synchronization, ready to provide service. >>>>> Nov 19 09:49:43 dev1n1 dlm_controld[1014]: 80664 fence request 2 pid >>>>> 27225 nodedown time 1353314983 fence_all dlm_stonith >>>>> Nov 19 09:49:43 dev1n1 notifyd[1010]: [notice] 192.168.1.2[2] >>>>> ip:192.168.1.2 left >>>>> Nov 19 09:49:43 dev1n1 stonith-ng[1075]: notice: stonith_command: >>>>> Client stonith-api.27225.b5ff8f4d wants to fence (reboot) '2' with device >>>>> '(any)' >>>>> Nov 19 09:49:43 dev1n1 stonith-ng[1075]: notice: >>>>> initiate_remote_stonith_op: Initiating remote operation reboot for dev1n2: >>>>> 71447261-0e53-4b20-b628-d3f026a4ae24 (0) >>>>> Nov 19 09:49:44 dev1n1 external/ipmi[27242]: [27254]: debug: ipmitool >>>>> output: Chassis Power Control: Reset >>>>> Nov 19 09:49:45 dev1n1 stonith-ng[1075]: notice: log_operation: >>>>> Operation 'reboot' [27234] (call 0 from stonith-api.27225) for host >>>>> 'dev1n2' with device 'stonith-dev1n2' returned: 0 (OK) >>>>> Nov 19 09:49:45 dev1n1 stonith-ng[1075]: notice: remote_op_done: >>>>> Operation reboot of dev1n2 by dev1n1 for >>>>> stonith-api.27225 at dev1n1.71447261: OK >>>>> Nov 19 09:49:45 dev1n1 crmd[1080]: notice: tengine_stonith_notify: >>>>> Peer dev1n2 was terminated (st_notify_fence) by dev1n1 for dev1n1: OK >>>>> (ref=71447261-0e53-4b20-b628-d3f026a4ae24) by client stonith-api.27225 >>>>> >>>>> Greets, >>>>> Jacek >>>>> >>>>> -- >>>>> Linux-cluster mailing list >>>>> Linux-cluster at redhat.com >>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>> >>>> >>>> >>>> -- >>>> Linux-cluster mailing list >>>> Linux-cluster at redhat.com >>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>> >>> >>> >>> >>> -- >>> esta es mi vida e me la vivo hasta que dios quiera >>> >>> -- >>> Linux-cluster mailing list >>> Linux-cluster at redhat.com >>> https://www.redhat.com/mailman/listinfo/linux-cluster >>> >> >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster >> > > > > -- > esta es mi vida e me la vivo hasta que dios quiera > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -- esta es mi vida e me la vivo hasta que dios quiera -------------- next part -------------- An HTML attachment was scrubbed... URL: From felipe.o.gutierrez at gmail.com Tue Nov 20 13:54:46 2012 From: felipe.o.gutierrez at gmail.com (Felipe Gutierrez) Date: Tue, 20 Nov 2012 10:54:46 -0300 Subject: [Linux-cluster] new Resources on heartbeat can't start Message-ID: Hi everyone, I am trying to setup a new resource on my heartbeat, but for some reason the resour doesn't come on. Does anyone have some hint, please? root at cloud9:/etc/heartbeat# crm_mon -1 ============ Last updated: Tue Nov 20 10:45:38 2012 Last change: Tue Nov 20 10:41:57 2012 via crm_shadow on cloud9 Stack: Heartbeat Current DC: cloud9 (55e3a080-6988-4bb4-814c-f63b20137601) - partition with quorum Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c 2 Nodes configured, unknown expected votes 2 Resources configured. ============ Online: [ cloud10 cloud9 ] FAILOVER-IP (ocf::heartbeat:IPaddr): Started cloud9 FAILED Failed actions: FAILOVER-IP_start_0 (node=cloud9, call=92, rc=1, status=complete): unknown error failover-ip_start_0 (node=cloud9, call=4, rc=1, status=complete): unknown error FAILOVER-IP_start_0 (node=cloud10, call=4, rc=1, status=complete): unknown error failover-ip_start_0 (node=cloud10, call=48, rc=1, status=complete): unknown error root at cloud9:/etc/heartbeat# My ha.cf file is like this: # enable pacemaker, without stonith crm yes # log where ? logfacility local0 # warning of soon be dead warntime 10 # declare a host (the other node) dead after: deadtime 20 # dead time on boot (could take some time until net is up) initdead 120 # time between heartbeats keepalive 2 # What UDP port to use for udp or ppp-udp communication? # udpport 694 # bcast eth0 # mcast eth0 225.0.0.1 694 1 0 # ucast eth0 192.168.188.9 # What interfaces to heartbeat over? # udp eth0 # the nodes node cloud9 node cloud10 # heartbeats, over dedicated replication interface! ucast eth0 192.168.188.9 # ignored by node1 (owner of ip) ucast eth0 192.168.188.10 # ignored by node2 (owner of ip) # ping the switch to assure we are online ping 192.168.178.1 Best Regards, Felipe -- *-- -- Felipe Oliveira Gutierrez -- Felipe.o.Gutierrez at gmail.com -- https://sites.google.com/site/lipe82/Home/diaadia* -------------- next part -------------- An HTML attachment was scrubbed... URL: From jfriesse at redhat.com Wed Nov 21 10:19:02 2012 From: jfriesse at redhat.com (Jan Friesse) Date: Wed, 21 Nov 2012 11:19:02 +0100 Subject: [Linux-cluster] node fenced by dlm_controld on a clean shutdown In-Reply-To: <20121119093920.GB20419@jajo.eggsoft> References: <20121119091647.GA20419@jajo.eggsoft> <20121119093920.GB20419@jajo.eggsoft> Message-ID: <50ACAA96.60401@redhat.com> Jacek Konieczny napsal(a): > On Mon, Nov 19, 2012 at 10:16:48AM +0100, Jacek Konieczny wrote: >> It goes like that: >> - resources using the shared storage are properly stopped by Pacemaker. >> - DRBD is cleanly demoted and unconfigured by Pacemaker >> - Pacemaker cleanly exits >> - CLVMD is stopped. >> ? dlm_controld is stopped >> ? corosync is being stopped >> >> and at this point the node is fenced (rebooted) by the dlm_controld on >> the other node. I would expect it continue with a clean shutdown. >> >> Any idea how to debug/fix it? >> Is this '541 cpg_dispatch error 9' the problem? > > I found a workaround: I have added a 10 seconds pause between > dlm_controld and corosync shutdown. The node shuts down cleanly now (is > not fenced). '541 cpg_dispatch error 9' is still there in the logs, > though. > > Greets, > Jacek > Hi, we've discussed this problem with dave, but I would like to get some information: - What distro are you using? - Packages are compiled or disro? - what you mean by "clean shutdown"? This is something like service dlm_control stop, or your own script? Thanks, Honza From jajcus at jajcus.net Wed Nov 21 14:48:59 2012 From: jajcus at jajcus.net (Jacek Konieczny) Date: Wed, 21 Nov 2012 15:48:59 +0100 Subject: [Linux-cluster] node fenced by dlm_controld on a clean shutdown In-Reply-To: <50ACAA96.60401@redhat.com> References: <20121119091647.GA20419@jajo.eggsoft> <20121119093920.GB20419@jajo.eggsoft> <50ACAA96.60401@redhat.com> Message-ID: <20121121144858.GF2125@jajo.eggsoft> On Wed, Nov 21, 2012 at 11:19:02AM +0100, Jan Friesse wrote: > Hi, > we've discussed this problem with dave, but I would like to get some > information: > - What distro are you using? PLD Linux > - Packages are compiled or disro? I am making packages for the distro as a part of my job. > - what you mean by "clean shutdown"? This is something like service > dlm_control stop, or your own script? systemd, using the corosync.service unit file provided with corosync sources (it is far from being 'systemd' native) and the dlm.service as comes with dlm sources (includes my patches). Shutdown is started by '/sbin/halt' or '/sbin/reboot' using standard systemd procedure. I have added some rules to make sure Pacemaker is stopped before the rest, but dlm and corosync order is not affected. Systemd kills dlm_controld first and as soon as it exits its initiates stop of corosync. Adding an artificial delay between those two fixes my problem. When calling shutdown scripts by hand or the old SysVinit way (through other shell scripts), the delay between the two jobs could be 'naturally' longer. Unfortunately, I have been distracted recently by some other, higher priority, job, so I could not do more investigation in this matter (still on my TODO, though). Greets, Jacek From jfriesse at redhat.com Wed Nov 21 16:19:02 2012 From: jfriesse at redhat.com (Jan Friesse) Date: Wed, 21 Nov 2012 17:19:02 +0100 Subject: [Linux-cluster] node fenced by dlm_controld on a clean shutdown In-Reply-To: <20121121144858.GF2125@jajo.eggsoft> References: <20121119091647.GA20419@jajo.eggsoft> <20121119093920.GB20419@jajo.eggsoft> <50ACAA96.60401@redhat.com> <20121121144858.GF2125@jajo.eggsoft> Message-ID: <50ACFEF6.3040907@redhat.com> Jacek Konieczny napsal(a): > On Wed, Nov 21, 2012 at 11:19:02AM +0100, Jan Friesse wrote: >> Hi, >> we've discussed this problem with dave, but I would like to get some >> information: >> - What distro are you using? > > PLD Linux > >> - Packages are compiled or disro? > > I am making packages for the distro as a part of my job. > >> - what you mean by "clean shutdown"? This is something like service >> dlm_control stop, or your own script? > > systemd, using the corosync.service unit file provided with corosync > sources (it is far from being 'systemd' native) and the dlm.service Ya, far far away. But it has good reasons... > as comes with dlm sources (includes my patches). > > Shutdown is started by '/sbin/halt' or '/sbin/reboot' using standard > systemd procedure. I have added some rules to make sure Pacemaker is > stopped before the rest, but dlm and corosync order is not affected. > Ok, cool. This is information I was seeking. > Systemd kills dlm_controld first and as soon as it exits its initiates > stop of corosync. Adding an artificial delay between those two fixes my > problem. > Problem may be, that if dlm_controld refuses to exit, maybe (= this is theory) it will kill it anyway. > When calling shutdown scripts by hand or the old SysVinit way (through > other shell scripts), the delay between the two jobs could be > 'naturally' longer. > > Unfortunately, I have been distracted recently by some other, higher > priority, job, so I could not do more investigation in this matter > (still on my TODO, though). > Understand. You gave me enough information anyway, so thanks. > Greets, > Jacek > Regards, Honza From parvez.h.shaikh at gmail.com Fri Nov 23 05:25:11 2012 From: parvez.h.shaikh at gmail.com (Parvez Shaikh) Date: Fri, 23 Nov 2012 10:55:11 +0530 Subject: [Linux-cluster] Normal startup vs startup due to failover on cluster node - can they be distinguished? Message-ID: Hi experts, I am using Red Hat Cluster available on RHEL 5.5. And it doesn't have any inbuilt mechanism to generate SNMP traps in failures of resources or failover of services from one node to another. I have a script agent, which starts, stops and checks status of my application. Is it possible that in a script resource - to distinguish between normal startup of service / resource vs startup of service/resource in response to failover / failure handling? Doing so would help me write code to generate alarms if startup of service / resource (in my case a process) is due to failover (not normal startup). Further is it possible to get information such as cause of failure(leading to failover), and previous cluster node on which service / resource was running(prior to failover)? This would help to provide as much information as possible in traps Thanks, Parvez -------------- next part -------------- An HTML attachment was scrubbed... URL: From kolapallisatya531 at gmail.com Fri Nov 23 09:24:59 2012 From: kolapallisatya531 at gmail.com (satya suresh kolapalli) Date: Fri, 23 Nov 2012 14:54:59 +0530 Subject: [Linux-cluster] Normal startup vs startup due to failover on cluster node - can they be distinguished? In-Reply-To: References: Message-ID: Hi, send the script which you have On 23 November 2012 10:55, Parvez Shaikh wrote: > Hi experts, > > I am using Red Hat Cluster available on RHEL 5.5. And it doesn't have any > inbuilt mechanism to generate SNMP traps in failures of resources or > failover of services from one node to another. > > I have a script agent, which starts, stops and checks status of my > application. Is it possible that in a script resource - to distinguish > between normal startup of service / resource vs startup of service/resource > in response to failover / failure handling? Doing so would help me write > code to generate alarms if startup of service / resource (in my case a > process) is due to failover (not normal startup). > > Further is it possible to get information such as cause of failure(leading > to failover), and previous cluster node on which service / resource was > running(prior to failover)? > > This would help to provide as much information as possible in traps > > Thanks, > Parvez > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -- Regards, SatyaSuresh Kolapalli Mob: 7702430892 From mgrac at redhat.com Sun Nov 25 13:41:11 2012 From: mgrac at redhat.com (Marek Grac) Date: Sun, 25 Nov 2012 14:41:11 +0100 Subject: [Linux-cluster] Fence agents - breaking compatibility Message-ID: <50B21FF7.5000404@redhat.com> Hi, In last few weeks there were a lot of internal changed in source code of fence agents to make code more readable, clean and adaptable. In order to clean up code a bit more and remove historical burden, we would like to remove/replace some options (details are in 7 patches at cluster-devel). These changes will be part of next major version of fence agents. There will be at least one upstream release (3.1.12) without these changes. Brief overview: * most of the fence agents: * removed option -T / test (command line / STDIN) --> you can use -o monitor / action to test if fence device is working * removed option -q / quiet --> fence agents are quiet enough by default * replaced --udpport / udpport --> use --ipport / ipport * on STDIN we also supported these options (this transition was done automatically in code) which are now replaced blade -> port option -> action fm -> port hostname -> ippaddr * (fence_drac5) removed -m / modulename / module_name --> replaced by standard -n / --plug / port this affect only Dell Drac CMC as other Drac devices do not use machine specification at all * (fence_lpar) removed -n / partition --> replaced by standard -n / --plug / port * (fence_rsb) removed -n / telnet_port --> replaced by --ipport / ipport m, From Elliott.Barrere at mywedding.com Mon Nov 26 19:18:54 2012 From: Elliott.Barrere at mywedding.com (Elliott Barrere) Date: Mon, 26 Nov 2012 19:18:54 +0000 Subject: [Linux-cluster] Set packet src address to a cluster-managed IP Message-ID: Hi everyone, I have a RHEL 5.8 cluster that manages several IP addresses (among other services). While this works fine for "serving" content (i.e. when a client hits one of the managed IP addresses the content is delivered), I also need the server to _send_ new packets from the managed address (this is an Asterisk cluster so it sends SIP invites to clients, which are rejected unless they come from the correct IP). I can successfully set the source address for packets by running something like this: ip route change 10.X.X.0/24 dev eth0 src 10.X.X.10 and this solves my problem. However, this solution is not "cluster aware", nor is it permanent across reboots. I could write a script to update the src address after the cluster IPs are applied, but that seems like a bit of a hack. Has anyone else had this problem? Any advice for how to deal with it? I can't imagine I'm the only one wanting to do this. Cheers, -elliott- From lists at alteeve.ca Tue Nov 27 05:15:16 2012 From: lists at alteeve.ca (Digimer) Date: Tue, 27 Nov 2012 00:15:16 -0500 Subject: [Linux-cluster] cluster 3.2.0 released Message-ID: <50B44C64.6000800@alteeve.ca> Welcome to the cluster 3.2.0 release. This new major release features improvements in the fencing area and several bug fixes across the stack. * New cluster recovery mechanism has been added based on hardware watchdog: ** fence_sanlock (req: wdmd 2,6+, fence_sanlock 2.6+) ** checkquorum.wdmd (req: cman 3.2.0+, wdmd 2.6+) ** Details and usage; https://alteeve.ca/w/Watchdog_Recovery * fence_check tool for verifying fence device configuration. The tool can be used in cron scripts. Please refer to the man page for operational details and caveats. The new source tarball can be downloaded here: https://fedorahosted.org/releases/c/l/cluster/cluster-3.2.0.tar.xz Change Log: https://fedorahosted.org/releases/c/l/cluster/Changelog-3.2.0 To report bugs or issues: https://bugzilla.redhat.com/ Would you like to meet the cluster team or members of its community? Join us on IRC (irc.freenode.net #linux-cluster) and share your experience with other sysadministrators or power users. Thanks/congratulations to all people that contributed to achieve this great milestone. Happy clustering, Digimer -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? From christian.masopust at siemens.com Tue Nov 27 07:44:19 2012 From: christian.masopust at siemens.com (Masopust, Christian) Date: Tue, 27 Nov 2012 08:44:19 +0100 Subject: [Linux-cluster] Set packet src address to a cluster-managed IP In-Reply-To: References: Message-ID: Hi Elliott, I had a similar problem with my license-server cluster (for IBM Rational ClearCase). As we found out that IBM's license daemon for ClearCase behaves very badly (sends response packets with the IP address of the NIC instead of the cluster-ip) and IBM was not able to provide a fix for that, we decided to use iptables to rewrite the addresses. For that I've added iptables servcie to my cluster configuration (only starts on that node that has the license daemon active) and configured SNAT and DNAT: iptables -A PREROUTING -d /32 -j DNAT --to-destination iptables -a POSTROUTING -s /32 -j SNAT --to-source This configuration of iptables on both nodes and (as said) iptables active only where license daemon is active and everything works fine for us :) cheers, christian > -----Urspr?ngliche Nachricht----- > Von: linux-cluster-bounces at redhat.com > [mailto:linux-cluster-bounces at redhat.com] Im Auftrag von > Elliott Barrere > Gesendet: Montag, 26. November 2012 20:19 > An: > Betreff: [Linux-cluster] Set packet src address to a > cluster-managed IP > > Hi everyone, > > I have a RHEL 5.8 cluster that manages several IP addresses > (among other services). While this works fine for "serving" > content (i.e. when a client hits one of the managed IP > addresses the content is delivered), I also need the server > to _send_ new packets from the managed address (this is an > Asterisk cluster so it sends SIP invites to clients, which > are rejected unless they come from the correct IP). > > I can successfully set the source address for packets by > running something like this: > > ip route change 10.X.X.0/24 dev eth0 src 10.X.X.10 > > and this solves my problem. > > However, this solution is not "cluster aware", nor is it > permanent across reboots. I could write a script to update > the src address after the cluster IPs are applied, but that > seems like a bit of a hack. > > Has anyone else had this problem? Any advice for how to deal > with it? I can't imagine I'm the only one wanting to do this. > > Cheers, > -elliott- > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > From parvez.h.shaikh at gmail.com Tue Nov 27 10:23:19 2012 From: parvez.h.shaikh at gmail.com (Parvez Shaikh) Date: Tue, 27 Nov 2012 15:53:19 +0530 Subject: [Linux-cluster] Normal startup vs startup due to failover on cluster node - can they be distinguished? In-Reply-To: References: Message-ID: Kind reminder on this. Any inputs would be of great help. Basically I intend to have SNMP traps generated to notify failures and failover while using RHCS. Thanks, Parvez On Fri, Nov 23, 2012 at 2:54 PM, satya suresh kolapalli < kolapallisatya531 at gmail.com> wrote: > Hi, > > send the script which you have > > > > On 23 November 2012 10:55, Parvez Shaikh > wrote: > > Hi experts, > > > > I am using Red Hat Cluster available on RHEL 5.5. And it doesn't have any > > inbuilt mechanism to generate SNMP traps in failures of resources or > > failover of services from one node to another. > > > > I have a script agent, which starts, stops and checks status of my > > application. Is it possible that in a script resource - to distinguish > > between normal startup of service / resource vs startup of > service/resource > > in response to failover / failure handling? Doing so would help me write > > code to generate alarms if startup of service / resource (in my case a > > process) is due to failover (not normal startup). > > > > Further is it possible to get information such as cause of > failure(leading > > to failover), and previous cluster node on which service / resource was > > running(prior to failover)? > > > > This would help to provide as much information as possible in traps > > > > Thanks, > > Parvez > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > -- > Regards, > SatyaSuresh Kolapalli > Mob: 7702430892 > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From uxbod at splatnix.net Tue Nov 27 14:58:47 2012 From: uxbod at splatnix.net (Phil Daws) Date: Tue, 27 Nov 2012 14:58:47 +0000 (GMT) Subject: [Linux-cluster] Thin (sparse) provisioning In-Reply-To: <324998424.456303.1353362334446.JavaMail.root@innovot.com> References: <324998424.456303.1353362334446.JavaMail.root@innovot.com> Message-ID: <213798016.1048691.1354028327212.JavaMail.root@innovot.com> any help of this would be gratefully appreciated. Thanks. ----- Original Message ----- From: "Phil Daws" To: Linux-cluster at redhat.com Sent: Monday, 19 November, 2012 9:58:54 PM Subject: [Linux-cluster] Thin (sparse) provisioning Hello: am learning about clustering with DRBD and GFS2 and have a question about thin provisioning. I would like to set up a number of individual vservers that reside on their own LVs which can then be shared between two nodes and flipped backwards and forwards using Pacemaker. When setting up the block/lvm device for DRBD I have used: lvcreate --virtualsize 1T --size 10G --name vserver01 vg1 once that has been added as a resource would I perform a standard mkfs.gfs2 or do I need to specify any further options; I was thinking something like: mkfs.gfs2 -t vservercluster:vservers -p lock_dlm -j 2 /dev/vservermirror/vserver01 Is that the way I should be doing it ? Thanks. -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From lists at alteeve.ca Tue Nov 27 15:14:54 2012 From: lists at alteeve.ca (Digimer) Date: Tue, 27 Nov 2012 10:14:54 -0500 Subject: [Linux-cluster] Thin (sparse) provisioning In-Reply-To: <324998424.456303.1353362334446.JavaMail.root@innovot.com> References: <324998424.456303.1353362334446.JavaMail.root@innovot.com> Message-ID: <50B4D8EE.1080606@alteeve.ca> On 11/19/2012 04:58 PM, Phil Daws wrote: > Hello: > > am learning about clustering with DRBD and GFS2 and have a question about thin provisioning. I would like to set up a number of individual vservers that reside on their own LVs which can then be shared between two nodes and flipped backwards and forwards using Pacemaker. When setting up the block/lvm device for DRBD I have used: > > lvcreate --virtualsize 1T --size 10G --name vserver01 vg1 > > once that has been added as a resource would I perform a standard mkfs.gfs2 or do I need to specify any further options; I was thinking something like: > > mkfs.gfs2 -t vservercluster:vservers -p lock_dlm -j 2 /dev/vservermirror/vserver01 > > Is that the way I should be doing it ? > > Thanks. I'm not entirely sure what you are trying to do here. If you want to put VMs on LVs, use clustered LVM (clvmd) and use the LVs as backing devices for the VMs. GFS2 is a great clustered FS, but no clustered FS is good for backing VMs, in my opinion. Here is how I do it: https://alteeve.ca/w/2-Node_Red_Hat_KVM_Cluster_Tutorial#Provisioning_Virtual_Machines I use GFS2 for storing the install images and XML definition files only. -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? From Elliott.Barrere at mywedding.com Wed Nov 28 20:37:33 2012 From: Elliott.Barrere at mywedding.com (Elliott Barrere) Date: Wed, 28 Nov 2012 20:37:33 +0000 Subject: [Linux-cluster] Set packet src address to a cluster-managed IP In-Reply-To: References: Message-ID: <1B1E96E7-C75B-46B9-88F5-91C1FFEE3F61@mywedding.com> That is great info, thanks! I need to run iptables all the time on my servers, but I'm sure I can work a way to add and remove the entries as needed.