From akaris at gmail.com Sun Jan 1 22:04:54 2012 From: akaris at gmail.com (Michel Nadeau) Date: Sun, 1 Jan 2012 17:04:54 -0500 Subject: [Linux-cluster] CMAN across different datacenters In-Reply-To: <4EFE5821.1020602@hastexo.com> References: <4EFE5821.1020602@hastexo.com> Message-ID: Hi, I upgraded to Corosync 1.4.2 and cman 6.2.0, but when I add this to my cluster.conf : I get (when starting cman) : Starting cman... Relax-NG validity error : Extra element cman in interleave tempfile:20: element cman: Relax-NG validity error : Element cluster failed to validate content Configuration fails to validate Any idea why? Thanks, - Mike akaris at gmail.com On Fri, Dec 30, 2011 at 7:32 PM, Andreas Kurz wrote: > Hello, > > On 12/30/2011 10:16 PM, Michel Nadeau wrote: > > Hi, > > > > We're trying to configure a CMAN cluster with 2 nodes located in 2 > > different datacenters. > > only one of various problems of split-site cluster: how do you plan to > implement reliable fencing? > > > > > The 2 nodes are running Debian 6 and they can access each other on the > > private LAN (using the eth0 interface). > > > > The problem is that the 2 nodes don't have the same subnet and the > > multicast doesn't seem to work: is there any way to make this work? > > since 1.3.0 corosync supports unicasts (UDPU) ... it ships with a nice > example configuration. > > Regards, > Andreas > > -- > Need help with Corosync? > http://www.hastexo.com/now > > > > > Thanks, > > > > - Mike > > akaris at gmail.com > > > > > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From szekelyi at niif.hu Sun Jan 1 23:24:32 2012 From: szekelyi at niif.hu (=?ISO-8859-1?Q?Sz=E9kelyi?= Szabolcs) Date: Mon, 02 Jan 2012 00:24:32 +0100 Subject: [Linux-cluster] CMAN across different datacenters In-Reply-To: References: <4EFE5821.1020602@hastexo.com> Message-ID: <15595999.ZutN6uBcOj@mranderson> On 2012. January 1. 17:04:54 Michel Nadeau wrote: > I upgraded to Corosync 1.4.2 and cman 6.2.0, but when I add this to my > cluster.conf : > > > > I get (when starting cman) : > > Starting cman... Relax-NG validity error : Extra element cman in > interleave > tempfile:20: element cman: Relax-NG validity error : Element cluster failed > to validate content > Configuration fails to validate > > Any idea why? I've fiddled around quite a lot with this. I wanted to keep multicasting, but change the TTL to more than 1 so that the nodes' packets can reach each other. It turned out that Corosync 1.4.1 that's included in Debian Backports, supports this feature, but I've figured out that this is not enough since you need a new cman to communicate the config to Corosync. Debian has cman 3.0.12 which is unable to do this. The situation was strange, because it looked like I'm using the same version than people on this list, but this feature works for them but not for me. After further research, it turned out that there are two versions of 3.0.12 out there, 3.0.12 and 3.0.12.1. Debian has the older one, which doesn't have this feature. The latter one came out long after the old one, and according to changelogs, has significant enhancements including the one in question. Looking at https://fedorahosted.org/releases/c/l/cluster/, here are the version numbers ordered by release dates: 3.0.11: 21-Apr-2010 3.0.12: 11-May-2010 3.0.13: 08-Jun-2010 3.0.14: 30-Jul-2010 3.0.15: 02-Sep-2010 3.0.16: 02-Sep-2010 3.0.17: 06-Oct-2010 3.1.0: 02-Dec-2010 3.1.1: 08-Mar-2011 3.0.12.1: 27-May-2011 3.1.2: 16-Jun-2011 Wow, it looks like the cman guys have a strange idea on versioning... The size of changelogs is also very interesting. Anyway, since Debian doesn't have the "new" 3.0.12, I worked around this problem by using multicast and some iptables magic to achieve what you need: iptables -t mangle -A OUTPUT -d -j TTL --ttl-set 8 Cheers, -- cc From akaris at gmail.com Mon Jan 2 00:24:14 2012 From: akaris at gmail.com (Michel Nadeau) Date: Mon, 2 Jan 2012 00:24:14 +0000 Subject: [Linux-cluster] CMAN across different datacenters In-Reply-To: <15595999.ZutN6uBcOj@mranderson> References: <4EFE5821.1020602@hastexo.com> <15595999.ZutN6uBcOj@mranderson> Message-ID: <1358280838-1325463856-cardhu_decombobulator_blackberry.rim.net-1070681867-@b15.c31.bise6.blackberry> I'm not sure to understand how your iptables rule can fix this? I'm trying to get 2 nodes in 2 datacenters using 2 IP subnets to work. -----Original Message----- From: Sz?kelyi Szabolcs Sender: linux-cluster-bounces at redhat.com Date: Mon, 02 Jan 2012 00:24:32 To: Reply-To: linux clustering Subject: Re: [Linux-cluster] CMAN across different datacenters On 2012. January 1. 17:04:54 Michel Nadeau wrote: > I upgraded to Corosync 1.4.2 and cman 6.2.0, but when I add this to my > cluster.conf : > > > > I get (when starting cman) : > > Starting cman... Relax-NG validity error : Extra element cman in > interleave > tempfile:20: element cman: Relax-NG validity error : Element cluster failed > to validate content > Configuration fails to validate > > Any idea why? I've fiddled around quite a lot with this. I wanted to keep multicasting, but change the TTL to more than 1 so that the nodes' packets can reach each other. It turned out that Corosync 1.4.1 that's included in Debian Backports, supports this feature, but I've figured out that this is not enough since you need a new cman to communicate the config to Corosync. Debian has cman 3.0.12 which is unable to do this. The situation was strange, because it looked like I'm using the same version than people on this list, but this feature works for them but not for me. After further research, it turned out that there are two versions of 3.0.12 out there, 3.0.12 and 3.0.12.1. Debian has the older one, which doesn't have this feature. The latter one came out long after the old one, and according to changelogs, has significant enhancements including the one in question. Looking at https://fedorahosted.org/releases/c/l/cluster/, here are the version numbers ordered by release dates: 3.0.11: 21-Apr-2010 3.0.12: 11-May-2010 3.0.13: 08-Jun-2010 3.0.14: 30-Jul-2010 3.0.15: 02-Sep-2010 3.0.16: 02-Sep-2010 3.0.17: 06-Oct-2010 3.1.0: 02-Dec-2010 3.1.1: 08-Mar-2011 3.0.12.1: 27-May-2011 3.1.2: 16-Jun-2011 Wow, it looks like the cman guys have a strange idea on versioning... The size of changelogs is also very interesting. Anyway, since Debian doesn't have the "new" 3.0.12, I worked around this problem by using multicast and some iptables magic to achieve what you need: iptables -t mangle -A OUTPUT -d -j TTL --ttl-set 8 Cheers, -- cc -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From fdinitto at redhat.com Mon Jan 2 04:20:36 2012 From: fdinitto at redhat.com (Fabio M. Di Nitto) Date: Mon, 02 Jan 2012 05:20:36 +0100 Subject: [Linux-cluster] CMAN across different datacenters In-Reply-To: <15595999.ZutN6uBcOj@mranderson> References: <4EFE5821.1020602@hastexo.com> <15595999.ZutN6uBcOj@mranderson> Message-ID: <4F013094.7050302@redhat.com> On 01/02/2012 12:24 AM, Sz?kelyi Szabolcs wrote: > 3.0.11: 21-Apr-2010 > 3.0.12: 11-May-2010 > 3.0.13: 08-Jun-2010 > 3.0.14: 30-Jul-2010 > 3.0.15: 02-Sep-2010 > 3.0.16: 02-Sep-2010 > 3.0.17: 06-Oct-2010 > 3.1.0: 02-Dec-2010 > 3.1.1: 08-Mar-2011 > 3.0.12.1: 27-May-2011 > 3.1.2: 16-Jun-2011 > > Wow, it looks like the cman guys have a strange idea on versioning... The size > of changelogs is also very interesting. What strange idea? The 3.0.12.x is based on the RHEL6 branch (what you ge tin RHEL6.x) and the 3.1.x serie is from STABLE31 branch from upstream. 3.0.12.x is a more strict bug fix only series. 3.1 gets a bit more. Mind to explain what's strange about it? From fdinitto at redhat.com Mon Jan 2 04:21:40 2012 From: fdinitto at redhat.com (Fabio M. Di Nitto) Date: Mon, 02 Jan 2012 05:21:40 +0100 Subject: [Linux-cluster] CMAN across different datacenters In-Reply-To: References: <4EFE5821.1020602@hastexo.com> Message-ID: <4F0130D4.5040909@redhat.com> On 01/01/2012 11:04 PM, Michel Nadeau wrote: > Hi, > > I upgraded to Corosync 1.4.2 and cman 6.2.0, but when I add this to my > cluster.conf : cman 6.2.0 does not exists anywhere. > > > > I get (when starting cman) : > > Starting cman... Relax-NG validity error : Extra element cman in > interleave > tempfile:20: element cman: Relax-NG validity error : Element cluster > failed to validate content > Configuration fails to validate > > Any idea why? This issue has been fixed a while ago in both upstream (3.1.x) and 3.0.12.x in RHEL. Fabio From list at fajar.net Mon Jan 2 04:33:20 2012 From: list at fajar.net (Fajar A. Nugraha) Date: Mon, 2 Jan 2012 11:33:20 +0700 Subject: [Linux-cluster] CMAN across different datacenters In-Reply-To: <4F013094.7050302@redhat.com> References: <4EFE5821.1020602@hastexo.com> <15595999.ZutN6uBcOj@mranderson> <4F013094.7050302@redhat.com> Message-ID: On Mon, Jan 2, 2012 at 11:20 AM, Fabio M. Di Nitto wrote: > On 01/02/2012 12:24 AM, Sz?kelyi Szabolcs wrote: >> 3.0.12: 11-May-2010 >> 3.0.12.1: 27-May-2011 >> 3.1.2: 16-Jun-2011 >> >> Wow, it looks like the cman guys have a strange idea on versioning... The size >> of changelogs is also very interesting. > > What strange idea? The 3.0.12.x is based on the RHEL6 branch (what you > ge tin RHEL6.x) and the 3.1.x serie is from STABLE31 branch from upstream. > > 3.0.12.x is a more strict bug fix only series. 3.1 gets a bit more. > > Mind to explain what's strange about it? I think Sz?kelyi is refering to the size of 3.0.12.1 changelog and the things that went in it. If it's a "strict bug fix only series", then e.g. arguably these changes shouldn't be there, cause they're new features: cman init: add support for "nocluster" kernel cmdline to not start cman at boot Cman: Add support for udpu and rdma transport resource-agents: Add NFSv4 support -- Fajar From fdinitto at redhat.com Mon Jan 2 05:30:58 2012 From: fdinitto at redhat.com (Fabio M. Di Nitto) Date: Mon, 02 Jan 2012 06:30:58 +0100 Subject: [Linux-cluster] CMAN across different datacenters In-Reply-To: References: <4EFE5821.1020602@hastexo.com> <15595999.ZutN6uBcOj@mranderson> <4F013094.7050302@redhat.com> Message-ID: <4F014112.4000607@redhat.com> On 01/02/2012 05:33 AM, Fajar A. Nugraha wrote: > On Mon, Jan 2, 2012 at 11:20 AM, Fabio M. Di Nitto wrote: >> On 01/02/2012 12:24 AM, Sz?kelyi Szabolcs wrote: > >>> 3.0.12: 11-May-2010 > >>> 3.0.12.1: 27-May-2011 >>> 3.1.2: 16-Jun-2011 >>> >>> Wow, it looks like the cman guys have a strange idea on versioning... The size >>> of changelogs is also very interesting. >> >> What strange idea? The 3.0.12.x is based on the RHEL6 branch (what you >> ge tin RHEL6.x) and the 3.1.x serie is from STABLE31 branch from upstream. >> >> 3.0.12.x is a more strict bug fix only series. 3.1 gets a bit more. >> >> Mind to explain what's strange about it? > > I think Sz?kelyi is refering to the size of 3.0.12.1 changelog and the > things that went in it. If it's a "strict bug fix only series", then > e.g. arguably these changes shouldn't be there, cause they're new > features: > > cman init: add support for "nocluster" kernel cmdline to not > start cman at boot > Cman: Add support for udpu and rdma transport > resource-agents: Add NFSv4 support > Request For Enhancement or integration bits are also bugs. corosync supports nocluster and so cman needs the same. corosync added support for udpu and rdma transports. In order for those to work, cman needs to understand them. Similar, NFSv4 support has been added, and that support needs to be reflected in resource-agents. The lack of those integration bits are bugs, we can argue that there is a thin gray line here. It's not always black and white. >From a cman perspective those are "new features" but when you look at it from a more global integration overview, the lack of those are issues. Fabio From akaris at gmail.com Mon Jan 2 16:52:14 2012 From: akaris at gmail.com (Michel Nadeau) Date: Mon, 2 Jan 2012 11:52:14 -0500 Subject: [Linux-cluster] CMAN across different datacenters In-Reply-To: <4F0130D4.5040909@redhat.com> References: <4EFE5821.1020602@hastexo.com> <4F0130D4.5040909@redhat.com> Message-ID: I just found out that my cman version isn't 6.2.0 (what's that version anyway? The cman_tool version?). My Debian package version is 3.0.12-3.. so I guess I don't have the udpu fix as I don't have 3.0.12.x. - Mike akaris at gmail.com On Sun, Jan 1, 2012 at 11:21 PM, Fabio M. Di Nitto wrote: > On 01/01/2012 11:04 PM, Michel Nadeau wrote: > > Hi, > > > > I upgraded to Corosync 1.4.2 and cman 6.2.0, but when I add this to my > > cluster.conf : > > cman 6.2.0 does not exists anywhere. > > > > > > > > > I get (when starting cman) : > > > > Starting cman... Relax-NG validity error : Extra element cman in > > interleave > > tempfile:20: element cman: Relax-NG validity error : Element cluster > > failed to validate content > > Configuration fails to validate > > > > Any idea why? > > This issue has been fixed a while ago in both upstream (3.1.x) and > 3.0.12.x in RHEL. > > Fabio > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From swhiteho at redhat.com Tue Jan 3 09:52:27 2012 From: swhiteho at redhat.com (Steven Whitehouse) Date: Tue, 03 Jan 2012 09:52:27 +0000 Subject: [Linux-cluster] CLVM/GFS2 distributed locking In-Reply-To: <4EFE116D.3080505@dbtgroup.com> References: <4EFE071F.8030406@alteeve.com> <4EFE116D.3080505@dbtgroup.com> Message-ID: <1325584347.2685.0.camel@menhir> Hi, On Fri, 2011-12-30 at 19:30 +0000, yvette hirth wrote: > Digimer wrote: > > > For GFS2, one of the easiest performance wins is to set > > 'noatime,nodiratime' in the mount options to avoid requiring locks to > > update the access times on files when you only read them. > > i've found that "noatime" implies "nodiratime", so both are not needed - > unless GFS/GFS2 behaves differently than other fs's wrt this attribute. > if so, that would be good to know for certain. > > see here: http://lwn.net/Articles/245002/ > > the article didn't specify the filesystem... > > yvette > Earlier GFS did have different atime code, but GFS2 uses the same code as all other filesystems, so the behaviour should also be the same, Steve. From swhiteho at redhat.com Tue Jan 3 09:55:28 2012 From: swhiteho at redhat.com (Steven Whitehouse) Date: Tue, 03 Jan 2012 09:55:28 +0000 Subject: [Linux-cluster] CLVM/GFS2 distributed locking In-Reply-To: References: <4EFE071F.8030406@alteeve.com> <4EFE116D.3080505@dbtgroup.com> <4EFE1DA7.3020506@alteeve.com> Message-ID: <1325584528.2685.2.camel@menhir> Hi, On Fri, 2011-12-30 at 21:37 +0100, Stevo Slavi? wrote: > Pulling the cables between shared storage and foo01, foo01 gets > fenced. Here is some info from foo02 about shared storage and dlm > debug (lock file seems to remain locked) > > root at foo02:-//data/activemq_data#ls -li > total 276 > 66467 -rw-r--r-- 1 root root 33030144 Dec 30 16:32 db-1.log > 66468 -rw-r--r-- 1 root root 73728 Dec 30 16:24 db.data > 66470 -rw-r--r-- 1 root root 53344 Dec 30 16:24 db.redo > 128014 -rw-r--r-- 1 root root 0 Dec 30 19:49 dummy > 66466 -rw-r--r-- 1 root root 0 Dec 30 16:23 lock > root at foo02:-//data/activemq_data#grep -A 7 -i > 103a2 /debug/dlm/activemq > Resource ffff81090faf96c0 Name (len=24) " 2 103a2" > Master Copy > Granted Queue > 03d10002 PR Remote: 1 00c80001 > 00e00001 PR > Conversion Queue > Waiting Queue > -- > Resource ffff81090faf97c0 Name (len=24) " 5 103a2" > Master Copy > Granted Queue > 03c30003 PR Remote: 1 039a0001 > 03550001 PR > Conversion Queue > Waiting Queue > > > Are there some docs for interpreting this dlm debug output? > > Not as such I think. It sounds like the issue is recovery related. Are there any messages which indicate what might be going on? Once the failed node has been fenced, then recovery should proceed fairly soon afterwards, Steve. > Regards, > Stevo. > > On Fri, Dec 30, 2011 at 9:23 PM, Digimer wrote: > On 12/30/2011 03:08 PM, Stevo Slavi? wrote: > > Hi Digimer and Yvette, > > > > Thanks for tips! I don't doubt reliability of the > technology, just want > > to make sure it is configured well. > > > > After fencing a node that held a lock on a file on shared > storage, lock > > remains, and non-fenced node cannot take over the lock on > that file. > > Wondering how can one check which process (from which node > if possible) > > is holding a lock on a file on shared storage. > > dlm should have taken care of releasing the lock once node > got fenced, > > right? > > > > Regards, > > Stevo. > > > After a successful fence call, DLM will clean up any locks > held by the > lost node. That's why it's so critical that the fence action > succeeded > (ie: test-test-test). If a node doesn't actually die in a > fence, but the > cluster thinks it did, and somehow the lost node returns, the > lost node > will think it's locks are still valid and modify shared > storage, leading > to near-certain data corruption. > > It's all perfectly safe, provided you've tested your fencing > properly. :) > > Yvette, > > You might be right on the 'noatime' implying 'nodiratime'... > I add > both out of habit. > > -- > Digimer > E-Mail: digimer at alteeve.com > Freenode handle: digimer > Papers and Projects: http://alteeve.com > Node Assassin: http://nodeassassin.org > "omg my singularity battery is dead again. > stupid hawking radiation." - epitron > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From swhiteho at redhat.com Tue Jan 3 09:59:16 2012 From: swhiteho at redhat.com (Steven Whitehouse) Date: Tue, 03 Jan 2012 09:59:16 +0000 Subject: [Linux-cluster] CLVM/GFS2 distributed locking In-Reply-To: References: Message-ID: <1325584756.2685.6.camel@menhir> Hi, On Fri, 2011-12-30 at 14:39 +0100, Stevo Slavi? wrote: > Hello RedHat Linux cluster community, > > I'm in process of configuring shared filesystem storage master/slave > Apache ActiveMQ setup. For it to work, it requires reliable > distributed locking - master is node that holds exclusive lock on a > file on shared filesystem storage. > How does it do this locking? There are several possible ways this might be done, and some will work better than others. > On RHEL (5.4), using CLVM with GFS2 is one of the options that should > work. Why are you using RHEL 5.4 and not something more recent? Note that if you are a Red Hat customer, then you should contact our support team who will be very happy to assist. > Third party configured the CLVM/GFS2. I'd like to make sure that > distributed locking works OK. > What are my options for verifying this? > I think we need to verify which type of locking the application uses before we can answer this, Steve. > Thanks in advance! > > Regards, > Stevo. > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From linux at alteeve.com Tue Jan 3 14:26:32 2012 From: linux at alteeve.com (Digimer) Date: Tue, 03 Jan 2012 09:26:32 -0500 Subject: [Linux-cluster] New Tutorial - RHCS + DRBD + KVM; 2-Node HA on EL6 Message-ID: <4F031018.4060204@alteeve.com> Hi all, I'm happy to announce a new tutorial! https://alteeve.com/w/2-Node_Red_Hat_KVM_Cluster_Tutorial This tutorial walks a user through the entire process of building a 2-Node cluster for making KVM virtual machines highly available. It uses Red Hat Cluster services v3 and DRBD 8.3.12. It is written such that you can use entirely free or fully Red Hat supported environments. Highlights; * Full network and power redundancy; no single-points of failure. * All off-the-shelf hardware; Storage via DRBD. * Starts with base OS install, no clustering experience required. * All software components explained. * Includes all testing steps covered. * Configuration is used in production environments! This tutorial is totally free (no ads, no registration) and released under the Creative Common 3.0 Share-Alike Non-Commercial license. Feedback is always appreciated! -- Digimer E-Mail: digimer at alteeve.com Freenode handle: digimer Papers and Projects: http://alteeve.com Node Assassin: http://nodeassassin.org "omg my singularity battery is dead again. stupid hawking radiation." - epitron From jeff.sturm at eprize.com Tue Jan 3 16:31:57 2012 From: jeff.sturm at eprize.com (Jeff Sturm) Date: Tue, 3 Jan 2012 16:31:57 +0000 Subject: [Linux-cluster] CLVM/GFS2 distributed locking In-Reply-To: <1325584347.2685.0.camel@menhir> References: <4EFE071F.8030406@alteeve.com> <4EFE116D.3080505@dbtgroup.com> <1325584347.2685.0.camel@menhir> Message-ID: > -----Original Message----- > From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] > On Behalf Of Steven Whitehouse > Sent: Tuesday, January 03, 2012 4:52 AM > > Earlier GFS did have different atime code, but GFS2 uses the same code as all other > filesystems, so the behaviour should also be the same, My testing on GFS (a few years ago) showed that "noatime" definitely did not set "nodiratime" implicitly, so I've always set both. Good to know that's corrected for GFS2. -Jeff From sslavic at gmail.com Wed Jan 4 17:10:46 2012 From: sslavic at gmail.com (=?UTF-8?Q?Stevo_Slavi=C4=87?=) Date: Wed, 4 Jan 2012 18:10:46 +0100 Subject: [Linux-cluster] CLVM/GFS2 distributed locking In-Reply-To: <1325584756.2685.6.camel@menhir> References: <1325584756.2685.6.camel@menhir> Message-ID: Hello Steven, I guess license covers only 5.4. Anyway I'm just told it's not an option at the moment to do the upgrade. About locking used, ActiveMQ uses Java 6 standard API for trying to acquire file lock, here is javadoc for the method used: http://docs.oracle.com/javase/6/docs/api/java/nio/channels/FileChannel.html#tryLock%28long,%20long,%20boolean%29 ActiveMQ tries to obtain non-shared (thus exclusive) lock on whole file on shared file system, with range from 0 to 1, since the file being locked is empty. As documentation states, tryLock is non-blocking, it executes immediately. If ActiveMQ fails to obtain a lock it will loop (pause and retry acquiring lock again) until lock is obtained. In initial state first node obtains lock and becomes master, second one fails to obtain lock and gets into this loop, as expected. Problem is that slave ActiveMQ on cannot obtain a lock even after first node gets fenced - it reports that the file on shared storage is still locked. Simple custom java tool that I made reports the same, that the file is locked. OpenJDK 1.6 update 20 is being used as Java runtime. I haven't yet found in openjdk source exact code which tryLock will call on Linux. Is there non-Java tool that could be used to reliably check if a file (on gfs2) is (or can be) exclusively locked (regardless of where the process holding lock is running, on same or different node where the tool is being run)? Regards, Stevo. On Tue, Jan 3, 2012 at 10:59 AM, Steven Whitehouse wrote: > Hi, > > On Fri, 2011-12-30 at 14:39 +0100, Stevo Slavi? wrote: > > Hello RedHat Linux cluster community, > > > > I'm in process of configuring shared filesystem storage master/slave > > Apache ActiveMQ setup. For it to work, it requires reliable > > distributed locking - master is node that holds exclusive lock on a > > file on shared filesystem storage. > > > How does it do this locking? There are several possible ways this might > be done, and some will work better than others. > > > On RHEL (5.4), using CLVM with GFS2 is one of the options that should > > work. > Why are you using RHEL 5.4 and not something more recent? Note that if > you are a Red Hat customer, then you should contact our support team who > will be very happy to assist. > > > Third party configured the CLVM/GFS2. I'd like to make sure that > > distributed locking works OK. > > What are my options for verifying this? > > > I think we need to verify which type of locking the application uses > before we can answer this, > > Steve. > > > Thanks in advance! > > > > Regards, > > Stevo. > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sslavic at gmail.com Thu Jan 5 00:00:20 2012 From: sslavic at gmail.com (=?UTF-8?Q?Stevo_Slavi=C4=87?=) Date: Thu, 5 Jan 2012 01:00:20 +0100 Subject: [Linux-cluster] CLVM/GFS2 distributed locking In-Reply-To: References: <1325584756.2685.6.camel@menhir> Message-ID: Here is a simple C utility for locking file - it's combination of two sources: - reading lock info from here: http://uw714doc.sco.com/en/SDK_sysprog/_Getting_Lock_Information.html - acquiring file lock from here: http://siber.cankaya.edu.tr/ozdogan/SystemsProgramming/ceng425/node161.html Now, to make use of it. Regards, Stevo. On Wed, Jan 4, 2012 at 6:10 PM, Stevo Slavi? wrote: > Hello Steven, > > I guess license covers only 5.4. Anyway I'm just told it's not an option > at the moment to do the upgrade. > > > About locking used, ActiveMQ uses Java 6 standard API for trying to > acquire file lock, here is javadoc for the method used: > > > http://docs.oracle.com/javase/6/docs/api/java/nio/channels/FileChannel.html#tryLock%28long,%20long,%20boolean%29 > > ActiveMQ tries to obtain non-shared (thus exclusive) lock on whole file on > shared file system, with range from 0 to 1, since the file being locked is > empty. As documentation states, tryLock is non-blocking, it executes > immediately. If ActiveMQ fails to obtain a lock it will loop (pause and > retry acquiring lock again) until lock is obtained. In initial state first > node obtains lock and becomes master, second one fails to obtain lock and > gets into this loop, as expected. Problem is that slave ActiveMQ on cannot > obtain a lock even after first node gets fenced - it reports that the file > on shared storage is still locked. Simple custom java tool that I made > reports the same, that the file is locked. > > OpenJDK 1.6 update 20 is being used as Java runtime. I haven't yet found > in openjdk source exact code which tryLock will call on Linux. > > > Is there non-Java tool that could be used to reliably check if a file (on > gfs2) is (or can be) exclusively locked (regardless of where the process > holding lock is running, on same or different node where the tool is being > run)? > > > Regards, > Stevo. > > > > > On Tue, Jan 3, 2012 at 10:59 AM, Steven Whitehouse wrote: > >> Hi, >> >> On Fri, 2011-12-30 at 14:39 +0100, Stevo Slavi? wrote: >> > Hello RedHat Linux cluster community, >> > >> > I'm in process of configuring shared filesystem storage master/slave >> > Apache ActiveMQ setup. For it to work, it requires reliable >> > distributed locking - master is node that holds exclusive lock on a >> > file on shared filesystem storage. >> > >> How does it do this locking? There are several possible ways this might >> be done, and some will work better than others. >> >> > On RHEL (5.4), using CLVM with GFS2 is one of the options that should >> > work. >> Why are you using RHEL 5.4 and not something more recent? Note that if >> you are a Red Hat customer, then you should contact our support team who >> will be very happy to assist. >> >> > Third party configured the CLVM/GFS2. I'd like to make sure that >> > distributed locking works OK. >> > What are my options for verifying this? >> > >> I think we need to verify which type of locking the application uses >> before we can answer this, >> >> Steve. >> >> > Thanks in advance! >> > >> > Regards, >> > Stevo. >> > -- >> > Linux-cluster mailing list >> > Linux-cluster at redhat.com >> > https://www.redhat.com/mailman/listinfo/linux-cluster >> >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: lock-file.c Type: text/x-csrc Size: 1433 bytes Desc: not available URL: From linux at alteeve.com Thu Jan 5 05:46:35 2012 From: linux at alteeve.com (Digimer) Date: Thu, 05 Jan 2012 00:46:35 -0500 Subject: [Linux-cluster] cluster 3.1.90 released Message-ID: <4F05393B.2030501@alteeve.com> Welcome to the cluster 3.1.90 release. The release has bug fixes and code clean-up. It is a test release for the coming 3.2 branch. Feedback is always appreciated. The new source tarball can be downloaded here: https://fedorahosted.org/releases/c/l/cluster/cluster-3.1.90.tar.xz ChangeLog: https://fedorahosted.org/releases/c/l/cluster/Changelog-3.1.90 To report bugs or issues: https://bugzilla.redhat.com/ Would you like to meet the cluster team or members of its community? Join us on IRC (irc.freenode.net #linux-cluster) and share your experience with other sysadministrators or power users. Thanks/congratulations to all people that contributed to achieve this great milestone. Happy clustering, Digimer From dkelson at gurulabs.com Thu Jan 5 17:07:42 2012 From: dkelson at gurulabs.com (Dax Kelson) Date: Thu, 5 Jan 2012 10:07:42 -0700 Subject: [Linux-cluster] Maximum number of GFS server nodes? Message-ID: Looking in older Red Hat Magazine article by Matthew O'Keefe such as: http://www.redhat.com/magazine/008jun05/features/gfs/ http://www.redhat.com/magazine/008jun05/features/gfs_nfs/ There are references to large GFS clusters. "For example, if 128 GFS server nodes require..." and "scalability 300+ or more" Why is it on RHEL6 only a max of 16 nodes is supported? Thanks, Dax Kelson Guru Labs -------------- next part -------------- An HTML attachment was scrubbed... URL: From swhiteho at redhat.com Thu Jan 5 17:20:44 2012 From: swhiteho at redhat.com (Steven Whitehouse) Date: Thu, 05 Jan 2012 17:20:44 +0000 Subject: [Linux-cluster] Maximum number of GFS server nodes? In-Reply-To: References: Message-ID: <1325784044.2690.52.camel@menhir> Hi, On Thu, 2012-01-05 at 10:07 -0700, Dax Kelson wrote: > Looking in older Red Hat Magazine article by Matthew O'Keefe such as: > > http://www.redhat.com/magazine/008jun05/features/gfs/ > http://www.redhat.com/magazine/008jun05/features/gfs_nfs/ > > There are references to large GFS clusters. > > "For example, if 128 GFS server nodes require..." and "scalability 300 > + or more" > > Why is it on RHEL6 only a max of 16 nodes is supported? > Those articles are rather out of date. I don't think that GFS was ever used or tested at that scale and that was probably the theoretical limit at the time. The reason for the 16 node limit is that it is what we test (and therefore what we support), which largely reflects what people have requested. There is no reason why larger numbers of nodes couldn't be made to work in theory though, Steve. > Thanks, > Dax Kelson > Guru Labs > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From adrew at redhat.com Thu Jan 5 17:26:41 2012 From: adrew at redhat.com (Adam Drew) Date: Thu, 5 Jan 2012 12:26:41 -0500 Subject: [Linux-cluster] Maximum number of GFS server nodes? In-Reply-To: References: Message-ID: Hello, Red Hat magazine articles aren't official documentation. Additionally RHM is no longer published (hasn't been for years.) The difference between what the article is talking about and what we support in RHEL is a matter of quality assurance and testing - we can only support what we can reasonably test and what we can commit to being able to dedicate to issue reproduction and resolution in the course of a support case. Linux-cluster and GFS/GFS2 will scale well past 16 nodes but Red Hat doesn't test or do engineering and development work on more than 16. The other side of the equation is that linux-cluster + GFS2 on RHEL as marketed by Red Hat is a high availability product - not a distributed computing or "big data" product. It's hard to make a case for HA at large scale. For HA purposes 16 nodes is on the generous side - I rarely see clusters greater than 4 nodes in the course of my work with cluster customers. Cluster and GFS2 could be spun into the back-bone for distributed computing or big data deployments but that's not how Red Hat tests, develops, and thus supports the combination of those products. If you are doing a research, academic, community, or personal project and don't require enterprise support you could likely do some really interesting things with GFS2/cluster at large scale - but for supported deployments with a commitment from Red Hat to test, QA, develop, and resolve issues the limit is 16. Hope this information helps you. -- Adam Drew Software Maintenance Engineer Support Engineering Group Red Hat, Inc. On Jan 5, 2012, at 12:07 PM, Dax Kelson wrote: > Looking in older Red Hat Magazine article by Matthew O'Keefe such as: > > http://www.redhat.com/magazine/008jun05/features/gfs/ > http://www.redhat.com/magazine/008jun05/features/gfs_nfs/ > > There are references to large GFS clusters. > > "For example, if 128 GFS server nodes require..." and "scalability 300+ or more" > > Why is it on RHEL6 only a max of 16 nodes is supported? > > Thanks, > Dax Kelson > Guru Labs > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- An HTML attachment was scrubbed... URL: From linux at alteeve.com Thu Jan 5 17:48:23 2012 From: linux at alteeve.com (Digimer) Date: Thu, 05 Jan 2012 12:48:23 -0500 Subject: [Linux-cluster] Maximum number of GFS server nodes? In-Reply-To: References: Message-ID: <4F05E267.6030307@alteeve.com> On 01/05/2012 12:07 PM, Dax Kelson wrote: > Looking in older Red Hat Magazine article by Matthew O'Keefe such as: > > http://www.redhat.com/magazine/008jun05/features/gfs/ > http://www.redhat.com/magazine/008jun05/features/gfs_nfs/ > > There are references to large GFS clusters. > > "For example, if 128 GFS server nodes require..." and "scalability 300+ > or more" > > Why is it on RHEL6 only a max of 16 nodes is supported? > > Thanks, > Dax Kelson > Guru Labs Speaking as an independent; I often see people with latency issues when they try to grow past 16 nodes when using corosync, which is the HA cluster communication layer in RHEL 6. More specifically, DLM (the distributed lock manager) can start to suffer from a performance perspective as the size of the cluster grows. You may be able to go higher, but be prepared to do a lot of network tweaking. Also, as Steven and Adam pointed out, >16 is outside the supported size so you will have trouble getting any official support. If you want to try anyway, the freenode IRC channel #linux-cluster is a good place to ask about specific problems you run into. -- Digimer E-Mail: digimer at alteeve.com Freenode handle: digimer Papers and Projects: http://alteeve.com Node Assassin: http://nodeassassin.org "omg my singularity battery is dead again. stupid hawking radiation." - epitron From wmodes at ucsc.edu Thu Jan 5 21:54:25 2012 From: wmodes at ucsc.edu (Wes Modes) Date: Thu, 05 Jan 2012 13:54:25 -0800 Subject: [Linux-cluster] GFS on CentOS - cman unable to start Message-ID: <4F061C11.5030303@ucsc.edu> Howdy, y'all. I'm trying to set up GFS in a cluster on CentOS systems running on vmWare. The GFS FS is on a Dell Equilogic SAN. I keep running into the same problem despite many differently-flavored attempts to set up GFS. The problem comes when I try to start cman, the cluster management software. [root at test01]# service cman start Starting cluster: Loading modules... done Mounting configfs... done Starting ccsd... done Starting cman... failed cman not started: Can't find local node name in cluster.conf /usr/sbin/cman_tool: aisexec daemon didn't start [FAILED] [root at test01]# tail /var/log/messages Jan 5 13:39:40 testbench06 ccsd[13194]: Unable to connect to cluster infrastructure after 1193640 seconds. Jan 5 13:40:10 testbench06 ccsd[13194]: Unable to connect to cluster infrastructure after 1193670 seconds. Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] AIS Executive Service RELEASE 'subrev 1887 version 0.80.6' Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] Copyright (C) 2002-2006 MontaVista Software, Inc and contributors. Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] Copyright (C) 2006 Red Hat, Inc. Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] AIS Executive Service: started and ready to provide service. Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] local node name "test01.gdao.ucsc.edu" not found in cluster.conf Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] Error reading CCS info, cannot start Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] Error reading config from CCS Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] AIS Executive exiting (reason: could not read the main configuration file). Here are details of my configuration: [root at test01]# rpm -qa | grep cman cman-2.0.115-85.el5_7.2 [root at test01]# echo $HOSTNAME test01.gdao.ucsc.edu [root at test01]# hostname test01.gdao.ucsc.edu [root at test01]# cat /etc/hosts # Do not remove the following line, or various programs # that require network functionality will fail. 128.114.31.112 test01 test01.gdao test01.gdao.ucsc.edu 128.114.31.113 test02 test02.gdao test02.gdao.ucsc.edu 127.0.0.1 localhost.localdomain localhost ::1 localhost6.localdomain6 localhost6 [root at test01]# sestatus SELinux status: enabled SELinuxfs mount: /selinux Current mode: permissive Mode from config file: permissive Policy version: 21 Policy from config file: targeted [root at test01]# cat /etc/cluster/cluster.conf I've seen much discussion of this problem, but no definitive solutions. Any help you can provide will be welcome. Wes Modes From dkelson at gurulabs.com Fri Jan 6 00:45:07 2012 From: dkelson at gurulabs.com (Dax Kelson) Date: Thu, 5 Jan 2012 17:45:07 -0700 Subject: [Linux-cluster] Getting a SPC-3 compliant PR enabled iSCSI target up and running? Message-ID: Hi, I have a testing lab where I'm attempting to get some more experience with LIO and targetcli. Is there an IRC channel where cluster and/or target folks hang out? I have: - A 3 node RHEL6.2 cluster with clvmd and GFS2 - A Fedora 16 box (kernel 3.1.6) with LIO/targetcli-2.0rc1.fb3-2 for my shared storage Is there special configuration needs to be done on the target to enable PR, because it doesn't seem to be working. I'm not able to get fence_scsi working. I'm seeing registrations, but no reservation. >From one of the RHEL6 cluster nodes, and attempt to do a read-key fails: # sg_persist -n -i -k -d /dev/sdf PR in: aborted command Which generates this message over on my Fedora 16 scsi target: filp_open(/var/target/pr/ aptpl_086b4b49-8736-45e1-a80c-2ddeb8a5a01e) for APTPL metadata failed # cat /sys/kernel/config/target/core/iblock_0/store01/pr/res_* APTPL Bit Status: Disabled Ready to process PR APTPL metadata.. No SPC-3 Reservation holder No SPC-3 Reservation holder 0x00000008 No SPC-3 Reservation holder SPC-3 PR Registrations: iSCSI Node: iqn.1994-05.com.redhat:226e63cf8cf5,i,0x00023d010000 Key: 0x00000000aa230001 PRgen: 0x00000003 iSCSI Node: iqn.1994-05.com.redhat:93471b6582,i,0x00023d010000 Key: 0x00000000aa230002 PRgen: 0x00000007 iSCSI Node: iqn.1994-05.com.redhat:21d24bc1b670,i,0x00023d010000 Key: 0x00000000aa230003 PRgen: 0x00000005 No SPC-3 Reservation holder SPC3_PERSISTENT_RESERVATIONS On the cluster node, this is what my fence_scsi log file looks like: Jan 5 17:38:45 fence_scsi: [debug] main::do_register_ignore (node_key=aa230003, dev=/dev/sdf) Jan 5 17:38:45 fence_scsi: [debug] main::do_reset (dev=/dev/sdf, status=0) (cmd=sg_turs /dev/sdf) Jan 5 17:38:45 fence_scsi: [debug] main::do_register_ignore (err=0) (cmd=sg_persist -n -o -I -S aa230003 -d /dev/sdf) Jan 5 17:38:45 fence_scsi: [error] main::get_reservation_key (err=11) (cmd=sg_persist -n -i -r -d /dev/sdf) Running that failing sg_persist gives "PR in: aborted command" A wireshark packet capture of that command shows: === packet generated by sg_persist === iSCSI (SCSI Command) Opcode: SCSI Command (0x01) .0.. .... = I: Queued delivery Flags: 0xc1 1... .... = F: Final PDU in sequence .1.. .... = R: Data will be read from target ..0. .... = W: No data will be written to target .... .001 = Attr: Simple (0x01) TotalAHSLength: 0x00 DataSegmentLength: 0x00000000 LUN: 0000000000000000 InitiatorTaskTag: 0x20000000 ExpectedDataTransferLength: 0x00002000 CmdSN: 0x000000e2 ExpStatSN: 0x4c6ecb1b SCSI CDB Persistent Reserve In [LUN: 0x0000] [Command Set:Direct Access Device (0x00) (Using default commandset)] Opcode: Persistent Reserve In (0x5e) .... 0001 = Service Action: Read Reservation (0x01) Allocation Length: 8192 Control: 0x00 00.. .... = Vendor specific: 0x00 ..00 0... = Reserved: 0x00 .... .0.. = NACA: Normal ACA is not set .... ..0. = Obsolete: 0x00 .... ...0 = Obsolete: 0x00 === response from target === iSCSI (SCSI Response) Opcode: SCSI Response (0x21) Flags: 0x80 ...0 .... = o: No overflow of read part of bi-directional command .... 0... = u: No underflow of read part of bi-directional command .... .0.. = O: No residual overflow occurred .... ..0. = U: No residual underflow occurred Response: Command completed at target (0x00) Status: Check Condition (0x02) TotalAHSLength: 0x00 DataSegmentLength: 0x00000062 InitiatorTaskTag: 0x20000000 StatSN: 0x4c6ecb1b ExpCmdSN: 0x000000e3 MaxCmdSN: 0x000000f2 ExpDataSN: 0x00000000 BidiReadResidualCount: 0x00000000 ResidualCount: 0x00000000 Request in: 1 Time from request: 0.000096000 seconds SenseLength: 0x0060 SCSI: SNS Info [LUN: 0x0000] Valid: 0 .111 0000 = SNS Error Type: Current Error (0x70) Filemark: 0, EOM: 0, ILI: 0 .... 1011 = Sense Key: Command Aborted (0x0b) Sense Info: 0x00000000 Additional Sense Length: 0 Command-Specific Information: 00000000 Additional Sense Code+Qualifier: Invalid Field In Cdb (0x2400) Field Replaceable Unit Code: 0x00 0... .... = SKSV: False Sense Key Specific: 000000 (there are more packets in the capture if needed) My target's saveconfig.json looks like this: { "storage_objects": [ { "attributes": { "block_size": 512, "emulate_dpo": 0, "emulate_fua_read": 0, "emulate_fua_write": 1, "emulate_rest_reord": 0, "emulate_tas": 1, "emulate_tpu": 0, "emulate_tpws": 0, "emulate_ua_intlck_ctrl": 0, "emulate_write_cache": 0, "enforce_pr_isids": 1, "is_nonrot": 0, "max_sectors": 1024, "max_unmap_block_desc_count": 0, "max_unmap_lba_count": 0, "optimal_sectors": 1024, "queue_depth": 128, "task_timeout": 0, "unmap_granularity": 0, "unmap_granularity_alignment": 0 }, "dev": "/dev/vg_station11/iscsi- lun01", "name": "store01", "plugin": "block", "wwn": "086b4b49-8736-45e1-a80c-2ddeb8a5a01e" } ], "targets": [ { "fabric": "iscsi", "tpgs": [ { "attributes": { "authentication": 0, "cache_dynamic_acls": 0, "default_cmdsn_depth": 16, "demo_mode_write_protect": 1, "generate_node_acls": 0, "login_timeout": 15, "netif_timeout": 2, "prod_mode_write_protect": 0 }, "luns": [ { "index": 0, "storage_object": "/backstores/block/store01" } ], "node_acls": [ { "attributes": { "dataout_timeout": 3, "dataout_timeout_retries": 5, "default_erl": 0, "nopin_response_timeout": 5, "nopin_timeout": 5, "random_datain_pdu_offsets": 0, "random_datain_seq_offsets": 0, "random_r2t_offsets": 0 }, "chap_mutual_password": "", "chap_mutual_userid": "", "chap_password": "", "chap_userid": "", "mapped_luns": [ { "index": 0, "write_protect": false } ], "node_wwn": "iqn.1994-05.com.redhat:21d24bc1b670", "tcq_depth": 16 }, { "attributes": { "dataout_timeout": 3, "dataout_timeout_retries": 5, "default_erl": 0, "nopin_response_timeout": 5, "nopin_timeout": 5, "random_datain_pdu_offsets": 0, "random_datain_seq_offsets": 0, "random_r2t_offsets": 0 }, "chap_mutual_password": "", "chap_mutual_userid": "", "chap_password": "", "chap_userid": "", "mapped_luns": [ { "index": 0, "write_protect": false } ], "node_wwn": "iqn.1994-05.com.redhat:93471b6582", "tcq_depth": 16 }, { "attributes": { "dataout_timeout": 3, "dataout_timeout_retries": 5, "default_erl": 0, "nopin_response_timeout": 5, "nopin_timeout": 5, "random_datain_pdu_offsets": 0, "random_datain_seq_offsets": 0, "random_r2t_offsets": 0 }, "chap_mutual_password": "", "chap_mutual_userid": "", "chap_password": "", "chap_userid": "", "mapped_luns": [ { "index": 0, "write_protect": false } ], "node_wwn": "iqn.1994-05.com.redhat:226e63cf8cf5", "tcq_depth": 16 } ], "portals": [ { "ip_address": "10.100.0.12", "port": 3260 } ], "tag": 1 } ], "wwn": "iqn.2003-01.org.linux-iscsi.station11.x8664:sn.32668e1cd52d" } ] } -------------- next part -------------- An HTML attachment was scrubbed... URL: From linux at alteeve.com Fri Jan 6 00:55:57 2012 From: linux at alteeve.com (Digimer) Date: Thu, 05 Jan 2012 19:55:57 -0500 Subject: [Linux-cluster] Getting a SPC-3 compliant PR enabled iSCSI target up and running? In-Reply-To: References: Message-ID: <4F06469D.2090704@alteeve.com> On 01/05/2012 07:45 PM, Dax Kelson wrote: > Hi, > > I have a testing lab where I'm attempting to get some more experience > with LIO and targetcli. Is there an IRC channel where cluster and/or > target folks hang out? Hi Dax, The two main HA clustering channels on freenode are #linux-cluster and #linux-ha. There is also an HPC channel at #hpc. The first is slightly more red hat focused, but certainly not exclusively RH. The second is in the same way slightly more pacemaker focused, but again, not exclusively in the least. Feel free to drop in and say hi. :) -- Digimer E-Mail: digimer at alteeve.com Freenode handle: digimer Papers and Projects: http://alteeve.com Node Assassin: http://nodeassassin.org "omg my singularity battery is dead again. stupid hawking radiation." - epitron From swhiteho at redhat.com Fri Jan 6 10:00:29 2012 From: swhiteho at redhat.com (Steven Whitehouse) Date: Fri, 06 Jan 2012 10:00:29 +0000 Subject: [Linux-cluster] GFS on CentOS - cman unable to start In-Reply-To: <4F061C11.5030303@ucsc.edu> References: <4F061C11.5030303@ucsc.edu> Message-ID: <1325844029.2703.8.camel@menhir> Hi, On Thu, 2012-01-05 at 13:54 -0800, Wes Modes wrote: > Howdy, y'all. I'm trying to set up GFS in a cluster on CentOS systems > running on vmWare. The GFS FS is on a Dell Equilogic SAN. > > I keep running into the same problem despite many differently-flavored > attempts to set up GFS. The problem comes when I try to start cman, the > cluster management software. > > [root at test01]# service cman start > Starting cluster: > Loading modules... done > Mounting configfs... done > Starting ccsd... done > Starting cman... failed > cman not started: Can't find local node name in cluster.conf > /usr/sbin/cman_tool: aisexec daemon didn't start > [FAILED] > This looks like what it says... whatever the node name is in cluster.conf, it doesn't exist when the name is looked up, or possibly it does exist, but is mapped to the loopback address (it needs to map to an address which is valid cluster-wide) Since your config files look correct, the next thing to check is what the resolver is actually returning. Try (for example) a ping to test01 (you need to specify exactly the same form of the name as is used in cluster.conf) from test02 and see whether it uses the correct ip address, just in case the wrong thing is being returned. Steve. > [root at test01]# tail /var/log/messages > Jan 5 13:39:40 testbench06 ccsd[13194]: Unable to connect to > cluster infrastructure after 1193640 seconds. > Jan 5 13:40:10 testbench06 ccsd[13194]: Unable to connect to > cluster infrastructure after 1193670 seconds. > Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] AIS Executive > Service RELEASE 'subrev 1887 version 0.80.6' > Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] Copyright (C) > 2002-2006 MontaVista Software, Inc and contributors. > Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] Copyright (C) > 2006 Red Hat, Inc. > Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] AIS Executive > Service: started and ready to provide service. > Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] local node name > "test01.gdao.ucsc.edu" not found in cluster.conf > Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] Error reading CCS > info, cannot start > Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] Error reading > config from CCS > Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] AIS Executive > exiting (reason: could not read the main configuration file). > > Here are details of my configuration: > > [root at test01]# rpm -qa | grep cman > cman-2.0.115-85.el5_7.2 > > [root at test01]# echo $HOSTNAME > test01.gdao.ucsc.edu > > [root at test01]# hostname > test01.gdao.ucsc.edu > > [root at test01]# cat /etc/hosts > # Do not remove the following line, or various programs > # that require network functionality will fail. > 128.114.31.112 test01 test01.gdao test01.gdao.ucsc.edu > 128.114.31.113 test02 test02.gdao test02.gdao.ucsc.edu > 127.0.0.1 localhost.localdomain localhost > ::1 localhost6.localdomain6 localhost6 > > [root at test01]# sestatus > SELinux status: enabled > SELinuxfs mount: /selinux > Current mode: permissive > Mode from config file: permissive > Policy version: 21 > Policy from config file: targeted > > [root at test01]# cat /etc/cluster/cluster.conf > > > > > > > > > > > > > > > > > > > > > > > ipaddr="gdvcenter.ucsc.edu" login="root" passwd="1hateAmazon.com" > vmlogin="root" vmpasswd="esxpass" > port="/vmfs/volumes/49086551-c64fd83c-0401-001e0bcd6848/eagle1/gfs1.vmx"/> > > > > > > > I've seen much discussion of this problem, but no definitive solutions. > Any help you can provide will be welcome. > > Wes Modes > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From wmodes at ucsc.edu Fri Jan 6 19:01:28 2012 From: wmodes at ucsc.edu (Wes Modes) Date: Fri, 06 Jan 2012 11:01:28 -0800 Subject: [Linux-cluster] GFS on CentOS - cman unable to start In-Reply-To: <1325844029.2703.8.camel@menhir> References: <4F061C11.5030303@ucsc.edu> <1325844029.2703.8.camel@menhir> Message-ID: <4F074508.7020701@ucsc.edu> Hi, Steven. I've tried just about every possible combination of hostname and cluster.conf. ping to test01 resolves to 128.114.31.112 ping to test01.gdao.ucsc.edu resolves to 128.114.31.112 It feels like the right thing is being returned. This feels like it might be a quirk (or bug possibly) of cman or openais. There are some old bug reports around this, for example https://bugzilla.redhat.com/show_bug.cgi?id=488565. It sounds like the way that cman reports this error is anything but straightforward. Is there anyone who has encountered this error and found a solution? Wes On 1/6/2012 2:00 AM, Steven Whitehouse wrote: > Hi, > > On Thu, 2012-01-05 at 13:54 -0800, Wes Modes wrote: >> Howdy, y'all. I'm trying to set up GFS in a cluster on CentOS systems >> running on vmWare. The GFS FS is on a Dell Equilogic SAN. >> >> I keep running into the same problem despite many differently-flavored >> attempts to set up GFS. The problem comes when I try to start cman, the >> cluster management software. >> >> [root at test01]# service cman start >> Starting cluster: >> Loading modules... done >> Mounting configfs... done >> Starting ccsd... done >> Starting cman... failed >> cman not started: Can't find local node name in cluster.conf >> /usr/sbin/cman_tool: aisexec daemon didn't start >> [FAILED] >> > This looks like what it says... whatever the node name is in > cluster.conf, it doesn't exist when the name is looked up, or possibly > it does exist, but is mapped to the loopback address (it needs to map to > an address which is valid cluster-wide) > > Since your config files look correct, the next thing to check is what > the resolver is actually returning. Try (for example) a ping to test01 > (you need to specify exactly the same form of the name as is used in > cluster.conf) from test02 and see whether it uses the correct ip > address, just in case the wrong thing is being returned. > > Steve. > >> [root at test01]# tail /var/log/messages >> Jan 5 13:39:40 testbench06 ccsd[13194]: Unable to connect to >> cluster infrastructure after 1193640 seconds. >> Jan 5 13:40:10 testbench06 ccsd[13194]: Unable to connect to >> cluster infrastructure after 1193670 seconds. >> Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] AIS Executive >> Service RELEASE 'subrev 1887 version 0.80.6' >> Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] Copyright (C) >> 2002-2006 MontaVista Software, Inc and contributors. >> Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] Copyright (C) >> 2006 Red Hat, Inc. >> Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] AIS Executive >> Service: started and ready to provide service. >> Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] local node name >> "test01.gdao.ucsc.edu" not found in cluster.conf >> Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] Error reading CCS >> info, cannot start >> Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] Error reading >> config from CCS >> Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] AIS Executive >> exiting (reason: could not read the main configuration file). >> >> Here are details of my configuration: >> >> [root at test01]# rpm -qa | grep cman >> cman-2.0.115-85.el5_7.2 >> >> [root at test01]# echo $HOSTNAME >> test01.gdao.ucsc.edu >> >> [root at test01]# hostname >> test01.gdao.ucsc.edu >> >> [root at test01]# cat /etc/hosts >> # Do not remove the following line, or various programs >> # that require network functionality will fail. >> 128.114.31.112 test01 test01.gdao test01.gdao.ucsc.edu >> 128.114.31.113 test02 test02.gdao test02.gdao.ucsc.edu >> 127.0.0.1 localhost.localdomain localhost >> ::1 localhost6.localdomain6 localhost6 >> >> [root at test01]# sestatus >> SELinux status: enabled >> SELinuxfs mount: /selinux >> Current mode: permissive >> Mode from config file: permissive >> Policy version: 21 >> Policy from config file: targeted >> >> [root at test01]# cat /etc/cluster/cluster.conf >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> > ipaddr="gdvcenter.ucsc.edu" login="root" passwd="1hateAmazon.com" >> vmlogin="root" vmpasswd="esxpass" >> port="/vmfs/volumes/49086551-c64fd83c-0401-001e0bcd6848/eagle1/gfs1.vmx"/> >> >> >> >> >> >> >> I've seen much discussion of this problem, but no definitive solutions. >> Any help you can provide will be welcome. >> >> Wes Modes >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From gustavo.tonello at gmail.com Fri Jan 6 20:05:23 2012 From: gustavo.tonello at gmail.com (Luiz Gustavo Tonello) Date: Fri, 6 Jan 2012 18:05:23 -0200 Subject: [Linux-cluster] GFS on CentOS - cman unable to start In-Reply-To: <4F074508.7020701@ucsc.edu> References: <4F061C11.5030303@ucsc.edu> <1325844029.2703.8.camel@menhir> <4F074508.7020701@ucsc.edu> Message-ID: Hi, This servers is on VMware? At the same host? SElinux is disable? iptables have something? In my environment I had a problem to start GFS2 with servers in differents hosts. To clustering servers, was need migrate one server to the same host of the other, and restart this. I think, one of the problem was because the virtual switchs. To solve, I changed a multicast IP, to use 225.0.0.13 at cluster.conf And add a static route in both, to use default gateway. I don't know if it's correct, but this solve my problem. I hope that help you. Regards. On Fri, Jan 6, 2012 at 5:01 PM, Wes Modes wrote: > Hi, Steven. > > I've tried just about every possible combination of hostname and > cluster.conf. > > ping to test01 resolves to 128.114.31.112 > ping to test01.gdao.ucsc.edu resolves to 128.114.31.112 > > It feels like the right thing is being returned. This feels like it > might be a quirk (or bug possibly) of cman or openais. > > There are some old bug reports around this, for example > https://bugzilla.redhat.com/show_bug.cgi?id=488565. It sounds like the > way that cman reports this error is anything but straightforward. > > Is there anyone who has encountered this error and found a solution? > > Wes > > > On 1/6/2012 2:00 AM, Steven Whitehouse wrote: > > Hi, > > > > On Thu, 2012-01-05 at 13:54 -0800, Wes Modes wrote: > >> Howdy, y'all. I'm trying to set up GFS in a cluster on CentOS systems > >> running on vmWare. The GFS FS is on a Dell Equilogic SAN. > >> > >> I keep running into the same problem despite many differently-flavored > >> attempts to set up GFS. The problem comes when I try to start cman, the > >> cluster management software. > >> > >> [root at test01]# service cman start > >> Starting cluster: > >> Loading modules... done > >> Mounting configfs... done > >> Starting ccsd... done > >> Starting cman... failed > >> cman not started: Can't find local node name in cluster.conf > >> /usr/sbin/cman_tool: aisexec daemon didn't start > >> [FAILED] > >> > > This looks like what it says... whatever the node name is in > > cluster.conf, it doesn't exist when the name is looked up, or possibly > > it does exist, but is mapped to the loopback address (it needs to map to > > an address which is valid cluster-wide) > > > > Since your config files look correct, the next thing to check is what > > the resolver is actually returning. Try (for example) a ping to test01 > > (you need to specify exactly the same form of the name as is used in > > cluster.conf) from test02 and see whether it uses the correct ip > > address, just in case the wrong thing is being returned. > > > > Steve. > > > >> [root at test01]# tail /var/log/messages > >> Jan 5 13:39:40 testbench06 ccsd[13194]: Unable to connect to > >> cluster infrastructure after 1193640 seconds. > >> Jan 5 13:40:10 testbench06 ccsd[13194]: Unable to connect to > >> cluster infrastructure after 1193670 seconds. > >> Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] AIS Executive > >> Service RELEASE 'subrev 1887 version 0.80.6' > >> Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] Copyright (C) > >> 2002-2006 MontaVista Software, Inc and contributors. > >> Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] Copyright (C) > >> 2006 Red Hat, Inc. > >> Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] AIS Executive > >> Service: started and ready to provide service. > >> Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] local node name > >> "test01.gdao.ucsc.edu" not found in cluster.conf > >> Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] Error reading CCS > >> info, cannot start > >> Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] Error reading > >> config from CCS > >> Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] AIS Executive > >> exiting (reason: could not read the main configuration file). > >> > >> Here are details of my configuration: > >> > >> [root at test01]# rpm -qa | grep cman > >> cman-2.0.115-85.el5_7.2 > >> > >> [root at test01]# echo $HOSTNAME > >> test01.gdao.ucsc.edu > >> > >> [root at test01]# hostname > >> test01.gdao.ucsc.edu > >> > >> [root at test01]# cat /etc/hosts > >> # Do not remove the following line, or various programs > >> # that require network functionality will fail. > >> 128.114.31.112 test01 test01.gdao test01.gdao.ucsc.edu > >> 128.114.31.113 test02 test02.gdao test02.gdao.ucsc.edu > >> 127.0.0.1 localhost.localdomain localhost > >> ::1 localhost6.localdomain6 localhost6 > >> > >> [root at test01]# sestatus > >> SELinux status: enabled > >> SELinuxfs mount: /selinux > >> Current mode: permissive > >> Mode from config file: permissive > >> Policy version: 21 > >> Policy from config file: targeted > >> > >> [root at test01]# cat /etc/cluster/cluster.conf > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> >> ipaddr="gdvcenter.ucsc.edu" login="root" passwd="1hateAmazon.com" > >> vmlogin="root" vmpasswd="esxpass" > >> > port="/vmfs/volumes/49086551-c64fd83c-0401-001e0bcd6848/eagle1/gfs1.vmx"/> > >> > >> > >> > >> > >> > >> > >> I've seen much discussion of this problem, but no definitive solutions. > >> Any help you can provide will be welcome. > >> > >> Wes Modes > >> > >> -- > >> Linux-cluster mailing list > >> Linux-cluster at redhat.com > >> https://www.redhat.com/mailman/listinfo/linux-cluster > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -- Luiz Gustavo P Tonello. -------------- next part -------------- An HTML attachment was scrubbed... URL: From wmodes at ucsc.edu Fri Jan 6 20:38:43 2012 From: wmodes at ucsc.edu (Wes Modes) Date: Fri, 06 Jan 2012 12:38:43 -0800 Subject: [Linux-cluster] GFS on CentOS - cman unable to start In-Reply-To: References: <4F061C11.5030303@ucsc.edu> <1325844029.2703.8.camel@menhir> <4F074508.7020701@ucsc.edu> Message-ID: <4F075BD3.3090702@ucsc.edu> These servers are currently on the same host, but may not be in the future. They are in a vm cluster (though honestly, I'm not sure what this means yet). SElinux is on, but disabled. Firewalling through iptables is turned off via system-config-securitylevel There is no line currently in the cluster.conf that deals with multicasting. Any other suggestions? Wes On 1/6/2012 12:05 PM, Luiz Gustavo Tonello wrote: > Hi, > > This servers is on VMware? At the same host? > SElinux is disable? iptables have something? > > In my environment I had a problem to start GFS2 with servers in > differents hosts. > To clustering servers, was need migrate one server to the same host of > the other, and restart this. > > I think, one of the problem was because the virtual switchs. > To solve, I changed a multicast IP, to use 225.0.0.13 at cluster.conf > > And add a static route in both, to use default gateway. > > I don't know if it's correct, but this solve my problem. > > I hope that help you. > > Regards. > > On Fri, Jan 6, 2012 at 5:01 PM, Wes Modes > wrote: > > Hi, Steven. > > I've tried just about every possible combination of hostname and > cluster.conf. > > ping to test01 resolves to 128.114.31.112 > ping to test01.gdao.ucsc.edu > resolves to 128.114.31.112 > > It feels like the right thing is being returned. This feels like it > might be a quirk (or bug possibly) of cman or openais. > > There are some old bug reports around this, for example > https://bugzilla.redhat.com/show_bug.cgi?id=488565. It sounds > like the > way that cman reports this error is anything but straightforward. > > Is there anyone who has encountered this error and found a solution? > > Wes > > > On 1/6/2012 2:00 AM, Steven Whitehouse wrote: > > Hi, > > > > On Thu, 2012-01-05 at 13:54 -0800, Wes Modes wrote: > >> Howdy, y'all. I'm trying to set up GFS in a cluster on CentOS > systems > >> running on vmWare. The GFS FS is on a Dell Equilogic SAN. > >> > >> I keep running into the same problem despite many > differently-flavored > >> attempts to set up GFS. The problem comes when I try to start > cman, the > >> cluster management software. > >> > >> [root at test01]# service cman start > >> Starting cluster: > >> Loading modules... done > >> Mounting configfs... done > >> Starting ccsd... done > >> Starting cman... failed > >> cman not started: Can't find local node name in cluster.conf > >> /usr/sbin/cman_tool: aisexec daemon didn't start > >> > [FAILED] > >> > > This looks like what it says... whatever the node name is in > > cluster.conf, it doesn't exist when the name is looked up, or > possibly > > it does exist, but is mapped to the loopback address (it needs > to map to > > an address which is valid cluster-wide) > > > > Since your config files look correct, the next thing to check is > what > > the resolver is actually returning. Try (for example) a ping to > test01 > > (you need to specify exactly the same form of the name as is used in > > cluster.conf) from test02 and see whether it uses the correct ip > > address, just in case the wrong thing is being returned. > > > > Steve. > > > >> [root at test01]# tail /var/log/messages > >> Jan 5 13:39:40 testbench06 ccsd[13194]: Unable to connect to > >> cluster infrastructure after 1193640 seconds. > >> Jan 5 13:40:10 testbench06 ccsd[13194]: Unable to connect to > >> cluster infrastructure after 1193670 seconds. > >> Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] AIS > Executive > >> Service RELEASE 'subrev 1887 version 0.80.6' > >> Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] > Copyright (C) > >> 2002-2006 MontaVista Software, Inc and contributors. > >> Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] > Copyright (C) > >> 2006 Red Hat, Inc. > >> Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] AIS > Executive > >> Service: started and ready to provide service. > >> Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] local > node name > >> "test01.gdao.ucsc.edu " not found > in cluster.conf > >> Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] Error > reading CCS > >> info, cannot start > >> Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] Error > reading > >> config from CCS > >> Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] AIS > Executive > >> exiting (reason: could not read the main configuration file). > >> > >> Here are details of my configuration: > >> > >> [root at test01]# rpm -qa | grep cman > >> cman-2.0.115-85.el5_7.2 > >> > >> [root at test01]# echo $HOSTNAME > >> test01.gdao.ucsc.edu > >> > >> [root at test01]# hostname > >> test01.gdao.ucsc.edu > >> > >> [root at test01]# cat /etc/hosts > >> # Do not remove the following line, or various programs > >> # that require network functionality will fail. > >> 128.114.31.112 test01 test01.gdao test01.gdao.ucsc.edu > > >> 128.114.31.113 test02 test02.gdao test02.gdao.ucsc.edu > > >> 127.0.0.1 localhost.localdomain localhost > >> ::1 localhost6.localdomain6 localhost6 > >> > >> [root at test01]# sestatus > >> SELinux status: enabled > >> SELinuxfs mount: /selinux > >> Current mode: permissive > >> Mode from config file: permissive > >> Policy version: 21 > >> Policy from config file: targeted > >> > >> [root at test01]# cat /etc/cluster/cluster.conf > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> >> ipaddr="gdvcenter.ucsc.edu " > login="root" passwd="1hateAmazon.com" > >> vmlogin="root" vmpasswd="esxpass" > >> > port="/vmfs/volumes/49086551-c64fd83c-0401-001e0bcd6848/eagle1/gfs1.vmx"/> > >> > >> > >> > >> > >> > >> > >> I've seen much discussion of this problem, but no definitive > solutions. > >> Any help you can provide will be welcome. > >> > >> Wes Modes > >> > >> -- > >> Linux-cluster mailing list > >> Linux-cluster at redhat.com > >> https://www.redhat.com/mailman/listinfo/linux-cluster > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > > -- > Luiz Gustavo P Tonello. > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- An HTML attachment was scrubbed... URL: From pbruna at it-linux.cl Sat Jan 7 00:30:19 2012 From: pbruna at it-linux.cl (Patricio A. Bruna) Date: Fri, 06 Jan 2012 21:30:19 -0300 (CLST) Subject: [Linux-cluster] GFS on CentOS - cman unable to start In-Reply-To: <4F075BD3.3090702@ucsc.edu> Message-ID: <60c71479-72a3-47aa-a91d-fa1c91c3e9ef@lisa.itlinux.cl> Hi, I think CMAN expect that the names of the cluster nodes be the same returned by the command "uname -n". For what you write your nodes hostnames are: test01.gdao.ucsc.edu and test02.gdao.ucsc.edu, but in cluster.conf you have declared only "test01" and "test02". ------------------------------------ Patricio Bruna V. IT Linux Ltda. www.it-linux.cl Twitter Fono : (+56-2) 333 0578 M?vil: (+56-9) 8899 6618 ----- Mensaje original ----- > These servers are currently on the same host, but may not be in the > future. They are in a vm cluster (though honestly, I'm not sure what > this means yet). > SElinux is on, but disabled. > Firewalling through iptables is turned off via > system-config-securitylevel > There is no line currently in the cluster.conf that deals with > multicasting. > Any other suggestions? > Wes > On 1/6/2012 12:05 PM, Luiz Gustavo Tonello wrote: > > Hi, > > > This servers is on VMware? At the same host? > > > SElinux is disable? iptables have something? > > > In my environment I had a problem to start GFS2 with servers in > > differents hosts. > > > To clustering servers, was need migrate one server to the same host > > of the other, and restart this. > > > I think, one of the problem was because the virtual switchs. > > > To solve, I changed a multicast IP, to use 225.0.0.13 at > > cluster.conf > > > > > > And add a static route in both, to use default gateway. > > > I don't know if it's correct, but this solve my problem. > > > I hope that help you. > > > Regards. > > > On Fri, Jan 6, 2012 at 5:01 PM, Wes Modes < wmodes at ucsc.edu > > > wrote: > > > > Hi, Steven. > > > > > > I've tried just about every possible combination of hostname and > > > > > > cluster.conf. > > > > > > ping to test01 resolves to 128.114.31.112 > > > > > > ping to test01.gdao.ucsc.edu resolves to 128.114.31.112 > > > > > > It feels like the right thing is being returned. This feels like > > > it > > > > > > might be a quirk (or bug possibly) of cman or openais. > > > > > > There are some old bug reports around this, for example > > > > > > https://bugzilla.redhat.com/show_bug.cgi?id=488565 . It sounds > > > like > > > the > > > > > > way that cman reports this error is anything but straightforward. > > > > > > Is there anyone who has encountered this error and found a > > > solution? > > > > > > Wes > > > > > > On 1/6/2012 2:00 AM, Steven Whitehouse wrote: > > > > > > > Hi, > > > > > > > > > > > > > > On Thu, 2012-01-05 at 13:54 -0800, Wes Modes wrote: > > > > > > >> Howdy, y'all. I'm trying to set up GFS in a cluster on CentOS > > > >> systems > > > > > > >> running on vmWare. The GFS FS is on a Dell Equilogic SAN. > > > > > > >> > > > > > > >> I keep running into the same problem despite many > > > >> differently-flavored > > > > > > >> attempts to set up GFS. The problem comes when I try to start > > > >> cman, the > > > > > > >> cluster management software. > > > > > > >> > > > > > > >> [root at test01]# service cman start > > > > > > >> Starting cluster: > > > > > > >> Loading modules... done > > > > > > >> Mounting configfs... done > > > > > > >> Starting ccsd... done > > > > > > >> Starting cman... failed > > > > > > >> cman not started: Can't find local node name in cluster.conf > > > > > > >> /usr/sbin/cman_tool: aisexec daemon didn't start > > > > > > >> [FAILED] > > > > > > >> > > > > > > > This looks like what it says... whatever the node name is in > > > > > > > cluster.conf, it doesn't exist when the name is looked up, or > > > > possibly > > > > > > > it does exist, but is mapped to the loopback address (it needs > > > > to > > > > map to > > > > > > > an address which is valid cluster-wide) > > > > > > > > > > > > > > Since your config files look correct, the next thing to check > > > > is > > > > what > > > > > > > the resolver is actually returning. Try (for example) a ping to > > > > test01 > > > > > > > (you need to specify exactly the same form of the name as is > > > > used > > > > in > > > > > > > cluster.conf) from test02 and see whether it uses the correct > > > > ip > > > > > > > address, just in case the wrong thing is being returned. > > > > > > > > > > > > > > Steve. > > > > > > > > > > > > > >> [root at test01]# tail /var/log/messages > > > > > > >> Jan 5 13:39:40 testbench06 ccsd[13194]: Unable to connect to > > > > > > >> cluster infrastructure after 1193640 seconds. > > > > > > >> Jan 5 13:40:10 testbench06 ccsd[13194]: Unable to connect to > > > > > > >> cluster infrastructure after 1193670 seconds. > > > > > > >> Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] AIS > > > >> Executive > > > > > > >> Service RELEASE 'subrev 1887 version 0.80.6' > > > > > > >> Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] Copyright > > > >> (C) > > > > > > >> 2002-2006 MontaVista Software, Inc and contributors. > > > > > > >> Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] Copyright > > > >> (C) > > > > > > >> 2006 Red Hat, Inc. > > > > > > >> Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] AIS > > > >> Executive > > > > > > >> Service: started and ready to provide service. > > > > > > >> Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] local node > > > >> name > > > > > > >> " test01.gdao.ucsc.edu " not found in cluster.conf > > > > > > >> Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] Error > > > >> reading > > > >> CCS > > > > > > >> info, cannot start > > > > > > >> Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] Error > > > >> reading > > > > > > >> config from CCS > > > > > > >> Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] AIS > > > >> Executive > > > > > > >> exiting (reason: could not read the main configuration file). > > > > > > >> > > > > > > >> Here are details of my configuration: > > > > > > >> > > > > > > >> [root at test01]# rpm -qa | grep cman > > > > > > >> cman-2.0.115-85.el5_7.2 > > > > > > >> > > > > > > >> [root at test01]# echo $HOSTNAME > > > > > > >> test01.gdao.ucsc.edu > > > > > > >> > > > > > > >> [root at test01]# hostname > > > > > > >> test01.gdao.ucsc.edu > > > > > > >> > > > > > > >> [root at test01]# cat /etc/hosts > > > > > > >> # Do not remove the following line, or various programs > > > > > > >> # that require network functionality will fail. > > > > > > >> 128.114.31.112 test01 test01.gdao test01.gdao.ucsc.edu > > > > > > >> 128.114.31.113 test02 test02.gdao test02.gdao.ucsc.edu > > > > > > >> 127.0.0.1 localhost.localdomain localhost > > > > > > >> ::1 localhost6.localdomain6 localhost6 > > > > > > >> > > > > > > >> [root at test01]# sestatus > > > > > > >> SELinux status: enabled > > > > > > >> SELinuxfs mount: /selinux > > > > > > >> Current mode: permissive > > > > > > >> Mode from config file: permissive > > > > > > >> Policy version: 21 > > > > > > >> Policy from config file: targeted > > > > > > >> > > > > > > >> [root at test01]# cat /etc/cluster/cluster.conf > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > >> ipaddr=" gdvcenter.ucsc.edu " login="root" > > > >> passwd="1hateAmazon.com" > > > > > > >> vmlogin="root" vmpasswd="esxpass" > > > > > > >> port="/vmfs/volumes/49086551-c64fd83c-0401-001e0bcd6848/eagle1/gfs1.vmx"/> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> I've seen much discussion of this problem, but no definitive > > > >> solutions. > > > > > > >> Any help you can provide will be welcome. > > > > > > >> > > > > > > >> Wes Modes > > > > > > >> > > > > > > >> -- > > > > > > >> Linux-cluster mailing list > > > > > > >> Linux-cluster at redhat.com > > > > > > >> https://www.redhat.com/mailman/listinfo/linux-cluster > > > > > > > > > > > > > > -- > > > > > > > Linux-cluster mailing list > > > > > > > Linux-cluster at redhat.com > > > > > > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > > > -- > > > > > > Linux-cluster mailing list > > > > > > Linux-cluster at redhat.com > > > > > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > > -- > > > Luiz Gustavo P Tonello. > > > -- > > > Linux-cluster mailing list Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: zimbra_gold_partner.png Type: image/png Size: 2893 bytes Desc: not available URL: From kevin.stanton at eprize.com Sat Jan 7 02:06:03 2012 From: kevin.stanton at eprize.com (Kevin Stanton) Date: Sat, 7 Jan 2012 02:06:03 +0000 Subject: [Linux-cluster] GFS on CentOS - cman unable to start In-Reply-To: <60c71479-72a3-47aa-a91d-fa1c91c3e9ef@lisa.itlinux.cl> References: <4F075BD3.3090702@ucsc.edu> <60c71479-72a3-47aa-a91d-fa1c91c3e9ef@lisa.itlinux.cl> Message-ID: > Hi, > I think CMAN expect that the names of the cluster nodes be the same returned by the command "uname -n". > For what you write your nodes hostnames are: test01.gdao.ucsc.edu and test02.gdao.ucsc.edu, but in cluster.conf you have declared only "test01" and "test02". I haven't found this to be the case in the past. I actually use a separate short name to reference each node which is different than the hostname of the server itself. All I've ever had to do is make sure it resolves correctly. You can do this either in DNS and/or in /etc/hosts. I have found that it's a good idea to do both in case your DNS server is a virtual machine and is not running for some reason. In that case with /etc/hosts you can still start cman. I would make sure whatever node names you use in the cluster.conf will resolve when you try to ping it from all nodes in the cluster. Also make sure your cluster.conf is in sync between all nodes. -Kevin ________________________________ These servers are currently on the same host, but may not be in the future. They are in a vm cluster (though honestly, I'm not sure what this means yet). SElinux is on, but disabled. Firewalling through iptables is turned off via system-config-securitylevel There is no line currently in the cluster.conf that deals with multicasting. Any other suggestions? Wes On 1/6/2012 12:05 PM, Luiz Gustavo Tonello wrote: Hi, This servers is on VMware? At the same host? SElinux is disable? iptables have something? In my environment I had a problem to start GFS2 with servers in differents hosts. To clustering servers, was need migrate one server to the same host of the other, and restart this. I think, one of the problem was because the virtual switchs. To solve, I changed a multicast IP, to use 225.0.0.13 at cluster.conf And add a static route in both, to use default gateway. I don't know if it's correct, but this solve my problem. I hope that help you. Regards. On Fri, Jan 6, 2012 at 5:01 PM, Wes Modes > wrote: Hi, Steven. I've tried just about every possible combination of hostname and cluster.conf. ping to test01 resolves to 128.114.31.112 ping to test01.gdao.ucsc.edu resolves to 128.114.31.112 It feels like the right thing is being returned. This feels like it might be a quirk (or bug possibly) of cman or openais. There are some old bug reports around this, for example https://bugzilla.redhat.com/show_bug.cgi?id=488565. It sounds like the way that cman reports this error is anything but straightforward. Is there anyone who has encountered this error and found a solution? Wes On 1/6/2012 2:00 AM, Steven Whitehouse wrote: > Hi, > > On Thu, 2012-01-05 at 13:54 -0800, Wes Modes wrote: >> Howdy, y'all. I'm trying to set up GFS in a cluster on CentOS systems >> running on vmWare. The GFS FS is on a Dell Equilogic SAN. >> >> I keep running into the same problem despite many differently-flavored >> attempts to set up GFS. The problem comes when I try to start cman, the >> cluster management software. >> >> [root at test01]# service cman start >> Starting cluster: >> Loading modules... done >> Mounting configfs... done >> Starting ccsd... done >> Starting cman... failed >> cman not started: Can't find local node name in cluster.conf >> /usr/sbin/cman_tool: aisexec daemon didn't start >> [FAILED] >> > This looks like what it says... whatever the node name is in > cluster.conf, it doesn't exist when the name is looked up, or possibly > it does exist, but is mapped to the loopback address (it needs to map to > an address which is valid cluster-wide) > > Since your config files look correct, the next thing to check is what > the resolver is actually returning. Try (for example) a ping to test01 > (you need to specify exactly the same form of the name as is used in > cluster.conf) from test02 and see whether it uses the correct ip > address, just in case the wrong thing is being returned. > > Steve. > >> [root at test01]# tail /var/log/messages >> Jan 5 13:39:40 testbench06 ccsd[13194]: Unable to connect to >> cluster infrastructure after 1193640 seconds. >> Jan 5 13:40:10 testbench06 ccsd[13194]: Unable to connect to >> cluster infrastructure after 1193670 seconds. >> Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] AIS Executive >> Service RELEASE 'subrev 1887 version 0.80.6' >> Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] Copyright (C) >> 2002-2006 MontaVista Software, Inc and contributors. >> Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] Copyright (C) >> 2006 Red Hat, Inc. >> Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] AIS Executive >> Service: started and ready to provide service. >> Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] local node name >> "test01.gdao.ucsc.edu" not found in cluster.conf >> Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] Error reading CCS >> info, cannot start >> Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] Error reading >> config from CCS >> Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] AIS Executive >> exiting (reason: could not read the main configuration file). >> >> Here are details of my configuration: >> >> [root at test01]# rpm -qa | grep cman >> cman-2.0.115-85.el5_7.2 >> >> [root at test01]# echo $HOSTNAME >> test01.gdao.ucsc.edu >> >> [root at test01]# hostname >> test01.gdao.ucsc.edu >> >> [root at test01]# cat /etc/hosts >> # Do not remove the following line, or various programs >> # that require network functionality will fail. >> 128.114.31.112 test01 test01.gdao test01.gdao.ucsc.edu >> 128.114.31.113 test02 test02.gdao test02.gdao.ucsc.edu >> 127.0.0.1 localhost.localdomain localhost >> ::1 localhost6.localdomain6 localhost6 >> >> [root at test01]# sestatus >> SELinux status: enabled >> SELinuxfs mount: /selinux >> Current mode: permissive >> Mode from config file: permissive >> Policy version: 21 >> Policy from config file: targeted >> >> [root at test01]# cat /etc/cluster/cluster.conf >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> > ipaddr="gdvcenter.ucsc.edu" login="root" passwd="1hateAmazon.com" >> vmlogin="root" vmpasswd="esxpass" >> port="/vmfs/volumes/49086551-c64fd83c-0401-001e0bcd6848/eagle1/gfs1.vmx"/> >> >> >> >> >> >> >> I've seen much discussion of this problem, but no definitive solutions. >> Any help you can provide will be welcome. >> >> Wes Modes >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -- Luiz Gustavo P Tonello. -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- An HTML attachment was scrubbed... URL: From christiankwall-qsa at yahoo.com Sat Jan 7 08:22:12 2012 From: christiankwall-qsa at yahoo.com (Chris Kwall) Date: Sat, 7 Jan 2012 08:22:12 +0000 (GMT) Subject: [Linux-cluster] GFS on CentOS - cman unable to start In-Reply-To: <4F061C11.5030303@ucsc.edu> References: <4F061C11.5030303@ucsc.edu> Message-ID: <1325924532.90505.YahooMailNeo@web29703.mail.ird.yahoo.com> Hi Wes Please excuse my poor english - it's not my mother?language I'm writing in. ----- Urspr?ngliche Message ----- > Howdy, y'all. I'm trying to set up GFS in a cluster on CentOS systems > running on vmWare. The GFS FS is on a Dell Equilogic SAN. > > I keep running into the same problem despite many differently-flavored > attempts to set up GFS. The problem comes when I try to start cman, the > cluster management software. > > ? ? [root at test01]# service cman start > ? ? Starting cluster: > ? ? ? Loading modules... done > ? ? ? Mounting configfs... done > ? ? ? Starting ccsd... done > ? ? ? Starting cman... failed > ? ? cman not started: Can't find local node name in cluster.conf > /usr/sbin/cman_tool: aisexec daemon didn't start > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? [FAILED] I don't think that the cluster is your main-problem. The nodename must "not" present in DNS, but it must be?resolvable by files, ldap whatever. Please verify?that "files" is present at /etc/nsswitch.conf. e.g:?hosts: ? ? ?files dns Did you've check with "ip addr list" that the ip-address matches the same as in /etc/hosts? >? > ? ? [root at test01]# tail /var/log/messages > ? ? Jan? 5 13:39:40 testbench06 ccsd[13194]: Unable to connect to > cluster infrastructure after 1193640 seconds. > ? ? Jan? 5 13:40:10 testbench06 ccsd[13194]: Unable to connect to > cluster infrastructure after 1193670 seconds. > ? ? Jan? 5 13:40:24 testbench06 openais[3939]: [MAIN ] AIS Executive > Service RELEASE 'subrev 1887 version 0.80.6' > ? ? Jan? 5 13:40:24 testbench06 openais[3939]: [MAIN ] Copyright (C) > 2002-2006 MontaVista Software, Inc and contributors. > ? ? Jan? 5 13:40:24 testbench06 openais[3939]: [MAIN ] Copyright (C) > 2006 Red Hat, Inc. > ? ? Jan? 5 13:40:24 testbench06 openais[3939]: [MAIN ] AIS Executive > Service: started and ready to provide service. > ? ? Jan? 5 13:40:24 testbench06 openais[3939]: [MAIN ] local node name > "test01.gdao.ucsc.edu" not found in cluster.conf > ? ? Jan? 5 13:40:24 testbench06 openais[3939]: [MAIN ] Error reading CCS > info, cannot start > ? ? Jan? 5 13:40:24 testbench06 openais[3939]: [MAIN ] Error reading > config from CCS > ? ? Jan? 5 13:40:24 testbench06 openais[3939]: [MAIN ] AIS Executive > exiting (reason: could not read the main configuration file). > > Here are details of my configuration: > > ? ? [root at test01]# rpm -qa | grep cman > ? ? cman-2.0.115-85.el5_7.2 > > ? ? [root at test01]# echo $HOSTNAME > ? ? test01.gdao.ucsc.edu > > ? ? [root at test01]# hostname > ? ? test01.gdao.ucsc.edu > > ? ? [root at test01]# cat /etc/hosts > ? ? # Do not remove the following line, or various programs > ? ? # that require network functionality will fail. > ? ? 128.114.31.112? ? ? test01 test01.gdao test01.gdao.ucsc.edu > ? ? 128.114.31.113? ? ? test02 test02.gdao test02.gdao.ucsc.edu > ? ? 127.0.0.1? ? ? ? ? ? ? localhost.localdomain localhost > ? ? ::1? ? ? ? ? ? localhost6.localdomain6 localhost6 > > ? ? [root at test01]# sestatus > ? ? SELinux status:? ? ? ? ? ? ? ? enabled > ? ? SELinuxfs mount:? ? ? ? ? ? ? ? /selinux > ? ? Current mode:? ? ? ? ? ? ? ? ? permissive > ? ? Mode from config file:? ? ? ? ? permissive > ? ? Policy version:? ? ? ? ? ? ? ? 21 > ? ? Policy from config file:? ? ? ? targeted > > ? ? [root at test01]# cat /etc/cluster/cluster.conf > ? ? > ? ? > ? ? ? ? post_join_delay="120"/> > ? ? ? ? > ? ? ? ? ? ? votes="1"> > ? ? ? ? ? ? ? ? > ? ? ? ? ? ? ? ? ? ? > ? ? ? ? ? ? ? ? ? ? ? ? > ? ? ? ? ? ? ? ? ? ? > ? ? ? ? ? ? ? ? > ? ? ? ? ? ? > ? ? ? ? ? ? votes="1"> > ? ? ? ? ? ? ? ? > ? ? ? ? ? ? ? ? ? ? > ? ? ? ? ? ? ? ? ? ? ? ? > ? ? ? ? ? ? ? ? ? ? > ? ? ? ? ? ? ? ? > ? ? ? ? ? ? > ? ? ? ? > ? ? ? ? > ? ? ? ? > ? ? ? ? ? ? name="gfs1_ipmi"/> > ? ? ? ? ? ? name="gfs_vmware" > ipaddr="gdvcenter.ucsc.edu" login="root" > passwd="1hateAmazon.com" > vmlogin="root" vmpasswd="esxpass" > port="/vmfs/volumes/49086551-c64fd83c-0401-001e0bcd6848/eagle1/gfs1.vmx"/> > ? ? ? ? > ? ? ? ? > ? ? ? ? > ? ? ? ? > ? ? - Chris From pbruna at it-linux.cl Sat Jan 7 14:11:38 2012 From: pbruna at it-linux.cl (Patricio Bruna) Date: Sat, 7 Jan 2012 11:11:38 -0300 Subject: [Linux-cluster] GFS on CentOS - cman unable to start In-Reply-To: <1325924532.90505.YahooMailNeo@web29703.mail.ird.yahoo.com> References: <4F061C11.5030303@ucsc.edu> <1325924532.90505.YahooMailNeo@web29703.mail.ird.yahoo.com> Message-ID: Hi, The error log is clear --------- Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] local node name "test01.gdao.ucsc.edu" not found in cluster.conf --------- And is related to the email i sent before. Give a try, you don't lose anything. El 07-01-2012, a las 5:22, Chris Kwall escribi?: > Hi Wes > > Please excuse my poor english - it's not my mother language I'm writing in. > > ----- Urspr?ngliche Message ----- > >> Howdy, y'all. I'm trying to set up GFS in a cluster on CentOS systems >> running on vmWare. The GFS FS is on a Dell Equilogic SAN. >> >> I keep running into the same problem despite many differently-flavored >> attempts to set up GFS. The problem comes when I try to start cman, the >> cluster management software. >> >> [root at test01]# service cman start >> Starting cluster: >> Loading modules... done >> Mounting configfs... done >> Starting ccsd... done >> Starting cman... failed >> cman not started: Can't find local node name in cluster.conf >> /usr/sbin/cman_tool: aisexec daemon didn't start >> [FAILED] > > > I don't think that the cluster is your main-problem. > The nodename must "not" present in DNS, but it must be resolvable by files, ldap whatever. > > Please verify that "files" is present at /etc/nsswitch.conf. > > e.g: hosts: files dns > > Did you've check with "ip addr list" that the ip-address matches the same as in /etc/hosts? > >> > >> [root at test01]# tail /var/log/messages >> Jan 5 13:39:40 testbench06 ccsd[13194]: Unable to connect to >> cluster infrastructure after 1193640 seconds. >> Jan 5 13:40:10 testbench06 ccsd[13194]: Unable to connect to >> cluster infrastructure after 1193670 seconds. >> Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] AIS Executive >> Service RELEASE 'subrev 1887 version 0.80.6' >> Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] Copyright (C) >> 2002-2006 MontaVista Software, Inc and contributors. >> Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] Copyright (C) >> 2006 Red Hat, Inc. >> Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] AIS Executive >> Service: started and ready to provide service. >> Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] local node name >> "test01.gdao.ucsc.edu" not found in cluster.conf >> Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] Error reading CCS >> info, cannot start >> Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] Error reading >> config from CCS >> Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] AIS Executive >> exiting (reason: could not read the main configuration file). >> >> Here are details of my configuration: >> >> [root at test01]# rpm -qa | grep cman >> cman-2.0.115-85.el5_7.2 >> >> [root at test01]# echo $HOSTNAME >> test01.gdao.ucsc.edu >> >> [root at test01]# hostname >> test01.gdao.ucsc.edu >> >> [root at test01]# cat /etc/hosts >> # Do not remove the following line, or various programs >> # that require network functionality will fail. >> 128.114.31.112 test01 test01.gdao test01.gdao.ucsc.edu >> 128.114.31.113 test02 test02.gdao test02.gdao.ucsc.edu >> 127.0.0.1 localhost.localdomain localhost >> ::1 localhost6.localdomain6 localhost6 >> >> [root at test01]# sestatus >> SELinux status: enabled >> SELinuxfs mount: /selinux >> Current mode: permissive >> Mode from config file: permissive >> Policy version: 21 >> Policy from config file: targeted >> >> [root at test01]# cat /etc/cluster/cluster.conf >> >> >> > post_join_delay="120"/> >> >> > votes="1"> >> >> >> >> >> >> >> > votes="1"> >> >> >> >> >> >> >> >> >> >> > name="gfs1_ipmi"/> >> > name="gfs_vmware" >> ipaddr="gdvcenter.ucsc.edu" login="root" >> passwd="1hateAmazon.com" >> vmlogin="root" vmpasswd="esxpass" >> port="/vmfs/volumes/49086551-c64fd83c-0401-001e0bcd6848/eagle1/gfs1.vmx"/> >> >> >> >> >> > > > - Chris > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster ------------------------------------ Patricio Bruna V. IT Linux Ltda. Fono : (+56-2) 333 0578 M?vil: (+56-9) 8899 6618 Twitter: http://twitter.com/ITLinux -------------- next part -------------- An HTML attachment was scrubbed... URL: From td3201 at gmail.com Sun Jan 8 23:39:37 2012 From: td3201 at gmail.com (Terry) Date: Sun, 8 Jan 2012 17:39:37 -0600 Subject: [Linux-cluster] centos5 to RHEL6 migration Message-ID: Hello, I am trying to gently migrate a 3-node cluster from centos5 to RHEL6. I have already taken one of the three nodes out and rebuilt it. My thinking is to build a new cluster from the RHEL node but want to run it by everyone here first. The cluster consists of a handful of NFS volumes and a PostgreSQL database. I am not concerned about the database. I am moving to a new version and will simply migrate that. I am more concerned about all of the ext4 clustered LVM volumes. In this process, if I shut down the old cluster, what's the process to force the new node to read those volumes in to the new single-node cluster? A pvscan on the new server shows all of the volumes fine. I am concerned there's something else I'll have to do here to begin mounting these volumes in the new cluster. [root at server ~]# pvdisplay Skipping clustered volume group vg_data01b Thanks! -------------- next part -------------- An HTML attachment was scrubbed... URL: From michael.allen at visi.com Mon Jan 9 01:36:10 2012 From: michael.allen at visi.com (Michael Allen) Date: Sun, 8 Jan 2012 19:36:10 -0600 Subject: [Linux-cluster] centos5 to RHEL6 migration In-Reply-To: References: Message-ID: <20120108193610.2404b425@godelsrevenge.induswx.com> On Sun, 8 Jan 2012 17:39:37 -0600 Terry wrote: > Hello, > > I am trying to gently migrate a 3-node cluster from centos5 to > RHEL6. I have already taken one of the three nodes out and rebuilt > it. My thinking is to build a new cluster from the RHEL node but want > to run it by everyone here first. The cluster consists of a handful > of NFS volumes and a PostgreSQL database. I am not concerned about > the database. I am moving to a new version and will simply migrate > that. I am more concerned about all of the ext4 clustered LVM > volumes. In this process, if I shut down the old cluster, what's the > process to force the new node to read those volumes in to the new > single-node cluster? A pvscan on the new server shows all of the > volumes fine. I am concerned there's something else I'll have to do > here to begin mounting these volumes in the new cluster. [root at server > ~]# pvdisplay Skipping clustered volume group vg_data01b > > Thanks! This message comes at a good time for me, too, since I am considering the same thing. I have 10 nodes but it appears that a change to CentOS 6.xx is about due. Michael Allen From linux at alteeve.com Mon Jan 9 02:38:34 2012 From: linux at alteeve.com (Digimer) Date: Sun, 08 Jan 2012 21:38:34 -0500 Subject: [Linux-cluster] centos5 to RHEL6 migration In-Reply-To: References: Message-ID: <4F0A532A.2000202@alteeve.com> On 01/08/2012 06:39 PM, Terry wrote: > Hello, > > I am trying to gently migrate a 3-node cluster from centos5 to RHEL6. I > have already taken one of the three nodes out and rebuilt it. My > thinking is to build a new cluster from the RHEL node but want to run it > by everyone here first. The cluster consists of a handful of NFS volumes > and a PostgreSQL database. I am not concerned about the database. I am > moving to a new version and will simply migrate that. I am more > concerned about all of the ext4 clustered LVM volumes. In this process, > if I shut down the old cluster, what's the process to force the new node > to read those volumes in to the new single-node cluster? A pvscan on > the new server shows all of the volumes fine. I am concerned there's > something else I'll have to do here to begin mounting these volumes in > the new cluster. > [root at server ~]# pvdisplay > Skipping clustered volume group vg_data01b > > Thanks! Technically yes, practically no. Or rather, not without a lot of testing first. I've never done this, but here are some pointers; upgrading Set this if you are performing a rolling upgrade of the cluster between major releases. disallowed Set this to 1 enable cman's Disallowed mode. This is usually only needed for backwards compatibility. Enable compatibility with cluster2 nodes. groupd(8) There may be some other things you need to do as well. Please be sure to do proper testing and, if you have the budget, hire Red Hat to advise on this process. Also, please report back your results. It would help me help others in the same boat later. :) -- Digimer E-Mail: digimer at alteeve.com Freenode handle: digimer Papers and Projects: http://alteeve.com Node Assassin: http://nodeassassin.org "omg my singularity battery is dead again. stupid hawking radiation." - epitron From td3201 at gmail.com Mon Jan 9 03:31:38 2012 From: td3201 at gmail.com (Terry) Date: Sun, 8 Jan 2012 21:31:38 -0600 Subject: [Linux-cluster] centos5 to RHEL6 migration In-Reply-To: <4F0A532A.2000202@alteeve.com> References: <4F0A532A.2000202@alteeve.com> Message-ID: If it's not practical, am I left with building a new cluster from scratch? On Sun, Jan 8, 2012 at 8:38 PM, Digimer wrote: > On 01/08/2012 06:39 PM, Terry wrote: > > Hello, > > > > I am trying to gently migrate a 3-node cluster from centos5 to RHEL6. I > > have already taken one of the three nodes out and rebuilt it. My > > thinking is to build a new cluster from the RHEL node but want to run it > > by everyone here first. The cluster consists of a handful of NFS volumes > > and a PostgreSQL database. I am not concerned about the database. I am > > moving to a new version and will simply migrate that. I am more > > concerned about all of the ext4 clustered LVM volumes. In this process, > > if I shut down the old cluster, what's the process to force the new node > > to read those volumes in to the new single-node cluster? A pvscan on > > the new server shows all of the volumes fine. I am concerned there's > > something else I'll have to do here to begin mounting these volumes in > > the new cluster. > > [root at server ~]# pvdisplay > > Skipping clustered volume group vg_data01b > > > > Thanks! > > Technically yes, practically no. Or rather, not without a lot of > testing first. > > I've never done this, but here are some pointers; > > > > upgrading > Set this if you are performing a rolling upgrade of the cluster > between major releases. > > disallowed > Set this to 1 enable cman's Disallowed mode. This is usually > only needed for backwards compatibility. > > > > Enable compatibility with cluster2 nodes. groupd(8) > > There may be some other things you need to do as well. Please be sure > to do proper testing and, if you have the budget, hire Red Hat to advise > on this process. Also, please report back your results. It would help me > help others in the same boat later. :) > > -- > Digimer > E-Mail: digimer at alteeve.com > Freenode handle: digimer > Papers and Projects: http://alteeve.com > Node Assassin: http://nodeassassin.org > "omg my singularity battery is dead again. > stupid hawking radiation." - epitron > -------------- next part -------------- An HTML attachment was scrubbed... URL: From list at fajar.net Mon Jan 9 04:01:06 2012 From: list at fajar.net (Fajar A. Nugraha) Date: Mon, 9 Jan 2012 11:01:06 +0700 Subject: [Linux-cluster] centos5 to RHEL6 migration In-Reply-To: References: <4F0A532A.2000202@alteeve.com> Message-ID: On Mon, Jan 9, 2012 at 10:31 AM, Terry wrote: > If it's not practical, am I left with building a new cluster from scratch? I'm pretty sure if your ONLY problem is "Skipping clustered volume group vg_data01b", you can just turn off cluster flag with "vgchange -cn", then use "-o lock_nolock" to mount it on a SINGLE (i.e. not cluster) node. That was your original question, wasn't it? As for upgrading, I haven't tested it. You should be able to use your old storage, but just create other settings from scratch. Like Digimer said, be sure to do proper testing :) -- Fajar From wmodes at ucsc.edu Mon Jan 9 04:03:18 2012 From: wmodes at ucsc.edu (Wes Modes) Date: Sun, 08 Jan 2012 20:03:18 -0800 Subject: [Linux-cluster] GFS on CentOS - cman unable to start In-Reply-To: References: <4F075BD3.3090702@ucsc.edu> <60c71479-72a3-47aa-a91d-fa1c91c3e9ef@lisa.itlinux.cl> Message-ID: <4F0A6706.6090308@ucsc.edu> The behavior of cman's resolving of cluster node names is less than clear, as per the RHEL bugzilla report. The hostname and cluster.conf match, as does /etc/hosts and uname -n. The short names and FQDN ping. I believe all the node cluster.conf are in sync, and all nodes are accessible to each other using either short or long names. You'll have to trust that I've tried everything obvious, and every possible combination of FQDN and short names in cluster.conf and hostname. That said, it is totally possible I missed something obvious. I suspect, there is something else going on and I don't know how to get at it. Wes On 1/6/2012 6:06 PM, Kevin Stanton wrote: > > > Hi, > > > I think CMAN expect that the names of the cluster nodes be the same > returned by the command "uname -n". > > > For what you write your nodes hostnames are: test01.gdao.ucsc.edu > and test02.gdao.ucsc.edu, but in cluster.conf you have declared only > "test01" and "test02". > > > > I haven't found this to be the case in the past. I actually use a > separate short name to reference each node which is different than the > hostname of the server itself. All I've ever had to do is make sure > it resolves correctly. You can do this either in DNS and/or in > /etc/hosts. I have found that it's a good idea to do both in case > your DNS server is a virtual machine and is not running for some > reason. In that case with /etc/hosts you can still start cman. > > > > I would make sure whatever node names you use in the cluster.conf will > resolve when you try to ping it from all nodes in the cluster. Also > make sure your cluster.conf is in sync between all nodes. > > > > -Kevin > > > > > > ------------------------------------------------------------------------ > > These servers are currently on the same host, but may not be in > the future. They are in a vm cluster (though honestly, I'm not > sure what this means yet). > > SElinux is on, but disabled. > Firewalling through iptables is turned off via > system-config-securitylevel > > There is no line currently in the cluster.conf that deals with > multicasting. > > Any other suggestions? > > Wes > > On 1/6/2012 12:05 PM, Luiz Gustavo Tonello wrote: > > Hi, > > > > This servers is on VMware? At the same host? > > SElinux is disable? iptables have something? > > > > In my environment I had a problem to start GFS2 with servers in > differents hosts. > > To clustering servers, was need migrate one server to the same > host of the other, and restart this. > > > > I think, one of the problem was because the virtual switchs. > > To solve, I changed a multicast IP, to use 225.0.0.13 at cluster.conf > > > > And add a static route in both, to use default gateway. > > > > I don't know if it's correct, but this solve my problem. > > > > I hope that help you. > > > > Regards. > > > > On Fri, Jan 6, 2012 at 5:01 PM, Wes Modes > wrote: > > Hi, Steven. > > I've tried just about every possible combination of hostname and > cluster.conf. > > ping to test01 resolves to 128.114.31.112 > ping to test01.gdao.ucsc.edu > resolves to 128.114.31.112 > > It feels like the right thing is being returned. This feels like it > might be a quirk (or bug possibly) of cman or openais. > > There are some old bug reports around this, for example > https://bugzilla.redhat.com/show_bug.cgi?id=488565. It sounds > like the > way that cman reports this error is anything but straightforward. > > Is there anyone who has encountered this error and found a solution? > > Wes > > > > On 1/6/2012 2:00 AM, Steven Whitehouse wrote: > > Hi, > > > > On Thu, 2012-01-05 at 13:54 -0800, Wes Modes wrote: > >> Howdy, y'all. I'm trying to set up GFS in a cluster on CentOS > systems > >> running on vmWare. The GFS FS is on a Dell Equilogic SAN. > >> > >> I keep running into the same problem despite many > differently-flavored > >> attempts to set up GFS. The problem comes when I try to start > cman, the > >> cluster management software. > >> > >> [root at test01]# service cman start > >> Starting cluster: > >> Loading modules... done > >> Mounting configfs... done > >> Starting ccsd... done > >> Starting cman... failed > >> cman not started: Can't find local node name in cluster.conf > >> /usr/sbin/cman_tool: aisexec daemon didn't start > >> > [FAILED] > >> > > This looks like what it says... whatever the node name is in > > cluster.conf, it doesn't exist when the name is looked up, or > possibly > > it does exist, but is mapped to the loopback address (it needs to > map to > > an address which is valid cluster-wide) > > > > Since your config files look correct, the next thing to check is what > > the resolver is actually returning. Try (for example) a ping to > test01 > > (you need to specify exactly the same form of the name as is used in > > cluster.conf) from test02 and see whether it uses the correct ip > > address, just in case the wrong thing is being returned. > > > > Steve. > > > >> [root at test01]# tail /var/log/messages > >> Jan 5 13:39:40 testbench06 ccsd[13194]: Unable to connect to > >> cluster infrastructure after 1193640 seconds. > >> Jan 5 13:40:10 testbench06 ccsd[13194]: Unable to connect to > >> cluster infrastructure after 1193670 seconds. > >> Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] AIS Executive > >> Service RELEASE 'subrev 1887 version 0.80.6' > >> Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] Copyright (C) > >> 2002-2006 MontaVista Software, Inc and contributors. > >> Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] Copyright (C) > >> 2006 Red Hat, Inc. > >> Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] AIS Executive > >> Service: started and ready to provide service. > >> Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] local > node name > >> "test01.gdao.ucsc.edu " not found > in cluster.conf > >> Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] Error > reading CCS > >> info, cannot start > >> Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] Error reading > >> config from CCS > >> Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] AIS Executive > >> exiting (reason: could not read the main configuration file). > >> > >> Here are details of my configuration: > >> > >> [root at test01]# rpm -qa | grep cman > >> cman-2.0.115-85.el5_7.2 > >> > >> [root at test01]# echo $HOSTNAME > >> test01.gdao.ucsc.edu > >> > >> [root at test01]# hostname > >> test01.gdao.ucsc.edu > >> > >> [root at test01]# cat /etc/hosts > >> # Do not remove the following line, or various programs > >> # that require network functionality will fail. > >> 128.114.31.112 test01 test01.gdao test01.gdao.ucsc.edu > > >> 128.114.31.113 test02 test02.gdao test02.gdao.ucsc.edu > > >> 127.0.0.1 localhost.localdomain localhost > >> ::1 localhost6.localdomain6 localhost6 > >> > >> [root at test01]# sestatus > >> SELinux status: enabled > >> SELinuxfs mount: /selinux > >> Current mode: permissive > >> Mode from config file: permissive > >> Policy version: 21 > >> Policy from config file: targeted > >> > >> [root at test01]# cat /etc/cluster/cluster.conf > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> >> ipaddr="gdvcenter.ucsc.edu " > login="root" passwd="1hateAmazon.com" > >> vmlogin="root" vmpasswd="esxpass" > >> > port="/vmfs/volumes/49086551-c64fd83c-0401-001e0bcd6848/eagle1/gfs1.vmx"/> > >> > >> > >> > >> > >> > >> > >> I've seen much discussion of this problem, but no definitive > solutions. > >> Any help you can provide will be welcome. > >> > >> Wes Modes > >> > >> -- > >> Linux-cluster mailing list > >> Linux-cluster at redhat.com > >> https://www.redhat.com/mailman/listinfo/linux-cluster > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > > > -- > Luiz Gustavo P Tonello. > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- An HTML attachment was scrubbed... URL: From sathyanarayanan.varadharajan at precisionit.co.in Mon Jan 9 04:37:45 2012 From: sathyanarayanan.varadharajan at precisionit.co.in (SATHYA - IT) Date: Mon, 9 Jan 2012 10:07:45 +0530 Subject: [Linux-cluster] rhel 6.2 network bonding interface in cluster environment Message-ID: <003101ccce88$71ddd7e0$559987a0$@precisionit.co.in> Hi, We had configured RHEL 6.2 - 2 node Cluster with clvmd + gfs2 + cman + smb. We have 4 nic cards in the servers where 2 been configured in bonding for heartbeat (with mode=1) and 2 been configured in bonding for public access (with mode=0). Heartbeat network is connected directly from server to server. Once in 3 - 4 days, the heartbeat goes down and comes up automatically in 2 to 3 seconds. Not sure why this down and up occurs. Because of this in cluster, one system is got fenced by other. Is there anyway where we can increase the time to wait for the cluster to wait for heartbeat. Ie if the cluster can wait for 5-6 seconds even the heartbeat fails for 5-6 seconds the node won't get fenced. Kindly advise. Thanks Sathya Narayanan V Solution Architect This communication may contain confidential information. If you are not the intended recipient it may be unlawful for you to read, copy, distribute, disclose or otherwise use the information contained within this communication.. Errors and Omissions may occur in the contents of this Email arising out of or in connection with data transmission, network malfunction or failure, machine or software error, malfunction, or operator errors by the person who is sending the email. Precision Group accepts no responsibility for any such errors or omissions. The information, views and comments within this communication are those of the individual and not necessarily those of Precision Group. All email that is sent from/to Precision Group is scanned for the presence of computer viruses, security issues and inappropriate content. However, it is the recipient's responsibility to check any attachments for viruses before use. -------------- next part -------------- An HTML attachment was scrubbed... URL: From linux at alteeve.com Mon Jan 9 04:51:54 2012 From: linux at alteeve.com (Digimer) Date: Sun, 08 Jan 2012 23:51:54 -0500 Subject: [Linux-cluster] centos5 to RHEL6 migration In-Reply-To: References: <4F0A532A.2000202@alteeve.com> Message-ID: <4F0A726A.6050304@alteeve.com> On 01/08/2012 10:31 PM, Terry wrote: > If it's not practical, am I left with building a new cluster from scratch? I don't know enough to say either way. I'd strongly suggest talking to Red hat, as you have a subscription, and ask them for advice. It might cost a bit, but I am certain it will save you trouble and money in the long wrong. Alternatively, use some spare machines to mock-up the current cluster and then test-upgrade. It might work flawlessly, I genuinely don't know. I do know that a good attempt was made at on-wire compatibility, I just don't know if it's actually been used in production, so I was erring on the side of caution. -- Digimer E-Mail: digimer at alteeve.com Freenode handle: digimer Papers and Projects: http://alteeve.com Node Assassin: http://nodeassassin.org "omg my singularity battery is dead again. stupid hawking radiation." - epitron From linux at alteeve.com Mon Jan 9 04:56:38 2012 From: linux at alteeve.com (Digimer) Date: Sun, 08 Jan 2012 23:56:38 -0500 Subject: [Linux-cluster] rhel 6.2 network bonding interface in cluster environment In-Reply-To: <003101ccce88$71ddd7e0$559987a0$@precisionit.co.in> References: <003101ccce88$71ddd7e0$559987a0$@precisionit.co.in> Message-ID: <4F0A7386.5000303@alteeve.com> On 01/08/2012 11:37 PM, SATHYA - IT wrote: > Hi, > > We had configured RHEL 6.2 - 2 node Cluster with clvmd + gfs2 + cman + > smb. We have 4 nic cards in the servers where 2 been configured in > bonding for heartbeat (with mode=1) and 2 been configured in bonding for > public access (with mode=0). Heartbeat network is connected directly > from server to server. Once in 3 ? 4 days, the heartbeat goes down and > comes up automatically in 2 to 3 seconds. Not sure why this down and up > occurs. Because of this in cluster, one system is got fenced by other. > > Is there anyway where we can increase the time to wait for the cluster > to wait for heartbeat. Ie if the cluster can wait for 5-6 seconds even > the heartbeat fails for 5-6 seconds the node won?t get fenced. Kindly > advise. "mode=1" is Active/Passive and I use it extensively with no trouble. I'm not sure where "heartbeat" comes from, but I might be missing the obvious. Can you share your bond and eth configuration files here please (as plain-text attachments)? Secondly, make sure that you are actually using that interface/bond. Run 'gethostip -d ', where "nodename" is what you set in cluster.conf. The returned IP will be the one used by the cluster. Back to the bond; A failed link would nearly instantly transfer to the backup link. So if you are going down for 2~3 seconds on both links, something else is happening. Look at syslog on both nodes around the time the last fence happened and see what logs are written just prior to the fence. That might give you a clue. -- Digimer E-Mail: digimer at alteeve.com Freenode handle: digimer Papers and Projects: http://alteeve.com Node Assassin: http://nodeassassin.org "omg my singularity battery is dead again. stupid hawking radiation." - epitron From sathyanarayanan.varadharajan at precisionit.co.in Mon Jan 9 05:12:43 2012 From: sathyanarayanan.varadharajan at precisionit.co.in (SATHYA - IT) Date: Mon, 9 Jan 2012 10:42:43 +0530 Subject: [Linux-cluster] rhel 6.2 network bonding interface in cluster environment Message-ID: <003901ccce8d$55a9be40$00fd3ac0$@precisionit.co.in> Hi, Thanks for your mail. I herewith attaching the bonding and eth configuration files. And on the /var/log/messages during the fence operation we can get the logs updated related to network only in the node which fences the other. Server 1 NIC 1: (eth2) /etc/sysconfig/network-scripts/ifcfg-eth2 DEVICE="eth2" HWADDR="3C:D9:2B:04:2D:7A" NM_CONTROLLED="no" ONBOOT="yes" MASTER=bond0 SLAVE=yes USERCTL=no BOOTPROTO=none Server 1 NIC 4: (eth5) /etc/sysconfig/network-scripts/ifcfg-eth5 DEVICE="eth5" HWADDR="3C:D9:2B:04:2D:80" NM_CONTROLLED="no" ONBOOT="yes" MASTER=bond0 SLAVE=yes USERCTL=no BOOTPROTO=none Server 1 NIC 2: (eth3) /etc/sysconfig/network-scripts/ifcfg-eth3 DEVICE="eth3" HWADDR="3C:D9:2B:04:2D:7C" NM_CONTROLLED="no" ONBOOT="yes" MASTER=bond1 SLAVE=yes USERCTL=no BOOTPROTO=none Server 1 NIC 3: /etc/sysconfig/network-scripts/ifcfg-eth4 DEVICE="eth4" HWADDR="3C:D9:2B:04:2D:7E" NM_CONTROLLED="no" ONBOOT="yes" MASTER=bond1 SLAVE=yes USERCTL=no BOOTPROTO=none Server 1 Bond0: (Public Access) /etc/sysconfig/network-scripts/ifcfg-bond0 DEVICE=bond0 BOOTPROTO=static IPADDR=192.168.129.10 NETMASK=255.255.255.0 GATEWAY=192.168.129.1 USERCTL=no ONBOOT=yes BONDING_OPTS="miimon=100 mode=0" Server 1 Bond1: (Heartbeat) /etc/sysconfig/network-scripts/ifcfg-bond1 DEVICE=bond1 BOOTPROTO=static IPADDR=10.0.0.10 NETMASK=255.0.0.0 USERCTL=no ONBOOT=yes BONDING_OPTS="miimon=100 mode=1" On the log messages, Jan 3 14:46:07 filesrv2 kernel: bnx2 0000:04:00.0: eth4: NIC Copper Link is Down Jan 3 14:46:07 filesrv2 kernel: bnx2 0000:03:00.1: eth3: NIC Copper Link is Down Jan 3 14:46:07 filesrv2 kernel: bonding: bond1: link status definitely down for interface eth3, disabling it Jan 3 14:46:07 filesrv2 kernel: bonding: bond1: now running without any active interface ! Jan 3 14:46:07 filesrv2 kernel: bonding: bond1: link status definitely down for interface eth4, disabling it Jan 3 14:46:10 filesrv2 kernel: bnx2 0000:03:00.1: eth3: NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON Jan 3 14:46:10 filesrv2 kernel: bond1: link status definitely up for interface eth3, 1000 Mbps full duplex. Jan 3 14:46:10 filesrv2 kernel: bonding: bond1: making interface eth3 the new active one. Jan 3 14:46:10 filesrv2 kernel: bonding: bond1: first active interface up! Jan 3 14:46:10 filesrv2 kernel: bnx2 0000:04:00.0: eth4: NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON Jan 3 14:46:10 filesrv2 kernel: bond1: link status definitely up for interface eth4, 1000 Mbps full duplex. Thanks Sathya Narayanan V Solution Architect -----Original Message----- From: Digimer [mailto:linux at alteeve.com] Sent: Monday, January 09, 2012 10:27 AM To: linux clustering Cc: SATHYA - IT Subject: SPAM - Re: [Linux-cluster] rhel 6.2 network bonding interface in cluster environment On 01/08/2012 11:37 PM, SATHYA - IT wrote: > Hi, > > We had configured RHEL 6.2 - 2 node Cluster with clvmd + gfs2 + cman + > smb. We have 4 nic cards in the servers where 2 been configured in > bonding for heartbeat (with mode=1) and 2 been configured in bonding > for public access (with mode=0). Heartbeat network is connected > directly from server to server. Once in 3 - 4 days, the heartbeat goes > down and comes up automatically in 2 to 3 seconds. Not sure why this > down and up occurs. Because of this in cluster, one system is got fenced by other. > > Is there anyway where we can increase the time to wait for the cluster > to wait for heartbeat. Ie if the cluster can wait for 5-6 seconds even > the heartbeat fails for 5-6 seconds the node won't get fenced. Kindly > advise. "mode=1" is Active/Passive and I use it extensively with no trouble. I'm not sure where "heartbeat" comes from, but I might be missing the obvious. Can you share your bond and eth configuration files here please (as plain-text attachments)? Secondly, make sure that you are actually using that interface/bond. Run 'gethostip -d ', where "nodename" is what you set in cluster.conf. The returned IP will be the one used by the cluster. Back to the bond; A failed link would nearly instantly transfer to the backup link. So if you are going down for 2~3 seconds on both links, something else is happening. Look at syslog on both nodes around the time the last fence happened and see what logs are written just prior to the fence. That might give you a clue. -- Digimer E-Mail: digimer at alteeve.com Freenode handle: digimer Papers and Projects: http://alteeve.com Node Assassin: http://nodeassassin.org "omg my singularity battery is dead again. stupid hawking radiation." - epitron This communication may contain confidential information. If you are not the intended recipient it may be unlawful for you to read, copy, distribute, disclose or otherwise use the information contained within this communication.. Errors and Omissions may occur in the contents of this Email arising out of or in connection with data transmission, network malfunction or failure, machine or software error, malfunction, or operator errors by the person who is sending the email. Precision Group accepts no responsibility for any such errors or omissions. The information, views and comments within this communication are those of the individual and not necessarily those of Precision Group. All email that is sent from/to Precision Group is scanned for the presence of computer viruses, security issues and inappropriate content. However, it is the recipient's responsibility to check any attachments for viruses before use. From linux at alteeve.com Mon Jan 9 05:24:10 2012 From: linux at alteeve.com (Digimer) Date: Mon, 09 Jan 2012 00:24:10 -0500 Subject: [Linux-cluster] rhel 6.2 network bonding interface in cluster environment In-Reply-To: <003901ccce8d$55a9be40$00fd3ac0$@precisionit.co.in> References: <003901ccce8d$55a9be40$00fd3ac0$@precisionit.co.in> Message-ID: <4F0A79FA.7080408@alteeve.com> On 01/09/2012 12:12 AM, SATHYA - IT wrote: > Hi, > > Thanks for your mail. I herewith attaching the bonding and eth configuration > files. And on the /var/log/messages during the fence operation we can get > the logs updated related to network only in the node which fences the other. What IPs do the node names resolve to? I'm assuming bond1, but I would like you to confirm. > Server 1 Bond1: (Heartbeat) I'm still not sure what you mean by heartbeat. Do you mean the channel corosync is using? > On the log messages, > > Jan 3 14:46:07 filesrv2 kernel: bnx2 0000:04:00.0: eth4: NIC Copper Link is > Down > Jan 3 14:46:07 filesrv2 kernel: bnx2 0000:03:00.1: eth3: NIC Copper Link is > Down This tells me both links dropped at the same time. These messages are coming from below the cluster though. > Jan 3 14:46:07 filesrv2 kernel: bonding: bond1: link status definitely down > for interface eth3, disabling it > Jan 3 14:46:07 filesrv2 kernel: bonding: bond1: now running without any > active interface ! > Jan 3 14:46:07 filesrv2 kernel: bonding: bond1: link status definitely down > for interface eth4, disabling it With both of the bond's NICs down, the bond itself is going to drop. > Jan 3 14:46:10 filesrv2 kernel: bnx2 0000:03:00.1: eth3: NIC Copper Link is > Up, 1000 Mbps full duplex, receive & transmit flow control ON > Jan 3 14:46:10 filesrv2 kernel: bond1: link status definitely up for > interface eth3, 1000 Mbps full duplex. > Jan 3 14:46:10 filesrv2 kernel: bonding: bond1: making interface eth3 the > new active one. > Jan 3 14:46:10 filesrv2 kernel: bonding: bond1: first active interface up! > Jan 3 14:46:10 filesrv2 kernel: bnx2 0000:04:00.0: eth4: NIC Copper Link is > Up, 1000 Mbps full duplex, receive & transmit flow control ON > Jan 3 14:46:10 filesrv2 kernel: bond1: link status definitely up for > interface eth4, 1000 Mbps full duplex. I don't see any messages about the cluster in here, which I assume you cropped out. In this case, it doesn't matter as the problem is well below the cluster, but in general, please provide more data, not less. You never know what might help. :) Anyway, you need to sort out what is happening here. Bad drivers? Bad card (assuming dual-port)? Something is taking the NICs down, as though they were actually unplugged. If you can run them through a switch, if might help isolate which node is causing the problems as then you would only see one node record "NIC Copper Link is Down" and can then focus on just that node. -- Digimer E-Mail: digimer at alteeve.com Freenode handle: digimer Papers and Projects: http://alteeve.com Node Assassin: http://nodeassassin.org "omg my singularity battery is dead again. stupid hawking radiation." - epitron From sathyanarayanan.varadharajan at precisionit.co.in Mon Jan 9 05:51:08 2012 From: sathyanarayanan.varadharajan at precisionit.co.in (SATHYA - IT) Date: Mon, 9 Jan 2012 11:21:08 +0530 Subject: [Linux-cluster] rhel 6.2 network bonding interface in cluster environment Message-ID: <004001ccce92$b0bfd210$123f7630$@precisionit.co.in> Hi, Herewith attaching the /var/log/messages of both the servers. Yesterday (08th Jan) one of the server got fenced by other around 10:48 AM. I am also attaching the cluster.conf file for your reference. On the related note, related to heartbeat - I am referring the channel used by corosync. And the name which has been configured in cluster.conf file resolves with bond1 only. Related to the network card, we are using 2 dual port card where we configured 1 port from each for bond0 and 1 port from the other for bond1. So it doesn't seems be a network card related issue. Moreover, we are not having any errors related to bond0. Thanks Sathya Narayanan V Solution Architect -----Original Message----- From: Digimer [mailto:linux at alteeve.com] Sent: Monday, January 09, 2012 10:54 AM To: SATHYA - IT Cc: 'linux clustering' Subject: SPAM - Re: [Linux-cluster] rhel 6.2 network bonding interface in cluster environment On 01/09/2012 12:12 AM, SATHYA - IT wrote: > Hi, > > Thanks for your mail. I herewith attaching the bonding and eth > configuration files. And on the /var/log/messages during the fence > operation we can get the logs updated related to network only in the node which fences the other. What IPs do the node names resolve to? I'm assuming bond1, but I would like you to confirm. > Server 1 Bond1: (Heartbeat) I'm still not sure what you mean by heartbeat. Do you mean the channel corosync is using? > On the log messages, > > Jan 3 14:46:07 filesrv2 kernel: bnx2 0000:04:00.0: eth4: NIC Copper > Link is Down Jan 3 14:46:07 filesrv2 kernel: bnx2 0000:03:00.1: eth3: > NIC Copper Link is Down This tells me both links dropped at the same time. These messages are coming from below the cluster though. > Jan 3 14:46:07 filesrv2 kernel: bonding: bond1: link status > definitely down for interface eth3, disabling it Jan 3 14:46:07 > filesrv2 kernel: bonding: bond1: now running without any active > interface ! > Jan 3 14:46:07 filesrv2 kernel: bonding: bond1: link status > definitely down for interface eth4, disabling it With both of the bond's NICs down, the bond itself is going to drop. > Jan 3 14:46:10 filesrv2 kernel: bnx2 0000:03:00.1: eth3: NIC Copper > Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON > Jan 3 14:46:10 filesrv2 kernel: bond1: link status definitely up for > interface eth3, 1000 Mbps full duplex. > Jan 3 14:46:10 filesrv2 kernel: bonding: bond1: making interface eth3 > the new active one. > Jan 3 14:46:10 filesrv2 kernel: bonding: bond1: first active interface up! > Jan 3 14:46:10 filesrv2 kernel: bnx2 0000:04:00.0: eth4: NIC Copper > Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON > Jan 3 14:46:10 filesrv2 kernel: bond1: link status definitely up for > interface eth4, 1000 Mbps full duplex. I don't see any messages about the cluster in here, which I assume you cropped out. In this case, it doesn't matter as the problem is well below the cluster, but in general, please provide more data, not less. You never know what might help. :) Anyway, you need to sort out what is happening here. Bad drivers? Bad card (assuming dual-port)? Something is taking the NICs down, as though they were actually unplugged. If you can run them through a switch, if might help isolate which node is causing the problems as then you would only see one node record "NIC Copper Link is Down" and can then focus on just that node. -- Digimer E-Mail: digimer at alteeve.com Freenode handle: digimer Papers and Projects: http://alteeve.com Node Assassin: http://nodeassassin.org "omg my singularity battery is dead again. stupid hawking radiation." - epitron This communication may contain confidential information. If you are not the intended recipient it may be unlawful for you to read, copy, distribute, disclose or otherwise use the information contained within this communication.. Errors and Omissions may occur in the contents of this Email arising out of or in connection with data transmission, network malfunction or failure, machine or software error, malfunction, or operator errors by the person who is sending the email. Precision Group accepts no responsibility for any such errors or omissions. The information, views and comments within this communication are those of the individual and not necessarily those of Precision Group. All email that is sent from/to Precision Group is scanned for the presence of computer viruses, security issues and inappropriate content. However, it is the recipient's responsibility to check any attachments for viruses before use. -------------- next part -------------- A non-text attachment was scrubbed... Name: cluster.conf Type: application/octet-stream Size: 1043 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: messages_filesrv1 Type: application/octet-stream Size: 117290 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: messages_filesrv2 Type: application/octet-stream Size: 15302 bytes Desc: not available URL: From sathyanarayanan.varadharajan at precisionit.co.in Mon Jan 9 06:18:22 2012 From: sathyanarayanan.varadharajan at precisionit.co.in (SATHYA - IT) Date: Mon, 9 Jan 2012 11:48:22 +0530 Subject: [Linux-cluster] rhel 6.2 network bonding interface in cluster environment Message-ID: <004c01ccce96$7f974ca0$7ec5e5e0$@precisionit.co.in> Not sure whether you received the logs and cluster.conf file. Herewith pasting the same... On File Server1: Jan 8 03:15:04 filesrv1 kernel: imklog 4.6.2, log source = /proc/kmsg started. Jan 8 03:15:04 filesrv1 rsyslogd: [origin software="rsyslogd" swVersion="4.6.2" x-pid="8765" x-info="http://www.rsyslog.com"] (re)start Jan 8 10:52:42 filesrv1 kernel: imklog 4.6.2, log source = /proc/kmsg started. Jan 8 10:52:42 filesrv1 rsyslogd: [origin software="rsyslogd" swVersion="4.6.2" x-pid="8751" x-info="http://www.rsyslog.com"] (re)start Jan 8 10:52:42 filesrv1 kernel: Initializing cgroup subsys cpuset Jan 8 10:52:42 filesrv1 kernel: Initializing cgroup subsys cpu Jan 8 10:52:42 filesrv1 kernel: Linux version 2.6.32-220.el6.x86_64 (mockbuild at x86-004.build.bos.redhat.com) (gcc version 4.4.5 20110214 (Red Hat 4.4.5-6) (GCC) ) #1 SMP Wed Nov 9 08:03:13 EST 2011 Jan 8 10:52:42 filesrv1 kernel: Command line: ro root=/dev/mapper/vg01-LogVol01 rd_LVM_LV=vg01/LogVol01 rd_LVM_LV=vg01/LogVol00 rd_NO_LUKS rd_NO_MD rd_NO_DM LANG=en_US.UTF-8 SYSFONT=latarcyrheb-sun16 KEYBOARDTYPE=pc KEYTABLE=us crashkernel=128M rhgb quiet acpi=off Jan 8 10:52:42 filesrv1 kernel: KERNEL supported cpus: Jan 8 10:52:42 filesrv1 kernel: Intel GenuineIntel Jan 8 10:52:42 filesrv1 kernel: AMD AuthenticAMD Jan 8 10:52:42 filesrv1 kernel: Centaur CentaurHauls Jan 8 10:52:42 filesrv1 kernel: BIOS-provided physical RAM map: Jan 8 10:52:42 filesrv1 kernel: BIOS-e820: 0000000000000000 - 000000000009f400 (usable) Jan 8 10:52:42 filesrv1 kernel: BIOS-e820: 000000000009f400 - 00000000000a0000 (reserved) Jan 8 10:52:42 filesrv1 kernel: BIOS-e820: 00000000000f0000 - 0000000000100000 (reserved) Jan 8 10:52:42 filesrv1 kernel: BIOS-e820: 0000000000100000 - 00000000d762f000 (usable) Jan 8 10:52:42 filesrv1 kernel: BIOS-e820: 00000000d762f000 - 00000000d763c000 (ACPI data) Jan 8 10:52:42 filesrv1 kernel: BIOS-e820: 00000000d763c000 - 00000000d763d000 (usable) Jan 8 10:52:42 filesrv1 kernel: BIOS-e820: 00000000d763d000 - 00000000dc000000 (reserved) Jan 8 10:52:42 filesrv1 kernel: BIOS-e820: 00000000fec00000 - 00000000fee10000 (reserved) Jan 8 10:52:42 filesrv1 kernel: BIOS-e820: 00000000ff800000 - 0000000100000000 (reserved) Jan 8 10:52:42 filesrv1 kernel: BIOS-e820: 0000000100000000 - 00000008a7fff000 (usable) Jan 8 10:52:42 filesrv1 kernel: DMI 2.7 present. Jan 8 10:52:42 filesrv1 kernel: SMBIOS version 2.7 @ 0xF4F40 Jan 8 10:52:42 filesrv1 kernel: last_pfn = 0x8a7fff max_arch_pfn = 0x400000000 Jan 8 10:52:42 filesrv1 kernel: x86 PAT enabled: cpu 0, old 0x7040600070406, new 0x7010600070106 Jan 8 10:52:42 filesrv1 kernel: last_pfn = 0xd763d max_arch_pfn = 0x400000000 . . On File Server 2: Jan 8 03:09:06 filesrv2 rsyslogd: [origin software="rsyslogd" swVersion="4.6.2" x-pid="8648" x-info="http://www.rsyslog.com"] (re)start Jan 8 10:48:07 filesrv2 kernel: bnx2 0000:04:00.0: eth4: NIC Copper Link is Down Jan 8 10:48:07 filesrv2 kernel: bnx2 0000:03:00.1: eth3: NIC Copper Link is Down Jan 8 10:48:07 filesrv2 kernel: bonding: bond1: link status definitely down for interface eth3, disabling it Jan 8 10:48:07 filesrv2 kernel: bonding: bond1: now running without any active interface ! Jan 8 10:48:07 filesrv2 kernel: bonding: bond1: link status definitely down for interface eth4, disabling it Jan 8 10:48:09 filesrv2 kernel: bnx2 0000:04:00.0: eth4: NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON Jan 8 10:48:09 filesrv2 kernel: bond1: link status definitely up for interface eth4, 1000 Mbps full duplex. Jan 8 10:48:09 filesrv2 kernel: bonding: bond1: making interface eth4 the new active one. Jan 8 10:48:09 filesrv2 kernel: bonding: bond1: first active interface up! Jan 8 10:48:09 filesrv2 kernel: bnx2 0000:03:00.1: eth3: NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON Jan 8 10:48:09 filesrv2 kernel: bond1: link status definitely up for interface eth3, 1000 Mbps full duplex. Jan 8 10:48:15 filesrv2 corosync[8933]: [TOTEM ] A processor failed, forming new configuration. Jan 8 10:48:17 filesrv2 corosync[8933]: [QUORUM] Members[1]: 2 Jan 8 10:48:17 filesrv2 corosync[8933]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Jan 8 10:48:17 filesrv2 rgmanager[12557]: State change: clustsrv1 DOWN Jan 8 10:48:17 filesrv2 corosync[8933]: [CPG ] chosen downlist: sender r(0) ip(10.0.0.20) ; members(old:2 left:1) Jan 8 10:48:17 filesrv2 corosync[8933]: [MAIN ] Completed service synchronization, ready to provide service. Jan 8 10:48:17 filesrv2 kernel: dlm: closing connection to node 1 Jan 8 10:48:17 filesrv2 fenced[8989]: fencing node clustsrv1 Jan 8 10:48:17 filesrv2 kernel: GFS2: fsid=samba:ctdb.0: jid=1: Trying to acquire journal lock... Jan 8 10:48:17 filesrv2 kernel: GFS2: fsid=samba:gen01.0: jid=1: Trying to acquire journal lock... Jan 8 10:48:24 filesrv2 kernel: bnx2 0000:03:00.1: eth3: NIC Copper Link is Down Jan 8 10:48:24 filesrv2 kernel: bnx2 0000:04:00.0: eth4: NIC Copper Link is Down Jan 8 10:48:24 filesrv2 kernel: bonding: bond1: link status definitely down for interface eth3, disabling it Jan 8 10:48:24 filesrv2 kernel: bonding: bond1: link status definitely down for interface eth4, disabling it Jan 8 10:48:24 filesrv2 kernel: bonding: bond1: now running without any active interface ! Jan 8 10:48:25 filesrv2 kernel: bnx2 0000:04:00.0: eth4: NIC Copper Link is Up, 100 Mbps full duplex, receive & transmit flow control ON Jan 8 10:48:25 filesrv2 kernel: bond1: link status definitely up for interface eth4, 100 Mbps full duplex. Jan 8 10:48:25 filesrv2 kernel: bonding: bond1: making interface eth4 the new active one. Jan 8 10:48:25 filesrv2 kernel: bonding: bond1: first active interface up! Jan 8 10:48:25 filesrv2 kernel: bnx2 0000:03:00.1: eth3: NIC Copper Link is Up, 100 Mbps full duplex, receive & transmit flow control ON Jan 8 10:48:25 filesrv2 kernel: bond1: link status definitely up for interface eth3, 100 Mbps full duplex. Jan 8 10:48:25 filesrv2 kernel: bnx2 0000:04:00.0: eth4: NIC Copper Link is Down Jan 8 10:48:25 filesrv2 kernel: bnx2 0000:03:00.1: eth3: NIC Copper Link is Down Jan 8 10:48:25 filesrv2 kernel: bonding: bond1: link status definitely down for interface eth3, disabling it Jan 8 10:48:25 filesrv2 kernel: bonding: bond1: link status definitely down for interface eth4, disabling it Jan 8 10:48:25 filesrv2 kernel: bonding: bond1: now running without any active interface ! Jan 8 10:48:27 filesrv2 fenced[8989]: fence clustsrv1 success Jan 8 10:48:28 filesrv2 kernel: GFS2: fsid=samba:ctdb.0: jid=1: Looking at journal... Jan 8 10:48:28 filesrv2 kernel: GFS2: fsid=samba:ctdb.0: jid=1: Done Jan 8 10:48:28 filesrv2 kernel: GFS2: fsid=samba:gen02.0: jid=1: Trying to acquire journal lock... Jan 8 10:48:28 filesrv2 kernel: GFS2: fsid=samba:hadata02.0: jid=1: Trying to acquire journal lock... Jan 8 10:48:28 filesrv2 kernel: GFS2: fsid=samba:gen02.0: jid=1: Looking at journal... Jan 8 10:48:28 filesrv2 kernel: GFS2: fsid=samba:gen02.0: jid=1: Done Jan 8 10:48:28 filesrv2 kernel: GFS2: fsid=samba:gen01.0: jid=1: Looking at journal... Jan 8 10:48:28 filesrv2 kernel: bnx2 0000:04:00.0: eth4: NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON Jan 8 10:48:28 filesrv2 kernel: bnx2 0000:03:00.1: eth3: NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON Jan 8 10:48:28 filesrv2 kernel: GFS2: fsid=samba:gen01.0: jid=1: Acquiring the transaction lock... Jan 8 10:48:28 filesrv2 kernel: GFS2: fsid=samba:gen01.0: jid=1: Replaying journal... Jan 8 10:48:28 filesrv2 kernel: bond1: link status definitely up for interface eth3, 1000 Mbps full duplex. Jan 8 10:48:28 filesrv2 kernel: bonding: bond1: making interface eth3 the new active one. Jan 8 10:48:28 filesrv2 kernel: bonding: bond1: first active interface up! Jan 8 10:48:28 filesrv2 kernel: bond1: link status definitely up for interface eth4, 1000 Mbps full duplex. Jan 8 10:48:30 filesrv2 kernel: GFS2: fsid=samba:gen01.0: jid=1: Replayed 29140 of 29474 blocks Jan 8 10:48:30 filesrv2 kernel: GFS2: fsid=samba:gen01.0: jid=1: Found 334 revoke tags Jan 8 10:48:30 filesrv2 kernel: GFS2: fsid=samba:gen01.0: jid=1: Journal replayed in 2s Jan 8 10:48:30 filesrv2 kernel: GFS2: fsid=samba:gen01.0: jid=1: Done Jan 8 10:49:01 filesrv2 kernel: bnx2 0000:03:00.1: eth3: NIC Copper Link is Down Jan 8 10:49:01 filesrv2 kernel: bonding: bond1: link status definitely down for interface eth3, disabling it Jan 8 10:49:01 filesrv2 kernel: bonding: bond1: making interface eth4 the new active one. Jan 8 10:49:01 filesrv2 kernel: bnx2 0000:04:00.0: eth4: NIC Copper Link is Down Jan 8 10:49:01 filesrv2 kernel: bonding: bond1: link status definitely down for interface eth4, disabling it Jan 8 10:49:01 filesrv2 kernel: bonding: bond1: now running without any active interface ! Jan 8 10:49:03 filesrv2 kernel: bnx2 0000:03:00.1: eth3: NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON Jan 8 10:49:03 filesrv2 kernel: bond1: link status definitely up for interface eth3, 1000 Mbps full duplex. Jan 8 10:49:03 filesrv2 kernel: bonding: bond1: making interface eth3 the new active one. Jan 8 10:49:03 filesrv2 kernel: bonding: bond1: first active interface up! Jan 8 10:49:04 filesrv2 kernel: bnx2 0000:04:00.0: eth4: NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON Jan 8 10:49:04 filesrv2 kernel: bond1: link status definitely up for interface eth4, 1000 Mbps full duplex. Jan 8 10:50:13 filesrv2 kernel: GFS2: fsid=samba:hadata02.0: jid=1: Looking at journal... Jan 8 10:50:13 filesrv2 kernel: GFS2: fsid=samba:hadata02.0: jid=1: Acquiring the transaction lock... Jan 8 10:50:13 filesrv2 kernel: GFS2: fsid=samba:hadata02.0: jid=1: Replaying journal... Jan 8 10:50:13 filesrv2 kernel: GFS2: fsid=samba:hadata02.0: jid=1: Replayed 0 of 0 blocks Jan 8 10:50:13 filesrv2 kernel: GFS2: fsid=samba:hadata02.0: jid=1: Found 0 revoke tags Jan 8 10:50:13 filesrv2 kernel: GFS2: fsid=samba:hadata02.0: jid=1: Journal replayed in 0s Jan 8 10:50:13 filesrv2 kernel: GFS2: fsid=samba:hadata02.0: jid=1: Done Jan 8 10:52:37 filesrv2 kernel: bnx2 0000:03:00.1: eth3: NIC Copper Link is Down Jan 8 10:52:38 filesrv2 kernel: bonding: bond1: link status definitely down for interface eth3, disabling it Jan 8 10:52:38 filesrv2 kernel: bonding: bond1: making interface eth4 the new active one. Jan 8 10:52:38 filesrv2 kernel: bnx2 0000:04:00.0: eth4: NIC Copper Link is Down Jan 8 10:52:38 filesrv2 kernel: bonding: bond1: link status definitely down for interface eth4, disabling it Jan 8 10:52:38 filesrv2 kernel: bonding: bond1: now running without any active interface ! Jan 8 10:52:40 filesrv2 kernel: bnx2 0000:03:00.1: eth3: NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON Jan 8 10:52:40 filesrv2 kernel: bnx2 0000:04:00.0: eth4: NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON Jan 8 10:52:40 filesrv2 kernel: bond1: link status definitely up for interface eth3, 1000 Mbps full duplex. Jan 8 10:52:40 filesrv2 kernel: bonding: bond1: making interface eth3 the new active one. Jan 8 10:52:40 filesrv2 kernel: bonding: bond1: first active interface up! Jan 8 10:52:40 filesrv2 kernel: bond1: link status definitely up for interface eth4, 1000 Mbps full duplex. Jan 8 10:52:44 filesrv2 corosync[8933]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Jan 8 10:52:44 filesrv2 corosync[8933]: [QUORUM] Members[2]: 1 2 Jan 8 10:52:44 filesrv2 corosync[8933]: [QUORUM] Members[2]: 1 2 Jan 8 10:52:44 filesrv2 corosync[8933]: [CPG ] chosen downlist: sender r(0) ip(10.0.0.10) ; members(old:1 left:0) Jan 8 10:52:44 filesrv2 corosync[8933]: [MAIN ] Completed service synchronization, ready to provide service. Jan 8 10:52:51 filesrv2 kernel: dlm: got connection from 1 Jan 8 10:55:57 filesrv2 kernel: INFO: task gfs2_quotad:9389 blocked for more than 120 seconds. Jan 8 10:55:57 filesrv2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jan 8 10:55:57 filesrv2 kernel: gfs2_quotad D ffff8808a7824900 0 9389 2 0x00000080 Jan 8 10:55:57 filesrv2 kernel: ffff88087580da88 0000000000000046 0000000000000000 00000000000001c3 Jan 8 10:55:57 filesrv2 kernel: ffff88087580da18 ffff88087580da50 ffffffff810ea694 ffff88088b184080 Jan 8 10:55:57 filesrv2 kernel: ffff88088e71a5f8 ffff88087580dfd8 000000000000f4e8 ffff88088e71a5f8 Jan 8 10:55:57 filesrv2 kernel: Call Trace: Jan 8 10:55:57 filesrv2 kernel: [] ? rb_reserve_next_event+0xb4/0x370 Jan 8 10:55:57 filesrv2 kernel: [] ? native_sched_clock+0x13/0x60 Jan 8 10:55:57 filesrv2 kernel: [] rwsem_down_failed_common+0x95/0x1d0 Jan 8 10:55:57 filesrv2 kernel: [] ? native_sched_clock+0x13/0x60 Jan 8 10:55:57 filesrv2 kernel: [] rwsem_down_read_failed+0x26/0x30 Jan 8 10:55:57 filesrv2 kernel: [] call_rwsem_down_read_failed+0x14/0x30 Jan 8 10:55:57 filesrv2 kernel: [] ? down_read+0x24/0x30 Jan 8 10:55:57 filesrv2 kernel: [] dlm_lock+0x62/0x1e0 [dlm] Jan 8 10:55:57 filesrv2 kernel: [] ? ring_buffer_lock_reserve+0xa2/0x160 Jan 8 10:55:57 filesrv2 kernel: [] gdlm_lock+0xf2/0x130 [gfs2] Jan 8 10:55:57 filesrv2 kernel: [] ? gdlm_ast+0x0/0xe0 [gfs2] Jan 8 10:55:57 filesrv2 kernel: [] ? gdlm_bast+0x0/0x50 [gfs2] Jan 8 10:55:57 filesrv2 kernel: [] do_xmote+0x17f/0x260 [gfs2] Jan 8 10:55:57 filesrv2 kernel: [] run_queue+0xf1/0x1d0 [gfs2] Jan 8 10:55:57 filesrv2 kernel: [] gfs2_glock_nq+0x1b7/0x360 [gfs2] Jan 8 10:55:57 filesrv2 kernel: [] ? try_to_del_timer_sync+0x7b/0xe0 Jan 8 10:55:57 filesrv2 kernel: [] gfs2_statfs_sync+0x58/0x1b0 [gfs2] Jan 8 10:55:57 filesrv2 kernel: [] ? schedule_timeout+0x19a/0x2e0 Jan 8 10:55:57 filesrv2 kernel: [] ? gfs2_statfs_sync+0x50/0x1b0 [gfs2] Jan 8 10:55:57 filesrv2 kernel: [] quotad_check_timeo+0x57/0xb0 [gfs2] Jan 8 10:55:57 filesrv2 kernel: [] gfs2_quotad+0x234/0x2b0 [gfs2] Jan 8 10:55:57 filesrv2 kernel: [] ? autoremove_wake_function+0x0/0x40 Jan 8 10:55:57 filesrv2 kernel: [] ? gfs2_quotad+0x0/0x2b0 [gfs2] Jan 8 10:55:57 filesrv2 kernel: [] kthread+0x96/0xa0 Jan 8 10:55:57 filesrv2 kernel: [] child_rip+0xa/0x20 Jan 8 10:55:57 filesrv2 kernel: [] ? kthread+0x0/0xa0 Jan 8 10:55:57 filesrv2 kernel: [] ? child_rip+0x0/0x20 Jan 8 10:57:57 filesrv2 kernel: INFO: task gfs2_quotad:9389 blocked for more than 120 seconds. Jan 8 10:57:57 filesrv2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jan 8 10:57:57 filesrv2 kernel: gfs2_quotad D ffff8808a7824900 0 9389 2 0x00000080 Jan 8 10:57:57 filesrv2 kernel: ffff88087580da88 0000000000000046 0000000000000000 00000000000001c3 Jan 8 10:57:57 filesrv2 kernel: ffff88087580da18 ffff88087580da50 ffffffff810ea694 ffff88088b184080 Jan 8 10:57:57 filesrv2 kernel: ffff88088e71a5f8 ffff88087580dfd8 000000000000f4e8 ffff88088e71a5f8 Jan 8 10:57:57 filesrv2 kernel: Call Trace: Jan 8 10:57:57 filesrv2 kernel: [] ? rb_reserve_next_event+0xb4/0x370 Jan 8 10:57:57 filesrv2 kernel: [] ? native_sched_clock+0x13/0x60 Jan 8 10:57:57 filesrv2 kernel: [] rwsem_down_failed_common+0x95/0x1d0 Jan 8 10:57:57 filesrv2 kernel: [] ? native_sched_clock+0x13/0x60 Jan 8 10:57:57 filesrv2 kernel: [] rwsem_down_read_failed+0x26/0x30 Jan 8 10:57:57 filesrv2 kernel: [] call_rwsem_down_read_failed+0x14/0x30 Jan 8 10:57:57 filesrv2 kernel: [] ? down_read+0x24/0x30 Jan 8 10:57:57 filesrv2 kernel: [] dlm_lock+0x62/0x1e0 [dlm] Jan 8 10:57:57 filesrv2 kernel: [] ? ring_buffer_lock_reserve+0xa2/0x160 Jan 8 10:57:57 filesrv2 kernel: [] gdlm_lock+0xf2/0x130 [gfs2] Jan 8 10:57:57 filesrv2 kernel: [] ? gdlm_ast+0x0/0xe0 [gfs2] Jan 8 10:57:57 filesrv2 kernel: [] ? gdlm_bast+0x0/0x50 [gfs2] Jan 8 10:57:57 filesrv2 kernel: [] do_xmote+0x17f/0x260 [gfs2] Jan 8 10:57:57 filesrv2 kernel: [] run_queue+0xf1/0x1d0 [gfs2] Jan 8 10:57:57 filesrv2 kernel: [] gfs2_glock_nq+0x1b7/0x360 [gfs2] Jan 8 10:57:57 filesrv2 kernel: [] ? try_to_del_timer_sync+0x7b/0xe0 Jan 8 10:57:57 filesrv2 kernel: [] gfs2_statfs_sync+0x58/0x1b0 [gfs2] Jan 8 10:57:57 filesrv2 kernel: [] ? schedule_timeout+0x19a/0x2e0 Jan 8 10:57:57 filesrv2 kernel: [] ? gfs2_statfs_sync+0x50/0x1b0 [gfs2] Jan 8 10:57:57 filesrv2 kernel: [] quotad_check_timeo+0x57/0xb0 [gfs2] Jan 8 10:57:57 filesrv2 kernel: [] gfs2_quotad+0x234/0x2b0 [gfs2] Jan 8 10:57:57 filesrv2 kernel: [] ? autoremove_wake_function+0x0/0x40 Jan 8 10:57:57 filesrv2 kernel: [] ? gfs2_quotad+0x0/0x2b0 [gfs2] Jan 8 10:57:57 filesrv2 kernel: [] kthread+0x96/0xa0 Jan 8 10:57:57 filesrv2 kernel: [] child_rip+0xa/0x20 Jan 8 10:57:57 filesrv2 kernel: [] ? kthread+0x0/0xa0 Jan 8 10:57:57 filesrv2 kernel: [] ? child_rip+0x0/0x20 Jan 8 10:59:22 filesrv2 rgmanager[12557]: State change: clustsrv1 UP Cluster.conf File: Thanks Sathya Narayanan V Solution Architect -----Original Message----- From: SATHYA - IT [mailto:sathyanarayanan.varadharajan at precisionit.co.in] Sent: Monday, January 09, 2012 11:21 AM To: 'Digimer'; 'linux clustering' Subject: Re: [Linux-cluster] rhel 6.2 network bonding interface in cluster environment Hi, Herewith attaching the /var/log/messages of both the servers. Yesterday (08th Jan) one of the server got fenced by other around 10:48 AM. I am also attaching the cluster.conf file for your reference. On the related note, related to heartbeat - I am referring the channel used by corosync. And the name which has been configured in cluster.conf file resolves with bond1 only. Related to the network card, we are using 2 dual port card where we configured 1 port from each for bond0 and 1 port from the other for bond1. So it doesn't seems be a network card related issue. Moreover, we are not having any errors related to bond0. Thanks Sathya Narayanan V Solution Architect This communication may contain confidential information. If you are not the intended recipient it may be unlawful for you to read, copy, distribute, disclose or otherwise use the information contained within this communication.. Errors and Omissions may occur in the contents of this Email arising out of or in connection with data transmission, network malfunction or failure, machine or software error, malfunction, or operator errors by the person who is sending the email. Precision Group accepts no responsibility for any such errors or omissions. The information, views and comments within this communication are those of the individual and not necessarily those of Precision Group. All email that is sent from/to Precision Group is scanned for the presence of computer viruses, security issues and inappropriate content. However, it is the recipient's responsibility to check any attachments for viruses before use. From Klaus.Steinberger at physik.uni-muenchen.de Mon Jan 9 07:28:38 2012 From: Klaus.Steinberger at physik.uni-muenchen.de (Klaus Steinberger) Date: Mon, 9 Jan 2012 08:28:38 +0100 Subject: [Linux-cluster] rhel 6.2 network bonding interface in cluster environment In-Reply-To: References: Message-ID: <53FBFF3E-A139-43F7-A500-FE69539ECF84@physik.uni-muenchen.de> > > > Jan 3 14:46:07 filesrv2 kernel: bnx2 0000:04:00.0: eth4: NIC Copper Link is > Down > Jan 3 14:46:07 filesrv2 kernel: bnx2 0000:03:00.1: eth3: NIC Copper Link is > Down > Jan 3 14:46:07 filesrv2 kernel: bonding: bond1: link status definitely down > for interface eth3, disabling it > Jan 3 14:46:07 filesrv2 kernel: bonding: bond1: now running without any > active interface ! > Jan 3 14:46:07 filesrv2 kernel: bonding: bond1: link status definitely down > for interface eth4, disabling it > Jan 3 14:46:10 filesrv2 kernel: bnx2 0000:03:00.1: eth3: NIC Copper Link is > Up, 1000 Mbps full duplex, receive & transmit flow control ON > Jan 3 14:46:10 filesrv2 kernel: bond1: link status definitely up for > interface eth3, 1000 Mbps full duplex. > Jan 3 14:46:10 filesrv2 kernel: bonding: bond1: making interface eth3 the > new active one. > Jan 3 14:46:10 filesrv2 kernel: bonding: bond1: first active interface up! > Jan 3 14:46:10 filesrv2 kernel: bnx2 0000:04:00.0: eth4: NIC Copper Link is > Up, 1000 Mbps full duplex, receive & transmit flow control ON > Jan 3 14:46:10 filesrv2 kernel: bond1: link status definitely up for > interface eth4, 1000 Mbps full duplex. Both links are going down at same time. Did you connect them to the same switch? Is there a switch reboot at that time or something else going on in the switch? Sincerly, Klaus -------------- next part -------------- An HTML attachment was scrubbed... URL: From ajb2 at mssl.ucl.ac.uk Mon Jan 9 08:52:30 2012 From: ajb2 at mssl.ucl.ac.uk (Alan Brown) Date: Mon, 09 Jan 2012 08:52:30 +0000 Subject: [Linux-cluster] centos5 to RHEL6 migration In-Reply-To: <4F0A532A.2000202@alteeve.com> References: <4F0A532A.2000202@alteeve.com> Message-ID: <4F0AAACE.7080602@mssl.ucl.ac.uk> On 09/01/12 02:38, Digimer wrote: > Technically yes, practically no. Or rather, not without a lot of > testing first. This is "rather a shame" I have a similar requirement (EL5 -> EL6 with GFS) > There may be some other things you need to do as well. Please be sure > to do proper testing and, if you have the budget, hire Red Hat to advise > on this process. Also, please report back your results. It would help me > help others in the same boat later. :) RH's advice to use is to "Big Bang" it. The last such transition (EL4 to EL5) was an unmitigated disaster even with RH onsite to make the change, so we're _very_ wary this time around. From ajb2 at mssl.ucl.ac.uk Mon Jan 9 08:55:15 2012 From: ajb2 at mssl.ucl.ac.uk (Alan Brown) Date: Mon, 09 Jan 2012 08:55:15 +0000 Subject: [Linux-cluster] centos5 to RHEL6 migration In-Reply-To: <4F0A726A.6050304@alteeve.com> References: <4F0A532A.2000202@alteeve.com> <4F0A726A.6050304@alteeve.com> Message-ID: <4F0AAB73.2040201@mssl.ucl.ac.uk> On 09/01/12 04:51, Digimer wrote: > Alternatively, use some spare machines to mock-up the current cluster > and then test-upgrade. It might work flawlessly, I genuinely don't know. Test setups aren't always a good metric. Everything worked fine on our last changeover until we put real-world load on. From fdinitto at redhat.com Mon Jan 9 09:36:05 2012 From: fdinitto at redhat.com (Fabio M. Di Nitto) Date: Mon, 09 Jan 2012 10:36:05 +0100 Subject: [Linux-cluster] centos5 to RHEL6 migration In-Reply-To: <4F0AAACE.7080602@mssl.ucl.ac.uk> References: <4F0A532A.2000202@alteeve.com> <4F0AAACE.7080602@mssl.ucl.ac.uk> Message-ID: <4F0AB505.2020402@redhat.com> On 1/9/2012 9:52 AM, Alan Brown wrote: > On 09/01/12 02:38, Digimer wrote: > >> Technically yes, practically no. Or rather, not without a lot of >> testing first. > > This is "rather a shame" > > I have a similar requirement (EL5 -> EL6 with GFS) > Well the cluster stack itself (openais/cman/gfs/rgmanager -> corosync/cman/gfs2/rgmanager) is capable of handling the upgrade in a compatible mode. *BUT* (yes there are tons of those) in time, while performing different upgrade scenarios/tests, we come to the conclusion that it is a lot more complicated for any user (even expert/advanced ones) to perform a safe upgrade than rebuilding the cluster from scratch (*) given that setup/config/etc are known from the old cluster. >> There may be some other things you need to do as well. Please be sure >> to do proper testing and, if you have the budget, hire Red Hat to advise >> on this process. Also, please report back your results. It would help me >> help others in the same boat later. :) > > RH's advice to use is to "Big Bang" it. It?s not much of an advice, as RH does not officially support this upgrade method. > > The last such transition (EL4 to EL5) was an unmitigated disaster even > with RH onsite to make the change, so we're _very_ wary this time around. > The amount of changes in the cluster software between EL5 and EL6 are a lot less intrusive at system level. I can?t really say for sure for the entire OS, since the upgrade doesn?t involve only RHCS. Fabio (*) The major issues, while upgrading from 5 to 6 are: - GFS1 is not support in EL6. Volumes need to be migrated to GFS2 (and there are several ways to do it, but still needs to be done offline) - cluster.conf cannot be updated automatically during an upgrade or nodes running in mixed mode (some nodes at 5 and others at 6). - some config options, while backward compat should be retained, needs to be changed in very specific sequence, making it really hard to perform an easy upgrade. - but the biggest blocker of all are all the resources (driven or not by rgmanager). For example, apache2 config in EL5 cannot be used out-of-the-box on EL6. So assuming rgmanager is driving apache2, then you would need to setup 2 separate apache2 configs, test them individually, perform migration checks between EL5 and 6... etc. This kind of testing is more time consuming and complex than what you can possibly gain by redoing the cluster from scratch. There are also other resources that are simply unable to deal with this kind of upgrade. Let?s make the example of a db stored on a gfs2 filesystem. DB created in version 1, after a migration to EL6, the DB format is upgraded to internal version 2. Version 2 being incompatible with 1. IF there is a situation where the service needs to failover back to a node running EL5, the DB will be unable to start. Effectively killing the purpose of HA. What you want to notice is that the service compatibility level has nothing to do with cluster itself. Now, when you multiply the amount of possible services, failover scenarios, config changes etc, you will easily come to the conclusion that an upgrade of this proportion is a path to insanity for the administrator. From rajatjpatel at gmail.com Mon Jan 9 10:20:19 2012 From: rajatjpatel at gmail.com (rajatjpatel) Date: Mon, 9 Jan 2012 15:50:19 +0530 Subject: [Linux-cluster] centos5 to RHEL6 migration Message-ID: 1. Back up anything you care about. 2. Remember - A fresh install is generally *strongly* preferred over an upgrade. Regards, Rajat Patel http://studyhat.blogspot.com FIRST THEY IGNORE YOU... THEN THEY LAUGH AT YOU... THEN THEY FIGHT YOU... THEN YOU WIN... On Mon, Jan 9, 2012 at 9:33 AM, wrote: > Send Linux-cluster mailing list submissions to > linux-cluster at redhat.com > > To subscribe or unsubscribe via the World Wide Web, visit > https://www.redhat.com/mailman/listinfo/linux-cluster > or, via email, send a message with subject or body 'help' to > linux-cluster-request at redhat.com > > You can reach the person managing the list at > linux-cluster-owner at redhat.com > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Linux-cluster digest..." > > > Today's Topics: > > 1. centos5 to RHEL6 migration (Terry) > 2. Re: centos5 to RHEL6 migration (Michael Allen) > 3. Re: centos5 to RHEL6 migration (Digimer) > 4. Re: centos5 to RHEL6 migration (Terry) > 5. Re: centos5 to RHEL6 migration (Fajar A. Nugraha) > 6. Re: GFS on CentOS - cman unable to start (Wes Modes) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Sun, 8 Jan 2012 17:39:37 -0600 > From: Terry > To: linux clustering > Subject: [Linux-cluster] centos5 to RHEL6 migration > Message-ID: > > > Content-Type: text/plain; charset="iso-8859-1" > > Hello, > > I am trying to gently migrate a 3-node cluster from centos5 to RHEL6. I > have already taken one of the three nodes out and rebuilt it. My thinking > is to build a new cluster from the RHEL node but want to run it by everyone > here first. The cluster consists of a handful of NFS volumes and a > PostgreSQL database. I am not concerned about the database. I am moving > to a new version and will simply migrate that. I am more concerned about > all of the ext4 clustered LVM volumes. In this process, if I shut down the > old cluster, what's the process to force the new node to read those volumes > in to the new single-node cluster? A pvscan on the new server shows all of > the volumes fine. I am concerned there's something else I'll have to do > here to begin mounting these volumes in the new cluster. > [root at server ~]# pvdisplay > Skipping clustered volume group vg_data01b > > Thanks! > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: < > https://www.redhat.com/archives/linux-cluster/attachments/20120108/bc344718/attachment.html > > > > ------------------------------ > > Message: 2 > Date: Sun, 8 Jan 2012 19:36:10 -0600 > From: Michael Allen > To: linux clustering > Subject: Re: [Linux-cluster] centos5 to RHEL6 migration > Message-ID: <20120108193610.2404b425 at godelsrevenge.induswx.com> > Content-Type: text/plain; charset=US-ASCII > > On Sun, 8 Jan 2012 17:39:37 -0600 > Terry wrote: > > > Hello, > > > > I am trying to gently migrate a 3-node cluster from centos5 to > > RHEL6. I have already taken one of the three nodes out and rebuilt > > it. My thinking is to build a new cluster from the RHEL node but want > > to run it by everyone here first. The cluster consists of a handful > > of NFS volumes and a PostgreSQL database. I am not concerned about > > the database. I am moving to a new version and will simply migrate > > that. I am more concerned about all of the ext4 clustered LVM > > volumes. In this process, if I shut down the old cluster, what's the > > process to force the new node to read those volumes in to the new > > single-node cluster? A pvscan on the new server shows all of the > > volumes fine. I am concerned there's something else I'll have to do > > here to begin mounting these volumes in the new cluster. [root at server > > ~]# pvdisplay Skipping clustered volume group vg_data01b > > > > Thanks! > This message comes at a good time for me, too, since I am considering > the same thing. I have 10 nodes but it appears that a change to CentOS > 6.xx is about due. > > Michael Allen > > > > ------------------------------ > > Message: 3 > Date: Sun, 08 Jan 2012 21:38:34 -0500 > From: Digimer > To: linux clustering > Subject: Re: [Linux-cluster] centos5 to RHEL6 migration > Message-ID: <4F0A532A.2000202 at alteeve.com> > Content-Type: text/plain; charset=ISO-8859-1 > > On 01/08/2012 06:39 PM, Terry wrote: > > Hello, > > > > I am trying to gently migrate a 3-node cluster from centos5 to RHEL6. I > > have already taken one of the three nodes out and rebuilt it. My > > thinking is to build a new cluster from the RHEL node but want to run it > > by everyone here first. The cluster consists of a handful of NFS volumes > > and a PostgreSQL database. I am not concerned about the database. I am > > moving to a new version and will simply migrate that. I am more > > concerned about all of the ext4 clustered LVM volumes. In this process, > > if I shut down the old cluster, what's the process to force the new node > > to read those volumes in to the new single-node cluster? A pvscan on > > the new server shows all of the volumes fine. I am concerned there's > > something else I'll have to do here to begin mounting these volumes in > > the new cluster. > > [root at server ~]# pvdisplay > > Skipping clustered volume group vg_data01b > > > > Thanks! > > Technically yes, practically no. Or rather, not without a lot of > testing first. > > I've never done this, but here are some pointers; > > > > upgrading > Set this if you are performing a rolling upgrade of the cluster > between major releases. > > disallowed > Set this to 1 enable cman's Disallowed mode. This is usually > only needed for backwards compatibility. > > > > Enable compatibility with cluster2 nodes. groupd(8) > > There may be some other things you need to do as well. Please be sure > to do proper testing and, if you have the budget, hire Red Hat to advise > on this process. Also, please report back your results. It would help me > help others in the same boat later. :) > > -- > Digimer > E-Mail: digimer at alteeve.com > Freenode handle: digimer > Papers and Projects: http://alteeve.com > Node Assassin: http://nodeassassin.org > "omg my singularity battery is dead again. > stupid hawking radiation." - epitron > > > > ------------------------------ > > Message: 4 > Date: Sun, 8 Jan 2012 21:31:38 -0600 > From: Terry > To: Digimer > Cc: linux clustering > Subject: Re: [Linux-cluster] centos5 to RHEL6 migration > Message-ID: > > > Content-Type: text/plain; charset="iso-8859-1" > > If it's not practical, am I left with building a new cluster from scratch? > > On Sun, Jan 8, 2012 at 8:38 PM, Digimer wrote: > > > On 01/08/2012 06:39 PM, Terry wrote: > > > Hello, > > > > > > I am trying to gently migrate a 3-node cluster from centos5 to RHEL6. > I > > > have already taken one of the three nodes out and rebuilt it. My > > > thinking is to build a new cluster from the RHEL node but want to run > it > > > by everyone here first. The cluster consists of a handful of NFS > volumes > > > and a PostgreSQL database. I am not concerned about the database. I > am > > > moving to a new version and will simply migrate that. I am more > > > concerned about all of the ext4 clustered LVM volumes. In this > process, > > > if I shut down the old cluster, what's the process to force the new > node > > > to read those volumes in to the new single-node cluster? A pvscan on > > > the new server shows all of the volumes fine. I am concerned there's > > > something else I'll have to do here to begin mounting these volumes in > > > the new cluster. > > > [root at server ~]# pvdisplay > > > Skipping clustered volume group vg_data01b > > > > > > Thanks! > > > > Technically yes, practically no. Or rather, not without a lot of > > testing first. > > > > I've never done this, but here are some pointers; > > > > > > > > upgrading > > Set this if you are performing a rolling upgrade of the cluster > > between major releases. > > > > disallowed > > Set this to 1 enable cman's Disallowed mode. This is usually > > only needed for backwards compatibility. > > > > > > > > Enable compatibility with cluster2 nodes. groupd(8) > > > > There may be some other things you need to do as well. Please be sure > > to do proper testing and, if you have the budget, hire Red Hat to advise > > on this process. Also, please report back your results. It would help me > > help others in the same boat later. :) > > > > -- > > Digimer > > E-Mail: digimer at alteeve.com > > Freenode handle: digimer > > Papers and Projects: http://alteeve.com > > Node Assassin: http://nodeassassin.org > > "omg my singularity battery is dead again. > > stupid hawking radiation." - epitron > > > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: < > https://www.redhat.com/archives/linux-cluster/attachments/20120108/e8917278/attachment.html > > > > ------------------------------ > > Message: 5 > Date: Mon, 9 Jan 2012 11:01:06 +0700 > From: "Fajar A. Nugraha" > To: linux clustering > Subject: Re: [Linux-cluster] centos5 to RHEL6 migration > Message-ID: > > > Content-Type: text/plain; charset=ISO-8859-1 > > On Mon, Jan 9, 2012 at 10:31 AM, Terry wrote: > > If it's not practical, am I left with building a new cluster from > scratch? > > I'm pretty sure if your ONLY problem is "Skipping clustered volume > group vg_data01b", you can just turn off cluster flag with "vgchange > -cn", then use "-o lock_nolock" to mount it on a SINGLE (i.e. not > cluster) node. That was your original question, wasn't it? > > As for upgrading, I haven't tested it. You should be able to use your > old storage, but just create other settings from scratch. Like Digimer > said, be sure to do proper testing :) > > -- > Fajar > > > > ------------------------------ > > Message: 6 > Date: Sun, 08 Jan 2012 20:03:18 -0800 > From: Wes Modes > To: linux-cluster at redhat.com > Subject: Re: [Linux-cluster] GFS on CentOS - cman unable to start > Message-ID: <4F0A6706.6090308 at ucsc.edu> > Content-Type: text/plain; charset="iso-8859-1" > > The behavior of cman's resolving of cluster node names is less than > clear, as per the RHEL bugzilla report. > > The hostname and cluster.conf match, as does /etc/hosts and uname -n. > The short names and FQDN ping. I believe all the node cluster.conf are > in sync, and all nodes are accessible to each other using either short > or long names. > > You'll have to trust that I've tried everything obvious, and every > possible combination of FQDN and short names in cluster.conf and > hostname. That said, it is totally possible I missed something obvious. > > I suspect, there is something else going on and I don't know how to get > at it. > > Wes > > > On 1/6/2012 6:06 PM, Kevin Stanton wrote: > > > > > Hi, > > > > > I think CMAN expect that the names of the cluster nodes be the same > > returned by the command "uname -n". > > > > > For what you write your nodes hostnames are: test01.gdao.ucsc.edu > > and test02.gdao.ucsc.edu, but in cluster.conf you have declared only > > "test01" and "test02". > > > > > > > > I haven't found this to be the case in the past. I actually use a > > separate short name to reference each node which is different than the > > hostname of the server itself. All I've ever had to do is make sure > > it resolves correctly. You can do this either in DNS and/or in > > /etc/hosts. I have found that it's a good idea to do both in case > > your DNS server is a virtual machine and is not running for some > > reason. In that case with /etc/hosts you can still start cman. > > > > > > > > I would make sure whatever node names you use in the cluster.conf will > > resolve when you try to ping it from all nodes in the cluster. Also > > make sure your cluster.conf is in sync between all nodes. > > > > > > > > -Kevin > > > > > > > > > > > > ------------------------------------------------------------------------ > > > > These servers are currently on the same host, but may not be in > > the future. They are in a vm cluster (though honestly, I'm not > > sure what this means yet). > > > > SElinux is on, but disabled. > > Firewalling through iptables is turned off via > > system-config-securitylevel > > > > There is no line currently in the cluster.conf that deals with > > multicasting. > > > > Any other suggestions? > > > > Wes > > > > On 1/6/2012 12:05 PM, Luiz Gustavo Tonello wrote: > > > > Hi, > > > > > > > > This servers is on VMware? At the same host? > > > > SElinux is disable? iptables have something? > > > > > > > > In my environment I had a problem to start GFS2 with servers in > > differents hosts. > > > > To clustering servers, was need migrate one server to the same > > host of the other, and restart this. > > > > > > > > I think, one of the problem was because the virtual switchs. > > > > To solve, I changed a multicast IP, to use 225.0.0.13 at cluster.conf > > > > > > > > And add a static route in both, to use default gateway. > > > > > > > > I don't know if it's correct, but this solve my problem. > > > > > > > > I hope that help you. > > > > > > > > Regards. > > > > > > > > On Fri, Jan 6, 2012 at 5:01 PM, Wes Modes > > wrote: > > > > Hi, Steven. > > > > I've tried just about every possible combination of hostname and > > cluster.conf. > > > > ping to test01 resolves to 128.114.31.112 > > ping to test01.gdao.ucsc.edu > > resolves to 128.114.31.112 > > > > It feels like the right thing is being returned. This feels like it > > might be a quirk (or bug possibly) of cman or openais. > > > > There are some old bug reports around this, for example > > https://bugzilla.redhat.com/show_bug.cgi?id=488565. It sounds > > like the > > way that cman reports this error is anything but straightforward. > > > > Is there anyone who has encountered this error and found a solution? > > > > Wes > > > > > > > > On 1/6/2012 2:00 AM, Steven Whitehouse wrote: > > > Hi, > > > > > > On Thu, 2012-01-05 at 13:54 -0800, Wes Modes wrote: > > >> Howdy, y'all. I'm trying to set up GFS in a cluster on CentOS > > systems > > >> running on vmWare. The GFS FS is on a Dell Equilogic SAN. > > >> > > >> I keep running into the same problem despite many > > differently-flavored > > >> attempts to set up GFS. The problem comes when I try to start > > cman, the > > >> cluster management software. > > >> > > >> [root at test01]# service cman start > > >> Starting cluster: > > >> Loading modules... done > > >> Mounting configfs... done > > >> Starting ccsd... done > > >> Starting cman... failed > > >> cman not started: Can't find local node name in cluster.conf > > >> /usr/sbin/cman_tool: aisexec daemon didn't start > > >> > > [FAILED] > > >> > > > This looks like what it says... whatever the node name is in > > > cluster.conf, it doesn't exist when the name is looked up, or > > possibly > > > it does exist, but is mapped to the loopback address (it needs to > > map to > > > an address which is valid cluster-wide) > > > > > > Since your config files look correct, the next thing to check is > what > > > the resolver is actually returning. Try (for example) a ping to > > test01 > > > (you need to specify exactly the same form of the name as is used > in > > > cluster.conf) from test02 and see whether it uses the correct ip > > > address, just in case the wrong thing is being returned. > > > > > > Steve. > > > > > >> [root at test01]# tail /var/log/messages > > >> Jan 5 13:39:40 testbench06 ccsd[13194]: Unable to connect to > > >> cluster infrastructure after 1193640 seconds. > > >> Jan 5 13:40:10 testbench06 ccsd[13194]: Unable to connect to > > >> cluster infrastructure after 1193670 seconds. > > >> Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] AIS > Executive > > >> Service RELEASE 'subrev 1887 version 0.80.6' > > >> Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] Copyright > (C) > > >> 2002-2006 MontaVista Software, Inc and contributors. > > >> Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] Copyright > (C) > > >> 2006 Red Hat, Inc. > > >> Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] AIS > Executive > > >> Service: started and ready to provide service. > > >> Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] local > > node name > > >> "test01.gdao.ucsc.edu " not found > > in cluster.conf > > >> Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] Error > > reading CCS > > >> info, cannot start > > >> Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] Error > reading > > >> config from CCS > > >> Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] AIS > Executive > > >> exiting (reason: could not read the main configuration file). > > >> > > >> Here are details of my configuration: > > >> > > >> [root at test01]# rpm -qa | grep cman > > >> cman-2.0.115-85.el5_7.2 > > >> > > >> [root at test01]# echo $HOSTNAME > > >> test01.gdao.ucsc.edu > > >> > > >> [root at test01]# hostname > > >> test01.gdao.ucsc.edu > > >> > > >> [root at test01]# cat /etc/hosts > > >> # Do not remove the following line, or various programs > > >> # that require network functionality will fail. > > >> 128.114.31.112 test01 test01.gdao test01.gdao.ucsc.edu > > > > >> 128.114.31.113 test02 test02.gdao test02.gdao.ucsc.edu > > > > >> 127.0.0.1 localhost.localdomain localhost > > >> ::1 localhost6.localdomain6 localhost6 > > >> > > >> [root at test01]# sestatus > > >> SELinux status: enabled > > >> SELinuxfs mount: /selinux > > >> Current mode: permissive > > >> Mode from config file: permissive > > >> Policy version: 21 > > >> Policy from config file: targeted > > >> > > >> [root at test01]# cat /etc/cluster/cluster.conf > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > >> ipaddr="gdvcenter.ucsc.edu " > > login="root" passwd="1hateAmazon.com" > > >> vmlogin="root" vmpasswd="esxpass" > > >> > > > port="/vmfs/volumes/49086551-c64fd83c-0401-001e0bcd6848/eagle1/gfs1.vmx"/> > > >> > > >> > > >> > > >> > > >> > > >> > > >> I've seen much discussion of this problem, but no definitive > > solutions. > > >> Any help you can provide will be welcome. > > >> > > >> Wes Modes > > >> > > >> -- > > >> Linux-cluster mailing list > > >> Linux-cluster at redhat.com > > >> https://www.redhat.com/mailman/listinfo/linux-cluster > > > > > > -- > > > Linux-cluster mailing list > > > Linux-cluster at redhat.com > > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > > > > > > > > > -- > > Luiz Gustavo P Tonello. > > > > > > > > -- > > > > Linux-cluster mailing list > > > > Linux-cluster at redhat.com > > > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > > > > > > > > > > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: < > https://www.redhat.com/archives/linux-cluster/attachments/20120108/707d1029/attachment.html > > > > ------------------------------ > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > End of Linux-cluster Digest, Vol 93, Issue 7 > ******************************************** > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sathyanarayanan.varadharajan at precisionit.co.in Mon Jan 9 10:43:15 2012 From: sathyanarayanan.varadharajan at precisionit.co.in (SATHYA - IT) Date: Mon, 9 Jan 2012 16:13:15 +0530 Subject: [Linux-cluster] rhel 6.2 network bonding interface in cluster environment Message-ID: <003201cccebb$805852e0$8108f8a0$@precisionit.co.in> Klaus, For your point the corosync network is not connected to the switch. They are connected directly to the servers (server to server). Thanks Sathya Narayanan V Solution Architect -----Original Message----- From: SATHYA - IT [mailto:sathyanarayanan.varadharajan at precisionit.co.in] Sent: Monday, January 09, 2012 11:48 AM To: 'Digimer'; 'linux clustering' Subject: RE: [Linux-cluster] rhel 6.2 network bonding interface in cluster environment Not sure whether you received the logs and cluster.conf file. Herewith pasting the same... On File Server1: Jan 8 03:15:04 filesrv1 kernel: imklog 4.6.2, log source = /proc/kmsg started. Jan 8 03:15:04 filesrv1 rsyslogd: [origin software="rsyslogd" swVersion="4.6.2" x-pid="8765" x-info="http://www.rsyslog.com"] (re)start Jan 8 10:52:42 filesrv1 kernel: imklog 4.6.2, log source = /proc/kmsg started. Jan 8 10:52:42 filesrv1 rsyslogd: [origin software="rsyslogd" swVersion="4.6.2" x-pid="8751" x-info="http://www.rsyslog.com"] (re)start Jan 8 10:52:42 filesrv1 kernel: Initializing cgroup subsys cpuset Jan 8 10:52:42 filesrv1 kernel: Initializing cgroup subsys cpu Jan 8 10:52:42 filesrv1 kernel: Linux version 2.6.32-220.el6.x86_64 (mockbuild at x86-004.build.bos.redhat.com) (gcc version 4.4.5 20110214 (Red Hat 4.4.5-6) (GCC) ) #1 SMP Wed Nov 9 08:03:13 EST 2011 Jan 8 10:52:42 filesrv1 kernel: Command line: ro root=/dev/mapper/vg01-LogVol01 rd_LVM_LV=vg01/LogVol01 rd_LVM_LV=vg01/LogVol00 rd_NO_LUKS rd_NO_MD rd_NO_DM LANG=en_US.UTF-8 SYSFONT=latarcyrheb-sun16 KEYBOARDTYPE=pc KEYTABLE=us crashkernel=128M rhgb quiet acpi=off Jan 8 10:52:42 filesrv1 kernel: KERNEL supported cpus: Jan 8 10:52:42 filesrv1 kernel: Intel GenuineIntel Jan 8 10:52:42 filesrv1 kernel: AMD AuthenticAMD Jan 8 10:52:42 filesrv1 kernel: Centaur CentaurHauls Jan 8 10:52:42 filesrv1 kernel: BIOS-provided physical RAM map: Jan 8 10:52:42 filesrv1 kernel: BIOS-e820: 0000000000000000 - 000000000009f400 (usable) Jan 8 10:52:42 filesrv1 kernel: BIOS-e820: 000000000009f400 - 00000000000a0000 (reserved) Jan 8 10:52:42 filesrv1 kernel: BIOS-e820: 00000000000f0000 - 0000000000100000 (reserved) Jan 8 10:52:42 filesrv1 kernel: BIOS-e820: 0000000000100000 - 00000000d762f000 (usable) Jan 8 10:52:42 filesrv1 kernel: BIOS-e820: 00000000d762f000 - 00000000d763c000 (ACPI data) Jan 8 10:52:42 filesrv1 kernel: BIOS-e820: 00000000d763c000 - 00000000d763d000 (usable) Jan 8 10:52:42 filesrv1 kernel: BIOS-e820: 00000000d763d000 - 00000000dc000000 (reserved) Jan 8 10:52:42 filesrv1 kernel: BIOS-e820: 00000000fec00000 - 00000000fee10000 (reserved) Jan 8 10:52:42 filesrv1 kernel: BIOS-e820: 00000000ff800000 - 0000000100000000 (reserved) Jan 8 10:52:42 filesrv1 kernel: BIOS-e820: 0000000100000000 - 00000008a7fff000 (usable) Jan 8 10:52:42 filesrv1 kernel: DMI 2.7 present. Jan 8 10:52:42 filesrv1 kernel: SMBIOS version 2.7 @ 0xF4F40 Jan 8 10:52:42 filesrv1 kernel: last_pfn = 0x8a7fff max_arch_pfn = 0x400000000 Jan 8 10:52:42 filesrv1 kernel: x86 PAT enabled: cpu 0, old 0x7040600070406, new 0x7010600070106 Jan 8 10:52:42 filesrv1 kernel: last_pfn = 0xd763d max_arch_pfn = 0x400000000 . . On File Server 2: Jan 8 03:09:06 filesrv2 rsyslogd: [origin software="rsyslogd" swVersion="4.6.2" x-pid="8648" x-info="http://www.rsyslog.com"] (re)start Jan 8 10:48:07 filesrv2 kernel: bnx2 0000:04:00.0: eth4: NIC Copper Link is Down Jan 8 10:48:07 filesrv2 kernel: bnx2 0000:03:00.1: eth3: NIC Copper Link is Down Jan 8 10:48:07 filesrv2 kernel: bonding: bond1: link status definitely down for interface eth3, disabling it Jan 8 10:48:07 filesrv2 kernel: bonding: bond1: now running without any active interface ! Jan 8 10:48:07 filesrv2 kernel: bonding: bond1: link status definitely down for interface eth4, disabling it Jan 8 10:48:09 filesrv2 kernel: bnx2 0000:04:00.0: eth4: NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON Jan 8 10:48:09 filesrv2 kernel: bond1: link status definitely up for interface eth4, 1000 Mbps full duplex. Jan 8 10:48:09 filesrv2 kernel: bonding: bond1: making interface eth4 the new active one. Jan 8 10:48:09 filesrv2 kernel: bonding: bond1: first active interface up! Jan 8 10:48:09 filesrv2 kernel: bnx2 0000:03:00.1: eth3: NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON Jan 8 10:48:09 filesrv2 kernel: bond1: link status definitely up for interface eth3, 1000 Mbps full duplex. Jan 8 10:48:15 filesrv2 corosync[8933]: [TOTEM ] A processor failed, forming new configuration. Jan 8 10:48:17 filesrv2 corosync[8933]: [QUORUM] Members[1]: 2 Jan 8 10:48:17 filesrv2 corosync[8933]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Jan 8 10:48:17 filesrv2 rgmanager[12557]: State change: clustsrv1 DOWN Jan 8 10:48:17 filesrv2 corosync[8933]: [CPG ] chosen downlist: sender r(0) ip(10.0.0.20) ; members(old:2 left:1) Jan 8 10:48:17 filesrv2 corosync[8933]: [MAIN ] Completed service synchronization, ready to provide service. Jan 8 10:48:17 filesrv2 kernel: dlm: closing connection to node 1 Jan 8 10:48:17 filesrv2 fenced[8989]: fencing node clustsrv1 Jan 8 10:48:17 filesrv2 kernel: GFS2: fsid=samba:ctdb.0: jid=1: Trying to acquire journal lock... Jan 8 10:48:17 filesrv2 kernel: GFS2: fsid=samba:gen01.0: jid=1: Trying to acquire journal lock... Jan 8 10:48:24 filesrv2 kernel: bnx2 0000:03:00.1: eth3: NIC Copper Link is Down Jan 8 10:48:24 filesrv2 kernel: bnx2 0000:04:00.0: eth4: NIC Copper Link is Down Jan 8 10:48:24 filesrv2 kernel: bonding: bond1: link status definitely down for interface eth3, disabling it Jan 8 10:48:24 filesrv2 kernel: bonding: bond1: link status definitely down for interface eth4, disabling it Jan 8 10:48:24 filesrv2 kernel: bonding: bond1: now running without any active interface ! Jan 8 10:48:25 filesrv2 kernel: bnx2 0000:04:00.0: eth4: NIC Copper Link is Up, 100 Mbps full duplex, receive & transmit flow control ON Jan 8 10:48:25 filesrv2 kernel: bond1: link status definitely up for interface eth4, 100 Mbps full duplex. Jan 8 10:48:25 filesrv2 kernel: bonding: bond1: making interface eth4 the new active one. Jan 8 10:48:25 filesrv2 kernel: bonding: bond1: first active interface up! Jan 8 10:48:25 filesrv2 kernel: bnx2 0000:03:00.1: eth3: NIC Copper Link is Up, 100 Mbps full duplex, receive & transmit flow control ON Jan 8 10:48:25 filesrv2 kernel: bond1: link status definitely up for interface eth3, 100 Mbps full duplex. Jan 8 10:48:25 filesrv2 kernel: bnx2 0000:04:00.0: eth4: NIC Copper Link is Down Jan 8 10:48:25 filesrv2 kernel: bnx2 0000:03:00.1: eth3: NIC Copper Link is Down Jan 8 10:48:25 filesrv2 kernel: bonding: bond1: link status definitely down for interface eth3, disabling it Jan 8 10:48:25 filesrv2 kernel: bonding: bond1: link status definitely down for interface eth4, disabling it Jan 8 10:48:25 filesrv2 kernel: bonding: bond1: now running without any active interface ! Jan 8 10:48:27 filesrv2 fenced[8989]: fence clustsrv1 success Jan 8 10:48:28 filesrv2 kernel: GFS2: fsid=samba:ctdb.0: jid=1: Looking at journal... Jan 8 10:48:28 filesrv2 kernel: GFS2: fsid=samba:ctdb.0: jid=1: Done Jan 8 10:48:28 filesrv2 kernel: GFS2: fsid=samba:gen02.0: jid=1: Trying to acquire journal lock... Jan 8 10:48:28 filesrv2 kernel: GFS2: fsid=samba:hadata02.0: jid=1: Trying to acquire journal lock... Jan 8 10:48:28 filesrv2 kernel: GFS2: fsid=samba:gen02.0: jid=1: Looking at journal... Jan 8 10:48:28 filesrv2 kernel: GFS2: fsid=samba:gen02.0: jid=1: Done Jan 8 10:48:28 filesrv2 kernel: GFS2: fsid=samba:gen01.0: jid=1: Looking at journal... Jan 8 10:48:28 filesrv2 kernel: bnx2 0000:04:00.0: eth4: NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON Jan 8 10:48:28 filesrv2 kernel: bnx2 0000:03:00.1: eth3: NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON Jan 8 10:48:28 filesrv2 kernel: GFS2: fsid=samba:gen01.0: jid=1: Acquiring the transaction lock... Jan 8 10:48:28 filesrv2 kernel: GFS2: fsid=samba:gen01.0: jid=1: Replaying journal... Jan 8 10:48:28 filesrv2 kernel: bond1: link status definitely up for interface eth3, 1000 Mbps full duplex. Jan 8 10:48:28 filesrv2 kernel: bonding: bond1: making interface eth3 the new active one. Jan 8 10:48:28 filesrv2 kernel: bonding: bond1: first active interface up! Jan 8 10:48:28 filesrv2 kernel: bond1: link status definitely up for interface eth4, 1000 Mbps full duplex. Jan 8 10:48:30 filesrv2 kernel: GFS2: fsid=samba:gen01.0: jid=1: Replayed 29140 of 29474 blocks Jan 8 10:48:30 filesrv2 kernel: GFS2: fsid=samba:gen01.0: jid=1: Found 334 revoke tags Jan 8 10:48:30 filesrv2 kernel: GFS2: fsid=samba:gen01.0: jid=1: Journal replayed in 2s Jan 8 10:48:30 filesrv2 kernel: GFS2: fsid=samba:gen01.0: jid=1: Done Jan 8 10:49:01 filesrv2 kernel: bnx2 0000:03:00.1: eth3: NIC Copper Link is Down Jan 8 10:49:01 filesrv2 kernel: bonding: bond1: link status definitely down for interface eth3, disabling it Jan 8 10:49:01 filesrv2 kernel: bonding: bond1: making interface eth4 the new active one. Jan 8 10:49:01 filesrv2 kernel: bnx2 0000:04:00.0: eth4: NIC Copper Link is Down Jan 8 10:49:01 filesrv2 kernel: bonding: bond1: link status definitely down for interface eth4, disabling it Jan 8 10:49:01 filesrv2 kernel: bonding: bond1: now running without any active interface ! Jan 8 10:49:03 filesrv2 kernel: bnx2 0000:03:00.1: eth3: NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON Jan 8 10:49:03 filesrv2 kernel: bond1: link status definitely up for interface eth3, 1000 Mbps full duplex. Jan 8 10:49:03 filesrv2 kernel: bonding: bond1: making interface eth3 the new active one. Jan 8 10:49:03 filesrv2 kernel: bonding: bond1: first active interface up! Jan 8 10:49:04 filesrv2 kernel: bnx2 0000:04:00.0: eth4: NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON Jan 8 10:49:04 filesrv2 kernel: bond1: link status definitely up for interface eth4, 1000 Mbps full duplex. Jan 8 10:50:13 filesrv2 kernel: GFS2: fsid=samba:hadata02.0: jid=1: Looking at journal... Jan 8 10:50:13 filesrv2 kernel: GFS2: fsid=samba:hadata02.0: jid=1: Acquiring the transaction lock... Jan 8 10:50:13 filesrv2 kernel: GFS2: fsid=samba:hadata02.0: jid=1: Replaying journal... Jan 8 10:50:13 filesrv2 kernel: GFS2: fsid=samba:hadata02.0: jid=1: Replayed 0 of 0 blocks Jan 8 10:50:13 filesrv2 kernel: GFS2: fsid=samba:hadata02.0: jid=1: Found 0 revoke tags Jan 8 10:50:13 filesrv2 kernel: GFS2: fsid=samba:hadata02.0: jid=1: Journal replayed in 0s Jan 8 10:50:13 filesrv2 kernel: GFS2: fsid=samba:hadata02.0: jid=1: Done Jan 8 10:52:37 filesrv2 kernel: bnx2 0000:03:00.1: eth3: NIC Copper Link is Down Jan 8 10:52:38 filesrv2 kernel: bonding: bond1: link status definitely down for interface eth3, disabling it Jan 8 10:52:38 filesrv2 kernel: bonding: bond1: making interface eth4 the new active one. Jan 8 10:52:38 filesrv2 kernel: bnx2 0000:04:00.0: eth4: NIC Copper Link is Down Jan 8 10:52:38 filesrv2 kernel: bonding: bond1: link status definitely down for interface eth4, disabling it Jan 8 10:52:38 filesrv2 kernel: bonding: bond1: now running without any active interface ! Jan 8 10:52:40 filesrv2 kernel: bnx2 0000:03:00.1: eth3: NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON Jan 8 10:52:40 filesrv2 kernel: bnx2 0000:04:00.0: eth4: NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON Jan 8 10:52:40 filesrv2 kernel: bond1: link status definitely up for interface eth3, 1000 Mbps full duplex. Jan 8 10:52:40 filesrv2 kernel: bonding: bond1: making interface eth3 the new active one. Jan 8 10:52:40 filesrv2 kernel: bonding: bond1: first active interface up! Jan 8 10:52:40 filesrv2 kernel: bond1: link status definitely up for interface eth4, 1000 Mbps full duplex. Jan 8 10:52:44 filesrv2 corosync[8933]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Jan 8 10:52:44 filesrv2 corosync[8933]: [QUORUM] Members[2]: 1 2 Jan 8 10:52:44 filesrv2 corosync[8933]: [QUORUM] Members[2]: 1 2 Jan 8 10:52:44 filesrv2 corosync[8933]: [CPG ] chosen downlist: sender r(0) ip(10.0.0.10) ; members(old:1 left:0) Jan 8 10:52:44 filesrv2 corosync[8933]: [MAIN ] Completed service synchronization, ready to provide service. Jan 8 10:52:51 filesrv2 kernel: dlm: got connection from 1 Jan 8 10:55:57 filesrv2 kernel: INFO: task gfs2_quotad:9389 blocked for more than 120 seconds. Jan 8 10:55:57 filesrv2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jan 8 10:55:57 filesrv2 kernel: gfs2_quotad D ffff8808a7824900 0 9389 2 0x00000080 Jan 8 10:55:57 filesrv2 kernel: ffff88087580da88 0000000000000046 0000000000000000 00000000000001c3 Jan 8 10:55:57 filesrv2 kernel: ffff88087580da18 ffff88087580da50 ffffffff810ea694 ffff88088b184080 Jan 8 10:55:57 filesrv2 kernel: ffff88088e71a5f8 ffff88087580dfd8 000000000000f4e8 ffff88088e71a5f8 Jan 8 10:55:57 filesrv2 kernel: Call Trace: Jan 8 10:55:57 filesrv2 kernel: [] ? rb_reserve_next_event+0xb4/0x370 Jan 8 10:55:57 filesrv2 kernel: [] ? native_sched_clock+0x13/0x60 Jan 8 10:55:57 filesrv2 kernel: [] rwsem_down_failed_common+0x95/0x1d0 Jan 8 10:55:57 filesrv2 kernel: [] ? native_sched_clock+0x13/0x60 Jan 8 10:55:57 filesrv2 kernel: [] rwsem_down_read_failed+0x26/0x30 Jan 8 10:55:57 filesrv2 kernel: [] call_rwsem_down_read_failed+0x14/0x30 Jan 8 10:55:57 filesrv2 kernel: [] ? down_read+0x24/0x30 Jan 8 10:55:57 filesrv2 kernel: [] dlm_lock+0x62/0x1e0 [dlm] Jan 8 10:55:57 filesrv2 kernel: [] ? ring_buffer_lock_reserve+0xa2/0x160 Jan 8 10:55:57 filesrv2 kernel: [] gdlm_lock+0xf2/0x130 [gfs2] Jan 8 10:55:57 filesrv2 kernel: [] ? gdlm_ast+0x0/0xe0 [gfs2] Jan 8 10:55:57 filesrv2 kernel: [] ? gdlm_bast+0x0/0x50 [gfs2] Jan 8 10:55:57 filesrv2 kernel: [] do_xmote+0x17f/0x260 [gfs2] Jan 8 10:55:57 filesrv2 kernel: [] run_queue+0xf1/0x1d0 [gfs2] Jan 8 10:55:57 filesrv2 kernel: [] gfs2_glock_nq+0x1b7/0x360 [gfs2] Jan 8 10:55:57 filesrv2 kernel: [] ? try_to_del_timer_sync+0x7b/0xe0 Jan 8 10:55:57 filesrv2 kernel: [] gfs2_statfs_sync+0x58/0x1b0 [gfs2] Jan 8 10:55:57 filesrv2 kernel: [] ? schedule_timeout+0x19a/0x2e0 Jan 8 10:55:57 filesrv2 kernel: [] ? gfs2_statfs_sync+0x50/0x1b0 [gfs2] Jan 8 10:55:57 filesrv2 kernel: [] quotad_check_timeo+0x57/0xb0 [gfs2] Jan 8 10:55:57 filesrv2 kernel: [] gfs2_quotad+0x234/0x2b0 [gfs2] Jan 8 10:55:57 filesrv2 kernel: [] ? autoremove_wake_function+0x0/0x40 Jan 8 10:55:57 filesrv2 kernel: [] ? gfs2_quotad+0x0/0x2b0 [gfs2] Jan 8 10:55:57 filesrv2 kernel: [] kthread+0x96/0xa0 Jan 8 10:55:57 filesrv2 kernel: [] child_rip+0xa/0x20 Jan 8 10:55:57 filesrv2 kernel: [] ? kthread+0x0/0xa0 Jan 8 10:55:57 filesrv2 kernel: [] ? child_rip+0x0/0x20 Jan 8 10:57:57 filesrv2 kernel: INFO: task gfs2_quotad:9389 blocked for more than 120 seconds. Jan 8 10:57:57 filesrv2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jan 8 10:57:57 filesrv2 kernel: gfs2_quotad D ffff8808a7824900 0 9389 2 0x00000080 Jan 8 10:57:57 filesrv2 kernel: ffff88087580da88 0000000000000046 0000000000000000 00000000000001c3 Jan 8 10:57:57 filesrv2 kernel: ffff88087580da18 ffff88087580da50 ffffffff810ea694 ffff88088b184080 Jan 8 10:57:57 filesrv2 kernel: ffff88088e71a5f8 ffff88087580dfd8 000000000000f4e8 ffff88088e71a5f8 Jan 8 10:57:57 filesrv2 kernel: Call Trace: Jan 8 10:57:57 filesrv2 kernel: [] ? rb_reserve_next_event+0xb4/0x370 Jan 8 10:57:57 filesrv2 kernel: [] ? native_sched_clock+0x13/0x60 Jan 8 10:57:57 filesrv2 kernel: [] rwsem_down_failed_common+0x95/0x1d0 Jan 8 10:57:57 filesrv2 kernel: [] ? native_sched_clock+0x13/0x60 Jan 8 10:57:57 filesrv2 kernel: [] rwsem_down_read_failed+0x26/0x30 Jan 8 10:57:57 filesrv2 kernel: [] call_rwsem_down_read_failed+0x14/0x30 Jan 8 10:57:57 filesrv2 kernel: [] ? down_read+0x24/0x30 Jan 8 10:57:57 filesrv2 kernel: [] dlm_lock+0x62/0x1e0 [dlm] Jan 8 10:57:57 filesrv2 kernel: [] ? ring_buffer_lock_reserve+0xa2/0x160 Jan 8 10:57:57 filesrv2 kernel: [] gdlm_lock+0xf2/0x130 [gfs2] Jan 8 10:57:57 filesrv2 kernel: [] ? gdlm_ast+0x0/0xe0 [gfs2] Jan 8 10:57:57 filesrv2 kernel: [] ? gdlm_bast+0x0/0x50 [gfs2] Jan 8 10:57:57 filesrv2 kernel: [] do_xmote+0x17f/0x260 [gfs2] Jan 8 10:57:57 filesrv2 kernel: [] run_queue+0xf1/0x1d0 [gfs2] Jan 8 10:57:57 filesrv2 kernel: [] gfs2_glock_nq+0x1b7/0x360 [gfs2] Jan 8 10:57:57 filesrv2 kernel: [] ? try_to_del_timer_sync+0x7b/0xe0 Jan 8 10:57:57 filesrv2 kernel: [] gfs2_statfs_sync+0x58/0x1b0 [gfs2] Jan 8 10:57:57 filesrv2 kernel: [] ? schedule_timeout+0x19a/0x2e0 Jan 8 10:57:57 filesrv2 kernel: [] ? gfs2_statfs_sync+0x50/0x1b0 [gfs2] Jan 8 10:57:57 filesrv2 kernel: [] quotad_check_timeo+0x57/0xb0 [gfs2] Jan 8 10:57:57 filesrv2 kernel: [] gfs2_quotad+0x234/0x2b0 [gfs2] Jan 8 10:57:57 filesrv2 kernel: [] ? autoremove_wake_function+0x0/0x40 Jan 8 10:57:57 filesrv2 kernel: [] ? gfs2_quotad+0x0/0x2b0 [gfs2] Jan 8 10:57:57 filesrv2 kernel: [] kthread+0x96/0xa0 Jan 8 10:57:57 filesrv2 kernel: [] child_rip+0xa/0x20 Jan 8 10:57:57 filesrv2 kernel: [] ? kthread+0x0/0xa0 Jan 8 10:57:57 filesrv2 kernel: [] ? child_rip+0x0/0x20 Jan 8 10:59:22 filesrv2 rgmanager[12557]: State change: clustsrv1 UP Cluster.conf File: Thanks Sathya Narayanan V Solution Architect -----Original Message----- From: SATHYA - IT [mailto:sathyanarayanan.varadharajan at precisionit.co.in] Sent: Monday, January 09, 2012 11:21 AM To: 'Digimer'; 'linux clustering' Subject: Re: [Linux-cluster] rhel 6.2 network bonding interface in cluster environment Hi, Herewith attaching the /var/log/messages of both the servers. Yesterday (08th Jan) one of the server got fenced by other around 10:48 AM. I am also attaching the cluster.conf file for your reference. On the related note, related to heartbeat - I am referring the channel used by corosync. And the name which has been configured in cluster.conf file resolves with bond1 only. Related to the network card, we are using 2 dual port card where we configured 1 port from each for bond0 and 1 port from the other for bond1. So it doesn't seems be a network card related issue. Moreover, we are not having any errors related to bond0. Thanks Sathya Narayanan V Solution Architect This communication may contain confidential information. If you are not the intended recipient it may be unlawful for you to read, copy, distribute, disclose or otherwise use the information contained within this communication.. Errors and Omissions may occur in the contents of this Email arising out of or in connection with data transmission, network malfunction or failure, machine or software error, malfunction, or operator errors by the person who is sending the email. Precision Group accepts no responsibility for any such errors or omissions. The information, views and comments within this communication are those of the individual and not necessarily those of Precision Group. All email that is sent from/to Precision Group is scanned for the presence of computer viruses, security issues and inappropriate content. However, it is the recipient's responsibility to check any attachments for viruses before use. From kkovachev at varna.net Mon Jan 9 11:08:25 2012 From: kkovachev at varna.net (Kaloyan Kovachev) Date: Mon, 09 Jan 2012 13:08:25 +0200 Subject: [Linux-cluster] GFS on CentOS - cman unable to start In-Reply-To: <4F0A6706.6090308@ucsc.edu> References: <4F075BD3.3090702@ucsc.edu> <60c71479-72a3-47aa-a91d-fa1c91c3e9ef@lisa.itlinux.cl> <4F0A6706.6090308@ucsc.edu> Message-ID: <7b4965e95aef00d06ba7be68951fb79b@mx.varna.net> Hi, check /etc/sysconfig/cman maybe there is a different name present as NODENAME ... remove the file (if present) or try to create one as: #CMAN_CLUSTER_TIMEOUT=120 #CMAN_QUORUM_TIMEOUT=0 #CMAN_SHUTDOWN_TIMEOUT=60 FENCED_START_TIMEOUT=120 ##FENCE_JOIN=no #LOCK_FILE="/var/lock/subsys/cman" CLUSTERNAME=ClusterName NODENAME=NodeName On Sun, 08 Jan 2012 20:03:18 -0800, Wes Modes wrote: > The behavior of cman's resolving of cluster node names is less than > clear, as per the RHEL bugzilla report. > > The hostname and cluster.conf match, as does /etc/hosts and uname -n. > The short names and FQDN ping. I believe all the node cluster.conf are > in sync, and all nodes are accessible to each other using either short > or long names. > > You'll have to trust that I've tried everything obvious, and every > possible combination of FQDN and short names in cluster.conf and > hostname. That said, it is totally possible I missed something obvious. > > I suspect, there is something else going on and I don't know how to get > at it. > > Wes > > > On 1/6/2012 6:06 PM, Kevin Stanton wrote: >> >> > Hi, >> >> > I think CMAN expect that the names of the cluster nodes be the same >> returned by the command "uname -n". >> >> > For what you write your nodes hostnames are: test01.gdao.ucsc.edu >> and test02.gdao.ucsc.edu, but in cluster.conf you have declared only >> "test01" and "test02". >> >> >> >> I haven't found this to be the case in the past. I actually use a >> separate short name to reference each node which is different than the >> hostname of the server itself. All I've ever had to do is make sure >> it resolves correctly. You can do this either in DNS and/or in >> /etc/hosts. I have found that it's a good idea to do both in case >> your DNS server is a virtual machine and is not running for some >> reason. In that case with /etc/hosts you can still start cman. >> >> >> >> I would make sure whatever node names you use in the cluster.conf will >> resolve when you try to ping it from all nodes in the cluster. Also >> make sure your cluster.conf is in sync between all nodes. >> >> >> >> -Kevin >> >> >> >> >> >> ------------------------------------------------------------------------ >> >> These servers are currently on the same host, but may not be in >> the future. They are in a vm cluster (though honestly, I'm not >> sure what this means yet). >> >> SElinux is on, but disabled. >> Firewalling through iptables is turned off via >> system-config-securitylevel >> >> There is no line currently in the cluster.conf that deals with >> multicasting. >> >> Any other suggestions? >> >> Wes >> >> On 1/6/2012 12:05 PM, Luiz Gustavo Tonello wrote: >> >> Hi, >> >> >> >> This servers is on VMware? At the same host? >> >> SElinux is disable? iptables have something? >> >> >> >> In my environment I had a problem to start GFS2 with servers in >> differents hosts. >> >> To clustering servers, was need migrate one server to the same >> host of the other, and restart this. >> >> >> >> I think, one of the problem was because the virtual switchs. >> >> To solve, I changed a multicast IP, to use 225.0.0.13 at >> cluster.conf >> >> >> >> And add a static route in both, to use default gateway. >> >> >> >> I don't know if it's correct, but this solve my problem. >> >> >> >> I hope that help you. >> >> >> >> Regards. >> >> >> >> On Fri, Jan 6, 2012 at 5:01 PM, Wes Modes > > wrote: >> >> Hi, Steven. >> >> I've tried just about every possible combination of hostname and >> cluster.conf. >> >> ping to test01 resolves to 128.114.31.112 >> ping to test01.gdao.ucsc.edu >> resolves to 128.114.31.112 >> >> It feels like the right thing is being returned. This feels like it >> might be a quirk (or bug possibly) of cman or openais. >> >> There are some old bug reports around this, for example >> https://bugzilla.redhat.com/show_bug.cgi?id=488565. It sounds >> like the >> way that cman reports this error is anything but straightforward. >> >> Is there anyone who has encountered this error and found a solution? >> >> Wes >> >> >> >> On 1/6/2012 2:00 AM, Steven Whitehouse wrote: >> > Hi, >> > >> > On Thu, 2012-01-05 at 13:54 -0800, Wes Modes wrote: >> >> Howdy, y'all. I'm trying to set up GFS in a cluster on CentOS >> systems >> >> running on vmWare. The GFS FS is on a Dell Equilogic SAN. >> >> >> >> I keep running into the same problem despite many >> differently-flavored >> >> attempts to set up GFS. The problem comes when I try to start >> cman, the >> >> cluster management software. >> >> >> >> [root at test01]# service cman start >> >> Starting cluster: >> >> Loading modules... done >> >> Mounting configfs... done >> >> Starting ccsd... done >> >> Starting cman... failed >> >> cman not started: Can't find local node name in cluster.conf >> >> /usr/sbin/cman_tool: aisexec daemon didn't start >> >> >> [FAILED] >> >> >> > This looks like what it says... whatever the node name is in >> > cluster.conf, it doesn't exist when the name is looked up, or >> possibly >> > it does exist, but is mapped to the loopback address (it needs to >> map to >> > an address which is valid cluster-wide) >> > >> > Since your config files look correct, the next thing to check is >> > what >> > the resolver is actually returning. Try (for example) a ping to >> test01 >> > (you need to specify exactly the same form of the name as is used >> > in >> > cluster.conf) from test02 and see whether it uses the correct ip >> > address, just in case the wrong thing is being returned. >> > >> > Steve. >> > >> >> [root at test01]# tail /var/log/messages >> >> Jan 5 13:39:40 testbench06 ccsd[13194]: Unable to connect to >> >> cluster infrastructure after 1193640 seconds. >> >> Jan 5 13:40:10 testbench06 ccsd[13194]: Unable to connect to >> >> cluster infrastructure after 1193670 seconds. >> >> Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] AIS >> >> Executive >> >> Service RELEASE 'subrev 1887 version 0.80.6' >> >> Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] Copyright >> >> (C) >> >> 2002-2006 MontaVista Software, Inc and contributors. >> >> Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] Copyright >> >> (C) >> >> 2006 Red Hat, Inc. >> >> Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] AIS >> >> Executive >> >> Service: started and ready to provide service. >> >> Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] local >> node name >> >> "test01.gdao.ucsc.edu " not found >> in cluster.conf >> >> Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] Error >> reading CCS >> >> info, cannot start >> >> Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] Error >> >> reading >> >> config from CCS >> >> Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] AIS >> >> Executive >> >> exiting (reason: could not read the main configuration file). >> >> >> >> Here are details of my configuration: >> >> >> >> [root at test01]# rpm -qa | grep cman >> >> cman-2.0.115-85.el5_7.2 >> >> >> >> [root at test01]# echo $HOSTNAME >> >> test01.gdao.ucsc.edu >> >> >> >> [root at test01]# hostname >> >> test01.gdao.ucsc.edu >> >> >> >> [root at test01]# cat /etc/hosts >> >> # Do not remove the following line, or various programs >> >> # that require network functionality will fail. >> >> 128.114.31.112 test01 test01.gdao test01.gdao.ucsc.edu >> >> >> 128.114.31.113 test02 test02.gdao test02.gdao.ucsc.edu >> >> >> 127.0.0.1 localhost.localdomain localhost >> >> ::1 localhost6.localdomain6 localhost6 >> >> >> >> [root at test01]# sestatus >> >> SELinux status: enabled >> >> SELinuxfs mount: /selinux >> >> Current mode: permissive >> >> Mode from config file: permissive >> >> Policy version: 21 >> >> Policy from config file: targeted >> >> >> >> [root at test01]# cat /etc/cluster/cluster.conf >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> > >> ipaddr="gdvcenter.ucsc.edu " >> login="root" passwd="1hateAmazon.com" >> >> vmlogin="root" vmpasswd="esxpass" >> >> >> port="/vmfs/volumes/49086551-c64fd83c-0401-001e0bcd6848/eagle1/gfs1.vmx"/> >> >> >> >> >> >> >> >> >> >> >> >> >> >> I've seen much discussion of this problem, but no definitive >> solutions. >> >> Any help you can provide will be welcome. >> >> >> >> Wes Modes >> >> >> >> -- >> >> Linux-cluster mailing list >> >> Linux-cluster at redhat.com >> >> https://www.redhat.com/mailman/listinfo/linux-cluster >> > >> > -- >> > Linux-cluster mailing list >> > Linux-cluster at redhat.com >> > https://www.redhat.com/mailman/listinfo/linux-cluster >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster >> >> >> >> >> >> -- >> Luiz Gustavo P Tonello. >> >> >> >> -- >> >> Linux-cluster mailing list >> >> Linux-cluster at redhat.com >> >> https://www.redhat.com/mailman/listinfo/linux-cluster >> >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster >> >> >> >> >> >> >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster From klaus.steinberger at Physik.Uni-Muenchen.DE Mon Jan 9 12:37:44 2012 From: klaus.steinberger at Physik.Uni-Muenchen.DE (Klaus Steinberger) Date: Mon, 09 Jan 2012 13:37:44 +0100 Subject: [Linux-cluster] rhel 6.2 network bonding interface in cluster environment In-Reply-To: <003201cccebb$805852e0$8108f8a0$@precisionit.co.in> References: <003201cccebb$805852e0$8108f8a0$@precisionit.co.in> Message-ID: <4F0ADF98.9030809@Physik.Uni-Muenchen.DE> Am 09.01.2012 11:43, schrieb SATHYA - IT: > Klaus, > > For your point the corosync network is not connected to the switch. They are > connected directly to the servers (server to server). Ahh, then the going down of the bond is probably not a sign of a network problem, it probably goes down when the other server is already down (fenced ?) Sincerly, Klaus -- Rechnerbetriebsgruppe / IT, Fakult?t f?r Physik Klaus Steinberger FAX: +49 89 28914280 Tel: +49 89 28914287 -------------- next part -------------- A non-text attachment was scrubbed... Name: 0x7FC1E68A.asc Type: application/pgp-keys Size: 6692 bytes Desc: not available URL: From sathyanarayanan.varadharajan at precisionit.co.in Mon Jan 9 12:46:43 2012 From: sathyanarayanan.varadharajan at precisionit.co.in (SATHYA - IT) Date: Mon, 9 Jan 2012 18:16:43 +0530 Subject: [Linux-cluster] rhel 6.2 network bonding interface in cluster environment In-Reply-To: <4F0ADF98.9030809@Physik.Uni-Muenchen.DE> References: <003201cccebb$805852e0$8108f8a0$@precisionit.co.in> <4F0ADF98.9030809@Physik.Uni-Muenchen.DE> Message-ID: <000401cccecc$bfe3f660$3fabe320$@precisionit.co.in> Klaus, That is weird. If you refer the logs which I had posted earlier, the server initiate the fence only after this error message. And the network fail error message is only on one server and not sure how it is not reflecting in the other. The server which has the error message fences the other server. Moreover on the error message, the link is getting down and is back on within 2 seconds. Not sure where it leads to... Thanks Sathya Narayanan V Solution Architect -----Original Message----- From: Klaus Steinberger [mailto:klaus.steinberger at Physik.Uni-Muenchen.DE] Sent: Monday, January 09, 2012 6:08 PM To: SATHYA - IT Cc: 'Digimer'; 'linux clustering' Subject: Re: [Linux-cluster] rhel 6.2 network bonding interface in cluster environment Am 09.01.2012 11:43, schrieb SATHYA - IT: > Klaus, > > For your point the corosync network is not connected to the switch. > They are connected directly to the servers (server to server). Ahh, then the going down of the bond is probably not a sign of a network problem, it probably goes down when the other server is already down (fenced ?) Sincerly, Klaus -- Rechnerbetriebsgruppe / IT, Fakult?t f?r Physik Klaus Steinberger FAX: +49 89 28914280 Tel: +49 89 28914287 This communication may contain confidential information. If you are not the intended recipient it may be unlawful for you to read, copy, distribute, disclose or otherwise use the information contained within this communication.. Errors and Omissions may occur in the contents of this Email arising out of or in connection with data transmission, network malfunction or failure, machine or software error, malfunction, or operator errors by the person who is sending the email. Precision Group accepts no responsibility for any such errors or omissions. The information, views and comments within this communication are those of the individual and not necessarily those of Precision Group. All email that is sent from/to Precision Group is scanned for the presence of computer viruses, security issues and inappropriate content. However, it is the recipient's responsibility to check any attachments for viruses before use. From ajb2 at mssl.ucl.ac.uk Mon Jan 9 13:16:18 2012 From: ajb2 at mssl.ucl.ac.uk (Alan Brown) Date: Mon, 09 Jan 2012 13:16:18 +0000 Subject: [Linux-cluster] rhel 6.2 network bonding interface in cluster environment In-Reply-To: <4F0A79FA.7080408@alteeve.com> References: <003901ccce8d$55a9be40$00fd3ac0$@precisionit.co.in> <4F0A79FA.7080408@alteeve.com> Message-ID: <4F0AE8A2.2010601@mssl.ucl.ac.uk> On 09/01/12 05:24, Digimer wrote: > With both of the bond's NICs down, the bond itself is going to drop. Odds are, both NICs are plugged into the same switch. (assuming the OP isn't running things plugged nic-nic - which I have found in the past tends to be flakey when N-way negotiation becomes involved) I'm assuming "heartbeat" - is a dedicated corosync (v)lan. To the OP: Please look at http://www.cyberciti.biz/howto/question/static/linux-ethernet-bonding-driver-howto.php and the descriptions of bonding there. The type of bond you want for this purpose is either LACP (mode 3) (if NICs are plugged into a single switch or switch stack which supports LACP) or Active Failover (mode 1) if separate switches are involved. Any other mode is potentially failure prone if things go wrong. FWIW: My heartbeat setup is as follows. 2 switches with a 4way LACP bond between them. 2 NICs on each cluster member in bonding mode 1, one NIC on each switch. This setup is resiliant against individual link (NIC, cable or fat fingers) OR switch failures. Switches used for this purpose are best completely isolated from the rest of the network and multicast traffic control should be DISABLED. Corosync can be set to failover to the public lan as a last resort but I've found it's not necessary - if things get bad enough that the private lan is completely out of action then the systems should shut themselves down (bad data is worse than zero data). Switch ports should be set "portfast" or whatever the non-cisco equivalent is, or else ~30 seconds will be wasted in checking that whatever's attached doesn't have a lan segment behind it. This can also lead to fencing. From sathyanarayanan.varadharajan at precisionit.co.in Mon Jan 9 13:23:42 2012 From: sathyanarayanan.varadharajan at precisionit.co.in (SATHYA - IT) Date: Mon, 9 Jan 2012 18:53:42 +0530 Subject: [Linux-cluster] rhel 6.2 network bonding interface in cluster environment In-Reply-To: <4F0AE8A2.2010601@mssl.ucl.ac.uk> References: <003901ccce8d$55a9be40$00fd3ac0$@precisionit.co.in> <4F0A79FA.7080408@alteeve.com> <4F0AE8A2.2010601@mssl.ucl.ac.uk> Message-ID: <000d01ccced1$eaac2a20$c0047e60$@precisionit.co.in> Alan, Corosync (heartbeat) network is not connected to switch. The network is connected between server to server directly. Thanks Sathya Narayanan V Solution Architect -----Original Message----- From: Alan Brown [mailto:ajb2 at mssl.ucl.ac.uk] Sent: Monday, January 09, 2012 6:46 PM To: linux clustering Cc: Digimer; SATHYA - IT Subject: Re: [Linux-cluster] rhel 6.2 network bonding interface in cluster environment On 09/01/12 05:24, Digimer wrote: > With both of the bond's NICs down, the bond itself is going to drop. Odds are, both NICs are plugged into the same switch. (assuming the OP isn't running things plugged nic-nic - which I have found in the past tends to be flakey when N-way negotiation becomes involved) I'm assuming "heartbeat" - is a dedicated corosync (v)lan. To the OP: Please look at http://www.cyberciti.biz/howto/question/static/linux-ethernet-bonding-driver -howto.php and the descriptions of bonding there. The type of bond you want for this purpose is either LACP (mode 3) (if NICs are plugged into a single switch or switch stack which supports LACP) or Active Failover (mode 1) if separate switches are involved. Any other mode is potentially failure prone if things go wrong. FWIW: My heartbeat setup is as follows. 2 switches with a 4way LACP bond between them. 2 NICs on each cluster member in bonding mode 1, one NIC on each switch. This setup is resiliant against individual link (NIC, cable or fat fingers) OR switch failures. Switches used for this purpose are best completely isolated from the rest of the network and multicast traffic control should be DISABLED. Corosync can be set to failover to the public lan as a last resort but I've found it's not necessary - if things get bad enough that the private lan is completely out of action then the systems should shut themselves down (bad data is worse than zero data). Switch ports should be set "portfast" or whatever the non-cisco equivalent is, or else ~30 seconds will be wasted in checking that whatever's attached doesn't have a lan segment behind it. This can also lead to fencing. This communication may contain confidential information. If you are not the intended recipient it may be unlawful for you to read, copy, distribute, disclose or otherwise use the information contained within this communication.. Errors and Omissions may occur in the contents of this Email arising out of or in connection with data transmission, network malfunction or failure, machine or software error, malfunction, or operator errors by the person who is sending the email. Precision Group accepts no responsibility for any such errors or omissions. The information, views and comments within this communication are those of the individual and not necessarily those of Precision Group. All email that is sent from/to Precision Group is scanned for the presence of computer viruses, security issues and inappropriate content. However, it is the recipient's responsibility to check any attachments for viruses before use. From ajb2 at mssl.ucl.ac.uk Mon Jan 9 13:27:10 2012 From: ajb2 at mssl.ucl.ac.uk (Alan Brown) Date: Mon, 09 Jan 2012 13:27:10 +0000 Subject: [Linux-cluster] centos5 to RHEL6 migration In-Reply-To: <4F0AB505.2020402@redhat.com> References: <4F0A532A.2000202@alteeve.com> <4F0AAACE.7080602@mssl.ucl.ac.uk> <4F0AB505.2020402@redhat.com> Message-ID: <4F0AEB2E.2060203@mssl.ucl.ac.uk> On 09/01/12 09:36, Fabio M. Di Nitto wrote: >> RH's advice to use is to "Big Bang" it. > > It?s not much of an advice, as RH does not officially support this > upgrade method. Indeed, but scheduling downtime in a 24*7*365.254 operation like space science ftp servers is tricky. (1: You can't please everyone all the time and they all believe their priorities are of earth-shattering importance. 2: You can't schedule downtime during nights or vacation periods as the people concerned tend to decide this is the best time to run heavy duty batch processing that's due first thing Monday morning.) > The amount of changes in the cluster software between EL5 and EL6 are a > lot less intrusive at system level. I can?t really say for sure for the > entire OS, since the upgrade doesn?t involve only RHCS. Aye. In this case the boxes are ONLY used as NFS fileservers because running anything else on them which touched the GFS(2) FSes resulted in file corruption (which is a case of "NFS vs everything else", more than clustering itself.) It would be _nice_ to have NFSv4 support working and supported in a GFS2 cluster. It's really a pity Ken Olsen refused to opensource VMS (and OSF1) all those years ago. They had this stuff working along time ago and there's a lot of wheel reinvention going on. :( AB From raju.rajsand at gmail.com Mon Jan 9 13:33:03 2012 From: raju.rajsand at gmail.com (Rajagopal Swaminathan) Date: Mon, 9 Jan 2012 19:03:03 +0530 Subject: [Linux-cluster] rhel 6.2 network bonding interface in cluster environment In-Reply-To: <4F0AE8A2.2010601@mssl.ucl.ac.uk> References: <003901ccce8d$55a9be40$00fd3ac0$@precisionit.co.in> <4F0A79FA.7080408@alteeve.com> <4F0AE8A2.2010601@mssl.ucl.ac.uk> Message-ID: Greetings, On Mon, Jan 9, 2012 at 6:46 PM, Alan Brown wrote: > On 09/01/12 05:24, Digimer wrote: > > > Switches used for this purpose are best completely isolated from the rest of > the network and multicast traffic control should be DISABLED. > I distinctly remember asking the network guys Multicast mode to be on for the Heartbeat network (for the clusters that I have built). This is BIG change I suppose from 5.x That was about couple years ago. -- Regards, Rajagopal From fdinitto at redhat.com Mon Jan 9 13:34:18 2012 From: fdinitto at redhat.com (Fabio M. Di Nitto) Date: Mon, 09 Jan 2012 14:34:18 +0100 Subject: [Linux-cluster] centos5 to RHEL6 migration In-Reply-To: <4F0AEB2E.2060203@mssl.ucl.ac.uk> References: <4F0A532A.2000202@alteeve.com> <4F0AAACE.7080602@mssl.ucl.ac.uk> <4F0AB505.2020402@redhat.com> <4F0AEB2E.2060203@mssl.ucl.ac.uk> Message-ID: <4F0AECDA.9060402@redhat.com> On 1/9/2012 2:27 PM, Alan Brown wrote: > On 09/01/12 09:36, Fabio M. Di Nitto wrote: > >>> RH's advice to use is to "Big Bang" it. >> >> It?s not much of an advice, as RH does not officially support this >> upgrade method. > > Indeed, but scheduling downtime in a 24*7*365.254 operation like space > science ftp servers is tricky. (1: You can't please everyone all the > time and they all believe their priorities are of earth-shattering > importance. 2: You can't schedule downtime during nights or vacation > periods as the people concerned tend to decide this is the best time to > run heavy duty batch processing that's due first thing Monday morning.) Yeah you are not telling me anything new :) Something i forgot to mention in the other email, is that for example, you can just move the LUNs from your SAN from one cluster to another assuming you are running GFS2 and that will work. So in theory the downtime would be reduced to just stop old cluster -> rewire the SAN -> start new cluster. > >> The amount of changes in the cluster software between EL5 and EL6 are a >> lot less intrusive at system level. I can?t really say for sure for the >> entire OS, since the upgrade doesn?t involve only RHCS. > > Aye. > > In this case the boxes are ONLY used as NFS fileservers because running > anything else on them which touched the GFS(2) FSes resulted in file > corruption (which is a case of "NFS vs everything else", more than > clustering itself.) > Possibly this is one of the use case where upgrading could work. > It would be _nice_ to have NFSv4 support working and supported in a GFS2 > cluster. Steven can answer to this one.. but I think the point is more active/active vs active/passive (IIRC from previous discussions). Fabio From ajb2 at mssl.ucl.ac.uk Mon Jan 9 14:04:25 2012 From: ajb2 at mssl.ucl.ac.uk (Alan Brown) Date: Mon, 09 Jan 2012 14:04:25 +0000 Subject: [Linux-cluster] centos5 to RHEL6 migration In-Reply-To: <4F0AECDA.9060402@redhat.com> References: <4F0A532A.2000202@alteeve.com> <4F0AAACE.7080602@mssl.ucl.ac.uk> <4F0AB505.2020402@redhat.com> <4F0AEB2E.2060203@mssl.ucl.ac.uk> <4F0AECDA.9060402@redhat.com> Message-ID: <4F0AF3E9.9060907@mssl.ucl.ac.uk> On 09/01/12 13:34, Fabio M. Di Nitto wrote: > Something i forgot to mention in the other email, is that for example, > you can just move the LUNs from your SAN from one cluster to another > assuming you are running GFS2 and that will work. And assuming that you have 2 clusters. This might be a possiblity shortly. >> It would be _nice_ to have NFSv4 support working and supported in a GFS2 >> cluster. > > Steven can answer to this one.. but I think the point is more > active/active vs active/passive (IIRC from previous discussions). We break up NFS serving into one service (ip) per FS. Any given FS is only served from one node because NFSv3 doesnt play nicely with anything else, including other instances of itself. Bringing all the NFS services all onto one node is perfectly possible but it's still a bunch of individual services. Running all NFS on one box turns into a choke point several times/day due to the loads involved. The protocol just doesn't scale very well. From ajb2 at mssl.ucl.ac.uk Mon Jan 9 14:07:21 2012 From: ajb2 at mssl.ucl.ac.uk (Alan Brown) Date: Mon, 09 Jan 2012 14:07:21 +0000 Subject: [Linux-cluster] rhel 6.2 network bonding interface in cluster environment In-Reply-To: <000d01ccced1$eaac2a20$c0047e60$@precisionit.co.in> References: <003901ccce8d$55a9be40$00fd3ac0$@precisionit.co.in> <4F0A79FA.7080408@alteeve.com> <4F0AE8A2.2010601@mssl.ucl.ac.uk> <000d01ccced1$eaac2a20$c0047e60$@precisionit.co.in> Message-ID: <4F0AF499.5020208@mssl.ucl.ac.uk> On 09/01/12 13:23, SATHYA - IT wrote: > Alan, > > Corosync (heartbeat) network is not connected to switch. The network is > connected between server to server directly. See my comment about direct hookups. My experience is that they are prone to playing up for no apparent reason (NICs simply aren't designed or tested well enough for this kind of connection mode) Managed Gb Switches are pretty cheap compared to the hours you'll waste trying to make it go. Put a couple in between the servers. From ajb2 at mssl.ucl.ac.uk Mon Jan 9 14:14:46 2012 From: ajb2 at mssl.ucl.ac.uk (Alan Brown) Date: Mon, 09 Jan 2012 14:14:46 +0000 Subject: [Linux-cluster] rhel 6.2 network bonding interface in cluster environment In-Reply-To: References: <003901ccce8d$55a9be40$00fd3ac0$@precisionit.co.in> <4F0A79FA.7080408@alteeve.com> <4F0AE8A2.2010601@mssl.ucl.ac.uk> Message-ID: <4F0AF656.7070708@mssl.ucl.ac.uk> On 09/01/12 13:33, Rajagopal Swaminathan wrote: >> Switches used for this purpose are best completely isolated from the rest of >> the network and multicast traffic control should be DISABLED. >> > > I distinctly remember asking the network guys Multicast mode to be on > for the Heartbeat network (for the clusters that I have built). You need multicast. What you don't want, is any form of filtering based on packet rates (broadcast/multicast rate limiting). This gets in the way. I can't emphasise enough that the heartbeat equipment is best separated from everything else. A spanning tree rebuild initiated elsewhere in the LAN may be enough to cause an outage long enough to generate a fence event (Ethernet fabric switching is spreading, but spanning tree will be around for quite a while yet) From wmodes at ucsc.edu Mon Jan 9 15:57:14 2012 From: wmodes at ucsc.edu (Wes Modes) Date: Mon, 09 Jan 2012 07:57:14 -0800 Subject: [Linux-cluster] GFS on CentOS - cman unable to start In-Reply-To: <7b4965e95aef00d06ba7be68951fb79b@mx.varna.net> References: <4F075BD3.3090702@ucsc.edu> <60c71479-72a3-47aa-a91d-fa1c91c3e9ef@lisa.itlinux.cl> <4F0A6706.6090308@ucsc.edu> <7b4965e95aef00d06ba7be68951fb79b@mx.varna.net> Message-ID: <4F0B0E5A.40401@ucsc.edu> Thanks, Kaloyan. Now we're talking. This is something I hadn't already tried yet. I will try it as soon as I get in. Wes On 1/9/2012 3:08 AM, Kaloyan Kovachev wrote: > Hi, > check /etc/sysconfig/cman maybe there is a different name present as > NODENAME ... remove the file (if present) or try to create one as: > > #CMAN_CLUSTER_TIMEOUT=120 > #CMAN_QUORUM_TIMEOUT=0 > #CMAN_SHUTDOWN_TIMEOUT=60 > FENCED_START_TIMEOUT=120 > ##FENCE_JOIN=no > #LOCK_FILE="/var/lock/subsys/cman" > CLUSTERNAME=ClusterName > NODENAME=NodeName > > > On Sun, 08 Jan 2012 20:03:18 -0800, Wes Modes wrote: >> The behavior of cman's resolving of cluster node names is less than >> clear, as per the RHEL bugzilla report. >> >> The hostname and cluster.conf match, as does /etc/hosts and uname -n. >> The short names and FQDN ping. I believe all the node cluster.conf are >> in sync, and all nodes are accessible to each other using either short >> or long names. >> >> You'll have to trust that I've tried everything obvious, and every >> possible combination of FQDN and short names in cluster.conf and >> hostname. That said, it is totally possible I missed something obvious. >> >> I suspect, there is something else going on and I don't know how to get >> at it. >> >> Wes >> >> >> On 1/6/2012 6:06 PM, Kevin Stanton wrote: >>>> Hi, >>>> I think CMAN expect that the names of the cluster nodes be the same >>> returned by the command "uname -n". >>> >>>> For what you write your nodes hostnames are: test01.gdao.ucsc.edu >>> and test02.gdao.ucsc.edu, but in cluster.conf you have declared only >>> "test01" and "test02". >>> >>> >>> >>> I haven't found this to be the case in the past. I actually use a >>> separate short name to reference each node which is different than the >>> hostname of the server itself. All I've ever had to do is make sure >>> it resolves correctly. You can do this either in DNS and/or in >>> /etc/hosts. I have found that it's a good idea to do both in case >>> your DNS server is a virtual machine and is not running for some >>> reason. In that case with /etc/hosts you can still start cman. >>> >>> >>> >>> I would make sure whatever node names you use in the cluster.conf will >>> resolve when you try to ping it from all nodes in the cluster. Also >>> make sure your cluster.conf is in sync between all nodes. >>> >>> >>> >>> -Kevin >>> >>> >>> >>> >>> >>> > ------------------------------------------------------------------------ >>> These servers are currently on the same host, but may not be in >>> the future. They are in a vm cluster (though honestly, I'm not >>> sure what this means yet). >>> >>> SElinux is on, but disabled. >>> Firewalling through iptables is turned off via >>> system-config-securitylevel >>> >>> There is no line currently in the cluster.conf that deals with >>> multicasting. >>> >>> Any other suggestions? >>> >>> Wes >>> >>> On 1/6/2012 12:05 PM, Luiz Gustavo Tonello wrote: >>> >>> Hi, >>> >>> >>> >>> This servers is on VMware? At the same host? >>> >>> SElinux is disable? iptables have something? >>> >>> >>> >>> In my environment I had a problem to start GFS2 with servers in >>> differents hosts. >>> >>> To clustering servers, was need migrate one server to the same >>> host of the other, and restart this. >>> >>> >>> >>> I think, one of the problem was because the virtual switchs. >>> >>> To solve, I changed a multicast IP, to use 225.0.0.13 at >>> cluster.conf >>> >>> >>> >>> And add a static route in both, to use default gateway. >>> >>> >>> >>> I don't know if it's correct, but this solve my problem. >>> >>> >>> >>> I hope that help you. >>> >>> >>> >>> Regards. >>> >>> >>> >>> On Fri, Jan 6, 2012 at 5:01 PM, Wes Modes >> > wrote: >>> >>> Hi, Steven. >>> >>> I've tried just about every possible combination of hostname and >>> cluster.conf. >>> >>> ping to test01 resolves to 128.114.31.112 >>> ping to test01.gdao.ucsc.edu >>> resolves to 128.114.31.112 >>> >>> It feels like the right thing is being returned. This feels like > it >>> might be a quirk (or bug possibly) of cman or openais. >>> >>> There are some old bug reports around this, for example >>> https://bugzilla.redhat.com/show_bug.cgi?id=488565. It sounds >>> like the >>> way that cman reports this error is anything but straightforward. >>> >>> Is there anyone who has encountered this error and found a > solution? >>> Wes >>> >>> >>> >>> On 1/6/2012 2:00 AM, Steven Whitehouse wrote: >>> > Hi, >>> > >>> > On Thu, 2012-01-05 at 13:54 -0800, Wes Modes wrote: >>> >> Howdy, y'all. I'm trying to set up GFS in a cluster on CentOS >>> systems >>> >> running on vmWare. The GFS FS is on a Dell Equilogic SAN. >>> >> >>> >> I keep running into the same problem despite many >>> differently-flavored >>> >> attempts to set up GFS. The problem comes when I try to start >>> cman, the >>> >> cluster management software. >>> >> >>> >> [root at test01]# service cman start >>> >> Starting cluster: >>> >> Loading modules... done >>> >> Mounting configfs... done >>> >> Starting ccsd... done >>> >> Starting cman... failed >>> >> cman not started: Can't find local node name in cluster.conf >>> >> /usr/sbin/cman_tool: aisexec daemon didn't start >>> >> >>> [FAILED] >>> >> >>> > This looks like what it says... whatever the node name is in >>> > cluster.conf, it doesn't exist when the name is looked up, or >>> possibly >>> > it does exist, but is mapped to the loopback address (it needs to >>> map to >>> > an address which is valid cluster-wide) >>> > >>> > Since your config files look correct, the next thing to check is >>> > what >>> > the resolver is actually returning. Try (for example) a ping to >>> test01 >>> > (you need to specify exactly the same form of the name as is used >>> > in >>> > cluster.conf) from test02 and see whether it uses the correct ip >>> > address, just in case the wrong thing is being returned. >>> > >>> > Steve. >>> > >>> >> [root at test01]# tail /var/log/messages >>> >> Jan 5 13:39:40 testbench06 ccsd[13194]: Unable to connect > to >>> >> cluster infrastructure after 1193640 seconds. >>> >> Jan 5 13:40:10 testbench06 ccsd[13194]: Unable to connect > to >>> >> cluster infrastructure after 1193670 seconds. >>> >> Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] AIS >>> >> Executive >>> >> Service RELEASE 'subrev 1887 version 0.80.6' >>> >> Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] Copyright >>> >> (C) >>> >> 2002-2006 MontaVista Software, Inc and contributors. >>> >> Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] Copyright >>> >> (C) >>> >> 2006 Red Hat, Inc. >>> >> Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] AIS >>> >> Executive >>> >> Service: started and ready to provide service. >>> >> Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] local >>> node name >>> >> "test01.gdao.ucsc.edu " not found >>> in cluster.conf >>> >> Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] Error >>> reading CCS >>> >> info, cannot start >>> >> Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] Error >>> >> reading >>> >> config from CCS >>> >> Jan 5 13:40:24 testbench06 openais[3939]: [MAIN ] AIS >>> >> Executive >>> >> exiting (reason: could not read the main configuration file). >>> >> >>> >> Here are details of my configuration: >>> >> >>> >> [root at test01]# rpm -qa | grep cman >>> >> cman-2.0.115-85.el5_7.2 >>> >> >>> >> [root at test01]# echo $HOSTNAME >>> >> test01.gdao.ucsc.edu >>> >> >>> >> [root at test01]# hostname >>> >> test01.gdao.ucsc.edu >>> >> >>> >> [root at test01]# cat /etc/hosts >>> >> # Do not remove the following line, or various programs >>> >> # that require network functionality will fail. >>> >> 128.114.31.112 test01 test01.gdao test01.gdao.ucsc.edu >>> >>> >> 128.114.31.113 test02 test02.gdao test02.gdao.ucsc.edu >>> >>> >> 127.0.0.1 localhost.localdomain localhost >>> >> ::1 localhost6.localdomain6 localhost6 >>> >> >>> >> [root at test01]# sestatus >>> >> SELinux status: enabled >>> >> SELinuxfs mount: /selinux >>> >> Current mode: permissive >>> >> Mode from config file: permissive >>> >> Policy version: 21 >>> >> Policy from config file: targeted >>> >> >>> >> [root at test01]# cat /etc/cluster/cluster.conf >>> >> >>> >> >>> >> post_join_delay="120"/> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >> >> ipaddr="gdvcenter.ucsc.edu " >>> login="root" passwd="1hateAmazon.com" >>> >> vmlogin="root" vmpasswd="esxpass" >>> >> >>> > port="/vmfs/volumes/49086551-c64fd83c-0401-001e0bcd6848/eagle1/gfs1.vmx"/> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> I've seen much discussion of this problem, but no definitive >>> solutions. >>> >> Any help you can provide will be welcome. >>> >> >>> >> Wes Modes >>> >> >>> >> -- >>> >> Linux-cluster mailing list >>> >> Linux-cluster at redhat.com >>> >> https://www.redhat.com/mailman/listinfo/linux-cluster >>> > >>> > -- >>> > Linux-cluster mailing list >>> > Linux-cluster at redhat.com >>> > https://www.redhat.com/mailman/listinfo/linux-cluster >>> >>> -- >>> Linux-cluster mailing list >>> Linux-cluster at redhat.com >>> https://www.redhat.com/mailman/listinfo/linux-cluster >>> >>> >>> >>> >>> >>> -- >>> Luiz Gustavo P Tonello. >>> >>> >>> >>> -- >>> >>> Linux-cluster mailing list >>> >>> Linux-cluster at redhat.com >>> >>> https://www.redhat.com/mailman/listinfo/linux-cluster >>> >>> >>> -- >>> Linux-cluster mailing list >>> Linux-cluster at redhat.com >>> https://www.redhat.com/mailman/listinfo/linux-cluster >>> >>> >>> >>> >>> >>> >>> >>> -- >>> Linux-cluster mailing list >>> Linux-cluster at redhat.com >>> https://www.redhat.com/mailman/listinfo/linux-cluster > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From rajendra.roka at pacificmags.com.au Mon Jan 9 21:57:29 2012 From: rajendra.roka at pacificmags.com.au (Roka, Rajendra) Date: Tue, 10 Jan 2012 08:57:29 +1100 Subject: [Linux-cluster] Issue with mysql service in RHEL6.2 cluster Message-ID: <508450C6AB960E4299CA64E838597F6202F22B63@nsw-mmp-exch1.snl.7net.com.au> I am having issue with mysql service in RHEL6.2 cluster. While starting service I receive the following error in /var/log/message Jan 10 08:47:52 atp-wwdev1 rgmanager[6015]: Starting Service mysql:mysql Jan 10 08:47:54 atp-wwdev1 rgmanager[6096]: Starting Service mysql:mysql > Failed - Timeout Error Jan 10 08:47:54 atp-wwdev1 rgmanager[1842]: start on mysql "mysql" returned 1 (generic error) Jan 10 08:47:54 atp-wwdev1 rgmanager[1842]: #68: Failed to start service:mysql; return value: 1 Jan 10 08:47:54 atp-wwdev1 rgmanager[1842]: Stopping service service:mysql Jan 10 08:47:54 atp-wwdev1 rgmanager[6180]: Stopping Service mysql:mysql Jan 10 08:47:55 atp-wwdev1 rgmanager[6202]: Checking Existence Of File /var/run/cluster/mysql.pid [mysql:mysql] > Failed - File Doesn't Exist Jan 10 08:47:55 atp-wwdev1 rgmanager[6224]: Stopping Service mysql:mysql > Succeed Jan 10 08:47:55 atp-wwdev1 rgmanager[1842]: Service service:mysql is recovering Jan 10 08:47:55 atp-wwdev1 rgmanager[1842]: #71: Relocating failed service service:mysql Jan 10 08:47:59 atp-wwdev1 rgmanager[1842]: Service service:mysql is stopped Can you please help me with the above problem? My cluster.conf file is follows: /etc/my.cnf file is follows: [mysqld] datadir=/var/lib/mysql socket=/var/lib/mysql/mysql.sock user=mysql # Disabling symbolic-links is recommended to prevent assorted security risks symbolic-links=0 [mysqld_safe] log-error=/var/log/mysqld.log pid-file=/var/run/cluster/mysql.pid Thanks Raj Important Notice: This message and its attachments are confidential and may contain information which is protected by copyright. It is intended solely for the named addressee. If you are not the authorised recipient (or responsible for delivery of the message to the authorised recipient), you must not use, disclose, print, copy or deliver this message or its attachments to anyone. If you receive this email in error, please contact the sender immediately and permanently delete this message and its attachments from your system. Any content of this message and its attachments that does not relate to the official business of Pacific Magazines Pty Limited must be taken not to have been sent or endorsed by it. No representation is made that this email or its attachments are without defect or that the contents express views other than those of the sender. Please consider the environment - do you really need to print this email? -------------- next part -------------- An HTML attachment was scrubbed... URL: From td3201 at gmail.com Mon Jan 9 22:36:31 2012 From: td3201 at gmail.com (Terry) Date: Mon, 9 Jan 2012 16:36:31 -0600 Subject: [Linux-cluster] centos5 to RHEL6 migration In-Reply-To: <4F0AF3E9.9060907@mssl.ucl.ac.uk> References: <4F0A532A.2000202@alteeve.com> <4F0AAACE.7080602@mssl.ucl.ac.uk> <4F0AB505.2020402@redhat.com> <4F0AEB2E.2060203@mssl.ucl.ac.uk> <4F0AECDA.9060402@redhat.com> <4F0AF3E9.9060907@mssl.ucl.ac.uk> Message-ID: So here's what I have done so far: 1. Created new cluster based on RHEL 6. 2. Created resources and services from scratch to match that in the old cluster (fsid, mount points, everything). I am using Congra (luci/ricci) just to ensure I am using the right syntax. 3. Gave access to storage volumes (iscsi) to new cluster node 4. pvscan/vgscan/lvscan 5. Disabled NFS services on old cluster 6. Enabled the NFS services on the new cluster That's it. Life's good for the volumes on the cluster. I am yet to transfer my postgres stuff but I am moving from 8.3 to 9.0 so that will be a new volume and postgres installation so nothing exciting there. On Mon, Jan 9, 2012 at 8:04 AM, Alan Brown wrote: > On 09/01/12 13:34, Fabio M. Di Nitto wrote: > > Something i forgot to mention in the other email, is that for example, >> you can just move the LUNs from your SAN from one cluster to another >> assuming you are running GFS2 and that will work. >> > > And assuming that you have 2 clusters. This might be a possiblity shortly. > > > It would be _nice_ to have NFSv4 support working and supported in a GFS2 >>> cluster. >>> >> >> Steven can answer to this one.. but I think the point is more >> active/active vs active/passive (IIRC from previous discussions). >> > > We break up NFS serving into one service (ip) per FS. > > Any given FS is only served from one node because NFSv3 doesnt play nicely > with anything else, including other instances of itself. > > Bringing all the NFS services all onto one node is perfectly possible but > it's still a bunch of individual services. > > Running all NFS on one box turns into a choke point several times/day due > to the loads involved. The protocol just doesn't scale very well. > > > > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/**mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From pbruna at it-linux.cl Mon Jan 9 22:34:22 2012 From: pbruna at it-linux.cl (Patricio A. Bruna) Date: Mon, 09 Jan 2012 19:34:22 -0300 (CLST) Subject: [Linux-cluster] Issue with mysql service in RHEL6.2 cluster In-Reply-To: <508450C6AB960E4299CA64E838597F6202F22B63@nsw-mmp-exch1.snl.7net.com.au> Message-ID: Has the mysql user the permissions to write on the /var/run/cluster directory? ------------------------------------ Patricio Bruna V. IT Linux Ltda. www.it-linux.cl Twitter Fono : (+56-2) 333 0578 M?vil: (+56-9) 8899 6618 ----- Mensaje original ----- > I am having issue with mysql service in RHEL6.2 cluster. While > starting service I receive the following error in /var/log/message > Jan 10 08:47:52 atp-wwdev1 rgmanager[6015]: Starting Service > mysql:mysql > Jan 10 08:47:54 atp-wwdev1 rgmanager[6096]: Starting Service > mysql:mysql > Failed - Timeout Error > Jan 10 08:47:54 atp-wwdev1 rgmanager[1842]: start on mysql "mysql" > returned 1 (generic error) > Jan 10 08:47:54 atp-wwdev1 rgmanager[1842]: #68: Failed to start > service:mysql; return value: 1 > Jan 10 08:47:54 atp-wwdev1 rgmanager[1842]: Stopping service > service:mysql > Jan 10 08:47:54 atp-wwdev1 rgmanager[6180]: Stopping Service > mysql:mysql > Jan 10 08:47:55 atp-wwdev1 rgmanager[6202]: Checking Existence Of > File /var/run/cluster/mysql.pid [mysql:mysql] > Failed - File > Doesn't Exist > Jan 10 08:47:55 atp-wwdev1 rgmanager[6224]: Stopping Service > mysql:mysql > Succeed > Jan 10 08:47:55 atp-wwdev1 rgmanager[1842]: Service service:mysql is > recovering > Jan 10 08:47:55 atp-wwdev1 rgmanager[1842]: #71: Relocating failed > service service:mysql > Jan 10 08:47:59 atp-wwdev1 rgmanager[1842]: Service service:mysql is > stopped > Can you please help me with the above problem? > My cluster.conf file is follows: > > > > > > > > > > > > > > > > > > restricted="0"> > > > > > > > host="10.26.240.190" mountpoint="/var/lib/mysql" name="filesystem" > no_unmount="on"/> > name="MySQL server" shutdown_wait="2" startup_wait="0"/> > > name="access_ip" recovery="relocate"> > > > name="mysql" recovery="relocate"> > name="mysql" shutdown_wait="2" startup_wait="2"/> > > name="storage" recovery="relocate"> > > > > post_join_delay="3"/> > > > > > > > /etc/my.cnf file is follows: > [mysqld] > datadir=/var/lib/mysql > socket=/var/lib/mysql/mysql.sock > user=mysql > # Disabling symbolic-links is recommended to prevent assorted > security risks > symbolic-links=0 > [mysqld_safe] > log-error=/var/log/mysqld.log > pid-file=/var/run/cluster/mysql.pid > Thanks > Raj > Important Notice: > This message and its attachments are confidential and may contain > information which is protected by copyright. It is intended solely > for the named addressee. If you are not the authorised recipient (or > responsible for delivery of the message to the authorised > recipient), you must not use, disclose, print, copy or deliver this > message or its attachments to anyone. If you receive this email in > error, please contact the sender immediately and permanently delete > this message and its attachments from your system. > Any content of this message and its attachments that does not relate > to the official business of Pacific Magazines Pty Limited must be > taken not to have been sent or endorsed by it. No representation is > made that this email or its attachments are without defect or that > the contents express views other than those of the sender. > Please consider the environment - do you really need to print this > email? > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: zimbra_gold_partner.png Type: image/png Size: 2893 bytes Desc: not available URL: From rajendra.roka at pacificmags.com.au Mon Jan 9 23:09:25 2012 From: rajendra.roka at pacificmags.com.au (Roka, Rajendra) Date: Tue, 10 Jan 2012 10:09:25 +1100 Subject: [Linux-cluster] Issue with mysql service in RHEL6.2 cluster In-Reply-To: References: <508450C6AB960E4299CA64E838597F6202F22B63@nsw-mmp-exch1.snl.7net.com.au> Message-ID: <508450C6AB960E4299CA64E838597F6202F22B64@nsw-mmp-exch1.snl.7net.com.au> Yes it has. drwx--x--x. 3 mysql root 4096 Jan 9 13:45 cluster Thanks From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Patricio A. Bruna Sent: Tuesday, 10 January 2012 9:34 AM To: linux clustering Subject: Re: [Linux-cluster] Issue with mysql service in RHEL6.2 cluster Has the mysql user the permissions to write on the /var/run/cluster directory? ------------------------------------ Patricio Bruna V. IT Linux Ltda. www.it-linux.cl Twitter Fono : (+56-2) 333 0578 M?vil: (+56-9) 8899 6618 ________________________________ I am having issue with mysql service in RHEL6.2 cluster. While starting service I receive the following error in /var/log/message Jan 10 08:47:52 atp-wwdev1 rgmanager[6015]: Starting Service mysql:mysql Jan 10 08:47:54 atp-wwdev1 rgmanager[6096]: Starting Service mysql:mysql > Failed - Timeout Error Jan 10 08:47:54 atp-wwdev1 rgmanager[1842]: start on mysql "mysql" returned 1 (generic error) Jan 10 08:47:54 atp-wwdev1 rgmanager[1842]: #68: Failed to start service:mysql; return value: 1 Jan 10 08:47:54 atp-wwdev1 rgmanager[1842]: Stopping service service:mysql Jan 10 08:47:54 atp-wwdev1 rgmanager[6180]: Stopping Service mysql:mysql Jan 10 08:47:55 atp-wwdev1 rgmanager[6202]: Checking Existence Of File /var/run/cluster/mysql.pid [mysql:mysql] > Failed - File Doesn't Exist Jan 10 08:47:55 atp-wwdev1 rgmanager[6224]: Stopping Service mysql:mysql > Succeed Jan 10 08:47:55 atp-wwdev1 rgmanager[1842]: Service service:mysql is recovering Jan 10 08:47:55 atp-wwdev1 rgmanager[1842]: #71: Relocating failed service service:mysql Jan 10 08:47:59 atp-wwdev1 rgmanager[1842]: Service service:mysql is stopped Can you please help me with the above problem? My cluster.conf file is follows: /etc/my.cnf file is follows: [mysqld] datadir=/var/lib/mysql socket=/var/lib/mysql/mysql.sock user=mysql # Disabling symbolic-links is recommended to prevent assorted security risks symbolic-links=0 [mysqld_safe] log-error=/var/log/mysqld.log pid-file=/var/run/cluster/mysql.pid Thanks Raj Important Notice: This message and its attachments are confidential and may contain information which is protected by copyright. It is intended solely for the named addressee. If you are not the authorised recipient (or responsible for delivery of the message to the authorised recipient), you must not use, disclose, print, copy or deliver this message or its attachments to anyone. If you receive this email in error, please contact the sender immediately and permanently delete this message and its attachments from your system. Any content of this message and its attachments that does not relate to the official business of Pacific Magazines Pty Limited must be taken not to have been sent or endorsed by it. No representation is made that this email or its attachments are without defect or that the contents express views other than those of the sender. Please consider the environment - do you really need to print this email? -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster Important Notice: This message and its attachments are confidential and may contain information which is protected by copyright. It is intended solely for the named addressee. If you are not the authorised recipient (or responsible for delivery of the message to the authorised recipient), you must not use, disclose, print, copy or deliver this message or its attachments to anyone. If you receive this email in error, please contact the sender immediately and permanently delete this message and its attachments from your system. Any content of this message and its attachments that does not relate to the official business of Pacific Magazines Pty Limited must be taken not to have been sent or endorsed by it. No representation is made that this email or its attachments are without defect or that the contents express views other than those of the sender. Please consider the environment - do you really need to print this email? -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 2893 bytes Desc: image001.png URL: From rmitchel at redhat.com Mon Jan 9 23:12:08 2012 From: rmitchel at redhat.com (Ryan Mitchell) Date: Tue, 10 Jan 2012 09:12:08 +1000 Subject: [Linux-cluster] Issue with mysql service in RHEL6.2 cluster In-Reply-To: <508450C6AB960E4299CA64E838597F6202F22B63@nsw-mmp-exch1.snl.7net.com.au> References: <508450C6AB960E4299CA64E838597F6202F22B63@nsw-mmp-exch1.snl.7net.com.au> Message-ID: <4F0B7448.9060008@redhat.com> On 01/10/2012 07:57 AM, Roka, Rajendra wrote: > > *I am having issue with mysql service in RHEL6.2 cluster. While > starting service I receive the following error in /var/log/message* > > Jan 10 08:47:52 atp-wwdev1 rgmanager[6015]: Starting Service mysql:mysql > > Jan 10 08:47:54 atp-wwdev1 rgmanager[6096]: Starting Service > mysql:mysql > Failed - Timeout Error > > Jan 10 08:47:54 atp-wwdev1 rgmanager[1842]: start on mysql "mysql" > returned 1 (generic error) > > I'm pretty sure the first problem is that mysql doesn't start before the script times out. All subsequent errors are trying to clean up from the failed start and can be ignored. There won't be a pid file if the service did not start or if it was cleanly shut down outside of rgmanager. Try increasing the startup_wait (something large until you find its successful, like 60). Its currently waiting 2 seconds. Also, I don't think you want to have the VIP and the service that uses it (mysql) in different services. They should be in the same service, because they always have to run on the same node (they aren't independent). Same goes for the filesystem resource if that is required by MYSQL. Perhaps something like the following?: Lastly, you have created a fence device but you haven't assigned it to the nodes so they currently have no fencing devices. Make sure you do that and test fencing before doing anything important with this cluster. Regards, Ryan Mitchell Software Maintenance Engineer Support Engineering Group Red Hat, Inc. -------------- next part -------------- An HTML attachment was scrubbed... URL: From rajendra.roka at pacificmags.com.au Tue Jan 10 00:48:35 2012 From: rajendra.roka at pacificmags.com.au (Roka, Rajendra) Date: Tue, 10 Jan 2012 11:48:35 +1100 Subject: [Linux-cluster] Issue with mysql service in RHEL6.2 cluster In-Reply-To: <4F0B7448.9060008@redhat.com> References: <508450C6AB960E4299CA64E838597F6202F22B63@nsw-mmp-exch1.snl.7net.com.au> <4F0B7448.9060008@redhat.com> Message-ID: <508450C6AB960E4299CA64E838597F6202F22B66@nsw-mmp-exch1.snl.7net.com.au> I have changed the resources and service in cluster.conf as follows: But no luck with the following message: Jan 10 11:42:57 atp-wwdev1 modcluster: Starting service: mysql on node Jan 10 11:42:57 atp-wwdev1 rgmanager[1690]: Starting stopped service service:mysql Jan 10 11:42:58 atp-wwdev1 rgmanager[5252]: Adding IPv4 address 10.26.240.95/24 to eth0 Jan 10 11:43:01 atp-wwdev1 rgmanager[5401]: Starting Service mysql:mysql Jan 10 11:44:01 atp-wwdev1 rgmanager[5657]: Starting Service mysql:mysql > Failed - Timeout Error Jan 10 11:44:01 atp-wwdev1 rgmanager[1690]: start on mysql "mysql" returned 1 (generic error) Jan 10 11:44:02 atp-wwdev1 rgmanager[1690]: #68: Failed to start service:mysql; return value: 1 Jan 10 11:44:02 atp-wwdev1 rgmanager[1690]: Stopping service service:mysql Jan 10 11:44:02 atp-wwdev1 rgmanager[5742]: Stopping Service mysql:mysql Jan 10 11:44:02 atp-wwdev1 rgmanager[5764]: Checking Existence Of File /var/run/cluster/mysql/mysql:mysql.pid [mysql:mysql] > Failed - File Doesn't Exist Jan 10 11:44:02 atp-wwdev1 rgmanager[5786]: Stopping Service mysql:mysql > Succeed Jan 10 11:44:02 atp-wwdev1 rgmanager[5837]: Removing IPv4 address 10.26.240.95/24 from eth0 Jan 10 11:44:04 atp-wwdev1 rgmanager[5924]: unmounting /var/lib/mysql Jan 10 11:44:04 atp-wwdev1 rgmanager[1690]: Service service:mysql is recovering Jan 10 11:44:04 atp-wwdev1 rgmanager[1690]: #71: Relocating failed service service:mysql Jan 10 11:45:14 atp-wwdev1 rgmanager[1690]: Service service:mysql is stopped Also changed the my.conf to: pid-file=/var/run/cluster/mysql/mysql.pid Cheers From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Ryan Mitchell Sent: Tuesday, 10 January 2012 10:12 AM To: linux-cluster at redhat.com Subject: Re: [Linux-cluster] Issue with mysql service in RHEL6.2 cluster On 01/10/2012 07:57 AM, Roka, Rajendra wrote: I am having issue with mysql service in RHEL6.2 cluster. While starting service I receive the following error in /var/log/message Jan 10 08:47:52 atp-wwdev1 rgmanager[6015]: Starting Service mysql:mysql Jan 10 08:47:54 atp-wwdev1 rgmanager[6096]: Starting Service mysql:mysql > Failed - Timeout Error Jan 10 08:47:54 atp-wwdev1 rgmanager[1842]: start on mysql "mysql" returned 1 (generic error) I'm pretty sure the first problem is that mysql doesn't start before the script times out. All subsequent errors are trying to clean up from the failed start and can be ignored. There won't be a pid file if the service did not start or if it was cleanly shut down outside of rgmanager. Try increasing the startup_wait (something large until you find its successful, like 60). Its currently waiting 2 seconds. Also, I don't think you want to have the VIP and the service that uses it (mysql) in different services. They should be in the same service, because they always have to run on the same node (they aren't independent). Same goes for the filesystem resource if that is required by MYSQL. Perhaps something like the following?: Lastly, you have created a fence device but you haven't assigned it to the nodes so they currently have no fencing devices. Make sure you do that and test fencing before doing anything important with this cluster. Regards, Ryan Mitchell Software Maintenance Engineer Support Engineering Group Red Hat, Inc. Important Notice: This message and its attachments are confidential and may contain information which is protected by copyright. It is intended solely for the named addressee. If you are not the authorised recipient (or responsible for delivery of the message to the authorised recipient), you must not use, disclose, print, copy or deliver this message or its attachments to anyone. If you receive this email in error, please contact the sender immediately and permanently delete this message and its attachments from your system. Any content of this message and its attachments that does not relate to the official business of Pacific Magazines Pty Limited must be taken not to have been sent or endorsed by it. No representation is made that this email or its attachments are without defect or that the contents express views other than those of the sender. Please consider the environment - do you really need to print this email? -------------- next part -------------- An HTML attachment was scrubbed... URL: From linux at alteeve.com Tue Jan 10 00:59:39 2012 From: linux at alteeve.com (Digimer) Date: Mon, 09 Jan 2012 19:59:39 -0500 Subject: [Linux-cluster] centos5 to RHEL6 migration In-Reply-To: References: <4F0A532A.2000202@alteeve.com> <4F0AAACE.7080602@mssl.ucl.ac.uk> <4F0AB505.2020402@redhat.com> <4F0AEB2E.2060203@mssl.ucl.ac.uk> <4F0AECDA.9060402@redhat.com> <4F0AF3E9.9060907@mssl.ucl.ac.uk> Message-ID: <4F0B8D7B.7030502@alteeve.com> On 01/09/2012 05:36 PM, Terry wrote: > So here's what I have done so far: > 1. Created new cluster based on RHEL 6. > 2. Created resources and services from scratch to match that in the old > cluster (fsid, mount points, everything). I am using Congra (luci/ricci) > just to ensure I am using the right syntax. > 3. Gave access to storage volumes (iscsi) to new cluster node > 4. pvscan/vgscan/lvscan > 5. Disabled NFS services on old cluster > 6. Enabled the NFS services on the new cluster > > That's it. Life's good for the volumes on the cluster. I am yet to > transfer my postgres stuff but I am moving from 8.3 to 9.0 so that will > be a new volume and postgres installation so nothing exciting there. Thanks for reporting back. I'm glad to hear it worked out well. Did you have to change your gfs part to gfs2? -- Digimer E-Mail: digimer at alteeve.com Freenode handle: digimer Papers and Projects: http://alteeve.com Node Assassin: http://nodeassassin.org "omg my singularity battery is dead again. stupid hawking radiation." - epitron From Gert.Wieberdink at enovation.nl Tue Jan 10 11:12:08 2012 From: Gert.Wieberdink at enovation.nl (Gert Wieberdink) Date: Tue, 10 Jan 2012 12:12:08 +0100 Subject: [Linux-cluster] (no subject) Message-ID: <8634845864125D4D9B397A3E598995980C9497F45A@MBX.emd.enovation.net> RHCS/GFS2 support team, I would like to inform you about a serious GFS2 problem we encountered last week. Please find a detailed description below. I have enclosed a tarfile containing detailed information about this problem. Description Two-node cluster is used as a test cluster without any load. Only functionality is tested, no performance tests. The RHCS services that run on this cluster are rather standard services. In a 2-day timeframe we had two occurrences of this problem which were both very similar. On the 2nd node, a Perl script tried to write some info to a file on the GFS2 filesystem, but the process hung at that time. From the GFS2 lockdump info we saw one W-lock associated with an inode and it turned out that the inode was a directory on GFS2. Every command executed on that file (eg. ls -l) or on this directory resulted in a hang of that process (eg. du ). The processes that hung all had the D-state (uninterruptable sleep). However, from the 1st node all files and directories were accessible without any problem. Even ls -lR executed on the 1st node from top of the GFS2 filesystem traversed the full directory tree without problems. We suspect that the offending directory has got a W-lock and that there is no lock owner anymore. So, it does not look like a 'global' file system hang, but it seems to to be a local problem on the 2nd node, where the major part of the GFS2 is also accessible from the 2nd node, except the dir with the lock. Needless to say that this causes the application to be unavailable. We are unable to reproduce the problem. 1st occurrence. After collecting information, we rebooted the 2nd node and after the reboot it joined the 1st node in the cluster without any problem. 2nd occurrence. This happened 2 days later in the same way on the same node. After collecting information, we now also ran gfs2_fsck on the GFS2 filesystem before letting it join the cluster. No errors, orphans, corruption was reported. After the fsck we started the cluster software on the 2nd node and the 2nd node joined the cluster without any problem. Additional information (gfs2_lockdump, gfs2_hangalyzer, sysrq-t info, etc.) was collected in a tarball (enov_additional_info.tar). Additional information in additional_info.tar - enov_clusterinfo_app2.txt.gz containing - /etc/cluster.conf - gfs2_hangalyzer output from 2nd node - cman_tool - group_tool < -v, dump, dump fence, dump gfs2> - ccs_tool - openais-cfgtool -s - clustat -fl - Process status information of all processes - gfs2_tool gettune /gfsdata - enov_sysrq-t_app2.txt.gz - enov_glocks_app2.txt.gz - enov_debugfs_dlm_app2.tar.gz Contains compressed tarball of dlm directory from debugfs filesystem from 2nd node. Environment 2-node cluster running CentOS 5.7, with RedHat Cluster Suite and GFS2. Latest updates for OS and RHCS/GFS2 (as per Jan 8, 2012) are installed. Kernel version 2.6.18-274.12.1.el5PAE. One GFS2 filesystem (20G) on HP/LeftHand Networks iSCSI SAN volume. iSCSI initiator version 6.2.0.872-10.el5. Thanking you in advance for your cooperation. If you need additional information to help to solve this problem, please let me know. With kind regards, G. Wieberdink Sr. Engineer at E.Novation gert.wieberdink at enovation.nl -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: enov_additional_info.tar Type: application/x-tar Size: 102400 bytes Desc: enov_additional_info.tar URL: From td3201 at gmail.com Tue Jan 10 15:04:13 2012 From: td3201 at gmail.com (Terry) Date: Tue, 10 Jan 2012 09:04:13 -0600 Subject: [Linux-cluster] centos5 to RHEL6 migration In-Reply-To: <4F0B8D7B.7030502@alteeve.com> References: <4F0A532A.2000202@alteeve.com> <4F0AAACE.7080602@mssl.ucl.ac.uk> <4F0AB505.2020402@redhat.com> <4F0AEB2E.2060203@mssl.ucl.ac.uk> <4F0AECDA.9060402@redhat.com> <4F0AF3E9.9060907@mssl.ucl.ac.uk> <4F0B8D7B.7030502@alteeve.com> Message-ID: On Mon, Jan 9, 2012 at 6:59 PM, Digimer wrote: > On 01/09/2012 05:36 PM, Terry wrote: > > So here's what I have done so far: > > 1. Created new cluster based on RHEL 6. > > 2. Created resources and services from scratch to match that in the old > > cluster (fsid, mount points, everything). I am using Congra (luci/ricci) > > just to ensure I am using the right syntax. > > 3. Gave access to storage volumes (iscsi) to new cluster node > > 4. pvscan/vgscan/lvscan > > 5. Disabled NFS services on old cluster > > 6. Enabled the NFS services on the new cluster > > > > That's it. Life's good for the volumes on the cluster. I am yet to > > transfer my postgres stuff but I am moving from 8.3 to 9.0 so that will > > be a new volume and postgres installation so nothing exciting there. > > Thanks for reporting back. I'm glad to hear it worked out well. Did you > have to change your gfs part to gfs2? > > -- > Digimer > E-Mail: digimer at alteeve.com > Freenode handle: digimer > Papers and Projects: http://alteeve.com > Node Assassin: http://nodeassassin.org > "omg my singularity battery is dead again. > stupid hawking radiation." - epitron > I am not using GFS. All ext3/4. -------------- next part -------------- An HTML attachment was scrubbed... URL: From linux at alteeve.com Tue Jan 10 15:52:18 2012 From: linux at alteeve.com (Digimer) Date: Tue, 10 Jan 2012 10:52:18 -0500 Subject: [Linux-cluster] centos5 to RHEL6 migration In-Reply-To: References: <4F0A532A.2000202@alteeve.com> <4F0AAACE.7080602@mssl.ucl.ac.uk> <4F0AB505.2020402@redhat.com> <4F0AEB2E.2060203@mssl.ucl.ac.uk> <4F0AECDA.9060402@redhat.com> <4F0AF3E9.9060907@mssl.ucl.ac.uk> <4F0B8D7B.7030502@alteeve.com> Message-ID: <4F0C5EB2.9060209@alteeve.com> > I am not using GFS. All ext3/4. Well then, that would make it easy to deal with, I suppose. :P -- Digimer E-Mail: digimer at alteeve.com Freenode handle: digimer Papers and Projects: http://alteeve.com Node Assassin: http://nodeassassin.org "omg my singularity battery is dead again. stupid hawking radiation." - epitron From dkelson at gurulabs.com Wed Jan 11 19:42:12 2012 From: dkelson at gurulabs.com (Dax Kelson) Date: Wed, 11 Jan 2012 12:42:12 -0700 Subject: [Linux-cluster] [PATCH] fence_scsi log actual commands Message-ID: <1326310932.4540.11.camel@mentor.gurulabs.com> There is a new Linux iSCSI target in the Linux kernel 3.1. Unlike tgt, it supports SPC-3 compliant persistent reservations so that it can be used with fence_scsi. I encountered a bug in the iSCSI target (an easy workaround is available) and it was very helpful to see the actual commands that fence_scsi was running. I now have a fully working 3 node RHEL6.2 cluster with a Fedora 16 iSCSI target with working SCSI fencing. Please consider applying this patch so, that if logging is enabled, the actual command being run will be logged as well. Dax Kelson Guru Labs Workaround details -- the bug should be fixed when the scatterlist conversion is completed by Andy Grover, but for now modifying the allocation length used by the sg_persist commands to 512 by adding '-l 512' to the sg_persist command lines is the workaround. --- fence_scsi.org 2012-01-11 12:27:52.234042483 -0700 +++ fence_scsi 2012-01-10 18:09:34.301813562 -0700 @@ -208,7 +208,7 @@ # log_error ("$self (err=$err)"); # } - log_debug ("$self (err=$err)"); + log_debug ("$self (err=$err cmd=$cmd)"); return ($err); } @@ -245,7 +245,7 @@ # log_error ("$self (err=$err)"); # } - log_debug ("$self (err=$err)"); + log_debug ("$self (err=$err) (cmd=$cmd)"); return ($err); } @@ -265,7 +265,7 @@ # log_error ("$self (err=$err)"); # } - log_debug ("$self (err=$err)"); + log_debug ("$self (err=$err, cmd=$cmd)"); return ($err); } @@ -285,7 +285,7 @@ # log_error ("$self (err=$err)"); # } - log_debug ("$self (err=$err)"); + log_debug ("$self (err=$err cmd=$cmd)"); return ($err); } @@ -305,7 +305,7 @@ # log_error ("$self (err=$err)"); # } - log_debug ("$self (err=$err)"); + log_debug ("$self (err=$err cmd=$cmd)"); return ($err); } @@ -325,7 +325,7 @@ # log_error ("$self (err=$err)"); # } - log_debug ("$self (err=$err)"); + log_debug ("$self (err=$err cmd=$cmd)"); return ($err); } @@ -342,7 +342,7 @@ ## note that it is not necessarily an error is $err is non-zero, ## so just log the device and status and continue. - log_debug ("$self (dev=$dev, status=$err)"); + log_debug ("$self (dev=$dev, status=$err, cmd=$cmd)"); return ($err); } @@ -425,7 +425,7 @@ my $err = ($?>>8); if ($err != 0) { - log_error ("$self (err=$err)"); + log_error ("$self (err=$err cmd=$cmd)"); } # die "[error]: $self\n" if ($?>>8); @@ -447,7 +447,7 @@ my $err = ($?>>8); if ($err != 0) { - log_error ("$self (err=$err)"); + log_error ("$self (err=$err cmd=$cmd)"); } # die "[error]: $self\n" if ($?>>8); @@ -479,7 +479,7 @@ my $err = ($?>>8); if ($err != 0) { - log_error ("$self (err=$err)"); + log_error ("$self (err=$err cmd=$cmd)"); } # die "[error]: $self\n" if ($?>>8); @@ -576,7 +576,7 @@ my $err = ($?>>8); if ($err != 0) { - log_error ("$self (err=$err)"); + log_error ("$self (err=$err cmd=$cmd)"); } # die "[error]: $self\n" if ($?>>8); @@ -602,7 +602,7 @@ my $err = ($?>>8); if ($err != 0) { - log_error ("$self (err=$err)"); + log_error ("$self (err=$err cmd=$cmd)"); } # die "[error]: $self\n" if ($?>>8); From florian at hastexo.com Wed Jan 11 20:32:21 2012 From: florian at hastexo.com (Florian Haas) Date: Wed, 11 Jan 2012 21:32:21 +0100 Subject: [Linux-cluster] [PATCH] fence_scsi log actual commands In-Reply-To: <1326310932.4540.11.camel@mentor.gurulabs.com> References: <1326310932.4540.11.camel@mentor.gurulabs.com> Message-ID: On Wed, Jan 11, 2012 at 8:42 PM, Dax Kelson wrote: > There is a new Linux iSCSI target in the Linux kernel 3.1. Unlike tgt, > it supports SPC-3 compliant persistent reservations so that it can be > used with fence_scsi. "Unlike tgt"? I thought tgt does support PR since its 1.0 release. In fact I seem to recall that implementing PR was what prompted Tomo to move to 1.0. Are you saying that tgt targets don't work with fence_iscsi? Cheers, Florian -- Need help with High Availability? http://www.hastexo.com/now From dkelson at gurulabs.com Wed Jan 11 20:43:59 2012 From: dkelson at gurulabs.com (Dax Kelson) Date: Wed, 11 Jan 2012 13:43:59 -0700 Subject: [Linux-cluster] [PATCH] fence_scsi log actual commands In-Reply-To: References: <1326310932.4540.11.camel@mentor.gurulabs.com> Message-ID: <1326314639.4540.19.camel@mentor.gurulabs.com> On Wed, 2012-01-11 at 21:32 +0100, Florian Haas wrote: > On Wed, Jan 11, 2012 at 8:42 PM, Dax Kelson wrote: > > There is a new Linux iSCSI target in the Linux kernel 3.1. Unlike tgt, > > it supports SPC-3 compliant persistent reservations so that it can be > > used with fence_scsi. > > "Unlike tgt"? I thought tgt does support PR since its 1.0 release. In > fact I seem to recall that implementing PR was what prompted Tomo to > move to 1.0. Are you saying that tgt targets don't work with > fence_iscsi? > > Cheers, > Florian My understanding is that tgt has support for PR but not the PR_OUT_PREEMPT_AND_ABORT service action necessary for I/O fencing. Maybe this has changed in the last year. Dax Kelson Guru Labs From florian at hastexo.com Wed Jan 11 21:19:01 2012 From: florian at hastexo.com (Florian Haas) Date: Wed, 11 Jan 2012 22:19:01 +0100 Subject: [Linux-cluster] [PATCH] fence_scsi log actual commands In-Reply-To: <1326314639.4540.19.camel@mentor.gurulabs.com> References: <1326310932.4540.11.camel@mentor.gurulabs.com> <1326314639.4540.19.camel@mentor.gurulabs.com> Message-ID: On Wed, Jan 11, 2012 at 9:43 PM, Dax Kelson wrote: > On Wed, 2012-01-11 at 21:32 +0100, Florian Haas wrote: >> On Wed, Jan 11, 2012 at 8:42 PM, Dax Kelson wrote: >> > There is a new Linux iSCSI target in the Linux kernel 3.1. Unlike tgt, >> > it supports SPC-3 compliant persistent reservations so that it can be >> > used with fence_scsi. >> >> "Unlike tgt"? I thought tgt does support PR since its 1.0 release. In >> fact I seem to recall that implementing PR was what prompted Tomo to >> move to 1.0. Are you saying that tgt targets don't work with >> fence_iscsi? >> >> Cheers, >> Florian > > My understanding is that tgt has support for PR but not the > PR_OUT_PREEMPT_AND_ABORT service action necessary for I/O fencing. Ah, that sounds about right (iirc). Cheers, Florian -- Need help with High Availability? http://www.hastexo.com/now From rajendra.roka at pacificmags.com.au Thu Jan 12 03:11:01 2012 From: rajendra.roka at pacificmags.com.au (Roka, Rajendra) Date: Thu, 12 Jan 2012 14:11:01 +1100 Subject: [Linux-cluster] Issue with mysql service in RHEL6.2 cluster In-Reply-To: <508450C6AB960E4299CA64E838597F6202F22B66@nsw-mmp-exch1.snl.7net.com.au> References: <508450C6AB960E4299CA64E838597F6202F22B63@nsw-mmp-exch1.snl.7net.com.au><4F0B7448.9060008@redhat.com> <508450C6AB960E4299CA64E838597F6202F22B66@nsw-mmp-exch1.snl.7net.com.au> Message-ID: <508450C6AB960E4299CA64E838597F6202F22B6A@nsw-mmp-exch1.snl.7net.com.au> Any more suggestions on this? From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Roka, Rajendra Sent: Tuesday, 10 January 2012 11:49 AM To: linux-cluster at redhat.com Subject: Re: [Linux-cluster] Issue with mysql service in RHEL6.2 cluster I have changed the resources and service in cluster.conf as follows: But no luck with the following message: Jan 10 11:42:57 atp-wwdev1 modcluster: Starting service: mysql on node Jan 10 11:42:57 atp-wwdev1 rgmanager[1690]: Starting stopped service service:mysql Jan 10 11:42:58 atp-wwdev1 rgmanager[5252]: Adding IPv4 address 10.26.240.95/24 to eth0 Jan 10 11:43:01 atp-wwdev1 rgmanager[5401]: Starting Service mysql:mysql Jan 10 11:44:01 atp-wwdev1 rgmanager[5657]: Starting Service mysql:mysql > Failed - Timeout Error Jan 10 11:44:01 atp-wwdev1 rgmanager[1690]: start on mysql "mysql" returned 1 (generic error) Jan 10 11:44:02 atp-wwdev1 rgmanager[1690]: #68: Failed to start service:mysql; return value: 1 Jan 10 11:44:02 atp-wwdev1 rgmanager[1690]: Stopping service service:mysql Jan 10 11:44:02 atp-wwdev1 rgmanager[5742]: Stopping Service mysql:mysql Jan 10 11:44:02 atp-wwdev1 rgmanager[5764]: Checking Existence Of File /var/run/cluster/mysql/mysql:mysql.pid [mysql:mysql] > Failed - File Doesn't Exist Jan 10 11:44:02 atp-wwdev1 rgmanager[5786]: Stopping Service mysql:mysql > Succeed Jan 10 11:44:02 atp-wwdev1 rgmanager[5837]: Removing IPv4 address 10.26.240.95/24 from eth0 Jan 10 11:44:04 atp-wwdev1 rgmanager[5924]: unmounting /var/lib/mysql Jan 10 11:44:04 atp-wwdev1 rgmanager[1690]: Service service:mysql is recovering Jan 10 11:44:04 atp-wwdev1 rgmanager[1690]: #71: Relocating failed service service:mysql Jan 10 11:45:14 atp-wwdev1 rgmanager[1690]: Service service:mysql is stopped Also changed the my.conf to: pid-file=/var/run/cluster/mysql/mysql.pid Cheers From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Ryan Mitchell Sent: Tuesday, 10 January 2012 10:12 AM To: linux-cluster at redhat.com Subject: Re: [Linux-cluster] Issue with mysql service in RHEL6.2 cluster On 01/10/2012 07:57 AM, Roka, Rajendra wrote: I am having issue with mysql service in RHEL6.2 cluster. While starting service I receive the following error in /var/log/message Jan 10 08:47:52 atp-wwdev1 rgmanager[6015]: Starting Service mysql:mysql Jan 10 08:47:54 atp-wwdev1 rgmanager[6096]: Starting Service mysql:mysql > Failed - Timeout Error Jan 10 08:47:54 atp-wwdev1 rgmanager[1842]: start on mysql "mysql" returned 1 (generic error) I'm pretty sure the first problem is that mysql doesn't start before the script times out. All subsequent errors are trying to clean up from the failed start and can be ignored. There won't be a pid file if the service did not start or if it was cleanly shut down outside of rgmanager. Try increasing the startup_wait (something large until you find its successful, like 60). Its currently waiting 2 seconds. Also, I don't think you want to have the VIP and the service that uses it (mysql) in different services. They should be in the same service, because they always have to run on the same node (they aren't independent). Same goes for the filesystem resource if that is required by MYSQL. Perhaps something like the following?: Lastly, you have created a fence device but you haven't assigned it to the nodes so they currently have no fencing devices. Make sure you do that and test fencing before doing anything important with this cluster. Regards, Ryan Mitchell Software Maintenance Engineer Support Engineering Group Red Hat, Inc. Important Notice: This message and its attachments are confidential and may contain information which is protected by copyright. It is intended solely for the named addressee. If you are not the authorised recipient (or responsible for delivery of the message to the authorised recipient), you must not use, disclose, print, copy or deliver this message or its attachments to anyone. If you receive this email in error, please contact the sender immediately and permanently delete this message and its attachments from your system. Any content of this message and its attachments that does not relate to the official business of Pacific Magazines Pty Limited must be taken not to have been sent or endorsed by it. No representation is made that this email or its attachments are without defect or that the contents express views other than those of the sender. Please consider the environment - do you really need to print this email? Important Notice: This message and its attachments are confidential and may contain information which is protected by copyright. It is intended solely for the named addressee. If you are not the authorised recipient (or responsible for delivery of the message to the authorised recipient), you must not use, disclose, print, copy or deliver this message or its attachments to anyone. If you receive this email in error, please contact the sender immediately and permanently delete this message and its attachments from your system. Any content of this message and its attachments that does not relate to the official business of Pacific Magazines Pty Limited must be taken not to have been sent or endorsed by it. No representation is made that this email or its attachments are without defect or that the contents express views other than those of the sender. Please consider the environment - do you really need to print this email? -------------- next part -------------- An HTML attachment was scrubbed... URL: From rmitchel at redhat.com Thu Jan 12 04:00:39 2012 From: rmitchel at redhat.com (Ryan Mitchell) Date: Thu, 12 Jan 2012 14:00:39 +1000 Subject: [Linux-cluster] Issue with mysql service in RHEL6.2 cluster In-Reply-To: <508450C6AB960E4299CA64E838597F6202F22B6A@nsw-mmp-exch1.snl.7net.com.au> References: <508450C6AB960E4299CA64E838597F6202F22B63@nsw-mmp-exch1.snl.7net.com.au><4F0B7448.9060008@redhat.com> <508450C6AB960E4299CA64E838597F6202F22B66@nsw-mmp-exch1.snl.7net.com.au> <508450C6AB960E4299CA64E838597F6202F22B6A@nsw-mmp-exch1.snl.7net.com.au> Message-ID: <4F0E5AE7.1080607@redhat.com> On 01/12/2012 01:11 PM, Roka, Rajendra wrote: > > Any more suggestions on this? > According to the new log, it still timed out after 60 seconds, so either that wasn't long enough either, or there is a misconfiguration and the database can't start because of it: > ** > > Jan 10 11:42:57 atp-wwdev1 modcluster: Starting service: mysql on node > > Jan 10 11:42:57 atp-wwdev1 rgmanager[1690]: Starting stopped service > service:mysql > > Jan 10 11:42:58 atp-wwdev1 rgmanager[5252]: Adding IPv4 address > 10.26.240.95/24 to eth0 > > Jan 10 11:43:01 atp-wwdev1 rgmanager[5401]: Starting Service mysql:mysql > > Jan 10 11:44:01 atp-wwdev1 rgmanager[5657]: Starting Service > mysql:mysql > Failed - Timeout Error > > Jan 10 11:44:01 atp-wwdev1 rgmanager[1690]: start on mysql "mysql" > returned 1 (generic error) > > Jan 10 11:44:02 atp-wwdev1 rgmanager[1690]: #68: Failed to start > service:mysql; return value: 1 > What does it say in your mysql log? The resource script runs the command to start the database and then waits for it to return success. It waited 60 seconds, and hadn't received any notice that the database started or not, so it gave up. Look in the logs to see if there is any indication as to why the database won't start. It could be because you have the wrong configuration in /etc/my.cnf, no permissions on some critical directories, or the resource script is misconfigured. Also, you should investigate whether you can manually start the database (after mounting the NFS mount and adding the VIP of course) outside of cluster (and compare working and failing mysql logs). Regards, Ryan Mitchell Software Maintenance Engineer Support Engineering Group Red Hat, Inc. -------------- next part -------------- An HTML attachment was scrubbed... URL: From tc3driver at gmail.com Thu Jan 12 04:20:01 2012 From: tc3driver at gmail.com (Bill G.) Date: Wed, 11 Jan 2012 20:20:01 -0800 Subject: [Linux-cluster] Issue with mysql service in RHEL6.2 cluster In-Reply-To: <4F0E5AE7.1080607@redhat.com> References: <508450C6AB960E4299CA64E838597F6202F22B63@nsw-mmp-exch1.snl.7net.com.au> <4F0B7448.9060008@redhat.com> <508450C6AB960E4299CA64E838597F6202F22B66@nsw-mmp-exch1.snl.7net.com.au> <508450C6AB960E4299CA64E838597F6202F22B6A@nsw-mmp-exch1.snl.7net.com.au> <4F0E5AE7.1080607@redhat.com> Message-ID: Really dumb question... do you have mysql installed? What happens when you try to start mysql stand alone? Is mysql already running? Is there anything in /var/log/messages? anything in the mysql logs? On Wed, Jan 11, 2012 at 8:00 PM, Ryan Mitchell wrote: > ** > On 01/12/2012 01:11 PM, Roka, Rajendra wrote: > > Any more suggestions on this?**** > > According to the new log, it still timed out after 60 seconds, so either > that wasn't long enough either, or there is a misconfiguration and the > database can't start because of it: > > ** > > Jan 10 11:42:57 atp-wwdev1 modcluster: Starting service: mysql on node *** > * > > Jan 10 11:42:57 atp-wwdev1 rgmanager[1690]: Starting stopped service > service:mysql**** > > Jan 10 11:42:58 atp-wwdev1 rgmanager[5252]: Adding IPv4 address > 10.26.240.95/24 to eth0**** > > Jan 10 11:43:01 atp-wwdev1 rgmanager[5401]: Starting Service mysql:mysql** > ** > > Jan 10 11:44:01 atp-wwdev1 rgmanager[5657]: Starting Service mysql:mysql > > Failed - Timeout Error**** > > Jan 10 11:44:01 atp-wwdev1 rgmanager[1690]: start on mysql "mysql" > returned 1 (generic error)**** > > Jan 10 11:44:02 atp-wwdev1 rgmanager[1690]: #68: Failed to start > service:mysql; return value: 1**** > > > What does it say in your mysql log? The resource script runs the command > to start the database and then waits for it to return success. It waited > 60 seconds, and hadn't received any notice that the database started or > not, so it gave up. > > Look in the logs to see if there is any indication as to why the database > won't start. It could be because you have the wrong configuration in > /etc/my.cnf, no permissions on some critical directories, or the resource > script is misconfigured. Also, you should investigate whether you can > manually start the database (after mounting the NFS mount and adding the > VIP of course) outside of cluster (and compare working and failing mysql > logs). > > > Regards, > > Ryan Mitchell > Software Maintenance Engineer > Support Engineering Group > Red Hat, Inc. > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -- Thanks, Bill G. tc3driver at gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From rajendra.roka at pacificmags.com.au Thu Jan 12 04:39:43 2012 From: rajendra.roka at pacificmags.com.au (Roka, Rajendra) Date: Thu, 12 Jan 2012 15:39:43 +1100 Subject: [Linux-cluster] Issue with mysql service in RHEL6.2 cluster In-Reply-To: <4F0E5AE7.1080607@redhat.com> References: <508450C6AB960E4299CA64E838597F6202F22B6A@nsw-mmp-exch1.snl.7net.com.au> <4F0E5AE7.1080607@redhat.com> Message-ID: <508450C6AB960E4299CA64E838597F6202F22B6B@nsw-mmp-exch1.snl.7net.com.au> Yes it starts if I do manually: [root at atp-wwdev1 ~]# mount -t nfs 10.26.240.190:/nfs/mysql /var/lib/mysql/ [root at atp-wwdev1 ~]# /etc/init.d/mysqld start Starting mysqld: [ OK ] [root at atp-wwdev1 ~]# cat /var/log/mysqld.log 120112 15:28:57 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql 120112 15:28:58 InnoDB: Started; log sequence number 0 44233 120112 15:28:58 [Note] Event Scheduler: Loaded 0 events 120112 15:28:58 [Note] /usr/libexec/mysqld: ready for connections. Version: '5.1.52' socket: '/var/lib/mysql/mysql.sock' port: 3306 Source distribution [root at atp-wwdev1 ~]# /etc/init.d/mysqld stop Stopping mysqld: [ OK ] root at atp-wwdev1 ~]# cat /var/log/mysqld.log 120112 15:29:39 [Note] /usr/libexec/mysqld: Normal shutdown 120112 15:29:39 [Note] Event Scheduler: Purging the queue. 0 events 120112 15:29:39 InnoDB: Starting shutdown... 120112 15:29:43 InnoDB: Shutdown completed; log sequence number 0 44233 120112 15:29:43 [Note] /usr/libexec/mysqld: Shutdown complete But if I start with cluster, it doesnot give any error message in /var/log/mysqld.log Once again my cluster.conf is follows: And my.cnf is follows: [mysqld] datadir=/var/lib/mysql socket=/var/lib/mysql/mysql.sock user=mysql # Disabling symbolic-links is recommended to prevent assorted security risks symbolic-links=0 [mysqld_safe] log-error=/var/log/mysqld.log pid-file=/var/run/cluster/mysql/mysql.pid If you need any more info, please let me know. Thanks From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Ryan Mitchell Sent: Thursday, 12 January 2012 3:01 PM To: linux-cluster at redhat.com Subject: Re: [Linux-cluster] Issue with mysql service in RHEL6.2 cluster On 01/12/2012 01:11 PM, Roka, Rajendra wrote: Any more suggestions on this? According to the new log, it still timed out after 60 seconds, so either that wasn't long enough either, or there is a misconfiguration and the database can't start because of it: Jan 10 11:42:57 atp-wwdev1 modcluster: Starting service: mysql on node Jan 10 11:42:57 atp-wwdev1 rgmanager[1690]: Starting stopped service service:mysql Jan 10 11:42:58 atp-wwdev1 rgmanager[5252]: Adding IPv4 address 10.26.240.95/24 to eth0 Jan 10 11:43:01 atp-wwdev1 rgmanager[5401]: Starting Service mysql:mysql Jan 10 11:44:01 atp-wwdev1 rgmanager[5657]: Starting Service mysql:mysql > Failed - Timeout Error Jan 10 11:44:01 atp-wwdev1 rgmanager[1690]: start on mysql "mysql" returned 1 (generic error) Jan 10 11:44:02 atp-wwdev1 rgmanager[1690]: #68: Failed to start service:mysql; return value: 1 What does it say in your mysql log? The resource script runs the command to start the database and then waits for it to return success. It waited 60 seconds, and hadn't received any notice that the database started or not, so it gave up. Look in the logs to see if there is any indication as to why the database won't start. It could be because you have the wrong configuration in /etc/my.cnf, no permissions on some critical directories, or the resource script is misconfigured. Also, you should investigate whether you can manually start the database (after mounting the NFS mount and adding the VIP of course) outside of cluster (and compare working and failing mysql logs). Regards, Ryan Mitchell Software Maintenance Engineer Support Engineering Group Red Hat, Inc. Important Notice: This message and its attachments are confidential and may contain information which is protected by copyright. It is intended solely for the named addressee. If you are not the authorised recipient (or responsible for delivery of the message to the authorised recipient), you must not use, disclose, print, copy or deliver this message or its attachments to anyone. If you receive this email in error, please contact the sender immediately and permanently delete this message and its attachments from your system. Any content of this message and its attachments that does not relate to the official business of Pacific Magazines Pty Limited must be taken not to have been sent or endorsed by it. No representation is made that this email or its attachments are without defect or that the contents express views other than those of the sender. Please consider the environment - do you really need to print this email? -------------- next part -------------- An HTML attachment was scrubbed... URL: From tc3driver at gmail.com Thu Jan 12 05:17:49 2012 From: tc3driver at gmail.com (Bill G.) Date: Wed, 11 Jan 2012 21:17:49 -0800 Subject: [Linux-cluster] Issue with mysql service in RHEL6.2 cluster In-Reply-To: <508450C6AB960E4299CA64E838597F6202F22B6B@nsw-mmp-exch1.snl.7net.com.au> References: <508450C6AB960E4299CA64E838597F6202F22B6A@nsw-mmp-exch1.snl.7net.com.au> <4F0E5AE7.1080607@redhat.com> <508450C6AB960E4299CA64E838597F6202F22B6B@nsw-mmp-exch1.snl.7net.com.au> Message-ID: Ok more dumb things... In the past I have had problems bringing up VIPs that have the subnet mask bits in the address try changing this line: to this Also remove it from the ip ref= tag as well... Then try starting the service. also it may be easier to enable debug logging to help figure out what is going on with the service... but I am betting the change to the ip will probably work. HTH, Bill On Wed, Jan 11, 2012 at 8:39 PM, Roka, Rajendra < rajendra.roka at pacificmags.com.au> wrote: > *Yes it starts if I do manually:* > > ** ** > > [root at atp-wwdev1 ~]# mount -t nfs 10.26.240.190:/nfs/mysql /var/lib/mysql/ > **** > > [root at atp-wwdev1 ~]# /etc/init.d/mysqld start**** > > Starting mysqld: [ OK ]**** > > ** ** > > [root at atp-wwdev1 ~]# cat /var/log/mysqld.log**** > > 120112 15:28:57 mysqld_safe Starting mysqld daemon with databases from > /var/lib/mysql**** > > 120112 15:28:58 InnoDB: Started; log sequence number 0 44233**** > > 120112 15:28:58 [Note] Event Scheduler: Loaded 0 events**** > > 120112 15:28:58 [Note] /usr/libexec/mysqld: ready for connections.**** > > Version: '5.1.52' socket: '/var/lib/mysql/mysql.sock' port: 3306 Source > distribution**** > > ** ** > > [root at atp-wwdev1 ~]# /etc/init.d/mysqld stop**** > > Stopping mysqld: [ OK ]**** > > root at atp-wwdev1 ~]# cat /var/log/mysqld.log**** > > 120112 15:29:39 [Note] /usr/libexec/mysqld: Normal shutdown**** > > 120112 15:29:39 [Note] Event Scheduler: Purging the queue. 0 events**** > > 120112 15:29:39 InnoDB: Starting shutdown...**** > > 120112 15:29:43 InnoDB: Shutdown completed; log sequence number 0 44233** > ** > > 120112 15:29:43 [Note] /usr/libexec/mysqld: Shutdown complete**** > > ** ** > > *But if I start with cluster, it doesnot give any error message in > /var/log/mysqld.log* > > * * > > *Once again my cluster.conf is follows:* > > **** > > **** > > **** > > *** > * > > **** > > **** > > **** > > votes="1">**** > > **** > > **** > > **** > > **** > > **** > > **** > > **** > > **** > > **** > > ordered="1" restricted="0">**** > > **** > > **** > > **** > > **** > > **** > > sleeptime="2"/>**** > > listen_address="10.26.24.95" name="mysql" shutdown_wait="60" > startup_wait="60"/>**** > > fstype="nfs" host="10.26.240.190" mountpoint="/var/lib/mysql" > name="storage" no_unmount="on"/>**** > > **** > > name="mysql" recovery="relocate">**** > > **** > > **** > > **** > > **** > > **** > > post_join_delay="3"/>**** > > **** > > **** > > **** > > **** > > **** > > **** > > ** ** > > *And my.cnf is follows:* > > [mysqld]**** > > datadir=/var/lib/mysql**** > > socket=/var/lib/mysql/mysql.sock**** > > user=mysql**** > > # Disabling symbolic-links is recommended to prevent assorted security > risks**** > > symbolic-links=0**** > > ** ** > > [mysqld_safe]**** > > log-error=/var/log/mysqld.log**** > > pid-file=/var/run/cluster/mysql/mysql.pid**** > > ** ** > > If you need any more info, please let me know.**** > > ** ** > > Thanks**** > > ** ** > > ** ** > > ** ** > > ** ** > > *From:* linux-cluster-bounces at redhat.com [mailto: > linux-cluster-bounces at redhat.com] *On Behalf Of *Ryan Mitchell > *Sent:* Thursday, 12 January 2012 3:01 PM > > *To:* linux-cluster at redhat.com > *Subject:* Re: [Linux-cluster] Issue with mysql service in RHEL6.2 cluster > **** > > ** ** > > On 01/12/2012 01:11 PM, Roka, Rajendra wrote: **** > > Any more suggestions on this?**** > > According to the new log, it still timed out after 60 seconds, so either > that wasn't long enough either, or there is a misconfiguration and the > database can't start because of it: > > **** > > Jan 10 11:42:57 atp-wwdev1 modcluster: Starting service: mysql on node *** > * > > Jan 10 11:42:57 atp-wwdev1 rgmanager[1690]: Starting stopped service > service:mysql**** > > Jan 10 11:42:58 atp-wwdev1 rgmanager[5252]: Adding IPv4 address > 10.26.240.95/24 to eth0**** > > Jan 10 11:43:01 atp-wwdev1 rgmanager[5401]: Starting Service mysql:mysql** > ** > > Jan 10 11:44:01 atp-wwdev1 rgmanager[5657]: Starting Service mysql:mysql > > Failed - Timeout Error**** > > Jan 10 11:44:01 atp-wwdev1 rgmanager[1690]: start on mysql "mysql" > returned 1 (generic error)**** > > Jan 10 11:44:02 atp-wwdev1 rgmanager[1690]: #68: Failed to start > service:mysql; return value: 1**** > > > What does it say in your mysql log? The resource script runs the command > to start the database and then waits for it to return success. It waited > 60 seconds, and hadn't received any notice that the database started or > not, so it gave up. > > Look in the logs to see if there is any indication as to why the database > won't start. It could be because you have the wrong configuration in > /etc/my.cnf, no permissions on some critical directories, or the resource > script is misconfigured. Also, you should investigate whether you can > manually start the database (after mounting the NFS mount and adding the > VIP of course) outside of cluster (and compare working and failing mysql > logs). > > Regards, > > Ryan Mitchell > Software Maintenance Engineer > Support Engineering Group > Red Hat, Inc.**** > > Important Notice: > This message and its attachments are confidential and may contain information which is protected by copyright. It is intended solely for the named addressee. If you are not the authorised recipient (or responsible for delivery of the message to the authorised recipient), you must not use, disclose, print, copy or deliver this message or its attachments to anyone. If you receive this email in error, please contact the sender immediately and permanently delete this message and its attachments from your system. > Any content of this message and its attachments that does not relate to the official business of Pacific Magazines Pty Limited must be taken not to have been sent or endorsed by it. No representation is made that this email or its attachments are without defect or that the contents express views other than those of the sender. > > Please consider the environment - do you really need to print this email? > > > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -- Thanks, Bill G. tc3driver at gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From scooter at cgl.ucsf.edu Thu Jan 12 22:50:43 2012 From: scooter at cgl.ucsf.edu (Scooter Morris) Date: Thu, 12 Jan 2012 14:50:43 -0800 Subject: [Linux-cluster] GFS2 mounts taking a *very* long time Message-ID: <4F0F63C3.1010309@cgl.ucsf.edu> Greetings all, We've got a 4 node cluster running RHEL 6.2. As part of the cluster, we've got several gfs2 filesystem. We've often noticed that when we reboot a single node in the cluster, the gfs2 mounts take a long time -- eventually getting the 120 second delay messages. When we migrated to 6.2, the default mount script echoed the filesystem being mounted, and we discovered that the long delays were filesystem-dependent. In particular, two filesystems were causing all of the problems, both of which had >1M files in them. We also noticed that dlm_recoverd on one of the other nodes accumulates a lot of time when this is happening. Is this expected? Are there non-ilnear handshaking algorithms between the mounting node and the cluster that are dependent on the number of files? Thanks in advance! -- scooter From pbruna at it-linux.cl Thu Jan 12 23:40:13 2012 From: pbruna at it-linux.cl (Patricio A. Bruna) Date: Thu, 12 Jan 2012 20:40:13 -0300 (CLST) Subject: [Linux-cluster] GFS2 mounts taking a *very* long time In-Reply-To: <4F0F63C3.1010309@cgl.ucsf.edu> Message-ID: <593e1c15-bfa2-4f0b-9053-8b34248d67db@lisa.itlinux.cl> Hi scooter, Logs would be welcome ------------------------------------ Patricio Bruna V. IT Linux Ltda. www.it-linux.cl Twitter Fono : (+56-2) 333 0578 M?vil: (+56-9) 8899 6618 ----- Mensaje original ----- > Greetings all, > We've got a 4 node cluster running RHEL 6.2. As part of the > cluster, we've got several gfs2 filesystem. We've often noticed that > when we reboot a single node in the cluster, the gfs2 mounts take a > long > time -- eventually getting the 120 second delay messages. When we > migrated to 6.2, the default mount script echoed the filesystem being > mounted, and we discovered that the long delays were > filesystem-dependent. In particular, two filesystems were causing all > of the problems, both of which had >1M files in them. We also noticed > that dlm_recoverd on one of the other nodes accumulates a lot of time > when this is happening. Is this expected? Are there non-ilnear > handshaking algorithms between the mounting node and the cluster that > are dependent on the number of files? > Thanks in advance! > -- scooter > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: zimbra_gold_partner.png Type: image/png Size: 2893 bytes Desc: not available URL: From scooter at cgl.ucsf.edu Fri Jan 13 00:26:05 2012 From: scooter at cgl.ucsf.edu (Scooter Morris) Date: Thu, 12 Jan 2012 16:26:05 -0800 Subject: [Linux-cluster] GFS2 mounts taking a *very* long time In-Reply-To: <593e1c15-bfa2-4f0b-9053-8b34248d67db@lisa.itlinux.cl> References: <593e1c15-bfa2-4f0b-9053-8b34248d67db@lisa.itlinux.cl> Message-ID: <4F0F7A1D.8070601@cgl.ucsf.edu> Hi Patricio, Sure thing -- which logs would help? I don't think the kernel logs would be of much use, and when the dlm_recoverd process is going it doesn't log anything, so it's not clear what would be useful, here. -- scooter On 01/12/2012 03:40 PM, Patricio A. Bruna wrote: > Hi scooter, > Logs would be welcome > > ------------------------------------ > Patricio Bruna V. > IT Linux Ltda. > www.it-linux.cl > Twitter > Fono : (+56-2) 333 0578 > M?vil: (+56-9) 8899 6618 > > > > ------------------------------------------------------------------------ > > Greetings all, > We've got a 4 node cluster running RHEL 6.2. As part of the > cluster, we've got several gfs2 filesystem. We've often noticed that > when we reboot a single node in the cluster, the gfs2 mounts take > a long > time -- eventually getting the 120 second delay messages. When we > migrated to 6.2, the default mount script echoed the filesystem being > mounted, and we discovered that the long delays were > filesystem-dependent. In particular, two filesystems were causing > all > of the problems, both of which had >1M files in them. We also > noticed > that dlm_recoverd on one of the other nodes accumulates a lot of time > when this is happening. Is this expected? Are there non-ilnear > handshaking algorithms between the mounting node and the cluster that > are dependent on the number of files? > > Thanks in advance! > > -- scooter > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 2893 bytes Desc: not available URL: From zheka at uvt.cz Fri Jan 13 00:52:04 2012 From: zheka at uvt.cz (Yevheniy Demchenko) Date: Fri, 13 Jan 2012 02:52:04 +0200 Subject: [Linux-cluster] GFS2 mounts taking a *very* long time In-Reply-To: <4F0F63C3.1010309@cgl.ucsf.edu> References: <4F0F63C3.1010309@cgl.ucsf.edu> Message-ID: <49D5F414-AFB6-49CF-A02B-B80BDFDB6F89@uvt.cz> Hi! This patched version of dlm will probably resolve your issue, please try it. http://www.bosson.eu/temp/dlm-kmod-1.0-1.el6.src.rpm See detailed description in the list earlier ( Subject: [Linux-cluster] [PATCH] dlm: faster dlm recovery ) And yes, mounts and umounts with unpatched dlm are proportional to N*N, where N is a number of files. Sincerely, Yevheniy Demchenko On Jan 13, 2012, at 00:50 , Scooter Morris wrote: > Greetings all, > We've got a 4 node cluster running RHEL 6.2. As part of the cluster, we've got several gfs2 filesystem. We've often noticed that when we reboot a single node in the cluster, the gfs2 mounts take a long time -- eventually getting the 120 second delay messages. When we migrated to 6.2, the default mount script echoed the filesystem being mounted, and we discovered that the long delays were filesystem-dependent. In particular, two filesystems were causing all of the problems, both of which had >1M files in them. We also noticed that dlm_recoverd on one of the other nodes accumulates a lot of time when this is happening. Is this expected? Are there non-ilnear handshaking algorithms between the mounting node and the cluster that are dependent on the number of files? > > Thanks in advance! > > -- scooter > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- An HTML attachment was scrubbed... URL: From wmodes at ucsc.edu Fri Jan 13 20:30:13 2012 From: wmodes at ucsc.edu (Wes Modes) Date: Fri, 13 Jan 2012 12:30:13 -0800 Subject: [Linux-cluster] Clustered filesystem questions for shared storage on CentOS/vmWare/SAN Message-ID: <4F109455.5090607@ucsc.edu> I have some general clustered filesystem questions for you. I'm wading through the confusing and often contradictory web sources RE clustering. I struggled through the initial setup of the GFS software, and am now working to create a shared GFS disk. But all of this brings up some general questions: 1) First several online sources have pointed me to the Microsoft Clustered Filesystem doc to set up my linux clustered FSs on vmWare. Though it deals with MSCS, I can see that it has some applicability. However, I have yet to find a step-by-step guide to linux clustered filesystems. Is there a better suited document to guide me thorough the process of creating shared filesystems on CentOS/RHEL on vmWare across boxes? 2) Is it necessary to create a private network for access to the shared filesystem as the MSCS doc suggests? 3) So far I've been looking at GFS because it is native to CentOS/RHEL. Is there a better non-commercial/free choice? 4) Is there a clustered filesystem method that supports vmWare HA? This is important to us. 5) Seems there at least three different methods to set up GFS (using parted, using lvmconf, and using iSCSI). If I go with GFS, which method should I use? Clustering seems to have a steep learning curve, but I'm laboriously climbing the slope! Thanks for your help. Wes Modes UCSC Library ITS Programmer/Analyst From linux at alteeve.com Fri Jan 13 20:44:28 2012 From: linux at alteeve.com (Digimer) Date: Fri, 13 Jan 2012 15:44:28 -0500 Subject: [Linux-cluster] Clustered filesystem questions for shared storage on CentOS/vmWare/SAN In-Reply-To: <4F109455.5090607@ucsc.edu> References: <4F109455.5090607@ucsc.edu> Message-ID: <4F1097AC.6030409@alteeve.com> On 01/13/2012 03:30 PM, Wes Modes wrote: > I have some general clustered filesystem questions for you. I'm wading > through the confusing and often contradictory web sources RE > clustering. I struggled through the initial setup of the GFS software, > and am now working to create a shared GFS disk. But all of this brings > up some general questions: > > 1) First several online sources have pointed me to the Microsoft > Clustered Filesystem doc to set up my linux clustered FSs on vmWare. > Though it deals with MSCS, I can see that it has some applicability. > However, I have yet to find a step-by-step guide to linux clustered > filesystems. Is there a better suited document to guide me thorough the > process of creating shared filesystems on CentOS/RHEL on vmWare across > boxes? > > 2) Is it necessary to create a private network for access to the shared > filesystem as the MSCS doc suggests? > > 3) So far I've been looking at GFS because it is native to > CentOS/RHEL. Is there a better non-commercial/free choice? > > 4) Is there a clustered filesystem method that supports vmWare HA? > This is important to us. > > 5) Seems there at least three different methods to set up GFS (using > parted, using lvmconf, and using iSCSI). If I go with GFS, which method > should I use? > > Clustering seems to have a steep learning curve, but I'm laboriously > climbing the slope! Thanks for your help. > > Wes Modes > UCSC Library ITS > Programmer/Analyst Hi Wes, I can't speak to windows or VMWare as I have near-null experience with both. So allow me to speak in general terms; 1. GFS2 is my preferred clustered file system, but it requires distributed locking as provided by DLM, which is part of the Red Hat Cluster Suite. 2. A private storage network is not required, but it is usually a good idea simply because of how much traffic storage uses and how easy it is to saturate a link and cause problems for other network stuff. 3. No. OCFS2 is the only other clustered file system I am aware of, and it's under the control of Oracle. I shall say no more. 4. I'm not familiar with what requirements VMWare HA has. Can you elaborate? In short though, all nodes take common storage and mount them as local partitions/filesystems. Once done, your GFS2 partition is, effectively, just another file system. 5. The storage layer and the file system should be independent of one another. So long as the back-end storage presents said storage as raw space to the nodes, GFS2 and the cluster shouldn't care. As for managing that storage... That is effectively up to you. Personally, I like to use clustered LVM on the raw storage, then create my GFS2 file system on an LV. Of course, you can put the file system directly on the raw storage and forego cLVM. I'm not sure how much this will help, given you want to use VMWare, but I've got a tutorial that, among other steps, walks you through setting up the base cluster , fencing (which is *required* for *any* shared storage) and configuring and using the clustered LVM and GFS2 tools; https://alteeve.com/w/2-Node_Red_Hat_KVM_Cluster_Tutorial -- Digimer E-Mail: digimer at alteeve.com Freenode handle: digimer Papers and Projects: http://alteeve.com Node Assassin: http://nodeassassin.org "omg my singularity battery is dead again. stupid hawking radiation." - epitron From linux at alteeve.com Fri Jan 13 20:45:59 2012 From: linux at alteeve.com (Digimer) Date: Fri, 13 Jan 2012 15:45:59 -0500 Subject: [Linux-cluster] Clustered filesystem questions for shared storage on CentOS/vmWare/SAN In-Reply-To: <4F109455.5090607@ucsc.edu> References: <4F109455.5090607@ucsc.edu> Message-ID: <4F109807.6070507@alteeve.com> I forgot to mention; Friendly cluster folks can be found on freenode at #linux-cluster :) -- Digimer E-Mail: digimer at alteeve.com Freenode handle: digimer Papers and Projects: http://alteeve.com Node Assassin: http://nodeassassin.org "omg my singularity battery is dead again. stupid hawking radiation." - epitron From dkelson at gurulabs.com Fri Jan 13 20:50:13 2012 From: dkelson at gurulabs.com (Dax Kelson) Date: Fri, 13 Jan 2012 13:50:13 -0700 Subject: [Linux-cluster] Clustered filesystem questions for shared storage on CentOS/vmWare/SAN In-Reply-To: <4F109455.5090607@ucsc.edu> References: <4F109455.5090607@ucsc.edu> Message-ID: <1326487813.3314.8.camel@mentor.gurulabs.com> A few comments below. On Fri, 2012-01-13 at 12:30 -0800, Wes Modes wrote: > I have some general clustered filesystem questions for you. I'm wading > > 1) First several online sources have pointed me to the Microsoft > Clustered Filesystem doc to set up my linux clustered FSs on vmWare. > Though it deals with MSCS, I can see that it has some applicability. > However, I have yet to find a step-by-step guide to linux clustered > filesystems. Is there a better suited document to guide me thorough the > process of creating shared filesystems on CentOS/RHEL on vmWare across > boxes? http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/6/html/Global_File_System_2/index.html http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/6/html/Cluster_Administration/index.html > 2) Is it necessary to create a private network for access to the shared > filesystem as the MSCS doc suggests? Required? No. > 3) So far I've been looking at GFS because it is native to > CentOS/RHEL. Is there a better non-commercial/free choice? Not really. Probably the next most popular is OCFS2. > 4) Is there a clustered filesystem method that supports vmWare HA? > This is important to us. Not sure what you mean. Do you mean fencing? > 5) Seems there at least three different methods to set up GFS (using > parted, using lvmconf, and using iSCSI). If I go with GFS, which method > should I use? GFS2 requires shared storage such as SAN, iSCSI or DRBD. Pick one. >From the RH docs, "While a GFS2 file system may be used outside of LVM, Red Hat supports only GFS2 file systems that are created on a CLVM logical volume." On RHEL6 and clones, clvmd requires cman. GFS2 requires fencing for safety and reliability suitable for production. Dax Kelson Guru Labs From pbruna at it-linux.cl Fri Jan 13 22:26:56 2012 From: pbruna at it-linux.cl (Patricio A. Bruna) Date: Fri, 13 Jan 2012 19:26:56 -0300 (CLST) Subject: [Linux-cluster] Clustered filesystem questions for shared storage on CentOS/vmWare/SAN In-Reply-To: <4F109455.5090607@ucsc.edu> Message-ID: <2578fc6c-f9cf-489f-ba18-13a8ee665bad@lisa.itlinux.cl> Hi, I used to use GFS for a while, but it has several requirements and some make it very inflexible. These days i'm all for GlusterFS (www.glusterfs.org) Gluster is a distributed filesystem, recently adquired by Red Hat. Gluster provides the main benefit of GFS, Cluster Filesystem, but without so many constrains. ------------------------------------ Patricio Bruna V. IT Linux Ltda. www.it-linux.cl Twitter Fono : (+56-2) 333 0578 M?vil: (+56-9) 8899 6618 ----- Mensaje original ----- > I have some general clustered filesystem questions for you. I'm > wading > through the confusing and often contradictory web sources RE > clustering. I struggled through the initial setup of the GFS > software, > and am now working to create a shared GFS disk. But all of this > brings > up some general questions: > 1) First several online sources have pointed me to the Microsoft > Clustered Filesystem doc to set up my linux clustered FSs on vmWare. > Though it deals with MSCS, I can see that it has some applicability. > However, I have yet to find a step-by-step guide to linux clustered > filesystems. Is there a better suited document to guide me thorough > the > process of creating shared filesystems on CentOS/RHEL on vmWare > across > boxes? > 2) Is it necessary to create a private network for access to the > shared > filesystem as the MSCS doc suggests? > 3) So far I've been looking at GFS because it is native to > CentOS/RHEL. Is there a better non-commercial/free choice? > 4) Is there a clustered filesystem method that supports vmWare HA? > This is important to us. > 5) Seems there at least three different methods to set up GFS (using > parted, using lvmconf, and using iSCSI). If I go with GFS, which > method > should I use? > Clustering seems to have a steep learning curve, but I'm laboriously > climbing the slope! Thanks for your help. > Wes Modes > UCSC Library ITS > Programmer/Analyst > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: zimbra_gold_partner.png Type: image/png Size: 2893 bytes Desc: not available URL: From td3201 at gmail.com Sat Jan 14 16:27:35 2012 From: td3201 at gmail.com (Terry) Date: Sat, 14 Jan 2012 10:27:35 -0600 Subject: [Linux-cluster] LVM not available on 2/6 clustered volumes on reboot Message-ID: All of my nodes have experienced this issue but I can't determine root cause. After reboot, 2/6 of my volumes are set to NOT available. I either have to do a vgscan or vgchange -ay on the volume group to then set the LV to available. Doing an lvchange -ay before doing the vgchange or vgscan results in this error: Jan 14 10:14:25 omadvnfs01c kernel: device-mapper: table: 253:45: linear: dm-linear: Device lookup failed Jan 14 10:14:25 omadvnfs01c kernel: device-mapper: ioctl: error adding target to table I am sure I can hack in a vgchange or something but clvmd does a vgscan I believe during the startup process not sure this workaround would even help. I just need to be pointed down a path to try to determine root cause here. Thanks! -------------- next part -------------- An HTML attachment was scrubbed... URL: From linux at alteeve.com Sat Jan 14 17:47:01 2012 From: linux at alteeve.com (Digimer) Date: Sat, 14 Jan 2012 12:47:01 -0500 Subject: [Linux-cluster] LVM not available on 2/6 clustered volumes on reboot In-Reply-To: References: Message-ID: <4F11BF95.7090707@alteeve.com> On 01/14/2012 11:27 AM, Terry wrote: > All of my nodes have experienced this issue but I can't determine root > cause. After reboot, 2/6 of my volumes are set to NOT available. I > either have to do a vgscan or vgchange -ay on the volume group to then > set the LV to available. > > Doing an lvchange -ay before doing the vgchange or vgscan results in > this error: > Jan 14 10:14:25 omadvnfs01c kernel: device-mapper: table: 253:45: > linear: dm-linear: Device lookup failed > Jan 14 10:14:25 omadvnfs01c kernel: device-mapper: ioctl: error adding > target to table > > I am sure I can hack in a vgchange or something but clvmd does a vgscan > I believe during the startup process not sure this workaround would even > help. I just need to be pointed down a path to try to determine root > cause here. > > Thanks! First thing that comes to mind is that LVM is starting before the devices are available. Some questions; * Do you mean clustered LVM? * How and when is (c)lvm started? * How and when is the backing device connected? * What kind of cluster is this? What versions? * What are the relevant configuration files? -- Digimer E-Mail: digimer at alteeve.com Freenode handle: digimer Papers and Projects: http://alteeve.com Node Assassin: http://nodeassassin.org "omg my singularity battery is dead again. stupid hawking radiation." - epitron From ga at steadfasttelecom.com Sun Jan 22 19:19:41 2012 From: ga at steadfasttelecom.com (Gilad Abada) Date: Sun, 22 Jan 2012 14:19:41 -0500 Subject: [Linux-cluster] crm issue Message-ID: Hi Guys I am new to the world of clustering. I am working on Ubuntu 10.04 LTS 64 bit and im running into a weird issue. When I am in crm -> configure after I type primitive if I try to tab anything out it doesnt work. It seems like its frozen. The only way to get out is to CTRL + C. Also this may be a related issue if i go to crm -> configure -> edit and actually make an edit, I am trying to add: primitive drbd_disk ocf:linbit:drbd \ params drbd_resource="disk0" \ op monitor interval="15s" primitive fs_drbd ocf:heartbeat:Filesystem \ params device="/dev/drbd/by-res/disk0" directory="/mnt" fstype="ext3" ms ms_drbd drbd_disk \ meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true" colocation mnt_on_master inf: fs_drbd ms_drbd:Master order mount_after_drbd inf: ms_drbd:promote fs_drbd:start then I :wq! and it freezes again and i have to CTRL + C I am hoping its a bad config issue on my side? Also if anyone has any good links for n00bs on clustering with ubuntu please send them along this is pretty overwhelming. Thanks so much!! Gill -- Gilad Abada SteadFast Telecommunications, Inc. Call us to find out how much you can save with VoIP! V: 212.589.1001 F: 212.589.1011 For 35 years, Steadfast Telecommunications has been providing state-of-the-art communications technology to businesses and government agencies - large and small. Steadfast Telecommunications tailors Unified Communications and Voice-Over IP Solutions to single-site offices or multi-site and worldwide enterprises.?? Make your virtual office a reality.? Enjoy the freedom to travel while remaining connected to your office. From df.cluster at gmail.com Mon Jan 23 06:36:32 2012 From: df.cluster at gmail.com (Dan Frincu) Date: Mon, 23 Jan 2012 08:36:32 +0200 Subject: [Linux-cluster] crm issue In-Reply-To: References: Message-ID: Hi, On Sun, Jan 22, 2012 at 9:19 PM, Gilad Abada wrote: > Hi Guys > > I am new to the world of clustering. > > I am working on Ubuntu 10.04 LTS 64 bit and im running into a weird issue. > > When I am in crm -> configure after I type primitive if I try to tab > anything out it doesnt work. It seems like its frozen. > > The only way to get out is to CTRL + C. Maybe this helps http://www.gossamer-threads.com/lists/linuxha/pacemaker/77423?do=post_view_threaded#77423 Regards, Dan > > Also this may be a related issue if i go to crm -> configure -> edit > and actually make an edit, I am trying to add: > > primitive drbd_disk ocf:linbit:drbd \ > ? ? ? ?params drbd_resource="disk0" \ > ? ? ? ?op monitor interval="15s" > primitive fs_drbd ocf:heartbeat:Filesystem \ > ? ? ? ?params device="/dev/drbd/by-res/disk0" directory="/mnt" fstype="ext3" > ms ms_drbd drbd_disk \ > ? ? ? ?meta master-max="1" master-node-max="1" clone-max="2" > clone-node-max="1" notify="true" > colocation mnt_on_master inf: fs_drbd ms_drbd:Master > order mount_after_drbd inf: ms_drbd:promote fs_drbd:start > > then I :wq! > and it freezes again and i have to CTRL + C > > I am hoping its a bad config issue on my side? > > Also if anyone has any good links for n00bs on clustering with ubuntu > please send them along this is pretty overwhelming. > > Thanks so much!! > > Gill > > > -- > Gilad Abada > > SteadFast Telecommunications, Inc. > > Call us to find out how much you can save with VoIP! > > V: 212.589.1001 > F: 212.589.1011 > > > For 35 years, Steadfast Telecommunications has been providing > state-of-the-art communications technology to businesses and > government agencies - large and small. Steadfast Telecommunications > tailors Unified Communications and Voice-Over IP Solutions to > single-site offices or multi-site and worldwide enterprises.?? Make > your virtual office a reality.? Enjoy the freedom to travel while > remaining connected to your office. > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -- Dan Frincu CCNA, RHCE From ga at steadfasttelecom.com Mon Jan 23 18:12:12 2012 From: ga at steadfasttelecom.com (Gilad Abada) Date: Mon, 23 Jan 2012 13:12:12 -0500 Subject: [Linux-cluster] crm issue In-Reply-To: References: Message-ID: Hi Dan, Thank you!! That worked. On Mon, Jan 23, 2012 at 1:36 AM, Dan Frincu wrote: > Hi, > > On Sun, Jan 22, 2012 at 9:19 PM, Gilad Abada wrote: >> Hi Guys >> >> I am new to the world of clustering. >> >> I am working on Ubuntu 10.04 LTS 64 bit and im running into a weird issue. >> >> When I am in crm -> configure after I type primitive if I try to tab >> anything out it doesnt work. It seems like its frozen. >> >> The only way to get out is to CTRL + C. > > Maybe this helps > http://www.gossamer-threads.com/lists/linuxha/pacemaker/77423?do=post_view_threaded#77423 > > Regards, > Dan > >> >> Also this may be a related issue if i go to crm -> configure -> edit >> and actually make an edit, I am trying to add: >> >> primitive drbd_disk ocf:linbit:drbd \ >> ? ? ? ?params drbd_resource="disk0" \ >> ? ? ? ?op monitor interval="15s" >> primitive fs_drbd ocf:heartbeat:Filesystem \ >> ? ? ? ?params device="/dev/drbd/by-res/disk0" directory="/mnt" fstype="ext3" >> ms ms_drbd drbd_disk \ >> ? ? ? ?meta master-max="1" master-node-max="1" clone-max="2" >> clone-node-max="1" notify="true" >> colocation mnt_on_master inf: fs_drbd ms_drbd:Master >> order mount_after_drbd inf: ms_drbd:promote fs_drbd:start >> >> then I :wq! >> and it freezes again and i have to CTRL + C >> >> I am hoping its a bad config issue on my side? >> >> Also if anyone has any good links for n00bs on clustering with ubuntu >> please send them along this is pretty overwhelming. >> >> Thanks so much!! >> >> Gill >> >> >> -- >> Gilad Abada >> >> SteadFast Telecommunications, Inc. >> >> Call us to find out how much you can save with VoIP! >> >> V: 212.589.1001 >> F: 212.589.1011 >> >> >> For 35 years, Steadfast Telecommunications has been providing >> state-of-the-art communications technology to businesses and >> government agencies - large and small. Steadfast Telecommunications >> tailors Unified Communications and Voice-Over IP Solutions to >> single-site offices or multi-site and worldwide enterprises.?? Make >> your virtual office a reality.? Enjoy the freedom to travel while >> remaining connected to your office. >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster > > > > -- > Dan Frincu > CCNA, RHCE > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -- Gilad Abada SteadFast Telecommunications, Inc. Call us to find out how much you can save with VoIP! V: 212.589.1001 F: 212.589.1011 For 35 years, Steadfast Telecommunications has been providing state-of-the-art communications technology to businesses and government agencies - large and small. Steadfast Telecommunications tailors Unified Communications and Voice-Over IP Solutions to single-site offices or multi-site and worldwide enterprises.?? Make your virtual office a reality.? Enjoy the freedom to travel while remaining connected to your office. From kortux at gmail.com Tue Jan 24 20:57:57 2012 From: kortux at gmail.com (Miguel Angel Guerrero) Date: Tue, 24 Jan 2012 15:57:57 -0500 Subject: [Linux-cluster] Halt nodes in cluster with cable disconnect Message-ID: Hi i'm trying to setup a centos cluster with two nodes with cman, drbd, gfs2 and i'm using ipmi for fencing. DRBD is set up between the nodes using a dedicated interface. So, when I unplug the drbd network cable, both nodes power off immediatly (i tried using crossover cable and both nodes connected to a switch, but both scenarios fail), and the logs doesn't seem to show something useful. In a previous thread on this list, it is recommended to deactivate ACPID daemon, even at BIOS level, but I'm still having troubles. If I simulate a physical disconnection with ifdown command in some node, this node reboots with no hassle, but unpluging the cable kills both nodes. I think the first scenario is correct, but the second one is not what I expect. Thanks for your help the next are my cluster.conf -- Att: ------------------------------------ Miguel Angel Guerrero Usuario GNU/Linux Registrado #353531 ------------------------------------ -------------- next part -------------- An HTML attachment was scrubbed... URL: From wmodes at ucsc.edu Tue Jan 24 21:01:53 2012 From: wmodes at ucsc.edu (Wes Modes) Date: Tue, 24 Jan 2012 13:01:53 -0800 Subject: [Linux-cluster] Expanding a LUN and a GFS2 filesystem Message-ID: <4F1F1C41.5030701@ucsc.edu> I am running CentOS with a GFS2 filesystem on a Dell EqualLogic SAN. I created the filesystem by mapping an RDM through VMWare to the guest OS. I used pvcreate, vgcreate, lvcreate, and mkfs.gfs2 to create the filesystem and the underlying architecture. I've included the log I created to document the process below. I've already increased the size of the LUN on the SAN. Now, how do I increase the size of the GFS2 filesystem and the LVM beneath it? Do I need to do something with the PV and VG as well? Thanks in advance for your help. Wes Here is the log of the process I used to create the filesystem: With the RDM created and all the daemons started (luci, ricci, cman) now I can config GFS. Make sure they are running on all of our nodes. We can even see the RDM on the guest systems: [root at test03]# ls /dev/sdb /dev/sdb [root at test04]# ls /dev/sdb /dev/sdb So we are doing this using lvm clustering: http://emrahbaysal.blogspot.com/2011/03/gfs-cluster-on-vmware-vsphere-rh... and http://linuxdynasty.org/215/howto-setup-gfs2-with-clustering/ We've already set up gfs daemons and fencing and whatnot. Before we start to create the LVM2 volumes and Proceed to GFS2, we will need to enable clustering in LVM2. [root at test03]# lvmconf --enable-cluster I try to create the cluster FS [root at test03]# pvcreate /dev/sdb connect() failed on local socket: No such file or directory Internal cluster locking initialisation failed. WARNING: Falling back to local file-based locking. Volume Groups with the clustered attribute will be inaccessible. Physical volume "/dev/sdb" successfully created One internet source says: >> That indicates that you have cluster locking enabled but that the cluster LVM >> daemon (clvmd) is not running. So let's start it, [root at test03]# service clvmd status clvmd is stopped [root at test03]# service clvmd start Starting clvmd: Activating VG(s): 2 logical volume(s) in volume group "VolGroup00" now active clvmd not running on node test04 [ OK ] [root at test03]# chkconfig clvmd on Okay, over on the other node: [root at test04]# service clvmd status clvmd is stopped [root at test04]# service clvmd start Starting clvmd: clvmd could not connect to cluster manager Consult syslog for more information [root at test04]# service cman status groupd is stopped [root at test04]# service cman start Starting cluster: Loading modules... done Mounting configfs... done Starting ccsd... done Starting cman... done Starting daemons... done Starting fencing... done [ OK ] [root at test04]# chkconfig cman on [root at test04]# service luci status luci is running... [root at test04]# service ricci status ricci (pid 4381) is running... [root at test04]# chkconfig ricci on [root at test04]# chkconfig luci on [root at test04]# service clvmd start Starting clvmd: Activating VG(s): 2 logical volume(s) in volume group "VolGroup00" now active [ OK ] And this time, no complaints: [root at test03]# service clvmd restart Restarting clvmd: [ OK ] Try again with pvcreate: [root at test03]# pvcreate /dev/sdb Physical volume "/dev/sdb" successfully created Create volume group: [root at test03]# vgcreate gdcache_vg /dev/sdb Clustered volume group "gdcache_vg" successfully created Create logical volume: [root at test03]# lvcreate -n gdcache_lv -L 2T gdcache_vg Logical volume "gdcache_lv" created Create GFS filesystem, ahem, GFS2 filesystem. I screwed this up the first time. [root at test03]# mkfs.gfs2 -j 8 -p lock_dlm -t gdcluster:gdcache -j 4 /dev/mapper/gdcache_vg-gdcache_lv This will destroy any data on /dev/mapper/gdcache_vg-gdcache_lv. It appears to contain a gfs filesystem. Are you sure you want to proceed? [y/n] y Device: /dev/mapper/gdcache_vg-gdcache_lv Blocksize: 4096 Device Size 2048.00 GB (536870912 blocks) Filesystem Size: 2048.00 GB (536870910 blocks) Journals: 4 Resource Groups: 8192 Locking Protocol: "lock_dlm" Lock Table: "gdcluster:gdcache" UUID: 0542628C-D8B8-2480-F67D-081435F38606 Okay! And! Finally! We mount it! [root at test03]# mount /dev/mapper/gdcache_vg-gdcache_lv /data /sbin/mount.gfs: fs is for a different cluster /sbin/mount.gfs: error mounting lockproto lock_dlm Wawawwah. Bummer. /var/log/messages says: Jan 19 14:21:05 test03 gfs_controld[3369]: mount: fs requires cluster="gdcluster" current="gdao_cluster" Someone on the interwebs concurs: the cluster name defined in /etc/cluster/cluster.conf is different from the one tagged on the GFS volume. Okay, so looking at cluster.conf: [root at test03]# vi /etc/cluster/cluster.conf Let's change that to match how I named the cluster in the above cfg_mkfs [root at test03]# vi /etc/cluster/cluster.conf And restart some stuff: [root at test03]# /etc/init.d/gfs2 stop [root at test03]# service luci stop Shutting down luci: service ricci [ OK ] [root at test03]# service ricci stop Shutting down ricci: [ OK ] [root at test03]# service cman stop Stopping cluster: Stopping fencing... done Stopping cman... failed /usr/sbin/cman_tool: Error leaving cluster: Device or resource busy [FAILED] [root at test03]# cman_tool leave force [root at test03]# service cman stop Stopping cluster: Stopping fencing... done Stopping cman... done Stopping ccsd... done Unmounting configfs... done [ OK ] AAAARRRRGGGHGHHH [root at test03]# service ricci start Starting ricci: [ OK ] [root at test03]# service luci start Starting luci: [ OK ] Point your web browser to https://test03.gdao.ucsc.edu:8084 to access luci [root at test03]# service gfs2 start [root at test03]# service cman start Starting cluster: Loading modules... done Mounting configfs... done Starting ccsd... done Starting cman... done Starting daemons... done Starting fencing... failed [FAILED] I had to reboot. [root at test03]# service luci status luci is running... [root at test03]# service ricci status ricci (pid 4385) is running... [root at test03]# service cman status cman is running. [root at test03]# service gfs2 status Okay, again? [root at test03]# mount /dev/mapper/gdcache_vg-gdcache_lv /data Did that just work? And on test04 [root at test04]# mount /dev/mapper/gdcache_vg-gdcache_lv /data Okay, how about a test: [root at test03]# touch /data/killme And then we look on the other node: [root at test04]# ls /data killme Holy shit. I've been working so hard for this moment that I don't completely know what to do now. Question is, now that I have two working nodes, can I duplicate it? Okay, finish up: [root at test03]# chkconfig rgmanager on [root at test03]# service rgmanager start Starting Cluster Service Manager: [ OK ] [root at test03]# vi /etc/fstab /dev/mapper/gdcache_vg-gdcache_lv /data gfs2 defaults,noatime,nodiratime 0 0 and on the other node: [root at test04]# chkconfig rgmanager on [root at test04]# service rgmanager start Starting Cluster Service Manager: [root at test04]# vi /etc/fstab /dev/mapper/gdcache_vg-gdcache_lv /data gfs2 defaults,noatime,nodiratime 0 0 And it works. Hell, yeah. -------------- next part -------------- An HTML attachment was scrubbed... URL: From linux at alteeve.com Tue Jan 24 21:09:55 2012 From: linux at alteeve.com (Digimer) Date: Tue, 24 Jan 2012 16:09:55 -0500 Subject: [Linux-cluster] Halt nodes in cluster with cable disconnect In-Reply-To: References: Message-ID: <4F1F1E23.6080308@alteeve.com> On 01/24/2012 03:57 PM, Miguel Angel Guerrero wrote: > Hi i'm trying to setup a centos cluster with two nodes with cman, drbd, > gfs2 and i'm using ipmi for fencing. DRBD is set up between the nodes > using a dedicated interface. So, when I unplug the drbd network cable, > both nodes power off immediatly (i tried using crossover cable and both > nodes connected to a switch, but both scenarios fail), and the logs > doesn't seem to show something useful. In a previous thread on this > list, it is recommended to deactivate ACPID daemon, even at BIOS level, > but I'm still having troubles. > > If I simulate a physical disconnection with ifdown command in some node, > this node reboots with no hassle, but unpluging the cable kills both > nodes. I think the first scenario is correct, but the second one is not > what I expect. > > Thanks for your help the next are my cluster.conf This is likely caused by both nodes getting their fence calls off before one of them dies. How do you have DRBD configured? Specifically, what fence handler are you using? If you're interested in testing, I have rewritten lon's obliterate-peer.sh and added explicit delays to help resolve this exact issue. https://github.com/digimer/rhcs_fence Alternatively, add a 'sleep 10' or similar to one of your existing fence handlers and you should find that the node with the delay consistently loses while the other node remains up. -- Digimer E-Mail: digimer at alteeve.com Papers and Projects: https://alteeve.com From rpeterso at redhat.com Tue Jan 24 21:24:12 2012 From: rpeterso at redhat.com (Bob Peterson) Date: Tue, 24 Jan 2012 16:24:12 -0500 (EST) Subject: [Linux-cluster] Expanding a LUN and a GFS2 filesystem In-Reply-To: <4F1F1C41.5030701@ucsc.edu> Message-ID: <17a7e975-d459-41b3-a5ed-2b3d9958c4de@zmail16.collab.prod.int.phx2.redhat.com> ----- Original Message ----- | I am running CentOS with a GFS2 filesystem on a Dell EqualLogic SAN. | I | created the filesystem by mapping an RDM through VMWare to the guest | OS. I used pvcreate, vgcreate, lvcreate, and mkfs.gfs2 to create the | filesystem and the underlying architecture. I've included the log I | created to document the process below. | | I've already increased the size of the LUN on the SAN. Now, how do I | increase the size of the GFS2 filesystem and the LVM beneath it? Do | I | need to do something with the PV and VG as well? | | Thanks in advance for your help. | | Wes Hi Wes, Yep, you do need to start cman service before clvmd. If you've already extended the volume with lvresize or lvextend, then the procedure to expand the GFS2 file system to use that extra space is simple: 1. mount it on both nodes 2. gfs2_grow /mnt/point (your mount point) If it was my file system, I'd umount it at that point and do sync just to be on the safe side. Some older versions of the software didn't always sync the statfs information correctly, etc. It shouldn't be necessary, but it doesn't hurt to do it, right? Then mount it again. Regards, Bob Peterson Red Hat File Systems From wmodes at ucsc.edu Tue Jan 24 21:25:33 2012 From: wmodes at ucsc.edu (Wes Modes) Date: Tue, 24 Jan 2012 13:25:33 -0800 Subject: [Linux-cluster] Expanding a LUN and a GFS2 filesystem Message-ID: <4F1F21CD.3000702@ucsc.edu> I am running CentOS with a GFS2 filesystem on a Dell EqualLogic SAN. I created the filesystem by mapping an RDM through VMWare to the guest OS. I used pvcreate, vgcreate, lvcreate, and mkfs.gfs2 to create the filesystem and the underlying architecture. I've included the log I created to document the process below. I've already increased the size of the LUN on the SAN. Now, how do I increase the size of the GFS2 filesystem and the LVM beneath it? Do I need to do something with the PV and VG as well? Thanks in advance for your help. Wes Here is the log of the process I used to create the filesystem: With the RDM created and all the daemons started (luci, ricci, cman) now I can config GFS. Make sure they are running on all of our nodes. We can even see the RDM on the guest systems: [root at test03]# ls /dev/sdb /dev/sdb [root at test04]# ls /dev/sdb /dev/sdb So we are doing this using lvm clustering: http://emrahbaysal.blogspot.com/2011/03/gfs-cluster-on-vmware-vsphere-rh... and http://linuxdynasty.org/215/howto-setup-gfs2-with-clustering/ We've already set up gfs daemons and fencing and whatnot. Before we start to create the LVM2 volumes and Proceed to GFS2, we will need to enable clustering in LVM2. [root at test03]# lvmconf --enable-cluster I try to create the cluster FS [root at test03]# pvcreate /dev/sdb connect() failed on local socket: No such file or directory Internal cluster locking initialisation failed. WARNING: Falling back to local file-based locking. Volume Groups with the clustered attribute will be inaccessible. Physical volume "/dev/sdb" successfully created One internet source says: >> That indicates that you have cluster locking enabled but that the cluster LVM >> daemon (clvmd) is not running. So let's start it, [root at test03]# service clvmd status clvmd is stopped [root at test03]# service clvmd start Starting clvmd: Activating VG(s): 2 logical volume(s) in volume group "VolGroup00" now active clvmd not running on node test04 [ OK ] [root at test03]# chkconfig clvmd on Okay, over on the other node: [root at test04]# service clvmd status clvmd is stopped [root at test04]# service clvmd start Starting clvmd: clvmd could not connect to cluster manager Consult syslog for more information [root at test04]# service cman status groupd is stopped [root at test04]# service cman start Starting cluster: Loading modules... done Mounting configfs... done Starting ccsd... done Starting cman... done Starting daemons... done Starting fencing... done [ OK ] [root at test04]# chkconfig cman on [root at test04]# service luci status luci is running... [root at test04]# service ricci status ricci (pid 4381) is running... [root at test04]# chkconfig ricci on [root at test04]# chkconfig luci on [root at test04]# service clvmd start Starting clvmd: Activating VG(s): 2 logical volume(s) in volume group "VolGroup00" now active [ OK ] And this time, no complaints: [root at test03]# service clvmd restart Restarting clvmd: [ OK ] Try again with pvcreate: [root at test03]# pvcreate /dev/sdb Physical volume "/dev/sdb" successfully created Create volume group: [root at test03]# vgcreate gdcache_vg /dev/sdb Clustered volume group "gdcache_vg" successfully created Create logical volume: [root at test03]# lvcreate -n gdcache_lv -L 2T gdcache_vg Logical volume "gdcache_lv" created Create GFS filesystem, ahem, GFS2 filesystem. I screwed this up the first time. [root at test03]# mkfs.gfs2 -j 8 -p lock_dlm -t gdcluster:gdcache -j 4 /dev/mapper/gdcache_vg-gdcache_lv This will destroy any data on /dev/mapper/gdcache_vg-gdcache_lv. It appears to contain a gfs filesystem. Are you sure you want to proceed? [y/n] y Device: /dev/mapper/gdcache_vg-gdcache_lv Blocksize: 4096 Device Size 2048.00 GB (536870912 blocks) Filesystem Size: 2048.00 GB (536870910 blocks) Journals: 4 Resource Groups: 8192 Locking Protocol: "lock_dlm" Lock Table: "gdcluster:gdcache" UUID: 0542628C-D8B8-2480-F67D-081435F38606 Okay! And! Finally! We mount it! [root at test03]# mount /dev/mapper/gdcache_vg-gdcache_lv /data /sbin/mount.gfs: fs is for a different cluster /sbin/mount.gfs: error mounting lockproto lock_dlm Wawawwah. Bummer. /var/log/messages says: Jan 19 14:21:05 test03 gfs_controld[3369]: mount: fs requires cluster="gdcluster" current="gdao_cluster" Someone on the interwebs concurs: the cluster name defined in /etc/cluster/cluster.conf is different from the one tagged on the GFS volume. Okay, so looking at cluster.conf: [root at test03]# vi /etc/cluster/cluster.conf Let's change that to match how I named the cluster in the above cfg_mkfs [root at test03]# vi /etc/cluster/cluster.conf And restart some stuff: [root at test03]# /etc/init.d/gfs2 stop [root at test03]# service luci stop Shutting down luci: service ricci [ OK ] [root at test03]# service ricci stop Shutting down ricci: [ OK ] [root at test03]# service cman stop Stopping cluster: Stopping fencing... done Stopping cman... failed /usr/sbin/cman_tool: Error leaving cluster: Device or resource busy [FAILED] [root at test03]# cman_tool leave force [root at test03]# service cman stop Stopping cluster: Stopping fencing... done Stopping cman... done Stopping ccsd... done Unmounting configfs... done [ OK ] AAAARRRRGGGHGHHH [root at test03]# service ricci start Starting ricci: [ OK ] [root at test03]# service luci start Starting luci: [ OK ] Point your web browser to https://test03.gdao.ucsc.edu:8084 to access luci [root at test03]# service gfs2 start [root at test03]# service cman start Starting cluster: Loading modules... done Mounting configfs... done Starting ccsd... done Starting cman... done Starting daemons... done Starting fencing... failed [FAILED] I had to reboot. [root at test03]# service luci status luci is running... [root at test03]# service ricci status ricci (pid 4385) is running... [root at test03]# service cman status cman is running. [root at test03]# service gfs2 status Okay, again? [root at test03]# mount /dev/mapper/gdcache_vg-gdcache_lv /data Did that just work? And on test04 [root at test04]# mount /dev/mapper/gdcache_vg-gdcache_lv /data Okay, how about a test: [root at test03]# touch /data/killme And then we look on the other node: [root at test04]# ls /data killme Holy shit. I've been working so hard for this moment that I don't completely know what to do now. Question is, now that I have two working nodes, can I duplicate it? Okay, finish up: [root at test03]# chkconfig rgmanager on [root at test03]# service rgmanager start Starting Cluster Service Manager: [ OK ] [root at test03]# vi /etc/fstab /dev/mapper/gdcache_vg-gdcache_lv /data gfs2 defaults,noatime,nodiratime 0 0 and on the other node: [root at test04]# chkconfig rgmanager on [root at test04]# service rgmanager start Starting Cluster Service Manager: [root at test04]# vi /etc/fstab /dev/mapper/gdcache_vg-gdcache_lv /data gfs2 defaults,noatime,nodiratime 0 0 And it works. Hell, yeah. -------------- next part -------------- An HTML attachment was scrubbed... URL: From kortux at gmail.com Tue Jan 24 21:34:42 2012 From: kortux at gmail.com (Miguel Angel Guerrero) Date: Tue, 24 Jan 2012 16:34:42 -0500 Subject: [Linux-cluster] Halt nodes in cluster with cable disconnect In-Reply-To: <4F1F1E23.6080308@alteeve.com> References: <4F1F1E23.6080308@alteeve.com> Message-ID: Digimer i use your manual ;) https://alteeve.com/w/Red_Hat_Cluster_Service_2_Tutorial in a test environment y desactivate drbd daemon for testing but with or without drbd daemon running, the problem persist I use the next handler and fencing policy in drbd fencing resource-and-stonith; outdate-peer "/sbin/obliterate-peer.sh"; Digimer when you suggest add "sleep 10"' is in drbd.conf? On Tue, Jan 24, 2012 at 4:09 PM, Digimer wrote: > On 01/24/2012 03:57 PM, Miguel Angel Guerrero wrote: > > Hi i'm trying to setup a centos cluster with two nodes with cman, drbd, > > gfs2 and i'm using ipmi for fencing. DRBD is set up between the nodes > > using a dedicated interface. So, when I unplug the drbd network cable, > > both nodes power off immediatly (i tried using crossover cable and both > > nodes connected to a switch, but both scenarios fail), and the logs > > doesn't seem to show something useful. In a previous thread on this > > list, it is recommended to deactivate ACPID daemon, even at BIOS level, > > but I'm still having troubles. > > > > If I simulate a physical disconnection with ifdown command in some node, > > this node reboots with no hassle, but unpluging the cable kills both > > nodes. I think the first scenario is correct, but the second one is not > > what I expect. > > > > Thanks for your help the next are my cluster.conf > > This is likely caused by both nodes getting their fence calls off before > one of them dies. > > How do you have DRBD configured? Specifically, what fence handler are > you using? If you're interested in testing, I have rewritten lon's > obliterate-peer.sh and added explicit delays to help resolve this exact > issue. > > https://github.com/digimer/rhcs_fence > > Alternatively, add a 'sleep 10' or similar to one of your existing fence > handlers and you should find that the node with the delay consistently > loses while the other node remains up. > > -- > Digimer > E-Mail: digimer at alteeve.com > Papers and Projects: https://alteeve.com > -- Atte: ------------------------------------ Miguel Angel Guerrero Usuario GNU/Linux Registrado #353531 ------------------------------------ -------------- next part -------------- An HTML attachment was scrubbed... URL: From linux at alteeve.com Tue Jan 24 21:42:56 2012 From: linux at alteeve.com (Digimer) Date: Tue, 24 Jan 2012 16:42:56 -0500 Subject: [Linux-cluster] Halt nodes in cluster with cable disconnect In-Reply-To: References: <4F1F1E23.6080308@alteeve.com> Message-ID: <4F1F25E0.80002@alteeve.com> On 01/24/2012 04:34 PM, Miguel Angel Guerrero wrote: > Digimer i use your manual ;) > > https://alteeve.com/w/Red_Hat_Cluster_Service_2_Tutorial > > in a test environment y desactivate drbd daemon for testing but with or > without drbd daemon running, the problem persist > I use the next handler and fencing policy in drbd > > fencingresource-and-stonith; > outdate-peer"/sbin/obliterate-peer.sh"; > > Digimer when you suggest add "sleep 10"' is in drbd.conf? That's awesome! :) No, you would put the sleep at the start of obliterate-peer.sh on one node only. If this works, would you be willing to test 'rhcs_fence' for me? It's new, and could use some testing. It automatically adds a delay based on the node's cluster ID, with no delay for the node with ID of "1". If so, here is how to install it on both nodes; wget -c https://raw.github.com/digimer/rhcs_fence/master/rhcs_fence chmod 755 rhcs_fence mv rhcs_fence /usr/sbin/ Then in drbd.conf, change: outdate-peer "/sbin/obliterate-peer.sh"; to outdate-peer "/usr/sbin/rhcs_fence"; Cheers. :) -- Digimer E-Mail: digimer at alteeve.com Papers and Projects: https://alteeve.com From wmodes at ucsc.edu Tue Jan 24 22:19:58 2012 From: wmodes at ucsc.edu (Wes Modes) Date: Tue, 24 Jan 2012 14:19:58 -0800 Subject: [Linux-cluster] Expanding a LUN and a GFS2 filesystem In-Reply-To: <17a7e975-d459-41b3-a5ed-2b3d9958c4de@zmail16.collab.prod.int.phx2.redhat.com> References: <17a7e975-d459-41b3-a5ed-2b3d9958c4de@zmail16.collab.prod.int.phx2.redhat.com> Message-ID: <4F1F2E8E.4010308@ucsc.edu> I have not extended the volume. That was precisely my question. I already understand how to grow the GFS2 filesystem (conceptually). As per https://alteeve.com/w/Grow_a_GFS2_Partition. I've tried to increase the size of the volume with lvextend, but it's not having it. [root at test03]# lvextend -L +2T /dev/sdb Path required for Logical Volume "sdb" Please provide a volume group name Run `lvextend --help' for more information. [root at test03]# lvextend -L +2T /dev/mapper/gdcache_vg-gdcache_lv /dev/sdb Extending logical volume gdcache_lv to 4.00 TB Insufficient free space: 524288 extents needed, but only 3 available [root at test03]# lvextend -L +2000G /dev/mapper/gdcache_vg-gdcache_lv /dev/sdb Extending logical volume gdcache_lv to 3.95 TB Insufficient free space: 512000 extents needed, but only 3 available [root at test03]# lvextend -L +1999G /dev/mapper/gdcache_vg-gdcache_lv /dev/sdb Extending logical volume gdcache_lv to 3.95 TB Insufficient free space: 511744 extents needed, but only 3 available I assume I need to expand the underlying PV or VG. But how? Wes On 1/24/2012 1:24 PM, Bob Peterson wrote: > ----- Original Message ----- > | I am running CentOS with a GFS2 filesystem on a Dell EqualLogic SAN. > | I > | created the filesystem by mapping an RDM through VMWare to the guest > | OS. I used pvcreate, vgcreate, lvcreate, and mkfs.gfs2 to create the > | filesystem and the underlying architecture. I've included the log I > | created to document the process below. > | > | I've already increased the size of the LUN on the SAN. Now, how do I > | increase the size of the GFS2 filesystem and the LVM beneath it? Do > | I > | need to do something with the PV and VG as well? > | > | Thanks in advance for your help. > | > | Wes > > Hi Wes, > > Yep, you do need to start cman service before clvmd. > > If you've already extended the volume with lvresize or lvextend, > then the procedure to expand the GFS2 file system to use that > extra space is simple: > > 1. mount it on both nodes > 2. gfs2_grow /mnt/point (your mount point) > > If it was my file system, I'd umount it at that point and do sync > just to be on the safe side. Some older versions of the software > didn't always sync the statfs information correctly, etc. > It shouldn't be necessary, but it doesn't hurt to do it, right? > Then mount it again. > > Regards, > > Bob Peterson > Red Hat File Systems > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- An HTML attachment was scrubbed... URL: From rpeterso at redhat.com Tue Jan 24 22:30:41 2012 From: rpeterso at redhat.com (Bob Peterson) Date: Tue, 24 Jan 2012 17:30:41 -0500 (EST) Subject: [Linux-cluster] Expanding a LUN and a GFS2 filesystem In-Reply-To: <4F1F2E8E.4010308@ucsc.edu> Message-ID: <28f21da3-b41b-496c-9be0-4104a0a7df91@zmail16.collab.prod.int.phx2.redhat.com> ----- Original Message ----- | I have not extended the volume. That was precisely my question. I | already understand how to grow the GFS2 filesystem (conceptually). | As | per https://alteeve.com/w/Grow_a_GFS2_Partition. | | I've tried to increase the size of the volume with lvextend, but it's | not having it. | | [root at test03]# lvextend -L +2T /dev/sdb | Path required for Logical Volume "sdb" | Please provide a volume group name | Run `lvextend --help' for more information. | [root at test03]# lvextend -L +2T /dev/mapper/gdcache_vg-gdcache_lv | /dev/sdb | Extending logical volume gdcache_lv to 4.00 TB | Insufficient free space: 524288 extents needed, but only 3 | available | [root at test03]# lvextend -L +2000G | /dev/mapper/gdcache_vg-gdcache_lv /dev/sdb | Extending logical volume gdcache_lv to 3.95 TB | Insufficient free space: 512000 extents needed, but only 3 | available | [root at test03]# lvextend -L +1999G | /dev/mapper/gdcache_vg-gdcache_lv /dev/sdb | Extending logical volume gdcache_lv to 3.95 TB | Insufficient free space: 511744 extents needed, but only 3 | available | | I assume I need to expand the underlying PV or VG. But how? | | Wes In order to make the volume bigger, you need to lvresize or lvextend it. In order to do that, you need to make the volume group bigger. If your volume group has no more space, you can add storage devices to it with a command like this: vgextend gdcache_vg /dev/sdt /dev/sdu /dev/sdv (assuming you want to add those devices to the vg) Once you've done that, you can extend the lv with lvresize or lvextend. So something like: vgextend gdcache_vg /dev/sdt /dev/sdu /dev/sdv lvresize -L+1T /dev/gdcache_vg/gdcache_lv mount -t gfs2 /dev/gdcache_vg/gdcache_lv /mnt/gfs2 gfs2_grow /mnt/gfs2 Regards, Bob Peterson Red Hat File Systems From jayesh.shinde at netcore.co.in Wed Jan 25 07:50:32 2012 From: jayesh.shinde at netcore.co.in (jayesh.shinde) Date: Wed, 25 Jan 2012 13:20:32 +0530 Subject: [Linux-cluster] Few queries about fence working Message-ID: <4F1FB448.6060709@netcore.co.in> Hi all , I have few queries about fence working. I am using 2 different the 2 node cluster with Dell and IBM hardware in two different IDC. Recently I came across the network failure problem at different time and I found my 2 nodes are power off state. Below is how the situation happened with my 2 different 2 node cluster. With 2 node IBM node cluster with SAN :-- ============================== 1) Network connectivity was failed totally for few minutes. 2) And as per the /var/log/messages both servers failed to fence to each other and both server was UP as it is with all services. 3) But the "clustat" was showing serves are not in cluster mode and "regmanger" status was stop. 4) I simply reboot the server. 5) After that I found both server in power off stat. with another 2 node Dell server with DRBD :-- ================================= 1) Network connectivity was failed totally. 2) DRAC ip was unavailable so fence failed from both server. 3) after some time I fond the servers are shutdown. In normal conditions both cluster work properly my queries are now :-- =============== 1) What could be the reason for power off ? 2) Does cluster's fencing method caused for the power off of server ( i.e because of previous failed fence ) ? 3) Is there any test cases mentioned on net / blog / wiki about the fence , i.e different situation under which fence works. Please guide. Thanks & Regards Jayesh Shinde -------------- next part -------------- An HTML attachment was scrubbed... URL: From emi2fast at gmail.com Wed Jan 25 08:29:27 2012 From: emi2fast at gmail.com (emmanuel segura) Date: Wed, 25 Jan 2012 09:29:27 +0100 Subject: [Linux-cluster] Few queries about fence working In-Reply-To: <4F1FB448.6060709@netcore.co.in> References: <4F1FB448.6060709@netcore.co.in> Message-ID: Can you show me your cluster config? 2012/1/25 jayesh.shinde > ** > Hi all , > > I have few queries about fence working. > > I am using 2 different the 2 node cluster with Dell and IBM hardware in > two different IDC. > Recently I came across the network failure problem at different time and I > found my 2 nodes are power off state. > > Below is how the situation happened with my 2 different 2 node cluster. > > With 2 node IBM node cluster with SAN :-- > ============================== > 1) Network connectivity was failed totally for few minutes. > 2) And as per the /var/log/messages both servers failed to fence to each > other and both server was UP as it is with all services. > 3) But the "clustat" was showing serves are not in cluster mode and > "regmanger" status was stop. > 4) I simply reboot the server. > 5) After that I found both server in power off stat. > > > with another 2 node Dell server with DRBD :-- > ================================= > 1) Network connectivity was failed totally. > 2) DRAC ip was unavailable so fence failed from both server. > 3) after some time I fond the servers are shutdown. > > In normal conditions both cluster work properly > > my queries are now :-- > =============== > 1) What could be the reason for power off ? > 2) Does cluster's fencing method caused for the power off of server ( > i.e because of previous failed fence ) ? > 3) Is there any test cases mentioned on net / blog / wiki about the fence > , i.e different situation under which fence works. > > Please guide. > > Thanks & Regards > Jayesh Shinde > > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -- esta es mi vida e me la vivo hasta que dios quiera -------------- next part -------------- An HTML attachment was scrubbed... URL: From emi2fast at gmail.com Wed Jan 25 09:20:21 2012 From: emi2fast at gmail.com (emmanuel segura) Date: Wed, 25 Jan 2012 10:20:21 +0100 Subject: [Linux-cluster] Halt nodes in cluster with cable disconnect In-Reply-To: References: Message-ID: Hello Miguel Talking about the problem when both nodes gets poweroff, this is called fencing-race, Redhat has this problem from so much time and the only fix was made it fence delay delay="30" man fence_ipmilan And I thinks you can look for a quorum qdisk 2012/1/24 Miguel Angel Guerrero > Hi i'm trying to setup a centos cluster with two nodes with cman, drbd, > gfs2 and i'm using ipmi for fencing. DRBD is set up between the nodes using > a dedicated interface. So, when I unplug the drbd network cable, both nodes > power off immediatly (i tried using crossover cable and both nodes > connected to a switch, but both scenarios fail), and the logs doesn't seem > to show something useful. In a previous thread on this list, it is > recommended to deactivate ACPID daemon, even at BIOS level, but I'm still > having troubles. > > If I simulate a physical disconnection with ifdown command in some node, > this node reboots with no hassle, but unpluging the cable kills both nodes. > I think the first scenario is correct, but the second one is not what I > expect. > > Thanks for your help the next are my cluster.conf > > > > > > > > > action="reboot"/> > > > > > > > action="reboot"/> > > > > > > ipaddr="192.168.201.220" lanplus="1" login="ADMIN" name="ipmi1" > passwd="itac321"/> > ipaddr="192.168.201.186" lanplus="1" login="ADMIN" name="ipmi2" > passwd="itac321"/> > > > > > -- > Att: > ------------------------------------ > Miguel Angel Guerrero > Usuario GNU/Linux Registrado #353531 > ------------------------------------ > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -- esta es mi vida e me la vivo hasta que dios quiera -------------- next part -------------- An HTML attachment was scrubbed... URL: From jayesh.shinde at netcore.co.in Wed Jan 25 09:38:45 2012 From: jayesh.shinde at netcore.co.in (jayesh.shinde) Date: Wed, 25 Jan 2012 15:08:45 +0530 Subject: [Linux-cluster] Few queries about fence working In-Reply-To: References: <4F1FB448.6060709@netcore.co.in> Message-ID: <4F1FCDA5.4010909@netcore.co.in> Dear Emmanuel Segura, Find the config below. Because of policy I have removed some login details. #############