From bergman at merctech.com Fri Apr 1 20:42:34 2011 From: bergman at merctech.com (bergman at merctech.com) Date: Fri, 1 Apr 2011 16:42:34 -0400 Subject: [Linux-cluster] clustat stuck In-Reply-To: <019b01c879e5$5f8a4cf0$0777b5c2@gda07ak> References: <1204150957.13733.3.camel@ayanami.boston.devel.redhat.com> <019b01c879e5$5f8a4cf0$0777b5c2@gda07ak> Message-ID: <20110401164234.46e1f8f4@mirchi.uphs.upenn.edu> The pithy ruminations from frederic randriamora on Oct 29, 2010 4:30:03 pm entitled"RE: [Linux-cluster] clustat stuck" were: ==> Hi, ==> ==> I have a 4 node cluster, with multipathed qdisk on a san. The nodes are ==> running redhat 5.4. I've got a 3 node cluster, with multipathed qdisk on a SAN. The nodes are running CentOS 5.5: Linux 2.6.18-194.32.1.el5 #1 SMP Wed Jan 5 17:52:25 EST 2011 x86_64 x86_64 x86_64 GNU/Linux lvm2-cluster-2.02.56-7.el5_5.4 cman-2.0.115-34.el5_5.4 rgmanager-2.0.52-6.el5.centos.8 openais-0.80.6-16.el5_5.9 ==> ==> After a minor change made in cluster.conf on node3 properly propagated ==> by ccs_tool update, clustat is no longer correctly responding in the ==> other 3 nodes. In my case, I failed a service from node3 ==> node2, but made no cluster configuration changes. ==> node3 is neither nodeid 1 nor qdisk master. ==> ==> clustat on node3 runs fine Similar. On node2, clustat works fine. ==> ==> clustat on the other nodes ==> ==> either hangs with ==> connect(8, {sa_family=AF_FILE, path="/var/run/cluster/rgmanager.sk"...}, 110 ==> from strace ==> ==> ==> or times out with ==> Timed out waiting for a response from Resource Group Manager ==> without displaying the still running services ==> Exactly the same behavior here. ==> cman_tool services et al. are just fine everywhere, ==> Agreed. The actual sevices are running on each node. The report from cman_tool is correct, but querying the cluster with "clustat" or operations with "cluscvadm" hang or timeout. ==> Although all the services are running fine, I cannot move/stop them ==> anymore with clusvcadm. ==> ==> How to get out of that situation? Is there any solution to this issue? Thanks, Mark From michaelm at plumbersstock.com Fri Apr 1 19:20:15 2011 From: michaelm at plumbersstock.com (Michael McGlothlin) Date: Fri, 1 Apr 2011 13:20:15 -0600 Subject: [Linux-cluster] Replication for iSCSI target? Message-ID: I'm experimenting with setting up an iSCSI target that has data replication between three nodes. Right now I'm trying Glusterfs which seems workable but I'm not sure how it'll handle it if more than one node is trying to access the same target device (1TB sparse file) at the same time. Has anyone set something like this up before and can give me some hints? I was looking at GFS before but it appeared to not do replication? DRBD seemed like a possibility but having more than two nodes sounded as if it might be an issue. Each server has dual quad-core Xeon processors, 64GB RAM, 8 2TB drives, and 10Gb Ethernet so I hope hardware won't be a limitation. We've constantly had trouble with every iSCSI, SAN, NAS, etc we've tried so I want to make something that is completely void of any single point of failure. Thanks, Michael McGlothlin From nhuczp at gmail.com Sun Apr 3 04:43:20 2011 From: nhuczp at gmail.com (chenzp) Date: Sun, 3 Apr 2011 12:43:20 +0800 Subject: [Linux-cluster] Replication for iSCSI target? In-Reply-To: References: Message-ID: use drbd! 2011/4/2 Michael McGlothlin > I'm experimenting with setting up an iSCSI target that has data > replication between three nodes. Right now I'm trying Glusterfs which > seems workable but I'm not sure how it'll handle it if more than one > node is trying to access the same target device (1TB sparse file) at > the same time. Has anyone set something like this up before and can > give me some hints? I was looking at GFS before but it appeared to not > do replication? DRBD seemed like a possibility but having more than > two nodes sounded as if it might be an issue. > > Each server has dual quad-core Xeon processors, 64GB RAM, 8 2TB > drives, and 10Gb Ethernet so I hope hardware won't be a limitation. > We've constantly had trouble with every iSCSI, SAN, NAS, etc we've > tried so I want to make something that is completely void of any > single point of failure. > > > Thanks, > Michael McGlothlin > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Abraham.Alawi at team.telstra.com Sun Apr 3 23:33:47 2011 From: Abraham.Alawi at team.telstra.com (Alawi, Abraham) Date: Mon, 4 Apr 2011 09:33:47 +1000 Subject: [Linux-cluster] Replication for iSCSI target? In-Reply-To: References: Message-ID: <8EBFB8471A4BFE45B31EAC433ED57C027E83B3BBAA@WSMSG3103V.srv.dir.telstra.com> Customer's requirements drive the solution, there are many ways to achieve HA storage. I don't know your exact setup and resources but I reckon iSCSI will be just an unnecessary overhead layer if you are using GlusterFS. You can use DRBD with a clustered file system (e.g. gfs, ocfs, ..) or with xfs/ext3/4 + NFS, you may consider using G/NBD as well. Again, the solution is driven by the customer's requirements and the available resources. GlusterFS is an awesome solution, provides HA & performance but it has some limited capabilities too, like it doesn't support POSIX ACL. Lastly my 2cents for a clustered file system, it sounds like a sexy solution but if it's deployed in a small scale (e.g. < 5 nodes) it might easily become a nightmare. For small scales HA storage without losing stability I'd recommend DRBD + XFS/EXT3/4 + NFS. From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of chenzp Sent: Sunday, 3 April 2011 2:43 PM To: linux clustering Subject: Re: [Linux-cluster] Replication for iSCSI target? use drbd! 2011/4/2 Michael McGlothlin > I'm experimenting with setting up an iSCSI target that has data replication between three nodes. Right now I'm trying Glusterfs which seems workable but I'm not sure how it'll handle it if more than one node is trying to access the same target device (1TB sparse file) at the same time. Has anyone set something like this up before and can give me some hints? I was looking at GFS before but it appeared to not do replication? DRBD seemed like a possibility but having more than two nodes sounded as if it might be an issue. Each server has dual quad-core Xeon processors, 64GB RAM, 8 2TB drives, and 10Gb Ethernet so I hope hardware won't be a limitation. We've constantly had trouble with every iSCSI, SAN, NAS, etc we've tried so I want to make something that is completely void of any single point of failure. Thanks, Michael McGlothlin -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- An HTML attachment was scrubbed... URL: From list at fajar.net Mon Apr 4 02:27:45 2011 From: list at fajar.net (Fajar A. Nugraha) Date: Mon, 4 Apr 2011 09:27:45 +0700 Subject: [Linux-cluster] Replication for iSCSI target? In-Reply-To: <8EBFB8471A4BFE45B31EAC433ED57C027E83B3BBAA@WSMSG3103V.srv.dir.telstra.com> References: <8EBFB8471A4BFE45B31EAC433ED57C027E83B3BBAA@WSMSG3103V.srv.dir.telstra.com> Message-ID: On Mon, Apr 4, 2011 at 6:33 AM, Alawi, Abraham wrote: > ?GlusterFS is an awesome solution, provides HA & performance but it has some > limited capabilities too, like it doesn?t support POSIX ACL. Lastly my > 2cents for a clustered file system, it sounds like a sexy solution but if > it?s deployed in a small scale (e.g. < 5 nodes) it might easily become a > nightmare. For small scales HA storage without losing stability I?d > recommend DRBD + XFS/EXT3/4 + NFS. Also don't forget that for any clustered solution (including DRBD), there will be a performance penalty (which comes from cache invalidation, sync writes, etc.) which might or might not be acceptable, depending on your needs. For simple setup I actually suggest you just stick with one or more storage appliance (Netapp if you can afford it, or Nexenta community edition if you want to use common hardware) and optionally setup replication between the storage appliance. -- Fajar From fdinitto at redhat.com Mon Apr 4 15:35:45 2011 From: fdinitto at redhat.com (Fabio M. Di Nitto) Date: Mon, 04 Apr 2011 17:35:45 +0200 Subject: [Linux-cluster] fence-agents 3.1.3 stable release Message-ID: <4D99E551.9050605@redhat.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256 Welcome to the fence-agents 3.1.3 release. This release contains a few bug fixes and a new fence_vmware_soap that integrates and supports everything supported by the SOAP API v4.1 (including vSphere). NOTE the new vmware agent requires python-suds. The new source tarball can be downloaded here: https://fedorahosted.org/releases/f/e/fence-agents/fence-agents-3.1.3.tar.xz To report bugs or issues: https://bugzilla.redhat.com/ Would you like to meet the cluster team or members of its community? Join us on IRC (irc.freenode.net #linux-cluster) and share your experience with other sysadministrators or power users. Thanks/congratulations to all people that contributed to achieve this great milestone. Happy clustering, Fabio Under the hood (from 3.1.2): Arnaud Quette (1): eaton_snmp: fix list offset Brandon Perkins (1): Fence_rhevm needs to change "RUNNING" status to "UP" status as the REST-AP Jim Ramsay (1): Allow fence_scsi to use any valid hexadecimal key Lon Hohberger (1): fence-agents: Accept other values for "true" Marek 'marx' Grac (10): fence_rhevm: Update URL to RHEV-M REST API fence_cisco_ucs: Support for sub-organization fence_cisco_ucs: Fix for support for sub-organization library: Add support for UUID (-U / --uuid / uuid) fence_vmware_soap: New fence agent for VMWare using SOAP API fence_cisco_ucs, fence_rhevm: Problems with SSL support fence_ipmilan: returns incorrect status on monitor op if chassis is powere fence_ipmilan: Correct return code for diag operation library: Add support for URL session instead of hostname bugfix uuid Ryan O'Hara (2): fence_scsi: fix typo when opening logfile fence_scsi: grep for keys should be case insensitive configure.ac | 1 + fence/agents/Makefile.am | 1 + fence/agents/cisco_ucs/fence_cisco_ucs.py | 15 ++- fence/agents/eaton_snmp/fence_eaton_snmp.py | 8 +- fence/agents/ipmilan/ipmilan.c | 13 ++ fence/agents/lib/fencing.py.py | 32 ++++- fence/agents/rhevm/fence_rhevm.py | 5 +- fence/agents/scsi/fence_scsi.pl | 12 +- fence/agents/vmware_soap/Makefile.am | 17 +++ fence/agents/vmware_soap/fence_vmware_soap.py | 175 +++++++++++++++++++++++++ 10 files changed, 261 insertions(+), 18 deletions(-) -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iQIcBAEBCAAGBQJNmeVQAAoJEFA6oBJjVJ+O0mMP+wdZLnYAWRKhFFm8pdllvbbw ZndPK13Xg/DtH5MkKiSjEl/aAmz7HwPW3Ml5XY5hlC6T6kE9V1nlA5/rfvU+2LQP IjDDCr9RiEMmsqJOxhNjzyPX38k5HyasEnpRdPfrW7yPwOg4KuWiqTaxGJzN6mKa /5OoXOKH1oadYUDcWGOT4ffYIo8ZlTAcpJmE1qp7ZP0h3w/o5XuFjCDRdMMJJc0Z VvYpAlfRYEvZy5pAYyEaMGpPv+Yw6PsOR2pFAwi75I9diTuwVqqyrnyPU8XZgTvB vMgY5OvbQcIwu9nXATH9QmAtvY805CyQmHVJHC56nUsNT1QzuW1AId63V0jUr6+F kkUyzerjN3nta41HAWEHdP0LY8F+0kNRQ5zPbwzvL9/mxqnOTz6MgxYy5lVJvMSh llIfsGZp+ZqmMbQw3ZXpAUT1lDibybhLlLvrTNF0iE6Am1gEMUmgJ0tCuzUG5H6K 6vbisPrv0ZT0IdQBZ3DPDuRH1VpmQ6gvUHd2pFkF+qKTKEzdL4Ns51X/lFXYPrU5 R2TQyr3alO0YKyIkDeDRJmHNYo6hDNjpe8lF9OTUu2Cw+Vih2W6FJ7m5SkcmFeGq Wu8v3/D56yvP9RVzSNobu1Z7NsTMYLuPEqHuvRQugtB3x+uFdCyNy+tV5jfyRb0W jJL7y9/UAcT+r05+Mdwe =kYZf -----END PGP SIGNATURE----- From mbreid at thepei.com Mon Apr 4 15:38:02 2011 From: mbreid at thepei.com (Mike Reid) Date: Mon, 04 Apr 2011 09:38:02 -0600 Subject: [Linux-cluster] Prevent locked I/O in two-node OCFS2 cluster? (DRBD 8.3.8 / Ubuntu 10.10) Message-ID: Hello, I am posting here as a recommendation from an ocfs2-users response for my original post: http://oss.oracle.com/pipermail/ocfs2-users/2011-April/005046.html Excerpt: I am running a two-node web cluster on OCFS2 via DRBD Primary/Primary (v8.3.8) and Pacemaker. Everything seems to be working great, except during testing of hard-boot scenarios. Whenever I hard-boot one of the nodes, the other node is successfully fenced and marked ?Outdated? * However, this locks up I/O on the still active node and prevents any operations within the cluster :( I have even forced DRBD into StandAlone mode while in this state, but that does not resolve the I/O lock either. ...does anyone know if this is possible using OCFS2 (maintaining an active cluster in Primary/Primary when the other node has a failure? E.g. Be it forced, controlled, etc) Is ?qdisk? a requirement for this to work with Pacemaker? NOTE: On a reply to my original post (URL above) I also provided an example CIB that I have been using during testing. -------------- next part -------------- An HTML attachment was scrubbed... URL: From michaelm at plumbersstock.com Mon Apr 4 20:08:53 2011 From: michaelm at plumbersstock.com (Michael McGlothlin) Date: Mon, 4 Apr 2011 14:08:53 -0600 Subject: [Linux-cluster] Replication for iSCSI target? In-Reply-To: References: <8EBFB8471A4BFE45B31EAC433ED57C027E83B3BBAA@WSMSG3103V.srv.dir.telstra.com> Message-ID: All the storage appliances I've looked at were either severely limited (2TB limit, no replication) or expensive ($4000 per node) and I'm not a fan of black boxes which in my experience make the easy stuff easier and the hard stuff a nightmare. I've got 15+ years of Linux (RedHat preferred) experience and more with various other Unixes and have experimented quite a bit with clustering but have never used it in production so I just want to get any feedback I can from people who've already done this. I'm not afraid of configuring things from the command line, compiling code, etc. It's my understanding that ESXi doesn't allow the use of random file systems such as ZFS or Gluster directly so they'll have to be accessed by NFS or iSCSI. NFS is probably easier but iSCSI seems more cluster friendly as you can easily have multiple IP addresses for the same data store. The general idea is so each physical machine will host a VM that is a cluster filesystem node and will by default access that filesystem from it's local VM (off the local RAID) and if the local node should stop responding it will automatically switch to another node. And by identifying all the nodes as being the same data store then if the physical machine goes down there should be no problem in automatically bringing those VMs back up on another physical host. So I'm pretty sure we want a clustering filesystem that supports replication across all nodes. Speed isn't nearly the issue for me that reliability is. DRBD sounds good in that it works at the block level and I'm pretty sure that as VMware locks individual VMs so that only one ESXi server can access it at a time that there would be no stepping on toes leading to filesystem corruption. Gluster sounds good in that it doesn't make me jump through hoops to set up more than two nodes and I do expect to be expanding to more than the current three nodes in the near future. Either one is simple enough to get working for the basics but I'm just trying to figure out the best configuration for using it in the iSCSI configuration. Thanks, Michael McGlothlin From list at fajar.net Mon Apr 4 21:51:49 2011 From: list at fajar.net (Fajar A. Nugraha) Date: Tue, 5 Apr 2011 04:51:49 +0700 Subject: [Linux-cluster] Replication for iSCSI target? In-Reply-To: References: <8EBFB8471A4BFE45B31EAC433ED57C027E83B3BBAA@WSMSG3103V.srv.dir.telstra.com> Message-ID: On Tue, Apr 5, 2011 at 3:08 AM, Michael McGlothlin wrote: > All the storage appliances I've looked at were either severely limited > (2TB limit, no replication) or expensive ($4000 per node) Nexentastor Community Edition has 18 TB usage capacity limit, and should have active-passive replication support. > Speed isn't nearly the issue for me that > reliability is. DRBD sounds good in that it works at the block level Give drbd a try then. Just make sure you do enough tests first. I had a test setup with no fencing (yes, I know it's bad, that's why it's only a test) with dbrd+ocfs2, and sometimes a node failure cause unpredicatable cluster-wide error (both nodes restarting, split brain, etc). In production environment timeout for fencing should be lower than other subsystems (e.g. drbd). -- Fajar From rossnick-lists at cybercat.ca Tue Apr 5 16:48:56 2011 From: rossnick-lists at cybercat.ca (Nicolas Ross) Date: Tue, 5 Apr 2011 12:48:56 -0400 Subject: [Linux-cluster] fence_apc and Apc AP-8941 Message-ID: Hi ! I've got my cluster now setup in it's final position at the colo facility, and we've got an APC ap-8941 power bar. At the moment, our fencing is configured with ipmilan via our RMM3 modules on our intel servers. But I'd like to add a backup fence device, being the apc. I can't seem to make it work. On our apc bar, I enabled ssh and disabled telnet. I can ssh from our cluster nodes to the ip of the apc bar and perform operations, altough connectin via ssh takes about 2 or 3 seconds. I try to call manual fence_apc from the command line like so : fence_apc -a ip -l user -p pass -n node101 -x -v and I get very rapidly : Unable to connect/login to fencing device Netstat shows me a time_wait connection, so it has made a tcp connection. Any hints ? From fdinitto at redhat.com Wed Apr 6 06:23:14 2011 From: fdinitto at redhat.com (Fabio M. Di Nitto) Date: Wed, 06 Apr 2011 08:23:14 +0200 Subject: [Linux-cluster] fence_apc and Apc AP-8941 In-Reply-To: References: Message-ID: <4D9C06D2.1040106@redhat.com> On 4/5/2011 6:48 PM, Nicolas Ross wrote: > Hi ! > > I've got my cluster now setup in it's final position at the colo > facility, and we've got an APC ap-8941 power bar. At the moment, our > fencing is configured with ipmilan via our RMM3 modules on our intel > servers. But I'd like to add a backup fence device, being the apc. > > I can't seem to make it work. On our apc bar, I enabled ssh and disabled > telnet. I can ssh from our cluster nodes to the ip of the apc bar and > perform operations, altough connectin via ssh takes about 2 or 3 > seconds. I try to call manual fence_apc from the command line like so : > > fence_apc -a ip -l user -p pass -n node101 -x -v > > and I get very rapidly : > > Unable to connect/login to fencing device > > Netstat shows me a time_wait connection, so it has made a tcp connection. > > Any hints ? It would be very useful if you could collect the output from the verbose log and send it to Marek (in CC). Also, what version of agents are you using? OS? Fabio From amit.jathar at alepo.com Wed Apr 6 11:40:02 2011 From: amit.jathar at alepo.com (Amit Jathar) Date: Wed, 6 Apr 2011 11:40:02 +0000 Subject: [Linux-cluster] Need help on the femce_vmware configuration ... !! Message-ID: Hi, I am facing following issues:- I have installed VMware-VIPerl-1.6.0-104313.x86_64.tar.gz & VMware-vSphere-Perl-SDK-4.1.0-254719.x86_64.tar.gz on the RHEL6 image on the Esxi server. My ESXi server hostname is "esx5" & the VM image running on it is "Cluster_1" 1) If I run the command using the Esxi server credentials, I get following output :- [root at OEL6_VIP_1 ~]# /usr/sbin/fence_vmware -a x.x.x.x -o status -l root -p password -n esx5 fence_vmware_helper returned Cannot find vm Esxistd! 2) If I run the command using the credential of the VM image running on the Esxi server, I get following output :- [root at OEL6_VIP_1 ~]# /usr/sbin/fence_vmware -a 172.16.201.132 -o status -l root -p password -n OEL6_VIP_1 fence_vmware_helper returned Cannot connect to server! VMware error:SOAP request error - possibly a protocol issue: 404 Not Found

Not Found

The requested URL /sdk/webService was not found on this server.


Apache/2.2.15 (Oracle) Server at x.x.x.x Port 443
I can ping that node "esx5" :- PING esx5.localdomain (x.x.x.x) 56(84) bytes of data. 64 bytes from esx5.localdomain (x.x.x.x): icmp_seq=1 ttl=64 time=0.064 ms Can you please help me in finding the issue ? Thanks, Amit ________________________________ This email (message and any attachment) is confidential and may be privileged. If you are not certain that you are the intended recipient, please notify the sender immediately by replying to this message, and delete all copies of this message and attachments. Any other use of this email by you is prohibited. ________________________________ -------------- next part -------------- An HTML attachment was scrubbed... URL: From nathan at nathanpowell.org Wed Apr 6 12:10:14 2011 From: nathan at nathanpowell.org (Nathan Powell) Date: Wed, 6 Apr 2011 08:10:14 -0400 Subject: [Linux-cluster] Need help on the femce_vmware configuration ... !! In-Reply-To: References: Message-ID: On Wed, Apr 6, 2011 at 7:40 AM, Amit Jathar wrote: > My ESXi server hostname is "esx5" & the VM image running on it is > ?Cluster_1? ... > [root at OEL6_VIP_1 ~]# /usr/sbin/fence_vmware -a x.x.x.x -o status -l root -p > password -n esx5 I have never used fence_vmware however a quick look at that command, and then at the man page reveals that '-n esx5' shouldn't work, if what you say above is true. -n Physical plug number on device or name of virtual machine -- Nathan Powell Linux System Administrator "Worry never robs tomorrow of it's sorrow. It only saps today of it's joy." ~ Leo Buscaglia From amit.jathar at alepo.com Wed Apr 6 12:32:40 2011 From: amit.jathar at alepo.com (Amit Jathar) Date: Wed, 6 Apr 2011 12:32:40 +0000 Subject: [Linux-cluster] Need help on the femce_vmware configuration ... !! In-Reply-To: References: Message-ID: Hi Nathan, Thanks for the reply . The esx5 is the hostname of the ESXi server. Also, I don't know what is the physical plug number? How to get that value? Thanks, Amit -----Original Message----- From: Nathan Powell [mailto:nathan at nathanpowell.org] Sent: Wednesday, April 06, 2011 5:40 PM To: linux clustering Cc: Amit Jathar Subject: Re: [Linux-cluster] Need help on the femce_vmware configuration ... !! On Wed, Apr 6, 2011 at 7:40 AM, Amit Jathar wrote: > My ESXi server hostname is "esx5" & the VM image running on it is > "Cluster_1" ... > [root at OEL6_VIP_1 ~]# /usr/sbin/fence_vmware -a x.x.x.x -o status -l > root -p password -n esx5 I have never used fence_vmware however a quick look at that command, and then at the man page reveals that '-n esx5' shouldn't work, if what you say above is true. -n Physical plug number on device or name of virtual machine -- Nathan Powell Linux System Administrator "Worry never robs tomorrow of it's sorrow. It only saps today of it's joy." ~ Leo Buscaglia ________________________________ This email (message and any attachment) is confidential and may be privileged. If you are not certain that you are the intended recipient, please notify the sender immediately by replying to this message, and delete all copies of this message and attachments. Any other use of this email by you is prohibited. ________________________________ From nathan at nathanpowell.org Wed Apr 6 12:33:31 2011 From: nathan at nathanpowell.org (Nathan Powell) Date: Wed, 6 Apr 2011 08:33:31 -0400 Subject: [Linux-cluster] Need help on the femce_vmware configuration ... !! In-Reply-To: References: Message-ID: It says OR the name of the virtual machine. Not the host, the guest. On Wed, Apr 6, 2011 at 8:32 AM, Amit Jathar wrote: > Hi Nathan, > > Thanks for the reply . > The esx5 is the hostname of the ESXi server. > Also, I don't know what is the physical plug number? How to get that value? > > Thanks, > Amit > > -----Original Message----- > From: Nathan Powell [mailto:nathan at nathanpowell.org] > Sent: Wednesday, April 06, 2011 5:40 PM > To: linux clustering > Cc: Amit Jathar > Subject: Re: [Linux-cluster] Need help on the femce_vmware configuration ... !! > > On Wed, Apr 6, 2011 at 7:40 AM, Amit Jathar wrote: >> My ESXi server hostname is "esx5" & the VM image running on it is >> "Cluster_1" > > ... > >> [root at OEL6_VIP_1 ~]# /usr/sbin/fence_vmware -a x.x.x.x -o status -l >> root -p password -n esx5 > > I have never used fence_vmware however a quick look at that command, and then at the man page reveals that '-n esx5' shouldn't work, if what you say above is true. > > ? -n ? ? ? ?Physical plug number on device or name of virtual machine > > -- > Nathan Powell > Linux System Administrator > > "Worry never robs tomorrow of it's sorrow. It only saps today of it's joy." ~ Leo Buscaglia > > > ________________________________ > This email (message and any attachment) is confidential and may be privileged. If you are not certain that you are the intended recipient, please notify the sender immediately by replying to this message, and delete all copies of this message and attachments. Any other use of this email by you is prohibited. > ________________________________ > > > -- Nathan Powell Linux System Administrator "Worry never robs tomorrow of it's sorrow. It only saps today of it's joy." ~ Leo Buscaglia From amit.jathar at alepo.com Wed Apr 6 12:42:43 2011 From: amit.jathar at alepo.com (Amit Jathar) Date: Wed, 6 Apr 2011 12:42:43 +0000 Subject: [Linux-cluster] Need help on the femce_vmware configuration ... !! In-Reply-To: References: Message-ID: Right. The hostname& IP of the guest machine also tried but it gives error :- [root at OEL6_VIP_1 ~]# /usr/sbin/fence_vmware -a 172.16.201.132 -o status -l root -p password -n OEL6_VIP_1 fence_vmware_helper returned Cannot connect to server! VMware error:SOAP request error - possibly a protocol issue: 404 Not Found

Not Found

The requested URL /sdk/webService was not found on this server.


Apache/2.2.15 (Oracle) Server at x.x.x.x Port 443
-----Original Message----- From: Nathan Powell [mailto:nathan at nathanpowell.org] Sent: Wednesday, April 06, 2011 6:04 PM To: Amit Jathar Cc: linux clustering Subject: Re: [Linux-cluster] Need help on the femce_vmware configuration ... !! It says OR the name of the virtual machine. Not the host, the guest. On Wed, Apr 6, 2011 at 8:32 AM, Amit Jathar wrote: > Hi Nathan, > > Thanks for the reply . > The esx5 is the hostname of the ESXi server. > Also, I don't know what is the physical plug number? How to get that value? > > Thanks, > Amit > > -----Original Message----- > From: Nathan Powell [mailto:nathan at nathanpowell.org] > Sent: Wednesday, April 06, 2011 5:40 PM > To: linux clustering > Cc: Amit Jathar > Subject: Re: [Linux-cluster] Need help on the femce_vmware configuration ... !! > > On Wed, Apr 6, 2011 at 7:40 AM, Amit Jathar wrote: >> My ESXi server hostname is "esx5" & the VM image running on it is >> "Cluster_1" > > ... > >> [root at OEL6_VIP_1 ~]# /usr/sbin/fence_vmware -a x.x.x.x -o status -l >> root -p password -n esx5 > > I have never used fence_vmware however a quick look at that command, and then at the man page reveals that '-n esx5' shouldn't work, if what you say above is true. > > -n Physical plug number on device or name of virtual > machine > > -- > Nathan Powell > Linux System Administrator > > "Worry never robs tomorrow of it's sorrow. It only saps today of it's > joy." ~ Leo Buscaglia > > > ________________________________ > This email (message and any attachment) is confidential and may be privileged. If you are not certain that you are the intended recipient, please notify the sender immediately by replying to this message, and delete all copies of this message and attachments. Any other use of this email by you is prohibited. > ________________________________ > > > -- Nathan Powell Linux System Administrator "Worry never robs tomorrow of it's sorrow. It only saps today of it's joy." ~ Leo Buscaglia ________________________________ This email (message and any attachment) is confidential and may be privileged. If you are not certain that you are the intended recipient, please notify the sender immediately by replying to this message, and delete all copies of this message and attachments. Any other use of this email by you is prohibited. ________________________________ From nathan at nathanpowell.org Wed Apr 6 12:46:10 2011 From: nathan at nathanpowell.org (Nathan Powell) Date: Wed, 6 Apr 2011 08:46:10 -0400 Subject: [Linux-cluster] Need help on the femce_vmware configuration ... !! In-Reply-To: References: Message-ID: On Wed, Apr 6, 2011 at 8:42 AM, Amit Jathar wrote: > Right. The hostname& IP of the guest machine also tried but it gives error :- > [root at OEL6_VIP_1 ~]# /usr/sbin/fence_vmware -a 172.16.201.132 -o status -l root -p password -n OEL6_VIP_1 > fence_vmware_helper returned Cannot connect to server! That's a different error. And now you have something to troubleshoot. ;) My point was just that you were not using the command properly. Since I have no experience with this tool, you will have to wait for someone else to come along now. -- Nathan Powell Linux System Administrator "Worry never robs tomorrow of it's sorrow. It only saps today of it's joy." ~ Leo Buscaglia From rossnick-lists at cybercat.ca Wed Apr 6 13:02:35 2011 From: rossnick-lists at cybercat.ca (Nicolas Ross) Date: Wed, 6 Apr 2011 09:02:35 -0400 Subject: [Linux-cluster] fence_apc and Apc AP-8941 References: <4D9C06D2.1040106@redhat.com> Message-ID: <48C9F07371214F2AA678938FDE5BD3B5@versa> (...) >> >> fence_apc -a ip -l user -p pass -n node101 -x -v >> >> and I get very rapidly : >> >> Unable to connect/login to fencing device >> >> Netstat shows me a time_wait connection, so it has made a tcp connection. >> >> Any hints ? > > It would be very useful if you could collect the output from the verbose > log and send it to Marek (in CC). > > Also, what version of agents are you using? OS? I am on RHEL6, with fence-agents version 3.0.12.8.el6_0.3 (so, up 2 date). When executed, the command above only display the error I mentionned (unable to connect). If I add --debug-file to the command line, the file id creates is empty. I also tried by re-enabeling telnet instead of ssh, and I got the same result, except that now the debug file looks like : -------------------- telnet> set binary Negotiating binary mode with remote host. telnet> open 1.1.1.1 -23 Trying 1.1.1.1... Connected to 1.1.1.1. Escape character is '^]'. User Name : user Password -------------------- I replaced username and ip with fake ones. Regards, From fdinitto at redhat.com Wed Apr 6 13:38:38 2011 From: fdinitto at redhat.com (Fabio M. Di Nitto) Date: Wed, 06 Apr 2011 15:38:38 +0200 Subject: [Linux-cluster] fence_apc and Apc AP-8941 In-Reply-To: <48C9F07371214F2AA678938FDE5BD3B5@versa> References: <4D9C06D2.1040106@redhat.com> <48C9F07371214F2AA678938FDE5BD3B5@versa> Message-ID: <4D9C6CDE.4080402@redhat.com> Nicolas, please report the issue via GSS. Marek can start looking into it. Fabio On 4/6/2011 3:02 PM, Nicolas Ross wrote: > (...) > >>> >>> fence_apc -a ip -l user -p pass -n node101 -x -v >>> >>> and I get very rapidly : >>> >>> Unable to connect/login to fencing device >>> >>> Netstat shows me a time_wait connection, so it has made a tcp >>> connection. >>> >>> Any hints ? >> >> It would be very useful if you could collect the output from the verbose >> log and send it to Marek (in CC). >> >> Also, what version of agents are you using? OS? > > I am on RHEL6, with fence-agents version 3.0.12.8.el6_0.3 (so, up 2 date). > > When executed, the command above only display the error I mentionned > (unable to connect). If I add --debug-file to the command line, the file > id creates is empty. > > I also tried by re-enabeling telnet instead of ssh, and I got the same > result, except that now the debug file looks like : > > -------------------- > telnet> set binary > Negotiating binary mode with remote host. > telnet> open 1.1.1.1 -23 > Trying 1.1.1.1... > Connected to 1.1.1.1. > Escape character is '^]'. > > User Name : user > Password > -------------------- > I replaced username and ip with fake ones. > > Regards, > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From mgrac at redhat.com Wed Apr 6 14:19:57 2011 From: mgrac at redhat.com (Marek Grac) Date: Wed, 06 Apr 2011 16:19:57 +0200 Subject: [Linux-cluster] fence_apc and Apc AP-8941 In-Reply-To: <48C9F07371214F2AA678938FDE5BD3B5@versa> References: <4D9C06D2.1040106@redhat.com> <48C9F07371214F2AA678938FDE5BD3B5@versa> Message-ID: <4D9C768D.4060106@redhat.com> On 04/06/2011 03:02 PM, Nicolas Ross wrote: > (...) > >>> >>> fence_apc -a ip -l user -p pass -n node101 -x -v >>> >>> and I get very rapidly : >>> >>> Unable to connect/login to fencing device >>> >>> Netstat shows me a time_wait connection, so it has made a tcp >>> connection. >>> >>> Any hints ? >> > I am on RHEL6, with fence-agents version 3.0.12.8.el6_0.3 (so, up 2 > date). > > When executed, the command above only display the error I mentionned > (unable to connect). If I add --debug-file to the command line, the > file id creates is empty. > > I also tried by re-enabeling telnet instead of ssh, and I got the same > result, except that now the debug file looks like : > > -------------------- > telnet> set binary > Negotiating binary mode with remote host. > telnet> open 1.1.1.1 -23 > Trying 1.1.1.1... > Connected to 1.1.1.1. > Escape character is '^]'. > > User Name : user > Password > -------------------- > I replaced username and ip with fake ones. > > Regards, If response is too fast then problem is in connecting information/process. If it took long enough (timeout problem) then it can be problem with change in command prompt. If it is possible please send me what it is displayed when you are trying to do it manually. m, From rossnick-lists at cybercat.ca Wed Apr 6 14:51:08 2011 From: rossnick-lists at cybercat.ca (Nicolas Ross) Date: Wed, 6 Apr 2011 10:51:08 -0400 Subject: [Linux-cluster] fence_apc and Apc AP-8941 References: <4D9C06D2.1040106@redhat.com><48C9F07371214F2AA678938FDE5BD3B5@versa> <4D9C6CDE.4080402@redhat.com> Message-ID: > Nicolas, please report the issue via GSS. > > Marek can start looking into it. > > Fabio > Sorry, what's GSS ? Is it bugzilla.redhat.com ? From rossnick-lists at cybercat.ca Wed Apr 6 15:06:09 2011 From: rossnick-lists at cybercat.ca (Nicolas Ross) Date: Wed, 6 Apr 2011 11:06:09 -0400 Subject: [Linux-cluster] fence_apc and Apc AP-8941 References: <4D9C06D2.1040106@redhat.com> <48C9F07371214F2AA678938FDE5BD3B5@versa> <4D9C768D.4060106@redhat.com> Message-ID: <8C77012A023D431CB5AA58E6CC676A35@versa> (...) >> When executed, the command above only display the error I mentionned >> (unable to connect). If I add --debug-file to the command line, the file >> id creates is empty. >> >> I also tried by re-enabeling telnet instead of ssh, and I got the same >> result, except that now the debug file looks like : >> >> -------------------- >> telnet> set binary >> Negotiating binary mode with remote host. >> telnet> open 1.1.1.1 -23 >> Trying 1.1.1.1... >> Connected to 1.1.1.1. >> Escape character is '^]'. >> >> User Name : user >> Password >> -------------------- >> I replaced username and ip with fake ones. >> >> Regards, > > If response is too fast then problem is in connecting information/process. > If it took long enough (timeout problem) then it can be problem with > change in command prompt. If it is possible please send me what it is > displayed when you are trying to do it manually. Response is indeed very fast when trying with the agent. With tcpdump on the node I try with the agent, I see ssh packets go to and from the apc switch. When ssh-iing to the apc switch, it takes about 2 or 3 seconds before I get the password prompt, and then I see : ------------------------------ user at 1.1.1.1's password: American Power Conversion Network Management Card AOS v5.1.2 (c) Copyright 2009 All Rights Reserved RPDU 2g v5.1.0 ------------------------------------------------------------------------------- Name : Unknown Date : 04/06/2011 Contact : Unknown Time : 10:52:44 Location : Unknown User : Device Manager Up Time : 0 Days 1 Hour 52 Minutes Stat : P+ N4+ N6+ A+ Type ? for command listing Use tcpip command for IP address(-i), subnet(-s), and gateway(-g) apc>qConnection to 1.1.1.1 closed. ------------------------------ I think I have found the problem... Looking at fence_apc, I see : options["ssh_options"] = "-1 -c blowfish" Now, if I add this to my ssh command like so : ssh -1 -c blowfish user at 1.1.1.1 I get : Protocol major versions differ: 1 vs. 2 So, there is no ssh version 1 on this version of the apc switrch. I commented out that line in /usr/sbin/fence_apc, and now the fence agent is able to establish the connection, but it cannot go any further. The only thing that shows in my debug file is : ---------------------------- user at 1.1.1.1's password: ---------------------------- From fdinitto at redhat.com Wed Apr 6 19:27:09 2011 From: fdinitto at redhat.com (Fabio M. Di Nitto) Date: Wed, 06 Apr 2011 21:27:09 +0200 Subject: [Linux-cluster] fence_apc and Apc AP-8941 In-Reply-To: References: <4D9C06D2.1040106@redhat.com><48C9F07371214F2AA678938FDE5BD3B5@versa> <4D9C6CDE.4080402@redhat.com> Message-ID: <4D9CBE8D.9010806@redhat.com> On 04/06/2011 04:51 PM, Nicolas Ross wrote: >> Nicolas, please report the issue via GSS. >> >> Marek can start looking into it. >> >> Fabio >> > > Sorry, what's GSS ? Is it bugzilla.redhat.com ? Red Hat Global Support Service... the one you contact to report customer/product related issues. No it's not bugzilla. Fabio From danielgore at yaktech.com Wed Apr 6 22:42:45 2011 From: danielgore at yaktech.com (Daniel R. Gore) Date: Wed, 06 Apr 2011 18:42:45 -0400 Subject: [Linux-cluster] nfs4 kerberos Message-ID: <1302129765.23236.2.camel@hawku> I am trying to get Kerberos authenticated high available NFS service running. I have looked at the cookbook, but it does not cover this. Any ideas? Thank you Dan -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From cthulhucalling at gmail.com Wed Apr 6 23:14:27 2011 From: cthulhucalling at gmail.com (Ian Hayes) Date: Wed, 6 Apr 2011 16:14:27 -0700 Subject: [Linux-cluster] nfs4 kerberos In-Reply-To: <1302129765.23236.2.camel@hawku> References: <1302129765.23236.2.camel@hawku> Message-ID: I've done some work on clustering NFSv4 using Kerberos at a previous job.... I probably did this completely wrong, but I did get it working. The big gotcha that I had was that all cluster members need the same keytab for the NFS service. I also had to have the active node change its hostname to match the keytab before it started up NFS. There are the usual NFS4 specific stuff you need to do like /etc/exports and building the pseudo filesystem. I did a few bind mounts to get everything under the pseudo-fs. Obviously I'm assuming that you have NFS4 working on a single-node environment and therefore know what to do to get that working (ie, keytabs for the clients). The cluster I had built was hosting NFS4 and Samba, with a shared GFS filesystem on an iSCSI backend. It ran pretty decent for secondhand test equipment. I was actually able to benchmark the GFS performance while I tuned the GFS with a little script that wrote out randomly sized files. I did some extensive build documentation of how to build a Kerberized NFS4 cluster, but I doubt my old employer would be willing to release them. But like Henry Jones, Sr., I wrote them down so I wouldn't have to remember them. On Wed, Apr 6, 2011 at 3:42 PM, Daniel R. Gore wrote: > I am trying to get Kerberos authenticated high available NFS service > running. I have looked at the cookbook, but it does not cover this. > > Any ideas? > > Thank you > > Dan > > > -- > This message has been scanned for viruses and > dangerous content by MailScanner, and is > believed to be clean. > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From danielgore at yaktech.com Thu Apr 7 00:23:37 2011 From: danielgore at yaktech.com (Daniel R. Gore) Date: Wed, 06 Apr 2011 20:23:37 -0400 Subject: [Linux-cluster] nfs4 kerberos In-Reply-To: References: <1302129765.23236.2.camel@hawku> Message-ID: <1302135817.23236.19.camel@hawku> Ian, Thanks for the info. My cluster is only a two node cluster. I have NFSv4 with Kerberos working on both node separately. I went and created a virtual IP on each node with the same IP to accommodate the floating IP. I associated the virtual IP with a new DNS name (fserv) and ensured forward and reverse look-up works. I create Kerberos host and nfs principals for fserv and added the associated keys to /etc/krb5.keytab on each node. Unfortunately, it still does not work and I am sure one of the reasons is because the "uname -n" comes up as the node name and not fserv. I also suspect that the nfs service that gets started through Redhat's HA service does not use the /etc/exports file on the nodes. How did you manage to change the nodes name when the nfs server was started? What worries me about that is then other services will like fail. Any guidance is appreciated. Thanks. Dan On Wed, 2011-04-06 at 16:14 -0700, Ian Hayes wrote: > I've done some work on clustering NFSv4 using Kerberos at a previous > job.... I probably did this completely wrong, but I did get it > working. The big gotcha that I had was that all cluster members need > the same keytab for the NFS service. I also had to have the active > node change its hostname to match the keytab before it started up NFS. > There are the usual NFS4 specific stuff you need to do > like /etc/exports and building the pseudo filesystem. I did a few bind > mounts to get everything under the pseudo-fs. Obviously I'm assuming > that you have NFS4 working on a single-node environment and therefore > know what to do to get that working (ie, keytabs for the clients). > > The cluster I had built was hosting NFS4 and Samba, with a shared GFS > filesystem on an iSCSI backend. It ran pretty decent for secondhand > test equipment. I was actually able to benchmark the GFS performance > while I tuned the GFS with a little script that wrote out randomly > sized files. > > I did some extensive build documentation of how to build a Kerberized > NFS4 cluster, but I doubt my old employer would be willing to release > them. But like Henry Jones, Sr., I wrote them down so I wouldn't have > to remember them. > > On Wed, Apr 6, 2011 at 3:42 PM, Daniel R. Gore > wrote: > I am trying to get Kerberos authenticated high available NFS > service > running. I have looked at the cookbook, but it does not cover > this. > > Any ideas? > > Thank you > > Dan > > > -- > This message has been scanned for viruses and > dangerous content by MailScanner, and is > believed to be clean. > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > > -- > This message has been scanned for viruses and > dangerous content by MailScanner, and is > believed to be clean. > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From danielgore at yaktech.com Thu Apr 7 01:01:00 2011 From: danielgore at yaktech.com (Daniel R. Gore) Date: Wed, 06 Apr 2011 21:01:00 -0400 Subject: [Linux-cluster] nfs4 kerberos In-Reply-To: <1302135817.23236.19.camel@hawku> References: <1302129765.23236.2.camel@hawku> <1302135817.23236.19.camel@hawku> Message-ID: <1302138060.24349.7.camel@hawku> I also found this thread, after many searches. http://linux-nfs.org/pipermail/nfsv4/2009-April/010583.html As I read through it, there appears to be a patch for rpc.gssd which allows for the daemon to be started and associated with multiple hosts. I do not want to compile rpc.gssd and it appears the patch is from over two years ago. I would hope that RHEL6 would have rpc.gssd patched to meet this requirement, but no documentation appear to exist for how to use it. On Wed, 2011-04-06 at 20:23 -0400, Daniel R. Gore wrote: > Ian, > > Thanks for the info. > > My cluster is only a two node cluster. I have NFSv4 with Kerberos > working on both node separately. I went and created a virtual IP on > each node with the same IP to accommodate the floating IP. I associated > the virtual IP with a new DNS name (fserv) and ensured forward and > reverse look-up works. I create Kerberos host and nfs principals for > fserv and added the associated keys to /etc/krb5.keytab on each node. > > Unfortunately, it still does not work and I am sure one of the reasons > is because the "uname -n" comes up as the node name and not fserv. > > I also suspect that the nfs service that gets started through Redhat's > HA service does not use the /etc/exports file on the nodes. > > How did you manage to change the nodes name when the nfs server was > started? What worries me about that is then other services will like > fail. > > Any guidance is appreciated. > > Thanks. > > Dan > > On Wed, 2011-04-06 at 16:14 -0700, Ian Hayes wrote: > > I've done some work on clustering NFSv4 using Kerberos at a previous > > job.... I probably did this completely wrong, but I did get it > > working. The big gotcha that I had was that all cluster members need > > the same keytab for the NFS service. I also had to have the active > > node change its hostname to match the keytab before it started up NFS. > > There are the usual NFS4 specific stuff you need to do > > like /etc/exports and building the pseudo filesystem. I did a few bind > > mounts to get everything under the pseudo-fs. Obviously I'm assuming > > that you have NFS4 working on a single-node environment and therefore > > know what to do to get that working (ie, keytabs for the clients). > > > > The cluster I had built was hosting NFS4 and Samba, with a shared GFS > > filesystem on an iSCSI backend. It ran pretty decent for secondhand > > test equipment. I was actually able to benchmark the GFS performance > > while I tuned the GFS with a little script that wrote out randomly > > sized files. > > > > I did some extensive build documentation of how to build a Kerberized > > NFS4 cluster, but I doubt my old employer would be willing to release > > them. But like Henry Jones, Sr., I wrote them down so I wouldn't have > > to remember them. > > > > On Wed, Apr 6, 2011 at 3:42 PM, Daniel R. Gore > > wrote: > > I am trying to get Kerberos authenticated high available NFS > > service > > running. I have looked at the cookbook, but it does not cover > > this. > > > > Any ideas? > > > > Thank you > > > > Dan > > > > > > -- > > This message has been scanned for viruses and > > dangerous content by MailScanner, and is > > believed to be clean. > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > > > -- > > This message has been scanned for viruses and > > dangerous content by MailScanner, and is > > believed to be clean. > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From jumanjiman at gmail.com Thu Apr 7 01:23:41 2011 From: jumanjiman at gmail.com (Paul Morgan) Date: Wed, 6 Apr 2011 21:23:41 -0400 Subject: [Linux-cluster] nfs4 kerberos In-Reply-To: <1302129765.23236.2.camel@hawku> References: <1302129765.23236.2.camel@hawku> Message-ID: On Apr 6, 2011 6:52 PM, "Daniel R. Gore" wrote: > > I am trying to get Kerberos authenticated high available NFS service > running. I have looked at the cookbook, but it does not cover this. > > Any ideas? I punted. I created a xen vm on rhel 5, where rhcs managed the vm as a resource... live migration across the physical cluster nodes, relatively minor aggravation if the vm went away and was restarted. -------------- next part -------------- An HTML attachment was scrubbed... URL: From marco.huang at auckland.ac.nz Thu Apr 7 01:38:25 2011 From: marco.huang at auckland.ac.nz (Marco Huang) Date: Thu, 7 Apr 2011 13:38:25 +1200 Subject: [Linux-cluster] gfs2_grow - Error writing new rindex entries; aborted. Message-ID: <8F6D445F-59AC-4AF0-B38A-9738574DC1ED@auckland.ac.nz> Hi, We are running Cenos5.5 (gfs2-utils.x86_64 v0.1.62-20.el5) cluster. We want to add another 5T disk space on the filesystem - expand from 19T to 25T, however it doesn't grow over 20T. There was no error when I did test run. So just wondering if there is a limitation of gfs2_grow which only can grow gfs2 filesystem upto 20T ? Is there anyone had same experience? Test run # gfs2_grow -T /mnt/fsbackup (Test mode--File system will not be changed) FS: Mount Point: /mnt/fsbackup FS: Device: /dev/mapper/fsbackup-fsbackup01 FS: Size: 4882811901 (0x12309cbfd) FS: RG size: 524244 (0x7ffd4) DEV: Size: 6103514112 (0x16bcc3c00) The file system grew by 4768368MB. gfs2_grow complete. Actual run # gfs2_grow /mnt/fsbackup FS: Mount Point: /mnt/fsbackup FS: Device: /dev/mapper/fsbackup-fsbackup01 FS: Size: 4882811901 (0x12309cbfd) FS: RG size: 524244 (0x7ffd4) DEV: Size: 6103514112 (0x16bcc3c00) The file system grew by 4768368MB. Error writing new rindex entries;aborted. gfs2_grow complete. Before gfs2_grow # df -h /mnt/fsbackup/ Filesystem Size Used Avail Use% Mounted on /dev/mapper/fsbackup-fsbackup01 19T 19T 651M 100% /mnt/fsbackup # gfs2_tool df /mnt/fsbackup: SB lock proto = "lock_dlm" SB lock table = "FSC:fsbackup01" SB ondisk format = 1801 SB multihost format = 1900 Block size = 4096 Journals = 8 Resource Groups = 10112 Mounted lock proto = "lock_dlm" Mounted lock table = "FSC:fsbackup01" Mounted host data = "jid=0:id=65539:first=1" Journal number = 0 Lock module flags = 0 Local flocks = FALSE Local caching = FALSE Type Total Blocks Used Blocks Free Blocks use% ------------------------------------------------------------------------ data 5300270360 4882310027 417960333 92% inodes 447901122 29940789 417960333 7% After gfs2_grow # df -h /mnt/fsbackup/ Filesystem Size Used Avail Use% Mounted on /dev/mapper/fsbackup-fsbackup01 20T 19T 1.6T 93% /mnt/fsbackup # gfs2_tool df /mnt/fsbackup: SB lock proto = "lock_dlm" SB lock table = "FSC:fsbackup01" SB ondisk format = 1801 SB multihost format = 1900 Block size = 4096 Journals = 8 Resource Groups = 9314 Mounted lock proto = "lock_dlm" Mounted lock table = "FSC:fsbackup01" Mounted host data = "jid=0:id=65539:first=1" Journal number = 0 Lock module flags = 0 Local flocks = FALSE Local caching = FALSE Type Total Blocks Used Blocks Free Blocks use% ------------------------------------------------------------------------ data 4882476584 4882310009 166575 100% inodes 30107364 29940789 166575 99% cheers -- Marco -------------- next part -------------- An HTML attachment was scrubbed... URL: From cthulhucalling at gmail.com Thu Apr 7 01:52:18 2011 From: cthulhucalling at gmail.com (Ian Hayes) Date: Wed, 6 Apr 2011 18:52:18 -0700 Subject: [Linux-cluster] nfs4 kerberos In-Reply-To: References: <1302129765.23236.2.camel@hawku> <1302135817.23236.19.camel@hawku> <1302138060.24349.7.camel@hawku> Message-ID: Shouldnt have to recompile rpc.gssd. On failover I migrated the ip address first, made portmapper a depend on the ip, rpc.gssd depend on portmap and nfsd depend on rpc. As for the hostname, I went with the inelegant solution of putting a 'hostname' command in the start functions of the portmapper script since that fires first in my config. On Apr 6, 2011 6:06 PM, "Daniel R. Gore" wrote: I also found this thread, after many searches. http://linux-nfs.org/pipermail/nfsv4/2009-April/010583.html As I read through it, there appears to be a patch for rpc.gssd which allows for the daemon to be started and associated with multiple hosts. I do not want to compile rpc.gssd and it appears the patch is from over two years ago. I would hope that RHEL6 would have rpc.gssd patched to meet this requirement, but no documentation appear to exist for how to use it. On Wed, 2011-04-06 at 20:23 -0400, Daniel R. Gore wrote: > Ian, > > Thanks for the info. > >... -------------- next part -------------- An HTML attachment was scrubbed... URL: From marco.huang at auckland.ac.nz Thu Apr 7 03:22:42 2011 From: marco.huang at auckland.ac.nz (Marco Huang) Date: Thu, 7 Apr 2011 15:22:42 +1200 Subject: [Linux-cluster] gfs2_grow - Error writing new rindex entries; aborted. In-Reply-To: <8F6D445F-59AC-4AF0-B38A-9738574DC1ED@auckland.ac.nz> References: <8F6D445F-59AC-4AF0-B38A-9738574DC1ED@auckland.ac.nz> Message-ID: <39B09866-B1AC-4E93-BC0B-2EF625087E42@auckland.ac.nz> Problem resolved by removing some large files before do gfs2_grow, as it requires some free space to process "grow". refer to: https://bugzilla.redhat.com/show_bug.cgi?id=490649 On 7/04/2011, at 1:38 PM, Marco Huang wrote: > Hi, > > We are running Cenos5.5 (gfs2-utils.x86_64 v0.1.62-20.el5) cluster. We want to add another 5T disk space on the filesystem - expand from 19T to 25T, however it doesn't grow over 20T. There was no error when I did test run. So just wondering if there is a limitation of gfs2_grow which only can grow gfs2 filesystem upto 20T ? Is there anyone had same experience? > > Test run > # gfs2_grow -T /mnt/fsbackup > (Test mode--File system will not be changed) > FS: Mount Point: /mnt/fsbackup > FS: Device: /dev/mapper/fsbackup-fsbackup01 > FS: Size: 4882811901 (0x12309cbfd) > FS: RG size: 524244 (0x7ffd4) > DEV: Size: 6103514112 (0x16bcc3c00) > The file system grew by 4768368MB. > gfs2_grow complete. > > Actual run > # gfs2_grow /mnt/fsbackup > FS: Mount Point: /mnt/fsbackup > FS: Device: /dev/mapper/fsbackup-fsbackup01 > FS: Size: 4882811901 (0x12309cbfd) > FS: RG size: 524244 (0x7ffd4) > DEV: Size: 6103514112 (0x16bcc3c00) > The file system grew by 4768368MB. > Error writing new rindex entries;aborted. > gfs2_grow complete. > > > Before gfs2_grow > # df -h /mnt/fsbackup/ > Filesystem Size Used Avail Use% Mounted on > /dev/mapper/fsbackup-fsbackup01 > 19T 19T 651M 100% /mnt/fsbackup > > # gfs2_tool df > /mnt/fsbackup: > SB lock proto = "lock_dlm" > SB lock table = "FSC:fsbackup01" > SB ondisk format = 1801 > SB multihost format = 1900 > Block size = 4096 > Journals = 8 > Resource Groups = 10112 > Mounted lock proto = "lock_dlm" > Mounted lock table = "FSC:fsbackup01" > Mounted host data = "jid=0:id=65539:first=1" > Journal number = 0 > Lock module flags = 0 > Local flocks = FALSE > Local caching = FALSE > > Type Total Blocks Used Blocks Free Blocks use% > ------------------------------------------------------------------------ > data 5300270360 4882310027 417960333 92% > inodes 447901122 29940789 417960333 7% > > After gfs2_grow > # df -h /mnt/fsbackup/ > Filesystem Size Used Avail Use% Mounted on > /dev/mapper/fsbackup-fsbackup01 > 20T 19T 1.6T 93% /mnt/fsbackup > > > # gfs2_tool df > /mnt/fsbackup: > SB lock proto = "lock_dlm" > SB lock table = "FSC:fsbackup01" > SB ondisk format = 1801 > SB multihost format = 1900 > Block size = 4096 > Journals = 8 > Resource Groups = 9314 > Mounted lock proto = "lock_dlm" > Mounted lock table = "FSC:fsbackup01" > Mounted host data = "jid=0:id=65539:first=1" > Journal number = 0 > Lock module flags = 0 > Local flocks = FALSE > Local caching = FALSE > > Type Total Blocks Used Blocks Free Blocks use% > ------------------------------------------------------------------------ > data 4882476584 4882310009 166575 100% > inodes 30107364 29940789 166575 99% > > > > cheers > -- > Marco > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- An HTML attachment was scrubbed... URL: From cthulhucalling at gmail.com Thu Apr 7 05:55:19 2011 From: cthulhucalling at gmail.com (Ian Hayes) Date: Wed, 6 Apr 2011 22:55:19 -0700 Subject: [Linux-cluster] nfs4 kerberos In-Reply-To: References: <1302129765.23236.2.camel@hawku> <1302135817.23236.19.camel@hawku> <1302138060.24349.7.camel@hawku> Message-ID: I whipped up a quick NFS4 cluster in Xen after I got home, and tried to remember what I did to make it work. After I bit, it all fell back into place. This is quick and dirty, and not how I would do things in production, but it's a good start. Note that I didn't set up a shared filesystem, but that should be academic at this point 1) Create your nfs/nfsserver.mydomain keytab 2) Copy keytab to both node1 and node2 3) Modify /etc/init.d/portmap- in the start function, add "hostname nfsserver.mydomain". In the stop function, add "hostname nodeX.mydomain" 4) Drop something that looks like the attached cluster.conf file in /etc/cluster 5) Set up your exports: /exports gss/krb5p(rw,async,fsid=0) 6) Start CMAN and RGManager 7) ? 8) Profit - mount -t nfs4 nfsserver.mydomain:/ /mnt/exports -o sec=krb5p The trick here is that we change the hostname before any Kerberized services start, so it will be happy when it tries to read the keytab. Also, I use all Script resources instead of the NFS resource. I never really liked it, and I'm old and set in my ways. This works, and I'm certain that it reads /etc/exports. First, we set up the IP, then start each necessary daemon as a dependency for the next. I've been bouncing the service back and forth for the last 10 minutes and only suffering from a complaint of a stale NFS mount on my client whenever I failover. On Wed, Apr 6, 2011 at 6:52 PM, Ian Hayes wrote: > Shouldnt have to recompile rpc.gssd. On failover I migrated the ip address > first, made portmapper a depend on the ip, rpc.gssd depend on portmap and > nfsd depend on rpc. As for the hostname, I went with the inelegant solution > of putting a 'hostname' command in the start functions of the portmapper > script since that fires first in my config. > > On Apr 6, 2011 6:06 PM, "Daniel R. Gore" wrote: > > I also found this thread, after many searches. > http://linux-nfs.org/pipermail/nfsv4/2009-April/010583.html > > As I read through it, there appears to be a patch for rpc.gssd which > allows for the daemon to be started and associated with multiple hosts. > I do not want to compile rpc.gssd and it appears the patch is from over > two years ago. I would hope that RHEL6 would have rpc.gssd patched to > meet this requirement, but no documentation appear to exist for how to > use it. > > > > > On Wed, 2011-04-06 at 20:23 -0400, Daniel R. Gore wrote: > > Ian, > > > > Thanks for the info. > > > >... > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From danielgore at yaktech.com Thu Apr 7 10:34:19 2011 From: danielgore at yaktech.com (Daniel R. Gore) Date: Thu, 07 Apr 2011 06:34:19 -0400 Subject: [Linux-cluster] nfs4 kerberos In-Reply-To: References: <1302129765.23236.2.camel@hawku> <1302135817.23236.19.camel@hawku> <1302138060.24349.7.camel@hawku> Message-ID: <1302172459.25191.0.camel@hawku> Thanks Ian! I will try and work on it today. Dan On Wed, 2011-04-06 at 22:55 -0700, Ian Hayes wrote: > I whipped up a quick NFS4 cluster in Xen after I got home, and tried > to remember what I did to make it work. After I bit, it all fell back > into place. This is quick and dirty, and not how I would do things in > production, but it's a good start. Note that I didn't set up a shared > filesystem, but that should be academic at this point > > 1) Create your nfs/nfsserver.mydomain keytab > 2) Copy keytab to both node1 and node2 > 3) Modify /etc/init.d/portmap- in the start function, add "hostname > nfsserver.mydomain". In the stop function, add "hostname > nodeX.mydomain" > 4) Drop something that looks like the attached cluster.conf file > in /etc/cluster > 5) Set up your exports: /exports gss/krb5p(rw,async,fsid=0) > 6) Start CMAN and RGManager > 7) ? > 8) Profit - mount -t nfs4 nfsserver.mydomain:/ /mnt/exports -o > sec=krb5p > > The trick here is that we change the hostname before any Kerberized > services start, so it will be happy when it tries to read the keytab. > Also, I use all Script resources instead of the NFS resource. I never > really liked it, and I'm old and set in my ways. This works, and I'm > certain that it reads /etc/exports. First, we set up the IP, then > start each necessary daemon as a dependency for the next. I've been > bouncing the service back and forth for the last 10 minutes and only > suffering from a complaint of a stale NFS mount on my client whenever > I failover. > > > > > > votes="1"> > > > nodename="node1.mydomain"/> > > > > votes="1"> > > > nodename="node2.mydomain"/> > > > > > > > name="Fence_Manual"/> > > > > restricted="1"> > name="node1.mydomain" priority="1"/> > name="node2.mydomain" priority="1"/> > > > > > > > > > > > > > On Wed, Apr 6, 2011 at 6:52 PM, Ian Hayes > wrote: > Shouldnt have to recompile rpc.gssd. On failover I migrated > the ip address first, made portmapper a depend on the ip, > rpc.gssd depend on portmap and nfsd depend on rpc. As for the > hostname, I went with the inelegant solution of putting a > 'hostname' command in the start functions of the portmapper > script since that fires first in my config. > > > On Apr 6, 2011 6:06 PM, "Daniel R. Gore" > > wrote: > > > > I also found this thread, after many searches. > > http://linux-nfs.org/pipermail/nfsv4/2009-April/010583.html > > > > As I read through it, there appears to be a patch for > > rpc.gssd which > > allows for the daemon to be started and associated with > > multiple hosts. > > I do not want to compile rpc.gssd and it appears the patch > > is from over > > two years ago. I would hope that RHEL6 would have rpc.gssd > > patched to > > meet this requirement, but no documentation appear to exist > > for how to > > use it. > > > > > > > > > > On Wed, 2011-04-06 at 20:23 -0400, Daniel R. Gore wrote: > > > Ian, > > > > > > Thanks for the info. > > > > > > > >... > > > > > -- > This message has been scanned for viruses and > dangerous content by MailScanner, and is > believed to be clean. > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From Colin.Simpson at iongeo.com Thu Apr 7 10:44:00 2011 From: Colin.Simpson at iongeo.com (Colin Simpson) Date: Thu, 07 Apr 2011 11:44:00 +0100 Subject: [Linux-cluster] nfs4 kerberos In-Reply-To: References: <1302129765.23236.2.camel@hawku> <1302135817.23236.19.camel@hawku> <1302138060.24349.7.camel@hawku> Message-ID: <1302173040.17187.37.camel@cowie.iouk.ioroot.tld> That's interesting about making the portmapper dependant on the IP, was this for the same reason I'm seeing just now. I used the method from NFS cookbook where I pseudo load balancing by distributing my NFS exports across my nodes. Sadly the RHEL 6 portmapper replacement (rpcbind) replies on the node IP and not the service IP, and this breaks NFSv3 mounts from RHEL5 clients with iptables stateful firewalls. I opened a bug on this one and have a call open with RH (via Dell) on this: https://bugzilla.redhat.com/show_bug.cgi?id=689589 But I too would like a good clean method of doing kerberized NFSv4 on a RHEL6 cluster. I thought NFSv4 being so central to RHEL6 this would be easy on a RHEL6 cluster (without using XEN)? Can the cookbook be updated? Which brings up another point. The RHEL cluster documentation is good, however it doesn't really help you implement a working cluster too easily (beyond the apache example), it's a bit reference orientated. I found myself googling around for examples of different RA types. Is there a more hands on set of docs around (or book)? It could almost do with a cookbook for every RA! Thanks Colin On Thu, 2011-04-07 at 02:52 +0100, Ian Hayes wrote: > Shouldnt have to recompile rpc.gssd. On failover I migrated the ip > address first, made portmapper a depend on the ip, rpc.gssd depend on > portmap and nfsd depend on rpc. As for the hostname, I went with the > inelegant solution of putting a 'hostname' command in the start > functions of the portmapper script since that fires first in my > config. > > > On Apr 6, 2011 6:06 PM, "Daniel R. Gore" > > wrote: > > > > I also found this thread, after many searches. > > http://linux-nfs.org/pipermail/nfsv4/2009-April/010583.html > > > > As I read through it, there appears to be a patch for rpc.gssd which > > allows for the daemon to be started and associated with multiple > > hosts. > > I do not want to compile rpc.gssd and it appears the patch is from > > over > > two years ago. I would hope that RHEL6 would have rpc.gssd patched > > to > > meet this requirement, but no documentation appear to exist for how > > to > > use it. > > > > > > > > > > > > On Wed, 2011-04-06 at 20:23 -0400, Daniel R. Gore wrote: > > > Ian, > > > > > > Thanks for the info. > > > > > >... > > > > plain text document attachment (ATT114553.txt) > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster This email and any files transmitted with it are confidential and are intended solely for the use of the individual or entity to whom they are addressed. If you are not the original recipient or the person responsible for delivering the email to the intended recipient, be advised that you have received this email in error, and that any use, dissemination, forwarding, printing, or copying of this email is strictly prohibited. If you received this email in error, please immediately notify the sender and delete the original. From sophana78 at gmail.com Thu Apr 7 11:42:50 2011 From: sophana78 at gmail.com (Sophana K) Date: Thu, 7 Apr 2011 13:42:50 +0200 Subject: [Linux-cluster] very poor rm performance with gfs2 (el6) Message-ID: Hello I'm currently testing gfs2 under EL6 (2.6.32-71.18.1.el6.x86_64) 11 nodes over gigabit iscsi (target is IET on EL5) Application is a compute cluster with lot of files of various sizes. Application need close to write coherency and cache coherency. I can't tell yet about performance. It looks not bad, iscsi target shows a bandwidth of about 80Mbytes/s (mostly reads during compute). I have to compare to NFS to tell if it is good. Major problem: I noticed very very poor rm performance, down to 1 to 2 files erased per second in a rm -rf. Some of these files are not very big: a few kilo bytes. Cluster is idle during rm. mkfs.gfs2 was made just before compute test was made. here are some small strace taken from rm -rf: 12:32:38.824237 unlinkat(6, "cycle.c", 0) = 0 12:32:39.226444 unlinkat(6, "cycle.h", 0) = 0 12:32:40.277366 unlinkat(6, "cycle_.h", 0) = 0 ... 13:38:09.719444 unlinkat(5, "BoxGrp._d988887c.+.vlog.vlog", 0) = 0 13:38:10.125942 unlinkat(5, "BoxGrp._d8335657.+.vlog.vlog.tog", 0) = 0 13:38:10.126037 unlinkat(5, "BoxGrp._8f2f6edf.+.vlog.vlog", 0) = 0 13:38:11.139070 unlinkat(5, "BoxGrp._cfc41f42..pkl", 0) = 0 13:38:11.139217 unlinkat(5, "BoxGrp._aec657d9..hier", 0) = 0 13:38:11.147072 unlinkat(5, "BoxGrp._cfc41f42..hier", 0) = 0 13:38:11.353478 unlinkat(5, "BoxGrp._47b2df42.+.vlog.vlog.tog", 0) = 0 13:38:11.575804 unlinkat(5, "BoxGrp._32107b60..hier", 0) = 0 13:38:12.637765 unlinkat(5, "BoxGrp._2e795200.+.vlog.vlog", 0) = 0 13:38:13.049503 unlinkat(5, "BoxGrp._32107b60..pkl", 0) = 0 13:38:13.468287 unlinkat(5, "BoxGrp._403da4c3..pkl", 0) = 0 13:38:14.107020 unlinkat(5, "BoxGrp._f07dc01b..pkl", 0) = 0 13:38:14.112050 unlinkat(5, "BoxGrp._27228204..hier", 0) = 0 13:38:14.144840 unlinkat(5, "BoxGrp._35ef97b2.+.vlog.vlog", 0) = 0 13:38:15.593250 unlinkat(5, "BoxGrp._47b2df42..pkl", 0) = 0 13:38:15.593437 unlinkat(5, "BoxGrp._e8cd3f8d..hier", 0) = 0 13:38:15.830560 unlinkat(5, "BoxGrp._35ef97b2.+.vlog.vlog.tog", 0) = 0 13:38:16.414388 unlinkat(5, "BoxGrp._403da4c3.+.vlog.vlog", 0) = 0 13:38:16.619618 unlinkat(5, "BoxGrp._d8335657..hier", 0) = 0 13:38:16.824062 unlinkat(5, "BoxGrp._d988887c.+.vlog.vlog.tog", 0) = 0 13:38:16.824397 unlinkat(5, "BoxGrp._808fb9e4..pkl", 0) = 0 13:38:17.077052 unlinkat(5, "BoxGrp._341e8e1f..pkl", 0) = 0 13:38:17.087327 unlinkat(5, "BoxGrp._47b2df42.+.vlog.vlog", 0) = 0 13:38:17.899539 unlinkat(5, "BoxGrp._341e8e1f..hier", 0) = 0 13:38:18.100736 unlinkat(5, "BoxGrp._d8335657.+.vlog.vlog", 0) = 0 13:38:18.319269 unlinkat(5, "BoxGrp._70bf96ba..pkl", 0) = 0 13:38:18.757326 unlinkat(5, "BoxGrp._d988887c..hier", 0) = 0 13:38:18.975545 unlinkat(5, "BoxGrp._35ef97b2..pkl", 0) = 0 13:38:19.568433 unlinkat(5, "BoxGrp._bb280eee..pkl", 0) = 0 Please advise if there is a solution for this. Thanks From danielgore at yaktech.com Thu Apr 7 12:08:50 2011 From: danielgore at yaktech.com (Daniel R. Gore) Date: Thu, 07 Apr 2011 08:08:50 -0400 Subject: [Linux-cluster] nfs4 kerberos In-Reply-To: <1302173040.17187.37.camel@cowie.iouk.ioroot.tld> References: <1302129765.23236.2.camel@hawku> <1302135817.23236.19.camel@hawku> <1302138060.24349.7.camel@hawku> <1302173040.17187.37.camel@cowie.iouk.ioroot.tld> Message-ID: <1302178130.26066.19.camel@hawku> A better solution for NFSv4 in a cluster is really required. A better cookbook with more real life likely scenarios for clustering solutions would be really helpful. How many people actually setup the complex three layered solutions depicted, as compared to people setting up simple two/three node servers to for authorization, authentication, file and license serving. It appears that the small business applicable system is completely ignored. On Thu, 2011-04-07 at 11:44 +0100, Colin Simpson wrote: > That's interesting about making the portmapper dependant on the IP, was > this for the same reason I'm seeing just now. I used the method from NFS > cookbook where I pseudo load balancing by distributing my NFS exports > across my nodes. Sadly the RHEL 6 portmapper replacement (rpcbind) > replies on the node IP and not the service IP, and this breaks NFSv3 > mounts from RHEL5 clients with iptables stateful firewalls. > > I opened a bug on this one and have a call open with RH (via Dell) on > this: > https://bugzilla.redhat.com/show_bug.cgi?id=689589 > > But I too would like a good clean method of doing kerberized NFSv4 on a > RHEL6 cluster. I thought NFSv4 being so central to RHEL6 this would be > easy on a RHEL6 cluster (without using XEN)? Can the cookbook be > updated? > > Which brings up another point. The RHEL cluster documentation is good, > however it doesn't really help you implement a working cluster too > easily (beyond the apache example), it's a bit reference orientated. I > found myself googling around for examples of different RA types. Is > there a more hands on set of docs around (or book)? It could almost do > with a cookbook for every RA! > > Thanks > > Colin > > On Thu, 2011-04-07 at 02:52 +0100, Ian Hayes wrote: > > Shouldnt have to recompile rpc.gssd. On failover I migrated the ip > > address first, made portmapper a depend on the ip, rpc.gssd depend on > > portmap and nfsd depend on rpc. As for the hostname, I went with the > > inelegant solution of putting a 'hostname' command in the start > > functions of the portmapper script since that fires first in my > > config. > > > > > On Apr 6, 2011 6:06 PM, "Daniel R. Gore" > > > wrote: > > > > > > I also found this thread, after many searches. > > > http://linux-nfs.org/pipermail/nfsv4/2009-April/010583.html > > > > > > As I read through it, there appears to be a patch for rpc.gssd which > > > allows for the daemon to be started and associated with multiple > > > hosts. > > > I do not want to compile rpc.gssd and it appears the patch is from > > > over > > > two years ago. I would hope that RHEL6 would have rpc.gssd patched > > > to > > > meet this requirement, but no documentation appear to exist for how > > > to > > > use it. > > > > > > > > > > > > > > > > > > On Wed, 2011-04-06 at 20:23 -0400, Daniel R. Gore wrote: > > > > Ian, > > > > > > > > Thanks for the info. > > > > > > > >... > > > > > > > plain text document attachment (ATT114553.txt) > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > This email and any files transmitted with it are confidential and are intended solely for the use of the individual or entity to whom they are addressed. If you are not the original recipient or the person responsible for delivering the email to the intended recipient, be advised that you have received this email in error, and that any use, dissemination, forwarding, printing, or copying of this email is strictly prohibited. If you received this email in error, please immediately notify the sender and delete the original. > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From rpeterso at redhat.com Thu Apr 7 12:58:31 2011 From: rpeterso at redhat.com (Bob Peterson) Date: Thu, 7 Apr 2011 08:58:31 -0400 (EDT) Subject: [Linux-cluster] very poor rm performance with gfs2 (el6) In-Reply-To: Message-ID: <307619244.793840.1302181111217.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> ----- Original Message ----- | Hello | | I'm currently testing gfs2 under EL6 (2.6.32-71.18.1.el6.x86_64) 11 | nodes over gigabit iscsi (target is IET on EL5) | Application is a compute cluster with lot of files of various sizes. | Application need close to write coherency and cache coherency. | | I can't tell yet about performance. It looks not bad, iscsi target | shows a bandwidth of about 80Mbytes/s (mostly reads during compute). I | have to compare to NFS to tell if it is good. | | Major problem: I noticed very very poor rm performance, down to 1 to 2 | files erased per second in a rm -rf. | Some of these files are not very big: a few kilo bytes. Cluster is | idle during rm. mkfs.gfs2 was made just before compute test was made. (snip) | Please advise if there is a solution for this. | | Thanks Hi, I've been working on a set of patches to improve gfs2 unlink (aka rm) performance. Here is the bug I'm using to track my work: https://bugzilla.redhat.com/show_bug.cgi?id=681902 I don't know if the bug is public or if anyone has the ability to see it, but I can't change that (sorry). I currently have a "proof of concept" patch that speeds up unlink by 19% in the tests I'm doing, which is "rm -fR /mnt/gfs2/*". It also speeds up "ls -lR *" by 42%. Of course, mileage will vary depending on what you're doing. (No guarantees, etc.) The patch still has some problems and still needs some work. At this point it's just "proof of concept" so don't get your hopes up. (When I straighten out the patch, the performance may not be as good.) It will take a while before anything is available in an official release. Regards, Bob Peterson Red Hat File Systems From rossnick-lists at cybercat.ca Thu Apr 7 16:57:07 2011 From: rossnick-lists at cybercat.ca (Nicolas Ross) Date: Thu, 7 Apr 2011 12:57:07 -0400 Subject: [Linux-cluster] very poor rm performance with gfs2 (el6) References: <307619244.793840.1302181111217.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> Message-ID: (...) > I've been working on a set of patches to improve gfs2 unlink (aka rm) > performance. Here is the bug I'm using to track my work: > > https://bugzilla.redhat.com/show_bug.cgi?id=681902 Just a quick question. When updates to gfs are made like this, in what form does it comes ? Update to gfs2-utils package ? If that's so, do I need to re-mount my filesystem to take advantage of such update ? Regards, From rpeterso at redhat.com Thu Apr 7 17:15:44 2011 From: rpeterso at redhat.com (Bob Peterson) Date: Thu, 7 Apr 2011 13:15:44 -0400 (EDT) Subject: [Linux-cluster] very poor rm performance with gfs2 (el6) In-Reply-To: Message-ID: <795218420.799834.1302196544980.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> ----- Original Message ----- | Just a quick question. When updates to gfs are made like this, in what | form | does it comes ? Update to gfs2-utils package ? If that's so, do I need | to | re-mount my filesystem to take advantage of such update ? | | Regards, Hi, I'm not sure I completely understand the question, but: GFS2 is part of the base kernel, so you'll automatically get the patch when a new kernel is installed that contains the patch. When you install a new kernel on a node, you should reboot the node, and then you'll get the fix automatically. As for the patch itself, the bugzilla record is normally used to track the progress. Patches always go upstream first before they're accepted into RHEL, and in the case of gfs2, there's a public mailing list called "cluster-devel" where upstream patches are posted for the gfs2 kernel development tree. The development tree itself is: http://git.kernel.org/?p=linux/kernel/git/steve/gfs2-2.6-nmw.git Did that answer your question? Regards, Bob Peterson Red Hat File Systems From cthulhucalling at gmail.com Thu Apr 7 18:07:34 2011 From: cthulhucalling at gmail.com (Ian Hayes) Date: Thu, 7 Apr 2011 11:07:34 -0700 Subject: [Linux-cluster] nfs4 kerberos In-Reply-To: <1302178130.26066.19.camel@hawku> References: <1302129765.23236.2.camel@hawku> <1302135817.23236.19.camel@hawku> <1302138060.24349.7.camel@hawku> <1302173040.17187.37.camel@cowie.iouk.ioroot.tld> <1302178130.26066.19.camel@hawku> Message-ID: I had written up a rather large set of build documentation for many common clustered services. NFS4, Samba, Postfix/Cyrus, Squid and some other stuff. But those docs stayed with my employer, so.... I don't think I've seen this cookbook, is it some wiki-type thing where new docs can be contributed? On Thu, Apr 7, 2011 at 5:08 AM, Daniel R. Gore wrote: > A better solution for NFSv4 in a cluster is really required. > > > A better cookbook with more real life likely scenarios for clustering > solutions would be really helpful. How many people actually setup the > complex three layered solutions depicted, as compared to people setting > up simple two/three node servers to for authorization, authentication, > file and license serving. It appears that the small business applicable > system is completely ignored. > > > On Thu, 2011-04-07 at 11:44 +0100, Colin Simpson wrote: > > That's interesting about making the portmapper dependant on the IP, was > > this for the same reason I'm seeing just now. I used the method from NFS > > cookbook where I pseudo load balancing by distributing my NFS exports > > across my nodes. Sadly the RHEL 6 portmapper replacement (rpcbind) > > replies on the node IP and not the service IP, and this breaks NFSv3 > > mounts from RHEL5 clients with iptables stateful firewalls. > > > > I opened a bug on this one and have a call open with RH (via Dell) on > > this: > > https://bugzilla.redhat.com/show_bug.cgi?id=689589 > > > > But I too would like a good clean method of doing kerberized NFSv4 on a > > RHEL6 cluster. I thought NFSv4 being so central to RHEL6 this would be > > easy on a RHEL6 cluster (without using XEN)? Can the cookbook be > > updated? > > > > Which brings up another point. The RHEL cluster documentation is good, > > however it doesn't really help you implement a working cluster too > > easily (beyond the apache example), it's a bit reference orientated. I > > found myself googling around for examples of different RA types. Is > > there a more hands on set of docs around (or book)? It could almost do > > with a cookbook for every RA! > > > > Thanks > > > > Colin > > > > On Thu, 2011-04-07 at 02:52 +0100, Ian Hayes wrote: > > > Shouldnt have to recompile rpc.gssd. On failover I migrated the ip > > > address first, made portmapper a depend on the ip, rpc.gssd depend on > > > portmap and nfsd depend on rpc. As for the hostname, I went with the > > > inelegant solution of putting a 'hostname' command in the start > > > functions of the portmapper script since that fires first in my > > > config. > > > > > > > On Apr 6, 2011 6:06 PM, "Daniel R. Gore" > > > > wrote: > > > > > > > > I also found this thread, after many searches. > > > > http://linux-nfs.org/pipermail/nfsv4/2009-April/010583.html > > > > > > > > As I read through it, there appears to be a patch for rpc.gssd which > > > > allows for the daemon to be started and associated with multiple > > > > hosts. > > > > I do not want to compile rpc.gssd and it appears the patch is from > > > > over > > > > two years ago. I would hope that RHEL6 would have rpc.gssd patched > > > > to > > > > meet this requirement, but no documentation appear to exist for how > > > > to > > > > use it. > > > > > > > > > > > > > > > > > > > > > > > > On Wed, 2011-04-06 at 20:23 -0400, Daniel R. Gore wrote: > > > > > Ian, > > > > > > > > > > Thanks for the info. > > > > > > > > > >... > > > > > > > > > > plain text document attachment (ATT114553.txt) > > > -- > > > Linux-cluster mailing list > > > Linux-cluster at redhat.com > > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > This email and any files transmitted with it are confidential and are > intended solely for the use of the individual or entity to whom they are > addressed. If you are not the original recipient or the person responsible > for delivering the email to the intended recipient, be advised that you have > received this email in error, and that any use, dissemination, forwarding, > printing, or copying of this email is strictly prohibited. If you received > this email in error, please immediately notify the sender and delete the > original. > > > > > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > > > -- > This message has been scanned for viruses and > dangerous content by MailScanner, and is > believed to be clean. > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From rossnick-lists at cybercat.ca Thu Apr 7 18:10:31 2011 From: rossnick-lists at cybercat.ca (Nicolas Ross) Date: Thu, 7 Apr 2011 14:10:31 -0400 Subject: [Linux-cluster] very poor rm performance with gfs2 (el6) References: <795218420.799834.1302196544980.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> Message-ID: <0F0CECF28C9740D69D6443433A2E1F8C@versa> > GFS2 is part of the base kernel, so you'll automatically get > the patch when a new kernel is installed that contains the > patch. When you install a new kernel on a node, you should > reboot the node, and then you'll get the fix automatically. That's what I wanted to know. I need to reboot the node to take adventage of the future fix... Regards, From danielgore at yaktech.com Thu Apr 7 18:58:16 2011 From: danielgore at yaktech.com (danielgore at yaktech.com) Date: Thu, 7 Apr 2011 14:58:16 -0400 (EDT) Subject: [Linux-cluster] nfs4 kerberos In-Reply-To: References: <1302129765.23236.2.camel@hawku> <1302135817.23236.19.camel@hawku> <1302138060.24349.7.camel@hawku> <1302173040.17187.37.camel@cowie.iouk.ioroot.tld> <1302178130.26066.19.camel@hawku> Message-ID: <4414.160.107.87.10.1302202696.squirrel@www.yaktech.com> Ian, You can find it here; http://sourceware.org/cluster/doc/nfscookbook.pdf > I had written up a rather large set of build documentation for many common > clustered services. NFS4, Samba, Postfix/Cyrus, Squid and some other > stuff. > But those docs stayed with my employer, so.... I don't think I've seen > this > cookbook, is it some wiki-type thing where new docs can be contributed? > > On Thu, Apr 7, 2011 at 5:08 AM, Daniel R. Gore > wrote: > >> A better solution for NFSv4 in a cluster is really required. >> >> >> A better cookbook with more real life likely scenarios for clustering >> solutions would be really helpful. How many people actually setup the >> complex three layered solutions depicted, as compared to people setting >> up simple two/three node servers to for authorization, authentication, >> file and license serving. It appears that the small business applicable >> system is completely ignored. >> >> >> On Thu, 2011-04-07 at 11:44 +0100, Colin Simpson wrote: >> > That's interesting about making the portmapper dependant on the IP, >> was >> > this for the same reason I'm seeing just now. I used the method from >> NFS >> > cookbook where I pseudo load balancing by distributing my NFS exports >> > across my nodes. Sadly the RHEL 6 portmapper replacement (rpcbind) >> > replies on the node IP and not the service IP, and this breaks NFSv3 >> > mounts from RHEL5 clients with iptables stateful firewalls. >> > >> > I opened a bug on this one and have a call open with RH (via Dell) on >> > this: >> > https://bugzilla.redhat.com/show_bug.cgi?id=689589 >> > >> > But I too would like a good clean method of doing kerberized NFSv4 on >> a >> > RHEL6 cluster. I thought NFSv4 being so central to RHEL6 this would be >> > easy on a RHEL6 cluster (without using XEN)? Can the cookbook be >> > updated? >> > >> > Which brings up another point. The RHEL cluster documentation is good, >> > however it doesn't really help you implement a working cluster too >> > easily (beyond the apache example), it's a bit reference orientated. I >> > found myself googling around for examples of different RA types. Is >> > there a more hands on set of docs around (or book)? It could almost do >> > with a cookbook for every RA! >> > >> > Thanks >> > >> > Colin >> > >> > On Thu, 2011-04-07 at 02:52 +0100, Ian Hayes wrote: >> > > Shouldnt have to recompile rpc.gssd. On failover I migrated the ip >> > > address first, made portmapper a depend on the ip, rpc.gssd depend >> on >> > > portmap and nfsd depend on rpc. As for the hostname, I went with the >> > > inelegant solution of putting a 'hostname' command in the start >> > > functions of the portmapper script since that fires first in my >> > > config. >> > > >> > > > On Apr 6, 2011 6:06 PM, "Daniel R. Gore" >> > > > wrote: >> > > > >> > > > I also found this thread, after many searches. >> > > > http://linux-nfs.org/pipermail/nfsv4/2009-April/010583.html >> > > > >> > > > As I read through it, there appears to be a patch for rpc.gssd >> which >> > > > allows for the daemon to be started and associated with multiple >> > > > hosts. >> > > > I do not want to compile rpc.gssd and it appears the patch is from >> > > > over >> > > > two years ago. I would hope that RHEL6 would have rpc.gssd >> patched >> > > > to >> > > > meet this requirement, but no documentation appear to exist for >> how >> > > > to >> > > > use it. >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > On Wed, 2011-04-06 at 20:23 -0400, Daniel R. Gore wrote: >> > > > > Ian, >> > > > > >> > > > > Thanks for the info. >> > > > > >> > > > >... >> > > > >> > > >> > > plain text document attachment (ATT114553.txt) >> > > -- >> > > Linux-cluster mailing list >> > > Linux-cluster at redhat.com >> > > https://www.redhat.com/mailman/listinfo/linux-cluster >> > >> > This email and any files transmitted with it are confidential and are >> intended solely for the use of the individual or entity to whom they are >> addressed. If you are not the original recipient or the person >> responsible >> for delivering the email to the intended recipient, be advised that you >> have >> received this email in error, and that any use, dissemination, >> forwarding, >> printing, or copying of this email is strictly prohibited. If you >> received >> this email in error, please immediately notify the sender and delete the >> original. >> > >> > >> > >> > -- >> > Linux-cluster mailing list >> > Linux-cluster at redhat.com >> > https://www.redhat.com/mailman/listinfo/linux-cluster >> > >> >> >> >> -- >> This message has been scanned for viruses and >> dangerous content by MailScanner, and is >> believed to be clean. >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster >> > > -- > This message has been scanned for viruses and > dangerous content by MailScanner, and is > believed to be clean. > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From danielgore at yaktech.com Thu Apr 7 23:16:44 2011 From: danielgore at yaktech.com (Daniel R. Gore) Date: Thu, 07 Apr 2011 19:16:44 -0400 Subject: [Linux-cluster] nfs4 kerberos In-Reply-To: <4414.160.107.87.10.1302202696.squirrel@www.yaktech.com> References: <1302129765.23236.2.camel@hawku> <1302135817.23236.19.camel@hawku> <1302138060.24349.7.camel@hawku> <1302173040.17187.37.camel@cowie.iouk.ioroot.tld> <1302178130.26066.19.camel@hawku> <4414.160.107.87.10.1302202696.squirrel@www.yaktech.com> Message-ID: <1302218204.29194.7.camel@hawku> Still could not get it to work. I tried changing the host name that rpcbind binds to during start up with arguments in the /etc/sysconfig/rpcbind file. RPCBIND_ARGS="hostname fserv.mydomain" rpcbind started correctly with not errors. I then restarted the other rpc daemons and nfs. Got the same error: rpc.svcgssd indicates "wrong principal" I know the ip is working correctly because I can ssh into using the file server name (fserv.mydomain). Looking for more ideas! Thanks. Dan On Thu, 2011-04-07 at 14:58 -0400, danielgore at yaktech.com wrote: > Ian, > > You can find it here; > > > http://sourceware.org/cluster/doc/nfscookbook.pdf > > > I had written up a rather large set of build documentation for many common > > clustered services. NFS4, Samba, Postfix/Cyrus, Squid and some other > > stuff. > > But those docs stayed with my employer, so.... I don't think I've seen > > this > > cookbook, is it some wiki-type thing where new docs can be contributed? > > > > On Thu, Apr 7, 2011 at 5:08 AM, Daniel R. Gore > > wrote: > > > >> A better solution for NFSv4 in a cluster is really required. > >> > >> > >> A better cookbook with more real life likely scenarios for clustering > >> solutions would be really helpful. How many people actually setup the > >> complex three layered solutions depicted, as compared to people setting > >> up simple two/three node servers to for authorization, authentication, > >> file and license serving. It appears that the small business applicable > >> system is completely ignored. > >> > >> > >> On Thu, 2011-04-07 at 11:44 +0100, Colin Simpson wrote: > >> > That's interesting about making the portmapper dependant on the IP, > >> was > >> > this for the same reason I'm seeing just now. I used the method from > >> NFS > >> > cookbook where I pseudo load balancing by distributing my NFS exports > >> > across my nodes. Sadly the RHEL 6 portmapper replacement (rpcbind) > >> > replies on the node IP and not the service IP, and this breaks NFSv3 > >> > mounts from RHEL5 clients with iptables stateful firewalls. > >> > > >> > I opened a bug on this one and have a call open with RH (via Dell) on > >> > this: > >> > https://bugzilla.redhat.com/show_bug.cgi?id=689589 > >> > > >> > But I too would like a good clean method of doing kerberized NFSv4 on > >> a > >> > RHEL6 cluster. I thought NFSv4 being so central to RHEL6 this would be > >> > easy on a RHEL6 cluster (without using XEN)? Can the cookbook be > >> > updated? > >> > > >> > Which brings up another point. The RHEL cluster documentation is good, > >> > however it doesn't really help you implement a working cluster too > >> > easily (beyond the apache example), it's a bit reference orientated. I > >> > found myself googling around for examples of different RA types. Is > >> > there a more hands on set of docs around (or book)? It could almost do > >> > with a cookbook for every RA! > >> > > >> > Thanks > >> > > >> > Colin > >> > > >> > On Thu, 2011-04-07 at 02:52 +0100, Ian Hayes wrote: > >> > > Shouldnt have to recompile rpc.gssd. On failover I migrated the ip > >> > > address first, made portmapper a depend on the ip, rpc.gssd depend > >> on > >> > > portmap and nfsd depend on rpc. As for the hostname, I went with the > >> > > inelegant solution of putting a 'hostname' command in the start > >> > > functions of the portmapper script since that fires first in my > >> > > config. > >> > > > >> > > > On Apr 6, 2011 6:06 PM, "Daniel R. Gore" > >> > > > wrote: > >> > > > > >> > > > I also found this thread, after many searches. > >> > > > http://linux-nfs.org/pipermail/nfsv4/2009-April/010583.html > >> > > > > >> > > > As I read through it, there appears to be a patch for rpc.gssd > >> which > >> > > > allows for the daemon to be started and associated with multiple > >> > > > hosts. > >> > > > I do not want to compile rpc.gssd and it appears the patch is from > >> > > > over > >> > > > two years ago. I would hope that RHEL6 would have rpc.gssd > >> patched > >> > > > to > >> > > > meet this requirement, but no documentation appear to exist for > >> how > >> > > > to > >> > > > use it. > >> > > > > >> > > > > >> > > > > >> > > > > >> > > > > >> > > > On Wed, 2011-04-06 at 20:23 -0400, Daniel R. Gore wrote: > >> > > > > Ian, > >> > > > > > >> > > > > Thanks for the info. > >> > > > > > >> > > > >... > >> > > > > >> > > > >> > > plain text document attachment (ATT114553.txt) > >> > > -- > >> > > Linux-cluster mailing list > >> > > Linux-cluster at redhat.com > >> > > https://www.redhat.com/mailman/listinfo/linux-cluster > >> > > >> > This email and any files transmitted with it are confidential and are > >> intended solely for the use of the individual or entity to whom they are > >> addressed. If you are not the original recipient or the person > >> responsible > >> for delivering the email to the intended recipient, be advised that you > >> have > >> received this email in error, and that any use, dissemination, > >> forwarding, > >> printing, or copying of this email is strictly prohibited. If you > >> received > >> this email in error, please immediately notify the sender and delete the > >> original. > >> > > >> > > >> > > >> > -- > >> > Linux-cluster mailing list > >> > Linux-cluster at redhat.com > >> > https://www.redhat.com/mailman/listinfo/linux-cluster > >> > > >> > >> > >> > >> -- > >> This message has been scanned for viruses and > >> dangerous content by MailScanner, and is > >> believed to be clean. > >> > >> -- > >> Linux-cluster mailing list > >> Linux-cluster at redhat.com > >> https://www.redhat.com/mailman/listinfo/linux-cluster > >> > > > > -- > > This message has been scanned for viruses and > > dangerous content by MailScanner, and is > > believed to be clean. > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From cthulhucalling at gmail.com Thu Apr 7 23:30:37 2011 From: cthulhucalling at gmail.com (Ian Hayes) Date: Thu, 7 Apr 2011 16:30:37 -0700 Subject: [Linux-cluster] nfs4 kerberos In-Reply-To: <1302218204.29194.7.camel@hawku> References: <1302129765.23236.2.camel@hawku> <1302135817.23236.19.camel@hawku> <1302138060.24349.7.camel@hawku> <1302173040.17187.37.camel@cowie.iouk.ioroot.tld> <1302178130.26066.19.camel@hawku> <4414.160.107.87.10.1302202696.squirrel@www.yaktech.com> <1302218204.29194.7.camel@hawku> Message-ID: Hmm. I think you're overcomplicating it a bit. Instead of tweaking /etc/sysconfig/rpcbind, I did this: Edit /etc/init.d/portmap start() { hostname nfsserver.mydomain echo -n $"Starting $prog: " daemon portmap $PMAP_ARGS RETVAL=$? echo [ $RETVAL -eq 0 ] && touch /var/lock/subsys/portmap return $RETVAL } Start order is : IP, portmap, rpcgssd, rpcidmapd, nfs. My goal was to get the hostname changed to whateve the service principal in Kerberos was named to before the Kerberized daemons start up. You can also play by manually changing the hostname on one of the nodes and firing up the service just to see that everything works. On Thu, Apr 7, 2011 at 4:16 PM, Daniel R. Gore wrote: > Still could not get it to work. > > I tried changing the host name that rpcbind binds to during start up > with arguments in the /etc/sysconfig/rpcbind file. > > RPCBIND_ARGS="hostname fserv.mydomain" > > rpcbind started correctly with not errors. I then restarted the other > rpc daemons and nfs. > > Got the same error: rpc.svcgssd indicates "wrong principal" > > I know the ip is working correctly because I can ssh into using the file > server name (fserv.mydomain). > > Looking for more ideas! > > Thanks. > > Dan > > On Thu, 2011-04-07 at 14:58 -0400, danielgore at yaktech.com wrote: > > Ian, > > > > You can find it here; > > > > > > http://sourceware.org/cluster/doc/nfscookbook.pdf > > > > > I had written up a rather large set of build documentation for many > common > > > clustered services. NFS4, Samba, Postfix/Cyrus, Squid and some other > > > stuff. > > > But those docs stayed with my employer, so.... I don't think I've seen > > > this > > > cookbook, is it some wiki-type thing where new docs can be contributed? > > > > > > On Thu, Apr 7, 2011 at 5:08 AM, Daniel R. Gore > > > wrote: > > > > > >> A better solution for NFSv4 in a cluster is really required. > > >> > > >> > > >> A better cookbook with more real life likely scenarios for clustering > > >> solutions would be really helpful. How many people actually setup the > > >> complex three layered solutions depicted, as compared to people > setting > > >> up simple two/three node servers to for authorization, authentication, > > >> file and license serving. It appears that the small business > applicable > > >> system is completely ignored. > > >> > > >> > > >> On Thu, 2011-04-07 at 11:44 +0100, Colin Simpson wrote: > > >> > That's interesting about making the portmapper dependant on the IP, > > >> was > > >> > this for the same reason I'm seeing just now. I used the method from > > >> NFS > > >> > cookbook where I pseudo load balancing by distributing my NFS > exports > > >> > across my nodes. Sadly the RHEL 6 portmapper replacement (rpcbind) > > >> > replies on the node IP and not the service IP, and this breaks NFSv3 > > >> > mounts from RHEL5 clients with iptables stateful firewalls. > > >> > > > >> > I opened a bug on this one and have a call open with RH (via Dell) > on > > >> > this: > > >> > https://bugzilla.redhat.com/show_bug.cgi?id=689589 > > >> > > > >> > But I too would like a good clean method of doing kerberized NFSv4 > on > > >> a > > >> > RHEL6 cluster. I thought NFSv4 being so central to RHEL6 this would > be > > >> > easy on a RHEL6 cluster (without using XEN)? Can the cookbook be > > >> > updated? > > >> > > > >> > Which brings up another point. The RHEL cluster documentation is > good, > > >> > however it doesn't really help you implement a working cluster too > > >> > easily (beyond the apache example), it's a bit reference orientated. > I > > >> > found myself googling around for examples of different RA types. Is > > >> > there a more hands on set of docs around (or book)? It could almost > do > > >> > with a cookbook for every RA! > > >> > > > >> > Thanks > > >> > > > >> > Colin > > >> > > > >> > On Thu, 2011-04-07 at 02:52 +0100, Ian Hayes wrote: > > >> > > Shouldnt have to recompile rpc.gssd. On failover I migrated the ip > > >> > > address first, made portmapper a depend on the ip, rpc.gssd depend > > >> on > > >> > > portmap and nfsd depend on rpc. As for the hostname, I went with > the > > >> > > inelegant solution of putting a 'hostname' command in the start > > >> > > functions of the portmapper script since that fires first in my > > >> > > config. > > >> > > > > >> > > > On Apr 6, 2011 6:06 PM, "Daniel R. Gore" < > danielgore at yaktech.com> > > >> > > > wrote: > > >> > > > > > >> > > > I also found this thread, after many searches. > > >> > > > http://linux-nfs.org/pipermail/nfsv4/2009-April/010583.html > > >> > > > > > >> > > > As I read through it, there appears to be a patch for rpc.gssd > > >> which > > >> > > > allows for the daemon to be started and associated with multiple > > >> > > > hosts. > > >> > > > I do not want to compile rpc.gssd and it appears the patch is > from > > >> > > > over > > >> > > > two years ago. I would hope that RHEL6 would have rpc.gssd > > >> patched > > >> > > > to > > >> > > > meet this requirement, but no documentation appear to exist for > > >> how > > >> > > > to > > >> > > > use it. > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > On Wed, 2011-04-06 at 20:23 -0400, Daniel R. Gore wrote: > > >> > > > > Ian, > > >> > > > > > > >> > > > > Thanks for the info. > > >> > > > > > > >> > > > >... > > >> > > > > > >> > > > > >> > > plain text document attachment (ATT114553.txt) > > >> > > -- > > >> > > Linux-cluster mailing list > > >> > > Linux-cluster at redhat.com > > >> > > https://www.redhat.com/mailman/listinfo/linux-cluster > > >> > > > >> > This email and any files transmitted with it are confidential and > are > > >> intended solely for the use of the individual or entity to whom they > are > > >> addressed. If you are not the original recipient or the person > > >> responsible > > >> for delivering the email to the intended recipient, be advised that > you > > >> have > > >> received this email in error, and that any use, dissemination, > > >> forwarding, > > >> printing, or copying of this email is strictly prohibited. If you > > >> received > > >> this email in error, please immediately notify the sender and delete > the > > >> original. > > >> > > > >> > > > >> > > > >> > -- > > >> > Linux-cluster mailing list > > >> > Linux-cluster at redhat.com > > >> > https://www.redhat.com/mailman/listinfo/linux-cluster > > >> > > > >> > > >> > > >> > > >> -- > > >> This message has been scanned for viruses and > > >> dangerous content by MailScanner, and is > > >> believed to be clean. > > >> > > >> -- > > >> Linux-cluster mailing list > > >> Linux-cluster at redhat.com > > >> https://www.redhat.com/mailman/listinfo/linux-cluster > > >> > > > > > > -- > > > This message has been scanned for viruses and > > > dangerous content by MailScanner, and is > > > believed to be clean. > > > > > > -- > > > Linux-cluster mailing list > > > Linux-cluster at redhat.com > > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > > > > > > > -- > This message has been scanned for viruses and > dangerous content by MailScanner, and is > believed to be clean. > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From danielgore at yaktech.com Thu Apr 7 23:51:34 2011 From: danielgore at yaktech.com (Daniel R. Gore) Date: Thu, 07 Apr 2011 19:51:34 -0400 Subject: [Linux-cluster] nfs4 kerberos In-Reply-To: References: <1302129765.23236.2.camel@hawku> <1302135817.23236.19.camel@hawku> <1302138060.24349.7.camel@hawku> <1302173040.17187.37.camel@cowie.iouk.ioroot.tld> <1302178130.26066.19.camel@hawku> <4414.160.107.87.10.1302202696.squirrel@www.yaktech.com> <1302218204.29194.7.camel@hawku> Message-ID: <1302220294.29194.10.camel@hawku> Ian, Thanks for the feedback. RHEL 6 replaced portmap with rpcbind, but you solution should work the same. I will give it a try tomorrow and let you know. Thanks for the starting order also. I was wondering if I got that correct. Dan On Thu, 2011-04-07 at 16:30 -0700, Ian Hayes wrote: > Hmm. I think you're overcomplicating it a bit. Instead of > tweaking /etc/sysconfig/rpcbind, I did this: > > Edit /etc/init.d/portmap > start() { > hostname nfsserver.mydomain > echo -n $"Starting $prog: " > daemon portmap $PMAP_ARGS > RETVAL=$? > echo > [ $RETVAL -eq 0 ] && touch /var/lock/subsys/portmap > return $RETVAL > } > > Start order is : IP, portmap, rpcgssd, rpcidmapd, nfs. My goal was to > get the hostname changed to whateve the service principal in Kerberos > was named to before the Kerberized daemons start up. > > You can also play by manually changing the hostname on one of the > nodes and firing up the service just to see that everything works. > > On Thu, Apr 7, 2011 at 4:16 PM, Daniel R. Gore > wrote: > Still could not get it to work. > > I tried changing the host name that rpcbind binds to during > start up > with arguments in the /etc/sysconfig/rpcbind file. > > RPCBIND_ARGS="hostname fserv.mydomain" > > rpcbind started correctly with not errors. I then restarted > the other > rpc daemons and nfs. > > Got the same error: rpc.svcgssd indicates "wrong principal" > > I know the ip is working correctly because I can ssh into > using the file > server name (fserv.mydomain). > > Looking for more ideas! > > Thanks. > > Dan > > > On Thu, 2011-04-07 at 14:58 -0400, danielgore at yaktech.com > wrote: > > Ian, > > > > You can find it here; > > > > > > http://sourceware.org/cluster/doc/nfscookbook.pdf > > > > > I had written up a rather large set of build documentation > for many common > > > clustered services. NFS4, Samba, Postfix/Cyrus, Squid and > some other > > > stuff. > > > But those docs stayed with my employer, so.... I don't > think I've seen > > > this > > > cookbook, is it some wiki-type thing where new docs can be > contributed? > > > > > > On Thu, Apr 7, 2011 at 5:08 AM, Daniel R. Gore > > > wrote: > > > > > >> A better solution for NFSv4 in a cluster is really > required. > > >> > > >> > > >> A better cookbook with more real life likely scenarios > for clustering > > >> solutions would be really helpful. How many people > actually setup the > > >> complex three layered solutions depicted, as compared to > people setting > > >> up simple two/three node servers to for authorization, > authentication, > > >> file and license serving. It appears that the small > business applicable > > >> system is completely ignored. > > >> > > >> > > >> On Thu, 2011-04-07 at 11:44 +0100, Colin Simpson wrote: > > >> > That's interesting about making the portmapper > dependant on the IP, > > >> was > > >> > this for the same reason I'm seeing just now. I used > the method from > > >> NFS > > >> > cookbook where I pseudo load balancing by distributing > my NFS exports > > >> > across my nodes. Sadly the RHEL 6 portmapper > replacement (rpcbind) > > >> > replies on the node IP and not the service IP, and this > breaks NFSv3 > > >> > mounts from RHEL5 clients with iptables stateful > firewalls. > > >> > > > >> > I opened a bug on this one and have a call open with RH > (via Dell) on > > >> > this: > > >> > https://bugzilla.redhat.com/show_bug.cgi?id=689589 > > >> > > > >> > But I too would like a good clean method of doing > kerberized NFSv4 on > > >> a > > >> > RHEL6 cluster. I thought NFSv4 being so central to > RHEL6 this would be > > >> > easy on a RHEL6 cluster (without using XEN)? Can the > cookbook be > > >> > updated? > > >> > > > >> > Which brings up another point. The RHEL cluster > documentation is good, > > >> > however it doesn't really help you implement a working > cluster too > > >> > easily (beyond the apache example), it's a bit > reference orientated. I > > >> > found myself googling around for examples of different > RA types. Is > > >> > there a more hands on set of docs around (or book)? It > could almost do > > >> > with a cookbook for every RA! > > >> > > > >> > Thanks > > >> > > > >> > Colin > > >> > > > >> > On Thu, 2011-04-07 at 02:52 +0100, Ian Hayes wrote: > > >> > > Shouldnt have to recompile rpc.gssd. On failover I > migrated the ip > > >> > > address first, made portmapper a depend on the ip, > rpc.gssd depend > > >> on > > >> > > portmap and nfsd depend on rpc. As for the hostname, > I went with the > > >> > > inelegant solution of putting a 'hostname' command in > the start > > >> > > functions of the portmapper script since that fires > first in my > > >> > > config. > > >> > > > > >> > > > On Apr 6, 2011 6:06 PM, "Daniel R. Gore" > > > >> > > > wrote: > > >> > > > > > >> > > > I also found this thread, after many searches. > > >> > > > > http://linux-nfs.org/pipermail/nfsv4/2009-April/010583.html > > >> > > > > > >> > > > As I read through it, there appears to be a patch > for rpc.gssd > > >> which > > >> > > > allows for the daemon to be started and associated > with multiple > > >> > > > hosts. > > >> > > > I do not want to compile rpc.gssd and it appears > the patch is from > > >> > > > over > > >> > > > two years ago. I would hope that RHEL6 would have > rpc.gssd > > >> patched > > >> > > > to > > >> > > > meet this requirement, but no documentation appear > to exist for > > >> how > > >> > > > to > > >> > > > use it. > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > On Wed, 2011-04-06 at 20:23 -0400, Daniel R. Gore > wrote: > > >> > > > > Ian, > > >> > > > > > > >> > > > > Thanks for the info. > > >> > > > > > > >> > > > >... > > >> > > > > > >> > > > > >> > > plain text document attachment (ATT114553.txt) > > >> > > -- > > >> > > Linux-cluster mailing list > > >> > > Linux-cluster at redhat.com > > >> > > https://www.redhat.com/mailman/listinfo/linux-cluster > > >> > > > >> > This email and any files transmitted with it are > confidential and are > > >> intended solely for the use of the individual or entity > to whom they are > > >> addressed. If you are not the original recipient or the > person > > >> responsible > > >> for delivering the email to the intended recipient, be > advised that you > > >> have > > >> received this email in error, and that any use, > dissemination, > > >> forwarding, > > >> printing, or copying of this email is strictly > prohibited. If you > > >> received > > >> this email in error, please immediately notify the sender > and delete the > > >> original. > > >> > > > >> > > > >> > > > >> > -- > > >> > Linux-cluster mailing list > > >> > Linux-cluster at redhat.com > > >> > https://www.redhat.com/mailman/listinfo/linux-cluster > > >> > > > >> > > >> > > >> > > >> -- > > >> This message has been scanned for viruses and > > >> dangerous content by MailScanner, and is > > >> believed to be clean. > > >> > > >> -- > > >> Linux-cluster mailing list > > >> Linux-cluster at redhat.com > > >> https://www.redhat.com/mailman/listinfo/linux-cluster > > >> > > > > > > -- > > > This message has been scanned for viruses and > > > dangerous content by MailScanner, and is > > > believed to be clean. > > > > > > -- > > > Linux-cluster mailing list > > > Linux-cluster at redhat.com > > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > > > > > > > -- > This message has been scanned for viruses and > dangerous content by MailScanner, and is > believed to be clean. > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > -- > This message has been scanned for viruses and > dangerous content by MailScanner, and is > believed to be clean. > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From mgrac at redhat.com Fri Apr 8 08:17:40 2011 From: mgrac at redhat.com (Marek Grac) Date: Fri, 08 Apr 2011 10:17:40 +0200 Subject: [Linux-cluster] fence_apc and Apc AP-8941 In-Reply-To: <8C77012A023D431CB5AA58E6CC676A35@versa> References: <4D9C06D2.1040106@redhat.com> <48C9F07371214F2AA678938FDE5BD3B5@versa> <4D9C768D.4060106@redhat.com> <8C77012A023D431CB5AA58E6CC676A35@versa> Message-ID: <4D9EC4A4.3060201@redhat.com> Hi, On 04/06/2011 05:06 PM, Nicolas Ross wrote: >> If response is too fast then problem is in connecting >> information/process. If it took long enough (timeout problem) then it >> can be problem with change in command prompt. If it is possible >> please send me what it is displayed when you are trying to do it >> manually. > > > Response is indeed very fast when trying with the agent. With tcpdump > on the node I try with the agent, I see ssh packets go to and from the > apc switch. > > When ssh-iing to the apc switch, it takes about 2 or 3 seconds before > I get the password prompt, and then I see : > ------------------------------ > user at 1.1.1.1's password: > > > American Power Conversion Network Management Card AOS > v5.1.2 > (c) Copyright 2009 All Rights Reserved RPDU 2g v5.1.0 > ------------------------------------------------------------------------------- > > Name : Unknown Date : 04/06/2011 > Contact : Unknown Time : 10:52:44 > Location : Unknown User : Device > Manager > Up Time : 0 Days 1 Hour 52 Minutes Stat : P+ N4+ > N6+ A+ > > > Type ? for command listing > Use tcpip command for IP address(-i), subnet(-s), and gateway(-g) > > apc>qConnection to 1.1.1.1 closed. > ------------------------------ > > I think I have found the problem... Looking at fence_apc, I see : > > options["ssh_options"] = "-1 -c blowfish" > > Now, if I add this to my ssh command like so : > > ssh -1 -c blowfish user at 1.1.1.1 > > I get : > > Protocol major versions differ: 1 vs. 2 So they drop support for ssh v1. Unfortunately old versions were not usable with v2. I can make ssh_options tunable and not only pre-set. > > So, there is no ssh version 1 on this version of the apc switrch. I > commented out that line in /usr/sbin/fence_apc, and now the fence > agent is able to establish the connection, but it cannot go any further. Add "cmd_prompt" into device_opt in fence_apc. Then you will have possibility to set --command-prompt to "apc>". Both fixes will be simple, feel free to create bugzilla entry for them. m, From member at linkedin.com Fri Apr 8 10:05:52 2011 From: member at linkedin.com (Rajiv Kumar Yadav via LinkedIn) Date: Fri, 8 Apr 2011 10:05:52 +0000 (UTC) Subject: [Linux-cluster] Rajiv Kumar Yadav wants to stay in touch on LinkedIn Message-ID: <2076648148.2637223.1302257152743.JavaMail.app@ela4-bed83.prod> LinkedIn ------------Rajiv Kumar Yadav requested to add you as a connection on LinkedIn: ------------------------------------------ Marian, I'd like to add you to my professional network on LinkedIn. - Rajiv Kumar Yadav Accept invitation from Rajiv Kumar Yadav http://www.linkedin.com/e/b35cik-gm8ya9yd-51/vUtgfygkIF6Uw4L-nwUux9bkT8t6w1dxQkI7gL/blk/I2740330385_2/1BpC5vrmRLoRZcjkkZt5YCpnlOt3RApnhMpmdzgmhxrSNBszYOnPkUcP0PcP0QdP99bRd4uCdchzBJbPgQcPwOe3wRdz8LrCBxbOYWrSlI/EML_comm_afe/ View invitation from Rajiv Kumar Yadav http://www.linkedin.com/e/b35cik-gm8ya9yd-51/vUtgfygkIF6Uw4L-nwUux9bkT8t6w1dxQkI7gL/blk/I2740330385_2/39vdjwPc3cPc3gTcAALqnpPbOYWrSlI/svi/ ------------------------------------------ DID YOU KNOW you can showcase your professional knowledge on LinkedIn to receive job/consulting offers and enhance your professional reputation? Posting replies to questions on LinkedIn Answers puts you in front of the world's professional community. http://www.linkedin.com/e/b35cik-gm8ya9yd-51/abq/inv-24/ -- (c) 2011, LinkedIn Corporation -------------- next part -------------- An HTML attachment was scrubbed... URL: From rossnick-lists at cybercat.ca Fri Apr 8 14:57:41 2011 From: rossnick-lists at cybercat.ca (Nicolas Ross) Date: Fri, 8 Apr 2011 10:57:41 -0400 Subject: [Linux-cluster] fence_apc and Apc AP-8941 References: <4D9C06D2.1040106@redhat.com><48C9F07371214F2AA678938FDE5BD3B5@versa><4D9C768D.4060106@redhat.com><8C77012A023D431CB5AA58E6CC676A35@versa> <4D9EC4A4.3060201@redhat.com> Message-ID: <7899FDCE451648D6B8F48FCCF9B35B4F@versa> >> Protocol major versions differ: 1 vs. 2 > > So they drop support for ssh v1. Unfortunately old versions were not > usable with v2. I can make ssh_options tunable and not only pre-set. > >> >> So, there is no ssh version 1 on this version of the apc switrch. I >> commented out that line in /usr/sbin/fence_apc, and now the fence agent >> is able to establish the connection, but it cannot go any further. > > Add "cmd_prompt" into device_opt in fence_apc. Then you will have > possibility to set --command-prompt to "apc>". > > Both fixes will be simple, feel free to create bugzilla entry for them. Ok, we are progressing. I did create a support case, as suggested by Fabio. It's case # 447666 I get a little further, but now it also seems that the command have also changed Now my log shows : ------------------------- American Power Conversion Network Management Card AOS v5.1.2 (c) Copyright 2009 All Rights Reserved RPDU 2g v5.1.0 ------------------------------------------------------------------------------- Name : Unknown Date : 04/08/2011 Contact : Unknown Time : 10:49:39 Location : Unknown User : Device Manager Up Time : 0 Days 2 Hours 18 Minutes Stat : P+ N4+ N6+ A+ Type ? for command listing Use tcpip command for IP address(-i), subnet(-s), and gateway(-g) apc>1 E101: Command Not Found apc>2 E101: Command Not Found apc> ------------------------- When I connect manually, here are the commands availaible : bkLowLoad bkNearOver bkOverLoad bkReading bkRestrictn devLowLoad devNearOver devOverLoad devReading devStartDly humLow humMin humReading olCancelCmd olDlyOff olDlyOn olDlyReboot olGroups olName olOff olOffDelay olOn olOnDelay olRbootTime olReboot olStatus phLowLoad phNearOver phOverLoad phReading phRestrictn prodInfo sensorName tempHigh tempMax tempReading userList whoami So the command to send would be : olReboot node101 Like so : ------------------------- apc>olReboot node101 E000: Success apc> ------------------------- and then "quit" to logout. From rossnick-lists at cybercat.ca Fri Apr 8 19:00:39 2011 From: rossnick-lists at cybercat.ca (Nicolas Ross) Date: Fri, 8 Apr 2011 15:00:39 -0400 Subject: [Linux-cluster] fence_apc and Apc AP-8941 References: <4D9C06D2.1040106@redhat.com><48C9F07371214F2AA678938FDE5BD3B5@versa><4D9C768D.4060106@redhat.com><8C77012A023D431CB5AA58E6CC676A35@versa> <4D9EC4A4.3060201@redhat.com> Message-ID: <1141834537364B079148969B6F08188F@versa> >> So, there is no ssh version 1 on this version of the apc switrch. I >> commented out that line in /usr/sbin/fence_apc, and now the fence >> agent is able to establish the connection, but it cannot go any further. > > Add "cmd_prompt" into device_opt in fence_apc. Then you will have > possibility to set --command-prompt to "apc>". > > Both fixes will be simple, feel free to create bugzilla entry for them. I submited but # 694894 for this. Let's take it there. From meisam.mohammadkhani at gmail.com Sun Apr 10 14:29:01 2011 From: meisam.mohammadkhani at gmail.com (Meisam Mohammadkhani) Date: Sun, 10 Apr 2011 17:59:01 +0330 Subject: [Linux-cluster] Fwd: High Available Transparent File System In-Reply-To: References: Message-ID: Hi All, I'm new to GFS. I'm searching around a solution for our enterprise application that is responsible to save(and manipulate) historical data of industrial devices. Now, we have two stations that works like hot redundant of each other. Our challenge is in case of failure. For now, our application is responsible to handling fault by synchronizing the files that changed during the fault, by itself. Our application is running on two totally independent machines (one as redundant) and so each one has its own disk. We are searching around a solution like a "high available transparent file system" that makes the fault transparent to the application, so in case of fault, redundant machine still can access the files even the master machine is down (replica issue or such a thing). Is there fail-over feature in GFS that satisfy our requirement? Actually, my question is that can GFS help us in our case? Regards -------------- next part -------------- An HTML attachment was scrubbed... URL: From linux at alteeve.com Sun Apr 10 14:57:45 2011 From: linux at alteeve.com (Digimer) Date: Sun, 10 Apr 2011 10:57:45 -0400 Subject: [Linux-cluster] Fwd: High Available Transparent File System In-Reply-To: References: Message-ID: <4DA1C569.7020600@alteeve.com> On 04/10/2011 10:29 AM, Meisam Mohammadkhani wrote: > Hi All, > > I'm new to GFS. I'm searching around a solution for our enterprise > application that is responsible to save(and manipulate) historical data > of industrial devices. Now, we have two stations that works like hot > redundant of each other. Our challenge is in case of failure. For now, > our application is responsible to handling fault by synchronizing the > files that changed during the fault, by itself. Our application is > running on two totally independent machines (one as redundant) and so > each one has its own disk. > We are searching around a solution like a "high available transparent > file system" that makes the fault transparent to the application, so in > case of fault, redundant machine still can access the files even the > master machine is down (replica issue or such a thing). > Is there fail-over feature in GFS that satisfy our requirement? > Actually, my question is that can GFS help us in our case? > > Regards Without knowing your performance requirements or available hardware, let me suggest: DRBD between the two nodes GFS2 on the DRBD resource. This way, you can use DRBD in Primary/Primary mount and mount the GFS2 share on both nodes at the same time. GFS2 required DLM, distributed lock manager, so you will need a minimal cluster setup. To answer your question directly; GFS2 does not need to fail over as it's available on all quorate cluster nodes at all times. If you just want to ensure that the data is synchronized between both nodes at all times, and you don't need to actually read/write from the backup node, then you could get away with just DRBD in Primary/Secondary mode with a normal FS like ext3. Of course, this would require manual recovery in the even of a failure, but the setup overhead would be a lot less. If either of these sound reasonable, let me know and I can help give you more specific suggestions. Let me know what you have in way of hardware (generally; NICs, Switches, etc). -- Digimer E-Mail: digimer at alteeve.com AN!Whitepapers: http://alteeve.com Node Assassin: http://nodeassassin.org From ooolinux at 163.com Sun Apr 10 15:08:05 2011 From: ooolinux at 163.com (yue) Date: Sun, 10 Apr 2011 23:08:05 +0800 (CST) Subject: [Linux-cluster] Fwd: High Available Transparent File System In-Reply-To: References: Message-ID: <2bf8c53d.b709.12f3ff49bda.Coremail.ooolinux@163.com> What is fencing? Fencing is the act of forecefully removing a node from a cluster. A node with OCFS2 mounted will fence itself when it realizes that it doesn't have quorum in a degraded cluster. It does this so that other nodes won't get stuck trying to access its resources. Currently OCFS2 will panic the machine when it realizes it has to fence itself off from the cluster. As described above, it will do this when it sees more nodes heartbeating than it has connectivity to and fails the quorum test. Due to user reports of nodes hanging during fencing, OCFS2 1.2.5 no longer uses "panic" for fencing. Instead, by default, it uses "machine restart". This should not only prevent nodes from hanging during fencing but also allow for nodes to quickly restart and rejoin the cluster. While this change is internal in nature, we are documenting this so as to make users aware that they are no longer going to see the familiar panic stack trace during fencing. Instead they will see the message"*** ocfs2 is very sorry to be fencing this system by restarting ***" and that too probably only as part of the messages captured on the netdump/netconsole server. If perchance the user wishes to use panic to fence (maybe to see the familiar oops stack trace or on the advise of customer support to diagnose frequent reboots), one can do so by issuing the following command after the O2CB cluster is online. # echo 1 > /proc/fs/ocfs2_nodemanager/fence_method Please note that this change is local to a node. At 2011-04-10 22:29:01?"Meisam Mohammadkhani" wrote: Hi All, I'm new to GFS. I'm searching around a solution for our enterprise application that is responsible to save(and manipulate) historical data of industrial devices. Now, we have two stations that works like hot redundant of each other. Our challenge is in case of failure. For now, our application is responsible to handling fault by synchronizing the files that changed during the fault, by itself. Our application is running on two totally independent machines (one as redundant) and so each one has its own disk. We are searching around a solution like a "high available transparent file system" that makes the fault transparent to the application, so in case of fault, redundant machine still can access the files even the master machine is down (replica issue or such a thing). Is there fail-over feature in GFS that satisfy our requirement? Actually, my question is that can GFS help us in our case? Regards -------------- next part -------------- An HTML attachment was scrubbed... URL: From meisam.mohammadkhani at gmail.com Sun Apr 10 19:06:48 2011 From: meisam.mohammadkhani at gmail.com (Meisam Mohammadkhani) Date: Sun, 10 Apr 2011 23:36:48 +0430 Subject: [Linux-cluster] Fwd: High Available Transparent File System In-Reply-To: <4DA1C569.7020600@alteeve.com> References: <4DA1C569.7020600@alteeve.com> Message-ID: Dear Digimer, First of all, thanks for your reply. I'm not familiar with DRBD, but according to my little searches it's a solution for high availability in Linux operating system. But, actually our application uses .net as its framework, so it is dependent to Windows-based operating systems and using the DRBD may facing us with some new challenges. Also because of commodity nature of our machines, using the windows solutions needs a windows server on machines that is heavy for them. Using DRBD, force us to run our application in virtual machines that decrease the performance according to hardware spec. So we thought that maybe a "high available transparent file system" can be a good solution for this case. Even if the file system was not so cross-platform, we maybe be able to handle it with virtual machines which use the physical disks as their storage. I will appreciate your opinion. Regards On Sun, Apr 10, 2011 at 7:27 PM, Digimer wrote: > On 04/10/2011 10:29 AM, Meisam Mohammadkhani wrote: > > Hi All, > > > > I'm new to GFS. I'm searching around a solution for our enterprise > > application that is responsible to save(and manipulate) historical data > > of industrial devices. Now, we have two stations that works like hot > > redundant of each other. Our challenge is in case of failure. For now, > > our application is responsible to handling fault by synchronizing the > > files that changed during the fault, by itself. Our application is > > running on two totally independent machines (one as redundant) and so > > each one has its own disk. > > We are searching around a solution like a "high available transparent > > file system" that makes the fault transparent to the application, so in > > case of fault, redundant machine still can access the files even the > > master machine is down (replica issue or such a thing). > > Is there fail-over feature in GFS that satisfy our requirement? > > Actually, my question is that can GFS help us in our case? > > > > Regards > > Without knowing your performance requirements or available hardware, let > me suggest: > > DRBD between the two nodes > GFS2 on the DRBD resource. > > This way, you can use DRBD in Primary/Primary mount and mount the GFS2 > share on both nodes at the same time. GFS2 required DLM, distributed > lock manager, so you will need a minimal cluster setup. To answer your > question directly; GFS2 does not need to fail over as it's available on > all quorate cluster nodes at all times. > > If you just want to ensure that the data is synchronized between both > nodes at all times, and you don't need to actually read/write from the > backup node, then you could get away with just DRBD in Primary/Secondary > mode with a normal FS like ext3. Of course, this would require manual > recovery in the even of a failure, but the setup overhead would be a lot > less. > > If either of these sound reasonable, let me know and I can help give you > more specific suggestions. Let me know what you have in way of hardware > (generally; NICs, Switches, etc). > > -- > Digimer > E-Mail: digimer at alteeve.com > AN!Whitepapers: http://alteeve.com > Node Assassin: http://nodeassassin.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From linux at alteeve.com Sun Apr 10 20:12:19 2011 From: linux at alteeve.com (Digimer) Date: Sun, 10 Apr 2011 16:12:19 -0400 Subject: [Linux-cluster] Fwd: High Available Transparent File System In-Reply-To: References: <4DA1C569.7020600@alteeve.com> Message-ID: <4DA20F23.5020409@alteeve.com> On 04/10/2011 03:06 PM, Meisam Mohammadkhani wrote: > Dear Digimer, > > First of all, thanks for your reply. > I'm not familiar with DRBD, but according to my little searches it's a > solution for high availability in Linux operating system. But, actually > our application uses .net as its framework, so it is dependent to > Windows-based operating systems and using the DRBD may facing us with > some new challenges. Also because of commodity nature of our machines, > using the windows solutions needs a windows server on machines that is > heavy for them. Using DRBD, force us to run our application in virtual > machines that decrease the performance according to hardware spec. So we > thought that maybe a "high available transparent file system" can be a > good solution for this case. Even if the file system was not so > cross-platform, we maybe be able to handle it with virtual machines > which use the physical disks as their storage. > I will appreciate your opinion. > > Regards Hi, Well, I must say, I am a bit confused as this is the Linux cluster mail list. I assumed you were using Linux. :P If you are running a relatively modern version of windows (ie: 2008 R2, iirc), then you can run the Windows as a paravirtualized guest on a server with hardware virtualization support, which most modern machines have. Particularly higher-end equipment. You would need to run Linux on the hosts, but you could minimize the resources used by that host and dedicate most of the resources to the VM with relatively minor performance hit. Now, to back up, I can't say I can advice the use of a cluster without you being able or willing to go through the fairly steep learning curve. It's not *hard*, per-se, but there are many bits that have to work together for a cluster to be stable. This inter-dependence also means that there are many creative ways that things could go wrong. Without sufficient time to learn and experience with Linux, those problems could be too much to justify this solution. I'm afraid I know nothing about clustering or shared file systems in the Windows world. Perhaps your vendor could provide you with some insight into pure-windows solutions? If you're a windows shop, that might make the most sense as it's a platform you are already familiar with. -- Digimer E-Mail: digimer at alteeve.com AN!Whitepapers: http://alteeve.com Node Assassin: http://nodeassassin.org From meisam.mohammadkhani at gmail.com Sun Apr 10 20:40:53 2011 From: meisam.mohammadkhani at gmail.com (Meisam Mohammadkhani) Date: Mon, 11 Apr 2011 01:10:53 +0430 Subject: [Linux-cluster] Fwd: High Available Transparent File System In-Reply-To: <4DA20F23.5020409@alteeve.com> References: <4DA1C569.7020600@alteeve.com> <4DA20F23.5020409@alteeve.com> Message-ID: I'm familiar with Linux and clusters, but actually my knowledge is around HPC clusters, not HA clusters. So you are right about the "training time", but I will try to handle it ;) According to that Linux world has a great open source projects around high availability, file systems and so on, that was my suggest to our corporation head masters to use these solutions "transparently" for application, may handle some parts of application responsibility. I still thinks that using Linux solutions can help us transparently and we can handle it with virtual machines advantages. So I will appreciate your solutions in Linux world. ;) Regards On Mon, Apr 11, 2011 at 12:42 AM, Digimer wrote: > On 04/10/2011 03:06 PM, Meisam Mohammadkhani wrote: > > Dear Digimer, > > > > First of all, thanks for your reply. > > I'm not familiar with DRBD, but according to my little searches it's a > > solution for high availability in Linux operating system. But, actually > > our application uses .net as its framework, so it is dependent to > > Windows-based operating systems and using the DRBD may facing us with > > some new challenges. Also because of commodity nature of our machines, > > using the windows solutions needs a windows server on machines that is > > heavy for them. Using DRBD, force us to run our application in virtual > > machines that decrease the performance according to hardware spec. So we > > thought that maybe a "high available transparent file system" can be a > > good solution for this case. Even if the file system was not so > > cross-platform, we maybe be able to handle it with virtual machines > > which use the physical disks as their storage. > > I will appreciate your opinion. > > > > Regards > > Hi, > > Well, I must say, I am a bit confused as this is the Linux cluster > mail list. I assumed you were using Linux. :P > > If you are running a relatively modern version of windows (ie: 2008 > R2, iirc), then you can run the Windows as a paravirtualized guest on a > server with hardware virtualization support, which most modern machines > have. Particularly higher-end equipment. You would need to run Linux on > the hosts, but you could minimize the resources used by that host and > dedicate most of the resources to the VM with relatively minor > performance hit. > > Now, to back up, I can't say I can advice the use of a cluster without > you being able or willing to go through the fairly steep learning curve. > It's not *hard*, per-se, but there are many bits that have to work > together for a cluster to be stable. This inter-dependence also means > that there are many creative ways that things could go wrong. Without > sufficient time to learn and experience with Linux, those problems could > be too much to justify this solution. > > I'm afraid I know nothing about clustering or shared file systems in > the Windows world. Perhaps your vendor could provide you with some > insight into pure-windows solutions? If you're a windows shop, that > might make the most sense as it's a platform you are already familiar with. > > -- > Digimer > E-Mail: digimer at alteeve.com > AN!Whitepapers: http://alteeve.com > Node Assassin: http://nodeassassin.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From gordan at bobich.net Sun Apr 10 22:56:09 2011 From: gordan at bobich.net (Gordan Bobic) Date: Sun, 10 Apr 2011 23:56:09 +0100 Subject: [Linux-cluster] Fwd: High Available Transparent File System In-Reply-To: References: <4DA1C569.7020600@alteeve.com> Message-ID: <4DA23589.4080702@bobich.net> On 10/04/2011 20:06, Meisam Mohammadkhani wrote: > Dear Digimer, > > First of all, thanks for your reply. > I'm not familiar with DRBD, but according to my little searches it's a > solution for high availability in Linux operating system. Considering you are asking on a Linux mailing list, that much should have been obvious pre-emptively. :) > But, actually > our application uses .net as its framework, so it is dependent to > Windows-based operating systems and using the DRBD may facing us with > some new challenges. Then I fear your asking here may leave you rather out of luck. RHCS, GFS and DRBD are all Linux-only. What you could potentially do is use the Linux storage servers as a back end with GFS on top of DRBD, and exporting the storage via Samba with CTDB. Your windows nodes could then connect to those. This will obviously double your hardware requirements, though. > Also because of commodity nature of our machines, > using the windows solutions needs a windows server on machines that is > heavy for them. Using DRBD, force us to run our application in virtual > machines that decrease the performance according to hardware spec. So we > thought that maybe a "high available transparent file system" can be a > good solution for this case. Even if the file system was not so > cross-platform, we maybe be able to handle it with virtual machines > which use the physical disks as their storage. > I will appreciate your opinion. If I am correctly following what you are saying, you want to use DRBD as the backing device which you want to export as a raw disk to your Windows VMs. That will only work if Windows have a clustering file system capable of concurrent access from multiple nodes. If you try to use standard NTFS on top of it, it will get corrupted as soon as you start writing to it from both nodes. You may be able to get some mileage out of Windows Cluster Shared Volumes, but how good that is (or whether it's any good at all for your application), I have no idea. You're better off asking that question on a Windows specific forum. Gordan From gordan at bobich.net Sun Apr 10 22:58:37 2011 From: gordan at bobich.net (Gordan Bobic) Date: Sun, 10 Apr 2011 23:58:37 +0100 Subject: [Linux-cluster] Fwd: High Available Transparent File System In-Reply-To: References: <4DA1C569.7020600@alteeve.com> <4DA20F23.5020409@alteeve.com> Message-ID: <4DA2361D.4070101@bobich.net> On 10/04/2011 21:40, Meisam Mohammadkhani wrote: > I'm familiar with Linux and clusters, but actually my knowledge is > around HPC clusters, not HA clusters. So you are right about the > "training time", but I will try to handle it ;) According to that Linux > world has a great open source projects around high availability, file > systems and so on, that was my suggest to our corporation head masters > to use these solutions "transparently" for application, may handle some > parts of application responsibility. I still thinks that using Linux > solutions can help us transparently and we can handle it with virtual > machines advantages. So I will appreciate your solutions in Linux world. ;) The problem you have is that you need concurrent access from Windows machines, and that will only work using some kind of a bodge solution like application servers accessing the data over Samba backed CIFS shares. Gordan From l.santeramo at brgm.fr Mon Apr 11 09:01:23 2011 From: l.santeramo at brgm.fr (Santeramo Luc) Date: Mon, 11 Apr 2011 09:01:23 +0000 Subject: [Linux-cluster] nfs4 kerberos In-Reply-To: References: <1302129765.23236.2.camel@hawku><1302135817.23236.19.camel@hawku> <1302138060.24349.7.camel@hawku><1302173040.17187.37.camel@cowie.iouk.ioroot.tld><1302178130.26066.19.camel@hawku><4414.160.107.87.10.1302202696.squirrel@www.yaktech.com><1302218204.29194.7.camel@hawku> Message-ID: <8C1F883FF5D0C34ABEAB0ACEF300212514BA28@srv154.brgm.fr> Ian, First, thanks for all those information. You have spent a lot of time to make it work on an active/passive cluster, but I was wondering if you tried to make it work on an active/active cluster ? Do you think that using VMs is the only solution ? Thanks, -- Luc De : linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] De la part de Ian Hayes Envoy? : vendredi 8 avril 2011 01:31 ? : linux clustering Objet : Re: [Linux-cluster] nfs4 kerberos Hmm. I think you're overcomplicating it a bit. Instead of tweaking /etc/sysconfig/rpcbind, I did this: Edit /etc/init.d/portmap start() { hostname nfsserver.mydomain echo -n $"Starting $prog: " daemon portmap $PMAP_ARGS RETVAL=$? echo [ $RETVAL -eq 0 ] && touch /var/lock/subsys/portmap return $RETVAL } Start order is : IP, portmap, rpcgssd, rpcidmapd, nfs. My goal was to get the hostname changed to whateve the service principal in Kerberos was named to before the Kerberized daemons start up. You can also play by manually changing the hostname on one of the nodes and firing up the service just to see that everything works. On Thu, Apr 7, 2011 at 4:16 PM, Daniel R. Gore > wrote: Still could not get it to work. I tried changing the host name that rpcbind binds to during start up with arguments in the /etc/sysconfig/rpcbind file. RPCBIND_ARGS="hostname fserv.mydomain" rpcbind started correctly with not errors. I then restarted the other rpc daemons and nfs. Got the same error: rpc.svcgssd indicates "wrong principal" I know the ip is working correctly because I can ssh into using the file server name (fserv.mydomain). Looking for more ideas! Thanks. Dan On Thu, 2011-04-07 at 14:58 -0400, danielgore at yaktech.com wrote: > Ian, > > You can find it here; > > > http://sourceware.org/cluster/doc/nfscookbook.pdf > > > I had written up a rather large set of build documentation for many common > > clustered services. NFS4, Samba, Postfix/Cyrus, Squid and some other > > stuff. > > But those docs stayed with my employer, so.... I don't think I've seen > > this > > cookbook, is it some wiki-type thing where new docs can be contributed? > > > > On Thu, Apr 7, 2011 at 5:08 AM, Daniel R. Gore > > >wrote: > > > >> A better solution for NFSv4 in a cluster is really required. > >> > >> > >> A better cookbook with more real life likely scenarios for clustering > >> solutions would be really helpful. How many people actually setup the > >> complex three layered solutions depicted, as compared to people setting > >> up simple two/three node servers to for authorization, authentication, > >> file and license serving. It appears that the small business applicable > >> system is completely ignored. > >> > >> > >> On Thu, 2011-04-07 at 11:44 +0100, Colin Simpson wrote: > >> > That's interesting about making the portmapper dependant on the IP, > >> was > >> > this for the same reason I'm seeing just now. I used the method from > >> NFS > >> > cookbook where I pseudo load balancing by distributing my NFS exports > >> > across my nodes. Sadly the RHEL 6 portmapper replacement (rpcbind) > >> > replies on the node IP and not the service IP, and this breaks NFSv3 > >> > mounts from RHEL5 clients with iptables stateful firewalls. > >> > > >> > I opened a bug on this one and have a call open with RH (via Dell) on > >> > this: > >> > https://bugzilla.redhat.com/show_bug.cgi?id=689589 > >> > > >> > But I too would like a good clean method of doing kerberized NFSv4 on > >> a > >> > RHEL6 cluster. I thought NFSv4 being so central to RHEL6 this would be > >> > easy on a RHEL6 cluster (without using XEN)? Can the cookbook be > >> > updated? > >> > > >> > Which brings up another point. The RHEL cluster documentation is good, > >> > however it doesn't really help you implement a working cluster too > >> > easily (beyond the apache example), it's a bit reference orientated. I > >> > found myself googling around for examples of different RA types. Is > >> > there a more hands on set of docs around (or book)? It could almost do > >> > with a cookbook for every RA! > >> > > >> > Thanks > >> > > >> > Colin > >> > > >> > On Thu, 2011-04-07 at 02:52 +0100, Ian Hayes wrote: > >> > > Shouldnt have to recompile rpc.gssd. On failover I migrated the ip > >> > > address first, made portmapper a depend on the ip, rpc.gssd depend > >> on > >> > > portmap and nfsd depend on rpc. As for the hostname, I went with the > >> > > inelegant solution of putting a 'hostname' command in the start > >> > > functions of the portmapper script since that fires first in my > >> > > config. > >> > > > >> > > > On Apr 6, 2011 6:06 PM, "Daniel R. Gore" > > >> > > > wrote: > >> > > > > >> > > > I also found this thread, after many searches. > >> > > > http://linux-nfs.org/pipermail/nfsv4/2009-April/010583.html > >> > > > > >> > > > As I read through it, there appears to be a patch for rpc.gssd > >> which > >> > > > allows for the daemon to be started and associated with multiple > >> > > > hosts. > >> > > > I do not want to compile rpc.gssd and it appears the patch is from > >> > > > over > >> > > > two years ago. I would hope that RHEL6 would have rpc.gssd > >> patched > >> > > > to > >> > > > meet this requirement, but no documentation appear to exist for > >> how > >> > > > to > >> > > > use it. > >> > > > > >> > > > > >> > > > > >> > > > > >> > > > > >> > > > On Wed, 2011-04-06 at 20:23 -0400, Daniel R. Gore wrote: > >> > > > > Ian, > >> > > > > > >> > > > > Thanks for the info. > >> > > > > > >> > > > >... > >> > > > > >> > > > >> > > plain text document attachment (ATT114553.txt) > >> > > -- > >> > > Linux-cluster mailing list > >> > > Linux-cluster at redhat.com > >> > > https://www.redhat.com/mailman/listinfo/linux-cluster > >> > > >> > This email and any files transmitted with it are confidential and are > >> intended solely for the use of the individual or entity to whom they are > >> addressed. If you are not the original recipient or the person > >> responsible > >> for delivering the email to the intended recipient, be advised that you > >> have > >> received this email in error, and that any use, dissemination, > >> forwarding, > >> printing, or copying of this email is strictly prohibited. If you > >> received > >> this email in error, please immediately notify the sender and delete the > >> original. > >> > > >> > > >> > > >> > -- > >> > Linux-cluster mailing list > >> > Linux-cluster at redhat.com > >> > https://www.redhat.com/mailman/listinfo/linux-cluster > >> > > >> > >> > >> > >> -- > >> This message has been scanned for viruses and > >> dangerous content by MailScanner, and is > >> believed to be clean. > >> > >> -- > >> Linux-cluster mailing list > >> Linux-cluster at redhat.com > >> https://www.redhat.com/mailman/listinfo/linux-cluster > >> > > > > -- > > This message has been scanned for viruses and > > dangerous content by MailScanner, and is > > believed to be clean. > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster ********************************************************************************************** Pensez a l'environnement avant d'imprimer ce message Think Environment before printing Le contenu de ce mel et de ses pieces jointes est destine a l'usage exclusif du (des) destinataire(s) designe (s) comme tel(s). En cas de reception par erreur, le signaler a son expediteur et ne pas en divulguer le contenu. L'absence de virus a ete verifiee a l'emission, il convient neanmoins de s'assurer de l'absence de contamination a sa reception. The contents of this email and any attachments are confidential. They are intended for the named recipient (s) only. If you have received this email in error please notify the system manager or the sender immediately and do not disclose the contents to anyone or make copies. eSafe scanned this email for viruses, vandals and malicious content. ********************************************************************************************** -------------- next part -------------- An HTML attachment was scrubbed... URL: From sherwin.tobias at nsn.com Tue Apr 12 08:29:12 2011 From: sherwin.tobias at nsn.com (Tobias, Sherwin (NSN - PH/Makati City)) Date: Tue, 12 Apr 2011 16:29:12 +0800 Subject: [Linux-cluster] RHCS logging debug level Message-ID: <9F94419BEEEECC4C8330D6A7771D384CB9252F@SGSIEXC009.nsn-intra.net> Hello. Can you help me increase the logging level (more detailed one) of cluster.log? I am troubleshooting our RHCS setup. Regards Sherwin -------------- next part -------------- An HTML attachment was scrubbed... URL: From linux at alteeve.com Tue Apr 12 15:03:01 2011 From: linux at alteeve.com (Digimer) Date: Tue, 12 Apr 2011 11:03:01 -0400 Subject: [Linux-cluster] >1500 MTU on EL5 causes things to go sideways Message-ID: <4DA469A5.3060301@alteeve.com> Hi all, I've got a two-node EL5.6 cluster with a handful of interfaces on the nodes. When I do not specify MTU=x, everything works great. However, I want to use jumbo frames, and so the fun begins. As soon as I define MTU=2000 (for example), then cman on one note will start but not stop (the other node stops fine). Also, 'ccs_tool update /etc/cluster/cluster.conf' fails with: ==== [root at xenmaster003 ~]# ccs_tool update /etc/cluster/cluster.confFailed to receive COMM_UPDATE_NOTICE_ACK from xenmaster004.iplink.net. Hint: Check the log on xenmaster004.iplink.net for reason. Failed to update config file. ==== Nothing at all gets written to the target node's log files though. Is there some subtle magic needed to get jumbo frames working in the cluster? Am I missing the forest for the trees? :) -- Digimer E-Mail: digimer at alteeve.com AN!Whitepapers: http://alteeve.com Node Assassin: http://nodeassassin.org From gordan at bobich.net Tue Apr 12 15:16:33 2011 From: gordan at bobich.net (Gordan Bobic) Date: Tue, 12 Apr 2011 16:16:33 +0100 Subject: [Linux-cluster] >1500 MTU on EL5 causes things to go sideways In-Reply-To: <4DA469A5.3060301@alteeve.com> References: <4DA469A5.3060301@alteeve.com> Message-ID: <4DA46CD1.4060107@bobich.net> Digimer wrote: > I've got a two-node EL5.6 cluster with a handful of interfaces on the > nodes. When I do not specify MTU=x, everything works great. However, I > want to use jumbo frames, and so the fun begins. > > As soon as I define MTU=2000 (for example), then cman on one note will > start but not stop (the other node stops fine). Also, 'ccs_tool update > /etc/cluster/cluster.conf' fails with: > > ==== > [root at xenmaster003 ~]# ccs_tool update /etc/cluster/cluster.confFailed > to receive COMM_UPDATE_NOTICE_ACK from xenmaster004.iplink.net. > Hint: Check the log on xenmaster004.iplink.net for reason. > > Failed to update config file. > ==== > > Nothing at all gets written to the target node's log files though. > > Is there some subtle magic needed to get jumbo frames working in the > cluster? Am I missing the forest for the trees? :) This may be a silly question, but as a first pass: 1) Do you have the MTU set on both nodes to the same value at the same time? 2) Have you confirmed that your NICs actually work properly with jumbo frames? Some Gb NICs can't handle them. 3) Does your switch support jumbo frames and have you explicitly enabled them on the switch? I have used jumbo frames up to 16KB (on Intel NICs with a cross-over cable on a 2-node cluster) with no problems. Gordan From linux at alteeve.com Tue Apr 12 15:22:34 2011 From: linux at alteeve.com (Digimer) Date: Tue, 12 Apr 2011 11:22:34 -0400 Subject: [Linux-cluster] >1500 MTU on EL5 causes things to go sideways In-Reply-To: <4DA46CD1.4060107@bobich.net> References: <4DA469A5.3060301@alteeve.com> <4DA46CD1.4060107@bobich.net> Message-ID: <4DA46E3A.9020404@alteeve.com> On 04/12/2011 11:16 AM, Gordan Bobic wrote: > Digimer wrote: > >> I've got a two-node EL5.6 cluster with a handful of interfaces on the >> nodes. When I do not specify MTU=x, everything works great. However, I >> want to use jumbo frames, and so the fun begins. >> >> As soon as I define MTU=2000 (for example), then cman on one note will >> start but not stop (the other node stops fine). Also, 'ccs_tool update >> /etc/cluster/cluster.conf' fails with: >> >> ==== >> [root at xenmaster003 ~]# ccs_tool update /etc/cluster/cluster.confFailed >> to receive COMM_UPDATE_NOTICE_ACK from xenmaster004.iplink.net. >> Hint: Check the log on xenmaster004.iplink.net for reason. >> >> Failed to update config file. >> ==== >> >> Nothing at all gets written to the target node's log files though. >> >> Is there some subtle magic needed to get jumbo frames working in the >> cluster? Am I missing the forest for the trees? :) > > This may be a silly question, but as a first pass: No worries, I routinely make silly mistakes, so silly questions are appropriate. :) > 1) Do you have the MTU set on both nodes to the same value at the same > time? Yes. > 2) Have you confirmed that your NICs actually work properly with jumbo > frames? Some Gb NICs can't handle them. Yes. > 3) Does your switch support jumbo frames and have you explicitly enabled > them on the switch? Yes and yes. > I have used jumbo frames up to 16KB (on Intel NICs with a cross-over > cable on a 2-node cluster) with no problems. > > Gordan Thanks for the questions. Still answer-hunting though. :P -- Digimer E-Mail: digimer at alteeve.com AN!Whitepapers: http://alteeve.com Node Assassin: http://nodeassassin.org From ajb2 at mssl.ucl.ac.uk Tue Apr 12 15:33:06 2011 From: ajb2 at mssl.ucl.ac.uk (Alan Brown) Date: Tue, 12 Apr 2011 16:33:06 +0100 Subject: [Linux-cluster] >1500 MTU on EL5 causes things to go sideways In-Reply-To: <4DA469A5.3060301@alteeve.com> References: <4DA469A5.3060301@alteeve.com> Message-ID: <4DA470B2.8020408@mssl.ucl.ac.uk> Digimer wrote: > > As soon as I define MTU=2000 (for example), then cman on one note will > start but not stop (the other node stops fine). Also, 'ccs_tool update > /etc/cluster/cluster.conf' fails with: Have you configured the interfaces themselves to use jumbo frames? Does the switch support jumbo frames? Is path mtu discovery enabled? (should only matter if there's a router in the way) From linux at alteeve.com Tue Apr 12 15:41:21 2011 From: linux at alteeve.com (Digimer) Date: Tue, 12 Apr 2011 11:41:21 -0400 Subject: [Linux-cluster] >1500 MTU on EL5 causes things to go sideways In-Reply-To: <4DA470B2.8020408@mssl.ucl.ac.uk> References: <4DA469A5.3060301@alteeve.com> <4DA470B2.8020408@mssl.ucl.ac.uk> Message-ID: <4DA472A1.5070606@alteeve.com> On 04/12/2011 11:33 AM, Alan Brown wrote: > Digimer wrote: >> >> As soon as I define MTU=2000 (for example), then cman on one note will >> start but not stop (the other node stops fine). Also, 'ccs_tool update >> /etc/cluster/cluster.conf' fails with: > > Have you configured the interfaces themselves to use jumbo frames? > > Does the switch support jumbo frames? > > Is path mtu discovery enabled? (should only matter if there's a router > in the way) The nodes are connected via a common switch, which supports large jumbo frames (10k) and jumbo frames are enabled. Each node's NIC is set to the same MTU=x number. No routers involved, and the NICs themselves also support 9kb JFs. -- Digimer E-Mail: digimer at alteeve.com AN!Whitepapers: http://alteeve.com Node Assassin: http://nodeassassin.org From carlopmart at gmail.com Thu Apr 14 10:44:39 2011 From: carlopmart at gmail.com (carlopmart) Date: Thu, 14 Apr 2011 12:44:39 +0200 Subject: [Linux-cluster] Using specific physical interface to migrate vms Message-ID: <4DA6D017.1060403@gmail.com> Hi all, I have installed two RHEL6 KVM hosts with cman/openais/rgmanager (RHCS) with latest uptades to support virtual guest migrations using a NFS resource as a shared storage. I have five physical networks interfaces on each host. I would like to use eth1 on both to accomplish vm migration task, but hostnames are binded to eth0. How can I configure cluster.conf file to assign eth1 interface when vm live migration will be required?? Thanks. -- CL Martinez carlopmart {at} gmail {d0t} com From gianluca.cecchi at gmail.com Fri Apr 15 09:01:00 2011 From: gianluca.cecchi at gmail.com (Gianluca Cecchi) Date: Fri, 15 Apr 2011 11:01:00 +0200 Subject: [Linux-cluster] Using specific physical interface to migrate vms Message-ID: On Thu, 14 Apr 2011 12:44:39 +0200 carlopmart wrote: > How can I configure cluster.conf file to assign eth1 interface when vm live migration will be required?? Already replied on rhelv6 list. Next time please don't cross post. Send to one list and eventually only after some time choose another list if more appropriate than the first one.... Gianluca From rossnick-lists at cybercat.ca Fri Apr 15 15:00:16 2011 From: rossnick-lists at cybercat.ca (Nicolas Ross) Date: Fri, 15 Apr 2011 11:00:16 -0400 Subject: [Linux-cluster] Slowness of GFS2 Message-ID: <7CBCEAC9F7BE4E658B6CDBD81DFBC9F8@versa> Hi ! We are slowly migrating services to our new cluster... We curently have an 8 node cluster with a 16 1tb disk enclosure in 7 x raid1 pairs, with 2 global spares. the first 2 arrays are in one vg, wich is in turn seperated in a 100 gig, a 300 and a 1tb lvs, all in gfs2. I use the firts lv as a global utility partition where I put my developement directories (source trees for apache, php, etc), and other binary utilities. This partition is mounted all the time on all nodes (it's in /etc/fstab). Now, if I connect to the first node, do some make distclean, ./configure, make and make install in on of my source directory, httpd for instance. The make distclean for exemple takes about 30 seconds or so. Now, I logout from the first node and I move to the next node where I need to do such update, and go into the same directory and do the same thing. Now it takes forever. I stoped after 3 or 4 minutes. There was no other node useing that directory at that time. Here's my fstab entry : /dev/VGa/gfs /gfs gfs2 defaults,noatime,noquota 0 0 What can I do or what parameter can I tune to help improve this kind of performance ? From swhiteho at redhat.com Fri Apr 15 15:45:24 2011 From: swhiteho at redhat.com (Steven Whitehouse) Date: Fri, 15 Apr 2011 16:45:24 +0100 Subject: [Linux-cluster] Slowness of GFS2 In-Reply-To: <7CBCEAC9F7BE4E658B6CDBD81DFBC9F8@versa> References: <7CBCEAC9F7BE4E658B6CDBD81DFBC9F8@versa> Message-ID: <1302882324.2692.8.camel@dolmen> Hi, On Fri, 2011-04-15 at 11:00 -0400, Nicolas Ross wrote: > Hi ! > > We are slowly migrating services to our new cluster... We curently have an 8 > node cluster with a 16 1tb disk enclosure in 7 x raid1 pairs, with 2 global > spares. > > the first 2 arrays are in one vg, wich is in turn seperated in a 100 gig, a > 300 and a 1tb lvs, all in gfs2. > > I use the firts lv as a global utility partition where I put my developement > directories (source trees for apache, php, etc), and other binary utilities. > This partition is mounted all the time on all nodes (it's in /etc/fstab). > > Now, if I connect to the first node, do some make distclean, ./configure, > make and make > install in on of my source directory, httpd for instance. The make distclean > for exemple takes about 30 seconds or so. > > Now, I logout from the first node and I move to the next node where I need > to do such update, and go into the same directory and do the same thing. Now > it takes forever. I stoped after 3 or 4 minutes. There was no other node > useing that directory at that time. > > Here's my fstab entry : > > /dev/VGa/gfs /gfs gfs2 defaults,noatime,noquota 0 0 > > What can I do or what parameter can I tune to help improve this kind of > performance ? > Did you do a sync on the node that you were moving away from before starting work on the new node? That should help to speed up the change of node. I suspect that the issue is that the utilities are working over a large number of small files, and those will take some time to migrate to the new node, since it will be accessing each file sequentially. There have been a few performance improvements recently which may help, depending on exactly which kernel you are using at the moment, Steve. From rossnick-lists at cybercat.ca Sat Apr 16 02:13:26 2011 From: rossnick-lists at cybercat.ca (Nicolas Ross) Date: Fri, 15 Apr 2011 22:13:26 -0400 Subject: [Linux-cluster] Slowness of GFS2 In-Reply-To: <1302882324.2692.8.camel@dolmen> References: <7CBCEAC9F7BE4E658B6CDBD81DFBC9F8@versa> <1302882324.2692.8.camel@dolmen> Message-ID: (...) >> What can I do or what parameter can I tune to help improve this kind of >> performance ? >> > Did you do a sync on the node that you were moving away from before > starting work on the new node? That should help to speed up the change > of node. I suspect that the issue is that the utilities are working over > a large number of small files, and those will take some time to migrate > to the new node, since it will be accessing each file sequentially. > > There have been a few performance improvements recently which may help, > depending on exactly which kernel you are using at the moment, Thanks, I will try that. I am at 2.6.32-71.24.1.el6.x86_64, wich I think is the latest RHEL6 kernel. From meisam.mohammadkhani at gmail.com Sun Apr 17 05:25:53 2011 From: meisam.mohammadkhani at gmail.com (Meisam Mohammadkhani) Date: Sun, 17 Apr 2011 09:55:53 +0430 Subject: [Linux-cluster] Recovery Tools Message-ID: Hi, There is ordinary recovery tools in market that is useful when our data corrupt or loose, but they support ordinary file systems. My question is for any reason, if our data was corrupted under GFS, is there any recovery tools for retrieving lost data?! Can we use ordinary recovery tools?! Regards -------------- next part -------------- An HTML attachment was scrubbed... URL: From td3201 at gmail.com Sun Apr 17 20:52:53 2011 From: td3201 at gmail.com (Terry) Date: Sun, 17 Apr 2011 15:52:53 -0500 Subject: [Linux-cluster] problems with clvmd Message-ID: As a result of a strange situation where our licensing for storage dropped off, I need to join a centos 5.6 node to a now single node cluster. I got it joined to the cluster but I am having issues with CLVMD. Any lvm operations on both boxes hang. For example, vgscan. I have increased debugging and I don't see any logs. The VGs aren't being populated in /dev/mapper. This WAS working right after I joined it to the cluster and now it's not for some unknown reason. Not sure where to take this at this point. I did find one weird startup log that I am not sure what it means yet: [root at omadvnfs01a ~]# dmesg | grep dlm dlm: no local IP address has been set dlm: cannot start dlm lowcomms -107 dlm: Using TCP for communications dlm: connecting to 2 [root at omadvnfs01a ~]# ps xauwwww | grep dlm root 5476 0.0 0.0 24736 760 ? Ss 15:34 0:00 /sbin/dlm_controld root 5502 0.0 0.0 0 0 ? S< 15:34 0:00 [dlm_astd] root 5503 0.0 0.0 0 0 ? S< 15:34 0:00 [dlm_scand] root 5504 0.0 0.0 0 0 ? S< 15:34 0:00 [dlm_recv] root 5505 0.0 0.0 0 0 ? S< 15:34 0:00 [dlm_send] root 5506 0.0 0.0 0 0 ? S< 15:34 0:00 [dlm_recoverd] root 5546 0.0 0.0 0 0 ? S< 15:35 0:00 [dlm_recoverd] [root at omadvnfs01a ~]# lsmod | grep dlm lock_dlm 52065 0 gfs2 529037 1 lock_dlm dlm 160065 17 lock_dlm configfs 62045 2 dlm centos server: [root at omadvnfs01a ~]# rpm -q cman rgmanager lvm2-cluster cman-2.0.115-68.el5 rgmanager-2.0.52-9.el5.centos lvm2-cluster-2.02.74-3.el5_6.1 [root at omadvnfs01a ~]# ls /dev/mapper/ | grep -v mpath control VolGroup00-LogVol00 VolGroup00-LogVol01 rhel server: [root at omadvnfs01b network-scripts]# rpm -q cman rgmanager lvm2-cluster cman-2.0.115-34.el5 rgmanager-2.0.52-6.el5 lvm2-cluster-2.02.56-7.el5_5.4 [root at omadvnfs01b network-scripts]# ls /dev/mapper/ | grep -v mpath control vg_data01a-lv_data01a vg_data01b-lv_data01b vg_data01c-lv_data01c vg_data01d-lv_data01d vg_data01e-lv_data01e vg_data01h-lv_data01h vg_data01i-lv_data01i VolGroup00-LogVol00 VolGroup00-LogVol01 VolGroup02-lv_data00 [root at omadvnfs01b network-scripts]# clustat Cluster Status for omadvnfs01 @ Sun Apr 17 15:44:52 2011 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ omadvnfs01a.sec.jel.lc 1 Online, rgmanager omadvnfs01b.sec.jel.lc 2 Online, Local, rgmanager Service Name Owner (Last) State ------- ---- ----- ------ ----- service:omadvnfs01-nfs-a omadvnfs01b.sec.jel.lc started service:omadvnfs01-nfs-b omadvnfs01b.sec.jel.lc started service:omadvnfs01-nfs-c omadvnfs01b.sec.jel.lc started service:omadvnfs01-nfs-h omadvnfs01b.sec.jel.lc started service:omadvnfs01-nfs-i omadvnfs01b.sec.jel.lc started service:postgresql omadvnfs01b.sec.jel.lc started [root at omadvnfs01a ~]# cman_tool nodes Node Sts Inc Joined Name 1 M 1892 2011-04-17 15:34:24 omadvnfs01a.sec.jel.lc 2 M 1896 2011-04-17 15:34:24 omadvnfs01b.sec.jel.lc From ccaulfie at redhat.com Mon Apr 18 08:48:34 2011 From: ccaulfie at redhat.com (Christine Caulfield) Date: Mon, 18 Apr 2011 09:48:34 +0100 Subject: [Linux-cluster] problems with clvmd In-Reply-To: References: Message-ID: <4DABFAE2.1090502@redhat.com> On 17/04/11 21:52, Terry wrote: > As a result of a strange situation where our licensing for storage > dropped off, I need to join a centos 5.6 node to a now single node > cluster. I got it joined to the cluster but I am having issues with > CLVMD. Any lvm operations on both boxes hang. For example, vgscan. > I have increased debugging and I don't see any logs. The VGs aren't > being populated in /dev/mapper. This WAS working right after I joined > it to the cluster and now it's not for some unknown reason. Not sure > where to take this at this point. I did find one weird startup log > that I am not sure what it means yet: > [root at omadvnfs01a ~]# dmesg | grep dlm > dlm: no local IP address has been set > dlm: cannot start dlm lowcomms -107 > dlm: Using TCP for communications > dlm: connecting to 2 > That message usually means that dlm_controld has failed to start. Try starting the cman daemons (groupd, dlm_controld) manually with the -D switch and read the output which might give some clues to why it's not working. Chrissie > [root at omadvnfs01a ~]# ps xauwwww | grep dlm > root 5476 0.0 0.0 24736 760 ? Ss 15:34 0:00 > /sbin/dlm_controld > root 5502 0.0 0.0 0 0 ? S< 15:34 0:00 [dlm_astd] > root 5503 0.0 0.0 0 0 ? S< 15:34 0:00 [dlm_scand] > root 5504 0.0 0.0 0 0 ? S< 15:34 0:00 [dlm_recv] > root 5505 0.0 0.0 0 0 ? S< 15:34 0:00 [dlm_send] > root 5506 0.0 0.0 0 0 ? S< 15:34 0:00 [dlm_recoverd] > root 5546 0.0 0.0 0 0 ? S< 15:35 0:00 [dlm_recoverd] > > [root at omadvnfs01a ~]# lsmod | grep dlm > lock_dlm 52065 0 > gfs2 529037 1 lock_dlm > dlm 160065 17 lock_dlm > configfs 62045 2 dlm > > > centos server: > [root at omadvnfs01a ~]# rpm -q cman rgmanager lvm2-cluster > cman-2.0.115-68.el5 > rgmanager-2.0.52-9.el5.centos > lvm2-cluster-2.02.74-3.el5_6.1 > > [root at omadvnfs01a ~]# ls /dev/mapper/ | grep -v mpath > control > VolGroup00-LogVol00 > VolGroup00-LogVol01 > > rhel server: > [root at omadvnfs01b network-scripts]# rpm -q cman rgmanager lvm2-cluster > cman-2.0.115-34.el5 > rgmanager-2.0.52-6.el5 > lvm2-cluster-2.02.56-7.el5_5.4 > > [root at omadvnfs01b network-scripts]# ls /dev/mapper/ | grep -v mpath > control > vg_data01a-lv_data01a > vg_data01b-lv_data01b > vg_data01c-lv_data01c > vg_data01d-lv_data01d > vg_data01e-lv_data01e > vg_data01h-lv_data01h > vg_data01i-lv_data01i > VolGroup00-LogVol00 > VolGroup00-LogVol01 > VolGroup02-lv_data00 > > [root at omadvnfs01b network-scripts]# clustat > Cluster Status for omadvnfs01 @ Sun Apr 17 15:44:52 2011 > Member Status: Quorate > > Member Name ID Status > ------ ---- ---- ------ > omadvnfs01a.sec.jel.lc 1 > Online, rgmanager > omadvnfs01b.sec.jel.lc 2 > Online, Local, rgmanager > > Service Name > Owner (Last) State > ------- ---- > ----- ------ ----- > service:omadvnfs01-nfs-a > omadvnfs01b.sec.jel.lc > started > service:omadvnfs01-nfs-b > omadvnfs01b.sec.jel.lc > started > service:omadvnfs01-nfs-c > omadvnfs01b.sec.jel.lc > started > service:omadvnfs01-nfs-h > omadvnfs01b.sec.jel.lc > started > service:omadvnfs01-nfs-i > omadvnfs01b.sec.jel.lc > started > service:postgresql > omadvnfs01b.sec.jel.lc > started > > > [root at omadvnfs01a ~]# cman_tool nodes > Node Sts Inc Joined Name > 1 M 1892 2011-04-17 15:34:24 omadvnfs01a.sec.jel.lc > 2 M 1896 2011-04-17 15:34:24 omadvnfs01b.sec.jel.lc > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From td3201 at gmail.com Mon Apr 18 13:38:55 2011 From: td3201 at gmail.com (Terry) Date: Mon, 18 Apr 2011 08:38:55 -0500 Subject: [Linux-cluster] problems with clvmd In-Reply-To: <4DABFAE2.1090502@redhat.com> References: <4DABFAE2.1090502@redhat.com> Message-ID: On Mon, Apr 18, 2011 at 3:48 AM, Christine Caulfield wrote: > On 17/04/11 21:52, Terry wrote: >> >> As a result of a strange situation where our licensing for storage >> dropped off, I need to join a centos 5.6 node to a now single node >> cluster. ?I got it joined to the cluster but I am having issues with >> CLVMD. ?Any lvm operations on both boxes hang. ?For example, vgscan. >> I have increased debugging and I don't see any logs. ?The VGs aren't >> being populated in /dev/mapper. ?This WAS working right after I joined >> it to the cluster and now it's not for some unknown reason. ?Not sure >> where to take this at this point. ? I did find one weird startup log >> that I am not sure what it means yet: >> [root at omadvnfs01a ~]# dmesg | grep dlm >> dlm: no local IP address has been set >> dlm: cannot start dlm lowcomms -107 >> dlm: Using TCP for communications >> dlm: connecting to 2 >> > > > That message usually means that dlm_controld has failed to start. Try > starting the cman daemons (groupd, dlm_controld) manually with the -D switch > and read the output which might give some clues to why it's not working. > > Chrissie > Hi Chrissie, I thought of that but I see dlm started on both nodes. See right below. >> [root at omadvnfs01a ~]# ps xauwwww | grep dlm >> root ? ? ?5476 ?0.0 ?0.0 ?24736 ? 760 ? ? ? ? ?Ss ? 15:34 ? 0:00 >> /sbin/dlm_controld >> root ? ? ?5502 ?0.0 ?0.0 ? ? ?0 ? ? 0 ? ? ? ? ?S< ? ?15:34 ? 0:00 >> [dlm_astd] >> root ? ? ?5503 ?0.0 ?0.0 ? ? ?0 ? ? 0 ? ? ? ? ?S< ? ?15:34 ? 0:00 >> [dlm_scand] >> root ? ? ?5504 ?0.0 ?0.0 ? ? ?0 ? ? 0 ? ? ? ? ?S< ? ?15:34 ? 0:00 >> [dlm_recv] >> root ? ? ?5505 ?0.0 ?0.0 ? ? ?0 ? ? 0 ? ? ? ? ?S< ? ?15:34 ? 0:00 >> [dlm_send] >> root ? ? ?5506 ?0.0 ?0.0 ? ? ?0 ? ? 0 ? ? ? ? ?S< ? ?15:34 ? 0:00 >> [dlm_recoverd] >> root ? ? ?5546 ?0.0 ?0.0 ? ? ?0 ? ? 0 ? ? ? ? ?S< ? ?15:35 ? 0:00 >> [dlm_recoverd] >> >> [root at omadvnfs01a ~]# lsmod | grep dlm >> lock_dlm ? ? ? ? ? ? ? 52065 ?0 >> gfs2 ? ? ? ? ? ? ? ? ?529037 ?1 lock_dlm >> dlm ? ? ? ? ? ? ? ? ? 160065 ?17 lock_dlm >> configfs ? ? ? ? ? ? ? 62045 ?2 dlm >> >> >> centos server: >> [root at omadvnfs01a ~]# rpm -q cman rgmanager lvm2-cluster >> cman-2.0.115-68.el5 >> rgmanager-2.0.52-9.el5.centos >> lvm2-cluster-2.02.74-3.el5_6.1 >> >> [root at omadvnfs01a ~]# ls /dev/mapper/ | grep -v mpath >> control >> VolGroup00-LogVol00 >> VolGroup00-LogVol01 >> >> rhel server: >> [root at omadvnfs01b network-scripts]# rpm -q cman rgmanager lvm2-cluster >> cman-2.0.115-34.el5 >> rgmanager-2.0.52-6.el5 >> lvm2-cluster-2.02.56-7.el5_5.4 >> >> [root at omadvnfs01b network-scripts]# ls /dev/mapper/ | grep -v mpath >> control >> vg_data01a-lv_data01a >> vg_data01b-lv_data01b >> vg_data01c-lv_data01c >> vg_data01d-lv_data01d >> vg_data01e-lv_data01e >> vg_data01h-lv_data01h >> vg_data01i-lv_data01i >> VolGroup00-LogVol00 >> VolGroup00-LogVol01 >> VolGroup02-lv_data00 >> >> [root at omadvnfs01b network-scripts]# clustat >> Cluster Status for omadvnfs01 @ Sun Apr 17 15:44:52 2011 >> Member Status: Quorate >> >> ?Member Name ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ID >> Status >> ?------ ---- ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ---- >> ------ >> ?omadvnfs01a.sec.jel.lc ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?1 >> Online, rgmanager >> ?omadvnfs01b.sec.jel.lc ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?2 >> Online, Local, rgmanager >> >> ?Service Name >> Owner (Last) ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? State >> ?------- ---- >> ----- ------ ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ----- >> ?service:omadvnfs01-nfs-a >> omadvnfs01b.sec.jel.lc >> started >> ?service:omadvnfs01-nfs-b >> omadvnfs01b.sec.jel.lc >> started >> ?service:omadvnfs01-nfs-c >> omadvnfs01b.sec.jel.lc >> started >> ?service:omadvnfs01-nfs-h >> omadvnfs01b.sec.jel.lc >> started >> ?service:omadvnfs01-nfs-i >> omadvnfs01b.sec.jel.lc >> started >> ?service:postgresql >> omadvnfs01b.sec.jel.lc >> started >> >> >> [root at omadvnfs01a ~]# cman_tool nodes >> Node ?Sts ? Inc ? Joined ? ? ? ? ? ? ? Name >> ? ?1 ? M ? 1892 ? 2011-04-17 15:34:24 ?omadvnfs01a.sec.jel.lc >> ? ?2 ? M ? 1896 ? 2011-04-17 15:34:24 ?omadvnfs01b.sec.jel.lc >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > From ccaulfie at redhat.com Mon Apr 18 13:57:37 2011 From: ccaulfie at redhat.com (Christine Caulfield) Date: Mon, 18 Apr 2011 14:57:37 +0100 Subject: [Linux-cluster] problems with clvmd In-Reply-To: References: <4DABFAE2.1090502@redhat.com> Message-ID: <4DAC4351.9090503@redhat.com> On 18/04/11 14:38, Terry wrote: > On Mon, Apr 18, 2011 at 3:48 AM, Christine Caulfield > wrote: >> On 17/04/11 21:52, Terry wrote: >>> >>> As a result of a strange situation where our licensing for storage >>> dropped off, I need to join a centos 5.6 node to a now single node >>> cluster. I got it joined to the cluster but I am having issues with >>> CLVMD. Any lvm operations on both boxes hang. For example, vgscan. >>> I have increased debugging and I don't see any logs. The VGs aren't >>> being populated in /dev/mapper. This WAS working right after I joined >>> it to the cluster and now it's not for some unknown reason. Not sure >>> where to take this at this point. I did find one weird startup log >>> that I am not sure what it means yet: >>> [root at omadvnfs01a ~]# dmesg | grep dlm >>> dlm: no local IP address has been set >>> dlm: cannot start dlm lowcomms -107 >>> dlm: Using TCP for communications >>> dlm: connecting to 2 >>> >> >> >> That message usually means that dlm_controld has failed to start. Try >> starting the cman daemons (groupd, dlm_controld) manually with the -D switch >> and read the output which might give some clues to why it's not working. >> >> Chrissie >> > > > Hi Chrissie, > > I thought of that but I see dlm started on both nodes. See right below. > >>> [root at omadvnfs01a ~]# ps xauwwww | grep dlm >>> root 5476 0.0 0.0 24736 760 ? Ss 15:34 0:00 >>> /sbin/dlm_controld >>> root 5502 0.0 0.0 0 0 ? S< 15:34 0:00 Well, that's encouraging in a way! But it's evidently not started fully or the DLM itself would be working. So I still recommend starting it with -D to see how far it gets. Chrissie From td3201 at gmail.com Mon Apr 18 13:57:34 2011 From: td3201 at gmail.com (Terry) Date: Mon, 18 Apr 2011 08:57:34 -0500 Subject: [Linux-cluster] problems with clvmd In-Reply-To: References: <4DABFAE2.1090502@redhat.com> Message-ID: On Mon, Apr 18, 2011 at 8:38 AM, Terry wrote: > On Mon, Apr 18, 2011 at 3:48 AM, Christine Caulfield > wrote: >> On 17/04/11 21:52, Terry wrote: >>> >>> As a result of a strange situation where our licensing for storage >>> dropped off, I need to join a centos 5.6 node to a now single node >>> cluster. ?I got it joined to the cluster but I am having issues with >>> CLVMD. ?Any lvm operations on both boxes hang. ?For example, vgscan. >>> I have increased debugging and I don't see any logs. ?The VGs aren't >>> being populated in /dev/mapper. ?This WAS working right after I joined >>> it to the cluster and now it's not for some unknown reason. ?Not sure >>> where to take this at this point. ? I did find one weird startup log >>> that I am not sure what it means yet: >>> [root at omadvnfs01a ~]# dmesg | grep dlm >>> dlm: no local IP address has been set >>> dlm: cannot start dlm lowcomms -107 >>> dlm: Using TCP for communications >>> dlm: connecting to 2 >>> >> >> >> That message usually means that dlm_controld has failed to start. Try >> starting the cman daemons (groupd, dlm_controld) manually with the -D switch >> and read the output which might give some clues to why it's not working. >> >> Chrissie >> > > > Hi Chrissie, > > I thought of that but I see dlm started on both nodes. ?See right below. > >>> [root at omadvnfs01a ~]# ps xauwwww | grep dlm >>> root ? ? ?5476 ?0.0 ?0.0 ?24736 ? 760 ? ? ? ? ?Ss ? 15:34 ? 0:00 >>> /sbin/dlm_controld >>> root ? ? ?5502 ?0.0 ?0.0 ? ? ?0 ? ? 0 ? ? ? ? ?S< ? ?15:34 ? 0:00 >>> [dlm_astd] >>> root ? ? ?5503 ?0.0 ?0.0 ? ? ?0 ? ? 0 ? ? ? ? ?S< ? ?15:34 ? 0:00 >>> [dlm_scand] >>> root ? ? ?5504 ?0.0 ?0.0 ? ? ?0 ? ? 0 ? ? ? ? ?S< ? ?15:34 ? 0:00 >>> [dlm_recv] >>> root ? ? ?5505 ?0.0 ?0.0 ? ? ?0 ? ? 0 ? ? ? ? ?S< ? ?15:34 ? 0:00 >>> [dlm_send] >>> root ? ? ?5506 ?0.0 ?0.0 ? ? ?0 ? ? 0 ? ? ? ? ?S< ? ?15:34 ? 0:00 >>> [dlm_recoverd] >>> root ? ? ?5546 ?0.0 ?0.0 ? ? ?0 ? ? 0 ? ? ? ? ?S< ? ?15:35 ? 0:00 >>> [dlm_recoverd] >>> >>> [root at omadvnfs01a ~]# lsmod | grep dlm >>> lock_dlm ? ? ? ? ? ? ? 52065 ?0 >>> gfs2 ? ? ? ? ? ? ? ? ?529037 ?1 lock_dlm >>> dlm ? ? ? ? ? ? ? ? ? 160065 ?17 lock_dlm >>> configfs ? ? ? ? ? ? ? 62045 ?2 dlm >>> >>> >>> centos server: >>> [root at omadvnfs01a ~]# rpm -q cman rgmanager lvm2-cluster >>> cman-2.0.115-68.el5 >>> rgmanager-2.0.52-9.el5.centos >>> lvm2-cluster-2.02.74-3.el5_6.1 >>> >>> [root at omadvnfs01a ~]# ls /dev/mapper/ | grep -v mpath >>> control >>> VolGroup00-LogVol00 >>> VolGroup00-LogVol01 >>> >>> rhel server: >>> [root at omadvnfs01b network-scripts]# rpm -q cman rgmanager lvm2-cluster >>> cman-2.0.115-34.el5 >>> rgmanager-2.0.52-6.el5 >>> lvm2-cluster-2.02.56-7.el5_5.4 >>> >>> [root at omadvnfs01b network-scripts]# ls /dev/mapper/ | grep -v mpath >>> control >>> vg_data01a-lv_data01a >>> vg_data01b-lv_data01b >>> vg_data01c-lv_data01c >>> vg_data01d-lv_data01d >>> vg_data01e-lv_data01e >>> vg_data01h-lv_data01h >>> vg_data01i-lv_data01i >>> VolGroup00-LogVol00 >>> VolGroup00-LogVol01 >>> VolGroup02-lv_data00 >>> >>> [root at omadvnfs01b network-scripts]# clustat >>> Cluster Status for omadvnfs01 @ Sun Apr 17 15:44:52 2011 >>> Member Status: Quorate >>> >>> ?Member Name ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ID >>> Status >>> ?------ ---- ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ---- >>> ------ >>> ?omadvnfs01a.sec.jel.lc ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?1 >>> Online, rgmanager >>> ?omadvnfs01b.sec.jel.lc ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?2 >>> Online, Local, rgmanager >>> >>> ?Service Name >>> Owner (Last) ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? State >>> ?------- ---- >>> ----- ------ ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ----- >>> ?service:omadvnfs01-nfs-a >>> omadvnfs01b.sec.jel.lc >>> started >>> ?service:omadvnfs01-nfs-b >>> omadvnfs01b.sec.jel.lc >>> started >>> ?service:omadvnfs01-nfs-c >>> omadvnfs01b.sec.jel.lc >>> started >>> ?service:omadvnfs01-nfs-h >>> omadvnfs01b.sec.jel.lc >>> started >>> ?service:omadvnfs01-nfs-i >>> omadvnfs01b.sec.jel.lc >>> started >>> ?service:postgresql >>> omadvnfs01b.sec.jel.lc >>> started >>> >>> >>> [root at omadvnfs01a ~]# cman_tool nodes >>> Node ?Sts ? Inc ? Joined ? ? ? ? ? ? ? Name >>> ? ?1 ? M ? 1892 ? 2011-04-17 15:34:24 ?omadvnfs01a.sec.jel.lc >>> ? ?2 ? M ? 1896 ? 2011-04-17 15:34:24 ?omadvnfs01b.sec.jel.lc >>> >>> -- >>> Linux-cluster mailing list >>> Linux-cluster at redhat.com >>> https://www.redhat.com/mailman/listinfo/linux-cluster >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster >> > Ok, started all the CMAN elements manually as you suggested. I started them in order as in the init script. Here's the only error that I see. I can post the other debug messages if you think they'd be useful but this is the only one that stuck out to me. [root at omadvnfs01a ~]# /sbin/dlm_controld -D 1303134840 /sys/kernel/config/dlm/cluster/comms: opendir failed: 2 1303134840 /sys/kernel/config/dlm/cluster/spaces: opendir failed: 2 1303134840 set_ccs_options 480 1303134840 cman: node 2 added 1303134840 set_configfs_node 2 10.198.1.111 local 0 1303134840 cman: node 3 added 1303134840 set_configfs_node 3 10.198.1.110 local 1 From td3201 at gmail.com Mon Apr 18 14:11:35 2011 From: td3201 at gmail.com (Terry) Date: Mon, 18 Apr 2011 09:11:35 -0500 Subject: [Linux-cluster] problems with clvmd In-Reply-To: <4DAC4351.9090503@redhat.com> References: <4DABFAE2.1090502@redhat.com> <4DAC4351.9090503@redhat.com> Message-ID: On Mon, Apr 18, 2011 at 8:57 AM, Christine Caulfield wrote: > On 18/04/11 14:38, Terry wrote: >> >> On Mon, Apr 18, 2011 at 3:48 AM, Christine Caulfield >> ?wrote: >>> >>> On 17/04/11 21:52, Terry wrote: >>>> >>>> As a result of a strange situation where our licensing for storage >>>> dropped off, I need to join a centos 5.6 node to a now single node >>>> cluster. ?I got it joined to the cluster but I am having issues with >>>> CLVMD. ?Any lvm operations on both boxes hang. ?For example, vgscan. >>>> I have increased debugging and I don't see any logs. ?The VGs aren't >>>> being populated in /dev/mapper. ?This WAS working right after I joined >>>> it to the cluster and now it's not for some unknown reason. ?Not sure >>>> where to take this at this point. ? I did find one weird startup log >>>> that I am not sure what it means yet: >>>> [root at omadvnfs01a ~]# dmesg | grep dlm >>>> dlm: no local IP address has been set >>>> dlm: cannot start dlm lowcomms -107 >>>> dlm: Using TCP for communications >>>> dlm: connecting to 2 >>>> >>> >>> >>> That message usually means that dlm_controld has failed to start. Try >>> starting the cman daemons (groupd, dlm_controld) manually with the -D >>> switch >>> and read the output which might give some clues to why it's not working. >>> >>> Chrissie >>> >> >> >> Hi Chrissie, >> >> I thought of that but I see dlm started on both nodes. ?See right below. >> >>>> [root at omadvnfs01a ~]# ps xauwwww | grep dlm >>>> root ? ? ?5476 ?0.0 ?0.0 ?24736 ? 760 ? ? ? ? ?Ss ? 15:34 ? 0:00 >>>> /sbin/dlm_controld >>>> root ? ? ?5502 ?0.0 ?0.0 ? ? ?0 ? ? 0 ? ? ? ? ?S< ? ? ?15:34 ? 0:00 > > > Well, that's encouraging in a way! But it's evidently not started fully or > the DLM itself would be working. So I still recommend starting it with -D to > see how far it gets. > > > Chrissie > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > I think we had posts cross. Here's my latest: Ok, started all the CMAN elements manually as you suggested. I started them in order as in the init script. Here's the only error that I see. I can post the other debug messages if you think they'd be useful but this is the only one that stuck out to me. [root at omadvnfs01a ~]# /sbin/dlm_controld -D 1303134840 /sys/kernel/config/dlm/cluster/comms: opendir failed: 2 1303134840 /sys/kernel/config/dlm/cluster/spaces: opendir failed: 2 1303134840 set_ccs_options 480 1303134840 cman: node 2 added 1303134840 set_configfs_node 2 10.198.1.111 local 0 1303134840 cman: node 3 added 1303134840 set_configfs_node 3 10.198.1.110 local 1 From kkovachev at varna.net Mon Apr 18 14:13:37 2011 From: kkovachev at varna.net (Kaloyan Kovachev) Date: Mon, 18 Apr 2011 17:13:37 +0300 Subject: [Linux-cluster] problems with clvmd In-Reply-To: References: <4DABFAE2.1090502@redhat.com> Message-ID: On Mon, 18 Apr 2011 08:57:34 -0500, Terry wrote: > On Mon, Apr 18, 2011 at 8:38 AM, Terry wrote: >> On Mon, Apr 18, 2011 at 3:48 AM, Christine Caulfield >> wrote: >>> On 17/04/11 21:52, Terry wrote: >>>> >>>> As a result of a strange situation where our licensing for storage >>>> dropped off, I need to join a centos 5.6 node to a now single node >>>> cluster. I got it joined to the cluster but I am having issues with >>>> CLVMD. Any lvm operations on both boxes hang. For example, vgscan. >>>> I have increased debugging and I don't see any logs. The VGs aren't >>>> being populated in /dev/mapper. This WAS working right after I joined >>>> it to the cluster and now it's not for some unknown reason. Not sure >>>> where to take this at this point. I did find one weird startup log >>>> that I am not sure what it means yet: >>>> [root at omadvnfs01a ~]# dmesg | grep dlm >>>> dlm: no local IP address has been set >>>> dlm: cannot start dlm lowcomms -107 >>>> dlm: Using TCP for communications >>>> dlm: connecting to 2 >>>> >>> >>> >>> That message usually means that dlm_controld has failed to start. Try >>> starting the cman daemons (groupd, dlm_controld) manually with the -D >>> switch >>> and read the output which might give some clues to why it's not working. >>> >>> Chrissie >>> >> >> >> Hi Chrissie, >> >> I thought of that but I see dlm started on both nodes. See right below. >> >>>> [root at omadvnfs01a ~]# ps xauwwww | grep dlm >>>> root 5476 0.0 0.0 24736 760 ? Ss 15:34 0:00 >>>> /sbin/dlm_controld >>>> root 5502 0.0 0.0 0 0 ? S< 15:34 0:00 >>>> [dlm_astd] >>>> root 5503 0.0 0.0 0 0 ? S< 15:34 0:00 >>>> [dlm_scand] >>>> root 5504 0.0 0.0 0 0 ? S< 15:34 0:00 >>>> [dlm_recv] >>>> root 5505 0.0 0.0 0 0 ? S< 15:34 0:00 >>>> [dlm_send] >>>> root 5506 0.0 0.0 0 0 ? S< 15:34 0:00 >>>> [dlm_recoverd] >>>> root 5546 0.0 0.0 0 0 ? S< 15:35 0:00 >>>> [dlm_recoverd] >>>> >>>> [root at omadvnfs01a ~]# lsmod | grep dlm >>>> lock_dlm 52065 0 >>>> gfs2 529037 1 lock_dlm >>>> dlm 160065 17 lock_dlm >>>> configfs 62045 2 dlm >>>> >>>> >>>> centos server: >>>> [root at omadvnfs01a ~]# rpm -q cman rgmanager lvm2-cluster >>>> cman-2.0.115-68.el5 >>>> rgmanager-2.0.52-9.el5.centos >>>> lvm2-cluster-2.02.74-3.el5_6.1 >>>> >>>> [root at omadvnfs01a ~]# ls /dev/mapper/ | grep -v mpath >>>> control >>>> VolGroup00-LogVol00 >>>> VolGroup00-LogVol01 >>>> >>>> rhel server: >>>> [root at omadvnfs01b network-scripts]# rpm -q cman rgmanager lvm2-cluster >>>> cman-2.0.115-34.el5 >>>> rgmanager-2.0.52-6.el5 >>>> lvm2-cluster-2.02.56-7.el5_5.4 >>>> >>>> [root at omadvnfs01b network-scripts]# ls /dev/mapper/ | grep -v mpath >>>> control >>>> vg_data01a-lv_data01a >>>> vg_data01b-lv_data01b >>>> vg_data01c-lv_data01c >>>> vg_data01d-lv_data01d >>>> vg_data01e-lv_data01e >>>> vg_data01h-lv_data01h >>>> vg_data01i-lv_data01i >>>> VolGroup00-LogVol00 >>>> VolGroup00-LogVol01 >>>> VolGroup02-lv_data00 >>>> >>>> [root at omadvnfs01b network-scripts]# clustat >>>> Cluster Status for omadvnfs01 @ Sun Apr 17 15:44:52 2011 >>>> Member Status: Quorate >>>> >>>> Member Name ID >>>> Status >>>> ------ ---- ---- >>>> ------ >>>> omadvnfs01a.sec.jel.lc 1 >>>> Online, rgmanager >>>> omadvnfs01b.sec.jel.lc 2 >>>> Online, Local, rgmanager >>>> >>>> Service Name >>>> Owner (Last) State >>>> ------- ---- >>>> ----- ------ ----- >>>> service:omadvnfs01-nfs-a >>>> omadvnfs01b.sec.jel.lc >>>> started >>>> service:omadvnfs01-nfs-b >>>> omadvnfs01b.sec.jel.lc >>>> started >>>> service:omadvnfs01-nfs-c >>>> omadvnfs01b.sec.jel.lc >>>> started >>>> service:omadvnfs01-nfs-h >>>> omadvnfs01b.sec.jel.lc >>>> started >>>> service:omadvnfs01-nfs-i >>>> omadvnfs01b.sec.jel.lc >>>> started >>>> service:postgresql >>>> omadvnfs01b.sec.jel.lc >>>> started >>>> >>>> >>>> [root at omadvnfs01a ~]# cman_tool nodes >>>> Node Sts Inc Joined Name >>>> 1 M 1892 2011-04-17 15:34:24 omadvnfs01a.sec.jel.lc >>>> 2 M 1896 2011-04-17 15:34:24 omadvnfs01b.sec.jel.lc >>>> >>>> -- >>>> Linux-cluster mailing list >>>> Linux-cluster at redhat.com >>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>> >>> -- >>> Linux-cluster mailing list >>> Linux-cluster at redhat.com >>> https://www.redhat.com/mailman/listinfo/linux-cluster >>> >> > > > > Ok, started all the CMAN elements manually as you suggested. I > started them in order as in the init script. Here's the only error > that I see. I can post the other debug messages if you think they'd > be useful but this is the only one that stuck out to me. > > [root at omadvnfs01a ~]# /sbin/dlm_controld -D > 1303134840 /sys/kernel/config/dlm/cluster/comms: opendir failed: 2 > 1303134840 /sys/kernel/config/dlm/cluster/spaces: opendir failed: 2 what does "lsmod | egrep -e 'configfs' -e 'dlm'" say? > 1303134840 set_ccs_options 480 > 1303134840 cman: node 2 added > 1303134840 set_configfs_node 2 10.198.1.111 local 0 > 1303134840 cman: node 3 added > 1303134840 set_configfs_node 3 10.198.1.110 local 1 > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From ccaulfie at redhat.com Mon Apr 18 14:26:34 2011 From: ccaulfie at redhat.com (Christine Caulfield) Date: Mon, 18 Apr 2011 15:26:34 +0100 Subject: [Linux-cluster] problems with clvmd In-Reply-To: References: <4DABFAE2.1090502@redhat.com> <4DAC4351.9090503@redhat.com> Message-ID: <4DAC4A1A.1090709@redhat.com> On 18/04/11 15:11, Terry wrote: > On Mon, Apr 18, 2011 at 8:57 AM, Christine Caulfield > wrote: >> On 18/04/11 14:38, Terry wrote: >>> >>> On Mon, Apr 18, 2011 at 3:48 AM, Christine Caulfield >>> wrote: >>>> >>>> On 17/04/11 21:52, Terry wrote: >>>>> >>>>> As a result of a strange situation where our licensing for storage >>>>> dropped off, I need to join a centos 5.6 node to a now single node >>>>> cluster. I got it joined to the cluster but I am having issues with >>>>> CLVMD. Any lvm operations on both boxes hang. For example, vgscan. >>>>> I have increased debugging and I don't see any logs. The VGs aren't >>>>> being populated in /dev/mapper. This WAS working right after I joined >>>>> it to the cluster and now it's not for some unknown reason. Not sure >>>>> where to take this at this point. I did find one weird startup log >>>>> that I am not sure what it means yet: >>>>> [root at omadvnfs01a ~]# dmesg | grep dlm >>>>> dlm: no local IP address has been set >>>>> dlm: cannot start dlm lowcomms -107 >>>>> dlm: Using TCP for communications >>>>> dlm: connecting to 2 >>>>> >>>> >>>> >>>> That message usually means that dlm_controld has failed to start. Try >>>> starting the cman daemons (groupd, dlm_controld) manually with the -D >>>> switch >>>> and read the output which might give some clues to why it's not working. >>>> >>>> Chrissie >>>> >>> >>> >>> Hi Chrissie, >>> >>> I thought of that but I see dlm started on both nodes. See right below. >>> >>>>> [root at omadvnfs01a ~]# ps xauwwww | grep dlm >>>>> root 5476 0.0 0.0 24736 760 ? Ss 15:34 0:00 >>>>> /sbin/dlm_controld >>>>> root 5502 0.0 0.0 0 0 ? S< 15:34 0:00 >> >> >> Well, that's encouraging in a way! But it's evidently not started fully or >> the DLM itself would be working. So I still recommend starting it with -D to >> see how far it gets. >> >> >> Chrissie >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster >> > > I think we had posts cross. Here's my latest: > > Ok, started all the CMAN elements manually as you suggested. I > started them in order as in the init script. Here's the only error > that I see. I can post the other debug messages if you think they'd > be useful but this is the only one that stuck out to me. > > [root at omadvnfs01a ~]# /sbin/dlm_controld -D > 1303134840 /sys/kernel/config/dlm/cluster/comms: opendir failed: 2 > 1303134840 /sys/kernel/config/dlm/cluster/spaces: opendir failed: 2 > 1303134840 set_ccs_options 480 > 1303134840 cman: node 2 added > 1303134840 set_configfs_node 2 10.198.1.111 local 0 > 1303134840 cman: node 3 added > 1303134840 set_configfs_node 3 10.198.1.110 local 1 > Can I see the whole set please ? It looks like dlm_controld might be stalled registering with groupd. Chrissie From td3201 at gmail.com Mon Apr 18 14:34:59 2011 From: td3201 at gmail.com (Terry) Date: Mon, 18 Apr 2011 09:34:59 -0500 Subject: [Linux-cluster] problems with clvmd In-Reply-To: References: <4DABFAE2.1090502@redhat.com> Message-ID: On Mon, Apr 18, 2011 at 9:13 AM, Kaloyan Kovachev wrote: > On Mon, 18 Apr 2011 08:57:34 -0500, Terry wrote: >> On Mon, Apr 18, 2011 at 8:38 AM, Terry wrote: >>> On Mon, Apr 18, 2011 at 3:48 AM, Christine Caulfield >>> wrote: >>>> On 17/04/11 21:52, Terry wrote: >>>>> >>>>> As a result of a strange situation where our licensing for storage >>>>> dropped off, I need to join a centos 5.6 node to a now single node >>>>> cluster. ?I got it joined to the cluster but I am having issues with >>>>> CLVMD. ?Any lvm operations on both boxes hang. ?For example, vgscan. >>>>> I have increased debugging and I don't see any logs. ?The VGs aren't >>>>> being populated in /dev/mapper. ?This WAS working right after I > joined >>>>> it to the cluster and now it's not for some unknown reason. ?Not sure >>>>> where to take this at this point. ? I did find one weird startup log >>>>> that I am not sure what it means yet: >>>>> [root at omadvnfs01a ~]# dmesg | grep dlm >>>>> dlm: no local IP address has been set >>>>> dlm: cannot start dlm lowcomms -107 >>>>> dlm: Using TCP for communications >>>>> dlm: connecting to 2 >>>>> >>>> >>>> >>>> That message usually means that dlm_controld has failed to start. Try >>>> starting the cman daemons (groupd, dlm_controld) manually with the -D >>>> switch >>>> and read the output which might give some clues to why it's not > working. >>>> >>>> Chrissie >>>> >>> >>> >>> Hi Chrissie, >>> >>> I thought of that but I see dlm started on both nodes. ?See right > below. >>> >>>>> [root at omadvnfs01a ~]# ps xauwwww | grep dlm >>>>> root ? ? ?5476 ?0.0 ?0.0 ?24736 ? 760 ? ? ? ? ?Ss ? 15:34 ? 0:00 >>>>> /sbin/dlm_controld >>>>> root ? ? ?5502 ?0.0 ?0.0 ? ? ?0 ? ? 0 ? ? ? ? ?S< ? ?15:34 ? 0:00 >>>>> [dlm_astd] >>>>> root ? ? ?5503 ?0.0 ?0.0 ? ? ?0 ? ? 0 ? ? ? ? ?S< ? ?15:34 ? 0:00 >>>>> [dlm_scand] >>>>> root ? ? ?5504 ?0.0 ?0.0 ? ? ?0 ? ? 0 ? ? ? ? ?S< ? ?15:34 ? 0:00 >>>>> [dlm_recv] >>>>> root ? ? ?5505 ?0.0 ?0.0 ? ? ?0 ? ? 0 ? ? ? ? ?S< ? ?15:34 ? 0:00 >>>>> [dlm_send] >>>>> root ? ? ?5506 ?0.0 ?0.0 ? ? ?0 ? ? 0 ? ? ? ? ?S< ? ?15:34 ? 0:00 >>>>> [dlm_recoverd] >>>>> root ? ? ?5546 ?0.0 ?0.0 ? ? ?0 ? ? 0 ? ? ? ? ?S< ? ?15:35 ? 0:00 >>>>> [dlm_recoverd] >>>>> >>>>> [root at omadvnfs01a ~]# lsmod | grep dlm >>>>> lock_dlm ? ? ? ? ? ? ? 52065 ?0 >>>>> gfs2 ? ? ? ? ? ? ? ? ?529037 ?1 lock_dlm >>>>> dlm ? ? ? ? ? ? ? ? ? 160065 ?17 lock_dlm >>>>> configfs ? ? ? ? ? ? ? 62045 ?2 dlm >>>>> >>>>> >>>>> centos server: >>>>> [root at omadvnfs01a ~]# rpm -q cman rgmanager lvm2-cluster >>>>> cman-2.0.115-68.el5 >>>>> rgmanager-2.0.52-9.el5.centos >>>>> lvm2-cluster-2.02.74-3.el5_6.1 >>>>> >>>>> [root at omadvnfs01a ~]# ls /dev/mapper/ | grep -v mpath >>>>> control >>>>> VolGroup00-LogVol00 >>>>> VolGroup00-LogVol01 >>>>> >>>>> rhel server: >>>>> [root at omadvnfs01b network-scripts]# rpm -q cman rgmanager > lvm2-cluster >>>>> cman-2.0.115-34.el5 >>>>> rgmanager-2.0.52-6.el5 >>>>> lvm2-cluster-2.02.56-7.el5_5.4 >>>>> >>>>> [root at omadvnfs01b network-scripts]# ls /dev/mapper/ | grep -v mpath >>>>> control >>>>> vg_data01a-lv_data01a >>>>> vg_data01b-lv_data01b >>>>> vg_data01c-lv_data01c >>>>> vg_data01d-lv_data01d >>>>> vg_data01e-lv_data01e >>>>> vg_data01h-lv_data01h >>>>> vg_data01i-lv_data01i >>>>> VolGroup00-LogVol00 >>>>> VolGroup00-LogVol01 >>>>> VolGroup02-lv_data00 >>>>> >>>>> [root at omadvnfs01b network-scripts]# clustat >>>>> Cluster Status for omadvnfs01 @ Sun Apr 17 15:44:52 2011 >>>>> Member Status: Quorate >>>>> >>>>> ?Member Name ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ID >>>>> Status >>>>> ?------ ---- ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ---- >>>>> ------ >>>>> ?omadvnfs01a.sec.jel.lc > 1 >>>>> Online, rgmanager >>>>> ?omadvnfs01b.sec.jel.lc > 2 >>>>> Online, Local, rgmanager >>>>> >>>>> ?Service Name >>>>> Owner (Last) > State >>>>> ?------- ---- >>>>> ----- ------ > ----- >>>>> ?service:omadvnfs01-nfs-a >>>>> omadvnfs01b.sec.jel.lc >>>>> started >>>>> ?service:omadvnfs01-nfs-b >>>>> omadvnfs01b.sec.jel.lc >>>>> started >>>>> ?service:omadvnfs01-nfs-c >>>>> omadvnfs01b.sec.jel.lc >>>>> started >>>>> ?service:omadvnfs01-nfs-h >>>>> omadvnfs01b.sec.jel.lc >>>>> started >>>>> ?service:omadvnfs01-nfs-i >>>>> omadvnfs01b.sec.jel.lc >>>>> started >>>>> ?service:postgresql >>>>> omadvnfs01b.sec.jel.lc >>>>> started >>>>> >>>>> >>>>> [root at omadvnfs01a ~]# cman_tool nodes >>>>> Node ?Sts ? Inc ? Joined ? ? ? ? ? ? ? Name >>>>> ? ?1 ? M ? 1892 ? 2011-04-17 15:34:24 ?omadvnfs01a.sec.jel.lc >>>>> ? ?2 ? M ? 1896 ? 2011-04-17 15:34:24 ?omadvnfs01b.sec.jel.lc >>>>> >>>>> -- >>>>> Linux-cluster mailing list >>>>> Linux-cluster at redhat.com >>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>> >>>> -- >>>> Linux-cluster mailing list >>>> Linux-cluster at redhat.com >>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>> >>> >> >> >> >> Ok, started all the CMAN elements manually as you suggested. ?I >> started them in order as in the init script. Here's the only error >> that I see. ?I can post the other debug messages if you think they'd >> be useful but this is the only one that stuck out to me. >> >> [root at omadvnfs01a ~]# /sbin/dlm_controld -D >> 1303134840 /sys/kernel/config/dlm/cluster/comms: opendir failed: 2 >> 1303134840 /sys/kernel/config/dlm/cluster/spaces: opendir failed: 2 > > what does "lsmod | egrep -e 'configfs' -e 'dlm'" say? > >> 1303134840 set_ccs_options 480 >> 1303134840 cman: node 2 added >> 1303134840 set_configfs_node 2 10.198.1.111 local 0 >> 1303134840 cman: node 3 added >> 1303134840 set_configfs_node 3 10.198.1.110 local 1 >> >> -- [root at omadvnfs01a ~]# lsmod | egrep -e 'configfs' -e 'dlm' lock_dlm 52065 0 gfs2 529037 1 lock_dlm dlm 160065 5 lock_dlm configfs 62045 2 dlm [root at omadvnfs01b log]# lsmod | egrep -e 'configfs' -e 'dlm' lock_dlm 52065 0 gfs2 524204 1 lock_dlm dlm 160065 19 gfs,lock_dlm configfs 62045 2 dlm From td3201 at gmail.com Mon Apr 18 14:49:56 2011 From: td3201 at gmail.com (Terry) Date: Mon, 18 Apr 2011 09:49:56 -0500 Subject: [Linux-cluster] problems with clvmd In-Reply-To: <4DAC4A1A.1090709@redhat.com> References: <4DABFAE2.1090502@redhat.com> <4DAC4351.9090503@redhat.com> <4DAC4A1A.1090709@redhat.com> Message-ID: On Mon, Apr 18, 2011 at 9:26 AM, Christine Caulfield wrote: > On 18/04/11 15:11, Terry wrote: >> >> On Mon, Apr 18, 2011 at 8:57 AM, Christine Caulfield >> ?wrote: >>> >>> On 18/04/11 14:38, Terry wrote: >>>> >>>> On Mon, Apr 18, 2011 at 3:48 AM, Christine Caulfield >>>> ? ?wrote: >>>>> >>>>> On 17/04/11 21:52, Terry wrote: >>>>>> >>>>>> As a result of a strange situation where our licensing for storage >>>>>> dropped off, I need to join a centos 5.6 node to a now single node >>>>>> cluster. ?I got it joined to the cluster but I am having issues with >>>>>> CLVMD. ?Any lvm operations on both boxes hang. ?For example, vgscan. >>>>>> I have increased debugging and I don't see any logs. ?The VGs aren't >>>>>> being populated in /dev/mapper. ?This WAS working right after I joined >>>>>> it to the cluster and now it's not for some unknown reason. ?Not sure >>>>>> where to take this at this point. ? I did find one weird startup log >>>>>> that I am not sure what it means yet: >>>>>> [root at omadvnfs01a ~]# dmesg | grep dlm >>>>>> dlm: no local IP address has been set >>>>>> dlm: cannot start dlm lowcomms -107 >>>>>> dlm: Using TCP for communications >>>>>> dlm: connecting to 2 >>>>>> >>>>> >>>>> >>>>> That message usually means that dlm_controld has failed to start. Try >>>>> starting the cman daemons (groupd, dlm_controld) manually with the -D >>>>> switch >>>>> and read the output which might give some clues to why it's not >>>>> working. >>>>> >>>>> Chrissie >>>>> >>>> >>>> >>>> Hi Chrissie, >>>> >>>> I thought of that but I see dlm started on both nodes. ?See right below. >>>> >>>>>> [root at omadvnfs01a ~]# ps xauwwww | grep dlm >>>>>> root ? ? ?5476 ?0.0 ?0.0 ?24736 ? 760 ? ? ? ? ?Ss ? 15:34 ? 0:00 >>>>>> /sbin/dlm_controld >>>>>> root ? ? ?5502 ?0.0 ?0.0 ? ? ?0 ? ? 0 ? ? ? ? ?S< ? ? ? ?15:34 ? 0:00 >>> >>> >>> Well, that's encouraging in a way! But it's evidently not started fully >>> or >>> the DLM itself would be working. So I still recommend starting it with -D >>> to >>> see how far it gets. >>> >>> >>> Chrissie >>> >>> -- >>> Linux-cluster mailing list >>> Linux-cluster at redhat.com >>> https://www.redhat.com/mailman/listinfo/linux-cluster >>> >> >> I think we had posts cross. ?Here's my latest: >> >> Ok, started all the CMAN elements manually as you suggested. ?I >> started them in order as in the init script. Here's the only error >> that I see. ?I can post the other debug messages if you think they'd >> be useful but this is the only one that stuck out to me. >> >> [root at omadvnfs01a ~]# /sbin/dlm_controld -D >> 1303134840 /sys/kernel/config/dlm/cluster/comms: opendir failed: 2 >> 1303134840 /sys/kernel/config/dlm/cluster/spaces: opendir failed: 2 >> 1303134840 set_ccs_options 480 >> 1303134840 cman: node 2 added >> 1303134840 set_configfs_node 2 10.198.1.111 local 0 >> 1303134840 cman: node 3 added >> 1303134840 set_configfs_node 3 10.198.1.110 local 1 >> > > Can I see the whole set please ? It looks like dlm_controld might be stalled > registering with groupd. > > Chrissie > > -- Here you go. Thank you very much for the help. Each daemon's output that I started is below. [root at omadvnfs01a log]# /sbin/ccsd -n Starting ccsd 2.0.115: Built: Mar 6 2011 00:47:03 Copyright (C) Red Hat, Inc. 2004 All rights reserved. No Daemon:: SET cluster.conf (cluster name = omadvnfs01, version = 71) found. Remote copy of cluster.conf is from quorate node. Local version # : 71 Remote version #: 71 Remote copy of cluster.conf is from quorate node. Local version # : 71 Remote version #: 71 Remote copy of cluster.conf is from quorate node. Local version # : 71 Remote version #: 71 Remote copy of cluster.conf is from quorate node. Local version # : 71 Remote version #: 71 Initial status:: Quorate [root at omadvnfs01a ~]# /sbin/fenced -D 1303134822 cman: node 2 added 1303134822 cman: node 3 added 1303134822 our_nodeid 3 our_name omadvnfs01a.sec.jel.lc 1303134822 listen 4 member 5 groupd 7 1303134861 client 3: join default 1303134861 delay post_join 3s post_fail 0s 1303134861 added 2 nodes from ccs 1303134861 setid default 65537 1303134861 start default 1 members 2 3 1303134861 do_recovery stop 0 start 1 finish 0 1303134861 finish default 1 [root at omadvnfs01a ~]# /sbin/dlm_controld -D 1303134840 /sys/kernel/config/dlm/cluster/comms: opendir failed: 2 1303134840 /sys/kernel/config/dlm/cluster/spaces: opendir failed: 2 1303134840 set_ccs_options 480 1303134840 cman: node 2 added 1303134840 set_configfs_node 2 10.198.1.111 local 0 1303134840 cman: node 3 added 1303134840 set_configfs_node 3 10.198.1.110 local 1 [root at omadvnfs01a ~]# /sbin/groupd -D 1303134809 cman: our nodeid 3 name omadvnfs01a.sec.jel.lc quorum 1 1303134809 setup_cpg groupd_handle 6b8b456700000000 1303134809 groupd confchg total 2 left 0 joined 1 1303134809 send_version nodeid 3 cluster 2 mode 2 compat 1 1303134822 client connection 3 1303134822 got client 3 setup 1303134822 setup fence 0 1303134840 client connection 4 1303134840 got client 4 setup 1303134840 setup dlm 1 1303134853 client connection 5 1303134853 got client 5 setup 1303134853 setup gfs 2 1303134861 got client 3 join 1303134861 0:default got join 1303134861 0:default is cpg client 6 name 0_default handle 6633487300000001 1303134861 0:default cpg_join ok 1303134861 0:default waiting for first cpg event 1303134861 client connection 7 1303134861 0:default waiting for first cpg event 1303134861 got client 7 get_group 1303134861 0:default waiting for first cpg event 1303134861 0:default waiting for first cpg event 1303134861 0:default confchg left 0 joined 1 total 2 1303134861 0:default process_node_join 3 1303134861 0:default cpg add node 2 total 1 1303134861 0:default cpg add node 3 total 2 1303134861 0:default make_event_id 300020001 nodeid 3 memb_count 2 type 1 1303134861 0:default queue join event for nodeid 3 1303134861 0:default process_current_event 300020001 3 JOIN_BEGIN 1303134861 0:default app node init: add 3 total 1 1303134861 0:default app node init: add 2 total 2 1303134861 0:default waiting for 1 more stopped messages before JOIN_ALL_STOPPED 3 1303134861 0:default mark node 2 stopped 1303134861 0:default set global_id 10001 from 2 1303134861 0:default process_current_event 300020001 3 JOIN_ALL_STOPPED 1303134861 0:default action for app: setid default 65537 1303134861 0:default action for app: start default 1 2 2 2 3 1303134861 client connection 7 1303134861 got client 7 get_group 1303134861 0:default mark node 2 started 1303134861 client connection 7 1303134861 got client 7 get_group 1303134861 got client 3 start_done 1303134861 0:default send started 1303134861 0:default mark node 3 started 1303134861 0:default process_current_event 300020001 3 JOIN_ALL_STARTED 1303134861 0:default action for app: finish default 1 1303134862 client connection 7 1303134862 got client 7 get_group [root at omadvnfs01a ~]# /sbin/gfs_controld -D 1303134853 config_no_withdraw 0 1303134853 config_no_plock 0 1303134853 config_plock_rate_limit 100 1303134853 config_plock_ownership 0 1303134853 config_drop_resources_time 10000 1303134853 config_drop_resources_count 10 1303134853 config_drop_resources_age 10000 1303134853 protocol 1.0.0 1303134853 listen 3 1303134853 cpg 6 1303134853 groupd 7 1303134853 uevent 8 1303134853 plocks 10 1303134853 plock need_fsid_translation 1 1303134853 plock cpg message size: 336 bytes 1303134853 setup done From td3201 at gmail.com Mon Apr 18 19:17:22 2011 From: td3201 at gmail.com (Terry) Date: Mon, 18 Apr 2011 14:17:22 -0500 Subject: [Linux-cluster] problems with clvmd In-Reply-To: References: <4DABFAE2.1090502@redhat.com> <4DAC4351.9090503@redhat.com> <4DAC4A1A.1090709@redhat.com> Message-ID: On Mon, Apr 18, 2011 at 9:49 AM, Terry wrote: > On Mon, Apr 18, 2011 at 9:26 AM, Christine Caulfield > wrote: >> On 18/04/11 15:11, Terry wrote: >>> >>> On Mon, Apr 18, 2011 at 8:57 AM, Christine Caulfield >>> ?wrote: >>>> >>>> On 18/04/11 14:38, Terry wrote: >>>>> >>>>> On Mon, Apr 18, 2011 at 3:48 AM, Christine Caulfield >>>>> ? ?wrote: >>>>>> >>>>>> On 17/04/11 21:52, Terry wrote: >>>>>>> >>>>>>> As a result of a strange situation where our licensing for storage >>>>>>> dropped off, I need to join a centos 5.6 node to a now single node >>>>>>> cluster. ?I got it joined to the cluster but I am having issues with >>>>>>> CLVMD. ?Any lvm operations on both boxes hang. ?For example, vgscan. >>>>>>> I have increased debugging and I don't see any logs. ?The VGs aren't >>>>>>> being populated in /dev/mapper. ?This WAS working right after I joined >>>>>>> it to the cluster and now it's not for some unknown reason. ?Not sure >>>>>>> where to take this at this point. ? I did find one weird startup log >>>>>>> that I am not sure what it means yet: >>>>>>> [root at omadvnfs01a ~]# dmesg | grep dlm >>>>>>> dlm: no local IP address has been set >>>>>>> dlm: cannot start dlm lowcomms -107 >>>>>>> dlm: Using TCP for communications >>>>>>> dlm: connecting to 2 >>>>>>> >>>>>> >>>>>> >>>>>> That message usually means that dlm_controld has failed to start. Try >>>>>> starting the cman daemons (groupd, dlm_controld) manually with the -D >>>>>> switch >>>>>> and read the output which might give some clues to why it's not >>>>>> working. >>>>>> >>>>>> Chrissie >>>>>> >>>>> >>>>> >>>>> Hi Chrissie, >>>>> >>>>> I thought of that but I see dlm started on both nodes. ?See right below. >>>>> >>>>>>> [root at omadvnfs01a ~]# ps xauwwww | grep dlm >>>>>>> root ? ? ?5476 ?0.0 ?0.0 ?24736 ? 760 ? ? ? ? ?Ss ? 15:34 ? 0:00 >>>>>>> /sbin/dlm_controld >>>>>>> root ? ? ?5502 ?0.0 ?0.0 ? ? ?0 ? ? 0 ? ? ? ? ?S< ? ? ? ?15:34 ? 0:00 >>>> >>>> >>>> Well, that's encouraging in a way! But it's evidently not started fully >>>> or >>>> the DLM itself would be working. So I still recommend starting it with -D >>>> to >>>> see how far it gets. >>>> >>>> >>>> Chrissie >>>> >>>> -- >>>> Linux-cluster mailing list >>>> Linux-cluster at redhat.com >>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>> >>> >>> I think we had posts cross. ?Here's my latest: >>> >>> Ok, started all the CMAN elements manually as you suggested. ?I >>> started them in order as in the init script. Here's the only error >>> that I see. ?I can post the other debug messages if you think they'd >>> be useful but this is the only one that stuck out to me. >>> >>> [root at omadvnfs01a ~]# /sbin/dlm_controld -D >>> 1303134840 /sys/kernel/config/dlm/cluster/comms: opendir failed: 2 >>> 1303134840 /sys/kernel/config/dlm/cluster/spaces: opendir failed: 2 >>> 1303134840 set_ccs_options 480 >>> 1303134840 cman: node 2 added >>> 1303134840 set_configfs_node 2 10.198.1.111 local 0 >>> 1303134840 cman: node 3 added >>> 1303134840 set_configfs_node 3 10.198.1.110 local 1 >>> >> >> Can I see the whole set please ? It looks like dlm_controld might be stalled >> registering with groupd. >> >> Chrissie >> >> -- > > Here you go. ?Thank you very much for the help. ?Each daemon's output > that I started is below. > > [root at omadvnfs01a log]# /sbin/ccsd -n > Starting ccsd 2.0.115: > ?Built: Mar ?6 2011 00:47:03 > ?Copyright (C) Red Hat, Inc. ?2004 ?All rights reserved. > ?No Daemon:: SET > > cluster.conf (cluster name = omadvnfs01, version = 71) found. > Remote copy of cluster.conf is from quorate node. > ?Local version # : 71 > ?Remote version #: 71 > Remote copy of cluster.conf is from quorate node. > ?Local version # : 71 > ?Remote version #: 71 > Remote copy of cluster.conf is from quorate node. > ?Local version # : 71 > ?Remote version #: 71 > Remote copy of cluster.conf is from quorate node. > ?Local version # : 71 > ?Remote version #: 71 > Initial status:: Quorate > > [root at omadvnfs01a ~]# /sbin/fenced -D > 1303134822 cman: node 2 added > 1303134822 cman: node 3 added > 1303134822 our_nodeid 3 our_name omadvnfs01a.sec.jel.lc > 1303134822 listen 4 member 5 groupd 7 > 1303134861 client 3: join default > 1303134861 delay post_join 3s post_fail 0s > 1303134861 added 2 nodes from ccs > 1303134861 setid default 65537 > 1303134861 start default 1 members 2 3 > 1303134861 do_recovery stop 0 start 1 finish 0 > 1303134861 finish default 1 > > [root at omadvnfs01a ~]# /sbin/dlm_controld -D > 1303134840 /sys/kernel/config/dlm/cluster/comms: opendir failed: 2 > 1303134840 /sys/kernel/config/dlm/cluster/spaces: opendir failed: 2 > 1303134840 set_ccs_options 480 > 1303134840 cman: node 2 added > 1303134840 set_configfs_node 2 10.198.1.111 local 0 > 1303134840 cman: node 3 added > 1303134840 set_configfs_node 3 10.198.1.110 local 1 > > > [root at omadvnfs01a ~]# /sbin/groupd -D > 1303134809 cman: our nodeid 3 name omadvnfs01a.sec.jel.lc quorum 1 > 1303134809 setup_cpg groupd_handle 6b8b456700000000 > 1303134809 groupd confchg total 2 left 0 joined 1 > 1303134809 send_version nodeid 3 cluster 2 mode 2 compat 1 > 1303134822 client connection 3 > 1303134822 got client 3 setup > 1303134822 setup fence 0 > 1303134840 client connection 4 > 1303134840 got client 4 setup > 1303134840 setup dlm 1 > 1303134853 client connection 5 > 1303134853 got client 5 setup > 1303134853 setup gfs 2 > 1303134861 got client 3 join > 1303134861 0:default got join > 1303134861 0:default is cpg client 6 name 0_default handle 6633487300000001 > 1303134861 0:default cpg_join ok > 1303134861 0:default waiting for first cpg event > 1303134861 client connection 7 > 1303134861 0:default waiting for first cpg event > 1303134861 got client 7 get_group > 1303134861 0:default waiting for first cpg event > 1303134861 0:default waiting for first cpg event > 1303134861 0:default confchg left 0 joined 1 total 2 > 1303134861 0:default process_node_join 3 > 1303134861 0:default cpg add node 2 total 1 > 1303134861 0:default cpg add node 3 total 2 > 1303134861 0:default make_event_id 300020001 nodeid 3 memb_count 2 type 1 > 1303134861 0:default queue join event for nodeid 3 > 1303134861 0:default process_current_event 300020001 3 JOIN_BEGIN > 1303134861 0:default app node init: add 3 total 1 > 1303134861 0:default app node init: add 2 total 2 > 1303134861 0:default waiting for 1 more stopped messages before > JOIN_ALL_STOPPED > > ?3 > 1303134861 0:default mark node 2 stopped > 1303134861 0:default set global_id 10001 from 2 > 1303134861 0:default process_current_event 300020001 3 JOIN_ALL_STOPPED > 1303134861 0:default action for app: setid default 65537 > 1303134861 0:default action for app: start default 1 2 2 2 3 > 1303134861 client connection 7 > 1303134861 got client 7 get_group > 1303134861 0:default mark node 2 started > 1303134861 client connection 7 > 1303134861 got client 7 get_group > 1303134861 got client 3 start_done > 1303134861 0:default send started > 1303134861 0:default mark node 3 started > 1303134861 0:default process_current_event 300020001 3 JOIN_ALL_STARTED > 1303134861 0:default action for app: finish default 1 > 1303134862 client connection 7 > 1303134862 got client 7 get_group > > > [root at omadvnfs01a ~]# /sbin/gfs_controld -D > 1303134853 config_no_withdraw 0 > 1303134853 config_no_plock 0 > 1303134853 config_plock_rate_limit 100 > 1303134853 config_plock_ownership 0 > 1303134853 config_drop_resources_time 10000 > 1303134853 config_drop_resources_count 10 > 1303134853 config_drop_resources_age 10000 > 1303134853 protocol 1.0.0 > 1303134853 listen 3 > 1303134853 cpg 6 > 1303134853 groupd 7 > 1303134853 uevent 8 > 1303134853 plocks 10 > 1303134853 plock need_fsid_translation 1 > 1303134853 plock cpg message size: 336 bytes > 1303134853 setup done > Another gap that I just found is I forgot to specify a fencing method for the new centos node. I put that in and now the rhel node wants to fence it so I am letting it do that then i'll see where i end up. From td3201 at gmail.com Mon Apr 18 19:46:15 2011 From: td3201 at gmail.com (Terry) Date: Mon, 18 Apr 2011 14:46:15 -0500 Subject: [Linux-cluster] problems with clvmd In-Reply-To: References: <4DABFAE2.1090502@redhat.com> <4DAC4351.9090503@redhat.com> <4DAC4A1A.1090709@redhat.com> Message-ID: On Mon, Apr 18, 2011 at 2:17 PM, Terry wrote: > On Mon, Apr 18, 2011 at 9:49 AM, Terry wrote: >> On Mon, Apr 18, 2011 at 9:26 AM, Christine Caulfield >> wrote: >>> On 18/04/11 15:11, Terry wrote: >>>> >>>> On Mon, Apr 18, 2011 at 8:57 AM, Christine Caulfield >>>> ?wrote: >>>>> >>>>> On 18/04/11 14:38, Terry wrote: >>>>>> >>>>>> On Mon, Apr 18, 2011 at 3:48 AM, Christine Caulfield >>>>>> ? ?wrote: >>>>>>> >>>>>>> On 17/04/11 21:52, Terry wrote: >>>>>>>> >>>>>>>> As a result of a strange situation where our licensing for storage >>>>>>>> dropped off, I need to join a centos 5.6 node to a now single node >>>>>>>> cluster. ?I got it joined to the cluster but I am having issues with >>>>>>>> CLVMD. ?Any lvm operations on both boxes hang. ?For example, vgscan. >>>>>>>> I have increased debugging and I don't see any logs. ?The VGs aren't >>>>>>>> being populated in /dev/mapper. ?This WAS working right after I joined >>>>>>>> it to the cluster and now it's not for some unknown reason. ?Not sure >>>>>>>> where to take this at this point. ? I did find one weird startup log >>>>>>>> that I am not sure what it means yet: >>>>>>>> [root at omadvnfs01a ~]# dmesg | grep dlm >>>>>>>> dlm: no local IP address has been set >>>>>>>> dlm: cannot start dlm lowcomms -107 >>>>>>>> dlm: Using TCP for communications >>>>>>>> dlm: connecting to 2 >>>>>>>> >>>>>>> >>>>>>> >>>>>>> That message usually means that dlm_controld has failed to start. Try >>>>>>> starting the cman daemons (groupd, dlm_controld) manually with the -D >>>>>>> switch >>>>>>> and read the output which might give some clues to why it's not >>>>>>> working. >>>>>>> >>>>>>> Chrissie >>>>>>> >>>>>> >>>>>> >>>>>> Hi Chrissie, >>>>>> >>>>>> I thought of that but I see dlm started on both nodes. ?See right below. >>>>>> >>>>>>>> [root at omadvnfs01a ~]# ps xauwwww | grep dlm >>>>>>>> root ? ? ?5476 ?0.0 ?0.0 ?24736 ? 760 ? ? ? ? ?Ss ? 15:34 ? 0:00 >>>>>>>> /sbin/dlm_controld >>>>>>>> root ? ? ?5502 ?0.0 ?0.0 ? ? ?0 ? ? 0 ? ? ? ? ?S< ? ? ? ?15:34 ? 0:00 >>>>> >>>>> >>>>> Well, that's encouraging in a way! But it's evidently not started fully >>>>> or >>>>> the DLM itself would be working. So I still recommend starting it with -D >>>>> to >>>>> see how far it gets. >>>>> >>>>> >>>>> Chrissie >>>>> >>>>> -- >>>>> Linux-cluster mailing list >>>>> Linux-cluster at redhat.com >>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>> >>>> >>>> I think we had posts cross. ?Here's my latest: >>>> >>>> Ok, started all the CMAN elements manually as you suggested. ?I >>>> started them in order as in the init script. Here's the only error >>>> that I see. ?I can post the other debug messages if you think they'd >>>> be useful but this is the only one that stuck out to me. >>>> >>>> [root at omadvnfs01a ~]# /sbin/dlm_controld -D >>>> 1303134840 /sys/kernel/config/dlm/cluster/comms: opendir failed: 2 >>>> 1303134840 /sys/kernel/config/dlm/cluster/spaces: opendir failed: 2 >>>> 1303134840 set_ccs_options 480 >>>> 1303134840 cman: node 2 added >>>> 1303134840 set_configfs_node 2 10.198.1.111 local 0 >>>> 1303134840 cman: node 3 added >>>> 1303134840 set_configfs_node 3 10.198.1.110 local 1 >>>> >>> >>> Can I see the whole set please ? It looks like dlm_controld might be stalled >>> registering with groupd. >>> >>> Chrissie >>> >>> -- >> >> Here you go. ?Thank you very much for the help. ?Each daemon's output >> that I started is below. >> >> [root at omadvnfs01a log]# /sbin/ccsd -n >> Starting ccsd 2.0.115: >> ?Built: Mar ?6 2011 00:47:03 >> ?Copyright (C) Red Hat, Inc. ?2004 ?All rights reserved. >> ?No Daemon:: SET >> >> cluster.conf (cluster name = omadvnfs01, version = 71) found. >> Remote copy of cluster.conf is from quorate node. >> ?Local version # : 71 >> ?Remote version #: 71 >> Remote copy of cluster.conf is from quorate node. >> ?Local version # : 71 >> ?Remote version #: 71 >> Remote copy of cluster.conf is from quorate node. >> ?Local version # : 71 >> ?Remote version #: 71 >> Remote copy of cluster.conf is from quorate node. >> ?Local version # : 71 >> ?Remote version #: 71 >> Initial status:: Quorate >> >> [root at omadvnfs01a ~]# /sbin/fenced -D >> 1303134822 cman: node 2 added >> 1303134822 cman: node 3 added >> 1303134822 our_nodeid 3 our_name omadvnfs01a.sec.jel.lc >> 1303134822 listen 4 member 5 groupd 7 >> 1303134861 client 3: join default >> 1303134861 delay post_join 3s post_fail 0s >> 1303134861 added 2 nodes from ccs >> 1303134861 setid default 65537 >> 1303134861 start default 1 members 2 3 >> 1303134861 do_recovery stop 0 start 1 finish 0 >> 1303134861 finish default 1 >> >> [root at omadvnfs01a ~]# /sbin/dlm_controld -D >> 1303134840 /sys/kernel/config/dlm/cluster/comms: opendir failed: 2 >> 1303134840 /sys/kernel/config/dlm/cluster/spaces: opendir failed: 2 >> 1303134840 set_ccs_options 480 >> 1303134840 cman: node 2 added >> 1303134840 set_configfs_node 2 10.198.1.111 local 0 >> 1303134840 cman: node 3 added >> 1303134840 set_configfs_node 3 10.198.1.110 local 1 >> >> >> [root at omadvnfs01a ~]# /sbin/groupd -D >> 1303134809 cman: our nodeid 3 name omadvnfs01a.sec.jel.lc quorum 1 >> 1303134809 setup_cpg groupd_handle 6b8b456700000000 >> 1303134809 groupd confchg total 2 left 0 joined 1 >> 1303134809 send_version nodeid 3 cluster 2 mode 2 compat 1 >> 1303134822 client connection 3 >> 1303134822 got client 3 setup >> 1303134822 setup fence 0 >> 1303134840 client connection 4 >> 1303134840 got client 4 setup >> 1303134840 setup dlm 1 >> 1303134853 client connection 5 >> 1303134853 got client 5 setup >> 1303134853 setup gfs 2 >> 1303134861 got client 3 join >> 1303134861 0:default got join >> 1303134861 0:default is cpg client 6 name 0_default handle 6633487300000001 >> 1303134861 0:default cpg_join ok >> 1303134861 0:default waiting for first cpg event >> 1303134861 client connection 7 >> 1303134861 0:default waiting for first cpg event >> 1303134861 got client 7 get_group >> 1303134861 0:default waiting for first cpg event >> 1303134861 0:default waiting for first cpg event >> 1303134861 0:default confchg left 0 joined 1 total 2 >> 1303134861 0:default process_node_join 3 >> 1303134861 0:default cpg add node 2 total 1 >> 1303134861 0:default cpg add node 3 total 2 >> 1303134861 0:default make_event_id 300020001 nodeid 3 memb_count 2 type 1 >> 1303134861 0:default queue join event for nodeid 3 >> 1303134861 0:default process_current_event 300020001 3 JOIN_BEGIN >> 1303134861 0:default app node init: add 3 total 1 >> 1303134861 0:default app node init: add 2 total 2 >> 1303134861 0:default waiting for 1 more stopped messages before >> JOIN_ALL_STOPPED >> >> ?3 >> 1303134861 0:default mark node 2 stopped >> 1303134861 0:default set global_id 10001 from 2 >> 1303134861 0:default process_current_event 300020001 3 JOIN_ALL_STOPPED >> 1303134861 0:default action for app: setid default 65537 >> 1303134861 0:default action for app: start default 1 2 2 2 3 >> 1303134861 client connection 7 >> 1303134861 got client 7 get_group >> 1303134861 0:default mark node 2 started >> 1303134861 client connection 7 >> 1303134861 got client 7 get_group >> 1303134861 got client 3 start_done >> 1303134861 0:default send started >> 1303134861 0:default mark node 3 started >> 1303134861 0:default process_current_event 300020001 3 JOIN_ALL_STARTED >> 1303134861 0:default action for app: finish default 1 >> 1303134862 client connection 7 >> 1303134862 got client 7 get_group >> >> >> [root at omadvnfs01a ~]# /sbin/gfs_controld -D >> 1303134853 config_no_withdraw 0 >> 1303134853 config_no_plock 0 >> 1303134853 config_plock_rate_limit 100 >> 1303134853 config_plock_ownership 0 >> 1303134853 config_drop_resources_time 10000 >> 1303134853 config_drop_resources_count 10 >> 1303134853 config_drop_resources_age 10000 >> 1303134853 protocol 1.0.0 >> 1303134853 listen 3 >> 1303134853 cpg 6 >> 1303134853 groupd 7 >> 1303134853 uevent 8 >> 1303134853 plocks 10 >> 1303134853 plock need_fsid_translation 1 >> 1303134853 plock cpg message size: 336 bytes >> 1303134853 setup done >> > > Another gap that I just found is I forgot to specify a fencing method > for the new centos node. ?I put that in and now the rhel node wants to > fence it so I am letting it do that then i'll see where i end up. > Node came up with no problems then started services manually: service cman start service clvmd start (keep in mind that I commented out the vgscan in that script, otherwise it times out) service rgmanager start The node enters the cluster and everything looks fine but no cluster LVM devices. The other node does see dlm start on the centos node: Apr 18 14:37:06 omadvnfs01b kernel: dlm: got connection from 3 On a hunch, tried this on the RHEL node: [root at omadvnfs01b ~]# clvmd -R Error resetting node omadvnfs01b.sec.jel.lc: Command timed out I think the RHEL node is broke but it has working services on it. I am OK with stopping all services but not sure how to get the cluster devices working on the new centos node. I have every intention of formatting the RHEL node but need to understand what I am getting into before I start shutting things down on that node. How can I forcefully make the centos node aware of the existing LVM configuration? Thanks! From td3201 at gmail.com Mon Apr 18 23:33:51 2011 From: td3201 at gmail.com (Terry) Date: Mon, 18 Apr 2011 18:33:51 -0500 Subject: [Linux-cluster] problems with clvmd In-Reply-To: References: <4DABFAE2.1090502@redhat.com> <4DAC4351.9090503@redhat.com> <4DAC4A1A.1090709@redhat.com> Message-ID: On Mon, Apr 18, 2011 at 2:46 PM, Terry wrote: > On Mon, Apr 18, 2011 at 2:17 PM, Terry wrote: >> On Mon, Apr 18, 2011 at 9:49 AM, Terry wrote: >>> On Mon, Apr 18, 2011 at 9:26 AM, Christine Caulfield >>> wrote: >>>> On 18/04/11 15:11, Terry wrote: >>>>> >>>>> On Mon, Apr 18, 2011 at 8:57 AM, Christine Caulfield >>>>> ?wrote: >>>>>> >>>>>> On 18/04/11 14:38, Terry wrote: >>>>>>> >>>>>>> On Mon, Apr 18, 2011 at 3:48 AM, Christine Caulfield >>>>>>> ? ?wrote: >>>>>>>> >>>>>>>> On 17/04/11 21:52, Terry wrote: >>>>>>>>> >>>>>>>>> As a result of a strange situation where our licensing for storage >>>>>>>>> dropped off, I need to join a centos 5.6 node to a now single node >>>>>>>>> cluster. ?I got it joined to the cluster but I am having issues with >>>>>>>>> CLVMD. ?Any lvm operations on both boxes hang. ?For example, vgscan. >>>>>>>>> I have increased debugging and I don't see any logs. ?The VGs aren't >>>>>>>>> being populated in /dev/mapper. ?This WAS working right after I joined >>>>>>>>> it to the cluster and now it's not for some unknown reason. ?Not sure >>>>>>>>> where to take this at this point. ? I did find one weird startup log >>>>>>>>> that I am not sure what it means yet: >>>>>>>>> [root at omadvnfs01a ~]# dmesg | grep dlm >>>>>>>>> dlm: no local IP address has been set >>>>>>>>> dlm: cannot start dlm lowcomms -107 >>>>>>>>> dlm: Using TCP for communications >>>>>>>>> dlm: connecting to 2 >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> That message usually means that dlm_controld has failed to start. Try >>>>>>>> starting the cman daemons (groupd, dlm_controld) manually with the -D >>>>>>>> switch >>>>>>>> and read the output which might give some clues to why it's not >>>>>>>> working. >>>>>>>> >>>>>>>> Chrissie >>>>>>>> >>>>>>> >>>>>>> >>>>>>> Hi Chrissie, >>>>>>> >>>>>>> I thought of that but I see dlm started on both nodes. ?See right below. >>>>>>> >>>>>>>>> [root at omadvnfs01a ~]# ps xauwwww | grep dlm >>>>>>>>> root ? ? ?5476 ?0.0 ?0.0 ?24736 ? 760 ? ? ? ? ?Ss ? 15:34 ? 0:00 >>>>>>>>> /sbin/dlm_controld >>>>>>>>> root ? ? ?5502 ?0.0 ?0.0 ? ? ?0 ? ? 0 ? ? ? ? ?S< ? ? ? ?15:34 ? 0:00 >>>>>> >>>>>> >>>>>> Well, that's encouraging in a way! But it's evidently not started fully >>>>>> or >>>>>> the DLM itself would be working. So I still recommend starting it with -D >>>>>> to >>>>>> see how far it gets. >>>>>> >>>>>> >>>>>> Chrissie >>>>>> >>>>>> -- >>>>>> Linux-cluster mailing list >>>>>> Linux-cluster at redhat.com >>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>> >>>>> >>>>> I think we had posts cross. ?Here's my latest: >>>>> >>>>> Ok, started all the CMAN elements manually as you suggested. ?I >>>>> started them in order as in the init script. Here's the only error >>>>> that I see. ?I can post the other debug messages if you think they'd >>>>> be useful but this is the only one that stuck out to me. >>>>> >>>>> [root at omadvnfs01a ~]# /sbin/dlm_controld -D >>>>> 1303134840 /sys/kernel/config/dlm/cluster/comms: opendir failed: 2 >>>>> 1303134840 /sys/kernel/config/dlm/cluster/spaces: opendir failed: 2 >>>>> 1303134840 set_ccs_options 480 >>>>> 1303134840 cman: node 2 added >>>>> 1303134840 set_configfs_node 2 10.198.1.111 local 0 >>>>> 1303134840 cman: node 3 added >>>>> 1303134840 set_configfs_node 3 10.198.1.110 local 1 >>>>> >>>> >>>> Can I see the whole set please ? It looks like dlm_controld might be stalled >>>> registering with groupd. >>>> >>>> Chrissie >>>> >>>> -- >>> >>> Here you go. ?Thank you very much for the help. ?Each daemon's output >>> that I started is below. >>> >>> [root at omadvnfs01a log]# /sbin/ccsd -n >>> Starting ccsd 2.0.115: >>> ?Built: Mar ?6 2011 00:47:03 >>> ?Copyright (C) Red Hat, Inc. ?2004 ?All rights reserved. >>> ?No Daemon:: SET >>> >>> cluster.conf (cluster name = omadvnfs01, version = 71) found. >>> Remote copy of cluster.conf is from quorate node. >>> ?Local version # : 71 >>> ?Remote version #: 71 >>> Remote copy of cluster.conf is from quorate node. >>> ?Local version # : 71 >>> ?Remote version #: 71 >>> Remote copy of cluster.conf is from quorate node. >>> ?Local version # : 71 >>> ?Remote version #: 71 >>> Remote copy of cluster.conf is from quorate node. >>> ?Local version # : 71 >>> ?Remote version #: 71 >>> Initial status:: Quorate >>> >>> [root at omadvnfs01a ~]# /sbin/fenced -D >>> 1303134822 cman: node 2 added >>> 1303134822 cman: node 3 added >>> 1303134822 our_nodeid 3 our_name omadvnfs01a.sec.jel.lc >>> 1303134822 listen 4 member 5 groupd 7 >>> 1303134861 client 3: join default >>> 1303134861 delay post_join 3s post_fail 0s >>> 1303134861 added 2 nodes from ccs >>> 1303134861 setid default 65537 >>> 1303134861 start default 1 members 2 3 >>> 1303134861 do_recovery stop 0 start 1 finish 0 >>> 1303134861 finish default 1 >>> >>> [root at omadvnfs01a ~]# /sbin/dlm_controld -D >>> 1303134840 /sys/kernel/config/dlm/cluster/comms: opendir failed: 2 >>> 1303134840 /sys/kernel/config/dlm/cluster/spaces: opendir failed: 2 >>> 1303134840 set_ccs_options 480 >>> 1303134840 cman: node 2 added >>> 1303134840 set_configfs_node 2 10.198.1.111 local 0 >>> 1303134840 cman: node 3 added >>> 1303134840 set_configfs_node 3 10.198.1.110 local 1 >>> >>> >>> [root at omadvnfs01a ~]# /sbin/groupd -D >>> 1303134809 cman: our nodeid 3 name omadvnfs01a.sec.jel.lc quorum 1 >>> 1303134809 setup_cpg groupd_handle 6b8b456700000000 >>> 1303134809 groupd confchg total 2 left 0 joined 1 >>> 1303134809 send_version nodeid 3 cluster 2 mode 2 compat 1 >>> 1303134822 client connection 3 >>> 1303134822 got client 3 setup >>> 1303134822 setup fence 0 >>> 1303134840 client connection 4 >>> 1303134840 got client 4 setup >>> 1303134840 setup dlm 1 >>> 1303134853 client connection 5 >>> 1303134853 got client 5 setup >>> 1303134853 setup gfs 2 >>> 1303134861 got client 3 join >>> 1303134861 0:default got join >>> 1303134861 0:default is cpg client 6 name 0_default handle 6633487300000001 >>> 1303134861 0:default cpg_join ok >>> 1303134861 0:default waiting for first cpg event >>> 1303134861 client connection 7 >>> 1303134861 0:default waiting for first cpg event >>> 1303134861 got client 7 get_group >>> 1303134861 0:default waiting for first cpg event >>> 1303134861 0:default waiting for first cpg event >>> 1303134861 0:default confchg left 0 joined 1 total 2 >>> 1303134861 0:default process_node_join 3 >>> 1303134861 0:default cpg add node 2 total 1 >>> 1303134861 0:default cpg add node 3 total 2 >>> 1303134861 0:default make_event_id 300020001 nodeid 3 memb_count 2 type 1 >>> 1303134861 0:default queue join event for nodeid 3 >>> 1303134861 0:default process_current_event 300020001 3 JOIN_BEGIN >>> 1303134861 0:default app node init: add 3 total 1 >>> 1303134861 0:default app node init: add 2 total 2 >>> 1303134861 0:default waiting for 1 more stopped messages before >>> JOIN_ALL_STOPPED >>> >>> ?3 >>> 1303134861 0:default mark node 2 stopped >>> 1303134861 0:default set global_id 10001 from 2 >>> 1303134861 0:default process_current_event 300020001 3 JOIN_ALL_STOPPED >>> 1303134861 0:default action for app: setid default 65537 >>> 1303134861 0:default action for app: start default 1 2 2 2 3 >>> 1303134861 client connection 7 >>> 1303134861 got client 7 get_group >>> 1303134861 0:default mark node 2 started >>> 1303134861 client connection 7 >>> 1303134861 got client 7 get_group >>> 1303134861 got client 3 start_done >>> 1303134861 0:default send started >>> 1303134861 0:default mark node 3 started >>> 1303134861 0:default process_current_event 300020001 3 JOIN_ALL_STARTED >>> 1303134861 0:default action for app: finish default 1 >>> 1303134862 client connection 7 >>> 1303134862 got client 7 get_group >>> >>> >>> [root at omadvnfs01a ~]# /sbin/gfs_controld -D >>> 1303134853 config_no_withdraw 0 >>> 1303134853 config_no_plock 0 >>> 1303134853 config_plock_rate_limit 100 >>> 1303134853 config_plock_ownership 0 >>> 1303134853 config_drop_resources_time 10000 >>> 1303134853 config_drop_resources_count 10 >>> 1303134853 config_drop_resources_age 10000 >>> 1303134853 protocol 1.0.0 >>> 1303134853 listen 3 >>> 1303134853 cpg 6 >>> 1303134853 groupd 7 >>> 1303134853 uevent 8 >>> 1303134853 plocks 10 >>> 1303134853 plock need_fsid_translation 1 >>> 1303134853 plock cpg message size: 336 bytes >>> 1303134853 setup done >>> >> >> Another gap that I just found is I forgot to specify a fencing method >> for the new centos node. ?I put that in and now the rhel node wants to >> fence it so I am letting it do that then i'll see where i end up. >> > > Node came up with no problems then started services manually: > service cman start > service clvmd start ?(keep in mind that I commented out the vgscan in > that script, otherwise it times out) > service rgmanager start > > The node enters the cluster and everything looks fine but no cluster > LVM devices. ?The other node does see dlm start on the centos node: > Apr 18 14:37:06 omadvnfs01b kernel: dlm: got connection from 3 > > On a hunch, tried this on the RHEL node: > [root at omadvnfs01b ~]# clvmd -R > Error resetting node omadvnfs01b.sec.jel.lc: Command timed out > > I think the RHEL node is broke but it has working services on it. ?I > am OK with stopping all services but not sure how to get the cluster > devices working on the new centos node. ? I have every intention of > formatting the RHEL node but need to understand what I am getting into > before I start shutting things down on that node. ?How can I > forcefully make the centos node aware of the existing LVM > configuration? > > Thanks! > Since this seems to be an unusual problem, let me ask some different questions: What happens if I shutdown all cluster services on the RHEL node, remove it from the cluster and then simply create a new LVM configuration (new pvs, vgs, lvs, etc) and cluster configuration on the centos box and start over? Will the data be intact? FYI, the data itself is a postgresql database and some NFS volumes on ext3 and gfs. Sorry for dominating the list today, just need some ideas on possible directions. From list at fajar.net Tue Apr 19 04:58:29 2011 From: list at fajar.net (Fajar A. Nugraha) Date: Tue, 19 Apr 2011 11:58:29 +0700 Subject: [Linux-cluster] problems with clvmd In-Reply-To: References: <4DABFAE2.1090502@redhat.com> <4DAC4351.9090503@redhat.com> <4DAC4A1A.1090709@redhat.com> Message-ID: On Tue, Apr 19, 2011 at 6:33 AM, Terry wrote: > What happens if I shutdown all cluster services on the RHEL node, > remove it from the cluster and then simply create a new LVM > configuration (new pvs, vgs, lvs, etc) and cluster configuration on > the centos box and start over? ? Will the data be intact? Creating a new VG on top of existing PVs means wiping all data. > ? FYI, the > data itself is a postgresql database and some NFS volumes on ext3 and > gfs. > > Sorry for dominating the list today, just need some ideas on possible > directions. If you just want to access the data, IIRC you should be able to simply do vgchange -cn VG_name (turn off cluster locking for that VG. See "man vgchange"). Then (possibly) "pvscan;vgchange -ay" If you also have some gfs/gfs2 filesystem, you should be able to mount them with "-o lock_nolock" -- Fajar From jmd_singhsaini at yahoo.com Tue Apr 19 06:34:42 2011 From: jmd_singhsaini at yahoo.com (Harvinder Singh Binder) Date: Tue, 19 Apr 2011 12:04:42 +0530 (IST) Subject: [Linux-cluster] about cluster Message-ID: <545603.34798.qm@web94807.mail.in2.yahoo.com> Good afternoon sir, Please send me whole detail about cluster such as what is cluster, requirement,configration etc Harvinder Singh S/O Baldev Raj, VPO Barwa Teh. Anandpur Sahib, Dist. Ropar, PunjabE-Mail ID:- ? ? jmd_singhsaini at yahoo.com From andrew at beekhof.net Tue Apr 19 07:11:21 2011 From: andrew at beekhof.net (Andrew Beekhof) Date: Tue, 19 Apr 2011 09:11:21 +0200 Subject: [Linux-cluster] about cluster In-Reply-To: <545603.34798.qm@web94807.mail.in2.yahoo.com> References: <545603.34798.qm@web94807.mail.in2.yahoo.com> Message-ID: Only if you do my homework assignment first. On Tue, Apr 19, 2011 at 8:34 AM, Harvinder Singh Binder wrote: > Good afternoon sir, > ? ? ? ? ? ? ? ? ? ? ? ?Please send me whole detail about cluster such as what is cluster, requirement,configration etc > Harvinder Singh S/O Baldev Raj, VPO Barwa Teh. Anandpur Sahib, Dist. Ropar, PunjabE-Mail ID:- ? ? jmd_singhsaini at yahoo.com > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > From list at fajar.net Tue Apr 19 07:53:30 2011 From: list at fajar.net (Fajar A. Nugraha) Date: Tue, 19 Apr 2011 14:53:30 +0700 Subject: [Linux-cluster] about cluster In-Reply-To: <545603.34798.qm@web94807.mail.in2.yahoo.com> References: <545603.34798.qm@web94807.mail.in2.yahoo.com> Message-ID: On Tue, Apr 19, 2011 at 1:34 PM, Harvinder Singh Binder wrote: > Good afternoon sir, > ? ? ? ? ? ? ? ? ? ? ? ?Please send me whole detail about cluster such as what is cluster, requirement,configration etc http://lmgtfy.com/?q=redhat+linux+cluster -- Fajar From wahyu at vivastor.co.id Tue Apr 19 09:05:04 2011 From: wahyu at vivastor.co.id (Wahyu Darmawan) Date: Tue, 19 Apr 2011 16:05:04 +0700 Subject: [Linux-cluster] about cluster In-Reply-To: References: <545603.34798.qm@web94807.mail.in2.yahoo.com> Message-ID: On Tue, Apr 19, 2011 at 2:53 PM, Fajar A. Nugraha wrote: > On Tue, Apr 19, 2011 at 1:34 PM, Harvinder Singh Binder > wrote: > > Good afternoon sir, > > Please send me whole detail about cluster such as > what is cluster, requirement,configration etc > > http://lmgtfy.com/?q=redhat+linux+cluster > > -- > Fajar > > Nice share Om Fajar.. ;-) -------------- next part -------------- An HTML attachment was scrubbed... URL: From ccaulfie at redhat.com Tue Apr 19 09:59:34 2011 From: ccaulfie at redhat.com (Christine Caulfield) Date: Tue, 19 Apr 2011 10:59:34 +0100 Subject: [Linux-cluster] problems with clvmd In-Reply-To: References: <4DABFAE2.1090502@redhat.com> <4DAC4351.9090503@redhat.com> <4DAC4A1A.1090709@redhat.com> Message-ID: <4DAD5D06.9030106@redhat.com> On 18/04/11 15:49, Terry wrote: > On Mon, Apr 18, 2011 at 9:26 AM, Christine Caulfield > wrote: >> On 18/04/11 15:11, Terry wrote: >>> >>> On Mon, Apr 18, 2011 at 8:57 AM, Christine Caulfield >>> wrote: >>>> >>>> On 18/04/11 14:38, Terry wrote: >>>>> >>>>> On Mon, Apr 18, 2011 at 3:48 AM, Christine Caulfield >>>>> wrote: >>>>>> >>>>>> On 17/04/11 21:52, Terry wrote: >>>>>>> >>>>>>> As a result of a strange situation where our licensing for storage >>>>>>> dropped off, I need to join a centos 5.6 node to a now single node >>>>>>> cluster. I got it joined to the cluster but I am having issues with >>>>>>> CLVMD. Any lvm operations on both boxes hang. For example, vgscan. >>>>>>> I have increased debugging and I don't see any logs. The VGs aren't >>>>>>> being populated in /dev/mapper. This WAS working right after I joined >>>>>>> it to the cluster and now it's not for some unknown reason. Not sure >>>>>>> where to take this at this point. I did find one weird startup log >>>>>>> that I am not sure what it means yet: >>>>>>> [root at omadvnfs01a ~]# dmesg | grep dlm >>>>>>> dlm: no local IP address has been set >>>>>>> dlm: cannot start dlm lowcomms -107 >>>>>>> dlm: Using TCP for communications >>>>>>> dlm: connecting to 2 >>>>>>> >>>>>> >>>>>> >>>>>> That message usually means that dlm_controld has failed to start. Try >>>>>> starting the cman daemons (groupd, dlm_controld) manually with the -D >>>>>> switch >>>>>> and read the output which might give some clues to why it's not >>>>>> working. >>>>>> >>>>>> Chrissie >>>>>> >>>>> >>>>> >>>>> Hi Chrissie, >>>>> >>>>> I thought of that but I see dlm started on both nodes. See right below. >>>>> >>>>>>> [root at omadvnfs01a ~]# ps xauwwww | grep dlm >>>>>>> root 5476 0.0 0.0 24736 760 ? Ss 15:34 0:00 >>>>>>> /sbin/dlm_controld >>>>>>> root 5502 0.0 0.0 0 0 ? S< 15:34 0:00 >>>> >>>> >>>> Well, that's encouraging in a way! But it's evidently not started fully >>>> or >>>> the DLM itself would be working. So I still recommend starting it with -D >>>> to >>>> see how far it gets. >>>> >>>> >>>> Chrissie >>>> >>>> -- >>>> Linux-cluster mailing list >>>> Linux-cluster at redhat.com >>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>> >>> >>> I think we had posts cross. Here's my latest: >>> >>> Ok, started all the CMAN elements manually as you suggested. I >>> started them in order as in the init script. Here's the only error >>> that I see. I can post the other debug messages if you think they'd >>> be useful but this is the only one that stuck out to me. >>> >>> [root at omadvnfs01a ~]# /sbin/dlm_controld -D >>> 1303134840 /sys/kernel/config/dlm/cluster/comms: opendir failed: 2 >>> 1303134840 /sys/kernel/config/dlm/cluster/spaces: opendir failed: 2 >>> 1303134840 set_ccs_options 480 >>> 1303134840 cman: node 2 added >>> 1303134840 set_configfs_node 2 10.198.1.111 local 0 >>> 1303134840 cman: node 3 added >>> 1303134840 set_configfs_node 3 10.198.1.110 local 1 >>> >> >> Can I see the whole set please ? It looks like dlm_controld might be stalled >> registering with groupd. >> >> Chrissie >> >> -- > > Here you go. Thank you very much for the help. Each daemon's output > that I started is below. > > [root at omadvnfs01a log]# /sbin/ccsd -n > Starting ccsd 2.0.115: > Built: Mar 6 2011 00:47:03 > Copyright (C) Red Hat, Inc. 2004 All rights reserved. > No Daemon:: SET > > cluster.conf (cluster name = omadvnfs01, version = 71) found. > Remote copy of cluster.conf is from quorate node. > Local version # : 71 > Remote version #: 71 > Remote copy of cluster.conf is from quorate node. > Local version # : 71 > Remote version #: 71 > Remote copy of cluster.conf is from quorate node. > Local version # : 71 > Remote version #: 71 > Remote copy of cluster.conf is from quorate node. > Local version # : 71 > Remote version #: 71 > Initial status:: Quorate > > [root at omadvnfs01a ~]# /sbin/fenced -D > 1303134822 cman: node 2 added > 1303134822 cman: node 3 added > 1303134822 our_nodeid 3 our_name omadvnfs01a.sec.jel.lc > 1303134822 listen 4 member 5 groupd 7 > 1303134861 client 3: join default > 1303134861 delay post_join 3s post_fail 0s > 1303134861 added 2 nodes from ccs > 1303134861 setid default 65537 > 1303134861 start default 1 members 2 3 > 1303134861 do_recovery stop 0 start 1 finish 0 > 1303134861 finish default 1 > > [root at omadvnfs01a ~]# /sbin/dlm_controld -D > 1303134840 /sys/kernel/config/dlm/cluster/comms: opendir failed: 2 > 1303134840 /sys/kernel/config/dlm/cluster/spaces: opendir failed: 2 > 1303134840 set_ccs_options 480 > 1303134840 cman: node 2 added > 1303134840 set_configfs_node 2 10.198.1.111 local 0 > 1303134840 cman: node 3 added > 1303134840 set_configfs_node 3 10.198.1.110 local 1 > > > [root at omadvnfs01a ~]# /sbin/groupd -D > 1303134809 cman: our nodeid 3 name omadvnfs01a.sec.jel.lc quorum 1 > 1303134809 setup_cpg groupd_handle 6b8b456700000000 > 1303134809 groupd confchg total 2 left 0 joined 1 > 1303134809 send_version nodeid 3 cluster 2 mode 2 compat 1 > 1303134822 client connection 3 > 1303134822 got client 3 setup > 1303134822 setup fence 0 > 1303134840 client connection 4 > 1303134840 got client 4 setup > 1303134840 setup dlm 1 > 1303134853 client connection 5 > 1303134853 got client 5 setup > 1303134853 setup gfs 2 > 1303134861 got client 3 join > 1303134861 0:default got join > 1303134861 0:default is cpg client 6 name 0_default handle 6633487300000001 > 1303134861 0:default cpg_join ok > 1303134861 0:default waiting for first cpg event > 1303134861 client connection 7 > 1303134861 0:default waiting for first cpg event > 1303134861 got client 7 get_group > 1303134861 0:default waiting for first cpg event > 1303134861 0:default waiting for first cpg event > 1303134861 0:default confchg left 0 joined 1 total 2 > 1303134861 0:default process_node_join 3 > 1303134861 0:default cpg add node 2 total 1 > 1303134861 0:default cpg add node 3 total 2 > 1303134861 0:default make_event_id 300020001 nodeid 3 memb_count 2 type 1 > 1303134861 0:default queue join event for nodeid 3 > 1303134861 0:default process_current_event 300020001 3 JOIN_BEGIN > 1303134861 0:default app node init: add 3 total 1 > 1303134861 0:default app node init: add 2 total 2 > 1303134861 0:default waiting for 1 more stopped messages before > JOIN_ALL_STOPPED > That looks like a service error. Is fencing started and working? Check the output of cman_tool services or group_tool Chrissie From td3201 at gmail.com Tue Apr 19 13:32:05 2011 From: td3201 at gmail.com (Terry) Date: Tue, 19 Apr 2011 08:32:05 -0500 Subject: [Linux-cluster] problems with clvmd In-Reply-To: <4DAD5D06.9030106@redhat.com> References: <4DABFAE2.1090502@redhat.com> <4DAC4351.9090503@redhat.com> <4DAC4A1A.1090709@redhat.com> <4DAD5D06.9030106@redhat.com> Message-ID: On Tue, Apr 19, 2011 at 4:59 AM, Christine Caulfield wrote: > On 18/04/11 15:49, Terry wrote: >> >> On Mon, Apr 18, 2011 at 9:26 AM, Christine Caulfield >> ?wrote: >>> >>> On 18/04/11 15:11, Terry wrote: >>>> >>>> On Mon, Apr 18, 2011 at 8:57 AM, Christine Caulfield >>>> ? ?wrote: >>>>> >>>>> On 18/04/11 14:38, Terry wrote: >>>>>> >>>>>> On Mon, Apr 18, 2011 at 3:48 AM, Christine Caulfield >>>>>> ? ? ?wrote: >>>>>>> >>>>>>> On 17/04/11 21:52, Terry wrote: >>>>>>>> >>>>>>>> As a result of a strange situation where our licensing for storage >>>>>>>> dropped off, I need to join a centos 5.6 node to a now single node >>>>>>>> cluster. ?I got it joined to the cluster but I am having issues with >>>>>>>> CLVMD. ?Any lvm operations on both boxes hang. ?For example, vgscan. >>>>>>>> I have increased debugging and I don't see any logs. ?The VGs aren't >>>>>>>> being populated in /dev/mapper. ?This WAS working right after I >>>>>>>> joined >>>>>>>> it to the cluster and now it's not for some unknown reason. ?Not >>>>>>>> sure >>>>>>>> where to take this at this point. ? I did find one weird startup log >>>>>>>> that I am not sure what it means yet: >>>>>>>> [root at omadvnfs01a ~]# dmesg | grep dlm >>>>>>>> dlm: no local IP address has been set >>>>>>>> dlm: cannot start dlm lowcomms -107 >>>>>>>> dlm: Using TCP for communications >>>>>>>> dlm: connecting to 2 >>>>>>>> >>>>>>> >>>>>>> >>>>>>> That message usually means that dlm_controld has failed to start. Try >>>>>>> starting the cman daemons (groupd, dlm_controld) manually with the -D >>>>>>> switch >>>>>>> and read the output which might give some clues to why it's not >>>>>>> working. >>>>>>> >>>>>>> Chrissie >>>>>>> >>>>>> >>>>>> >>>>>> Hi Chrissie, >>>>>> >>>>>> I thought of that but I see dlm started on both nodes. ?See right >>>>>> below. >>>>>> >>>>>>>> [root at omadvnfs01a ~]# ps xauwwww | grep dlm >>>>>>>> root ? ? ?5476 ?0.0 ?0.0 ?24736 ? 760 ? ? ? ? ?Ss ? 15:34 ? 0:00 >>>>>>>> /sbin/dlm_controld >>>>>>>> root ? ? ?5502 ?0.0 ?0.0 ? ? ?0 ? ? 0 ? ? ? ? ?S< ? ? ? ? ?15:34 >>>>>>>> 0:00 >>>>> >>>>> >>>>> Well, that's encouraging in a way! But it's evidently not started fully >>>>> or >>>>> the DLM itself would be working. So I still recommend starting it with >>>>> -D >>>>> to >>>>> see how far it gets. >>>>> >>>>> >>>>> Chrissie >>>>> >>>>> -- >>>>> Linux-cluster mailing list >>>>> Linux-cluster at redhat.com >>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>> >>>> >>>> I think we had posts cross. ?Here's my latest: >>>> >>>> Ok, started all the CMAN elements manually as you suggested. ?I >>>> started them in order as in the init script. Here's the only error >>>> that I see. ?I can post the other debug messages if you think they'd >>>> be useful but this is the only one that stuck out to me. >>>> >>>> [root at omadvnfs01a ~]# /sbin/dlm_controld -D >>>> 1303134840 /sys/kernel/config/dlm/cluster/comms: opendir failed: 2 >>>> 1303134840 /sys/kernel/config/dlm/cluster/spaces: opendir failed: 2 >>>> 1303134840 set_ccs_options 480 >>>> 1303134840 cman: node 2 added >>>> 1303134840 set_configfs_node 2 10.198.1.111 local 0 >>>> 1303134840 cman: node 3 added >>>> 1303134840 set_configfs_node 3 10.198.1.110 local 1 >>>> >>> >>> Can I see the whole set please ? It looks like dlm_controld might be >>> stalled >>> registering with groupd. >>> >>> Chrissie >>> >>> -- >> >> Here you go. ?Thank you very much for the help. ?Each daemon's output >> that I started is below. >> >> [root at omadvnfs01a log]# /sbin/ccsd -n >> Starting ccsd 2.0.115: >> ?Built: Mar ?6 2011 00:47:03 >> ?Copyright (C) Red Hat, Inc. ?2004 ?All rights reserved. >> ? No Daemon:: SET >> >> cluster.conf (cluster name = omadvnfs01, version = 71) found. >> Remote copy of cluster.conf is from quorate node. >> ?Local version # : 71 >> ?Remote version #: 71 >> Remote copy of cluster.conf is from quorate node. >> ?Local version # : 71 >> ?Remote version #: 71 >> Remote copy of cluster.conf is from quorate node. >> ?Local version # : 71 >> ?Remote version #: 71 >> Remote copy of cluster.conf is from quorate node. >> ?Local version # : 71 >> ?Remote version #: 71 >> Initial status:: Quorate >> >> [root at omadvnfs01a ~]# /sbin/fenced -D >> 1303134822 cman: node 2 added >> 1303134822 cman: node 3 added >> 1303134822 our_nodeid 3 our_name omadvnfs01a.sec.jel.lc >> 1303134822 listen 4 member 5 groupd 7 >> 1303134861 client 3: join default >> 1303134861 delay post_join 3s post_fail 0s >> 1303134861 added 2 nodes from ccs >> 1303134861 setid default 65537 >> 1303134861 start default 1 members 2 3 >> 1303134861 do_recovery stop 0 start 1 finish 0 >> 1303134861 finish default 1 >> >> [root at omadvnfs01a ~]# /sbin/dlm_controld -D >> 1303134840 /sys/kernel/config/dlm/cluster/comms: opendir failed: 2 >> 1303134840 /sys/kernel/config/dlm/cluster/spaces: opendir failed: 2 >> 1303134840 set_ccs_options 480 >> 1303134840 cman: node 2 added >> 1303134840 set_configfs_node 2 10.198.1.111 local 0 >> 1303134840 cman: node 3 added >> 1303134840 set_configfs_node 3 10.198.1.110 local 1 >> >> >> [root at omadvnfs01a ~]# /sbin/groupd -D >> 1303134809 cman: our nodeid 3 name omadvnfs01a.sec.jel.lc quorum 1 >> 1303134809 setup_cpg groupd_handle 6b8b456700000000 >> 1303134809 groupd confchg total 2 left 0 joined 1 >> 1303134809 send_version nodeid 3 cluster 2 mode 2 compat 1 >> 1303134822 client connection 3 >> 1303134822 got client 3 setup >> 1303134822 setup fence 0 >> 1303134840 client connection 4 >> 1303134840 got client 4 setup >> 1303134840 setup dlm 1 >> 1303134853 client connection 5 >> 1303134853 got client 5 setup >> 1303134853 setup gfs 2 >> 1303134861 got client 3 join >> 1303134861 0:default got join >> 1303134861 0:default is cpg client 6 name 0_default handle >> 6633487300000001 >> 1303134861 0:default cpg_join ok >> 1303134861 0:default waiting for first cpg event >> 1303134861 client connection 7 >> 1303134861 0:default waiting for first cpg event >> 1303134861 got client 7 get_group >> 1303134861 0:default waiting for first cpg event >> 1303134861 0:default waiting for first cpg event >> 1303134861 0:default confchg left 0 joined 1 total 2 >> 1303134861 0:default process_node_join 3 >> 1303134861 0:default cpg add node 2 total 1 >> 1303134861 0:default cpg add node 3 total 2 >> 1303134861 0:default make_event_id 300020001 nodeid 3 memb_count 2 type 1 >> 1303134861 0:default queue join event for nodeid 3 >> 1303134861 0:default process_current_event 300020001 3 JOIN_BEGIN >> 1303134861 0:default app node init: add 3 total 1 >> 1303134861 0:default app node init: add 2 total 2 >> 1303134861 0:default waiting for 1 more stopped messages before >> JOIN_ALL_STOPPED >> > > That looks like a service error. Is fencing started and working? Check the > output of cman_tool services or group_tool > > Chrissie > > -- Another point that I saw is the output of clustat looks good on the centos node, but the centos node appears offline to the rhel node. Here's that clustat as well as group_tool and cman_tool from both nodes: centos: [root at omadvnfs01a ~]# clustat Cluster Status for omadvnfs01 @ Mon Apr 18 18:25:58 2011 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ omadvnfs01b.sec.jel.lc 2 Online, rgmanager omadvnfs01a.sec.jel.lc 3 Online, Local, rgmanager ... [root at omadvnfs01a ~]# group_tool -v ls type level name id state node id local_done fence 0 default 00010001 none [2 3] dlm 1 clvmd 00040002 none [2 3] dlm 1 rgmanager 00030002 none [2 3] [root at omadvnfs01a ~]# cman_tool status Version: 6.2.0 Config Version: 72 Cluster Name: omadvnfs01 Cluster Id: 44973 Cluster Member: Yes Cluster Generation: 1976 Membership state: Cluster-Member Nodes: 2 Expected votes: 1 Total votes: 2 Quorum: 1 Active subsystems: 9 Flags: 2node Dirty Ports Bound: 0 11 177 Node name: omadvnfs01a.sec.jel.lc Node ID: 3 Multicast addresses: 239.192.175.93 Node addresses: 10.198.1.110 rhel: [root at omadvnfs01b ~]# clustat Cluster Status for omadvnfs01 @ Tue Apr 19 08:29:07 2011 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ omadvnfs01b.sec.jel.lc 2 Online, Local, rgmanager omadvnfs01a.sec.jel.lc 3 Offline, rgmanager ... [root at omadvnfs01b ~]# group_tool -v ls type level name id state node id local_done fence 0 default 00010001 none [2 3] dlm 1 gfs_data00 00020002 none [2] dlm 1 rgmanager 00030002 none [2 3] dlm 1 clvmd 00040002 none [2 3] gfs 2 gfs_data00 00010002 none [2] [root at omadvnfs01b ~]# cman_tool status Version: 6.2.0 Config Version: 72 Cluster Name: omadvnfs01 Cluster Id: 44973 Cluster Member: Yes Cluster Generation: 1976 Membership state: Cluster-Member Nodes: 2 Expected votes: 1 Total votes: 2 Quorum: 1 Active subsystems: 9 Flags: 2node Dirty Ports Bound: 0 11 177 Node name: omadvnfs01b.sec.jel.lc Node ID: 2 Multicast addresses: 239.192.175.93 Node addresses: 10.198.1.111 Thanks! From linux at alteeve.com Tue Apr 19 13:43:43 2011 From: linux at alteeve.com (Digimer) Date: Tue, 19 Apr 2011 09:43:43 -0400 Subject: [Linux-cluster] about cluster In-Reply-To: <545603.34798.qm@web94807.mail.in2.yahoo.com> References: <545603.34798.qm@web94807.mail.in2.yahoo.com> Message-ID: <4DAD918F.6030609@alteeve.com> On 04/19/2011 02:34 AM, Harvinder Singh Binder wrote: > Good afternoon sir, > Please send me whole detail about cluster such as what is cluster, requirement,configration etc > Harvinder Singh S/O Baldev Raj, VPO Barwa Teh. Anandpur Sahib, Dist. Ropar, PunjabE-Mail ID:- jmd_singhsaini at yahoo.com This is a bit like asking someone to send you all the details on how to be a programmer. It is simply not possible. First; * Clusters can be designed for High Availability, Performance or Scalability. * What specific applications or services do you want to cluster and are they supported by any cluster stack. * What cluster foundation; Corosync or Heartbeat? What resource manager, Pacemaker or rgmanager? * How big will you make your cluster? Will it spread over a wide area between data centers? You need to come back to the list with *much* more specific questions. You will also find people here more willing to help if you show what work you've done on your own. As Andrew said; This kind of question is like asking for someone to do your homework for you. Clustering is not hard, but it is complex. You must be willing to spend a good amount of time studying and practicing before you ever want to put a cluster in to production. Are you willing and able to put that time into it? If so, show it by studying and then asking questions. -- Digimer E-Mail: digimer at alteeve.com AN!Whitepapers: http://alteeve.com Node Assassin: http://nodeassassin.org From rossnick-lists at cybercat.ca Tue Apr 19 14:37:16 2011 From: rossnick-lists at cybercat.ca (Nicolas Ross) Date: Tue, 19 Apr 2011 10:37:16 -0400 Subject: [Linux-cluster] about cluster References: <545603.34798.qm@web94807.mail.in2.yahoo.com> Message-ID: <8B94A4716CDE41A9ACAA47063B3C28A3@versa> > > Good afternoon sir, > > Please send me whole detail about cluster such as what is cluster, > > requirement,configration etc > > http://lmgtfy.com/?q=redhat+linux+cluster :LOL: That made my day... From mammadshah at hotmail.com Tue Apr 19 14:47:24 2011 From: mammadshah at hotmail.com (Muhammad Ammad Shah) Date: Tue, 19 Apr 2011 20:47:24 +0600 Subject: [Linux-cluster] GFS version Message-ID: Hello, I am using RHEL 5.3 and formated the shared volume using gfs. how can i know that its GFS version 1 or GFS version 2? Thanks, Muhammad Ammad Shah From swhiteho at redhat.com Tue Apr 19 14:56:54 2011 From: swhiteho at redhat.com (Steven Whitehouse) Date: Tue, 19 Apr 2011 15:56:54 +0100 Subject: [Linux-cluster] GFS version In-Reply-To: References: Message-ID: <1303225014.2702.47.camel@dolmen> Hi, On Tue, 2011-04-19 at 20:47 +0600, Muhammad Ammad Shah wrote: > Hello, > > > I am using RHEL 5.3 and formated the shared volume using gfs. how can i know that its GFS version 1 or GFS version 2? > > > > Thanks, > Muhammad Ammad Shah > Well you should be able to use file -s to figure that out, but since early versions of file can't tell the difference, that won't work for you on 5.3, I'm afraid. You could try to mount it and then look to see whether you were successful with gfs or gfs2. Also, you could try using gfs_tool or gfs2_tool to print the info out from the sb depending on which is easier, Steve. From rpeterso at redhat.com Tue Apr 19 15:32:01 2011 From: rpeterso at redhat.com (Bob Peterson) Date: Tue, 19 Apr 2011 11:32:01 -0400 (EDT) Subject: [Linux-cluster] GFS version In-Reply-To: Message-ID: <574804652.33011.1303227121262.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> ----- Original Message ----- | Hello, | | | I am using RHEL 5.3 and formated the shared volume using gfs. how can | i know that its GFS version 1 or GFS version 2? Hi, One way is to use gfs2_edit to find this out like so: [root at kool ~]# gfs2_edit -x -p root /dev/kool_vg/test | grep seems *** This seems to be a GFS-1 file system *** If it's gfs2, it doesn't print anything. Regards, Bob Peterson Red Hat File Systems From mguazzardo76 at gmail.com Tue Apr 19 17:58:38 2011 From: mguazzardo76 at gmail.com (Marcelo Guazzardo) Date: Tue, 19 Apr 2011 14:58:38 -0300 Subject: [Linux-cluster] Question with RHCS and oracle Message-ID: Hello: I would like to Install an Oracle database with RHCS (RH 5.5) I have some issues. Can I make a cluster without using RAC for listener and VIP? only using the Redhat cluster services. Or if mandatory to use RAC? Any hint will be appreciated. Thanks in advance Regards, -- Marcelo Guazzardo mguazzardo76 at gmail.com http://mguazzardo.blogspot.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From alvaro.fernandez at sivsa.com Tue Apr 19 19:27:56 2011 From: alvaro.fernandez at sivsa.com (Alvaro Jose Fernandez) Date: Tue, 19 Apr 2011 21:27:56 +0200 Subject: [Linux-cluster] Question with RHCS and oracle References: Message-ID: <607D6181D9919041BE792D70EF2AEC48018D8C2D@LIMENS.sivsa.int> Hi, No, it's not mandatory to use RAC. You can use Cluster Suite HA services to provide a cold failover cluster for Oracle database, and this is supported by both Oracle and Redhat as a valid deployment. You can install Oracle Standard edition or Enterprise editions for that. Check Lon notes (for example) in http://people.redhat.com/lhh/oracle-rhel5-notes-0.6/oracle-notes.html regards, alvaro Hello: I would like to Install an Oracle database with RHCS (RH 5.5) I have some issues. Can I make a cluster without using RAC for listener and VIP? only using the Redhat cluster services. Or if mandatory to use RAC? Any hint will be appreciated. Thanks in advance Regards, -- Marcelo Guazzardo mguazzardo76 at gmail.com http://mguazzardo.blogspot.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From michael at ulimit.org Tue Apr 19 19:31:20 2011 From: michael at ulimit.org (Michael Pye) Date: Tue, 19 Apr 2011 20:31:20 +0100 Subject: [Linux-cluster] Question with RHCS and oracle In-Reply-To: References: Message-ID: <4DADE308.2000006@ulimit.org> On 19/04/2011 18:58, Marcelo Guazzardo wrote: > I would like to Install an Oracle database with RHCS (RH 5.5) > I have some issues. Can I make a cluster without using RAC for listener and > VIP? only using the Redhat cluster services. Or if mandatory to use RAC? > Any hint will be appreciated. If you want an active-passive cluster, that is fine to do just with RHCS, no requirement for RAC. The RH docs have an example of how to do an oracle cold-failover cluster for oracle. If you want active-active then you will require RAC. Michael From mguazzardo76 at gmail.com Wed Apr 20 00:33:46 2011 From: mguazzardo76 at gmail.com (Marcelo Guazzardo) Date: Tue, 19 Apr 2011 21:33:46 -0300 Subject: [Linux-cluster] Question with RHCS and oracle In-Reply-To: <4DADE308.2000006@ulimit.org> References: <4DADE308.2000006@ulimit.org> Message-ID: Michael and Alvaro. Thanks for your reply. I 'll research how install RHCS with an active-pasive cluster. Regards! 2011/4/19 Michael Pye > On 19/04/2011 18:58, Marcelo Guazzardo wrote: > > I would like to Install an Oracle database with RHCS (RH 5.5) > > I have some issues. Can I make a cluster without using RAC for listener > and > > VIP? only using the Redhat cluster services. Or if mandatory to use RAC? > > Any hint will be appreciated. > > If you want an active-passive cluster, that is fine to do just with > RHCS, no requirement for RAC. The RH docs have an example of how to do > an oracle cold-failover cluster for oracle. > > If you want active-active then you will require RAC. > > Michael > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -- Marcelo Guazzardo mguazzardo76 at gmail.com http://mguazzardo.blogspot.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From mguazzardo76 at gmail.com Wed Apr 20 00:49:47 2011 From: mguazzardo76 at gmail.com (Marcelo Guazzardo) Date: Tue, 19 Apr 2011 21:49:47 -0300 Subject: [Linux-cluster] Question with RHCS and oracle In-Reply-To: References: <4DADE308.2000006@ulimit.org> Message-ID: Hello again Anyone here have read the follow paper http: www.perftuning.com/pdf/white_paper_linux_cluster.pdf ? That i see very well explained. If anyone have another page with tutorials, thanks in advance. Regards, Marcelo PS: Sorry for my english! 2011/4/19 Marcelo Guazzardo > Michael and Alvaro. > Thanks for your reply. I 'll research how install RHCS with an > active-pasive cluster. > Regards! > > > > 2011/4/19 Michael Pye > >> On 19/04/2011 18:58, Marcelo Guazzardo wrote: >> > I would like to Install an Oracle database with RHCS (RH 5.5) >> > I have some issues. Can I make a cluster without using RAC for listener >> and >> > VIP? only using the Redhat cluster services. Or if mandatory to use RAC? >> > Any hint will be appreciated. >> >> If you want an active-passive cluster, that is fine to do just with >> RHCS, no requirement for RAC. The RH docs have an example of how to do >> an oracle cold-failover cluster for oracle. >> >> If you want active-active then you will require RAC. >> >> Michael >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster >> > > > > -- > Marcelo Guazzardo > mguazzardo76 at gmail.com > http://mguazzardo.blogspot.com > > > -- Marcelo Guazzardo mguazzardo76 at gmail.com http://mguazzardo.blogspot.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From Chris.Jankowski at hp.com Wed Apr 20 01:35:44 2011 From: Chris.Jankowski at hp.com (Jankowski, Chris) Date: Wed, 20 Apr 2011 01:35:44 +0000 Subject: [Linux-cluster] Question with RHCS and oracle In-Reply-To: References: <4DADE308.2000006@ulimit.org> Message-ID: <036B68E61A28CA49AC2767596576CD596F64D6138C@GVW1113EXC.americas.hpqcorp.net> Marcelo, The paper you mentioned is now 6 years old. Quite a bit has changed in RHEL CS and Oracle DB since. If you need RAC and you are doing this for a living, I recommend that you invest in these two books: For Oracle 11g (published in 2010): http://www.amazon.com/Pro-Oracle-Database-11g-Linux/dp/1430229586/ref=sr_1_1?s=books&ie=UTF8&qid=1303262530&sr=1-1 For Oracle 10g (published in 2006): http://www.amazon.com/Pro-Oracle-Database-11g-Linux/dp/1430229586/ref=sr_1_1?s=books&ie=UTF8&qid=1303262530&sr=1-1 Each has 800+ pages of very solid information. However, as always with books, they are obsolete the day they are published. Oracle 11g R2 RAC is covered in the first book in one chapter only. I hope this helps, Regards, Chris Jankowski From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Marcelo Guazzardo Sent: Wednesday, 20 April 2011 10:50 To: linux clustering Subject: Re: [Linux-cluster] Question with RHCS and oracle Hello again Anyone here have read the follow paper http:www.perftuning.com/pdf/white_paper_linux_cluster.pdf ? That i see very well explained. If anyone have another page with tutorials, thanks in advance. Regards, Marcelo PS: Sorry for my english! 2011/4/19 Marcelo Guazzardo > Michael and Alvaro. Thanks for your reply. I 'll research how install RHCS with an active-pasive cluster. Regards! 2011/4/19 Michael Pye > On 19/04/2011 18:58, Marcelo Guazzardo wrote: > I would like to Install an Oracle database with RHCS (RH 5.5) > I have some issues. Can I make a cluster without using RAC for listener and > VIP? only using the Redhat cluster services. Or if mandatory to use RAC? > Any hint will be appreciated. If you want an active-passive cluster, that is fine to do just with RHCS, no requirement for RAC. The RH docs have an example of how to do an oracle cold-failover cluster for oracle. If you want active-active then you will require RAC. Michael -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -- Marcelo Guazzardo mguazzardo76 at gmail.com http://mguazzardo.blogspot.com -- Marcelo Guazzardo mguazzardo76 at gmail.com http://mguazzardo.blogspot.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From mguazzardo76 at gmail.com Wed Apr 20 02:29:25 2011 From: mguazzardo76 at gmail.com (Marcelo Guazzardo) Date: Tue, 19 Apr 2011 23:29:25 -0300 Subject: [Linux-cluster] Question with RHCS and oracle In-Reply-To: <036B68E61A28CA49AC2767596576CD596F64D6138C@GVW1113EXC.americas.hpqcorp.net> References: <4DADE308.2000006@ulimit.org> <036B68E61A28CA49AC2767596576CD596F64D6138C@GVW1113EXC.americas.hpqcorp.net> Message-ID: HI Chris, thanks for your repply. I was reading the Lon's notes, I have seem that he uses a active-passive conf, without gfs. I installed a mysql cluster, but, I used gfs. If I use and active-passive conf, I believe that I shouldn't use gfs. Anyone could explain me how to configure a cluster without gfs?. Thanks.! 2011/4/19 Jankowski, Chris > Marcelo, > > > > The paper you mentioned is now 6 years old. Quite a bit has changed in > RHEL CS and Oracle DB since. > > > > If you need RAC and you are doing this for a living, I recommend that you > invest in these two books: > > > > For Oracle 11g (published in 2010): > > > http://www.amazon.com/Pro-Oracle-Database-11g-Linux/dp/1430229586/ref=sr_1_1?s=books&ie=UTF8&qid=1303262530&sr=1-1 > > > > For Oracle 10g (published in 2006): > > > http://www.amazon.com/Pro-Oracle-Database-11g-Linux/dp/1430229586/ref=sr_1_1?s=books&ie=UTF8&qid=1303262530&sr=1-1 > > > > Each has 800+ pages of very solid information. > > > > However, as always with books, they are obsolete the day they are > published. Oracle 11g R2 RAC is covered in the first book in one chapter > only. > > > > I hope this helps, > > > > Regards, > > > > Chris Jankowski > > > > *From:* linux-cluster-bounces at redhat.com [mailto: > linux-cluster-bounces at redhat.com] *On Behalf Of *Marcelo Guazzardo > *Sent:* Wednesday, 20 April 2011 10:50 > *To:* linux clustering > *Subject:* Re: [Linux-cluster] Question with RHCS and oracle > > > > Hello again > > Anyone here have read the follow paper http: > www.perftuning.com/pdf/white_paper_linux_cluster.pdf ? > That i see very well explained. If anyone have another page with tutorials, > thanks in advance. > Regards, > Marcelo > PS: Sorry for my english! > > 2011/4/19 Marcelo Guazzardo > > Michael and Alvaro. > Thanks for your reply. I 'll research how install RHCS with an > active-pasive cluster. > Regards! > > > > 2011/4/19 Michael Pye > > On 19/04/2011 18:58, Marcelo Guazzardo wrote: > > I would like to Install an Oracle database with RHCS (RH 5.5) > > I have some issues. Can I make a cluster without using RAC for listener > and > > VIP? only using the Redhat cluster services. Or if mandatory to use RAC? > > Any hint will be appreciated. > > If you want an active-passive cluster, that is fine to do just with > RHCS, no requirement for RAC. The RH docs have an example of how to do > an oracle cold-failover cluster for oracle. > > If you want active-active then you will require RAC. > > Michael > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > -- > Marcelo Guazzardo > mguazzardo76 at gmail.com > http://mguazzardo.blogspot.com > > > > > -- > Marcelo Guazzardo > mguazzardo76 at gmail.com > http://mguazzardo.blogspot.com > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -- Marcelo Guazzardo mguazzardo76 at gmail.com http://mguazzardo.blogspot.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From Chris.Jankowski at hp.com Wed Apr 20 03:03:33 2011 From: Chris.Jankowski at hp.com (Jankowski, Chris) Date: Wed, 20 Apr 2011 03:03:33 +0000 Subject: [Linux-cluster] Question with RHCS and oracle In-Reply-To: References: <4DADE308.2000006@ulimit.org> <036B68E61A28CA49AC2767596576CD596F64D6138C@GVW1113EXC.americas.hpqcorp.net> Message-ID: <036B68E61A28CA49AC2767596576CD596F64D6141D@GVW1113EXC.americas.hpqcorp.net> Marcelo, You can use standard RHEL CS manuals as a starting point: Red_Hat_Enterprise_Linux-6-Cluster_Suite_Overview-en-US.pdf Red_Hat_Enterprise_Linux-6-Cluster_Administration-en-US.pdf Remember that you still need to have shared storage, which in enterprise class systems means FC storage arrays or at least iSCSI based arrays. Both require proper configuration to achieve full redundancy of access. For FC this means configuration of device mapper multipath. There is RHEL manual for this as well. Then you probably would like to use LVM for your storage. There is a little known and badly documented way of using LVM with tags. You probably want that. Then you layer either ext4fs or XFS on top. No need for GFS. You just do not install it. Nor do you need CLVM. Regards, Chris Jankowski From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Marcelo Guazzardo Sent: Wednesday, 20 April 2011 12:29 To: linux clustering Subject: Re: [Linux-cluster] Question with RHCS and oracle HI Chris, thanks for your repply. I was reading the Lon's notes, I have seem that he uses a active-passive conf, without gfs. I installed a mysql cluster, but, I used gfs. If I use and active-passive conf, I believe that I shouldn't use gfs. Anyone could explain me how to configure a cluster without gfs?. Thanks.! 2011/4/19 Jankowski, Chris > Marcelo, The paper you mentioned is now 6 years old. Quite a bit has changed in RHEL CS and Oracle DB since. If you need RAC and you are doing this for a living, I recommend that you invest in these two books: For Oracle 11g (published in 2010): http://www.amazon.com/Pro-Oracle-Database-11g-Linux/dp/1430229586/ref=sr_1_1?s=books&ie=UTF8&qid=1303262530&sr=1-1 For Oracle 10g (published in 2006): http://www.amazon.com/Pro-Oracle-Database-11g-Linux/dp/1430229586/ref=sr_1_1?s=books&ie=UTF8&qid=1303262530&sr=1-1 Each has 800+ pages of very solid information. However, as always with books, they are obsolete the day they are published. Oracle 11g R2 RAC is covered in the first book in one chapter only. I hope this helps, Regards, Chris Jankowski From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Marcelo Guazzardo Sent: Wednesday, 20 April 2011 10:50 To: linux clustering Subject: Re: [Linux-cluster] Question with RHCS and oracle Hello again Anyone here have read the follow paper http:www.perftuning.com/pdf/white_paper_linux_cluster.pdf ? That i see very well explained. If anyone have another page with tutorials, thanks in advance. Regards, Marcelo PS: Sorry for my english! 2011/4/19 Marcelo Guazzardo > Michael and Alvaro. Thanks for your reply. I 'll research how install RHCS with an active-pasive cluster. Regards! 2011/4/19 Michael Pye > On 19/04/2011 18:58, Marcelo Guazzardo wrote: > I would like to Install an Oracle database with RHCS (RH 5.5) > I have some issues. Can I make a cluster without using RAC for listener and > VIP? only using the Redhat cluster services. Or if mandatory to use RAC? > Any hint will be appreciated. If you want an active-passive cluster, that is fine to do just with RHCS, no requirement for RAC. The RH docs have an example of how to do an oracle cold-failover cluster for oracle. If you want active-active then you will require RAC. Michael -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -- Marcelo Guazzardo mguazzardo76 at gmail.com http://mguazzardo.blogspot.com -- Marcelo Guazzardo mguazzardo76 at gmail.com http://mguazzardo.blogspot.com -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -- Marcelo Guazzardo mguazzardo76 at gmail.com http://mguazzardo.blogspot.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From dlugi at forum-polska.com Wed Apr 20 07:50:03 2011 From: dlugi at forum-polska.com (dlugi) Date: Wed, 20 Apr 2011 09:50:03 +0200 Subject: [Linux-cluster] Solution for HPC Message-ID: <4d8c6193b0d58ae49299c8164ba34849@forum-polska.com> Hi Gurus, I would like to ask You about something. Since few days I`m preparing 3D fluid simulation. The problem is that my simulation is rendered only on one core. CPU usage provading information that only 1 core is 100% used by process. In my opinion this software doesnt support multithreading thats why everything is calculated on one core. Is it possible to build some kind of HPC cluster where this single process could be distributed for several machines ? I`m not thinking about dividing this job for several small peaces and distributing them. I`m thinking about infrastructure where single process could use CPU power from several machines at the same time. Is it possible to do this on RH or Fedora ? cheers Konrad From list at fajar.net Wed Apr 20 07:58:49 2011 From: list at fajar.net (Fajar A. Nugraha) Date: Wed, 20 Apr 2011 14:58:49 +0700 Subject: [Linux-cluster] Solution for HPC In-Reply-To: <4d8c6193b0d58ae49299c8164ba34849@forum-polska.com> References: <4d8c6193b0d58ae49299c8164ba34849@forum-polska.com> Message-ID: On Wed, Apr 20, 2011 at 2:50 PM, dlugi wrote: > Is it possible to build some kind of HPC cluster where this single process > could be distributed for several machines ? No. That's not what HPC is all about. -- Fajar From dlugi at forum-polska.com Wed Apr 20 08:08:08 2011 From: dlugi at forum-polska.com (dlugi) Date: Wed, 20 Apr 2011 10:08:08 +0200 Subject: [Linux-cluster] Solution for HPC In-Reply-To: References: <4d8c6193b0d58ae49299c8164ba34849@forum-polska.com> Message-ID: <0cc2e4287dd05faa8642ef78bcf97d05@forum-polska.com> On Wed, 20 Apr 2011 14:58:49 +0700, "Fajar A. Nugraha" wrote: > On Wed, Apr 20, 2011 at 2:50 PM, dlugi > wrote: >> Is it possible to build some kind of HPC cluster where this single >> process >> could be distributed for several machines ? > > No. That's not what HPC is all about. Is it possible maybe with some other distribution ? Connecting few machines to work as one ? From shariq.siddiqui at yahoo.com Wed Apr 20 08:24:39 2011 From: shariq.siddiqui at yahoo.com (Shariq Siddiqui) Date: Wed, 20 Apr 2011 01:24:39 -0700 (PDT) Subject: [Linux-cluster] GFS version In-Reply-To: Message-ID: <458846.77273.qm@web39801.mail.mud.yahoo.com> Dear Ammad, When you will mount this partition with # mount -t gfs /source /destination-folder it will indicate that partition is gfs or gfs2, By running above command if partition will be mounted, it mean its gfs... if not, it will give you error that , this partition is not seems GFS, So you can run command # mount -t gfs2 /source /destination-folder Best Regards, Shariq Siddiqui?? --- On Tue, 4/19/11, Muhammad Ammad Shah wrote: From: Muhammad Ammad Shah Subject: [Linux-cluster] GFS version To: "Linux Cluster" Date: Tuesday, April 19, 2011, 9:47 AM Hello, I am using RHEL 5.3 and formated the shared volume using gfs. how can i know that its GFS version 1 or GFS version 2? Thanks, Muhammad Ammad Shah? ??? ???????? ?????? ??? ? -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- An HTML attachment was scrubbed... URL: From kkovachev at varna.net Wed Apr 20 08:48:27 2011 From: kkovachev at varna.net (Kaloyan Kovachev) Date: Wed, 20 Apr 2011 11:48:27 +0300 Subject: [Linux-cluster] Solution for HPC In-Reply-To: <0cc2e4287dd05faa8642ef78bcf97d05@forum-polska.com> References: <4d8c6193b0d58ae49299c8164ba34849@forum-polska.com> <0cc2e4287dd05faa8642ef78bcf97d05@forum-polska.com> Message-ID: On Wed, 20 Apr 2011 10:08:08 +0200, dlugi wrote: > On Wed, 20 Apr 2011 14:58:49 +0700, "Fajar A. Nugraha" > wrote: >> On Wed, Apr 20, 2011 at 2:50 PM, dlugi >> wrote: >>> Is it possible to build some kind of HPC cluster where this single >>> process >>> could be distributed for several machines ? >> >> No. That's not what HPC is all about. > > Is it possible maybe with some other distribution ? Connecting few > machines to work as one ? > No it is not possible, even there are ways for connecting few machines to work as one it will not help, as it will be one machine with multiple processors/cores, while your application can not make use of them > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From gordan at bobich.net Wed Apr 20 08:57:51 2011 From: gordan at bobich.net (Gordan Bobic) Date: Wed, 20 Apr 2011 09:57:51 +0100 Subject: [Linux-cluster] Solution for HPC In-Reply-To: <4d8c6193b0d58ae49299c8164ba34849@forum-polska.com> References: <4d8c6193b0d58ae49299c8164ba34849@forum-polska.com> Message-ID: <4DAEA00F.3040909@bobich.net> dlugi wrote: > Hi Gurus, > > I would like to ask You about something. Since few days I`m preparing > 3D fluid simulation. The problem is that my simulation is rendered only > on one core. CPU usage provading information that only 1 core is 100% > used by process. In my opinion this software doesnt support > multithreading thats why everything is calculated on one core. > > Is it possible to build some kind of HPC cluster where this single > process could be distributed for several machines ? > I`m not thinking about dividing this job for several small peaces and > distributing them. I`m thinking about infrastructure where single > process could use CPU power from several machines at the same time. > > Is it possible to do this on RH or Fedora ? There is no such thing - period. On any OS. If your application is single-process/single-thread, it will only scale vertically. Gordan From gordan at bobich.net Wed Apr 20 08:59:40 2011 From: gordan at bobich.net (Gordan Bobic) Date: Wed, 20 Apr 2011 09:59:40 +0100 Subject: [Linux-cluster] Solution for HPC In-Reply-To: <0cc2e4287dd05faa8642ef78bcf97d05@forum-polska.com> References: <4d8c6193b0d58ae49299c8164ba34849@forum-polska.com> <0cc2e4287dd05faa8642ef78bcf97d05@forum-polska.com> Message-ID: <4DAEA07C.20408@bobich.net> dlugi wrote: > On Wed, 20 Apr 2011 14:58:49 +0700, "Fajar A. Nugraha" > wrote: >> On Wed, Apr 20, 2011 at 2:50 PM, dlugi wrote: >>> Is it possible to build some kind of HPC cluster where this single >>> process >>> could be distributed for several machines ? >> >> No. That's not what HPC is all about. > > Is it possible maybe with some other distribution ? Connecting few > machines to work as one ? Sort of, but you won't gain anything since your application is single-threaded. If your application forks multiple processes that work independently, then you could use something like Kerrighed to merge multiple machines into one big virtual SMP machine. But if SMP doesn't help you, then Kerrighed certainly won't either. Gordan From Chris.Jankowski at hp.com Wed Apr 20 09:05:52 2011 From: Chris.Jankowski at hp.com (Jankowski, Chris) Date: Wed, 20 Apr 2011 09:05:52 +0000 Subject: [Linux-cluster] Solution for HPC In-Reply-To: <4d8c6193b0d58ae49299c8164ba34849@forum-polska.com> References: <4d8c6193b0d58ae49299c8164ba34849@forum-polska.com> Message-ID: <036B68E61A28CA49AC2767596576CD596F64D616C4@GVW1113EXC.americas.hpqcorp.net> Konrad, The first thing to do is to recompile your application using a parallelizing compiler with proper parameter equal to the number of cores on your server. This of course assumes that you have the source code for your application. For a properly written Fortran and C application a modern parallelizing compiler would do a great job. Note that today you may have easily 48 real physical cores i.e. 96 independent parallel threads of execution with hyperthreading turned on a modern Intel x86_64 server such as HP DL980 G7. Then the next step is to tune the application on the source code level to increase its parallelism such that it can actually use the 96 threads. Only then, if the elapsed time of your processing is still unacceptably long (weeks), you would move to a HPTC cluster. This is very expensive - the Infiniband interconnects do not come cheap and you still need to put in a few man years of work to tune your code for the HPTC cluster. I hope this helps. Regards, Chris Jankowski -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of dlugi Sent: Wednesday, 20 April 2011 17:50 To: linux-cluster at redhat.com Subject: [Linux-cluster] Solution for HPC Hi Gurus, I would like to ask You about something. Since few days I`m preparing 3D fluid simulation. The problem is that my simulation is rendered only on one core. CPU usage provading information that only 1 core is 100% used by process. In my opinion this software doesnt support multithreading thats why everything is calculated on one core. Is it possible to build some kind of HPC cluster where this single process could be distributed for several machines ? I`m not thinking about dividing this job for several small peaces and distributing them. I`m thinking about infrastructure where single process could use CPU power from several machines at the same time. Is it possible to do this on RH or Fedora ? cheers Konrad -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From list at fajar.net Wed Apr 20 09:17:12 2011 From: list at fajar.net (Fajar A. Nugraha) Date: Wed, 20 Apr 2011 16:17:12 +0700 Subject: [Linux-cluster] Solution for HPC In-Reply-To: <036B68E61A28CA49AC2767596576CD596F64D616C4@GVW1113EXC.americas.hpqcorp.net> References: <4d8c6193b0d58ae49299c8164ba34849@forum-polska.com> <036B68E61A28CA49AC2767596576CD596F64D616C4@GVW1113EXC.americas.hpqcorp.net> Message-ID: On Wed, Apr 20, 2011 at 4:05 PM, Jankowski, Chris wrote: > Only then, if the elapsed time of your processing is still unacceptably long (weeks), you would move to a HPTC cluster. This is very expensive - the Infiniband interconnects do not come cheap and you still need to put in a few man years of work to tune your code for the HPTC cluster. Well said :D -- Fajar From dlugi at forum-polska.com Wed Apr 20 09:40:24 2011 From: dlugi at forum-polska.com (dlugi) Date: Wed, 20 Apr 2011 11:40:24 +0200 Subject: [Linux-cluster] Solution for HPC In-Reply-To: <036B68E61A28CA49AC2767596576CD596F64D616C4@GVW1113EXC.americas.hpqcorp.net> References: <4d8c6193b0d58ae49299c8164ba34849@forum-polska.com> <036B68E61A28CA49AC2767596576CD596F64D616C4@GVW1113EXC.americas.hpqcorp.net> Message-ID: <97d5de62c5c44f8edbd970a2f3da6d20@forum-polska.com> Ok I understand. Anyway thanks for explanation. My software is Blender. I dont know why for such a thing like fluid dynamics somebody somebody coded software without multithreading. I will ask on Blender forum maybe there is any good version with support for multithreading. Thanks for fast and profi support :) I appriaciate. Konrad On Wed, 20 Apr 2011 09:05:52 +0000, "Jankowski, Chris" wrote: > Konrad, > > The first thing to do is to recompile your application using a > parallelizing compiler with proper parameter equal to the number of > cores on your server. This of course assumes that you have the > source > code for your application. > > For a properly written Fortran and C application a modern > parallelizing compiler would do a great job. > > Note that today you may have easily 48 real physical cores i.e. 96 > independent parallel threads of execution with hyperthreading turned > on a modern Intel x86_64 server such as HP DL980 G7. > > Then the next step is to tune the application on the source code > level to increase its parallelism such that it can actually use the > 96 > threads. > > Only then, if the elapsed time of your processing is still > unacceptably long (weeks), you would move to a HPTC cluster. This is > very expensive - the Infiniband interconnects do not come cheap and > you still need to put in a few man years of work to tune your code > for > the HPTC cluster. > > I hope this helps. > > Regards, > > Chris Jankowski > > -----Original Message----- > From: linux-cluster-bounces at redhat.com > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of dlugi > Sent: Wednesday, 20 April 2011 17:50 > To: linux-cluster at redhat.com > Subject: [Linux-cluster] Solution for HPC > > Hi Gurus, > > I would like to ask You about something. Since few days I`m > preparing 3D fluid simulation. The problem is that my simulation is > rendered only on one core. CPU usage provading information that only > 1 > core is 100% used by process. In my opinion this software doesnt > support > multithreading thats why everything is calculated on one core. > > Is it possible to build some kind of HPC cluster where this single > process could be distributed for several machines ? > I`m not thinking about dividing this job for several small peaces > and > distributing them. I`m thinking about infrastructure where single > process could use CPU power from several machines at the same time. > > Is it possible to do this on RH or Fedora ? > > cheers > > Konrad > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From list at fajar.net Wed Apr 20 09:49:15 2011 From: list at fajar.net (Fajar A. Nugraha) Date: Wed, 20 Apr 2011 16:49:15 +0700 Subject: [Linux-cluster] Solution for HPC In-Reply-To: <97d5de62c5c44f8edbd970a2f3da6d20@forum-polska.com> References: <4d8c6193b0d58ae49299c8164ba34849@forum-polska.com> <036B68E61A28CA49AC2767596576CD596F64D616C4@GVW1113EXC.americas.hpqcorp.net> <97d5de62c5c44f8edbd970a2f3da6d20@forum-polska.com> Message-ID: On Wed, Apr 20, 2011 at 4:40 PM, dlugi wrote: > Ok I understand. Anyway thanks for explanation. My software is Blender. I > dont know why for such a thing like fluid dynamics somebody somebody coded > software without multithreading. I will ask on Blender forum maybe there is > any good version with support for multithreading. Isn't it always there, but off by default? http://wbs.nsf.tc/articles/article8_e.html -- Fajar From gnetravali at sonusnet.com Wed Apr 20 09:55:13 2011 From: gnetravali at sonusnet.com (Netravali Ganesh) Date: Wed, 20 Apr 2011 15:25:13 +0530 Subject: [Linux-cluster] Vlan interfaces Message-ID: Hi . I have two node cluster configured.. I have created bonding interface and configured the cluster IP's using below option in cluster.conf.