From corey.kovacs at gmail.com Sun May 1 09:32:39 2011 From: corey.kovacs at gmail.com (Corey Kovacs) Date: Sun, 1 May 2011 10:32:39 +0100 Subject: [Linux-cluster] How do you HA your storage? In-Reply-To: <4DBC401A.6000902@bulbous.org> References: <1304154086.10889.1446718041@webmail.messagingengine.com> <4DBBD19C.1000606@bulbous.org> <4DBBDE59.9000107@bulbous.org> <4DBC401A.6000902@bulbous.org> Message-ID: You could probably do what you want using san level mirroring across two sans and device-mapper-multipath. I believe the sans will automatically put the alternate san copies into read/write if it cannot communicate with the first if it's configured to do so but I don't have access to that capability on my EVA for lack of license. Actually, it woulnd't require a whole other san but if I was mirroring things, that's what I'd opt for. This is the kind of problem that DMP was designed to handle. If you are booting from the san, you may have some other tweeks but in general I think DMP is still the way to go. Good luck -C On Sat, Apr 30, 2011 at 6:00 PM, urgrue wrote: > On 30/4/11 14:27, Corey Kovacs wrote: > > This has nothing to do with any network. It's all over the fiber... > > True, my bad, I was thinking of DRBD. > >> Points in time? It's a raid 1, it's relatively instant. It's more >> complex to manage a failover in the way you describe if anything. > > I didn't mean that. What I meant is with any enterprise storage filer I can > walk in and take a point in time snapshot of my entire datacenter - all > hundreds of servers - with almost no effort. And restore it. That's a pretty > fantastic thing to be able to do before, say, a major upgrade on hundreds of > servers. And you manage all of it in one place. Take a situation like if the > company decides it needs a third copy of the data. It'd be a fun job to map > and configure the third LUN on 500 servers, when on the SAN it'd be a a few > minutes to configure. Or if that third copy needs to be async instead, I > don't even think you can do that with LVM or software raid. > Host-based mirroring is great for many situations, but when it comes to > larger environments, I think most companies tend to prefer SAN mirroring. > >> Well, my $0.02 anyway. >> >> -C >> >> On Sat, Apr 30, 2011 at 11:03 AM, urgrue ?wrote: >>> >>> Yes, these work, but then I'm having each server handle the job of >>> mirroring >>> their own disks, which has some disadvantages. Network usage instead of >>> fiber, more complex management of points-in-time compared to a nice big >>> fat >>> centralized SAN, etc. In my experience most companies favor SAN-level >>> replication. >>> The challenge is just getting Linux to recover gracefully when the SAN >>> fails >>> over. Worst case you can just reboot, but, that's not very HA. >>> >>> >>> On 30/4/11 13:23, Corey Kovacs wrote: >>>> >>>> What you seem to be describing is the mirror target for device mapper. >>>> >>>> Another alternative would be to setup a software raid using multipath'd >>>> luns. >>>> >>>> SANVOL1 ? ? ? ? ? ?SANVOL2 >>>> ? ?| ? ? ? ? ? ? ? ? ? ? ? ? ? | >>>> ? ?\ ? ? ? ? ? ? ? ? ? ? ? ? ?/ >>>> ? ? \ ? ? ? ? ? ? ? ? ? ? ? / >>>> ? ? ? \ ? ? ? ? ? ? ? ? ? / >>>> ? ? MPATH1 ? ?MPATH2 >>>> ? ? ? ? ?\ ? ? ? ? ? ? / >>>> ? ? ? ?RAID 1 DEV >>>> ? ? ? ? ? ? ? ?| >>>> ? ? ? ? ? ? ?PV >>>> ? ? ? ? ? ? ? ?| >>>> ? ? ? ? ? ? ? VG >>>> ? ? ? ? ? ? ? ?| >>>> ? ? ? ? ? ? ? LV >>>> >>>> That might work >>>> >>>> -C >>>> >>>> >>>> On Sat, Apr 30, 2011 at 10:08 AM, urgrue ? ?wrote: >>>>> >>>>> But, how do you get dm-multipath to consider two different LUNs to be >>>>> in >>>>> fact two paths to the same device? >>>>> I mean, normally multipath has two paths to one device. >>>>> When we're talking about san-level mirroring, we've got two paths to >>>>> two >>>>> different devices (which just happen to contain identical data). >>>>> >>>>> On 30/4/11 11:47, Kit Gerrits wrote: >>>>>> >>>>>> With dual-controller arrays, dm-multipath ?keeps checking if the >>>>>> current >>>>>> device is still responding and switches to a different path if it is >>>>>> not. >>>>>> (for examply, by reading sector 0) >>>>>> >>>>>> With SAN failover, you may need to tell the secondary SAN LUN to go >>>>>> into >>>>>> read-write mode. >>>>>> Unfortunately, I am not familiar with tying this into RHEL. >>>>>> (also, sector 0 will already be readable on the secundary LUN, but not >>>>>> writable) >>>>>> >>>>>> Maybe there is a write test, which tries to write to both SANs >>>>>> The one which allows write access will become the active LUN. >>>>>> >>>>>> If you can switch your SANs inside 30 seconds, you might even be able >>>>>> to >>>>>> salvage/execute pending write operations. >>>>>> >>>>>> >>>>>> Regards, >>>>>> >>>>>> Kit >>>>>> >>>>>> -----Original Message----- >>>>>> From: linux-cluster-bounces at redhat.com >>>>>> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of urgrue >>>>>> Sent: zaterdag 30 april 2011 11:01 >>>>>> To: linux-cluster at redhat.com >>>>>> Subject: [Linux-cluster] How do you HA your storage? >>>>>> >>>>>> I'm struggling to find the best way to deal with SAN failover. >>>>>> By this I mean the common scenario where you have SAN-based mirroring. >>>>>> It's pretty easy with host-based mirroring (md, DRBD, LVM, etc) but >>>>>> how >>>>>> can >>>>>> you minimize the impact and manual effort to recover from losing a >>>>>> LUN, >>>>>> and >>>>>> needing to somehow get your system to realize the data is now on a >>>>>> different >>>>>> LUN (the now-active mirror)? >>>>>> -- >>>>>> Linux-cluster mailing list >>>>>> Linux-cluster at redhat.com >>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>> >>>>>> -- >>>>>> Linux-cluster mailing list >>>>>> Linux-cluster at redhat.com >>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>> >>>>> -- >>>>> Linux-cluster mailing list >>>>> Linux-cluster at redhat.com >>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>> >>>> -- >>>> Linux-cluster mailing list >>>> Linux-cluster at redhat.com >>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>> >>> -- >>> Linux-cluster mailing list >>> Linux-cluster at redhat.com >>> https://www.redhat.com/mailman/listinfo/linux-cluster >>> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > From Chris.Jankowski at hp.com Sun May 1 11:22:03 2011 From: Chris.Jankowski at hp.com (Jankowski, Chris) Date: Sun, 1 May 2011 11:22:03 +0000 Subject: [Linux-cluster] How do you HA your storage? In-Reply-To: <4DBC3BED.6080104@bulbous.org> References: <1304154086.10889.1446718041@webmail.messagingengine.com> <036B68E61A28CA49AC2767596576CD596F6579D294@GVW1113EXC.americas.hpqcorp.net> <4DBC3BED.6080104@bulbous.org> Message-ID: <036B68E61A28CA49AC2767596576CD596F6579D2AB@GVW1113EXC.americas.hpqcorp.net> I think you might not appreciate subtleties involved in making the decision of failover and concurrency issues between different LUNs. Be that as it may, I'd suggest to close this discussion at this point, as it has nothing to do with the Linux cluster and everything to do with general BC and DR in multi-site environment. There are specialized forums for this. Regards, Chris -----Original Message----- From: urgrue [mailto:urgrue at bulbous.org] Sent: Sunday, 1 May 2011 02:42 To: linux clustering Cc: Jankowski, Chris Subject: Re: [Linux-cluster] How do you HA your storage? I do have RAID, multipath over multiple fabrics, etc. But what you're not at all protected from is major SAN failure, or a datacenter outage, for example. Which happens, and if you've got more than a few datacenters and dozens of SAN filers, you know they happen actually way too often for you to not miss a graceful, predictable recovery procedure. So like everyone else, you've got cluster nodes in each datacenter, and all of them connected to the same SAN. Everything will recover quite nicely from just about every type of failure - except failure of the SAN itself. Your cluster nodes in your backup datacenter will not be happy to see the disks disappear. You can activate your backup filer(s) in seconds - all your hundreds of passive nodes actually do now have functioning copies of the data and could/should be able to get back to work - but getting all of them to actually realize it and get back to work, can be hours of messy manual work. I wouldn't think it'd be very difficult to handle this gracefully, all the basic functionality is already there in multipath and LVM. I think it would be a pretty big deal in the enterprise world to be able to transparently switch SANs like this. As far as I know only z/os can do this currently and even then it's built around a very specific, complicated and expensive storage configuration. And there's a whole industry around "san virtualization" just because of this kind of sitautions, that would become obsolete overnight if the OS itself could handle it natively. On 30/4/11 16:29, Jankowski, Chris wrote: > I am just wondering, why would you like to do it this way? > > If you have SAN then by implication you have a storage array on the SAN. This storage array will normally have capability to give you highly available storage through RAID{1,5,6}. Moreover, any decent array will also provide redundancy in case of a failure of one of is controllers. Then standard dual fabric FC SAN configuration will give you multiple paths to the controllers of the array - normally at least 4 paths. What remains to be done on the servers is to configure device mapper multipath to fit your SAN configuration and capabilities of the array. Most modern arrays these days are active-active and support ALUA extensions. > > Nothing specifically needs to be done in the cluster software. This works the same way as for a single host. > > Are you trying to build a stretched cluster across multiple sites with a SAN array in each? > > Regards, > > Chris Jankowski > > -----Original Message----- > From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of urgrue > Sent: Saturday, 30 April 2011 19:01 > To: linux-cluster at redhat.com > Subject: [Linux-cluster] How do you HA your storage? > > I'm struggling to find the best way to deal with SAN failover. > By this I mean the common scenario where you have SAN-based mirroring. > It's pretty easy with host-based mirroring (md, DRBD, LVM, etc) but how > can you minimize the impact and manual effort to recover from losing a > LUN, and needing to somehow get your system to realize the data is now > on a different LUN (the now-active mirror)? > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From ercan.karadeniz at vodafone.com Sun May 1 21:57:23 2011 From: ercan.karadeniz at vodafone.com (Karadeniz, Ercan, VF-Group) Date: Sun, 1 May 2011 23:57:23 +0200 Subject: [Linux-cluster] GFS2 daemon hangs during boot process Message-ID: <84220C5308E5B146BFA374E47437FB550674AEAE@VF-MBX12.internal.vodafone.com> Hi Linux-Cluster-List-Members, I'm a newbie in RHCS. I have visited recently the RH436 training. Currently I'm trying to get more experience by doing some hands-on on the course labs. My setup is as follows: ? Physical Server where Dom-0 is running ? 2 x xen virtual machines ? 2 nodes cluster (via Conga) The two node cluster is setup by using Conga. The node1 and node2 are xen virtual machines. Everything worked so far. For fencing I'm using fence_xvmd. That is also working without any problems. To test the multicasting with different address (than the default one), I have done some changes on the multicast address and rebooted both nodes. Apparently when I start node1 or node2 (xm console node1 -c ) they hang during boot process on the "GFS2 daemon". I have tried to login via using the "Single" mode as boot parameter regrettably this didn't help. My question is how can I overcome this deadlock situation. I need somehow to boot both nodes and change my recent changes related to the multicasting address in the cluster.conf file. However I cannot login to any of the nodes? Furthermore is there a change within the xen virtual machine to get in to the interactive boot mode? It would be great if you can give me some hint here. Many thanks in advance! Warm regards from D?sseldorf/Germany Ercan Karadeniz -------------- next part -------------- An HTML attachment was scrubbed... URL: From linux at alteeve.com Sun May 1 22:09:19 2011 From: linux at alteeve.com (Digimer) Date: Sun, 01 May 2011 18:09:19 -0400 Subject: [Linux-cluster] GFS2 daemon hangs during boot process In-Reply-To: <84220C5308E5B146BFA374E47437FB550674AEAE@VF-MBX12.internal.vodafone.com> References: <84220C5308E5B146BFA374E47437FB550674AEAE@VF-MBX12.internal.vodafone.com> Message-ID: <4DBDDA0F.4070509@alteeve.com> On 05/01/2011 05:57 PM, Karadeniz, Ercan, VF-Group wrote: > Hi Linux-Cluster-List-Members, > > I?m a newbie in RHCS. I have visited recently the RH436 training. > Currently I?m trying to get more experience by doing some hands-on on > the course labs. > > My setup is as follows: > > ? Physical Server where Dom-0 is running > ? 2 x xen virtual machines > ? 2 nodes cluster (via Conga) > > The two node cluster is setup by using Conga. The node1 and node2 are > xen virtual machines. Everything worked so far. For fencing I?m using > fence_xvmd. That is also working without any problems. > > To test the multicasting with different address (than the default one), > I have done some changes on the multicast address and rebooted both > nodes. Apparently when I start node1 or node2 (xm console node1 ?c ) > they hang during boot process on the ?GFS2 daemon?. > > I have tried to login via using the ?Single? mode as boot parameter > regrettably this didn?t help. > > My question is how can I overcome this deadlock situation. I need > somehow to boot both nodes and change my recent changes related to the > multicasting address in the cluster.conf file. However I cannot login to > any of the nodes? Furthermore is there a change within the xen virtual > machine to get in to the interactive boot mode? > > It would be great if you can give me some hint here. > > Many thanks in advance! You could try booting the VM using the RHEL5 ISO as the first boot device, then boot into rescue mode. This should allow you to mount the system and edit you /etc/fstab and/or /etc/cluster/cluster.conf. As a side note, for testing, I like to 'chkconfig cman off'. This way, I know that even if I totally screw up, I'll always be able to boot into the host OS. :) Further, I'd set the GFS2 entry in fstab to not use 'defaults', but instead to use 'rw,suid,dev,exec,nouser,async'. This excludes the 'auto' option, so that a failure to mount the GFS2 partition won't cause dom0 to drop to single-user mode. -- Digimer E-Mail: digimer at alteeve.com AN!Whitepapers: http://alteeve.com Node Assassin: http://nodeassassin.org From mgrac at redhat.com Mon May 2 07:33:58 2011 From: mgrac at redhat.com (Marek Grac) Date: Mon, 02 May 2011 09:33:58 +0200 Subject: [Linux-cluster] Plugged out blade from bladecenter chassis - fence_bladecenter failed In-Reply-To: References: <4DBA71EA.9070303@redhat.com> Message-ID: <4DBE5E66.80802@redhat.com> Hi, On 04/29/2011 10:15 AM, Parvez Shaikh wrote: > Hi Marek, > > Can we give this option in cluster.conf file for bladecenter fencing > device or method for cluster.conf you should add ... missing_as_off="1" ... to fence configuration > > For IPMI, fencing is there similar option? > There is no such method for IPMI. m, From ercan.karadeniz at vodafone.com Mon May 2 11:35:27 2011 From: ercan.karadeniz at vodafone.com (Karadeniz, Ercan, VF-Group) Date: Mon, 2 May 2011 13:35:27 +0200 Subject: [Linux-cluster] GFS2 daemon hangs during boot process In-Reply-To: <4DBDDA0F.4070509@alteeve.com> References: <84220C5308E5B146BFA374E47437FB550674AEAE@VF-MBX12.internal.vodafone.com> <4DBDDA0F.4070509@alteeve.com> Message-ID: <84220C5308E5B146BFA374E47437FB550674AFD3@VF-MBX12.internal.vodafone.com> Many thanks for the hints. Regards, Ercan -----Original Message----- From: Digimer [mailto:linux at alteeve.com] Sent: Montag, 2. Mai 2011 00:09 To: linux clustering Cc: Karadeniz, Ercan, VF-Group Subject: Re: [Linux-cluster] GFS2 daemon hangs during boot process On 05/01/2011 05:57 PM, Karadeniz, Ercan, VF-Group wrote: > Hi Linux-Cluster-List-Members, > > I'm a newbie in RHCS. I have visited recently the RH436 training. > Currently I'm trying to get more experience by doing some hands-on on > the course labs. > > My setup is as follows: > > * Physical Server where Dom-0 is running > * 2 x xen virtual machines > * 2 nodes cluster (via Conga) > > The two node cluster is setup by using Conga. The node1 and node2 are > xen virtual machines. Everything worked so far. For fencing I'm using > fence_xvmd. That is also working without any problems. > > To test the multicasting with different address (than the default > one), I have done some changes on the multicast address and rebooted > both nodes. Apparently when I start node1 or node2 (xm console node1 > -c ) they hang during boot process on the "GFS2 daemon". > > I have tried to login via using the "Single" mode as boot parameter > regrettably this didn't help. > > My question is how can I overcome this deadlock situation. I need > somehow to boot both nodes and change my recent changes related to the > multicasting address in the cluster.conf file. However I cannot login > to any of the nodes? Furthermore is there a change within the xen > virtual machine to get in to the interactive boot mode? > > It would be great if you can give me some hint here. > > Many thanks in advance! You could try booting the VM using the RHEL5 ISO as the first boot device, then boot into rescue mode. This should allow you to mount the system and edit you /etc/fstab and/or /etc/cluster/cluster.conf. As a side note, for testing, I like to 'chkconfig cman off'. This way, I know that even if I totally screw up, I'll always be able to boot into the host OS. :) Further, I'd set the GFS2 entry in fstab to not use 'defaults', but instead to use 'rw,suid,dev,exec,nouser,async'. This excludes the 'auto' option, so that a failure to mount the GFS2 partition won't cause dom0 to drop to single-user mode. -- Digimer E-Mail: digimer at alteeve.com AN!Whitepapers: http://alteeve.com Node Assassin: http://nodeassassin.org From parvez.h.shaikh at gmail.com Mon May 2 13:19:17 2011 From: parvez.h.shaikh at gmail.com (Parvez Shaikh) Date: Mon, 2 May 2011 18:49:17 +0530 Subject: [Linux-cluster] Plugged out blade from bladecenter chassis - fence_bladecenter failed In-Reply-To: <4DBE5E66.80802@redhat.com> References: <4DBA71EA.9070303@redhat.com> <4DBE5E66.80802@redhat.com> Message-ID: Hi Marek, I tried the option missing_as_off="1" and now I get an another error - fenced[18433]: fence "node5.sscdomain" failed fenced[18433]: fencing node "node5.sscdomain" Sniplet of cluster.conf file is - .... .... Did I miss something? Thanks Parvez On Mon, May 2, 2011 at 1:03 PM, Marek Grac wrote: > Hi, > > > On 04/29/2011 10:15 AM, Parvez Shaikh wrote: > >> Hi Marek, >> >> Can we give this option in cluster.conf file for bladecenter fencing >> device or method >> > > for cluster.conf you should add ... missing_as_off="1" ... to fence > configuration > > > >> For IPMI, fencing is there similar option? >> >> > There is no such method for IPMI. > > > m, > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From linux at alteeve.com Wed May 4 15:37:27 2011 From: linux at alteeve.com (Digimer) Date: Wed, 04 May 2011 11:37:27 -0400 Subject: [Linux-cluster] GFS2 partition grew with 'gfs2_grow -T' Message-ID: <4DC172B7.3040206@alteeve.com> This is a little concerning... Can someone confirm that I didn't screw up before I lodge a bug? [root at xenmaster003 ~]# rpm -q cman gfs2-utils cman-2.0.115-68.el5_6.3 gfs2-utils-0.1.62-28.el5 [root at xenmaster003 ~]# lvextend -L +50G /dev/drbd_sh1_vg0/cluster_files /dev/drbd3 Extending logical volume cluster_files to 250.00 GB Logical volume cluster_files successfully resized [root at xenmaster003 ~]# gfs2_grow -T /dev/drbd_sh1_vg0/cluster_files /cluster_files/ (Test mode--File system will not be changed) FS: Mount Point: /cluster_files FS: Device: /dev/mapper/drbd_sh1_vg0-cluster_files FS: Size: 52428798 (0x31ffffe) FS: RG size: 65535 (0xffff) DEV: Size: 65536000 (0x3e80000) The file system grew by 51200MB. FS: Mount Point: /cluster_files FS: Device: /dev/mapper/drbd_sh1_vg0-cluster_files FS: Size: 52428798 (0x31ffffe) FS: RG size: 65535 (0xffff) DEV: Size: 65536000 (0x3e80000) The file system grew by 51200MB. gfs2_grow complete. [root at xenmaster003 ~]# gfs2_grow /dev/drbd_sh1_vg0/cluster_files /cluster_files/ FS: Mount Point: /cluster_files FS: Device: /dev/mapper/drbd_sh1_vg0-cluster_files FS: Size: 52428798 (0x31ffffe) FS: RG size: 65535 (0xffff) DEV: Size: 65536000 (0x3e80000) The file system grew by 51200MB. Error: The device has grown by less than one Resource Group (RG). The device grew by 0MB. One RG is 255MB for this file system. gfs2_grow complete. [root at xenmaster003 ~]# df -h Filesystem Size Used Avail Use% Mounted on /dev/md2 57G 2.7G 51G 6% / /dev/md0 251M 52M 187M 22% /boot tmpfs 7.7G 0 7.7G 0% /dev/shm /dev/mapper/drbd_sh0_vg0-xen_shared 56G 259M 56G 1% /xen_shared /dev/mapper/drbd_sh1_vg0-cluster_files 250G 145G 106G 58% /cluster_files -- Digimer E-Mail: digimer at alteeve.com AN!Whitepapers: http://alteeve.com Node Assassin: http://nodeassassin.org From rpeterso at redhat.com Wed May 4 15:54:30 2011 From: rpeterso at redhat.com (Bob Peterson) Date: Wed, 4 May 2011 11:54:30 -0400 (EDT) Subject: [Linux-cluster] GFS2 partition grew with 'gfs2_grow -T' In-Reply-To: <4DC172B7.3040206@alteeve.com> Message-ID: <345920098.235604.1304524470146.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> ----- Original Message ----- | This is a little concerning... Can someone confirm that I didn't screw | up before I lodge a bug? | | [root at xenmaster003 ~]# rpm -q cman gfs2-utils | cman-2.0.115-68.el5_6.3 | gfs2-utils-0.1.62-28.el5 | | [root at xenmaster003 ~]# lvextend -L +50G | /dev/drbd_sh1_vg0/cluster_files | /dev/drbd3 | Extending logical volume cluster_files to 250.00 GB | Logical volume cluster_files successfully resized | [root at xenmaster003 ~]# gfs2_grow -T /dev/drbd_sh1_vg0/cluster_files | /cluster_files/ | (Test mode--File system will not be changed) | FS: Mount Point: /cluster_files | FS: Device: /dev/mapper/drbd_sh1_vg0-cluster_files | FS: Size: 52428798 (0x31ffffe) | FS: RG size: 65535 (0xffff) | DEV: Size: 65536000 (0x3e80000) | The file system grew by 51200MB. | FS: Mount Point: /cluster_files | FS: Device: /dev/mapper/drbd_sh1_vg0-cluster_files | FS: Size: 52428798 (0x31ffffe) | FS: RG size: 65535 (0xffff) | DEV: Size: 65536000 (0x3e80000) | The file system grew by 51200MB. | gfs2_grow complete. | | [root at xenmaster003 ~]# gfs2_grow /dev/drbd_sh1_vg0/cluster_files | /cluster_files/ | FS: Mount Point: /cluster_files | FS: Device: /dev/mapper/drbd_sh1_vg0-cluster_files | FS: Size: 52428798 (0x31ffffe) | FS: RG size: 65535 (0xffff) | DEV: Size: 65536000 (0x3e80000) | The file system grew by 51200MB. | Error: The device has grown by less than one Resource Group (RG). | The device grew by 0MB. One RG is 255MB for this file system. | gfs2_grow complete. | | [root at xenmaster003 ~]# df -h | Filesystem Size Used Avail Use% Mounted on | /dev/md2 57G 2.7G 51G 6% / | /dev/md0 251M 52M 187M 22% /boot | tmpfs 7.7G 0 7.7G 0% /dev/shm | /dev/mapper/drbd_sh0_vg0-xen_shared | 56G 259M 56G 1% /xen_shared | /dev/mapper/drbd_sh1_vg0-cluster_files | 250G 145G 106G 58% /cluster_files | | -- | Digimer | E-Mail: digimer at alteeve.com | AN!Whitepapers: http://alteeve.com | Node Assassin: http://nodeassassin.org Hi, Hm...This sounds like a bug to me. I'd open the bug record. Regards, Bob Peterson Red Hat File Systems From linux at alteeve.com Wed May 4 15:59:37 2011 From: linux at alteeve.com (Digimer) Date: Wed, 04 May 2011 11:59:37 -0400 Subject: [Linux-cluster] GFS2 partition grew with 'gfs2_grow -T' In-Reply-To: <345920098.235604.1304524470146.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> References: <345920098.235604.1304524470146.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> Message-ID: <4DC177E9.3060107@alteeve.com> On 05/04/2011 11:54 AM, Bob Peterson wrote: > ----- Original Message ----- > | This is a little concerning... Can someone confirm that I didn't screw > | up before I lodge a bug? > | > | [root at xenmaster003 ~]# rpm -q cman gfs2-utils > | cman-2.0.115-68.el5_6.3 > | gfs2-utils-0.1.62-28.el5 > | > | [root at xenmaster003 ~]# lvextend -L +50G > | /dev/drbd_sh1_vg0/cluster_files > | /dev/drbd3 > | Extending logical volume cluster_files to 250.00 GB > | Logical volume cluster_files successfully resized > | [root at xenmaster003 ~]# gfs2_grow -T /dev/drbd_sh1_vg0/cluster_files > | /cluster_files/ > | (Test mode--File system will not be changed) > | FS: Mount Point: /cluster_files > | FS: Device: /dev/mapper/drbd_sh1_vg0-cluster_files > | FS: Size: 52428798 (0x31ffffe) > | FS: RG size: 65535 (0xffff) > | DEV: Size: 65536000 (0x3e80000) > | The file system grew by 51200MB. > | FS: Mount Point: /cluster_files > | FS: Device: /dev/mapper/drbd_sh1_vg0-cluster_files > | FS: Size: 52428798 (0x31ffffe) > | FS: RG size: 65535 (0xffff) > | DEV: Size: 65536000 (0x3e80000) > | The file system grew by 51200MB. > | gfs2_grow complete. > | > | [root at xenmaster003 ~]# gfs2_grow /dev/drbd_sh1_vg0/cluster_files > | /cluster_files/ > | FS: Mount Point: /cluster_files > | FS: Device: /dev/mapper/drbd_sh1_vg0-cluster_files > | FS: Size: 52428798 (0x31ffffe) > | FS: RG size: 65535 (0xffff) > | DEV: Size: 65536000 (0x3e80000) > | The file system grew by 51200MB. > | Error: The device has grown by less than one Resource Group (RG). > | The device grew by 0MB. One RG is 255MB for this file system. > | gfs2_grow complete. > | > | [root at xenmaster003 ~]# df -h > | Filesystem Size Used Avail Use% Mounted on > | /dev/md2 57G 2.7G 51G 6% / > | /dev/md0 251M 52M 187M 22% /boot > | tmpfs 7.7G 0 7.7G 0% /dev/shm > | /dev/mapper/drbd_sh0_vg0-xen_shared > | 56G 259M 56G 1% /xen_shared > | /dev/mapper/drbd_sh1_vg0-cluster_files > | 250G 145G 106G 58% /cluster_files > | > | -- > | Digimer > | E-Mail: digimer at alteeve.com > | AN!Whitepapers: http://alteeve.com > | Node Assassin: http://nodeassassin.org > > Hi, > > Hm...This sounds like a bug to me. I'd open the bug record. > > Regards, > > Bob Peterson > Red Hat File Systems Will do, thanks for the prompt reply. -- Digimer E-Mail: digimer at alteeve.com AN!Whitepapers: http://alteeve.com Node Assassin: http://nodeassassin.org From linux at alteeve.com Wed May 4 16:08:39 2011 From: linux at alteeve.com (Digimer) Date: Wed, 04 May 2011 12:08:39 -0400 Subject: [Linux-cluster] GFS2 partition grew with 'gfs2_grow -T' In-Reply-To: <345920098.235604.1304524470146.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> References: <345920098.235604.1304524470146.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> Message-ID: <4DC17A07.6070004@alteeve.com> On 05/04/2011 11:54 AM, Bob Peterson wrote: > ----- Original Message ----- > | This is a little concerning... Can someone confirm that I didn't screw > | up before I lodge a bug? > | > | [root at xenmaster003 ~]# rpm -q cman gfs2-utils > | cman-2.0.115-68.el5_6.3 > | gfs2-utils-0.1.62-28.el5 > | > | [root at xenmaster003 ~]# lvextend -L +50G > | /dev/drbd_sh1_vg0/cluster_files > | /dev/drbd3 > | Extending logical volume cluster_files to 250.00 GB > | Logical volume cluster_files successfully resized > | [root at xenmaster003 ~]# gfs2_grow -T /dev/drbd_sh1_vg0/cluster_files > | /cluster_files/ > | (Test mode--File system will not be changed) > | FS: Mount Point: /cluster_files > | FS: Device: /dev/mapper/drbd_sh1_vg0-cluster_files > | FS: Size: 52428798 (0x31ffffe) > | FS: RG size: 65535 (0xffff) > | DEV: Size: 65536000 (0x3e80000) > | The file system grew by 51200MB. > | FS: Mount Point: /cluster_files > | FS: Device: /dev/mapper/drbd_sh1_vg0-cluster_files > | FS: Size: 52428798 (0x31ffffe) > | FS: RG size: 65535 (0xffff) > | DEV: Size: 65536000 (0x3e80000) > | The file system grew by 51200MB. > | gfs2_grow complete. > | > | [root at xenmaster003 ~]# gfs2_grow /dev/drbd_sh1_vg0/cluster_files > | /cluster_files/ > | FS: Mount Point: /cluster_files > | FS: Device: /dev/mapper/drbd_sh1_vg0-cluster_files > | FS: Size: 52428798 (0x31ffffe) > | FS: RG size: 65535 (0xffff) > | DEV: Size: 65536000 (0x3e80000) > | The file system grew by 51200MB. > | Error: The device has grown by less than one Resource Group (RG). > | The device grew by 0MB. One RG is 255MB for this file system. > | gfs2_grow complete. > | > | [root at xenmaster003 ~]# df -h > | Filesystem Size Used Avail Use% Mounted on > | /dev/md2 57G 2.7G 51G 6% / > | /dev/md0 251M 52M 187M 22% /boot > | tmpfs 7.7G 0 7.7G 0% /dev/shm > | /dev/mapper/drbd_sh0_vg0-xen_shared > | 56G 259M 56G 1% /xen_shared > | /dev/mapper/drbd_sh1_vg0-cluster_files > | 250G 145G 106G 58% /cluster_files > | > | -- > | Digimer > | E-Mail: digimer at alteeve.com > | AN!Whitepapers: http://alteeve.com > | Node Assassin: http://nodeassassin.org > > Hi, > > Hm...This sounds like a bug to me. I'd open the bug record. > > Regards, > > Bob Peterson > Red Hat File Systems For the archives: https://bugzilla.redhat.com/show_bug.cgi?id=702050 -- Digimer E-Mail: digimer at alteeve.com AN!Whitepapers: http://alteeve.com Node Assassin: http://nodeassassin.org From ercan.karadeniz at vodafone.com Sat May 7 17:48:39 2011 From: ercan.karadeniz at vodafone.com (Karadeniz, Ercan, VF-Group) Date: Sat, 7 May 2011 19:48:39 +0200 Subject: [Linux-cluster] GFS File System Resource does not mount automatically => Bug or Feature Message-ID: <84220C5308E5B146BFA374E47437FB55067EB475@VF-MBX12.internal.vodafone.com> Hi All, I have a two node cluster setup, with a httpd service and with the resource IP, GFS file system (iscsi => /dev/sda2 /var/www/html) and httpd. Now when I start the service the GFS file system gets not automatically mounted. I have also tried to relocate the service between both nodes (node1 and node2). However the result has not changed. Moreover I have checked the logs but did not see any error messages. The used OS is RHEL 5.4. Is this a normal behaviour of the RHCS or is this a bug or am I doing something wrong? Since I'm a newbie, I will be thankful for any hint. Have a nice weekend! Warm regards, Ercan Karadeniz -------------- next part -------------- An HTML attachment was scrubbed... URL: From raju.rajsand at gmail.com Sun May 8 10:39:58 2011 From: raju.rajsand at gmail.com (Rajagopal Swaminathan) Date: Sun, 8 May 2011 16:09:58 +0530 Subject: [Linux-cluster] GFS File System Resource does not mount automatically => Bug or Feature In-Reply-To: <84220C5308E5B146BFA374E47437FB55067EB475@VF-MBX12.internal.vodafone.com> References: <84220C5308E5B146BFA374E47437FB55067EB475@VF-MBX12.internal.vodafone.com> Message-ID: Greetings, On Sat, May 7, 2011 at 11:18 PM, Karadeniz, Ercan, VF-Group wrote: > Hi All, > Can you post the config file here? -- Regards, Rajagopal From ercan.karadeniz at vodafone.com Sun May 8 10:56:35 2011 From: ercan.karadeniz at vodafone.com (Karadeniz, Ercan, VF-Group) Date: Sun, 8 May 2011 12:56:35 +0200 Subject: [Linux-cluster] GFS File System Resource does not mount automatically => Bug or Feature In-Reply-To: References: <84220C5308E5B146BFA374E47437FB55067EB475@VF-MBX12.internal.vodafone.com> Message-ID: <84220C5308E5B146BFA374E47437FB55067EB49A@VF-MBX12.internal.vodafone.com> Hi Rajagopal, Please find enclosed my cluster.conf file. Regards, Ercan -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Rajagopal Swaminathan Sent: Sonntag, 8. Mai 2011 12:40 To: linux clustering Subject: Re: [Linux-cluster] GFS File System Resource does not mount automatically => Bug or Feature Greetings, On Sat, May 7, 2011 at 11:18 PM, Karadeniz, Ercan, VF-Group wrote: > Hi All, > Can you post the config file here? -- Regards, Rajagopal -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- A non-text attachment was scrubbed... Name: cluster.conf Type: application/octet-stream Size: 2232 bytes Desc: cluster.conf URL: From fdinitto at redhat.com Mon May 9 11:25:28 2011 From: fdinitto at redhat.com (Fabio M. Di Nitto) Date: Mon, 09 May 2011 13:25:28 +0200 Subject: [Linux-cluster] new RHCS upstream wiki Message-ID: <4DC7CF28.30602@redhat.com> Hi all, we are in the process of moving the old cluster wiki (http://sourceware.org/cluster/wiki/) to: https://fedorahosted.org/cluster/wiki/HomePage All pages from the old wiki have been imported and we are in the process to reformat the pages to match the new trac-wiki notation. If you own any page or content, please make sure to verify that the content is correct. In the process I also spotted an insane amount of spam, if you have 5 minutes to spare to help cleaning that up, https://fedorahosted.org/cluster/wiki/TitleIndex is a good starting point. The old wiki will be made readonly soon and any change will be discarded. If necessary I have a backup stored on my harddisk. Please update all your URLs. Fabio From mra at webtel.pl Mon May 9 13:32:39 2011 From: mra at webtel.pl (mr) Date: Mon, 09 May 2011 15:32:39 +0200 Subject: [Linux-cluster] gfs2 setting quota problem Message-ID: <4DC7ECF7.4060302@webtel.pl> Hello, I'm having problem to init gfs2 quota on my existing FS. I have 2TB gfs2 FS which is being used in 50%. I have decided to set up quotas. Setting warning and limit levels seemed OK - no errors (athought I had to reset all my existing setting gfs2_quota reset...) New quota calculation ends with the following error: gfs2_quota: Couldn't mount /tmp/.gfs2meta.WUXfKC : Invalid argument Getting some quota values fails - I'm always getting "value: 0.0" :( I have no idea what is wrong... Sombody could help? thx in advance Details: 2.6.18-194.11.1.el5 /tmp/test type gfs2 (rw,noatime,lockproto=lock_nolock,localflocks,localcaching,quota=on) gfs2-utils.i386 0.1.62-28.el5_6.1 kmod-gfs.i686 0.1.34-2.el5 cman.i386 2.0.98-1.el5_3.4 -- Best Regards, MR From swhiteho at redhat.com Mon May 9 18:07:21 2011 From: swhiteho at redhat.com (Steven Whitehouse) Date: Mon, 09 May 2011 19:07:21 +0100 Subject: [Linux-cluster] gfs2 setting quota problem In-Reply-To: <4DC7ECF7.4060302@webtel.pl> References: <4DC7ECF7.4060302@webtel.pl> Message-ID: <1304964441.2813.9.camel@menhir> Hi, On Mon, 2011-05-09 at 15:32 +0200, mr wrote: > Hello, > I'm having problem to init gfs2 quota on my existing FS. > > I have 2TB gfs2 FS which is being used in 50%. I have decided to set up > quotas. Setting warning and limit levels seemed OK - no errors (athought > I had to reset all my existing setting gfs2_quota reset...) New quota > calculation ends with the following error: > > gfs2_quota: Couldn't mount /tmp/.gfs2meta.WUXfKC : Invalid argument > Are you using selinux? The gfs2_quota tool tries to mount the GFS2 metafs in order to make the changes that you requested. For some reason it seems this mount is failing. > Getting some quota values fails - I'm always getting "value: 0.0" :( > > I have no idea what is wrong... Sombody could help? thx in advance > > Details: > 2.6.18-194.11.1.el5 > /tmp/test type gfs2 > (rw,noatime,lockproto=lock_nolock,localflocks,localcaching,quota=on) > gfs2-utils.i386 0.1.62-28.el5_6.1 > kmod-gfs.i686 0.1.34-2.el5 > cman.i386 2.0.98-1.el5_3.4 > > > Is this CentOS or a real RHEL installation? Steve. From raju.rajsand at gmail.com Tue May 10 03:25:31 2011 From: raju.rajsand at gmail.com (Rajagopal Swaminathan) Date: Tue, 10 May 2011 08:55:31 +0530 Subject: [Linux-cluster] GFS File System Resource does not mount automatically => Bug or Feature In-Reply-To: <84220C5308E5B146BFA374E47437FB55067EB49A@VF-MBX12.internal.vodafone.com> References: <84220C5308E5B146BFA374E47437FB55067EB475@VF-MBX12.internal.vodafone.com> <84220C5308E5B146BFA374E47437FB55067EB49A@VF-MBX12.internal.vodafone.com> Message-ID: Greetings, On Sun, May 8, 2011 at 4:26 PM, Karadeniz, Ercan, VF-Group wrote: > Hi Rajagopal, > > Please find enclosed my cluster.conf file. > Just in case, why not mount the GFS in rc.local in both the nodes? Not an elegent solution. but usually works. -- Regards, Rajagopal From mra at webtel.pl Tue May 10 06:06:38 2011 From: mra at webtel.pl (mr) Date: Tue, 10 May 2011 08:06:38 +0200 Subject: [Linux-cluster] gfs2 setting quota problem In-Reply-To: <1304964441.2813.9.camel@menhir> References: <4DC7ECF7.4060302@webtel.pl> <1304964441.2813.9.camel@menhir> Message-ID: <4DC8D5EE.7050001@webtel.pl> Hi, Steven Whitehouse pisze: > Hi, > > On Mon, 2011-05-09 at 15:32 +0200, mr wrote: > >> Hello, >> I'm having problem to init gfs2 quota on my existing FS. >> >> I have 2TB gfs2 FS which is being used in 50%. I have decided to set up >> quotas. Setting warning and limit levels seemed OK - no errors (athought >> I had to reset all my existing setting gfs2_quota reset...) New quota >> calculation ends with the following error: >> >> gfs2_quota: Couldn't mount /tmp/.gfs2meta.WUXfKC : Invalid argument >> >> > Are you using selinux? The gfs2_quota tool tries to mount the GFS2 > metafs in order to make the changes that you requested. For some reason > it seems this mount is failing. > Selinux is diabled. I'm also able to mount gfs2meta manually. > >> Getting some quota values fails - I'm always getting "value: 0.0" :( >> >> I have no idea what is wrong... Sombody could help? thx in advance >> >> Details: >> 2.6.18-194.11.1.el5 >> /tmp/test type gfs2 >> (rw,noatime,lockproto=lock_nolock,localflocks,localcaching,quota=on) >> gfs2-utils.i386 0.1.62-28.el5_6.1 >> kmod-gfs.i686 0.1.34-2.el5 >> cman.i386 2.0.98-1.el5_3.4 >> >> >> >> > Is this CentOS or a real RHEL installation? > Centos. > Steve. > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > -- Pozdrawiam, Miko?aj Radzewicz Tel. (022) 257 43 36 Webtel - Interactive Solutions Interaktywne rozwi?zania dla biznesu i marketingu Webtel Sp.z o.o. ul. Marynarska 11, 02-674 Warszawa S?d Rejonowy dla m. st. Warszawy, XIII Wydzia? Gospodarczy Krajowego Rejestru S?dowego KRS: 0000088129, NIP: 525-10-61-332, kapita? zak?adowy: 1 745 700 PLN www.webtel.pl Niniejsza wiadomo??, wraz z wszelkimi za??cznikami, jest poufna i przeznaczona wy??cznie do wiadomo?ci adresata. W przypadku omy?kowego otrzymania tej wiadomo?ci, prosimy o poinformowanie nadawcy oraz nie u?ywanie, nie przekazywanie i nie kopiowanie zawartych w niej tre?ci, kt?re jest prawnie zabronione. From koubat at fzu.cz Tue May 10 12:33:45 2011 From: koubat at fzu.cz (Tomas Kouba) Date: Tue, 10 May 2011 14:33:45 +0200 Subject: [Linux-cluster] Documentation for cluster beginner and pacemaker vs rgmanager Message-ID: <4DC930A9.9040803@fzu.cz> Hello HA magicians, I would like to bring our services to a more reliable level and I was googling around some basic information about RH cluster suite. I am not quite able to answer 2 questions so I'd like to ask here: 1) What is the starting documentation that you would recommend to a linux administrator who would like to setup a HA cluster? I have found http://www.linuxtopia.org/online_books/rhel6/rhel_6_cluster_admin/ is it good even though I use clone of RHEL? (Scientific Linux 6). 2) Which resource manager would you recommend? rgmanager or pacemaker? The following pages favor pacemaker but the documentation usually says rgmanager: http://www.spinics.net/lists/cluster/msg16401.html http://sources.redhat.com/cluster/wiki/RGManagerVsPacemaker Best regards, -- Tomas Kouba From adas at redhat.com Tue May 10 13:27:07 2011 From: adas at redhat.com (Abhijith Das) Date: Tue, 10 May 2011 09:27:07 -0400 (EDT) Subject: [Linux-cluster] gfs2 setting quota problem In-Reply-To: <4DC8D5EE.7050001@webtel.pl> Message-ID: <802130008.429868.1305034027382.JavaMail.root@zmail05.collab.prod.int.phx2.redhat.com> ----- Original Message ----- > From: "mr" > To: "linux clustering" > Sent: Tuesday, May 10, 2011 1:06:38 AM > Subject: Re: [Linux-cluster] gfs2 setting quota problem > Hi, > Steven Whitehouse pisze: > > Hi, > > > > On Mon, 2011-05-09 at 15:32 +0200, mr wrote: > > > >> Hello, > >> I'm having problem to init gfs2 quota on my existing FS. > >> > >> I have 2TB gfs2 FS which is being used in 50%. I have decided to > >> set up > >> quotas. Setting warning and limit levels seemed OK - no errors > >> (athought > >> I had to reset all my existing setting gfs2_quota reset...) New > >> quota > >> calculation ends with the following error: > >> > >> gfs2_quota: Couldn't mount /tmp/.gfs2meta.WUXfKC : Invalid argument > >> > >> > > Are you using selinux? The gfs2_quota tool tries to mount the GFS2 > > metafs in order to make the changes that you requested. For some > > reason > > it seems this mount is failing. > > > Selinux is diabled. I'm also able to mount gfs2meta manually. > > > >> Getting some quota values fails - I'm always getting "value: 0.0" > >> :( > >> > >> I have no idea what is wrong... Sombody could help? thx in advance > >> > >> Details: > >> 2.6.18-194.11.1.el5 > >> /tmp/test type gfs2 > >> (rw,noatime,lockproto=lock_nolock,localflocks,localcaching,quota=on) > >> gfs2-utils.i386 0.1.62-28.el5_6.1 > >> kmod-gfs.i686 0.1.34-2.el5 > >> cman.i386 2.0.98-1.el5_3.4 > >> > >> > >> > >> > > Is this CentOS or a real RHEL installation? > > > Centos. > > Steve. > > Hi, I found this bz: https://bugzilla.redhat.com/show_bug.cgi?id=459630, but the package versions you list are pretty recent and this was fixed quite a while ago. Are there any older gfs2-utils bits lying around? I'd like to see the strace of the gfs2_quota command that triggers the meta-mount error. Please also include the command line in your output. Thanks! --Abhi From andrew at beekhof.net Tue May 10 13:33:25 2011 From: andrew at beekhof.net (Andrew Beekhof) Date: Tue, 10 May 2011 15:33:25 +0200 Subject: [Linux-cluster] Documentation for cluster beginner and pacemaker vs rgmanager In-Reply-To: <4DC930A9.9040803@fzu.cz> References: <4DC930A9.9040803@fzu.cz> Message-ID: On Tue, May 10, 2011 at 2:33 PM, Tomas Kouba wrote: > Hello HA magicians, > > I would like to bring our services to a more reliable level and I was > googling around some > basic information about RH cluster suite. > I am not quite able to answer 2 questions so I'd like to ask here: > > 1) What is the starting documentation that you would recommend to a linux > administrator who would > like to setup a HA cluster? I have found > http://www.linuxtopia.org/online_books/rhel6/rhel_6_cluster_admin/ > is it good even though I use clone of RHEL? (Scientific Linux 6). > > 2) Which resource manager would you recommend? rgmanager or pacemaker? > The following pages favor pacemaker but the documentation usually says > rgmanager: > http://www.spinics.net/lists/cluster/msg16401.html > http://sources.redhat.com/cluster/wiki/RGManagerVsPacemaker Well I'm going to say Pacemaker + http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Clusters_from_Scratch/ But then you'd expect that since I wrote both :-) I think its fair to say that Pacemaker is better than rgmanager, but the stars didn't align in time for RHEL6.0 so full support wasn't an option. That said, you're using SL6 so community support on mailing lists such as this one might well be sufficient. Oh, but the cluster GUI only supports rgmanager if thats important to you. Pacemaker does have a shiny integrated CLI though. From linux at alteeve.com Tue May 10 14:06:49 2011 From: linux at alteeve.com (Digimer) Date: Tue, 10 May 2011 10:06:49 -0400 Subject: [Linux-cluster] Documentation for cluster beginner and pacemaker vs rgmanager In-Reply-To: References: <4DC930A9.9040803@fzu.cz> Message-ID: <4DC94679.7070507@alteeve.com> On 05/10/2011 09:33 AM, Andrew Beekhof wrote: > On Tue, May 10, 2011 at 2:33 PM, Tomas Kouba wrote: >> Hello HA magicians, >> >> I would like to bring our services to a more reliable level and I was >> googling around some >> basic information about RH cluster suite. >> I am not quite able to answer 2 questions so I'd like to ask here: >> >> 1) What is the starting documentation that you would recommend to a linux >> administrator who would >> like to setup a HA cluster? I have found >> http://www.linuxtopia.org/online_books/rhel6/rhel_6_cluster_admin/ >> is it good even though I use clone of RHEL? (Scientific Linux 6). >> >> 2) Which resource manager would you recommend? rgmanager or pacemaker? >> The following pages favor pacemaker but the documentation usually says >> rgmanager: >> http://www.spinics.net/lists/cluster/msg16401.html >> http://sources.redhat.com/cluster/wiki/RGManagerVsPacemaker > > Well I'm going to say Pacemaker + > http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Clusters_from_Scratch/ > But then you'd expect that since I wrote both :-) > > I think its fair to say that Pacemaker is better than rgmanager, but > the stars didn't align in time for RHEL6.0 so full support wasn't an > option. That said, you're using SL6 so community support on mailing > lists such as this one might well be sufficient. > > Oh, but the cluster GUI only supports rgmanager if thats important to you. > Pacemaker does have a shiny integrated CLI though. I'd agree with Andrew that Pacemaker is better, but I'd also say that rgmanager has it's bright spots, too. :) Pacemaker is far more flexible with regard to the resource management side of things, and rgmanager will be phased out over the next few years in favour of pacemaker. The two biggest arguments I'd make in favour of rgmanager are; If your resource managements needs are within it's capabilities and you are running RHCS, then it can be configured within the main cluster.conf file. It is also an old and well tested solution, which some find value in. -- Digimer E-Mail: digimer at alteeve.com AN!Whitepapers: http://alteeve.com Node Assassin: http://nodeassassin.org "I feel confined, only free to expand myself within boundaries." From andrew at beekhof.net Tue May 10 14:38:39 2011 From: andrew at beekhof.net (Andrew Beekhof) Date: Tue, 10 May 2011 16:38:39 +0200 Subject: [Linux-cluster] Documentation for cluster beginner and pacemaker vs rgmanager In-Reply-To: <4DC94679.7070507@alteeve.com> References: <4DC930A9.9040803@fzu.cz> <4DC94679.7070507@alteeve.com> Message-ID: On Tue, May 10, 2011 at 4:06 PM, Digimer wrote: > On 05/10/2011 09:33 AM, Andrew Beekhof wrote: >> On Tue, May 10, 2011 at 2:33 PM, Tomas Kouba wrote: >>> Hello HA magicians, >>> >>> I would like to bring our services to a more reliable level and I was >>> googling around some >>> basic information about RH cluster suite. >>> I am not quite able to answer 2 questions so I'd like to ask here: >>> >>> 1) What is the starting documentation that you would recommend to a linux >>> administrator who would >>> like to setup a HA cluster? I have found >>> http://www.linuxtopia.org/online_books/rhel6/rhel_6_cluster_admin/ >>> is it good even though I use clone of RHEL? (Scientific Linux 6). >>> >>> 2) Which resource manager would you recommend? rgmanager or pacemaker? >>> The following pages favor pacemaker but the documentation usually says >>> rgmanager: >>> http://www.spinics.net/lists/cluster/msg16401.html >>> http://sources.redhat.com/cluster/wiki/RGManagerVsPacemaker >> >> Well I'm going to say Pacemaker + >> http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Clusters_from_Scratch/ >> But then you'd expect that since I wrote both :-) >> >> I think its fair to say that Pacemaker is better than rgmanager, but >> the stars didn't align in time for RHEL6.0 so full support wasn't an >> option. ?That said, you're using SL6 so community support on mailing >> lists such as this one might well be sufficient. >> >> Oh, but the cluster GUI only supports rgmanager if thats important to you. >> Pacemaker does have a shiny integrated CLI though. > > I'd agree with Andrew that Pacemaker is better, but I'd also say that > rgmanager has it's bright spots, too. :) No argument there. > Pacemaker is far more flexible with regard to the resource management > side of things, and rgmanager will be phased out over the next few years > in favour of pacemaker. > > The two biggest arguments I'd make in favour of rgmanager are; If your > resource managements needs are within it's capabilities and you are > running RHCS, then it can be configured within the main cluster.conf > file. It is also an old and well tested solution, which some find value in. Pacemaker will be celebrating its 8th anniversary this year. So it's not a spring chicken either ;-) From victor.ramirez at prhin.net Wed May 11 16:09:24 2011 From: victor.ramirez at prhin.net (Victor Ramirez) Date: Wed, 11 May 2011 12:09:24 -0400 Subject: [Linux-cluster] Fencing problem on Cluster Suite 3.0.12: fenced throws agent error when invoking fence_xvm Message-ID: I cannot get fence_virt in multicast mode to fence automatically even though I can do it manually with the fence_node command. Lemme start from the beginning. I have a 2 node cluster, each node is a kvm guest running on a different physical host. All machines are RHEL 6 x64 and the cluster suite version is 3.0.12. Like I mentioned before, cluster is configured correctly since I can fence manually with the fence_node command, but when I trigger fenced to call fence_xvm automatically, fence_xvm fails silently with error status 1 and no multicast packet is sent. fence_xvm command does not write its output anywhere when invoked by fenced so I cannot know why it fails, but I suspect that it may be trying to use serial communication instead of multicast. During troubleshooting, I made a script to run in place of fence_xvm in order to write the piped arguments into a log file and the arguments seem to be correct: domain=prhin01-vm01 nodename=prhin01-vm01 agent=fence_xvm debug=5 Also, I used tcpdump to determine that no multicast packet was being sent by fence_xvm. I downloaded the fence-virt code but I am not too keen on debugging linux C code as I am a lowly Java webapp developer. More info can be found here: https://access.redhat.com/discussion/fencing-problem-cluster-suite-3012-fenced-throws-agent-error-when-invoking-fencexvm What else can I try to coax fence_xvm to work? How can I make fence_xvm write to its output to a log file? Can I call fence_virt instead and use a parameter to force multicast mode? Should I give up multicast and try to use some vmchannel scheme? -------------- next part -------------- An HTML attachment was scrubbed... URL: From ramiblanco at gmail.com Wed May 11 20:34:18 2011 From: ramiblanco at gmail.com (Ramiro Blanco) Date: Wed, 11 May 2011 17:34:18 -0300 Subject: [Linux-cluster] Write Performance Issues with GFS2 Message-ID: Hi, I have a 4 node cluster running gfs2 on top of a EMC SAN for a while now, and since couple of months ago we are randomly experiencing heavy write slowdowns. Write rate goes down from about 30 MB/s to 10 kB/s. It can affect 1 node or more at the same time. Umount and mount solves the problem on the affected node, but after some random time (hours, days) happens again. Operating system: Centos 5.6 x86_64 Kernel: 2.6.18-238.9.1.el5 Cman: cman-2.0.115-68.el5_6.3 Gfs2-utils: gfs2-utils-0.1.62-28.el5_6.1 3 nodes fibre channel 4gb 1 node on iscsi 1gb I've read that there's a bug concerning slow writes, but i think that affects newer kernels, isn't that right? Is there any other bug that could be the root of this? Cheers, -- Ramiro Blanco From amrossi at linux.it Wed May 11 22:07:58 2011 From: amrossi at linux.it (Andrea Modesto Rossi) Date: Thu, 12 May 2011 00:07:58 +0200 (CEST) Subject: [Linux-cluster] Write Performance Issues with GFS2 In-Reply-To: References: Message-ID: <2920eae44f09b0e9a8e24c98c21e1cff.squirrel@picard.linux.it> On Mer, 11 Maggio 2011 10:34 pm, Ramiro Blanco wrote: > Hi, > > I have a 4 node cluster running gfs2 on top of a EMC SAN for a while > now, and since couple of months ago we are randomly experiencing heavy > write slowdowns. Write rate goes down from about 30 MB/s to 10 kB/s. Hi! i've got a similar issue. In may case for example, an SCP copy begin with 30MB/s but after about 15 minuts it is less then 300Kb/s Why? -- Andrea Modesto Rossi Fedora Ambassador From adrew at redhat.com Wed May 11 22:15:58 2011 From: adrew at redhat.com (Adam Drew) Date: Wed, 11 May 2011 18:15:58 -0400 (EDT) Subject: [Linux-cluster] Write Performance Issues with GFS2 In-Reply-To: <2920eae44f09b0e9a8e24c98c21e1cff.squirrel@picard.linux.it> Message-ID: <1335256588.190299.1305152158222.JavaMail.root@zmail01.collab.prod.int.phx2.redhat.com> Hello, There's multiple reasons such things could be happening and though both of you see similar symptoms the underlying causes may be different. I highly suggest opening cases with Red Hat Support if you are Red Hat customers. As far as what it could be... well, there's a lot to pick from. One thing that comes to mind is: https://bugzilla.redhat.com/show_bug.cgi?id=683155 But that depends on the size of the file being written vs. rgrp size. Lock contention is also a possibility of course. I'd start with that bug and go from there. Again, Red Hat Support may be able to really assist you in this. Thanks, Adam Drew ----- Original Message ----- > On Mer, 11 Maggio 2011 10:34 pm, Ramiro Blanco wrote: > > Hi, > > > > I have a 4 node cluster running gfs2 on top of a EMC SAN for a while > > now, and since couple of months ago we are randomly experiencing > > heavy > > write slowdowns. Write rate goes down from about 30 MB/s to 10 kB/s. > > Hi! > > i've got a similar issue. In may case for example, an SCP copy begin > with > 30MB/s but after about 15 minuts it is less then 300Kb/s > > Why? > > > -- > Andrea Modesto Rossi > Fedora Ambassador > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From ramiblanco at gmail.com Wed May 11 23:32:57 2011 From: ramiblanco at gmail.com (Ramiro Blanco) Date: Wed, 11 May 2011 20:32:57 -0300 Subject: [Linux-cluster] Write Performance Issues with GFS2 In-Reply-To: <1335256588.190299.1305152158222.JavaMail.root@zmail01.collab.prod.int.phx2.redhat.com> References: <2920eae44f09b0e9a8e24c98c21e1cff.squirrel@picard.linux.it> <1335256588.190299.1305152158222.JavaMail.root@zmail01.collab.prod.int.phx2.redhat.com> Message-ID: 2011/5/11 Adam Drew : > Hello, > > There's multiple reasons such things could be happening and though both of you see similar symptoms the underlying causes may be different. I highly suggest opening cases with Red Hat Support if you are Red Hat customers. > I'll do that. > As far as what it could be... well, there's a lot to pick from. One thing that comes to mind is: > > https://bugzilla.redhat.com/show_bug.cgi?id=683155 Can't access that one: "You are not authorized to access bug #683155" > > But that depends on the size of the file being written vs. rgrp size. Lock contention is also a possibility of course. I'd start with that bug and go from there. Again, Red Hat Support may be able to really assist you in this. > In my case, no mather what the file size is, it could be 100k or 1gb, the same performance is reached. Cheers, -- Ramiro Blanco From sufyan.khan at its.ws Thu May 12 06:27:13 2011 From: sufyan.khan at its.ws (Sufyan Khan) Date: Thu, 12 May 2011 09:27:13 +0300 Subject: [Linux-cluster] oracle DB is not failing over on killin PMON deamon Message-ID: <00ae01cc106d$a294c440$e7be4cc0$@its.ws> Dear All I need to setup HA cluster for mu oracle dabase. I have setup two node cluster using "System-Config-Cluster .." on RHEL 5.5 I created RG a shared file system "/emc01" ext3 , shared IP and DB script to monitor the DB. My cluster starts perfectly and fail over on shutting down primary node, also stopping shared IP fails node to failover node. But on kill PMON , or LSNR process the node does not fails and keep showing the status services running on primary node. I JUST NEED TO KNOW WHERE IS THE PROBLEM. ATTACHED IS DB scripts and "cluster.conf" file. Thanks in advance for help. Sufyan -------------- next part -------------- A non-text attachment was scrubbed... Name: cluster.conf Type: application/octet-stream Size: 1534 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: script_db.sh Type: application/octet-stream Size: 814 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: startdb.sh Type: application/octet-stream Size: 448 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: stopdb.sh Type: application/octet-stream Size: 414 bytes Desc: not available URL: From Chris.Jankowski at hp.com Thu May 12 06:44:19 2011 From: Chris.Jankowski at hp.com (Jankowski, Chris) Date: Thu, 12 May 2011 06:44:19 +0000 Subject: [Linux-cluster] oracle DB is not failing over on killin PMON deamon In-Reply-To: <00ae01cc106d$a294c440$e7be4cc0$@its.ws> References: <00ae01cc106d$a294c440$e7be4cc0$@its.ws> Message-ID: <036B68E61A28CA49AC2767596576CD596F6641246F@GVW1113EXC.americas.hpqcorp.net> Sufyan, What username does the instance of Oracle DB run as? Is this "orainfra" or some other username? The scripts assume a user named "orainfra". If you use a different username then you need to modify the scripts accordingly. Regards, Chris Jankowski -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Sufyan Khan Sent: Thursday, 12 May 2011 16:27 To: 'linux clustering' Subject: [Linux-cluster] oracle DB is not failing over on killin PMON deamon Dear All I need to setup HA cluster for mu oracle dabase. I have setup two node cluster using "System-Config-Cluster .." on RHEL 5.5 I created RG a shared file system "/emc01" ext3 , shared IP and DB script to monitor the DB. My cluster starts perfectly and fail over on shutting down primary node, also stopping shared IP fails node to failover node. But on kill PMON , or LSNR process the node does not fails and keep showing the status services running on primary node. I JUST NEED TO KNOW WHERE IS THE PROBLEM. ATTACHED IS DB scripts and "cluster.conf" file. Thanks in advance for help. Sufyan From sufyan.khan at its.ws Thu May 12 07:22:01 2011 From: sufyan.khan at its.ws (Sufyan Khan) Date: Thu, 12 May 2011 10:22:01 +0300 Subject: [Linux-cluster] oracle DB is not failing over on killin PMON deamon In-Reply-To: <036B68E61A28CA49AC2767596576CD596F6641246F@GVW1113EXC.americas.hpqcorp.net> References: <00ae01cc106d$a294c440$e7be4cc0$@its.ws> <036B68E61A28CA49AC2767596576CD596F6641246F@GVW1113EXC.americas.hpqcorp.net> Message-ID: <00b501cc1075$4a47cfa0$ded76ee0$@its.ws> First of all thanks for you quick response. Secondly please note: the working "cluster.conf" file is attached here, the previous file was not correct. Yes the orainfra is the user name. Any othere clue please. sufyan -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jankowski, Chris Sent: Thursday, May 12, 2011 9:44 AM To: linux clustering Subject: Re: [Linux-cluster] oracle DB is not failing over on killin PMON deamon Sufyan, What username does the instance of Oracle DB run as? Is this "orainfra" or some other username? The scripts assume a user named "orainfra". If you use a different username then you need to modify the scripts accordingly. Regards, Chris Jankowski -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Sufyan Khan Sent: Thursday, 12 May 2011 16:27 To: 'linux clustering' Subject: [Linux-cluster] oracle DB is not failing over on killin PMON deamon Dear All I need to setup HA cluster for mu oracle dabase. I have setup two node cluster using "System-Config-Cluster .." on RHEL 5.5 I created RG a shared file system "/emc01" ext3 , shared IP and DB script to monitor the DB. My cluster starts perfectly and fail over on shutting down primary node, also stopping shared IP fails node to failover node. But on kill PMON , or LSNR process the node does not fails and keep showing the status services running on primary node. I JUST NEED TO KNOW WHERE IS THE PROBLEM. ATTACHED IS DB scripts and "cluster.conf" file. Thanks in advance for help. Sufyan -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- A non-text attachment was scrubbed... Name: cluster.conf Type: application/octet-stream Size: 1457 bytes Desc: not available URL: From Andre.Gerbatsch at globalfoundries.com Fri May 13 10:09:40 2011 From: Andre.Gerbatsch at globalfoundries.com (Gerbatsch, Andre) Date: Fri, 13 May 2011 12:09:40 +0200 Subject: [Linux-cluster] qdiskd does not call heuristics regularly? Message-ID: <495A8D8FDB79D54DA88BDC8AE7F3A76D23D70533EC@VDRSEXMBXP1.gfoundries.com> Hello, Im at a point where I have different answers from different experts, read "qdiskd" source code by myself and would be happy if someone could help me: I expected in my configuration (see below) that a heuristics script will be called on a regularly bases (every "interval" s) to have a chance to influence quorumd scores if something happened with the cluster node. What I see is, that there were some cycles during quorum device initialization, after that heuristics is called "from time to time". Question: is this the expected behavior ? If yes, is there a chance to call heuristics regularly ? Question2: how can I determine the cman/qdisk version I use.. cman_1_0_??? (see rpm -qi cman) The final effect is: if I disconnect one node in a 2-node cluster from network the "wrong" node won - and heuristics had no influence on the fencing decision. Thank you in advance for any response Andre ================================================= == rpm -qi cman Name : cman Relocations: (not relocatable) Version : 2.0.115 Vendor: Red Hat, Inc. Release : 68.el5_6.1 Build Date: Mon Dec 20 19:28:36 2010 Install Date: Thu Apr 28 11:11:43 2011 Build Host: ls20-bc2-14.build.redhat.com Group : System Environment/Base Source RPM: cman-2.0.115-68.el5_6.1.src.rpm Size : 2619414 License: GPL Signature : DSA/SHA1, Fri Dec 31 06:29:03 2010, Key ID 5326810137017186 Packager : Red Hat, Inc. URL : http://sources.redhat.com/cluster/ Summary : cman - The Cluster Manager Description : cman - The Cluster Manager == cluster.conf: .. .. == > ps -eLf | grep qdiskd root 3976 1 3976 0 3 08:59 ? 00:00:00 qdiskd -Q root 3976 1 3978 0 3 08:59 ? 00:00:00 qdiskd -Q root 3976 1 4226 0 3 08:59 ? 00:00:00 qdiskd -Q root 21613 12673 21613 0 1 10:45 pts/0 00:00:00 grep qdiskd == strace "score thread" (hopefully :-) = it seems simply waiting for some timer... clock_gettime(CLOCK_MONOTONIC, {60774, 182881847}) = 0 clock_gettime(CLOCK_MONOTONIC, {60774, 182920847}) = 0 rt_sigprocmask(SIG_BLOCK, [CHLD], ~[QUIT ILL TRAP ABRT BUS FPE KILL SEGV TERM CHLD STOP RTMIN RT_1], 8) = 0 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0 rt_sigprocmask(SIG_SETMASK, ~[QUIT ILL TRAP ABRT BUS FPE KILL SEGV TERM CHLD STOP RTMIN RT_1], NULL, 8) = 0 nanosleep({1, 0}, {1, 0}) = 0 clock_gettime(CLOCK_MONOTONIC, {60775, 202918847}) = 0 clock_gettime(CLOCK_MONOTONIC, {60775, 202961847}) = 0 rt_sigprocmask(SIG_BLOCK, [CHLD], ~[QUIT ILL TRAP ABRT BUS FPE KILL SEGV TERM CHLD STOP RTMIN RT_1], 8) = 0 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0 rt_sigprocmask(SIG_SETMASK, ~[QUIT ILL TRAP ABRT BUS FPE KILL SEGV TERM CHLD STOP RTMIN RT_1], NULL, 8) = 0 nanosleep({1, 0}, {1, 0}) = 0 clock_gettime(CLOCK_MONOTONIC, {60776, 222868847}) = 0 clock_gettime(CLOCK_MONOTONIC, {60776, 222912847}) = 0 rt_sigprocmask(SIG_BLOCK, [CHLD], ~[QUIT ILL TRAP ABRT BUS FPE KILL SEGV TERM CHLD STOP RTMIN RT_1], 8) = 0 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0 rt_sigprocmask(SIG_SETMASK, ~[QUIT ILL TRAP ABRT BUS FPE KILL SEGV TERM CHLD STOP RTMIN RT_1], NULL, 8) = 0 nanosleep({1, 0}, Process 3978 detached .. seems to me that this is the score thread with a "wrong" h->nextrun.. but I think I simply do not understand smthg.. cman/qdiskd/score.c: from http://git.fedorahosted.org/git/?p=cluster.git;a=summary 99 fork_heuristic(struct h_data *h) 100 { ... 110 now = time(NULL); 111 if (now < h->nextrun) 112 return 0; 113 114 h->nextrun = now + h->interval; 115 116 pid = fork(); == output from heuristic testscript > cat checkpvtlink.sh #!/bin/sh rval=0 echo "dummy: $(date) $0 rval=$rval" >> /root/root/cluster/checkpvtlink.log exit $rval > tail checkpvtlink.log dummy: Fri May 13 09:03:35 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0 <== service qdiskd restart dummy: Fri May 13 09:05:17 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0 dummy: Fri May 13 09:06:58 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0 dummy: Fri May 13 09:08:40 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0 dummy: Fri May 13 09:10:22 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0 dummy: Fri May 13 09:12:04 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0 dummy: Fri May 13 09:23:46 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0 dummy: Fri May 13 09:31:48 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0 dummy: Fri May 13 10:20:19 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0 <== why so late ?? dummy: Fri May 13 10:40:29 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0 Andre Gerbatsch MTS IT Systems Engineer Tel +49 (0) 351 277-1762 Fax +49 (0) 351 277-91762 andre.gerbatsch at globalfoundries.com GLOBALFOUNDRIES Dresden Module Two GmbH & Co. KG Wilschdorfer Landstr. 101, 01109 Dresden, Deutschland, Sitz Dresden I Registergericht Dresden HRA 4896 From Andre.Gerbatsch at globalfoundries.com Fri May 13 12:00:23 2011 From: Andre.Gerbatsch at globalfoundries.com (Gerbatsch, Andre) Date: Fri, 13 May 2011 14:00:23 +0200 Subject: [Linux-cluster] qdiskd does not call heuristics regularly? In-Reply-To: <495A8D8FDB79D54DA88BDC8AE7F3A76D23D70533EC@VDRSEXMBXP1.gfoundries.com> References: <495A8D8FDB79D54DA88BDC8AE7F3A76D23D70533EC@VDRSEXMBXP1.gfoundries.com> Message-ID: <495A8D8FDB79D54DA88BDC8AE7F3A76D23D70533EF@VDRSEXMBXP1.gfoundries.com> .. small correction of the qdiskd->heuristic script timing: dummy: Fri May 13 08:59:16 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=1 <--qdiskd restart, rval=1 dummy: Fri May 13 08:59:21 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=1 dummy: Fri May 13 08:59:26 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=1 dummy: Fri May 13 08:59:31 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=1 dummy: Fri May 13 08:59:36 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=1 dummy: Fri May 13 08:59:41 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=1 dummy: Fri May 13 08:59:46 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=1 dummy: Fri May 13 08:59:51 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=1 dummy: Fri May 13 08:59:56 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0 <--changed script, rval=0 dummy: Fri May 13 09:00:01 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0 dummy: Fri May 13 09:00:06 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0 dummy: Fri May 13 09:00:11 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0 <-- until this point ok (dt=5s) dummy: Fri May 13 09:01:53 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0 <-- below: ?? every 103s ? dummy: Fri May 13 09:03:35 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0 dummy: Fri May 13 09:05:17 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0 dummy: Fri May 13 09:06:58 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0 dummy: Fri May 13 09:08:40 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0 dummy: Fri May 13 09:10:22 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0 dummy: Fri May 13 09:12:04 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0 dummy: Fri May 13 09:23:46 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0 <-- ?? no regular checks ? dummy: Fri May 13 09:31:48 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0 dummy: Fri May 13 10:20:19 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0 dummy: Fri May 13 10:40:29 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0 -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Gerbatsch, Andre Sent: Freitag, 13. Mai 2011 12:10 To: 'linux-cluster at redhat.com' Subject: [Linux-cluster] qdiskd does not call heuristics regularly? Hello, Im at a point where I have different answers from different experts, read "qdiskd" source code by myself and would be happy if someone could help me: I expected in my configuration (see below) that a heuristics script will be called on a regularly bases (every "interval" s) to have a chance to influence quorumd scores if something happened with the cluster node. What I see is, that there were some cycles during quorum device initialization, after that heuristics is called "from time to time". Question: is this the expected behavior ? If yes, is there a chance to call heuristics regularly ? Question2: how can I determine the cman/qdisk version I use.. cman_1_0_??? (see rpm -qi cman) The final effect is: if I disconnect one node in a 2-node cluster from network the "wrong" node won - and heuristics had no influence on the fencing decision. Thank you in advance for any response Andre ================================================= == rpm -qi cman Name : cman Relocations: (not relocatable) Version : 2.0.115 Vendor: Red Hat, Inc. Release : 68.el5_6.1 Build Date: Mon Dec 20 19:28:36 2010 Install Date: Thu Apr 28 11:11:43 2011 Build Host: ls20-bc2-14.build.redhat.com Group : System Environment/Base Source RPM: cman-2.0.115-68.el5_6.1.src.rpm Size : 2619414 License: GPL Signature : DSA/SHA1, Fri Dec 31 06:29:03 2010, Key ID 5326810137017186 Packager : Red Hat, Inc. URL : http://sources.redhat.com/cluster/ Summary : cman - The Cluster Manager Description : cman - The Cluster Manager == cluster.conf: .. .. == > ps -eLf | grep qdiskd root 3976 1 3976 0 3 08:59 ? 00:00:00 qdiskd -Q root 3976 1 3978 0 3 08:59 ? 00:00:00 qdiskd -Q root 3976 1 4226 0 3 08:59 ? 00:00:00 qdiskd -Q root 21613 12673 21613 0 1 10:45 pts/0 00:00:00 grep qdiskd == strace "score thread" (hopefully :-) = it seems simply waiting for some timer... clock_gettime(CLOCK_MONOTONIC, {60774, 182881847}) = 0 clock_gettime(CLOCK_MONOTONIC, {60774, 182920847}) = 0 rt_sigprocmask(SIG_BLOCK, [CHLD], ~[QUIT ILL TRAP ABRT BUS FPE KILL SEGV TERM CHLD STOP RTMIN RT_1], 8) = 0 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0 rt_sigprocmask(SIG_SETMASK, ~[QUIT ILL TRAP ABRT BUS FPE KILL SEGV TERM CHLD STOP RTMIN RT_1], NULL, 8) = 0 nanosleep({1, 0}, {1, 0}) = 0 clock_gettime(CLOCK_MONOTONIC, {60775, 202918847}) = 0 clock_gettime(CLOCK_MONOTONIC, {60775, 202961847}) = 0 rt_sigprocmask(SIG_BLOCK, [CHLD], ~[QUIT ILL TRAP ABRT BUS FPE KILL SEGV TERM CHLD STOP RTMIN RT_1], 8) = 0 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0 rt_sigprocmask(SIG_SETMASK, ~[QUIT ILL TRAP ABRT BUS FPE KILL SEGV TERM CHLD STOP RTMIN RT_1], NULL, 8) = 0 nanosleep({1, 0}, {1, 0}) = 0 clock_gettime(CLOCK_MONOTONIC, {60776, 222868847}) = 0 clock_gettime(CLOCK_MONOTONIC, {60776, 222912847}) = 0 rt_sigprocmask(SIG_BLOCK, [CHLD], ~[QUIT ILL TRAP ABRT BUS FPE KILL SEGV TERM CHLD STOP RTMIN RT_1], 8) = 0 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0 rt_sigprocmask(SIG_SETMASK, ~[QUIT ILL TRAP ABRT BUS FPE KILL SEGV TERM CHLD STOP RTMIN RT_1], NULL, 8) = 0 nanosleep({1, 0}, Process 3978 detached .. seems to me that this is the score thread with a "wrong" h->nextrun.. but I think I simply do not understand smthg.. cman/qdiskd/score.c: from http://git.fedorahosted.org/git/?p=cluster.git;a=summary 99 fork_heuristic(struct h_data *h) 100 { ... 110 now = time(NULL); 111 if (now < h->nextrun) 112 return 0; 113 114 h->nextrun = now + h->interval; 115 116 pid = fork(); == output from heuristic testscript > cat checkpvtlink.sh #!/bin/sh rval=0 echo "dummy: $(date) $0 rval=$rval" >> /root/root/cluster/checkpvtlink.log exit $rval > tail checkpvtlink.log dummy: Fri May 13 09:03:35 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0 <== service qdiskd restart dummy: Fri May 13 09:05:17 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0 dummy: Fri May 13 09:06:58 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0 dummy: Fri May 13 09:08:40 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0 dummy: Fri May 13 09:10:22 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0 dummy: Fri May 13 09:12:04 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0 dummy: Fri May 13 09:23:46 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0 dummy: Fri May 13 09:31:48 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0 dummy: Fri May 13 10:20:19 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0 <== why so late ?? dummy: Fri May 13 10:40:29 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0 Andre Gerbatsch MTS IT Systems Engineer Tel +49 (0) 351 277-1762 Fax +49 (0) 351 277-91762 andre.gerbatsch at globalfoundries.com GLOBALFOUNDRIES Dresden Module Two GmbH & Co. KG Wilschdorfer Landstr. 101, 01109 Dresden, Deutschland, Sitz Dresden I Registergericht Dresden HRA 4896 -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From klusterfsck at outofoptions.net Fri May 13 16:12:10 2011 From: klusterfsck at outofoptions.net (Kluster Fsck) Date: Fri, 13 May 2011 12:12:10 -0400 Subject: [Linux-cluster] Virtual Network Message-ID: <4DCD585A.1090304@outofoptions.net> I inherited an old cluster that RH won't support. Red Hat Linux Advanced Server release 2.1AS. The last day the old sys admin the cluster went down and never joined. (Customer owned equipment and the UPS is failed) As a quick fix I hard coded the address on the active node. Life was good until last night when another power bump occured and the other machine grabbed control. This is EOL hardware/software and we are working to get off of this in the next couple of weeks. My question. What is the mechanism for bringing up the shared address? After taking the hard coded nic down I tried: service cluster stop/start. I tried bringing up the preferred node a little ahead of the non-preferred node and then tried allowing it to come up completely before brining it up on the second node. Thanks for listening. From ajb2 at mssl.ucl.ac.uk Fri May 13 21:49:59 2011 From: ajb2 at mssl.ucl.ac.uk (Alan Brown) Date: Fri, 13 May 2011 22:49:59 +0100 Subject: [Linux-cluster] Write Performance Issues with GFS2 In-Reply-To: References: <2920eae44f09b0e9a8e24c98c21e1cff.squirrel@picard.linux.it> <1335256588.190299.1305152158222.JavaMail.root@zmail01.collab.prod.int.phx2.redhat.com> Message-ID: <4DCDA787.5070508@mssl.ucl.ac.uk> On 12/05/11 00:32, Ramiro Blanco wrote: >> https://bugzilla.redhat.com/show_bug.cgi?id=683155 > Can't access that one: "You are not authorized to access bug #683155" There's no reason this bug should be private, however it's addressed in test kernel kernel-2.6.18-248.el5 Steve/Bob, how about opening this one up for public view? From rpeterso at redhat.com Fri May 13 22:21:05 2011 From: rpeterso at redhat.com (Bob Peterson) Date: Fri, 13 May 2011 18:21:05 -0400 (EDT) Subject: [Linux-cluster] Write Performance Issues with GFS2 In-Reply-To: <4DCDA787.5070508@mssl.ucl.ac.uk> Message-ID: <1466017320.34706.1305325265203.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> ----- Original Message ----- | On 12/05/11 00:32, Ramiro Blanco wrote: | | >> https://bugzilla.redhat.com/show_bug.cgi?id=683155 | > Can't access that one: "You are not authorized to access bug | > #683155" | | There's no reason this bug should be private, however it's addressed | in | test kernel kernel-2.6.18-248.el5 | | Steve/Bob, how about opening this one up for public view? Sounds okay to me. Not sure how that's done, and not sure if I have the right authority in bugzilla to do it. Regards, Bob Peterson Red Hat File Systems From ajb2 at mssl.ucl.ac.uk Sat May 14 13:01:09 2011 From: ajb2 at mssl.ucl.ac.uk (Alan Brown) Date: Sat, 14 May 2011 14:01:09 +0100 Subject: [Linux-cluster] Write Performance Issues with GFS2 In-Reply-To: <1466017320.34706.1305325265203.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> References: <1466017320.34706.1305325265203.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> Message-ID: <4DCE7D15.8090709@mssl.ucl.ac.uk> On 13/05/11 23:21, Bob Peterson wrote: > | Steve/Bob, how about opening this one up for public view? > > Sounds okay to me. Not sure how that's done, and not sure if I have > the right authority in bugzilla to do it. I'm not entirely sure either but as the creator I think all you have to do is uncheck the private/developers boxes. AB From unknownboogyman at gmail.com Sat May 14 15:18:05 2011 From: unknownboogyman at gmail.com (Steve) Date: Sat, 14 May 2011 11:18:05 -0400 Subject: [Linux-cluster] (no subject) Message-ID: Hello all, Currently my group at college is working on a Senior Project and have created it pretty much successfully. We have a group of four test computers in a cluster before we go along with the eight we plan on. Right now we have tried a cluster software openMosix, or just follow the link below. Well, the dependencies didn't work and we couldn't install it. So, my question is does anyone know of stress testing software for CentOS clustering? Just regular stress testing too, like processing speed, hard-drive, yatta yatta. Basically, just four computers clustered over Ethernet (yes, I know, it'll most likely be slow). If you need anymore information, just let me know. This is the link for the one that didn't work: http://www.openmosixview.com/omtest/ -- -Steve -------------- next part -------------- An HTML attachment was scrubbed... URL: From parvez.h.shaikh at gmail.com Sat May 14 15:36:47 2011 From: parvez.h.shaikh at gmail.com (Parvez Shaikh) Date: Sat, 14 May 2011 21:06:47 +0530 Subject: [Linux-cluster] oracle DB is not failing over on killin PMON deamon In-Reply-To: <00b501cc1075$4a47cfa0$ded76ee0$@its.ws> References: <00ae01cc106d$a294c440$e7be4cc0$@its.ws> <036B68E61A28CA49AC2767596576CD596F6641246F@GVW1113EXC.americas.hpqcorp.net> <00b501cc1075$4a47cfa0$ded76ee0$@its.ws> Message-ID: Hi Sufyan Does your status function return 0 or 1 if database is up or down respectively (i.e. have you tested it works outside script_db.sh) when run as "root"? On Thu, May 12, 2011 at 12:52 PM, Sufyan Khan wrote: > First of all thanks for you quick response. > > Secondly please note: the working "cluster.conf" file is attached here, > the > previous file was not correct. > Yes the orainfra is the user name. > > Any othere clue please. > > sufyan > > > > > > > -----Original Message----- > From: linux-cluster-bounces at redhat.com > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jankowski, Chris > Sent: Thursday, May 12, 2011 9:44 AM > To: linux clustering > Subject: Re: [Linux-cluster] oracle DB is not failing over on killin PMON > deamon > > Sufyan, > > What username does the instance of Oracle DB run as? Is this "orainfra" or > some other username? > > The scripts assume a user named "orainfra". > If you use a different username then you need to modify the scripts > accordingly. > > Regards, > > Chris Jankowski > > > -----Original Message----- > From: linux-cluster-bounces at redhat.com > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Sufyan Khan > Sent: Thursday, 12 May 2011 16:27 > To: 'linux clustering' > Subject: [Linux-cluster] oracle DB is not failing over on killin PMON > deamon > > Dear All > > I need to setup HA cluster for mu oracle dabase. > I have setup two node cluster using "System-Config-Cluster .." on RHEL 5.5 > I > created RG a shared file system "/emc01" ext3 , shared IP and DB script to > monitor the DB. > My cluster starts perfectly and fail over on shutting down primary node, > also stopping shared IP fails node to failover node. > But on kill PMON , or LSNR process the node does not fails and keep showing > the status services running on primary node. > > I JUST NEED TO KNOW WHERE IS THE PROBLEM. > > ATTACHED IS DB scripts and "cluster.conf" file. > > Thanks in advance for help. > > Sufyan > > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sufyan.khan at its.ws Sat May 14 19:21:25 2011 From: sufyan.khan at its.ws (Sufyan Khan) Date: Sat, 14 May 2011 22:21:25 +0300 Subject: [Linux-cluster] oracle DB is not failing over on killin PMON deamon In-Reply-To: References: <00ae01cc106d$a294c440$e7be4cc0$@its.ws> <036B68E61A28CA49AC2767596576CD596F6641246F@GVW1113EXC.americas.hpqcorp.net> <00b501cc1075$4a47cfa0$ded76ee0$@its.ws> Message-ID: <007301cc126c$200506b0$600f1410$@its.ws> Yes , you can see in attached script From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Parvez Shaikh Sent: Saturday, May 14, 2011 6:37 PM To: linux clustering Subject: Re: [Linux-cluster] oracle DB is not failing over on killin PMON deamon Hi Sufyan Does your status function return 0 or 1 if database is up or down respectively (i.e. have you tested it works outside script_db.sh) when run as "root"? On Thu, May 12, 2011 at 12:52 PM, Sufyan Khan wrote: First of all thanks for you quick response. Secondly please note: the working "cluster.conf" file is attached here, the previous file was not correct. Yes the orainfra is the user name. Any othere clue please. sufyan -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jankowski, Chris Sent: Thursday, May 12, 2011 9:44 AM To: linux clustering Subject: Re: [Linux-cluster] oracle DB is not failing over on killin PMON deamon Sufyan, What username does the instance of Oracle DB run as? Is this "orainfra" or some other username? The scripts assume a user named "orainfra". If you use a different username then you need to modify the scripts accordingly. Regards, Chris Jankowski -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Sufyan Khan Sent: Thursday, 12 May 2011 16:27 To: 'linux clustering' Subject: [Linux-cluster] oracle DB is not failing over on killin PMON deamon Dear All I need to setup HA cluster for mu oracle dabase. I have setup two node cluster using "System-Config-Cluster .." on RHEL 5.5 I created RG a shared file system "/emc01" ext3 , shared IP and DB script to monitor the DB. My cluster starts perfectly and fail over on shutting down primary node, also stopping shared IP fails node to failover node. But on kill PMON , or LSNR process the node does not fails and keep showing the status services running on primary node. I JUST NEED TO KNOW WHERE IS THE PROBLEM. ATTACHED IS DB scripts and "cluster.conf" file. Thanks in advance for help. Sufyan -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: script_db.sh Type: application/octet-stream Size: 814 bytes Desc: not available URL: From sufyan.khan at its.ws Sat May 14 19:25:37 2011 From: sufyan.khan at its.ws (Sufyan Khan) Date: Sat, 14 May 2011 22:25:37 +0300 Subject: [Linux-cluster] oracle DB is not failing over on killin PMON deamon In-Reply-To: References: <00ae01cc106d$a294c440$e7be4cc0$@its.ws> <036B68E61A28CA49AC2767596576CD596F6641246F@GVW1113EXC.americas.hpqcorp.net> <00b501cc1075$4a47cfa0$ded76ee0$@its.ws> Message-ID: <007901cc126c$b5ae03b0$210a0b10$@its.ws> I can run the script by command as root, but do see the script is running in background as a daemon, Mohammad Raza Sufyan Khan Team Leader (Technology&Infrastructure Group) Telco development Description: Description: ITS Logo.pngT. + (965) 22409100 ext. 379 M. + (965) 99871684 F. + (965) 22405201 E. sufyan.khan at its.ws Description: Description: degital signeture.png From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Parvez Shaikh Sent: Saturday, May 14, 2011 6:37 PM To: linux clustering Subject: Re: [Linux-cluster] oracle DB is not failing over on killin PMON deamon Hi Sufyan Does your status function return 0 or 1 if database is up or down respectively (i.e. have you tested it works outside script_db.sh) when run as "root"? On Thu, May 12, 2011 at 12:52 PM, Sufyan Khan wrote: First of all thanks for you quick response. Secondly please note: the working "cluster.conf" file is attached here, the previous file was not correct. Yes the orainfra is the user name. Any othere clue please. sufyan -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jankowski, Chris Sent: Thursday, May 12, 2011 9:44 AM To: linux clustering Subject: Re: [Linux-cluster] oracle DB is not failing over on killin PMON deamon Sufyan, What username does the instance of Oracle DB run as? Is this "orainfra" or some other username? The scripts assume a user named "orainfra". If you use a different username then you need to modify the scripts accordingly. Regards, Chris Jankowski -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Sufyan Khan Sent: Thursday, 12 May 2011 16:27 To: 'linux clustering' Subject: [Linux-cluster] oracle DB is not failing over on killin PMON deamon Dear All I need to setup HA cluster for mu oracle dabase. I have setup two node cluster using "System-Config-Cluster .." on RHEL 5.5 I created RG a shared file system "/emc01" ext3 , shared IP and DB script to monitor the DB. My cluster starts perfectly and fail over on shutting down primary node, also stopping shared IP fails node to failover node. But on kill PMON , or LSNR process the node does not fails and keep showing the status services running on primary node. I JUST NEED TO KNOW WHERE IS THE PROBLEM. ATTACHED IS DB scripts and "cluster.conf" file. Thanks in advance for help. Sufyan -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 130 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image002.png Type: image/png Size: 175 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image003.png Type: image/png Size: 5748 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image004.png Type: image/png Size: 8941 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image005.jpg Type: image/jpeg Size: 3350 bytes Desc: not available URL: From raju.rajsand at gmail.com Sat May 14 21:13:30 2011 From: raju.rajsand at gmail.com (Rajagopal Swaminathan) Date: Sun, 15 May 2011 02:43:30 +0530 Subject: [Linux-cluster] oracle DB is not failing over on killin PMON deamon In-Reply-To: <007901cc126c$b5ae03b0$210a0b10$@its.ws> References: <00ae01cc106d$a294c440$e7be4cc0$@its.ws> <036B68E61A28CA49AC2767596576CD596F6641246F@GVW1113EXC.americas.hpqcorp.net> <00b501cc1075$4a47cfa0$ded76ee0$@its.ws> <007901cc126c$b5ae03b0$210a0b10$@its.ws> Message-ID: Greetings, On Sun, May 15, 2011 at 12:55 AM, Sufyan Khan wrote: > I can run the script by command as root, but do see the script is running > in background as a daemon, > > > > > > Mohammad Raza Sufyan Khan > Team Leader (Technology&Infrastructure Group) > > Telco development > a stupid question: Have you used chkconfig to switchoff all the cluster controlled services? IMHO, They should be -- Regards, Rajagopal -------------- next part -------------- An HTML attachment was scrubbed... URL: From sufyan.khan at its.ws Sun May 15 07:06:21 2011 From: sufyan.khan at its.ws (Sufyan Khan) Date: Sun, 15 May 2011 10:06:21 +0300 Subject: [Linux-cluster] oracle DB is not failing over on killin PMON deamon In-Reply-To: References: <00ae01cc106d$a294c440$e7be4cc0$@its.ws> <036B68E61A28CA49AC2767596576CD596F6641246F@GVW1113EXC.americas.hpqcorp.net> <00b501cc1075$4a47cfa0$ded76ee0$@its.ws> <007901cc126c$b5ae03b0$210a0b10$@its.ws> Message-ID: <000901cc12ce$99532df0$cbf989d0$@its.ws> There is writing mistake, I cannot see the script is running in background. Off course I stop the cluster then I run the manual script. From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Rajagopal Swaminathan Sent: Sunday, May 15, 2011 12:14 AM To: linux clustering Subject: Re: [Linux-cluster] oracle DB is not failing over on killin PMON deamon Greetings, On Sun, May 15, 2011 at 12:55 AM, Sufyan Khan wrote: I can run the script by command as root, but do see the script is running in background as a daemon, Mohammad Raza Sufyan Khan Team Leader (Technology&Infrastructure Group) Telco development a stupid question: Have you used chkconfig to switchoff all the cluster controlled services? IMHO, They should be -- Regards, Rajagopal -------------- next part -------------- An HTML attachment was scrubbed... URL: From mguazzardo76 at gmail.com Sun May 15 08:28:52 2011 From: mguazzardo76 at gmail.com (Marcelo Guazzardo) Date: Sun, 15 May 2011 05:28:52 -0300 Subject: [Linux-cluster] oracle DB is not failing over on killin PMON deamon In-Reply-To: <000901cc12ce$99532df0$cbf989d0$@its.ws> References: <00ae01cc106d$a294c440$e7be4cc0$@its.ws> <036B68E61A28CA49AC2767596576CD596F6641246F@GVW1113EXC.americas.hpqcorp.net> <00b501cc1075$4a47cfa0$ded76ee0$@its.ws> <007901cc126c$b5ae03b0$210a0b10$@its.ws> <000901cc12ce$99532df0$cbf989d0$@its.ws> Message-ID: HI sufyan Maybe is irrelevant, but, do you try with oracledb.sh script?. I use that script and all work fine for me.... Regards, Marcelo PS: that script is placed in /usr/share/cluster/oracledb.sh, you must modify the oracle settings, like orauser, db_instance_name, and db_virtual name. 2011/5/15 Sufyan Khan > There is writing mistake, I *cannot* see the script is running in > background. > > > > Off course I stop the cluster then I run the manual script. > > > > > > > > *From:* linux-cluster-bounces at redhat.com [mailto: > linux-cluster-bounces at redhat.com] *On Behalf Of *Rajagopal Swaminathan > *Sent:* Sunday, May 15, 2011 12:14 AM > > *To:* linux clustering > *Subject:* Re: [Linux-cluster] oracle DB is not failing over on killin > PMON deamon > > > > Greetings, > > On Sun, May 15, 2011 at 12:55 AM, Sufyan Khan wrote: > > I ca nd as root, but do see the script is running in background as a > daemon, > > > > > > Mohammad Raza Sufyan Khan > Team Leader (Technology&Infrastructure Group) > > Telco development > > > > a stupid question: Have you used chkconfig to switchoff all the cluster > controlled services? > > ! IMHO, The >Regards, > > Rajagopal > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -- Marcelo Guazzardo mguazzardo76 at gmail.com http://mguazzardo.blogspot.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From sufyan.khan at its.ws Sun May 15 13:49:54 2011 From: sufyan.khan at its.ws (Sufyan Khan) Date: Sun, 15 May 2011 16:49:54 +0300 Subject: [Linux-cluster] oracle DB is not failing over on killin PMON deamon In-Reply-To: References: <00ae01cc106d$a294c440$e7be4cc0$@its.ws> <036B68E61A28CA49AC2767596576CD596F6641246F@GVW1113EXC.americas.hpqcorp.net> <00b501cc1075$4a47cfa0$ded76ee0$@its.ws> <007901cc126c$b5ae03b0$210a0b10$@its.ws> <000901cc12ce$99532df0$cbf989d0$@its.ws> Message-ID: <007801cc1306$f93e6850$ebbb38f0$@its.ws> Thanks for the tips, I need only oracle database to be monitor by Cluster NOT OPMN ( application server) Any clue sufyan From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Marcelo Guazzardo Sent: Sunday, May 15, 2011 11:29 AM To: linux clustering Subject: Re: [Linux-cluster] oracle DB is not failing over on killin PMON deamon HI sufyan Maybe is irrelevant, but, do you try with oracledb.sh script?. I use that script and all work fine for me.... Regards, Marcelo PS: that script is placed in /usr/share/cluster/oracledb.sh, you must modify the oracle settings, like orauser, db_instance_name, and db_virtual name. 2011/5/15 Sufyan Khan There is writing mistake, I cannot see the script is running in background. Off course I stop the cluster then I run the manual script. From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Rajagopal Swaminathan Sent: Sunday, May 15, 2011 12:14 AM To: linux clustering Subject: Re: [Linux-cluster] oracle DB is not failing over on killin PMON deamon Greetings, On Sun, May 15, 2011 at 12:55 AM, Sufyan Khan wrote: I ca nd as root, but do see the script is running in background as a daemon, Mohammad Raza Sufyan Khan Team Leader (Technology&Infrastructure Group) Telco development a stupid question: Have you used chkconfig to switchoff all the cluster controlled services? ! IMHO, The >Regards, Rajagopal -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -- Marcelo Guazzardo mguazzardo76 at gmail.com http://mguazzardo.blogspot.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From mguazzardo76 at gmail.com Sun May 15 16:02:42 2011 From: mguazzardo76 at gmail.com (Marcelo Guazzardo) Date: Sun, 15 May 2011 13:02:42 -0300 Subject: [Linux-cluster] oracle DB is not failing over on killin PMON deamon In-Reply-To: <007801cc1306$f93e6850$ebbb38f0$@its.ws> References: <00ae01cc106d$a294c440$e7be4cc0$@its.ws> <036B68E61A28CA49AC2767596576CD596F6641246F@GVW1113EXC.americas.hpqcorp.net> <00b501cc1075$4a47cfa0$ded76ee0$@its.ws> <007901cc126c$b5ae03b0$210a0b10$@its.ws> <000901cc12ce$99532df0$cbf989d0$@its.ws> <007801cc1306$f93e6850$ebbb38f0$@its.ws> Message-ID: ok, this script checks the listener and db , if you select "base" in database type. I 've worked with oracle10g r2 , and it worked fine for me. Thanks 2011/5/15 Sufyan Khan > Thanks for the tips, > > > > I need only oracle database to be monitor by Cluster NOT OPMN ( application > server) > > Any clue > > > > sufyan > > > > *From:* linux-cluster-bounces at redhat.com [mailto: > linux-cluster-bounces at redhat.com] *On Behalf Of *Marcelo Guazzardo > *Sent:* Sunday, May 15, 2011 11:29 AM > > *To:* linux clustering > *Subject:* Re: [Linux-cluster] oracle DB is not failing over on killin > PMON deamon > > > > HI sufyan > > > Maybe is irrelevant, but, do you try with oracledb.sh script?. > I use that script and all work fine for me.... > > Regards, > Marcelo > PS: that script is placed in /usr/share/cluster/oracledb.sh, you must > modify the oracle settings, like orauser, db_instance_name, and db_virtual > name. > > 2011/5/15 Sufyan! Khan < han at its.ws">sufyan.khan at its.ws> > > There is writing mistake, I *cannot* see the script is running in > background. > > > > Off course I stop the cluster then I run the manual script. > > > > > > > > *From:* linux-cluster-bounces at redhat.com [mailto: > linux-cluster-bounces at redhat.com] *On Behalf Of *Rajagopal Swaminathan > *Sent:* Sunday, May 15, 2011 12:14 AM > > > *To:* linux clustering > *Subject:* Re: [Linux-cluster] oracle DB is not failing over on killin > PMON deamon > > > > Greetings, > > I ca nd as root, but do see the script is running in background as a > daemon, > > > > > > Mohammad Raza Sufyan Khan > Team Leader (Technology&Infrastructure Group) > > Telco development > > > > a stupid question: Have you used chkconfig to switchoff all the cluster > controlled services? > > ! IMHO, The >Regards, > > Rajagopal > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > > -- > Marcelo Guazzardo > mguazzardo76 at gmail.com > http://mguazzardo.blogspot.com > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -- Marcelo Guazzardo mguazzardo76 at gmail.com http://mguazzardo.blogspot.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From sufyan.khan at its.ws Sun May 15 18:14:10 2011 From: sufyan.khan at its.ws (Sufyan Khan) Date: Sun, 15 May 2011 21:14:10 +0300 Subject: [Linux-cluster] oracle DB is not failing over on killin PMON deamon In-Reply-To: References: <00ae01cc106d$a294c440$e7be4cc0$@its.ws> <036B68E61A28CA49AC2767596576CD596F6641246F@GVW1113EXC.americas.hpqcorp.net> <00b501cc1075$4a47cfa0$ded76ee0$@its.ws> <007901cc126c$b5ae03b0$210a0b10$@its.ws> <000901cc12ce$99532df0$cbf989d0$@its.ws> <007801cc1306$f93e6850$ebbb38f0$@its.ws> Message-ID: <008e01cc132b$e4692800$ad3b7800$@its.ws> Will you share if it is not confidential. From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Marcelo Guazzardo Sent: Sunday, May 15, 2011 7:03 PM To: linux clustering Subject: Re: [Linux-cluster] oracle DB is not failing over on killin PMON deamon ok, this script checks the listener and db , if you select "base" in database type. I 've worked with oracle10g r2 , and it worked fine for me. Thanks 2011/5/15 Sufyan Khan Thanks for the tips, I need only oracle database to be monitor by Cluster NOT OPMN ( application server) Any clue sufyan From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Marcelo Guazzardo Sent: Sunday, May 15, 2011 11:29 AM To: linux clustering Subject: Re: [Linux-cluster] oracle DB is not failing over on killin PMON deamon HI sufyan Maybe is irrelevant, but, do you try with oracledb.sh script?. I use that script and all work fine for me.... Regards, Marcelo PS: that script is placed in /usr/share/cluster/oracledb.sh, you must modify the oracle settings, like orauser, db_instance_name, and db_virtual name. 2011/5/15 Sufyan! Khan < han at its.ws">sufyan.khan at its.ws> There is writing mistake, I cannot see the script is running in background. Off course I stop the cluster then I run the manual script. From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Rajagopal Swaminathan Sent: Sunday, May 15, 2011 12:14 AM To: linux clustering Subject: Re: [Linux-cluster] oracle DB is not failing over on killin PMON deamon Greetings, I ca nd as root, but do see the script is running in background as a daemon, Mohammad Raza Sufyan Khan Team Leader (Technology&Infrastructure Group) Telco development a stupid question: Have you used chkconfig to switchoff all the cluster controlled services? ! IMHO, The >Regards, Rajagopal -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -- Marcelo Guazzardo mguazzardo76 at gmail.com http://mguazzardo.blogspot.com From linux at alteeve.com Mon May 16 03:39:10 2011 From: linux at alteeve.com (Digimer) Date: Sun, 15 May 2011 23:39:10 -0400 Subject: [Linux-cluster] Announcing - RHCS2 on EL5, Xen, DRBD and rgmanager 2-node cluster tutorial Message-ID: <4DD09C5E.70508@alteeve.com> Two years ago, I set out to learn clustering. I decided the best way to ensure that I learned it properly would be to write down, as a tutorial. I expect many warts to be found, but I think it is done enough to "officially" announce it, in hopes that it might help others. This tutorial shows how to build a 2-node cluster using Red Hat's Cluster Service Stable 2, using rgmanager for resource management, DRBD and Clustered LVM for shared storage, GFS2 for definition file storage and Xen for virtualization. The tutorial can be found here: http://wiki.alteeve.com/index.php/Red_Hat_Cluster_Service_2_Tutorial Anyone who has been around the #linux-cluster IRC channel has probably heard me talking about this tutorial. I need to give a tremendous thank you to many of the regulars in that channel. I've put a "thanks" section at the end, but it is woefully short of all the people who have helped me over the last two years. :) Any and all feedback, particularly critical ones, are welcome and appreciated! -- Digimer E-Mail: digimer at alteeve.com AN!Whitepapers: http://alteeve.com Node Assassin: http://nodeassassin.org "I feel confined, only free to expand myself within boundaries." From Chris.Jankowski at hp.com Mon May 16 04:29:04 2011 From: Chris.Jankowski at hp.com (Jankowski, Chris) Date: Mon, 16 May 2011 04:29:04 +0000 Subject: [Linux-cluster] Announcing - RHCS2 on EL5, Xen, DRBD and rgmanager 2-node cluster tutorial In-Reply-To: <4DD09C5E.70508@alteeve.com> References: <4DD09C5E.70508@alteeve.com> Message-ID: <036B68E61A28CA49AC2767596576CD596F66412D43@GVW1113EXC.americas.hpqcorp.net> Digimer, I think you published an earlier version before. Isn't it the time to introduce versioning, release dates and also list of deltas from version to version? Mundane things, I know. But if you want to make this a useful document for others they are all very necessary, I think. Regards, Chris Jankowski -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Digimer Sent: Monday, 16 May 2011 13:39 To: linux clustering Subject: [Linux-cluster] Announcing - RHCS2 on EL5, Xen, DRBD and rgmanager 2-node cluster tutorial Two years ago, I set out to learn clustering. I decided the best way to ensure that I learned it properly would be to write down, as a tutorial. I expect many warts to be found, but I think it is done enough to "officially" announce it, in hopes that it might help others. This tutorial shows how to build a 2-node cluster using Red Hat's Cluster Service Stable 2, using rgmanager for resource management, DRBD and Clustered LVM for shared storage, GFS2 for definition file storage and Xen for virtualization. The tutorial can be found here: http://wiki.alteeve.com/index.php/Red_Hat_Cluster_Service_2_Tutorial Anyone who has been around the #linux-cluster IRC channel has probably heard me talking about this tutorial. I need to give a tremendous thank you to many of the regulars in that channel. I've put a "thanks" section at the end, but it is woefully short of all the people who have helped me over the last two years. :) Any and all feedback, particularly critical ones, are welcome and appreciated! -- Digimer E-Mail: digimer at alteeve.com AN!Whitepapers: http://alteeve.com Node Assassin: http://nodeassassin.org "I feel confined, only free to expand myself within boundaries." -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From linux at alteeve.com Mon May 16 04:33:30 2011 From: linux at alteeve.com (Digimer) Date: Mon, 16 May 2011 00:33:30 -0400 Subject: [Linux-cluster] Announcing - RHCS2 on EL5, Xen, DRBD and rgmanager 2-node cluster tutorial In-Reply-To: <036B68E61A28CA49AC2767596576CD596F66412D43@GVW1113EXC.americas.hpqcorp.net> References: <4DD09C5E.70508@alteeve.com> <036B68E61A28CA49AC2767596576CD596F66412D43@GVW1113EXC.americas.hpqcorp.net> Message-ID: <4DD0A91A.3090701@alteeve.com> On 05/16/2011 12:29 AM, Jankowski, Chris wrote: > Digimer, > > I think you published an earlier version before. > Isn't it the time to introduce versioning, release dates and also list of deltas from version to version? > > Mundane things, I know. But if you want to make this a useful document for others they are all very necessary, I think. > > Regards, > > Chris Jankowski I mentioned it to some people off-list as it was being developed, but this is the first "official" announcement/release. You comment is valid, and partly addressed by the medium of being a wiki. Changes through time can be seen and tracked using the "History" button at the top of the page. -- Digimer E-Mail: digimer at alteeve.com AN!Whitepapers: http://alteeve.com Node Assassin: http://nodeassassin.org "I feel confined, only free to expand myself within boundaries." From corey.kovacs at gmail.com Mon May 16 04:58:41 2011 From: corey.kovacs at gmail.com (Corey Kovacs) Date: Mon, 16 May 2011 05:58:41 +0100 Subject: [Linux-cluster] Announcing - RHCS2 on EL5, Xen, DRBD and rgmanager 2-node cluster tutorial In-Reply-To: <4DD0A91A.3090701@alteeve.com> References: <4DD09C5E.70508@alteeve.com> <036B68E61A28CA49AC2767596576CD596F66412D43@GVW1113EXC.americas.hpqcorp.net> <4DD0A91A.3090701@alteeve.com> Message-ID: Nice job, I am sure this will help quite a few... -C On Mon, May 16, 2011 at 5:33 AM, Digimer wrote: > On 05/16/2011 12:29 AM, Jankowski, Chris wrote: > > Digimer, > > > > I think you published an earlier version before. > > Isn't it the time to introduce versioning, release dates and also list of > deltas from version to version? > > > > Mundane things, I know. But if you want to make this a useful document > for others they are all very necessary, I think. > > > > Regards, > > > > Chris Jankowski > > I mentioned it to some people off-list as it was being developed, but > this is the first "official" announcement/release. You comment is valid, > and partly addressed by the medium of being a wiki. Changes through time > can be seen and tracked using the "History" button at the top of the page. > > -- > Digimer > E-Mail: digimer at alteeve.com > AN!Whitepapers: http://alteeve.com > Node Assassin: http://nodeassassin.org > "I feel confined, only free to expand myself within boundaries." > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From swhiteho at redhat.com Mon May 16 09:15:56 2011 From: swhiteho at redhat.com (Steven Whitehouse) Date: Mon, 16 May 2011 10:15:56 +0100 Subject: [Linux-cluster] Write Performance Issues with GFS2 In-Reply-To: <1466017320.34706.1305325265203.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> References: <1466017320.34706.1305325265203.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> Message-ID: <1305537356.2855.1.camel@menhir> Hi, On Fri, 2011-05-13 at 18:21 -0400, Bob Peterson wrote: > ----- Original Message ----- > | On 12/05/11 00:32, Ramiro Blanco wrote: > | > | >> https://bugzilla.redhat.com/show_bug.cgi?id=683155 > | > Can't access that one: "You are not authorized to access bug > | > #683155" > | > | There's no reason this bug should be private, however it's addressed > | in > | test kernel kernel-2.6.18-248.el5 > | > | Steve/Bob, how about opening this one up for public view? > > Sounds okay to me. Not sure how that's done, and not sure if I have > the right authority in bugzilla to do it. > You can just untick all the boxes which restrict it to certain groups, which I've now done, Steve. From mammadshah at hotmail.com Mon May 16 09:34:42 2011 From: mammadshah at hotmail.com (Muhammad Ammad Shah) Date: Mon, 16 May 2011 15:34:42 +0600 Subject: [Linux-cluster] rhel5.5 GFS2 Message-ID: Hi, I want to force mount the filesystem before starting other services and when relocating the services to another node, the other services should be stopped before filesystem should be unmounted on active node. Thanks, Muhammad Ammad Shah From mra at webtel.pl Mon May 16 11:29:06 2011 From: mra at webtel.pl (mr) Date: Mon, 16 May 2011 13:29:06 +0200 Subject: [Linux-cluster] gfs2 setting quota problem In-Reply-To: <802130008.429868.1305034027382.JavaMail.root@zmail05.collab.prod.int.phx2.redhat.com> References: <802130008.429868.1305034027382.JavaMail.root@zmail05.collab.prod.int.phx2.redhat.com> Message-ID: <4DD10A82.3000904@webtel.pl> Hello, No, I don't think so... Anyway I was able to finally set the quotes on my FS after two fails. I rebooted the server and moved the mounting point from /tmp/test to /mnt/test... It seems real strange to me but it worked. I can not find any reasonable explanation of that.... As I saw in strace gfs2 uses /tmp/ to mount its meta so maybe it was sth with that... The other thing is I'm trying to use gfs2_quota in chroot env. After some tests and changes I'm able to use gfs2_quota get command without any errors but gfs2_quota limit and gfs2_quota warn make some error although it works... "Warning: This filesystem doesn't seem to have the new quota list format or the quota list is corrupt. list, check and init operation performance will suffer due to this. It is recommended that you run the 'gfs2_quota reset' operation to reset the quota file. All current quota information will be lost and you will have to reassign all quota limits and warnings" In "real" env everything is ok, without any errors. I have already mount /dev, /proc and /sys in chroot env... I have noticed in strace output that the salt in chroot env is not being generated during quota tasks: fg. good ("real"): oldumount("/tmp/.gfs2meta.4Hd5aR") bad: oldumount("/tmp/.gfs2meta") I think sth is missing.... Any ideas? Abhijith Das pisze: > ----- Original Message ----- > >> From: "mr" >> To: "linux clustering" >> Sent: Tuesday, May 10, 2011 1:06:38 AM >> Subject: Re: [Linux-cluster] gfs2 setting quota problem >> Hi, >> Steven Whitehouse pisze: >> >>> Hi, >>> >>> On Mon, 2011-05-09 at 15:32 +0200, mr wrote: >>> >>> >>>> Hello, >>>> I'm having problem to init gfs2 quota on my existing FS. >>>> >>>> I have 2TB gfs2 FS which is being used in 50%. I have decided to >>>> set up >>>> quotas. Setting warning and limit levels seemed OK - no errors >>>> (athought >>>> I had to reset all my existing setting gfs2_quota reset...) New >>>> quota >>>> calculation ends with the following error: >>>> >>>> gfs2_quota: Couldn't mount /tmp/.gfs2meta.WUXfKC : Invalid argument >>>> >>>> >>>> >>> Are you using selinux? The gfs2_quota tool tries to mount the GFS2 >>> metafs in order to make the changes that you requested. For some >>> reason >>> it seems this mount is failing. >>> >>> >> Selinux is diabled. I'm also able to mount gfs2meta manually. >> >>>> Getting some quota values fails - I'm always getting "value: 0.0" >>>> :( >>>> >>>> I have no idea what is wrong... Sombody could help? thx in advance >>>> >>>> Details: >>>> 2.6.18-194.11.1.el5 >>>> /tmp/test type gfs2 >>>> (rw,noatime,lockproto=lock_nolock,localflocks,localcaching,quota=on) >>>> gfs2-utils.i386 0.1.62-28.el5_6.1 >>>> kmod-gfs.i686 0.1.34-2.el5 >>>> cman.i386 2.0.98-1.el5_3.4 >>>> >>>> >>>> >>>> >>>> >>> Is this CentOS or a real RHEL installation? >>> >>> >> Centos. >> >>> Steve. >>> >>> > > Hi, > > I found this bz: https://bugzilla.redhat.com/show_bug.cgi?id=459630, but the package versions you list are pretty recent and this was fixed quite a while ago. Are there any older gfs2-utils bits lying around? > > I'd like to see the strace of the gfs2_quota command that triggers the meta-mount error. Please also include the command line in your output. > > Thanks! > --Abhi > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > -- mr From swhiteho at redhat.com Mon May 16 11:43:12 2011 From: swhiteho at redhat.com (Steven Whitehouse) Date: Mon, 16 May 2011 12:43:12 +0100 Subject: [Linux-cluster] gfs2 setting quota problem In-Reply-To: <4DD10A82.3000904@webtel.pl> References: <802130008.429868.1305034027382.JavaMail.root@zmail05.collab.prod.int.phx2.redhat.com> <4DD10A82.3000904@webtel.pl> Message-ID: <1305546192.2855.7.camel@menhir> Hi, On Mon, 2011-05-16 at 13:29 +0200, mr wrote: > Hello, > No, I don't think so... > > Anyway I was able to finally set the quotes on my FS after two fails. I > rebooted the server and moved the mounting point from /tmp/test to > /mnt/test... It seems real strange to me but it worked. I can not find > any reasonable explanation of that.... As I saw in strace gfs2 uses > /tmp/ to mount its meta so maybe it was sth with that... > Possibly... > The other thing is I'm trying to use gfs2_quota in chroot env. After > some tests and changes I'm able to use gfs2_quota get command without > any errors but gfs2_quota limit and gfs2_quota warn make some error > although it works... > > "Warning: This filesystem doesn't seem to have the new quota list format > or the quota list is corrupt. list, check and init operation performance > will suffer due to this. It is recommended that you run the 'gfs2_quota > reset' operation to reset the quota file. All current quota information > will be lost and you will have to reassign all quota limits and warnings" > That sounds like a pretty old version of gfs2_quota. > In "real" env everything is ok, without any errors. I have already mount > /dev, /proc and /sys in chroot env... > > I have noticed in strace output that the salt in chroot env is not > being generated during quota tasks: fg. > > good ("real"): > oldumount("/tmp/.gfs2meta.4Hd5aR") > > bad: > oldumount("/tmp/.gfs2meta") > > I think sth is missing.... Any ideas? > One of the problems with CentOS is that it doesn't have our more recent fixes. If you used Fedora or another more uptodate distro then this problem should have long since been fixed. Also with the latest Fedora (Abhi should be able to confirm the exact version) then the standard system quota tools are available to use with GFS2. The plan is to get rid of gfs2_quota (probably fairly shortly in Fedora, but it will stay much longer in RHEL - until the end of the release, of course) and use exclusively the system quota-tools package, Steve. > Abhijith Das pisze: > > ----- Original Message ----- > > > >> From: "mr" > >> To: "linux clustering" > >> Sent: Tuesday, May 10, 2011 1:06:38 AM > >> Subject: Re: [Linux-cluster] gfs2 setting quota problem > >> Hi, > >> Steven Whitehouse pisze: > >> > >>> Hi, > >>> > >>> On Mon, 2011-05-09 at 15:32 +0200, mr wrote: > >>> > >>> > >>>> Hello, > >>>> I'm having problem to init gfs2 quota on my existing FS. > >>>> > >>>> I have 2TB gfs2 FS which is being used in 50%. I have decided to > >>>> set up > >>>> quotas. Setting warning and limit levels seemed OK - no errors > >>>> (athought > >>>> I had to reset all my existing setting gfs2_quota reset...) New > >>>> quota > >>>> calculation ends with the following error: > >>>> > >>>> gfs2_quota: Couldn't mount /tmp/.gfs2meta.WUXfKC : Invalid argument > >>>> > >>>> > >>>> > >>> Are you using selinux? The gfs2_quota tool tries to mount the GFS2 > >>> metafs in order to make the changes that you requested. For some > >>> reason > >>> it seems this mount is failing. > >>> > >>> > >> Selinux is diabled. I'm also able to mount gfs2meta manually. > >> > >>>> Getting some quota values fails - I'm always getting "value: 0.0" > >>>> :( > >>>> > >>>> I have no idea what is wrong... Sombody could help? thx in advance > >>>> > >>>> Details: > >>>> 2.6.18-194.11.1.el5 > >>>> /tmp/test type gfs2 > >>>> (rw,noatime,lockproto=lock_nolock,localflocks,localcaching,quota=on) > >>>> gfs2-utils.i386 0.1.62-28.el5_6.1 > >>>> kmod-gfs.i686 0.1.34-2.el5 > >>>> cman.i386 2.0.98-1.el5_3.4 > >>>> > >>>> > >>>> > >>>> > >>>> > >>> Is this CentOS or a real RHEL installation? > >>> > >>> > >> Centos. > >> > >>> Steve. > >>> > >>> > > > > Hi, > > > > I found this bz: https://bugzilla.redhat.com/show_bug.cgi?id=459630, but the package versions you list are pretty recent and this was fixed quite a while ago. Are there any older gfs2-utils bits lying around? > > > > I'd like to see the strace of the gfs2_quota command that triggers the meta-mount error. Please also include the command line in your output. > > > > Thanks! > > --Abhi > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > > > From mguazzardo76 at gmail.com Mon May 16 12:20:07 2011 From: mguazzardo76 at gmail.com (Marcelo Guazzardo) Date: Mon, 16 May 2011 09:20:07 -0300 Subject: [Linux-cluster] oracle DB is not failing over on killin PMON deamon In-Reply-To: <008e01cc132b$e4692800$ad3b7800$@its.ws> References: <00ae01cc106d$a294c440$e7be4cc0$@its.ws> <036B68E61A28CA49AC2767596576CD596F6641246F@GVW1113EXC.americas.hpqcorp.net> <00b501cc1075$4a47cfa0$ded76ee0$@its.ws> <007901cc126c$b5ae03b0$210a0b10$@its.ws> <000901cc12ce$99532df0$cbf989d0$@its.ws> <007801cc1306$f93e6850$ebbb38f0$@its.ws> <008e01cc132b$e4692800$ad3b7800$@its.ws> Message-ID: Hy sufyan I sent two files. First, is cluster.conf, second, is the oracledb.sh , this file must be placed in /usr/share/cluster in both nodes. (Or if you have more nodes, in all nodes). In oracledb.sh I 've changed oracle_user, oracle_sid, type of database, I used base (for monitor and listener), and virtual ip, If you have any doubt, just let me know I hope that helps you Regards, Marcelo 2011/5/15 Sufyan Khan > Will you share if it is not confidential. > > > > *From:* linux-cluster-bounces at redhat.com [mailto: > linux-cluster-bounces at redhat.com] *On Behalf Of *Marcelo Guazzardo > *Sent:* Sunday, May 15, 2011 7:03 PM > > *To:* linux clustering > *Subject:* Re: [Linux-cluster] oracle DB is not failing over on killin > PMON deamon > > > > ok, this script checks the listener a! nd db , i uot; in database type. > > I 've worked with oracle10g r2 , and it worked fine for me. > Thanks > > 2011/5/15 Sufyan Khan > > Thanks for the tips, > > > > I need only oracle database to be monitor by Cluster NOT OPMN ( application > server) > > Any clue > > > > sufyan > > > > *From:* linux-cluster-bounces at redhat.com [mailto: > linux-cluster-bounces at redhat.com] *On Behalf Of *Marcelo Guazzardo > *Sent:* Sunday, May 15, 2011 11:29 AM > > > *To:* linux clustering > *Subject:* Re: [Linux-cluster] oracle DB is not failing over on killin > PMON deamon > > > > HI sufyan > > > > Maybe is irrelevant, but, do you try with oracledb.sh script?. > I use that script and all work fine for me.... > > Regards, > Marcelo > PS: that script is placed in /usr/share/cluster/oracledb.sh, you must > modify the oracle settings, like orauser, db_instance_name, and db_virtual > name. > > 2011/5/15 Sufyan! Khan < han at its.ws">sufyan.khan at its.ws> > > There is writing mistake, I *cannot* see the script is running in > background.! > > Off course I stop the cluster then I run the manual script. > > > > > > > > *From:* linux-cluster-bounces at redhat.com [mailto: > linux-cluster-bounces at redhat.com] *On Behalf Of *Rajagopal Swaminathan > *Sent:* Sunday, May 15, 2011 12:14 AM > > > *To:* linux clustering > *Subject:* Re: [Linux-cluster] oracle DB is not failing over on killin > PMON deamon > > > > Greetings, > > I ca nd as root, but do see the script is running in background as a > daemon, > > > > > > Mohammad Raza Sufyan Khan > Team Leader (Technology&Infrastructure Group) > > Telco development > > > > a stupid question: Have you used chkconfig to switchoff all the cluster > controlled services? > > ! IMHO, The >Regards, > > ! Rajagopal v> > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > > -- > Marcelo Guazzardo > mguazzardo76 at gmail.com > http://mguazzardo.blogspot.com > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-clust! er > > > > -- > Marcelo Guazzardo > mguazzardo76 at gmail.com > http://mguazzardo.blogspot.com > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -- Marcelo Guazzardo mguazzardo76 at gmail.com http://mguazzardo.blogspot.com -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- -------------- next part -------------- A non-text attachment was scrubbed... Name: oracledb.sh Type: application/x-sh Size: 21744 bytes Desc: not available URL: From Gert.Wieberdink at enovation.nl Mon May 16 12:21:18 2011 From: Gert.Wieberdink at enovation.nl (Gert Wieberdink) Date: Mon, 16 May 2011 14:21:18 +0200 Subject: [Linux-cluster] gfs2 setting quota problem In-Reply-To: <1305546192.2855.7.camel@menhir> References: <802130008.429868.1305034027382.JavaMail.root@zmail05.collab.prod.int.phx2.redhat.com> <4DD10A82.3000904@webtel.pl> <1305546192.2855.7.camel@menhir> Message-ID: <8634845864125D4D9B397A3E598995980555800F5C@MBX.emd.enovation.net> bij deze Met vriendelijke groet/With kind regards, Gert Wieberdink Sr. Engineer -----Oorspronkelijk bericht----- Van: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] Namens Steven Whitehouse Verzonden: maandag 16 mei 2011 13:43 Aan: linux clustering Onderwerp: Re: [Linux-cluster] gfs2 setting quota problem Hi, On Mon, 2011-05-16 at 13:29 +0200, mr wrote: > Hello, > No, I don't think so... > > Anyway I was able to finally set the quotes on my FS after two fails. I > rebooted the server and moved the mounting point from /tmp/test to > /mnt/test... It seems real strange to me but it worked. I can not find > any reasonable explanation of that.... As I saw in strace gfs2 uses > /tmp/ to mount its meta so maybe it was sth with that... > Possibly... > The other thing is I'm trying to use gfs2_quota in chroot env. After > some tests and changes I'm able to use gfs2_quota get command without > any errors but gfs2_quota limit and gfs2_quota warn make some error > although it works... > > "Warning: This filesystem doesn't seem to have the new quota list format > or the quota list is corrupt. list, check and init operation performance > will suffer due to this. It is recommended that you run the 'gfs2_quota > reset' operation to reset the quota file. All current quota information > will be lost and you will have to reassign all quota limits and warnings" > That sounds like a pretty old version of gfs2_quota. > In "real" env everything is ok, without any errors. I have already mount > /dev, /proc and /sys in chroot env... > > I have noticed in strace output that the salt in chroot env is not > being generated during quota tasks: fg. > > good ("real"): > oldumount("/tmp/.gfs2meta.4Hd5aR") > > bad: > oldumount("/tmp/.gfs2meta") > > I think sth is missing.... Any ideas? > One of the problems with CentOS is that it doesn't have our more recent fixes. If you used Fedora or another more uptodate distro then this problem should have long since been fixed. Also with the latest Fedora (Abhi should be able to confirm the exact version) then the standard system quota tools are available to use with GFS2. The plan is to get rid of gfs2_quota (probably fairly shortly in Fedora, but it will stay much longer in RHEL - until the end of the release, of course) and use exclusively the system quota-tools package, Steve. > Abhijith Das pisze: > > ----- Original Message ----- > > > >> From: "mr" > >> To: "linux clustering" > >> Sent: Tuesday, May 10, 2011 1:06:38 AM > >> Subject: Re: [Linux-cluster] gfs2 setting quota problem > >> Hi, > >> Steven Whitehouse pisze: > >> > >>> Hi, > >>> > >>> On Mon, 2011-05-09 at 15:32 +0200, mr wrote: > >>> > >>> > >>>> Hello, > >>>> I'm having problem to init gfs2 quota on my existing FS. > >>>> > >>>> I have 2TB gfs2 FS which is being used in 50%. I have decided to > >>>> set up > >>>> quotas. Setting warning and limit levels seemed OK - no errors > >>>> (athought > >>>> I had to reset all my existing setting gfs2_quota reset...) New > >>>> quota > >>>> calculation ends with the following error: > >>>> > >>>> gfs2_quota: Couldn't mount /tmp/.gfs2meta.WUXfKC : Invalid argument > >>>> > >>>> > >>>> > >>> Are you using selinux? The gfs2_quota tool tries to mount the GFS2 > >>> metafs in order to make the changes that you requested. For some > >>> reason > >>> it seems this mount is failing. > >>> > >>> > >> Selinux is diabled. I'm also able to mount gfs2meta manually. > >> > >>>> Getting some quota values fails - I'm always getting "value: 0.0" > >>>> :( > >>>> > >>>> I have no idea what is wrong... Sombody could help? thx in advance > >>>> > >>>> Details: > >>>> 2.6.18-194.11.1.el5 > >>>> /tmp/test type gfs2 > >>>> (rw,noatime,lockproto=lock_nolock,localflocks,localcaching,quota=on) > >>>> gfs2-utils.i386 0.1.62-28.el5_6.1 > >>>> kmod-gfs.i686 0.1.34-2.el5 > >>>> cman.i386 2.0.98-1.el5_3.4 > >>>> > >>>> > >>>> > >>>> > >>>> > >>> Is this CentOS or a real RHEL installation? > >>> > >>> > >> Centos. > >> > >>> Steve. > >>> > >>> > > > > Hi, > > > > I found this bz: https://bugzilla.redhat.com/show_bug.cgi?id=459630, but the package versions you list are pretty recent and this was fixed quite a while ago. Are there any older gfs2-utils bits lying around? > > > > I'd like to see the strace of the gfs2_quota command that triggers the meta-mount error. Please also include the command line in your output. > > > > Thanks! > > --Abhi > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > > > -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From sufyan.khan at its.ws Mon May 16 13:01:44 2011 From: sufyan.khan at its.ws (Sufyan Khan) Date: Mon, 16 May 2011 16:01:44 +0300 Subject: [Linux-cluster] oracle DB is not failing over on killin PMON deamon In-Reply-To: References: <00ae01cc106d$a294c440$e7be4cc0$@its.ws> <036B68E61A28CA49AC2767596576CD596F6641246F@GVW1113EXC.americas.hpqcorp.net> <00b501cc1075$4a47cfa0$ded76ee0$@its.ws> <007901cc126c$b5ae03b0$210a0b10$@its.ws> <000901cc12ce$99532df0$cbf989d0$@its.ws> <007801cc1306$f93e6850$ebbb38f0$@its.ws> <008e01cc132b$e4692800$ad3b7800$@its.ws> Message-ID: <009701cc13c9$68fd80a0$3af881e0$@its.ws> Thanks Marcelo Thanks for support and help Let me try and update From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Marcelo Guazzardo Sent: Monday, May 16, 2011 3:20 PM To: linux clustering Subject: Re: [Linux-cluster] oracle DB is not failing over on killin PMON deamon Hy sufyan I sent two files. First, is cluster.conf, second, is the oracledb.sh , this file must be placed in /usr/share/cluster in both nodes. (Or if you have more nodes, in all nodes). In oracledb.sh I 've changed oracle_user, oracle_sid, type of database, I used base (for monitor and listener), and virtual ip, If you have any doubt, just let me know I hope that helps you Regards, Marcelo 2011/5/15 Sufyan Khan Will you share if it is not confidential. From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Marcelo Guazzardo Sent: Sunday, May 15, 2011 7:03 PM To: linux clustering Subject: Re: [Linux-cluster] oracle DB is not failing over on killin PMON deamon ok, this script checks the listener a! nd db , i uot; in database type. I 've worked with oracle10g r2 , and it worked fine for me. Thanks 2011/5/15 Sufyan Khan Thanks for the tips, I need only oracle database to be monitor by Cluster NOT OPMN ( application server) Any clue sufyan From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Marcelo Guazzardo Sent: Sunday, May 15, 2011 11:29 AM To: linux clustering Subject: Re: [Linux-cluster] oracle DB is not failing over on killin PMON deamon HI sufyan Maybe is irrelevant, but, do you try with oracledb.sh script?. I use that script and all work fine for me.... Regards, Marcelo PS: that script is placed in /usr/share/cluster/oracledb.sh, you must modify the oracle settings, like orauser, db_instance_name, and db_virtual name. 2011/5/15 Sufyan! Khan < han at its.ws">sufyan.khan at its.ws> There is writing mistake, I cannot see the script is running in background.! Off course I stop the cluster then I run the manual script. From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Rajagopal Swaminathan Sent: Sunday, May 15, 2011 12:14 AM To: linux clustering Subject: Re: [Linux-cluster] oracle DB is not failing over on killin PMON deamon Greetings, I ca nd as root, but do see the script is running in background as a daemon, Mohammad Raza Sufyan Khan Team Leader (Technology&Infrastructure Group) Telco development a stupid question: Have you used chkconfig to switchoff all the cluster controlled services? ! IMHO, The >Regards, ! Rajagopal v> -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -- Marcelo Guazzardo mguazzardo76 at gmail.com http://mguazzardo.blogspot.com -- Marcelo Guazzardo mguazzardo76 at gmail.com http://mguazzardo.blogspot.com -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -- Marcelo Guazzardo mguazzardo76 at gmail.com http://mguazzardo.blogspot.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From rhayden.public at gmail.com Mon May 16 13:16:50 2011 From: rhayden.public at gmail.com (Robert Hayden) Date: Mon, 16 May 2011 08:16:50 -0500 Subject: [Linux-cluster] oracle DB is not failing over on killin PMON deamon In-Reply-To: <007301cc126c$200506b0$600f1410$@its.ws> References: <00ae01cc106d$a294c440$e7be4cc0$@its.ws> <036B68E61A28CA49AC2767596576CD596F6641246F@GVW1113EXC.americas.hpqcorp.net> <00b501cc1075$4a47cfa0$ded76ee0$@its.ws> <007301cc126c$200506b0$600f1410$@its.ws> Message-ID: On Sat, May 14, 2011 at 2:21 PM, Sufyan Khan wrote: > > Yes , you can see in attached script I can very well be miss reading the script, but with the status function, you are returning a "0" or a "1" appropriately, but I am not sure that return value is the return value for the script_db.sh. Isn't that just the return value for the status function? Meaning, you need to set the RETVAL variable in the status function to be then returned at the end of the bash script. I don't code in bash much, so RETVAL may be a special variable. I attempted to boil down the script to test. #!/bin/bash . /etc/rc.d/init.d/functions status() { return 1 } case "$1" in status) status ;; *) echo $" Not Applicable" exit 1 esac When I run the above, I see the "0" being returned. [root ~]# ./status.ksh Not Applicable [root ~]# ./status.ksh status exiting script with [root ~]# echo $? 0 echo "exiting script with $RETVAL" exit $RETVAL > > > > From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Parvez Shaikh > Sent: Saturday, May 14, 2011 6:37 PM > To: linux clustering > Subject: Re: [Linux-cluster] oracle DB is not failing over on killin PMON deamon > > > > Hi Sufyan > > Does your status function r! eturn 0 o down respectively (i.e. have you tested it works outside script_db.sh) when run as "root"? > > On Thu, May 12, 2011 at 12:52 PM, Sufyan Khan wrote: > > First of all thanks for you quick response. > > Secondly please note: ?the working "cluster.conf" file is attached here, the > previous file was not correct. > Yes the ?orainfra is the user name. > > Any othere clue please. > > sufyan > > > > > > -----Original Message----- > From: linux-cluster-bounces at redhat.com > > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jankowski, Chris > Sent: Thursday, May 12, 2011 9:44 AM > To: linux clustering > Subject: Re: [! Linux-clu iling over on killin PMON > > deamon > > Sufyan, > > What username does the instance of Oracle DB run as? Is this "orainfra" or > some other username? > > The scripts assume a user named "orainfra". > If you use a different username then you need to modify the scripts > accordingly. > > Regards, > > Chris Jankowski > > > -----Original Message----- > From: linux-cluster-bounces at redhat.com > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Sufyan Khan > Sent: Thursday, 12 May 2011 16:27 > To: 'linux clustering' > Subject: [Linux-cluster] oracle DB is not failing over on killin PMON deamon > > Dear All > > I need to setup HA cluster for mu oracle dabase. > I have setup two node cluster using "System-Config-Cluster .." on RHEL 5.5 I > created RG a ?shared fil! e system shared IP and DB script to > monitor the DB. > My cluster starts perfectly and fail over on shutting down primary node, > also stopping shared IP ?fails node to failover node. > But on kill PMON , or LSNR process the node does not fails and keep showing > the status services running on primary node. > > I JUST NEED TO KNOW WHERE IS THE PROBLEM. > > ATTACHED IS DB scripts and "cluster.conf" file. > > Thanks in advance for help. > > Sufyan > > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://! www.redha nux-cluster > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From mra at webtel.pl Mon May 16 13:34:09 2011 From: mra at webtel.pl (mr) Date: Mon, 16 May 2011 15:34:09 +0200 Subject: [Linux-cluster] gfs2 setting quota problem In-Reply-To: <1305546192.2855.7.camel@menhir> References: <802130008.429868.1305034027382.JavaMail.root@zmail05.collab.prod.int.phx2.redhat.com> <4DD10A82.3000904@webtel.pl> <1305546192.2855.7.camel@menhir> Message-ID: <4DD127D1.1090204@webtel.pl> ok, but why gfs2_quota works fine in "real" env and errors/warnings only appera in "chroot" env then... If this is a issue of binaries I should have seen them in both real and "chroot" env, right? Steven Whitehouse pisze: > Hi, > > On Mon, 2011-05-16 at 13:29 +0200, mr wrote: > >> Hello, >> No, I don't think so... >> >> Anyway I was able to finally set the quotes on my FS after two fails. I >> rebooted the server and moved the mounting point from /tmp/test to >> /mnt/test... It seems real strange to me but it worked. I can not find >> any reasonable explanation of that.... As I saw in strace gfs2 uses >> /tmp/ to mount its meta so maybe it was sth with that... >> >> > Possibly... > > >> The other thing is I'm trying to use gfs2_quota in chroot env. After >> some tests and changes I'm able to use gfs2_quota get command without >> any errors but gfs2_quota limit and gfs2_quota warn make some error >> although it works... >> >> "Warning: This filesystem doesn't seem to have the new quota list format >> or the quota list is corrupt. list, check and init operation performance >> will suffer due to this. It is recommended that you run the 'gfs2_quota >> reset' operation to reset the quota file. All current quota information >> will be lost and you will have to reassign all quota limits and warnings" >> >> > That sounds like a pretty old version of gfs2_quota. > > >> In "real" env everything is ok, without any errors. I have already mount >> /dev, /proc and /sys in chroot env... >> >> I have noticed in strace output that the salt in chroot env is not >> being generated during quota tasks: fg. >> >> good ("real"): >> oldumount("/tmp/.gfs2meta.4Hd5aR") >> >> bad: >> oldumount("/tmp/.gfs2meta") >> >> I think sth is missing.... Any ideas? >> >> > One of the problems with CentOS is that it doesn't have our more recent > fixes. If you used Fedora or another more uptodate distro then this > problem should have long since been fixed. Also with the latest Fedora > (Abhi should be able to confirm the exact version) then the standard > system quota tools are available to use with GFS2. > > The plan is to get rid of gfs2_quota (probably fairly shortly in Fedora, > but it will stay much longer in RHEL - until the end of the release, of > course) and use exclusively the system quota-tools package, > > Steve. > > >> Abhijith Das pisze: >> >>> ----- Original Message ----- >>> >>> >>>> From: "mr" >>>> To: "linux clustering" >>>> Sent: Tuesday, May 10, 2011 1:06:38 AM >>>> Subject: Re: [Linux-cluster] gfs2 setting quota problem >>>> Hi, >>>> Steven Whitehouse pisze: >>>> >>>> >>>>> Hi, >>>>> >>>>> On Mon, 2011-05-09 at 15:32 +0200, mr wrote: >>>>> >>>>> >>>>> >>>>>> Hello, >>>>>> I'm having problem to init gfs2 quota on my existing FS. >>>>>> >>>>>> I have 2TB gfs2 FS which is being used in 50%. I have decided to >>>>>> set up >>>>>> quotas. Setting warning and limit levels seemed OK - no errors >>>>>> (athought >>>>>> I had to reset all my existing setting gfs2_quota reset...) New >>>>>> quota >>>>>> calculation ends with the following error: >>>>>> >>>>>> gfs2_quota: Couldn't mount /tmp/.gfs2meta.WUXfKC : Invalid argument >>>>>> >>>>>> >>>>>> >>>>>> >>>>> Are you using selinux? The gfs2_quota tool tries to mount the GFS2 >>>>> metafs in order to make the changes that you requested. For some >>>>> reason >>>>> it seems this mount is failing. >>>>> >>>>> >>>>> >>>> Selinux is diabled. I'm also able to mount gfs2meta manually. >>>> >>>> >>>>>> Getting some quota values fails - I'm always getting "value: 0.0" >>>>>> :( >>>>>> >>>>>> I have no idea what is wrong... Sombody could help? thx in advance >>>>>> >>>>>> Details: >>>>>> 2.6.18-194.11.1.el5 >>>>>> /tmp/test type gfs2 >>>>>> (rw,noatime,lockproto=lock_nolock,localflocks,localcaching,quota=on) >>>>>> gfs2-utils.i386 0.1.62-28.el5_6.1 >>>>>> kmod-gfs.i686 0.1.34-2.el5 >>>>>> cman.i386 2.0.98-1.el5_3.4 >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>> Is this CentOS or a real RHEL installation? >>>>> >>>>> >>>>> >>>> Centos. >>>> >>>> >>>>> Steve. >>>>> >>>>> >>>>> >>> Hi, >>> >>> I found this bz: https://bugzilla.redhat.com/show_bug.cgi?id=459630, but the package versions you list are pretty recent and this was fixed quite a while ago. Are there any older gfs2-utils bits lying around? >>> >>> I'd like to see the strace of the gfs2_quota command that triggers the meta-mount error. Please also include the command line in your output. >>> >>> Thanks! >>> --Abhi >>> >>> -- >>> Linux-cluster mailing list >>> Linux-cluster at redhat.com >>> https://www.redhat.com/mailman/listinfo/linux-cluster >>> >>> >>> >> > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > -- mr From Colin.Simpson at iongeo.com Mon May 16 17:27:25 2011 From: Colin.Simpson at iongeo.com (Colin Simpson) Date: Mon, 16 May 2011 18:27:25 +0100 Subject: [Linux-cluster] Announcing - RHCS2 on EL5, Xen, DRBD and rgmanager 2-node cluster tutorial In-Reply-To: References: Message-ID: <1305566845.4224.34.camel@cowie.iouk.ioroot.tld> Recently I constructed a cluster for Intranet services. I too had to dig around for information to get all this going, it wasn't easy to find (I kind of hoped RH would have more recipes and worked examples out there for all the different services). So I also decided to write up my setup too, and as it looks pretty similar technology underlying (DRBD, CLVMD and GFS2) but as I required different services I thought I'd mention it here. Sadly my howto isn't as neat and tidy as yours (just in a blog) but covers: File Services (NFS) Printing Services (CUPS) DHCP DNS Server (named) Clustered Samba (ctdb) Intranet Web Service (HTTP) http://catsysadminblog.blogspot.com/2011/04/building-rhel-6centos-6-ha-cluster-for.html Hopefully might help someone else out there Thanks Colin On Mon, 2011-05-16 at 05:58 +0100, Corey Kovacs wrote: > Nice job, I am sure this will help quite a few... > > -C > > On Mon, May 16, 2011 at 5:33 AM, Digimer wrote: > On 05/16/2011 12:29 AM, Jankowski, Chris wrote: > > Digimer, > > > > I think you published an earlier version before. > > Isn't it the time to introduce versioning, release dates and > also list of deltas from version to version? > > > > Mundane things, I know. But if you want to make this a > useful document for others they are all very necessary, I > think. > > > > Regards, > > > > Chris Jankowski > > > I mentioned it to some people off-list as it was being > developed, but > this is the first "official" announcement/release. You comment > is valid, > and partly addressed by the medium of being a wiki. Changes > through time > can be seen and tracked using the "History" button at the top > of the page. > > > -- > Digimer > E-Mail: digimer at alteeve.com > AN!Whitepapers: http://alteeve.com > Node Assassin: http://nodeassassin.org > "I feel confined, only free to expand myself within > boundaries." > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > > plain text document attachment (ATT666054.txt) > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster This email and any files transmitted with it are confidential and are intended solely for the use of the individual or entity to whom they are addressed. If you are not the original recipient or the person responsible for delivering the email to the intended recipient, be advised that you have received this email in error, and that any use, dissemination, forwarding, printing, or copying of this email is strictly prohibited. If you received this email in error, please immediately notify the sender and delete the original. From lhh at redhat.com Mon May 16 21:39:20 2011 From: lhh at redhat.com (Lon Hohberger) Date: Mon, 16 May 2011 17:39:20 -0400 Subject: [Linux-cluster] rg_test for testing other resource agent functions? In-Reply-To: <20110322212940.GF13584@mip.aaaaa.org> References: <20110304194923.GX934@mip.aaaaa.org> <20110307214919.GJ17423@redhat.com> <20110322212940.GF13584@mip.aaaaa.org> Message-ID: <20110516213919.GA23451@redhat.com> On Tue, Mar 22, 2011 at 04:29:40PM -0500, Ofer Inbar wrote: > > That could be useful. > Do you have any plans to distribute this tool with cluster suite? > -- Cos > (ancient thread resurrection) https://github.com/lhh/ccs2cib The 'rgm_flatten' command is in there. -- Lon Hohberger - Red Hat, Inc. From alvaro.fernandez at sivsa.com Mon May 16 22:03:24 2011 From: alvaro.fernandez at sivsa.com (Alvaro Jose Fernandez) Date: Tue, 17 May 2011 00:03:24 +0200 Subject: [Linux-cluster] q about post_fail_delay Message-ID: <607D6181D9919041BE792D70EF2AEC480195B102@LIMENS.sivsa.int> Hi, Do using a post_fail_delay > 0, when triggered, blocks running resources on the node, if one is not using GFS? . For example, if one only uses a couple of fs resources locally mounted in HA configuration, not shared filesystems at all. Regards, alvaro -------------- next part -------------- An HTML attachment was scrubbed... URL: From linux at alteeve.com Mon May 16 22:52:44 2011 From: linux at alteeve.com (Digimer) Date: Mon, 16 May 2011 18:52:44 -0400 Subject: [Linux-cluster] q about post_fail_delay In-Reply-To: <607D6181D9919041BE792D70EF2AEC480195B102@LIMENS.sivsa.int> References: <607D6181D9919041BE792D70EF2AEC480195B102@LIMENS.sivsa.int> Message-ID: <4DD1AABC.8080202@alteeve.com> On 05/16/2011 06:03 PM, Alvaro Jose Fernandez wrote: > Hi, > > Do using a post_fail_delay > 0, when triggered, blocks running resources > on the node, if one is not using GFS? . For example, if one only uses a > couple of fs resources locally mounted in HA configuration, not shared > filesystems at all. > > Regards, > > alvaro I believe that all IO blocks because the cluster is not able to ensure that messages arrived to all nodes in the same order, as the silent/failed node stopped responding. This is a trait called "virtual synchrony". -- Digimer E-Mail: digimer at alteeve.com AN!Whitepapers: http://alteeve.com Node Assassin: http://nodeassassin.org "I feel confined, only free to expand myself within boundaries." From mguazzardo76 at gmail.com Mon May 16 23:17:14 2011 From: mguazzardo76 at gmail.com (Marcelo Guazzardo) Date: Mon, 16 May 2011 20:17:14 -0300 Subject: [Linux-cluster] oracle DB is not failing over on killin PMON deamon In-Reply-To: References: <00ae01cc106d$a294c440$e7be4cc0$@its.ws> <036B68E61A28CA49AC2767596576CD596F6641246F@GVW1113EXC.americas.hpqcorp.net> <00b501cc1075$4a47cfa0$ded76ee0$@its.ws> <007301cc126c$200506b0$600f1410$@its.ws> Message-ID: Hy Sufyan Morning, I forgot nombrar the source that I 've followed to made a cluster this is http://people.redhat.com/lhh/oracle-rhel5-notes-0.6/oracle-notes.html Good Luck! Regards, 2011/5/16 Robert Hayden > On Sat, May 14, 2011 at 2:21 PM, Sufyan Khan wrote: > > > > Yes , you can see in attached script > > I can very well be miss reading the script, but with the status > function, you are returning a "0" or a "1" appropriately, but I am not > sure that return value is the return value for the script_db.sh. > Isn't that just the return value for the status function? Meaning, > you need to set the RETVAL variable in the status function to be then > returned at the end of the bash script. I don't code in bash much, so > RETVAL may be a special variable. I attempted to boil down the script > to test. > > #!/bin/bash > . /etc/rc.d/init.d/functions > > status() { > return 1 > } > > case "$1" in > status) > status > ;; > *) > echo $" Not Applicable" > exit 1 > esac > > When I run the above, I see the "0" being returned. > [root ~]# ./status.ksh > Not Applicable > [root ~]# ./status.ksh status > exiting script with > [root ~]# echo $? > 0 > > > echo "exiting script with $RETVAL" > exit $RETVAL > > > > > > > > > > > > From: linux-cluster-bounces at redhat.com [mailto: > linux-cluster-bounces at redhat.com] On Behalf Of Parvez Shaikh > > Sent: Saturday, May 14, 2011 6:37 PM > > To: linux clustering > > Subject: Re: [Linux-cluster] oracle DB is not failing over on killin PMON > deamon > > > > > > > > Hi Sufyan > > > > Does your status function r! eturn 0 o down respectively (i.e. have you > tested it works outside script_db.sh) when run as "root"? > > > > On Thu, May 12, 2011 at 12:52 PM, Sufyan Khan > wrote: > > > > First of all thanks for you quick response. > > > > Secondly please note: the working "cluster.conf" file is attached here, > the > > previous file was not correct. > > Yes the orainfra is the user name. > > > > Any othere clue please. > > > > sufyan > > > > > > > > > > > > -----Original Message----- > > From: linux-cluster-bounces at redhat.com > > > > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jankowski, Chris > > Sent: Thursday, May 12, 2011 9:44 AM > > To: linux clustering > > Subject: Re: [! Linux-clu iling over on killin PMON > > > > deamon > > > > Sufyan, > > > > What username does the instance of Oracle DB run as? Is this "orainfra" > or > > some other username? > > > > The scripts assume a user named "orainfra". > > If you use a different username then you need to modify the scripts > > accordingly. > > > > Regards, > > > > Chris Jankowski > > > > > > -----Original Message----- > > From: linux-cluster-bounces at redhat.com > > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Sufyan Khan > > Sent: Thursday, 12 May 2011 16:27 > > To: 'linux clustering' > > Subject: [Linux-cluster] oracle DB is not failing over on killin PMON > deamon > > > > Dear All > > > > I need to setup HA cluster for mu oracle dabase. > > I have setup two node cluster using "System-Config-Cluster .." on RHEL > 5.5 I > > created RG a shared fil! e system shared IP and DB script to > > monitor the DB. > > My cluster starts perfectly and fail over on shutting down primary node, > > also stopping shared IP fails node to failover node. > > But on kill PMON , or LSNR process the node does not fails and keep > showing > > the status services running on primary node. > > > > I JUST NEED TO KNOW WHERE IS THE PROBLEM. > > > > ATTACHED IS DB scripts and "cluster.conf" file. > > > > Thanks in advance for help. > > > > Sufyan > > > > > > > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://! www.redha nux-cluster > > > > > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -- Marcelo Guazzardo mguazzardo76 at gmail.com http://mguazzardo.blogspot.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From carlopmart at gmail.com Tue May 17 16:06:27 2011 From: carlopmart at gmail.com (carlopmart) Date: Tue, 17 May 2011 18:06:27 +0200 Subject: [Linux-cluster] Corosync goes cpu to 95-99% Message-ID: <4DD29D03.9080901@gmail.com> Hi all, I am running cman-3.0.12-23.el6_0.6.i686 with corosync-1.2.3-21.el6_0.1.i686; the cluster consists of two systems running in KVM, each on a dedicated host. I have observed several times that corosync goes cpu to 95-99% in only one node. Is this a bug?? -- CL Martinez carlopmart {at} gmail {d0t} com -------------- next part -------------- An HTML attachment was scrubbed... URL: From sdake at redhat.com Tue May 17 18:13:23 2011 From: sdake at redhat.com (Steven Dake) Date: Tue, 17 May 2011 11:13:23 -0700 Subject: [Linux-cluster] Corosync goes cpu to 95-99% In-Reply-To: <4DD29D03.9080901@gmail.com> References: <4DD29D03.9080901@gmail.com> Message-ID: <4DD2BAC3.50509@redhat.com> On 05/17/2011 09:06 AM, carlopmart wrote: > Hi all, > > I am running cman-3.0.12-23.el6_0.6.i686 with corosync-1.2.3-21.el6_0.1.i686; the cluster consists of two systems running in KVM, each on a dedicated host. > > I have observed several times that corosync goes cpu to 95-99% in only one node. > > Is this a bug?? > > > -- > CL Martinez > carlopmart {at} gmail {d0t} com > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster yes Believe this is fixed in 1.3.1 From carlopmart at gmail.com Tue May 17 18:25:01 2011 From: carlopmart at gmail.com (carlopmart) Date: Tue, 17 May 2011 20:25:01 +0200 Subject: [Linux-cluster] Corosync goes cpu to 95-99% In-Reply-To: <4DD2BAC3.50509@redhat.com> References: <4DD29D03.9080901@gmail.com> <4DD2BAC3.50509@redhat.com> Message-ID: <4DD2BD7D.5070704@gmail.com> On 05/17/2011 08:13 PM, Steven Dake wrote: > On 05/17/2011 09:06 AM, carlopmart wrote: >> Hi all, >> >> I am running cman-3.0.12-23.el6_0.6.i686 with corosync-1.2.3-21.el6_0.1.i686; the cluster consists of two systems running in KVM, each on a dedicated host. >> >> I have observed several times that corosync goes cpu to 95-99% in only one node. >> >> Is this a bug?? >> >> >> -- >> CL Martinez >> carlopmart {at} gmail {d0t} com >> >> >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster > > yes > > Believe this is fixed in 1.3.1 > Thanks Steven ... But is it released for rhel6?? -- CL Martinez carlopmart {at} gmail {d0t} com From sdake at redhat.com Tue May 17 19:20:48 2011 From: sdake at redhat.com (Steven Dake) Date: Tue, 17 May 2011 12:20:48 -0700 Subject: [Linux-cluster] Corosync goes cpu to 95-99% In-Reply-To: <4DD2BD7D.5070704@gmail.com> References: <4DD29D03.9080901@gmail.com> <4DD2BAC3.50509@redhat.com> <4DD2BD7D.5070704@gmail.com> Message-ID: <4DD2CA90.6090802@redhat.com> On 05/17/2011 11:25 AM, carlopmart wrote: > On 05/17/2011 08:13 PM, Steven Dake wrote: >> On 05/17/2011 09:06 AM, carlopmart wrote: >>> Hi all, >>> >>> I am running cman-3.0.12-23.el6_0.6.i686 with >>> corosync-1.2.3-21.el6_0.1.i686; the cluster consists of two systems >>> running in KVM, each on a dedicated host. >>> >>> I have observed several times that corosync goes cpu to 95-99% in >>> only one node. >>> >>> Is this a bug?? >>> >>> >>> -- >>> CL Martinez >>> carlopmart {at} gmail {d0t} com >>> >>> >>> >>> -- >>> Linux-cluster mailing list >>> Linux-cluster at redhat.com >>> https://www.redhat.com/mailman/listinfo/linux-cluster >> >> yes >> >> Believe this is fixed in 1.3.1 >> > > Thanks Steven ... But is it released for rhel6?? > RHEL 6.1 has these problems resolved. If you have problems with rhel6.0 please open a support ticket. There is no SLA for bugzilla/mailing lists, and I can't modify shipped RHEL 6.0.z packages without support tickets. Regards -steve From carlopmart at gmail.com Tue May 17 19:28:50 2011 From: carlopmart at gmail.com (carlopmart) Date: Tue, 17 May 2011 21:28:50 +0200 Subject: [Linux-cluster] Corosync goes cpu to 95-99% In-Reply-To: <4DD2CA90.6090802@redhat.com> References: <4DD29D03.9080901@gmail.com> <4DD2BAC3.50509@redhat.com> <4DD2BD7D.5070704@gmail.com> <4DD2CA90.6090802@redhat.com> Message-ID: <4DD2CC72.80404@gmail.com> On 05/17/2011 09:20 PM, Steven Dake >>> yes >>> >>> Believe this is fixed in 1.3.1 >>> >> >> Thanks Steven ... But is it released for rhel6?? >> > > RHEL 6.1 has these problems resolved. If you have problems with rhel6.0 > please open a support ticket. There is no SLA for bugzilla/mailing > lists, and I can't modify shipped RHEL 6.0.z packages without support > tickets. > > Regards > -steve Thanks Steve. -- CL Martinez carlopmart {at} gmail {d0t} com From sufyan.khan at its.ws Tue May 17 19:40:46 2011 From: sufyan.khan at its.ws (Sufyan Khan) Date: Tue, 17 May 2011 22:40:46 +0300 Subject: [Linux-cluster] oracle DB is not failing over on killin PMON deamon In-Reply-To: References: <00ae01cc106d$a294c440$e7be4cc0$@its.ws> <036B68E61A28CA49AC2767596576CD596F6641246F@GVW1113EXC.americas.hpqcorp.net> <00b501cc1075$4a47cfa0$ded76ee0$@its.ws> <007901cc126c$b5ae03b0$210a0b10$@its.ws> <000901cc12ce$99532df0$cbf989d0$@its.ws> <007801cc1306$f93e6850$ebbb38f0$@its.ws> <008e01cc132b$e4692800$ad3b7800$@its.ws> Message-ID: <002801cc14ca$52ed6260$f8c82720$@its.ws> Hi Marcelo I am succeeded to run oracle DB and its restarting automatically with killing pmon process. Thanks to all. I have another question (sorry I am new to RHEL cluster) my oracle application server and DB server has different HOME directory, if I used oracledb.sh , service fails in startup because in the oracledb.sh the HOME directory is same for PMON and OPMN process. What could be the solution , I am using ricci and luci. sufyan From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Marcelo Guazzardo Sent: Monday, May 16, 2011 3:20 PM To: linux clustering Subject: Re: [Linux-cluster] oracle DB is not failing over on killin PMON deamon Hy sufyan I sent two files. First, is cluster.conf, second, is the oracledb.sh , this file must be placed in /usr/share/cluster in both nodes. (Or if you have more nodes, in all nodes). In oracledb.sh I 've changed oracle_user, oracle_sid, type of database, I used base (for monitor and listener), and virtual ip, If you have any doubt, just let me know I hope that helps you Regards, Marcelo 2011/5/15 Sufyan Khan Will you share if it is not confidential. From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Marcelo Guazzardo Sent: Sunday, May 15, 2011 7:03 PM To: linux clustering Subject: Re: [Linux-cluster] oracle DB is not failing over on killin PMON deamon ok, this script checks the listener a! nd db , i uot; in database type. I 've worked with oracle10g r2 , and it worked fine for me. Thanks 2011/5/15 Sufyan Khan Thanks for the tips, I need only oracle database to be monitor by Cluster NOT OPMN ( application server) Any clue sufyan From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Marcelo Guazzardo Sent: Sunday, May 15, 2011 11:29 AM To: linux clustering Subject: Re: [Linux-cluster] oracle DB is not failing over on killin PMON deamon HI sufyan Maybe is irrelevant, but, do you try with oracledb.sh script?. I use that script and all work fine for me.... Regards, Marcelo PS: that script is placed in /usr/share/cluster/oracledb.sh, you must modify the oracle settings, like orauser, db_instance_name, and db_virtual name. 2011/5/15 Sufyan! Khan < han at its.ws">sufyan.khan at its.ws> There is writing mistake, I cannot see the script is running in background.! Off course I stop the cluster then I run the manual script. From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Rajagopal Swaminathan Sent: Sunday, May 15, 2011 12:14 AM To: linux clustering Subject: Re: [Linux-cluster] oracle DB is not failing over on killin PMON deamon Greetings, I ca nd as root, but do see the script is running in background as a daemon, Mohammad Raza Sufyan Khan Team Leader (Technology&Infrastructure Group) Telco development a stupid question: Have you used chkconfig to switchoff all the cluster controlled services? ! IMHO, The >Regards, ! Rajagopal v> -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -- Marcelo Guazzardo mguazzardo76 at gmail.com http://mguazzardo.blogspot.com -- Marcelo Guazzardo mguazzardo76 at gmail.com http://mguazzardo.blogspot.com -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -- Marcelo Guazzardo mguazzardo76 at gmail.com http://mguazzardo.blogspot.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From mguazzardo76 at gmail.com Wed May 18 00:42:35 2011 From: mguazzardo76 at gmail.com (Marcelo Guazzardo) Date: Tue, 17 May 2011 21:42:35 -0300 Subject: [Linux-cluster] oracle DB is not failing over on killin PMON deamon In-Reply-To: <002801cc14ca$52ed6260$f8c82720$@its.ws> References: <00ae01cc106d$a294c440$e7be4cc0$@its.ws> <036B68E61A28CA49AC2767596576CD596F6641246F@GVW1113EXC.americas.hpqcorp.net> <00b501cc1075$4a47cfa0$ded76ee0$@its.ws> <007901cc126c$b5ae03b0$210a0b10$@its.ws> <000901cc12ce$99532df0$cbf989d0$@its.ws> <007801cc1306$f93e6850$ebbb38f0$@its.ws> <008e01cc132b$e4692800$ad3b7800$@its.ws> <002801cc14ca$52ed6260$f8c82720$@its.ws> Message-ID: 2011/5/17 Sufyan Khan > Hi Marcelo > > > > I am succeeded to run oracle DB and its restarting automatically with > killing pmon process. > > > > Thanks to all. > > > > I have another question (sorry I am new to RHEL cluster) my oracle > application server and DB server has different HOME directory, if I used > oracledb.sh , service fails in startup because in the oracledb.sh the HOME > directory is same for PMON and OPMN process. > > What could be the solution , I am using ricci and luci. > > > > Sufyan Sorry, I am not DBA. I don't know how help you, Maybe in this list there are a dba who can help you Regards, Marcelo. -------------- next part -------------- An HTML attachment was scrubbed... URL: From munishdh at yahoo.com Wed May 18 02:45:44 2011 From: munishdh at yahoo.com (Munish) Date: Wed, 18 May 2011 10:45:44 +0800 Subject: [Linux-cluster] oracle DB is not failing over on killin PMON deamon In-Reply-To: <002801cc14ca$52ed6260$f8c82720$@its.ws> References: <00ae01cc106d$a294c440$e7be4cc0$@its.ws> <036B68E61A28CA49AC2767596576CD596F6641246F@GVW1113EXC.americas.hpqcorp.net> <00b501cc1075$4a47cfa0$ded76ee0$@its.ws> <007901cc126c$b5ae03b0$210a0b10$@its.ws> <000901cc12ce$99532df0$cbf989d0$@its.ws> <007801cc1306$f93e6850$ebbb38f0$@its.ws> <008e01cc132b$e4692800$ad3b7800$@its.ws> <002801cc14ca$52ed6260$f8c82720$@its.ws> Message-ID: Where was the problem ? What has been done to fix it? Cheers!!! Munish On May 18, 2011, at 3:40 AM, Sufyan Khan wrote: > Hi Marcelo > > > > I am succeeded to run oracle DB and its restarting automatically with killing pmon process. > > > > Thanks to all. > > > > I have another question (sorry I am new to RHEL cluster) my oracle application server and DB server has different HOME directory, if I used oracledb.sh , service fails in startup because in the oracledb.sh the HOME directory is same for PMON and OPMN process. > > What could be the solution , I am using ricci and luci. > > > > sufyan > > From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Marcelo Guazzardo > Sent: Monday, May 16, 2011 3:20 PM > To: linux clustering > Subject: Re: [Linux-cluster] oracle DB is not failing over on killin PMON deamon > > > > Hy sufyan > > I sent two files. First, is cluster.conf, second, is the oracledb.sh , this file must be placed in /usr/share/cluster in both nodes. (Or if you have more nodes, in all nodes). > In oracledb.sh I 've changed oracle_user, oracle_sid, type of database, I used base (for monitor and listener), and virtual ip, > > If you have any doubt, just let me know > I h! ope that Marcelo > > > 2011/5/15 Sufyan Khan > > Will you share if it is not confidential. > > > > From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Marcelo Guazzardo > Sent: Sunday, May 15, 2011 7:03 PM > > > Subject: Re: [Linux-cluster] oracle DB is not failing over on killin PMON deamon > > > ok, this script checks the listener a! nd db , i uot; in database type. > > > I 've worked with oracle10g r2 , and it worked fine for me. > Thanks > > 2011/5/15 Sufyan Khan > > Thanks for the tips, > > > > I need only oracle database to be monitor by Cluster NOT OPMN ( application server) > > Any clue > > > > sufyan > > > > From: linux-cluster-bounces at redhat.com] On Behalf Of Marcelo Guazzardo > Sent: Sunday, May 15, 2011 11:29 AM > > > To: linux clustering > Subject: Re: [Linux-cluster] oracle DB is not failing over on killin PMON deamon > > > > HI sufyan > > > > Maybe is irrelevant, but, do you try with oracledb.sh script?. > I use that script and all work fine for me.... > > Regards, > Marcelo > PS: that script is placed in /usr/share/cluster/oracledb.sh, you must modify the oracle settings,! like ora nd db_virtual name. > > 2011/5/15 Sufyan! Khan < han at its.ws">sufyan.khan at its.ws> > > There is writing mistake, I cannot see the script is running in background.! > > Off course I stop the cluster then I run the manual script. > > > > > > > > From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Rajagopal Swaminathan > Sent: Sunday, May 15, 2011 12:14 AM > > > To: linux clustering > Subject: Re: [Linux-cluster] oracle DB is not failing over on killin PMON deamon > > > > Greetings, > > I ca nd as root, but do see the script is running in background as a daemon, > > > > > > Mohammad Raza Sufyan Khan > Team Leader (Technology&Infrastructure Group) > > Telco development > > > > a stupid question: Have you used chkconfig to switchoff all the cluster controlled services? > > ! IMHO, The >Regards, > > ! Rajagopal v> > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > > -- Marcel ailto:mguazzardo76 at gmail.com" target="_blank">mguazzardo76 at gmail.com > http://mguazzardo.blogspot.com > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-clust! er > > > > > -- > Marcelo Guazzardo > mguazzardo76 at gmail.com > http://mguazzardo.blogspot.com > > > -- > Linux-cluster mailing list > Linu! x-cluster f="https://www.redhat.com/mailman/listinfo/linux-cluster" target="_blank">https://www.redhat.com/mailman/listinfo/linux-cluster > > > > > -- > Marcelo Guazzardo > mguazzardo76 at gmail.com > http://mguazzardo.blogspot.com > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- An HTML attachment was scrubbed... URL: From ajb2 at mssl.ucl.ac.uk Wed May 18 15:14:58 2011 From: ajb2 at mssl.ucl.ac.uk (Alan Brown) Date: Wed, 18 May 2011 16:14:58 +0100 Subject: [Linux-cluster] |Optimizing DLM Speed Message-ID: <4DD3E272.7080709@mssl.ucl.ac.uk> Bob, Steve, Dave, Is there any progress on tuning the size of the tables (RHEL5) to allow larger values and see if they help things as far as caching goes? It would be advantageous to tweak the dentry limits too - the kernel limits this to 10% and attempts to increase are throttled back. This doesn't scale for larger memory sizes on fileservers and I think it's a hangover from 4Gb ram days. AB From swhiteho at redhat.com Wed May 18 15:31:55 2011 From: swhiteho at redhat.com (Steven Whitehouse) Date: Wed, 18 May 2011 16:31:55 +0100 Subject: [Linux-cluster] |Optimizing DLM Speed In-Reply-To: <4DD3E272.7080709@mssl.ucl.ac.uk> References: <4DD3E272.7080709@mssl.ucl.ac.uk> Message-ID: <1305732715.5294.32.camel@menhir> Hi, On Wed, 2011-05-18 at 16:14 +0100, Alan Brown wrote: > Bob, Steve, Dave, > > Is there any progress on tuning the size of the tables (RHEL5) to allow > larger values and see if they help things as far as caching goes? > There is a bz open, and you should ask for that to be linked to one of your support cases, if it hasn't already been. I thought we'd concluded though that this didn't actually affect your particular workload. > It would be advantageous to tweak the dentry limits too - the kernel > limits this to 10% and attempts to increase are throttled back. > Yes, I've not forgotten this. I've been working on some similar issues recently and I'll explore this more fully once I'm done with the writeback side of things. > This doesn't scale for larger memory sizes on fileservers and I think > it's a hangover from 4Gb ram days. > > AB > Yes, it might well be, so we should certainly look into it. Again though, please ensure that you raise this through support so that (a) it doesn't get missed by accident and (b) that we are all in the loop. If there are not tickets open for these, then we need to resolve that in order to push this forward, Steve. From lhh at redhat.com Wed May 18 15:41:39 2011 From: lhh at redhat.com (Lon Hohberger) Date: Wed, 18 May 2011 11:41:39 -0400 Subject: [Linux-cluster] qdiskd does not call heuristics regularly? In-Reply-To: <495A8D8FDB79D54DA88BDC8AE7F3A76D23D70533EF@VDRSEXMBXP1.gfoundries.com> References: <495A8D8FDB79D54DA88BDC8AE7F3A76D23D70533EC@VDRSEXMBXP1.gfoundries.com> <495A8D8FDB79D54DA88BDC8AE7F3A76D23D70533EF@VDRSEXMBXP1.gfoundries.com> Message-ID: <20110518154138.GN11022@redhat.com> On Fri, May 13, 2011 at 02:00:23PM +0200, Gerbatsch, Andre wrote: > > . small correction of the qdiskd->heuristic script timing: > dummy: Fri May 13 08:59:16 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=1 <--qdiskd restart, rval=1 http://git.fedorahosted.org/git/?p=cluster.git;a=commit;h=a47bc261ef58cb056077c448c06a7c518dd4191d -- Lon Hohberger - Red Hat, Inc. From Benjamin.Navaro at loto-quebec.com Wed May 18 16:54:44 2011 From: Benjamin.Navaro at loto-quebec.com (Navaro Benjamin) Date: Wed, 18 May 2011 12:54:44 -0400 Subject: [Linux-cluster] CLVM - Locking Disabled In-Reply-To: <20110518154138.GN11022@redhat.com> References: <495A8D8FDB79D54DA88BDC8AE7F3A76D23D70533EC@VDRSEXMBXP1.gfoundries.com><495A8D8FDB79D54DA88BDC8AE7F3A76D23D70533EF@VDRSEXMBXP1.gfoundries.com> <20110518154138.GN11022@redhat.com> Message-ID: Hi list, Is it normal for a fresh install that CLVM says that the locking is disabled while locking_type is set to 3 in lvm.conf ? [root at myhost ~]# clvmd -d CLVMD[e2bd6170]: May 18 12:42:24 CLVMD started CLVMD[e2bd6170]: May 18 12:42:24 Connected to CMAN CLVMD[e2bd6170]: May 18 12:42:24 CMAN initialisation complete CLVMD[e2bd6170]: May 18 12:42:25 DLM initialisation complete CLVMD[e2bd6170]: May 18 12:42:25 Cluster ready, doing some more initialisation CLVMD[e2bd6170]: May 18 12:42:25 starting LVM thread CLVMD[e2bd6170]: May 18 12:42:25 clvmd ready for work CLVMD[e2bd6170]: May 18 12:42:25 Using timeout of 60 seconds CLVMD[42aa8940]: May 18 12:42:25 LVM thread function started File descriptor 5 (/dev/zero) leaked on lvm invocation. Parent PID 6240: clvmd WARNING: Locking disabled. Be careful! This could corrupt your metadata. CLVMD[42aa8940]: May 18 12:42:25 LVM thread waiting for work I guess it's related to this following warning when trying to list the vg's (while clvmd is up) : [root at myhost ~]# vgs connect() failed on local socket: No such file or directory WARNING: Falling back to local file-based locking. Volume Groups with the clustered attribute will be inaccessible. VG #PV #LV #SN Attr VSize VFree vg00 1 7 0 wz--n- 24.28G 10.44G This prevents me from creating a clustered VG (actually I can create a clustered VG, but not the LV inside). [root at myhost ~]# vgcreate -c y vggfs01 /dev/sdb2 connect() failed on local socket: No such file or directory WARNING: Falling back to local file-based locking. Volume Groups with the clustered attribute will be inaccessible. No physical volume label read from /dev/sdb2 Physical volume "/dev/sdb2" successfully created Clustered volume group "vggfs01" successfully created [root at myhost ~]# lvcreate -L 500M -n lvgfs01 vggfs01 connect() failed on local socket: No such file or directory WARNING: Falling back to local file-based locking. Volume Groups with the clustered attribute will be inaccessible. Skipping clustered volume group vggfs01 [root at myhost ~]# The final goal is to build a GFS shared storage between 3 nodes. The cman part seems to be OK for the three nodes : [root at lhnq501l ~]# cman_tool services type level name id state fence 0 default 00010001 none [1 2 3] dlm 1 rgmanager 00020003 none [1 2 3] dlm 1 clvmd 00010003 none [1 2 3] This is my first RHEL cluster, and I'm not sure where to investigate right now. If anyone has ever seen this behaviour, any comment is appreciated, Thanks, - Ben. Mise en garde concernant la confidentialite : Le present message, comprenant tout fichier qui y est joint, est envoye a l'intention exclusive de son destinataire; il est de nature confidentielle et peut constituer une information protegee par le secret professionnel. Si vous n'etes pas le destinataire, nous vous avisons que toute impression, copie, distribution ou autre utilisation de ce message est strictement interdite. Si vous avez recu ce courriel par erreur, veuillez en aviser immediatement l'expediteur par retour de courriel et supprimer le courriel. Merci! Confidentiality Warning: This message, including any attachment, is sent only for the use of the intended recipient; it is confidential and may constitute privileged information. If you are not the intended recipient, you are hereby notified that any printing, copying, distribution or other use of this message is strictly prohibited. If you have received this email in error, please notify the sender immediately by return email, and delete it. Thank you! From ajb2 at mssl.ucl.ac.uk Wed May 18 17:34:45 2011 From: ajb2 at mssl.ucl.ac.uk (Alan Brown) Date: Wed, 18 May 2011 18:34:45 +0100 Subject: [Linux-cluster] |Optimizing DLM Speed In-Reply-To: <1305732715.5294.32.camel@menhir> References: <4DD3E272.7080709@mssl.ucl.ac.uk> <1305732715.5294.32.camel@menhir> Message-ID: <4DD40335.4010406@mssl.ucl.ac.uk> Steven Whitehouse wrote: > Hi, > > On Wed, 2011-05-18 at 16:14 +0100, Alan Brown wrote: >> Bob, Steve, Dave, >> >> Is there any progress on tuning the size of the tables (RHEL5) to allow >> larger values and see if they help things as far as caching goes? >> > There is a bz open, I thought so, but I can't find it. > and you should ask for that to be linked to one of > your support cases, if it hasn't already been. I thought we'd concluded > though that this didn't actually affect your particular workload. Increasing them to 4096 hasn't but larger numbers might. >> It would be advantageous to tweak the dentry limits too - the kernel >> limits this to 10% and attempts to increase are throttled back. >> > Yes, I've not forgotten this. I've been working on some similar issues > recently and I'll explore this more fully once I'm done with the > writeback side of things. Do you have a BZ for this one? >> This doesn't scale for larger memory sizes on fileservers and I think >> it's a hangover from 4Gb ram days. >> >> AB >> > Yes, it might well be, so we should certainly look into it. Again > though, please ensure that you raise this through support so that (a) it > doesn't get missed by accident and (b) that we are all in the loop. If > there are not tickets open for these, then we need to resolve that in > order to push this forward, willdo. > > Steve. > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > From swhiteho at redhat.com Wed May 18 17:52:24 2011 From: swhiteho at redhat.com (Steven Whitehouse) Date: Wed, 18 May 2011 18:52:24 +0100 Subject: [Linux-cluster] |Optimizing DLM Speed In-Reply-To: <4DD40335.4010406@mssl.ucl.ac.uk> References: <4DD3E272.7080709@mssl.ucl.ac.uk> <1305732715.5294.32.camel@menhir> <4DD40335.4010406@mssl.ucl.ac.uk> Message-ID: <1305741144.5294.39.camel@menhir> Hi, On Wed, 2011-05-18 at 18:34 +0100, Alan Brown wrote: > Steven Whitehouse wrote: > > Hi, > > > > On Wed, 2011-05-18 at 16:14 +0100, Alan Brown wrote: > >> Bob, Steve, Dave, > >> > >> Is there any progress on tuning the size of the tables (RHEL5) to allow > >> larger values and see if they help things as far as caching goes? > >> > > There is a bz open, > > I thought so, but I can't find it. > Its #678102, which you are on the cc list of. It probably needs a RHEL5 bug as well. Bryn posted a patch to it to make the change, but I'm not sure of the current status. I'm copying in Dave Teigland so that he can comment on the current status. > > and you should ask for that to be linked to one of > > your support cases, if it hasn't already been. I thought we'd concluded > > though that this didn't actually affect your particular workload. > > Increasing them to 4096 hasn't but larger numbers might. > > >> It would be advantageous to tweak the dentry limits too - the kernel > >> limits this to 10% and attempts to increase are throttled back. > >> > > Yes, I've not forgotten this. I've been working on some similar issues > > recently and I'll explore this more fully once I'm done with the > > writeback side of things. > > Do you have a BZ for this one? > The writeback issues are under #676626 at the moment, although this is a slightly different issue to what that bug was originally opened for. There isn't a bug for the dentries issue as that needs to have a ticket opened first, and then a bz opened by support if appropriate. I've copied in Bryn so that he can pick this up and make sure that it is done, Steve. From teigland at redhat.com Wed May 18 18:12:34 2011 From: teigland at redhat.com (David Teigland) Date: Wed, 18 May 2011 14:12:34 -0400 Subject: [Linux-cluster] |Optimizing DLM Speed In-Reply-To: <1305741144.5294.39.camel@menhir> References: <4DD3E272.7080709@mssl.ucl.ac.uk> <1305732715.5294.32.camel@menhir> <4DD40335.4010406@mssl.ucl.ac.uk> <1305741144.5294.39.camel@menhir> Message-ID: <20110518181234.GB3381@redhat.com> On Wed, May 18, 2011 at 06:52:24PM +0100, Steven Whitehouse wrote: > > >> Is there any progress on tuning the size of the tables (RHEL5) to allow > > >> larger values and see if they help things as far as caching goes? > > >> > > > There is a bz open, > > > > I thought so, but I can't find it. > > > Its #678102, which you are on the cc list of. It probably needs a RHEL5 > bug as well. Bryn posted a patch to it to make the change, but I'm not > sure of the current status. I'm copying in Dave Teigland so that he can > comment on the current status. > > > > and you should ask for that to be linked to one of > > > your support cases, if it hasn't already been. I thought we'd concluded > > > though that this didn't actually affect your particular workload. > > > > Increasing them to 4096 hasn't but larger numbers might. I'd suggest applying Bryn's vmalloc patch, and trying a higher value to see if it has any effect. If it does, we can certainly get that patch and larger default values queued up for various releases. Thanks, Dave From klusterfsck at outofoptions.net Thu May 19 14:27:58 2011 From: klusterfsck at outofoptions.net (Kluster Fsck) Date: Thu, 19 May 2011 10:27:58 -0400 Subject: [Linux-cluster] Cannot migrate VM's Message-ID: <4DD528EE.9030206@outofoptions.net> I upgraded Red Hat Enterprise Linux Server release 5.6 to try and solve some problems with an inherited broken cluster. After some effort I was able to migrate to the upgraded machine last night. This morning I upgraded the second machine and all seemed to go well until I tried to migrate the VM's back. The system just hangs and nothing happens whether I use virsh or Virtual Machine Manager. From the machine with the vm's currently running: May 19 09:48:26 julius libvirtd: 09:48:26.384: error : qemuDomainMigrateSetMaxDowntime:11792 : invalid argument in qemuDomainMigrateSetMaxDowntime: unsupported flags (0xbc614e) May 19 09:48:26 julius libvirtd: 09:48:26.400: error : qemuDomainMigrateSetMaxDowntime:11792 : invalid argument in qemuDomainMigrateSetMaxDowntime: unsupported flags (0xbc614e) May 19 09:49:04 julius libvirtd: 09:49:04.957: error : qemuDomainWaitForMigrationComplete:5066 : operation failed: Migration was cancelled by client From the machine trying to migrate too: May 19 09:43:11 justinian libvirtd: 09:43:11.682: warning : qemudStartup:1662 : Unable to create cgroup for driver: No such device or address May 19 09:48:25 justinian libvirtd: 09:48:25.202: error : qemudDomainGetVcpus:5993 : Requested operation is not valid: cannot list vcpu pinning for an inactive domain May 19 09:48:25 justinian libvirtd: 09:48:25.208: error : qemuDomainGetJobInfo:11726 : Requested operation is not valid: domain is not running May 19 09:48:25 justinian libvirtd: 09:48:25.230: error : qemudDomainGetVcpus:5993 : Requested operation is not valid: cannot list vcpu pinning for an inactive domain May 19 09:48:25 justinian libvirtd: 09:48:25.235: error : qemuDomainGetJobInfo:11726 : Requested operation is not valid: domain is not running May 19 09:48:25 justinian libvirtd: 09:48:25.256: error : qemudDomainGetVcpus:5993 : Requested operation is not valid: cannot list vcpu pinning for an inactive domain May 19 09:48:25 justinian libvirtd: 09:48:25.262: error : qemuDomainGetJobInfo:11726 : Requested operation is not valid: domain is not running May 19 09:48:25 justinian libvirtd: 09:48:25.279: error : qemudDomainGetVcpus:5993 : Requested operation is not valid: cannot list vcpu pinning for an inactive domain May 19 09:48:25 justinian libvirtd: 09:48:25.285: error : qemuDomainGetJobInfo:11726 : Requested operation is not valid: domain is not running May 19 09:48:25 justinian libvirtd: 09:48:25.347: error : qemudDomainGetVcpus:5993 : Requested operation is not valid: cannot list vcpu pinning for an inactive domain May 19 09:48:25 justinian libvirtd: 09:48:25.353: error : qemuDomainGetJobInfo:11726 : Requested operation is not valid: domain is not running May 19 09:49:13 justinian libvirtd: 09:49:13.006: error : qemuDomainObjBeginJob:362 : Timed out during operation: cannot acquire state change lock May 19 09:49:13 justinian libvirtd: 09:49:13.827: error : qemudDomainBlockStats:9500 : Requested operation is not valid: domain is not running From: http://libvirt.org/drvqemu.html I tried: mkdir /dev/cgroup mount -t cgroup none /dev/cgroup -o devices [root at julius vmdata]# mount -t cgroup none /dev/cgroup -o devices mount: unknown filesystem type 'cgroup' Any help would be appreciated. Thank You Ken Lowther From rossnick-lists at cybercat.ca Thu May 19 15:14:28 2011 From: rossnick-lists at cybercat.ca (Nicolas Ross) Date: Thu, 19 May 2011 11:14:28 -0400 Subject: [Linux-cluster] RedHat EL6.1 Message-ID: <90062EBE94844CE19B00124C4559D9C2@versa> Hi all ! We are running our cluster with RHEL6, and now 6.1 is out. We have an 8 node cluster, and I want to know is it "safe" to update on a running cluster ? We use GFS2 on a FC network. Is it just a matter of taking the first node, moving it's service to another one, yum update, reboot, and move the next one ? Thanks, Regards, From fdinitto at redhat.com Thu May 19 16:59:39 2011 From: fdinitto at redhat.com (Fabio M. Di Nitto) Date: Thu, 19 May 2011 18:59:39 +0200 Subject: [Linux-cluster] RedHat EL6.1 In-Reply-To: <90062EBE94844CE19B00124C4559D9C2@versa> References: <90062EBE94844CE19B00124C4559D9C2@versa> Message-ID: <4DD54C7B.1040106@redhat.com> On 05/19/2011 05:14 PM, Nicolas Ross wrote: > Is it just a matter of taking the first node, moving it's service to > another one, yum update, reboot, and move the next one ? Please contact GSS that will point you to the correct documentation to perform the upgrade. In general: take first node, move services to another, shutdown all cluster services (cman), yum update, reboot, move to the next one. Fabio From ableisch at redhat.com Fri May 20 11:51:30 2011 From: ableisch at redhat.com (Andreas Bleischwitz) Date: Fri, 20 May 2011 13:51:30 +0200 Subject: [Linux-cluster] Mirrored LVM device and recovery Message-ID: <4DD655C2.6080406@redhat.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hello all, we are currently facing some handling issues using mirrored LVM-lvols in a cluster: We have two diffent storage systems which should be mirrored using host-based mirroring. AFAIK cmirrored lvols are the only supported mirroring solution under RHEL56. So we have three multipath devices which are used for 2 data and one log-volume. We added these three pvs to one volumegroup and created the logical volume using the following command: lvcreate -m1 -L 10G -n lv_mirrored /dev/mpath/mpath0p1 /dev/mpath2p1 /dev/mpath/mpath1p1 The volume replicates ok and everything is fine.... until we remove one storage-side of the mirror. Then LVM simply removes the missing pv and the mirror is simply removed - which I think is ok; if it will be recreated after readding the failed mirror-side. Unfortunately LVM doesn't do anything such - is there a special configuration-option which we missed? And keep in mind: there might be a huge amount of lvms which have to be re-mirrored. So manual interaction shouldn't be the default option ;) Regards, Andreas -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.14 (GNU/Linux) Comment: Using GnuPG with Red Hat - http://enigmail.mozdev.org/ iQIcBAEBAgAGBQJN1lXCAAoJEK45y/Z6LXho/4wP/AjjoOGX3pRoe3XRARumkTKW C/+8Jjm4+aC4VP1ycVkHZhrdGlzy4QmTtCFTCv40AgPU2YT/Aq+PqfNnn4SNqpwR c815zd9Gk+uQwwR55kloX232eZzFEw2wVa9PWxOmKwaeuYSEOz8GmLVZrPVc3V9p MNr6wkV5gzTzhC2v75KOZ4PchOiuYEDbhCd5GFDKmpyTeHTq/uNW2yRnjInAX8L9 8UCJ1JEzo4ry2mIBK1J+du5YtKx4uDLB893rgbf+T5Cci3hsLJ9/gfF1VU80b+o/ uVc5t31rwUwMaFSyt9wtEhMQB0ggbyiQqzzjSP5wnnakd6lbJKhB06wM5XqGuUkS ZetkZdH+etALFpt3PrV7F4+LDwGnP7Hw438czKjD+Xk21fd7idSo3vhtWjArPgKp L+b5fxB8JoUGN7x2S3239aDMI6BmxTTZ+QnsamYzSy0IdHYghPSjPSsx8H5laJWd I03F2sfPWwB8vWVweHvNbxfFjZfmEaawoMqGanoGktj/RYgvUpPZJD+YHDVGXohN VoRVmB+t4JVSWb15BzOhzkAI//LtXjSHmtcnBuYQf8G0Q2v/r0x/hv04F9/0fQ0l dPlU1vh244fh0nG5BMCJlKPcdcpGJnGy4kIKOknOi+NuI2ZxxvSLIY5WbrqAwkBX QXYt6plJ5DgzWCa8fNYN =0I6Z -----END PGP SIGNATURE----- From rossnick-lists at cybercat.ca Fri May 20 12:37:30 2011 From: rossnick-lists at cybercat.ca (Nicolas Ross) Date: Fri, 20 May 2011 08:37:30 -0400 Subject: [Linux-cluster] fence_apc and Apc AP-8941 References: <4D9C06D2.1040106@redhat.com><48C9F07371214F2AA678938FDE5BD3B5@versa><4D9C768D.4060106@redhat.com><8C77012A023D431CB5AA58E6CC676A35@versa> <4D9EC4A4.3060201@redhat.com> Message-ID: <7D713A7F1242489EB185872AC93ED25B@versa> > Add "cmd_prompt" into device_opt in fence_apc. Then you will have > possibility to set --command-prompt to "apc>". > > Both fixes will be simple, feel free to create bugzilla entry for them. Hi ! It appears that development management won't fix the problem : https://bugzilla.redhat.com/show_bug.cgi?id=694894 It's not all that bad, since I now use fence_apc_snmp instead. Regards, From rossnick-lists at cybercat.ca Fri May 20 14:12:10 2011 From: rossnick-lists at cybercat.ca (Nicolas Ross) Date: Fri, 20 May 2011 10:12:10 -0400 Subject: [Linux-cluster] Corosync goes cpu to 95-99% References: <4DD29D03.9080901@gmail.com> <4DD2BAC3.50509@redhat.com><4DD2BD7D.5070704@gmail.com> <4DD2CA90.6090802@redhat.com> Message-ID: <3B50BA7445114813AE429BEE51A2BA52@versa> >>> Believe this is fixed in 1.3.1 >>> >> >> Thanks Steven ... But is it released for rhel6?? >> > > RHEL 6.1 has these problems resolved. If you have problems with rhel6.0 > please open a support ticket. There is no SLA for bugzilla/mailing > lists, and I can't modify shipped RHEL 6.0.z packages without support > tickets. I am also observing this kind of beahviour. But at a different level. We have an 8 node cluster composed of dual quad-core xeon. I have now updated all the nodes to RHEL 6.1, cman is at 3.0.12-41.el6. and from time to time, for no apparent reason one random node has a peak in cpu usage, where it's corosync that eats CPU for a minute or so. During that time services on that node responds very slowly and ssh shell access is very rough and slow as hell... From carlopmart at gmail.com Sat May 21 09:42:32 2011 From: carlopmart at gmail.com (carlopmart) Date: Sat, 21 May 2011 11:42:32 +0200 Subject: [Linux-cluster] Corosync goes cpu to 95-99% In-Reply-To: <3B50BA7445114813AE429BEE51A2BA52@versa> References: <4DD29D03.9080901@gmail.com> <4DD2BAC3.50509@redhat.com><4DD2BD7D.5070704@gmail.com> <4DD2CA90.6090802@redhat.com> <3B50BA7445114813AE429BEE51A2BA52@versa> Message-ID: <4DD78908.2030801@gmail.com> On 05/20/2011 04:12 PM, Nicolas Ross wrote: >>>> Believe this is fixed in 1.3.1 >>>> >>> >>> Thanks Steven ... But is it released for rhel6?? >>> >> >> RHEL 6.1 has these problems resolved. If you have problems with rhel6.0 >> please open a support ticket. There is no SLA for bugzilla/mailing >> lists, and I can't modify shipped RHEL 6.0.z packages without support >> tickets. > > I am also observing this kind of beahviour. But at a different level. We > have an 8 node cluster composed of dual quad-core xeon. I have now > updated all the nodes to RHEL 6.1, cman is at 3.0.12-41.el6. > > and from time to time, for no apparent reason one random node has a peak > in cpu usage, where it's corosync that eats CPU for a minute or so. > During that time services on that node responds very slowly and ssh > shell access is very rough and slow as hell... Steven, is this problem confirmed to rhel6.1?? It seems that I need to downgrade my servers to rhel5.x ... -- CL Martinez carlopmart {at} gmail {d0t} com From rossnick-lists at cybercat.ca Sat May 21 13:12:09 2011 From: rossnick-lists at cybercat.ca (Nicolas Ross) Date: Sat, 21 May 2011 09:12:09 -0400 Subject: [Linux-cluster] Corosync goes cpu to 95-99% In-Reply-To: <4DD78908.2030801@gmail.com> References: <4DD29D03.9080901@gmail.com> <4DD2BAC3.50509@redhat.com> <4DD2BD7D.5070704@gmail.com> <4DD2CA90.6090802@redhat.com> <3B50BA7445114813AE429BEE51A2BA52@versa> <4DD78908.2030801@gmail.com> Message-ID: <0B1965C8-9807-42B6-9453-01BE0C0B1DCB@cybercat.ca> > Steven, is this problem confirmed to rhel6.1?? It seems that I need to downgrade my servers to rhel5.x ... I've opened a support case at redhat for this. While collecting the sosreport for redhat, I found ot in my var/log/message file something about gfs2_quotad being stalled for more than 120 seconds. Tought I disabled quotas with the noquota option. It appears that it's "quota=off". Since I cannot chane thecluster config and remount the filessystems at the moment, I did not made the change to tes it. It might helps you. From carlopmart at gmail.com Sat May 21 19:07:09 2011 From: carlopmart at gmail.com (carlopmart) Date: Sat, 21 May 2011 21:07:09 +0200 Subject: [Linux-cluster] Corosync goes cpu to 95-99% In-Reply-To: <0B1965C8-9807-42B6-9453-01BE0C0B1DCB@cybercat.ca> References: <4DD29D03.9080901@gmail.com> <4DD2BAC3.50509@redhat.com> <4DD2BD7D.5070704@gmail.com> <4DD2CA90.6090802@redhat.com> <3B50BA7445114813AE429BEE51A2BA52@versa> <4DD78908.2030801@gmail.com> <0B1965C8-9807-42B6-9453-01BE0C0B1DCB@cybercat.ca> Message-ID: <4DD80D5D.10004@gmail.com> On 05/21/2011 03:12 PM, Nicolas Ross wrote: > >> Steven, is this problem confirmed to rhel6.1?? It seems that I need to downgrade my servers to rhel5.x ... > > I've opened a support case at redhat for this. While collecting the sosreport for redhat, I found ot in my var/log/message file something about gfs2_quotad being stalled for more than 120 seconds. Tought I disabled quotas with the noquota option. It appears that it's "quota=off". Since I cannot chane thecluster config and remount the filessystems at the moment, I did not made the change to tes it. > > It might helps you. > Thanks Nicolas. what bugzilla id is?? -- CL Martinez carlopmart {at} gmail {d0t} com From rossnick-lists at cybercat.ca Sun May 22 02:24:07 2011 From: rossnick-lists at cybercat.ca (Nicolas Ross) Date: Sat, 21 May 2011 22:24:07 -0400 Subject: [Linux-cluster] Corosync goes cpu to 95-99% In-Reply-To: <4DD80D5D.10004@gmail.com> References: <4DD29D03.9080901@gmail.com> <4DD2BAC3.50509@redhat.com> <4DD2BD7D.5070704@gmail.com> <4DD2CA90.6090802@redhat.com> <3B50BA7445114813AE429BEE51A2BA52@versa> <4DD78908.2030801@gmail.com> <0B1965C8-9807-42B6-9453-01BE0C0B1DCB@cybercat.ca> <4DD80D5D.10004@gmail.com> Message-ID: <4DD873C7.8080402@cybercat.ca> >> I've opened a support case at redhat for this. While collecting the >> sosreport for redhat, I found ot in my var/log/message file something >> about gfs2_quotad being stalled for more than 120 seconds. Tought I >> disabled quotas with the noquota option. It appears that it's >> "quota=off". Since I cannot chane thecluster config and remount the >> filessystems at the moment, I did not made the change to tes it. >> >> It might helps you. >> > > Thanks Nicolas. what bugzilla id is?? It's not a bugzilla, it's a support case. From fdinitto at redhat.com Wed May 25 07:34:57 2011 From: fdinitto at redhat.com (Fabio M. Di Nitto) Date: Wed, 25 May 2011 09:34:57 +0200 Subject: [Linux-cluster] fence-agents 3.1.4 stable release Message-ID: <4DDCB121.1020204@redhat.com> Welcome to the fence-agents 3.1.4 release. This release contains a few bug fixes and a new fence_xenapi contributed by Matt Clark that supports Citrix XenServer and XCP. The new source tarball can be downloaded here: https://fedorahosted.org/releases/f/e/fence-agents/fence-agents-3.1.4.tar.xz To report bugs or issues: https://bugzilla.redhat.com/ Would you like to meet the cluster team or members of its community? Join us on IRC (irc.freenode.net #linux-cluster) and share your experience with other sysadministrators or power users. Thanks/congratulations to all people that contributed to achieve this great milestone. Happy clustering, Fabio Under the hood (from 3.1.3): Cedric Buissart (1): ipmilan help: login same as -l Fabio M. Di Nitto (4): Fix file permissions build: add missing file from tarball release fence_rsa: readd test info build: allow selection of agents to build and fix configure help output Lon Hohberger (1): Revert "fence_ipmilan: Correct return code for diag operation" Marek 'marx' Grac (1): fence_ipmilan: Correct return code for diag operation Matt Clark (5): New fencing script for Citrix XenServer and XCP. Updated to include xenapi script. Updated to include xenapi script in Makefile.am. Clean up of fence_xenapi patches. Moved XenAPI.py to lib directory and added to Makefile.am. Cleanup of fence_xenapi patches. Added copyright information to doc/COPYRIGHT. Fixed static reference to lib directory in fence_xenapi.py. Fixed static reference to RELEASE_VERSION and BUILD_DATE in fence_xenapi.py. configure.ac | 38 +++++- doc/COPYRIGHT | 4 + fence/agents/Makefile.am | 40 +----- fence/agents/ipmilan/ipmilan.c | 4 +- fence/agents/lib/Makefile.am | 8 +- fence/agents/lib/XenAPI.py.py | 209 ++++++++++++++++++++++++++ fence/agents/rsa/fence_rsa.py | 1 + fence/agents/xenapi/Makefile.am | 17 ++ fence/agents/xenapi/fence_xenapi.py | 231 +++++++++++++++++++++++++++++ 9 files changed, 505 insertions(+), 47 deletions(-) From hiroysato at gmail.com Sat May 28 12:39:37 2011 From: hiroysato at gmail.com (Hiroyuki Sato) Date: Sat, 28 May 2011 21:39:37 +0900 Subject: [Linux-cluster] [Q] Good documentation about command line interface?? Message-ID: Dear members. I'm newbie Red Hat cluster. Could you point me to good documentation about command line interface?? ( cman_tool, css_tool, ccs_test, fence_ack_manual ..) Especially the following topics. * How to rejoin to node. * How to leave from node. * How to use fence_ack_manual * How to manage cluster with command line tools. One of my problem is here. The status of gfs3 which in my test cluster is JOIN_STOP_WAIT. I don't know how to re-join it. # /usr/sbin/cman_tool services type level name id state fence 0 default 00000000 JOIN_STOP_WAIT I found a keyword 'fenced_override'. This file. should be named pipe. Howevre I can't find that file in /var/run/cluter directory in my clusters. fenced working on all of clusters. Sincerely. * Environment CentOS 5.6 * Configurations -- Hiroyuki Sato From linux at alteeve.com Sat May 28 17:31:12 2011 From: linux at alteeve.com (Digimer) Date: Sat, 28 May 2011 13:31:12 -0400 Subject: [Linux-cluster] [Q] Good documentation about command line interface?? In-Reply-To: References: Message-ID: <4DE13160.7080709@alteeve.com> On 05/28/2011 08:39 AM, Hiroyuki Sato wrote: > Dear members. > > I'm newbie Red Hat cluster. Welcome! > Could you point me to good documentation about command line interface?? > ( cman_tool, css_tool, ccs_test, fence_ack_manual ..) The man pages for these tools are well documented. > fence_ack_manual This is not supported in any way, shape or form. You *must* use a proper fence device. Do your servers have IPMI (or OEM version like DRAC, iLO, etc?). Please read this: http://wiki.alteeve.com/index.php/Red_Hat_Cluster_Service_2_Tutorial#Concept.3B_Virtual_Synchrony Specifically; "Concept; Virtual Synchrony" and "Concept; Fencing" > Especially the following topics. > > * How to rejoin to node. > * How to leave from node. Starting and stopping the cman service will cause the node to join and leave, respectively. You can do it manually if you wish, please check the man pages. > * How to use fence_ack_manual Again, you can't. It is not supported. > * How to manage cluster with command line tools. ccs_tool is the main program to look at. > One of my problem is here. > > The status of gfs3 which in my test cluster is JOIN_STOP_WAIT. > I don't know how to re-join it. > > # /usr/sbin/cman_tool services > type level name id state > fence 0 default 00000000 JOIN_STOP_WAIT Without a working fence device, the cluster will block forever. As far as I know, once a fence call has been issued, there is nothing that can be done to abort it. I'd suggest pulling the power on the node, boot it cleanly and start cman. > I found a keyword 'fenced_override'. This file. should be named pipe. > Howevre I can't find that file in /var/run/cluter directory in my clusters. > fenced working on all of clusters. Again, it's not supported. > Sincerely. > > > * Environment > > CentOS 5.6 > > > * Configurations > > > > This is wrong, 'expected_votes' is the number of nodes in the cluster (plus qdisk votes, if you are using it). > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > If you are on IRC, join #linux-cluster, it is also a great place to get help. I am usually there and will be happy to help you get a) fencing working and b) get the rest working. Welcome to clustering! :) -- Digimer E-Mail: digimer at alteeve.com AN!Whitepapers: http://alteeve.com Node Assassin: http://nodeassassin.org "I feel confined, only free to expand myself within boundaries." From hiroysato at gmail.com Sun May 29 10:14:41 2011 From: hiroysato at gmail.com (Hiroyuki Sato) Date: Sun, 29 May 2011 19:14:41 +0900 Subject: [Linux-cluster] [Q] Good documentation about command line interface?? In-Reply-To: <4DE13160.7080709@alteeve.com> References: <4DE13160.7080709@alteeve.com> Message-ID: Hello Digimer. Thank you for your information. This is the document that I'm looking for!!. This doc is very very usuful. Thanks!!. I want to ask one thing. Please take a look my cluster configration again. Mainly I want to use GNBD on gfs_clientX. GNBD server is gfs2, and gfs3. And gfs_client's hardwhere does not support IPMI, iLO..., Because That machine is Desktop computers. And no APC like UPS. The desktop machine is just support Wake On LAN. What fence device should I use?? I'm thinking fence_wake_on_lan is proper fence device. but that is nothing.. Thank you for your advice. Regards. 2011/5/29 Digimer : > On 05/28/2011 08:39 AM, Hiroyuki Sato wrote: >> >> Dear members. >> >> I'm newbie Red Hat cluster. > > Welcome! > >> Could you point me to good documentation about command line interface?? > >> ? ( cman_tool, css_tool, ccs_test, fence_ack_manual ..) > > The man pages for these tools are well documented. > >> ?fence_ack_manual > > bat> > > This is not supported in any way, shape or form. You *must* use a proper > fence device. Do your servers have IPMI (or OEM version like DRAC, iLO, > etc?). > > Please read this: > > http://wiki.alteeve.com/index.php/Red_Hat_Cluster_Service_2_Tutorial#Concept.3B_Virtual_Synchrony > > Specifically; "Concept; Virtual Synchrony" and "Concept; Fencing" > >> Especially the following topics. >> >> ? * How to rejoin to node. >> ? * How to leave from node. > > Starting and stopping the cman service will cause the node to join and > leave, respectively. You can do it manually if you wish, please check the > man pages. > >> ? * How to use fence_ack_manual > > Again, you can't. It is not supported. > >> ? * How to manage cluster with command line tools. > > ccs_tool is the main program to look at. > >> One of my problem is here. >> >> The status of gfs3 which in my test cluster is JOIN_STOP_WAIT. >> I don't know how to re-join it. >> >> # /usr/sbin/cman_tool services >> type ? ? ? ? ? ? level name ? ? id ? ? ? state >> fence ? ? ? ? ? ?0 ? ? default ?00000000 JOIN_STOP_WAIT > > Without a working fence device, the cluster will block forever. As far as I > know, once a fence call has been issued, there is nothing that can be done > to abort it. I'd suggest pulling the power on the node, boot it cleanly and > start cman. > >> I found a keyword 'fenced_override'. This file. should be named pipe. >> Howevre I can't find that file in /var/run/cluter directory in my >> clusters. >> fenced working on all of clusters. > > Again, it's not supported. > >> Sincerely. >> >> >> * Environment >> >> ? CentOS 5.6 >> >> >> * Configurations >> >> >> >> ? > > This is wrong, 'expected_votes' is the number of nodes in the cluster (plus > qdisk votes, if you are using it). > >> ? >> ? >> ? ? >> ? ? ? >> ? ? ? ? >> ? ? ? ? ? >> ? ? ? ? >> ? ? ? >> ? ? >> ? ? >> ? ? ? >> ? ? ? ? >> ? ? ? ? ? >> ? ? ? ? >> ? ? ? >> ? ? >> ? ? >> ? ? ? >> ? ? ? ? >> ? ? ? ? ? >> ? ? ? ? >> ? ? ? >> ? ? >> ? ? >> ? ? ? >> ? ? ? ? >> ? ? ? ? ? >> ? ? ? ? >> ? ? ? >> ? ? >> ? ? >> ? ? ? >> ? ? ? ? >> ? ? ? ? ? >> ? ? ? ? >> ? ? ? >> ? ? >> ? ? >> ? ? ? >> ? ? ? ? >> ? ? ? ? ? >> ? ? ? ? >> ? ? ? >> ? ? >> ? ? >> ? ? ? >> ? ? ? ? >> ? ? ? ? ? >> ? ? ? ? >> ? ? ? >> ? ? >> ? ? >> ? ? ? >> ? ? ? ? >> ? ? ? ? ? >> ? ? ? ? >> ? ? ? >> ? ? >> ? ? >> ? ? ? >> ? ? ? ? >> ? ? ? ? ? >> ? ? ? ? >> ? ? ? >> ? ? >> ? ? >> ? ? ? >> ? ? ? ? >> ? ? ? ? ? >> ? ? ? ? >> ? ? ? >> ? ? >> ? ? >> ? ? ? >> ? ? ? ? >> ? ? ? ? ? >> ? ? ? ? >> ? ? ? >> ? ? >> ? ? >> ? ? ? >> ? ? ? ? >> ? ? ? ? ? >> ? ? ? ? >> ? ? ? >> ? ? >> ? ? >> ? ? ? >> ? ? ? ? >> ? ? ? ? ? >> ? ? ? ? >> ? ? ? >> ? ? >> ? >> ? >> ? ? >> ? >> ? >> ? ? >> ? ? >> ? >> > > If you are on IRC, join #linux-cluster, it is also a great place to get > help. I am usually there and will be happy to help you get a) fencing > working and b) get the rest working. > > Welcome to clustering! :) > > -- > Digimer > E-Mail: digimer at alteeve.com > AN!Whitepapers: http://alteeve.com > Node Assassin: ?http://nodeassassin.org > "I feel confined, only free to expand myself within boundaries." > -- Hiroyuki Sato From linux at alteeve.com Sun May 29 16:00:57 2011 From: linux at alteeve.com (Digimer) Date: Sun, 29 May 2011 12:00:57 -0400 Subject: [Linux-cluster] [Q] Good documentation about command line interface?? In-Reply-To: References: <4DE13160.7080709@alteeve.com> Message-ID: <4DE26DB9.6020905@alteeve.com> On 05/29/2011 06:14 AM, Hiroyuki Sato wrote: > Hello Digimer. > > Thank you for your information. > > This is the document that I'm looking for!!. > This doc is very very usuful. Thanks!!. Wonderful, I'm glad you find it useful. :) > I want to ask one thing. > > Please take a look my cluster configration again. Will do, comments will be in-line. > Mainly I want to use GNBD on gfs_clientX. > GNBD server is gfs2, and gfs3. > > And gfs_client's hardwhere does not support IPMI, iLO..., > Because That machine is Desktop computers. > > And no APC like UPS. > > The desktop machine is just support Wake On LAN. > > What fence device should I use?? > I'm thinking fence_wake_on_lan is proper fence device. > but that is nothing.. The least expensive option for a commercial product would be APC's switched PDU. You have 13 machines, so you would need either 2 of the 1U models, or 1 of the 0U models. If you are in North America, you can use these: http://www.apc.com/products/resource/include/techspec_index.cfm?base_sku=AP7900 or http://www.apc.com/products/resource/include/techspec_index.cfm?base_sku=AP7931 If you are in Japan, you'll need to select the best one of these: http://www.apc.com/products/family/index.cfm?id=70&ISOCountryCode=JP Whichever you get, you can use the 'fence_apc' fence agent. > Thank you for your advice. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Regards. Outside of the "fence_manual" issue, this looks fine. You will probably want to get the GFS and GNBD stuff into rgmanager, but that can come later after you have fencing working and the core of the cluster tested and working. Take a look at this: http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/5/html/Global_Network_Block_Device/s1-gnbd-mp-sn.html It discusses fencing with GNBD. Below is the start of the Red Hat document on GNBD in EL5 that you may find helpful, if you haven't read it already. http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/5/html/Global_Network_Block_Device/ch-gnbd.html Let me know if you want/need any more help. I'll be happy to see what I can do. -- Digimer E-Mail: digimer at alteeve.com Freenode handle: digimer Papers and Projects: http://alteeve.com Node Assassin: http://nodeassassin.org "I feel confined, only free to expand myself within boundaries." From tom+linux-cluster at oneshoeco.com Mon May 30 09:45:27 2011 From: tom+linux-cluster at oneshoeco.com (Tom Lanyon) Date: Mon, 30 May 2011 19:15:27 +0930 Subject: [Linux-cluster] How do you HA your storage? In-Reply-To: <1304154086.10889.1446718041@webmail.messagingengine.com> References: <1304154086.10889.1446718041@webmail.messagingengine.com> Message-ID: On 30/04/2011, at 6:31 PM, urgrue wrote: > > I'm struggling to find the best way to deal with SAN failover. > By this I mean the common scenario where you have SAN-based mirroring. > It's pretty easy with host-based mirroring (md, DRBD, LVM, etc) but how > can you minimize the impact and manual effort to recover from losing a > LUN, and needing to somehow get your system to realize the data is now > on a different LUN (the now-active mirror)? urgrue, As others have mentioned, this may be a little off-topic for the list. However, I reply in support of hopefully providing an answer to your original question. In my experience the destination array of storage-based (i.e. array-to-array) replication is able to present the replication target LUN with the same ID (e.g. WWN) as that of the source LUN on the source array. In this scenario, you would present the replicated LUN on the destination array to your server(s), and your multipathing (i.e. device-mapper-multipath) software would essentially see it as another path to the same device. You obviously need to ensure that the priority of these paths are such that no I/O operations will traverse them unless the paths to the source array have failed. In the case of a failure on the source array, it's paths will (hopefully!) be marked as failed, your multipath software will start queueing I/O, the destination array will detect the source array failure and switch its LUN presentation to read/write and your multipathing software will resume I/O on the new paths. There's a lot to consider here. Such live failover can often be asking for trouble, and given the total failure rates of high-end storage equipment is quite minimal, I'd only implement if absolutely required. The above assumes synchronous replication between the arrays. Hope this helps somewhat. Tom From Chris.Jankowski at hp.com Mon May 30 11:14:37 2011 From: Chris.Jankowski at hp.com (Jankowski, Chris) Date: Mon, 30 May 2011 11:14:37 +0000 Subject: [Linux-cluster] How do you HA your storage? In-Reply-To: References: <1304154086.10889.1446718041@webmail.messagingengine.com> Message-ID: <036B68E61A28CA49AC2767596576CD596F6710F09B@GVW1113EXC.americas.hpqcorp.net> There is a school of thought among practitioners of Business Continuity that says: HA != DR The two cover different domains and mixing the two concepts leads to horror stories. Essentially, HA covers a single (small or large) component failure. If components can be duplicated and work in parallel (e.g. disks, paths, controllers) then failure of one component may be transparent to the end users. If they carry state e.g. a server then you replace the element and recover stable state - hence a HA cluster. The action taken is automatic and the outcome can be guaranteed if only one component failed. DR covers multiple and not necessarily simultaneous component failures. They may result from large catastrophic events such as a fire in the data centre. As the extent of damage is not known then a human must be in the loop - to declare a disaster and initiate execution of a disaster recovery plan. Software has horrible problems distinguishing between a hole in the ground from a massive bomb blast and a puff of smoke from a little short circuit in a power supply (:-)). Humans do better here. The results of execution of a disaster recovery plan can be achieved by very careful design for geographical separation, so a disaster does not invalidate redundancy. The execution itself can be automated, but is initiated by a human - push button solution. Typically DR is layered on top of HA e.g. HA clusters in each location to protect against single component failures and data replication from the active to the DR site to maintain complete state in geographically distant location. The typical cost ratios are 1=>4=>16 for single system => HA cluster => complete DR solution. That is why there are very few properly designed, built, tested and maintained DR solutions based on two HA clusters and replication. --------- I believe that you are trying to configure a stretched cluster that would provide some automatic DR capabilities. The problem with stretched cluster solutions is that they do not normally take into consideration multiple, non-simultaneous component failures. I suggest that you think carefully what happens in such system depending on which fibre melts first and which disk seizes up first in a fire. You will soon find out that the software lacks the notion of locally consistent groups. The only cluster that ever did that location stuff right was DEC VMS cluster 25 years ago. Stretched VMS clusters did work correctly. The cost was horrendous though. --------- You can also try to make storage somebody else's problem by using a storage array that enables you to build HA geographically extended configuration. Believe it or not, there is one like that - P4000 from HP (formerly from Left Hand Networks). Of course, you still would need to properly design and configure such extended configuration, but it is a fully supported solution from the vendor. You can play with it by downloading evaluation copies of the software - VSA - Virtual Storage Appliance from HP site. Regards, Chris Jankowski Once you ado -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Tom Lanyon Sent: Monday, 30 May 2011 19:45 To: linux clustering Subject: Re: [Linux-cluster] How do you HA your storage? On 30/04/2011, at 6:31 PM, urgrue wrote: > > I'm struggling to find the best way to deal with SAN failover. > By this I mean the common scenario where you have SAN-based mirroring. > It's pretty easy with host-based mirroring (md, DRBD, LVM, etc) but how > can you minimize the impact and manual effort to recover from losing a > LUN, and needing to somehow get your system to realize the data is now > on a different LUN (the now-active mirror)? urgrue, As others have mentioned, this may be a little off-topic for the list. However, I reply in support of hopefully providing an answer to your original question. In my experience the destination array of storage-based (i.e. array-to-array) replication is able to present the replication target LUN with the same ID (e.g. WWN) as that of the source LUN on the source array. In this scenario, you would present the replicated LUN on the destination array to your server(s), and your multipathing (i.e. device-mapper-multipath) software would essentially see it as another path to the same device. You obviously need to ensure that the priority of these paths are such that no I/O operations will traverse them unless the paths to the source array have failed. In the case of a failure on the source array, it's paths will (hopefully!) be marked as failed, your multipath software will start queueing I/O, the destination array will detect the source array failure and switch its LUN presentation to read/write and your multipathing software will resume I/O on the new paths. There's a lot to consider here. Such live failover can often be asking for trouble, and given the total failure rates of high-end storage equipment is quite minimal, I'd only implement if absolutely required. The above assumes synchronous replication between the arrays. Hope this helps somewhat. Tom -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From hiroysato at gmail.com Mon May 30 11:35:38 2011 From: hiroysato at gmail.com (Hiroyuki Sato) Date: Mon, 30 May 2011 20:35:38 +0900 Subject: [Linux-cluster] [Q] Good documentation about command line interface?? In-Reply-To: <4DE26DB9.6020905@alteeve.com> References: <4DE13160.7080709@alteeve.com> <4DE26DB9.6020905@alteeve.com> Message-ID: Hello Digimer Thank you for your advice. * GNBD I've already succeed to mount GNBD. locking_type = 1 Should I change lock_type = 3 ?, If not, what problem will be occur?? * fence_apc some of reason, I can't get use APC switch. (That configuration example is test environment. ) so I asked alternative solution. * fence_wol I can't find fence_wake_on_lan. so I'm thinking to create it. WOL supports Power on and Power off ( I'll test later ). So, It's will be fence tool. And I downloaded fence_na, It was written in Perl script. so I want to change fence_na to use wol command. Could you point me to good reference to build fence_wol. (Of course!!. fence_na is good reference) Thank you for your advice again. Regards. 2011/5/30 Digimer : > On 05/29/2011 06:14 AM, Hiroyuki Sato wrote: >> >> Hello Digimer. >> >> Thank you for your information. >> >> This is the document that I'm looking for!!. >> This doc is very very usuful. Thanks!!. > > Wonderful, I'm glad you find it useful. :) > >> I want to ask one thing. >> >> Please take a look my cluster configration again. > > Will do, comments will be in-line. > >> Mainly I want to use GNBD on gfs_clientX. >> GNBD server is gfs2, and gfs3. >> >> And gfs_client's hardwhere does not support IPMI, iLO..., >> Because That machine is Desktop computers. >> >> And no APC like UPS. >> >> The desktop machine is just support Wake On LAN. >> >> What fence device should I use?? >> I'm thinking fence_wake_on_lan is proper fence device. >> but that is nothing.. > > The least expensive option for a commercial product would be APC's switched > PDU. You have 13 machines, so you would need either 2 of the 1U models, or 1 > of the 0U models. > > If you are in North America, you can use these: > > http://www.apc.com/products/resource/include/techspec_index.cfm?base_sku=AP7900 > > or > > http://www.apc.com/products/resource/include/techspec_index.cfm?base_sku=AP7931 > > If you are in Japan, you'll need to select the best one of these: > > http://www.apc.com/products/family/index.cfm?id=70&ISOCountryCode=JP > > Whichever you get, you can use the 'fence_apc' fence agent. > >> Thank you for your advice. >> >> >> >> ? >> ? ? >> ? ? ? >> ? ? ? ? >> ? ? ? ? ? >> ? ? ? ? >> ? ? ? >> ? ? >> ? ? >> ? ? ? >> ? ? ? ? >> ? ? ? ? ? >> ? ? ? ? >> ? ? ? >> ? ? >> ? ? >> ? ? ? >> ? ? ? ? >> ? ? ? ? ? >> ? ? ? ? >> ? ? ? >> ? ? >> ? ? >> ? ? ? >> ? ? ? ? >> ? ? ? ? ? >> ? ? ? ? >> ? ? ? >> ? ? >> ? ? >> ? ? ? >> ? ? ? ? >> ? ? ? ? ? >> ? ? ? ? >> ? ? ? >> ? ? >> ? ? >> ? ? ? >> ? ? ? ? >> ? ? ? ? ? >> ? ? ? ? >> ? ? ? >> ? ? >> ? ? >> ? ? ? >> ? ? ? ? >> ? ? ? ? ? >> ? ? ? ? >> ? ? ? >> ? ? >> ? ? >> ? ? ? >> ? ? ? ? >> ? ? ? ? ? >> ? ? ? ? >> ? ? ? >> ? ? >> ? ? >> ? ? ? >> ? ? ? ? >> ? ? ? ? ? >> ? ? ? ? >> ? ? ? >> ? ? >> ? ? >> ? ? ? >> ? ? ? ? >> ? ? ? ? ? >> ? ? ? ? >> ? ? ? >> ? ? >> ? ? >> ? ? ? >> ? ? ? ? >> ? ? ? ? ? >> ? ? ? ? >> ? ? ? >> ? ? >> ? ? >> ? ? ? >> ? ? ? ? >> ? ? ? ? ? >> ? ? ? ? >> ? ? ? >> ? ? >> ? ? >> ? ? ? >> ? ? ? ? >> ? ? ? ? ? >> ? ? ? ? >> ? ? ? >> ? ? >> ? >> ? >> ? ? >> ? >> ? >> ? ? >> ? ? >> ? >> >> >> Regards. > > Outside of the "fence_manual" issue, this looks fine. You will probably want > to get the GFS and GNBD stuff into rgmanager, but that can come later after > you have fencing working and the core of the cluster tested and working. > > Take a look at this: > > http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/5/html/Global_Network_Block_Device/s1-gnbd-mp-sn.html > > It discusses fencing with GNBD. Below is the start of the Red Hat document > on GNBD in EL5 that you may find helpful, if you haven't read it already. > > http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/5/html/Global_Network_Block_Device/ch-gnbd.html > > Let me know if you want/need any more help. I'll be happy to see what I can > do. > > -- > Digimer > E-Mail: ? ? ? ? ? ? ?digimer at alteeve.com > Freenode handle: ? ? digimer > Papers and Projects: http://alteeve.com > Node Assassin: ? ? ? http://nodeassassin.org > "I feel confined, only free to expand myself within boundaries." > -- Hiroyuki Sato From linux at alteeve.com Mon May 30 12:26:53 2011 From: linux at alteeve.com (Digimer) Date: Mon, 30 May 2011 08:26:53 -0400 Subject: [Linux-cluster] [Q] Good documentation about command line interface?? In-Reply-To: References: <4DE13160.7080709@alteeve.com> <4DE26DB9.6020905@alteeve.com> Message-ID: <4DE38D0D.8010800@alteeve.com> On 05/30/2011 07:35 AM, Hiroyuki Sato wrote: > Hello Digimer > > Thank you for your advice. > > * GNBD > I've already succeed to mount GNBD. > locking_type = 1 > Should I change lock_type = 3 ?, > If not, what problem will be occur?? To be honest, I'm not familiar with GNBD. The locking needs to use DLM I do believe, so check the documentation to ensure that is the case. > * fence_apc > some of reason, I can't get use APC switch. > (That configuration example is test environment. ) > so I asked alternative solution. Ah, ok. > * fence_wol > > I can't find fence_wake_on_lan. so I'm thinking to create it. > WOL supports Power on and Power off ( I'll test later ). > So, It's will be fence tool. > > And I downloaded fence_na, It was written in Perl script. > so I want to change fence_na to use wol command. > > > Could you point me to good reference to build fence_wol. > (Of course!!. fence_na is good reference) Does wake-on-lan allow for: a) Forcing a node to power off, or does it just start an ACPI shutdown? b) Can you check that the node is successfully off using wol? Unless wol can force a node off (ie: in the case of a hung OS) and can return the current power state of the node, then I would be hesitant to use it. As a general question though; You will need to write a script that follows the FenceAgentAPI: https://fedorahosted.org/cluster/wiki/FenceAgentAPI You could make a few Node Assassin devices if you have access to Arduino boards and don't mind soldering. However, you have 13 nodes, so you'd need five of them... Not sure if that is feasible. If you want to to a test cluster with fewer nodes though, it's probably more reasonable. -- Digimer E-Mail: digimer at alteeve.com Freenode handle: digimer Papers and Projects: http://alteeve.com Node Assassin: http://nodeassassin.org "I feel confined, only free to expand myself within boundaries." From Ralph.Grothe at itdz-berlin.de Mon May 30 12:28:34 2011 From: Ralph.Grothe at itdz-berlin.de (Ralph.Grothe at itdz-berlin.de) Date: Mon, 30 May 2011 14:28:34 +0200 Subject: [Linux-cluster] How to integrate a custom resource agent into RHCS? Message-ID: Hi, I hope this is the right forum. So bear with me Pacemaker aficionados et alii when I talk about Red Hat Cluster Suite (RHCS). That's the clusterware product I am given to set up the cluster and I'm not free to chose another software of my liking. Though this may sound ridiculous, since days I've been labouring to get a fairly simple custom resource agent (hence RA) to be acknowledged by RHCS and correctly executed through its rgmanager. When scripting my RA I mostly adhered to http://www.linux-ha.org/doc/dev-guides/ra-dev-guide.html apart from where RHCS RAs differs from general OCF. I put my RA in /usr/share/cluster and afterwards restarted rgmanager on all nodes. When I try to start the service whereof my RA's managed resource is part of the service though gets started but not my resource, as if it wasn't part of the service at all. When I try to start my resource via rg_test nothing happens apart from this obscure log entry [root at aruba:~] # rg_test test /etc/cluster/cluster.conf start aDIStn_sec Running in test mode. Entity: line 2: parser error : Char 0x0 out of allowed range ^ Entity: line 2: parser error : Premature end of data in tag error line 1 ^ [root at aruba:~] # echo $? 0 [root at aruba:~] # grep rg_test /var/log/cluster.log|tail -1 May 30 13:54:55 aruba rg_test: [28643]: Cannot dump meta-data because '/usr/share/cluster/default.metadata' is missing Though this is true [root at aruba:~] # ls -l /usr/share/cluster/default.metadata ls: /usr/share/cluster/default.metadata: No such file or directory there isn't such a file part of the installed clusterware at all either [root at aruba:~] # yum groupinfo Clustering|tail -10|xargs rpm -ql|grep -c default\\.metadata 0 And besides, I don't understand this error because since I wrote my RA according to above mentioned RA Developer's Guide it of course dumps its metadata [root at aruba:~] # /usr/share/cluster/aDIStn_sec.sh meta-data|grep action (note, RHCS deviates from OCF here in naming its actions verify-all instead of validate-all and status instead of monitor. But both refer to the same case block in my RA) I also don't understand the "Char 0x0 out of allowed range" error from the XML parser. If it really refers to line 2 of my cluster.conf this looks pretty ok to me [root at aruba:~] # sed -n 2p /etc/cluster/cluster.conf If I run a validity check of the XML of my cluster.conf against RHCS's RNG schema I also get an incomprehensible error about extra elements in interleave. Nevertheless, all other resources of my cluster which rely on RHCS's standard RAs are managed ok by the clusterware. [root at aruba:~] # declare -f cluconf_valid cluconf_valid () { xmllint --noout --relaxng /usr/share/system-config-cluster/misc/cluster.ng ${1:-/etc/cluster/cluster.conf} } [root at aruba:~] # cluconf_valid Relax-NG validity error : Extra element cman in interleave /etc/cluster/cluster.conf:2: element cluster: Relax-NG validity error : Element cluster failed to validate content /etc/cluster/cluster.conf fails to validate Btw. is there a schema file available to check an RA's metadata for validity? Of course did I test my RA script for correct functionality when used like an init script (to which end I provide the required environment of OCF_RESKEY_parameter(s)), and it starts, stops and monitors my resource as intended. Can anyone help? Regards Ralph From linux at alteeve.com Mon May 30 13:15:37 2011 From: linux at alteeve.com (Digimer) Date: Mon, 30 May 2011 09:15:37 -0400 Subject: [Linux-cluster] How to integrate a custom resource agent into RHCS? In-Reply-To: References: Message-ID: <4DE39879.4070406@alteeve.com> On 05/30/2011 08:28 AM, Ralph.Grothe at itdz-berlin.de wrote: > Hi, > > I hope this is the right forum. So bear with me Pacemaker > aficionados et alii when I talk about Red Hat Cluster Suite > (RHCS). > That's the clusterware product I am given to set up the cluster > and I'm not free to chose another software of my liking. > > Though this may sound ridiculous, since days I've been labouring > to get a fairly simple custom resource agent (hence RA) to be > acknowledged by RHCS and correctly executed through its > rgmanager. > > When scripting my RA I mostly adhered to > http://www.linux-ha.org/doc/dev-guides/ra-dev-guide.html apart > from where RHCS RAs differs from general OCF. > > I put my RA in /usr/share/cluster and afterwards restarted > rgmanager on all nodes. > > When I try to start the service whereof my RA's managed resource > is part of the service though gets started but not my resource, > as if it wasn't part of the service at all. > > > When I try to start my resource via rg_test nothing happens apart > from this obscure log entry > > > [root at aruba:~] > # rg_test test /etc/cluster/cluster.conf start aDIStn_sec > Running in test mode. > Entity: line 2: parser error : Char 0x0 out of allowed range > > ^ > Entity: line 2: parser error : Premature end of data in tag error > line 1 > > ^ > [root at aruba:~] > # echo $? > 0 > > [root at aruba:~] > # grep rg_test /var/log/cluster.log|tail -1 > May 30 13:54:55 aruba rg_test: [28643]: Cannot dump > meta-data because '/usr/share/cluster/default.metadata' is > missing > > > Though this is true > > [root at aruba:~] > # ls -l /usr/share/cluster/default.metadata > ls: /usr/share/cluster/default.metadata: No such file or > directory > > there isn't such a file part of the installed clusterware at all > either > > [root at aruba:~] > # yum groupinfo Clustering|tail -10|xargs rpm -ql|grep -c > default\\.metadata > 0 > > And besides, I don't understand this error because since I wrote > my RA according to above mentioned RA Developer's Guide it of > course dumps its metadata > > > [root at aruba:~] > # /usr/share/cluster/aDIStn_sec.sh meta-data|grep action > > > > > > > > > > > (note, RHCS deviates from OCF here in naming its actions > verify-all instead of validate-all and status instead of monitor. > But both refer to the same case block in my RA) > > > I also don't understand the "Char 0x0 out of allowed range" error > from the XML parser. > > If it really refers to line 2 of my cluster.conf this looks > pretty ok to me > > > [root at aruba:~] > # sed -n 2p /etc/cluster/cluster.conf > > > > If I run a validity check of the XML of my cluster.conf against > RHCS's RNG schema I also get an incomprehensible error about > extra elements in interleave. > > Nevertheless, all other resources of my cluster which rely on > RHCS's standard RAs are managed ok by the clusterware. > > > > [root at aruba:~] > # declare -f cluconf_valid > cluconf_valid () > { > xmllint --noout --relaxng > /usr/share/system-config-cluster/misc/cluster.ng > ${1:-/etc/cluster/cluster.conf} > } > [root at aruba:~] > # cluconf_valid > Relax-NG validity error : Extra element cman in interleave > /etc/cluster/cluster.conf:2: element cluster: Relax-NG validity > error : Element cluster failed to validate content > /etc/cluster/cluster.conf fails to validate > > > Btw. is there a schema file available to check an RA's metadata > for validity? > > > > Of course did I test my RA script for correct functionality when > used like an init script (to which end I provide the required > environment of OCF_RESKEY_parameter(s)), > and it starts, stops and monitors my resource as intended. > > > Can anyone help? > > > Regards > Ralph Can you paste in your cluster.conf file? Please only alter the passwords. Generally speaking, if your scripts can work like init.d script (taking start/stop/status arguments), then you should be able to use the "script" resource type. I am not too familiar with OCF, I am afraid, but I think I can help with RHCS as that is what I am most familiar with. -- Digimer E-Mail: digimer at alteeve.com Freenode handle: digimer Papers and Projects: http://alteeve.com Node Assassin: http://nodeassassin.org "I feel confined, only free to expand myself within boundaries." From hiroysato at gmail.com Mon May 30 14:49:15 2011 From: hiroysato at gmail.com (Hiroyuki Sato) Date: Mon, 30 May 2011 23:49:15 +0900 Subject: [Linux-cluster] [Q] Good documentation about command line interface?? In-Reply-To: <4DE38D0D.8010800@alteeve.com> References: <4DE13160.7080709@alteeve.com> <4DE26DB9.6020905@alteeve.com> <4DE38D0D.8010800@alteeve.com> Message-ID: Hello Digimer. Thank you for your advice. It is very very useful information for me. > a) Forcing a node to power off, or does it just start an ACPI shutdown? Maybe ok. I'll test it. > b) Can you check that the node is successfully off using wol? I'm not sure, I'll test it. Could you tell me one more thing. Where fenced will call fence agent?? It is mean that the following * Can I check where fenced daemon will call fence_agent when I execute fence_node?? (that message send to master fenced, or localhost??) * And Can I check ``where are master'' with command?? (If fenced is master-slave type) * Can I control master priority. (for example I want to specify gfs1, gfs2, gfs3 as fenced master) Thanks again Regards. 2011/5/30 Digimer : > On 05/30/2011 07:35 AM, Hiroyuki Sato wrote: >> >> Hello Digimer >> >> Thank you for your advice. >> >> * GNBD >> ? I've already succeed to mount GNBD. >> ? locking_type = 1 >> ? Should I change lock_type = 3 ?, >> ? If not, what problem will be occur?? > > To be honest, I'm not familiar with GNBD. The locking needs to use DLM I do > believe, so check the documentation to ensure that is the case. > >> * fence_apc >> ?some of reason, I can't get use APC switch. >> ?(That configuration example is test environment. ) >> ?so I asked alternative solution. > > Ah, ok. > >> * fence_wol >> >> ?I can't find fence_wake_on_lan. so I'm thinking to create it. >> ?WOL supports Power on and Power off ( I'll test later ). >> ?So, It's will be fence tool. >> >> ?And I downloaded fence_na, It was written in Perl script. >> ?so I want to change fence_na to use wol command. >> >> >> ?Could you point me to good reference to build fence_wol. >> ?(Of course!!. fence_na is good reference) > > Does wake-on-lan allow for: > > a) Forcing a node to power off, or does it just start an ACPI shutdown? > b) Can you check that the node is successfully off using wol? > > Unless wol can force a node off (ie: in the case of a hung OS) and can > return the current power state of the node, then I would be hesitant to use > it. > > As a general question though; You will need to write a script that follows > the FenceAgentAPI: > > https://fedorahosted.org/cluster/wiki/FenceAgentAPI > > You could make a few Node Assassin devices if you have access to Arduino > boards and don't mind soldering. However, you have 13 nodes, so you'd need > five of them... Not sure if that is feasible. If you want to to a test > cluster with fewer nodes though, it's probably more reasonable. > > -- > Digimer > E-Mail: ? ? ? ? ? ? ?digimer at alteeve.com > Freenode handle: ? ? digimer > Papers and Projects: http://alteeve.com > Node Assassin: ? ? ? http://nodeassassin.org > "I feel confined, only free to expand myself within boundaries." > -- Hiroyuki Sato From linux at alteeve.com Mon May 30 15:06:45 2011 From: linux at alteeve.com (Digimer) Date: Mon, 30 May 2011 11:06:45 -0400 Subject: [Linux-cluster] [Q] Good documentation about command line interface?? In-Reply-To: References: <4DE13160.7080709@alteeve.com> <4DE26DB9.6020905@alteeve.com> <4DE38D0D.8010800@alteeve.com> Message-ID: <4DE3B285.1010105@alteeve.com> On 05/30/2011 10:49 AM, Hiroyuki Sato wrote: > Hello Digimer. > > Thank you for your advice. > It is very very useful information for me. > >> a) Forcing a node to power off, or does it just start an ACPI shutdown? > > Maybe ok. I'll test it. To test, hang the host (echo c > /proc/sysrq-trigger), then try to force it to power off with wol. If this succeeds, you are in business. I have my doubts though. >> b) Can you check that the node is successfully off using wol? > > I'm not sure, I'll test it. Please do. If you can though, it will make IPMI far less needed. :) > Could you tell me one more thing. > > Where fenced will call fence agent?? > It is mean that the following > > * Can I check where fenced daemon will call fence_agent when I > execute fence_node?? > (that message send to master fenced, or localhost??) > * And Can I check ``where are master'' with command?? (If fenced is > master-slave type) > * Can I control master priority. > (for example I want to specify gfs1, gfs2, gfs3 as fenced master) > > Thanks again > > Regards. I'm not sure about the internals of cman, so I am not sure which machine actually sends the fence command. I do know that it has to come from a machine with quorum, and I do believe it is handled by the cluster manager. It's not like pacemaker where a DC is clearly defined. I'll try to sort out how the internals work and will let you know. -- Digimer E-Mail: digimer at alteeve.com Freenode handle: digimer Papers and Projects: http://alteeve.com Node Assassin: http://nodeassassin.org "I feel confined, only free to expand myself within boundaries." From linux at alteeve.com Mon May 30 15:23:55 2011 From: linux at alteeve.com (Digimer) Date: Mon, 30 May 2011 11:23:55 -0400 Subject: [Linux-cluster] [Q] Good documentation about command line interface?? In-Reply-To: References: <4DE13160.7080709@alteeve.com> <4DE26DB9.6020905@alteeve.com> <4DE38D0D.8010800@alteeve.com> Message-ID: <4DE3B68B.9080605@alteeve.com> On 05/30/2011 10:49 AM, Hiroyuki Sato wrote: > Where fenced will call fence agent?? > It is mean that the following > > * Can I check where fenced daemon will call fence_agent when I > execute fence_node?? > (that message send to master fenced, or localhost??) > * And Can I check ``where are master'' with command?? (If fenced is > master-slave type) > * Can I control master priority. > (for example I want to specify gfs1, gfs2, gfs3 as fenced master) > > Thanks again > > Regards. It looks like the node with the lowest ID that is quorate sends the actual fence call. -- Digimer E-Mail: digimer at alteeve.com Freenode handle: digimer Papers and Projects: http://alteeve.com Node Assassin: http://nodeassassin.org "I feel confined, only free to expand myself within boundaries." From kkovachev at varna.net Mon May 30 15:30:41 2011 From: kkovachev at varna.net (Kaloyan Kovachev) Date: Mon, 30 May 2011 18:30:41 +0300 Subject: [Linux-cluster] =?utf-8?q?=5BQ=5D_Good_documentation_about_comman?= =?utf-8?q?d_line_interface=3F=3F?= In-Reply-To: <4DE3B285.1010105@alteeve.com> References: <4DE13160.7080709@alteeve.com> <4DE26DB9.6020905@alteeve.com> <4DE38D0D.8010800@alteeve.com> <4DE3B285.1010105@alteeve.com> Message-ID: <8ccb8ebec7d0cd0071d3e2898d312b8e@mx.varna.net> On Mon, 30 May 2011 11:06:45 -0400, Digimer wrote: > On 05/30/2011 10:49 AM, Hiroyuki Sato wrote: >> Hello Digimer. >> >> Thank you for your advice. >> It is very very useful information for me. >> >>> a) Forcing a node to power off, or does it just start an ACPI shutdown? >> >> Maybe ok. I'll test it. > > To test, hang the host (echo c > /proc/sysrq-trigger), then try to force > it to power off with wol. If this succeeds, you are in business. I have > my doubts though. > >>> b) Can you check that the node is successfully off using wol? >> >> I'm not sure, I'll test it. > > Please do. If you can though, it will make IPMI far less needed. :) > >> Could you tell me one more thing. >> >> Where fenced will call fence agent?? >> It is mean that the following >> >> * Can I check where fenced daemon will call fence_agent when I >> execute fence_node?? >> (that message send to master fenced, or localhost??) >> * And Can I check ``where are master'' with command?? (If fenced is >> master-slave type) >> * Can I control master priority. >> (for example I want to specify gfs1, gfs2, gfs3 as fenced master) >> >> Thanks again >> >> Regards. > > I'm not sure about the internals of cman, so I am not sure which machine > actually sends the fence command. I do know that it has to come from a > machine with quorum, and I do believe it is handled by the cluster > manager. It's not like pacemaker where a DC is clearly defined. > > I'll try to sort out how the internals work and will let you know. Not sure where i got this information from (i think it was on this list), but for sure: the node with the lowest ID, which is quorate, will take the responsibility to call the fencing script From linux at alteeve.com Mon May 30 15:34:30 2011 From: linux at alteeve.com (Digimer) Date: Mon, 30 May 2011 11:34:30 -0400 Subject: [Linux-cluster] [Q] Good documentation about command line interface?? In-Reply-To: <8ccb8ebec7d0cd0071d3e2898d312b8e@mx.varna.net> References: <4DE13160.7080709@alteeve.com> <4DE26DB9.6020905@alteeve.com> <4DE38D0D.8010800@alteeve.com> <4DE3B285.1010105@alteeve.com> <8ccb8ebec7d0cd0071d3e2898d312b8e@mx.varna.net> Message-ID: <4DE3B906.4080005@alteeve.com> On 05/30/2011 11:30 AM, Kaloyan Kovachev wrote: >> actually sends the fence command. I do know that it has to come from a >> machine with quorum, and I do believe it is handled by the cluster >> manager. It's not like pacemaker where a DC is clearly defined. >> >> I'll try to sort out how the internals work and will let you know. > > Not sure where i got this information from (i think it was on this list), > but for sure: the node with the lowest ID, which is quorate, will take the > responsibility to call the fencing script Indeed, you are right. :) -- Digimer E-Mail: digimer at alteeve.com Freenode handle: digimer Papers and Projects: http://alteeve.com Node Assassin: http://nodeassassin.org "I feel confined, only free to expand myself within boundaries." From hiroysato at gmail.com Mon May 30 16:54:30 2011 From: hiroysato at gmail.com (Hiroyuki Sato) Date: Tue, 31 May 2011 01:54:30 +0900 Subject: [Linux-cluster] [Q] Good documentation about command line interface?? In-Reply-To: <4DE3B906.4080005@alteeve.com> References: <4DE13160.7080709@alteeve.com> <4DE26DB9.6020905@alteeve.com> <4DE38D0D.8010800@alteeve.com> <4DE3B285.1010105@alteeve.com> <8ccb8ebec7d0cd0071d3e2898d312b8e@mx.varna.net> <4DE3B906.4080005@alteeve.com> Message-ID: Hello Digimer and Kaloyan Thank you for your information. I'll set gfs1, gfs2 and gfs3 with lowest ID (ex, 1,2,3). I found the following Notes in fenced/recover.c recover.c Notes: - When fenced is started, the complete list is initialized to all the nodes in cluster.conf. - fence_victims actually only runs on one of the nodes in the domain so that a victim isn't fenced by everyone. - The node to run fence_victims is the node with lowest id that's in both complete and prev lists. - This node will never be a node that's just joining since by definition the joining node wasn't in the last complete group. - An exception to this is when there is just one node in the group in which case it's chosen even if it wasn't in the last complete group. - There's also a leaving list that parallels the victims list but are not fenced. Here is call procedures. recover.c do_recovery fence_victims dispatch_fence_agent agent.c dispatch_fence_agent use_device run_agent exec fence_XXXX Regards. 2011/5/31 Digimer : > On 05/30/2011 11:30 AM, Kaloyan Kovachev wrote: >>> >>> actually sends the fence command. I do know that it has to come from a >>> machine with quorum, and I do believe it is handled by the cluster >>> manager. It's not like pacemaker where a DC is clearly defined. >>> >>> I'll try to sort out how the internals work and will let you know. >> >> Not sure where i got this information from (i think it was on this list), >> but for sure: the node with the lowest ID, which is quorate, will take the >> responsibility to call the fencing script > > Indeed, you are right. :) > > -- > Digimer > E-Mail: ? ? ? ? ? ? ?digimer at alteeve.com > Freenode handle: ? ? digimer > Papers and Projects: http://alteeve.com > Node Assassin: ? ? ? http://nodeassassin.org > "I feel confined, only free to expand myself within boundaries." > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -- Hiroyuki Sato From swap_project at yahoo.com Mon May 30 19:17:07 2011 From: swap_project at yahoo.com (Srija) Date: Mon, 30 May 2011 12:17:07 -0700 (PDT) Subject: [Linux-cluster] Cluster environment issue In-Reply-To: Message-ID: <17723.13598.qm@web112805.mail.gq1.yahoo.com> Hi, I am very new to the redhat cluster. Need some help and suggession for the cluster configuration. We have sixteen node cluster of OS : Linux Server release 5.5 (Tikanga) kernel : 2.6.18-194.3.1.el5xen. The problem is sometimes the cluster is getting broken. The solution is (still yet)to reboot the sixteen nodes. Otherwise the nodes are not joining We are using clvm and not using any quorum disk. The quorum is by default. When it is getting broken, clustat commands shows evrything offline except the node from where the clustat command executed. If we execute vgs, lvs command, those commands are getting hung. Here is at present the clustat report ------------------------------------- [server1]# clustat Cluster Status for newcluster @ Mon May 30 14:55:10 2011 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ server1 1 Online server2 2 Online, Local server3 3 Online server4 4 Online server5 5 Online server6 6 Online server7 7 Online server8 8 Online server9 9 Online server10 10 Online server11 11 Online server12 12 Online server13 13 Online server14 14 Online server15 15 Online server16 16 Online Here the cman_tool status output from one server -------------------------------------------------- [server1 ~]# cman_tool status Version: 6.2.0 Config Version: 23 Cluster Name: newcluster Cluster Id: 53322 Cluster Member: Yes Cluster Generation: 11432 Membership state: Cluster-Member Nodes: 16 Expected votes: 16 Total votes: 16 Quorum: 9 Active subsystems: 8 Flags: Dirty Ports Bound: 0 11 Node name: server1 Node ID: 1 Multicast addresses: xxx.xxx.xxx.xx Node addresses: 192.168.xxx.xx Here is the cluster.conf file. ------------------------------ [ ... sinp .....] .......... Here is the lvm.conf file -------------------------- devices { dir = "/dev" scan = [ "/dev" ] preferred_names = [ ] filter = [ "r/scsi.*/","r/pci.*/","r/sd.*/","a/.*/" ] cache_dir = "/etc/lvm/cache" cache_file_prefix = "" write_cache_state = 1 sysfs_scan = 1 md_component_detection = 1 md_chunk_alignment = 1 data_alignment_detection = 1 data_alignment = 0 data_alignment_offset_detection = 1 ignore_suspended_devices = 0 } log { verbose = 0 syslog = 1 overwrite = 0 level = 0 indent = 1 command_names = 0 prefix = " " } backup { backup = 1 backup_dir = "/etc/lvm/backup" archive = 1 archive_dir = "/etc/lvm/archive" retain_min = 10 retain_days = 30 } shell { history_size = 100 } global { library_dir = "/usr/lib64" umask = 077 test = 0 units = "h" si_unit_consistency = 0 activation = 1 proc = "/proc" locking_type = 3 wait_for_locks = 1 fallback_to_clustered_locking = 1 fallback_to_local_locking = 1 locking_dir = "/var/lock/lvm" prioritise_write_locks = 1 } activation { udev_sync = 1 missing_stripe_filler = "error" reserved_stack = 256 reserved_memory = 8192 process_priority = -18 mirror_region_size = 512 readahead = "auto" mirror_log_fault_policy = "allocate" mirror_image_fault_policy = "remove" } dmeventd { mirror_library = "libdevmapper-event-lvm2mirror.so" snapshot_library = "libdevmapper-event-lvm2snapshot.so" } If you need more information, I can provide ... Thanks for your help Priya From kkovachev at varna.net Mon May 30 20:05:38 2011 From: kkovachev at varna.net (Kaloyan Kovachev) Date: Mon, 30 May 2011 23:05:38 +0300 Subject: [Linux-cluster] Cluster environment issue In-Reply-To: <17723.13598.qm@web112805.mail.gq1.yahoo.com> References: <17723.13598.qm@web112805.mail.gq1.yahoo.com> Message-ID: Hi, when your cluster gets broken, most likely the reason is, there is a network problem (switch restart or multicast traffic is lost for a while) on the interface where serverX-priv IPs are configured. Having a quorum disk may help by giving a quorum vote to one of the servers, so it can fence the others, but the best thing to do is to fix your network and preferably add a redundant link for the cluster communication to avoid breakage in the first place On Mon, 30 May 2011 12:17:07 -0700 (PDT), Srija wrote: > Hi, > > I am very new to the redhat cluster. Need some help and suggession for the > cluster configuration. > We have sixteen node cluster of > > OS : Linux Server release 5.5 (Tikanga) > kernel : 2.6.18-194.3.1.el5xen. > > The problem is sometimes the cluster is getting broken. The solution is > (still yet)to reboot the > sixteen nodes. Otherwise the nodes are not joining > > We are using clvm and not using any quorum disk. The quorum is by default. > > When it is getting broken, clustat commands shows evrything offline > except the node from where > the clustat command executed. If we execute vgs, lvs command, those > commands are getting hung. > > Here is at present the clustat report > ------------------------------------- > > [server1]# clustat > Cluster Status for newcluster @ Mon May 30 14:55:10 2011 > Member Status: Quorate > > Member Name ID Status > ------ ---- ---- ------ > server1 1 Online > server2 2 Online, Local > server3 3 Online > server4 4 Online > server5 5 Online > server6 6 Online > server7 7 Online > server8 8 Online > server9 9 Online > server10 10 Online > server11 11 Online > server12 12 Online > server13 13 Online > server14 14 Online > server15 15 Online > server16 16 Online > > Here the cman_tool status output from one server > -------------------------------------------------- > > [server1 ~]# cman_tool status > Version: 6.2.0 > Config Version: 23 > Cluster Name: newcluster > Cluster Id: 53322 > Cluster Member: Yes > Cluster Generation: 11432 > Membership state: Cluster-Member > Nodes: 16 > Expected votes: 16 > Total votes: 16 > Quorum: 9 > Active subsystems: 8 > Flags: Dirty > Ports Bound: 0 11 > Node name: server1 > Node ID: 1 > Multicast addresses: xxx.xxx.xxx.xx > Node addresses: 192.168.xxx.xx > > > Here is the cluster.conf file. > ------------------------------ > > > > > > > > > > > > > > > > > > > > > > > > > > [ ... sinp .....] > > > > > > > > > > > > > > > name="ilo-server1r" passwd="xxxxx"/> > .......... > name="ilo-server16r" passwd="xxxxx"/> > > > > > > > Here is the lvm.conf file > -------------------------- > > devices { > > dir = "/dev" > scan = [ "/dev" ] > preferred_names = [ ] > filter = [ "r/scsi.*/","r/pci.*/","r/sd.*/","a/.*/" ] > cache_dir = "/etc/lvm/cache" > cache_file_prefix = "" > write_cache_state = 1 > sysfs_scan = 1 > md_component_detection = 1 > md_chunk_alignment = 1 > data_alignment_detection = 1 > data_alignment = 0 > data_alignment_offset_detection = 1 > ignore_suspended_devices = 0 > } > > log { > > verbose = 0 > syslog = 1 > overwrite = 0 > level = 0 > indent = 1 > command_names = 0 > prefix = " " > } > > backup { > > backup = 1 > backup_dir = "/etc/lvm/backup" > archive = 1 > archive_dir = "/etc/lvm/archive" > retain_min = 10 > retain_days = 30 > } > > shell { > > history_size = 100 > } > global { > library_dir = "/usr/lib64" > umask = 077 > test = 0 > units = "h" > si_unit_consistency = 0 > activation = 1 > proc = "/proc" > locking_type = 3 > wait_for_locks = 1 > fallback_to_clustered_locking = 1 > fallback_to_local_locking = 1 > locking_dir = "/var/lock/lvm" > prioritise_write_locks = 1 > } > > activation { > udev_sync = 1 > missing_stripe_filler = "error" > reserved_stack = 256 > reserved_memory = 8192 > process_priority = -18 > mirror_region_size = 512 > readahead = "auto" > mirror_log_fault_policy = "allocate" > mirror_image_fault_policy = "remove" > } > dmeventd { > > mirror_library = "libdevmapper-event-lvm2mirror.so" > snapshot_library = "libdevmapper-event-lvm2snapshot.so" > } > > > If you need more information, I can provide ... > > Thanks for your help > Priya > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From swap_project at yahoo.com Tue May 31 01:22:00 2011 From: swap_project at yahoo.com (Srija) Date: Mon, 30 May 2011 18:22:00 -0700 (PDT) Subject: [Linux-cluster] Cluster environment issue In-Reply-To: Message-ID: <584441.81591.qm@web112812.mail.gq1.yahoo.com> Thanks for your quick reply. I talked to the network people , but they are saying everything is good at their end. Is there anyway at the server end, to figure it for the switch restart or multicast traffic? I think you have already checked the cluster.conf file.. Except quorum disk, do you think that the cluster configuration is sufficient for handling the sixteen node cluster!! thanks again . regards --- On Mon, 5/30/11, Kaloyan Kovachev wrote: > From: Kaloyan Kovachev > Subject: Re: [Linux-cluster] Cluster environment issue > To: "linux clustering" > Date: Monday, May 30, 2011, 4:05 PM > Hi, > when your cluster gets broken, most likely the reason is, > there is a > network problem (switch restart or multicast traffic is > lost for a while) > on the interface where serverX-priv IPs are configured. > Having a quorum > disk may help by giving a quorum vote to one of the > servers, so it can > fence the others, but the best thing to do is to fix your > network and > preferably add a redundant link for the cluster > communication to avoid > breakage in the first place > > On Mon, 30 May 2011 12:17:07 -0700 (PDT), Srija > wrote: > > Hi, > > > > I am very new to the redhat cluster. Need some help > and suggession for > the > > cluster configuration. > > We have sixteen node cluster of > > > >? ? ? ? ? ???OS > : Linux Server release 5.5 (Tikanga) > >? ? ? ? ? > ???kernel :? 2.6.18-194.3.1.el5xen. > > > > The problem is sometimes the cluster is getting? > broken. The solution is > > (still yet)to reboot the > > sixteen nodes. Otherwise the nodes are not joining > > > > We are using? clvm and not using any quorum disk. > The quorum is by > default. > > > > When it is getting broken, clustat commands > shows? evrything? offline > > except the node from where > > the clustat command executed.? If we execute vgs, > lvs command, those > > commands are getting hung. > > > > Here is at present the clustat report > > ------------------------------------- > > > > [server1]# clustat > > Cluster Status for newcluster @ Mon May 30 14:55:10 > 2011 > > Member Status: Quorate > > > >? Member Name? ? ? ? ? > ? ? ? ? ? ? > ID???Status > >? ------ ----? ? ? ? ? > ? ? ? ? ? ? ---- ------ > >? server1? ? ? ? ? ? > ? ? ? ? ? ? ? 1 Online > >? server2? ? ? ? ? ? > ? ? ? ? ? ? ? 2 Online, > Local > >? server3? ? ? ? ? ? > ? ? ? ? ? ? ? 3 Online > >? server4? ? ? ? ? ? > ? ? ? ? ? ? ? 4 Online > >? server5? ? ? ? ? ? > ? ? ? ? ? ? ? 5 Online > >? server6? ? ? ? ? ? > ? ? ? ? ? ? ? 6 Online > >? server7? ? ? ? ? ? > ? ? ? ? ? ? ? 7 Online > >? server8? ? ? ? ? ? > ? ? ? ? ? ? ? 8 Online > >? server9? ? ? ? ? ? > ? ? ? ? ? ? ? 9 Online > >? server10? ? ? ? ? > ? ? ? ? ? ? > ???10 Online > >? server11? ? ? ? ? > ? ? ? ? ? ? > ???11 Online > >? server12? ? ? ? ? > ? ? ? ? ? ? > ???12 Online > >? server13? ? ? ? ? > ? ? ? ? ? ? > ???13 Online > >? server14? ? ? ? ? > ? ? ? ? ? ? > ???14 Online > >? server15? ? ? ? ? > ? ? ? ? ? ? > ???15 Online > >? server16? ? ? ? ? > ? ? ? ? ? ? > ???16 Online > > > > Here the cman_tool status? output? from one > server > > -------------------------------------------------- > > > > [server1 ~]# cman_tool status > > Version: 6.2.0 > > Config Version: 23 > > Cluster Name: newcluster > > Cluster Id: 53322 > > Cluster Member: Yes > > Cluster Generation: 11432 > > Membership state: Cluster-Member > > Nodes: 16 > > Expected votes: 16 > > Total votes: 16 > > Quorum: 9? > > Active subsystems: 8 > > Flags: Dirty > > Ports Bound: 0 11? > > Node name: server1 > > Node ID: 1 > > Multicast addresses: xxx.xxx.xxx.xx > > Node addresses: 192.168.xxx.xx > > > > > > Here is the cluster.conf file. > > ------------------------------ > > > > > > name="newcluster"> > > post_join_delay="15"/> > > > > > > > > votes="1"> > >? ? ? ? ? ? ? ? > ? > >? ? ? ? ? ? ? ? > ? > >? ? ? ? ? ? ? ? > ? > > > > > > votes="1"> > >? ? ? > ??? > >? ? ? ??? name="ilo-server2r"/> > >? ? ? ??? > > > > > > votes="1"> > >? ? ? > ??? > >? ? ? ??? name="ilo-server3r"/> > >? ? ? ??? > > > > > > [ ... sinp .....] > > > > votes="1"> > >? ? ? ? name="1"> > >? ? ? ? name="ilo-server16r"/> > >? ? ? ? > > > > > > > > > > > > > > > > > > > >? ? ? ??? agent="fence_ilo" hostname="server1r" login="Admin" > >? ? ? > ???name="ilo-server1r" passwd="xxxxx"/> > >? ? ? ???.......... > >? ? ? ??? agent="fence_ilo" hostname="server16r" > login="Admin" > >? ? ? > ???name="ilo-server16r" passwd="xxxxx"/> > > > > > > > > > > > > > > Here is the lvm.conf file > > -------------------------- > > > > devices { > > > >? ???dir = "/dev" > >? ???scan = [ "/dev" ] > >? ???preferred_names = [ ] > >? ???filter = [ > "r/scsi.*/","r/pci.*/","r/sd.*/","a/.*/" ] > >? ???cache_dir = "/etc/lvm/cache" > >? ???cache_file_prefix = "" > >? ???write_cache_state = 1 > >? ???sysfs_scan = 1 > >? ???md_component_detection = 1 > >? ???md_chunk_alignment = 1 > >? ???data_alignment_detection = 1 > >? ???data_alignment = 0 > >? > ???data_alignment_offset_detection = 1 > >? ???ignore_suspended_devices = 0 > > } > > > > log { > > > >? ???verbose = 0 > >? ???syslog = 1 > >? ???overwrite = 0 > >? ???level = 0 > >? ???indent = 1 > >? ???command_names = 0 > >? ???prefix = "? " > > } > > > > backup { > > > >? ???backup = 1 > >? ???backup_dir = > "/etc/lvm/backup" > >? ???archive = 1 > >? ???archive_dir = > "/etc/lvm/archive" > >? ???retain_min = 10 > >? ???retain_days = 30 > > } > > > > shell { > > > >? ???history_size = 100 > > } > > global { > >? ???library_dir = "/usr/lib64" > >? ???umask = 077 > >? ???test = 0 > >? ???units = "h" > >? ???si_unit_consistency = 0 > >? ???activation = 1 > >? ???proc = "/proc" > >? ???locking_type = 3 > >? ???wait_for_locks = 1 > >? ???fallback_to_clustered_locking > = 1 > >? ???fallback_to_local_locking = 1 > >? ???locking_dir = "/var/lock/lvm" > >? ???prioritise_write_locks = 1 > > } > > > > activation { > >? ???udev_sync = 1 > >? ???missing_stripe_filler = > "error" > >? ???reserved_stack = 256 > >? ???reserved_memory = 8192 > >? ???process_priority = -18 > >? ???mirror_region_size = 512 > >? ???readahead = "auto" > >? ???mirror_log_fault_policy = > "allocate" > >? ???mirror_image_fault_policy = > "remove" > > } > > dmeventd { > > > >? ???mirror_library = > "libdevmapper-event-lvm2mirror.so" > >? ???snapshot_library = > "libdevmapper-event-lvm2snapshot.so" > > } > > > > > > If you need more? information,? I can > provide ... > > > > Thanks for your help > > Priya > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > From hiroysato at gmail.com Tue May 31 03:03:58 2011 From: hiroysato at gmail.com (Hiroyuki Sato) Date: Tue, 31 May 2011 12:03:58 +0900 Subject: [Linux-cluster] Cluster environment issue In-Reply-To: <584441.81591.qm@web112812.mail.gq1.yahoo.com> References: <584441.81591.qm@web112812.mail.gq1.yahoo.com> Message-ID: Hello I'm not sure, This is useful or not. Have you ever checked ``ping some_where'' on domU when cluster is broken?? ( I thought you are using Xen, because you are using 2.6.18-194.3.1.el5xen. ) If it does not respond anything, you should check iptables. (ex, disable iptables) -- Hiroyuki Sato 2011/5/31 Srija : > Thanks for your quick reply. > > I talked to the network people , but they are saying everything is good at their end. Is there anyway at the server end, to figure it ?for the switch restart or multicast traffic? > > I think you have already checked the cluster.conf file.. Except quorum disk, do you think that the cluster configuration is sufficient for handling the sixteen node cluster!! > > thanks again . > regards > > --- On Mon, 5/30/11, Kaloyan Kovachev wrote: > >> From: Kaloyan Kovachev >> Subject: Re: [Linux-cluster] Cluster environment issue >> To: "linux clustering" >> Date: Monday, May 30, 2011, 4:05 PM >> Hi, >> ?when your cluster gets broken, most likely the reason is, >> there is a >> network problem (switch restart or multicast traffic is >> lost for a while) >> on the interface where serverX-priv IPs are configured. >> Having a quorum >> disk may help by giving a quorum vote to one of the >> servers, so it can >> fence the others, but the best thing to do is to fix your >> network and >> preferably add a redundant link for the cluster >> communication to avoid >> breakage in the first place >> >> On Mon, 30 May 2011 12:17:07 -0700 (PDT), Srija >> wrote: >> > Hi, >> > >> > I am very new to the redhat cluster. Need some help >> and suggession for >> the >> > cluster configuration. >> > We have sixteen node cluster of >> > >> >? ? ? ? ? ???OS >> : Linux Server release 5.5 (Tikanga) >> > >> ???kernel :? 2.6.18-194.3.1.el5xen. >> > >> > The problem is sometimes the cluster is getting >> broken. The solution is >> > (still yet)to reboot the >> > sixteen nodes. Otherwise the nodes are not joining >> > >> > We are using? clvm and not using any quorum disk. >> The quorum is by >> default. >> > >> > When it is getting broken, clustat commands >> shows? evrything? offline >> > except the node from where >> > the clustat command executed.? If we execute vgs, >> lvs command, those >> > commands are getting hung. >> > >> > Here is at present the clustat report >> > ------------------------------------- >> > >> > [server1]# clustat >> > Cluster Status for newcluster @ Mon May 30 14:55:10 >> 2011 >> > Member Status: Quorate >> > >> >? Member Name >> >> ID???Status >> >? ------ ---- >> ? ? ? ? ? ? ---- ------ >> >? server1 >> ? ? ? ? ? ? ? 1 Online >> >? server2 >> ? ? ? ? ? ? ? 2 Online, >> Local >> >? server3 >> ? ? ? ? ? ? ? 3 Online >> >? server4 >> ? ? ? ? ? ? ? 4 Online >> >? server5 >> ? ? ? ? ? ? ? 5 Online >> >? server6 >> ? ? ? ? ? ? ? 6 Online >> >? server7 >> ? ? ? ? ? ? ? 7 Online >> >? server8 >> ? ? ? ? ? ? ? 8 Online >> >? server9 >> ? ? ? ? ? ? ? 9 Online >> >? server10 >> >> ???10 Online >> >? server11 >> >> ???11 Online >> >? server12 >> >> ???12 Online >> >? server13 >> >> ???13 Online >> >? server14 >> >> ???14 Online >> >? server15 >> >> ???15 Online >> >? server16 >> >> ???16 Online >> > >> > Here the cman_tool status? output? from one >> server >> > -------------------------------------------------- >> > >> > [server1 ~]# cman_tool status >> > Version: 6.2.0 >> > Config Version: 23 >> > Cluster Name: newcluster >> > Cluster Id: 53322 >> > Cluster Member: Yes >> > Cluster Generation: 11432 >> > Membership state: Cluster-Member >> > Nodes: 16 >> > Expected votes: 16 >> > Total votes: 16 >> > Quorum: 9 >> > Active subsystems: 8 >> > Flags: Dirty >> > Ports Bound: 0 11 >> > Node name: server1 >> > Node ID: 1 >> > Multicast addresses: xxx.xxx.xxx.xx >> > Node addresses: 192.168.xxx.xx >> > >> > >> > Here is the cluster.conf file. >> > ------------------------------ >> > >> > >> > > name="newcluster"> >> > > post_join_delay="15"/> >> > >> > >> > >> > > votes="1"> >> > >> ? >> > >> ? >> > >> ? >> > >> > >> > > votes="1"> >> > >> ??? >> >? ? ? ???> name="ilo-server2r"/> >> >? ? ? ??? >> > >> > >> > > votes="1"> >> > >> ??? >> >? ? ? ???> name="ilo-server3r"/> >> >? ? ? ??? >> > >> > >> > [ ... sinp .....] >> > >> > > votes="1"> >> >? ? ? ? > name="1"> >> >? ? ? ? > name="ilo-server16r"/> >> >? ? ? ? >> > >> > >> > >> > >> > >> > >> > >> > >> > >> >? ? ? ???> agent="fence_ilo" hostname="server1r" login="Admin" >> > >> ???name="ilo-server1r" passwd="xxxxx"/> >> >? ? ? ???.......... >> >? ? ? ???> agent="fence_ilo" hostname="server16r" >> login="Admin" >> > >> ???name="ilo-server16r" passwd="xxxxx"/> >> > >> > >> > >> > >> > >> > >> > Here is the lvm.conf file >> > -------------------------- >> > >> > devices { >> > >> >? ???dir = "/dev" >> >? ???scan = [ "/dev" ] >> >? ???preferred_names = [ ] >> >? ???filter = [ >> "r/scsi.*/","r/pci.*/","r/sd.*/","a/.*/" ] >> >? ???cache_dir = "/etc/lvm/cache" >> >? ???cache_file_prefix = "" >> >? ???write_cache_state = 1 >> >? ???sysfs_scan = 1 >> >? ???md_component_detection = 1 >> >? ???md_chunk_alignment = 1 >> >? ???data_alignment_detection = 1 >> >? ???data_alignment = 0 >> > >> ???data_alignment_offset_detection = 1 >> >? ???ignore_suspended_devices = 0 >> > } >> > >> > log { >> > >> >? ???verbose = 0 >> >? ???syslog = 1 >> >? ???overwrite = 0 >> >? ???level = 0 >> >? ???indent = 1 >> >? ???command_names = 0 >> >? ???prefix = "? " >> > } >> > >> > backup { >> > >> >? ???backup = 1 >> >? ???backup_dir = >> "/etc/lvm/backup" >> >? ???archive = 1 >> >? ???archive_dir = >> "/etc/lvm/archive" >> >? ???retain_min = 10 >> >? ???retain_days = 30 >> > } >> > >> > shell { >> > >> >? ???history_size = 100 >> > } >> > global { >> >? ???library_dir = "/usr/lib64" >> >? ???umask = 077 >> >? ???test = 0 >> >? ???units = "h" >> >? ???si_unit_consistency = 0 >> >? ???activation = 1 >> >? ???proc = "/proc" >> >? ???locking_type = 3 >> >? ???wait_for_locks = 1 >> >? ???fallback_to_clustered_locking >> = 1 >> >? ???fallback_to_local_locking = 1 >> >? ???locking_dir = "/var/lock/lvm" >> >? ???prioritise_write_locks = 1 >> > } >> > >> > activation { >> >? ???udev_sync = 1 >> >? ???missing_stripe_filler = >> "error" >> >? ???reserved_stack = 256 >> >? ???reserved_memory = 8192 >> >? ???process_priority = -18 >> >? ???mirror_region_size = 512 >> >? ???readahead = "auto" >> >? ???mirror_log_fault_policy = >> "allocate" >> >? ???mirror_image_fault_policy = >> "remove" >> > } >> > dmeventd { >> > >> >? ???mirror_library = >> "libdevmapper-event-lvm2mirror.so" >> >? ???snapshot_library = >> "libdevmapper-event-lvm2snapshot.so" >> > } >> > >> > >> > If you need more? information,? I can >> provide ... >> > >> > Thanks for your help >> > Priya >> > >> > -- >> > Linux-cluster mailing list >> > Linux-cluster at redhat.com >> > https://www.redhat.com/mailman/listinfo/linux-cluster >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster >> > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > From fdinitto at redhat.com Tue May 31 04:36:05 2011 From: fdinitto at redhat.com (Fabio M. Di Nitto) Date: Tue, 31 May 2011 06:36:05 +0200 Subject: [Linux-cluster] [Cluster-devel] new RHCS upstream wiki In-Reply-To: <4DC7CF28.30602@redhat.com> References: <4DC7CF28.30602@redhat.com> Message-ID: <4DE47035.9040405@redhat.com> On 05/09/2011 01:25 PM, Fabio M. Di Nitto wrote: > Hi all, > > we are in the process of moving the old cluster wiki > (http://sourceware.org/cluster/wiki/) to: > > https://fedorahosted.org/cluster/wiki/HomePage The relocation is now complete and the old wiki is redirecting users to the new one. I'd like to thanks Digimer for doing the heavy lifting of fixing all pages. The very last thing left to do is to create a proper default home page with a summary and maybe a logo... anybody would like to suggest one? the winner will get a month free support on IRC #linux-cluster ;) Fabio From kkovachev at varna.net Tue May 31 09:18:51 2011 From: kkovachev at varna.net (Kaloyan Kovachev) Date: Tue, 31 May 2011 12:18:51 +0300 Subject: [Linux-cluster] Cluster environment issue In-Reply-To: <584441.81591.qm@web112812.mail.gq1.yahoo.com> References: <584441.81591.qm@web112812.mail.gq1.yahoo.com> Message-ID: <9537f2a4eb5ae1c11038deed2e3fe40f@mx.varna.net> On Mon, 30 May 2011 18:22:00 -0700 (PDT), Srija wrote: > Thanks for your quick reply. > > I talked to the network people , but they are saying everything is good at > their end. Is there anyway at the server end, to figure it for the switch > restart or multicast traffic? > If it is a switch restart you will have in your logs the interface going down/up, but more problematic is to find a short drop of the multicast traffic (even with a ping script you may miss it) which is more likely the case, as your cluster is working fine, but suddenly looses connection to all nodes at the same time. You may ask the network people to check for STP changes and double check the multicast configuration and you may also try to use broadcast instead of multicast or use a dedicated switch. > I think you have already checked the cluster.conf file.. Except quorum > disk, do you think that the cluster configuration is sufficient for > handling the sixteen node cluster!! > The config is OK ... probably add specific multicast address in the cman section to avoid surprises, but the default is also fine. To confirm it is a multicast drop (if you are lucky not ti miss it) - on just one of the nodes enable icmp broadcasts: echo 0 >/proc/sys/net/ipv4/icmp_echo_ignore_broadcasts then ping from another node, check if just a single one replies (change to your interface and multicast address) ping -I ethX -b -L 239.x.x.x -c 1 and finaly run this script until the cluster gets broken while [ $((`ping -I ethX -w 1 -b -L 239.x.x.x -c 1 | grep -c ' 0% packet loss'`)) -eq 1 ]; do sleep 1; done; echo "missed ping at `date`" if you get 'missed ping' at the same when cluster goes down - it is confirmed :) > thanks again . > regards > > --- On Mon, 5/30/11, Kaloyan Kovachev wrote: > >> From: Kaloyan Kovachev >> Subject: Re: [Linux-cluster] Cluster environment issue >> To: "linux clustering" >> Date: Monday, May 30, 2011, 4:05 PM >> Hi, >> when your cluster gets broken, most likely the reason is, >> there is a >> network problem (switch restart or multicast traffic is >> lost for a while) >> on the interface where serverX-priv IPs are configured. >> Having a quorum >> disk may help by giving a quorum vote to one of the >> servers, so it can >> fence the others, but the best thing to do is to fix your >> network and >> preferably add a redundant link for the cluster >> communication to avoid >> breakage in the first place >> >> On Mon, 30 May 2011 12:17:07 -0700 (PDT), Srija >> wrote: >> > Hi, >> > >> > I am very new to the redhat cluster. Need some help >> and suggession for >> the >> > cluster configuration. >> > We have sixteen node cluster of >> > >> > OS >> : Linux Server release 5.5 (Tikanga) >> > >> kernel : 2.6.18-194.3.1.el5xen. >> > >> > The problem is sometimes the cluster is getting >> broken. The solution is >> > (still yet)to reboot the >> > sixteen nodes. Otherwise the nodes are not joining >> > >> > We are using clvm and not using any quorum disk. >> The quorum is by >> default. >> > >> > When it is getting broken, clustat commands >> shows evrything offline >> > except the node from where >> > the clustat command executed. If we execute vgs, >> lvs command, those >> > commands are getting hung. >> > >> > Here is at present the clustat report >> > ------------------------------------- >> > >> > [server1]# clustat >> > Cluster Status for newcluster @ Mon May 30 14:55:10 >> 2011 >> > Member Status: Quorate >> > >> > Member Name >> >> ID Status >> > ------ ---- >> ---- ------ >> > server1 >> 1 Online >> > server2 >> 2 Online, >> Local >> > server3 >> 3 Online >> > server4 >> 4 Online >> > server5 >> 5 Online >> > server6 >> 6 Online >> > server7 >> 7 Online >> > server8 >> 8 Online >> > server9 >> 9 Online >> > server10 >> >> 10 Online >> > server11 >> >> 11 Online >> > server12 >> >> 12 Online >> > server13 >> >> 13 Online >> > server14 >> >> 14 Online >> > server15 >> >> 15 Online >> > server16 >> >> 16 Online >> > >> > Here the cman_tool status output from one >> server >> > -------------------------------------------------- >> > >> > [server1 ~]# cman_tool status >> > Version: 6.2.0 >> > Config Version: 23 >> > Cluster Name: newcluster >> > Cluster Id: 53322 >> > Cluster Member: Yes >> > Cluster Generation: 11432 >> > Membership state: Cluster-Member >> > Nodes: 16 >> > Expected votes: 16 >> > Total votes: 16 >> > Quorum: 9 >> > Active subsystems: 8 >> > Flags: Dirty >> > Ports Bound: 0 11 >> > Node name: server1 >> > Node ID: 1 >> > Multicast addresses: xxx.xxx.xxx.xx >> > Node addresses: 192.168.xxx.xx >> > >> > >> > Here is the cluster.conf file. >> > ------------------------------ >> > >> > >> > > name="newcluster"> >> > > post_join_delay="15"/> >> > >> > >> > >> > > votes="1"> >> > >> >> > >> >> > >> >> > >> > >> > > votes="1"> >> > >> >> > > name="ilo-server2r"/> >> > >> > >> > >> > > votes="1"> >> > >> >> > > name="ilo-server3r"/> >> > >> > >> > >> > [ ... sinp .....] >> > >> > > votes="1"> >> > > name="1"> >> > > name="ilo-server16r"/> >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > > agent="fence_ilo" hostname="server1r" login="Admin" >> > >> name="ilo-server1r" passwd="xxxxx"/> >> > .......... >> > > agent="fence_ilo" hostname="server16r" >> login="Admin" >> > >> name="ilo-server16r" passwd="xxxxx"/> >> > >> > >> > >> > >> > >> > >> > Here is the lvm.conf file >> > -------------------------- >> > >> > devices { >> > >> > dir = "/dev" >> > scan = [ "/dev" ] >> > preferred_names = [ ] >> > filter = [ >> "r/scsi.*/","r/pci.*/","r/sd.*/","a/.*/" ] >> > cache_dir = "/etc/lvm/cache" >> > cache_file_prefix = "" >> > write_cache_state = 1 >> > sysfs_scan = 1 >> > md_component_detection = 1 >> > md_chunk_alignment = 1 >> > data_alignment_detection = 1 >> > data_alignment = 0 >> > >> data_alignment_offset_detection = 1 >> > ignore_suspended_devices = 0 >> > } >> > >> > log { >> > >> > verbose = 0 >> > syslog = 1 >> > overwrite = 0 >> > level = 0 >> > indent = 1 >> > command_names = 0 >> > prefix = " " >> > } >> > >> > backup { >> > >> > backup = 1 >> > backup_dir = >> "/etc/lvm/backup" >> > archive = 1 >> > archive_dir = >> "/etc/lvm/archive" >> > retain_min = 10 >> > retain_days = 30 >> > } >> > >> > shell { >> > >> > history_size = 100 >> > } >> > global { >> > library_dir = "/usr/lib64" >> > umask = 077 >> > test = 0 >> > units = "h" >> > si_unit_consistency = 0 >> > activation = 1 >> > proc = "/proc" >> > locking_type = 3 >> > wait_for_locks = 1 >> > fallback_to_clustered_locking >> = 1 >> > fallback_to_local_locking = 1 >> > locking_dir = "/var/lock/lvm" >> > prioritise_write_locks = 1 >> > } >> > >> > activation { >> > udev_sync = 1 >> > missing_stripe_filler = >> "error" >> > reserved_stack = 256 >> > reserved_memory = 8192 >> > process_priority = -18 >> > mirror_region_size = 512 >> > readahead = "auto" >> > mirror_log_fault_policy = >> "allocate" >> > mirror_image_fault_policy = >> "remove" >> > } >> > dmeventd { >> > >> > mirror_library = >> "libdevmapper-event-lvm2mirror.so" >> > snapshot_library = >> "libdevmapper-event-lvm2snapshot.so" >> > } >> > >> > >> > If you need more information, I can >> provide ... >> > >> > Thanks for your help >> > Priya >> > >> > -- >> > Linux-cluster mailing list >> > Linux-cluster at redhat.com >> > https://www.redhat.com/mailman/listinfo/linux-cluster >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster >> > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From hlawatschek at atix.de Tue May 31 12:16:43 2011 From: hlawatschek at atix.de (Mark Hlawatschek) Date: Tue, 31 May 2011 14:16:43 +0200 (CEST) Subject: [Linux-cluster] How to integrate a custom resource agent into RHCS? In-Reply-To: Message-ID: <1531003971.3181.1306844203629.JavaMail.root@axgroupware01-1.gallien.atix> Hi Ralph, could you post your RA script and the service definition element from your cluster.conf? Best regards, Mark ----- "Ralph Grothe" wrote: > Hi, > > I hope this is the right forum. So bear with me Pacemaker > aficionados et alii when I talk about Red Hat Cluster Suite > (RHCS). > That's the clusterware product I am given to set up the cluster > and I'm not free to chose another software of my liking. > > Though this may sound ridiculous, since days I've been labouring > to get a fairly simple custom resource agent (hence RA) to be > acknowledged by RHCS and correctly executed through its > rgmanager. > > When scripting my RA I mostly adhered to > http://www.linux-ha.org/doc/dev-guides/ra-dev-guide.html apart > from where RHCS RAs differs from general OCF. > > I put my RA in /usr/share/cluster and afterwards restarted > rgmanager on all nodes. > > When I try to start the service whereof my RA's managed resource > is part of the service though gets started but not my resource, > as if it wasn't part of the service at all. > > > When I try to start my resource via rg_test nothing happens apart > from this obscure log entry > > > [root at aruba:~] > # rg_test test /etc/cluster/cluster.conf start aDIStn_sec > Running in test mode. > Entity: line 2: parser error : Char 0x0 out of allowed range > > ^ > Entity: line 2: parser error : Premature end of data in tag error > line 1 > > ^ > [root at aruba:~] > # echo $? > 0 > > [root at aruba:~] > # grep rg_test /var/log/cluster.log|tail -1 > May 30 13:54:55 aruba rg_test: [28643]: Cannot dump > meta-data because '/usr/share/cluster/default.metadata' is > missing > > > Though this is true > > [root at aruba:~] > # ls -l /usr/share/cluster/default.metadata > ls: /usr/share/cluster/default.metadata: No such file or > directory > > there isn't such a file part of the installed clusterware at all > either > > [root at aruba:~] > # yum groupinfo Clustering|tail -10|xargs rpm -ql|grep -c > default\\.metadata > 0 > > And besides, I don't understand this error because since I wrote > my RA according to above mentioned RA Developer's Guide it of > course dumps its metadata > > > [root at aruba:~] > # /usr/share/cluster/aDIStn_sec.sh meta-data|grep action > > > > > > > > > > > (note, RHCS deviates from OCF here in naming its actions > verify-all instead of validate-all and status instead of monitor. > But both refer to the same case block in my RA) > > > I also don't understand the "Char 0x0 out of allowed range" error > from the XML parser. > > If it really refers to line 2 of my cluster.conf this looks > pretty ok to me > > > [root at aruba:~] > # sed -n 2p /etc/cluster/cluster.conf > > > > If I run a validity check of the XML of my cluster.conf against > RHCS's RNG schema I also get an incomprehensible error about > extra elements in interleave. > > Nevertheless, all other resources of my cluster which rely on > RHCS's standard RAs are managed ok by the clusterware. > > > > [root at aruba:~] > # declare -f cluconf_valid > cluconf_valid () > { > xmllint --noout --relaxng > /usr/share/system-config-cluster/misc/cluster.ng > ${1:-/etc/cluster/cluster.conf} > } > [root at aruba:~] > # cluconf_valid > Relax-NG validity error : Extra element cman in interleave > /etc/cluster/cluster.conf:2: element cluster: Relax-NG validity > error : Element cluster failed to validate content > /etc/cluster/cluster.conf fails to validate > > > Btw. is there a schema file available to check an RA's metadata > for validity? > > > > Of course did I test my RA script for correct functionality when > used like an init script (to which end I provide the required > environment of OCF_RESKEY_parameter(s)), > and it starts, stops and monitors my resource as intended. > > > Can anyone help? > > > Regards > Ralph > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -- Mark Hlawatschek ATIX Informationstechnologie und Consulting AG | Einsteinstrasse 10 | 85716 Unterschleissheim | www.atix.de http://www.linux-subscriptions.com From hlawatschek at atix.de Tue May 31 12:26:39 2011 From: hlawatschek at atix.de (Mark Hlawatschek) Date: Tue, 31 May 2011 14:26:39 +0200 (CEST) Subject: [Linux-cluster] Mirrored LVM device and recovery In-Reply-To: <4DD655C2.6080406@redhat.com> Message-ID: <1373329283.3184.1306844799557.JavaMail.root@axgroupware01-1.gallien.atix> Hi Andreas, your system works as designed. If a storage leg of a mirrored LVM volume fails, it simply gets removed from the LVM mirror. LVM does not provide automatic resync if the storage is available again. Best regards, Mark ----- "Andreas Bleischwitz" wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Hello all, > > we are currently facing some handling issues using mirrored LVM-lvols > in > a cluster: > > We have two diffent storage systems which should be mirrored using > host-based mirroring. > AFAIK cmirrored lvols are the only supported mirroring solution under > RHEL56. So we have three multipath devices which are used for 2 data > and > one log-volume. > We added these three pvs to one volumegroup and created the logical > volume using the following command: > lvcreate -m1 -L 10G -n lv_mirrored /dev/mpath/mpath0p1 /dev/mpath2p1 > /dev/mpath/mpath1p1 > > The volume replicates ok and everything is fine.... until we remove > one > storage-side of the mirror. Then LVM simply removes the missing pv > and > the mirror is simply removed - which I think is ok; if it will be > recreated after readding the failed mirror-side. > Unfortunately LVM doesn't do anything such - is there a special > configuration-option which we missed? > > And keep in mind: there might be a huge amount of lvms which have to > be > re-mirrored. So manual interaction shouldn't be the default option ;) > > Regards, > > Andreas > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v2.0.14 (GNU/Linux) > Comment: Using GnuPG with Red Hat - http://enigmail.mozdev.org/ > > iQIcBAEBAgAGBQJN1lXCAAoJEK45y/Z6LXho/4wP/AjjoOGX3pRoe3XRARumkTKW > C/+8Jjm4+aC4VP1ycVkHZhrdGlzy4QmTtCFTCv40AgPU2YT/Aq+PqfNnn4SNqpwR > c815zd9Gk+uQwwR55kloX232eZzFEw2wVa9PWxOmKwaeuYSEOz8GmLVZrPVc3V9p > MNr6wkV5gzTzhC2v75KOZ4PchOiuYEDbhCd5GFDKmpyTeHTq/uNW2yRnjInAX8L9 > 8UCJ1JEzo4ry2mIBK1J+du5YtKx4uDLB893rgbf+T5Cci3hsLJ9/gfF1VU80b+o/ > uVc5t31rwUwMaFSyt9wtEhMQB0ggbyiQqzzjSP5wnnakd6lbJKhB06wM5XqGuUkS > ZetkZdH+etALFpt3PrV7F4+LDwGnP7Hw438czKjD+Xk21fd7idSo3vhtWjArPgKp > L+b5fxB8JoUGN7x2S3239aDMI6BmxTTZ+QnsamYzSy0IdHYghPSjPSsx8H5laJWd > I03F2sfPWwB8vWVweHvNbxfFjZfmEaawoMqGanoGktj/RYgvUpPZJD+YHDVGXohN > VoRVmB+t4JVSWb15BzOhzkAI//LtXjSHmtcnBuYQf8G0Q2v/r0x/hv04F9/0fQ0l > dPlU1vh244fh0nG5BMCJlKPcdcpGJnGy4kIKOknOi+NuI2ZxxvSLIY5WbrqAwkBX > QXYt6plJ5DgzWCa8fNYN > =0I6Z > -----END PGP SIGNATURE----- > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -- Mark Hlawatschek ATIX Informationstechnologie und Consulting AG | Einsteinstrasse 10 | 85716 Unterschleissheim | www.atix.de http://www.linux-subscriptions.com From agk at redhat.com Tue May 31 12:37:14 2011 From: agk at redhat.com (Alasdair G Kergon) Date: Tue, 31 May 2011 13:37:14 +0100 Subject: [Linux-cluster] Mirrored LVM device and recovery In-Reply-To: <4DD655C2.6080406@redhat.com> References: <4DD655C2.6080406@redhat.com> Message-ID: <20110531123713.GJ11145@agk-dp.fab.redhat.com> On Fri, May 20, 2011 at 01:51:30PM +0200, Andreas Bleischwitz wrote: > Unfortunately LVM doesn't do anything such - is there a special > configuration-option which we missed? Try mirror_image_fault_policy = "allocate" in the activation section of lvm.conf. Alasdair From swap_project at yahoo.com Tue May 31 13:51:35 2011 From: swap_project at yahoo.com (Srija) Date: Tue, 31 May 2011 06:51:35 -0700 (PDT) Subject: [Linux-cluster] Cluster environment issue In-Reply-To: Message-ID: <832312.52153.qm@web112812.mail.gq1.yahoo.com> Thanks again for the reply. Yes, this cluster environment is of xen hosts. When the cluster is detatched, all the guests are pingable, there is no issue for that. Only as I said , clustat command shows everything 'offline', also can't able to execute the lvm related commands. iptables are 'off' already in this cluster environment. regards. --- On Mon, 5/30/11, Hiroyuki Sato wrote: > From: Hiroyuki Sato > Subject: Re: [Linux-cluster] Cluster environment issue > To: "linux clustering" > Date: Monday, May 30, 2011, 11:03 PM > Hello > > I'm not sure, This is useful or not. > > Have you ever checked ``ping some_where'' on domU when > cluster is broken?? > ( I thought you are using Xen, because you are using > 2.6.18-194.3.1.el5xen. ) > If it does not respond anything, you should check > iptables. > (ex, disable iptables) > > -- > Hiroyuki Sato > > 2011/5/31 Srija : > > Thanks for your quick reply. > > > > I talked to the network people , but they are saying > everything is good at their end. Is there anyway at the > server end, to figure it ?for the switch restart or > multicast traffic? > > > > I think you have already checked the cluster.conf > file.. Except quorum disk, do you think that the cluster > configuration is sufficient for handling the sixteen node > cluster!! > > > > thanks again . > > regards > > > > --- On Mon, 5/30/11, Kaloyan Kovachev > wrote: > > > >> From: Kaloyan Kovachev > >> Subject: Re: [Linux-cluster] Cluster environment > issue > >> To: "linux clustering" > >> Date: Monday, May 30, 2011, 4:05 PM > >> Hi, > >> ?when your cluster gets broken, most likely the > reason is, > >> there is a > >> network problem (switch restart or multicast > traffic is > >> lost for a while) > >> on the interface where serverX-priv IPs are > configured. > >> Having a quorum > >> disk may help by giving a quorum vote to one of > the > >> servers, so it can > >> fence the others, but the best thing to do is to > fix your > >> network and > >> preferably add a redundant link for the cluster > >> communication to avoid > >> breakage in the first place > >> > >> On Mon, 30 May 2011 12:17:07 -0700 (PDT), Srija > > >> wrote: > >> > Hi, > >> > > >> > I am very new to the redhat cluster. Need > some help > >> and suggession for > >> the > >> > cluster configuration. > >> > We have sixteen node cluster of > >> > > >> >? ? ? ? ? ???OS > >> : Linux Server release 5.5 (Tikanga) > >> > > >> ???kernel :? 2.6.18-194.3.1.el5xen. > >> > > >> > The problem is sometimes the cluster is > getting > >> broken. The solution is > >> > (still yet)to reboot the > >> > sixteen nodes. Otherwise the nodes are not > joining > >> > > >> > We are using? clvm and not using any quorum > disk. > >> The quorum is by > >> default. > >> > > >> > When it is getting broken, clustat commands > >> shows? evrything? offline > >> > except the node from where > >> > the clustat command executed.? If we execute > vgs, > >> lvs command, those > >> > commands are getting hung. > >> > > >> > Here is at present the clustat report > >> > ------------------------------------- > >> > > >> > [server1]# clustat > >> > Cluster Status for newcluster @ Mon May 30 > 14:55:10 > >> 2011 > >> > Member Status: Quorate > >> > > >> >? Member Name > >> > >> ID???Status > >> >? ------ ---- > >> ? ? ? ? ? ? ---- ------ > >> >? server1 > >> ? ? ? ? ? ? ? 1 Online > >> >? server2 > >> ? ? ? ? ? ? ? 2 Online, > >> Local > >> >? server3 > >> ? ? ? ? ? ? ? 3 Online > >> >? server4 > >> ? ? ? ? ? ? ? 4 Online > >> >? server5 > >> ? ? ? ? ? ? ? 5 Online > >> >? server6 > >> ? ? ? ? ? ? ? 6 Online > >> >? server7 > >> ? ? ? ? ? ? ? 7 Online > >> >? server8 > >> ? ? ? ? ? ? ? 8 Online > >> >? server9 > >> ? ? ? ? ? ? ? 9 Online > >> >? server10 > >> > >> ???10 Online > >> >? server11 > >> > >> ???11 Online > >> >? server12 > >> > >> ???12 Online > >> >? server13 > >> > >> ???13 Online > >> >? server14 > >> > >> ???14 Online > >> >? server15 > >> > >> ???15 Online > >> >? server16 > >> > >> ???16 Online > >> > > >> > Here the cman_tool status? output? from > one > >> server > >> > > -------------------------------------------------- > >> > > >> > [server1 ~]# cman_tool status > >> > Version: 6.2.0 > >> > Config Version: 23 > >> > Cluster Name: newcluster > >> > Cluster Id: 53322 > >> > Cluster Member: Yes > >> > Cluster Generation: 11432 > >> > Membership state: Cluster-Member > >> > Nodes: 16 > >> > Expected votes: 16 > >> > Total votes: 16 > >> > Quorum: 9 > >> > Active subsystems: 8 > >> > Flags: Dirty > >> > Ports Bound: 0 11 > >> > Node name: server1 > >> > Node ID: 1 > >> > Multicast addresses: xxx.xxx.xxx.xx > >> > Node addresses: 192.168.xxx.xx > >> > > >> > > >> > Here is the cluster.conf file. > >> > ------------------------------ > >> > > >> > > >> > config_version="23" > >> name="newcluster"> > >> > post_fail_delay="0" > >> post_join_delay="15"/> > >> > > >> > > >> > > >> > nodeid="1" > >> votes="1"> > >> > > >> ? > >> > > >> ? name="ilo-server1r"/> > >> > > >> ? > >> > > >> > > >> > nodeid="3" > >> votes="1"> > >> > > >> ??? > >> >? ? ? ??? >> name="ilo-server2r"/> > >> >? ? ? ??? > >> > > >> > > >> > nodeid="2" > >> votes="1"> > >> > > >> ??? > >> >? ? ? ??? >> name="ilo-server3r"/> > >> >? ? ? ??? > >> > > >> > > >> > [ ... sinp .....] > >> > > >> > nodeid="16" > >> votes="1"> > >> >? ? ? ? >> name="1"> > >> >? ? ? ? >> name="ilo-server16r"/> > >> >? ? ? ? > >> > > >> > > >> > > >> > > >> > > >> > plock_rate_limit="0"/> > >> > > >> > > >> > > >> >? ? ? ??? >> agent="fence_ilo" hostname="server1r" > login="Admin" > >> > > >> ???name="ilo-server1r" passwd="xxxxx"/> > >> >? ? ? ???.......... > >> >? ? ? ??? >> agent="fence_ilo" hostname="server16r" > >> login="Admin" > >> > > >> ???name="ilo-server16r" passwd="xxxxx"/> > >> > > >> > > >> > > >> > > >> > > >> > > >> > Here is the lvm.conf file > >> > -------------------------- > >> > > >> > devices { > >> > > >> >? ???dir = "/dev" > >> >? ???scan = [ "/dev" ] > >> >? ???preferred_names = [ ] > >> >? ???filter = [ > >> "r/scsi.*/","r/pci.*/","r/sd.*/","a/.*/" ] > >> >? ???cache_dir = "/etc/lvm/cache" > >> >? ???cache_file_prefix = "" > >> >? ???write_cache_state = 1 > >> >? ???sysfs_scan = 1 > >> >? ???md_component_detection = 1 > >> >? ???md_chunk_alignment = 1 > >> >? ???data_alignment_detection = 1 > >> >? ???data_alignment = 0 > >> > > >> ???data_alignment_offset_detection = 1 > >> >? ???ignore_suspended_devices = 0 > >> > } > >> > > >> > log { > >> > > >> >? ???verbose = 0 > >> >? ???syslog = 1 > >> >? ???overwrite = 0 > >> >? ???level = 0 > >> >? ???indent = 1 > >> >? ???command_names = 0 > >> >? ???prefix = "? " > >> > } > >> > > >> > backup { > >> > > >> >? ???backup = 1 > >> >? ???backup_dir = > >> "/etc/lvm/backup" > >> >? ???archive = 1 > >> >? ???archive_dir = > >> "/etc/lvm/archive" > >> >? ???retain_min = 10 > >> >? ???retain_days = 30 > >> > } > >> > > >> > shell { > >> > > >> >? ???history_size = 100 > >> > } > >> > global { > >> >? ???library_dir = "/usr/lib64" > >> >? ???umask = 077 > >> >? ???test = 0 > >> >? ???units = "h" > >> >? ???si_unit_consistency = 0 > >> >? ???activation = 1 > >> >? ???proc = "/proc" > >> >? ???locking_type = 3 > >> >? ???wait_for_locks = 1 > >> >? ???fallback_to_clustered_locking > >> = 1 > >> >? ???fallback_to_local_locking = 1 > >> >? ???locking_dir = "/var/lock/lvm" > >> >? ???prioritise_write_locks = 1 > >> > } > >> > > >> > activation { > >> >? ???udev_sync = 1 > >> >? ???missing_stripe_filler = > >> "error" > >> >? ???reserved_stack = 256 > >> >? ???reserved_memory = 8192 > >> >? ???process_priority = -18 > >> >? ???mirror_region_size = 512 > >> >? ???readahead = "auto" > >> >? ???mirror_log_fault_policy = > >> "allocate" > >> >? ???mirror_image_fault_policy = > >> "remove" > >> > } > >> > dmeventd { > >> > > >> >? ???mirror_library = > >> "libdevmapper-event-lvm2mirror.so" > >> >? ???snapshot_library = > >> "libdevmapper-event-lvm2snapshot.so" > >> > } > >> > > >> > > >> > If you need more? information,? I can > >> provide ... > >> > > >> > Thanks for your help > >> > Priya > >> > > >> > -- > >> > Linux-cluster mailing list > >> > Linux-cluster at redhat.com > >> > https://www.redhat.com/mailman/listinfo/linux-cluster > >> > >> -- > >> Linux-cluster mailing list > >> Linux-cluster at redhat.com > >> https://www.redhat.com/mailman/listinfo/linux-cluster > >> > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > From claudio.martin at abilene.it Tue May 31 16:22:01 2011 From: claudio.martin at abilene.it (Martin Claudio) Date: Tue, 31 May 2011 18:22:01 +0200 Subject: [Linux-cluster] quorum dissolved but resources are still alive Message-ID: <4DE515A9.40003@abilene.it> Hi, i have a problem with a 2 node cluster with this conf: all is ok but when node 2 goes down quorum dissolved but resources is not stopped, here log: clurgmgrd[1302]: #1: Quorum Dissolved kernel: dlm: closing connection to node 2 openais[971]: [CLM ] r(0) ip(10.1.1.11) openais[971]: [CLM ] Members Left: openais[971]: [CLM ] r(0) ip(10.1.1.12) openais[971]: [CLM ] Members Joined: openais[971]: [CMAN ] quorum lost, blocking activity openais[971]: [CLM ] CLM CONFIGURATION CHANGE openais[971]: [CLM ] New Configuration: openais[971]: [CLM ] r(0) ip(10.1.1.11) openais[971]: [CLM ] Members Left: openais[971]: [CLM ] Members Joined: openais[971]: [SYNC ] This node is within the primary component and will provide service. openais[971]: [TOTEM] entering OPERATIONAL state. openais[971]: [CLM ] got nodejoin message 10.1.1.11 openais[971]: [CPG ] got joinlist message from node 1 ccsd[964]: Cluster is not quorate. Refusing connection. cluster recognized that quorum is dissolved but resource manager doesn't stop resource, ip address is still alive, filesystem is still mount, i'll expect an emergency shutdown but it does not happen.... -- Distinti Saluti Claudio Martin Abilene Net Solutions S.r.l. From linux at alteeve.com Tue May 31 17:05:17 2011 From: linux at alteeve.com (Digimer) Date: Tue, 31 May 2011 13:05:17 -0400 Subject: [Linux-cluster] quorum dissolved but resources are still alive In-Reply-To: <4DE515A9.40003@abilene.it> References: <4DE515A9.40003@abilene.it> Message-ID: <4DE51FCD.5050904@alteeve.com> On 05/31/2011 12:22 PM, Martin Claudio wrote: > Hi, > > i have a problem with a 2 node cluster with this conf: > > > > > > > > > > > There are a couple of problems here; You need: With a two-node, quorum is effectively useless, as a single node is allowed to continue. Also, without proper fencing, things will not fail properly. This means that you are in somewhat of an undefined area. Can you setup proper fencing, make the change and then try again? If the problem persists, please paste your entire cluster.conf (please only alter passwords) along with the relevant sections of logs from both nodes? -- Digimer E-Mail: digimer at alteeve.com Freenode handle: digimer Papers and Projects: http://alteeve.com Node Assassin: http://nodeassassin.org "I feel confined, only free to expand myself within boundaries." From ajb2 at mssl.ucl.ac.uk Tue May 31 17:56:56 2011 From: ajb2 at mssl.ucl.ac.uk (Alan Brown) Date: Tue, 31 May 2011 18:56:56 +0100 Subject: [Linux-cluster] quorum dissolved but resources are still alive In-Reply-To: <4DE51FCD.5050904@alteeve.com> References: <4DE515A9.40003@abilene.it> <4DE51FCD.5050904@alteeve.com> Message-ID: <4DE52BE8.8020009@mssl.ucl.ac.uk> Digimer wrote: > > With a two-node, quorum is effectively useless, as a single node is > allowed to continue. That's what qdiskd is for. It's also useful in larger clusters. > Also, without proper fencing, things will not fail > properly. This means that you are in somewhat of an undefined area. Undefined = likely to cause data corruption. The OP needs to sort this out first before going on to anything else. From claudio.martin at abilene.it Tue May 31 18:33:03 2011 From: claudio.martin at abilene.it (Martin Claudio) Date: Tue, 31 May 2011 20:33:03 +0200 Subject: [Linux-cluster] quorum dissolved but resources are still alive In-Reply-To: <4DE51FCD.5050904@alteeve.com> References: <4DE515A9.40003@abilene.it> <4DE51FCD.5050904@alteeve.com> Message-ID: <4DE5345F.6000804@abilene.it> Il 31/05/2011 19.05, Digimer ha scritto: > > There are a couple of problems here; You need: > > > > With a two-node, quorum is effectively useless, as a single node is > allowed to continue. Also, without proper fencing, things will not fail > properly. This means that you are in somewhat of an undefined area. > > Can you setup proper fencing, make the change and then try > again? If the problem persists, please paste your entire cluster.conf > (please only alter passwords) along with the relevant sections of logs > from both nodes? > i know that quorum in a "two way cluster" is useless, but i need to config cluster in this way : node 1 votes 1 node 2 votes 2 quorum 2 When all nodes are working total votes is 3, quorum is 2 and all is working fine... if link between nodes is down node 1 alone has no quorum ( votes = 1 ) and it has to shutdown his resources while node 2 has quorum ( votes = 2) and it has to bring up resources. In this way i avoid "split brain situation". I know that in this config i have a single-point-of-failure, infact if node 2 goes down, also node 1 goes down ( no quorum ) but for me is ok... I also plannig to implement some way to fencing nodes, but at the moment it's only a simulation lab.... Anyway i still have the problem, node without quorum has not shutdown resources, any help for me plese? Distinti Saluti Claudio Martin Abilene Net Solutions S.r.l. From linux at alteeve.com Tue May 31 18:34:43 2011 From: linux at alteeve.com (Digimer) Date: Tue, 31 May 2011 14:34:43 -0400 Subject: [Linux-cluster] quorum dissolved but resources are still alive In-Reply-To: <4DE52BE8.8020009@mssl.ucl.ac.uk> References: <4DE515A9.40003@abilene.it> <4DE51FCD.5050904@alteeve.com> <4DE52BE8.8020009@mssl.ucl.ac.uk> Message-ID: <4DE534C3.8030302@alteeve.com> On 05/31/2011 01:56 PM, Alan Brown wrote: > Digimer wrote: >> >> With a two-node, quorum is effectively useless, as a single node is >> allowed to continue. > > That's what qdiskd is for. It's also useful in larger clusters. Agreed, but there are 2 caveats that need addressing; 1. qdisk requires a SAN (DRBD will not do). 2. qdisk works up to 16 nodes only. >> Also, without proper fencing, things will not fail properly. This >> means that you are in somewhat of an undefined area. > > Undefined = likely to cause data corruption. > > The OP needs to sort this out first before going on to anything else. Agreed. :) -- Digimer E-Mail: digimer at alteeve.com Freenode handle: digimer Papers and Projects: http://alteeve.com Node Assassin: http://nodeassassin.org "I feel confined, only free to expand myself within boundaries." From linux at alteeve.com Tue May 31 18:56:05 2011 From: linux at alteeve.com (Digimer) Date: Tue, 31 May 2011 14:56:05 -0400 Subject: [Linux-cluster] quorum dissolved but resources are still alive In-Reply-To: <4DE5345F.6000804@abilene.it> References: <4DE515A9.40003@abilene.it> <4DE51FCD.5050904@alteeve.com> <4DE5345F.6000804@abilene.it> Message-ID: <4DE539C5.9000700@alteeve.com> On 05/31/2011 02:33 PM, Martin Claudio wrote: > I also plannig to implement some way to fencing nodes, but at the moment > it's only a simulation lab.... Please read this: http://wiki.alteeve.com/index.php/Red_Hat_Cluster_Service_2_Tutorial#Concept.3B_Fencing > Anyway i still have the problem, node without quorum has not shutdown > resources, any help for me plese? We'd like to help you, but we've been here before. Without getting fencing working, there is no real sense going forward. Please, take the time now to get fencing working. The cluster stack has no concept of "a test cluster"; All clusters are treated as mission critical by the software. -- Digimer E-Mail: digimer at alteeve.com Freenode handle: digimer Papers and Projects: http://alteeve.com Node Assassin: http://nodeassassin.org "I feel confined, only free to expand myself within boundaries." From rossnick-lists at cybercat.ca Tue May 31 19:00:11 2011 From: rossnick-lists at cybercat.ca (Nicolas Ross) Date: Tue, 31 May 2011 15:00:11 -0400 Subject: [Linux-cluster] Corosync goes cpu to 95-99% References: <4DD29D03.9080901@gmail.com> <4DD2BAC3.50509@redhat.com> <4DD2BD7D.5070704@gmail.com> <4DD2CA90.6090802@redhat.com> <3B50BA7445114813AE429BEE51A2BA52@versa> <4DD78908.2030801@gmail.com> <0B1965C8-9807-42B6-9453-01BE0C0B1DCB@cybercat.ca><4DD80D5D.10004@gmail.com> <4DD873C7.8080402@cybercat.ca> Message-ID: <22E7D11CD5E64E338A66811F31F06238@versa> >>> I've opened a support case at redhat for this. While collecting the >>> sosreport for redhat, I found ot in my var/log/message file something >>> about gfs2_quotad being stalled for more than 120 seconds. Tought I >>> disabled quotas with the noquota option. It appears that it's >>> "quota=off". Since I cannot chane thecluster config and remount the >>> filessystems at the moment, I did not made the change to tes it. >> >> Thanks Nicolas. what bugzilla id is?? > > It's not a bugzilla, it's a support case. Hi ! FYI, my support ticket is still open, and GSS are searching to find the cause of the problem. In the mean time, they suggested that I start corosync with -p option and see if that changes anything. I wanted to know how to do that since it's cman that does start corosync ? Regards, From hlawatschek at atix.de Tue May 31 19:10:26 2011 From: hlawatschek at atix.de (Mark Hlawatschek) Date: Tue, 31 May 2011 21:10:26 +0200 (CEST) Subject: [Linux-cluster] quorum dissolved but resources are still alive In-Reply-To: <1977462131.3330.1306868972931.JavaMail.root@axgroupware01-1.gallien.atix> Message-ID: <580423379.3332.1306869026208.JavaMail.root@axgroupware01-1.gallien.atix> Martin, I did some testings with RHEL5.6 and no additional asynchronous updates. I remember that it worked as you expected. If rgmanager notices that quorum dissolved, it triggers an emergency shutdown for all services running on the nodes that lost quorum. Which version of rgmanager are you using? Best regards, Mark ----- "Martin Claudio" wrote: > Hi, > > i have a problem with a 2 node cluster with this conf: > > > > > > > > > > > > > > all is ok but when node 2 goes down quorum dissolved but resources is > > not stopped, here log: > > > clurgmgrd[1302]: #1: Quorum Dissolved > kernel: dlm: closing connection to node 2 > openais[971]: [CLM ] r(0) ip(10.1.1.11) > openais[971]: [CLM ] Members Left: > openais[971]: [CLM ] r(0) ip(10.1.1.12) > openais[971]: [CLM ] Members Joined: > openais[971]: [CMAN ] quorum lost, blocking activity > openais[971]: [CLM ] CLM CONFIGURATION CHANGE > openais[971]: [CLM ] New Configuration: > openais[971]: [CLM ] r(0) ip(10.1.1.11) > openais[971]: [CLM ] Members Left: > openais[971]: [CLM ] Members Joined: > openais[971]: [SYNC ] This node is within the primary component and > will > provide service. > openais[971]: [TOTEM] entering OPERATIONAL state. > openais[971]: [CLM ] got nodejoin message 10.1.1.11 > openais[971]: [CPG ] got joinlist message from node 1 > ccsd[964]: Cluster is not quorate. Refusing connection. > > > cluster recognized that quorum is dissolved but resource manager > doesn't > stop resource, ip address is still alive, filesystem is still mount, > i'll expect an emergency shutdown but it does not happen.... > > > > > -- > Distinti Saluti > Claudio Martin > Abilene Net Solutions S.r.l. > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -- Mark Hlawatschek ATIX Informationstechnologie und Consulting AG | Einsteinstrasse 10 | 85716 Unterschleissheim | www.atix.de http://www.linux-subscriptions.com Registergericht: Amtsgericht Muenchen, Registernummer: HRB 168930 USt.-Id.: DE209485962 Vorstand: Thomas Merz (Vors.), Marc Grimme, Mark Hlawatschek, Jan R. Bergrath Vorsitzender des Aufsichtsrats: Dr. Martin Buss From linux at alteeve.com Tue May 31 19:13:31 2011 From: linux at alteeve.com (Digimer) Date: Tue, 31 May 2011 15:13:31 -0400 Subject: [Linux-cluster] quorum dissolved but resources are still alive In-Reply-To: <580423379.3332.1306869026208.JavaMail.root@axgroupware01-1.gallien.atix> References: <580423379.3332.1306869026208.JavaMail.root@axgroupware01-1.gallien.atix> Message-ID: <4DE53DDB.8080501@alteeve.com> On 05/31/2011 03:10 PM, Mark Hlawatschek wrote: > Martin, > > I did some testings with RHEL5.6 and no additional asynchronous updates. > I remember that it worked as you expected. If rgmanager notices that quorum dissolved, it triggers an emergency shutdown for all services running on the nodes that lost quorum. > > Which version of rgmanager are you using? > > Best regards, > Mark The "openais" log prefixes lead me to believe it's EL5.x. -- Digimer E-Mail: digimer at alteeve.com Freenode handle: digimer Papers and Projects: http://alteeve.com Node Assassin: http://nodeassassin.org "I feel confined, only free to expand myself within boundaries." From bergman at merctech.com Tue May 31 19:35:54 2011 From: bergman at merctech.com (bergman at merctech.com) Date: Tue, 31 May 2011 15:35:54 -0400 Subject: [Linux-cluster] recommended method for changing quorum device Message-ID: <21868.1306870554@localhost> I've got a 3-node RHCS cluster and the quorum device is on a SAN disk array that needs to be replaced. The relevent versions are: CentOS 5.6 (2.6.18-238.9.1.el5) openais-0.80.6-28.el5_6.1 cman-2.0.115-68.el5_6.3 rgmanager-2.0.52-9.el5.centos.1 Currently the cluster is configured with each node having one vote and the quorum device having 2 votes, to allow operation in the event of multiple node failures. I'd like to know if there's any recommended method for changing the quorum disk "in place", without shutting down the cluster. The following approaches come to mind: 1. Create a new quorum device (multipath, mkqdisk). Ensure that at least 2 of the 3 nodes are up. Change the cluster configuration to use the new path to the new device instead of the old device. Commit the change to the cluster. 2. Create a new quorum device (multipath, mkqdisk). Ensure that at least 2 of the 3 nodes are up. Change the cluster configuration to not use any quorum device. Commit the change to the cluster. Change the cluster configuration to use the new quorum device. Commit the change to the cluster. 3. Create a new quorum device (multipath, mkqdisk). Change the cluster configuration to use both quorum devices. Commit the change to the cluster. -------------------------------------------------- Note: the 'mkqdisk' manual page (dated July 2006) states: using multiple different devices is currently not supported Is that still accurate? -------------------------------------------------- Change the cluster configuration to use just the new quorum device instead of the old device. Commit the change to the cluster. Thanks for any suggestions. Mark From claudio.martin at abilene.it Tue May 31 19:40:48 2011 From: claudio.martin at abilene.it (Martin Claudio) Date: Tue, 31 May 2011 21:40:48 +0200 Subject: [Linux-cluster] quorum dissolved but resources are still alive In-Reply-To: <4DE53DDB.8080501@alteeve.com> References: <580423379.3332.1306869026208.JavaMail.root@axgroupware01-1.gallien.atix> <4DE53DDB.8080501@alteeve.com> Message-ID: <4DE54440.5040404@abilene.it> First of all thanks everybody for help me... RHEL 5.5 rgmanager-2.0.52-6.0.1.el5 cman-2.0.115-34.el5 Distinti Saluti Claudio Martin Abilene Net Solutions S.r.l. Il 31/05/2011 21.13, Digimer ha scritto: > On 05/31/2011 03:10 PM, Mark Hlawatschek wrote: >> Martin, >> >> I did some testings with RHEL5.6 and no additional asynchronous updates. >> I remember that it worked as you expected. If rgmanager notices that >> quorum dissolved, it triggers an emergency shutdown for all services >> running on the nodes that lost quorum. >> >> Which version of rgmanager are you using? >> >> Best regards, >> Mark > > The "openais" log prefixes lead me to believe it's EL5.x. > From linux at alteeve.com Tue May 31 19:42:16 2011 From: linux at alteeve.com (Digimer) Date: Tue, 31 May 2011 15:42:16 -0400 Subject: [Linux-cluster] recommended method for changing quorum device In-Reply-To: <21868.1306870554@localhost> References: <21868.1306870554@localhost> Message-ID: <4DE54498.1000509@alteeve.com> On 05/31/2011 03:35 PM, bergman at merctech.com wrote: > I've got a 3-node RHCS cluster and the quorum device is on a SAN disk > array that needs to be replaced. The relevent versions are: > > CentOS 5.6 (2.6.18-238.9.1.el5) > openais-0.80.6-28.el5_6.1 > cman-2.0.115-68.el5_6.3 > rgmanager-2.0.52-9.el5.centos.1 > > > Currently the cluster is configured with each node having one vote and > the quorum device having 2 votes, to allow operation in the event of > multiple node failures. > > I'd like to know if there's any recommended method for changing the > quorum disk "in place", without shutting down the cluster. > > The following approaches come to mind: > > 1. Create a new quorum device (multipath, mkqdisk). > > Ensure that at least 2 of the 3 nodes are up. > > Change the cluster configuration to use the new path to > the new device instead of the old device. > > Commit the change to the cluster. > > 2. Create a new quorum device (multipath, mkqdisk). > > Ensure that at least 2 of the 3 nodes are up. > > Change the cluster configuration to not use any quorum > device. > > Commit the change to the cluster. > > Change the cluster configuration to use the new quorum > device. > > Commit the change to the cluster. > > 3. Create a new quorum device (multipath, mkqdisk). > > Change the cluster configuration to use both quorum > devices. > > Commit the change to the cluster. > > -------------------------------------------------- > Note: the 'mkqdisk' manual page (dated July 2006) > states: > using multiple different devices is currently > not supported > Is that still accurate? > -------------------------------------------------- > > Change the cluster configuration to use just the > new quorum device instead of the old device. > > Commit the change to the cluster. > > Thanks for any suggestions. > > Mark With the caveat that I have not done this and make no claims to being an expert; Option 2 strikes me as the best choice. -- Digimer E-Mail: digimer at alteeve.com Freenode handle: digimer Papers and Projects: http://alteeve.com Node Assassin: http://nodeassassin.org "I feel confined, only free to expand myself within boundaries." From sdake at redhat.com Tue May 31 19:47:35 2011 From: sdake at redhat.com (Steven Dake) Date: Tue, 31 May 2011 12:47:35 -0700 Subject: [Linux-cluster] Corosync goes cpu to 95-99% In-Reply-To: <22E7D11CD5E64E338A66811F31F06238@versa> References: <4DD29D03.9080901@gmail.com> <4DD2BAC3.50509@redhat.com> <4DD2BD7D.5070704@gmail.com> <4DD2CA90.6090802@redhat.com> <3B50BA7445114813AE429BEE51A2BA52@versa> <4DD78908.2030801@gmail.com> <0B1965C8-9807-42B6-9453-01BE0C0B1DCB@cybercat.ca><4DD80D5D.10004@gmail.com> <4DD873C7.8080402@cybercat.ca> <22E7D11CD5E64E338A66811F31F06238@versa> Message-ID: <4DE545D7.1080703@redhat.com> On 05/31/2011 12:00 PM, Nicolas Ross wrote: >>>> I've opened a support case at redhat for this. While collecting the >>>> sosreport for redhat, I found ot in my var/log/message file something >>>> about gfs2_quotad being stalled for more than 120 seconds. Tought I >>>> disabled quotas with the noquota option. It appears that it's >>>> "quota=off". Since I cannot chane thecluster config and remount the >>>> filessystems at the moment, I did not made the change to tes it. >>> >>> Thanks Nicolas. what bugzilla id is?? >> >> It's not a bugzilla, it's a support case. > > Hi ! > > FYI, my support ticket is still open, and GSS are searching to find the > cause of the problem. In the mean time, they suggested that I start > corosync with -p option and see if that changes anything. > > I wanted to know how to do that since it's cman that does start corosync ? > cman_tool join is called in /etc/rc.d/init.d/cman I believe. Add a -P option to it. Regards -steve > Regards, > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From hlawatschek at atix.de Tue May 31 20:22:44 2011 From: hlawatschek at atix.de (Mark Hlawatschek) Date: Tue, 31 May 2011 22:22:44 +0200 (CEST) Subject: [Linux-cluster] recommended method for changing quorum device In-Reply-To: <713182518.3335.1306873331051.JavaMail.root@axgroupware01-1.gallien.atix> Message-ID: <215272920.3337.1306873364406.JavaMail.root@axgroupware01-1.gallien.atix> Mark, without guarantee ;-) I believe that the following method should work: 1. make sure that all 3 nodes are running and part of the cluster 2. stop qdiskd on all nodes (#service qdiskd stop) 3. create new quorum disk (#mkqdisk ...) 4. modify cluster.conf 5. #ccs_tool update /etc/cluster/cluster.conf 6. start qdiskd on all nodes (#service qdiskd start) Kind regards, Mark ----- bergman at merctech.com wrote: > I've got a 3-node RHCS cluster and the quorum device is on a SAN disk > array that needs to be replaced. The relevent versions are: > > CentOS 5.6 (2.6.18-238.9.1.el5) > openais-0.80.6-28.el5_6.1 > cman-2.0.115-68.el5_6.3 > rgmanager-2.0.52-9.el5.centos.1 > > > Currently the cluster is configured with each node having one vote > and > the quorum device having 2 votes, to allow operation in the event of > multiple node failures. > > I'd like to know if there's any recommended method for changing the > quorum disk "in place", without shutting down the cluster. > > The following approaches come to mind: > > 1. Create a new quorum device (multipath, mkqdisk). > > Ensure that at least 2 of the 3 nodes are up. > > Change the cluster configuration to use the new path to > the new device instead of the old device. > > Commit the change to the cluster. > > 2. Create a new quorum device (multipath, mkqdisk). > > Ensure that at least 2 of the 3 nodes are up. > > Change the cluster configuration to not use any quorum > device. > > Commit the change to the cluster. > > Change the cluster configuration to use the new quorum > device. > > Commit the change to the cluster. > > 3. Create a new quorum device (multipath, mkqdisk). > > Change the cluster configuration to use both quorum > devices. > > Commit the change to the cluster. > > -------------------------------------------------- > Note: the 'mkqdisk' manual page (dated July 2006) > states: > using multiple different devices is currently > not supported > Is that still accurate? > -------------------------------------------------- > > Change the cluster configuration to use just the > new quorum device instead of the old device. > > Commit the change to the cluster. > > Thanks for any suggestions. > > Mark > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -- Mark Hlawatschek ATIX Informationstechnologie und Consulting AG | Einsteinstrasse 10 | 85716 Unterschleissheim | www.atix.de http://www.linux-subscriptions.com From rossnick-lists at cybercat.ca Tue May 31 22:34:21 2011 From: rossnick-lists at cybercat.ca (Nicolas Ross) Date: Tue, 31 May 2011 18:34:21 -0400 Subject: [Linux-cluster] Corosync goes cpu to 95-99% In-Reply-To: <4DE545D7.1080703@redhat.com> References: <4DD29D03.9080901@gmail.com> <4DD2BAC3.50509@redhat.com> <4DD2BD7D.5070704@gmail.com> <4DD2CA90.6090802@redhat.com> <3B50BA7445114813AE429BEE51A2BA52@versa> <4DD78908.2030801@gmail.com> <0B1965C8-9807-42B6-9453-01BE0C0B1DCB@cybercat.ca><4DD80D5D.10004@gmail.com> <4DD873C7.8080402@cybercat.ca> <22E7D11CD5E64E338A66811F31F06238@versa> <4DE545D7.1080703@redhat.com> Message-ID: <068AEB47E11A41C3A8EC25F71D30B82F@Inspiron> >> FYI, my support ticket is still open, and GSS are searching to find the >> cause of the problem. In the mean time, they suggested that I start >> corosync with -p option and see if that changes anything. >> >> I wanted to know how to do that since it's cman that does start corosync >> ? >> > > cman_tool join is called in /etc/rc.d/init.d/cman I believe. Add a -P > option to it. That did it. I will do it for a couple of nodes and see what happens. Regards,