Companions.com

From fabbione at ubuntu.com Wed Nov 1 07:03:27 2006 From: fabbione at ubuntu.com (Fabio Massimo Di Nitto) Date: Wed, 01 Nov 2006 08:03:27 +0100 Subject: [Linux-cluster] [PATCH] Fix fence/agents/xvm/ip_lookup.c build with 2.6.19 headers Message-ID: <454846BF.8020805@ubuntu.com> hi guys, 2.6.19 headers did shuffle a couple of things around. The patch in attachment fix the build maintaining backward compatibility with older kernel headers. Please apply. Thanks Fabio -- I'm going to make him an offer he can't refuse. -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: 000_linux_2.6.19.dpatch URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 890 bytes Desc: OpenPGP digital signature URL: From cosimo at streppone.it Wed Nov 1 08:32:53 2006 From: cosimo at streppone.it (Cosimo Streppone) Date: Wed, 01 Nov 2006 09:32:53 +0100 Subject: [Linux-cluster] Samba share resource fails to start after upgrade to Cluster Suite 4U4 In-Reply-To: <1162239152.4518.333.camel@rei.boston.devel.redhat.com> References: <4538E16F.3060709@streppone.it> <1161974666.4518.208.camel@rei.boston.devel.redhat.com> <45427718.5030304@streppone.it> <1162239152.4518.333.camel@rei.boston.devel.redhat.com> Message-ID: <45485BB5.20409@streppone.it> Lon Hohberger wrote: > On Fri, 2006-10-27 at 23:16 +0200, Cosimo Streppone wrote: > >> >> >> >> > > The share name should not contain any slashes. With a valid name, the > resource agent generates a brain-dead example configuration for > use as something for the administrator to build on. > > Ok, so, "exportdb" or something would be a better name. Using that, > given the rest your service configuration, the agent will generate an > *example* configuration with two properties: > [...] I think there's an error on my part, probably. By that config, I meant to define: - an external filesystem, which is a (Yeah) SAN, that's mounted by the active node only, that contains the core applications; - another external smb filesystem, which is accessible as smb://share/exportdb, with "share" being a machine name, and "exportdb" the share name. (What a mess, I know). - The /etc/init.d/smb script that starts the samba server on the active node, for other shares that *my* server exposes. BTW, if it's relevant to you, the redhat support request I filed is the no. 1070734. -- Cosimo From dbrieck at gmail.com Wed Nov 1 13:50:49 2006 From: dbrieck at gmail.com (David Brieck Jr.) Date: Wed, 1 Nov 2006 08:50:49 -0500 Subject: [Linux-cluster] Re: Hard lockups during file transfer to GNBD/GFS device In-Reply-To: <8c1094290610231039g739fe237m360f0a213d97e62f@mail.gmail.com> References: <8c1094290609280915o6b6b4962ud0d090e58e5d7fc6@mail.gmail.com> <8c1094290609281208i6a5eaf8br70697c6b5d085cf@mail.gmail.com> <8c1094290609281227j1303ec11u300932ab8d4953ab@mail.gmail.com> <20060928195844.GB25242@redhat.com> <8c1094290609290651r62cec5f9n28278d6a81c3e6ef@mail.gmail.com> <8c1094290610100545j1f91d0a2h519cd9033b439390@mail.gmail.com> <8c1094290610230648x7ac8189dy3b2612fb9191125b@mail.gmail.com> <20061023171946.GA23495@ether.msp.redhat.com> <8c1094290610231032w5184472dn749f0bfbc7456624@mail.gmail.com> <8c1094290610231039g739fe237m360f0a213d97e62f@mail.gmail.com> Message-ID: <8c1094290611010550h733b0b33nb3dfab0b2bb21eb8@mail.gmail.com> Well, the problem has gotten even stranger, now a node is mysteriously crashing with nothing in the logs: Nov 1 04:02:19 http2 kernel: dlm: http: process_lockqueue_reply id 20260 state 0 Nov 1 04:02:19 http2 kernel: dlm: http: process_lockqueue_reply id 202e2 state 0 Nov 1 04:02:19 http2 kernel: dlm: http: process_lockqueue_reply id 303d7 state 0 Nov 1 04:02:19 http2 kernel: dlm: http: process_lockqueue_reply id 50159 state 0 Nov 1 06:29:19 http2 sshd(pam_unix)[24026]: session opened for user root by root(uid=0) Nov 1 06:45:02 http2 syslogd 1.4.1: restart. Nov 1 06:45:02 http2 syslog: syslogd startup succeeded Earlier in the day I had this crash on my GNBD server though (might not be related to my other problem, but hey, who knows), looks like it's related to DLM: Oct 31 10:35:55 storage1 gnbd_serv[5073]: server process 25402 exited because of signal 15 Oct 31 10:35:55 storage1 gnbd_serv[5073]: server process 25400 exited because of signal 15 Oct 31 10:39:45 storage1 kernel: rebuilt 1 resources Oct 31 10:39:45 storage1 kernel: backups rebuilt 98 resources Oct 31 10:39:45 storage1 kernel: clvmd purge requests Oct 31 10:39:45 storage1 kernel: backups purge requests Oct 31 10:39:45 storage1 kernel: clvmd purged 0 requests Oct 31 10:39:45 storage1 kernel: backups purged 0 requests Oct 31 10:39:45 storage1 kernel: configs mark waiting requests Oct 31 10:39:45 storage1 kernel: configs marked 0 requests Oct 31 10:39:45 storage1 kernel: configs purge locks of departed nodes Oct 31 10:39:45 storage1 kernel: configs purged 11 locks Oct 31 10:39:45 storage1 kernel: configs update remastered resources Oct 31 10:39:45 storage1 kernel: configs updated 1 resources Oct 31 10:39:45 storage1 kernel: configs rebuild locks Oct 31 10:39:45 storage1 kernel: configs rebuilt 1 locks Oct 31 10:39:45 storage1 kernel: configs recover event 230 done Oct 31 10:39:45 storage1 kernel: configs move flags 0,0,1 ids 229,230,230 Oct 31 10:39:45 storage1 kernel: configs process held requests Oct 31 10:39:45 storage1 kernel: configs processed 0 requests Oct 31 10:39:45 storage1 kernel: configs resend marked requests Oct 31 10:39:45 storage1 kernel: configs resent 0 requests Oct 31 10:39:45 storage1 kernel: configs recover event 230 finished Oct 31 10:39:45 storage1 kernel: clvmd mark waiting requests Oct 31 10:39:45 storage1 kernel: clvmd marked 0 requests Oct 31 10:39:46 storage1 kernel: clvmd purge locks of departed nodes Oct 31 10:39:46 storage1 kernel: clvmd purged 5 locks Oct 31 10:39:46 storage1 kernel: clvmd update remastered resources Oct 31 10:39:46 storage1 kernel: clvmd updated 0 resources Oct 31 10:39:46 storage1 kernel: clvmd rebuild locks Oct 31 10:39:46 storage1 kernel: clvmd rebuilt 0 locks Oct 31 10:39:46 storage1 kernel: clvmd recover event 230 done Oct 31 10:39:46 storage1 kernel: Magma mark waiting requests Oct 31 10:39:46 storage1 kernel: Magma marked 0 requests Oct 31 10:39:46 storage1 kernel: Magma purge locks of departed nodes Oct 31 10:39:46 storage1 kernel: Magma purged 0 locks Oct 31 10:39:46 storage1 kernel: Magma update remastered resources Oct 31 10:39:46 storage1 kernel: Magma updated 0 resources Oct 31 10:39:46 storage1 kernel: Magma rebuild locks Oct 31 10:39:46 storage1 kernel: Oct 31 10:39:46 storage1 kernel: DLM: Assertion failed on line 105 of file /home/buildcentos/rpmbuild/BUILD/dlm-kernel-2.6.9-42/hugemem/src/rebuild.c Oct 31 10:39:46 storage1 kernel: DLM: assertion: "root->res_newlkid_expect" Oct 31 10:39:46 storage1 kernel: DLM: time = 2164169409 Oct 31 10:39:46 storage1 kernel: newlkid_expect=0 Oct 31 10:39:46 storage1 kernel: Oct 31 10:39:46 storage1 kernel: ------------[ cut here ]------------ Oct 31 10:39:46 storage1 kernel: kernel BUG at /home/buildcentos/rpmbuild/BUILD/dlm-kernel-2.6.9-42/hugemem/src/rebuild.c:105! Oct 31 10:39:46 storage1 kernel: invalid operand: 0000 [#1] Oct 31 10:39:46 storage1 kernel: SMP Oct 31 10:39:46 storage1 kernel: Modules linked in: ip_vs_wlc ip_vs lock_dlm(U) gfs(U) lock_harness(U) mptctl mptbase dell_rbu parport_pc lp parport autofs4 i2c_dev i2c_core gnbd(U) dlm(U) cman(U) sunrpc ipmi_devintf ipmi_si ipmi_msghandler iptable_filter ip_tables md5 ipv6 dm_mirror joydev button battery ac uhci_hcd ehci_hcd hw_random shpchp e1000 bonding(U) floppy sg ext3 jbd dm_mod megaraid_mbox megaraid_mm sd_mod scsi_mod Oct 31 10:39:46 storage1 kernel: CPU: 0 Oct 31 10:39:46 storage1 kernel: EIP: 0060:[] Not tainted VLI Oct 31 10:39:46 storage1 kernel: EFLAGS: 00010246 (2.6.9-42.0.2.ELhugemem) Oct 31 10:39:46 storage1 kernel: EIP is at have_new_lkid+0x79/0xb7 [dlm] Oct 31 10:39:46 storage1 kernel: eax: 00000001 ebx: dd76a0ec ecx: e1069e3c edx: f8a340dd Oct 31 10:39:46 storage1 kernel: esi: dd76a150 edi: 009803dc ebp: 39f2e400 esp: e1069e38 Oct 31 10:39:46 storage1 kernel: ds: 007b es: 007b ss: 0068 Oct 31 10:39:46 storage1 kernel: Process dlm_recvd (pid: 4314, threadinfo=e1069000 task=e13c1630) Oct 31 10:39:46 storage1 kernel: Stack: f8a340dd f8a34136 00000000 f8a34086 00000069 f8a3403b f8a3411d 80fe9ac1 Oct 31 10:39:46 storage1 kernel: 000002e8 00060028 f8a2e46b 6b914018 00000001 00000020 6b914000 39f2e400 Oct 31 10:39:46 storage1 kernel: 00000001 6b914000 f8a2e9f6 000002e8 00004040 00001000 de541580 00000001 Oct 31 10:39:46 storage1 kernel: Call Trace: Oct 31 10:39:46 storage1 kernel: [] rebuild_rsbs_lkids_recv+0x99/0x106 [dlm] Oct 31 10:39:46 storage1 kernel: [] rcom_process_message+0x2e8/0x405 [dlm] Oct 31 10:39:46 storage1 kernel: [] process_recovery_comm+0x3c/0xa7 [dlm] Oct 31 10:39:46 storage1 kernel: [] midcomms_process_incoming_buffer+0x1bc/0x1f8 [dlm] Oct 31 10:39:46 storage1 kernel: [<02142d40>] buffered_rmqueue+0x17d/0x1a5 Oct 31 10:39:46 storage1 kernel: [<021204e9>] autoremove_wake_function+0x0/0x2d Oct 31 10:39:46 storage1 kernel: [<02142e1c>] __alloc_pages+0xb4/0x29d Oct 31 10:39:46 storage1 kernel: [] receive_from_sock+0x192/0x26c [dlm] Oct 31 10:39:46 storage1 kernel: [] dlm_recvd+0x0/0x95 [dlm] Oct 31 10:39:46 storage1 kernel: [] process_sockets+0x56/0x91 [dlm] Oct 31 10:39:46 storage1 kernel: [] dlm_recvd+0x85/0x95 [dlm] Oct 31 10:39:46 storage1 kernel: [<02133089>] kthread+0x73/0x9b Oct 31 10:39:46 storage1 kernel: [<02133016>] kthread+0x0/0x9b Oct 31 10:39:46 storage1 kernel: [<021041f5>] kernel_thread_helper+0x5/0xb Oct 31 10:39:46 storage1 kernel: Code: 41 a3 f8 68 3b 40 a3 f8 6a 69 68 86 40 a3 f8 e8 17 59 6f 09 ff 73 60 68 36 41 a3 f8 e8 0a 59 6f 09 68 dd 40 a3 f8 e8 00 59 6f 09 <0f> 0b 69 00 3b 40 a3 f8 83 c4 20 68 df 40 a3 f8 e8 55 50 6f 09 Oct 31 10:39:46 storage1 kernel: <0>Fatal exception: panic in 5 seconds From tscherf at redhat.com Wed Nov 1 15:34:35 2006 From: tscherf at redhat.com (Thorsten Scherf) Date: Wed, 01 Nov 2006 16:34:35 +0100 Subject: [Linux-cluster] Piranha/LVS/Load balancing In-Reply-To: <20061031135329.292530@leena> References: <20061031135329.292530@leena> Message-ID: <1162395275.9516.0.camel@tiffy.tuxgeek.de> Am Dienstag, den 31.10.2006, 13:53 -0600 schrieb isplist at logicore.net: > Anyone know of a CLEAR and easy to understand document on setting up a Piranha > LVS load balancing setup? I've read everything I can find and since mine won't > work, I'm just making it worse now. http://www.redhat.com/docs/manuals/csgfs/browse/rh-cs-en/pt-lvs.html clear enough? -- Thorsten Scherf Red Hat GmbH -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: Dies ist ein digital signierter Nachrichtenteil URL: From isplist at logicore.net Wed Nov 1 15:40:36 2006 From: isplist at logicore.net (isplist at logicore.net) Date: Wed, 1 Nov 2006 09:40:36 -0600 Subject: [Linux-cluster] Piranha/LVS/Load balancing In-Reply-To: <1162395275.9516.0.camel@tiffy.tuxgeek.de> Message-ID: <200611194036.839699@leena> This is where I started, on the redhat site. Ended up with problems, have not been able to find enough to get past the errors. For example, the services won't turn on for some reason, at least Piranha says the daemon is stopped. Everything seems to be running when I ps from the command line. Anyhow, I'll keep looking, thanks for the lead. Mike >> Anyone know of a CLEAR and easy to understand document on setting up a >> Piranha LVS load balancing setup? I've read everything I can find and since > http://www.redhat.com/docs/manuals/csgfs/browse/rh-cs-en/pt-lvs.html > > clear enough? From teigland at redhat.com Wed Nov 1 15:39:46 2006 From: teigland at redhat.com (David Teigland) Date: Wed, 1 Nov 2006 09:39:46 -0600 Subject: [Linux-cluster] Re: Hard lockups during file transfer to GNBD/GFS device In-Reply-To: <8c1094290611010550h733b0b33nb3dfab0b2bb21eb8@mail.gmail.com> References: <8c1094290609281208i6a5eaf8br70697c6b5d085cf@mail.gmail.com> <8c1094290609281227j1303ec11u300932ab8d4953ab@mail.gmail.com> <20060928195844.GB25242@redhat.com> <8c1094290609290651r62cec5f9n28278d6a81c3e6ef@mail.gmail.com> <8c1094290610100545j1f91d0a2h519cd9033b439390@mail.gmail.com> <8c1094290610230648x7ac8189dy3b2612fb9191125b@mail.gmail.com> <20061023171946.GA23495@ether.msp.redhat.com> <8c1094290610231032w5184472dn749f0bfbc7456624@mail.gmail.com> <8c1094290610231039g739fe237m360f0a213d97e62f@mail.gmail.com> <8c1094290611010550h733b0b33nb3dfab0b2bb21eb8@mail.gmail.com> Message-ID: <20061101153945.GB16402@redhat.com> On Wed, Nov 01, 2006 at 08:50:49AM -0500, David Brieck Jr. wrote: > Oct 31 10:39:46 storage1 kernel: DLM: Assertion failed on line 105 of > file > /home/buildcentos/rpmbuild/BUILD/dlm-kernel-2.6.9-42/hugemem/src/rebuild.c > Oct 31 10:39:46 storage1 kernel: DLM: assertion: > "root->res_newlkid_expect" > Oct 31 10:39:46 storage1 kernel: DLM: time = 2164169409 > Oct 31 10:39:46 storage1 kernel: newlkid_expect=0 I've added a reference to this in https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=208134 Dave From riaan at obsidian.co.za Wed Nov 1 16:02:12 2006 From: riaan at obsidian.co.za (Riaan van Niekerk) Date: Wed, 01 Nov 2006 18:02:12 +0200 Subject: [Linux-cluster] Piranha/LVS/Load balancing In-Reply-To: <200611194036.839699@leena> References: <200611194036.839699@leena> Message-ID: <4548C504.4060903@obsidian.co.za> hi Mike are there any error messages when you try to start things up? Between the list members and Google, something should match or be able to give an indication of how to progress on your config. Are there any specific sections in the official documentation that you find unclear? (there are a number of open bugzillas on the RHCS documentation). However, the documentation is reasonably complete. IMHO, the chances are fairly small that the documentation is incomplete/unusable to such an extent that someone else would be writing / maintaining a separate, more understandable LVS howto. If the vendor-provided docs don't work for you, perhaps a generic howto (I found one at http://www.austintek.com/LVS/LVS-HOWTO/HOWTO/ ) might give you better mileage. If you are not misreading the documentation (others have been able to set up IPVS using the RH docs), you may have stumbled on a mistake/omission. From personal experience I know that if I had to write up the problem that I am having, retracing my steps, rereading the docs, manpages and looking in detail at the error messages, there is a good chance that I might solve the problem myself, without submitting a query to either this list or Red Hat Support. greetings Riaan isplist at logicore.net wrote: > This is where I started, on the redhat site. Ended up with problems, have not > been able to find enough to get past the errors. > > For example, the services won't turn on for some reason, at least Piranha says > the daemon is stopped. Everything seems to be running when I ps from the > command line. > > Anyhow, I'll keep looking, thanks for the lead. > > Mike > >>> Anyone know of a CLEAR and easy to understand document on setting up a >>> Piranha LVS load balancing setup? I've read everything I can find and since > >> http://www.redhat.com/docs/manuals/csgfs/browse/rh-cs-en/pt-lvs.html >> >> clear enough? > > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- A non-text attachment was scrubbed... Name: riaan.vcf Type: text/x-vcard Size: 310 bytes Desc: not available URL: From isplist at logicore.net Wed Nov 1 16:06:14 2006 From: isplist at logicore.net (isplist at logicore.net) Date: Wed, 1 Nov 2006 10:06:14 -0600 Subject: [Linux-cluster] Piranha/LVS/Load balancing In-Reply-To: <1162395799.5296.12.camel@localhost> Message-ID: <200611110614.758782@leena> I followed everything in the documentation for setting up and firing up. I might have missed or misunderstood something but I'm looking. On the first LVS machine, things seem to be running. Can't get to any sites but it seems to be running according to Piranha and a ps. On the second LVS backup, something prevents services from starting; Nov 1 09:54:32 lb53 pulse: SIOCGIFADDR failed: Cannot assign requested address Nov 1 09:54:32 lb53 pulse: SIOCGIFADDR failed: Cannot assign requested address Nov 1 09:54:32 lb53 pulse: failed to bind to heartbeat address: Cannot assign requested address Nov 1 09:54:32 lb53 pulse: pulse: cannot create heartbeat socket. running as root? Nov 1 09:54:32 lb53 pulse: pulse startup failed And nanny is not yet running on the secondary and I'm guessing it won't run until pulse is running first. Backup config is identical to primary, just not sure what to look for next but the error seems to be related to attaching an IP to an interface. This is where it's got me confused since the config is from the working primary. Mike On Wed, 01 Nov 2006 15:43:18 +0000, Alan Cooper wrote: > Have you started the nanny service? > > > On Wed, 2006-11-01 at 09:40 -0600, isplist at logicore.net wrote: >> This is where I started, on the redhat site. Ended up with problems, have >> not >> been able to find enough to get past the errors. >> >> For example, the services won't turn on for some reason, at least Piranha >> says >> the daemon is stopped. Everything seems to be running when I ps from the >> command line. >> >> Anyhow, I'll keep looking, thanks for the lead. >> >> Mike >> >>>> Anyone know of a CLEAR and easy to understand document on setting up a >>>> Piranha LVS load balancing setup? I've read everything I can find and >>>> since >>>> >>> http://www.redhat.com/docs/manuals/csgfs/browse/rh-cs-en/pt-lvs.html >>> >>> clear enough? >> >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster From isplist at logicore.net Wed Nov 1 16:17:53 2006 From: isplist at logicore.net (isplist at logicore.net) Date: Wed, 1 Nov 2006 10:17:53 -0600 Subject: [Linux-cluster] Piranha/LVS/Load balancing In-Reply-To: <4548C504.4060903@obsidian.co.za> Message-ID: <2006111101753.094316@leena> I agree that there is a lot of documentation and that it does seem to be complete. Searching and finding answers is where things sometimes get messed up since it's not often clear what versions or combinations folks are talking about when helping each other and ultimately, you if you read and try their advise :). > are there any error messages when you try to start things up? Nov 1 09:54:32 lb53 pulse: SIOCGIFADDR failed: Cannot assign requested address Nov 1 09:54:32 lb53 pulse: SIOCGIFADDR failed: Cannot assign requested address Nov 1 09:54:32 lb53 pulse: failed to bind to heartbeat address: Cannot assign requested address Nov 1 09:54:32 lb53 pulse: pulse: cannot create heartbeat socket. running as root? Nov 1 09:54:32 lb53 pulse: pulse startup failed As replied to another list member, this seems to be related to attaching an IP to an interface. Thing is, the machines are identical in every way that I know of so am not sure why this is suddenly happening. Maybe something is stuck somewhere? > Are there any specific sections in the official documentation that you > find unclear? (there are a number of open bugzillas on the RHCS No, all of those things are great documents. As I say, it's when one does get into trouble, if the answers aren't there, then one has to look on the net and possibly end up with new problems based on trying some of the solutions found. > (I found one at http://www.austintek.com/LVS/LVS-HOWTO/HOWTO/ ) might > give you better mileage. I'll check this one out also. Thanks very much. > From personal experience I know that if I had to write up the problem > that I am having, retracing my steps, rereading the docs, manpages and > looking in detail at the error messages, there is a good chance that I > might solve the problem myself, without submitting a query to either > this list or Red Hat Support. I've been doing network stuff since the early 90's so believe me, I don't ask unless I've done some researching first, if that's what you mean. Seems much better to ask live members in a mailing list than to use suggestions from the countless sites found when searching for such solutions since they could make things worse. Mike From dbrieck at gmail.com Wed Nov 1 16:20:04 2006 From: dbrieck at gmail.com (David Brieck Jr.) Date: Wed, 1 Nov 2006 11:20:04 -0500 Subject: [Linux-cluster] Re: Hard lockups during file transfer to GNBD/GFS device In-Reply-To: <20061101153945.GB16402@redhat.com> References: <8c1094290609281208i6a5eaf8br70697c6b5d085cf@mail.gmail.com> <20060928195844.GB25242@redhat.com> <8c1094290609290651r62cec5f9n28278d6a81c3e6ef@mail.gmail.com> <8c1094290610100545j1f91d0a2h519cd9033b439390@mail.gmail.com> <8c1094290610230648x7ac8189dy3b2612fb9191125b@mail.gmail.com> <20061023171946.GA23495@ether.msp.redhat.com> <8c1094290610231032w5184472dn749f0bfbc7456624@mail.gmail.com> <8c1094290610231039g739fe237m360f0a213d97e62f@mail.gmail.com> <8c1094290611010550h733b0b33nb3dfab0b2bb21eb8@mail.gmail.com> <20061101153945.GB16402@redhat.com> Message-ID: <8c1094290611010820q711d879cx7af8a790641e22b0@mail.gmail.com> On 11/1/06, David Teigland wrote: > On Wed, Nov 01, 2006 at 08:50:49AM -0500, David Brieck Jr. wrote: > > Oct 31 10:39:46 storage1 kernel: DLM: Assertion failed on line 105 of > > file > > /home/buildcentos/rpmbuild/BUILD/dlm-kernel-2.6.9-42/hugemem/src/rebuild.c > > Oct 31 10:39:46 storage1 kernel: DLM: assertion: > > "root->res_newlkid_expect" > > Oct 31 10:39:46 storage1 kernel: DLM: time = 2164169409 > > Oct 31 10:39:46 storage1 kernel: newlkid_expect=0 > > I've added a reference to this in > https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=208134 > > Dave > > Does this bug seem related to the other problems I've been having? This is the first I've gotten that message so I'm not sure if it's related to my other problems or just a coincidence. From isplist at logicore.net Wed Nov 1 16:26:35 2006 From: isplist at logicore.net (isplist at logicore.net) Date: Wed, 1 Nov 2006 10:26:35 -0600 Subject: [Linux-cluster] Piranha/LVS/Load balancing In-Reply-To: <4548C504.4060903@obsidian.co.za> Message-ID: <2006111102635.523111@leena> Nov 1 09:54:32 lb53 pulse: SIOCGIFADDR failed: Cannot assign requested address Nov 1 09:54:32 lb53 pulse: failed to bind to heartbeat address: Cannot assign requested address Nov 1 09:54:32 lb53 pulse: pulse: cannot create heartbeat socket. running as root? Nov 1 09:54:32 lb53 pulse: pulse startup failed --- The error seems to be that pulse cannot start because something is already running on port 539. Checking the system however shows that nothing is running on that port so? Mike From dbrieck at gmail.com Wed Nov 1 16:27:07 2006 From: dbrieck at gmail.com (David Brieck Jr.) Date: Wed, 1 Nov 2006 11:27:07 -0500 Subject: [Linux-cluster] Piranha/LVS/Load balancing In-Reply-To: <2006111101753.094316@leena> References: <4548C504.4060903@obsidian.co.za> <2006111101753.094316@leena> Message-ID: <8c1094290611010827k46c5378dsaccfea73d465ada5@mail.gmail.com> On 11/1/06, isplist at logicore.net wrote: > I agree that there is a lot of documentation and that it does seem to be > complete. Searching and finding answers is where things sometimes get messed > up since it's not often clear what versions or combinations folks are talking > about when helping each other and ultimately, you if you read and try their > advise :). > > > are there any error messages when you try to start things up? > > Nov 1 09:54:32 lb53 pulse: SIOCGIFADDR failed: Cannot assign requested > address > Nov 1 09:54:32 lb53 pulse: SIOCGIFADDR failed: Cannot assign requested > address > Nov 1 09:54:32 lb53 pulse: failed to bind to heartbeat address: Cannot assign > requested address > Nov 1 09:54:32 lb53 pulse: pulse: cannot create heartbeat socket. running as > root? > Nov 1 09:54:32 lb53 pulse: pulse startup failed > > As replied to another list member, this seems to be related to attaching an IP > to an interface. Thing is, the machines are identical in every way that I know > of so am not sure why this is suddenly happening. Maybe something is stuck > somewhere? > > > Are there any specific sections in the official documentation that you > > find unclear? (there are a number of open bugzillas on the RHCS > > No, all of those things are great documents. As I say, it's when one does get > into trouble, if the answers aren't there, then one has to look on the net and > possibly end up with new problems based on trying some of the solutions found. > > > (I found one at http://www.austintek.com/LVS/LVS-HOWTO/HOWTO/ ) might > > give you better mileage. > > I'll check this one out also. Thanks very much. > > > From personal experience I know that if I had to write up the problem > > that I am having, retracing my steps, rereading the docs, manpages and > > looking in detail at the error messages, there is a good chance that I > > might solve the problem myself, without submitting a query to either > > this list or Red Hat Support. > > I've been doing network stuff since the early 90's so believe me, I don't ask > unless I've done some researching first, if that's what you mean. > > Seems much better to ask live members in a mailing list than to use > suggestions from the countless sites found when searching for such solutions > since they could make things worse. > > Mike > > Post the output of "cat /etc/sysconfig/ha/lvs.cf" From isplist at logicore.net Wed Nov 1 16:36:36 2006 From: isplist at logicore.net (isplist at logicore.net) Date: Wed, 1 Nov 2006 10:36:36 -0600 Subject: [Linux-cluster] Piranha/LVS/Load balancing In-Reply-To: <8c1094290611010827k46c5378dsaccfea73d465ada5@mail.gmail.com> Message-ID: <2006111103636.828190@leena> > Post the output of "cat /etc/sysconfig/ha/lvs.cf" serial_no = 44 primary = 192.168.1.150 service = lvs backup_active = 1 backup = 192.168.1.52 heartbeat = 1 heartbeat_port = 539 keepalive = 6 deadtime = 18 network = direct debug_level = NONE monitor_links = 0 virtual HTTP { active = 1 address = 192.168.1.150 eth0:1 vip_nmask = 255.255.255.0 port = 80 send = "GET / HTTP/1.0rnrn" expect = "HTTP" use_regex = 0 load_monitor = none scheduler = wlc protocol = tcp timeout = 6 reentry = 15 quiesce_server = 0 server cweb92 { address = 192.168.1.92 active = 1 weight = 1 } server cweb93 { address = 192.168.1.93 active = 1 weight = 1 } server cweb94 { address = 192.168.1.94 active = 1 weight = 1 } } From isplist at logicore.net Wed Nov 1 16:56:41 2006 From: isplist at logicore.net (isplist at logicore.net) Date: Wed, 1 Nov 2006 10:56:41 -0600 Subject: [Linux-cluster] Piranha/LVS/Load balancing In-Reply-To: <8c1094290611010827k46c5378dsaccfea73d465ada5@mail.gmail.com> Message-ID: <2006111105641.972691@leena> Found the problem but it leaves me confused. First, the log says that the problem is that pulse cannot start, looking into that, seems something else might be running on that port. Nothing else running on that port yet pulse won't start. A reboot fixed that yet netstat looks identical, I don't get it. Look at permissions of lvs.cf and sure enough, those aren't correct. Change everything over and voila, Piranha reports that everything is running. So, have to guess first that my scp command is not working as I've set it up because permissions were changed when copying file over to redundant server. scp /etc/sysconfig/ha/lvs.cf 192.168.1.53:/etc/sysconfig/ha/lvs.cf So, now it's all running but have new errors to deal with since the actual servers are timing out. (Nov 1 10:53:29 lb53 nanny[2963]: READ to 192.168.1.93:80 timed out). No biggie, this one should just be part of some configuring now. Mike From dfenton at ucalgary.ca Wed Nov 1 16:59:21 2006 From: dfenton at ucalgary.ca (Daryl Fenton) Date: Wed, 01 Nov 2006 09:59:21 -0700 Subject: [Linux-cluster] /etc/exports script Message-ID: <4548D269.5020101@ucalgary.ca> Hello, I was wondering if there is a simple script to put the contents of /etc/exports into /etc/cluster/cluster.conf for Cluster Suite for AS v. 4. I have seen that there is a way in Cluster Suite for AS v. 3 in the Culster Manager GUI to import the exports file, but I have not been able to find an easy way to do this in v4. Any help that would save me for writing my own script to do this would be greatly appreciated. Thanks Daryl Fenton dfenton at ucalgary.ca From lhh at redhat.com Wed Nov 1 19:12:20 2006 From: lhh at redhat.com (Lon Hohberger) Date: Wed, 01 Nov 2006 14:12:20 -0500 Subject: [Linux-cluster] [PATCH] Fix fence/agents/xvm/ip_lookup.c build with 2.6.19 headers In-Reply-To: <454846BF.8020805@ubuntu.com> References: <454846BF.8020805@ubuntu.com> Message-ID: <1162408340.4518.370.camel@rei.boston.devel.redhat.com> On Wed, 2006-11-01 at 08:03 +0100, Fabio Massimo Di Nitto wrote: > hi guys, > > 2.6.19 headers did shuffle a couple of things around. The patch in attachment > fix the build maintaining backward compatibility with older kernel headers. > > Please apply. Done in -head. -- Lon From lhh at redhat.com Wed Nov 1 19:39:08 2006 From: lhh at redhat.com (Lon Hohberger) Date: Wed, 01 Nov 2006 14:39:08 -0500 Subject: [Linux-cluster] Samba share resource fails to start after upgrade to Cluster Suite 4U4 In-Reply-To: <45485BB5.20409@streppone.it> References: <4538E16F.3060709@streppone.it> <1161974666.4518.208.camel@rei.boston.devel.redhat.com> <45427718.5030304@streppone.it> <1162239152.4518.333.camel@rei.boston.devel.redhat.com> <45485BB5.20409@streppone.it> Message-ID: <1162409948.4518.398.camel@rei.boston.devel.redhat.com> On Wed, 2006-11-01 at 09:32 +0100, Cosimo Streppone wrote: > Lon Hohberger wrote: > > > On Fri, 2006-10-27 at 23:16 +0200, Cosimo Streppone wrote: > > > >> > >> > >> > >> > > > > The share name should not contain any slashes. With a valid name, the > > resource agent generates a brain-dead example configuration for > > use as something for the administrator to build on. > > > > Ok, so, "exportdb" or something would be a better name. Using that, > > given the rest your service configuration, the agent will generate an > > *example* configuration with two properties: > > > [...] > > I think there's an error on my part, probably. > By that config, I meant to define: > > - an external filesystem, which is a (Yeah) SAN, that's mounted > by the active node only, that contains the core applications; > - another external smb filesystem, which is accessible as > smb://share/exportdb, with "share" being a machine name, and > "exportdb" the share name. (What a mess, I know). > - The /etc/init.d/smb script that starts the samba server > on the active node, for other shares that *my* server exposes. Easiest thing to do is flip things around: (You can do this by refs if you want). That will cause smb.sh to create a samba config called /etc/samba/smb.conf.share. It will be accessible only via 10.1.1.200 using netbios name "share" in workgroup "WORKGROUP", and providing access to /mnt/san as the share name "exportdb", i.e., clients accessing: //share/exportdb ... will see the contents of /mnt/san. Since you intend to run a machine-specific Samba config (e.g. one not managed by the cluster, presumably exporting local file systems or printers, etc.), you should: * Ensure that your /etc/samba/smb.conf is configured to *not* bind to the 10.1.1.200 address. So, you'll have to set "bind interfaces=" yourself in /etc/samba/smb.conf. * Generally speaking, machine-specific Samba shares should not usually bind to cluster IP addresses or be started as part of a cluster service. So, if you have machine-specific Samba shares you want started, you should probably just chkconfig --add them rather than putting them in as part of a cluster service. Otherwise, clients accessing the share will lose access any time you take down the cluster service (even though they could operate independently). The end result should look like: +---------[ Server1 ]----------+ +--[ Server2 ]--+ | "share" | | | | "share1" (floating IP) | | "share2" | | 10.1.1.101 10.1.1.200 | | 10.1.1.102 | | | | | | | | | | | | | | | | | | | | | | | /mnt/local1 /mnt/san | | /mnt/local2 | +------------------------------+ +---------------+ (local) (failover/HA) (local) The "share" pseudo-machine (including the "exportdb" export), the IP address 10.1.1.200, and /mnt/san can be provided by either node, but the local exports may only be provided by specific machines. In the above example, Server1 is providing "share", but you could move it to Server2 using the clusvcadm -r command. Hope that helps. -- Lon From lhh at redhat.com Wed Nov 1 21:04:04 2006 From: lhh at redhat.com (Lon Hohberger) Date: Wed, 01 Nov 2006 16:04:04 -0500 Subject: [Linux-cluster] RHEL4 cluster source RPMs incomplete In-Reply-To: <20061026165124.A10943@xos037.xos.nl> References: <200610121452.k9CEqYg13019@xos037.xos.nl> <1161873875.4518.82.camel@rei.boston.devel.redhat.com> <20061026165124.A10943@xos037.xos.nl> Message-ID: <1162415044.4518.400.camel@rei.boston.devel.redhat.com> On Thu, 2006-10-26 at 16:51 +0200, Jos Vos wrote: > On Thu, Oct 26, 2006 at 10:44:35AM -0400, Lon Hohberger wrote: > > > On Thu, 2006-10-12 at 16:52 +0200, Jos Vos wrote: > > > > Can someone of the cluster team maybe check the completeness of the > > > RHCS/RHGFS source RPMs in the RHEL4 updates tree on public ftp sites? > > > > > > It looks like some RHEL4 packages were not propagated to the public > > > sites. The last version of system-config-cluster is older than the > > > RHEL4 version, and I'm also missing the cman-kernel and gnbd-kernel > > > packages for the -42.0.3.EL kernel, while the new dlm-kernel and > > > GFS-kernel packages do exist. > > > > Ugh, are they still "not there" ? Sometimes, the packages are a bit > > behind. > > The kernel modules in the meantime arrived, system-config-cluster 1.0.27 > (this was an U4 update) never appeared... :-( > Until it appears, I've placed it here: http://people.redhat.com/lhh/system-config-cluster-1.0.27-1.0.src.rpm -- Lon From jos at xos.nl Wed Nov 1 21:09:40 2006 From: jos at xos.nl (Jos Vos) Date: Wed, 1 Nov 2006 22:09:40 +0100 Subject: [Linux-cluster] RHEL4 cluster source RPMs incomplete In-Reply-To: <1162415044.4518.400.camel@rei.boston.devel.redhat.com>; from lhh@redhat.com on Wed, Nov 01, 2006 at 04:04:04PM -0500 References: <200610121452.k9CEqYg13019@xos037.xos.nl> <1161873875.4518.82.camel@rei.boston.devel.redhat.com> <20061026165124.A10943@xos037.xos.nl> <1162415044.4518.400.camel@rei.boston.devel.redhat.com> Message-ID: <20061101220940.A4046@xos037.xos.nl> On Wed, Nov 01, 2006 at 04:04:04PM -0500, Lon Hohberger wrote: > Until it appears, I've placed it here: > > http://people.redhat.com/lhh/system-config-cluster-1.0.27-1.0.src.rpm Thanks, but it already has appeared on the public ftp sites. -- -- Jos Vos -- X/OS Experts in Open Systems BV | Phone: +31 20 6938364 -- Amsterdam, The Netherlands | Fax: +31 20 6948204 From dist-list at LEXUM.UMontreal.CA Thu Nov 2 13:58:13 2006 From: dist-list at LEXUM.UMontreal.CA (FM) Date: Thu, 02 Nov 2006 08:58:13 -0500 Subject: [Linux-cluster] LVS Director + TG3 = problem ? Message-ID: <4549F975.3000706@lexum.umontreal.ca> Hello, With RHEL 4.4 CS : My director : HP DL360 (TG3 RH module), 2 Gb NIC (bonding mode=0) behind a PIX firewall (100 Mb NIC). All servers connected to Gb CISCO switches I'm having a hard time with this server. the LVS seems to be ok (web sites are available for users ;) ). But, in the logs, I have a lot of nanny disabling real servers and then enabling them 1 second later. And when connecting using SSH, I have a connection time out or when the connection succeed, the ssh session freezes after several minutes. I talked to CISCO admin and traffic is not crazy in the director ports. I read in the past that the TG3 is not recommended by HP. Do you have that kind of issue with the mix hp broadcom NIC + tg3 RH module ? Trying to tune the TCP stack, I add this in the sysctl.conf : # This setting changes the maximum network receive buffer size net.core.rmem_max= 16777216 # This setting changes the maximum network send buffer size net.core.wmem_max= 16777216 # This sets the kernel's minimum, default, and maximum TCP receive buffer sizes. net.ipv4.tcp_rmem=4096 87380 16777216 # This sets the kernel's minimum, default, and maximum TCP send buffer sizes. net.ipv4.tcp_wmem=4096 65536 16777216 Here is a ipvsadm output : Virtual Server version 1.2.0 (size=4096) Prot LocalAddress:Port Scheduler Flags -> RemoteAddress:Port Forward Weight ActiveConn InActConn TCP 192.168.3.157:443 wlc persistent 300 mask 255.255.255.0 -> 192.168.3.29:443 Route 3 0 0 -> 192.168.3.25:443 Route 1 0 0 -> 192.168.3.27:443 Route 1 0 0 -> 192.168.3.26:443 Route 1 0 0 -> 192.168.3.24:443 Route 1 0 0 TCP 192.168.3.158:443 wlc persistent 300 mask 255.255.255.0 -> 192.168.3.25:443 Route 1 0 0 -> 192.168.3.27:443 Route 1 0 0 -> 192.168.3.26:443 Route 1 0 0 -> 192.168.3.24:443 Route 1 0 0 TCP 192.168.3.159:443 wlc persistent 300 mask 255.255.255.0 -> 192.168.3.29:443 Route 3 0 0 -> 192.168.3.25:443 Route 1 0 0 -> 192.168.3.27:443 Route 1 0 0 -> 192.168.3.26:443 Route 1 0 0 -> 192.168.3.24:443 Route 1 0 0 TCP 192.168.3.136:443 wlc persistent 300 -> 192.168.3.25:443 Route 1 0 1 -> 192.168.3.27:443 Route 1 0 0 -> 192.168.3.24:443 Route 1 0 0 -> 192.168.3.26:443 Route 1 0 0 TCP 192.168.3.147:80 wlc -> 192.168.3.24:80 Route 1 2 218 -> 192.168.3.25:80 Route 1 2 252 -> 192.168.3.26:80 Route 1 3 252 -> 192.168.3.27:80 Route 1 2 217 TCP 192.168.3.158:80 wlc -> 192.168.3.24:80 Route 1 0 47 -> 192.168.3.25:80 Route 1 0 40 -> 192.168.3.26:80 Route 1 0 48 -> 192.168.3.27:80 Route 1 0 47 TCP 192.168.3.159:80 wlc -> 192.168.3.24:80 Route 1 0 2 -> 192.168.3.27:80 Route 1 0 1 -> 192.168.3.26:80 Route 1 0 1 -> 192.168.3.25:80 Route 1 0 1 TCP 192.168.3.156:80 wlc -> 192.168.3.24:80 Route 1 0 0 -> 192.168.3.25:80 Route 1 0 0 -> 192.168.3.27:80 Route 1 0 0 -> 192.168.3.26:80 Route 1 0 0 TCP 192.168.3.157:80 wlc -> 192.168.3.25:80 Route 1 1 31 -> 192.168.3.24:80 Route 1 0 50 -> 192.168.3.26:80 Route 1 0 50 -> 192.168.3.27:80 Route 1 0 48 TCP 192.168.3.135:80 wlc -> 192.168.3.25:80 Route 1 0 0 -> 192.168.3.24:80 Route 1 0 1 -> 192.168.3.27:80 Route 1 0 1 -> 192.168.3.26:80 Route 1 0 0 TCP 192.168.3.132:80 wlc -> 192.168.3.24:80 Route 1 0 1 -> 192.168.3.25:80 Route 1 0 1 -> 192.168.3.27:80 Route 1 0 2 -> 192.168.3.26:80 Route 1 0 1 TCP 192.168.3.133:80 wlc -> 192.168.3.25:80 Route 1 0 0 -> 192.168.3.24:80 Route 1 0 0 -> 192.168.3.27:80 Route 1 0 0 From lists at brimer.org Thu Nov 2 14:26:07 2006 From: lists at brimer.org (Barry Brimer) Date: Thu, 2 Nov 2006 08:26:07 -0600 (CST) Subject: [Linux-cluster] LVS Director + TG3 = problem ? In-Reply-To: <4549F975.3000706@lexum.umontreal.ca> References: <4549F975.3000706@lexum.umontreal.ca> Message-ID: > I read in the past that the TG3 is not recommended by HP. Do you have that > kind of issue with the mix hp broadcom NIC + tg3 RH module ? In the past, HP recommended the bcm5700 driver. I also recommend it. In the last several months, HP has been posting their own version of the tg3 driver, although I have not tried it. I would assume that HP now recommends the use of their tg3 driver above all else. Barry From lhh at redhat.com Thu Nov 2 14:37:41 2006 From: lhh at redhat.com (Lon Hohberger) Date: Thu, 02 Nov 2006 09:37:41 -0500 Subject: [Linux-cluster] LVS Director + TG3 = problem ? In-Reply-To: <4549F975.3000706@lexum.umontreal.ca> References: <4549F975.3000706@lexum.umontreal.ca> Message-ID: <1162478261.4518.421.camel@rei.boston.devel.redhat.com> On Thu, 2006-11-02 at 08:58 -0500, FM wrote: > Hello, > With RHEL 4.4 CS : > My director : HP DL360 (TG3 RH module), 2 Gb NIC (bonding mode=0) behind > a PIX firewall (100 Mb NIC). All servers connected to Gb CISCO switches > > I'm having a hard time with this server. the LVS seems to be ok (web > sites are available for users ;) ). But, in the logs, I have a lot of > nanny disabling real servers and then enabling them 1 second later. And > when connecting using SSH, I have a connection time out or when the > connection succeed, the ssh session freezes after several minutes. I > talked to CISCO admin and traffic is not crazy in the director ports. > > I read in the past that the TG3 is not recommended by HP. Do you have > that kind of issue with the mix hp broadcom NIC + tg3 RH module ? Hi, I haven't experienced the same problem, but occasionally, my tg3 boxes don't acquire their DHCP addresses on boot because the link is (reportedly) not detected. At this point, I have to log in via remote console and manually try to bring up the interface (sometimes it takes 2-3 tries). Also, sometimes shell sessions pause for 3-5 seconds for no good reason. I haven't bothered to trace the problem further, because for me, it's just a minor annoyance. Maybe it's related? *shrug* -- Lon From isplist at logicore.net Thu Nov 2 17:25:55 2006 From: isplist at logicore.net (isplist at logicore.net) Date: Thu, 2 Nov 2006 11:25:55 -0600 Subject: [Linux-cluster] LVS: Not as a gateway? Message-ID: <2006112112555.242286@leena> From what I've seen, LVS can act as an NAT gateway, direct routing or tunneling. The LVS machines are already under NAT under the firewall's. I cannot change my web servers to point to LVS as this causes too many complications but, each server does have two Ethernet interfaces. I cannot really use NAT as the machines are already under NAT to my firewall's so, what's the best way of dealing with this without creating another sub networking within? Mike From chawkins at bplinux.com Thu Nov 2 17:34:15 2006 From: chawkins at bplinux.com (Christopher Hawkins) Date: Thu, 2 Nov 2006 12:34:15 -0500 Subject: [Linux-cluster] LVS: Not as a gateway? In-Reply-To: <2006112112555.242286@leena> Message-ID: <200611021713.kA2HDEAA004930@mail2.ontariocreditcorp.com> > -----Original Message----- > From: linux-cluster-bounces at redhat.com > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of > isplist at logicore.net > Sent: Thursday, November 02, 2006 12:26 PM > To: linux-cluster > Subject: [Linux-cluster] LVS: Not as a gateway? > > From what I've seen, LVS can act as an NAT gateway, direct > routing or tunneling. > > The LVS machines are already under NAT under the firewall's. > > I cannot change my web servers to point to LVS as this causes > too many complications but, each server does have two > Ethernet interfaces. > > I cannot really use NAT as the machines are already under NAT > to my firewall's so, what's the best way of dealing with this > without creating another sub networking within? > > Mike > You would probably want direct routing. "Direct" means on the same network as the director, and able to use the same gateway to the outside world. An outside client would access services by sending a packet to your firewall, which would forward it to the director, then the director would choose an LVS "real server" to send it to for processing, and then the real server that got it would reply "directly" to the client without further intervention from the director machine. Chris From isplist at logicore.net Thu Nov 2 17:45:24 2006 From: isplist at logicore.net (isplist at logicore.net) Date: Thu, 2 Nov 2006 11:45:24 -0600 Subject: [Linux-cluster] LVS: Not as a gateway? In-Reply-To: <200611021713.kA2HDEAA004930@mail2.ontariocreditcorp.com> Message-ID: <2006112114524.012422@leena> > You would probably want direct routing. "Direct" means on the same network > as the director, and able to use the same gateway to the outside world. Yes, it's how I've got it set up. Only problem is, the web servers need to see the LVS as their gateways no? The last error I seem to have to conquer is; /// Nov 2 11:24:49 lb52 nanny[3652]: READ to 192.168.1.94:80 timed out Nov 2 11:24:52 lb52 nanny[3650]: READ to 192.168.1.92:80 timed out Nov 2 11:24:54 lb52 nanny[3651]: READ to 192.168.1.93:80 timed out /// Just seems complicated as heck. Here I have firewall's taking care of NAT. Connections come into the network as real IP's, then are sent to the various machines which are NAT'd. So, if using LVS which are NAT'd under the firewall's, there's a double weirdness there. Not just in the NAT itself but in how cache, session and other services end up acting. I could change the LVS's to real IP's still protected by the firewall's I guess. > outside client would access services by sending a packet to your firewall, > which would forward it to the director, then the director would choose an > LVS "real server" to send it to for processing, and then the real server > that got it would reply "directly" to the client without further > intervention from the director machine. Oh, I was not sure about this then. From what I've read, it seemed that the LVS remains in the path once it is used. That would be fine so guess I just need to solve this problem first. Thanks. Mike From isplist at logicore.net Thu Nov 2 17:51:21 2006 From: isplist at logicore.net (isplist at logicore.net) Date: Thu, 2 Nov 2006 11:51:21 -0600 Subject: [Linux-cluster] LVS: Not as a gateway? In-Reply-To: <200611021713.kA2HDEAA004930@mail2.ontariocreditcorp.com> Message-ID: <2006112115121.994624@leena> What is confusing me is that everything on my network is behind the firewall, so NAT to begin with. Yet, the LVS setup requires a public address and a private one. Both LVS machines do have two NIC's. Both LVS and real servers use the same gateway, 192.168.1.1 for example. The client connections are sent directly to the real servers by using 'DIRECT' connections, I've set up LVS with 192.168.1.150 as the public floating IP that clients connect to. Yet, I keep getting these errors I noted in another reply. What am I missing here? I've not set up anything outside of LVS however because that's not clear to me. Should I be setting up 192.168.1.150 on my second LVS interface manually? Mike > You would probably want direct routing. "Direct" means on the same network > as the director, and able to use the same gateway to the outside world. An > outside client would access services by sending a packet to your firewall, > which would forward it to the director, then the director would choose an > LVS "real server" to send it to for processing, and then the real server > that got it would reply "directly" to the client without further > intervention from the director machine. > > Chris > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From johnson.eric at gmail.com Thu Nov 2 17:51:34 2006 From: johnson.eric at gmail.com (Eric Johnson) Date: Thu, 2 Nov 2006 12:51:34 -0500 Subject: [Linux-cluster] What can cause dlm_pthread_init to generate an errno 13? Message-ID: Hi - I must be very dumb. I'm unable to do what I would consider to be the "Hello World" of user mode dlm. [ejohnson at gfsa01:~/projects/dlm-simple] cat foo.cpp #include #include #include #include #include extern "C" { #include } #include main(int argc, char **argv) { printf("hello world\n"); char lockName[] = "foobar" ; int initStatus = dlm_pthread_init(); printf("initStatus: %d, errno: %d\n",initStatus,errno); int lockid =0 ; int status = lock_resource( lockName, LKM_EXMODE, 0, &lockid ) ; printf("status: %d, errno: %d\n",status,errno); } [ejohnson at gfsa01:~/projects/dlm-simple] touch foo.cpp [ejohnson at gfsa01:~/projects/dlm-simple] make g++ -D_REENTRANT foo.cpp -o foo -ldlm -lpthread ./foo/home/dev/util/linux64/bin/ld: skipping incompatible /home/dev/util/linux64/lib/gcc/x86_64-unknown-linux-gnu/4.1.1/libstdc++.a when searching for -lstdc++ [ejohnson at gfsa01:~/projects/dlm-simple] ./foo hello world initStatus: -1, errno: 13 status: -1, errno: 13 I have googled around endlessly and been unable to find any insight as to what it is that I'm missing. I'll gladly take a terse - "Hey knucklehead, go read this page in more detail." -Eric From jwhiter at redhat.com Thu Nov 2 18:05:34 2006 From: jwhiter at redhat.com (Josef Whiter) Date: Thu, 2 Nov 2006 13:05:34 -0500 Subject: [Linux-cluster] What can cause dlm_pthread_init to generate an errno 13? In-Reply-To: References: Message-ID: <20061102180533.GB4283@korben.rdu.redhat.com> Run it as root. Josef On Thu, Nov 02, 2006 at 12:51:34PM -0500, Eric Johnson wrote: > Hi - > > I must be very dumb. I'm unable to do what I would consider to be the > "Hello World" of user mode dlm. > > [ejohnson at gfsa01:~/projects/dlm-simple] cat foo.cpp > #include > #include > #include > #include > #include > > extern "C" { > #include > } > #include > > main(int argc, char **argv) > { > printf("hello world\n"); > char lockName[] = "foobar" ; > > int initStatus = dlm_pthread_init(); > printf("initStatus: %d, errno: %d\n",initStatus,errno); > > int lockid =0 ; > int status = lock_resource( lockName, LKM_EXMODE, 0, &lockid ) ; > printf("status: %d, errno: %d\n",status,errno); > } > [ejohnson at gfsa01:~/projects/dlm-simple] touch foo.cpp > > [ejohnson at gfsa01:~/projects/dlm-simple] make > g++ -D_REENTRANT foo.cpp -o foo -ldlm -lpthread > ./foo/home/dev/util/linux64/bin/ld: skipping incompatible > /home/dev/util/linux64/lib/gcc/x86_64-unknown-linux-gnu/4.1.1/libstdc++.a > when searching for -lstdc++ > > [ejohnson at gfsa01:~/projects/dlm-simple] ./foo > hello world > initStatus: -1, errno: 13 > status: -1, errno: 13 > > I have googled around endlessly and been unable to find any insight as > to what it is that I'm missing. > > I'll gladly take a terse - "Hey knucklehead, go read this page in more > detail." > > -Eric > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From johnson.eric at gmail.com Thu Nov 2 18:05:36 2006 From: johnson.eric at gmail.com (Eric Johnson) Date: Thu, 2 Nov 2006 13:05:36 -0500 Subject: [Linux-cluster] What can cause dlm_pthread_init to generate an errno 13? In-Reply-To: <20061102180533.GB4283@korben.rdu.redhat.com> References: <20061102180533.GB4283@korben.rdu.redhat.com> Message-ID: It works as root. I know that much. I guess I thought it was possible to hit the dlm as just a plain user. No? -Eric On 11/2/06, Josef Whiter wrote: > Run it as root. > > Josef > > On Thu, Nov 02, 2006 at 12:51:34PM -0500, Eric Johnson wrote: > > Hi - > > > > I must be very dumb. I'm unable to do what I would consider to be the > > "Hello World" of user mode dlm. > > > > [ejohnson at gfsa01:~/projects/dlm-simple] cat foo.cpp > > #include > > #include > > #include > > #include > > #include > > > > extern "C" { > > #include > > } > > #include > > > > main(int argc, char **argv) > > { > > printf("hello world\n"); > > char lockName[] = "foobar" ; > > > > int initStatus = dlm_pthread_init(); > > printf("initStatus: %d, errno: %d\n",initStatus,errno); > > > > int lockid =0 ; > > int status = lock_resource( lockName, LKM_EXMODE, 0, &lockid ) ; > > printf("status: %d, errno: %d\n",status,errno); > > } > > [ejohnson at gfsa01:~/projects/dlm-simple] touch foo.cpp > > > > [ejohnson at gfsa01:~/projects/dlm-simple] make > > g++ -D_REENTRANT foo.cpp -o foo -ldlm -lpthread > > ./foo/home/dev/util/linux64/bin/ld: skipping incompatible > > /home/dev/util/linux64/lib/gcc/x86_64-unknown-linux-gnu/4.1.1/libstdc++.a > > when searching for -lstdc++ > > > > [ejohnson at gfsa01:~/projects/dlm-simple] ./foo > > hello world > > initStatus: -1, errno: 13 > > status: -1, errno: 13 > > > > I have googled around endlessly and been unable to find any insight as > > to what it is that I'm missing. > > > > I'll gladly take a terse - "Hey knucklehead, go read this page in more > > detail." > > > > -Eric > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > From johnson.eric at gmail.com Thu Nov 2 17:51:34 2006 From: johnson.eric at gmail.com (Eric Johnson) Date: Thu, 2 Nov 2006 12:51:34 -0500 Subject: [Linux-cluster] What can cause dlm_pthread_init to generate an errno 13? Message-ID: Hi - I must be very dumb. I'm unable to do what I would consider to be the "Hello World" of user mode dlm. [ejohnson at gfsa01:~/projects/dlm-simple] cat foo.cpp #include #include #include #include #include extern "C" { #include } #include main(int argc, char **argv) { printf("hello world\n"); char lockName[] = "foobar" ; int initStatus = dlm_pthread_init(); printf("initStatus: %d, errno: %d\n",initStatus,errno); int lockid =0 ; int status = lock_resource( lockName, LKM_EXMODE, 0, &lockid ) ; printf("status: %d, errno: %d\n",status,errno); } [ejohnson at gfsa01:~/projects/dlm-simple] touch foo.cpp [ejohnson at gfsa01:~/projects/dlm-simple] make g++ -D_REENTRANT foo.cpp -o foo -ldlm -lpthread ./foo/home/dev/util/linux64/bin/ld: skipping incompatible /home/dev/util/linux64/lib/gcc/x86_64-unknown-linux-gnu/4.1.1/libstdc++.a when searching for -lstdc++ [ejohnson at gfsa01:~/projects/dlm-simple] ./foo hello world initStatus: -1, errno: 13 status: -1, errno: 13 I have googled around endlessly and been unable to find any insight as to what it is that I'm missing. I'll gladly take a terse - "Hey knucklehead, go read this page in more detail." -Eric From jwhiter at redhat.com Thu Nov 2 18:15:44 2006 From: jwhiter at redhat.com (Josef Whiter) Date: Thu, 2 Nov 2006 13:15:44 -0500 Subject: [Linux-cluster] What can cause dlm_pthread_init to generate an errno 13? In-Reply-To: References: <20061102180533.GB4283@korben.rdu.redhat.com> Message-ID: <20061102181543.GC4283@korben.rdu.redhat.com> AFAIK not yet, I know the work has been done, but if anything its in CVS, it hasn't been actually released yet. Josef On Thu, Nov 02, 2006 at 01:05:36PM -0500, Eric Johnson wrote: > It works as root. I know that much. I guess I thought it was > possible to hit the dlm as just a plain user. No? > > -Eric > > On 11/2/06, Josef Whiter wrote: > >Run it as root. > > > >Josef > > > >On Thu, Nov 02, 2006 at 12:51:34PM -0500, Eric Johnson wrote: > >> Hi - > >> > >> I must be very dumb. I'm unable to do what I would consider to be the > >> "Hello World" of user mode dlm. > >> > >> [ejohnson at gfsa01:~/projects/dlm-simple] cat foo.cpp > >> #include > >> #include > >> #include > >> #include > >> #include > >> > >> extern "C" { > >> #include > >> } > >> #include > >> > >> main(int argc, char **argv) > >> { > >> printf("hello world\n"); > >> char lockName[] = "foobar" ; > >> > >> int initStatus = dlm_pthread_init(); > >> printf("initStatus: %d, errno: %d\n",initStatus,errno); > >> > >> int lockid =0 ; > >> int status = lock_resource( lockName, LKM_EXMODE, 0, &lockid ) ; > >> printf("status: %d, errno: %d\n",status,errno); > >> } > >> [ejohnson at gfsa01:~/projects/dlm-simple] touch foo.cpp > >> > >> [ejohnson at gfsa01:~/projects/dlm-simple] make > >> g++ -D_REENTRANT foo.cpp -o foo -ldlm -lpthread > >> ./foo/home/dev/util/linux64/bin/ld: skipping incompatible > >> /home/dev/util/linux64/lib/gcc/x86_64-unknown-linux-gnu/4.1.1/libstdc++.a > >> when searching for -lstdc++ > >> > >> [ejohnson at gfsa01:~/projects/dlm-simple] ./foo > >> hello world > >> initStatus: -1, errno: 13 > >> status: -1, errno: 13 > >> > >> I have googled around endlessly and been unable to find any insight as > >> to what it is that I'm missing. > >> > >> I'll gladly take a terse - "Hey knucklehead, go read this page in more > >> detail." > >> > >> -Eric > >> > >> -- > >> Linux-cluster mailing list > >> Linux-cluster at redhat.com > >> https://www.redhat.com/mailman/listinfo/linux-cluster > > > >-- > >Linux-cluster mailing list > >Linux-cluster at redhat.com > >https://www.redhat.com/mailman/listinfo/linux-cluster > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From teigland at redhat.com Thu Nov 2 18:45:31 2006 From: teigland at redhat.com (David Teigland) Date: Thu, 2 Nov 2006 12:45:31 -0600 Subject: [Linux-cluster] What can cause dlm_pthread_init to generate an errno 13? In-Reply-To: References: <20061102180533.GB4283@korben.rdu.redhat.com> Message-ID: <20061102184531.GF9431@redhat.com> On Thu, Nov 02, 2006 at 01:05:36PM -0500, Eric Johnson wrote: > It works as root. I know that much. I guess I thought it was > possible to hit the dlm as just a plain user. No? I'm not sure... maybe if you create your own private lockspace to use, which would mean not using the "default" lockspace routines like lock_resource(). I remember Patrick adding a fix a while back related to permissions... Dave From johnson.eric at gmail.com Thu Nov 2 19:07:25 2006 From: johnson.eric at gmail.com (Eric Johnson) Date: Thu, 2 Nov 2006 14:07:25 -0500 Subject: [Linux-cluster] What can cause dlm_pthread_init to generate an errno 13? In-Reply-To: <20061102184531.GF9431@redhat.com> References: <20061102180533.GB4283@korben.rdu.redhat.com> <20061102184531.GF9431@redhat.com> Message-ID: Should dlm_pthread_init be affected as well? I'm unsure if that function is intended to be just for mucking with the "default" lockspace or for any locking function such as creating my own lockspace. -Eric On 11/2/06, David Teigland wrote: > On Thu, Nov 02, 2006 at 01:05:36PM -0500, Eric Johnson wrote: > > It works as root. I know that much. I guess I thought it was > > possible to hit the dlm as just a plain user. No? > > I'm not sure... maybe if you create your own private lockspace to use, > which would mean not using the "default" lockspace routines like > lock_resource(). I remember Patrick adding a fix a while back related to > permissions... > > Dave > > From teigland at redhat.com Thu Nov 2 19:18:49 2006 From: teigland at redhat.com (David Teigland) Date: Thu, 2 Nov 2006 13:18:49 -0600 Subject: [Linux-cluster] What can cause dlm_pthread_init to generate an errno 13? In-Reply-To: References: <20061102180533.GB4283@korben.rdu.redhat.com> <20061102184531.GF9431@redhat.com> Message-ID: <20061102191849.GG9431@redhat.com> On Thu, Nov 02, 2006 at 02:07:25PM -0500, Eric Johnson wrote: > Should dlm_pthread_init be affected as well? I'm unsure if that > function is intended to be just for mucking with the "default" > lockspace or for any locking function such as creating my own > lockspace. Looking at the examples in cluster/dlm/tests/usertest/ I think that you'll wan't dlm_ls_pthread_init() and not dlm_pthread_init() after you've created your lockspace. Dave From pbruna at it-linux.cl Thu Nov 2 19:46:00 2006 From: pbruna at it-linux.cl (Patricio A. Bruna) Date: Thu, 2 Nov 2006 16:46:00 -0300 (CLST) Subject: [Linux-cluster] CS and DRBD Message-ID: <23500065.31162496760586.JavaMail.root@lisa.it-linux.cl> Anyone has a working script for use DRBD with Cluster Suite? I need to use DRBD, but i dont know how to make it play with CS. I now CS can mount partitions, but i dont know if CS can mark a drbd device as primary. regards -------------- next part -------------- An HTML attachment was scrubbed... URL: From isplist at logicore.net Thu Nov 2 20:50:34 2006 From: isplist at logicore.net (isplist at logicore.net) Date: Thu, 2 Nov 2006 14:50:34 -0600 Subject: [Linux-cluster] LVS: Not as a gateway? In-Reply-To: <200611021713.kA2HDEAA004930@mail2.ontariocreditcorp.com> Message-ID: <2006112145034.775003@leena> Maybe someone who's running DIRECT and all of the same internap IP's. can send me their lvs.cfg? I'm stumped. On Thu, 2 Nov 2006 12:34:15 -0500, Christopher Hawkins wrote: >> -----Original Message----- >> >> From: linux-cluster-bounces at redhat.com >> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of >> isplist at logicore.net >> Sent: Thursday, November 02, 2006 12:26 PM >> To: linux-cluster >> Subject: [Linux-cluster] LVS: Not as a gateway? >> >> From what I've seen, LVS can act as an NAT gateway, direct >> routing or tunneling. >> >> The LVS machines are already under NAT under the firewall's. >> >> I cannot change my web servers to point to LVS as this causes >> too many complications but, each server does have two >> Ethernet interfaces. >> >> I cannot really use NAT as the machines are already under NAT >> to my firewall's so, what's the best way of dealing with this >> without creating another sub networking within? >> >> Mike >> > > You would probably want direct routing. "Direct" means on the same network > as the director, and able to use the same gateway to the outside world. An > outside client would access services by sending a packet to your firewall, > which would forward it to the director, then the director would choose an > LVS "real server" to send it to for processing, and then the real server > that got it would reply "directly" to the client without further > intervention from the director machine. > > Chris > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From gsml at netops.gvtc.com Thu Nov 2 20:56:44 2006 From: gsml at netops.gvtc.com (Greg Swift) Date: Thu, 02 Nov 2006 14:56:44 -0600 Subject: [Linux-cluster] fc6 two-node cluster with gfs2 not working Message-ID: <454A5B8C.40508@netops.gvtc.com> So i've got two dell blades setup, with multipath. they even cluster together, but once one is up and has the gfs mounted the other can't start the gfs2 service. I'm basing my setup on how I was setting up gfs w/ rhel4, i realize this newer way has some more niceties to it, and i must be doing something wrong, but i am not seeing much documentation on the differences so i am just trying to pull this off this way. Basic rundown of setup: decently minimal non-X install local drives are not lvm'd (which btw, if you have 2 boxes setup differently, one w/ lvm and one w/o it makes clvm a pita) yum update yum install screen ntp cman lvm2-cluster gfs2-utils put on good firewall config (or turn it off, both behave the same) selinux turned down to permissive see attached multipath.conf and cluster.conf after updating the multipath.conf i do this: mkinitrd -f /boot/initrd-`uname -r` `uname -r` init 6; exit modprobe dm-multipath modprobe dm-round-robin service multipathd start that part looks just fine. then after updating the cluster.conf on each node i do 'ccs_tool addnodeids' (it said to do this when i tried to start cman the first time). then service cman start everything looks fine, pvcreate, vgcreate, lvcreate, mkfs.gfs2, voila we have a gfs formatted drive visible on both systems. i add the /etc/fstab entry, and create the mount point. next i start clvmd, then gfs2. the first box starts gfs2 just fine, second won't, it hangs at this (from var/log/messages): Nov 1 22:41:07 box2 kernel: audit(1162442467.427:150): avc: denied { connectto } for pid=3724 comm="mount.gfs2" path=006766735F636F6E74726F6C645F736F6$ Nov 1 22:41:07 box2 kernel: GFS2: fsid=: Trying to join cluster "lock_dlm", "outMail:data" Nov 1 22:41:07 box2 kernel: audit(1162442467.451:151): avc: denied { search } for pid=3724 comm="mount.gfs2" name="dlm" dev=debugfs ino=13186 scontext$ Nov 1 22:41:07 box2 kernel: dlm: data: recover 1 Nov 1 22:41:07 box2 kernel: GFS2: fsid=outMail:data.1: Joined cluster. Now mounting FS... Nov 1 22:41:07 box2 kernel: dlm: data: add member 1 Nov 1 22:41:07 box2 kernel: dlm: data: add member 2 Nov 1 22:49:07 box2 gfs_controld[3639]: mount: failed -17 Remember it is set to permissive. So I shut down the box that came up fine on its own, manually enabled the services on box2 (the box that wasnt coming up) and it works fine. Turned on the box1, and at boot it is hanging at the same place box2 was. I also realize that a 2 node cluster is not prefered, but its what i'm setting up, what i have access to at the moment, and honestly i'm not sure that i believe a 3rd box would help (but it might). Any suggestions? -greg -- http://www.gvtc.com -- ?While it is possible to change without improving, it is impossible to improve without changing.? -anonymous ?only he who attempts the absurd can achieve the impossible.? -anonymous -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: cluster.conf.txt URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: multipath.conf.txt URL: From teigland at redhat.com Thu Nov 2 21:28:48 2006 From: teigland at redhat.com (David Teigland) Date: Thu, 2 Nov 2006 15:28:48 -0600 Subject: [Linux-cluster] fc6 two-node cluster with gfs2 not working In-Reply-To: <454A5B8C.40508@netops.gvtc.com> References: <454A5B8C.40508@netops.gvtc.com> Message-ID: <20061102212848.GA18926@redhat.com> On Thu, Nov 02, 2006 at 02:56:44PM -0600, Greg Swift wrote: > the first box starts gfs2 just fine, second won't, it hangs at this > (from var/log/messages): > > Nov 1 22:41:07 box2 kernel: audit(1162442467.427:150): avc: denied { > connectto } for pid=3724 comm="mount.gfs2" > path=006766735F636F6E74726F6C645F736F6$ > Nov 1 22:41:07 box2 kernel: GFS2: fsid=: Trying to join cluster > "lock_dlm", "outMail:data" > Nov 1 22:41:07 box2 kernel: audit(1162442467.451:151): avc: denied { > search } for pid=3724 comm="mount.gfs2" name="dlm" dev=debugfs ino=13186 > scontext$ > Nov 1 22:41:07 box2 kernel: dlm: data: recover 1 > Nov 1 22:41:07 box2 kernel: GFS2: fsid=outMail:data.1: Joined cluster. > Now mounting FS... > Nov 1 22:41:07 box2 kernel: dlm: data: add member 1 > Nov 1 22:41:07 box2 kernel: dlm: data: add member 2 > Nov 1 22:49:07 box2 gfs_controld[3639]: mount: failed -17 > > > Remember it is set to permissive. > > So I shut down the box that came up fine on its own, manually enabled > the services on box2 (the box that wasnt coming up) and it works fine. > Turned on the box1, and at boot it is hanging at the same place box2 was. > > I also realize that a 2 node cluster is not prefered, but its what i'm > setting up, what i have access to at the moment, and honestly i'm not > sure that i believe a 3rd box would help (but it might). A couple things about your cluster.conf 1. You probably want to set a post_join_delay of around 10 to avoid fencing at startup time. e.g. 2. It's "fencedevices" and "fencedevice", no "_". With both machines in a clean/reset state, do service cman start on both, then start clvmd on both, then _before_ doing anything with gfs, check the status; on both nodes run: $ cman_tool status $ cman_tool nodes $ group_tool -v then do "mount -v /dev/foo /dir" on one node then do group_tool -v on that node then do "mount -v /dev/foo /dir" on the other node then do group_tool -v on _both_ nodes Send the output of all that and we'll try to see where things are going off track. Dave From chawkins at bplinux.com Thu Nov 2 21:52:56 2006 From: chawkins at bplinux.com (Christopher Hawkins) Date: Thu, 2 Nov 2006 16:52:56 -0500 Subject: [Linux-cluster] LVS: Not as a gateway? In-Reply-To: <2006112145034.775003@leena> Message-ID: <200611022131.kA2LVsAA012346@mail2.ontariocreditcorp.com> > -----Original Message----- > From: linux-cluster-bounces at redhat.com > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of > isplist at logicore.net > Sent: Thursday, November 02, 2006 3:51 PM > To: linux-cluster > Subject: RE: [Linux-cluster] LVS: Not as a gateway? > > Maybe someone who's running DIRECT and all of the same > internap IP's. can send me their lvs.cfg? I'm stumped. >Yes, it's how I've got it set up. Only problem is, the web servers need to see the LVS as their gateways no? No. I'm not familiar with the redhat way of doing this, but I know LVS. So if I trample on accepted redhat wisdom someone please correct me.... But straight, cross-platform LVS works like this: The real servers do not even need to know they are part of an LVS cluster. They can be setup like any typical server providing a service, like httpd or whatever, and then they get the virtual IP bound to their loopback interface so that they will accept packets with that address. However, they do not respond to ARP requests, therefore that IP can exist on all servers in this cluster and not cause communication problems. Ie., if you were to ping that address, only the director would respond. I have configured real servers before with just these three lines of code: ifconfig lo:1 $VIP netmask 255.255.255.255 up echo 1 > /proc/sys/net/ipv4/conf/all/hidden echo 1 > /proc/sys/net/ipv4/conf/lo/hidden That's it. If a real server can accept packets addressed to the virtual IP and does not respond to ARP on that IP, it's done, all set up. If nanny requires more, then maybe there is more, but that's the guts of the LVS stuff on real servers. The LVS director (and it's partner if there are two directors) accept all packets that are sent to the floating IP, whether they get there by NAT through a firewall, or directly from a machine on the same network, or anywhere else. Say I want a web page from this cluster. I send an http request to your floating IP. The director gets it, chooses a real server, and then forwards the packet there. The real server serves the request and replies directly to the client, using the virtual IP as the source in the packet header, such that client never realizes that it dealt with more than one machine. From the client side, it sent an http request to your virtual IP (hypothetically) 10.10.10.9 which got accepted by the director and it got a packet back from 10.10.10.9 (which is bound to the loopback on the real server, so it can use that address as its source). And voila, a transparent load balanced cluster. I can't answer a redhat specific question because I don't know the answer, but maybe this will help you diagnose what is not working. Chris From gsml at netops.gvtc.com Thu Nov 2 22:48:24 2006 From: gsml at netops.gvtc.com (Greg Swift) Date: Thu, 02 Nov 2006 16:48:24 -0600 Subject: [Linux-cluster] fc6 two-node cluster with gfs2 not working In-Reply-To: <20061102223359.GC18926@redhat.com> References: <454A5B8C.40508@netops.gvtc.com> <20061102212848.GA18926@redhat.com> <454A6A11.6020406@netops.gvtc.com> <20061102223359.GC18926@redhat.com> Message-ID: <454A75B8.70904@netops.gvtc.com> David Teigland wrote: > On Thu, Nov 02, 2006 at 03:58:41PM -0600, Greg Swift wrote: > >>>> Nov 1 22:49:07 box2 gfs_controld[3639]: mount: failed -17 >>>> > > >> [root at goumang ~]# mount -v /mnt/data >> /sbin/mount.gfs2: mount /dev/pri_outMail/pri_outMail_lv0 /mnt/data >> /sbin/mount.gfs2: parse_opts: opts = "rw" >> /sbin/mount.gfs2: clear flag 1 for "rw", flags = 0 >> /sbin/mount.gfs2: parse_opts: flags = 0 >> /sbin/mount.gfs2: parse_opts: extra = "" >> /sbin/mount.gfs2: parse_opts: hostdata = "" >> /sbin/mount.gfs2: parse_opts: lockproto = "" >> /sbin/mount.gfs2: parse_opts: locktable = "" >> /sbin/mount.gfs2: message to gfs_controld: asking to join mountgroup: >> /sbin/mount.gfs2: write "join /mnt/data gfs2 lock_dlm outMail:data rw" >> /sbin/mount.gfs2: setup_mount_error_fd 4 5 >> /sbin/mount.gfs2: message from gfs_controld: response to join request: >> /sbin/mount.gfs2: lock_dlm_join: read "0" >> /sbin/mount.gfs2: message from gfs_controld: mount options: >> /sbin/mount.gfs2: lock_dlm_join: read "hostdata=jid=1:id=65538:first=0" >> /sbin/mount.gfs2: lock_dlm_join: hostdata: "hostdata=jid=1:id=65538:first=0" >> /sbin/mount.gfs2: lock_dlm_join: extra_plus: "hostdata=jid=1:id=65538:first=0" >> > > All the cluster infrastructure appears to be working ok, and no more > gfs_controld error in the syslog again I'm assuming. So, gfs on the > second node is either stuck doing i/o or it's stuck trying to get a dlm > lock. A "ps ax -o pid,stat,cmd,wchan" might show what it's blocked on. > You might also try the same thing with gfs1 (would eliminate the dlm as > the problem). It could also very well be a gfs2 or dlm bug that's been > fixed since the fc6 kernel froze -- we need to get some updates pushed > out. > > Dave > > Here is the output from the log file at the same time as what I included before. Nov 2 15:51:33 goumang kernel: GFS2: fsid=: Trying to join cluster "lock_dlm", "outMail:data" Nov 2 15:51:33 goumang kernel: dlm: data: recover 1 Nov 2 15:51:33 goumang kernel: GFS2: fsid=outMail:data.1: Joined cluster. Now mounting FS... Nov 2 15:51:33 goumang kernel: dlm: data: add member 2 Nov 2 15:51:33 goumang kernel: dlm: Initiating association with node 2 Nov 2 15:51:33 goumang kernel: dlm: data: add member 1 Nov 2 15:51:33 goumang kernel: dlm: Error sending to node 2 -32 (sorry i pulled it offlist for a minute by not hitting reply all. i re-attached output for archival purposes) -- http://www.gvtc.com -- ?While it is possible to change without improving, it is impossible to improve without changing.? -anonymous ?only he who attempts the absurd can achieve the impossible.? -anonymous -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: goumang.txt URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: rushou.txt URL: From teigland at redhat.com Thu Nov 2 22:50:30 2006 From: teigland at redhat.com (David Teigland) Date: Thu, 2 Nov 2006 16:50:30 -0600 Subject: [Linux-cluster] fc6 two-node cluster with gfs2 not working In-Reply-To: <454A75B8.70904@netops.gvtc.com> References: <454A5B8C.40508@netops.gvtc.com> <20061102212848.GA18926@redhat.com> <454A6A11.6020406@netops.gvtc.com> <20061102223359.GC18926@redhat.com> <454A75B8.70904@netops.gvtc.com> Message-ID: <20061102225030.GD18926@redhat.com> On Thu, Nov 02, 2006 at 04:48:24PM -0600, Greg Swift wrote: > Nov 2 15:51:33 goumang kernel: GFS2: fsid=: Trying to join cluster > "lock_dlm", "outMail:data" > Nov 2 15:51:33 goumang kernel: dlm: data: recover 1 > Nov 2 15:51:33 goumang kernel: GFS2: fsid=outMail:data.1: Joined > cluster. Now mounting FS... > Nov 2 15:51:33 goumang kernel: dlm: data: add member 2 > Nov 2 15:51:33 goumang kernel: dlm: Initiating association with node 2 > Nov 2 15:51:33 goumang kernel: dlm: data: add member 1 > Nov 2 15:51:33 goumang kernel: dlm: Error sending to node 2 -32 Ah, yes, that's the SCTP problems we've been having in the dlm. Just in the last couple days we've put back the TCP comms module and it seems to be doing better. We'll need to send some patches off for the FC6 kernel. Dave From gsml at netops.gvtc.com Thu Nov 2 23:07:51 2006 From: gsml at netops.gvtc.com (Greg Swift) Date: Thu, 02 Nov 2006 17:07:51 -0600 Subject: [Linux-cluster] fc6 two-node cluster with gfs2 not working In-Reply-To: <20061102225030.GD18926@redhat.com> References: <454A5B8C.40508@netops.gvtc.com> <20061102212848.GA18926@redhat.com> <454A6A11.6020406@netops.gvtc.com> <20061102223359.GC18926@redhat.com> <454A75B8.70904@netops.gvtc.com> <20061102225030.GD18926@redhat.com> Message-ID: <454A7A47.1090504@netops.gvtc.com> David Teigland wrote: > On Thu, Nov 02, 2006 at 04:48:24PM -0600, Greg Swift wrote: > >> Nov 2 15:51:33 goumang kernel: GFS2: fsid=: Trying to join cluster >> "lock_dlm", "outMail:data" >> Nov 2 15:51:33 goumang kernel: dlm: data: recover 1 >> Nov 2 15:51:33 goumang kernel: GFS2: fsid=outMail:data.1: Joined >> cluster. Now mounting FS... >> Nov 2 15:51:33 goumang kernel: dlm: data: add member 2 >> Nov 2 15:51:33 goumang kernel: dlm: Initiating association with node 2 >> Nov 2 15:51:33 goumang kernel: dlm: data: add member 1 >> Nov 2 15:51:33 goumang kernel: dlm: Error sending to node 2 -32 >> > > Ah, yes, that's the SCTP problems we've been having in the dlm. Just in > the last couple days we've put back the TCP comms module and it seems to > be doing better. We'll need to send some patches off for the FC6 kernel. > > Dave > > that makes me sad. i had needed/wanted to get this up before i left for a week to alleviate some other issues that require my manual intervention. ahh well.. how can i stay up on the status of that? thanks again for your assistance. -greg -- http://www.gvtc.com -- ?While it is possible to change without improving, it is impossible to improve without changing.? -anonymous ?only he who attempts the absurd can achieve the impossible.? -anonymous From isplist at logicore.net Thu Nov 2 23:08:43 2006 From: isplist at logicore.net (isplist at logicore.net) Date: Thu, 2 Nov 2006 17:08:43 -0600 Subject: [Linux-cluster] Duplicate Volumes, GFS fail over Message-ID: <200611217843.148678@leena> One of the reasons I went with fibre channel was because of the redundancy it can give me as well as the centralized storage. There is one aspect of this I've not found any solutions to yet however in dealing with GFS, redundant paths. As you can see, I have 4 storage devices, three of which have dual paths in case of failure. The problem is that the system thinks this is duplicate and possibly an error as it sees it, I'm not sure. I thought I would ask before trying to put the dual paths to use. Here is an output; Found duplicate PV TB3VUn1m3CBOj8dRnRjAbRuD3ZIyBp7t: using /dev/sde1 not /dev/sda1 Found duplicate PV uXjYfhj4NlvQphf1DIz28psAAkFmJRaM: using /dev/sdf1 not /dev/sdb1 Found duplicate PV 3mlyNROBtWp4a3LXQEKBiCwk3pJear7u: using /dev/sdg1 not /dev/sdc1 ACTIVE '/dev/VolGroup01/rimfire' [572.72 GB] inherit ACTIVE '/dev/VolGroup04/web' [318.85 GB] inherit ACTIVE '/dev/VolGroup03/qm' [745.76 GB] inherit ACTIVE '/dev/VolGroup02/sql' [745.76 GB] inherit I don't have a dual path on the VolGroup01 device because it's just manual storage I use here and then. The question is; How should I deal with the dual paths and how do I set up volumes/GFS to be fault tolerant in this respect. If one path fails, I'll want GFS to fail over to the second path to the same storage. Mike From gsml at netops.gvtc.com Thu Nov 2 23:19:02 2006 From: gsml at netops.gvtc.com (Greg Swift) Date: Thu, 02 Nov 2006 17:19:02 -0600 Subject: [Linux-cluster] Duplicate Volumes, GFS fail over In-Reply-To: <200611217843.148678@leena> References: <200611217843.148678@leena> Message-ID: <454A7CE6.6090109@netops.gvtc.com> isplist at logicore.net wrote: > One of the reasons I went with fibre channel was because of the redundancy it > can give me as well as the centralized storage. > There is one aspect of this I've not found any solutions to yet however in > dealing with GFS, redundant paths. > > As you can see, I have 4 storage devices, three of which have dual paths in > case of failure. The problem is that the system thinks this is duplicate and > possibly an error as it sees it, I'm not sure. > > I thought I would ask before trying to put the dual paths to use. Here is an > output; > > Found duplicate PV TB3VUn1m3CBOj8dRnRjAbRuD3ZIyBp7t: using /dev/sde1 not > /dev/sda1 > Found duplicate PV uXjYfhj4NlvQphf1DIz28psAAkFmJRaM: using /dev/sdf1 not > /dev/sdb1 > Found duplicate PV 3mlyNROBtWp4a3LXQEKBiCwk3pJear7u: using /dev/sdg1 not > /dev/sdc1 > ACTIVE '/dev/VolGroup01/rimfire' [572.72 GB] inherit > ACTIVE '/dev/VolGroup04/web' [318.85 GB] inherit > ACTIVE '/dev/VolGroup03/qm' [745.76 GB] inherit > ACTIVE '/dev/VolGroup02/sql' [745.76 GB] inherit > > I don't have a dual path on the VolGroup01 device because it's just manual > storage I use here and then. > > The question is; How should I deal with the dual paths and how do I set up > volumes/GFS to be fault tolerant in this respect. If one path fails, I'll want > GFS to fail over to the second path to the same storage. > > Mike > use the device-mapper-multipath package You would do it beore you create your lvm volumes. Depending on the equipment it can be as easy as just removing the blacklist in the /etc/multipath.conf, but sometimes it might take contacting your vendor (like i had to w/ emc) to find the best setup to run the multipath cleanly. -greg -- http://www.gvtc.com -- ?While it is possible to change without improving, it is impossible to improve without changing.? -anonymous ?only he who attempts the absurd can achieve the impossible.? -anonymous From isplist at logicore.net Thu Nov 2 23:23:07 2006 From: isplist at logicore.net (isplist at logicore.net) Date: Thu, 2 Nov 2006 17:23:07 -0600 Subject: [Linux-cluster] Duplicate Volumes, GFS fail over In-Reply-To: <454A7CE6.6090109@netops.gvtc.com> Message-ID: <200611217237.537958@leena> Thanks. I'll look into this. Does this have to be installed on every server in the cluster? Contacting the vendor won't do me any good since I don't have support for 90% of the hardware being used :). Has to be something I can do on my own with what I've got. Mike >> The question is; How should I deal with the dual paths and how do I set up > use the device-mapper-multipath package You would do it beore you create > your lvm volumes. Depending on the equipment it can be as easy as just > removing the blacklist in the /etc/multipath.conf, but sometimes it > might take contacting your vendor (like i had to w/ emc) to find the > best setup to run the multipath cleanly. > > -greg From isplist at logicore.net Thu Nov 2 23:29:18 2006 From: isplist at logicore.net (isplist at logicore.net) Date: Thu, 2 Nov 2006 17:29:18 -0600 Subject: [Linux-cluster] device-mapper-multipath package In-Reply-To: <454A7CE6.6090109@netops.gvtc.com> Message-ID: <2006112172918.124721@leena> Ok, I see, these are utilities installed on each machine which needs a fail over mechanism. Although, there is mention of kernels so am nervous about having to start messing with kernels with custom needs for GFS and this. Have not read it all yet but I'm assuming it allows one to add storage devices later, etc. Mike > use the device-mapper-multipath package You would do it beore you create > your lvm volumes. Depending on the equipment it can be as easy as just > removing the blacklist in the /etc/multipath.conf, but sometimes it > might take contacting your vendor (like i had to w/ emc) to find the > best setup to run the multipath cleanly. > > -greg From gsml at netops.gvtc.com Thu Nov 2 23:36:17 2006 From: gsml at netops.gvtc.com (Greg Swift) Date: Thu, 02 Nov 2006 17:36:17 -0600 Subject: [Linux-cluster] Duplicate Volumes, GFS fail over In-Reply-To: <200611217237.537958@leena> References: <200611217237.537958@leena> Message-ID: <454A80F1.5030305@netops.gvtc.com> isplist at logicore.net wrote: > Thanks. > > I'll look into this. > > Does this have to be installed on every server in the cluster? Contacting the > vendor won't do me any good since I don't have support for 90% of the hardware > being used :). Has to be something I can do on my own with what I've got. > > Mike > every box that has multiple paths would need to have it to behave properly. Like i said before, it might not be that difficult. You can use the default /etc/multipath.conf by just commenting out the blacklist section. My default /etc/multipath.conf looks like this: devnode_blacklist { ##this block the primary local hard disk, change to meet your setup devnode "^sda" ##this ignores lots of stuph that is not relevant at all in most circumstances devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*" ##this blocks ide based devices devnode "^hd[a-z][0-9]*" ##i dont remember what this blocks. devnode "^cciss!c[0-9]d[0-9]*[p[0-9]*]" } defaults { ## use user friendly names, instead of using WWIDs as names. user_friendly_names yes } I define a devices section for the emc box i work with that looks like this: devices { ## Device attributes requirements for EMC Symmetrix ## are part of the default definitions and do not require separate ## definition. ## Device attributes for EMC CLARiiON device { vendor "DGC " product "*" path_grouping_policy group_by_prio getuid_callout "/sbin/scsi_id -g -u -s /block/%n" prio_callout "/sbin/mpath_prio_emc /dev/%n" path_checker emc_clariion path_selector "round-robin 0" features "1 queue_if_no_path" no_path_retry 300 hardware_handler "1 emc" failback immediate } } but thats not necessary. it should look at what you have and use defaults based on pre-configured assumptions (correct me if i'm wrong ppl). -greg -- http://www.gvtc.com -- ?While it is possible to change without improving, it is impossible to improve without changing.? -anonymous ?only he who attempts the absurd can achieve the impossible.? -anonymous From gsml at netops.gvtc.com Thu Nov 2 23:41:55 2006 From: gsml at netops.gvtc.com (Greg Swift) Date: Thu, 02 Nov 2006 17:41:55 -0600 Subject: [Linux-cluster] Duplicate Volumes, GFS fail over In-Reply-To: <454A80F1.5030305@netops.gvtc.com> References: <200611217237.537958@leena> <454A80F1.5030305@netops.gvtc.com> Message-ID: <454A8243.8000000@netops.gvtc.com> Greg Swift wrote: > isplist at logicore.net wrote: >> Thanks. >> >> I'll look into this. >> Does this have to be installed on every server in the cluster? >> Contacting the vendor won't do me any good since I don't have support >> for 90% of the hardware being used :). Has to be something I can do >> on my own with what I've got. >> >> Mike >> > every box that has multiple paths would need to have it to behave > properly. Like i said before, it might not be that difficult. > You can use the default /etc/multipath.conf by just commenting out the > blacklist section. My default /etc/multipath.conf looks like this: > > devnode_blacklist > { > ##this block the primary local hard disk, change to meet your setup > devnode "^sda" > ##this ignores lots of stuph that is not relevant at all in most > circumstances > devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*" > ##this blocks ide based devices > devnode "^hd[a-z][0-9]*" > ##i dont remember what this blocks. > devnode "^cciss!c[0-9]d[0-9]*[p[0-9]*]" > } > defaults { > ## use user friendly names, instead of using WWIDs as names. > user_friendly_names yes > } > > I define a devices section for the emc box i work with that looks like > this: > devices { > ## Device attributes requirements for EMC Symmetrix > ## are part of the default definitions and do not require separate > ## definition. > ## Device attributes for EMC CLARiiON > device { > vendor "DGC " > product "*" > path_grouping_policy group_by_prio > getuid_callout "/sbin/scsi_id -g -u -s /block/%n" > prio_callout "/sbin/mpath_prio_emc /dev/%n" > path_checker emc_clariion > path_selector "round-robin 0" > features "1 queue_if_no_path" > no_path_retry 300 > hardware_handler "1 emc" > failback immediate > } > } > > but thats not necessary. it should look at what you have and use > defaults based on pre-configured assumptions (correct me if i'm wrong > ppl). > > -greg > wait a minute... this is the procedure i use to turn on multipath (sorry i had forgotten that i did more than just turn on the service). update /etc/multipath.conf mkinitrd -f /boot/initrd-`uname -r` `uname -r` reboot modprobe dm-multipath modprobe dm-round-robin service multipathd start at that point you should be able to do 'multipath -ll' and see your multipath setup. Here is an example from one of my boxes that has 3 luns through 4 paths a piece: [root at phantasos ~]# multipath -ll mpath2 (36006016003e115003ef0eb769632db11) [size=191 GB][features="1 queue_if_no_path"][hwhandler="1 emc"] \_ round-robin 0 [prio=2][active] \_ 1:0:3:1 sdg 8:96 [active][ready] \_ 2:0:3:1 sdm 8:192 [active][ready] \_ round-robin 0 [enabled] \_ 1:0:2:1 sde 8:64 [active][ready] \_ 2:0:2:1 sdk 8:160 [active][ready] mpath1 (36006016003e11500eefae69dd756da11) [size=532 GB][features="1 queue_if_no_path"][hwhandler="1 emc"] \_ round-robin 0 [prio=2][active] \_ 1:0:3:0 sdf 8:80 [active][ready] \_ 2:0:3:0 sdl 8:176 [active][ready] \_ round-robin 0 [enabled] \_ 1:0:2:0 sdd 8:48 [active][ready] \_ 2:0:2:0 sdj 8:144 [active][ready] mpath0 (360060160a9a01000a23b3bb43cfbda11) [size=532 GB][features="1 queue_if_no_path"][hwhandler="1 emc"] \_ round-robin 0 [prio=2][active] \_ 1:0:1:0 sdc 8:32 [active][ready] \_ 2:0:1:0 sdi 8:128 [active][ready] \_ round-robin 0 [enabled] \_ 1:0:0:0 sdb 8:16 [active][ready] \_ 2:0:0:0 sdh 8:112 [active][ready] the mpathX is the use_friendly_names part of multipathd.conf, else the 3600* would be the name. -greg -- http://www.gvtc.com -- ?While it is possible to change without improving, it is impossible to improve without changing.? -anonymous ?only he who attempts the absurd can achieve the impossible.? -anonymous From isplist at logicore.net Fri Nov 3 02:02:46 2006 From: isplist at logicore.net (isplist at logicore.net) Date: Thu, 2 Nov 2006 20:02:46 -0600 Subject: [Linux-cluster] Duplicate Volumes, GFS fail over In-Reply-To: <454A8243.8000000@netops.gvtc.com> Message-ID: <200611220246.295237@leena> Thanks for the info on this, checking it out now. I've installed and started the service but don't see anything yet. Assuming I have to do some configuring. Problem is that the system thinks there is some confusion about device letters. Found duplicate PV TB3VUn1m3CBOj8dRnRjAbRuD3ZIyBp7t: using /dev/sde1 not /dev/sda1 Found duplicate PV uXjYfhj4NlvQphf1DIz28psAAkFmJRaM: using /dev/sdf1 not /dev/sdb1 Found duplicate PV 3mlyNROBtWp4a3LXQEKBiCwk3pJear7u: using /dev/sdg1 not /dev/sdc1 1 logical volume(s) in volume group "VolGroup01" now active 1 logical volume(s) in volume group "VolGroup04" now active 1 logical volume(s) in volume group "VolGroup03" now active 1 logical volume(s) in volume group "VolGroup02" now active Do I need to clean this up before I get into the multipath config? Mike > wait a minute... this is the procedure i use to turn on multipath (sorry > i had forgotten that i did more than just turn on the service). > > update /etc/multipath.conf > mkinitrd -f /boot/initrd-`uname -r` `uname -r` > > reboot > > modprobe dm-multipath > modprobe dm-round-robin > service multipathd start > > at that point you should be able to do 'multipath -ll' and see your > multipath setup. > > Here is an example from one of my boxes that has 3 luns through 4 paths > a piece: > > [root at phantasos ~]# multipath -ll > mpath2 (36006016003e115003ef0eb769632db11) > [size=191 GB][features="1 queue_if_no_path"][hwhandler="1 emc"] > \_ round-robin 0 [prio=2][active] > \_ 1:0:3:1 sdg 8:96 [active][ready] > \_ 2:0:3:1 sdm 8:192 [active][ready] > \_ round-robin 0 [enabled] > \_ 1:0:2:1 sde 8:64 [active][ready] > \_ 2:0:2:1 sdk 8:160 [active][ready] > > mpath1 (36006016003e11500eefae69dd756da11) > [size=532 GB][features="1 queue_if_no_path"][hwhandler="1 emc"] > \_ round-robin 0 [prio=2][active] > \_ 1:0:3:0 sdf 8:80 [active][ready] > \_ 2:0:3:0 sdl 8:176 [active][ready] > \_ round-robin 0 [enabled] > \_ 1:0:2:0 sdd 8:48 [active][ready] > \_ 2:0:2:0 sdj 8:144 [active][ready] > > mpath0 (360060160a9a01000a23b3bb43cfbda11) > [size=532 GB][features="1 queue_if_no_path"][hwhandler="1 emc"] > \_ round-robin 0 [prio=2][active] > \_ 1:0:1:0 sdc 8:32 [active][ready] > \_ 2:0:1:0 sdi 8:128 [active][ready] > \_ round-robin 0 [enabled] > \_ 1:0:0:0 sdb 8:16 [active][ready] > \_ 2:0:0:0 sdh 8:112 [active][ready] > > > the mpathX is the use_friendly_names part of multipathd.conf, else the > 3600* would be the name. > > -greg From zxvdr.au at gmail.com Fri Nov 3 04:32:54 2006 From: zxvdr.au at gmail.com (David Robinson) Date: Fri, 03 Nov 2006 14:32:54 +1000 Subject: [Linux-cluster] LVS Director + TG3 = problem ? In-Reply-To: References: <4549F975.3000706@lexum.umontreal.ca> Message-ID: <454AC676.5090909@gmail.com> Barry Brimer wrote: >> I read in the past that the TG3 is not recommended by HP. Do you have >> that kind of issue with the mix hp broadcom NIC + tg3 RH module ? > > In the past, HP recommended the bcm5700 driver. I also recommend it. > In the last several months, HP has been posting their own version of the > tg3 driver, although I have not tried it. I would assume that HP now > recommends the use of their tg3 driver above all else. > Broadcom no longer provides support for the bcm5700, their website says "the tg3 driver is now the only Linux driver that Broadcom will support". http://www.broadcom.com/support/ethernet_nic/faq_drivers.php#tg3 Problems I've come across with the tg3 driver are generally fixed by a firmware update. https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=123218 Dave From isplist at logicore.net Fri Nov 3 05:19:02 2006 From: isplist at logicore.net (isplist at logicore.net) Date: Thu, 2 Nov 2006 23:19:02 -0600 Subject: [Linux-cluster] device-mapper errors Message-ID: <200611223192.297812@leena> Whew, I'm starting to wonder if this stuff is worth all of the headaches. The learning curve on this stuff is unreal. It's coming together but the constant new problems is wearing me down :). Just installed device-mapper-multipath. Seemed fine when I installed it a while ago but suddenly; fence_tool: waiting for cluster quorum Another problem above... When I start up the cluster, the first nods always get's locked out and I need to do a cman_tool expected -e 1 in another window. And the problem I'm posting about, as I start the cluster; Found duplicate PV TB3VUn1m3CBOj8dRnRjAbRuD3ZIyBp7t: using /dev/sde1 not /dev/sda1 Found duplicate PV uXjYfhj4NlvQphf1DIz28psAAkFmJRaM: using /dev/sdf1 not /dev/sdb1 Found duplicate PV 3mlyNROBtWp4a3LXQEKBiCwk3pJear7u: using /dev/sdg1 not /dev/sdc1 1 logical volume(s) in volume group "VolGroup01" now active Error locking on node compdev.companions.com: Internal lvm error, check syslog 1 logical volume(s) in volume group "VolGroup04" now active Error locking on node compdev.companions.com: Internal lvm error, check syslog 1 logical volume(s) in volume group "VolGroup03" now active Error locking on node compdev.companions.com: Internal lvm error, check syslog 1 logical volume(s) in volume group "VolGroup02" now active # tail /var/log/messages Nov 2 23:12:04 compdev clvmd: Cluster LVM daemon started - connected to CMAN Nov 2 23:12:05 compdev lvm[3335]: Found duplicate PV TB3VUn1m3CBOj8dRnRjAbRuD3ZIyBp7t: using /dev/sde1 not /dev/sda1 Nov 2 23:12:05 compdev lvm[3335]: Found duplicate PV uXjYfhj4NlvQphf1DIz28psAAkFmJRaM: using /dev/sdf1 not /dev/sdb1 Nov 2 23:12:05 compdev lvm[3335]: Found duplicate PV 3mlyNROBtWp4a3LXQEKBiCwk3pJear7u: using /dev/sdg1 not /dev/sdc1 Nov 2 23:12:06 compdev kernel: device-mapper: dm-linear: Device lookup failed Nov 2 23:12:06 compdev kernel: device-mapper: error adding target to table Nov 2 23:12:06 compdev kernel: device-mapper: dm-linear: Device lookup failed Nov 2 23:12:06 compdev kernel: device-mapper: error adding target to table Nov 2 23:12:06 compdev kernel: device-mapper: dm-linear: Device lookup failed Nov 2 23:12:06 compdev kernel: device-mapper: error adding target to table Is this because I need to do some more configuring in the /etc/multipath.conf? Here's the output of multipath -ll mpath1 (220000080e512699a) [size=1 GB][features="0"][hwhandler="0"] \_ round-robin 0 [prio=1][active] \_ 0:0:2:0 sde 8:64 [active][ready] \_ round-robin 0 [prio=1][enabled] \_ 0:0:2:1 sdf 8:80 [active][ready] \_ round-robin 0 [prio=1][enabled] \_ 0:0:2:2 sdg 8:96 [active][ready] mpath0 (220000080e511feb0) [size=1 GB][features="0"][hwhandler="0"] \_ round-robin 0 [prio=1][active] \_ 0:0:0:0 sda 8:0 [active][ready] \_ round-robin 0 [prio=1][enabled] \_ 0:0:0:1 sdb 8:16 [active][ready] \_ round-robin 0 [prio=1][enabled] \_ 0:0:0:2 sdc 8:32 [active][ready] From gsml at netops.gvtc.com Fri Nov 3 06:16:45 2006 From: gsml at netops.gvtc.com (gsml at netops.gvtc.com) Date: Fri, 3 Nov 2006 00:16:45 -0600 Subject: [Linux-cluster] Duplicate Volumes, GFS fail over In-Reply-To: <200611220246.295237@leena> References: <200611220246.295237@leena> Message-ID: <20061103001645.z8cgejj6q8co40cs@mail.gvtc.com> Quoting "isplist at logicore.net" : > Thanks for the info on this, checking it out now. > > I've installed and started the service but don't see anything yet. Assuming I > have to do some configuring. Problem is that the system thinks there is some > confusion about device letters. > > Found duplicate PV TB3VUn1m3CBOj8dRnRjAbRuD3ZIyBp7t: using /dev/sde1 not > /dev/sda1 > Found duplicate PV uXjYfhj4NlvQphf1DIz28psAAkFmJRaM: using /dev/sdf1 not > /dev/sdb1 > Found duplicate PV 3mlyNROBtWp4a3LXQEKBiCwk3pJear7u: using /dev/sdg1 not > /dev/sdc1 > 1 logical volume(s) in volume group "VolGroup01" now active > 1 logical volume(s) in volume group "VolGroup04" now active > 1 logical volume(s) in volume group "VolGroup03" now active > 1 logical volume(s) in volume group "VolGroup02" now active > > Do I need to clean this up before I get into the multipath config? i'm pretty sure that multipath has to be done first, and i do not know how forgiving or easy to man-handle lvm would be about enabling the multipath after the lvm setup. -greg From orkcu at yahoo.com Fri Nov 3 13:32:37 2006 From: orkcu at yahoo.com (Roger Pe�a Escobio) Date: Fri, 3 Nov 2006 05:32:37 -0800 (PST) Subject: [Linux-cluster] device-mapper errors In-Reply-To: <200611223192.297812@leena> Message-ID: <20061103133237.49213.qmail@web50612.mail.yahoo.com> --- "isplist at logicore.net" wrote: > Whew, I'm starting to wonder if this stuff is worth > all of the headaches. The > learning curve on this stuff is unreal. It's coming > together but the constant > new problems is wearing me down :). it happens sometimes :-) > Found duplicate PV TB3VUn1m3CBOj8dRnRjAbRuD3ZIyBp7t: > using /dev/sde1 not > /dev/sda1 > Found duplicate PV > uXjYfhj4NlvQphf1DIz28psAAkFmJRaM: using /dev/sdf1 > not > /dev/sdb1 > Found duplicate PV > 3mlyNROBtWp4a3LXQEKBiCwk3pJear7u: using /dev/sdg1 > not > /dev/sdc1 > 1 logical volume(s) in volume group "VolGroup01" > now active > Error locking on node compdev.companions.com: > Internal lvm error, check > syslog > 1 logical volume(s) in volume group "VolGroup04" > now active > Error locking on node compdev.companions.com: > Internal lvm error, check > syslog > 1 logical volume(s) in volume group "VolGroup03" > now active > Error locking on node compdev.companions.com: > Internal lvm error, check > syslog > 1 logical volume(s) in volume group "VolGroup02" > now active > > # tail /var/log/messages > Nov 2 23:12:04 compdev clvmd: Cluster LVM daemon > started - connected to CMAN > Nov 2 23:12:05 compdev lvm[3335]: Found duplicate > PV > TB3VUn1m3CBOj8dRnRjAbRuD3ZIyBp7t: using /dev/sde1 > not /dev/sda1 > Nov 2 23:12:05 compdev lvm[3335]: Found duplicate > PV > uXjYfhj4NlvQphf1DIz28psAAkFmJRaM: using /dev/sdf1 > not /dev/sdb1 > Nov 2 23:12:05 compdev lvm[3335]: Found duplicate > PV > 3mlyNROBtWp4a3LXQEKBiCwk3pJear7u: using /dev/sdg1 > not /dev/sdc1 > Nov 2 23:12:06 compdev kernel: device-mapper: > dm-linear: Device lookup failed > Nov 2 23:12:06 compdev kernel: device-mapper: error > adding target to table > Nov 2 23:12:06 compdev kernel: device-mapper: > dm-linear: Device lookup failed > Nov 2 23:12:06 compdev kernel: device-mapper: error > adding target to table > Nov 2 23:12:06 compdev kernel: device-mapper: > dm-linear: Device lookup failed > Nov 2 23:12:06 compdev kernel: device-mapper: error > adding target to table > > Is this because I need to do some more configuring > in the /etc/multipath.conf? No, this is because you did not configure lvm.conf you need to filter the "real" devices so lvm only find PV in the multipath device look at the /etc/lvm/lvm.conf cu roger __________________________________________ RedHat Certified Engineer ( RHCE ) Cisco Certified Network Associate ( CCNA ) __________________________________________________________________________________________ Check out the New Yahoo! Mail - Fire up a more powerful email and get things done faster. (http://advision.webevents.yahoo.com/mailbeta) From isplist at logicore.net Fri Nov 3 15:46:12 2006 From: isplist at logicore.net (isplist at logicore.net) Date: Fri, 3 Nov 2006 09:46:12 -0600 Subject: [Linux-cluster] Duplicate Volumes, GFS fail over In-Reply-To: <454B5F7F.8040306@netops.gvtc.com> Message-ID: <200611394612.319875@leena> As you might be able to tell, I started doing this but it's leading to too many other problems. I think I'll have to learn dual path's later as it's just going to end up being another level of learning on top of everything else I've taken on. I appreciate the help everyone gave me on dual path's and have kept your replies to work with at a later date. Mike > #First setup multipath.conf Done, leaving it as default. > #If you are using qlogic qla2xxx series hba's this is what i was told > to put in modprobe.conf: > alias scsi_hostadapter qla2xxx > alias scsi_hostadapter1 mptbase > alias scsi_hostadapter2 mptspi > options qla2xxx qlport_down_retry=1 ql2xretrycount=5 > options scsi_mod max_luns=255 Done, changed file. > mkinitrd -f /boot/initrd-`uname -r` `uname -r` > init 6; exit > #else just continue on > modprobe dm-multipath > modprobe dm-round-robin > service multipathd start Done. > #Setup /etc/cluster/cluster.conf Done. > #Enable the nodes by running this on each box: > ccs_tool addnodeids This command won't run. Have to look into it. > #at this point get on one of the boxes and do this: > pvcreate /dev/mapper/mpath[01] > vgcreate pri_volGrp0 /dev/mapper/mpath[01] > lvcreate --size GB --name pri_lv0 pri_volGrp0 > mkfs.gfs2 -p lock_dlm -t :data -j 4 > /dev/pri_volGrp0/pri_lv0 > > echo "/dev/pri_volGrp0/pri_lv0 gfs2 defaults 1 1" >> > /etc/fstab > mkdir > > #Now one at a time, do each of these commands, one box then the other: > service clvmd start > service gfs2 start > #if that all works enable it to run at boot: > for x in multipathd cman clvmd gfs2; do chkconfig $x on; done From lhh at redhat.com Fri Nov 3 16:53:14 2006 From: lhh at redhat.com (Lon Hohberger) Date: Fri, 03 Nov 2006 11:53:14 -0500 Subject: [Linux-cluster] CS and DRBD In-Reply-To: <23500065.31162496760586.JavaMail.root@lisa.it-linux.cl> References: <23500065.31162496760586.JavaMail.root@lisa.it-linux.cl> Message-ID: <1162572794.4518.493.camel@rei.boston.devel.redhat.com> On Thu, 2006-11-02 at 16:46 -0300, Patricio A. Bruna wrote: > Anyone has a working script for use DRBD with Cluster Suite? > I need to use DRBD, but i dont know how to make it play with CS. > I now CS can mount partitions, but i dont know if CS can mark a drbd > device as primary. I'm not aware of any -- but you're not the first to ask. Could you file a bugzilla against fc6 / rgmanager? It might be as simple as using an existing script to do it. -- Lon From lhh at redhat.com Fri Nov 3 16:57:14 2006 From: lhh at redhat.com (Lon Hohberger) Date: Fri, 03 Nov 2006 11:57:14 -0500 Subject: [Linux-cluster] LVS: Not as a gateway? In-Reply-To: <2006112114524.012422@leena> References: <2006112114524.012422@leena> Message-ID: <1162573034.4518.495.camel@rei.boston.devel.redhat.com> On Thu, 2006-11-02 at 11:45 -0600, isplist at logicore.net wrote: > > You would probably want direct routing. "Direct" means on the same network > > as the director, and able to use the same gateway to the outside world. > > Yes, it's how I've got it set up. Only problem is, the web servers need to see > the LVS as their gateways no? > > The last error I seem to have to conquer is; > > /// > Nov 2 11:24:49 lb52 nanny[3652]: READ to 192.168.1.94:80 timed out > Nov 2 11:24:52 lb52 nanny[3650]: READ to 192.168.1.92:80 timed out > Nov 2 11:24:54 lb52 nanny[3651]: READ to 192.168.1.93:80 timed out > /// This might help. -- Lon -------------- next part -------------- Piranha 0.7.7+ Direct Routing Mini-HOWTO v0.2.2 Scope: This only contains relevant information on how to make direct routing to work with Piranha, it does not explain how to configure Piranha services. Setting up Piranha: (1) Ensure that the following packages are installed on the LVS directors: * piranha * ipvsadm Ensure that the following packages are installed on the LVS real servers: * iptables * arptables_jf (2) Set up and log in to the Piranha web-based GUI. See the following link: http://www.redhat.com/docs/manuals/enterprise/RHEL-3-Manual/cluster-suite/ch-lvs-piranha.html (3) Configure Piranha for Direct Routing. In the "GLOBAL SETTINGS" tab of the Piranha configuration tool, enter the primary server's public IP address in the box provided. The private IP address is not needed/used for Direct Routing configurations. In a direct routing configuration, all real servers as well as the LVS directors share the same virtual IP addresses and should have the same IP route configuration. Click the "Direct Routing" button to enable Direct Routing support on the Piranha LVS director node(s). (4) Configure services + real servers using the Piranha GUI. (5) Set up the each of the real servers using one of the methods below. =========================================================================== Setting up the Real Servers, method #1: Using arptables_jf How it works: Each real server has the virtual IP address(es) configured, so they can directly route the packets. ARP requests for the VIP are ignored entirely by the real servers, and any ARP packets which might otherwise be sent containing the VIPs are mangled to contain the real server's IP instead of the VIPs. Main Advantages: * Ability for applications to bind to each individual VIP/port the real server is servicing. This allows, for instance, multiple instances of Apache to be running bound explicitly to different VIPs on the system. * Performance. Disadvantages: * The VIPs can not be configured to start on boot using standard RHEL system configuration tools. How to make it work: (1) BACK UP YOUR ARPTABLES CONFIGURATION. (2) Configure each real server to ignore ARP requests for each of the virtual IP addresses the Piranha cluster will be servicing. To do this, first create the ARP table entries for each virtual IP address on each real server (the real_ip is the IP the director uses to communicate with the real server; often this is the IP bound to "eth0"): arptables -A IN -d -j DROP arptables -A OUT -d -j mangle --mangle-ip-s This will cause the real servers to ignore all ARP requests for the virtual IP addresses, and change any outbound ARP responses which might otherwise contain the virtual IP so that they contain the real IP of the server instead. The only node in the Piranha cluster which should respond to ARP requests for any of the VIPs is the current active Piranha LVS director node. Once this has been completed on each real server, we can save the ARP table entries for later. Run the following commands on each real server: service arptables_jf save chkconfig --level 2345 arptables_jf on The second command will cause the system to reload the arptables configuration we just made on boot - before the network is started. (3) Configure the virtual IP address on all real servers using 'ifconfig' to create an IP alias: ifconfig eth0:1 192.168.76.24 netmask 255.255.252.0 \ broadcast 192.168.79.255 up Or using the iproute2 utility "ip", for example: ip addr add 192.168.76.24/22 dev eth0 As noted previously, the virtual IP addresses can not be configured to start on boot using the Red Hat system configuration tools. One way to work around this is to simply place these commands in /etc/rc.d/rc.local. =========================================================================== Setting up the Real Servers, method #2: Use iptables to tell the real servers to handle the packets. How it works: We use an IP tables rule to create a transparent proxy so that a node will service packets sent to the virtual IP address(es), even though the virtual IP address does not exist on the system. Advantages: * Simple to configure. * Avoids the LVS "ARP problem" entirely. Because the virtual IP address(es) only exist on the active LVS director, there _is_ no ARP problem! Disadvantages: * Performance. There is overhead in forwarding/masquerading every packet. * Impossible to reuse ports. For instance, it is not possible to run two separate Apache services bound to port 80, because both must bind to INADDR_ANY instead of the virtual IP addresses. (1) BACK UP YOUR IPTABLES CONFIGURATION. (2) On each real server, run the following for every VIP / port / protocol (TCP, UDP) combination intended to be serviced for that real server: iptables -t nat -A PREROUTING -p -d \ --dport -j REDIRECT This will cause the real servers to process packets destined for the VIP which they are handed. service iptables save chkconfig --level 2345 iptables on The second command will cause the system to reload the arptables configuration we just made on boot - before the network is started. From isplist at logicore.net Fri Nov 3 19:28:45 2006 From: isplist at logicore.net (isplist at logicore.net) Date: Fri, 3 Nov 2006 13:28:45 -0600 Subject: [Linux-cluster] LVS: Not as a gateway? In-Reply-To: <1162573034.4518.495.camel@rei.boston.devel.redhat.com> Message-ID: <2006113132845.679187@leena> WONDERFUL!!! Just what I wish I'd have found when looking all over the place. I'll check it out, thanks very much. Mike >> /// >> Nov 2 11:24:49 lb52 nanny[3652]: READ to 192.168.1.94:80 timed out >> Nov 2 11:24:52 lb52 nanny[3650]: READ to 192.168.1.92:80 timed out >> Nov 2 11:24:54 lb52 nanny[3651]: READ to 192.168.1.93:80 timed out >> /// >> > This might help. > > -- Lon A T T A C H E D F I L E S I N L I N E D I S P L A Y Attached text follows, filename: piranha-direct-routing-howto.txt Piranha 0.7.7+ Direct Routing Mini-HOWTO v0.2.2 Scope: This only contains relevant information on how to make direct routing to work with Piranha, it does not explain how to configure Piranha services. From isplist at logicore.net Fri Nov 3 20:21:26 2006 From: isplist at logicore.net (isplist at logicore.net) Date: Fri, 3 Nov 2006 14:21:26 -0600 Subject: [Linux-cluster] device-mapper errors In-Reply-To: <20061103194916.68329.qmail@web50611.mail.yahoo.com> Message-ID: <2006113142126.452465@leena> >> filter = [ "a/.*/" ] >> to >> filter = [ "a|/dev/hd[ab]|", "r/.*/" ] > ^^^ > you are using ATA disk names :-) > better use SCSI names :-) > because I think you are using a SAN, do you? >> using /dev/sde1 not >> /dev/sda1 Yes, I looked around for more information on setting this up and decided to put it off, too many things at once. You are correct, SAN, all SCSI devices. Mike From johnson.eric at gmail.com Fri Nov 3 20:39:45 2006 From: johnson.eric at gmail.com (Eric Johnson) Date: Fri, 3 Nov 2006 15:39:45 -0500 Subject: [Linux-cluster] What can cause dlm_pthread_init to generate an errno 13? In-Reply-To: <20061102191849.GG9431@redhat.com> References: <20061102180533.GB4283@korben.rdu.redhat.com> <20061102184531.GF9431@redhat.com> <20061102191849.GG9431@redhat.com> Message-ID: > Looking at the examples in cluster/dlm/tests/usertest/ I think that you'll > wan't dlm_ls_pthread_init() and not dlm_pthread_init() after you've > created your lockspace. Ok, so I'm further ahead. I created my own lockspace with a program like this. [ejohnson at gfsa01:~/projects/dlm-simple] cat bar.cpp #include #include #include #include #include extern "C" { #include } #include main(int argc, char **argv) { printf("hello world\n"); char lockName[] = "foobar" ; dlm_lshandle_t *lockspace=0; lockspace=(dlm_lshandle_t*)dlm_create_lockspace("EEJ_TESTING_LOCKSPACE",0777); printf("lockspace result: %X\n",lockspace); } And I can verify that it really produced something. Because I see this... [ejohnson at gfsa01:/proc/cluster] cat services Service Name GID LID State Code Fence Domain: "default" 1 2 run - [3 2 1] DLM Lock Space: "clvmd" 2 3 run - [3 2 1] DLM Lock Space: "GFS1" 4 4 run - [3 2 1] DLM Lock Space: "EEJ_TESTING_LOCKSPACE" 9 7 run - [1] GFS Mount Group: "GFS1" 5 5 run - [3 2 1] User: "usrm::manager" 3 6 run - [3 2 1] And, I can see this too... [ejohnson at gfsa01:/dev/misc] ls -asl total 0 0 drwxr-xr-x 2 root root 100 Nov 3 13:50 ./ 0 drwxr-xr-x 13 root root 5700 Nov 3 13:50 ../ 0 crw------- 1 root root 10, 62 Aug 30 16:07 dlm-control 0 crwxr-xr-x 1 root root 10, 60 Nov 3 13:50 dlm_EEJ_TESTING_LOCKSPACE 0 crw------- 1 root root 10, 61 Aug 30 16:07 dlm_clvmd So I must be getting closer! BUT - note the permissions on the dlm_EEJ* "file". Not quite the 777 I had supplied in dlm_create_lockspace. Hmpf. Is that good? I wonder if that explains why this program still gets a null returned from dlm_open_lockspace. [ejohnson at gfsa01:~/projects/dlm-simple] cat foo.cpp #include #include #include #include #include extern "C" { #include } #include main(int argc, char **argv) { printf("hello world\n"); char lockName[] = "foobar" ; dlm_lshandle_t *lockspace=0; lockspace=(dlm_lshandle_t*)dlm_open_lockspace("EEJ_TESTING_LOCKSPACE"); printf("Lockspace: %X\n",lockspace); int initStatus = dlm_ls_pthread_init(lockspace); printf("initStatus: %d, errno: %d\n",initStatus,errno); return 1; } [ejohnson at gfsa01:~/projects/dlm-simple] ./foo hello world Lockspace: 0 Segmentation fault (core dumped) Still feels like I need root somewhere along the way. OR - my gfs arrangement is hopelessly misconfigured and I need to be told RTFM. -Eric From gsml at netops.gvtc.com Fri Nov 3 22:36:17 2006 From: gsml at netops.gvtc.com (Greg Swift) Date: Fri, 03 Nov 2006 16:36:17 -0600 Subject: [Linux-cluster] Mount hangs mounting gfs file sytem from second node In-Reply-To: <1.3.200610120812.24989@mclink.it> References: <1.3.200610120812.24989@mclink.it> Message-ID: <454BC461.2090405@netops.gvtc.com> Guido Aulisi wrote: > Hi, > I'm using kernel 2.6.17.13 and the latest cvs stable cluster suite. > I can't mount a gfs file system from the second node because mount hangs (state D). > > /proc/cluster/services is: > Service Name GID LID State Code > Fence Domain: "default" 2 2 run - > [1 2] > > DLM Lock Space: "clvmd" 2 3 run - > [1 2] > > DLM Lock Space: "oraclefs" 2 4 join S-4,4,1 > [1 2] > > Now the firstnode can't unmount the file system. > can version is 5.0.1 > I'm using LVM2 with clvmd > > Thanks in advance for helping me > Guido Aulisi > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > see thread with subject: fc6 two-node cluster with gfs2 not working. its a bug they need to submit patches to the kernel for. -greg -- http://www.gvtc.com -- ?While it is possible to change without improving, it is impossible to improve without changing.? -anonymous ?only he who attempts the absurd can achieve the impossible.? -anonymous From mederyf at LEXUM.UMontreal.CA Fri Nov 3 22:36:50 2006 From: mederyf at LEXUM.UMontreal.CA (Frederic Medery) Date: Fri, 03 Nov 2006 17:36:50 -0500 Subject: [Linux-cluster] Mount hangs mounting gfs file sytem from second node In-Reply-To: <454BC461.2090405@netops.gvtc.com> References: <1.3.200610120812.24989@mclink.it> <454BC461.2090405@netops.gvtc.com> Message-ID: --- Bonjour, Je serai absent du bureau du 6 novembre 2006 au 4 mars 2007. Pour toute urgence, veuillez contacter Mr Thomas Lanteigne : lanteignet at lexum.umontreal.ca. --- Hello, I will be absent from the office from November 6, 2006 until March 4, 2007. In case of an emergency, please contact Mr Thomas Lanteigne : lanteignet at lexum.umontreal.ca --- From orkcu at yahoo.com Sat Nov 4 02:10:24 2006 From: orkcu at yahoo.com (Roger Pe�a Escobio) Date: Fri, 3 Nov 2006 18:10:24 -0800 (PST) Subject: [Linux-cluster] device-mapper errors In-Reply-To: <2006113142126.452465@leena> Message-ID: <20061104021024.58834.qmail@web50606.mail.yahoo.com> --- "isplist at logicore.net" wrote: > >> filter = [ "a/.*/" ] > >> to > >> filter = [ "a|/dev/hd[ab]|", "r/.*/" ] > > ^^^ my point was that you should set the filter line loke this: filter = [ "a|/dev/sd[ab]|", "r/.*/" ] cu roger __________________________________________ RedHat Certified Engineer ( RHCE ) Cisco Certified Network Associate ( CCNA ) ____________________________________________________________________________________ Low, Low, Low Rates! Check out Yahoo! Messenger's cheap PC-to-Phone call rates (http://voice.yahoo.com) From romero.cl at gmail.com Sat Nov 4 04:51:42 2006 From: romero.cl at gmail.com (romero.cl at gmail.com) Date: Sat, 4 Nov 2006 01:51:42 -0300 Subject: [Linux-cluster] gfs mounted but not working Message-ID: <001701c6ffcc$ed2d2d50$0100a8c0@fremen> Hello. I have a two node cluster runing. Each one have /dev/sdb2 mounted as gfs on /users/home, but when I create one file y node1 not appear on node2. I use the following commands to create the file systems: on node1: gfs_mkfs -p lock_dlm -t node1_cluster:node1_gfs -j 8 /dev/sdb2 on node2: gfs_mkfs -p lock_dlm -t node1_cluster:node1_gfs -j 8 /dev/sdb2 What I'm doing wrong? -------------- next part -------------- An HTML attachment was scrubbed... URL: From rpeterso at redhat.com Sat Nov 4 15:48:05 2006 From: rpeterso at redhat.com (Robert Peterson) Date: Sat, 04 Nov 2006 09:48:05 -0600 Subject: [Linux-cluster] gfs mounted but not working In-Reply-To: <001701c6ffcc$ed2d2d50$0100a8c0@fremen> References: <001701c6ffcc$ed2d2d50$0100a8c0@fremen> Message-ID: <454CB635.9030806@redhat.com> romero.cl at gmail.com wrote: > Hello. > > I have a two node cluster runing. > > Each one have /dev/sdb2 mounted as gfs on /users/home, but when I > create one file y node1 not appear on node2. > > I use the following commands to create the file systems: > > on node1: gfs_mkfs -p lock_dlm -t node1_cluster:node1_gfs -j 8 /dev/sdb2 > on node2: gfs_mkfs -p lock_dlm -t node1_cluster:node1_gfs -j 8 /dev/sdb2 > > What I'm doing wrong? Hi, When using GFS in a clustered environment, I strongly recommend you use LVM rather than using the raw device for your GFS partition. Without a clustered LVM of some sort, there is no locking coordination between the nodes. I'm assuming, of course, that device sdb is some kind of shared storage, like a SAN. For example, assuming that your /dev/sdb2 has no valuable data yet, I recommend doing something like this: pvcreate /dev/sdb2 vgcreate your_vg /dev/sdb2 (where "your_vg" is the name you choose for your new vg) vgchange -cy your_vg (turn on the clustered bit) lvcreate -n your_lv -L 500G your_vg (where 500G is the size of your file system, and your_lv is the name you choose for your lv) gfs_mkfs -p lock_dlm -t node1_cluster:node1_gfs -j 8 /dev/your_vg/your_lv (on only one node) At this point you've got to bring up the cluster infrastructure, if it isn't already up. Next, mount the logical volume from both nodes: mount -tgfs /dev/your_vg/your_lv /users/home Now when you touch a file on one node, the other node should see it. I hope this helps. Regards, Bob Peterson Red Hat Cluster Suite From riaan at obsidian.co.za Sat Nov 4 17:30:18 2006 From: riaan at obsidian.co.za (Riaan van Niekerk) Date: Sat, 04 Nov 2006 19:30:18 +0200 Subject: [Linux-cluster] gfs mounted but not working In-Reply-To: <454CB635.9030806@redhat.com> References: <001701c6ffcc$ed2d2d50$0100a8c0@fremen> <454CB635.9030806@redhat.com> Message-ID: <454CCE2A.1000501@obsidian.co.za> Robert Peterson wrote: > romero.cl at gmail.com wrote: >> Hello. >> >> I have a two node cluster runing. >> >> Each one have /dev/sdb2 mounted as gfs on /users/home, but when I >> create one file y node1 not appear on node2. >> >> I use the following commands to create the file systems: >> >> on node1: gfs_mkfs -p lock_dlm -t node1_cluster:node1_gfs -j 8 /dev/sdb2 >> on node2: gfs_mkfs -p lock_dlm -t node1_cluster:node1_gfs -j 8 /dev/sdb2 >> >> What I'm doing wrong? > Hi, > > When using GFS in a clustered environment, I strongly recommend you use LVM > rather than using the raw device for your GFS partition. Without a > clustered > LVM of some sort, there is no locking coordination between the nodes. > I'm assuming, of course, that device sdb is some kind of shared storage, > like a SAN. > > For example, assuming that your /dev/sdb2 has no valuable data yet, I > recommend > doing something like this: > > pvcreate /dev/sdb2 > vgcreate your_vg /dev/sdb2 (where "your_vg" is the name you choose for > your new vg) > vgchange -cy your_vg (turn on the clustered bit) > lvcreate -n your_lv -L 500G your_vg (where 500G is the size of your file > system, > and your_lv is the name you choose for your lv) > gfs_mkfs -p lock_dlm -t node1_cluster:node1_gfs -j 8 /dev/your_vg/your_lv > (on only one node) > At this point you've got to bring up the cluster infrastructure, if it > isn't already up. > Next, mount the logical volume from both nodes: > mount -tgfs /dev/your_vg/your_lv /users/home > > Now when you touch a file on one node, the other node should see it. > > I hope this helps. > > Regards, > > Bob Peterson > Red Hat Cluster Suite > Romero - you should only gfs_mkfs from one node, and the other should pick it up. Also (some may feel it goes without saying) you need shared storage between these nodes - GFS does not replicate data between nodes) Bob - as far as I understand, LVM has certain advantages over native devices (dont have to worry about device names, easy expandability of LVs vs partitions). however there are people who are running GFS without clvmd, which is technically not wrong. (they do so to keep things simple - one less thing that can break). just switching from GFS on top of partitions to GFS on top of a LVM should not solve such a problem as described by Romero (assuming everything else is working and properly configured) HTH Riaan -------------- next part -------------- A non-text attachment was scrubbed... Name: riaan.vcf Type: text/x-vcard Size: 310 bytes Desc: not available URL: From isplist at logicore.net Sat Nov 4 17:34:45 2006 From: isplist at logicore.net (isplist at logicore.net) Date: Sat, 4 Nov 2006 11:34:45 -0600 Subject: [Linux-cluster] device-mapper errors In-Reply-To: <20061104021024.58834.qmail@web50606.mail.yahoo.com> Message-ID: <2006114113445.322198@leena> I think I became overwhelmed with too many things at once. I think I'll get it all running as it is first, then in a few weeks, take on the addition of these tools on a new batch of servers. Good point, I was not sure how the filtering works however so was looking for info on the net. Mike On Fri, 3 Nov 2006 18:10:24 -0800 (PST), Pe?a wrote: > > > --- "isplist at logicore.net" > wrote: > >>>> filter = [ "a/.*/" ] >>>> to >>>> filter = [ "a|/dev/hd[ab]|", "r/.*/" ] >>> ^^^ >>> > my point was that you should set the filter line loke > this: > filter = [ "a|/dev/sd[ab]|", "r/.*/" ] > > cu > roger > > __________________________________________ > RedHat Certified Engineer ( RHCE ) > Cisco Certified Network Associate ( CCNA ) > > > ___________________________________________________________________ ___________ > ______ > Low, Low, Low Rates! Check out Yahoo! Messenger's cheap PC-to-Phone call > rates > (http://voice.yahoo.com) From katriel at penguin-it.co.il Sun Nov 5 18:11:18 2006 From: katriel at penguin-it.co.il (Katriel Traum) Date: Sun, 05 Nov 2006 20:11:18 +0200 Subject: [Linux-cluster] clvmd and lvm cache file Message-ID: <454E2946.3070808@penguin-it.co.il> Hi. I've had some problems with clvmd not updating it's device list after adding a new PV. I created a new VG with "-cy", and then, when creating a new LV, it failed to enable it, whith clvmd claiming it can't find the device's uuid. An strace on clvmd showed that it simply didn't scan the newly added PV. A current and ugly work around is running "service clvmd restart" after each VG is created. This is good for setting up a system, but what happens when I have to add a new PV to a clustered system? A short look at the CVS sources showed that clvmd has been given a "-R" or refresh option that re-reads LVM's cache file. Can someone comment on when this patch will find it's way into RHEL? RHEL4 U4 uses lvm2 2.02.06 and the feature was added to 2.02.13. Thanks, -- Katriel Traum, PenguinIT RHCE, CLP Mobile: 054-6789953 From romero.cl at gmail.com Mon Nov 6 00:30:59 2006 From: romero.cl at gmail.com (romero.cl at gmail.com) Date: Sun, 5 Nov 2006 21:30:59 -0300 Subject: [Linux-cluster] gfs mounted but not working References: <001701c6ffcc$ed2d2d50$0100a8c0@fremen> <454CB635.9030806@redhat.com> Message-ID: <000e01c7013a$d720f220$0100a8c0@fremen> Hi. I'm trying your method, but still have a problem: Note: /dev/db2/ is a local partition on my second SCSI hard drive (no RAID) runing on HP ProLiant On node3: # /usr/sbin/vgcreate vg01 /dev/sdb2 Volume group "vg01" successfully created # /usr/sbin/vgchange -cy vg01 Volume group "vg01" is already clustered # /usr/sbin/lvcreate -n node3_lv -L 67G vg01 Error locking on node node4: Internal lvm error, check syslog Failed to activate new LV. --->On node4 log : lvm[6361]: Volume group for uuid not found: sgufJEs53VJSJTKG0vA1dLHXTthjnFctmfjC6YddzZvY3LI6db300wqEp8H0H58H Then I can mount /dev/vg01/node3_lv as gfs on node3, but node4 can't view the new files. What i'm trying to do is to mount 2 partitions (one on node3, the other on node4) as one big shared drive using gfs and then expand this to 4 nodes. Any help is well appreciated!!! (i'm a cluster newbie) Thanks. > Hi, > > When using GFS in a clustered environment, I strongly recommend you use LVM > rather than using the raw device for your GFS partition. Without a > clustered > LVM of some sort, there is no locking coordination between the nodes. > I'm assuming, of course, that device sdb is some kind of shared storage, > like a SAN. > > For example, assuming that your /dev/sdb2 has no valuable data yet, I > recommend > doing something like this: > > pvcreate /dev/sdb2 > vgcreate your_vg /dev/sdb2 (where "your_vg" is the name you choose for > your new vg) > vgchange -cy your_vg (turn on the clustered bit) > lvcreate -n your_lv -L 500G your_vg (where 500G is the size of your file > system, > and your_lv is the name you choose for your lv) > gfs_mkfs -p lock_dlm -t node1_cluster:node1_gfs -j 8 /dev/your_vg/your_lv > (on only one node) > At this point you've got to bring up the cluster infrastructure, if it > isn't already up. > Next, mount the logical volume from both nodes: > mount -tgfs /dev/your_vg/your_lv /users/home > > Now when you touch a file on one node, the other node should see it. > > I hope this helps. > > Regards, > > Bob Peterson > Red Hat Cluster Suite > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From linuxr at gmail.com Mon Nov 6 01:02:16 2006 From: linuxr at gmail.com (Marc ) Date: Sun, 5 Nov 2006 20:02:16 -0500 Subject: [Linux-cluster] gfs mounted but not working In-Reply-To: <000e01c7013a$d720f220$0100a8c0@fremen> References: <001701c6ffcc$ed2d2d50$0100a8c0@fremen> <454CB635.9030806@redhat.com> <000e01c7013a$d720f220$0100a8c0@fremen> Message-ID: This is as far as I got recently, and I called RH support and got nowhere. GFS and RHCS are apparently very flaky and despite their supposedly being mission critical and enterprise ready. This sounds like 'split brain' configuration where you finagle with the crappy software until each node is kind of a cluster unto itself but not all together. Maybe Red Hat inc, can make a pretty red HTML applet that can convince me otherwise. In the meantime, despite a ton of hard work on my part, it just convinced my client to switch to Microsoft clustering as fast as possible. Actually I don't know which is the most disappointing - the flaky redhat software or the flaky redhat people. Something to ponder as I remove the shadowman logo from my car. Sincerely, another Ubuntu convert On 11/5/06, romero.cl at gmail.com wrote: > > Hi. > > I'm trying your method, but still have a problem: > > Note: /dev/db2/ is a local partition on my second SCSI hard drive (no > RAID) > runing on HP ProLiant > > On node3: > # /usr/sbin/vgcreate vg01 /dev/sdb2 > Volume group "vg01" successfully created > # /usr/sbin/vgchange -cy vg01 > Volume group "vg01" is already clustered > # /usr/sbin/lvcreate -n node3_lv -L 67G vg01 > Error locking on node node4: Internal lvm error, check syslog > Failed to activate new LV. > > --->On node4 log : lvm[6361]: Volume group for uuid not found: > sgufJEs53VJSJTKG0vA1dLHXTthjnFctmfjC6YddzZvY3LI6db300wqEp8H0H58H > > > Then I can mount /dev/vg01/node3_lv as gfs on node3, but node4 can't view > the new files. > > What i'm trying to do is to mount 2 partitions (one on node3, the other on > node4) as one big shared drive using gfs and then expand this to 4 nodes. > > Any help is well appreciated!!! (i'm a cluster newbie) > Thanks. > > > > Hi, > > > > When using GFS in a clustered environment, I strongly recommend you use > LVM > > rather than using the raw device for your GFS partition. Without a > > clustered > > LVM of some sort, there is no locking coordination between the nodes. > > I'm assuming, of course, that device sdb is some kind of shared storage, > > like a SAN. > > > > For example, assuming that your /dev/sdb2 has no valuable data yet, I > > recommend > > doing something like this: > > > > pvcreate /dev/sdb2 > > vgcreate your_vg /dev/sdb2 (where "your_vg" is the name you choose for > > your new vg) > > vgchange -cy your_vg (turn on the clustered bit) > > lvcreate -n your_lv -L 500G your_vg (where 500G is the size of your file > > system, > > and your_lv is the name you choose for your lv) > > gfs_mkfs -p lock_dlm -t node1_cluster:node1_gfs -j 8 > /dev/your_vg/your_lv > > (on only one node) > > At this point you've got to bring up the cluster infrastructure, if it > > isn't already up. > > Next, mount the logical volume from both nodes: > > mount -tgfs /dev/your_vg/your_lv /users/home > > > > Now when you touch a file on one node, the other node should see it. > > > > I hope this helps. > > > > Regards, > > > > Bob Peterson > > Red Hat Cluster Suite > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From kanderso at redhat.com Mon Nov 6 01:12:57 2006 From: kanderso at redhat.com (Kevin Anderson) Date: Sun, 05 Nov 2006 19:12:57 -0600 Subject: [Linux-cluster] gfs mounted but not working In-Reply-To: References: <001701c6ffcc$ed2d2d50$0100a8c0@fremen> <454CB635.9030806@redhat.com> <000e01c7013a$d720f220$0100a8c0@fremen> Message-ID: <454E8C19.1070607@redhat.com> On 11/5/06, *romero.cl at gmail.com * > wrote: > > Hi. > > I'm trying your method, but still have a problem: > > Note: /dev/db2/ is a local partition on my second SCSI hard drive > (no RAID) > runing on HP ProLiant. > GFS requires that all storage is equally accessible by all nodes in the cluster. Your other nodes have no path to the storage you set up so it is impossible for them to share the data. Kevin From isplist at logicore.net Mon Nov 6 02:25:49 2006 From: isplist at logicore.net (isplist at logicore.net) Date: Sun, 5 Nov 2006 20:25:49 -0600 Subject: [Linux-cluster] GFS/SCSI Lost Message-ID: <2006115202549.192581@leena> Everything seems to be working fine. The hardware will just be sitting there working as I left it, idling away. The next day, I come to use it and find that the storage is gone and I find this error. The storage is gone on all nodes by the way. Nov 5 19:01:02 cweb92 kernel: SCSI error : <0 0 2 0> return code = 0x10000 Nov 5 19:01:02 cweb92 kernel: end_request: I/O error, dev sde, sector 63 Nov 5 19:01:02 cweb92 kernel: SCSI error : <0 0 2 1> return code = 0x10000 Nov 5 19:01:02 cweb92 kernel: end_request: I/O error, dev sdf, sector 63 Nov 5 19:01:02 cweb92 kernel: SCSI error : <0 0 2 2> return code = 0x10000 Nov 5 19:01:02 cweb92 kernel: end_request: I/O error, dev sdg, sector 63 From isplist at logicore.net Mon Nov 6 03:29:40 2006 From: isplist at logicore.net (isplist at logicore.net) Date: Sun, 5 Nov 2006 21:29:40 -0600 Subject: [Linux-cluster] LVS: Not as a gateway? In-Reply-To: <200611022131.kA2LVsAA012346@mail2.ontariocreditcorp.com> Message-ID: <2006115212940.158229@leena> Hi Chris, I'm still not getting this. Same problem, all weekend. Nov 5 21:28:46 lb52 nanny[1227]: READ to 192.168.1.94:80 timed out Nov 5 21:28:46 lb52 nanny[1226]: READ to 192.168.1.93:80 timed out Nov 5 21:28:46 lb52 nanny[1225]: READ to 192.168.1.92:80 timed out Nov 5 21:28:58 lb52 nanny[1227]: READ to 192.168.1.94:80 timed out Nov 5 21:28:58 lb52 nanny[1226]: READ to 192.168.1.93:80 timed out Nov 5 21:28:58 lb52 nanny[1225]: READ to 192.168.1.92:80 timed out > accept packets with that address. However, they do not respond to ARP > requests, therefore that IP can exist on all servers in this cluster and not > cause communication problems. Ie., if you were to ping that address, only > the director would respond. I have configured real servers before with just > these three lines of code: I understand the concept, it's just not working for me, so far. I'm missing something for sure. > ifconfig lo:1 $VIP netmask 255.255.255.255 up I have the VIP set up on my first NIC on a real server. eth0:1 Ethernet (Virtual) 192.168.1.150 255.255.255.0 Up I have 192.168.1.150 installed on LVS0 as the VIP for the real servers and real servers configured. > echo 1 > /proc/sys/net/ipv4/conf/all/hidden > echo 1 > /proc/sys/net/ipv4/conf/lo/hidden These files do not exist on RHEL4 so cannot do this. > That's it. If a real server can accept packets addressed to the virtual IP > and does not respond to ARP on that IP, it's done, all set up. If nanny > requires more, then maybe there is more, but that's the guts of the LVS > stuff on real servers. Ok, so there's my problem, where I'm misunderstanding something here. The LVS servers respond to my VIP of 192.168.1.150 and the real servers have that IP as a virtual IP but they don't respond. I think I'm not understanding something with the floating and virtual IP's. They are the same correct? Client connects to VIP on the LVS and LVS sends connection to one of the real servers which all know to use the same virtual IP? Mike > The LVS director (and it's partner if there are two directors) accept all > packets that are sent to the floating IP, whether they get there by NAT > through a firewall, or directly from a machine on the same network, or > anywhere else. Say I want a web page from this cluster. I send an http > request to your floating IP. The director gets it, chooses a real server, > and then forwards the packet there. The real server serves the request and > replies directly to the client, using the virtual IP as the source in the > packet header, such that client never realizes that it dealt with more than > one machine. From the client side, it sent an http request to your virtual > IP (hypothetically) 10.10.10.9 which got accepted by the director and it got > a packet back from 10.10.10.9 (which is bound to the loopback on the real > server, so it can use that address as its source). And voila, a transparent > load balanced cluster. I can't answer a redhat specific question because I > don't know the answer, but maybe this will help you diagnose what is not > working. > > Chris > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster On Thu, 2 Nov 2006 16:52:56 -0500, Christopher Hawkins wrote: >> -----Original Message----- >> >> From: linux-cluster-bounces at redhat.com >> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of >> isplist at logicore.net >> Sent: Thursday, November 02, 2006 3:51 PM >> To: linux-cluster >> Subject: RE: [Linux-cluster] LVS: Not as a gateway? >> >> Maybe someone who's running DIRECT and all of the same >> internap IP's. can send me their lvs.cfg? I'm stumped. >> >> Yes, it's how I've got it set up. Only problem is, the web servers need to > see the LVS as their gateways no? > > No. I'm not familiar with the redhat way of doing this, but I know LVS. So > if I trample on accepted redhat wisdom someone please correct me.... But > straight, cross-platform LVS works like this: The real servers do not even > need to know they are part of an LVS cluster. They can be setup like any > typical server providing a service, like httpd or whatever, and then they > get the virtual IP bound to their loopback interface so that they will > accept packets with that address. However, they do not respond to ARP > requests, therefore that IP can exist on all servers in this cluster and not > cause communication problems. Ie., if you were to ping that address, only > the director would respond. I have configured real servers before with just > these three lines of code: > ifconfig lo:1 $VIP netmask 255.255.255.255 up > echo 1 > /proc/sys/net/ipv4/conf/all/hidden > echo 1 > /proc/sys/net/ipv4/conf/lo/hidden > > That's it. If a real server can accept packets addressed to the virtual IP > and does not respond to ARP on that IP, it's done, all set up. If nanny > requires more, then maybe there is more, but that's the guts of the LVS > stuff on real servers. > > The LVS director (and it's partner if there are two directors) accept all > packets that are sent to the floating IP, whether they get there by NAT > through a firewall, or directly from a machine on the same network, or > anywhere else. Say I want a web page from this cluster. I send an http > request to your floating IP. The director gets it, chooses a real server, > and then forwards the packet there. The real server serves the request and > replies directly to the client, using the virtual IP as the source in the > packet header, such that client never realizes that it dealt with more than > one machine. From the client side, it sent an http request to your virtual > IP (hypothetically) 10.10.10.9 which got accepted by the director and it got > a packet back from 10.10.10.9 (which is bound to the loopback on the real > server, so it can use that address as its source). And voila, a transparent > load balanced cluster. I can't answer a redhat specific question because I > don't know the answer, but maybe this will help you diagnose what is not > working. > > Chris > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From bganeshmail at gmail.com Mon Nov 6 04:31:13 2006 From: bganeshmail at gmail.com (Ganesh B) Date: Mon, 6 Nov 2006 10:01:13 +0530 Subject: [Linux-cluster] (no subject) Message-ID: <4e3b15f00611052031j12783d1ck81223ec272fa20df@mail.gmail.com> Linux-cluster From bganeshmail at gmail.com Mon Nov 6 04:31:35 2006 From: bganeshmail at gmail.com (Ganesh B) Date: Mon, 6 Nov 2006 10:01:35 +0530 Subject: [Linux-cluster] Linux-cluster Message-ID: <4e3b15f00611052031t1347fc7te7f954c8717504f8@mail.gmail.com> From bganeshmail at gmail.com Mon Nov 6 04:34:36 2006 From: bganeshmail at gmail.com (Ganesh B) Date: Mon, 6 Nov 2006 10:04:36 +0530 Subject: [Linux-cluster] Procedure for installing and configurung Redhat cluster suite 4 Message-ID: <4e3b15f00611052034j22e98a7bq461693728238c13e@mail.gmail.com> Dear Admin, Pls send the procedure for installing and confirung procedure for Redhat clustering of mail server Redhat Clustering of web server. Regds B.Ganesh From romero.cl at gmail.com Mon Nov 6 04:38:40 2006 From: romero.cl at gmail.com (romero.cl at gmail.com) Date: Mon, 6 Nov 2006 01:38:40 -0300 Subject: [Linux-cluster] gfs mounted but not working References: <001701c6ffcc$ed2d2d50$0100a8c0@fremen> <454CB635.9030806@redhat.com><000e01c7013a$d720f220$0100a8c0@fremen> <454E8C19.1070607@redhat.com> Message-ID: <001501c7015d$702a38b0$0100a8c0@fremen> Hi. Now i'm trying this and it works! for now... Two nodes: node3 & node4 node4 export his /dev/sdb2 with gnbd_export as "node4_sdb2" node3 import node4's /dev/sdb2 with gnbd_import (new /dev/gnbd/node4_sdb2) on node3: gfs_mkfs -p lock_dlm -t node3:node3_gfs -j 4 /dev/gnbd/node4_sdb2 mount -t gfs /dev/gnbd/node4_gfs /users/home on node4: mount -t gfs /dev/sdb2 /users/home and both nodes can read an write ths same files on /users/home!!! Now i'm going for this: 4 nodes on a dedicated 3com 1Gbit ethernet switch: node2 exporting with gnbd_export /dev/sdb2 as "node2_sdb2" node3 exporting with gnbd_export /dev/sdb2 as "node3_sdb2" node4 exporting with gnbd_export /dev/sdb2 as "node4_sdb2" node1 (main) will import all "nodeX_sdb2" and create a logical volume named "main_lv" including: /dev/sdb2 (his own) /dev/gnbd/node2_sdb2 /dev/gnbd/node3_sdb2 /dev/gnbd/node4_sdb2 Next I will try to export the new big logical volume with "gnbd_export" and then do gnbd_import on each node. With that each node will see "main_lv", then mount it on /users/home as gfs and get a big shared filesystem to work toghether. Is this the correct way to do the work??? possibly a deadlock??? Sorry if my english isn't very good ;) ----- Original Message ----- From: "Kevin Anderson" To: "linux clustering" Sent: Sunday, November 05, 2006 10:12 PM Subject: Re: [Linux-cluster] gfs mounted but not working > On 11/5/06, *romero.cl at gmail.com * > > wrote: > > > > Hi. > > > > I'm trying your method, but still have a problem: > > > > Note: /dev/db2/ is a local partition on my second SCSI hard drive > > (no RAID) > > runing on HP ProLiant. > > > GFS requires that all storage is equally accessible by all nodes in the > cluster. Your other nodes have no path to the storage you set up so it > is impossible for them to share the data. > > Kevin > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From romero.cl at gmail.com Mon Nov 6 04:43:20 2006 From: romero.cl at gmail.com (romero.cl at gmail.com) Date: Mon, 6 Nov 2006 01:43:20 -0300 Subject: [Linux-cluster] Procedure for installing and configurung Redhatcluster suite 4 References: <4e3b15f00611052034j22e98a7bq461693728238c13e@mail.gmail.com> Message-ID: <002301c7015e$1726a270$0100a8c0@fremen> Hi. setting up httpd server can be found on: http://www.redhat.com/docs/manuals/csgfs/browse/rh-cs-en/ch-httpd-service.html ----- Original Message ----- From: "Ganesh B" To: Sent: Monday, November 06, 2006 1:34 AM Subject: [Linux-cluster] Procedure for installing and configurung Redhatcluster suite 4 > Dear Admin, > > Pls send the procedure for installing and confirung procedure for > > Redhat clustering of mail server > Redhat Clustering of web server. > > > Regds > B.Ganesh > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From wcheng at redhat.com Mon Nov 6 05:29:18 2006 From: wcheng at redhat.com (Wendy Cheng) Date: Mon, 06 Nov 2006 00:29:18 -0500 Subject: [Linux-cluster] gfs mounted but not working In-Reply-To: References: <001701c6ffcc$ed2d2d50$0100a8c0@fremen> <454CB635.9030806@redhat.com> <000e01c7013a$d720f220$0100a8c0@fremen> Message-ID: <454EC82E.6050407@redhat.com> Marc wrote: > > This is as far as I got recently, and I called RH support and got > nowhere. GFS and RHCS are apparently very flaky and despite their > supposedly being mission critical and enterprise ready. This sounds > like 'split brain' configuration where you finagle with the crappy > software until each node is kind of a cluster unto itself but not all > together. Based on "as far as I got" statement above, I assume you also tried to use local SCSI partitions as GFS mount ? I'm more than surprised that Red Hat Support didn't notice a wrong configuration like that. Could you give me your ticket number and/or phone contact info so we can verify this issue ? Thanks. -- Wendy From wcheng at redhat.com Mon Nov 6 06:00:52 2006 From: wcheng at redhat.com (Wendy Cheng) Date: Mon, 06 Nov 2006 01:00:52 -0500 Subject: [Linux-cluster] gfs mounted but not working In-Reply-To: <001501c7015d$702a38b0$0100a8c0@fremen> References: <001701c6ffcc$ed2d2d50$0100a8c0@fremen> <454CB635.9030806@redhat.com><000e01c7013a$d720f220$0100a8c0@fremen> <454E8C19.1070607@redhat.com> <001501c7015d$702a38b0$0100a8c0@fremen> Message-ID: <454ECF94.7010008@redhat.com> romero.cl at gmail.com wrote: >Hi. > >Now i'm trying this and it works! for now... > >Two nodes: node3 & node4 >node4 export his /dev/sdb2 with gnbd_export as "node4_sdb2" >node3 import node4's /dev/sdb2 with gnbd_import (new /dev/gnbd/node4_sdb2) > >on node3: gfs_mkfs -p lock_dlm -t node3:node3_gfs -j 4 /dev/gnbd/node4_sdb2 > mount -t gfs /dev/gnbd/node4_gfs /users/home > >on node4: mount -t gfs /dev/sdb2 /users/home > >and both nodes can read an write ths same files on /users/home!!! > >Now i'm going for this: > >4 nodes on a dedicated 3com 1Gbit ethernet switch: > >node2 exporting with gnbd_export /dev/sdb2 as "node2_sdb2" >node3 exporting with gnbd_export /dev/sdb2 as "node3_sdb2" >node4 exporting with gnbd_export /dev/sdb2 as "node4_sdb2" > >node1 (main) will import all "nodeX_sdb2" and create a logical volume named >"main_lv" including: > > /dev/sdb2 (his own) > /dev/gnbd/node2_sdb2 > /dev/gnbd/node3_sdb2 > /dev/gnbd/node4_sdb2 > >Next I will try to export the new big logical volume with "gnbd_export" and >then do gnbd_import on each node. >With that each node will see "main_lv", then mount it on /users/home as gfs >and get a big shared filesystem to work toghether. > >Is this the correct way to do the work??? possibly a deadlock??? > >Sorry if my english isn't very good ;) > > I personally don't know GNBD very well. In theory, it exports block device (via network) and there is nothing wrong to glue them together to form a big LVM partition. However, most of the GNBD configurations I know of are exporting the block devices from a group of server nodes that normally have quite decent disk resources. Then another group of nodes just imports these block devices as GNBD clients. I'm not sure how well the system would work if you mix all GNBD client and server together, particularly under heavy workloads. -- Wendy From wcheng at redhat.com Mon Nov 6 06:31:40 2006 From: wcheng at redhat.com (Wendy Cheng) Date: Mon, 06 Nov 2006 01:31:40 -0500 Subject: [Linux-cluster] gfs mounted but not working In-Reply-To: <454ECF94.7010008@redhat.com> References: <001701c6ffcc$ed2d2d50$0100a8c0@fremen> <454CB635.9030806@redhat.com><000e01c7013a$d720f220$0100a8c0@fremen> <454E8C19.1070607@redhat.com> <001501c7015d$702a38b0$0100a8c0@fremen> <454ECF94.7010008@redhat.com> Message-ID: <454ED6CC.9030908@redhat.com> > romero.cl at gmail.com wrote: > >> Hi. >> >> Now i'm trying this and it works! for now... >> >> Two nodes: node3 & node4 >> node4 export his /dev/sdb2 with gnbd_export as "node4_sdb2" >> node3 import node4's /dev/sdb2 with gnbd_import (new >> /dev/gnbd/node4_sdb2) >> >> on node3: gfs_mkfs -p lock_dlm -t node3:node3_gfs -j 4 >> /dev/gnbd/node4_sdb2 >> mount -t gfs /dev/gnbd/node4_gfs /users/home >> >> on node4: mount -t gfs /dev/sdb2 /users/home >> >> and both nodes can read an write ths same files on /users/home!!! >> >> Now i'm going for this: >> >> 4 nodes on a dedicated 3com 1Gbit ethernet switch: >> >> node2 exporting with gnbd_export /dev/sdb2 as "node2_sdb2" >> node3 exporting with gnbd_export /dev/sdb2 as "node3_sdb2" >> node4 exporting with gnbd_export /dev/sdb2 as "node4_sdb2" >> >> node1 (main) will import all "nodeX_sdb2" and create a logical volume >> named >> "main_lv" including: >> >> /dev/sdb2 (his own) >> /dev/gnbd/node2_sdb2 >> /dev/gnbd/node3_sdb2 >> /dev/gnbd/node4_sdb2 > Forgot to say if you want to proceed with above, ... node1 has to import its own GNBD sdb2 export. The key idea to keep in mind is that if you have any "local" configuration, other node would not be able to see it. I also assume you've installed CLVM (lvm2) - that is "cluster LVM", not the plain LVM. -- Wendy From isplist at logicore.net Mon Nov 6 14:39:41 2006 From: isplist at logicore.net (isplist at logicore.net) Date: Mon, 6 Nov 2006 08:39:41 -0600 Subject: [Linux-cluster] GFS/SCSI Lost Message-ID: <200611683941.908313@leena> I posted about this last night but found some additional info. My first post was not very useful, showing a paste of SCSI errors after it was disconnected. I see that something just times out and the storage is lost. I find that I can just get on the node, unmount the lost mount, remount and it's back. I also notice that the mount is set as non permanent? Do I need a keep alive script or is there a configuration somewhere I've missed? Here is a snippet from where SCSCI errors started overnight. Nov 5 21:16:02 qm250 kernel: SCSI error : <0 0 2 1> return code = 0x10000 Nov 5 21:16:02 qm250 kernel: end_request: I/O error, dev sdf, sector 655 Nov 5 21:16:02 qm250 kernel: GFS: fsid=vgcomp:qm.0: fatal: I/O error Nov 5 21:16:02 qm250 kernel: GFS: fsid=vgcomp:qm.0: block = 26 Nov 5 21:16:02 qm250 kernel: GFS: fsid=vgcomp:qm.0: function = gfs_dreread Nov 5 21:16:02 qm250 kernel: GFS: fsid=vgcomp:qm.0: file = /home/xos/gen/updates-2006-08/xlrpm21122/rpm/BUILD/gfs-kerne l-2.6.9-58/up/src/gfs/dio.c, line = 576 Nov 5 21:16:02 qm250 kernel: GFS: fsid=vgcomp:qm.0: time = 1162782962 Nov 5 21:16:02 qm250 kernel: GFS: fsid=vgcomp:qm.0: about to withdraw from the cluster Nov 5 21:16:02 qm250 kernel: GFS: fsid=vgcomp:qm.0: waiting for outstanding I/O Nov 5 21:16:02 qm250 kernel: GFS: fsid=vgcomp:qm.0: telling LM to withdraw Nov 5 21:16:05 qm250 kernel: lock_dlm: withdraw abandoned memory Nov 5 21:16:05 qm250 kernel: GFS: fsid=vgcomp:qm.0: withdrawn From rpeterso at redhat.com Mon Nov 6 15:25:34 2006 From: rpeterso at redhat.com (Robert Peterson) Date: Mon, 06 Nov 2006 09:25:34 -0600 Subject: [Linux-cluster] GFS/SCSI Lost In-Reply-To: <200611683941.908313@leena> References: <200611683941.908313@leena> Message-ID: <454F53EE.9050001@redhat.com> isplist at logicore.net wrote: > I posted about this last night but found some additional info. My first post > was not very useful, showing a paste of SCSI errors after it was disconnected. > > I see that something just times out and the storage is lost. I find that I can > just get on the node, unmount the lost mount, remount and it's back. I also > notice that the mount is set as non permanent? > > Do I need a keep alive script or is there a configuration somewhere I've > missed? Here is a snippet from where SCSCI errors started overnight. > > Nov 5 21:16:02 qm250 kernel: SCSI error : <0 0 2 1> return code = 0x10000 > Nov 5 21:16:02 qm250 kernel: end_request: I/O error, dev sdf, sector 655 > Nov 5 21:16:02 qm250 kernel: GFS: fsid=vgcomp:qm.0: fatal: I/O error > Nov 5 21:16:02 qm250 kernel: GFS: fsid=vgcomp:qm.0: block = 26 > Nov 5 21:16:02 qm250 kernel: GFS: fsid=vgcomp:qm.0: function = gfs_dreread > Nov 5 21:16:02 qm250 kernel: GFS: fsid=vgcomp:qm.0: file = > /home/xos/gen/updates-2006-08/xlrpm21122/rpm/BUILD/gfs-kerne > l-2.6.9-58/up/src/gfs/dio.c, line = 576 > Nov 5 21:16:02 qm250 kernel: GFS: fsid=vgcomp:qm.0: time = 1162782962 > Nov 5 21:16:02 qm250 kernel: GFS: fsid=vgcomp:qm.0: about to withdraw from > the cluster > Nov 5 21:16:02 qm250 kernel: GFS: fsid=vgcomp:qm.0: waiting for outstanding > I/O > Nov 5 21:16:02 qm250 kernel: GFS: fsid=vgcomp:qm.0: telling LM to withdraw > Nov 5 21:16:05 qm250 kernel: lock_dlm: withdraw abandoned memory > Nov 5 21:16:05 qm250 kernel: GFS: fsid=vgcomp:qm.0: withdrawn > Hi Mike, This one can't be blamed on GFS or the cluster infrastructure. The messages indicate that GFS withdrew because of underlying SCSI errors, which could mean a number of things underneath GFS, like flaky hardware, cables, etc. Maybe even the storage adapter or possibly even its device driver. The problem is not that your mount is temporary, and you shouldn't need any kind of keepalive script, that I'm aware of. Regards, Bob Peterson Red Hat Cluster Suite From isplist at logicore.net Mon Nov 6 15:33:45 2006 From: isplist at logicore.net (isplist at logicore.net) Date: Mon, 6 Nov 2006 09:33:45 -0600 Subject: [Linux-cluster] GFS/SCSI Lost In-Reply-To: <454F53EE.9050001@redhat.com> Message-ID: <200611693345.079643@leena> >> Do I need a keep alive script or is there a configuration somewhere I've >> missed? Here is a snippet from where SCSCI errors started overnight. > This one can't be blamed on GFS or the cluster infrastructure. > The messages indicate that GFS withdrew because of underlying SCSI errors, > which could mean a number of things underneath GFS, like flaky hardware, > cables, etc. Ok, so nothing to do with GFS or the cluster other than it pulling out due to failed storage. Thanks very much, it was not clear to me which problem came first. Mike > Maybe even the storage adapter or possibly even its device driver. > The problem is not that your mount is temporary, and you shouldn't need any > kind of keepalive script, that I'm aware of. >> Nov 5 21:16:02 qm250 kernel: SCSI error : <0 0 2 1> return code = 0x10000 >> Nov 5 21:16:02 qm250 kernel: end_request: I/O error, dev sdf, sector 655 >> Nov 5 21:16:02 qm250 kernel: GFS: fsid=vgcomp:qm.0: fatal: I/O error >> Nov 5 21:16:02 qm250 kernel: GFS: fsid=vgcomp:qm.0: block = 26 >> Nov 5 21:16:02 qm250 kernel: GFS: fsid=vgcomp:qm.0: function = >> gfs_dreread >> Nov 5 21:16:02 qm250 kernel: GFS: fsid=vgcomp:qm.0: file = >> /home/xos/gen/updates-2006-08/xlrpm21122/rpm/BUILD/gfs-kerne >> l-2.6.9-58/up/src/gfs/dio.c, line = 576 >> Nov 5 21:16:02 qm250 kernel: GFS: fsid=vgcomp:qm.0: time = 1162782962 >> Nov 5 21:16:02 qm250 kernel: GFS: fsid=vgcomp:qm.0: about to withdraw >> from >> the cluster >> Nov 5 21:16:02 qm250 kernel: GFS: fsid=vgcomp:qm.0: waiting for >> outstanding >> I/O >> Nov 5 21:16:02 qm250 kernel: GFS: fsid=vgcomp:qm.0: telling LM to >> withdraw >> Nov 5 21:16:05 qm250 kernel: lock_dlm: withdraw abandoned memory >> Nov 5 21:16:05 qm250 kernel: GFS: fsid=vgcomp:qm.0: withdrawn > Regards, > > Bob Peterson > Red Hat Cluster Suite From rpeterso at redhat.com Mon Nov 6 15:35:16 2006 From: rpeterso at redhat.com (Robert Peterson) Date: Mon, 06 Nov 2006 09:35:16 -0600 Subject: [Linux-cluster] Procedure for installing and configurung Redhat cluster suite 4 In-Reply-To: <4e3b15f00611052034j22e98a7bq461693728238c13e@mail.gmail.com> References: <4e3b15f00611052034j22e98a7bq461693728238c13e@mail.gmail.com> Message-ID: <454F5634.7020409@redhat.com> Ganesh B wrote: > Dear Admin, > > Pls send the procedure for installing and confirung procedure for > > Redhat clustering of mail server > Redhat Clustering of web server. > > > Regds > B.Ganesh Ganesh, I recommend you start with this document: http://sources.redhat.com/cluster/doc/nfscookbook.pdf and perhaps: http://sources.redhat.com/cluster/faq.html Regards, Bob Peterson Red Hat Cluster Suite From isplist at logicore.net Mon Nov 6 15:40:26 2006 From: isplist at logicore.net (isplist at logicore.net) Date: Mon, 6 Nov 2006 09:40:26 -0600 Subject: [Linux-cluster] LVS: Not as a gateway? In-Reply-To: <200611022131.kA2LVsAA012346@mail2.ontariocreditcorp.com> Message-ID: <200611694026.854986@leena> To not take up too much of your time, here is my lvs.cf file. LVS0's real eth0 IP is 192.168.1.52. LVS1's real eth0 IP is 192.168.1.53. CWEB92 is the only real server with the virtual IP of 192.168.1.150 on it's ether0 for now. It's real IP is 192.168.1.92. When I ping 150, I ping either LVS servers, depending on who's master but the CWEB machine never responds to 192.168.1.150. I know that's where my confusion or error is, understanding this part. Mike serial_no = 88 primary = 192.168.1.52 service = lvs backup_active = 1 backup = 0.0.0.0 heartbeat = 1 heartbeat_port = 539 keepalive = 6 deadtime = 18 network = direct nat_nmask = 255.255.255.255 debug_level = NONE monitor_links = 0 virtual HTTP { active = 1 address = 192.168.1.150 eth0:1 vip_nmask = 255.255.255.0 port = 80 send = "GET / HTTP/1.0rnrn" expect = "HTTP" use_regex = 0 load_monitor = none scheduler = wlc protocol = tcp timeout = 6 reentry = 15 quiesce_server = 0 server cweb92 { address = 192.168.1.92 active = 1 weight = 1 } server cweb93 { address = 192.168.1.93 active = 1 weight = 1 } server cweb94 { address = 192.168.1.94 active = 1 weight = 1 } } From rpeterso at redhat.com Mon Nov 6 15:39:48 2006 From: rpeterso at redhat.com (Robert Peterson) Date: Mon, 06 Nov 2006 09:39:48 -0600 Subject: [Linux-cluster] Files are there, but not. In-Reply-To: <339554D0FE9DD94A8E5ACE4403676CEB016936D4@douwes.ka.sara.nl> References: <339554D0FE9DD94A8E5ACE4403676CEB016936D4@douwes.ka.sara.nl> Message-ID: <454F5744.9040508@redhat.com> Jaap Dijkshoorn wrote: > Hi All, > > We are running a GFS / NFS cluster with 5 fileservers. Each server > exports the same storage as a different NFS server. nfs1, nfs2....nfs5 > > On nodes that mount on nfs1 we have the following problem: > > root# ls -l > ls: CHGCAR: No such file or directory > ls: CHG: No such file or directory > ls: WAVECAR: No such file or directory > total 17616 > -rw------- 1 xxxxxxxx yyy 1612 Sep 26 15:43 CONTCAR > -rw------- 1 xxxxxxxx yyy 167 Sep 26 14:11 DOSCAR > > Other fileservers display the 3 missing files normally and they are > accessible. On the fileservers we also get these kind messages: > > h_update: A.TCNQ/CHGCAR already up-to-date! > fh_update: A.TCNQ/CHGCAR already up-to-date! > fh_update: A.TCNQ/CHGCAR already up-to-date! > fh_update: A.TCNQ/CHGCAR already up-to-date! > fh_update: A.TCNQ/WAVECAR already up-to-date! > fh_update: A.TCNQ/WAVECAR already up-to-date! > fh_update: A.TCNQ/WAVECAR already up-to-date! > fh_update: A+B/CHGCAR already up-to-date! > fh_update: A+B/CHGCAR already up-to-date! > fh_update: A+B/CHGCAR already up-to-date! > fh_update: A+B/WAVECAR already up-to-date! > > I don't know where to start looking to trac this problem. If i reboot > the nfs1 server the problem is gone, but in time the problem comes back > with other files, until now on the same fileserver. > > Maybe someone has seen this problem before? > > We use GFS version CVS 1.0.3 stable. with kernel 2.6.17.11 > > > > > Met vriendelijke groet, Kind Regards, > > Jaap P. Dijkshoorn > Systems Programmer > mailto:jaap at sara.nl http://home.sara.nl/~jaapd > > SARA Computing & Networking Services > Kruislaan 415 1098 SJ Amsterdam > Tel: +31-(0)20-5923000 > Fax: +31-(0)20-6683167 > http://www.sara.nl > Hi Jaap, I'm cleaning out my backlog of emails, so I apologize for now seeing this sooner. This problem might be the same as: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=190756 Regards, Bob Peterson Red Hat Cluster Suite From dbrieck at gmail.com Mon Nov 6 16:25:59 2006 From: dbrieck at gmail.com (David Brieck Jr.) Date: Mon, 6 Nov 2006 11:25:59 -0500 Subject: [Linux-cluster] Storage Problems, need some advice Message-ID: <8c1094290611060825i5125eb31m40935b1083e33f0b@mail.gmail.com> Some of you may have noticed the problems I've been posting to the mailing with my storage. For those who haven't here a quick rundown of my setup: GNBD Servers w/ clustered SCSI enclosure GNBD Clients w/ multipath CLVM GFS and DLM I've been having crashes, lockups and poor performance with my setup almost since it was first setup. I've tried getting rid of multipath and went down to only one GNBD server but I'm still suffering from random crashes under moderate to high I/O. I'm not getting any errors in the logs at all that can help, not even kernel errors any more. So I've simpled to GNBD Server -> GNBD Clients -> CLVM -> GFS which is exactly what is specified in the manual: http://www.redhat.com/docs/manuals/csgfs/browse/rh-gfs-en/ch-gnbd.html and it still seems to act very flaky. My GNBD Server(s) don't crash, it's always the client nodes that end up crashing. The only connecting factor seems to be I/O on the GNBD. So I have to find a solution that's going to work and I don't think anything with GNBD is going to work regardless of hardware. So that pretty much only leaves FC or iSCSI. I don't want to go with FC due to cost considerations and I already have a GB network setup just for storage (GNBD) and cluster functions and another for regular LAN traffic. So if I go with iSCSI then I can completely get rid of the GNBD aspect and possibly CLVM as well. However, I am concerned about performance, especially since I'd want to put MySQL data on it. The Promise 500i (http://www.promise.com/product/product_detail_eng.asp?segment=VTrak&product_id=149) seems like it nice solution (much cheaper than the Dell CX300 I would also consider) but will it deliver performance wise? Would you want an enclosure with SCSI drives or SATA drives? And if you went with SATA drives, would you go with the 10k 1.5Gb drives or the 7200 3Gb drives for maximum performance? If you wanted SATA drives is the Promise the best solution? If you wanted iSCSI with SCSI drives what would you go with? Thanks for the help, I really need to get this storage situation under control. From carlopmart at gmail.com Mon Nov 6 17:49:01 2006 From: carlopmart at gmail.com (C. L. Martinez) Date: Mon, 6 Nov 2006 17:49:01 +0000 Subject: [Linux-cluster] Kerberos server on a RHCS Message-ID: <590a9c800611060949w52842789nd97a16850552d06c@mail.gmail.com> HI all, Somebody have tried to setup a kerberos primary server on a rhcs (with two nodes)?. I need to deploy two pairs of rhcs cluster with two nodes on each one: one for primary kdc and another for secondary kdc. Is it possible?? I think that the big problem is with krb5.conf file ... Many thanks. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jaap at sara.nl Mon Nov 6 18:50:13 2006 From: jaap at sara.nl (Jaap Dijkshoorn) Date: Mon, 6 Nov 2006 19:50:13 +0100 Subject: [Linux-cluster] Files are there, but not. In-Reply-To: <454F5744.9040508@redhat.com> Message-ID: <339554D0FE9DD94A8E5ACE4403676CEB01693D84@douwes.ka.sara.nl> Hi Bob, No problem. Yeah i think it is. I read that someone of RedHat already could reproduce it in the lab. Do you know of any progress so far? Met vriendelijke groet, Kind Regards, Jaap P. Dijkshoorn Group Leader Cluster Computing Systems Programmer mailto:jaap at sara.nl http://home.sara.nl/~jaapd SARA Computing & Networking Services Kruislaan 415 1098 SJ Amsterdam Tel: +31-(0)20-5923000 Fax: +31-(0)20-6683167 http://www.sara.nl > Hi Jaap, > > I'm cleaning out my backlog of emails, so I apologize for now seeing > this sooner. > This problem might be the same as: > > https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=190756 > > Regards, > > Bob Peterson > Red Hat Cluster Suite > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > From rpeterso at redhat.com Mon Nov 6 19:02:11 2006 From: rpeterso at redhat.com (Robert Peterson) Date: Mon, 06 Nov 2006 13:02:11 -0600 Subject: [Linux-cluster] Files are there, but not. In-Reply-To: <339554D0FE9DD94A8E5ACE4403676CEB01693D84@douwes.ka.sara.nl> References: <339554D0FE9DD94A8E5ACE4403676CEB01693D84@douwes.ka.sara.nl> Message-ID: <454F86B3.5000009@redhat.com> Jaap Dijkshoorn wrote: > Hi Bob, > > No problem. Yeah i think it is. I read that someone of RedHat already > could reproduce it in the lab. Do you know of any progress so far? > > > Met vriendelijke groet, Kind Regards, > > Jaap P. Dijkshoorn > Group Leader Cluster Computing > Systems Programmer > mailto:jaap at sara.nl http://home.sara.nl/~jaapd > > SARA Computing & Networking Services > Kruislaan 415 1098 SJ Amsterdam > Tel: +31-(0)20-5923000 > Fax: +31-(0)20-6683167 > http://www.sara.nl > Hi Jaap, I was able to recreate it once in the lab, but it's not an easy thing to do. As it goes, I've been pulled off this to work on another hot problem, which, in turn, was preempted by another priority, which was interrupted by a severity one critical situation, which... You get the picture. I hope to get back to this soon though. Regards, Bob Peterson Red Hat Cluster Suite From jaap at sara.nl Mon Nov 6 19:21:51 2006 From: jaap at sara.nl (Jaap Dijkshoorn) Date: Mon, 6 Nov 2006 20:21:51 +0100 Subject: [Linux-cluster] order of GFS nodes in a cluster Message-ID: <339554D0FE9DD94A8E5ACE4403676CEB01693D85@douwes.ka.sara.nl> Hi All, I have a question. We have 5 GFS fileservers in one cluster. On one node we see this: GFS Mount Group: "lisa_vg2_lv1" 10 36 run - [3 2 1 4 5] On another node we see this: GFS Mount Group: "lisa_vg2_lv1" 10 11 run - [1 2 3 4 5] The order of nodes in the GFS cluster is different. Is this a problem, or is this normal behaviour? Met vriendelijke groet, Kind Regards, Jaap P. Dijkshoorn Group Leader Cluster Computing Systems Programmer mailto:jaap at sara.nl http://home.sara.nl/~jaapd SARA Computing & Networking Services Kruislaan 415 1098 SJ Amsterdam Tel: +31-(0)20-5923000 Fax: +31-(0)20-6683167 http://www.sara.nl From jaap at sara.nl Mon Nov 6 19:23:40 2006 From: jaap at sara.nl (Jaap Dijkshoorn) Date: Mon, 6 Nov 2006 20:23:40 +0100 Subject: [Linux-cluster] Files are there, but not. In-Reply-To: <454F86B3.5000009@redhat.com> Message-ID: <339554D0FE9DD94A8E5ACE4403676CEB01693D86@douwes.ka.sara.nl> Hi Bob, I get the picture, in fact we have the same pictures over here ;) So i understand your trouble! > Jaap Dijkshoorn wrote: > > Hi Bob, > > > > No problem. Yeah i think it is. I read that someone of > RedHat already > > could reproduce it in the lab. Do you know of any progress so far? > > > > > > Met vriendelijke groet, Kind Regards, > > > > Jaap P. Dijkshoorn > > Group Leader Cluster Computing > > Systems Programmer > > mailto:jaap at sara.nl http://home.sara.nl/~jaapd > > > > SARA Computing & Networking Services > > Kruislaan 415 1098 SJ Amsterdam > > Tel: +31-(0)20-5923000 > > Fax: +31-(0)20-6683167 > > http://www.sara.nl > > > Hi Jaap, > > I was able to recreate it once in the lab, but it's not an > easy thing to do. > As it goes, I've been pulled off this to work on another hot problem, > which, in turn, was preempted by another priority, which was > interrupted > by a severity one critical situation, which... > You get the picture. I hope to get back to this soon though. > > Regards, > > Bob Peterson > Red Hat Cluster Suite > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > From isplist at logicore.net Mon Nov 6 20:15:05 2006 From: isplist at logicore.net (isplist at logicore.net) Date: Mon, 6 Nov 2006 14:15:05 -0600 Subject: [Linux-cluster] LVS, not so fun today... Message-ID: <200611614155.814749@leena> Don't know why, maybe I've taken on too many things at once but this is just confusing me now. Here's all the info I can think of as I ask for help once again from the fine folks in this list :). Same problem, all weekend, trying various combinations. Seems simple enough but nothing has worked for me so far. There's something that's just not sinking into my brain here. I've included my lvs.cf at the end of this message. LVS0's real eth0 IP is 192.168.1.52. LVS1's real eth0 IP is 192.168.1.53. CWEB92 is the only real server with the virtual IP of 192.168.1.150 on it's ether0 for testing. It's real IP is 192.168.1.92. When I ping 150, I reach either LVS servers, depending on who's master but the CWEB machine never responds to 192.168.1.150. So, again, I have 192.168.1.150 as a virtual IP on the first NIC of each REAL web server. I have 192.168.1.150 installed on LVS0 as the VIP for the real servers and real servers configured. Nov 6 14:00:49 lb52 pulse[2498]: STARTING PULSE AS MASTER Nov 6 14:00:49 lb52 pulse: pulse startup succeeded Nov 6 14:01:07 lb52 pulse[2498]: partner dead: activating lvs Nov 6 14:01:07 lb52 lvs[2522]: starting virtual service HTTP active: 80 Nov 6 14:01:07 lb52 nanny[2527]: starting LVS client monitor for 192.168.1.150:80 Nov 6 14:01:07 lb52 lvs[2522]: create_monitor for HTTP/cweb92 running as pid 2527 Nov 6 14:01:07 lb52 nanny[2528]: starting LVS client monitor for 192.168.1.150:80 Nov 6 14:01:07 lb52 lvs[2522]: create_monitor for HTTP/cweb93 running as pid 2528 Nov 6 14:01:07 lb52 nanny[2529]: starting LVS client monitor for 192.168.1.150:80 Nov 6 14:01:07 lb52 lvs[2522]: create_monitor for HTTP/cweb94 running as pid 2529 Nov 6 14:01:12 lb52 pulse[2524]: gratuitous lvs arps finished Nov 6 14:01:13 lb52 nanny[2527]: READ to 192.168.1.92:80 timed out Nov 6 14:01:13 lb52 nanny[2528]: READ to 192.168.1.93:80 timed out Nov 6 14:01:13 lb52 nanny[2529]: READ to 192.168.1.94:80 timed out Nov 6 14:01:25 lb52 nanny[2528]: READ to 192.168.1.93:80 timed out Nov 6 14:01:25 lb52 nanny[2527]: READ to 192.168.1.92:80 timed out Nov 6 14:01:25 lb52 nanny[2529]: READ to 192.168.1.94:80 timed out serial_no = 88 primary = 192.168.1.52 (Real, physical IP of the first LVS server) service = lvs backup_active = 1 backup = 0.0.0.0 heartbeat = 1 heartbeat_port = 539 keepalive = 6 deadtime = 18 network = direct nat_nmask = 255.255.255.255 debug_level = NONE monitor_links = 0 virtual HTTP { active = 1 address = 192.168.1.150 eth0:1 (IP that LVS responds to, then sends connections to read servers, right?) vip_nmask = 255.255.255.0 port = 80 send = "GET / HTTP/1.0rnrn" expect = "HTTP" use_regex = 0 load_monitor = none scheduler = wlc protocol = tcp timeout = 6 reentry = 15 quiesce_server = 0 server cweb92 { address = 192.168.1.92 active = 1 weight = 1 } server cweb93 { address = 192.168.1.93 active = 1 weight = 1 } server cweb94 { address = 192.168.1.94 active = 1 weight = 1 } } From Matthew.Patton.ctr at osd.mil Mon Nov 6 21:46:10 2006 From: Matthew.Patton.ctr at osd.mil (Patton, Matthew F, CTR, OSD-PA&E) Date: Mon, 6 Nov 2006 16:46:10 -0500 Subject: [Linux-cluster] Storage Problems, need some advice Message-ID: <9344F1B4E9BB3E46AA0E76EDD7DCC158140D04@RSRCNEX02CN1.rsrc.osd.mil> Classification: UNCLASSIFIED > Would you want an enclosure with SCSI drives or SATA drives? how much money you got? Unless you are running a fortune 1000 or better operation SATA should more than cut it. > the 7200 3Gb drives for maximum performance? define "maximum". With a RAID controller that doesn't stink and enough spindles the 300GB drives should be quite sufficient. depends on how much $ you have. You don't have to have an external storage chassis. You can convert that GNBD server to an iscsi head just with some software. From lhh at redhat.com Mon Nov 6 22:16:35 2006 From: lhh at redhat.com (Lon Hohberger) Date: Mon, 06 Nov 2006 17:16:35 -0500 Subject: [Linux-cluster] order of GFS nodes in a cluster In-Reply-To: <339554D0FE9DD94A8E5ACE4403676CEB01693D85@douwes.ka.sara.nl> References: <339554D0FE9DD94A8E5ACE4403676CEB01693D85@douwes.ka.sara.nl> Message-ID: <1162851395.4518.738.camel@rei.boston.devel.redhat.com> On Mon, 2006-11-06 at 20:21 +0100, Jaap Dijkshoorn wrote: > Hi All, > > I have a question. We have 5 GFS fileservers in one cluster. On one node > we see this: > > GFS Mount Group: "lisa_vg2_lv1" 10 36 run - > [3 2 1 4 5] > > On another node we see this: > > GFS Mount Group: "lisa_vg2_lv1" 10 11 run - > [1 2 3 4 5] > > The order of nodes in the GFS cluster is different. Is this a problem, > or is this normal behaviour? The ordering on different nodes does not matter, as long as they all line up. Ex: [1 2 3] [2 1 3] [3 2 1] for the same service group is all fine. -- Lon From dbrieck at gmail.com Tue Nov 7 01:10:37 2006 From: dbrieck at gmail.com (David Brieck Jr.) Date: Mon, 6 Nov 2006 20:10:37 -0500 Subject: [Linux-cluster] Storage Problems, need some advice In-Reply-To: <9344F1B4E9BB3E46AA0E76EDD7DCC158140D04@RSRCNEX02CN1.rsrc.osd.mil> References: <9344F1B4E9BB3E46AA0E76EDD7DCC158140D04@RSRCNEX02CN1.rsrc.osd.mil> Message-ID: <8c1094290611061710n2068fd38ib853151870f49330@mail.gmail.com> On 11/6/06, Patton, Matthew F, CTR, OSD-PA&E wrote: > Classification: UNCLASSIFIED > > > Would you want an enclosure with SCSI drives or SATA drives? > > how much money you got? Unless you are running a fortune 1000 or better operation SATA should more than cut it. > Thanks, I was just concerned about database activity on the array, I'm sure everything else would be fine. > > the 7200 3Gb drives for maximum performance? > > define "maximum". With a RAID controller that doesn't stink and enough spindles the 300GB drives should be quite sufficient. depends on how much $ you have. You don't have to have an external storage chassis. You can convert that GNBD server to an iscsi head just with some software. > Well, by maximum I mean, with SCSI we always buy 15k drives vs 10k drives so it seems counter intuitive to buy 7200 RPM drives over 10k drives for better performance. I don't have much experience with SATA drives in this type of environment. The only reason I was looking at an enclosure was to have some built in redundancy vs having to use multipath again on two servers or just a single server and no redundancy. From bmarzins at redhat.com Tue Nov 7 02:01:11 2006 From: bmarzins at redhat.com (Benjamin Marzinski) Date: Mon, 6 Nov 2006 20:01:11 -0600 Subject: [Linux-cluster] gfs mounted but not working In-Reply-To: <001501c7015d$702a38b0$0100a8c0@fremen> References: <001701c6ffcc$ed2d2d50$0100a8c0@fremen> <454E8C19.1070607@redhat.com> <001501c7015d$702a38b0$0100a8c0@fremen> Message-ID: <20061107020111.GA8361@ether.msp.redhat.com> On Mon, Nov 06, 2006 at 01:38:40AM -0300, romero.cl at gmail.com wrote: > Hi. > > Now i'm trying this and it works! for now... > > Two nodes: node3 & node4 > node4 export his /dev/sdb2 with gnbd_export as "node4_sdb2" > node3 import node4's /dev/sdb2 with gnbd_import (new /dev/gnbd/node4_sdb2) > > on node3: gfs_mkfs -p lock_dlm -t node3:node3_gfs -j 4 /dev/gnbd/node4_sdb2 > mount -t gfs /dev/gnbd/node4_gfs /users/home > > on node4: mount -t gfs /dev/sdb2 /users/home > > and both nodes can read an write ths same files on /users/home!!! > > Now i'm going for this: > > 4 nodes on a dedicated 3com 1Gbit ethernet switch: > > node2 exporting with gnbd_export /dev/sdb2 as "node2_sdb2" > node3 exporting with gnbd_export /dev/sdb2 as "node3_sdb2" > node4 exporting with gnbd_export /dev/sdb2 as "node4_sdb2" > > node1 (main) will import all "nodeX_sdb2" and create a logical volume named > "main_lv" including: > > /dev/sdb2 (his own) > /dev/gnbd/node2_sdb2 > /dev/gnbd/node3_sdb2 > /dev/gnbd/node4_sdb2 > > Next I will try to export the new big logical volume with "gnbd_export" and > then do gnbd_import on each node. > With that each node will see "main_lv", then mount it on /users/home as gfs > and get a big shared filesystem to work toghether. > > Is this the correct way to do the work??? possibly a deadlock??? Sorry. This will not work. There are a couple of problems. 1. A node shouldn't ever gnbd import a device it has exported. This can cause memory deadlock. When memory pressure is high nodes try to write their buffers to disk. Once the buffer is written to disk, the node can drop it from memory, reducing memory pressure. When you do this over gnbd, for every buffer that you write out on the client, a new buffer request come into the gnbd server. If you import a device you have exported (even indirectly through the logical volume on node1 in this setup) that new request just comes back to you. This means that you suddenly double your buffers in memory, just when memory was running low. The solution is to only access the local device directly, but never through gnbd. Oh, just a note, if you are planning on accessing the local device directly, you must not use the "-c" option when you are exporting the device. This will eventually lead to corruption. The "-c" option is only for dedicated gnbd servers. 2. Theoretically, you could just have every node export the devices to every other node, and then build a logical volume on top of all the devices on each node, but you should not do this. It totally destroys the benefit of having a cluster. Since your GFS filesystem would then depend on having access to the block devices of every machine, if ANY machine in your cluster went down, the whole cluster would crash, because a piece of your filesystem would just disappear. Without shared storage, your gnbd server will be a single point of failure. The most common way that people set up gnbd is with one dedicated gnbd server machine, that is only used to serve gnbd blocks, so that it is unlikely to crash. > Sorry if my english isn't very good ;) > > ----- Original Message ----- > From: "Kevin Anderson" > To: "linux clustering" > Sent: Sunday, November 05, 2006 10:12 PM > Subject: Re: [Linux-cluster] gfs mounted but not working > > > > On 11/5/06, *romero.cl at gmail.com * > > > wrote: > > > > > > Hi. > > > > > > I'm trying your method, but still have a problem: > > > > > > Note: /dev/db2/ is a local partition on my second SCSI hard drive > > > (no RAID) > > > runing on HP ProLiant. > > > > > GFS requires that all storage is equally accessible by all nodes in the > > cluster. Your other nodes have no path to the storage you set up so it > > is impossible for them to share the data. > > > > Kevin > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From matthew at arts.usyd.edu.au Tue Nov 7 02:12:11 2006 From: matthew at arts.usyd.edu.au (Matthew Geier) Date: Tue, 07 Nov 2006 13:12:11 +1100 Subject: [Linux-cluster] dm_multipath and failover Message-ID: <454FEB7B.6080103@arts.usyd.edu.au> Not strictly a clustering question, but I have for RHEL CS servers connected to an EMC CX300. QLA2432 fibre channel controllers. The cards have two FC channels. I've used them to run a redundant fabric - there are two fibre channel switches, that each connect to one of the EMC's FC ports - so there are two paths to the storage. A link or switch can fail, but the hosts still can see the storage. The redundancy works fine - I can pull a fibre out and the hosts can see the storage fine. Syslog gets spammed with lots of scsi errors, but dm_multipath does its stuff and the server keeps running. However if I plug that fibre back in, the path doesn't come back. So far the only way to get the path back i've found is to reboot the server. Is there any way to tell the controller the path is back ?. I have noticed that if i'm in this state, rebooting the McData FC switch will also restore the channel - it's like the switch sends some sort of re-initialise channel signal as it reboots. From romero.cl at gmail.com Tue Nov 7 03:19:50 2006 From: romero.cl at gmail.com (romero.cl at gmail.com) Date: Tue, 7 Nov 2006 00:19:50 -0300 Subject: [Linux-cluster] gfs mounted but not working References: <001701c6ffcc$ed2d2d50$0100a8c0@fremen><454E8C19.1070607@redhat.com><001501c7015d$702a38b0$0100a8c0@fremen> <20061107020111.GA8361@ether.msp.redhat.com> Message-ID: <000a01c7021b$972f4ae0$0100a8c0@fremen> Hi. Thanks for the tips. But, if there is a deadlock problem with memory, what wold be a good solution to get each one of the drives on my 4 nodes look like one big shared drive using gfs? I think that I have a wrong idea of what GFS is, can you explain it to me please? is only a file system ? is a way to share storage with distributed lock? ----- Original Message ----- From: "Benjamin Marzinski" To: "linux clustering" Sent: Monday, November 06, 2006 11:01 PM Subject: Re: [Linux-cluster] gfs mounted but not working > On Mon, Nov 06, 2006 at 01:38:40AM -0300, romero.cl at gmail.com wrote: > > Hi. > > > > Now i'm trying this and it works! for now... > > > > Two nodes: node3 & node4 > > node4 export his /dev/sdb2 with gnbd_export as "node4_sdb2" > > node3 import node4's /dev/sdb2 with gnbd_import (new /dev/gnbd/node4_sdb2) > > > > on node3: gfs_mkfs -p lock_dlm -t node3:node3_gfs -j 4 /dev/gnbd/node4_sdb2 > > mount -t gfs /dev/gnbd/node4_gfs /users/home > > > > on node4: mount -t gfs /dev/sdb2 /users/home > > > > and both nodes can read an write ths same files on /users/home!!! > > > > Now i'm going for this: > > > > 4 nodes on a dedicated 3com 1Gbit ethernet switch: > > > > node2 exporting with gnbd_export /dev/sdb2 as "node2_sdb2" > > node3 exporting with gnbd_export /dev/sdb2 as "node3_sdb2" > > node4 exporting with gnbd_export /dev/sdb2 as "node4_sdb2" > > > > node1 (main) will import all "nodeX_sdb2" and create a logical volume named > > "main_lv" including: > > > > /dev/sdb2 (his own) > > /dev/gnbd/node2_sdb2 > > /dev/gnbd/node3_sdb2 > > /dev/gnbd/node4_sdb2 > > > > Next I will try to export the new big logical volume with "gnbd_export" and > > then do gnbd_import on each node. > > With that each node will see "main_lv", then mount it on /users/home as gfs > > and get a big shared filesystem to work toghether. > > > > Is this the correct way to do the work??? possibly a deadlock??? > > Sorry. This will not work. There are a couple of problems. > > 1. A node shouldn't ever gnbd import a device it has exported. This can > cause memory deadlock. When memory pressure is high nodes try to write > their buffers to disk. Once the buffer is written to disk, the node can drop > it from memory, reducing memory pressure. When you do this over gnbd, for every > buffer that you write out on the client, a new buffer request come into the gnbd > server. If you import a device you have exported (even indirectly through the > logical volume on node1 in this setup) that new request just comes back to you. > This means that you suddenly double your buffers in memory, just when memory was > running low. > > The solution is to only access the local device directly, but never through > gnbd. Oh, just a note, if you are planning on accessing the local device > directly, you must not use the "-c" option when you are exporting the device. > This will eventually lead to corruption. The "-c" option is only for dedicated > gnbd servers. > > 2. Theoretically, you could just have every node export the devices to every > other node, and then build a logical volume on top of all the devices on each > node, but you should not do this. It totally destroys the benefit of having > a cluster. Since your GFS filesystem would then depend on having access to the > block devices of every machine, if ANY machine in your cluster went down, the > whole cluster would crash, because a piece of your filesystem would just > disappear. > > > Without shared storage, your gnbd server will be a single point of failure. > The most common way that people set up gnbd is with one dedicated gnbd server > machine, that is only used to serve gnbd blocks, so that it is unlikely to > crash. > > > Sorry if my english isn't very good ;) > > > > ----- Original Message ----- > > From: "Kevin Anderson" > > To: "linux clustering" > > Sent: Sunday, November 05, 2006 10:12 PM > > Subject: Re: [Linux-cluster] gfs mounted but not working > > > > > > > On 11/5/06, *romero.cl at gmail.com * > > > > wrote: > > > > > > > > Hi. > > > > > > > > I'm trying your method, but still have a problem: > > > > > > > > Note: /dev/db2/ is a local partition on my second SCSI hard drive > > > > (no RAID) > > > > runing on HP ProLiant. > > > > > > > GFS requires that all storage is equally accessible by all nodes in the > > > cluster. Your other nodes have no path to the storage you set up so it > > > is impossible for them to share the data. > > > > > > Kevin > > > > > > -- > > > Linux-cluster mailing list > > > Linux-cluster at redhat.com > > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From bmarzins at redhat.com Tue Nov 7 06:03:21 2006 From: bmarzins at redhat.com (Benjamin Marzinski) Date: Tue, 7 Nov 2006 00:03:21 -0600 Subject: [Linux-cluster] gfs mounted but not working In-Reply-To: <000a01c7021b$972f4ae0$0100a8c0@fremen> References: <20061107020111.GA8361@ether.msp.redhat.com> <000a01c7021b$972f4ae0$0100a8c0@fremen> Message-ID: <20061107060321.GB8361@ether.msp.redhat.com> On Tue, Nov 07, 2006 at 12:19:50AM -0300, romero.cl at gmail.com wrote: > Hi. > > Thanks for the tips. > > But, if there is a deadlock problem with memory, what wold be a good > solution to get each one of the drives on my 4 nodes look like one big > shared drive using gfs? The easiest way to get a cluster with GFS is to have one of your nodes export its block device with gnbd to the other nodes. There is no good way to take all the storage from all your nodes and make it look like one big file system that all the nodes can use. Like I said before, you can do this with GNBD. It's just a bad idea. If you make a big shared drive that uses all the disks from all the machines, when any machine goes down, a piece of that big shared drive will disappear, because it was located on the machine that went down. Since there's no way that any software could access a machine's local storage when the machine is crashed, there is no possible way to avoid this problem. We will shortly be releasing cluster-mirror software, that will allow you to make a mirror using all of the gnbd_exported devices from all of the machines. However, since you are mirroring the devices, instead of concatenating or striping them. the shared device will not be the size of the sum of all the devices. However this does mean that there is no gnbd server machine that is a single point of failure. This code is already available in cvs, but I'm not sure that there are any rpms yet. > I think that I have a wrong idea of what GFS is, can you explain it to me > please? is only a file system ? is a way to share storage with distributed > lock? GFS is simply a filesystem that uses locking to allow you to access shared storage from multiple machines at the same time. GFS does not give you shared storage. To use GFS in a cluster you must either have a SAN, or you must use software like GNBD or iSCSI. > > > > ----- Original Message ----- > From: "Benjamin Marzinski" > To: "linux clustering" > Sent: Monday, November 06, 2006 11:01 PM > Subject: Re: [Linux-cluster] gfs mounted but not working > > > > On Mon, Nov 06, 2006 at 01:38:40AM -0300, romero.cl at gmail.com wrote: > > > Hi. > > > > > > Now i'm trying this and it works! for now... > > > > > > Two nodes: node3 & node4 > > > node4 export his /dev/sdb2 with gnbd_export as "node4_sdb2" > > > node3 import node4's /dev/sdb2 with gnbd_import (new > /dev/gnbd/node4_sdb2) > > > > > > on node3: gfs_mkfs -p lock_dlm -t node3:node3_gfs -j 4 > /dev/gnbd/node4_sdb2 > > > mount -t gfs /dev/gnbd/node4_gfs /users/home > > > > > > on node4: mount -t gfs /dev/sdb2 /users/home > > > > > > and both nodes can read an write ths same files on /users/home!!! > > > > > > Now i'm going for this: > > > > > > 4 nodes on a dedicated 3com 1Gbit ethernet switch: > > > > > > node2 exporting with gnbd_export /dev/sdb2 as "node2_sdb2" > > > node3 exporting with gnbd_export /dev/sdb2 as "node3_sdb2" > > > node4 exporting with gnbd_export /dev/sdb2 as "node4_sdb2" > > > > > > node1 (main) will import all "nodeX_sdb2" and create a logical volume > named > > > "main_lv" including: > > > > > > /dev/sdb2 (his own) > > > /dev/gnbd/node2_sdb2 > > > /dev/gnbd/node3_sdb2 > > > /dev/gnbd/node4_sdb2 > > > > > > Next I will try to export the new big logical volume with "gnbd_export" > and > > > then do gnbd_import on each node. > > > With that each node will see "main_lv", then mount it on /users/home as > gfs > > > and get a big shared filesystem to work toghether. > > > > > > Is this the correct way to do the work??? possibly a deadlock??? > > > > Sorry. This will not work. There are a couple of problems. > > > > 1. A node shouldn't ever gnbd import a device it has exported. This can > > cause memory deadlock. When memory pressure is high nodes try to write > > their buffers to disk. Once the buffer is written to disk, the node can > drop > > it from memory, reducing memory pressure. When you do this over gnbd, for > every > > buffer that you write out on the client, a new buffer request come into > the gnbd > > server. If you import a device you have exported (even indirectly through > the > > logical volume on node1 in this setup) that new request just comes back to > you. > > This means that you suddenly double your buffers in memory, just when > memory was > > running low. > > > > The solution is to only access the local device directly, but never > through > > gnbd. Oh, just a note, if you are planning on accessing the local device > > directly, you must not use the "-c" option when you are exporting the > device. > > This will eventually lead to corruption. The "-c" option is only for > dedicated > > gnbd servers. > > > > 2. Theoretically, you could just have every node export the devices to > every > > other node, and then build a logical volume on top of all the devices on > each > > node, but you should not do this. It totally destroys the benefit of > having > > a cluster. Since your GFS filesystem would then depend on having access to > the > > block devices of every machine, if ANY machine in your cluster went down, > the > > whole cluster would crash, because a piece of your filesystem would just > > disappear. > > > > > > Without shared storage, your gnbd server will be a single point of > failure. > > The most common way that people set up gnbd is with one dedicated gnbd > server > > machine, that is only used to serve gnbd blocks, so that it is unlikely to > > crash. > > > > > Sorry if my english isn't very good ;) > > > > > > ----- Original Message ----- > > > From: "Kevin Anderson" > > > To: "linux clustering" > > > Sent: Sunday, November 05, 2006 10:12 PM > > > Subject: Re: [Linux-cluster] gfs mounted but not working > > > > > > > > > > On 11/5/06, *romero.cl at gmail.com * > > > > > wrote: > > > > > > > > > > Hi. > > > > > > > > > > I'm trying your method, but still have a problem: > > > > > > > > > > Note: /dev/db2/ is a local partition on my second SCSI hard > drive > > > > > (no RAID) > > > > > runing on HP ProLiant. > > > > > > > > > GFS requires that all storage is equally accessible by all nodes in > the > > > > cluster. Your other nodes have no path to the storage you set up so > it > > > > is impossible for them to share the data. > > > > > > > > Kevin > > > > > > > > -- > > > > Linux-cluster mailing list > > > > Linux-cluster at redhat.com > > > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > > > -- > > > Linux-cluster mailing list > > > Linux-cluster at redhat.com > > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From stefan at x-cellent.com Tue Nov 7 07:19:33 2006 From: stefan at x-cellent.com (Stefan Majer) Date: Tue, 7 Nov 2006 08:19:33 +0100 (CET) Subject: [Linux-cluster] dm_multipath and failover In-Reply-To: <454FEB7B.6080103@arts.usyd.edu.au> References: <454FEB7B.6080103@arts.usyd.edu.au> Message-ID: <61816.212.34.68.17.1162883973.squirrel@extern.x-cellent.com> Hi, > > Not strictly a clustering question, but > > I have for RHEL CS servers connected to an EMC CX300. QLA2432 fibre > channel controllers. > > The cards have two FC channels. I've used them to run a redundant > fabric - there are two fibre channel switches, that each connect to one > of the EMC's FC ports - so there are two paths to the storage. A link or > switch can fail, but the hosts still can see the storage. > > The redundancy works fine - I can pull a fibre out and the hosts can > see the storage fine. Syslog gets spammed with lots of scsi errors, but > dm_multipath does its stuff and the server keeps running. > > However if I plug that fibre back in, the path doesn't come back. So > far the only way to get the path back i've found is to reboot the server. > Is there any way to tell the controller the path is back ?. please check if "service multipathd" returns "running" the multipathd checks the availability of your paths and reclaims them as active after a failure. > I have noticed that if i'm in this state, rebooting the McData FC > switch will also restore the channel - it's like the switch sends some > sort of re-initialise channel signal as it reboots. > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -- Stefan Majer From jaap at sara.nl Tue Nov 7 07:22:31 2006 From: jaap at sara.nl (Jaap Dijkshoorn) Date: Tue, 7 Nov 2006 08:22:31 +0100 Subject: [Linux-cluster] order of GFS nodes in a cluster In-Reply-To: <1162851395.4518.738.camel@rei.boston.devel.redhat.com> Message-ID: <339554D0FE9DD94A8E5ACE4403676CEB01693D8D@douwes.ka.sara.nl> > > > The ordering on different nodes does not matter, as long as they all > line up. Ex: [1 2 3] [2 1 3] [3 2 1] for the same service > group is all > fine. Thanks Lon, so we don't have to worry about that. > > -- Lon > Best Regards, -Jaap From dbrieck at gmail.com Tue Nov 7 16:31:07 2006 From: dbrieck at gmail.com (David Brieck Jr.) Date: Tue, 7 Nov 2006 11:31:07 -0500 Subject: [Linux-cluster] Re: Storage Problems, need some advice In-Reply-To: <8c1094290611060825i5125eb31m40935b1083e33f0b@mail.gmail.com> References: <8c1094290611060825i5125eb31m40935b1083e33f0b@mail.gmail.com> Message-ID: <8c1094290611070831k1c870304r8cffb870ff551613@mail.gmail.com> On 11/6/06, David Brieck Jr. wrote: > Some of you may have noticed the problems I've been posting to the > mailing with my storage. For those who haven't here a quick rundown of > my setup: > > GNBD Servers w/ clustered SCSI enclosure > GNBD Clients w/ multipath > CLVM > GFS and DLM > > I've been having crashes, lockups and poor performance with my setup > almost since it was first setup. I've tried getting rid of multipath > and went down to only one GNBD server but I'm still suffering from > random crashes under moderate to high I/O. I'm not getting any errors > in the logs at all that can help, not even kernel errors any more. > > So I've simpled to GNBD Server -> GNBD Clients -> CLVM -> GFS which is > exactly what is specified in the manual: > http://www.redhat.com/docs/manuals/csgfs/browse/rh-gfs-en/ch-gnbd.html > and it still seems to act very flaky. My GNBD Server(s) don't crash, > it's always the client nodes that end up crashing. The only connecting > factor seems to be I/O on the GNBD. > > So I have to find a solution that's going to work and I don't think > anything with GNBD is going to work regardless of hardware. So that > pretty much only leaves FC or iSCSI. I don't want to go with FC due to > cost considerations and I already have a GB network setup just for > storage (GNBD) and cluster functions and another for regular LAN > traffic. > > So if I go with iSCSI then I can completely get rid of the GNBD aspect > and possibly CLVM as well. However, I am concerned about performance, > especially since I'd want to put MySQL data on it. > > The Promise 500i > (http://www.promise.com/product/product_detail_eng.asp?segment=VTrak&product_id=149) > seems like it nice solution (much cheaper than the Dell CX300 I would > also consider) but will it deliver performance wise? > > Would you want an enclosure with SCSI drives or SATA drives? And if > you went with SATA drives, would you go with the 10k 1.5Gb drives or > the 7200 3Gb drives for maximum performance? > > If you wanted SATA drives is the Promise the best solution? If you > wanted iSCSI with SCSI drives what would you go with? > > Thanks for the help, I really need to get this storage situation under control. > After doing much more research I came across the DS300 by IBM. It uses SCSI drives, is fully redundant, does iSCSI and doesn't cost an arm and a leg (just an arm). My question is, their site says linux clustering isn't supported, but does it have to be? Doesn't iSCSI let you do the same thing GNBD does? Also, I talked to someone on their chat who said I could use any U320 drive with it, basically I could reuse the drives I already have and just not use my old enclosure. Does that sound right? Any reason I couldn't do that other than loosing all my data? Anyone using a DS300? Seems like with 15k drives it would be pretty darn fast. From sandra-llistes at fib.upc.edu Tue Nov 7 16:58:59 2006 From: sandra-llistes at fib.upc.edu (sandra-llistes) Date: Tue, 07 Nov 2006 17:58:59 +0100 Subject: [Linux-cluster] GFS and samba problem in Fedora, again In-Reply-To: <4533B9C2.8080906@redhat.com> References: <4523A637.1060706@fib.upc.edu> <4526B4ED.9050907@redhat.com> <452B4F39.60906@fib.upc.edu> <452BFC6D.60902@redhat.com> <452D0ED6.9040605@fib.upc.edu> <452D7040.8090704@redhat.com> <4533A6EA.6030503@fib.upc.edu> <4533B9C2.8080906@redhat.com> Message-ID: <4550BB53.9070401@fib.upc.edu> Hi Ahbi, I've proved to install kernel 2.6.10 and GFS from cvs on Fedora 5 because if this software worked for RHES4 It will work also for Fedora. I have had to recompile the kernel and software, and have some problems because Fedora has gcc 4 and Red Hat gcc 3.4.6. Now I succesfully installed it but I'm getting odd errors with ccsd: Nov 7 16:44:39 nocilla ccsd[15826]: Unable to connect to cluster infrastructure after 120 seconds. Nov 7 16:45:09 nocilla ccsd[15826]: Unable to connect to cluster infrastructure after 150 seconds. Nov 7 16:45:40 nocilla ccsd[15826]: Unable to connect to cluster infrastructure after 180 seconds. Nov 7 16:46:10 nocilla ccsd[15826]: Unable to connect to cluster infrastructure after 210 seconds. Nov 7 16:46:40 nocilla ccsd[15826]: Unable to connect to cluster infrastructure after 240 seconds. Nov 7 16:47:10 nocilla ccsd[15826]: Unable to connect to cluster infrastructure after 270 seconds. Nov 7 16:47:40 nocilla ccsd[15826]: Unable to connect to cluster infrastructure after 300 seconds. Nov 7 16:48:10 nocilla ccsd[15826]: Unable to connect to cluster infrastructure after 330 seconds. Nov 7 16:48:40 nocilla ccsd[15826]: Unable to connect to cluster infrastructure after 360 seconds. cman starts ok and quorum is regained, but ccsd fails. I tried: [root ~]# ccs_test connect ccs_connect failed: Connection refused [root ~]# ccs_test connect force Force is set. Connect successful. Connection descriptor = 60 Also I tried to start ccsd with -4 and -I parameters with the same results: #ccsd -4 -I 127.0.0.0.1 What could the problem be? Thanks, Sandra Abhijith Das wrote: > Hi Sandra, > >> Hi Abhi, >> >> Your mail astonished me. The only difference between your environment >> and our is that you've RHEL4 and we've Fedora 5 with GFS. > > I'll get FC5 installed on a test cluster and try it out. > >> I'm sorry about this but I have to insist. Are you completelly sure >> that your samba access was simoultaneous? because if you probe one >> client, and then another it isn't the same. >> I found people that complain about the same problem we have: >> http://www.redhat.com/archives/linux-cluster/2004-November/msg00065.html >> http://www.centos.org/modules/newbb/viewtopic.php?topic_id=4997 > > Yes, the access was simultaneous. If you have a specific test, (specific > types of files or windows programs) I can try that too. I'm not denying > that you're seeing problems with GFS+samba :-). If I can reproduce your > problem, I can chase it down and we'll know why this is happening. > >> Well, I had finally compiled GFS2 but I was uncapable to start cman >> daemons. >> [root at server2 ~]# /etc/init.d/cman start >> Starting cluster: >> Loading modules... done >> Mounting configfs... done >> Starting ccsd... done >> Starting cman... failed >> /usr/sbin/cman_tool: aisexec daemon didn't start >> >> And I obtain this when I try: # strace /usr/sbin/cman_tool -t 120 -w join >> .... >> connect(5, {sa_family=AF_FILE, path="/var/run/cman_admin"}, 110) = -1 >> ENOENT (No such file or directory) >> close(5) >> write(2, "/usr/sbin/cman_tool: ", 21/usr/sbin/cman_tool: ) = 21 >> write(2, "aisexec daemon didn\'t start\n", 28aisexec daemon didn't start >> ) = 28 >> exit_group(1) >> >> So I'm stalled by that way. > > Looks like the openais component of the cluster is missing. The cluster > architecture has changed quite a bit for GFS2. Please refer to this for > help installing GFS2: > http://sources.redhat.com/cgi-bin/cvsweb.cgi/cluster/doc/usage.txt?rev=1.35&content-type=text/x-cvsweb-markup&cvsroot=cluster > > >> I'm downloading RHEL 30-days evaluating version and I'm also trying >> samba 3.0.23c compiling with cluster support. >> I will tell you something as soon as I finish all test. >> Regards, >> >> Sandra > > Thanks, > --Abhi From DylanV at semaphore.com Tue Nov 7 17:47:45 2006 From: DylanV at semaphore.com (Dylan Vanderhoof) Date: Tue, 7 Nov 2006 09:47:45 -0800 Subject: [Linux-cluster] Ls and globbing taking ridiculously long on GFS Message-ID: I understand that a df, or ls -l that requires statting files should be slow. However, I'm seeing ridiculous performance of just an ls, or anything doing file globbing in directory reads. For example: dylanv at iscsi0 /srv/rancid/logs $ time ls [snip] real 0m4.434s user 0m0.000s sys 0m0.000s dylanv at iscsi0 /srv/rancid/logs $ ls | wc -l 412 dylanv at iscsi0 /srv/rancid/logs $ I consider 400 files a fairly small directory. 4.5 seconds to do an ls is pretty bad. Its worse in a large directory, although not porportionally to the amount of files: dylanv at iscsi0 /srv/flows/9/13 $ time ls [snip] real 0m31.693s user 0m0.070s sys 0m0.430s dylanv at iscsi0 /srv/flows/9/13 $ ls | wc -l 14274 dylanv at iscsi0 /srv/flows/9/13 $ And an even larger dir: dylanv at iscsi0 /srv/flows/6/7 $ time ls [snip] real 1m31.849s user 0m0.520s sys 0m1.450s dylanv at iscsi0 /srv/flows/6/7 $ ls | wc -l 42462 dylanv at iscsi0 /srv/flows/6/7 $ Any ideas for why this might be? Its clearly blocking on IO somewhere, but I'm not sure where. Both kernel and userland are 32-bit for now. Caching doesn't really appear to make a whole lot of difference. (A subsequent ls on the large directory above takes 1m24.274s) What concerns me even more is that I'm not even using DLM yet! This is with lock_nolock. Any suggestions or ideas are much appreciated. =) Thanks, Dylan From wcheng at redhat.com Tue Nov 7 19:00:15 2006 From: wcheng at redhat.com (Wendy Cheng) Date: Tue, 07 Nov 2006 14:00:15 -0500 Subject: [Linux-cluster] Ls and globbing taking ridiculously long on GFS In-Reply-To: References: Message-ID: <4550D7BF.9010503@redhat.com> Dylan Vanderhoof wrote: > I understand that a df, or ls -l that requires statting files should be > slow. However, I'm seeing ridiculous performance of just an ls, or > anything doing file globbing in directory reads. > My guess is that you have lots of small writes before the "ls" that generates the disk flushing. Could you pass your kernel and gfs versions ? Mind running oprofile on your node (I can pass the instructions if you like) so I can take a look ? -- Wendy From DylanV at semaphore.com Tue Nov 7 19:11:42 2006 From: DylanV at semaphore.com (Dylan Vanderhoof) Date: Tue, 7 Nov 2006 11:11:42 -0800 Subject: [Linux-cluster] Ls and globbing taking ridiculously long on GFS Message-ID: Certainly. dylanv at iscsi0 /var/www/netresponse/lib/NetResponse/Controller $ gfs_tool version gfs_tool 1.03.00 (built Oct 17 2006 15:10:45) Copyright (C) Red Hat, Inc. 2004-2005 All rights reserved. dylanv at iscsi0 /var/www/netresponse/lib/NetResponse/Controller $ uname -a Linux iscsi0 2.6.17-gentoo-r8 #8 SMP Fri Sep 15 13:57:05 PDT 2006 i686 Intel(R) Xeon(TM) CPU 3.00GHz GNU/Linux As long as its not too disruptive, I can definitely run oprofile. Instructions would be handy however. =) -Dylan > -----Original Message----- > From: Wendy Cheng [mailto:wcheng at redhat.com] > Sent: Tuesday, November 07, 2006 11:00 AM > To: linux clustering > Subject: Re: [Linux-cluster] Ls and globbing taking > ridiculously long on GFS > > > Dylan Vanderhoof wrote: > > I understand that a df, or ls -l that requires statting > files should be > > slow. However, I'm seeing ridiculous performance of just an ls, or > > anything doing file globbing in directory reads. > > > My guess is that you have lots of small writes before the "ls" that > generates the disk flushing. Could you pass your kernel and > gfs versions > ? Mind running oprofile on your node (I can pass the > instructions if you > like) so I can take a look ? > > -- Wendy > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > From rpeterso at redhat.com Tue Nov 7 19:46:36 2006 From: rpeterso at redhat.com (Robert Peterson) Date: Tue, 07 Nov 2006 13:46:36 -0600 Subject: [Linux-cluster] GFS+ACL In-Reply-To: <4540B9B9.40107@infor.pl> References: <4540B9B9.40107@infor.pl> Message-ID: <4550E29C.8010701@redhat.com> Marek Dabrowski wrote: > Hello > > I have 2 node cluster with GFS. Some strange problem occur - I set acl > rigths on some directorires. After about a few minutes, acls whitch I > set was deleted and appear ald acls. I dont know how this happened? > Any idea? > > Regards and sorry my english > Marek Hi Marek, What version of the cluster code were you running? RHEL3? RHEL4? CVS HEAD? CVS STABLE? Other? This sounds a lot like this bugzilla: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=210369 which has been fixed in HEAD a while back. I haven't tried it in RHEL4 or other versions though. Regards, Bob Peterson Red Hat Cluster Suite From aberoham at gmail.com Tue Nov 7 20:29:59 2006 From: aberoham at gmail.com (aberoham at gmail.com) Date: Tue, 7 Nov 2006 12:29:59 -0800 Subject: [Linux-cluster] rgmanager crash, deadlock? Message-ID: <3bdb07840611071229yd9cec69u5ebefc84c8f88c92@mail.gmail.com> Last night one of my five cluster nodes suffered a hardware failure (memory, cpu?). The other nodes properly fenced the failed machine, but no matter what clusvcadm command I ran, I could not get the other cluster members to start, stop or disable the cluster resource group/service that had been running on the failed node. (the resource group/service that was running on the failed node includes an EXT3 fs, an IP address, a rsyncd and a smbd init script) The "clusvcadm -d [service]" command would just hang for minutes and not return. "clustat" intially reported the rg/service in an unknown state, then stopped reporting rgmanager status and only showed cman status. The cluster remained quorate the entire time. Resource groups/services on non-failed nodes continued to run, but no matter what I tried I could not get rgmanager status on any node. I had to reset the entire cluster to get things back to normal. (This is a heavily used operational system so I didn't have time to do further debugging.) My logs don't show any rgmanger related error messages, only fencing status: Nov 6 20:24:37 bamf02 kernel: CMAN: removing node bamf03 from the cluster : Missed too many heartbeats Nov 6 20:24:38 bamf02 fenced[5913]: fencing deferred to bamf01 --- Nov 6 20:24:37 bamf01 kernel: CMAN: node bamf03 has been removed from the cluster : Missed too many heartbeats Nov 6 20:24:38 bamf01 fenced[5756]: bamf03 not a cluster member after 0 sec post_fail_delay Nov 6 20:24:38 bamf01 fenced[5756]: fencing node "bamf03" Nov 6 20:24:46 bamf01 fenced[5756]: fence "bamf03" success Nov 6 20:30:36 bamf01 sshd(pam_unix)[27244]: session opened for user root by root(uid=0) Nov 6 20:36:29 bamf01 kernel: CMAN: node bamf03 rejoining Nov 6 20:42:55 bamf01 shutdown: shutting down for system reboot --- I'm running RHEL4U4 (cman 1.0.11-0, cman-kernel-smp 2.6.9-45.5, dlm 1.0.1-1, magma 1.0.6-0 rgmanager 1.9.53) on x86_64 hardware. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- Nov 6 20:17:48 bamf03 clurgmgrd: [4170]: Executing /etc/init.d/rsyncd-cougar status Nov 6 20:17:51 bamf03 sshd(pam_unix)[10896]: session opened for user root by (uid=0) Nov 6 20:18:18 bamf03 clurgmgrd: [4170]: Executing /etc/init.d/rsyncd-cougar status Nov 6 20:19:18 bamf03 last message repeated 2 times Nov 6 20:20:48 bamf03 last message repeated 3 times Nov 6 20:21:18 bamf03 clurgmgrd: [4170]: Executing /etc/init.d/rsyncd-cougar status Nov 6 20:21:34 bamf03 kernel: Bad page state at prep_new_page (in process 'smbd', page 00000101fe80fec0) Nov 6 20:21:34 bamf03 kernel: flags:0x05001078 mapping:000001010c7f75e8 mapcount:0 count:2 Nov 6 20:21:34 bamf03 kernel: Backtrace: Nov 6 20:21:34 bamf03 kernel: Nov 6 20:21:34 bamf03 kernel: Call Trace:{bad_page+112} {buffered_rmqueue+520} Nov 6 20:21:34 bamf03 kernel: {sock_sendmsg+271} {__alloc_pages+211} Nov 6 20:21:34 bamf03 kernel: {__get_free_pages+11} {__pollwait+58} Nov 6 20:21:34 bamf03 kernel: {datagram_poll+39} {datagram_poll+0} Nov 6 20:21:34 bamf03 kernel: {datagram_poll+0} {do_select+656} Nov 6 20:21:34 bamf03 kernel: {__pollwait+0} {sys_select+820} Nov 6 20:21:34 bamf03 kernel: {dnotify_parent+34} {system_call+126} Nov 6 20:21:34 bamf03 kernel: Nov 6 20:21:34 bamf03 kernel: Trying to fix it up, but a reboot is needed Nov 6 20:21:48 bamf03 clurgmgrd: [4170]: Executing /etc/init.d/rsyncd-cougar status Nov 6 20:22:08 bamf03 kernel: Bad page state at prep_new_page (in process 'ip.sh', page 00000101fe80c730) Nov 6 20:22:08 bamf03 kernel: flags:0x0500102c mapping:0000010079d9a3e0 mapcount:0 count:2 Nov 6 20:22:08 bamf03 kernel: Backtrace: Nov 6 20:22:08 bamf03 kernel: Nov 6 20:22:08 bamf03 kernel: Call Trace:{bad_page+112} {buffered_rmqueue+520} Nov 6 20:22:08 bamf03 kernel: {__alloc_pages+211} {do_no_page+651} Nov 6 20:22:08 bamf03 kernel: {__generic_file_aio_read+385} {handle_mm_fault+373} Nov 6 20:22:08 bamf03 kernel: {generic_file_aio_read+48} {do_sync_read+173} Nov 6 20:22:08 bamf03 kernel: {dput+56} {do_page_fault+518} Nov 6 20:22:08 bamf03 kernel: {autoremove_wake_function+0} {dnotify_parent+34} Nov 6 20:22:08 bamf03 kernel: {vfs_read+248} {error_exit+0} Nov 6 20:22:08 bamf03 kernel: Nov 6 20:22:08 bamf03 kernel: Trying to fix it up, but a reboot is needed Nov 6 20:22:16 bamf03 kernel: Bad page state at prep_new_page (in process 'smbd', page 00000101fe816ec0) Nov 6 20:22:16 bamf03 kernel: flags:0x05001028 mapping:000001018b7eea30 mapcount:0 count:2 Nov 6 20:22:16 bamf03 kernel: Backtrace: Nov 6 20:22:16 bamf03 kernel: Nov 6 20:22:16 bamf03 kernel: Call Trace:{bad_page+112} {buffered_rmqueue+520} Nov 6 20:22:16 bamf03 kernel: {sock_sendmsg+271} {__alloc_pages+211} Nov 6 20:22:16 bamf03 kernel: {__get_free_pages+11} {__pollwait+58} Nov 6 20:22:16 bamf03 kernel: {tcp_poll+44} {do_select+656} Nov 6 20:22:16 bamf03 kernel: {__pollwait+0} {sys_select+820} Nov 6 20:22:16 bamf03 kernel: {dnotify_parent+34} {system_call+126} Nov 6 20:22:16 bamf03 kernel: Nov 6 20:22:16 bamf03 kernel: Trying to fix it up, but a reboot is needed Nov 6 20:22:18 bamf03 clurgmgrd: [4170]: Executing /etc/init.d/rsyncd-cougar status Nov 6 20:22:38 bamf03 clurgmgrd[4170]: Stopping service cougar-compout Nov 6 20:22:38 bamf03 clurgmgrd: [4170]: Executing /etc/init.d/rsyncd-cougar stop Nov 6 20:22:38 bamf03 clurgmgrd: [4170]: Removing IPv4 address 192.168.10.22 from bond0 Nov 6 20:22:41 bamf03 clurgmgrd: [4170]: Stopping Samba instance "cougar" Nov 6 20:22:41 bamf03 nmbd[30156]: [2006/11/06 20:22:41, 0] nmbd/nmbd.c:terminate(56) Nov 6 20:22:41 bamf03 nmbd[30156]: Got SIGTERM: going down... Nov 6 20:22:41 bamf03 nmbd[30156]: [2006/11/06 20:22:41, 0] libsmb/nmblib.c:send_udp(790) Nov 6 20:22:41 bamf03 nmbd[30156]: Packet send failed to 192.168.255.255(138) ERRNO=Invalid argument Nov 6 20:23:10 bamf03 sshd(pam_unix)[13090]: session opened for user root by root(uid=0) Nov 6 20:24:16 bamf03 sshd(pam_unix)[13146]: session opened for user root by root(uid=0) Nov 6 20:24:36 bamf03 kernel: CMAN: removing node bamf01 from the cluster : Missed too many heartbeats Nov 6 20:24:38 bamf03 kernel: clustat[13184] trap stack segment rip:33512b1c13 rsp:7fbffff840 error:0 Nov 6 21:36:04 bamf03 syslogd 1.4.1: restart. From riaan at obsidian.co.za Tue Nov 7 21:46:57 2006 From: riaan at obsidian.co.za (Riaan van Niekerk) Date: Tue, 07 Nov 2006 23:46:57 +0200 Subject: [Linux-cluster] Re: Storage Problems, need some advice In-Reply-To: <8c1094290611070831k1c870304r8cffb870ff551613@mail.gmail.com> References: <8c1094290611060825i5125eb31m40935b1083e33f0b@mail.gmail.com> <8c1094290611070831k1c870304r8cffb870ff551613@mail.gmail.com> Message-ID: <4550FED1.1070809@obsidian.co.za> > After doing much more research I came across the DS300 by IBM. It uses > SCSI drives, is fully redundant, does iSCSI and doesn't cost an arm > and a leg (just an arm). My question is, their site says linux > clustering isn't supported, but does it have to be? Doesn't iSCSI let > you do the same thing GNBD does? > hi David do you have a link to the page with that statement? "linux clustering" is somewhat of an ambiguous term. Within the context of Red Hat software (excluding Linux Virtual Server and high-performance computing clusters), it can mean either: a) Cluster Suite (without GFS) - only one node in a cluster accesses the storage at a time. if you fail/switch over, one node unmounts an FS, another one mounts it. b) GFS (which implies/includes Cluster Suite) - multiple nodes accessing the same LUN with a (G)FS on top of it I am not familiar with entry-level iSCSI initiators. I always thought iSCSI is logically like fibre, e.g. multiple hosts in the same raidgroup can concurrently access the same LUN/FS. Perhaps these entry-level iSCSI arrays are more like regular SCSI meaning that they do not support multiple initiators accessing the same LUN behind a target (storage processor). I had a look at EMC cert matrix for the AX100/150 series arrays http://www.emc.com/interoperability/matrices/AX_Series_SupportMatrix.pdf thes entry-level EMC iSCSI arrays also only supports non-clustered Linux. iSCSI will allow you to "do the same thing" as GNDB: GNDB client and server are replaced iSCSI initiator (Linux host) and target (dedicated hardware, e.g. EMC array, or software target - not yet considered production-ready nor included with RHEL). However, if the hardware has an explicit exclusion of Linux clustering, you are stuck, not being able to have two nodes aaccess the storage at the same time.. HTH Riaan > Also, I talked to someone on their chat who said I could use any U320 > drive with it, basically I could reuse the drives I already have and > just not use my old enclosure. Does that sound right? Any reason I > couldn't do that other than loosing all my data? > > Anyone using a DS300? Seems like with 15k drives it would be pretty darn > fast. > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- A non-text attachment was scrubbed... Name: riaan.vcf Type: text/x-vcard Size: 310 bytes Desc: not available URL: From isplist at logicore.net Tue Nov 7 23:25:03 2006 From: isplist at logicore.net (isplist at logicore.net) Date: Tue, 7 Nov 2006 17:25:03 -0600 Subject: [Linux-cluster] LVS, not so fun today... Message-ID: <200611717253.195087@leena> I posted about this before but, no replies so still looking for help. --- Don't know why, maybe I've taken on too many things at once but this is just confusing me now. Here's all the info I can think of as I ask for help once again from the fine folks in this list :). Same problem, all weekend, trying various combinations. Seems simple enough but nothing has worked for me so far. There's something that's just not sinking into my brain here. I've included my lvs.cf at the end of this message. LVS0's real eth0 IP is 192.168.1.52. LVS1's real eth0 IP is 192.168.1.53. CWEB92 is the only real server with the virtual IP of 192.168.1.150 on it's ether0 for testing. It's real IP is 192.168.1.92. When I ping 150, I reach either LVS servers, depending on who's master but the CWEB machine never responds to 192.168.1.150. So, again, I have 192.168.1.150 as a virtual IP on the first NIC of each REAL web server. I have 192.168.1.150 installed on LVS0 as the VIP for the real servers and real servers configured. Nov 6 14:00:49 lb52 pulse[2498]: STARTING PULSE AS MASTER Nov 6 14:00:49 lb52 pulse: pulse startup succeeded Nov 6 14:01:07 lb52 pulse[2498]: partner dead: activating lvs Nov 6 14:01:07 lb52 lvs[2522]: starting virtual service HTTP active: 80 Nov 6 14:01:07 lb52 nanny[2527]: starting LVS client monitor for 192.168.1.150:80 Nov 6 14:01:07 lb52 lvs[2522]: create_monitor for HTTP/cweb92 running as pid 2527 Nov 6 14:01:07 lb52 nanny[2528]: starting LVS client monitor for 192.168.1.150:80 Nov 6 14:01:07 lb52 lvs[2522]: create_monitor for HTTP/cweb93 running as pid 2528 Nov 6 14:01:07 lb52 nanny[2529]: starting LVS client monitor for 192.168.1.150:80 Nov 6 14:01:07 lb52 lvs[2522]: create_monitor for HTTP/cweb94 running as pid 2529 Nov 6 14:01:12 lb52 pulse[2524]: gratuitous lvs arps finished Nov 6 14:01:13 lb52 nanny[2527]: READ to 192.168.1.92:80 timed out Nov 6 14:01:13 lb52 nanny[2528]: READ to 192.168.1.93:80 timed out Nov 6 14:01:13 lb52 nanny[2529]: READ to 192.168.1.94:80 timed out Nov 6 14:01:25 lb52 nanny[2528]: READ to 192.168.1.93:80 timed out Nov 6 14:01:25 lb52 nanny[2527]: READ to 192.168.1.92:80 timed out Nov 6 14:01:25 lb52 nanny[2529]: READ to 192.168.1.94:80 timed out serial_no = 88 primary = 192.168.1.52 (Real, physical IP of the first LVS server) service = lvs backup_active = 1 backup = 0.0.0.0 heartbeat = 1 heartbeat_port = 539 keepalive = 6 deadtime = 18 network = direct nat_nmask = 255.255.255.255 debug_level = NONE monitor_links = 0 virtual HTTP { active = 1 address = 192.168.1.150 eth0:1 (IP that LVS responds to, then sends connections to read servers, right?) vip_nmask = 255.255.255.0 port = 80 send = "GET / HTTP/1.0rnrn" expect = "HTTP" use_regex = 0 load_monitor = none scheduler = wlc protocol = tcp timeout = 6 reentry = 15 quiesce_server = 0 server cweb92 { address = 192.168.1.92 active = 1 weight = 1 } server cweb93 { address = 192.168.1.93 active = 1 weight = 1 } server cweb94 { address = 192.168.1.94 active = 1 weight = 1 } } From dbrieck at gmail.com Tue Nov 7 23:33:22 2006 From: dbrieck at gmail.com (David Brieck Jr.) Date: Tue, 7 Nov 2006 18:33:22 -0500 Subject: [Linux-cluster] Re: Storage Problems, need some advice In-Reply-To: <4550FED1.1070809@obsidian.co.za> References: <8c1094290611060825i5125eb31m40935b1083e33f0b@mail.gmail.com> <8c1094290611070831k1c870304r8cffb870ff551613@mail.gmail.com> <4550FED1.1070809@obsidian.co.za> Message-ID: <8c1094290611071533n3a5bf071j43c6c50cf8272778@mail.gmail.com> On 11/7/06, Riaan van Niekerk wrote: > > After doing much more research I came across the DS300 by IBM. It uses > > SCSI drives, is fully redundant, does iSCSI and doesn't cost an arm > > and a leg (just an arm). My question is, their site says linux > > clustering isn't supported, but does it have to be? Doesn't iSCSI let > > you do the same thing GNBD does? > > > > hi David > > do you have a link to the page with that statement? "linux clustering" > is somewhat of an ambiguous term. Within the context of Red Hat software > (excluding Linux Virtual Server and high-performance computing > clusters), it can mean either: > > a) Cluster Suite (without GFS) - only one node in a cluster accesses the > storage at a time. if you fail/switch over, one node unmounts an FS, > another one mounts it. > > b) GFS (which implies/includes Cluster Suite) - multiple nodes accessing > the same LUN with a (G)FS on top of it > > I am not familiar with entry-level iSCSI initiators. I always thought > iSCSI is logically like fibre, e.g. multiple hosts in the same raidgroup > can concurrently access the same LUN/FS. Perhaps these entry-level iSCSI > arrays are more like regular SCSI meaning that they do not support > multiple initiators accessing the same LUN behind a target (storage > processor). > > I had a look at EMC cert matrix for the AX100/150 series arrays > http://www.emc.com/interoperability/matrices/AX_Series_SupportMatrix.pdf > thes entry-level EMC iSCSI arrays also only supports non-clustered Linux. > > iSCSI will allow you to "do the same thing" as GNDB: > GNDB client and server are replaced iSCSI initiator (Linux host) and > target (dedicated hardware, e.g. EMC array, or software target - not yet > considered production-ready nor included with RHEL). However, if the > hardware has an explicit exclusion of Linux clustering, you are stuck, > not being able to have two nodes aaccess the storage at the same time.. > > HTH > Riaan > > > Also, I talked to someone on their chat who said I could use any U320 > > drive with it, basically I could reuse the drives I already have and > > just not use my old enclosure. Does that sound right? Any reason I > > couldn't do that other than loosing all my data? > > > > Anyone using a DS300? Seems like with 15k drives it would be pretty darn > > fast. > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > > Here the PDF I was looking at: http://www-03.ibm.com/servers/storage/disk/ds/pdf/ds300400_interop.pdf It says it supports Microsoft Clusters so I don't know why it wouldn't support linux. I also found somewhere else that said more specifically that it doesn't support RedHat Cluster Manager, but as far as I'm concerned it has nothing to do with that unless you're talking about fencing, in which case it would work for me since I'm using power fencing. Anyone from RedHat mind chiming in? It's not on the hardware compatibility list, but there are many things that work that aren't on that list. From kanderso at redhat.com Wed Nov 8 03:26:58 2006 From: kanderso at redhat.com (Kevin Anderson) Date: Tue, 07 Nov 2006 21:26:58 -0600 Subject: [Linux-cluster] Re: Storage Problems, need some advice In-Reply-To: <8c1094290611071533n3a5bf071j43c6c50cf8272778@mail.gmail.com> References: <8c1094290611060825i5125eb31m40935b1083e33f0b@mail.gmail.com> <8c1094290611070831k1c870304r8cffb870ff551613@mail.gmail.com> <4550FED1.1070809@obsidian.co.za> <8c1094290611071533n3a5bf071j43c6c50cf8272778@mail.gmail.com> Message-ID: <1162956418.2827.9.camel@localhost.localdomain> We use iSCSI in our development labs for GFS clustering. Basically, GFS requires that the underlying storage be concurrently accessible from all nodes in the cluster. This rules out a shared scsi bus between nodes, but fibre channel, iSCSI and gndb all provide the ability to access the underlying storage concurrently. From the basic descriptions of the DS300, I don't see any reason it wouldn't work very well with GFS. It provides either fibre channel or iSCSI support. Kevin On Tue, 2006-11-07 at 18:33 -0500, David Brieck Jr. wrote: > On 11/7/06, Riaan van Niekerk wrote: > > > After doing much more research I came across the DS300 by IBM. It uses > > > SCSI drives, is fully redundant, does iSCSI and doesn't cost an arm > > > and a leg (just an arm). My question is, their site says linux > > > clustering isn't supported, but does it have to be? Doesn't iSCSI let > > > you do the same thing GNBD does? > > > > > > > hi David > > > > do you have a link to the page with that statement? "linux clustering" > > is somewhat of an ambiguous term. Within the context of Red Hat software > > (excluding Linux Virtual Server and high-performance computing > > clusters), it can mean either: > > > > a) Cluster Suite (without GFS) - only one node in a cluster accesses the > > storage at a time. if you fail/switch over, one node unmounts an FS, > > another one mounts it. > > > > b) GFS (which implies/includes Cluster Suite) - multiple nodes accessing > > the same LUN with a (G)FS on top of it > > > > I am not familiar with entry-level iSCSI initiators. I always thought > > iSCSI is logically like fibre, e.g. multiple hosts in the same raidgroup > > can concurrently access the same LUN/FS. Perhaps these entry-level iSCSI > > arrays are more like regular SCSI meaning that they do not support > > multiple initiators accessing the same LUN behind a target (storage > > processor). > > > > I had a look at EMC cert matrix for the AX100/150 series arrays > > http://www.emc.com/interoperability/matrices/AX_Series_SupportMatrix.pdf > > thes entry-level EMC iSCSI arrays also only supports non-clustered Linux. > > > > iSCSI will allow you to "do the same thing" as GNDB: > > GNDB client and server are replaced iSCSI initiator (Linux host) and > > target (dedicated hardware, e.g. EMC array, or software target - not yet > > considered production-ready nor included with RHEL). However, if the > > hardware has an explicit exclusion of Linux clustering, you are stuck, > > not being able to have two nodes aaccess the storage at the same time.. > > > > HTH > > Riaan > > > > > Also, I talked to someone on their chat who said I could use any U320 > > > drive with it, basically I could reuse the drives I already have and > > > just not use my old enclosure. Does that sound right? Any reason I > > > couldn't do that other than loosing all my data? > > > > > > Anyone using a DS300? Seems like with 15k drives it would be pretty darn > > > fast. > > > > > > -- > > > Linux-cluster mailing list > > > Linux-cluster at redhat.com > > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > > > > > Here the PDF I was looking at: > http://www-03.ibm.com/servers/storage/disk/ds/pdf/ds300400_interop.pdf > > It says it supports Microsoft Clusters so I don't know why it wouldn't > support linux. I also found somewhere else that said more specifically > that it doesn't support RedHat Cluster Manager, but as far as I'm > concerned it has nothing to do with that unless you're talking about > fencing, in which case it would work for me since I'm using power > fencing. > > Anyone from RedHat mind chiming in? It's not on the hardware > compatibility list, but there are many things that work that aren't on > that list. > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From dbrieck at gmail.com Wed Nov 8 04:09:52 2006 From: dbrieck at gmail.com (David Brieck Jr.) Date: Tue, 7 Nov 2006 23:09:52 -0500 Subject: [Linux-cluster] Re: Storage Problems, need some advice In-Reply-To: <1162956418.2827.9.camel@localhost.localdomain> References: <8c1094290611060825i5125eb31m40935b1083e33f0b@mail.gmail.com> <8c1094290611070831k1c870304r8cffb870ff551613@mail.gmail.com> <4550FED1.1070809@obsidian.co.za> <8c1094290611071533n3a5bf071j43c6c50cf8272778@mail.gmail.com> <1162956418.2827.9.camel@localhost.localdomain> Message-ID: <8c1094290611072009y546a27a0t8c2b94bfc538c405@mail.gmail.com> On 11/7/06, Kevin Anderson wrote: > > We use iSCSI in our development labs for GFS clustering. Basically, GFS > requires that the underlying storage be concurrently accessible from all > nodes in the cluster. This rules out a shared scsi bus between nodes, but > fibre channel, iSCSI and gndb all provide the ability to access the > underlying storage concurrently. From the basic descriptions of the DS300, > I don't see any reason it wouldn't work very well with GFS. It provides > either fibre channel or iSCSI support. > > Kevin > Thanks for your response. If I'm reading what you said correctly ANY iSCSI device would be compatible with GFS? I apologize if this is common knowledge but I've been having problems tracking down some of this info. I'm actually surprised at how little information I've been able to find on the DS300. It seems to be a very good product with SCSI drives that is comparable in price to something similar with SATA drives. Maybe people just don't consider Big Blue for these things? From bganeshmail at gmail.com Wed Nov 8 04:27:55 2006 From: bganeshmail at gmail.com (Ganesh B) Date: Wed, 8 Nov 2006 09:57:55 +0530 Subject: [Linux-cluster] Redhat Cluster Suite 4 Message-ID: <4e3b15f00611072027u75c20681wb75118aec78447c@mail.gmail.com> Dear All, While reading the documents related to Redhat Cluster Suite Some doc says Quorom Partition and some docs does not say anything about it. Pls give suggstions whether is it neccessary to create quorom partitions on redhat cluster suite 4. If necessary how to create that. Regds B.Ganesh From rpeterso at redhat.com Wed Nov 8 15:02:39 2006 From: rpeterso at redhat.com (Robert Peterson) Date: Wed, 08 Nov 2006 09:02:39 -0600 Subject: [Linux-cluster] Redhat Cluster Suite 4 In-Reply-To: <4e3b15f00611072027u75c20681wb75118aec78447c@mail.gmail.com> References: <4e3b15f00611072027u75c20681wb75118aec78447c@mail.gmail.com> Message-ID: <4551F18F.1060100@redhat.com> Ganesh B wrote: > Dear All, > > While reading the documents related to Redhat Cluster Suite Some doc > says Quorom Partition and some docs does not say anything about it. > > Pls give suggstions whether is it neccessary to create quorom > partitions on redhat cluster suite 4. > > If necessary how to create that. > > Regds > B.Ganesh Hi Ganesh, Please see the three questions in the FAQ, starting with: http://sources.redhat.com/cluster/faq.html#quorum Regards, Bob Peterson Red Hat Cluster Suite From adas at redhat.com Wed Nov 8 20:20:11 2006 From: adas at redhat.com (Abhijith Das) Date: Wed, 08 Nov 2006 14:20:11 -0600 Subject: [Linux-cluster] GFS and samba problem in Fedora, again In-Reply-To: <4550BB53.9070401@fib.upc.edu> References: <4523A637.1060706@fib.upc.edu> <4526B4ED.9050907@redhat.com> <452B4F39.60906@fib.upc.edu> <452BFC6D.60902@redhat.com> <452D0ED6.9040605@fib.upc.edu> <452D7040.8090704@redhat.com> <4533A6EA.6030503@fib.upc.edu> <4533B9C2.8080906@redhat.com> <4550BB53.9070401@fib.upc.edu> Message-ID: <45523BFB.30707@redhat.com> sandra-llistes wrote: > Hi Ahbi, > > I've proved to install kernel 2.6.10 and GFS from cvs on Fedora 5 > because if this software worked for RHES4 It will work also for Fedora. > I have had to recompile the kernel and software, and have some > problems because Fedora has gcc 4 and Red Hat gcc 3.4.6. Now I > succesfully installed it but I'm getting odd errors with ccsd: > > Nov 7 16:44:39 nocilla ccsd[15826]: Unable to connect to cluster > infrastructure after 120 seconds. > Nov 7 16:45:09 nocilla ccsd[15826]: Unable to connect to cluster > infrastructure after 150 seconds. > Nov 7 16:45:40 nocilla ccsd[15826]: Unable to connect to cluster > infrastructure after 180 seconds. > Nov 7 16:46:10 nocilla ccsd[15826]: Unable to connect to cluster > infrastructure after 210 seconds. > Nov 7 16:46:40 nocilla ccsd[15826]: Unable to connect to cluster > infrastructure after 240 seconds. > Nov 7 16:47:10 nocilla ccsd[15826]: Unable to connect to cluster > infrastructure after 270 seconds. > Nov 7 16:47:40 nocilla ccsd[15826]: Unable to connect to cluster > infrastructure after 300 seconds. > Nov 7 16:48:10 nocilla ccsd[15826]: Unable to connect to cluster > infrastructure after 330 seconds. > Nov 7 16:48:40 nocilla ccsd[15826]: Unable to connect to cluster > infrastructure after 360 seconds. > > cman starts ok and quorum is regained, but ccsd fails. I tried: > [root ~]# ccs_test connect > ccs_connect failed: Connection refused > [root ~]# ccs_test connect force > Force is set. > Connect successful. > Connection descriptor = 60 > > Also I tried to start ccsd with -4 and -I parameters with the same > results: > #ccsd -4 -I 127.0.0.0.1 > > What could the problem be? > Thanks, > > Sandra Hi Sandra, I'm assuming you're talking about the RHEL4 branch of CVS. I'm not 100% sure that the cluster-suite and GFS are supported for FC5. (Somebody on this list can confirm). The error however looks like ccsd is unable to talk to cman. At what point do you see this error? Are you using the init-scripts? I'd suggest that you compile ccsd with the DEBUG flag and try starting all the components by hand. Hopefully that'll give us more information. Thanks, --Abhi From ranjtech at gmail.com Thu Nov 9 03:40:16 2006 From: ranjtech at gmail.com (RR) Date: Thu, 9 Nov 2006 13:40:16 +1000 Subject: [Linux-cluster] Quorum disk: Can it be a partition? In-Reply-To: <452AC716.2040207@redhat.com> References: <45211826.1050703@redhat.com> <20061009094721.65AAB27C57@ux-mail.informatics.lk> <20061009115927.C7198@xos037.xos.nl> <452AC716.2040207@redhat.com> Message-ID: On 10/10/06, Robert Peterson wrote: > Yes. You definitely want to use shared storage. > > Regards, > > Bob Peterson Hello, does anyone know if having this quorum disk on an iSCSI SAN which the linux nodes can only access through an iscsi-initiator with non-TOE NICs would cause any significant CPU usage? The machines I have may normally have to deal with high CPU demanding tasks and if the NICs don't offload the TCP/IP checksum computations, then the CPU on the machine will have to be used for it. If the writing to the Quorum is short and sweet then I could have it on the iSCSI SAN. But I guess I do add load if I use an NFS share for it as well, right? comments? From ranjtech at gmail.com Thu Nov 9 06:53:54 2006 From: ranjtech at gmail.com (RR) Date: Thu, 9 Nov 2006 16:53:54 +1000 Subject: [Linux-cluster] safe to remove old pkgs? Message-ID: Hi all, I had initially installed CentOS 4.3 on a machine with the corresponding CSGFS pkgs. Then further went to upgrade the system to CentOS 4.4 with the corresponding CSGFS pkgs but just realised that the old kernel and CSGFS pkgs are still there. Is it safe to remove them? As in when I do, > yum remove kernel-2.6.9-34.0.1.EL kernel-smp-2.6.9-34.0.1.EL and get ============================================================================= Package Arch Version Repository Size ============================================================================= Removing: kernel i686 2.6.9-34.0.1.EL installed 26 M kernel-smp i686 2.6.9-34.0.1.EL installed 28 M Removing for dependencies: GFS-kernel-smp i686 2.6.9-49.1.1.centos4 installed 480 k cman-kernel-smp i686 2.6.9-43.8.3.centos4 installed 326 k dlm-kernel-smp i686 2.6.9-41.7.1.centos4 installed 316 k gnbd-kernel-smp i686 2.6.9-9.31.1.centos4 installed 25 k Transaction Summary ============================================================================= Install 0 Package(s) Update 0 Package(s) Remove 6 Package(s) Total download size: 0 Is this ok [y/N]: Can I safely say "y" here? Thanks \R From bganeshmail at gmail.com Thu Nov 9 10:54:29 2006 From: bganeshmail at gmail.com (Ganesh B) Date: Thu, 9 Nov 2006 16:24:29 +0530 Subject: [Linux-cluster] Redhat cluster suite 4 ccsd service failed Message-ID: <4e3b15f00611090254m12de87eu2c9356cf3e14d76a@mail.gmail.com> Dear All, Installed the necessary rpm which came along with redhat cluster cd. But the ccsd service was not getting started without any error,. How to diagnose the error and sort out the issue. This is the fresh installation.Kindly give some instructions. Installation caried as per the procedure given from Redhat site. Pls help Regds B.Ganesh -------------- next part -------------- An HTML attachment was scrubbed... URL: From jwhiter at redhat.com Thu Nov 9 12:51:58 2006 From: jwhiter at redhat.com (Josef Whiter) Date: Thu, 9 Nov 2006 07:51:58 -0500 Subject: [Linux-cluster] Redhat cluster suite 4 ccsd service failed In-Reply-To: <4e3b15f00611090254m12de87eu2c9356cf3e14d76a@mail.gmail.com> References: <4e3b15f00611090254m12de87eu2c9356cf3e14d76a@mail.gmail.com> Message-ID: <20061109125157.GB24980@korben.rdu.redhat.com> You didnt install magma-plugins, install that package and it should start working. Josef On Thu, Nov 09, 2006 at 04:24:29PM +0530, Ganesh B wrote: > Dear All, > > Installed the necessary rpm which came along with redhat cluster cd. > > But the ccsd service was not getting started without any error,. > > How to diagnose the error and sort out the issue. > > > > > > This is the fresh installation.Kindly give some instructions. > > Installation caried as per the procedure given from Redhat site. > > Pls help > > Regds > B.Ganesh > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From isplist at logicore.net Thu Nov 9 14:10:15 2006 From: isplist at logicore.net (isplist at logicore.net) Date: Thu, 9 Nov 2006 08:10:15 -0600 Subject: [Linux-cluster] Load balancing Message-ID: <200611981015.199640@leena> I've spent this whole week so far trying to get load balancing working. I've posted the config I'm working with and strangely, no replies. I've tried LVS/Piranha and am now trying keepalived. Nothing seems to work and I'm at a loss. Once I see it working, I'll understand it and in fact, I already understand it, it's just not working. Could be my firewalling, could be something else, I've no idea. I don't know what the rules are but can I offer someone a bit of cash to help me get this working??? I need a master/slave LB front end that handles both web and email connections to a cluster. Or, two LB front ends, one for web and one for email. Mike From dbrieck at gmail.com Thu Nov 9 14:46:00 2006 From: dbrieck at gmail.com (David Brieck Jr.) Date: Thu, 9 Nov 2006 09:46:00 -0500 Subject: [Linux-cluster] LVS, not so fun today... In-Reply-To: <200611717253.195087@leena> References: <200611717253.195087@leena> Message-ID: <8c1094290611090646s724a6ba4p9e48189a83351ab1@mail.gmail.com> On 11/7/06, isplist at logicore.net wrote: > I posted about this before but, no replies so still looking for help. > > --- > > Don't know why, maybe I've taken on too many things at once but this is just > confusing me now. Here's all the info I can think of as I ask for help once > again from the fine folks in this list :). > > Same problem, all weekend, trying various combinations. Seems simple enough > but nothing has worked for me so far. There's something that's just not > sinking into my brain here. I've included my lvs.cf at the end of this > message. > > LVS0's real eth0 IP is 192.168.1.52. > LVS1's real eth0 IP is 192.168.1.53. > CWEB92 is the only real server with the virtual IP of 192.168.1.150 on it's > ether0 for testing. It's real IP is 192.168.1.92. > When I ping 150, I reach either LVS servers, depending on who's master but the > CWEB machine never responds to 192.168.1.150. > > So, again, I have 192.168.1.150 as a virtual IP on the first NIC of each REAL > web server. I have 192.168.1.150 installed on LVS0 as the VIP for the real > servers and real servers configured. > > Nov 6 14:00:49 lb52 pulse[2498]: STARTING PULSE AS MASTER > Nov 6 14:00:49 lb52 pulse: pulse startup succeeded > Nov 6 14:01:07 lb52 pulse[2498]: partner dead: activating lvs > Nov 6 14:01:07 lb52 lvs[2522]: starting virtual service HTTP active: 80 > Nov 6 14:01:07 lb52 nanny[2527]: starting LVS client monitor for > 192.168.1.150:80 > Nov 6 14:01:07 lb52 lvs[2522]: create_monitor for HTTP/cweb92 running as pid > 2527 > Nov 6 14:01:07 lb52 nanny[2528]: starting LVS client monitor for > 192.168.1.150:80 > Nov 6 14:01:07 lb52 lvs[2522]: create_monitor for HTTP/cweb93 running as pid > 2528 > Nov 6 14:01:07 lb52 nanny[2529]: starting LVS client monitor for > 192.168.1.150:80 > Nov 6 14:01:07 lb52 lvs[2522]: create_monitor for HTTP/cweb94 running as pid > 2529 > Nov 6 14:01:12 lb52 pulse[2524]: gratuitous lvs arps finished > Nov 6 14:01:13 lb52 nanny[2527]: READ to 192.168.1.92:80 timed out > Nov 6 14:01:13 lb52 nanny[2528]: READ to 192.168.1.93:80 timed out > Nov 6 14:01:13 lb52 nanny[2529]: READ to 192.168.1.94:80 timed out > Nov 6 14:01:25 lb52 nanny[2528]: READ to 192.168.1.93:80 timed out > Nov 6 14:01:25 lb52 nanny[2527]: READ to 192.168.1.92:80 timed out > Nov 6 14:01:25 lb52 nanny[2529]: READ to 192.168.1.94:80 timed out > > serial_no = 88 > primary = 192.168.1.52 (Real, physical IP of the first LVS server) > service = lvs > backup_active = 1 > backup = 0.0.0.0 > heartbeat = 1 > heartbeat_port = 539 > keepalive = 6 > deadtime = 18 > network = direct > nat_nmask = 255.255.255.255 > debug_level = NONE > monitor_links = 0 > virtual HTTP { > active = 1 > address = 192.168.1.150 eth0:1 > (IP that LVS responds to, then sends connections to read servers, right?) > vip_nmask = 255.255.255.0 > port = 80 > send = "GET / HTTP/1.0rnrn" > expect = "HTTP" > use_regex = 0 > load_monitor = none > scheduler = wlc > protocol = tcp > timeout = 6 > reentry = 15 > quiesce_server = 0 > server cweb92 { > address = 192.168.1.92 > active = 1 > weight = 1 > } > server cweb93 { > address = 192.168.1.93 > active = 1 > weight = 1 > } > server cweb94 { > address = 192.168.1.94 > active = 1 > weight = 1 > } > } > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > Can you post the results of "arping 192.168.1.150"? You might need to specify the interface and source IP, see the man page for more info. From riaan at obsidian.co.za Thu Nov 9 15:24:05 2006 From: riaan at obsidian.co.za (Riaan van Niekerk) Date: Thu, 09 Nov 2006 17:24:05 +0200 Subject: [Linux-cluster] Load balancing In-Reply-To: <200611981015.199640@leena> References: <200611981015.199640@leena> Message-ID: <45534815.4010003@obsidian.co.za> isplist at logicore.net wrote: > I've spent this whole week so far trying to get load balancing working. > I've posted the config I'm working with and strangely, no replies. > I've tried LVS/Piranha and am now trying keepalived. Nothing seems to work and > I'm at a loss. > > Once I see it working, I'll understand it and in fact, I already understand > it, it's just not working. Could be my firewalling, could be something else, > I've no idea. > > I don't know what the rules are but can I offer someone a bit of cash to help > me get this working??? I need a master/slave LB front end that handles both > web and email connections to a cluster. Or, two LB front ends, one for web and > one for email. > > Mike hi Mike Have you considered contacting Red Hat for support? To be able to escalate to them, you need at least - one RHEL ES Standard ($799) and - one Red Hat Cluster Suite ($499) subscription (you only need a subscription for the LVS, not the real servers) Or is this more than what you would be willing to spend? For that price, you would get unlimited incident support for your system, for one year, btw. greetings Riaan -------------- next part -------------- A non-text attachment was scrubbed... Name: riaan.vcf Type: text/x-vcard Size: 310 bytes Desc: not available URL: From isplist at logicore.net Thu Nov 9 15:38:05 2006 From: isplist at logicore.net (isplist at logicore.net) Date: Thu, 9 Nov 2006 09:38:05 -0600 Subject: [Linux-cluster] Load balancing In-Reply-To: <45534815.4010003@obsidian.co.za> Message-ID: <20061199385.853009@leena> Yes, it's more than I'd like to spend but more importantly, it's also something that someone who needs a few bucks can help with. Mike On Thu, 09 Nov 2006 17:24:05 +0200, Riaan van Niekerk wrote: > isplist at logicore.net wrote: > >> I've spent this whole week so far trying to get load balancing working. >> I've posted the config I'm working with and strangely, no replies. >> I've tried LVS/Piranha and am now trying keepalived. Nothing seems to >> work and >> I'm at a loss. >> >> Once I see it working, I'll understand it and in fact, I already >> understand >> it, it's just not working. Could be my firewalling, could be something >> else, >> I've no idea. >> >> I don't know what the rules are but can I offer someone a bit of cash to >> help >> me get this working??? I need a master/slave LB front end that handles >> both >> web and email connections to a cluster. Or, two LB front ends, one for >> web and >> one for email. >> >> Mike >> > hi Mike > > Have you considered contacting Red Hat for support? To be able to > escalate to them, you need at least > - one RHEL ES Standard ($799) and > - one Red Hat Cluster Suite ($499) subscription > (you only need a subscription for the LVS, not the real servers) > > Or is this more than what you would be willing to spend? For that price, > you would get unlimited incident support for your system, for one year, > btw. > > greetings > Riaan From cjk at techma.com Thu Nov 9 16:06:37 2006 From: cjk at techma.com (Kovacs, Corey J.) Date: Thu, 9 Nov 2006 11:06:37 -0500 Subject: =?us-ascii?Q?RE:_=5BLinux-cluster=5D_Load_balancing?= In-Reply-To: <20061199385.853009@leena> Message-ID: Sorry if this hs been suggested as I've been away and haven't sifted through all the cruft emails I have. If I am understanding this correctly, you have the target machine AND the LVS machines using the same IP address? If so, I don't think this is a correct LVS configuration. In my configs, the only machines that actually have the virtual address as the load balancers. They, in turn (using direct routing) forward the packets to the target nodes. The target nodes in turn need to be configured to accept packets that were not oroginally the target for those pakets. client ^ | | ---->incoming --> LVS1 (.10) (virt .15) ---> | LVS2 (.11) | packets redirected to "real server address" | | |<------------------TGT1 (.12) <--------------| TGT2 (.13) TGT3 (.14) If you are using another routing scheme (I believe there are three) this may not work. But it works for me. Hope this helps. Corey -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of isplist at logicore.net Sent: Thursday, November 09, 2006 10:38 AM To: linux clustering Subject: Re: [Linux-cluster] Load balancing Yes, it's more than I'd like to spend but more importantly, it's also something that someone who needs a few bucks can help with. Mike On Thu, 09 Nov 2006 17:24:05 +0200, Riaan van Niekerk wrote: > isplist at logicore.net wrote: > >> I've spent this whole week so far trying to get load balancing working. >> I've posted the config I'm working with and strangely, no replies. >> I've tried LVS/Piranha and am now trying keepalived. Nothing seems to >> work and I'm at a loss. >> >> Once I see it working, I'll understand it and in fact, I already >> understand it, it's just not working. Could be my firewalling, could >> be something else, I've no idea. >> >> I don't know what the rules are but can I offer someone a bit of cash >> to help me get this working??? I need a master/slave LB front end >> that handles both web and email connections to a cluster. Or, two LB >> front ends, one for web and one for email. >> >> Mike >> > hi Mike > > Have you considered contacting Red Hat for support? To be able to > escalate to them, you need at least > - one RHEL ES Standard ($799) and > - one Red Hat Cluster Suite ($499) subscription (you only need a > subscription for the LVS, not the real servers) > > Or is this more than what you would be willing to spend? For that > price, you would get unlimited incident support for your system, for > one year, btw. > > greetings > Riaan -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From lhh at redhat.com Thu Nov 9 16:10:50 2006 From: lhh at redhat.com (Lon Hohberger) Date: Thu, 09 Nov 2006 11:10:50 -0500 Subject: [Linux-cluster] Redhat Cluster Suite 4 In-Reply-To: <4e3b15f00611072027u75c20681wb75118aec78447c@mail.gmail.com> References: <4e3b15f00611072027u75c20681wb75118aec78447c@mail.gmail.com> Message-ID: <1163088650.7048.2.camel@rei.boston.devel.redhat.com> On Wed, 2006-11-08 at 09:57 +0530, Ganesh B wrote: > Dear All, > > While reading the documents related to Redhat Cluster Suite Some doc > says Quorom Partition and some docs does not say anything about it. > > Pls give suggstions whether is it neccessary to create quorom > partitions on redhat cluster suite 4. > > If necessary how to create that. Quorum partitions are optional in RHCS4, and were only actually introduced in RHCS4U4. -- Lon From lhh at redhat.com Thu Nov 9 16:12:17 2006 From: lhh at redhat.com (Lon Hohberger) Date: Thu, 09 Nov 2006 11:12:17 -0500 Subject: [Linux-cluster] rgmanager crash, deadlock? In-Reply-To: <3bdb07840611071229yd9cec69u5ebefc84c8f88c92@mail.gmail.com> References: <3bdb07840611071229yd9cec69u5ebefc84c8f88c92@mail.gmail.com> Message-ID: <1163088737.7048.5.camel@rei.boston.devel.redhat.com> On Tue, 2006-11-07 at 12:29 -0800, aberoham at gmail.com wrote: > > Last night one of my five cluster nodes suffered a hardware failure > (memory, cpu?). The other nodes properly fenced the failed machine, > but no matter what clusvcadm command I ran, I could not get the other > cluster members to start, stop or disable the cluster resource > group/service that had been running on the failed node. (the resource > group/service that was running on the failed node includes an EXT3 fs, > an IP address, a rsyncd and a smbd init script) > > The "clusvcadm -d [service]" command would just hang for minutes and > not return. "clustat" intially reported the rg/service in an unknown > state, then stopped reporting rgmanager status and only showed cman > status. The cluster remained quorate the entire time. Resource > groups/services on non-failed nodes continued to run, but no matter > what I tried I could not get rgmanager status on any node. > > I had to reset the entire cluster to get things back to normal. (This > is a heavily used operational system so I didn't have time to do > further debugging.) My logs don't show any rgmanger related error > messages, only fencing status: > > Nov 6 20:24:37 bamf02 kernel: CMAN: removing node bamf03 from the > cluster : Missed too many heartbeats > Nov 6 20:24:38 bamf02 fenced[5913]: fencing deferred to bamf01 > --- > Nov 6 20:24:37 bamf01 kernel: CMAN: node bamf03 has been removed from > the cluster : Missed too many heartbeats > Nov 6 20:24:38 bamf01 fenced[5756]: bamf03 not a cluster member after > 0 sec post_fail_delay > Nov 6 20:24:38 bamf01 fenced[5756]: fencing node "bamf03" > Nov 6 20:24:46 bamf01 fenced[5756]: fence "bamf03" success > Nov 6 20:30:36 bamf01 sshd(pam_unix)[27244]: session opened for user > root by root(uid=0) > Nov 6 20:36:29 bamf01 kernel: CMAN: node bamf03 rejoining > Nov 6 20:42:55 bamf01 shutdown: shutting down for system reboot > --- > > I'm running RHEL4U4 (cman 1.0.11-0, cman-kernel-smp 2.6.9-45.5, dlm > 1.0.1-1, magma 1.0.6-0 rgmanager 1.9.53) on x86_64 hardware. cman_tool status ? Did rgmanager crash (service rgmanager status reported it as dead)? Was anything in dmesg indicating a DLM error? -- Lon From lhh at redhat.com Thu Nov 9 16:13:45 2006 From: lhh at redhat.com (Lon Hohberger) Date: Thu, 09 Nov 2006 11:13:45 -0500 Subject: [Linux-cluster] Re: Storage Problems, need some advice In-Reply-To: <8c1094290611070831k1c870304r8cffb870ff551613@mail.gmail.com> References: <8c1094290611060825i5125eb31m40935b1083e33f0b@mail.gmail.com> <8c1094290611070831k1c870304r8cffb870ff551613@mail.gmail.com> Message-ID: <1163088825.7048.7.camel@rei.boston.devel.redhat.com> On Tue, 2006-11-07 at 11:31 -0500, David Brieck Jr. wrote: > After doing much more research I came across the DS300 by IBM. It uses > SCSI drives, is fully redundant, does iSCSI and doesn't cost an arm > and a leg (just an arm). My question is, their site says linux > clustering isn't supported, but does it have to be? Doesn't iSCSI let > you do the same thing GNBD does? > > Also, I talked to someone on their chat who said I could use any U320 > drive with it, basically I could reuse the drives I already have and > just not use my old enclosure. Does that sound right? Any reason I > couldn't do that other than loosing all my data? > > Anyone using a DS300? Seems like with 15k drives it would be pretty darn fast. I haven't tried it, but I'm puzzled as to why IBM would say it didn't work with "Linux clustering". -- Lon From lhh at redhat.com Thu Nov 9 16:17:40 2006 From: lhh at redhat.com (Lon Hohberger) Date: Thu, 09 Nov 2006 11:17:40 -0500 Subject: [Linux-cluster] Quorum disk: Can it be a partition? In-Reply-To: References: <45211826.1050703@redhat.com> <20061009094721.65AAB27C57@ux-mail.informatics.lk> <20061009115927.C7198@xos037.xos.nl> <452AC716.2040207@redhat.com> Message-ID: <1163089060.7048.12.camel@rei.boston.devel.redhat.com> On Thu, 2006-11-09 at 13:40 +1000, RR wrote: > On 10/10/06, Robert Peterson wrote: > > Yes. You definitely want to use shared storage. > > > > Regards, > > > > Bob Peterson > > Hello, > > does anyone know if having this quorum disk on an iSCSI SAN which the > linux nodes can only access through an iscsi-initiator with non-TOE > NICs would cause any significant CPU usage? Qdisk is normally used to watch network paths and advertise via a non-network channel (i.e. a SAN) about a node's viability in the cluster... Using qdisk over iSCSI (or gnbd) is doable, but you'd have to have it on a private, iSCSI-only network for it to make any sense. (i.e. treat the iSCSI network as a SAN which only has SAN traffic). I don't know the implications of using TOE vs. non-TOE for your configuration. Someone else will have to answer that one. -- Lon From mwill at penguincomputing.com Thu Nov 9 16:28:27 2006 From: mwill at penguincomputing.com (Michael Will) Date: Thu, 9 Nov 2006 08:28:27 -0800 Subject: [Linux-cluster] Load balancing Message-ID: <433093DF7AD7444DA65EFAFE3987879C24545A@jellyfish.highlyscyld.com> Its easy to see if its the firewalling. service iptables stop Test service iptables start -----Original Message----- From: isplist at logicore.net [mailto:isplist at logicore.net] Sent: Thu Nov 09 06:10:29 2006 To: linux-cluster Subject: [Linux-cluster] Load balancing I've spent this whole week so far trying to get load balancing working. I've posted the config I'm working with and strangely, no replies. I've tried LVS/Piranha and am now trying keepalived. Nothing seems to work and I'm at a loss. Once I see it working, I'll understand it and in fact, I already understand it, it's just not working. Could be my firewalling, could be something else, I've no idea. I don't know what the rules are but can I offer someone a bit of cash to help me get this working??? I need a master/slave LB front end that handles both web and email connections to a cluster. Or, two LB front ends, one for web and one for email. Mike -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- An HTML attachment was scrubbed... URL: From isplist at logicore.net Thu Nov 9 16:43:19 2006 From: isplist at logicore.net (isplist at logicore.net) Date: Thu, 9 Nov 2006 10:43:19 -0600 Subject: [Linux-cluster] LVS, not so fun today... In-Reply-To: <8c1094290611090646s724a6ba4p9e48189a83351ab1@mail.gmail.com> Message-ID: <2006119104319.386555@leena> Hi and thank you for helping. I'll post all I can think of. I have IPTables turned off, SElinux disabled so it's not a firewall issue. I can also post my lvs.cf if that will help. > Can you post the results of "arping 192.168.1.150"? You might need to > specify the interface and source IP, see the man page for more info. Sure, ping form where? All machines on my network see 192.168.1.150 which is the VIP on the load balancer. I'm back to Piranha today. 192.168.1.150 is my VIP pn the load balancer. 192.168.1.53 is the load balancer. # ifconfig eth0 Link encap:Ethernet HWaddr 00:20:94:10:44:A5 inet addr:192.168.1.52 Bcast:192.168.1.255 Mask:255.255.255.0 inet6 addr: fe80::220:94ff:fe10:44a5/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:91631 errors:0 dropped:0 overruns:0 frame:0 TX packets:8142 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:6004157 (5.7 MiB) TX bytes:1402617 (1.3 MiB) eth0:1 Link encap:Ethernet HWaddr 00:20:94:10:44:A5 inet addr:192.168.1.150 Bcast:192.168.1.150 Mask:255.255.255.255 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:8 errors:0 dropped:0 overruns:0 frame:0 TX packets:8 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:560 (560.0 b) TX bytes:560 (560.0 b) My real servers are at 192.168.1.93, 93 and 94. When I try a web connection, I can see this in the netstat; # netstat -a|more Active Internet connections (servers and established) Proto Recv-Q Send-Q Local Address Foreign Address State tcp 0 0 192.168.1.52:32957 192.168.1.92:http TIME_WAIT tcp 0 0 192.168.1.52:32969 192.168.1.92:http TIME_WAIT tcp 0 0 192.168.1.52:32960 192.168.1.92:http TIME_WAIT tcp 0 0 192.168.1.52:32963 192.168.1.92:http TIME_WAIT tcp 0 0 192.168.1.52:32966 192.168.1.92:http TIME_WAIT tcp 0 0 192.168.1.52:32956 192.168.1.93:http TIME_WAIT tcp 0 0 192.168.1.52:32959 192.168.1.93:http TIME_WAIT tcp 0 0 192.168.1.52:32968 192.168.1.93:http TIME_WAIT tcp 0 0 192.168.1.52:32962 192.168.1.93:http TIME_WAIT tcp 0 0 192.168.1.52:32965 192.168.1.93:http TIME_WAIT tcp 0 0 192.168.1.52:32955 192.168.1.94:http TIME_WAIT tcp 0 0 192.168.1.52:32958 192.168.1.94:http TIME_WAIT tcp 0 0 192.168.1.52:32961 192.168.1.94:http TIME_WAIT tcp 0 0 192.168.1.52:32967 192.168.1.94:http TIME_WAIT tcp 0 0 192.168.1.52:32964 192.168.1.94:http TIME_WAIT However, I have yet to solve this problem; Nov 9 10:41:28 lb52 nanny[3311]: READ to 192.168.1.92:80 timed out Nov 9 10:41:40 lb52 nanny[3313]: READ to 192.168.1.94:80 timed out Nov 9 10:41:40 lb52 nanny[3312]: READ to 192.168.1.93:80 timed out From pbruna at it-linux.cl Thu Nov 9 17:37:35 2006 From: pbruna at it-linux.cl (Patricio Bruna V.) Date: Thu, 09 Nov 2006 14:37:35 -0300 Subject: [Linux-cluster] CS and DRBD In-Reply-To: <1162572794.4518.493.camel@rei.boston.devel.redhat.com> References: <23500065.31162496760586.JavaMail.root@lisa.it-linux.cl> <1162572794.4518.493.camel@rei.boston.devel.redhat.com> Message-ID: <4553675F.7030108@it-linux.cl> Lon Hohberger escribi?: > On Thu, 2006-11-02 at 16:46 -0300, Patricio A. Bruna wrote: > >> Anyone has a working script for use DRBD with Cluster Suite? >> I need to use DRBD, but i dont know how to make it play with CS. >> I now CS can mount partitions, but i dont know if CS can mark a drbd >> device as primary. >> > > I'm not aware of any -- but you're not the first to ask. Could you file > a bugzilla against fc6 / rgmanager? It might be as simple as using an > existing script to do it. > > I got it worked. I used the script /etc/ha.d/resource.d/drbddisk, i had to modified it a little bit so. From matt at rebelbase.com Thu Nov 9 18:35:34 2006 From: matt at rebelbase.com (Matt Eagleson) Date: Thu, 09 Nov 2006 10:35:34 -0800 Subject: [Linux-cluster] ClusterFS "Generic Error" Message-ID: <455374F6.7000309@rebelbase.com> Hi, I'm having an issue with a few of my NFS on GFS clusters where the service is being failed due to a "generic error." Inspection after the fact reveals nothing obviously wrong with the host that returned the error. The GFS mounts appear fine and there are no other errors in the logs. Any ideas on what I should be looking for? Nov 9 15:32:41 file04 clurgmgrd[4123]: status on clusterfs "data" returned 1 (generic error) Nov 9 15:32:41 file04 clurgmgrd[4123]: Stopping service nfs-vip04 the resources for this cluster look like: --Matt From dbrieck at gmail.com Thu Nov 9 19:46:12 2006 From: dbrieck at gmail.com (David Brieck Jr.) Date: Thu, 9 Nov 2006 14:46:12 -0500 Subject: [Linux-cluster] LVS, not so fun today... In-Reply-To: <2006119104319.386555@leena> References: <8c1094290611090646s724a6ba4p9e48189a83351ab1@mail.gmail.com> <2006119104319.386555@leena> Message-ID: <8c1094290611091146t42f7f734x257b63a04299d791@mail.gmail.com> On 11/9/06, isplist at logicore.net wrote: > Hi and thank you for helping. > > I'll post all I can think of. I have IPTables turned off, SElinux disabled so > it's not a firewall issue. I can also post my lvs.cf if that will help. > > > Can you post the results of "arping 192.168.1.150"? You might need to > > specify the interface and source IP, see the man page for more info. > > Sure, ping form where? All machines on my network see 192.168.1.150 which is > the VIP on the load balancer. I'm back to Piranha today. > Do the ARP ping from anywhere. It might not be related at all to your problem but ARP problems will cause you no end of headaches with direct routing. The output should look something like this: arping -I eth2 -s 10.1.1.100 10.1.1.125 ARPING 10.1.1.125 from 10.1.1.100 eth2 Unicast reply from 10.1.1.125 [00:13:72:5D:FA:F1] 0.722ms Unicast reply from 10.1.1.125 [00:13:72:5D:FA:F1] 0.785ms Unicast reply from 10.1.1.125 [00:13:72:5D:FA:F1] 0.685ms Unicast reply from 10.1.1.125 [00:13:72:5D:FA:F1] 0.679ms Unicast reply from 10.1.1.125 [00:13:72:5D:FA:F1] 0.723ms Unicast reply from 10.1.1.125 [00:13:72:5D:FA:F1] 0.626ms Unicast reply from 10.1.1.125 [00:13:72:5D:FA:F1] 0.732ms Unicast reply from 10.1.1.125 [00:13:72:5D:FA:F1] 0.790ms Unicast reply from 10.1.1.125 [00:13:72:5D:FA:F1] 0.690ms Unicast reply from 10.1.1.125 [00:13:72:5D:FA:F1] 0.663ms Sent 9 probes (1 broadcast(s)) Received 10 response(s) If you have different MAC addresses responding then you've got issues. Also, from your Load Balancer, please post the output of "curl -i http://192.168.1.92" and the other real server IPs you have (93 and 94). From isplist at logicore.net Thu Nov 9 20:18:38 2006 From: isplist at logicore.net (isplist at logicore.net) Date: Thu, 9 Nov 2006 14:18:38 -0600 Subject: [Linux-cluster] LVS, not so fun today... In-Reply-To: <8c1094290611091146t42f7f734x257b63a04299d791@mail.gmail.com> Message-ID: <2006119141838.126321@leena> Hi there, > Do the ARP ping from anywhere. It might not be related at all to your > problem but ARP problems will cause you no end of headaches with > direct routing. The output should look something like this: Testing from ANY machine including one of the real web servers works but not from the load balancer, 192.168.1.52 (LB0); #arping 192.168.1.150 ARPING 192.168.1.150 from 192.168.1.56 eth0 Unicast reply from 192.168.1.150 [00:20:94:10:44:A5] 1.193ms Unicast reply from 192.168.1.150 [00:20:94:10:44:A5] 0.666ms > Also, from your Load Balancer, please post the output of "curl -i > http://192.168.1.92" and the other real server IPs you have (93 and > 94). They all output the same, as they should, so I'll just post the one; 192.168.1.52# curl -i http://192.168.1.92/ HTTP/1.1 200 OK Date: Thu, 09 Nov 2006 20:17:03 GMT Server: Apache Last-Modified: Mon, 28 Aug 2006 01:03:47 GMT ETag: "4ef781-1679-7f8be2c0" Accept-Ranges: bytes Content-Length: 5753 Connection: close Content-Type: text/html; charset=UTF-8 Companions.com From dbrieck at gmail.com Thu Nov 9 20:38:09 2006 From: dbrieck at gmail.com (David Brieck Jr.) Date: Thu, 9 Nov 2006 15:38:09 -0500 Subject: [Linux-cluster] LVS, not so fun today... In-Reply-To: <2006119141838.126321@leena> References: <8c1094290611091146t42f7f734x257b63a04299d791@mail.gmail.com> <2006119141838.126321@leena> Message-ID: <8c1094290611091238r35f692c1p886c24f2462e7ee6@mail.gmail.com> On 11/9/06, isplist at logicore.net wrote: > Hi there, > > > Do the ARP ping from anywhere. It might not be related at all to your > > problem but ARP problems will cause you no end of headaches with > > direct routing. The output should look something like this: > > Testing from ANY machine including one of the real web servers works but not > from the load balancer, 192.168.1.52 (LB0); > > #arping 192.168.1.150 > ARPING 192.168.1.150 from 192.168.1.56 eth0 > Unicast reply from 192.168.1.150 [00:20:94:10:44:A5] 1.193ms > Unicast reply from 192.168.1.150 [00:20:94:10:44:A5] 0.666ms > > > Also, from your Load Balancer, please post the output of "curl -i > > http://192.168.1.92" and the other real server IPs you have (93 and > > 94). > > They all output the same, as they should, so I'll just post the one; > > 192.168.1.52# curl -i http://192.168.1.92/ > > HTTP/1.1 200 OK > Date: Thu, 09 Nov 2006 20:17:03 GMT > Server: Apache > Last-Modified: Mon, 28 Aug 2006 01:03:47 GMT > ETag: "4ef781-1679-7f8be2c0" > Accept-Ranges: bytes > Content-Length: 5753 > Connection: close > Content-Type: text/html; charset=UTF-8 > > > > > Companions.com > > > > alink="666699"> >

Who Are You Looking For?

> > > > > > > I went back and looked at this: eth0 Link encap:Ethernet HWaddr 00:20:94:10:44:A5 inet addr:192.168.1.52 Bcast:192.168.1.255 Mask:255.255.255.0 inet6 addr: fe80::220:94ff:fe10:44a5/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:91631 errors:0 dropped:0 overruns:0 frame:0 TX packets:8142 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:6004157 (5.7 MiB) TX bytes:1402617 (1.3 MiB) eth0:1 Link encap:Ethernet HWaddr 00:20:94:10:44:A5 inet addr:192.168.1.150 Bcast:192.168.1.150 Mask:255.255.255.255 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 and just wanted to make sure you are not setting up eth0:1 manually. It will be added automatically by lvs. From the looks of that output compared to what you have in your lvs config it looks like that is the case. From isplist at logicore.net Thu Nov 9 20:39:59 2006 From: isplist at logicore.net (isplist at logicore.net) Date: Thu, 9 Nov 2006 14:39:59 -0600 Subject: [Linux-cluster] LVS, not so fun today... In-Reply-To: <8c1094290611091238r35f692c1p886c24f2462e7ee6@mail.gmail.com> Message-ID: <2006119143959.843473@leena> I might have some left over test things here and there, trying to clean it up as I go. I was trying keepalived but went back to LVS. Mike > and just wanted to make sure you are not setting up eth0:1 manually. > It will be added automatically by lvs. From the looks of that output > compared to what you have in your lvs config it looks like that is the > case. From dbrieck at gmail.com Thu Nov 9 20:46:11 2006 From: dbrieck at gmail.com (David Brieck Jr.) Date: Thu, 9 Nov 2006 15:46:11 -0500 Subject: [Linux-cluster] LVS, not so fun today... In-Reply-To: <2006119143959.843473@leena> References: <8c1094290611091238r35f692c1p886c24f2462e7ee6@mail.gmail.com> <2006119143959.843473@leena> Message-ID: <8c1094290611091246j20b0c312k3c8731b05f8af7b5@mail.gmail.com> On 11/9/06, isplist at logicore.net wrote: > I might have some left over test things here and there, trying to clean it up > as I go. I was trying keepalived but went back to LVS. > > Mike > > > > and just wanted to make sure you are not setting up eth0:1 manually. > > It will be added automatically by lvs. From the looks of that output > > compared to what you have in your lvs config it looks like that is the > > case. > > Try removing that interface then starting up pulse. The netmask and gateway looks really off compared to eth0. Pulse will add the interface and then write back with your ifconfig output. From isplist at logicore.net Thu Nov 9 20:52:58 2006 From: isplist at logicore.net (isplist at logicore.net) Date: Thu, 9 Nov 2006 14:52:58 -0600 Subject: [Linux-cluster] LVS, not so fun today... In-Reply-To: <8c1094290611091246j20b0c312k3c8731b05f8af7b5@mail.gmail.com> Message-ID: <2006119145258.032211@leena> > Try removing that interface then starting up pulse. The netmask and > gateway looks really off compared to eth0. Pulse will add the > interface and then write back with your ifconfig output. Ok, stopped pulse; Nothing to remove on interface, restart pulse; # ifconfig eth0 Link encap:Ethernet HWaddr 00:20:94:10:44:A5 inet addr:192.168.1.52 Bcast:192.168.1.255 Mask:255.255.255.0 inet6 addr: fe80::220:94ff:fe10:44a5/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:218526 errors:0 dropped:0 overruns:0 frame:0 TX packets:44604 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:14276764 (13.6 MiB) TX bytes:4034081 (3.8 MiB) eth0:1 Link encap:Ethernet HWaddr 00:20:94:10:44:A5 inet addr:192.168.1.150 Bcast:192.168.1.255 Mask:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:8 errors:0 dropped:0 overruns:0 frame:0 TX packets:8 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:560 (560.0 b) TX bytes:560 (560.0 b) Pulse adds VIP to interface; # tail -f /var/log/messages Nov 9 14:51:04 lb52 pulse[3487]: STARTING PULSE AS MASTER Nov 9 14:51:04 lb52 pulse: pulse startup succeeded Nov 9 14:51:22 lb52 pulse[3487]: partner dead: activating lvs Nov 9 14:51:22 lb52 lvs[3494]: starting virtual service HTTP active: 80 Nov 9 14:51:22 lb52 nanny[3498]: starting LVS client monitor for 192.168.1.150:80 Nov 9 14:51:22 lb52 lvs[3494]: create_monitor for HTTP/cweb92 running as pid 3498 Nov 9 14:51:22 lb52 nanny[3499]: starting LVS client monitor for 192.168.1.150:80 Nov 9 14:51:22 lb52 lvs[3494]: create_monitor for HTTP/cweb93 running as pid 3499 Nov 9 14:51:22 lb52 nanny[3500]: starting LVS client monitor for 192.168.1.150:80 Nov 9 14:51:22 lb52 lvs[3494]: create_monitor for HTTP/cweb94 running as pid 3500 Nov 9 14:51:27 lb52 pulse[3501]: gratuitous lvs arps finished Nov 9 14:51:28 lb52 nanny[3498]: READ to 192.168.1.92:80 timed out Nov 9 14:51:28 lb52 nanny[3499]: READ to 192.168.1.93:80 timed out Nov 9 14:51:28 lb52 nanny[3500]: READ to 192.168.1.94:80 timed out From dbrieck at gmail.com Thu Nov 9 20:59:46 2006 From: dbrieck at gmail.com (David Brieck Jr.) Date: Thu, 9 Nov 2006 15:59:46 -0500 Subject: [Linux-cluster] LVS, not so fun today... In-Reply-To: <2006119145258.032211@leena> References: <8c1094290611091246j20b0c312k3c8731b05f8af7b5@mail.gmail.com> <2006119145258.032211@leena> Message-ID: <8c1094290611091259r38c3e9efm973ad3eec1e41f26@mail.gmail.com> On 11/9/06, isplist at logicore.net wrote: > > Try removing that interface then starting up pulse. The netmask and > > gateway looks really off compared to eth0. Pulse will add the > > interface and then write back with your ifconfig output. > > Ok, stopped pulse; > Nothing to remove on interface, restart pulse; > > # ifconfig > eth0 Link encap:Ethernet HWaddr 00:20:94:10:44:A5 > inet addr:192.168.1.52 Bcast:192.168.1.255 Mask:255.255.255.0 > inet6 addr: fe80::220:94ff:fe10:44a5/64 Scope:Link > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > RX packets:218526 errors:0 dropped:0 overruns:0 frame:0 > TX packets:44604 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:1000 > RX bytes:14276764 (13.6 MiB) TX bytes:4034081 (3.8 MiB) > > eth0:1 Link encap:Ethernet HWaddr 00:20:94:10:44:A5 > inet addr:192.168.1.150 Bcast:192.168.1.255 Mask:255.255.255.0 > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > > lo Link encap:Local Loopback > inet addr:127.0.0.1 Mask:255.0.0.0 > inet6 addr: ::1/128 Scope:Host > UP LOOPBACK RUNNING MTU:16436 Metric:1 > RX packets:8 errors:0 dropped:0 overruns:0 frame:0 > TX packets:8 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:0 > RX bytes:560 (560.0 b) TX bytes:560 (560.0 b) > > Pulse adds VIP to interface; > > # tail -f /var/log/messages > > Nov 9 14:51:04 lb52 pulse[3487]: STARTING PULSE AS MASTER > Nov 9 14:51:04 lb52 pulse: pulse startup succeeded > Nov 9 14:51:22 lb52 pulse[3487]: partner dead: activating lvs > Nov 9 14:51:22 lb52 lvs[3494]: starting virtual service HTTP active: 80 > Nov 9 14:51:22 lb52 nanny[3498]: starting LVS client monitor for > 192.168.1.150:80 > Nov 9 14:51:22 lb52 lvs[3494]: create_monitor for HTTP/cweb92 running as pid > 3498 > Nov 9 14:51:22 lb52 nanny[3499]: starting LVS client monitor for > 192.168.1.150:80 > Nov 9 14:51:22 lb52 lvs[3494]: create_monitor for HTTP/cweb93 running as pid > 3499 > Nov 9 14:51:22 lb52 nanny[3500]: starting LVS client monitor for > 192.168.1.150:80 > Nov 9 14:51:22 lb52 lvs[3494]: create_monitor for HTTP/cweb94 running as pid > 3500 > Nov 9 14:51:27 lb52 pulse[3501]: gratuitous lvs arps finished > Nov 9 14:51:28 lb52 nanny[3498]: READ to 192.168.1.92:80 timed out > Nov 9 14:51:28 lb52 nanny[3499]: READ to 192.168.1.93:80 timed out > Nov 9 14:51:28 lb52 nanny[3500]: READ to 192.168.1.94:80 timed out > > Here's another one that might help: Your LVS reads: send = "GET / HTTP/1.0rnrn" It should be (at least one mine it is): send = "GET / HTTP/1.0\r\n\r\n" Could be a copy/paste problem but that might be your problem in addition to the interface mentioned before. From isplist at logicore.net Thu Nov 9 21:11:59 2006 From: isplist at logicore.net (isplist at logicore.net) Date: Thu, 9 Nov 2006 15:11:59 -0600 Subject: [Linux-cluster] LVS, not so fun today... In-Reply-To: <8c1094290611091259r38c3e9efm973ad3eec1e41f26@mail.gmail.com> Message-ID: <2006119151159.821975@leena> > Here's another one that might help: > > Your LVS reads: > send = "GET / HTTP/1.0rnrn" > > It should be (at least one mine it is): > send = "GET / HTTP/1.0\r\n\r\n" I don't see GET anywhere? # curl -i http://192.168.1.92/|more % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 5753 100 5753 0 0 225k 0 --:--:-- --:--:-- --:--:-- 700k HTTP/1.1 200 OK Date: Thu, 09 Nov 2006 21:07:02 GMT Server: Apache Last-Modified: Mon, 28 Aug 2006 01:03:47 GMT ETag: "4ef781-1679-7f8be2c0" Accept-Ranges: bytes Content-Length: 5753 Connection: close Content-Type: text/html; charset=UTF-8 > Could be a copy/paste problem but that might be your problem in > addition to the interface mentioned before. Did I miss something? You mentioned that the VIP might be getting placed on the interface at boot. Doing a test, I saw it removed when I shut down pulse then pulse put it back in when it started. Is that not how it's supposed to work? Mike From dbrieck at gmail.com Thu Nov 9 21:21:34 2006 From: dbrieck at gmail.com (David Brieck Jr.) Date: Thu, 9 Nov 2006 16:21:34 -0500 Subject: [Linux-cluster] LVS, not so fun today... In-Reply-To: <2006119151159.821975@leena> References: <8c1094290611091259r38c3e9efm973ad3eec1e41f26@mail.gmail.com> <2006119151159.821975@leena> Message-ID: <8c1094290611091321x16991139g1a34aea8370c499d@mail.gmail.com> On 11/9/06, isplist at logicore.net wrote: > > Here's another one that might help: > > > > Your LVS reads: > > send = "GET / HTTP/1.0rnrn" > > > > It should be (at least one mine it is): > > send = "GET / HTTP/1.0\r\n\r\n" > > I don't see GET anywhere? > It's just referring to the header being sent to the real server. That is standard for all GET connections, posts would be POST. Check out a website with this tool and you'll see what I mean: http://www.rexswain.com/httpview.html send = "GET / HTTP/1.0\r\n\r\n" expect = "HTTP" is what that section of your lvs.cf file should be for a virtual webserver. From isplist at logicore.net Thu Nov 9 21:25:55 2006 From: isplist at logicore.net (isplist at logicore.net) Date: Thu, 9 Nov 2006 15:25:55 -0600 Subject: [Linux-cluster] LVS, not so fun today... In-Reply-To: <8c1094290611091321x16991139g1a34aea8370c499d@mail.gmail.com> Message-ID: <2006119152555.142989@leena> > send = "GET / HTTP/1.0\r\n\r\n" > expect = "HTTP" > > is what that section of your lvs.cf file should be for a virtual webserver. serial_no = 121 primary = 192.168.1.52 service = lvs backup_active = 1 backup = 0.0.0.0 heartbeat = 1 heartbeat_port = 539 keepalive = 6 deadtime = 18 network = direct nat_nmask = 255.255.255.255 debug_level = NONE monitor_links = 0 virtual HTTP { active = 1 address = 192.168.1.150 eth0:1 vip_nmask = 255.255.255.0 port = 80 send = "GET / HTTP/1.0rnrn" expect = "HTTP" use_regex = 0 load_monitor = none scheduler = wlc protocol = tcp timeout = 6 reentry = 15 quiesce_server = 0 server cweb92 { address = 192.168.1.92 active = 1 weight = 1 } server cweb93 { address = 192.168.1.93 active = 1 weight = 1 } server cweb94 { address = 192.168.1.94 active = 1 weight = 1 } } From dbrieck at gmail.com Thu Nov 9 22:07:48 2006 From: dbrieck at gmail.com (David Brieck Jr.) Date: Thu, 9 Nov 2006 17:07:48 -0500 Subject: [Linux-cluster] LVS, not so fun today... In-Reply-To: <2006119152555.142989@leena> References: <8c1094290611091321x16991139g1a34aea8370c499d@mail.gmail.com> <2006119152555.142989@leena> Message-ID: <8c1094290611091407l3b047fe6sfa8b2d68e781ccd2@mail.gmail.com> On 11/9/06, isplist at logicore.net wrote: > > send = "GET / HTTP/1.0\r\n\r\n" > > expect = "HTTP" > > > > is what that section of your lvs.cf file should be for a virtual webserver. > > serial_no = 121 > primary = 192.168.1.52 > service = lvs > backup_active = 1 > backup = 0.0.0.0 > heartbeat = 1 > heartbeat_port = 539 > keepalive = 6 > deadtime = 18 > network = direct > nat_nmask = 255.255.255.255 > debug_level = NONE > monitor_links = 0 > virtual HTTP { > active = 1 > address = 192.168.1.150 eth0:1 > vip_nmask = 255.255.255.0 > port = 80 > send = "GET / HTTP/1.0rnrn" > expect = "HTTP" > use_regex = 0 > load_monitor = none > scheduler = wlc > protocol = tcp > timeout = 6 > reentry = 15 > quiesce_server = 0 > server cweb92 { > address = 192.168.1.92 > active = 1 > weight = 1 > } > server cweb93 { > address = 192.168.1.93 > active = 1 > weight = 1 > } > server cweb94 { > address = 192.168.1.94 > active = 1 > weight = 1 > } > } > > It's in there, but not exactly the way I posted. The slashes in front of the r's and n's are critical, otherwise you're just sending a malformed header that probably won't get you a response. If HTML weren't such bad form on a mailing list I could probably point it out easier. From lshen at cisco.com Fri Nov 3 21:48:39 2006 From: lshen at cisco.com (Lin Shen (lshen)) Date: Fri, 3 Nov 2006 13:48:39 -0800 Subject: [Linux-cluster] Does GFS work well with hybrid storage devices? Message-ID: <08A9A3213527A6428774900A80DBD8D802D9645A@xmb-sjc-222.amer.cisco.com> Will GFS work well in a cluster system that hosts a wide range of storage devices such as disk array, hard disk, flash and USB devices? Since these devices have very different performance and reliability behaviors, it could be a challenge for any CFS, right? I read from somewhere that GFS doesn't support multiple writers on the same file simutaneously. Is this true? Lin Shen Cisco Systems From isplist at logicore.net Thu Nov 9 22:52:07 2006 From: isplist at logicore.net (isplist at logicore.net) Date: Thu, 9 Nov 2006 16:52:07 -0600 Subject: [Linux-cluster] LVS, not so fun today... In-Reply-To: <8c1094290611091259r38c3e9efm973ad3eec1e41f26@mail.gmail.com> Message-ID: <200611916527.941612@leena> Ah, interesting. You have found an error! In the package itself? Mine is a default install, nothing custom about it so I have to believe I'm not the only one with this problem? > send = "GET / HTTP/1.0rnrn" > > It should be (at least one mine it is): > send = "GET / HTTP/1.0\r\n\r\n" Changed as suggested, restarted pulse. Nanny now working properly, no more timeouts to real servers. Nov 9 16:44:56 lb52 nanny[3616]: starting LVS client monitor for 192.168.1.150:80 Nov 9 16:44:56 lb52 lvs[3613]: create_monitor for HTTP/cweb92 running as pid 3616 Nov 9 16:44:56 lb52 nanny[3617]: starting LVS client monitor for 192.168.1.150:80 Nov 9 16:44:56 lb52 lvs[3613]: create_monitor for HTTP/cweb93 running as pid 3617 Nov 9 16:44:56 lb52 nanny[3618]: starting LVS client monitor for 192.168.1.150:80 Nov 9 16:44:56 lb52 lvs[3613]: create_monitor for HTTP/cweb94 running as pid 3618 Nov 9 16:44:56 lb52 nanny[3616]: making 192.168.1.92:80 available Nov 9 16:44:56 lb52 nanny[3617]: making 192.168.1.93:80 available Nov 9 16:44:56 lb52 nanny[3618]: making 192.168.1.94:80 available Nov 9 16:45:01 lb52 pulse[3620]: gratuitous lvs arps finished However, trying to connect to VIP of 192.168.1.150, nothing shows up. ifconfig shows all looks good. From dbrieck at gmail.com Fri Nov 10 01:47:00 2006 From: dbrieck at gmail.com (David Brieck Jr.) Date: Thu, 9 Nov 2006 20:47:00 -0500 Subject: [Linux-cluster] LVS, not so fun today... In-Reply-To: <200611916527.941612@leena> References: <8c1094290611091259r38c3e9efm973ad3eec1e41f26@mail.gmail.com> <200611916527.941612@leena> Message-ID: <8c1094290611091747m3afddb47h5b21588cddce20c5@mail.gmail.com> On 11/9/06, isplist at logicore.net wrote: > Ah, interesting. You have found an error! In the package itself? > Mine is a default install, nothing custom about it so I have to believe I'm > not the only one with this problem? > > > send = "GET / HTTP/1.0rnrn" > > > > It should be (at least one mine it is): > > send = "GET / HTTP/1.0\r\n\r\n" > > Changed as suggested, restarted pulse. Nanny now working properly, no more > timeouts to real servers. > > Nov 9 16:44:56 lb52 nanny[3616]: starting LVS client monitor for > 192.168.1.150:80 > Nov 9 16:44:56 lb52 lvs[3613]: create_monitor for HTTP/cweb92 running as pid > 3616 > Nov 9 16:44:56 lb52 nanny[3617]: starting LVS client monitor for > 192.168.1.150:80 > Nov 9 16:44:56 lb52 lvs[3613]: create_monitor for HTTP/cweb93 running as pid > 3617 > Nov 9 16:44:56 lb52 nanny[3618]: starting LVS client monitor for > 192.168.1.150:80 > Nov 9 16:44:56 lb52 lvs[3613]: create_monitor for HTTP/cweb94 running as pid > 3618 > Nov 9 16:44:56 lb52 nanny[3616]: making 192.168.1.92:80 available > Nov 9 16:44:56 lb52 nanny[3617]: making 192.168.1.93:80 available > Nov 9 16:44:56 lb52 nanny[3618]: making 192.168.1.94:80 available > Nov 9 16:45:01 lb52 pulse[3620]: gratuitous lvs arps finished > > However, trying to connect to VIP of 192.168.1.150, nothing shows up. > > ifconfig shows all looks good. > Alright, so what does the ifconfig look like for one of the real servers? What method are you using to make sure the real server responds to the VIP requests? For me, I have the VIP on the loopback adapter and use an arptable rule to make sure it doesn't respond to arp requests. You need to make sure apache is listening on the VIP, not it's actual IP address. From ranjtech at gmail.com Fri Nov 10 03:17:34 2006 From: ranjtech at gmail.com (RR) Date: Fri, 10 Nov 2006 13:17:34 +1000 Subject: [Linux-cluster] Quorum disk: Can it be a partition? In-Reply-To: <1163089060.7048.12.camel@rei.boston.devel.redhat.com> References: <45211826.1050703@redhat.com> <20061009094721.65AAB27C57@ux-mail.informatics.lk> <20061009115927.C7198@xos037.xos.nl> <452AC716.2040207@redhat.com> <1163089060.7048.12.camel@rei.boston.devel.redhat.com> Message-ID: On 11/10/06, Lon Hohberger wrote: > On Thu, 2006-11-09 at 13:40 +1000, RR wrote: > > On 10/10/06, Robert Peterson wrote: > > > Yes. You definitely want to use shared storage. > > > > > > Regards, > > > > > > Bob Peterson > > > > Hello, > > > > does anyone know if having this quorum disk on an iSCSI SAN which the > > linux nodes can only access through an iscsi-initiator with non-TOE > > NICs would cause any significant CPU usage? > > Qdisk is normally used to watch network paths and advertise via a > non-network channel (i.e. a SAN) about a node's viability in the > cluster... > > Using qdisk over iSCSI (or gnbd) is doable, but you'd have to have it on > a private, iSCSI-only network for it to make any sense. (i.e. treat the > iSCSI network as a SAN which only has SAN traffic). > > I don't know the implications of using TOE vs. non-TOE for your > configuration. Someone else will have to answer that one. > > -- Lon Hi Lon, thanks for the response. Yeah I have an isolated SAN where the iSCSI targets reside and is serviced by stacked Cisco GigE switches and there's nothing on this network besides iSCSI traffic. You didn't say anything about if this can be an NFS type partition (but that would probably not qualify as a non-network path) although NFS also uses a fair bit of CPU but I have seen the CPU usage on a Dual-Xeon 3.6Ghz computer go upto 44% (is there a way in Windows to see usage per virtual CPU?) when transferring a 1GB file across the network to the SAN over non-TOE NICs. From bganeshmail at gmail.com Fri Nov 10 07:13:19 2006 From: bganeshmail at gmail.com (Ganesh B) Date: Fri, 10 Nov 2006 12:43:19 +0530 Subject: [Linux-cluster] Redhat cluster suite 4 ccsd service Message-ID: <4e3b15f00611092313p23c3e7f3m6b1b6de7bbfb5e4c@mail.gmail.com> Dear Sir, magma-plugins was installed but still probelm. Before that i will explain how the installtion was carried out. 1.REdhat has shipped the Redhat Cluster Suite 4 for Itanium ,IA64 and Ia 32 cd. The Itanium Cd was Inserted and made to run in autmode.But it gives error of cman-kernel and dlm-kernel is required. 2.Through command line installed both the rpm using #rpm -ivh cam-kernel-* --nodeps #rpm -ivh dlm-kernel-* --nodeps. 3.Again the autorun was executed and installed all the rpm which shows in selection menu.(Same process was carried out in other node also). 4.Executed the command system-config-cluster and configured Cluster name and node name.Sabed the filed. 5.Copied /etc/cluster/cluster.conf to other node. 6.Execcuted #service ccsd start .# service cman failed ,# service rgmanger start All shows failed. Pls help what is the mistake in this process. Regds B.Ganesh Message: 5 Date: Thu, 9 Nov 2006 07:51:58 -0500 From: Josef Whiter Subject: Re: [Linux-cluster] Redhat cluster suite 4 ccsd service failed To: linux clustering Message-ID: <20061109125157.GB24980 at korben.rdu.redhat.com> Content-Type: text/plain; charset=us-ascii You didnt install magma-plugins, install that package and it should start working. Josef On Thu, Nov 09, 2006 at 04:24:29PM +0530, Ganesh B wrote: > Dear All, > > Installed the necessary rpm which came along with redhat cluster cd. > > But the ccsd service was not getting started without any error,. > > How to diagnose the error and sort out the issue. > > > > > > This is the fresh installation.Kindly give some instructions. > > Installation caried as per the procedure given from Redhat site. > > Pls help > > Regds > B.Ganesh From bganeshmail at gmail.com Fri Nov 10 07:29:54 2006 From: bganeshmail at gmail.com (Ganesh B) Date: Fri, 10 Nov 2006 12:59:54 +0530 Subject: [Linux-cluster] Redhat Cluster Service ccsd failed Message-ID: <4e3b15f00611092329h173864cy24c15c389559a41d@mail.gmail.com> Dear Sir, magma-plugins was installed but still probelm. Before that i will explain how the installtion was carried out. 1.REdhat has shipped the Redhat Cluster Suite 4 for Itanium ,IA64 and Ia 32 cd. The Itanium Cd was Inserted and made to run in autmode.But it gives error of cman-kernel and dlm-kernel is required. 2.Through command line installed both the rpm using #rpm -ivh cam-kernel-* --nodeps #rpm -ivh dlm-kernel-* --nodeps. 3.Again the autorun was executed and installed all the rpm which shows in selection menu.(Same process was carried out in other node also). 4.Executed the command system-config-cluster and configured Cluster name and node name.Sabed the filed. 5.Copied /etc/cluster/cluster.conf to other node. 6.Execcuted #service ccsd start .# service cman failed ,# service rgmanger start All shows failed. Pls help what is the mistake in this process. Regds B.Ganesh From jos at xos.nl Fri Nov 10 09:27:18 2006 From: jos at xos.nl (Jos Vos) Date: Fri, 10 Nov 2006 10:27:18 +0100 Subject: [Linux-cluster] RHEL4 cluster source RPMs incomplete In-Reply-To: <1162415044.4518.400.camel@rei.boston.devel.redhat.com>; from lhh@redhat.com on Wed, Nov 01, 2006 at 04:04:04PM -0500 References: <200610121452.k9CEqYg13019@xos037.xos.nl> <1161873875.4518.82.camel@rei.boston.devel.redhat.com> <20061026165124.A10943@xos037.xos.nl> <1162415044.4518.400.camel@rei.boston.devel.redhat.com> Message-ID: <20061110102718.A28789@xos037.xos.nl> On Wed, Nov 01, 2006 at 04:04:04PM -0500, Lon Hohberger wrote: > Until it appears, I've placed it here: > > http://people.redhat.com/lhh/system-config-cluster-1.0.27-1.0.src.rpm Also still missing: piranha-0.8.3-1.src.rpm ... -- -- Jos Vos -- X/OS Experts in Open Systems BV | Phone: +31 20 6938364 -- Amsterdam, The Netherlands | Fax: +31 20 6948204 From peter.huesser at psi.ch Fri Nov 10 09:32:47 2006 From: peter.huesser at psi.ch (Huesser Peter) Date: Fri, 10 Nov 2006 10:32:47 +0100 Subject: [Linux-cluster] Why kernelmodules? Message-ID: <8E2924888511274B95014C2DD906E58A01108724@MAILBOX0A.psi.ch> Hello While comparing the redhat HA solution with linux HA (www.linux-ha.org ) we remarked that linux HA does not make use of kernelmodules. What is the advantage off a kernel module solution over a non-kernelmodule solution. What things can not be done with a non-kernelmodule solution? Thank's for any answer Pedro -------------- next part -------------- An HTML attachment was scrubbed... URL: From hlawatschek at atix.de Fri Nov 10 10:13:58 2006 From: hlawatschek at atix.de (Mark Hlawatschek) Date: Fri, 10 Nov 2006 11:13:58 +0100 Subject: [Linux-cluster] rgmanager service initialization on startup Message-ID: <200611101113.59201.hlawatschek@atix.de> Hi, we internally discussed the behaviour of the rgmanager resource group initialization, and we came across the following problem: The initialization process includes a stop operation for all defined resources in the resource group. I.e. every time a cluster member starts the rgmanager (e.g. after a reboot) all service agents are called with a stop option, although the service is not running on that node. To track problems and service transitions, our resource agents drop emails to operation everytime a service has been stopped. Now they receive email alerts caused from rgmanager startups, although everything is ok and the services are still running on another cluster node. As you might imagine, the support people would suffer regular heart attacks ;-). Is it really important to stop unstarted services after a reboot ? If so, maybe a service attribute (e.g. could be introduced to restain this behaviour. What do you think ? -- Gruss / Regards, Dipl.-Ing. Mark Hlawatschek http://www.atix.de/ http://www.open-sharedroot.org/ ** ATIX - Ges. fuer Informationstechnologie und Consulting mbH Einsteinstr. 10 - 85716 Unterschleissheim - Germany From sandra-llistes at fib.upc.edu Fri Nov 10 13:01:10 2006 From: sandra-llistes at fib.upc.edu (sandra-llistes) Date: Fri, 10 Nov 2006 14:01:10 +0100 Subject: [Linux-cluster] GFS and samba problem in Fedora, again In-Reply-To: <45523BFB.30707@redhat.com> References: <4523A637.1060706@fib.upc.edu> <4526B4ED.9050907@redhat.com> <452B4F39.60906@fib.upc.edu> <452BFC6D.60902@redhat.com> <452D0ED6.9040605@fib.upc.edu> <452D7040.8090704@redhat.com> <4533A6EA.6030503@fib.upc.edu> <4533B9C2.8080906@redhat.com> <4550BB53.9070401@fib.upc.edu> <45523BFB.30707@redhat.com> Message-ID: <45547816.8040500@fib.upc.edu> Hi Abhi, I re-compiled ccsd with debug option, and I attach the logs of each node. CMAN works ok and quored is regained: Nov 10 13:30:54 tigreton kernel: CMAN: Waiting to join or form a Linux-cluster Nov 10 13:31:24 tigreton kernel: CMAN: sending membership request Nov 10 13:31:24 tigreton kernel: CMAN: got node nocilla.fib.upc.es Nov 10 13:31:24 tigreton kernel: CMAN: quorum regained, resuming activity Nov 10 13:36:45 tigreton kernel: CMAN: we are leaving the cluster. Removed daemon.log: ........... Nov 10 13:30:44 tigreton ccsd[3951]: [ccsd.c:87] Unable to create IPv6 socket:: Address family not supported by protocol ---> says that because I disabled IPV6 (the result with it was the same). Nov 10 13:30:44 tigreton ccsd[3951]: [ccsd.c:99] Using IPv4 Nov 10 13:30:44 tigreton ccsd[3951]: [ccsd.c:878] Entering setup_local_socket ........... In Red Hat documentation there is a daemon called clvmd. We didn't install because it wasn't in rpms, and seems unuseful because two nodes sees LVM logical partitions without it. In my test with RHES I didn't install it also, and GFS+samba worked, but I asked if it could be a relation between clvmd and locks problems with samba? Do you think that is important enough to try install it? We are planning to buy RHES4 because, in test environment GFS and samba works better, and we're interested in support. Is GFS included in RHES4 for academic use, or I have to download it from SVN? What kind of support RH gives if we buy RHES? Thanks a lot, Sandra Hern?ndez Abhijith Das wrote: > > Hi Sandra, > I'm assuming you're talking about the RHEL4 branch of CVS. I'm not 100% > sure that the cluster-suite and GFS are supported for FC5. (Somebody on > this list can confirm). The error however looks like ccsd is unable to > talk to cman. At what point do you see this error? Are you using the > init-scripts? I'd suggest that you compile ccsd with the DEBUG flag and > try starting all the components by hand. Hopefully that'll give us more > information. > > Thanks, > --Abhi -------------- next part -------------- A non-text attachment was scrubbed... Name: daemonNODE1.log.gz Type: application/x-gzip Size: 3726 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: daemonNODE2.log.gz Type: application/x-gzip Size: 3793 bytes Desc: not available URL: From pcaulfie at redhat.com Fri Nov 10 13:16:06 2006 From: pcaulfie at redhat.com (Patrick Caulfield) Date: Fri, 10 Nov 2006 13:16:06 +0000 Subject: [Linux-cluster] GFS and samba problem in Fedora, again In-Reply-To: <45547816.8040500@fib.upc.edu> References: <4523A637.1060706@fib.upc.edu> <4526B4ED.9050907@redhat.com> <452B4F39.60906@fib.upc.edu> <452BFC6D.60902@redhat.com> <452D0ED6.9040605@fib.upc.edu> <452D7040.8090704@redhat.com> <4533A6EA.6030503@fib.upc.edu> <4533B9C2.8080906@redhat.com> <4550BB53.9070401@fib.upc.edu> <45523BFB.30707@redhat.com> <45547816.8040500@fib.upc.edu> Message-ID: <45547B96.4030800@redhat.com> sandra-llistes wrote: > Hi Abhi, > > I re-compiled ccsd with debug option, and I attach the logs of each > node. CMAN works ok and quored is regained: > Nov 10 13:30:54 tigreton kernel: CMAN: Waiting to join or form a > Linux-cluster > Nov 10 13:31:24 tigreton kernel: CMAN: sending membership request > Nov 10 13:31:24 tigreton kernel: CMAN: got node nocilla.fib.upc.es > Nov 10 13:31:24 tigreton kernel: CMAN: quorum regained, resuming activity > Nov 10 13:36:45 tigreton kernel: CMAN: we are leaving the cluster. Removed That message indicates that the node was manually removed from the cluster with the "cman_tool leave remove" command. -- patrick From redhat at watson-wilson.ca Fri Nov 10 13:58:47 2006 From: redhat at watson-wilson.ca (Neil Watson) Date: Fri, 10 Nov 2006 08:58:47 -0500 Subject: [Linux-cluster] ClusterFS "Generic Error" In-Reply-To: <455374F6.7000309@rebelbase.com> References: <455374F6.7000309@rebelbase.com> Message-ID: <20061110135847.GB14507@watson-wilson.ca> What does the kernel say about the status of the file system? Can you still read and write to them? -- Neil Watson | Debian Linux System Administrator | Uptime 5 days http://watson-wilson.ca From lhh at redhat.com Fri Nov 10 14:34:10 2006 From: lhh at redhat.com (Lon Hohberger) Date: Fri, 10 Nov 2006 09:34:10 -0500 Subject: [Linux-cluster] CS and DRBD In-Reply-To: <4553675F.7030108@it-linux.cl> References: <23500065.31162496760586.JavaMail.root@lisa.it-linux.cl> <1162572794.4518.493.camel@rei.boston.devel.redhat.com> <4553675F.7030108@it-linux.cl> Message-ID: <1163169250.7048.18.camel@rei.boston.devel.redhat.com> On Thu, 2006-11-09 at 14:37 -0300, Patricio Bruna V. wrote: > Lon Hohberger escribi?: > > On Thu, 2006-11-02 at 16:46 -0300, Patricio A. Bruna wrote: > > > >> Anyone has a working script for use DRBD with Cluster Suite? > >> I need to use DRBD, but i dont know how to make it play with CS. > >> I now CS can mount partitions, but i dont know if CS can mark a drbd > >> device as primary. > >> > > > > I'm not aware of any -- but you're not the first to ask. Could you file > > a bugzilla against fc6 / rgmanager? It might be as simple as using an > > existing script to do it. > > > > > I got it worked. > I used the script /etc/ha.d/resource.d/drbddisk, i had to modified it a > little bit so. Can you post the script here? -- Lon From lhh at redhat.com Fri Nov 10 14:35:09 2006 From: lhh at redhat.com (Lon Hohberger) Date: Fri, 10 Nov 2006 09:35:09 -0500 Subject: [Linux-cluster] ClusterFS "Generic Error" In-Reply-To: <455374F6.7000309@rebelbase.com> References: <455374F6.7000309@rebelbase.com> Message-ID: <1163169309.7048.20.camel@rei.boston.devel.redhat.com> What version of rgmanager do you have installed? It sounds like a bug we fixed awhile ago. -- Lon From lhh at redhat.com Fri Nov 10 14:39:52 2006 From: lhh at redhat.com (Lon Hohberger) Date: Fri, 10 Nov 2006 09:39:52 -0500 Subject: [Linux-cluster] Quorum disk: Can it be a partition? In-Reply-To: References: <45211826.1050703@redhat.com> <20061009094721.65AAB27C57@ux-mail.informatics.lk> <20061009115927.C7198@xos037.xos.nl> <452AC716.2040207@redhat.com> <1163089060.7048.12.camel@rei.boston.devel.redhat.com> Message-ID: <1163169592.7048.26.camel@rei.boston.devel.redhat.com> On Fri, 2006-11-10 at 13:17 +1000, RR wrote: > thanks for the response. Yeah I have an isolated SAN where the iSCSI > targets reside and is serviced by stacked Cisco GigE switches and > there's nothing on this network besides iSCSI traffic. > > You didn't say anything about if this can be an NFS type partition Qdisk needs O_DIRECT access to a block device; it doesn't use file system semantics. It might work to do loopback-over-NFS, but I really think it's a poor idea at best. I suppose that technically, if it works with O_DIRECT over loopback-on-nfs, it should also work with a flat file on NFS, but neither case has been tested... > (but that would probably not qualify as a non-network path) although > NFS also uses a fair bit of CPU but I have seen the CPU usage on a > Dual-Xeon 3.6Ghz computer go upto 44% > (is there a way in Windows to > see usage per virtual CPU?) I have no idea. ;) -- Lon From lhh at redhat.com Fri Nov 10 14:40:48 2006 From: lhh at redhat.com (Lon Hohberger) Date: Fri, 10 Nov 2006 09:40:48 -0500 Subject: [Linux-cluster] RHEL4 cluster source RPMs incomplete In-Reply-To: <20061110102718.A28789@xos037.xos.nl> References: <200610121452.k9CEqYg13019@xos037.xos.nl> <1161873875.4518.82.camel@rei.boston.devel.redhat.com> <20061026165124.A10943@xos037.xos.nl> <1162415044.4518.400.camel@rei.boston.devel.redhat.com> <20061110102718.A28789@xos037.xos.nl> Message-ID: <1163169648.7048.28.camel@rei.boston.devel.redhat.com> On Fri, 2006-11-10 at 10:27 +0100, Jos Vos wrote: > On Wed, Nov 01, 2006 at 04:04:04PM -0500, Lon Hohberger wrote: > > > Until it appears, I've placed it here: > > > > http://people.redhat.com/lhh/system-config-cluster-1.0.27-1.0.src.rpm > > Also still missing: piranha-0.8.3-1.src.rpm ... > At this rate, I'm going to run out of space on my people page ;) -- Lon From lhh at redhat.com Fri Nov 10 15:01:05 2006 From: lhh at redhat.com (Lon Hohberger) Date: Fri, 10 Nov 2006 10:01:05 -0500 Subject: [Linux-cluster] Why kernelmodules? In-Reply-To: <8E2924888511274B95014C2DD906E58A01108724@MAILBOX0A.psi.ch> References: <8E2924888511274B95014C2DD906E58A01108724@MAILBOX0A.psi.ch> Message-ID: <1163170865.7048.49.camel@rei.boston.devel.redhat.com> On Fri, 2006-11-10 at 10:32 +0100, Huesser Peter wrote: > Hello > > > > While comparing the redhat HA solution with linux HA > (www.linux-ha.org) we remarked that linux HA does not make use of > kernelmodules. What is the advantage off a kernel module solution over > a non-kernelmodule solution. What things can not be done with a > non-kernelmodule solution? The main reason is that Linux-HA does not include a file system, so it doesn't have (nor need) any kernel components. In contrast, linux-cluster was built around a file system originally, so there are parts which necessarily need to be in the kernel (namely GFS & DLM). CMAN was in the kernel in the RHEL4 branch, but now is in userland (and plugs in to openais). I guess you could port GFS to FUSE if you wanted... *scratches head* -- Lon From lhh at redhat.com Fri Nov 10 15:26:02 2006 From: lhh at redhat.com (Lon Hohberger) Date: Fri, 10 Nov 2006 10:26:02 -0500 Subject: [Linux-cluster] Re: rgmanager service initialization on startup In-Reply-To: <200611101113.59201.hlawatschek@atix.de> References: <200611101113.59201.hlawatschek@atix.de> Message-ID: <1163172362.7048.66.camel@rei.boston.devel.redhat.com> On Fri, 2006-11-10 at 11:13 +0100, Mark Hlawatschek wrote: > Is it really important to stop unstarted services after a reboot ? If so, > maybe a service attribute (e.g. could > be introduced to restain this behaviour. I disagree with the idea. Some people have put cluster-managed file systems in /etc/fstab, causing file system corruption. Anything - anything at all - that minimizes the amount of damage caused by mistakes like this is not only important. Furthermore, if it's a specific agent that's sending the email, then you're going to have to set up agent parameter inheritance for service% cleanstart. While not hard, it is not an OCF-compatible structure. > What do you think ? I think it would be better to add an environment variable which indicates that the resource manager is an initialization path, and let agents check for that. e.g. _INIT=yes (or something) ... stop) if [ -z "$_INIT" ]; then email_someone_NOW fi do_stop_stuff exit $? ;; ... (only exists during the "initialization" path) While this also isn't OCF-compatible, at least this way, it wouldn't introduce any incompatibilities between running your agent on CRM and running it on rgmanager. -- Lon From dbrieck at gmail.com Fri Nov 10 15:44:18 2006 From: dbrieck at gmail.com (David Brieck Jr.) Date: Fri, 10 Nov 2006 10:44:18 -0500 Subject: [Linux-cluster] LVS, not so fun today... In-Reply-To: <2006119204457.534898@leena> References: <8c1094290611091747m3afddb47h5b21588cddce20c5@mail.gmail.com> <2006119204457.534898@leena> Message-ID: <8c1094290611100744m2efc3127mf304c16dfe839530@mail.gmail.com> On 11/9/06, isplist at logicore.net wrote: > > Alright, so what does the ifconfig look like for one of the real > > servers? > > Here's the output of the first server, cweb92. > > eth0 Link encap:Ethernet HWaddr 00:20:94:10:3B:13 > inet addr:192.168.1.92 Bcast:192.168.1.255 Mask:255.255.255.0 > inet6 addr: fe80::220:94ff:fe10:3b13/64 Scope:Link > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > RX packets:1525827 errors:0 dropped:0 overruns:0 frame:0 > TX packets:189047 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:1000 > RX bytes:108127316 (103.1 MiB) TX bytes:34679304 (33.0 MiB) > > lo Link encap:Local Loopback > inet addr:127.0.0.1 Mask:255.0.0.0 > inet6 addr: ::1/128 Scope:Host > UP LOOPBACK RUNNING MTU:16436 Metric:1 > RX packets:205 errors:0 dropped:0 overruns:0 frame:0 > TX packets:205 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:0 > RX bytes:13204 (12.8 KiB) TX bytes:13204 (12.8 KiB) > > >What method are you using to make sure the real server > >responds to the VIP requests? > > Ping and arping from other servers on the network. 192.168.1.150 seems to > respond to all servers. > > > For me, I have the VIP on the loopback adapter and use an arptable > > rule to make sure it doesn't respond to arp requests. You need to make > > sure apache is listening on the VIP, not it's actual IP address. > > Right but at this point, I've confused myself with what method to use. What > method do you use or what commands do you run on your real servers to set up > the IP? > > My VIP is 192.168.1.150. The real IP on the first web server is 192.168.1.92, > then 93 and 94 on the next servers. I suspect if web alone is this > complicated, that trying to set up my mail services are going to be murder :). > > Mike > On each real server you need to have to VIP setup on lo:0 (or 1,2,3 if needed). It should have a netmask of 255.255.255.255. Something like: cat /etc/sysconfig/network-scripts/ifcfg-lo:0 NETMASK=255.255.255.255 MTU="" BOOTPROTO=none BROADCAST=192.168.1.255 ONPARENT=yes IPADDR=192.168.1.150 NETWORK=192.168.1.1 ONBOOT=yes DEVICE=lo:0 Then, add the following to the end of /etc/rc.local: # make sure we don't respond to arp requests for the mysql cluster arptables -A IN -d 192.168.1.150 -j DROP this will fix any ARP problems. So, once you do the above, just run ifup lo:0 && arptables -A IN -d 192.168.1.150 -j DROP You need to do this on all of the servers which need to respond to the VIP. Then, in your apache config file make sure apache is listening on the VIP NOT 192.168.1.92, 93, or 94. Make any adjustments then restart apache. From lhh at redhat.com Fri Nov 10 15:47:47 2006 From: lhh at redhat.com (Lon Hohberger) Date: Fri, 10 Nov 2006 10:47:47 -0500 Subject: [Linux-cluster] RHEL4 cluster source RPMs incomplete In-Reply-To: <1163169648.7048.28.camel@rei.boston.devel.redhat.com> References: <200610121452.k9CEqYg13019@xos037.xos.nl> <1161873875.4518.82.camel@rei.boston.devel.redhat.com> <20061026165124.A10943@xos037.xos.nl> <1162415044.4518.400.camel@rei.boston.devel.redhat.com> <20061110102718.A28789@xos037.xos.nl> <1163169648.7048.28.camel@rei.boston.devel.redhat.com> Message-ID: <1163173667.7048.69.camel@rei.boston.devel.redhat.com> On Fri, 2006-11-10 at 09:40 -0500, Lon Hohberger wrote: > On Fri, 2006-11-10 at 10:27 +0100, Jos Vos wrote: > > On Wed, Nov 01, 2006 at 04:04:04PM -0500, Lon Hohberger wrote: > > > > > Until it appears, I've placed it here: > > > > > > http://people.redhat.com/lhh/system-config-cluster-1.0.27-1.0.src.rpm > > > > Also still missing: piranha-0.8.3-1.src.rpm ... > > > > At this rate, I'm going to run out of space on my people page ;) Hmm -- could you try grabbing it from 108? https://lon.108.redhat.com/files/documents/159/158/piranha-0.8.3-1.src.rpm If not, it's still here: http://people.redhat.com/lhh/piranha-0.8.3-1.src.rpm -- Lon From jos at xos.nl Fri Nov 10 16:11:08 2006 From: jos at xos.nl (Jos Vos) Date: Fri, 10 Nov 2006 17:11:08 +0100 Subject: [Linux-cluster] RHEL4 cluster source RPMs incomplete In-Reply-To: <1163173667.7048.69.camel@rei.boston.devel.redhat.com>; from lhh@redhat.com on Fri, Nov 10, 2006 at 10:47:47AM -0500 References: <200610121452.k9CEqYg13019@xos037.xos.nl> <1161873875.4518.82.camel@rei.boston.devel.redhat.com> <20061026165124.A10943@xos037.xos.nl> <1162415044.4518.400.camel@rei.boston.devel.redhat.com> <20061110102718.A28789@xos037.xos.nl> <1163169648.7048.28.camel@rei.boston.devel.redhat.com> <1163173667.7048.69.camel@rei.boston.devel.redhat.com> Message-ID: <20061110171108.F28789@xos037.xos.nl> On Fri, Nov 10, 2006 at 10:47:47AM -0500, Lon Hohberger wrote: > https://lon.108.redhat.com/files/documents/159/158/piranha-0.8.3-1.src.rpm This doesn't work. > http://people.redhat.com/lhh/piranha-0.8.3-1.src.rpm This works, thanks. -- -- Jos Vos -- X/OS Experts in Open Systems BV | Phone: +31 20 6938364 -- Amsterdam, The Netherlands | Fax: +31 20 6948204 From sandra-llistes at fib.upc.edu Fri Nov 10 17:15:26 2006 From: sandra-llistes at fib.upc.edu (sandra-llistes) Date: Fri, 10 Nov 2006 18:15:26 +0100 Subject: [Linux-cluster] GFS and samba problem in Fedora, again In-Reply-To: <45547B96.4030800@redhat.com> References: <4523A637.1060706@fib.upc.edu> <4526B4ED.9050907@redhat.com> <452B4F39.60906@fib.upc.edu> <452BFC6D.60902@redhat.com> <452D0ED6.9040605@fib.upc.edu> <452D7040.8090704@redhat.com> <4533A6EA.6030503@fib.upc.edu> <4533B9C2.8080906@redhat.com> <4550BB53.9070401@fib.upc.edu> <45523BFB.30707@redhat.com> <45547816.8040500@fib.upc.edu> <45547B96.4030800@redhat.com> Message-ID: <4554B3AE.8070001@fib.upc.edu> Yes, well, I'm sorry because the logs are not so clear.. I do a ccsd start, cman start, fenced start and rgmanager start. The quorum of cman is regained but ccsd has a lot of errors. Then finally, I stop the cluster, that's because the last sentence is that cman is leaving the cluster. As you can see, there are five minutes of difference between one sentence and another. Sorry for the misunderstanding, Sandra Patrick Caulfield wrote: > sandra-llistes wrote: >> Hi Abhi, >> >> I re-compiled ccsd with debug option, and I attach the logs of each >> node. CMAN works ok and quored is regained: >> Nov 10 13:30:54 tigreton kernel: CMAN: Waiting to join or form a >> Linux-cluster >> Nov 10 13:31:24 tigreton kernel: CMAN: sending membership request >> Nov 10 13:31:24 tigreton kernel: CMAN: got node nocilla.fib.upc.es >> Nov 10 13:31:24 tigreton kernel: CMAN: quorum regained, resuming activity >> Nov 10 13:36:45 tigreton kernel: CMAN: we are leaving the cluster. Removed > > > That message indicates that the node was manually removed from the cluster > with the "cman_tool leave remove" command. > From matt at rebelbase.com Fri Nov 10 18:18:38 2006 From: matt at rebelbase.com (Matt Eagleson) Date: Fri, 10 Nov 2006 10:18:38 -0800 Subject: [Linux-cluster] ClusterFS "Generic Error" In-Reply-To: <1163169309.7048.20.camel@rei.boston.devel.redhat.com> References: <455374F6.7000309@rebelbase.com> <1163169309.7048.20.camel@rei.boston.devel.redhat.com> Message-ID: <4554C27E.3050903@rebelbase.com> Version: rgmanager-1.9.53-0 Lon Hohberger wrote: > What version of rgmanager do you have installed? It sounds like a bug > we fixed awhile ago. > > -- Lon > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From matt at rebelbase.com Fri Nov 10 18:27:36 2006 From: matt at rebelbase.com (Matt Eagleson) Date: Fri, 10 Nov 2006 10:27:36 -0800 Subject: [Linux-cluster] ClusterFS "Generic Error" In-Reply-To: <20061110135847.GB14507@watson-wilson.ca> References: <455374F6.7000309@rebelbase.com> <20061110135847.GB14507@watson-wilson.ca> Message-ID: <4554C498.8040807@rebelbase.com> There are no other errors in any of the system logs near the time of the failure. When I finally saw the error and logged in the file system was fine and I can read and write as normal. Neil Watson wrote: > What does the kernel say about the status of the file system? Can you > still read and write to them? > From robert.hatch at terebellum.co.uk Fri Nov 10 20:14:01 2006 From: robert.hatch at terebellum.co.uk (Robert Hatch) Date: Fri, 10 Nov 2006 20:14:01 -0000 Subject: [Linux-cluster] Xen Live Migration and iSCSI Message-ID: <000701c70504$c37a51d0$1500010a@orbiter> Hi, I am configuring a Xen high availability solution which backs onto a SAN via iscsi. I am not sure whether to allow the domU's to swap locally on the dom0 or have a separate swap partition on the san. My main question is: When you perform a live migration are the contents of the RAM and the swap transferred to the new dom0 or is it just the ram that is transferred? If both are then wouldn't it be best to have local swapping so that the new state is built on the new node. Would this also be faster as there would be fewer loads on the san in the long term? Or is swapping on the san the best way forward? Thanks for all your help Regards Rob From lhh at redhat.com Fri Nov 10 22:27:01 2006 From: lhh at redhat.com (Lon Hohberger) Date: Fri, 10 Nov 2006 17:27:01 -0500 Subject: [Linux-cluster] Xen Live Migration and iSCSI In-Reply-To: <000701c70504$c37a51d0$1500010a@orbiter> References: <000701c70504$c37a51d0$1500010a@orbiter> Message-ID: <1163197621.7048.85.camel@rei.boston.devel.redhat.com> On Fri, 2006-11-10 at 20:14 +0000, Robert Hatch wrote: > Hi, I am configuring a Xen high availability solution which backs onto a > SAN via iscsi. I am not sure whether to allow the domU's to swap > locally on the dom0 or have a separate swap partition on the san. My > main question is: When you perform a live migration are the contents of > the RAM and the swap transferred to the new dom0 or is it just the ram > that is transferred? I'm pretty sure it's RAM only; you need to have the swap partition shared between the nodes (on the SAN). If you're not doing live migration, then whether the swap disk is on the SAN or not is your choice. > If both are then wouldn't it be best to have local swapping so that the > new state is built on the new node. Would this also be faster as there > would be fewer loads on the san in the long term? Even if it does copy the swap disk, I suspect a well engineered SAN will almost always outperform a local hard disk anyway. *shrug* -- Lon From lhh at redhat.com Fri Nov 10 22:27:41 2006 From: lhh at redhat.com (Lon Hohberger) Date: Fri, 10 Nov 2006 17:27:41 -0500 Subject: [Linux-cluster] ClusterFS "Generic Error" In-Reply-To: <4554C27E.3050903@rebelbase.com> References: <455374F6.7000309@rebelbase.com> <1163169309.7048.20.camel@rei.boston.devel.redhat.com> <4554C27E.3050903@rebelbase.com> Message-ID: <1163197661.7048.87.camel@rei.boston.devel.redhat.com> On Fri, 2006-11-10 at 10:18 -0800, Matt Eagleson wrote: > Version: rgmanager-1.9.53-0 > Ok; looks like clusterfs needs to have better error reporting... -- Lon From bganeshmail at gmail.com Sat Nov 11 04:01:56 2006 From: bganeshmail at gmail.com (Ganesh B) Date: Sat, 11 Nov 2006 09:31:56 +0530 Subject: [Linux-cluster] Redhat cluster suite 4 Message-ID: <4e3b15f00611102001y66ba2c5bm47eee6466cf718aa@mail.gmail.com> Dear Josef, You are 100% correct , After installing magamplugins ccsd gets started. But cman service service getting failed. When i see /var/log/messages it shows "CMAN Fatal Module not found". But when iisse rpm -qa cman it shows cman and cman-kernel rpm installed. Kindly help. Regds B.Ganesh To: jwhiter at redhat.com, linux-cluster at redhat.com Message-ID: <4e3b15f00611092313p23c3e7f3m6b1b6de7bbfb5e4c at mail.gmail.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Dear Sir, magma-plugins was installed but still probelm. Before that i will explain how the installtion was carried out. 1.REdhat has shipped the Redhat Cluster Suite 4 for Itanium ,IA64 and Ia 32 cd. The Itanium Cd was Inserted and made to run in autmode.But it gives error of cman-kernel and dlm-kernel is required. 2.Through command line installed both the rpm using #rpm -ivh cam-kernel-* --nodeps #rpm -ivh dlm-kernel-* --nodeps. 3.Again the autorun was executed and installed all the rpm which shows in selection menu.(Same process was carried out in other node also). 4.Executed the command system-config-cluster and configured Cluster name and node name.Sabed the filed. 5.Copied /etc/cluster/cluster.conf to other node. 6.Execcuted #service ccsd start .# service cman failed ,# service rgmanger start All shows failed. Pls help what is the mistake in this process. Regds B.Ganesh Message: 5 Date: Thu, 9 Nov 2006 07:51:58 -0500 From: Josef Whiter Subject: Re: [Linux-cluster] Redhat cluster suite 4 ccsd service failed To: linux clustering Message-ID: <20061109125157.GB24980 at korben.rdu.redhat.com> Content-Type: text/plain; charset=us-ascii You didnt install magma-plugins, install that package and it should start working. Josef From dan at orb.cz Sun Nov 12 19:27:02 2006 From: dan at orb.cz (Daniel Stanek) Date: Sun, 12 Nov 2006 20:27:02 +0100 Subject: [Linux-cluster] samba with ad support as cluster service Message-ID: <45577586.5050505@orb.cz> Hi friends, is anybody here succesfuly running Samba with Active Directory authorisation as an failover service over several cluster nodes? I'd tried to run samba over two nodes with following configuration (part of): > [global] > workgroup = TEST-LAB > netbios name = MACHINE > server string = MACHINE > realm = TEST-LAB.CZ > socket address = 10.6.7.51 > bind interfaces only = yes > interfaces = 10.6.7.51 > security = ads > encrypt passwords = Yes > password server = * > syslog = 0 > log file = /var/log/samba/%m.log > max log size = 2000 > socket options = TCP_NODELAY SO_RCVBUF=8192 SO_SNDBUF=8192 > load printers = No > show add printer wizard = No > preferred master = No > local master = No > domain master = No > dns proxy = No > wins server = dc1.test-lab.cz > printing = lprng > log level = 2 > ldap ssl = no > restrict anonymous = Yes > lanman auth = No > ntlm auth = No > client ntlmv2 auth = Yes > client lanman auth = No > client plaintext auth = No > idmap uid = 10000-19999 > idmap gid = 20000-29999 > winbind enum users = yes > winbind enum groups = yes > winbind separator = + > max disk size = 10000 > template shell = /bin/false > winbind use default domain = no and the idea behind start (and stop) is something like: 1) assign virtual ip (10.6.7.51) 2) mount shared gfs storage for data 3) mount shared gfs storage for /etc/samba and /var/cache/samba 3) switch system authorisation to winbind 3) start samba and winbind This is working good until first switch to other node. After that AD authorisation stops working and something like "could'nt verify kerberos ticket" appears in logs. This may be something behind kerberos libs, different hostnames etc. I think (??) I am using Centos4.4 and actual cluster suite from centos. Could anybody kick me to solve this? :) BTW: what is the current status of support SMB on top of GFS - the faq says, smb is not supported now? Thanks Dan From bganeshmail at gmail.com Mon Nov 13 06:13:30 2006 From: bganeshmail at gmail.com (Ganesh B) Date: Mon, 13 Nov 2006 11:43:30 +0530 Subject: [Linux-cluster] Redhat cluster suite 4 Update 4 Message-ID: <4e3b15f00611122213w7b002a38u25d15bf3cb74dff3@mail.gmail.com> Dear All, Configured Cluster suite 4 update 4 version. I can able to browse my web page through virtual IP. When we clsustat the output looks loke below. On webserver1:#clustat webserver1 Online,local,rgmanager webserver2 offline service: httpdeservice started. On webserver2::#clustat webserver1 offline. webserver2 Online,local,rgmanager What is the problem. Regds B.Ganesh From bganeshmail at gmail.com Mon Nov 13 06:16:31 2006 From: bganeshmail at gmail.com (Ganesh B) Date: Mon, 13 Nov 2006 11:46:31 +0530 Subject: [Linux-cluster] Redhat cluster suite 4 Active-Active Message-ID: <4e3b15f00611122216y108474c1v9e404f9a5364abf4@mail.gmail.com> Dear All, How to configure the redhat cluster suite 4 update 4 version in active-active environment. Regds B.Ganesh From peter.huesser at psi.ch Mon Nov 13 07:11:20 2006 From: peter.huesser at psi.ch (Huesser Peter) Date: Mon, 13 Nov 2006 08:11:20 +0100 Subject: [Linux-cluster] Why kernelmodules? In-Reply-To: <1163170865.7048.49.camel@rei.boston.devel.redhat.com> Message-ID: <8E2924888511274B95014C2DD906E58A01108795@MAILBOX0A.psi.ch> Thanks' for information Pedro > -----Original Message----- > From: linux-cluster-bounces at redhat.com [mailto:linux-cluster- > bounces at redhat.com] On Behalf Of Lon Hohberger > Sent: Freitag, 10. November 2006 16:01 > To: linux clustering > Subject: Re: [Linux-cluster] Why kernelmodules? > > On Fri, 2006-11-10 at 10:32 +0100, Huesser Peter wrote: > > Hello > > > > > > > > While comparing the redhat HA solution with linux HA > > (www.linux-ha.org) we remarked that linux HA does not make use of > > kernelmodules. What is the advantage off a kernel module solution over > > a non-kernelmodule solution. What things can not be done with a > > non-kernelmodule solution? > > The main reason is that Linux-HA does not include a file system, so it > doesn't have (nor need) any kernel components. > > In contrast, linux-cluster was built around a file system originally, so > there are parts which necessarily need to be in the kernel (namely GFS & > DLM). CMAN was in the kernel in the RHEL4 branch, but now is in > userland (and plugs in to openais). > > I guess you could port GFS to FUSE if you wanted... *scratches head* > > -- Lon > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From bganeshmail at gmail.com Mon Nov 13 11:36:00 2006 From: bganeshmail at gmail.com (Ganesh B) Date: Mon, 13 Nov 2006 17:06:00 +0530 Subject: [Linux-cluster] Fence device configuaration Message-ID: <4e3b15f00611130336v47fc346oc320764dfbaa768f@mail.gmail.com> We had 2 servers connected with common san storage. Do we really require fencing device.We had skipped fence deivce configuaration . *Is there any problem because of that.* ** *Pls advive.* ** *Regds* *Ganesh* -------------- next part -------------- An HTML attachment was scrubbed... URL: From mkpai at redhat.com Mon Nov 13 13:37:20 2006 From: mkpai at redhat.com (Pai) Date: Mon, 13 Nov 2006 19:07:20 +0530 Subject: [Linux-cluster] Redhat cluster suite 4 Active-Active In-Reply-To: <4e3b15f00611122216y108474c1v9e404f9a5364abf4@mail.gmail.com> References: <4e3b15f00611122216y108474c1v9e404f9a5364abf4@mail.gmail.com> Message-ID: <20061113133720.GC4859@mkpai.users.redhat.com> Hi Ganesh, > > How to configure the redhat cluster suite 4 update 4 version in > active-active environment. > 1. You could run service foo on many servers and load-balance by using LVS. 2. You could run services foo and bar on different servers and configure the cluster to fail them around the servers in your cluster. Extensive docs at http://www.redhat.com/docs/manuals/csgfs/ . Good luck, -- Pai From cjk at techma.com Mon Nov 13 13:59:16 2006 From: cjk at techma.com (Kovacs, Corey J.) Date: Mon, 13 Nov 2006 08:59:16 -0500 Subject: [Linux-cluster] dlm_recvd + bnx2 oops Message-ID: Morning all. We've been experienceing regular cluster crashes on RHEL4u4. This system has 5 nodes and a few dozen nodes mounting shares via nfs. Periodically, nodes will panic, get fenced and all continues on. This system does have some of the HP Product Support Pack installed (not the HP bnx2 driver). Below is the section from the logs. It is hand typed but I am fairly sure it's accurrate. The machines are HP DL360-G5's. The nics are Broadcom NeXtreme II 5708's. Anyone else seeing this? Corey =========================================================== Unable to handle kernel NULL pointer dereference at virtual address 000000ac printing eip: f8f339ae *pde = 37038001 Oops: 0000 [#1] SMP Modules linked in: ipt_multiport iptable_nat ip_conntrack ip_tables ip_vs_rr ip_vs cpqci(U) ipmi_dev intf ipmi_si ipmi_msghandler xp(U) mptctl mptbase sg autofs4 i2c_dev i2c_core lock_dlm(U) gfs(U) lock_harness(U) dlm(U) cman(U) md5 ipv6 nfsd exportfs lockd nfs_acl sunrpc joydev dm_mirror button battery ac ehci_hcd uhci_hcd bnx2 ext3 jbd dm_mod qla6312(U) qla2400(U) qla2300(U) qla2xxx_conf(U) qla2xxx(U) cciss sd_mod scsi_mod CPU: 0 EIP: 0060:[] Tainted: P VLI EFLAGS: 00010202 (2.6.9-42.0.2.ELsmp) EIP is at bnx2_tx_int+0x48/0x1d1 [bnx2] eax: f70620dc ebx: 00000ad7 ecx: 00000002 edx: 00000037 esi: 00000a37 edi: 00000000 ebp: f6a0b200 esp: c03cefa0 ds: 007b es: 007b ss: 0068 Process dlm_recvd (pid: 3973, threadinfo=c03ce000 task=f71652f0) Stack: f70620dc 00000037 f5c19000 00000000 f6a0b200 f6a0afc0 c03cefd4 f8f3431d 00000000 f6a0afc0 c201fd80 15a3182b c0280e24 000493dc 00000001 c0392c18 0000000a 00000000 c01269b8 f59d4dc4 00000046 c038b900 f59d4000 c010819f Call trace: [] bnx2_poll+0x4f/0x142 [bnx2] [] net_rx_action+0xae/0x160 [] __do_softirq+0x4c/0xb1 [] do_softirq+0x4f/0x56 =========================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From pcaulfie at redhat.com Mon Nov 13 14:13:40 2006 From: pcaulfie at redhat.com (Patrick Caulfield) Date: Mon, 13 Nov 2006 14:13:40 +0000 Subject: [Linux-cluster] dlm_recvd + bnx2 oops In-Reply-To: References: Message-ID: <45587D94.8060709@redhat.com> Kovacs, Corey J. wrote: > Morning all. We've been experienceing regular cluster crashes on RHEL4u4. > This system has 5 nodes and a few dozen nodes mounting shares via nfs. > Periodically, nodes will panic, get fenced and all continues on. This > system > does have some of the HP Product Support Pack installed (not the HP bnx2 > driver). Below is the section from the logs. It is hand typed but I am > fairly sure > it's accurrate. > > The machines are HP DL360-G5's. The nics are Broadcom NeXtreme II 5708's. > > > Anyone else seeing this? > > Corey > > =========================================================== > > Unable to handle kernel NULL pointer dereference at virtual address > 000000ac > printing eip: > f8f339ae > *pde = 37038001 > Oops: 0000 [#1] > SMP > Modules linked in: ipt_multiport iptable_nat ip_conntrack ip_tables > ip_vs_rr ip_vs cpqci(U) ipmi_dev intf ipmi_si ipmi_msghandler xp(U) > mptctl mptbase sg autofs4 i2c_dev i2c_core lock_dlm(U) gfs(U) > lock_harness(U) dlm(U) cman(U) md5 ipv6 nfsd exportfs lockd nfs_acl > sunrpc joydev dm_mirror button battery ac ehci_hcd uhci_hcd bnx2 ext3 > jbd dm_mod qla6312(U) qla2400(U) qla2300(U) qla2xxx_conf(U) qla2xxx(U) > cciss sd_mod scsi_mod > CPU: 0 > EIP: 0060:[] Tainted: P VLI > EFLAGS: 00010202 (2.6.9-42.0.2.ELsmp) > EIP is at bnx2_tx_int+0x48/0x1d1 [bnx2] > eax: f70620dc ebx: 00000ad7 ecx: 00000002 edx: 00000037 > esi: 00000a37 edi: 00000000 ebp: f6a0b200 esp: c03cefa0 > ds: 007b es: 007b ss: 0068 > Process dlm_recvd (pid: 3973, threadinfo=c03ce000 task=f71652f0) > Stack: f70620dc 00000037 f5c19000 00000000 f6a0b200 f6a0afc0 c03cefd4 > f8f3431d > 00000000 f6a0afc0 c201fd80 15a3182b c0280e24 000493dc 00000001 > c0392c18 > 0000000a 00000000 c01269b8 f59d4dc4 00000046 c038b900 f59d4000 > c010819f > Call trace: > [] bnx2_poll+0x4f/0x142 [bnx2] > [] net_rx_action+0xae/0x160 > [] __do_softirq+0x4c/0xb1 > [] do_softirq+0x4f/0x56 That looks like a driver crash to me. The fact that it's in dlm_recvd is probably just that it's a busy process doing lots of network IO. There's no DLM code in the stacktrace at all -- patrick From rpeterso at redhat.com Mon Nov 13 14:44:25 2006 From: rpeterso at redhat.com (Robert Peterson) Date: Mon, 13 Nov 2006 08:44:25 -0600 Subject: [Linux-cluster] Fence device configuaration In-Reply-To: <4e3b15f00611130336v47fc346oc320764dfbaa768f@mail.gmail.com> References: <4e3b15f00611130336v47fc346oc320764dfbaa768f@mail.gmail.com> Message-ID: <455884C9.8080703@redhat.com> Ganesh B wrote: > We had 2 servers connected with common san storage. > Do we really require fencing device.We had skipped fence deivce > configuaration . > *Is there any problem because of that.* > ** > *Pls advive.* > ** > *Regds* > *Ganesh* Hi Ganesh, You may have problems if a fence device isn't configured properly. Please see: http://sources.redhat.com/cluster/faq.html#fence_manual3 Regards, Bob Peterson Red Hat Cluster Suite From cjk at techma.com Mon Nov 13 15:15:57 2006 From: cjk at techma.com (Kovacs, Corey J.) Date: Mon, 13 Nov 2006 10:15:57 -0500 Subject: =?us-ascii?Q?RE:_=5BLinux-cluster=5D_dlm=5Frecvd_+_bnx2_oops?= In-Reply-To: <45587D94.8060709@redhat.com> Message-ID: Ok, that's sort of what I thought was going on but I wanted to get some feedback. There is another bug in bugzilla that looks like it might be related. https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=212055 Anyway, thanks Corey -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Patrick Caulfield Sent: Monday, November 13, 2006 9:14 AM To: linux clustering Subject: Re: [Linux-cluster] dlm_recvd + bnx2 oops Kovacs, Corey J. wrote: > Morning all. We've been experienceing regular cluster crashes on RHEL4u4. > This system has 5 nodes and a few dozen nodes mounting shares via nfs. > Periodically, nodes will panic, get fenced and all continues on. This > system does have some of the HP Product Support Pack installed (not > the HP bnx2 driver). Below is the section from the logs. It is hand > typed but I am fairly sure it's accurrate. > > The machines are HP DL360-G5's. The nics are Broadcom NeXtreme II 5708's. > > > Anyone else seeing this? > > Corey > > =========================================================== > > Unable to handle kernel NULL pointer dereference at virtual address > 000000ac printing eip: > f8f339ae > *pde = 37038001 > Oops: 0000 [#1] > SMP > Modules linked in: ipt_multiport iptable_nat ip_conntrack ip_tables > ip_vs_rr ip_vs cpqci(U) ipmi_dev intf ipmi_si ipmi_msghandler xp(U) > mptctl mptbase sg autofs4 i2c_dev i2c_core lock_dlm(U) gfs(U) > lock_harness(U) dlm(U) cman(U) md5 ipv6 nfsd exportfs lockd nfs_acl > sunrpc joydev dm_mirror button battery ac ehci_hcd uhci_hcd bnx2 ext3 > jbd dm_mod qla6312(U) qla2400(U) qla2300(U) qla2xxx_conf(U) qla2xxx(U) > cciss sd_mod scsi_mod > CPU: 0 > EIP: 0060:[] Tainted: P VLI > EFLAGS: 00010202 (2.6.9-42.0.2.ELsmp) > EIP is at bnx2_tx_int+0x48/0x1d1 [bnx2] > eax: f70620dc ebx: 00000ad7 ecx: 00000002 edx: 00000037 > esi: 00000a37 edi: 00000000 ebp: f6a0b200 esp: c03cefa0 > ds: 007b es: 007b ss: 0068 > Process dlm_recvd (pid: 3973, threadinfo=c03ce000 task=f71652f0) > Stack: f70620dc 00000037 f5c19000 00000000 f6a0b200 f6a0afc0 c03cefd4 > f8f3431d > 00000000 f6a0afc0 c201fd80 15a3182b c0280e24 000493dc 00000001 > c0392c18 > 0000000a 00000000 c01269b8 f59d4dc4 00000046 c038b900 f59d4000 > c010819f Call trace: > [] bnx2_poll+0x4f/0x142 [bnx2] [] > net_rx_action+0xae/0x160 [] __do_softirq+0x4c/0xb1 > [] do_softirq+0x4f/0x56 That looks like a driver crash to me. The fact that it's in dlm_recvd is probably just that it's a busy process doing lots of network IO. There's no DLM code in the stacktrace at all -- patrick -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From pcaulfie at redhat.com Mon Nov 13 15:25:40 2006 From: pcaulfie at redhat.com (Patrick Caulfield) Date: Mon, 13 Nov 2006 15:25:40 +0000 Subject: [Linux-cluster] dlm_recvd + bnx2 oops In-Reply-To: References: Message-ID: <45588E74.1010000@redhat.com> Kovacs, Corey J. wrote: > Ok, that's sort of what I thought was going on but I wanted to get some > feedback. There is another bug in bugzilla that looks like it might be > related. > > https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=212055 > Yes, that looks pretty similar to me too :) -- patrick From jbrassow at redhat.com Mon Nov 13 17:58:35 2006 From: jbrassow at redhat.com (Jonathan E Brassow) Date: Mon, 13 Nov 2006 11:58:35 -0600 Subject: [Linux-cluster] Does GFS work well with hybrid storage devices? In-Reply-To: <08A9A3213527A6428774900A80DBD8D802D9645A@xmb-sjc-222.amer.cisco.com> References: <08A9A3213527A6428774900A80DBD8D802D9645A@xmb-sjc-222.amer.cisco.com> Message-ID: <59830cd1ffe0e570568e4ce052bb2349@redhat.com> On Nov 3, 2006, at 3:48 PM, Lin Shen (lshen) wrote: > Will GFS work well in a cluster system that hosts a wide range of > storage > devices such as disk array, hard disk, flash and USB devices? Since > these > devices have very different performance and reliability behaviors, it > could be > a challenge for any CFS, right? Firstly, a cluster file system needs all nodes in the cluster to be able to see the storage if they are to mount the file system. So, flash and USB devices are not really suited to a cluster filesystem. You can have storage devices that are viewable by all cluster nodes and have vastly different characteristics (GNBD/iSCSI storage, FC disks, arrays, etc). GFS will work, but it doesn't mean it's good for performance. Taking it to the extreme, I could take a USB device, export it to all cluster nodes using GNBD and merge that with some fast array. I could then put GFS on that, but what kind of sense would that make? > I read from somewhere that GFS doesn't support multiple writers on the > same file > simutaneously. Is this true? > That's probably a mischaracterization of what GFS does. GFS does not allow multiple machines to update the metadata on the same file at the same time. It does allow multiple writers to the same file at the same time. This is no different than a single machine with two processes writing to the same file at the same time... you can't be sure of the contents of the file (unless you are doing application level locking), but you can be sure your file system won't be corrupted. brassow From tushar.patel at bofasecurities.com Mon Nov 13 18:04:47 2006 From: tushar.patel at bofasecurities.com (Patel, Tushar) Date: Mon, 13 Nov 2006 13:04:47 -0500 Subject: [Linux-cluster] AS 4.0 with GFS 6.1 cluster not working Message-ID: <5F08B160555AC946B5AB743B85FF406D07D60920@ex2k.bankofamerica.com> Hello, We have lot of AS 4.0 GFS 6.0 clusters in our firm working fine. However recently we are upgrading our clusters to GFS 6.1 >From what we have seen so far is once we upgraded to GFS 6.1 our failover test is failing. So far we have conducted failover tests using 2 scenario 1.) Manually reboot one of the node in cluster using HP-iLO interface sending fence_ilo command from one of the nodes - the command works fine and host does reboot. 2.) Pulling out network cables of one of the host in the cluster. We have 4 node cluster. Problem is with either of the above tests, gfs hangs and clustat starts reporting only member information. No service information. Clustat displays following message: "Timeout : Resource Manager not responding" The gfs just keeps hanging for untill we manually intervene and bring the halted node up. It seems fencing is not working or rgmanager is flawed and cannot process/parse information. Has anybody experienced this situation? Any remedy? -Thanks Tushar -------------- next part -------------- An HTML attachment was scrubbed... URL: From fabrizio.lippolis at aurigainformatica.it Tue Nov 14 15:44:52 2006 From: fabrizio.lippolis at aurigainformatica.it (Fabrizio Lippolis) Date: Tue, 14 Nov 2006 16:44:52 +0100 Subject: [Linux-cluster] occasional cluster crashes Message-ID: Hi list, some time ago I configured a two node cluster made of two HP servers and a HP disk array. The two nodes are connected each other by a crossover ethernet gigabit cable for heartbeat signals. The disk array is GFS formatted and connected by SCSI cables to both machines. The cluster is running MySQL, one of the machines runs the MySQL process at a time while the database files are on the disk array. I checked that if I kill the process, it will migrate on the second machine. From time to time I experience occasional lockups of one of the two machines, it doesn't happen very often and apparently without reason. The only solution in this case is to brutally switch off the machine and reboot. The problem started to be much more frequent when I tried to add another service to the cluster, a LDAP directory. The crashes happened sometimes more than once a day. I already wrote about this problem some time ago and somebody answered that it could be caused because of the connection of the nodes to the disk array. When a node is accessing the disk array the SCSI bus will prevent the other node from doing something. Can anybody confirm this? Does it mean this hardware architecture is not suitable for a cluster? In this case which architecture would you recommend to let a cluster like this work smoothly? Thank you in advance. Best regards, Fabrizio From mwill at penguincomputing.com Tue Nov 14 16:21:56 2006 From: mwill at penguincomputing.com (Michael Will) Date: Tue, 14 Nov 2006 08:21:56 -0800 Subject: [Linux-cluster] occasional cluster crashes Message-ID: <433093DF7AD7444DA65EFAFE3987879C245473@jellyfish.highlyscyld.com> Fibre channel instead of scsi -----Original Message----- From: Fabrizio Lippolis [mailto:fabrizio.lippolis at aurigainformatica.it] Sent: Tue Nov 14 07:51:19 2006 To: linux-cluster at redhat.com Subject: [Linux-cluster] occasional cluster crashes Hi list, some time ago I configured a two node cluster made of two HP servers and a HP disk array. The two nodes are connected each other by a crossover ethernet gigabit cable for heartbeat signals. The disk array is GFS formatted and connected by SCSI cables to both machines. The cluster is running MySQL, one of the machines runs the MySQL process at a time while the database files are on the disk array. I checked that if I kill the process, it will migrate on the second machine. From time to time I experience occasional lockups of one of the two machines, it doesn't happen very often and apparently without reason. The only solution in this case is to brutally switch off the machine and reboot. The problem started to be much more frequent when I tried to add another service to the cluster, a LDAP directory. The crashes happened sometimes more than once a day. I already wrote about this problem some time ago and somebody answered that it could be caused because of the connection of the nodes to the disk array. When a node is accessing the disk array the SCSI bus will prevent the other node from doing something. Can anybody confirm this? Does it mean this hardware architecture is not suitable for a cluster? In this case which architecture would you recommend to let a cluster like this work smoothly? Thank you in advance. Best regards, Fabrizio -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- An HTML attachment was scrubbed... URL: From andrewxwang at yahoo.com.tw Tue Nov 14 17:13:51 2006 From: andrewxwang at yahoo.com.tw (Andrew Wang) Date: Wed, 15 Nov 2006 01:13:51 +0800 (CST) Subject: [Linux-cluster] Three new open source modules relased to the Grid Engine project Message-ID: <20061114171351.95932.qmail@web18008.mail.tpe.yahoo.com> Now SGE gets Service Level Objectives/Service Level Agreement with the new components. See: http://gridengine.sunsource.net/project/gridengine/news/SuperComputing2006.html ------------------------------------------------------------------------------- Sun releases three new open source modules to the Grid Engine project SuperComputing 2006 - November 14, 2006 Sun is making a key contribution to advancing open source technologies by releasing three modules to the Grid Engine project. The new Grid Engine Service Domain Management module and two modules with an existing Sun IP taken from the N1 Grid Engine Product. These are the Grid Engine Accounting and Reporting Console and the Grid Engine Microsoft Windows submit and execution host support. Grid Engine is an ideal solution running on Solaris 10 leveraging Solaris' Dtrace monitors, eliminating bottlenecks for unmatched grid performance. Sun will demonstrate the Grid Engine Service Domain Management module that allows developers to integrate grid-enabled services which then can be deployed on a pool of shared resources. The Grid Engine Service Domain Management functionality will dynamically adjust the allocation of resources in order to meet Service Level Objectives. Reallocating a host resource to another service may include reprovisioning of the underlying virtual or actual operating system stack.1 The Grid Engine Accounting and Reporting Console is based upon standard technology such as Java[tm], SQL, JDBC. Providing source code access and thus opening up the interfaces will make Grid Engine's accounting and reporting infrastructure ideally suited for development projects in new areas opening up new possibilities for utility computing as a business model in grids operations. Furthermore, Sun has the intention to add the Sun proprietary Microsoft Windows submit and execution host support at the earliest feasible point in time to the Grid Engine open source project. http://gridengine.sunsource.net/ Andrew. ___________________________________________________ ??????? ? ???????????????? http://messenger.yahoo.com.tw/ From david at grootendorst.nu Tue Nov 14 15:53:18 2006 From: david at grootendorst.nu (David Grootendorst) Date: Tue, 14 Nov 2006 16:53:18 +0100 (CET) Subject: [Linux-cluster] cluster question Message-ID: <55159.217.140.15.102.1163519598.squirrel@www.grootendorst.nu> Redhat, Last week i installed a 3 node cluster at a bank in The Netherlands. I've got the folling problem: Start the 2 nodes. They form the cluster and node 3 is automaticly fenced on the brocade switch. If i want to join the cluster with node 3, i have to enable the 2 SAN ports on the SAN swiches and boot my machine again. ( No san... no data disks at boottime ). Is this normal or do i miss something? ( running the latest redhat software. ) David Grootendorst RHCT RHCE RHCA ( almost ) From rpeterso at redhat.com Tue Nov 14 19:47:11 2006 From: rpeterso at redhat.com (Robert Peterson) Date: Tue, 14 Nov 2006 13:47:11 -0600 Subject: [Linux-cluster] cluster question In-Reply-To: <55159.217.140.15.102.1163519598.squirrel@www.grootendorst.nu> References: <55159.217.140.15.102.1163519598.squirrel@www.grootendorst.nu> Message-ID: <455A1D3F.2080904@redhat.com> David Grootendorst wrote: > Redhat, > > Last week i installed a 3 node cluster at a bank in The Netherlands. > I've got the folling problem: > Start the 2 nodes. > They form the cluster and node 3 is automaticly fenced on the brocade switch. > If i want to join the cluster with node 3, i have to enable the 2 SAN > ports on the SAN swiches and boot my machine again. ( No san... no data > disks at boottime ). > Is this normal or do i miss something? > ( running the latest redhat software. ) > > David Grootendorst > RHCT > RHCE > RHCA ( almost ) > Hi David, What normally is supposed to happen is that all three nodes are started at once. The first two nodes should wait for the third node to join the fence domain before they proceed. The only reason the third node would be fenced is if it hasn't reported in before the post_join_delay time specified in the cluster.conf file. See: http://sources.redhat.com/cluster/faq.html#fence_startup I think the goal was to get the cluster in a known good fully-functional state at startup. Regards, Bob Peterson Red Hat Cluster Suite From jason at monsterjam.org Wed Nov 15 01:06:55 2006 From: jason at monsterjam.org (jason at monsterjam.org) Date: Tue, 14 Nov 2006 20:06:55 -0500 Subject: [Linux-cluster] failover questions after upgrade Message-ID: <20061115010655.GA18138@monsterjam.org> ok, upgraded my rpms to the following cman-kernheaders-2.6.9-45.8 cman-kernel-2.6.9-45.8 cman-1.0.11-0 cman-kernel-hugemem-2.6.9-45.8 cman-devel-1.0.11-0 cman-kernel-smp-2.6.9-45.8 GFS-6.1.6-1 GFS-kernheaders-2.6.9-60.3 GFS-kernel-smp-2.6.9-60.3 dlm-kernel-smp-2.6.9-44.3 dlm-devel-1.0.1-1 dlm-kernel-hugemem-2.6.9-44.3 dlm-kernheaders-2.6.9-44.3 dlm-1.0.1-1 dlm-kernel-2.6.9-44.3 magma-plugins-1.0.6-0 magma-devel-1.0.6-0 magma-debuginfo-1.0.6-0 magma-1.0.6-0 and when I reboot both servers of 2 node cluster, they come up fine.. [jason at tf2 ~]$ clustat Member Status: Quorate, Group Member Member Name State ID ------ ---- ----- -- tf1 Online 0x0000000000000001 tf2 Online 0x0000000000000002 Service Name Owner (Last) State ------- ---- ----- ------ ----- Apache Service tf1 started [jason at tf2 ~]$ when I reboot (shutdown -r now) tf1, tf2 never takes over [jason at tf2 ~]$ clustat Member Status: Quorate, Group Member Member Name State ID ------ ---- ----- -- tf2 Online 0x0000000000000002 Service Name Owner (Last) State ------- ---- ----- ------ ----- Apache Service ((null) ) failed [jason at tf2 ~]$ heres the logs from tf2: Nov 14 19:48:21 tf2 clurgmgrd[5345]: Logged in SG "usrm::manager" Nov 14 19:48:21 tf2 clurgmgrd[5345]: Magma Event: Membership Change Nov 14 19:48:21 tf2 clurgmgrd[5345]: State change: Local UP Nov 14 19:48:22 tf2 clurgmgrd[5345]: State change: tf1 UP Nov 14 19:48:25 tf2 snmpd[5195]: Got trap from peer on fd 13 Nov 14 19:48:44 tf2 kernel: process `omaws32' is using obsolete setsockopt SO_BSDCOMPAT Nov 14 19:48:58 tf2 Server Administrator: Storage Service EventID: 2164 See readme.txt for a list of validated controller driver versions. Nov 14 19:49:00 tf2 snmpd[5195]: Got trap from peer on fd 13 Nov 14 19:50:31 tf2 sshd(pam_unix)[6920]: session opened for user jason by (uid=0) Nov 14 19:51:03 tf2 sshd(pam_unix)[6951]: session opened for user jason by (uid=0) Nov 14 19:51:39 tf2 clurgmgrd[5345]: Magma Event: Membership Change Nov 14 19:51:39 tf2 clurgmgrd[5345]: State change: tf1 DOWN Nov 14 19:52:19 tf2 ntpd[4896]: synchronized to 193.162.159.97, stratum 2 Nov 14 19:52:19 tf2 ntpd[4896]: kernel time sync disabled 0041 Nov 14 19:52:28 tf2 kernel: e100: eth2: e100_watchdog: link down Nov 14 19:52:34 tf2 kernel: CMAN: removing node tf1 from the cluster : Missed too many heartbeats Nov 14 19:52:58 tf2 kernel: e100: eth2: e100_watchdog: link up, 100Mbps, full-duplex Nov 14 19:55:14 tf2 kernel: CMAN: node tf1 rejoining Nov 14 19:55:45 tf2 clurgmgrd[5345]: Magma Event: Membership Change Nov 14 19:55:45 tf2 clurgmgrd[5345]: State change: tf1 UP then when tf1 comes back up, my apache service doesnt come up correctly.. [jason at tf2 ~]$ clustat Member Status: Quorate, Group Member Member Name State ID ------ ---- ----- -- tf1 Online 0x0000000000000001 tf2 Online 0x0000000000000002 Service Name Owner (Last) State ------- ---- ----- ------ ----- Apache Service (tf1 ) failed [jason at tf2 ~]$ and I see this in the logs on tf1 as hes booting up. Nov 14 19:55:44 tf1 rhnsd[5445]: Red Hat Network Services Daemon starting up. Nov 14 19:55:44 tf1 rhnsd: rhnsd startup succeeded Nov 14 19:55:44 tf1 cups-config-daemon: cups-config-daemon startup succeeded Nov 14 19:55:44 tf1 haldaemon: haldaemon startup succeeded Nov 14 19:55:44 tf1 clurgmgrd[5488]: Loading Service Data Nov 14 19:55:44 tf1 rgmanager: clurgmgrd startup succeeded Nov 14 19:55:44 tf1 fstab-sync[5764]: removed all generated mount points Nov 14 19:55:45 tf1 clurgmgrd[5488]: Initializing Services Nov 14 19:55:45 tf1 fstab-sync[6152]: added mount point /media/cdrom for /dev/hda Nov 14 19:55:45 tf1 httpd: httpd shutdown failed Nov 14 19:55:45 tf1 clurgmgrd[5488]: stop on script "cluster_apache" returned 1 (generic error) Nov 14 19:55:45 tf1 clurgmgrd[5488]: Services Initialized Nov 14 19:55:45 tf1 clurgmgrd[5488]: Logged in SG "usrm::manager" Nov 14 19:55:45 tf1 clurgmgrd[5488]: Magma Event: Membership Change Nov 14 19:55:45 tf1 clurgmgrd[5488]: State change: Local UP Nov 14 19:55:46 tf1 fstab-sync[6465]: added mount point /media/floppy for /dev/fd0 Nov 14 19:55:46 tf1 clurgmgrd[5488]: State change: tf2 UP any suggestions? Jason From chawkins at veracitynetworks.com Wed Nov 15 05:47:04 2006 From: chawkins at veracitynetworks.com (Christopher Hawkins) Date: Wed, 15 Nov 2006 00:47:04 -0500 Subject: [Linux-cluster] fence_tool issue Message-ID: <200611150524.kAF5OXAA029718@mail2.ontariocreditcorp.com> Hello, I am working on a shared root gfs cluster and I'm having a problem I can't seem to figure out... It's just two nodes and I have gfs working fine when they are both running normally. I am trying to get one of them to fire up cluster services in an initramfs and pivot root into the gfs filesystem (which is running already on the other node), and I'm slightly stuck on fence_tool join. CCSD starts, cman_tool join works fine, and then fence tool errors out with waiting for ccs connection -111. The cluster is quorate, ccs_test fails, but a forced ccs_test works. cman_tool status tells all and seems to know everything it should. Networking is fine, including hostnames and all that. One thing I tried was going with ccsd -I -4 since I think that the ccsd.sock socket did not get created correctly... But no difference. Stracing fence_tool join doesn't help much, just basically says that it can't connect to the socket, which I sort of knew already. Any ideas as to how to verify the ccsd socket or how to make ccsd -I -4 work so that isn't necessary, or maybe some info on what fence_tool is trying to do here? I have not blown up all the sources from the Atix folks yet to see if they have something special going on in there, but I suppose that's next if no one knows the answer. Thanks! Chris From grimme at atix.de Wed Nov 15 06:06:27 2006 From: grimme at atix.de (Marc Grimme) Date: Wed, 15 Nov 2006 07:06:27 +0100 Subject: [Linux-cluster] fence_tool issue In-Reply-To: <200611150524.kAF5OXAA029718@mail2.ontariocreditcorp.com> References: <200611150524.kAF5OXAA029718@mail2.ontariocreditcorp.com> Message-ID: <200611150706.28994.grimme@atix.de> Hello Chris, if you like have a look at www.opensharedroot.org. This is a sharedroot cluster with gfs. Software and a Howto can be found there. Perhaps you can get some hints from there or just use the software. It should run pretty fine. Let me know if you have problems. Regards Marc. On Wednesday 15 November 2006 06:47, Christopher Hawkins wrote: > Hello, > > I am working on a shared root gfs cluster and I'm having a problem I can't > seem to figure out... It's just two nodes and I have gfs working fine when > they are both running normally. I am trying to get one of them to fire up > cluster services in an initramfs and pivot root into the gfs filesystem > (which is running already on the other node), and I'm slightly stuck on > fence_tool join. CCSD starts, cman_tool join works fine, and then fence > tool errors out with waiting for ccs connection -111. > > The cluster is quorate, ccs_test fails, but a forced ccs_test works. > cman_tool status tells all and seems to know everything it should. > Networking is fine, including hostnames and all that. One thing I tried was > going with ccsd -I -4 since I think that the ccsd.sock socket did not get > created correctly... But no difference. Stracing fence_tool join doesn't > help much, just basically says that it can't connect to the socket, which I > sort of knew already. Any ideas as to how to verify the ccsd socket or how > to make ccsd -I -4 work so that isn't necessary, or maybe some info on what > fence_tool is trying to do here? I have not blown up all the sources from > the Atix folks yet to see if they have something special going on in there, > but I suppose that's next if no one knows the answer. > Thanks! > Chris > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -- Gruss / Regards, Marc Grimme Phone: +49-89 452 3538-14 http://www.atix.de/ http://www.open-sharedroot.org/ ** ATIX - Ges. fuer Informationstechnologie und Consulting mbH Einsteinstr. 10 - 85716 Unterschleissheim - Germany From mkpai at redhat.com Wed Nov 15 10:51:50 2006 From: mkpai at redhat.com (Pai) Date: Wed, 15 Nov 2006 16:21:50 +0530 Subject: [Linux-cluster] occasional cluster crashes In-Reply-To: References: Message-ID: <20061115105150.GB6938@mkpai.users.redhat.com> Hi Fabrizio, > > Does it mean this hardware architecture is not suitable for a cluster? > In this case which architecture would you recommend to let a cluster > like this work smoothly? Thank you in advance. > http://www.redhat.com/docs/manuals/csgfs/browse/rh-cs-en/ch-hardware.html Alternatively, you could connect the storage to just one server and export the volumes using iSCSI. Regards, -- Pai From kpodesta at redbrick.dcu.ie Wed Nov 15 12:10:19 2006 From: kpodesta at redbrick.dcu.ie (Karl Podesta) Date: Wed, 15 Nov 2006 12:10:19 +0000 Subject: [Linux-cluster] RHCS 3 "could not connect to service manager" Message-ID: <20061115121019.GC18237@murphy.redbrick.dcu.ie> Hi folks, When we try to relocate a service between a basic 2-node RHCS 3 setup, we get the following error message: "Member p2 trying to relocate oracle to p1. msg-open: connection timed out. Could not connect to service manager." This also happens if a node in the cluster fails: the service does not relocate to the other node. There is only the one service on the cluster (oracle). The cluster is 2 x Dell 2850 with shared SAN storage, RHAS Update 5, kernel 2.4.21-32.EL SMP for x_64. What are the reasons that could cause this failure in relocation to happen? Many thanks for any thoughts or suggestions you have! Regards, Karl -- Karl Podesta Systems Engineer, Securelinx Ltd. http://www.securelinx.com/ From kpodesta at redbrick.dcu.ie Wed Nov 15 12:45:02 2006 From: kpodesta at redbrick.dcu.ie (Karl Podesta) Date: Wed, 15 Nov 2006 12:45:02 +0000 Subject: [Linux-cluster] RHCS 3 "could not connect to service manager" In-Reply-To: <20061115121019.GC18237@murphy.redbrick.dcu.ie> References: <20061115121019.GC18237@murphy.redbrick.dcu.ie> Message-ID: <20061115124502.GA3350@murphy.redbrick.dcu.ie> We have just discovered that the bonding is configured differently on each node (p1 is configured for load balancing, p2 is configured for active-backup). We're curious if this bonding configuration could be interfering with the service relocation... could it be a reason why the connection times out... something akin to: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=149867 (bonding contributing to a system hang after a period of time) Does this sound familiar to anyone? Has anyone encoutered anything like this in their experience? Many thanks! Karl On Wed, Nov 15, 2006 at 12:10:19PM +0000, Karl Podesta wrote: > Hi folks, > > When we try to relocate a service between a basic 2-node RHCS 3 setup, we > get the following error message: > > "Member p2 trying to relocate oracle to p1. msg-open: connection timed out. > Could not connect to service manager." > > This also happens if a node in the cluster fails: the service does not > relocate to the other node. There is only the one service on the cluster > (oracle). The cluster is 2 x Dell 2850 with shared SAN storage, > RHAS Update 5, kernel 2.4.21-32.EL SMP for x_64. > > What are the reasons that could cause this failure in relocation to happen? > > Many thanks for any thoughts or suggestions you have! > > Regards, > Karl > > -- > Karl Podesta > Systems Engineer, Securelinx Ltd. > http://www.securelinx.com/ > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -- Hello. I'm Leonard Nimoy. The following tale of alien encounters is true. And by true, I mean false. It's all lies. But they're entertaining lies. And in the end, isn't that the real truth? The answer is: No. -- What was the question again?, "The Springfield Files" From fabrizio.lippolis at aurigainformatica.it Wed Nov 15 13:46:39 2006 From: fabrizio.lippolis at aurigainformatica.it (Fabrizio Lippolis) Date: Wed, 15 Nov 2006 14:46:39 +0100 Subject: [Linux-cluster] Re: occasional cluster crashes In-Reply-To: <20061115105150.GB6938@mkpai.users.redhat.com> References: <20061115105150.GB6938@mkpai.users.redhat.com> Message-ID: Pai ha scritto: > http://www.redhat.com/docs/manuals/csgfs/browse/rh-cs-en/ch-hardware.html Pai, thank you for your answer, from this documentation looks like fibre channel is necessary to ensure data integrity. > Alternatively, you could connect the storage to just one server and > export the volumes using iSCSI. In this case, should the server connected to the storage fail all the services on the cluster will be unavailable, won't they? Regards, Fabrizio From lhh at redhat.com Wed Nov 15 16:06:09 2006 From: lhh at redhat.com (Lon Hohberger) Date: Wed, 15 Nov 2006 11:06:09 -0500 Subject: [Linux-cluster] Redhat cluster suite 4 Update 4 In-Reply-To: <4e3b15f00611122213w7b002a38u25d15bf3cb74dff3@mail.gmail.com> References: <4e3b15f00611122213w7b002a38u25d15bf3cb74dff3@mail.gmail.com> Message-ID: <1163606770.15754.48.camel@rei.boston.devel.redhat.com> On Mon, 2006-11-13 at 11:43 +0530, Ganesh B wrote: > Dear All, > > Configured Cluster suite 4 update 4 version. > > I can able to browse my web page through virtual IP. > > When we clsustat the output looks loke below. > > On webserver1:#clustat > webserver1 Online,local,rgmanager > webserver2 offline > > service: httpdeservice started. > On webserver2::#clustat > webserver1 offline. > webserver2 Online,local,rgmanager > > What is the problem. Can you attach /proc/cluster/services from both nodes? -- Lon From lhh at redhat.com Wed Nov 15 16:07:22 2006 From: lhh at redhat.com (Lon Hohberger) Date: Wed, 15 Nov 2006 11:07:22 -0500 Subject: [Linux-cluster] Fence device configuaration In-Reply-To: <4e3b15f00611130336v47fc346oc320764dfbaa768f@mail.gmail.com> References: <4e3b15f00611130336v47fc346oc320764dfbaa768f@mail.gmail.com> Message-ID: <1163606843.15754.50.camel@rei.boston.devel.redhat.com> On Mon, 2006-11-13 at 17:06 +0530, Ganesh B wrote: > We had 2 servers connected with common san storage. > Do we really require fencing device.We had skipped fence deivce > configuaration . > *Is there any problem because of that.* Yes, because in a two node cluster, fencing fails if you don't have any fencing configured. This could also explain your weird case where both nodes think the other is dead. -- Lon From lhh at redhat.com Wed Nov 15 16:18:41 2006 From: lhh at redhat.com (Lon Hohberger) Date: Wed, 15 Nov 2006 11:18:41 -0500 Subject: [Linux-cluster] occasional cluster crashes In-Reply-To: References: Message-ID: <1163607521.15754.62.camel@rei.boston.devel.redhat.com> On Tue, 2006-11-14 at 16:44 +0100, Fabrizio Lippolis wrote: > The cluster is running MySQL, one of the machines runs the MySQL process > at a time while the database files are on the disk array. I checked that > if I kill the process, it will migrate on the second machine. From time > to time I experience occasional lockups of one of the two machines, it > doesn't happen very often and apparently without reason. The only > solution in this case is to brutally switch off the machine and reboot. > The problem started to be much more frequent when I tried to add another > service to the cluster, a LDAP directory. The crashes happened sometimes > more than once a day. :o The only problems I'm aware of related to cluster service counts are performance related (rgmanager used to slow down a lot with more services), and only on pre-U4 version. > I already wrote about this problem some time ago and somebody answered > that it could be caused because of the connection of the nodes to the > disk array. When a node is accessing the disk array the SCSI bus will > prevent the other node from doing something. Can anybody confirm this? That's very array dependent and I don't know much about how arrays work. Even so, I do not think it should cause a lockup; unless there's some kernel bug that it exposes. Do they crash (panic), or do they just become totally unresponsive? Have you tried getting a stack trace from the console using sysrq? (echo 1 > /proc/sys/kernel/sysrq; then hit alt-sysrq-t from the console). One thing that's peculiar is that - if they are locking up, they have to be locking up at about the same time -- otherwise, one would fence the other, and life would go on. -- Lon From lhh at redhat.com Wed Nov 15 16:59:35 2006 From: lhh at redhat.com (Lon Hohberger) Date: Wed, 15 Nov 2006 11:59:35 -0500 Subject: [Linux-cluster] AS 4.0 with GFS 6.1 cluster not working In-Reply-To: <5F08B160555AC946B5AB743B85FF406D07D60920@ex2k.bankofamerica.com> References: <5F08B160555AC946B5AB743B85FF406D07D60920@ex2k.bankofamerica.com> Message-ID: <1163609975.15754.67.camel@rei.boston.devel.redhat.com> On Mon, 2006-11-13 at 13:04 -0500, Patel, Tushar wrote: > > Hello, > > We have lot of AS 4.0 GFS 6.0 clusters in our firm working fine. > However recently we are upgrading our clusters to GFS 6.1 > > From what we have seen so far is once we upgraded to GFS 6.1 our > failover test is failing. > > So far we have conducted failover tests using 2 scenario > 1.) Manually reboot one of the node in cluster using HP-iLO interface > sending fence_ilo command from one of the nodes - the command works > fine and host does reboot. > 2.) Pulling out network cables of one of the host in the cluster. > > We have 4 node cluster. > > Problem is with either of the above tests, gfs hangs and clustat > starts reporting only member information. No service information. > Clustat displays following message: > "Timeout : Resource Manager not responding" > > The gfs just keeps hanging for untill we manually intervene and bring > the halted node up. > > It seems fencing is not working or rgmanager is flawed and cannot > process/parse information. If fencing breaks, rgmanager will hang... Look at /proc/cluster/services -- if you see the fence domain in the 'recover' state, fencing needs to be fixed. -- Lon From lhh at redhat.com Wed Nov 15 17:02:23 2006 From: lhh at redhat.com (Lon Hohberger) Date: Wed, 15 Nov 2006 12:02:23 -0500 Subject: [Linux-cluster] failover questions after upgrade In-Reply-To: <20061115010655.GA18138@monsterjam.org> References: <20061115010655.GA18138@monsterjam.org> Message-ID: <1163610143.15754.71.camel@rei.boston.devel.redhat.com> On Tue, 2006-11-14 at 20:06 -0500, jason at monsterjam.org wrote: > > and when I reboot both servers of 2 node cluster, they come up fine.. > [jason at tf2 ~]$ clustat > Member Status: Quorate, Group Member > > Member Name State ID > ------ ---- ----- -- > tf1 Online 0x0000000000000001 > tf2 Online 0x0000000000000002 > > Service Name Owner (Last) State > ------- ---- ----- ------ ----- > Apache Service tf1 started > [jason at tf2 ~]$ > > when I reboot (shutdown -r now) tf1, > tf2 never takes over > > [jason at tf2 ~]$ clustat > Member Status: Quorate, Group Member > > Member Name State ID > ------ ---- ----- -- > tf2 Online 0x0000000000000002 > > Service Name Owner (Last) State > ------- ---- ----- ------ ----- > Apache Service ((null) ) failed > [jason at tf2 ~]$ > > heres the logs from tf2: > > Nov 14 19:48:21 tf2 clurgmgrd[5345]: Logged in SG "usrm::manager" > Nov 14 19:48:21 tf2 clurgmgrd[5345]: Magma Event: Membership Change > Nov 14 19:48:21 tf2 clurgmgrd[5345]: State change: Local UP > Nov 14 19:48:22 tf2 clurgmgrd[5345]: State change: tf1 UP > Nov 14 19:48:25 tf2 snmpd[5195]: Got trap from peer on fd 13 > Nov 14 19:48:44 tf2 kernel: process `omaws32' is using obsolete setsockopt SO_BSDCOMPAT > Nov 14 19:48:58 tf2 Server Administrator: Storage Service EventID: 2164 See readme.txt for a list > of validated controller driver versions. > Nov 14 19:49:00 tf2 snmpd[5195]: Got trap from peer on fd 13 > Nov 14 19:50:31 tf2 sshd(pam_unix)[6920]: session opened for user jason by (uid=0) > Nov 14 19:51:03 tf2 sshd(pam_unix)[6951]: session opened for user jason by (uid=0) > > Nov 14 19:51:39 tf2 clurgmgrd[5345]: Magma Event: Membership Change > Nov 14 19:51:39 tf2 clurgmgrd[5345]: State change: tf1 DOWN > Nov 14 19:52:19 tf2 ntpd[4896]: synchronized to 193.162.159.97, stratum 2 > Nov 14 19:52:19 tf2 ntpd[4896]: kernel time sync disabled 0041 > Nov 14 19:52:28 tf2 kernel: e100: eth2: e100_watchdog: link down > Nov 14 19:52:34 tf2 kernel: CMAN: removing node tf1 from the cluster : Missed too many heartbeats > Nov 14 19:52:58 tf2 kernel: e100: eth2: e100_watchdog: link up, 100Mbps, full-duplex > Nov 14 19:55:14 tf2 kernel: CMAN: node tf1 rejoining > Nov 14 19:55:45 tf2 clurgmgrd[5345]: Magma Event: Membership Change > Nov 14 19:55:45 tf2 clurgmgrd[5345]: State change: tf1 UP > > > then when tf1 comes back up, my apache service doesnt come up correctly.. > > [jason at tf2 ~]$ clustat > Member Status: Quorate, Group Member > > Member Name State ID > ------ ---- ----- -- > tf1 Online 0x0000000000000001 > tf2 Online 0x0000000000000002 > > Service Name Owner (Last) State > ------- ---- ----- ------ ----- > Apache Service (tf1 ) failed > [jason at tf2 ~]$ > > > and I see this in the logs on tf1 as hes booting up. > Nov 14 19:55:44 tf1 rhnsd[5445]: Red Hat Network Services Daemon starting up. > Nov 14 19:55:44 tf1 rhnsd: rhnsd startup succeeded > Nov 14 19:55:44 tf1 cups-config-daemon: cups-config-daemon startup succeeded > Nov 14 19:55:44 tf1 haldaemon: haldaemon startup succeeded > Nov 14 19:55:44 tf1 clurgmgrd[5488]: Loading Service Data > Nov 14 19:55:44 tf1 rgmanager: clurgmgrd startup succeeded > Nov 14 19:55:44 tf1 fstab-sync[5764]: removed all generated mount points > Nov 14 19:55:45 tf1 clurgmgrd[5488]: Initializing Services > Nov 14 19:55:45 tf1 fstab-sync[6152]: added mount point /media/cdrom for /dev/hda > Nov 14 19:55:45 tf1 httpd: httpd shutdown failed > Nov 14 19:55:45 tf1 clurgmgrd[5488]: stop on script "cluster_apache" returned 1 (generic > error) > Nov 14 19:55:45 tf1 clurgmgrd[5488]: Services Initialized > Nov 14 19:55:45 tf1 clurgmgrd[5488]: Logged in SG "usrm::manager" > Nov 14 19:55:45 tf1 clurgmgrd[5488]: Magma Event: Membership Change > Nov 14 19:55:45 tf1 clurgmgrd[5488]: State change: Local UP > Nov 14 19:55:46 tf1 fstab-sync[6465]: added mount point /media/floppy for /dev/fd0 > Nov 14 19:55:46 tf1 clurgmgrd[5488]: State change: tf2 UP > > any suggestions? > http://sources.redhat.com/cluster/faq.html#rgm_wontrestart The init script probably is returning 1 for stop-after-stop (or stop-when-stopped), when it should be returning 0. This is a bug in the initscripts package, and here's a patch to /etc/init.d/functions to make httpd work normally: https://bugzilla.redhat.com/bugzilla/attachment.cgi?id=111998 -- Lon From lhh at redhat.com Wed Nov 15 17:11:00 2006 From: lhh at redhat.com (Lon Hohberger) Date: Wed, 15 Nov 2006 12:11:00 -0500 Subject: [Linux-cluster] RHCS 3 "could not connect to service manager" In-Reply-To: <20061115124502.GA3350@murphy.redbrick.dcu.ie> References: <20061115121019.GC18237@murphy.redbrick.dcu.ie> <20061115124502.GA3350@murphy.redbrick.dcu.ie> Message-ID: <1163610660.15754.75.camel@rei.boston.devel.redhat.com> On Wed, 2006-11-15 at 12:45 +0000, Karl Podesta wrote: > We have just discovered that the bonding is configured differently on > each node (p1 is configured for load balancing, p2 is configured for > active-backup). We're curious if this bonding configuration could be > interfering with the service relocation... could it be a reason why > the connection times out... something akin to: > https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=149867 > (bonding contributing to a system hang after a period of time) > > Does this sound familiar to anyone? Has anyone encoutered anything like > this in their experience? It doesn't sound familiar, but the easiest thing to do now is to first try *without* bonding, then try again with it (both nodes in active-backup mode). -- Lon From lhh at redhat.com Wed Nov 15 17:16:23 2006 From: lhh at redhat.com (Lon Hohberger) Date: Wed, 15 Nov 2006 12:16:23 -0500 Subject: [Linux-cluster] Re: occasional cluster crashes In-Reply-To: References: <20061115105150.GB6938@mkpai.users.redhat.com> Message-ID: <1163610983.15754.82.camel@rei.boston.devel.redhat.com> On Wed, 2006-11-15 at 14:46 +0100, Fabrizio Lippolis wrote: > Pai ha scritto: > > > http://www.redhat.com/docs/manuals/csgfs/browse/rh-cs-en/ch-hardware.html > > Pai, > > thank you for your answer, from this documentation looks like fibre > channel is necessary to ensure data integrity. You can use iSCSI as well. Some multi-port SCSI arrays with multiple buses have been known to work, but the list is pretty slim for the known-working ones (I know that the Winchester Flashdisk G7 is known to work, as are most Dot Hill SCSI arrays). > > Alternatively, you could connect the storage to just one server and > > export the volumes using iSCSI. > > In this case, should the server connected to the storage fail all the > services on the cluster will be unavailable, won't they? Yes. -- Lon From kpodesta at redbrick.dcu.ie Wed Nov 15 17:29:53 2006 From: kpodesta at redbrick.dcu.ie (Karl Podesta) Date: Wed, 15 Nov 2006 17:29:53 +0000 Subject: [Linux-cluster] RHCS 3 "could not connect to service manager" In-Reply-To: <1163610660.15754.75.camel@rei.boston.devel.redhat.com> References: <20061115121019.GC18237@murphy.redbrick.dcu.ie> <20061115124502.GA3350@murphy.redbrick.dcu.ie> <1163610660.15754.75.camel@rei.boston.devel.redhat.com> Message-ID: <20061115172953.GB7952@murphy.redbrick.dcu.ie> On Wed, Nov 15, 2006 at 12:11:00PM -0500, Lon Hohberger wrote: > > Does this sound familiar to anyone? Has anyone encoutered anything like > > this in their experience? > > It doesn't sound familiar, but the easiest thing to do now is to first > try *without* bonding, then try again with it (both nodes in > active-backup mode). > > -- Lon Thanks for the advice Lon - unfortunately the system is in production so we'll have to wait for an outage window before we can try it. However we have tried to simulate the setup using VMWare, and with one node using load balancing for bonding, we can reproduce the error ("msg-open: connection timed out. Could not connect to service manager") by disabling one of the NICs in the bond and trying to relocate the service. When we do this, 50% of packets get through (i.e. load balancing is working and we can ping the other node), but the service fails to relocate with the above error. When we have both NICs enabled, 100% of packets get through, and service relocation works fine. So this seems to establish that network activity/problems can disrupt the relocation of services if one of the nodes is using load balancing on it's network bonding. Sound reasonable? We'll wait for an opportunity in the next few days to apply active-backup to the bonding, but if anyone has any other musings in the meantime it would be great to hear them of course. Thanks a lot! Karl -- Karl Podesta Systems Engineer, Securelinx Ltd. http://www.securelinx.com/ From marcos.david at efacec.pt Wed Nov 15 17:31:14 2006 From: marcos.david at efacec.pt (Marcos Gil Ferreira David) Date: Wed, 15 Nov 2006 17:31:14 -0000 Subject: [Linux-cluster] ccs_test tool Message-ID: Hello, I have a two-node cluster with several services running on it. Each of these services needs an ip address that configured in the cluster (a Cluster IP Address). My problem is that in order for some of our applications to function properly I need to know the ip address that is assigned to a given service. Example: Extracted from /etc/cluster/cluster.conf: ....

Who > Are You Looking > For?