From rodgersr at yahoo.com Sun Oct 1 02:52:03 2006 From: rodgersr at yahoo.com (Rick Rodgers) Date: Sat, 30 Sep 2006 19:52:03 -0700 (PDT) Subject: [Linux-cluster] ip-tiebreaker quotes need explainging Message-ID: <20061001025203.76527.qmail@web34209.mail.mud.yahoo.com> Can anyone explain why this is so. Why is it only used on maintaining qourum and not startup? "The IP tiebreaker is typically used to *maintain* a quorum after a node failure, because there are certain network faults in which two nodes may see the tiebreaker - but not each other. -- Lon" -------------- next part -------------- An HTML attachment was scrubbed... URL: From andre at hudat.com Sun Oct 1 19:13:44 2006 From: andre at hudat.com (andre at hudat.com) Date: Sun, 1 Oct 2006 15:13:44 -0400 Subject: [Linux-cluster] Panic Message-ID: <003401c6e58d$b71dc9d0$0245450a@AndreLaptop> I have the following panic on two nodes hours apart. Each node Is in a different state ( as in states of the US ). NO I am not running a cluster over a WAN, just two separate clusters in two different locations. Files are written on one cluster and I have a script that does an SCP of the file to the other cluster. Both machines running the latest RHEL4 with the latest GFS updates. This just started happening. Happened twice since Friday morning. Any hints ? What is happening with clvmd here ? What does the global conflict message mean ? -- Andre Oct 1 13:26:47 fs1.fl.apexrad.com kernel: purged 0 requests Oct 1 13:26:47 fs1.fl.apexrad.com kernel: clvmd mark waiting requests Oct 1 13:26:47 fs1.fl.apexrad.com kernel: clvmd marked 0 requests Oct 1 13:26:47 fs1.fl.apexrad.com kernel: clvmd recover event 5 done Oct 1 13:26:47 fs1.fl.apexrad.com kernel: clvmd move flags 0,0,1 ids 2,5,5 Oct 1 13:26:47 fs1.fl.apexrad.com kernel: clvmd process held requests Oct 1 13:26:47 fs1.fl.apexrad.com kernel: clvmd processed 0 requests Oct 1 13:26:47 fs1.fl.apexrad.com kernel: clvmd resend marked requests Oct 1 13:26:47 fs1.fl.apexrad.com kernel: clvmd resent 0 requests Oct 1 13:26:47 fs1.fl.apexrad.com kernel: clvmd recover event 5 finished Oct 1 13:26:47 fs1.fl.apexrad.com kernel: lvol01 total nodes 1 Oct 1 13:26:47 fs1.fl.apexrad.com kernel: lvol01 rebuild resource directory Oct 1 13:26:47 fs1.fl.apexrad.com kernel: lvol01 rebuilt 0 resources Oct 1 13:26:47 fs1.fl.apexrad.com kernel: lvol01 recover event 4 done Oct 1 13:26:47 fs1.fl.apexrad.com kernel: lvol01 move flags 0,0,1 ids 0,4,4 Oct 1 13:26:47 fs1.fl.apexrad.com kernel: lvol01 process held requests Oct 1 13:26:47 fs1.fl.apexrad.com kernel: lvol01 processed 0 requests Oct 1 13:26:47 fs1.fl.apexrad.com kernel: lvol01 recover event 4 finished Oct 1 13:26:47 fs1.fl.apexrad.com kernel: lvol01 move flags 1,0,0 ids 4,4,4 Oct 1 13:26:47 fs1.fl.apexrad.com kernel: lvol01 move flags 0,1,0 ids 4,7,4 Oct 1 13:26:47 fs1.fl.apexrad.com kernel: lvol01 move use event 7 Oct 1 13:26:47 fs1.fl.apexrad.com kernel: lvol01 recover event 7 Oct 1 13:26:47 fs1.fl.apexrad.com kernel: lvol01 add node 2 Oct 1 13:26:47 fs1.fl.apexrad.com kernel: lvol01 total nodes 2 Oct 1 13:26:47 fs1.fl.apexrad.com kernel: lvol01 rebuild resource directory Oct 1 13:26:47 fs1.fl.apexrad.com kernel: lvol01 rebuilt 6 resources Oct 1 13:26:47 fs1.fl.apexrad.com kernel: lvol01 purge requests Oct 1 13:26:47 fs1.fl.apexrad.com kernel: lvol01 purged 0 requests Oct 1 13:26:47 fs1.fl.apexrad.com kernel: lvol01 mark waiting requests Oct 1 13:26:47 fs1.fl.apexrad.com kernel: lvol01 marked 0 requests Oct 1 13:26:47 fs1.fl.apexrad.com kernel: lvol01 recover event 7 done Oct 1 13:26:47 fs1.fl.apexrad.com kernel: lvol01 move flags 0,0,1 ids 4,7,7 Oct 1 13:26:47 fs1.fl.apexrad.com kernel: lvol01 process held requests Oct 1 13:26:47 fs1.fl.apexrad.com kernel: lvol01 processed 0 requests Oct 1 13:26:47 fs1.fl.apexrad.com kernel: lvol01 resend marked requests Oct 1 13:26:47 fs1.fl.apexrad.com kernel: lvol01 resent 0 requests Oct 1 13:26:47 fs1.fl.apexrad.com kernel: lvol01 recover event 7 finished Oct 1 13:26:47 fs1.fl.apexrad.com kernel: 444u Oct 1 13:26:47 fs1.fl.apexrad.com kernel: 4424 global conflict 0 245-253 ex 1 own 4158637196, pid 444u Oct 1 13:26:47 fs1.fl.apexrad.com kernel: 4424 global conflict 0 254-26c ex 1 own 4158636236, pid 444u Oct 1 13:26:47 fs1.fl.apexrad.com kernel: 4424 global conflict 0 26d-27b ex 1 own 4158637196, pid 444u Oct 1 13:26:47 fs1.fl.apexrad.com kernel: 4424 global conflict 0 27c-28b ex 1 own 4158636236, pid 444u Oct 1 13:26:47 fs1.fl.apexrad.com kernel: 4424 global conflict 0 28c-29b ex 1 own 4158637196, pid 444u Oct 1 13:26:47 fs1.fl.apexrad.com kernel: 4424 global conflict 0 29c-2ac ex 1 own 4158636236, pid 444u Oct 1 13:26:47 fs1.fl.apexrad.com kernel: 4424 global conflict 0 2ad-2b9 ex 1 own 4158637196, pid 444u Oct 1 13:26:47 fs1.fl.apexrad.com kernel: 4424 global conflict 0 2ba-2c7 ex 1 own 4158636236, pid 444u Oct 1 13:26:47 fs1.fl.apexrad.com kernel: 4424 global conflict 0 0-ff ex 0 own 4158636236, pid 444u Oct 1 13:26:47 fs1.fl.apexrad.com kernel: 4424 global conflict 0 c8-2c7 ex 0 own 4158638348, pid 444u Oct 1 13:26:47 fs1.fl.apexrad.com kernel: 4424 global conflict 0 0-1ff ex 0 own 4158636236, pid 444u Oct 1 13:26:47 fs1.fl.apexrad.com kernel: 4424 global conflict 0 200-2c7 ex 0 own 4158638348, pid 444u Oct 1 13:26:47 fs1.fl.apexrad.com kernel: 4424 global conflict 0 0-1 ex 0 own 4101191756, pid 444u Oct 1 13:26:47 fs1.fl.apexrad.com kernel: 4424 global conflict 0 0-1ff ex 0 own 4158638828, pid 444u Oct 1 13:26:47 fs1.fl.apexrad.com kernel: 4424 global conflict 0 0-2c7 ex 0 own 4158636236, pid 444u Oct 1 13:26:47 fs1.fl.apexrad.com kernel: 4424 global conflict 0 0-fff ex 0 own 4158638348, pid 444u Oct 1 13:26:47 fs1.fl.apexrad.com kernel: 4424 global conflict 0 2c8-fff ex 0 own 4158636236, pid 444u Oct 1 13:26:47 fs1.fl.apexrad.com kernel: 4424 global conflict 0 0-1ff ex 0 own 4158638348, pid 444u Oct 1 13:26:47 fs1.fl.apexrad.com kernel: 4424 global conflict 0 0-3f ex 0 own 4158636236, pid 444u Oct 1 13:26:47 fs1.fl.apexrad.com kernel: 4424 global conflict 0 0-1ff ex 0 own 4158638348, pid 444u Oct 1 13:26:47 fs1.fl.apexrad.com kernel: 4424 global conflict 0 0-fff ex 1 own 4158636236, pid 296u Oct 1 13:26:47 fs1.fl.apexrad.com kernel: 4424 global conflict 0 70000-7ffff ex 1 own 4158638348, pid 296u Oct 1 13:26:47 fs1.fl.apexrad.com kernel: 4424 global conflict 0 80000-8ffff ex 1 own 4158636236, pid 296u Oct 1 13:26:47 fs1.fl.apexrad.com kernel: 4424 global conflict 0 90000-9ffff ex 1 own 4158638348, pid 296u Oct 1 13:26:47 fs1.fl.apexrad.com kernel: 4424 global conflict 0 a0000-affff ex 1 own 4158636236, pid 296u Oct 1 13:26:47 fs1.fl.apexrad.com kernel: 4424 global conflict 0 b0000-bffff ex 1 own 4158638348, pid 296u Oct 1 13:26:47 fs1.fl.apexrad.com kernel: 4424 global conflict 0 c0000-cffff ex 1 own 4158636236, pid 296u Oct 1 13:26:47 fs1.fl.apexrad.com kernel: 4424 global conflict 0 d0000-dffff ex 1 own 4158638348, pid 296u Oct 1 13:26:47 fs1.fl.apexrad.com kernel: 4424 global conflict 0 e0000-effff ex 1 own 4158636236, pid 296u Oct 1 13:26:47 fs1.fl.apexrad.com kernel: 4424 global conflict 0 f0000-fffff ex 1 own 4158638348, pid 296u Oct 1 13:26:47 fs1.fl.apexrad.com kernel: 4424 global conflict 0 100000-10ffff ex 1 own 4101191756, pid 296u Oct 1 13:26:47 fs1.fl.apexrad.com kernel: 4424 global conflict 0 110000-11ffff ex 1 own 4158636236, pid 296u Oct 1 13:26:47 fs1.fl.apexrad.com kernel: 4424 global conflict 0 120000-12ffff ex 1 own 4158638348, pid 296u Oct 1 13:26:47 fs1.fl.apexrad.com kernel: 4424 global conflict 0 130000-13ffff ex 1 own 4158638828, pid 296u Oct 1 13:26:47 fs1.fl.apexrad.com kernel: 4424 global conflict 0 140000-14ffff ex 1 own 4158636236, pid 296u Oct 1 13:26:47 fs1.fl.apexrad.com kernel: 4424 global conflict 0 150000-15ffff ex 1 own 4158638348, pid 296u Oct 1 13:26:47 fs1.fl.apexrad.com kernel: 4424 global conflict 0 160000-16ffff ex 1 own 4158636236, pid 296u Oct 1 13:26:47 fs1.fl.apexrad.com kernel: 4424 global conflict 0 170000-17ffff ex 1 own 4158638348, pid 296u Oct 1 13:26:47 fs1.fl.apexrad.com kernel: 4424 global conflict 0 180000-18ffff ex 1 own 4158636236, pid 296u Oct 1 13:26:47 fs1.fl.apexrad.com kernel: 4424 global conflict 0 190000-19ffff ex 1 own 4158638348, pid 296u Oct 1 13:26:47 fs1.fl.apexrad.com kernel: 4424 global conflict 0 1a0000-1affff ex 1 own 4158636236, pid 296u Oct 1 13:26:47 fs1.fl.apexrad.com kernel: 4424 global conflict 0 1b0000-1bffff ex 1 own 4158638348, pid 296u Oct 1 13:26:47 fs1.fl.apexrad.com kernel: 4424 global conflict 0 1c0000-1c2aa7 ex 1 own 4158636236, pid 296u Oct 1 13:26:47 fs1.fl.apexrad.com kernel: 4424 global conflict 0 0-ff ex 0 own 4158637196, pid 296u Oct 1 13:26:47 fs1.fl.apexrad.com kernel: 4424 global conflict 0 200-2ff ex 0 own 4158638348, pid 296u Oct 1 13:26:47 fs1.fl.apexrad.com kernel: 4424 global conflict 0 1c28a8-1c2aa7 ex 0 own 4158637196, pid 296u Oct 1 13:26:47 fs1.fl.apexrad.com kernel: 4424 global conflict 0 1c26da-1c27d9 ex 0 own 4158638348, pid 296u Oct 1 13:26:47 fs1.fl.apexrad.com kernel: 4424 global conflict 0 44b-54a ex 0 own 4158637196, pid 296u Oct 1 13:26:47 fs1.fl.apexrad.com kernel: 4424 global conflict 0 1c0ba5-1c0ca4 ex 0 own 4158638348, pid 296u Oct 1 13:26:47 fs1.fl.apexrad.com kernel: 4424 global conflict 0 1c0780-1c087f ex 0 own 4158637196, pid 296u Oct 1 13:26:47 fs1.fl.apexrad.com kernel: 4424 global conflict 0 1c12a8-1c13a7 ex 0 own 4158638348, pid 296u Oct 1 13:26:47 fs1.fl.apexrad.com kernel: 4424 global conflict 0 1c277d-1c287c ex 0 own 4158637196, pid 296u Oct 1 13:26:47 fs1.fl.apexrad.com kernel: 4424 global conflict 0 1c276a-1c2869 ex 0 own 4158638348, pid 296u Oct 1 13:26:47 fs1.fl.apexrad.com kernel: 4424 global conflict 0 1c10eb-1c11ea ex 0 own 4158637196, pid 296u Oct 1 13:26:47 fs1.fl.apexrad.com kernel: 4424 global conflict 0 1c04-1d03 ex 0 own 4158638348, pid 296u Oct 1 13:26:47 fs1.fl.apexrad.com kernel: 4424 global conflict 0 0-1ff ex 0 own 4158637196, pid 296u Oct 1 13:26:47 fs1.fl.apexrad.com kernel: 4424 global conflict 0 fe00-ffff ex 0 own 4158638348, pid 296u Oct 1 13:26:47 fs1.fl.apexrad.com kernel: 4424 global conflict 0 0-1 ex 0 own 4158637196, pid 296u Oct 1 13:26:47 fs1.fl.apexrad.com kernel: 4424 global conflict 0 0-1ff ex 0 own 4158638348, pid 296u Oct 1 13:26:47 fs1.fl.apexrad.com kernel: 4424 global conflict 0 0-fff ex 0 own 4158637196, pid 296u Oct 1 13:26:47 fs1.fl.apexrad.com kernel: 4424 global conflict 0 1c1aa8-1c2aa7 ex 0 own 4158638348, pid 296u Oct 1 13:26:47 fs1.fl.apexrad.com kernel: 4424 global conflict 0 0-fff ex 0 own 4158637196, pid 296u Oct 1 13:26:47 fs1.fl.apexrad.com kernel: 4424 global conflict 0 0-1ff ex 0 own 4158638348, pid 296u Oct 1 13:26:47 fs1.fl.apexrad.com kernel: 4424 global conflict 0 0-3f ex 0 own 4158637196, pid 296u Oct 1 13:26:47 fs1.fl.apexrad.com kernel: 4424 global conflict 0 0-1ff ex 0 own 4158638348, pid 296u Oct 1 13:26:47 fs1.fl.apexrad.com kernel: Oct 1 13:26:47 fs1.fl.apexrad.com kernel: lock_dlm: Assertion failed on line 428 of file /usr/src/build/765787-i686/BUIL D/gfs-kernel-2.6.9-58/smp/src/dlm/lock.c Oct 1 13:26:47 fs1.fl.apexrad.com kernel: lock_dlm: assertion: "!error" Oct 1 13:26:47 fs1.fl.apexrad.com kernel: lock_dlm: time = 185852977 Oct 1 13:26:47 fs1.fl.apexrad.com kernel: lvol01: num=2,684f0dd err=-22 cur=3 req=5 lkf=44 Oct 1 13:26:47 fs1.fl.apexrad.com kernel: Oct 1 13:26:47 fs1.fl.apexrad.com kernel: ------------[ cut here ]------------ Oct 1 13:26:47 fs1.fl.apexrad.com kernel: kernel BUG at /usr/src/build/765787-i686/BUILD/gfs-kernel-2.6.9-58/smp/src/dlm/ lock.c:428! Oct 1 13:26:47 fs1.fl.apexrad.com kernel: invalid operand: 0000 [#1] Oct 1 13:26:47 fs1.fl.apexrad.com kernel: SMP Oct 1 13:26:47 fs1.fl.apexrad.com kernel: Modules linked in: nfs nfsd exportfs lockd nfs_acl autofs4 i2c_dev i2c_core loc k_dlm(U) gfs(U) lock_harness(U) dlm(U) cman(U) md5 ipv6 sunrpc dm_mirror button battery ac uhci_hcd ehci_hcd hw_random e10 00 floppy sg ext3 jbd dm_mod megaraid_mbox megaraid_mm sd_mod scsi_mod Oct 1 13:26:47 fs1.fl.apexrad.com kernel: CPU: 0 Oct 1 13:26:47 fs1.fl.apexrad.com kernel: EIP: 0060:[] Not tainted VLI Oct 1 13:26:47 fs1.fl.apexrad.com kernel: EFLAGS: 00010246 (2.6.9-42.ELsmp) Oct 1 13:26:47 fs1.fl.apexrad.com kernel: EIP is at do_dlm_lock+0x134/0x14e [lock_dlm] Oct 1 13:26:47 fs1.fl.apexrad.com kernel: eax: 00000001 ebx: ffffffea ecx: d18c5dc0 edx: f8dfc221 Oct 1 13:26:47 fs1.fl.apexrad.com kernel: esi: f8df7798 edi: c387e600 ebp: e194f780 esp: d18c5dbc Oct 1 13:26:47 fs1.fl.apexrad.com kernel: ds: 007b es: 007b ss: 0068 Oct 1 13:26:47 fs1.fl.apexrad.com kernel: Process rmdir (pid: 23174, threadinfo=d18c5000 task=f64f0b30) Oct 1 13:26:47 fs1.fl.apexrad.com kernel: Stack: f8dfc221 20202020 32202020 20202020 20202020 34383620 64643066 32200018 Oct 1 13:26:47 fs1.fl.apexrad.com kernel: 20202020 e194f780 00000001 00000003 e194f780 f8df7828 00000005 f8dff940 Oct 1 13:26:47 fs1.fl.apexrad.com kernel: f8919000 f8eba936 00000000 00000001 d16c1dd4 d16c1db8 f8919000 f8eb08fe Oct 1 13:26:47 fs1.fl.apexrad.com kernel: Call Trace: Oct 1 13:26:47 fs1.fl.apexrad.com kernel: [] lm_dlm_lock+0x49/0x52 [lock_dlm] Oct 1 13:26:47 fs1.fl.apexrad.com kernel: [] gfs_lm_lock+0x35/0x4d [gfs] Oct 1 13:26:47 fs1.fl.apexrad.com kernel: [] gfs_glock_xmote_th+0x130/0x172 [gfs] Oct 1 13:26:47 fs1.fl.apexrad.com kernel: [] rq_promote+0xc8/0x147 [gfs] Oct 1 13:26:47 fs1.fl.apexrad.com kernel: [] run_queue+0x91/0xc1 [gfs] Oct 1 13:26:47 fs1.fl.apexrad.com kernel: [] gfs_glock_nq+0xcf/0x116 [gfs] Oct 1 13:26:47 fs1.fl.apexrad.com kernel: [] nq_m_sync+0x44/0x64 [gfs] Oct 1 13:26:47 fs1.fl.apexrad.com kernel: [] gfs_glock_nq_m+0x149/0x15d [gfs] Oct 1 13:26:47 fs1.fl.apexrad.com kernel: [] gfs_rmdir+0x6a/0x168 [gfs] Oct 1 13:26:47 fs1.fl.apexrad.com kernel: [] vfs_rmdir+0x1a3/0x1f1 Oct 1 13:26:47 fs1.fl.apexrad.com kernel: [] sys_rmdir+0xa1/0xf4 Oct 1 13:26:47 fs1.fl.apexrad.com kernel: [] do_page_fault+0x0/0x5c6 Oct 1 13:26:47 fs1.fl.apexrad.com kernel: [] syscall_call+0x7/0xb Oct 1 13:26:47 fs1.fl.apexrad.com kernel: Code: 26 50 0f bf 45 24 50 53 ff 75 08 ff 75 04 ff 75 0c ff 77 18 68 4c c3 df f 8 e8 32 b1 32 c7 83 c4 38 68 21 c2 df f8 e8 25 b1 32 c7 <0f> 0b ac 01 5e c1 df f8 68 23 c2 df f8 e8 e0 a8 32 c7 83 c4 20 Oct 1 13:26:47 fs1.fl.apexrad.com kernel: <0>Fatal exception: panic in 5 seconds Oct 1 13:49:59 fs1.fl.apexrad.com syslogd 1.4.1: restart. From mnapolis at redhat.com Mon Oct 2 00:12:14 2006 From: mnapolis at redhat.com (Isauro Michael Napolis) Date: Mon, 02 Oct 2006 10:12:14 +1000 Subject: [Linux-cluster] ip-tiebreaker quotes need explainging In-Reply-To: <20061001025203.76527.qmail@web34209.mail.mud.yahoo.com> References: <20061001025203.76527.qmail@web34209.mail.mud.yahoo.com> Message-ID: <1159747934.4661.4.camel@localhost.localdomain> Hi, The ip tiebreaker is primarily used during network split-brain situations in a even numbered (2,4, etc.) cluster. The ip tiebreaker provides the extra vote (to form a quorum (50% + 1)) to determine who should be the next/ new master. cheers, Michael On Sun, 2006-10-01 at 12:52, Rick Rodgers wrote: > Can anyone explain why this is so. Why is it only used on maintaining > qourum and not startup? > > > "The IP tiebreaker is typically used to *maintain* a quorum after a node > failure, because there are certain network faults in which two nodes may > see the tiebreaker - but not each other. > -- Lon" > > > > ______________________________________________________________________ > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From rpeterso at redhat.com Mon Oct 2 00:14:54 2006 From: rpeterso at redhat.com (Robert Peterson) Date: Sun, 01 Oct 2006 19:14:54 -0500 Subject: [Linux-cluster] Panic In-Reply-To: <003401c6e58d$b71dc9d0$0245450a@AndreLaptop> References: <003401c6e58d$b71dc9d0$0245450a@AndreLaptop> Message-ID: <452059FE.3030104@redhat.com> andre at hudat.com wrote: > I have the following panic on two nodes hours apart. Each node Is in a > different state ( as in states of the US ). NO I am not running a cluster > over a WAN, just two separate clusters in two different locations. Files are > written on one cluster and I have a script that does an SCP of the file to > the other cluster. Both machines running the latest RHEL4 with the latest > GFS updates. This just started happening. Happened twice since Friday > morning. Any hints ? What is happening with clvmd here ? What does the > global conflict message mean ? > > -- > Andre > Hi Andre, This might be the same as bugzilla bug 208134. See https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=208134. There is a patch to try with the bugzilla. Regards, Bob Peterson Red Hat Cluster Suite From andre at hudat.com Mon Oct 2 02:24:51 2006 From: andre at hudat.com (Andre Henry) Date: Sun, 1 Oct 2006 22:24:51 -0400 Subject: [Linux-cluster] Panic In-Reply-To: <452059FE.3030104@redhat.com> Message-ID: <003501c6e5c9$f11dd850$0245450a@AndreLaptop> Says I am not authorized. Even after creating an account. Can I access bugzilla from my rhn account ? -- Thanks Andre > -----Original Message----- > From: linux-cluster-bounces at redhat.com [mailto:linux-cluster- > bounces at redhat.com] On Behalf Of Robert Peterson > Sent: Sunday, October 01, 2006 8:15 PM > To: linux clustering > Subject: Re: [Linux-cluster] Panic > > andre at hudat.com wrote: > > I have the following panic on two nodes hours apart. Each node Is in a > > different state ( as in states of the US ). NO I am not running a > cluster > > over a WAN, just two separate clusters in two different locations. Files > are > > written on one cluster and I have a script that does an SCP of the file > to > > the other cluster. Both machines running the latest RHEL4 with the > latest > > GFS updates. This just started happening. Happened twice since Friday > > morning. Any hints ? What is happening with clvmd here ? What does the > > global conflict message mean ? > > > > -- > > Andre > > > Hi Andre, > > This might be the same as bugzilla bug 208134. > See https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=208134. > There is a patch to try with the bugzilla. > > Regards, > > Bob Peterson > Red Hat Cluster Suite > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From rodgersr at yahoo.com Mon Oct 2 03:52:38 2006 From: rodgersr at yahoo.com (Rick Rodgers) Date: Sun, 1 Oct 2006 20:52:38 -0700 (PDT) Subject: [Linux-cluster] ip-tiebreaker quotes need explainging Message-ID: <20061002035238.66323.qmail@web34203.mail.mud.yahoo.com> Thanks for replying to me especically on a Sunday. I was familiar with everything you said. The question concerned not being able to use the tiebreaker to form quarum on STARTUP. Why is this? Also Lon sent me a URL to a doc I need but it seems the URL is stale. Can you help me? Here is the URL: http://people.redhat.com/lhh/rhcm-3-internals.odt Rick ----- Original Message ---- From: Isauro Michael Napolis To: linux clustering Sent: Sunday, October 1, 2006 5:12:14 PM Subject: Re: [Linux-cluster] ip-tiebreaker quotes need explainging Hi, The ip tiebreaker is primarily used during network split-brain situations in a even numbered (2,4, etc.) cluster. The ip tiebreaker provides the extra vote (to form a quorum (50% + 1)) to determine who should be the next/ new master. cheers, Michael On Sun, 2006-10-01 at 12:52, Rick Rodgers wrote: > Can anyone explain why this is so. Why is it only used on maintaining > qourum and not startup? > > > "The IP tiebreaker is typically used to *maintain* a quorum after a node > failure, because there are certain network faults in which two nodes may > see the tiebreaker - but not each other. > -- Lon" > > > > ______________________________________________________________________ > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- An HTML attachment was scrubbed... URL: From rodgersr at yahoo.com Mon Oct 2 03:59:42 2006 From: rodgersr at yahoo.com (Rick Rodgers) Date: Sun, 1 Oct 2006 20:59:42 -0700 (PDT) Subject: [Linux-cluster] How does disk prevent split brain? Message-ID: <20061002035942.52252.qmail@web34213.mail.mud.yahoo.com> According to RH documentation and the engineers you can have a safe two node cluster without having to have an ip-tiebreaker. They say it will use the disk tie-breaker. How is this possible? Since clulockd is running separately on each of the nodes how does it prevent each node from accessing the disk at the same time and declaring himself the Active node? If it is through some sort of locking mechanism that insures this can you please be technically specific in what it is? -------------- next part -------------- An HTML attachment was scrubbed... URL: From rodgersr at yahoo.com Mon Oct 2 05:12:28 2006 From: rodgersr at yahoo.com (Rick Rodgers) Date: Sun, 1 Oct 2006 22:12:28 -0700 (PDT) Subject: [Linux-cluster] RH documentation and RH Engineering do not agree Message-ID: <20061002051228.1904.qmail@web34209.mail.mud.yahoo.com> I am using a two node cluster without a tiebreaker and find that the documentation that RH provides and some of the technical info provided by RH engineering folks do not agree with each other. Redhaat docs say that if all netowrking fails that the nodes will not failover because they still have the disk to act as heartbeat. Also the docs say that the services will continue. Yet, I tested theis and it does not aahhpen this way. The system will get stonithed. So what is the the real story here? Below are the docs I am talking about. Below are two excerpts from RH documentation that says the following about loss of network connections in a two node cluster: ---------------------------------------------------------------------------- Total Network Connection Failure A total network connection failure occurs when all the heartbeat network connections between the systems fail. This can be caused by one of the following: All the heartbeat network cables are disconnected from a system. All the serial connections and network interfaces used for heartbeat communication fail. If a total network connection failure occurs, both systems detect the problem, but they also detect that the SCSI disk connections are still active. Therefore, services remain running on the systems and are not interrupted. --------------------------------------------------------------------------------- From a RH FAQ list ---------------------------------------------------------------------------------- E.4. Common Behaviors: Two Member Cluster with Disk-basedTie-breaker Loss of network connectivity to other member, shared media still accessible Common Causes: Network connectivity lost. Test Case: Disconnect all network cables from a member. Expected Behavior: No fail-over unless disk updates are also lost. Services will not be able to be relocated in most cases, which is due to the fact that the lock server requires network connectivity. ---------------------------------------------------------------------- However this does not seem to be the case. The systems stop the service or get STONITHed. Below is some info form this message board with a reply from RH engineering that seems to confirm that the nodes will get STONITHed. THis is followed by more RH engineering that conform to the RH docs. ????/ RE: [Linux-cluster] Tiebreaker IP ------------------------------------------------------------------------ * /From/: * /To/: * /Subject/: RE: [Linux-cluster] Tiebreaker IP * /Date/: Fri, 26 Aug 2005 13:24:39 -0500 ------------------------------------------------------------------------ Rob, Heres a summary of what I have observed with this configuration. You may want to verify the accuracy of my observations on your own. Starting with RHEL3, the RHCS verified node membership via a network heartbeat rather than/in addition to a disk timestamp. The network heartbeat traffic moves over the same interface that is used to access the network resources. This means that there is no dedicated heartbeat interface like you would see in a microsoft cluster. The tiebreaker IP is used to prevent a split brain situation in a a cluster with an even number of nodes. Lets say you have 2 active cluster nodes... say nodeA and nodeB, and nodeA owns an NFS disk resource and IP. Then lets say nodeA fails to receive a heartbeat from nodeB over its primary interface. This could mean several things: nodeA's interface is down, nodeB's interface is down, or their shared switch is down. So if nodeA and nodeB stop communicating with eachother, they will both try to ping the tiebreaker IP, which is usually your default gateway IP. If nodeA gets a response from the tiebreaker IP, it will continue to own the resource. If it cant, it will assume its external interface is down and fence/reboot itself. The same holds true for nodeB. Unlike RHEL2.1 which used STONITH, RHEL3 cluster nodes reboot themselves. Therefor, even if nodeB can reach the tiebreaker and CANT reach nodeA, it will not get the cluster resource until nodeA releases it. This prevents the nodes from accessing the shared disk resource concommitantly. This configuration prevents split brain by ensuring the resource owner doesn't get killed accidentally by its peer. For those that remember, ping=ponging was a big problem with RHEL2.1 clusters. If they couldn't read their partners disk timestamp update in a timely manner -- due to IO latency or whatever -- they would reboot their partner node. On reboot, the rebooted node would STONITH the other node, etc. Anyway, I hope this answers your questions. It is fairly easy to test. Set up a 2 node cluster, then reboot the service owner. If the service starts on the other node, you should be configured correctly. Next disconnect the service owner from the network. The service owner should reboot itself with the watchdog or fail over the resource, depending on how its ocnfigured. Repeat this test with the non-service owner. (the resources should not move in this case.) then take turns disconnecting them from the shared storage. Cheers, jacob ----------------------------------------------------------------------------- RE: [Linux-cluster] Tiebreaker IP ------------------------------------------------------------------------ * /From/: Lon Hohberger * /To/: linux clustering * /Subject/: RE: [Linux-cluster] Tiebreaker IP * /Date/: Mon, 29 Aug 2005 15:19:40 -0400 ------------------------------------------------------------------------ On Fri, 2005-08-26 at 13:24 -0500, JACOB_LIBERMAN Dell com wrote: > If it cant, it will assume its external interface is down > and fence/reboot itself. The same holds true for nodeB. Unlike RHEL2.1 > which used STONITH, RHEL3 cluster nodes reboot themselves. Both use STONITH. RHEL3 cluster nodes are more paranoid about running without STONITH. If STONITH is configured on a RHEL3 cluster, the node will instead wait to be shot -- or for a new quorum to form -- if it loses network connectivity. > Anyway, I hope this answers your questions. It is fairly easy to test. > Set up a 2 node cluster, then reboot the service owner. If the service > starts on the other node, you should be configured correctly. Next > disconnect the service owner from the network. The service owner should > reboot itself with the watchdog or fail over the resource, depending on > how its ocnfigured. It should reboot itself because it loses quorum, really. Basically, without STONITH, a node thinks like this on RHEL3: "I was quorate and now I'm not, and no one can cut me off from shared storage... Uh, oh, REBOOT!" -- Lon ------------------------------------------------------------------------ more ----------------------------------------------------------------------------- >The disk tiebreaker works in a similar way, except that it lets the >cluster limp in along in a safe, semi-split-brain (split brain) in a >network outage. What I mean is that because there's state information >written to/read from the shared raw partitions, the nodes can actually >tell via other means whether or not the other node is "alive" or not as >opposed to relying solely on the network traffic. >Both nodes update state information on the shared partitions. When one >node detects that the other node has not updated its information for a >period of time, that node is "down" according to the disk subsystem.If >this coincides with a "down" status from the membership daemon, the node >is fenced and services are failed over. If the node never goes down >(and keeps updating its information on the shared partitions), then the >node is never fenced and services never fail over. -- Lon 14. What is a quorum disk/partition and what does it do for you? A quorum disk or partition is a section of a disk that's set up for use with components of the cluster project. It has a couple of purposes. Again, I'll explain with an example. Suppose you have nodes A and B, and node A fails to get several of cluster manager's "heartbeat" packets from node B. Node A doesn't know why it hasn't received the packets, but there are several possibilities: either node B has failed, the network switch or hub has failed, node A's network adapter has failed, or maybe just because node B was just too busy to send the packet. That can happen if your cluster is extremely large, your systems are extremely busy or your network is flakey. Node A doesn't know which is the case, and it doesn't know whether the problem lies within itself or with node B. This is especially problematic in a two-node cluster because both nodes, out of touch with one another, can try to fence the other. So before fencing a node, it would be nice to have another way to check if the other node is really alive, even though we can't seem to contact it. A quorum disk gives you the ability to do just that. Before fencing a node that's out of touch, the cluster software can check whether the node is still alive based on whether it has written data to the quorum partition. In the case of two-node systems, the quorum disk also acts as a tie-breaker. If a node has access to the quorum disk and the network, that counts as two votes. A node that has lost contact with the network or the quorum disk has lost a vote, and therefore may safely be fenced. -------------- next part -------------- An HTML attachment was scrubbed... URL: From lhh at redhat.com Mon Oct 2 13:10:16 2006 From: lhh at redhat.com (Lon Hohberger) Date: Mon, 02 Oct 2006 09:10:16 -0400 Subject: Fwd: Re: [Linux-cluster] Disk tie breaker -how does it work? In-Reply-To: <20060929214217.2230.qmail@web34210.mail.mud.yahoo.com> References: <20060929214217.2230.qmail@web34210.mail.mud.yahoo.com> Message-ID: <1159794616.3047.2.camel@localhost.localdomain> On Fri, 2006-09-29 at 14:42 -0700, Rick Rodgers wrote: > Thanks but the page can not be found. 404 error. Do i need to > download some keys? Whoops, typo. > http://people.redhat.com/lhh/rhcm-3-internals.odt http://people.redhat.com/lhh/rhcm-el3-internals.odt Tested it this time. -- Lon From troels at arvin.dk Mon Oct 2 13:16:37 2006 From: troels at arvin.dk (Troels Arvin) Date: Mon, 02 Oct 2006 15:16:37 +0200 Subject: [Linux-cluster] Quorum disk: Can it be a partition? Message-ID: Hello, About the new quorum disk feature in RH Cluster Suite Update 4: Does it have to be a complete, dedicated disk, or is it just as fine to use a disk partition? -- Greetings from Troels Arvin From rpeterso at redhat.com Mon Oct 2 13:46:14 2006 From: rpeterso at redhat.com (Robert Peterson) Date: Mon, 02 Oct 2006 08:46:14 -0500 Subject: [Linux-cluster] Quorum disk: Can it be a partition? In-Reply-To: References: Message-ID: <45211826.1050703@redhat.com> Troels Arvin wrote: > Hello, > > About the new quorum disk feature in RH Cluster Suite Update 4: > Does it have to be a complete, dedicated disk, or is it just as fine to > use a disk partition? Hi Troels, It doesn't have to be a disk; a partition is just fine. http://sources.redhat.com/cluster/faq.html#quorum Regards, Bob Peterson Red Hat Cluster Suite From teigland at redhat.com Mon Oct 2 15:31:14 2006 From: teigland at redhat.com (David Teigland) Date: Mon, 2 Oct 2006 10:31:14 -0500 Subject: [Linux-cluster] Panic In-Reply-To: <452059FE.3030104@redhat.com> References: <003401c6e58d$b71dc9d0$0245450a@AndreLaptop> <452059FE.3030104@redhat.com> Message-ID: <20061002153114.GD19242@redhat.com> On Sun, Oct 01, 2006 at 07:14:54PM -0500, Robert Peterson wrote: > andre at hudat.com wrote: > >I have the following panic on two nodes hours apart. Each node Is in a > >different state ( as in states of the US ). NO I am not running a cluster > >over a WAN, just two separate clusters in two different locations. Files > >are > >written on one cluster and I have a script that does an SCP of the file to > >the other cluster. Both machines running the latest RHEL4 with the latest > >GFS updates. This just started happening. Happened twice since Friday > >morning. Any hints ? What is happening with clvmd here ? What does the > >global conflict message mean ? The clvmd and "global conflict" lines are just normal stuff from the debug buffer that was dumped on the panic, they're not relevant here. > This might be the same as bugzilla bug 208134. > See https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=208134. > There is a patch to try with the bugzilla. Bug 208134 is not related, but bug 199673 may be. Dave From Andre at hudat.com Mon Oct 2 15:58:50 2006 From: Andre at hudat.com (Andre Henry) Date: Mon, 2 Oct 2006 11:58:50 -0400 Subject: [Linux-cluster] Panic In-Reply-To: <20061002153114.GD19242@redhat.com> References: <003401c6e58d$b71dc9d0$0245450a@AndreLaptop> <452059FE.3030104@redhat.com> <20061002153114.GD19242@redhat.com> Message-ID: This is exactly what we are doing. The system creates and deletes several thousand files per day. I take it this has not made it to a GFS release as yet ? So I would have to manually apply the patch and rebuild the DLM kernel ? -- Thanks Andre On Oct 2, 2006, at 11:31 AM, David Teigland wrote: > On Sun, Oct 01, 2006 at 07:14:54PM -0500, Robert Peterson wrote: >> andre at hudat.com wrote: >>> I have the following panic on two nodes hours apart. Each node Is in >>> a >>> different state ( as in states of the US ). NO I am not running a >>> cluster >>> over a WAN, just two separate clusters in two different locations. >>> Files >>> are >>> written on one cluster and I have a script that does an SCP of the >>> file to >>> the other cluster. Both machines running the latest RHEL4 with the >>> latest >>> GFS updates. This just started happening. Happened twice since Friday >>> morning. Any hints ? What is happening with clvmd here ? What does >>> the >>> global conflict message mean ? > > The clvmd and "global conflict" lines are just normal stuff from the > debug > buffer that was dumped on the panic, they're not relevant here. > >> This might be the same as bugzilla bug 208134. >> See https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=208134. >> There is a patch to try with the bugzilla. > > Bug 208134 is not related, but bug 199673 may be. > > Dave > From teigland at redhat.com Mon Oct 2 16:07:25 2006 From: teigland at redhat.com (David Teigland) Date: Mon, 2 Oct 2006 11:07:25 -0500 Subject: [Linux-cluster] Panic In-Reply-To: References: <003401c6e58d$b71dc9d0$0245450a@AndreLaptop> <452059FE.3030104@redhat.com> <20061002153114.GD19242@redhat.com> Message-ID: <20061002160725.GE19242@redhat.com> On Mon, Oct 02, 2006 at 11:58:50AM -0400, Andre Henry wrote: > This is exactly what we are doing. The system creates and deletes > several thousand files per day. > > I take it this has not made it to a GFS release as yet ? So I would > have to manually apply the patch and rebuild the DLM kernel ? It'll be in the next errata release. Until then, yes, you can patch and rebuild the dlm kernel module. Here's the diff: http://sources.redhat.com/cgi-bin/cvsweb.cgi/cluster/dlm-kernel/src/Attic/lkb.c.diff?r1=1.3.2.1&r2=1.3.2.1.12.1&cvsroot=cluster&only_with_tag=RHEL4U4&f=h Dave From dbrieck at gmail.com Mon Oct 2 16:10:34 2006 From: dbrieck at gmail.com (David Brieck Jr.) Date: Mon, 2 Oct 2006 12:10:34 -0400 Subject: [Linux-cluster] GNBD Ports Message-ID: <8c1094290610020910s53ac1575gf6c596f4cdef54a8@mail.gmail.com> Which ports would need to be open on a firewall to use GNBD server? Nothing is mentioned in this document about them. http://sources.redhat.com/cluster/faq.html#iptables Thanks. From lhh at redhat.com Mon Oct 2 16:17:39 2006 From: lhh at redhat.com (Lon Hohberger) Date: Mon, 02 Oct 2006 12:17:39 -0400 Subject: [Linux-cluster] ip-tiebreaker quotes need explainging In-Reply-To: <20061002035238.66323.qmail@web34203.mail.mud.yahoo.com> References: <20061002035238.66323.qmail@web34203.mail.mud.yahoo.com> Message-ID: <1159805859.22558.0.camel@rei.boston.devel.redhat.com> On Sun, 2006-10-01 at 20:52 -0700, Rick Rodgers wrote: > Thanks for replying to me especically on a Sunday. I was familiar with > everything you > said. The question concerned not being able to use the tiebreaker to > form quarum on STARTUP. > Why is this? > > Also Lon sent me a URL to a doc I need but it seems the URL is stale. > Can you help > me? Here is the URL: > > > http://people.redhat.com/lhh/rhcm-3-internals.odt http://people.redhat.com/lhh/rhcm-el3-internals.odt ^^^ Typo. -- Lon From lhh at redhat.com Mon Oct 2 16:20:14 2006 From: lhh at redhat.com (Lon Hohberger) Date: Mon, 02 Oct 2006 12:20:14 -0400 Subject: [Linux-cluster] How does disk prevent split brain? In-Reply-To: <20061002035942.52252.qmail@web34213.mail.mud.yahoo.com> References: <20061002035942.52252.qmail@web34213.mail.mud.yahoo.com> Message-ID: <1159806014.22558.3.camel@rei.boston.devel.redhat.com> On Sun, 2006-10-01 at 20:59 -0700, Rick Rodgers wrote: > According to RH documentation and the engineers you can have a safe > two node > cluster without having to have an ip-tiebreaker. They say it will use > the disk tie-breaker. > > How is this possible? Since clulockd is running separately on each of > the nodes how does > it prevent each node from accessing the disk at the same time and > declaring himself the > Active node? If it is through some sort of locking mechanism that > insures this can you please > be technically specific in what it is? Only one will be the lock master, because there is knowledge that both nodes are actually still online. You can not disable/move services when the cluster network is dead, but both nodes can continue running the services they already are running. Most people should probably use the disk tiebreaker if they want failover if a node's cluster network gets disconnected. -- Lon From rodgersr at yahoo.com Mon Oct 2 16:38:04 2006 From: rodgersr at yahoo.com (Rick Rodgers) Date: Mon, 2 Oct 2006 09:38:04 -0700 (PDT) Subject: [Linux-cluster] How does disk prevent split brain? Message-ID: <20061002163804.82857.qmail@web34212.mail.mud.yahoo.com> Thanks for the info. Yes I understand what you are saying. However, if you do not specify and ip-tiebreaker I assumed it used the disk tiebreaker by default. ----- Original Message ---- From: Lon Hohberger To: linux clustering Sent: Monday, October 2, 2006 9:20:14 AM Subject: Re: [Linux-cluster] How does disk prevent split brain? On Sun, 2006-10-01 at 20:59 -0700, Rick Rodgers wrote: > According to RH documentation and the engineers you can have a safe > two node > cluster without having to have an ip-tiebreaker. They say it will use > the disk tie-breaker. > > How is this possible? Since clulockd is running separately on each of > the nodes how does > it prevent each node from accessing the disk at the same time and > declaring himself the > Active node? If it is through some sort of locking mechanism that > insures this can you please > be technically specific in what it is? Only one will be the lock master, because there is knowledge that both nodes are actually still online. You can not disable/move services when the cluster network is dead, but both nodes can continue running the services they already are running. Most people should probably use the disk tiebreaker if they want failover if a node's cluster network gets disconnected. -- Lon -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- An HTML attachment was scrubbed... URL: From damian.osullivan at hp.com Mon Oct 2 16:41:37 2006 From: damian.osullivan at hp.com (O'Sullivan, Damian) Date: Mon, 2 Oct 2006 17:41:37 +0100 Subject: [Linux-cluster] LVM2 cluster problem Message-ID: <644A0966265D9D40AC7584FCE95611130308B00D@dubexc01.emea.cpqcorp.net> Hi all, A few days ago I had to reset a node in the cluster. On reboot I get the following : Loading jbd.ko module Loading ext3.ko module Loading dm-mirror.ko module Loading dm-zero.ko module Loading dm-snapshot.ko module Making device-mapper control node Scanning logical volumes Unable to open external locking library liblvm2clusterlock.so Reading all physical volumes. This may take a while... cdrom: open failed. Found volume group "VolGroup00" using metadata type lvm2 Found volume group "MSA_VOL" using metadata type lvm2 Activating logical volumes Unable to open external locking library liblvm2clusterlock.so Locking inactive: ignoring clustered volume group VolGroup00 ERROR: /bin/lvm exited abnormally! (pid 402) Creating root device Mounting root filesystem mount: error 6 mounting ext3 mount: error 2 mounting none Switching to new root switchroot: mount failed: 22 umount /initrd/dev failed: 2 Kernel panic - not syncing: Attempted to kill init! This is a result of the following commands in my initrd : insmod /lib/jbd.ko echo "Loading ext3.ko module" insmod /lib/ext3.ko echo "Loading dm-mirror.ko module" insmod /lib/dm-mirror.ko echo "Loading dm-zero.ko module" insmod /lib/dm-zero.ko echo "Loading dm-snapshot.ko module" insmod /lib/dm-snapshot.ko /sbin/udevstart echo Making device-mapper control node mkdmnod echo Scanning logical volumes lvm vgscan --ignorelockingfailure echo Activating logical volumes lvm vgchange -ay --ignorelockingfailure VolGroup00 echo Creating root device mkrootdev /dev/root umount /sys echo Mounting root filesystem mount -o defaults --ro -t ext3 /dev/root /sysroot mount -t tmpfs --bind /dev /sysroot/dev echo Switching to new root switchroot /sysroot umount /initrd/dev The other node is exactly the same and comes up no problem with the same initrd + kernel. I can access all the logical volumes from a rescue disk. All based on Centos 4.4 with latest updates. Any ideas? Thanks, D. Kernel : 2.6.9-42.0.2.Elsmp Cluster LVM daemon version: 2.02.06 (2006-05-12) Protocol version: 0.2.1 From lhh at redhat.com Mon Oct 2 18:14:54 2006 From: lhh at redhat.com (Lon Hohberger) Date: Mon, 02 Oct 2006 14:14:54 -0400 Subject: [Linux-cluster] How does disk prevent split brain? In-Reply-To: <20061002163804.82857.qmail@web34212.mail.mud.yahoo.com> References: <20061002163804.82857.qmail@web34212.mail.mud.yahoo.com> Message-ID: <1159812894.3103.18.camel@rei.boston.devel.redhat.com> On Mon, 2006-10-02 at 09:38 -0700, Rick Rodgers wrote: > Thanks for the info. Yes I understand what you are saying. However, > if > you do not specify and ip-tiebreaker I assumed it used the disk > tiebreaker > by default. That's correct. -- Lon From lhh at redhat.com Mon Oct 2 18:16:03 2006 From: lhh at redhat.com (Lon Hohberger) Date: Mon, 02 Oct 2006 14:16:03 -0400 Subject: [Linux-cluster] RH documentation and RH Engineering do not agree In-Reply-To: <20061002051228.1904.qmail@web34209.mail.mud.yahoo.com> References: <20061002051228.1904.qmail@web34209.mail.mud.yahoo.com> Message-ID: <1159812963.3103.20.camel@rei.boston.devel.redhat.com> On Sun, 2006-10-01 at 22:12 -0700, Rick Rodgers wrote: > However this does not seem to be the case. The systems stop the > service or get STONITHed. That was a bug. cludb -p cluquorumd%disk_quorum 1 In a future release, it will be set to this by default. -- Lon > From Leonardo.Mello at planejamento.gov.br Mon Oct 2 18:47:30 2006 From: Leonardo.Mello at planejamento.gov.br (Leonardo Rodrigues de Mello) Date: Mon, 2 Oct 2006 15:47:30 -0300 Subject: RES: [Linux-cluster] RH documentation and RH Engineering do not agree Message-ID: <1DDCE5B29CB5BC42BC2BFC39E3F1C8A3255BF2@corp-bsa-mp01.planejamento.gov.br> Lon, Congratulations for the documentation!!! :-D -----Mensagem original----- De: linux-cluster-bounces at redhat.com em nome de Lon Hohberger Enviada: seg 2/10/2006 15:16 Para: linux clustering Cc: Assunto: Re: [Linux-cluster] RH documentation and RH Engineering do not agree On Sun, 2006-10-01 at 22:12 -0700, Rick Rodgers wrote: > However this does not seem to be the case. The systems stop the > service or get STONITHed. That was a bug. cludb -p cluquorumd%disk_quorum 1 In a future release, it will be set to this by default. -- Lon > -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- A non-text attachment was scrubbed... Name: winmail.dat Type: application/ms-tnef Size: 2970 bytes Desc: not available URL: From lhh at redhat.com Mon Oct 2 20:56:03 2006 From: lhh at redhat.com (Lon Hohberger) Date: Mon, 02 Oct 2006 16:56:03 -0400 Subject: [Linux-cluster] ip-tiebreaker quotes need explainging In-Reply-To: <20061001025203.76527.qmail@web34209.mail.mud.yahoo.com> References: <20061001025203.76527.qmail@web34209.mail.mud.yahoo.com> Message-ID: <1159822563.3103.48.camel@rei.boston.devel.redhat.com> On Sat, 2006-09-30 at 19:52 -0700, Rick Rodgers wrote: > Can anyone explain why this is so. Why is it only used on maintaining > qourum and not startup? > > > "The IP tiebreaker is typically used to *maintain* a quorum after a node > failure, because there are certain network faults in which two nodes may > see the tiebreaker - but not each other. > -- Lon" In certain situations (ex: ARP storms, switch loops, etc.), it is possible to see the IP-tiebreaker (an upstream router) but *not* your peer over the switch. If this happens to both nodes, you have a split brain. I should update the internals big to reflect the 'why'. You can change this behavior using cludb. Note that IP tiebreakers have to be in the cluster communications path - you can't use heartbeating over a private network and an IP tiebreaker on another network. -- Lon From lhh at redhat.com Mon Oct 2 21:12:57 2006 From: lhh at redhat.com (Lon Hohberger) Date: Mon, 02 Oct 2006 17:12:57 -0400 Subject: [Linux-cluster] clurmtabd In-Reply-To: <20060929213741.97281.qmail@web34212.mail.mud.yahoo.com> References: <20060929213741.97281.qmail@web34212.mail.mud.yahoo.com> Message-ID: <1159823577.3103.56.camel@rei.boston.devel.redhat.com> On Fri, 2006-09-29 at 14:37 -0700, Rick Rodgers wrote: > it does not seem to work that way. I tested it and it only got what > was mounted on the specified directory. Not the subdirectories. > Has this changed recently (in the last 2 years?) http://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=80081 It looks like it was never fixed in the clumanager-1.2.x tree. It was fixed for RHEL2.1 (clumanager-1.0.x) a long time ago, but not for RHCS3. You should file a bugzilla if you need it fixed. -- Lon From lhh at redhat.com Mon Oct 2 21:18:45 2006 From: lhh at redhat.com (Lon Hohberger) Date: Mon, 02 Oct 2006 17:18:45 -0400 Subject: [Linux-cluster] clurmtabd In-Reply-To: <1159823577.3103.56.camel@rei.boston.devel.redhat.com> References: <20060929213741.97281.qmail@web34212.mail.mud.yahoo.com> <1159823577.3103.56.camel@rei.boston.devel.redhat.com> Message-ID: <1159823925.3103.58.camel@rei.boston.devel.redhat.com> On Mon, 2006-10-02 at 17:12 -0400, Lon Hohberger wrote: > On Fri, 2006-09-29 at 14:37 -0700, Rick Rodgers wrote: > > it does not seem to work that way. I tested it and it only got what > > was mounted on the specified directory. Not the subdirectories. > > Has this changed recently (in the last 2 years?) > > http://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=80081 > > It looks like it was never fixed in the clumanager-1.2.x tree. > > It was fixed for RHEL2.1 (clumanager-1.0.x) a long time ago, but not for > RHCS3. You should file a bugzilla if you need it fixed. I've filed this as a clone of 80081: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=208995 -- Lon From rodgersr at yahoo.com Mon Oct 2 21:45:19 2006 From: rodgersr at yahoo.com (Rick Rodgers) Date: Mon, 2 Oct 2006 14:45:19 -0700 (PDT) Subject: [Linux-cluster] RH documentation and RH Engineering do not agree Message-ID: <20061002214519.56791.qmail@web34211.mail.mud.yahoo.com> Ok thanks for the info. It was all good documentation just left me a little confused. Thanks for all your feedback it has beenn a great help. Rick ----- Original Message ---- From: Lon Hohberger To: linux clustering Sent: Monday, October 2, 2006 11:16:03 AM Subject: Re: [Linux-cluster] RH documentation and RH Engineering do not agree On Sun, 2006-10-01 at 22:12 -0700, Rick Rodgers wrote: > However this does not seem to be the case. The systems stop the > service or get STONITHed. That was a bug. cludb -p cluquorumd%disk_quorum 1 In a future release, it will be set to this by default. -- Lon > -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- An HTML attachment was scrubbed... URL: From rodgersr at yahoo.com Mon Oct 2 21:53:13 2006 From: rodgersr at yahoo.com (Rick Rodgers) Date: Mon, 2 Oct 2006 14:53:13 -0700 (PDT) Subject: [Linux-cluster] clurmtabd Message-ID: <20061002215313.69567.qmail@web34201.mail.mud.yahoo.com> Ok thanks.At least now I know I am not going crazy :-) ----- Original Message ---- From: Lon Hohberger To: linux clustering Sent: Monday, October 2, 2006 2:12:57 PM Subject: Re: [Linux-cluster] clurmtabd On Fri, 2006-09-29 at 14:37 -0700, Rick Rodgers wrote: > it does not seem to work that way. I tested it and it only got what > was mounted on the specified directory. Not the subdirectories. > Has this changed recently (in the last 2 years?) http://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=80081 It looks like it was never fixed in the clumanager-1.2.x tree. It was fixed for RHEL2.1 (clumanager-1.0.x) a long time ago, but not for RHCS3. You should file a bugzilla if you need it fixed. -- Lon -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- An HTML attachment was scrubbed... URL: From jprats at cesca.es Tue Oct 3 06:18:45 2006 From: jprats at cesca.es (Jordi Prats) Date: Tue, 03 Oct 2006 08:18:45 +0200 Subject: [Linux-cluster] problems relocating services Message-ID: <452200C5.6080406@cesca.es> Hi all, I have a problem relocating services, sometimes fails. I have all requiered operations nested, so it shoud not be a race condition. How could add some verbosity to syslog to find out why is failing? Thanks, -- ...................................................................... __ / / Jordi Prats C E / S / C A Dept. de Sistemes /_/ Centre de Supercomputaci? de Catalunya Gran Capit?, 2-4 (Edifici Nexus) ? 08034 Barcelona T. 93 205 6464 ? F. 93 205 6979 ? jprats at cesca.es ...................................................................... From lhh at redhat.com Tue Oct 3 13:50:33 2006 From: lhh at redhat.com (Lon Hohberger) Date: Tue, 03 Oct 2006 09:50:33 -0400 Subject: [Linux-cluster] problems relocating services In-Reply-To: <452200C5.6080406@cesca.es> References: <452200C5.6080406@cesca.es> Message-ID: <1159883433.8020.3.camel@rei.boston.devel.redhat.com> On Tue, 2006-10-03 at 08:18 +0200, Jordi Prats wrote: > Hi all, > I have a problem relocating services, sometimes fails. I have all > requiered operations nested, so it shoud not be a race condition. How > could add some verbosity to syslog to find out why is failing? Did it go in to the "failed" state, or did it simply "fail" to relocate? Change /etc/syslog.conf to add the following on all nodes: local4.* /var/log/rgmanager Change your "rm" tag in cluster.conf to enable debugging: (don't forget to increment the configuration version) Run ccs_tool update /etc/cluster/cluster.conf -- Lon From lhh at redhat.com Tue Oct 3 14:30:38 2006 From: lhh at redhat.com (Lon Hohberger) Date: Tue, 03 Oct 2006 10:30:38 -0400 Subject: [Linux-cluster] Red Hat Linux AS 4 U3 Clustering In-Reply-To: <7BED60E643BD1C4F8A84E3F0B411C14A0F3F31@srit_mail.renaissance-it.com> References: <7BED60E643BD1C4F8A84E3F0B411C14A0F3F31@srit_mail.renaissance-it.com> Message-ID: <1159885839.8020.7.camel@rei.boston.devel.redhat.com> On Sat, 2006-09-30 at 12:13 +0530, Jotheswaran M wrote: > Hi All, > > I am new to this forum, I have a problem with Red Hat Linux AS 4 U3 > Clustering I have used IBM Xseries 366 servers with two HBA's and > DS4300 SAN storage. > > I have installed and configured the OS and the clustering with out any > issues. I am running oracle9i as the database and the same has been > configured in the cluster and it works fine, I can also fail over it > works fine. > > The problem is if I shutdown one server or remove the power chord of > one server the cluster doesn't switch over but if I go through the > normal shutdown the cluster switches. > > Can you gueys help me to resolve this please. Ok, first of all - please try updating rgmanager, magma and magma-plugins to the U4 versions (you don't have to update anything else). :) -- Lon From spatuality at yahoo.ca Tue Oct 3 14:33:58 2006 From: spatuality at yahoo.ca (Brian) Date: Tue, 3 Oct 2006 07:33:58 -0700 (PDT) Subject: [Linux-cluster] fence_drac broken with DRAC/MC 1.3 Message-ID: <20061003143358.38113.qmail@web30809.mail.mud.yahoo.com> Hi group, I have submitted a bug report for this problem, but thought it might be useful to let the group know what I've found. I'm running RHEL 4 Update 4 on Dell PowerEdge 1955 blade servers in a chassis with DRAC/MC 1.3 firmware. The fence_drac is able to power off/on the blade, but the script is not returning the correct status after the power is switched off/on. Example command issued: # fence_drac -a 10.0.0.20 -l username -p password -D debug.txt -m Server-10 -v -o off detected drac version 'DRAC/MC' failed: telnet returned: pattern match timed-out Result: Server is shut off harshly (ie. about 3 services are shutdown in init 6, then power is cut to the machine). For troubleshooting, running init 6 manually results in a full, normal shutdown of the server. If I run fence_node, with fence_drac as the script to run setup in /etc/cluster/cluster.conf, the missing expected response of server off/on results in the node being power cycled repeatedly. Problem: Its great that the server is getting shut down, but the Perl Telnet interface needs a known response to feedback an expected result. I'm guessing changing the script is fairly trivial to get this working with DRAC/MC 1.3. If anyone else has this working, please pass along the fix. I will try working on this next week to see if I can kick it into working. Brian From damian.osullivan at hp.com Tue Oct 3 15:17:57 2006 From: damian.osullivan at hp.com (O'Sullivan, Damian) Date: Tue, 3 Oct 2006 16:17:57 +0100 Subject: [Linux-cluster] LVM2 cluster problem In-Reply-To: <644A0966265D9D40AC7584FCE95611130308B00D@dubexc01.emea.cpqcorp.net> Message-ID: <644A0966265D9D40AC7584FCE95611130308B58E@dubexc01.emea.cpqcorp.net> > ERROR: /bin/lvm exited abnormally! (pid 402) Creating root > device Mounting root filesystem > mount: error 6 mounting ext3 > mount: error 2 mounting none > Switching to new root > switchroot: mount failed: 22 > umount /initrd/dev failed: 2 > Kernel panic - not syncing: Attempted to kill init! > Just a follow up to my own mail. The local storage was marked as "clustered" for some reason. vgs showed this up. I took away the c bit and it works again. D. From spatuality at yahoo.ca Tue Oct 3 21:26:22 2006 From: spatuality at yahoo.ca (Brian) Date: Tue, 3 Oct 2006 14:26:22 -0700 (PDT) Subject: [Linux-cluster] fence_drac broken with DRAC/MC 1.3 Message-ID: <20061003212622.67382.qmail@web30807.mail.mud.yahoo.com> I have fixed the problem and posted the change to Bugzilla. It was due to the $telnet_timeout value of 5 seconds being too short for the 1.3 DRAC/MC firmware. Dell decided to make the telnet connection slower than 1.2 for some reason. /sbin/fence_drac, line 33: From: my $telnet_timeout = 10; # Seconds to wait for matching telent response To: my $telnet_timeout = 10; # Seconds to wait for matching telent response Brian ----- Original Message ---- From: Brian To: linux-cluster at redhat.com Sent: Tuesday, October 3, 2006 10:33:58 AM Subject: [Linux-cluster] fence_drac broken with DRAC/MC 1.3 Hi group, I have submitted a bug report for this problem, but thought it might be useful to let the group know what I've found. I'm running RHEL 4 Update 4 on Dell PowerEdge 1955 blade servers in a chassis with DRAC/MC 1.3 firmware. The fence_drac is able to power off/on the blade, but the script is not returning the correct status after the power is switched off/on. Example command issued: # fence_drac -a 10.0.0.20 -l username -p password -D debug.txt -m Server-10 -v -o off detected drac version 'DRAC/MC' failed: telnet returned: pattern match timed-out Result: Server is shut off harshly (ie. about 3 services are shutdown in init 6, then power is cut to the machine). For troubleshooting, running init 6 manually results in a full, normal shutdown of the server. If I run fence_node, with fence_drac as the script to run setup in /etc/cluster/cluster.conf, the missing expected response of server off/on results in the node being power cycled repeatedly. Problem: Its great that the server is getting shut down, but the Perl Telnet interface needs a known response to feedback an expected result. I'm guessing changing the script is fairly trivial to get this working with DRAC/MC 1.3. If anyone else has this working, please pass along the fix. I will try working on this next week to see if I can kick it into working. Brian -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From spatuality at yahoo.ca Tue Oct 3 21:31:12 2006 From: spatuality at yahoo.ca (Brian) Date: Tue, 3 Oct 2006 14:31:12 -0700 (PDT) Subject: [Linux-cluster] fence_drac broken with DRAC/MC 1.3 Message-ID: <20061003213112.40010.qmail@web30810.mail.mud.yahoo.com> Sorry about that. The change was supposed to be from 5 seconds to 10 seconds. Brian ----- Original Message ---- From: Brian To: linux clustering Sent: Tuesday, October 3, 2006 5:26:22 PM Subject: Re: [Linux-cluster] fence_drac broken with DRAC/MC 1.3 I have fixed the problem and posted the change to Bugzilla. It was due to the $telnet_timeout value of 5 seconds being too short for the 1.3 DRAC/MC firmware. Dell decided to make the telnet connection slower than 1.2 for some reason. /sbin/fence_drac, line 33: From: my $telnet_timeout = 10; # Seconds to wait for matching telent response To: my $telnet_timeout = 10; # Seconds to wait for matching telent response Brian ----- Original Message ---- From: Brian To: linux-cluster at redhat.com Sent: Tuesday, October 3, 2006 10:33:58 AM Subject: [Linux-cluster] fence_drac broken with DRAC/MC 1.3 Hi group, I have submitted a bug report for this problem, but thought it might be useful to let the group know what I've found. I'm running RHEL 4 Update 4 on Dell PowerEdge 1955 blade servers in a chassis with DRAC/MC 1.3 firmware. The fence_drac is able to power off/on the blade, but the script is not returning the correct status after the power is switched off/on. Example command issued: # fence_drac -a 10.0.0.20 -l username -p password -D debug.txt -m Server-10 -v -o off detected drac version 'DRAC/MC' failed: telnet returned: pattern match timed-out Result: Server is shut off harshly (ie. about 3 services are shutdown in init 6, then power is cut to the machine). For troubleshooting, running init 6 manually results in a full, normal shutdown of the server. If I run fence_node, with fence_drac as the script to run setup in /etc/cluster/cluster.conf, the missing expected response of server off/on results in the node being power cycled repeatedly. Problem: Its great that the server is getting shut down, but the Perl Telnet interface needs a known response to feedback an expected result. I'm guessing changing the script is fairly trivial to get this working with DRAC/MC 1.3. If anyone else has this working, please pass along the fix. I will try working on this next week to see if I can kick it into working. Brian -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From jprats at cesca.es Wed Oct 4 07:51:33 2006 From: jprats at cesca.es (Jordi Prats) Date: Wed, 04 Oct 2006 09:51:33 +0200 Subject: [Linux-cluster] problems relocating services In-Reply-To: <1159883433.8020.3.camel@rei.boston.devel.redhat.com> References: <452200C5.6080406@cesca.es> <1159883433.8020.3.camel@rei.boston.devel.redhat.com> Message-ID: <45236805.60709@cesca.es> Thanks, It goes to failed state. Today I've been reloacating services but it fails to relocate (is not going to failed state): Oct 4 09:07:49 inf04 clurgmgrd[6299]: Service projectes is stopped Oct 4 09:07:49 inf04 clurgmgrd[6299]: #70: Attempting to restart service projectes locally. Oct 4 09:07:49 inf04 clurgmgrd[6299]: Starting stopped service projectes On the other node log do not appear anything related to the relocation. Maybe is a communication problem between nodes? Jordi Lon Hohberger wrote: > On Tue, 2006-10-03 at 08:18 +0200, Jordi Prats wrote: > >> Hi all, >> I have a problem relocating services, sometimes fails. I have all >> requiered operations nested, so it shoud not be a race condition. How >> could add some verbosity to syslog to find out why is failing? >> > > Did it go in to the "failed" state, or did it simply "fail" to relocate? > > > Change /etc/syslog.conf to add the following on all nodes: > > local4.* /var/log/rgmanager > > Change your "rm" tag in cluster.conf to enable debugging: > > > > (don't forget to increment the configuration version) > > Run ccs_tool update /etc/cluster/cluster.conf > > > > -- Lon > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > > -- ...................................................................... __ / / Jordi Prats C E / S / C A Dept. de Sistemes /_/ Centre de Supercomputaci? de Catalunya Gran Capit?, 2-4 (Edifici Nexus) ? 08034 Barcelona T. 93 205 6464 ? F. 93 205 6979 ? jprats at cesca.es ...................................................................... From sandra-llistes at fib.upc.edu Wed Oct 4 12:16:55 2006 From: sandra-llistes at fib.upc.edu (sandra-llistes) Date: Wed, 04 Oct 2006 14:16:55 +0200 Subject: [Linux-cluster] GFS and samba problem, again Message-ID: <4523A637.1060706@fib.upc.edu> Hi, I sent a mail a few days ago to this list related with GFS+samba problems. Since the, we have installed a sepparated test environment also with two linux servers where we have tested a samba server with an exported share in GFS. The share is read-only and only one server is exporting it. When we try to access from a single windows client it works fine, but when we try to access to the same file from 2 or more windows clients simoultaneously, windows hangs and samba also does. This seems not to happen with concurrent access to different files or with linux clients. We've also tested to export the same share without GFS and in this case it works fine. It seems to be a locking problem with samba, GFS and windows clients. Does any of you have experienced similar problems? Do you have any suggestion about this? Following is the share configuration in smb.conf: [public] comment = ShareGFS path = /public writeable = No read only = Yes write list = @admsamba force group = admsamba create mask = 0775 directory mask = 0775 oplocks = No locking = Yes strict locking = Yes # I proved with locking/Strick locking=Yes and No. Always happens the same problem I attach some samba logs (Level 3). Software Versions: Fedora 5 Samba 3.0.23 GFS 6.1.5 kernel 2.6.17-1.2187_FC5 Any help will be appreciated. Sandra Hernandez -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: GFSlog.smbd URL: From lhh at redhat.com Wed Oct 4 16:59:54 2006 From: lhh at redhat.com (Lon Hohberger) Date: Wed, 04 Oct 2006 12:59:54 -0400 Subject: [Linux-cluster] problems relocating services In-Reply-To: <45236805.60709@cesca.es> References: <452200C5.6080406@cesca.es> <1159883433.8020.3.camel@rei.boston.devel.redhat.com> <45236805.60709@cesca.es> Message-ID: <1159981194.12856.14.camel@rei.boston.devel.redhat.com> On Wed, 2006-10-04 at 09:51 +0200, Jordi Prats wrote: > Thanks, > It goes to failed state. Today I've been reloacating services but it > fails to relocate (is not going to failed state): > > Oct 4 09:07:49 inf04 clurgmgrd[6299]: Service projectes is stopped > Oct 4 09:07:49 inf04 clurgmgrd[6299]: #70: Attempting to > restart service projectes locally. > Oct 4 09:07:49 inf04 clurgmgrd[6299]: Starting stopped service > projectes > > On the other node log do not appear anything related to the relocation. > Maybe is a communication problem between nodes? It sounds like rgmanager isn't running on the other node or something; check /proc/cluster/services? -- Lon From jprats at cesca.es Wed Oct 4 18:34:57 2006 From: jprats at cesca.es (Jordi Prats) Date: Wed, 04 Oct 2006 20:34:57 +0200 Subject: [Linux-cluster] problems relocating services In-Reply-To: <1159981194.12856.14.camel@rei.boston.devel.redhat.com> References: <452200C5.6080406@cesca.es> <1159883433.8020.3.camel@rei.boston.devel.redhat.com> <45236805.60709@cesca.es> <1159981194.12856.14.camel@rei.boston.devel.redhat.com> Message-ID: <4523FED1.9010404@cesca.es> It appears to be running on both nodes: # cat /proc/cluster/services Service Name GID LID State Code Fence Domain: "default" 1 2 run - [1 2] DLM Lock Space: "Magma" 3 4 run - [1 2] User: "usrm::manager" 2 3 run - [1 2] # ccs_tool lsnode Cluster name: dades, config_version: 76 Nodename Votes Nodeid Iface Fencetype inf04 1 inf05 1 # clustat Member Status: Quorate Member Name Status ------ ---- ------ inf04 Online, rgmanager inf05 Online, Local, rgmanager Service Name Owner (Last) State ------- ---- ----- ------ ----- projectes inf04 started local inf04 started mysql inf05 started postgres inf05 started # ps -fea | grep rg root 6362 1 0 Oct03 ? 00:00:00 clurgmgrd root 6364 6362 0 Oct03 ? 00:03:49 clurgmgrd root 19138 7151 0 09:07 pts/1 00:00:00 tail /var/log/rgmanager -f -n5000 root 7073 6954 0 20:28 pts/3 00:00:00 grep rg # ps -fea | grep rg root 6362 1 0 Oct03 ? 00:00:00 clurgmgrd root 6364 6362 0 Oct03 ? 00:03:49 clurgmgrd root 19138 7151 0 09:07 pts/1 00:00:00 tail /var/log/rgmanager -f -n5000 root 7073 6954 0 20:28 pts/3 00:00:00 grep rg The same information is displayed on both nodes. Our version is: # clustat -v clustat version 1.9.53 Connected via: CMAN/SM Plugin v1.1.7.1 Any ideas? Thanks, Jordi Lon Hohberger wrote: > On Wed, 2006-10-04 at 09:51 +0200, Jordi Prats wrote: >> Thanks, >> It goes to failed state. Today I've been reloacating services but it >> fails to relocate (is not going to failed state): >> >> Oct 4 09:07:49 inf04 clurgmgrd[6299]: Service projectes is stopped >> Oct 4 09:07:49 inf04 clurgmgrd[6299]: #70: Attempting to >> restart service projectes locally. >> Oct 4 09:07:49 inf04 clurgmgrd[6299]: Starting stopped service >> projectes >> >> On the other node log do not appear anything related to the relocation. >> Maybe is a communication problem between nodes? > > It sounds like rgmanager isn't running on the other node or something; > check /proc/cluster/services? > > -- Lon > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > -- ...................................................................... __ / / Jordi Prats Catal? C E / S / C A Departament de Sistemes /_/ Centre de Supercomputaci? de Catalunya Gran Capit?, 2-4 (Edifici Nexus) ? 08034 Barcelona T. 93 205 6464 ? F. 93 205 6979 ? jprats at cesca.es ...................................................................... pgp:0x5D0D1321 ...................................................................... From lhh at redhat.com Thu Oct 5 16:25:29 2006 From: lhh at redhat.com (Lon Hohberger) Date: Thu, 05 Oct 2006 12:25:29 -0400 Subject: [Linux-cluster] Xen virtual machine fencing Message-ID: <1160065529.18145.12.camel@rei.boston.devel.redhat.com> Hi, I committed an updated agent for fencing Xen virtual machines to CVS. It allows fencing of any virtual machine from any other host in the cluster, and handles the case where the VM no longer exists. Note that there is no 'on' function mostly due to the fact that it would require a lot of configuration knowledge about the VM which is currently not available. The README is not 100% complete, and neither are any of the features mentioned in TODO. ;) Basically, here's how to get it running: - build (requires nss, openais, cman, & nspr development stuff) - install openais + cman - generate a key file (e.g. dd if=/dev/urandom of=/etc/cluster/fence_xvm.key bs=4096 count=1) - scp /etc/cluster/fence_xvm.key to all dom0 cluster nodes. - start cman - start fence_xvmd with whatever options you like on all members of the dom0 cluster (must be started with same options cluster-wide - start domU nodes - scp /etc/cluster/fence_xvm.key to all domU machines. - install fence_xvm on domU nodes - fence_xvm -H || fence_xvm -u -H (boom) If anyone wants to take up the ball on anything in the TODO, let me know. (If you want to implement the SSL part, you need to use the nss/nspr libraries, and NOT openssl, due to licensing and other reasons). -- Lon From lhh at redhat.com Thu Oct 5 19:18:29 2006 From: lhh at redhat.com (Lon Hohberger) Date: Thu, 05 Oct 2006 15:18:29 -0400 Subject: [Linux-cluster] problems relocating services In-Reply-To: <4523FED1.9010404@cesca.es> References: <452200C5.6080406@cesca.es> <1159883433.8020.3.camel@rei.boston.devel.redhat.com> <45236805.60709@cesca.es> <1159981194.12856.14.camel@rei.boston.devel.redhat.com> <4523FED1.9010404@cesca.es> Message-ID: <1160075910.18145.18.camel@rei.boston.devel.redhat.com> On Wed, 2006-10-04 at 20:34 +0200, Jordi Prats wrote: > It appears to be running on both nodes: > > # cat /proc/cluster/services > Service Name GID LID State Code > Fence Domain: "default" 1 2 run - > [1 2] > > DLM Lock Space: "Magma" 3 4 run - > [1 2] > > User: "usrm::manager" 2 3 run - > [1 2] > > # ccs_tool lsnode > > Cluster name: dades, config_version: 76 Could you post your service blob? If you're using a script, it might not be installed on the other node. When you do a "relocate", does anything appear in the logs on the other node? -- Lon From danwest at comcast.net Thu Oct 5 19:35:14 2006 From: danwest at comcast.net (danwest) Date: Thu, 05 Oct 2006 15:35:14 -0400 Subject: [Linux-cluster] qdiskd vote not represented by cman Message-ID: <1160076914.3666.2.camel@belmont.site> Shouldn?t we expect to see the qdisk votes reported in ?Total_votes? from cman (see cman_tool below). If we have four nodes with 1 vote each and a qdisk configured with 1 vote (see cluster.conf snippet below) shouldn?t we see a total vote count of 5? With the qdisk config shown below I would expect to be able to sustain a loss of 2 out of 4 nodes and still have quorum but in fact the loss of 2 nodes dissolves quorum every time, effectively locking the cluster. Thanks, Dan Node4:~ # cman_tool status Protocol version: 5.0.1 Config version: 28 Cluster name: testcluster Cluster ID: 26387 Cluster Member: Yes Membership state: Cluster-Member Nodes: 4 Expected_votes: 4 Total_votes: 4 Quorum: 3 Active subsystems: 8 Node name: node4 Node addresses: X.X.X.X From isplist at logicore.net Fri Oct 6 01:25:26 2006 From: isplist at logicore.net (isplist at logicore.net) Date: Thu, 5 Oct 2006 20:25:26 -0500 Subject: [Linux-cluster] Cluster.conf Message-ID: <2006105202526.771406@leena> I've been messing with GFS for a while now, learning curve kinda high but slowly getting it. I'm now at the point where I need to fence things better so that reliability becomes better now that I'm getting closer to actually using this on something production. Thing is, I've asked before about cluster.conf file building without replies so am still unsure about this part right now. I've seen countless variations, many with things I've not even seen before and for the most part, many seem to be custom made, like someone's recipe :). So, the question remains... where can I find VERY good details and information that will help me understand the building of this file. I'm now using a McData ED-5000 switch and need to make sure that fencing is working correctly. My (probably silly) cluster. conf file looks like; Isn't there something missing for fencing in each clusternode line? Note also that every time I start the cluster, I get quorum'd out until I log into another node and run cman_tool expected -e 1 to regain. I've seen a way to fix that in my travels but you think I can find it now? Nope :). ANY help to make this man's crummy conf file work properly would be welcome. Mike From pcaulfie at redhat.com Fri Oct 6 07:12:28 2006 From: pcaulfie at redhat.com (Patrick Caulfield) Date: Fri, 06 Oct 2006 08:12:28 +0100 Subject: [Linux-cluster] qdiskd vote not represented by cman In-Reply-To: <1160076914.3666.2.camel@belmont.site> References: <1160076914.3666.2.camel@belmont.site> Message-ID: <452601DC.4030206@redhat.com> danwest wrote: > Shouldn?t we expect to see the qdisk votes reported in ?Total_votes? > from cman (see cman_tool below). Yes. if the quorum disk is registered correctly with cman you should see the votes it contributes and also it's "node name" in cman_tool nodes. > If we have four nodes with 1 vote each > and a qdisk configured with 1 vote (see cluster.conf snippet below) > shouldn?t we see a total vote count of 5? With the qdisk config shown > below I would expect to be able to sustain a loss of 2 out of 4 nodes > and still have quorum but in fact the loss of 2 nodes dissolves quorum > every time, effectively locking the cluster. -- patrick From gwaters1 at csc.com Fri Oct 6 12:03:00 2006 From: gwaters1 at csc.com (Grant Waters) Date: Fri, 6 Oct 2006 13:03:00 +0100 Subject: [Linux-cluster] Fw: STONITH Message-ID: Forgot to say I also get the following msgs in syslog when I telnet to the NPS.... Oct 6 12:53:34 node1 cluquorumd[27339]: Cannot log into WTI Network/Telnet Power Switch. Oct 6 12:53:34 node1 cluquorumd[27339]: STONITH: Device at xx.xxx.xxx.xxx controlling node2-h FAILED status check: Bad configuration Oct 6 12:53:47 node1 cluquorumd[2384]: Error returned from STONITH device(s) controlling node1-h. See system logs on node2-h for more information. I obscured the IP address in there - but it is the correct address of the NPS. What could this "Bad Config" be - is it the /etc/cluster.xml? Regards, GXW :o) ----- Forwarded by Grant Waters/GIS/CSC on 06/10/2006 13:00 ----- Grant Waters/GIS/CSC 06/10/2006 12:11 To linux-cluster at redhat.com cc Subject STONITH I had a quick search through your threads but couldn't find an exact hit which includes a resolution so I thought I'd try posting this here. We have a two node RH ES 3.0 cluster which uses an MSA 500 G2 shared array with a single LUN, and a crossover cable set up as eth1 for heartbeat. Both nodes are dual fed through an NPS power switch. All works fine and has done for 18 months but we've had 2 outages recently where the following happens... We appear to lose eth1, and the MSA 500 G2 starts timing out, and by the time I get in in the morning I can see errors on the MSA 500 G2 LCDs saying "43 REDUNDANCY FAILED" and "POWER OK" resepctively on the secondary and primary controllers. Both servers are up, but the failover node appears to have been forcibly rebooted by STONITH, with 2 plugs in the NPS being turned off & on again. This leaves neither node able to talk to the shared array, and the service down. Powering cycling both nodes and the array fixes the problem, but I want to know whats causing it in the first place. It doesn't appear to be related to load, although I can't rule that out - both outages were at approx 04:40 on a Friday. Here are the key msgs from syslog... Sep 29 04:44:50 node1 kernel: tg3: eth1: Link is down. Sep 29 04:44:51 node1 kernel: cciss: cmd f79252b0 timedout .......~100 of these Sep 29 04:44:51 node1 kernel: cciss: cmd f79216f8 timedout Sep 29 04:44:53 node1 kernel: tg3: eth1: Link is up at 1000 Mbps, full duplex. Sep 29 04:44:53 node1 kernel: tg3: eth1: Flow control is off for TX and off for RX. Sep 29 04:45:03 node1 clumembd[2411]: Membership View #3:0x00000001 Sep 29 04:45:04 node1 cluquorumd[2389]: --> Commencing STONITH <-- Sep 29 04:45:06 node1 cluquorumd[2389]: Power to NPS outlet(s) 6 turned /Off. Sep 29 04:45:07 node1 kernel: tg3: eth1: Link is down. Sep 29 04:45:08 node1 cluquorumd[2389]: Power to NPS outlet(s) 2 turned /Off. Sep 29 04:45:08 node1 cluquorumd[2389]: STONITH: node2-h has been fenced! Sep 29 04:45:10 node1 cluquorumd[2389]: Power to NPS outlet(s) 6 turned /On. Sep 29 04:45:12 node1 cluquorumd[2389]: Power to NPS outlet(s) 2 turned /On. Sep 29 04:45:12 node1 cluquorumd[2389]: STONITH: node2-h is no longer fenced off. Sep 29 04:45:14 node1 kernel: tg3: eth1: Link is up at 1000 Mbps, full duplex. Sep 29 04:45:14 node1 kernel: tg3: eth1: Flow control is off for TX and off for RX. Sep 29 04:47:41 node1 kernel: tg3: eth1: Link is down. Sep 29 04:47:44 node1 kernel: tg3: eth1: Link is up at 1000 Mbps, full duplex. Sep 29 04:47:44 node1 kernel: tg3: eth1: Flow control is on for TX and on for RX. I thought it would go again this morning so I turned up the cluster daemon loglevels, and unfortunately it didn't crash but I spotted this in the debug msgs.... Oct 6 04:39:31 node1 clulockd[2462]: ioctl(fd,SIOCGARP,ar [eth1]): No such device or address Oct 6 04:39:31 node1 clulockd[2462]: Connect: Member #1 (192.168.100.101) [IPv4] Oct 6 04:39:31 node1 clulockd[2462]: Processing message on 11 Oct 6 04:39:31 node1 clulockd[2462]: Received 188 bytes from peer Oct 6 04:39:31 node1 clulockd[2462]: LOCK_LOCK | LOCK_TRYLOCK Oct 6 04:39:31 node1 clulockd[2462]: lockd_trylock: member #1 lock 0 Oct 6 04:39:31 node1 clulockd[2462]: Replying ACK The point is the cluster is working fine, and fails over and back fine. I can telnet onto the NPS from both nodes so thats OK too. As far as I can tell eth1 is set up OK, and working across 192.168 addresses. Any ideas where to start looking at this? Regards, GXW :o) -------------- next part -------------- An HTML attachment was scrubbed... URL: From ilya at nigma.ru Fri Oct 6 08:53:33 2006 From: ilya at nigma.ru (Ilya M. Slepnev) Date: Fri, 06 Oct 2006 12:53:33 +0400 Subject: [Linux-cluster] Cluster.conf In-Reply-To: <2006105202526.771406@leena> References: <2006105202526.771406@leena> Message-ID: <1160124814.5597.3.camel@localhost.localdomain> I'd like to know that also!-) I can't find a good manual, explaining me how to write cluster.conf... On Thu, 2006-10-05 at 20:25 -0500, isplist at logicore.net wrote: > I've been messing with GFS for a while now, learning curve kinda high but > slowly getting it. I'm now at the point where I need to fence things better so > that reliability becomes better now that I'm getting closer to actually using > this on something production. > > Thing is, I've asked before about cluster.conf file building without replies > so am still unsure about this part right now. I've seen countless variations, > many with things I've not even seen before and for the most part, many seem to > be custom made, like someone's recipe :). > > So, the question remains... where can I find VERY good details and information > that will help me understand the building of this file. > > I'm now using a McData ED-5000 switch and need to make sure that fencing is > working correctly. My (probably silly) cluster. conf file looks like; > > > > > > > > > > > > > > > > > name="ED5000" passwd="xxxxx"/> > > > > > > > > Isn't there something missing for fencing in each clusternode line? > > Note also that every time I start the cluster, I get quorum'd out until I log > into another node and run cman_tool expected -e 1 to regain. I've seen a way > to fix that in my travels but you think I can find it now? Nope :). > > ANY help to make this man's crummy conf file work properly would be welcome. > > Mike > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 191 bytes Desc: This is a digitally signed message part URL: From gwaters1 at csc.com Fri Oct 6 11:10:38 2006 From: gwaters1 at csc.com (Grant Waters) Date: Fri, 6 Oct 2006 12:10:38 +0100 Subject: [Linux-cluster] STONITH Message-ID: I had a quick search through your threads but couldn't find an exact hit which includes a resolution so I thought I'd try posting this here. We have a two node RH ES 3.0 cluster which uses an MSA 500 G2 shared array with a single LUN, and a crossover cable set up as eth1 for heartbeat. Both nodes are dual fed through an NPS power switch. All works fine and has done for 18 months but we've had 2 outages recently where the following happens... We appear to lose eth1, and the MSA 500 G2 starts timing out, and by the time I get in in the morning I can see errors on the MSA 500 G2 LCDs saying "43 REDUNDANCY FAILED" and "POWER OK" resepctively on the secondary and primary controllers. Both servers are up, but the failover node appears to have been forcibly rebooted by STONITH, with 2 plugs in the NPS being turned off & on again. This leaves neither node able to talk to the shared array, and the service down. Powering cycling both nodes and the array fixes the problem, but I want to know whats causing it in the first place. It doesn't appear to be related to load, although I can't rule that out - both outages were at approx 04:40 on a Friday. Here are the key msgs from syslog... Sep 29 04:44:50 node1 kernel: tg3: eth1: Link is down. Sep 29 04:44:51 node1 kernel: cciss: cmd f79252b0 timedout .......~100 of these Sep 29 04:44:51 node1 kernel: cciss: cmd f79216f8 timedout Sep 29 04:44:53 node1 kernel: tg3: eth1: Link is up at 1000 Mbps, full duplex. Sep 29 04:44:53 node1 kernel: tg3: eth1: Flow control is off for TX and off for RX. Sep 29 04:45:03 node1 clumembd[2411]: Membership View #3:0x00000001 Sep 29 04:45:04 node1 cluquorumd[2389]: --> Commencing STONITH <-- Sep 29 04:45:06 node1 cluquorumd[2389]: Power to NPS outlet(s) 6 turned /Off. Sep 29 04:45:07 node1 kernel: tg3: eth1: Link is down. Sep 29 04:45:08 node1 cluquorumd[2389]: Power to NPS outlet(s) 2 turned /Off. Sep 29 04:45:08 node1 cluquorumd[2389]: STONITH: node2-h has been fenced! Sep 29 04:45:10 node1 cluquorumd[2389]: Power to NPS outlet(s) 6 turned /On. Sep 29 04:45:12 node1 cluquorumd[2389]: Power to NPS outlet(s) 2 turned /On. Sep 29 04:45:12 node1 cluquorumd[2389]: STONITH: node2-h is no longer fenced off. Sep 29 04:45:14 node1 kernel: tg3: eth1: Link is up at 1000 Mbps, full duplex. Sep 29 04:45:14 node1 kernel: tg3: eth1: Flow control is off for TX and off for RX. Sep 29 04:47:41 node1 kernel: tg3: eth1: Link is down. Sep 29 04:47:44 node1 kernel: tg3: eth1: Link is up at 1000 Mbps, full duplex. Sep 29 04:47:44 node1 kernel: tg3: eth1: Flow control is on for TX and on for RX. I thought it would go again this morning so I turned up the cluster daemon loglevels, and unfortunately it didn't crash but I spotted this in the debug msgs.... Oct 6 04:39:31 node1 clulockd[2462]: ioctl(fd,SIOCGARP,ar [eth1]): No such device or address Oct 6 04:39:31 node1 clulockd[2462]: Connect: Member #1 (192.168.100.101) [IPv4] Oct 6 04:39:31 node1 clulockd[2462]: Processing message on 11 Oct 6 04:39:31 node1 clulockd[2462]: Received 188 bytes from peer Oct 6 04:39:31 node1 clulockd[2462]: LOCK_LOCK | LOCK_TRYLOCK Oct 6 04:39:31 node1 clulockd[2462]: lockd_trylock: member #1 lock 0 Oct 6 04:39:31 node1 clulockd[2462]: Replying ACK The point is the cluster is working fine, and fails over and back fine. I can telnet onto the NPS from both nodes so thats OK too. As far as I can tell eth1 is set up OK, and working across 192.168 addresses. Any ideas where to start looking at this? Regards, GXW :o) -------------- next part -------------- An HTML attachment was scrubbed... URL: From eric at bootseg.com Fri Oct 6 14:43:21 2006 From: eric at bootseg.com (Eric Kerin) Date: Fri, 06 Oct 2006 10:43:21 -0400 Subject: [Linux-cluster] STONITH In-Reply-To: References: Message-ID: <45266B89.80705@bootseg.com> Grant Waters wrote: > > I had a quick search through your threads but couldn't find an exact > hit which includes a resolution so I thought I'd try posting this here. > We appear to lose eth1, and the MSA 500 G2 starts timing out, and by > the time I get in in the morning I can see errors on the MSA 500 G2 > LCDs saying "43 REDUNDANCY FAILED" and "POWER OK" resepctively on the > secondary and primary controllers. I've had the same problem, although I'm running RHEL 4. About 3 times in the last year I've had failures where the nodes can no longer access the MSA 500 G2, with the same errors shown on the controllers. Each time HP has told me to "Upgrade the firmware" (and this is over the course of a year or more at this point). Since the problem only happens every few months, by the time it happens again HP has a new firmware release out and they tell me to upgrade again. Not much help to fix your fencing problem. But since your MSA 500 G2 problem is the same as mine, I figured it was worth a mention. Thanks, Eric Kerin From filipe.miranda at gmail.com Fri Oct 6 14:50:54 2006 From: filipe.miranda at gmail.com (Filipe Miranda) Date: Fri, 6 Oct 2006 11:50:54 -0300 Subject: [Linux-cluster] IPMI IBM x366 basics Message-ID: Hello there, I am tryingo to implement a 6 nodes Cluster using RHEL4 + GFS. We will use the IPMI as our fence device since we are using IBM x366 machines, they are IPMI compliant. Could someone help me out on how to setup tihs machines to use this IPMI functionaliy> Does it have a special NIC for it> This NIC must me connected to the same physical network as the private network between these nodes to communicate right> You I get the some IPMI 101 basics> Thank you, -- --- Filipe T Miranda -------------- next part -------------- An HTML attachment was scrubbed... URL: From cjk at techma.com Fri Oct 6 15:57:03 2006 From: cjk at techma.com (Kovacs, Corey J.) Date: Fri, 6 Oct 2006 11:57:03 -0400 Subject: =?us-ascii?Q?RE:_=5BLinux-cluster=5D_STONITH=20?= In-Reply-To: Message-ID: What exactly do you mean by outage? Power outage? If so, power for what? Just network gear? As far as I know the MSA500 shouldn't "timeout" it's a hard scsi connection thats not in any way network dependant. I probably missed something but I'm not clear on your description of what happened. If it's scsi timeoutes, then see below about profiles. The MSA will not failover correctly under Linux unless the "profile" for the connections defined in the controllers are set up correctly. Even if it's been done in the past, check it again. I've had the profile setting reset to the defaults after updating firmware. Even then, there needs to be I/O going down the pipe in order for the controllers to failover correctly. If everything went down, then I can almost gaurentee that the nodes came back online before the MSA was operational again. These they're pretty slow booting and I'd bet just about any computer will boot way before the MSA will and this not be able to see any of the devices it presents. A reboot of the nodes then fixes that problem. Aside from all of this, you probably need to figure out why the primary controller failed in the first place. The fact that the redundancy failed on you is not good. Sounds like it failed over but you likely have other issues that are preventing the device paths from being maintained. FInally, if all else is good, try forcelby failing the controllers over by pulling the active one out and see how long it takes to recover. Then set your heartbeat tineout slightly longer than that value. Hope the ramble helps. Corey ________________________________ From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Grant Waters Sent: Friday, October 06, 2006 7:11 AM To: linux-cluster at redhat.com Subject: [Linux-cluster] STONITH I had a quick search through your threads but couldn't find an exact hit which includes a resolution so I thought I'd try posting this here. We have a two node RH ES 3.0 cluster which uses an MSA 500 G2 shared array with a single LUN, and a crossover cable set up as eth1 for heartbeat. Both nodes are dual fed through an NPS power switch. All works fine and has done for 18 months but we've had 2 outages recently where the following happens... We appear to lose eth1, and the MSA 500 G2 starts timing out, and by the time I get in in the morning I can see errors on the MSA 500 G2 LCDs saying "43 REDUNDANCY FAILED" and "POWER OK" resepctively on the secondary and primary controllers. Both servers are up, but the failover node appears to have been forcibly rebooted by STONITH, with 2 plugs in the NPS being turned off & on again. This leaves neither node able to talk to the shared array, and the service down. Powering cycling both nodes and the array fixes the problem, but I want to know whats causing it in the first place. It doesn't appear to be related to load, although I can't rule that out - both outages were at approx 04:40 on a Friday. Here are the key msgs from syslog... Sep 29 04:44:50 node1 kernel: tg3: eth1: Link is down. Sep 29 04:44:51 node1 kernel: cciss: cmd f79252b0 timedout .......~100 of these Sep 29 04:44:51 node1 kernel: cciss: cmd f79216f8 timedout Sep 29 04:44:53 node1 kernel: tg3: eth1: Link is up at 1000 Mbps, full duplex. Sep 29 04:44:53 node1 kernel: tg3: eth1: Flow control is off for TX and off for RX. Sep 29 04:45:03 node1 clumembd[2411]: Membership View #3:0x00000001 Sep 29 04:45:04 node1 cluquorumd[2389]: --> Commencing STONITH <-- Sep 29 04:45:06 node1 cluquorumd[2389]: Power to NPS outlet(s) 6 turned /Off. Sep 29 04:45:07 node1 kernel: tg3: eth1: Link is down. Sep 29 04:45:08 node1 cluquorumd[2389]: Power to NPS outlet(s) 2 turned /Off. Sep 29 04:45:08 node1 cluquorumd[2389]: STONITH: node2-h has been fenced! Sep 29 04:45:10 node1 cluquorumd[2389]: Power to NPS outlet(s) 6 turned /On. Sep 29 04:45:12 node1 cluquorumd[2389]: Power to NPS outlet(s) 2 turned /On. Sep 29 04:45:12 node1 cluquorumd[2389]: STONITH: node2-h is no longer fenced off. Sep 29 04:45:14 node1 kernel: tg3: eth1: Link is up at 1000 Mbps, full duplex. Sep 29 04:45:14 node1 kernel: tg3: eth1: Flow control is off for TX and off for RX. Sep 29 04:47:41 node1 kernel: tg3: eth1: Link is down. Sep 29 04:47:44 node1 kernel: tg3: eth1: Link is up at 1000 Mbps, full duplex. Sep 29 04:47:44 node1 kernel: tg3: eth1: Flow control is on for TX and on for RX. I thought it would go again this morning so I turned up the cluster daemon loglevels, and unfortunately it didn't crash but I spotted this in the debug msgs.... Oct 6 04:39:31 node1 clulockd[2462]: ioctl(fd,SIOCGARP,ar [eth1]): No such device or address Oct 6 04:39:31 node1 clulockd[2462]: Connect: Member #1 (192.168.100.101) [IPv4] Oct 6 04:39:31 node1 clulockd[2462]: Processing message on 11 Oct 6 04:39:31 node1 clulockd[2462]: Received 188 bytes from peer Oct 6 04:39:31 node1 clulockd[2462]: LOCK_LOCK | LOCK_TRYLOCK Oct 6 04:39:31 node1 clulockd[2462]: lockd_trylock: member #1 lock 0 Oct 6 04:39:31 node1 clulockd[2462]: Replying ACK The point is the cluster is working fine, and fails over and back fine. I can telnet onto the NPS from both nodes so thats OK too. As far as I can tell eth1 is set up OK, and working across 192.168 addresses. Any ideas where to start looking at this? Regards, GXW :o) -------------- next part -------------- An HTML attachment was scrubbed... URL: From jos at xos.nl Fri Oct 6 17:05:11 2006 From: jos at xos.nl (Jos Vos) Date: Fri, 6 Oct 2006 19:05:11 +0200 Subject: [Linux-cluster] IPMI IBM x366 basics In-Reply-To: ; from filipe.miranda@gmail.com on Fri, Oct 06, 2006 at 11:50:54AM -0300 References: Message-ID: <20061006190511.B5863@xos037.xos.nl> On Fri, Oct 06, 2006 at 11:50:54AM -0300, Filipe Miranda wrote: > We will use the IPMI as our fence device since we are using IBM x366 > machines, they are IPMI compliant. > Could someone help me out on how to setup tihs machines to use this IPMI > functionaliy> > Does it have a special NIC for it> AFAIK, the first NIC of an x366 can act in dual-mode: as IPMI device, with its own IP address, and as a normal NIC in Linux, with it own address. Both have their own MAC address. > I am tryingo to implement a 6 nodes Cluster using RHEL4 + GFS. > This NIC must me connected to the same physical network as the private > network between these nodes to communicate right> > You I get the some IPMI 101 basics> When you share the first NIC for IPMI and "normal" use, you need to use the same switch. If possible, I'd choose to use this NIC exclusively for IPMI, connected to a dedicated switch for. In RHCS, just configure that card as "IPMI Lan", with its IP address, user and password (all changeable from the x366 BIOS) and "password" as authentication type. -- -- Jos Vos -- X/OS Experts in Open Systems BV | Phone: +31 20 6938364 -- Amsterdam, The Netherlands | Fax: +31 20 6948204 From dist-list at LEXUM.UMontreal.CA Fri Oct 6 17:37:39 2006 From: dist-list at LEXUM.UMontreal.CA (FM) Date: Fri, 06 Oct 2006 13:37:39 -0400 Subject: [Linux-cluster] clustering and web throttling/quotas ? Message-ID: <45269463.6030307@lexum.umontreal.ca> Hello, We are using director in front of several web servers. I'm looking a way to block web client based on download quota , etc ? I know mod_cband but in a cluster/webfarm setup it does not seems to be the soltution From adas at redhat.com Fri Oct 6 19:56:29 2006 From: adas at redhat.com (Abhijith Das) Date: Fri, 06 Oct 2006 14:56:29 -0500 Subject: [Linux-cluster] GFS and samba problem, again In-Reply-To: <4523A637.1060706@fib.upc.edu> References: <4523A637.1060706@fib.upc.edu> Message-ID: <4526B4ED.9050907@redhat.com> sandra-llistes wrote: > Hi, > > I sent a mail a few days ago to this list related with GFS+samba > problems. > > Since the, we have installed a sepparated test environment also with > two linux servers where we have tested a samba server with an > exported share in GFS. The share is read-only and only one server is > exporting it. > > When we try to access from a single windows client it works fine, but > when we try to access to the same file from 2 or more windows clients > simoultaneously, windows hangs and samba also does. This seems not to > happen with concurrent access to different files or with linux clients. > > We've also tested to export the same share without GFS and in this > case it works fine. > > It seems to be a locking problem with samba, GFS and windows clients. > Does any of you have experienced similar problems? Do you have any > suggestion about this? > > Following is the share configuration in smb.conf: > > [public] > comment = ShareGFS > path = /public > writeable = No > read only = Yes > write list = @admsamba > force group = admsamba > create mask = 0775 > directory mask = 0775 > oplocks = No > locking = Yes > strict locking = Yes > # I proved with locking/Strick locking=Yes and No. Always happens the > same problem > > I attach some samba logs (Level 3). > Software Versions: > Fedora 5 > Samba 3.0.23 > GFS 6.1.5 > kernel 2.6.17-1.2187_FC5 > > Any help will be appreciated. > > Sandra Hernandez Hi Sandra, I'm not very familiar with the locking of samba, but I did try the scenario you described on my test cluster. I'm unable to reproduce your problem. I have an identical smb.conf as you've pasted above. Accessing (reading a txt file, or playing a video clip) from two windows clients simultaneously works just fine without any glitches. If I understood it right, the test case you describe has one node in a cluster exporting a single samba share over a GFS filesystem and you're using multiple windows clients to access the same file in this share. This is a fairly basic operation IMO and it is quite odd that you should see this failure. Maybe you can try the CVS version of cluster suite (cvs -d :pserver:cvs at sources.redhat.com:/cvs/cluster checkout -r RHEL4 cluster) to see if the problem persists. Also, I'd be interested in knowing the behavior when you mount GFS on only one node (the one that's exporting) and also when you use GFS with lock_nolock on a standalone machine. Thanks, --Abhi From filipe.miranda at gmail.com Sat Oct 7 00:05:07 2006 From: filipe.miranda at gmail.com (Filipe Miranda) Date: Fri, 6 Oct 2006 21:05:07 -0300 Subject: [Linux-cluster] IPMI IBM x366 basics In-Reply-To: <20061006190511.B5863@xos037.xos.nl> References: <20061006190511.B5863@xos037.xos.nl> Message-ID: Jos, Thanks a lot, I entered the Bios setup and found out the BMC configurantion, where I set the IP address and User+password information. It worked just fine, thanks for the hint!!! Regards, Filipe Miranda On 10/6/06, Jos Vos wrote: > > On Fri, Oct 06, 2006 at 11:50:54AM -0300, Filipe Miranda wrote: > > > We will use the IPMI as our fence device since we are using IBM x366 > > machines, they are IPMI compliant. > > Could someone help me out on how to setup tihs machines to use this IPMI > > functionaliy> > > Does it have a special NIC for it> > > AFAIK, the first NIC of an x366 can act in dual-mode: as IPMI device, > with its own IP address, and as a normal NIC in Linux, with it own > address. > Both have their own MAC address. > > > I am tryingo to implement a 6 nodes Cluster using RHEL4 + GFS. > > This NIC must me connected to the same physical network as the private > > network between these nodes to communicate right> > > You I get the some IPMI 101 basics> > > When you share the first NIC for IPMI and "normal" use, you need to > use the same switch. If possible, I'd choose to use this NIC > exclusively for IPMI, connected to a dedicated switch for. > > In RHCS, just configure that card as "IPMI Lan", with its IP address, > user and password (all changeable from the x366 BIOS) and "password" > as authentication type. > > -- > -- Jos Vos > -- X/OS Experts in Open Systems BV | Phone: +31 20 6938364 > -- Amsterdam, The Netherlands | Fax: +31 20 6948204 > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -- --- Filipe T Miranda Red Hat Certified Engineer -------------- next part -------------- An HTML attachment was scrubbed... URL: From zvedavec at gmail.com Sat Oct 7 14:42:35 2006 From: zvedavec at gmail.com (Zvedavec) Date: Sat, 7 Oct 2006 16:42:35 +0200 Subject: [Linux-cluster] NFS problem ? Message-ID: Dear all I build a small linux cluster. Based on Fedora Core 5 at master node. Hardware of the nodes and master are identical identicky - mainborards: Asus M2NPV-MX :http://support.asus.com/download/do...&model=M2NPV-MX During instalation/configuration I did ->> Step by step : 1. DHCP - working well, give correct IPs to nodes 2. TFTP - file with kernel is loaded to node and boot start 3. NFS - working at master, but during the booting of system I deal with the problem with mounting NFS rootimage. I try also SLIM http://slim.cs.hku.hk/vmware/index.html instalation. Everything is almost same like in my instalation. Everything working to last screen of instalation guide, but after same error like in my isnataltion: Error mesage: Error, fail to mount system image! It may be due to the network driver fail to load, NFS server export incorrect or network failure. We try many thing to fix it and finaly we find that when we compile kernel with parameter CONFIG_ROOT_NFS: Y It helps a litle bit. We can reach login screen. But during booting we still have a lot of error messages that system is just readable. With CD and live distro on node everything work well, I can mount nfs disk from master. Thanks for any advice/solution/anything what help. Thank you. best regards, Skeptik From jos at xos.nl Sun Oct 8 13:04:08 2006 From: jos at xos.nl (Jos Vos) Date: Sun, 08 Oct 2006 15:04:08 +0200 Subject: [Linux-cluster] Distributing cluster.conf with tag Message-ID: <200610081304.k98D48Z32299@xos037.xos.nl> Hi, After manually editting cluster.conf to add a entry, it seems to be impossible to distribute the config it to the other nodes using system-config-cluster, because then a new version without the entry is distributed. Is there a way to distribute a cluster.conf *with* using ccsd to all cluster nodes? Thanks, -- -- Jos Vos -- X/OS Experts in Open Systems BV | Phone: +31 20 6938364 -- Amsterdam, The Netherlands | Fax: +31 20 6948204 From jos at xos.nl Sun Oct 8 17:10:43 2006 From: jos at xos.nl (Jos Vos) Date: Sun, 08 Oct 2006 19:10:43 +0200 Subject: [Linux-cluster] Quorum partition size requirements? Message-ID: <200610081710.k98HAhd02563@xos037.xos.nl> Hi, What are the size requirements for a quorum disk (RHEL4 U4 qdisk)? Mkqdisk seems to write a fixed amount of status blocks (always for 16 nodes) and it doesn't complain when running mkqdisk on just 1 MB, so I guess the needs are minimal, but I wnat to be sure. (Back in RHEL 2.1 the old-style quorum disk needed to be 10+ MB.) Thanks, -- -- Jos Vos -- X/OS Experts in Open Systems BV | Phone: +31 20 6938364 -- Amsterdam, The Netherlands | Fax: +31 20 6948204 From jprats at cesca.es Mon Oct 9 06:57:54 2006 From: jprats at cesca.es (Jordi Prats) Date: Mon, 09 Oct 2006 08:57:54 +0200 Subject: [Linux-cluster] problems relocating services In-Reply-To: <1160075910.18145.18.camel@rei.boston.devel.redhat.com> References: <452200C5.6080406@cesca.es> <1159883433.8020.3.camel@rei.boston.devel.redhat.com> <45236805.60709@cesca.es> <1159981194.12856.14.camel@rei.boston.devel.redhat.com> <4523FED1.9010404@cesca.es> <1160075910.18145.18.camel@rei.boston.devel.redhat.com> Message-ID: <4529F2F2.3010903@cesca.es> Hi, I'm attaching to you my services configuration. If I disable a service on node1 and enable it on node2, it succesfully runs on the other node. So, aparently all scripts are installed on both nodes and functional. Relocating a service do not appear nothing on the other node's log. So, must be a communications problem. Where can I start to search any problem related to this? Network seems to be ok, and I can do ssh between nodes. Sending pings with mtr -i 0.01 does not loose any packet. Thanks, Services: